Engineering Proteins from Sequence Statistics: Identifying and Understanding the Roles
Engineering Proteins from Sequence Statistics: Identifying and Understanding the Roles
of Conservation and Correlation in Triosephosphate Isomerase
DISSERTATION
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University
By
Brandon Joseph Sullivan B.S.
Graduate Program in the Ohio State Biochemistry Program
The Ohio State University
2011
Dissertation Committee:
Thomas J. Magliery, Advisor
Mark P. Foster
William C. Ray
Copyright by
Brandon Joseph Sullivan
2011
Abstract
The structure, function and dynamics of proteins are determined by the physical and chemical properties of their amino acids. Unfortunately, the information encapsulated within a position or between positions is poorly understood. Multiple sequence alignments of protein families allow us to interrogate these questions statistically. Here, we describe the characterization of bioinformatically-designed variants of triosephosphate isomerase (TIM). First, we review the state-of-the-art for engineering proteins with increased stability. We examine two methodologies that benefit from the availability of large numbers - high-throughput screening and sequence statistics of protein families. Second, we have deconvoluted what properties are encoded within a position (conservation) and between positions (correlations) by designing TIMs in which each position is the most common amino acid in the multiple sequence alignment. We found that a consensus TIM from a raw sequence database performs the complex isomerization reaction with weak activity as a dynamic molten globule. Furthermore, we have confirmed that the monomeric species is the catalytically active conformation despite being designed from 600+ dimeric proteins. A second consensus TIM from a curated dataset is well folded, has wild-type activity and is dimeric, but it only differs from the raw consensus TIM at 35 nonconserved positions. These two TIMs differ in the
ii
fraction of dataset sequences from eukaryotes and prokaryotes. These distribution
differences have led to the breaking and altering of networks of statistical correlations at
nonconserved positions which we demonstrate with mutual information and subset
perturbation calculations. Additionally, we show that the curated consensus TIM is an extreme thermostable enzyme. The protein remains half folded at 95 °C and may be the only TIM to completely refold after thermal denaturation.
Third, we wished to understand the determinants of protein stability -- one of biochemistry's most difficult questions. It has been shown that consensus mutations improve the stability of native proteins approximately half the time, but there is no a
priori technique to predict which consensus mutations will be stabilizing. We have
developed a double-sieve filter that selects stabilizing mutations based on extent of
conservation and statistical independence from other positions within the multiple
sequence alignment. These two mathematical tests reliably predict stabilizing mutations
with greater than 90% accuracy. The statistical algorithm was used to select 15
consensus mutations that together, improved the melting temperature of wild-type TIM
by nearly 10 °C.
Finally, we designed and characterized a model system for testing the effects of
statistically correlated residues. The TIM-knockout from the Keio Collection was
engineered for T7 expression and tested for TIM activity complementation. The single
gene knockout exhibits differential growth that correlates well to in vitro specific
iii activities. The design and characterization of two libraries are proposed to test the relationship between correlations and protein fitness.
iv
Dedication
Mòran taing
To my family - Heidi, Keegan, Killian, Merlin and Addison.
To my Parents - Brian and Kathy Sullivan
v
Acknowledgments
Trí na chéile a thógtar na cáisléain
As Venuka, Tom and I developed this project we observed several residues that interact with many other positions - we deemed these as very important residues that define the protein. Looking back, I have adopted a different perspective. I believe that these residues alone are meaningless and only achieve importance through the contribution of the other positions. In that same light, all of my accomplishments have been dependent on the support of my family, friends, mentors and labmates. For that, I owe them everything.
I first want to thank my undergraduate advisor, David Wells and my many professors who developed my love for science. When I wished to return to academics, it was Sean
Taylor who forwarded my curriculum vitae to newly hired faculty, Thomas Magliery. I thank Sean for the introduction and thank Tom for the great opportunity to join and start his lab.
I am highly appreciative of all the administrative help and talent I have been blessed with throughout these years, particularly Che Maxwell, Nicole Wade, Peter Sanders, Judith
Brown and Jennifer Hambach. I also thank Kevin Dill and Jerry Park for all their help. I
vi have thoroughly loved my opportunities to teach for the chemistry department. I thank the entire chemistry staff for their training, support, advice and guidance - especially
Yiying Wu, Steven Kroner, Christopher Callam, Matthew Stoltzfus, Mary Bailey, Robert
Tatz, Eric Heine, Holly Wheaton, Tami Sizemore and Christopher Hadad. I am also grateful to my many students who have brought me great pride and inspiration. On the other side of the desk, I have enjoyed all of my coursework because of my amazing professors: Dehua Pei, Russ Hille, David Bisaro, Paul Herman, George Marzluf, Mark
Foster, Ross Dalbey, Mark Pfeiffer, Jovica Badjic, Thomas Magliery, Charles Bell,
Chenglong Li, Michael Chan and Will Ray. I am also highly appreciative of Ross
Dalbey and Jill Rafael-Fortney for directing the Ohio State Biochemistry Program and providing incredible mentorship.
I thank David Hart for the weekly Science magazines and for being awesome. I thank surface tension for saving numerous experiments in my graduate career. I am incredibly thankful for our Graduate School and the wonderful work of Kathleen Wallace, Karen
Mayer and Dean Patrick Osmer. I am humbled by their nominations and kind words.
Additional thanks are due to Kathleen Wallace and Karen Mayer for their efforts regarding the Preparing Future Faculty Program and my mentor Heather Rhodes of
Denison University.
I am also grateful for the many professors who have inspired me in my tenure at Ohio
State. Their efforts continue to push and inspire me, especially: Pehr Harbury, Vern
vii
Schramm, Jay Keasling, Amy Keating, David Liu, Shelley Copley, David Baker, Peter
Schultz, Julius Rebek, David Eisenberg, Dan Bolon, Patricia Babbitt and Daniel Nocera.
I am additionally thankful to Pehr and his student Kierstin for MPAX training.
Receiving a doctorate is truly an endurance event. I often feel that completing the journey has been made easier through my participation in endurance sports. Whether it has been a century ride, a triathlon, a marathon or a 15 mile swim the mental strength, stubbornness and stress relief have been invaluable. These traits have been second only to the friends and teammates who have inspired and pushed me to completion. In particular, I am thankful to labmates turned workout buddies: Matthew Heberling, Ely
Porter, Sarah Johnston and Ted Schoenfeldt. I thank my swim team, the Columbus
Sharks, for making 5:30 am workouts seem like a good idea. In particular I thank my coaches Tracy Hendershot and Bo Martin and teammate Evan Morrison for cultivating some truly insane swims.
I am grateful for the many friends and classmates I have made in graduate school. In particular, I wish to thank Christopher Jones, Jeffrey Joyner, John Shimko, Ross Wilson,
Ian Kleckner and Kevin Fiala. These students set high benchmarks and provided friendly competition. I am deeply thankful for my talented (and crazy) labmates. Jason Lavinder was a great friend, fellow football fan, and labmate. If I had known Jason before graduate school, I would have bought stock in Wendy's and retired after receiving my
Ph.D. Some of my favorite Magliery memories took place on the golf course with Jason
viii and the awkward brilliance of Sanjay Hari. Quiz team, snow pants, printer cartridges, the
-20, lab lingo, wooden seat, B, the Inner Game. Man - I miss Sanjay. I am also incredibly fortunate to have adopted two lab sisters in Lihua Nie and Brinda R-a-m-a-s-u- b-r-a-m-a-n-i-a-n. Lihua has taught me that no detail is too small, to be persistent and never come up short, and that set backs are often little. She has also taught me that a trilingual-pint-sized-analytical-organic-biological-chemist wielding a hammer is a force to be reckoned with. In all seriousness, Lihua is the hardest working individual I have ever met. It has been an honor working with her. Brinda has been a fantastic friend. She has worked on one of the most difficult projects in the lab with grace, patience and unparalleled persistence. It was a great joy for our families to bring Sahana Umbah@@ and Keegan into the world only six weeks apart. Brinda also taught me that "isosbestic point" is a curse word. I also wish to thank the Genomic Design Group, especially
Venuka Durani, Nicholas Callahan, Deepamali Perera and Sidharth Mohan. Of this group, Venuka deserves a special thanks. This project would not have been possible without her help. She continually provided invaluable feedback, suggestions, time and the techniques that brought this dissertation to fruition. I am most grateful and happy to have shared authorships with her. I thank David Mata for autoclaving the rotor in the middle of my electrocompetent cell preparation.
One of my favorite aspects of graduate school has been the mentoring of undergraduates.
I have been so lucky to have shared my project with incredibly talented people including
Trixy Syu, Miriam Thomas, Deepti Mathur, Tran Nguyen and Samantha Rojas. These
ix
students continually surprised me, contributed significantly to the work written here, and
brightened each day in the lab. I am so happy that I was able to work side by side with
these amazing undergraduates.
I also thank my committee members for their advice, time and support - Ross Dalbey,
Mark Foster, Will Ray and Tom Magliery. In particular, I wish to thank my advisor,
Thomas Magliery. I will never be able to repay Tom for all that I have received. He gave me the opportunity, the resources, the mentorship and the confidence and tools to succeed. He has changed my life and for that I will always be indebted. Additionally, I thank Tom for showing me in my second year how to read a meniscus and for pointing out that "6-0-start" is quicker to type than "1-0-0-start" on the microwave.
I also wish to thank those that funded my education. I thank the faculty of the Ohio State
Biochemistry Program for funding our first year of study in addition to the Ohio State
University. I am grateful to the National Institute of Health that provided two years of funding through the Chemistry-Biology Interface Training Program. I appreciated the
University and Department of Chemistry for allowing me to teach and earn my tuition and stipend. I am thankful to Tom for supporting me as a Graduate Research Assistant.
Lastly, I thank the University for the Presidential Fellowship which supported my dissertation year at Ohio State.
x
Finally and most deservingly, I thank my family. I thank my in-laws Clarence and
Sharon Zielke for their support and temporary roof. I thank my parents for their love, pride and support. None of this would have been possible without their continued lessons and advice. I am also thankful for the rounds of golf I have shared with my parents - the perfect four hour distraction from a rough week in lab. I am incredibly grateful for Heidi,
Keegan, Killian, Merlin and Addison. Merlin and Addison, my pups, have provided infinite smiles and ridiculousness to my life. It's cliché, but I love their excitement and happiness every evening when I come home from lab. I want to thank my son, Keegan - the happiest kid I have ever seen. He ensures that every day starts and ends with a smile.
Elmo, bananas, elephants, a stuffed dog named Kona, Rocket and "more" - these are the things that now bring me happiness. I am thankful for Killian and look forward to forging many memories with him. Finally, I want to thank Heidi. I have come to realize that receiving a Ph. D. takes persistence and hard work, but being married to someone earning that degree takes unparallel patience and sacrifice. I thank Heidi for that and for meeting those daily challenges with absolute grace. This degree would not have been possible without her love and support. Heidi, thank you.
Trí na chéile a thógtar na cáisléain
xi
Vita
2000...... Saint Charles Preparatory School
2004...... B.S. Biology, The Ohio State University
2006-2011 ...... Graduate Associate, Ohio State
Biochemistry Program, The Ohio State
University
Publications
High-throughput thermal scanning: a general, rapid dye-binding thermal shift screen for protein engineering: Jason J. Lavinder, Sanjay B. Hari, Brandon J. Sullivan and Thomas J. Magliery. J Am Chem Soc 2009. 131(11): 3794-3785.
Protein stability by number: high-throughput and statistical approaches to one of protein science’s most difficult problems. Thomas J. Magliery, Jason J. Lavinder and Brandon J. Sullivan. Curr Opin Chem Biol 2011. 12(3): 443-451.
Triosephosphate isomerase by consensus design: Dramatic differences in physical properties and activity of related variants. Brandon J. Sullivan, Venuka Durani and Thomas J. Magliery. J Mol Biol 2011. 413(1): 195-208.
Fields of Study
Major Field: Biochemistry
Specialization: Protein Engineering and Sequence Statistics xii
Table of Contents
Abstract ...... ii
Dedication ...... v
Acknowledgments...... vi
Vita ...... xii
Publications ...... xii
Fields of Study ...... xii
Table of Contents ...... xiii
List of Tables ...... xviii
List of Figures ...... xix
Chapter 1: Introduction ...... 1
1.1 The Importance of Proteins ...... 1
1.2 Protein Structure and Function ...... 5
1.3 The Thermodynamic Hypothesis ...... 9
1.4 The Sequence-Structure-Function Relationship ...... 15
1.5 The Protein Folding Problem ...... 18 xiii
1.6 Computational Structure Prediction ...... 21
1.7 The Inverse Folding Problem ...... 24
1.8 Empirical Protein Science ...... 32
1.9 The Genomic Era and Sequence Statistics ...... 40
1.9.1 Conservation Statistics ...... 47
1.9.2 Ancestral Statistics ...... 59
1.9.3 Correlation Statistics ...... 61
1.10 Triosephosphate Isomerase ...... 68
1.11 Dissertation Synopsis ...... 75
Chapter 2: Protein Stability by Number ...... 77
2.0 Contributions ...... 77
2.1 Abstract ...... 77
2.2 Introduction ...... 78
2.3 Screening for Protein Stability ...... 81
2.4 Inferential Screens for Protein Stability ...... 81
2.5 Direct, Small Scale Screens for Stability ...... 85
2.6 Membrane Proteins and Antibodies ...... 90
2.7 Protein Stability from Sequence Statistics ...... 92
2.8 Consensus ...... 93
xiv
2.9 Correlation ...... 95
2.10 Conclusions and Outlook ...... 96
2.11 Acknowledgements ...... 97
Chapter 3: Consensus Design of Triosephosphate Isomerase ...... 98
3.0 Contributions ...... 98
3.1 Abstract ...... 98
3.2 Introduction ...... 99
3.3 Results ...... 103
3.4 Discussion ...... 117
3.5 Acknowledgements ...... 125
3.6 Kinetic Plots ...... 126
Chapter 4: Protein Stability and Sequence Statistics ...... 128
4.0 Contributions ...... 128
4.1 Abstract ...... 129
4.2 Introduction ...... 130
4.3 Results ...... 134
4.4 Discussion ...... 150
4.5 Acknowledgements ...... 158
Chapter 5: In vivo Analyses of Triosephosphate Isomerase ...... 159
xv
5.0 Contributions ...... 159
5.1 Abstract ...... 159
5.2 Introduction ...... 160
5.3 Results and Discussion ...... 164
5.4 Future Directions ...... 172
5.5 Outlook ...... 175
5.6 Acknowledgements ...... 177
Chapter 6: Materials and Methods ...... 178
6.1 Sequence Statistics ...... 178
6.1.1 Databases and Curation ...... 178
6.1.2 Conservation ...... 179
6.1.3 Correlation ...... 181
6.2 Chapter 3 Methods ...... 182
6.2.1 Sequences and Cloning ...... 182
6.2.2 Expression ...... 187
6.2.3 Purification ...... 187
6.2.4 Purity and Yield ...... 188
6.2.5 Circular Dichroism ...... 189
6.2.6 Gel Filtration ...... 190
xvi
6.2.7 Analytical Ultracentrifugation ...... 190
6.2.8 Hydrophobic Dye Binding ...... 191
6.2.9 Activity ...... 191
6.2.10 Nuclear Magnetic Resonance ...... 193
6.3 Chapter 4 Methods ...... 194
6.3.1 Sequences and Cloning ...... 194
6.3.2 Circular Dichroism ...... 197
6.3.3 Differential Static Light Scattering...... 198
6.3.4 FoldX Calculations ...... 199
6.4 Chapter 5 Methods ...... 199
6.4.1 Keio(DE3) Construction ...... 199
6.4.2 Cloning of E165D Mutants...... 201
6.4.3 Solid and Liquid Minimal Media Growth ...... 201
References ...... 204
xvii
List of Tables
Table 1: Protein Stability by Number...... 80
Table 2: Kinetic Data for Consensus TIMs...... 105
Table 3: Differences between cTIM and ccTIM...... 120
Table 4: Characterization of Mutants...... 138
Table 5: Kinetic values for Keio characterization...... 165
xviii
List of Figures
Figure 1: The Importance of Proteins...... 2
Figure 2: The Structure-Function Relationship...... 7
Figure 3: Protein Folding Energy Landscape...... 11
Figure 4: Transformation Efficiencies...... 36
Figure 5: Redesign of Rop...... 39
Figure 6: Example Multiple Sequence Alignment...... 44
Figure 7: Binding Specificities of SH2 Domains...... 46
Figure 8: SCA and WW-Domains...... 63
Figure 9: Beyond Consensus Analysis of TPRs...... 65
Figure 10: StickWRLD...... 68
Figure 11: Triosephosphate Isomerase...... 70
Figure 12: monoTIM...... 71
Figure 13: The Activity of TIM...... 72
Figure 14: Principles of Screening for Protein Stability...... 85
Figure 15: cTIM Structure and Stability...... 106
Figure 16: ir-cTIM Design...... 106
Figure 17: ir-cTIM Characterization...... 108
xix
Figure 18: 1H,15N-HSQC NMR of cTIM and S.c. TIM...... 109
Figure 19: Temperature and Concentration Dependent Activity...... 110
Figure 20: Comparison of Consensus TIM sequences...... 112
Figure 21: ccTIM Characterization...... 113
Figure 22: Structure of ccTIM...... 114
Figure 23: Arsenate Inhibition of ccTIM...... 115
Figure 24: Sequence Differences between cTIM and ccTIM...... 117
Figure 25: Taxonomy Statistics...... 118
Figure 26: Sequence Correlations in TIMs...... 119
Figure 27: Kinetic Plots for S.c. TIM...... 126
Figure 28: Kinetic Plots for cTIM...... 126
Figure 29: Kinetic Plots for ir-cTIM...... 127
Figure 30: Kinetic Plots of ccTIM...... 127
Figure 31: re-S.c.TIM...... 135
Figure 32: CD Characterization of Highly Conserved Mutations...... 136
Figure 33: Thermal Assays...... 137
Figure 34: Concordance of Stability Assays...... 139
Figure 35: Correlation of Thermal Methods...... 139
Figure 36: Filtering by Conservation...... 143
Figure 37: Mutual Information and Protein Stability...... 145
Figure 38: Hidden Correlations...... 146
Figure 39: Characterization of comboTIM and algoTIM...... 148
xx
Figure 40: Kinetic Unfolding of Consensus Variants...... 150
Figure 41: Mutual Information for comboTIM and algoTIM...... 153
Figure 42: Correlations in TIM...... 155
Figure 43: Physical Properties...... 157
Figure 44: Differential growth in solid media...... 166
Figure 45: Differential growth in liquid media...... 167
Figure 46: Protein expression in TIM-knockout...... 169
Figure 47: Analytical digest schemes to assess fitness...... 171
Figure 48: Analytical digests to determine populations...... 172
Figure 49: Library sites in triosephosphate isomerase...... 174
xxi
Chapter 1: Introduction
1.1 The Importance of Proteins
DNA is most simply the blueprint for life. It gives us our identity, but more importantly provides systematic directions for the assembly of our cellular machines. Nearly every cellular event and life process is the result of the actions of proteins. A simple bacterium employs roughly 2,000 different proteins while more complex organisms like humans make use of more than 20,000 diverse proteins. Numerous polypeptides are required,
because they perform highly specialized and unique roles in their native environments.
There are essentially six classes of functions performed by these amazing
macromolecules: binding, movement, structure, signaling, transport and catalysis (Fig. 1).
First, proteins have the ability to bind small molecules and other macromolecules like
carbohydrates, nucleic acids, lipids or even other proteins. For example, the heme- containing protein, myoglobin, reversibly binds oxygen (O2) under saturating conditions as an oxygen storage protein. Whales harbor high concentrations of myoglobin within muscle tissue enabling them to dive deeply in the water for long periods of time. A second example of binding is witnessed in the immune system with antibodies.
Antibodies are multi-chained proteins that tightly bind substances known as antigens.
1
These antigens are generally foreign molecules (e.g. viral proteins) that are recognized as foreign threats and tagged for destruction by the binding of antibodies.
Figure 1: The Importance of Proteins.
A. Myoglobin is an example of a binding protein. The crystal structure shown here binds diatomic oxygen via a heme group. B. Myosin is one of the protein components of muscle that allow the tissue to contract and move. C. Ras is an important signaling protein that binds GTP. D. Triosephosphate isomerase is a diffusion-controlled enzyme that catalyzes the isomerization of two three-carbon sugars. E. This membrane protein serves as a chloride channel that transports ions into the cell. F. The collagen triple helix provides structure, strength and flexibility to skin, cartilage and bone. Images were rendered in Pymol 1.4.1 with PDB IDs: 1MBN, 1W9I, 3GFT, 2YPI, 1KPL, 1CAG.
Other protein molecules like actin and myosin orchestrate movement via muscular contractions. Likewise, some ATPases and flagellum motors choreograph complex rotation. Some proteins serve structural roles. Microtubules and filaments provide an architectural scaffold for each cell. Cross-linked keratin provides the structure and strength of nails and hair. As previously mentioned, many binding events set into motion a series of further incidents. This communication allows for complicated signaling
2
between cells, within a cell, and with the environment. Many membrane proteins, like the acetylcholine receptor protein, bind extracellular molecules that trigger intracellular events. Another class, transport, is responsible for moving molecules from point A to
point B. This class may operate as membrane transporters, vesicle transporters or carrier
proteins.
The hallmark function of proteins is their ability to catalyze biochemical reactions.
Virtually every chemical reaction in the cell is catalyzed by proteins known as enzymes.
There are ten proteins that serially breakdown the six-carbon sugar, glucose, to two
smaller molecules of pyruvate. From here, further proteins continue to alter pyruvate in a
process that yields energy. A second example, DNA polymerase, catalyzes the
replication of our DNA genes with extremely high fidelity.
These examples represent only a small snapshot of the many diverse, but imperative roles
of proteins. While the variety is truly remarkable, the fidelity with which proteins
perform their tasks is truly impressive. The cell requires this fidelity to maintain
homeostasis, replicate and perform specialized tasks. Unfortunately, alterations in DNA
genes may lead to downstream changes in protein composition that may impair the
molecule's function. This is often the case in disease. For example, tumor suppressor
protein p53 has been implicated in more than 50% of human cancers.1 This protein's
native function is to respond to cellular stresses that lead to cell growth, proliferation and division. Functional p53 operates as an early transcription factor that essentially turns off
3 these pathways. In the absence of active p53 these pathways continuously run unregulated leading to uncontrolled cell growth - tumors. The study of p53's mechanism, structure, stability, and effects of mutation are major goals in biochemistry. Likewise, a deeper understanding of all physiological proteins is necessary to better understand life and disease.
In addition, the field of protein science is highly interested in the action of proteins outside of their native environment and the manipulation of proteins for novel tasks.
With the plethora of functions proteins perform, it is not surprising that many wish to exploit these and similar functions for use in therapeutics, industry and academics. DNA polymerases have been isolated from extreme thermophilic organisms for use in polymerase chain reactions and DNA amplification. Antibodies have been raised and isolated from many species as a method of identification and quantification. Other industries have supplemented detergents with protein molecules like lipases. These proteins hydrolyze lipids which are often difficult to remove from greasy stains with standard chemical procedures. The use of proteins in pharmaceutics for in vitro screening and even therapeutics is becoming increasingly common. Small proteins such as insulin and human growth hormone are long-standing staples in the drug industry.
Recently, protein antibodies have gained favor as therapeutic reagents for their ability to bind receptor molecules which trigger intracellular events. Herceptin, a monoclonal antibody, binds the extracellular domain of HER2 - a receptor protein often over expressed in breast cancer patients.2 The binding of Herceptin to HER2 arrests the cell
4 during the G1 phase of the cell cycle reducing proliferation. The Magliery Lab has formed a collaboration with several research groups to improve the drug-like properties of human paraoxonase 1 (huPON1). huPON1's native function is unknown, but it is found bound to high density lipoprotein (HDL) in the blood serum.3 Experiments have shown the enzyme has esterase and lactonase activities, and has been linked to cardioprotection.4-7 In addition to these proposed functions, huPON1 weakly hydrolyzes organophosphate nerve agents found in pesticides and potentially used in chemical warfare. Our lab and others have sought to fine tune the properties of human paraoxonase 1 to increase its usefulness as a therapeutic agent.
In summary, the native roles of proteins, their abilities to bind targets and catalyze chemical reactions, and their role in disease make them one of science's most important molecules. Despite decades of intense study and research, and despite knowing great details of certain proteins and systems, we are still far from understanding many of the fundamental properties of proteins. In order to truly understand life, prepare for life's challenges, and manipulate nature it is imperative to understand these elementary properties.
1.2 Protein Structure and Function
Proteins are assembled from the linkage of twenty natural building blocks known as amino acids. The twenty amino acids vary in size, shape, branching, charge and chemical composition. The carboxyl- and amino- groups of amino acids undergo a condensation
5
reaction at the ribosome. This repetitive reaction can link several amino acids to form an
oligopeptide or dozens, hundreds, or thousands to assemble polypeptides.
The linear polymer synthesized at the ribosome generally assumes a compact three- dimensional structure. This is imperative because nearly all proteins are biologically
active in their folded conformations. To further demonstrate, let us consider two
examples - an SH2 binding domain and the enzyme, triosephosphate isomerase. Src
Homology 2 (SH2) domains are structurally conserved domains that are involved in
signal transduction.8 They mediate key pathways by recognizing and binding specific
peptide sequences that include phosphorylated tyrosines. In order to fulfill this role, each
SH2 domain must bind their target sequences tightly and specifically. The ~100 residue
domain achieves this via its unique three-dimensional structure (Fig. 2A). The extended
polypeptide chain folds into an anti-parallel β-sheet surrounded by two α-helices. A
conserved arginine forms strong electrostatic interactions with the phosphorylated
tyrosine and nonconserved sites within the binding cleft provide the correct shape,
geometry and hydrogen bond donors and acceptors for specificity. There are over 3000
SH2 domains recorded in the Protein Families Databank (Pfam) and slightly over 200
known structures. Although the overall fold is conserved, each of these have slight
differences at the atomic level allowing them to recognize different target sequences.
A second example is well illustrated in the glycolytic enzyme triosephosphate isomerase
(TIM). This complex protein catalyzes the isomerization of dihydroxyacetone phosphate
6
and glyceraldehyde-3-phosphate. Similar to the SH2 domain, TIM must recognize and
bind both of its substrates with affinity and specificity. The enzyme is ~240 residues, but
only three of those are directly involved in catalysis (Fig. 2B). This means that the
majority of the protein is providing a stable architecture to position and align the
functional groups of only a handful of amino acids - in this case a lysine, histidine and
glutamic acid. As is the case with most enzymes, the positioning of catalytic residues
must be precise for the chemistry to follow. In triosephosphate isomerase, mutation of
the catalytic glutamic acid to aspartic acid diminishes the activity of the enzyme by 500-
fold.9 In this case, the carboxylate is maintained, but moved one methylene group (~1.5
Å) away from the substrate. Clearly, the exquisite structure of proteins is imperative for function.
Figure 2: The Structure-Function Relationship.
A. The crystal structure of SHP2 SH2 domain shown in gray. The SH2 domain binds the phosphorylated peptide, RLNpYAQLWHR, shown in green (PDB: 3TL0). B. A crystal structure of triosephosphate isomerase with bound inhibitor shown in spheres. The active site glutamate (12 o'clock) is 500-fold more active than a mutant with aspartate at this position (Figure rendered in Pymol 1.4.1 with PDB: 2YPI).
7
Thus far we have described protein structure as a consequence of its function. This makes intuitive sense as nature evolves based on phenotypic differences that affect fitness. Triosephosphate isomerase, as mentioned, has evolved with glutamic acid rather than the shorter aspartate because that amino acid provides the cell with a vital advantage. In this sense, natural selection is the true "designer" of protein structures. If we consider the diversity of reactions in organisms it is clear why nature has raised many distinct proteins. Further examination of this conclusion leads to many more complicated questions regarding protein structure and function. Most importantly, how has nature evolved thousands of structures and functions from only twenty building blocks?
First, the great diversity seen in proteins is driven by combinatorics. While twenty amino acids appear modest, a hypothetical protein of length, n, has 20n possible sequences (a short protein of only 50 residues has >1065 solutions). Furthermore, the amino acids themselves are quite diverse in both size, shape and composition. The sizes range from small (glycine) to large (tryptophan) with 18 intermediate volumes. Most of the amino acids are neutral, but several can be positively charged and others can form anions. Even within the nonpolar hydrocarbons there is a great degree of diversity in both size, shape and branching. Finally, each of the amino acids hosts a handful of common rotamers and inherent flexibilities. Taken together, the twenty diverse amino acids may combine to form nearly infinite structures. The next question is how do the extended chains fold into biologically relevant conformations?
8
1.3 The Thermodynamic Hypothesis
In the early 1950s, Christian B. Anfinsen began his fundamental studies on the model
protein, ribonuclease A.10-12 RNase is well-studied enzyme that catalyzes the cleavage of
RNA through general acid-base catalysis.13 Here, the 2'-OH is activated to attack the
phosphodiester linkage ultimately leading to an upstream and downstream product.
Anfinsen's lab showed that RNase unfolds in high concentrations of urea and β- mercaptoethanol. The native protein contains eight cysteine residues that form four disulfide bonds. As predicted, the reduced enzyme unfolded in urea and lost its biological activity. When the protein was dialyzed into nondenaturing buffer to promote refolding, the enzyme retained less than 1% of its original activity. The refolding experiment produced a population of semi-folded proteins with scrambled disulfide bonds. If the primary sequence contains eight cysteine residues one expects 7 x 5 x 3 x 1
= 105 possible outcomes from random disulfide bond formation. If the native structure represents the only active conformation, 0.95% ((1/105) x 100 %) is expected to be active. This correlates nearly perfectly to the assayed data. When the refolding experiment was repeated with trace amounts of β-mercaptoethanol the starting activity was recovered nearly quantitatively. The addition of reducing agent allows the protein to sample multiple disulfide bond solutions until the thermodynamically favored result was achieved. These experiments demonstrated several phenomena that revolutionized the field of biochemistry.
9
First, Anfinsen showed that there is not a clear dichotomy between what happens within the cell (in vivo) and what occurs inside a test tube (in vitro). The RNase molecule was able to obtain its catalytically competent conformation whether the starting point was on or off the ribosome. Anfinsen later discovered and described a second enzyme, protein disulfide isomerase (PDI), that catalyzes the reduction of disulfide bonds within the cell11. PDI, much like β-mercaptoethanol, rescues proteins trapped in nonnative conformations. These experiments paved the way for more rigorous analysis of proteins in settings where the investigators could better control variables not possible in many biological contexts. The lack of division between in vivo and in vitro was further demonstrated by Robert Bruce Merrifield in 1969, when his lab synthesized RNase A on solid phase support.14 Merrifield received the 1984 Nobel Prize in Chemistry for his ground breaking synthetic protocols.
Anfinsen's work also provided some of the first insights into the thermodynamics of protein folding and structure. These discoveries are described as the Thermodynamic
Hypothesis. Here, the native structure is a unique, stable and kinetically accessible minimum of the free energy. This means that a polypeptide with given length and composition will fold into a single three dimensional structure under the environmental conditions in which folding occurs - barring conformational dynamics and allostery. In order for this to be true, the energy landscape for protein folding must resemble a rugged funnel (Fig. 3). The wide flange represents many possible conformations with similar high and unfavorable energies. As one progresses down the folding funnel the number of
10
possible conformations decrease as does the free energy. The native state conformation is found at the energy minimum and generally has high activation barriers before reaching other nonnative folds. Because the native state has the lowest energy, thermodynamics dictate that this will be the true structure and hence unique. The large activation barriers between the native state and nonnative folds provide inherent stability
to the biologically relevant conformation. For protein folding to be a thermodynamic
process, the native state must be kinetically accessible and for nearly all proteins this is
true. α-Lytic protease is a rare example where the thermodynamic minimum is
kinetically inaccessible.15 The protease is synthesized with a proregion that serves as a
folding catalyst and is degraded after proper folding making unfolding essentially
irreversible (t1/2 = 1.2 years). This unique mechanism decouples the folding and unfolding events, providing escape from Anfinsen's thermodynamic descriptions of protein folding.
Figure 3: Protein Folding Energy Landscape.
The native state conformation (N) is the energetic minima with high activation barriers to other conformations. In this figure, other "valleys" represent local energy minimas. The top of the funnel is wide representing numerous unfavorable folds. As a protein proceeds down the energy funnel, the free energy and number of conformations decrease. Image used with permission of Ken Dill.16 11
Anfinsen showed that the structural fold of the protein is literally dictated by physical chemistry and thermodynamics. This leads to interesting questions: 1) What are the forces that lead to protein folding and energy minimization and 2) how does Nature sample and examine these potential structures? First, proteins fold in aqueous solution as a result of large but nearly balancing forces - enthalpy (ΔH) and entropy (ΔS). From
Gibb's free energy equations one can express the protein states as: ΔHunfolding - TΔSunfolding
≈ ΔHfolding - TΔSfolding or ΔGunfolding ≈ ΔGfolding. Enthalpy consists of all the possible
interactive forces in protein folding including ionic interactions, dipoles, hydrogen bonds
and van der Waals forces. Each amino acid residue can form two hydrogen bonds using
the backbone amide and carboxyl donor and acceptor, respectively. Polar amino acids
such as glutamine, asparagine, serine, threonine and tyrosine can form additional
hydrogen bonds using their side chains. Other amino acids including histidine, arginine,
lysine, glutamate and aspartate can form salt bridges with their ionizable side chains. A folded polypeptide will maximize the number of intramolecular hydrogen bonds and salt bridges while limiting steric clashes (surface exposed residues may interact with the environment to satisfy hydrogen bonds). Additionally, the native state will minimize cavities to achieve tight packing to maximize van der Waals interactions. The average hydrogen bond strength is ~ 1-4 kcal mol-1 and these forces add up quickly when
examining the folded state of proteins. It would be easy to conclude that these
interactions drive protein folding and provide the "glue" for structural stability, however,
this is not entirely true. Nearly every, if not all, hydrogen bond donor and acceptor can
12
be equally satisfied by intermolecular water molecules. Thus the n(1-4 kcal mol-1) are nearly degenerate between the folded and unfolded states. The same phenomenon is true for hydrophobic packing interactions such as London dispersion forces. So what is the driving force for protein folding?
Although counterintuitive, entropy is the dominant force that drives polypeptides into their native and active conformations. An unfolded polypeptide with many degenerate states has more degrees of freedom. Each side chain has multiple favorable rotamers which are entropically very favorable thus stabilizing the unfolding state. The examination of crystal and NMR structures show that tight packing interactions severely limit the accessible degrees of freedom for most residues. While this is necessary for proper function it is entropically unfavorable. To fully understand the entropic consequences of protein folding, one must examine the water molecules surrounding the unfolded and folded peptide.
A typical organism, and therefore cell, is comprised of ~70 % water. Pure water has a concentration of 55.5 M, while most cellular proteins are present at micromolar (10-6 M) concentrations or lower. A protein folds by means of hydrophobic collapse, where nonpolar amino acids are buried within the protein's core and hydrophilic amino acids remain solvent exposed. This is the same phenomenon seen in micelles from detergent molecules. Interestingly, the hydrophobic collapse of the protein molecule ultimately leads to an increase in the system's entropy. An unfolded, linear peptide has a large
13
solvent assessable surface area. Submerged in an aqueous environment, water molecules
must surround the polymer in a highly organized fashion greatly reducing the system's
overall entropy.
Water surrounding large macromolecules is often described as "ice" or "solid" water due
to its lack of freedom and low ΔS. As the protein folds the solvent assessable surface
area diminishes freeing many water molecules to behave more "liquid" or "bulk-like."
For a given volume, a sphere will always have the minimum surface area, therefore, it is
no surprise that the majority of native proteins are globular. To illustrate this point,
consider a sphere and a cylinder, both of which have a volume of 100 Å3. If one describes the cylinder as a rod with height 100-fold greater than the radius, this shape has a solvent assessable surface area of ~300 Å2. The sphere, however, only has a surface
area of ~100 Å2 - representing a three-fold decrease over the elongated cylinder. Clearly, more ordered water is required to encase an unfolded peptide versus a globular protein.
This view is somewhat exaggerated as unfolded proteins are still approximately spherical; however, these states are larger and exclude less water. This increase in randomness, known as entropy, drives protein folding by ultimately lowering the free energy (ΔG).
Anfinsen's Thermodynamic Hypothesis states that most native proteins adopt an active structure with the lowest free energy. Does this mean that the molecule samples all conformations, evaluates the energy and selects the most stable structure? To date, the
14
mechanism of protein folding remains largely unknown, but Cyrus Levinthal's
computational thought experiment performed in 1969 ruled out the "sample all"
approach.17 Levinthal demonstrated that proteins must fold on a linear (or reasonably
linear) pathway where proteins sample far less than 1 % of all possible conformations.
He considered a small protein with only 100 amino acids and therefore 99 peptide bonds
with 198 phi and psi angles. He limited the calculation to 3 possible rotamers for each
bond angle yielding 3198 (~3 x 1094) different conformers. Note that here we ignore many of the possible rotations amongst single bonds within the complex structure. A single side chain rotation occurs in ~10-8 seconds, meaning it would take longer than the age of
the universe to sample all combinatorial solutions. This value is far longer than the
average millisecond (10-3 sec) to seconds required for protein folding. This
computational observation became known as Levinthal's Paradox. To rationalize these
impossibilities, Levinthal suggested that proteins fold on one or several trajectories in
which local assembly precedes global events. The details of this mechanism are still
poorly understood but many hypotheses are gaining traction.18-23 The pioneering work by
Anfinsen and Levinthal described many complex details in protein folding and function,
but also generated new and imperative questions.
1.4 The Sequence-Structure-Function Relationship
Christian Anfinsen's conclusions drawn from the RNase studies earned him the 1972
Nobel Prize in Chemistry. Interestingly, his lab was not the only group studying the
thermodynamic properties and folding of ribonuclease A. At the 25th Anniversary
15
Symposium of the Protein Society, David Eisenberg (University of California, Los
Angeles) shared the story of a second lab whose studies preceded the work by Anfinsen
and colleagues. Yale University medical student, Lisa Steiner, and her advisor Fred
Richards had performed nearly the exact experiments several years prior to Anfinsen.
The work went unpublished until Steiner's thesis, "The Reduction of the Disulfide Bonds
of Ribonuclease," was released in 1959 - after Anfinsen's Science paper.
Anfinsen's work revolutionized the field and dramatically affected the trajectory of biochemistry. The primary papers on RNase refolding and activity have each earned over
300 citations while his Nobel Lecture, "Principles that Govern the Folding of Protein
Chains," stands at 3,720 citations at the time this dissertation was written. Why was the
research of Steiner and Elliott left for a medical thesis and not published in highly distributed journals?
The 1950's were an exciting time for protein science and molecular biology. It was
during this decade that Francis Crick, James Watson, and Linus Pauling were active,
joining Anfinsen in a quest to change the scientific landscape. One such discovery, first
posited by Francis Crick, was the Central Dogma of Biology which was a further description of his Sequence Hypothesis.24 The Sequence Hypothesis states:
“…..In its simplest form it assumes that the specificity of a piece of nucleic acid is expressed solely by the sequence of its bases, and that this sequence is a simple code for the amino acid sequence of a particular protein.”
16
The Central Dogma describes that the transfer of information from DNA to RNA and
RNA to protein is irreversible:
"This states that once “information” has passed into protein it cannot get out again. In more detail, the transfer of information from nucleic acid to nucleic acid, or from nucleic acid to protein may be possible, but transfer from protein to protein, or from protein to nucleic acid is impossible. Information means here the precise determination of sequence, either of bases in the nucleic acid or of amino acid residues in the protein."
We now know that the above quotation is not entirely true with the discovery of reverse transcriptase and nonribosomal peptides; nevertheless, the overall flow of information from DNA to RNA to Protein is canonical. Crick's hypotheses were first articulated in
1958, coming years after the Yale University work on ribonuclease, but concurrent with the findings of Anfinsen. Anfinsen's laboratory was now armed with all the puzzle pieces to change protein science. The mating of the Central Dogma (DNA→RNA→Protein) with the Thermodynamic Hypothesis formed the basis of the sequence-structure-function relationship. Lisa Steiner's work was simply ahead of its time.
The sequence-structure-function relationship describes the intimate flow of information from the protein's primary sequence (ultimately the DNA gene) to its biological activity.
Earlier examples chronicling the binding site of SH2 domains and the active site of triosephosphate isomerase showed conclusive evidence to support the structure-function relationship. Anfinsen's RNase experiments elucidated the sequence-structure paradigm.
In vivo, or in the case of in vitro translation, the mRNA template and ribosome catalyze the linkage of amino acid monomers. This amino acid chain will spontaneously fold into
17 the biologically relevant conformation. This means that the sequence of amino acids provides all the information required to describe the folded state. One should note that some (perhaps the majority) proteins do employ chaperones to accelerate the folding process and prevent nonnative conformations in vivo. The example of RNase, and many others in the last 50 years, have shown that proteins can be unfolded and refolded in vitro without any "magic" that extends beyond the simple thermodynamics that describe the systems of supramolecular chemistry. This observation led to the birth of modern protein biochemistry.
1.5 The Protein Folding Problem
The Protein Folding Problem is most simply a question, "How does a protein's primary sequence dictate the three-dimensional structure?" Ken Dill elaborates on this question by describing the problem as three closely related puzzles25: 1) The folding code: the thermodynamic question of what balance of forces dictate the structure of a protein for a given amino acid sequence; 2) Protein structure prediction: the computational problem of predicting the native structure from sequence; and 3) The folding process: the kinetic question of how proteins fold so quickly.
The forces that dictate structure were described earlier. Prior to Anfinsen's work it was believed that the amino acid sequence predetermined secondary structure which drove tertiary contacts and hence protein folding. Many of the early views of protein folding are reviewed by Anfinsen.26 Linus Pauling, Robert Corey and Herman Branson proposed
18 the first description of helices within proteins and further hypothesized that proteins would form regular internal lattices. Diverse sequences of DNA have the same structure regardless of sequence and initial speculations suggested proteins would behave similarly. This made sense since secondary structures such as α-helices are mediated by backbone hydrogen bonds, which all amino acids have regardless of side chains composition. The first crystal structure of myoglobin in 1958 dramatically changed these viewpoints and later studies revealed that that secondary structures are actually stabilized by hydrophobic collapse.27-29
As Dill points out, and crystal structures indicate, folding relies on side chains and therefore the folding code is distributed both locally and globally within primary sequences. This is why paradigmatic proteins such as lysozyme and ribonuclease have distinct structures. In summary, the folding code is globally distributed among the side chains. Many details of this puzzle remain unknown. For example, how do mutations to primary structure affect the stability (free energy)? To date, there is still no physicochemical model that can predict the effects of even a single point mutation.30
The kinetic puzzle - how do proteins fold so quickly? - is a corollary of Levinthal's paradox. Many labs have sought to understand the mechanism of protein folding. This has proven to be even more difficult than predicting structures, but promising avenues are being paved in this field. Further review of the kinetics of folding and folding pathways
19 is beyond the scope of this chapter. Robert Matthews (University of Massachusetts
Medical School) has provided a thorough review.31
Perhaps the greatest revolution of Anfinsen's work was the idea that protein structure could be predicted from the sequence alone since it necessarily carries the entire folding code. Furthermore, the native structure could be confirmed by comparing its computed free energy to other structures, ensuring it is the most stable (lowest free energy). This became a major, but challenging goal in computational biology.
Predicting structure is an intuitive but complicated enterprise. Imagine using an organic chemistry molecular model kit and building a small protein like hen egg white lysozyme.
The protein has roughly 130 residues and slightly more than 1,000 atoms. Given the sequence and the polymer nature of proteins, all of the covalent bonds are known (less disulfides). The builder would now bend, fold, and twist the backbone, rotate side chains and collapse the extended peptide into a globular three-dimensional structure. During the process, the architect would attempt to optimize several features: 1) Minimize the surface area to increase the entropy of hypothetical water molecules; 2) Avoid steric clashes and minimize cavities to increase van der Waals interactions; and 3) Maximize the number of hydrogen bonds and salt bridges. From a purely theoretical perspective, the above goals seem possible, but there are simply too many theoretical structures to build and evaluate. To alleviate this burden computational biologists began developing algorithms and modeling procedures to increase the throughput.
20
With an influx in computational programs in the late 80s and early 90s many began to erroneously feel that the Protein Folding Problem had been solved. In response, computational biologist John Moult (University of Maryland) founded the Critical
Assessment of Techniques for Protein Structure Prediction (CASP) in 1994. Prior to the first meeting, Moult and colleagues gathered primary sequences for proteins with solved, but unpublished structures. These sequences were offered to the computational community who were given a month to submit predicted structures. The inaugural CASP meeting attracted 35 labs and 24 target sequences ranging from easy to difficult.32 Many labs were able to accurately predict small domains and proteins (scored as GDT_TS), but success rates fell quickly as protein sizes increased and targets became more challenging.33 The CASP meeting is held every two years and is currently preparing for its ninth gathering. Data collected from previous contests shows modest improvements between contests, but still demonstrates the need for enhanced algorithms and search strategies. In the next section we highlight several success stories in the field of computational structural biology.
1.6 Computational Structure Prediction
The Protein Folding Problem is general thought of as what structure will a given sequence adopt? CASP meetings have chosen to focus on this computational avenue. At recent CASP meetings three groups routinely outperformed other labs and servers - The
21
Zhang Lab's I-TASSER and QUARK, Baker's Rosetta and Xu's RAPTOR
(http://predictioncenter.org/casp9/).
TASSER (Threading/ASSEmbly/Refinement) combines homology modeling and Monte
Carlo simulations to predict the tertiary structure of proteins.34-36 First, the primary sequence is threaded through a template library constructed from the Protein DataBank
(PDB). Sequence segments that share homology to sequences in the PDB are modeled to the corresponding structures. These stretches of amino acids are used as building blocks for structure prediction. In the next step, the building blocks are held constant, and regions not selected by the threading procedure are optimized and refined by Monte
Carlo methods. I-TASSER was used to predict the structure of T0437_1 at CASP8.37
The model protein predicted by the Zhang lab exhibited an RMSD of 1.13 Å to the actual crystal structure. This method has been generalized in the latest algorithm, QUARK
(manuscript in preparation). Here, the building blocks are designed by replica-exchange rather than extraction from homologous PDB structures.
Jinbo Xu, from the University of Chicago, has developed CASP-winning software suites under the title of RAPTOR (RApid Protein Threading predictOR).38,39 Similarly to I-
TASSER and QUARK, RAPTOR uses protein threading in the initial stages of prediction. One problem with current threading techniques is that the procedure is "NP- hard," making the matching process computationally expensive. NP-hard algorithms run in polynomial time where O(nk), and the size of the input determines the upper bound of
22 the running time. RAPTOR improves these search algorithms by applying the mathematical theory of linear programming. Linear programming is used to identify extreme data points within a complex function, in this case the minimal energies. These adaptions have improved the throughput and speed making RAPTOR ideal for full automation. In fact, RAPTOR has routinely taken the top honors at CASP in the
CAFASP division (CA[fully automated]SP).
The David Baker lab (University of Washington/HHMI) has spent the past decade developing computer programs around a scaffold script known as Rosetta.40,41 The algorithm uses potential functions to compute interaction energies within and between molecules, ultimately searching for the lowest energy solutions using Monte Carlo simulated annealing. Since its inception in 1998, the code has been modified for several distinct procedures including RosettaDesign, RosettaDock, Robetta and others.42-44
Rosetta threads three and nine residue stretches through the PDB identifying homologous sequences. Like TASSER, Rosetta starts by identifying local building blocks in the PDB and optimizes the global structure around these short stretches. Although the protocols are similar to other prediction technologies, the Rosetta suite employs highly tuned potential energy functions. It distributes greater weight to compact structures, buried hydrophobic residues, and paired β-strands. Rosetta has gone beyond complete structure prediction by guiding search algorithms with biophysical and evolutionary data.45,46 In
2010, the program accurately determined the structure of larger proteins using only backbone NMR data.47
23
Even after two decades of in silico structure prediction, the field is still learning. Despite numerous advances, the algorithms still fall short of predicting structures for some of our most biologically interesting molecules - large human proteins and membrane proteins.
The intimate pairing of biophysical data and computational algorithms is a promising avenue for improving structural predictions.
1.7 The Inverse Folding Problem
A second and related question is to ask, what sequence will adopt a given structure? This is known as the Inverse Folding Problem.48 In the genomic era, we are essentially faced with two imperative challenges. First, the number of available structures pales in comparison to the number of available sequences. This is the impetus for in silico structure prediction. Second, is the design of novel sequences with desired physical features such as increased stability, solubility, or activity. This, the Inverse Folding
Problem, is highly related to the sequence-structure relationship.
A significant hurdle with designing sequences that adopt a particular fold is that the margin of error is minimal. Native proteins are generally only stabilized over the unfolded state by ~5-15 kcal mol-1. This means from a computational perspective, that missing even one hydrogen bond can result in rejected structures. Protein folding can be written as the equation: unfolded peptide → folded peptide. If we take the folded state as the product, the equilibrium constant, Keq, can be expressed as relative concentrations
24
(Keq = [folded peptide]/[unfolded peptide]). From the equation, ΔG = -RT ln Keq, we see that the incorrect placement of a single hydrogen bond affects the equilibrium nearly
1000-fold. Several labs have accepted the challenge of computationally predicting stabilities - an intuitively simple test of the Inverse Protein Problem. Here, we highlight the work of two labs, Nikolay Dokholyan's (University of North Carolina) and Luis
Serrano's (Centre de Regulacio Genomics).
The Dokholyan lab uses a computational and biophysical methods to study the physical nature of interactions within and between molecules. In 2007, they introduced a protein stability prediction server named Eris.49 Eris is a physical force field with atomic modeling that features fast side-chain packing and algorithms to relax the backbone.50
The authors point out several key advantages of Eris, compared to other predictive software. First, most competitors use extensive training of the algorithms from known stability data. While the general consensus is that including empirical data helps calculations, it unfortunately biases the output. In the case of protein stability prediction, most libraries of stability-characterized proteins include a large fraction of amino acid to alanine mutations - also known as alanine scanning. Inclusion of these datasets tend to bias the computers towards large to small mutations as this is what they encounter upon training. Second, the majority of algorithms slight or completely ignore the effects of backbone motions and flexibility. Recently, Sachdev Sidhu and colleagues designed binomial libraries of antibodies where binding site residues were limited to tyrosine or serine.51,52 These limited libraries were able to bind diverse substrates with nanomolar
25 affinities. One interpretation from this work is the importance of backbone flexibility which added a degree of diversification to an already limited library. The Eris server, with the ability to model backbone flexibility, was tested against 595 mutations from five different proteins. The authors found significant agreement (Pearson correlation coefficient = 0.75) between the calculated and experimentally derived ΔΔGs, but the standard deviations are reported as ~2-3 kcal mol-1.
The Serrano lab developed FoldX in 2005 as a web server, and recently packaged the code for use in the graphical interface, YASARA.53,54 Similar to Eris, FoldX uses a full atomic description of the structure of proteins, which include flexible backbones. Unlike
Eris, the weighting of atomic interactions have been modeled using empirical data from
1,088 point mutations spanning many diverse folding scaffolds. The original FoldX predictions yielded a correlation coefficient of 0.81 with a standard deviation of 0.46 kcal mol-1. In the YASARA interface, researchers are given the option of doing local or global repacking after mutations are modeled. In 2010, the effectiveness of FoldX and eleven other, stability predictors were tested by Khan and Vihinen.55 In this study
Dmutant and FoldX were the best of the predictors, but the authors left the audience with heavy guidance:
"In conclusion, at best, the methods predicted the changes in stability caused by mutations with only moderate accuracies. However, the number of false positives and false negatives returned by the programs was substantial. As so many factors affect protein stability, even small differences in the ΔΔG values between a wild type and its mutant can be significant. Molecular dynamics and Monte Carlo simulations provide more accurate results in general; however, characterization of mutational effects is still problematic even when these methods are used.
26
Additionally, the computational power demands of these two methods are prohibitively great for the analysis of large datasets.
For mutation effect investigations the tested methods have only limited applicability, and should thus be used preferably together with other prediction approaches. One way to improve the performance of predictors might be to use additional features."
Despite these warnings for predicting stabilities, many labs have pushed the Inverse
Folding Problem even further in the computational design of de novo proteins with surprising success.
The first in silico design efforts were based on Ramachandran plots and common side chain rotomers seen in the PDB.56 These angle constraints were used to mold short sequences into a given backbone architectures. Initial design efforts were limited to small peptides or the redesign of hydrophobic cores. In the years that followed, further development of rotamer libraries greatly reduced computational space, and improved search algorithms such as the Monte Carlo methods made larger design efforts more promising. The Steve Mayo lab (California Institute of Technology) was an early innovator in the computational design of proteins. In 1996, they developed algorithms that quantitatively considered side-chain packing, sterics, and their relationship to packing specificity.57 A year later, the same authors used these core lessons and expanded the design criteria to include the solvent exposed surface.58,59 Dead-end elimination was used to select the optimal sequence for the redesign of the ββα-motif found in zinc-finger DNA-binding domains. The designed fold successfully repacked the core in the absence of zinc. 27
In 2003, exciting research began to pour out of the previously mentioned Baker lab.
Using the RosettaDesign software, Kuhlman et al. attempted to engineer a protein with novel structure (i.e. a structural scaffold not known in Nature).60 The lab undertook the design of a 93 residue peptide that would harbor five antiparallel β-strands and two α- helices. An iterative approach was prescribed that cycles between sequence optimization and structure prediction. After each cycle of sequence generation, the backbone was relaxed and optimized until the outcome stabilities were comparable to native proteins.
The structure of one designed variant, Top7, was characterized in detail. Top7 was only
31 % sequence identical to the initial model sequence, but had a backbone RMSD of 1.1
Å to the starting model. The model agrees with the crystal structure at the atomic level, with an RMSD of 1.17 Å.
The Baker lab has taken these successes and applied them to the design of functional proteins in recent years in collaboration with Donald Hilvert's lab (Swiss Federal Institute of Technology - ETH Zurich). In 2008, Hilvert and Baker reported the de novo computational design of retro-aldol enzymes.61 These designer enzymes catalyzed the carbon-carbon cleavage of nonnatural substrates with multiple turnover and rate enhancements approaching four orders of magnitude. Crystallography of the catalysts nearly superimposed on the designed models. Dan Tawfik (Weizmann Institute) collaborated with Baker to design enzymes that could transfer protons from carbon atoms via the Kemp elimination.62 These eliminases showed similar catalytic improvements as
28 the retro-aldol enzymes and were further optimized with rounds of in vitro evolution.63,64
In 2010, the Seattle lab furthered their design legacy by computationally engineering proteins that could catalyze the bimolecular Diels-Alder reaction.65
These experiments illustrate the power and promise of computational design, but it is important to not oversell these successes. The prediction of stabilizing mutations, the design of folds, and the computational proposal of catalytic proteins is a young and error- prone field. Even in the most advanced algorithms, one can only expect the computer to predict stabilizing mutations with ~60 % accuracy.55 The design of novel folds and redesign of native folds is still difficult and often requires additional rational or irrational design. Top7 is the tip of an arrow - for every success there are hundreds of failures. At symposia, David Baker is famous for leaving the audience with the following take home message, "computational design of proteins does not work." Even the famously designed catalysts described here and elsewhere fail to rival the efficiencies, speed and fidelity of native like enzymes which are up to three orders of magnitude faster than those currently described in the literature. In addition, the very nature of these computational experiments are highly specialized and generally require years of training in mathematics and computational science. In the following years it is clear that these technologies will improve and become more user friendly.
As a final caveat to the use of computational programming, we analyze the reports of
Regan, Elber and Bryan. Many computational algorithms rely on sequence homology,
29 either by alignment or protein threading through the PDB. The work presented here, provides a succinct warning (even if anomalous) against such practices. In the early
1990s, George Rose and Trevor Creamer formulated the Peracelsus challenge: convert a globular protein into a second fold without altering more than 50 % of its sequence. The confident pair guaranteed $1,000 to the first successful alchemist. The Lynne Regan lab
(Yale University) started with the predominantly β-sheet Protein G (the B1 domain of
Streptococcal IgG-binding protein). Regan's lab mutated ~40 % of the amino acids based on secondary structure propensities, energy minimizations, visual modeling and intuition.66,67 The end result, Janus, was a four-helix bundle protein similar to the structure of a single chain Rop variant (Repressor of primer). Regan accepted the award at a Johns Hopkins ceremony in 1997 and donated the prize money to charity.
Inspired by this work, Ron Elber (University of Texas) began a computational project describing networks of sequence flow between protein structures.68,69 Elber noticed that the sequence-structure relationship is highly asymmetric as many sequences fold into relatively few structures. He postulated that given two sequences of two different folds, mutations could be made serially in a Markov chain leading one structure to another.
Many of these mutations would retain the native fold, but as these mutations amass there would eventually be a single missense mutation that flipped structures. Continued mutations towards sequence B would retain the flipped fold. Elber's lab has evaluated many such Markov chains in the PDB and has found many evolutionarily interesting
30 sequence networks. The Magliery lab is collaborating to empirically validate these chains.
The Elber lab has postulated the existence of a solution to the Paracelsus challenge with more extreme criteria (a single amino acid mutation versus 50% sequence divergence).
His lab has chosen to chase native proteins chains as his interests primarily lie with the mechanisms of protein evolution. Stepping outside these criteria, the Philip Bryan lab
(University of Maryland) has furthered the work of the Regan lab using in vitro evolution. Alexander et al. began with the IgG binding domains of streptococcal protein and staphylococcal protein A. These two proteins are similar in size, but have drastically different folds. Phage display was employed to drive the generation of similar sequence variants with different folds and binding functions.70 This method produced two heteromorphs with 59 % sequence identity. The tertiary structure of these molecules was later confirmed with high resolution NMR structures.71 In a third publication, the authors simplified which residues contributed to the folding code for each structure. After rounds of mutations at nonidentical and nonepitope binding sites, the authors were able to produce two functional variants, with different folds, that shared 88% sequence identity.72
Their original estimates suggested that only 10% of the residues within the small protein contained the folding code. The resultant NMR structures guided further rounds of mutagenesis that led to 95% sequence identity.73,74 Finally, the authors discovered a single amino acid mutation that completes the Markov chain from GA to GB - L/Y at
31
position 45.75 This work is further proof of Rob Elber's hypothesis, but much work is
still required to prove these Markov chains exist in Nature.
In conclusion, the computational design of proteins has sparked a new era of protein
science that has been interesting and useful. However, it seems that for every success
story there are countless failures and exceptions setting the stage for the next generation
of improved algorithms and protocols. In the next section we examine the counter
discipline - empirical data collection, its successes and challenges.
1.8 Empirical Protein Science
The most reliable data comes from direct experimentation. Unfortunately, this path is
generally the slowest and most expensive. The structure-function relationship began with
chemical modification of side chains. Ribonuclease's activity and stability have been
challenged by modifying arginines, carboxylic acids, tryptophans and the alcohols.76-79
The general acid base mechanism of RNase was elucidated from these studies and pH profiling along with crystal structures. Similar approaches using radiation have also been used to modify side chains. These technologies, however, are not specific and are limited to certain functional groups.
In the early 1970s methods began that altered the DNA sequences ultimately leading to missense mutations in proteins. Now, amino acids could be substituted at specific sites allowing the investigators to construct sequence-structure-function relationships. The
32
earliest technologies used restriction endonucleases to fragment phage genomes. The
fragments were annealed to new, intact phage. Some of the ΦX174 progeny phage had genetic markers derived from the fragmented genomes.80 In 1978, the method was
further generalized by the addition of a single-stranded DNA primer and the Klenow
fragment of E. coli DNA polymerase I.81 This method granted more rigorous control of
the mutations, but was still limited to phage genomes, although the authors suggest
similar applications with plasmids.
This method was optimized to modern day form using Kunkel and QuikChange
mutagenesis. In these updated protocols, the background (wild-type) progeny are
removed by digestion of the original template. In Kunkel mutagenesis this is accomplished by producing the template DNA in E. coli strains that are dut- ung-, resulting in DNA with dUTP.82 After primer addition and replication, the template strand
is removed post transformation by native uracil deglycosidases. In QuikChange methods
(Stratagene), the template DNA is degraded by DpnI - a restriction endonuclease that
degrades methylated DNA at site, GATC. The mutagenic strand of DNA is synthetic and
therefore lacks adenosine methylation allowing for selection of mutants. Altered DNA
can also be generated by cassette PCR using restriction endonucleases and ligase.83
These methods greatly improved our ability to generate and examine mutant proteins.
Some of the earlier empirical work on protein sequence, structure and function was
performed in the laboratory of Jeffrey Miller (University of California, Los Angeles).
33
The Miller lab employed nucleotide mutagens like 2-aminopurine to probe the activity and quaternary structure of lac repressor.84,85 Later, Miller amassed nearly 4,000 single mutations at 328 total positions using nonsense suppression methods.86,87 From this collection, Miller determined that most sites within lac repressor are tolerant to substitution, but some sites required conservative mutations. Exceptions included the
DNA binding site and the inducer binding site which ablated activity by disrupting necessary intermolecular contacts. Furthermore, they also discovered that positions within the hydrophobic core were considerably more intolerant to mutation.
The observation of intolerance led Michael Hecht and Robert Sauer (Princeton University and Massachusetts Institute of Technology, respectively) to survey the relative mutabilities of surface and core positions of λ repressor.88 The lab generated 52 single mutants and assayed each for function. The low activity variants fell into two categories.
First, there were a series of surface mutations that colocalized on the solved crystal structure - these mutants directly impaired DNA binding.89 A second class, were distributed randomly throughout the hydrophobic core of the repressor. Seven surface positions and one core position were biophysically characterized in detail. The authors concluded that completely or partially buried mutations exert functional effects by altering stability and/or foldedness.
Brian Matthews, a crystallographer at the University of Oregon, has studied mutational effects in T4 lysozyme. In a radical study, Matthews challenged the hydrophobic core
34 with methionine mutations at positions with native leucines, isoleucines, valines and phenylalanines.90 All of these substitutions were destabilizing, but the native fold could be reproduced with ten simultaneous mutations to methionine. The Matthews lab and others have made substantially more mutations over the years trying to elucidate the determinants between sequence, structure and stability.91,92
The above handful of examples represent only a snap shot of the single mutants that have been constructed and analyzed in the past 30 years. Many labs have performed similar experiments in a variety of host structures. Unfortunately, we are still far from understanding the details of the sequence-structure-function relationship despite direct probes into this problem. Miller was able to assay nearly 4,000 mutants of the E. coli lac repressor - nearly all of the 6,270 (330 x 19) possible individual mutants. In the crystal structure of proteins it is clear that amino acids have to be physically and chemically compatible with neighboring residues. As such, to truly understand the Protein Folding
Problem, one needs to study more than one site simultaneously. This dramatically affects sequence space and opens the door for combinatorial biochemistry.
Combinatorial biochemistry involves the design, creation and analysis of libraries of proteins. Currently, the most straight-forward approach to creating libraries starts at the gene level. Researchers order oligonucleotides with randomized nucleotide stretches.
For example, if one wishes to test all possible codons at position x, the cloning would proceed with a primer with codon NNN at x (N=A/T/G/C). The NNK (K=T/G) codon
35
still grants all twenty amino acids, but eliminates stop codons and some bias from amino acid codon degeneracy. Furthermore, one could analyze just the hydrophobes and alcohols by replacing the wild-type codon with DYV (D=A/G/T, Y=C/T, V=A/C/G).
The biased mixing of codons can be used to create many distinct, but related libraries due to the evolution of the genetic code.
Figure 4: Transformation Efficiencies.
The library size is shown as a function of the number of positions to be randomized to all twenty amino acids. The library size increases by more than an order of magnitude for each added position. Using logarithmic axes, we show standard transformation efficiencies for bacteria and yeast.
It is called combinatorial biochemistry because the addition of library positions expands
the sequence space combinatorially. Randomizing one position with an NNK codon
creates twenty library members (at the amino acid level). Randomizing two positions
creates a library size of 400, four yields 160,000, etc. (Fig. 4). In theory, one can
randomize as many positions as desired, but the number of variants that can be studied is
limited by several factors. One limitation is the transformation efficiency, which
describes the efficiency in which cells uptake extracellular DNA. Electrocompetent
transformations of E. coli routinely yield 108 clones.93 Many screens and proteins require
36
the further complexity of eukaryotic cells, of which yeast are the easiest to handle.
Competent cell preparations of S. cerevisiae generally cover smaller libraries nearing 106 variants. This means that it is very difficult to cover libraries with more than 4 and 7 randomized positions in yeast and bacteria, respectively.
As discussed above, cloning is a significant limitation to library sizes, but it is not the only constraint. Combinatorial experiments are also limited by the data collection method. Genetic screens that rely on life or death are the highest throughput - here one is simply limited by the number of cells one wishes to plate. Screens that rely on fluorescent or chormophoric reporters are lower in throughput, although the application of Fluorescent Activated Cell Sorting increases the feasibility of screening large numbers.94,95 The methods described above are both in vivo techniques for screening
protein function. If one wishes to analyze the folding, stability, or activity, lower
throughput methods are required. Many of these are detailed in the second chapter.
One of the earliest applications of PCR-generated libraries was used to examine the core
of barnase in the Alan Fersht Lab (University of Cambridge). In this study, twelve of the thirteen hydrophobic core residues were randomized, or subrandomized to a set of
hydrophobic amino acids (VILMF).96 Note here, that it is impossible to include all hydrophobic amino acids, such as glycine and alanine without adding the alcohols.
Barnase, a bacterial ribonuclease, is lethal to the cell when synthesized without binding inhibitor, Barnstar. The library is cloned into a nonsupressing E. coli strain where two
37
serine residues have been replaced with stop codons, thus leading to early truncation and
inactivation of Barnase. In theory, all library members complement growth in this strain.
In a second cloning step, the library is expressed in a nonsupressing bacterial strain where
the nonnative stop codons are read as serines. Under this scenario, library members that
form colonies must have inactive, or severely diminished (< 0.2 %) activity. Fersht
found that ~ 23 % of the library retained activity, including several variants that lack any
wild-type amino acids at the twelve core positions. These results led the authors to
conclude that hydrophobicity is sufficient to construct functional cores. They also
suggest that novel proteins should be designable from primitive cores.
Further examples of combinatorial libraries have been reported leading to the description
of two models for protein core packing, the oil droplet and jigsaw models.97 The oil droplet model suggests that the core simply requires greasy residues with approximate volumes and low packing specificities.98 This would imply that mutation to other
hydrophobic residues would have little effect on protein structure and function, which is
somewhat elucidated in the Barnase experiment. Other examples, suggest a higher degree of specificity in the packing of hydrophobic cores, much like a jig-saw puzzle.56
Which of the models is correct?
The likely answer is somewhere in between. Ken Dill proposes a model that he refers to as, "nuts and bolts in a jar.97" In the jig-saw model, packing resembles a series of lock
and key fits requiring specific size and shape complementation. The side chains are
38
frozen and only gain rotational freedom, once the protein is fully denatured. The nuts
and bolts model has no shape complementary and no critical separation. The nuts and
bolts have rotational freedom at near maximal packing density. Work within the
Magliery lab supports the nuts and bolts model. Lavinder et al. repacked the two central layers of Rop with hydrophobic amino acids (manuscript in preparation). Active variants were found with core volumes ranging from 230-440 Å3, lending initial credence to the
oil droplet model. The active variants were then assayed for relative stability using High-
Throughput Thermal Scanning.99 These results displayed a surprising trend in which
over packed variants generally yielded molten globular proteins, and stable variants had core volumes approximately equal to native Rop (320 Å3) (Fig. 5). This suggests there
are more stringent packing guidelines than simple hydrophobicity. It was also discovered
that stable variants were found at all core volumes, suggesting leniency to the jigsaw
model. The best way to reconcile these observations is to describe the packing as the
"nuts and bolts" model posited by Dill.
Figure 5: Redesign of Rop.
The central two layers of Rop were repacked with hydrophobic amino acids. Here, the stability and library fraction are plotted as a function of core volumes. Native Rop has a core volume of 320 Å3. Image from Lavinder thesis. 39
Combinatorial biophysics is a powerful tool, however, there are still several challenges
that prevent its application towards solving the Protein Folding Problem. First, the
design and construction of large libraries is not trivial. As discussed in figure 4, our
current molecular biology tools are limited to the construction of ~1010 variants. In vitro
methods may expand these numbers further, but present new challenges. Even if one
could generate larger libraries, the field is still limited by screening and analysis methods.
This is a particular problem when the "hit" rates ([active variants/total variants] x 100 %)
are not within a reasonable range. Excessively large hit rates require further deductions
to apply lower-throughput technologies for detailed characterization. Excessively low hit rates may require too many resources to amass a meaningful dataset, that may or not be
statistically significant for detailed characterization. Finally, many of these methods are
expensive and time consuming. In the next section, we examine a new field that mates
bioinformatics and genomic sequencing to examine the Protein Folding Problem.
1.9 The Genomic Era and Sequence Statistics
Many attribute April 14th, 2003 as the precise date of scission between the pre- and post-
Genomic Eras.100 This is the date the Human Genome Project announced the completion of their final milestone: the complete sequencing of the human genome. I would argue
that the Genomic Era actually began a few years prior as DNA sequencing technologies
were improving, thus enabling the Human Genome Project.
40
The sequencing of DNA was a major hurdle in biochemistry due to its size. Techniques
for sequencing both protein and RNA preceded it, and many of these protocols were
applied to DNA with limited success.101,102 Frederick Sanger won the 1958 Nobel Prize in Chemistry for protein sequencing and soon turned his interests to nucleic acids. He
began with a method called "plus-minus" sequencing where up to 80 base pairs could be
determined a single experiment.103 Sanger and colleagues were able to determine the
5,386 nucleotide genome of ΦX174 phage with this technique.104 The method was
unfortunately slow, error prone and limited to single stranded DNA. In 1976, Allan
Maxam and Walter Gilbert improved the throughput and fidelity of DNA sequencing
using reagents that selectively modify and cleave nucleotide bases.105 The cleavage
reactions were ran in parallel lanes on an electrophoresis gel were the sequence could be
sequentially mapped. This marked a significant improvement over partial digestion
methods and the "plus-minus" technique, but was quickly replaced by the next iteration
of Sanger's methods.
Sanger turned to the use of chain terminating dideoxynucleotides (ddNTPs). Here, the
radiolabeled terminators lack the 3' hydroxyl needed for nucleophilic attack and chain
elongation.106 The sequencing reaction produces a ladder of products that can be
separated by gel electrophoresis. In this protocol, four reactions are run in parallel - one
for each nucleotide. This method was limited by the resolution of agarose gels to ~100
nucleotides. During the Human Genome Project the technique was enhanced by
replacing the gel slabs with capillary electrophoresis and the radiolabeled ddNTPs with
41
dye-modified bases.107 These improvements yield ~1,000 base pairs reads, and eliminate
the necessity of parallel reactions since different fluorophores distinguish the bases. The
application of BigDye ddNTPs and capillary electrophoresis dramatically accelerated the
rates of sequencing. Maxam and Sanger shared the 1980 Nobel Prize in Chemistry for
their advancements. Parallel improvements in molecular biology methods such as
shotgun cloning, the use of bacterial artificial chromosomes and bioinformatics enabled
the complete assembly of the human genome in 2003. There are currently more than 200
organisms with complete published genomes (http://www.genomenewsnetwork.org) and
there are vastly more partial sequences available (Pfam). The velocity of genome
sequencing will continue to accelerate with further advances in sequencing technologies,
most notably, Roche's 454 and Illumina.108,109
The advances in sequencing technologies have given birth to a new field - protein design
from sequence statistics. The design of proteins is important for two reasons. First, it
allows for the exploitation of certain features, such as enhanced activity or stability.
Second, the ability to design proteins is a rigorous test of our first principles that govern
protein science. Many fundamental properties of proteins have been deciphered through
reverse engineering efforts. As described earlier, the computational design and redesign
of proteins suffers from calculation errors and sampling. Empirical experiments, single
or combinatorial, suffer from sampling throughput. Furthermore, many variants in a
combinatorial experiment are inactive and provide little information. In contrast, the
genomic design and study of proteins provides solutions to many of these drawbacks.
42
The quintessential tool of genomic design is the multiple sequence alignment (Fig. 6).
Here, one makes use of genomic sequencing efforts to compare and contrast protein sequences since they can be directly inferred from DNA using codon tables. Datasets of interests can be downloaded from several servers including the Protein Families Database
(Pfam), the Protein Data Bank (PDB), UniProt/Swiss-Prot, GenBank, Kabat etc. The work presented in this dissertation utilizes the Pfam dataset, which is a depository of protein sequences organized by family and aligned by hidden Markov models.110 The
Pfam database strives to provide accuracy and complete coverage which are often competing factors.111 The database curators manage two alignments for each family.
The first, is a small, high quality seed alignment that may not change between updates.
The seed alignments are hand curated, annotated and confirmed as true members.
Triosephosphate isomerase (TIM) has a seed alignment of 58 sequences in the now current version, 25.0. While the seed alignment provides the highest quality data, it severely under-samples the number of known protein sequences for any given family.
Note, that it would be impossible to hand curate the entire Pfam database which contains over 12,000 families. To address this concern, and automate sequence identification,
Pfam generates a Hidden Markov Model-based profile (HMMR) derived solely from the seed alignment. This HMMR profile is used to search the existing database, locate, and align additional members. In the case of triosephosphate isomerase, the HMMR profile identifies 3,894 TIM sequences from 2,946 species. The division of seed/full sequence libraries is the main novelty of Pfam's approach. 43
Figure 6: Example Multiple Sequence Alignment.
A multiple sequence alignment aligns the protein sequence of multiple organisms. These alignments provide information regarding the properties for proteins. Here we show two sites. red and blue that are highly and weakly conserved, respectively. In purple, two positions are shown which are statistically correlated. Positively charged amino acids such as arginine (R) and lysine (K) form salt bridges at a neighboring positions via negatively charged residues aspartate (D) and glutamate (E).
What can we learn from multiple sequence alignments (MSAs)? MSAs provide a wealth
of data that correctly harnessed serve many applications. Similarity between sequences of known and unknown inputs may reflect evolution from a common ancestor and therefore similar functions and/or structures. As genomic efforts continue to produce unannotated sequences, BLAST and FASTA searches are often employed to provide clues to guide further experimentation. MSAs are used to construct phylogenetic trees from clustering algorithms that trace the trajectory of mutations based on evolutionary likelihoods. As described in the computational section, multiple sequence alignments can predict secondary, tertiary and even quaternary structures - if homologs have known folds. MSAs can also be used to identify functionally important positions such as catalytic or binding site residues. For example, MSAs may identify the catalytic triad of a putative serine protease (serine-aspartate-histidine). Casari et al. were one of the first
44
to predict functional residues from MSAs using model cyclins.112 The state of the art has
continually improved since this study.113,114 Identification of catalytic sites coupled with
high sequence identity can be used to prescribe function to unknown sequences with
reasonable accuracies. Although, situations of divergent evolution will lead to proteins
with similar sequences and structures with different functions. The analysis of MSAs can
also elucidate the folding code. Previous examples have demonstrated that the folding
code is necessarily contained within the primary sequence. It may be easier to decipher
the code in the context of many aligned and related sequences. How has nature evolved
through sequence space while maintaining function?
Multiple sequence alignments mate the best of computation and empirical resources.
Unlike purely computational data, these sequences have been validated for folding,
stability and activity. If they did not, they would not exist in nature. They are similar to
the output of a combinatorial experiment, but less laborious. In the limit, all the protein
properties are encoded in MSAs, and it represents many more mutations leading to stable,
active proteins than could be created and characterized in the laboratory. For example,
consider the screening of binding peptides to SH2 domains of interest (Fig. 7).115 Here, the Dehua Pei lab (The Ohio State University) profiled the recognition specificities of several SH2 domains against libraries of pY-containing peptides. Identified binding partners are assembled into multiple sequence alignments. The following image shows the alignment as amino acid percentages. Looking at one such example, the Grb2 SH2
45
domain shows little preference at the -2 position, but requires asparagine at the +2
position.
Figure 7: Binding Specificities of SH2 Domains.
Three SH2 domains are shown with distinct binding specificities. Here the phosphorylated tyrosine (pY) is located at the 0 position. A library with amino acids from -2 to +3 was screened and analyzed. Figure from Wavreille et al., reprinted with permission from Dehua Pei.
In theory, a genomically derived MSA of peptide sequences that bind Grb2 may reveal
the same conclusions - a strong preference for asparagine at the +2 position and little to no bias at the -2 and -1 positions. In the combinatorial experiment the library is designed,
46
constructed, screened and analyzed. Analyzing genomic data simplifies this process as
the first three steps - generally the hardest three steps - are performed by Nature. The
libraries are generated by random mutation, drift, recombination and gene transfer. The
DNA sequences and proteins are produced natively by the organisms and Natural
Selection provides the screening power. Here, only variants that confer fitness populate
the alignment. The laboratory SH2 example was limited to 205 = 3 x 106 peptides. The long evolutionary history suggests that Nature has sampled significantly greater sequence space.
The post genomic era provides a novel approach to understand the relationship between sequence and biophysical properties. Sequences of protein families can be studied to deduce how information is "printed" within primary structures. This may be considered a bioinformatics formulation of the Inverse Folding Problem for the post-genomic era.48,116
Furthermore, the application of sequence statistics may be used to improve the stability of native proteins (Chapter 2).
1.9.1 Conservation Statistics
The most intuitive analysis of multiple sequence alignments is the observation of conservation. If there is zero bias in the sequence determinants at a given position one expects all amino acids to be present roughly equally (100% / 20 aa = 5 % each). This value may be adjusted for codon usage since the amino acids do not equally populate genomes (e.g., leucine codons populate the S. cerevisiae genome at 10%, but tryptophan
47
is seen only 1%). Alternatively, one could use global mean propensities to normalize
codon usage by correcting for the number of codons (e.g., leucine has six codons versus
one for tryptophan). The difference between these three treatments are marginal when
calculating the consensus amino acid at highly conserved sites, but may have slight differences at nonconserved positions. For example, a consensus TPR is unfolded, but
the global mean propensity TPR is folded and stable (Regan, unpublished). As noted
above, the conservation at different sites can be used to indentify key residues which in
turn may help annotate unknown proteins. How can this information be used to design
proteins of interests and further our understanding of the Protein Folding Problem? Here
we discuss the engineering of proteins using consensus-guided mutations. The following
discussion presents two formulations of consensus design. First is the literal selection of mutations based on multiple sequence alignments. Second are related methods that employ additional rational and irrational design considerations.
Boris Steipe and Andreas Pluckthun (University of Zurich) were amongst the first to engineer proteins from sequence statistics. Antibodies are multi-chained proteins that tightly bind antigens within the immune system. These molecules bind nearly infinite substrates through a mechanism of recombination and somatic hypermutation at complementary determining regions (CDRs) of the heavy and light chains. The heavy chain has ~300 variable and 4 joining domains genetically encoded, while the light chain has 100-1000 variable, 4 joining and 12 diversity domain copies. These domains generate combinatorial solutions during B-cell differentiation. Upon antigen binding and
48
proliferation, further rounds of recombination and mutation lead to an estimated diversity
of >1015 - exponentially more than the ~20,000 human genes. Antigen binding mutations
often carry significant penalties to domain stability complicating engineering efforts. The
immunoglobulin fold is therefore an attractive candidate for stability enhancement.
Steipe et al. addressed the problem of domain stability by analyzing the Kabat immunoglobulin sequence database (www.kabatdatabase.com).117 The authors
hypothesized that: 1) The repertoire of sequences represents a canonical ensemble of
sequences compatible with antibody function; 2) The stability of an average member is
marginal; and 3) Sequence mutations that affect stability are in equilibrium. This is
because under physiological conditions there is no selection for stability beyond a certain
threshold, meaning that destabilizing mutations are selectively neutral assuming the
domain stability remains above a certain threshold. Likewise, there is no selection
pressure to produce hyperstable domains. Statistical mechanics allow us to quantify the
amino acid populations at each sites as free energies that quantitatively describe selection
at that position. Therefore, mutations to consensus amino acids should only select for
domain stabilization (barring biases from database construction). To test these
hypotheses Steipe predicted ten stability-enhancing consensus mutants to the VΚ domain of anti-phosphorylcholine antibody, McPC603. The ten single mutations were constructed via site-directed mutagenesis and assayed for stability by chemical denaturation. Six of the ten mutants were stabilized, three were neutral and only one mutation affected the stability negatively. Random mutations would have performed significantly worst, perhaps yielding zero stabilized variants.
49
Steipe continued to statistically design antibodies in his independent career at the
University of Toronto. His lab made panels of consensus mutation to the VL domain of
murine intrabodies and characterized the variants for expression, solubility and stability.
Steipe saw excellent correlations between stability and yield within the VL domains. The
individually stable mutants were combined in one molecule resulting in an antibody with
-1 -1 118 ΔGF = -34.3 kJ mol compared to -13.5 kJ mol for wild-type. The combining loops
from the esterolytic antibody, 17E8, were grafted onto the host architecture of Steipe's
119 dramatically stabilized VL. This hybrid enzyme was as active as 17E8 when expressed
and purified from E. coli, without any additional engineering. These lessons were further
tested in the consensus design of VH domains. The heavy domains are generally less
stable and more prone to aggregation. Wirtz and Steipe predicted and validated six stable
consensus mutations to the heavy chain. When combined into a single domain the TM increased by 6 °C and the D50 in urea increased by 1 M. Later, Andreas Pluckthun
observed that only seven VL and seven VH families cover more than 95 % of the human
germ-line repertoire. Knappik et al. designed fully consensus antibodies of each family.
The goal of this engineering project was to create starting scaffolds that were stable,
modular, and avoid the HAMA (Human Anti-Mouse Antibodies) response. The
consensus domains expressed well in E. coli and were surprisingly stable by thermal
denaturation. The sequences were designed to include convenient restriction
endonuclease sites to facility CDR engineering. Using phage display the authors were
able to generate libraries approaching 109 molecules. To add modularity to the system,
the seven heavy and light chains can be differential matched creating 49 unique starting
50 scaffolds. The panning of these libraries against multiple substrates produced low nanomolar affinity binders with specificity.
Antibodies are difficult proteins to engineer, characterize and produce in mass quantities.
This is largely because the proteins are multi-chained, contain disulfides, lack inherent stability, and require special conditions to express in bacteria. These factors have led laboratories to search for similar scaffolds with more amenable features. The Pluckthun lab has approached this problem by engineering repeat proteins such as the Ankyrins and
Armadillo repeats. Ankyrin repeats (Anks) contain 33 amino acids that fold into two antiparallel α-helices. The repeat proteins generally contain several of these helical units.
The native domains mediate protein-protein interactions where contact is driven by the loops, much akin to antibodies. Advantageously, these proteins are highly soluble, express well and contain no disulfide bonds. Pluckthun's lab has engineered these molecules for several tasks coining the name: Designed Ankyrin Repeat ProteINS
(DARPins).
The original consensus Anks were designed with the most common amino acid at each position, but randomized loop sequences.120 Six randomly chosen library members were well folded with stabilities ranging from stable (-10 kcal mol-1) to highly stable (-21 kcal mol-1). The X-ray diffraction structure was solved for one of these variants at 2.0 Å.
51
The Pluckthun lab has published over 30 articles demonstrating the use of DARPins. The
121 domains are generally very stable with TMs between 66-85 °C. In 2004, the lab
reported high affinity binding of consensus Anks to several targets including MBP, JNK2
122 and p38. The KDs from Surface Plasmon Resonance were reported in the single
nanomolar ranges and failed to bind other targets confirming specificity. Consensus
DARPins have also been engineered to inhibit intracellular kinases and
phosphotransferases, as well as HER2 and neurotensin.123-126 These represent a small
slice of the work performed with DARPins - for further review of design and application
please see Pluckthun's review.127 The lab has recently began evaluating the effectiveness
of consensus Armadillo repeat proteins.128
The Regan lab (Yale University) has consensus engineered the tetratricopeptide repeat
proteins (TPR). These repeat domains are similar in structure to ankyrin repeats and
likewise, mediate protein-protein interactions. The Regan consensus TPRs are actually
global mean propensity TPRs; PG = [nij/Nj]/[ni/N], where nij is the number of i amino
acids at position j, Nj is the total number of amino acids at position j, ni is the total
number of i amino acids in the OWL/PFAM database.129 Accounting for global mean
propensities allows for permissible amino acids at every position. For example,
tryptophans and cysteines may be underrepresented in consensus alone because they have
fewer codons. Global mean propensities correct for this, putting the amino acids on more
level ground. Native TPRs have 3-16 repeats, but it is not clear if this is due to stability of functional constraints. Main et al. designed GMP-TPRs with a single, double and
52
triple repeat. Previous work, demonstrated stability enhancements from consensus
design, and this was again seen in the TPR work. CTPR1, CTPR2 and CTPR3
(consensus TPR # repeats) were all monomeric and folded. Circular Dichroism ellipticity was monitored with increasing temperature to assay thermostability. The authors found that the TMs increased with the number of repeats and all were reasonably reversible.
The structures were determined by solution state NMR and X-ray diffraction. Later, the
Regan lab analyzed the binding of ~10 native TPRs to Heat Shock Protein (Hsp90).130
Consensus binding residues, determined from known Hsp90 binders, were hosted on the
CTPR3 architecture. The designed protein was stable and bound Hsp90 with KD = 200
µM. The consensus TPR was specific, but bound the ligand 40-fold weaker than native
HSP-binders. Magliery and Regan quantified conservation with statistical free energies
and relative entropy calculations.131 For proteins with diverse binding specificities (e.g.
TPRs, ankyrins, His2-Cys2 Zn fingers and PDZ domains) they observed that positions in
contact with peptide ligands are more variable than average surface positions.
Sir Alan Fersht (University of Cambridge) has devoted his career to the elucidation of
two scientific problems: The structure and mechanism of proteins, and tumor suppressor
protein p53. For him, these two disciplines are intimately related. As noted in Section
1.1, mutations to p53 are directly implicated in 50 % of human cancers, and p53 is likely
involved in 100% of cancers cases.1 This apoptosis-inducing protein requires tight
regulation for homeostasis which it achieves through numerous binding partners and
weak stability resulting in increase proteolysis. Unfortunately this weak stability makes
53
p53 susceptible to loss of function through mutation. Improving the stability of p53 with
small molecules or gene therapy is a long-standing goal in cancer research. Furthermore,
the detailed characterization of p53 is complicated by its poor bacterial expression and
solubility. In order to study the protein in an academic environment Fersht's lab designed
and characterized consensus mutations in p53.132 They began with an alignment of 23
p53 homologs from different species. M133L, V203A, N239Y and N269D were
combined into a quad mutant that is 2.6 kcal mol-1 and 5.6 °C more stable than the wild-
type core domain with no effects on activity.
Consensus design has also been applied to the engineering of fluorescent and
chemiluminescent proteins. Dai et al. aligned 31 fluorescent proteins with at least 62 %
sequence identity to monomeric Azami Green (mAG).133 The consensus amino acid was
selected at each position, but at highly variable sites the mAG residues were chosen. The
novel consensus proteins is slightly less stable than mAG and has red-shifted
fluorescence, but expresses better and is brighter. Loening et al. sought to improve the
serum lability of luciferase for use a bioluminescent reporter.134 The authors made eight consensus-guided mutations to form RLuc8. RLu8 is 200-fold more resistant to inactivation in murine serum and exhibits 4-fold improvement in light output.
One of the most impressive uses of consensus design was the reengineering of fungal phytases. Phytases are enzymes belonging to the histidine acid phosphotase family that hydrolyze inorganic phosphate from phytic acid. Livestock such as poultry and pigs lack
54
the phytase activity required to liberate phosphate from plant diets and therefore require
expensive and ecologically threatening supplementation. An attractive solution would be
to add an active phytase enzyme to food sources. Unfortunately, the feed is pelleted at
manufacturing temperatures ranging from 60-90 °C and all known phytases unfolded
around 50 °C. Lehmann and Wyss aligned 13 known phytases from six different fungal
species.135 In this case, additional rational design was also employed: 1) The first 26 aa,
which contain a signal sequence, were taken from A. terreus CBS phytase. 2) At 18
ambiguous positions the authors chose A. niger or A. fumigatus amino acids. The
resulting sequence showed between 58 and 80 % sequence identity to its parent
sequences. The melting temperature measured by DSC and heat inactivation was 78 °C,
15-22 °C higher than its predecessors. The consensus phytase showed maximal activity
at 70 °C, but was as active as its mesophilic parents at 37 °C. All wild-type sequences
increased activity with rising temperatures until the TM was reached. The authors later
increased the TM of the consensus protein beyond 90 °C with an improved alignment and
site-directed mutagenesis at predominantly surface-exposed sites.136,137
The Abp1p SH3 domain from S. cerevisiae is considerably less stable than comparable
SH3 domains.138 A large alignment of SH3 domains revealed eight atypical residues in
Abp1p. Eight consensus mutations at these positions were constructed individually, and
three resulted in resistance to chemical and thermal denaturation. The three stabilizing
mutations were combined in the SH3 domain leading to dramatic stability. The TM raised from 60 to 90 °C and the D50 in urea doubled. The folding and unfolding rates were
55
measured to assess the kinetic stability of the designed binding motif. The engineered
domain folded ten times faster and unfolded five-fold slower. Thermodynamic stability
has also been reported in consensus variants of papain, Fibroblast Growth Factor 1,
subtilisin and glucose dehydrogenase.139-142 In all of these studies additional design
constraints were employed including structure-based alignments and computational
modeling.
TAG-72 is a glycoprotein overexpressed on the surface of cancer cells. CC49 is a
clinically validated antibody that binds TAG-72.143 Roberge et al. designed a CC49 scFV-β-lactamase fusion, TAB2.4, for use as a antibody dependent enzyme prodrug therapy (ADEPT).144 Unfortunately, TAB2.4 expressed poorly and was prone to
degradation. Combinatorial Consensus Mutagenesis (CCM) was recruited to increase the
native stability of the CC49 scFV-BLA fusion. Multiple sequence alignment of
antibodies revealed 11 positions in TAB2.4 that harbored amino acids seen in < 5 % of
homologs. Consensus mutation were made at sites of high conservation and a library of
higher frequency amino acids were tested combinatorially at nonconserved sites.
TAB2.5 was identified from this library with enhanced features - 4-fold improved
expression and 2.5 °C increase in TM. Recently, Biogen Idec has designed stabilizing libraries based on Shannon entropies from the multiple sequence alignment of Antibodies
145,146 CH3 and αTT.
56
An interesting philosophical debate comes to light when analyzing conservation in
multiple sequence alignments. That is, what are the driving forces that lead to amino acid
bias at different positions? A handful to the majority of sites are nonconserved meaning
many amino acid identities are tolerated. Some positions can be directly tied to function
such as catalytic residues, interaction and binding sites. Other positions may be involved
in dynamics such as loop motions and allostery. But for most proteins the number of
defined "functional" residues will be low leading us to wonder how the folding code is
distributed throughout the remainder of the positions with medium to high conservation.
Are these positions structurally significant, stability determining, or are they simply
phylogenetic artifacts? In general, once a position mutates from an ancestral sequence it
is unlikely that further mutations will occur at that same site. For example, the rate of
mutation in E. coli is approximately 1 per 108 base pairs under neutral conditions.147
While this makes subsequent mutations at the same site appear unlikely, it is not that improbable in the timescale of evolutionary history. Here, we argue that a conserved position, that does not directly relate to function, must have implications to the overall fitness of the organism (protein structure, stability, dynamics, etc.) or else would have been subject to further mutability and lost its amino acid bias. Unfortunately, any experiments to directly test this theory remain elusive.
On the other side of the argument, Frances Arnold (California Institute of Technology) and Donald Hilvert have attempted to demonstrate consensus protein design without phylogenetic bias.148 Here, the authors argue that Steipe's idea that consensus stability is
57
correlated to statistical free energies fails, because the sequences are not independent as
they are biased by common ancestry. In other words, this violates the logarithmic
relationship between an amino acid's stability contribution and its frequency in an MSA.
The authors suggest that these biases can be avoided by making large libraries of a single
protein sequence and constructing alignments based on these libraries. Combinatorial
strategies were applied to chorismate mutase (CM) and library members were screened in
an E. coli strain deficient in the desired activity. The screen identified 26 catalytically
active variants that were used to generate a synthetic-consensus CM. The consensus CM
is 9 °C and 2.6 kcal mol-1 more stable than any of the 26 library members. The specific
activity was 2-fold greater than the E. coli CM. To provide generality to the method,
they also repeated the procedure with M. jannaschii CM. Similar results were obtained, but the consensus M. jannaschii CM is 30-fold less active than wild-type.
It is unclear to me, whether these experiments provide any credence to this method over standard consensus design. First, in terms of stability and activity this work matched, or failed to match the activities seen in pure consensus designs. Second, the design and characterization of the libraries represent significant hurdles in both time and resources.
Finally, while it may not be phylogenetic bias, the sequence datasets themselves are
biased - perhaps to an even greater degree than raw consensus approaches. This method
forces bias by demanding the selection of wild-type starting sequence and adds further
stereotyping by controlling the library mutagenesis, screening organism, not to mention
inherent biases in PCR. It is conceivable that this method could be powerful for
58
engineering proteins lacking homologs, like Rop. The Magliery lab has studied the central core (ITLA) of this four helix bundle. A hydrophobic/alcohol library of these four positions produced a consensus sequence of IVVA that was considerably more stable than all other library members.
1.9.2 Ancestral Statistics
While some have chosen to argue that the phylogenetic bias of consensus design is problematic, others have chosen to embrace the ancestral nature of multiple sequence alignments in a new field coined, Ancestral Design.
The antiquity of thermophilic organisms have been suggested by Woese.149 He notes that thermophiles exist in both bacteria and archaea and generally have the deepest and shortest branches in the phylogenetic tree. The idea of a thermostable ancestor is well supported by other labs.150,151 Forterre believes that the thermostabilities seen in bacteria
and archaea are a result of convergent evolution and not common ancestry.152 To directly test the thermophilic ancestor hypothesis, the Akihiko Yamagishi lab (Tokyo University) has designed and characterized ancestral sequences. The lab began with the CLUSTAL alignment of all 3-isopropylmalate dehydrogenases from the GenBank database.153 The
MSA was phylogenetically analyzed by PHYLIP, constructed into a phylogenetic tree, and the ancestral sequence was determined.154 The lab analyzed a handful of single and
double mutations in addition to one quad mutant. The estimated melting temperatures by
CD were recorded at > 95 °C for all variants, but the wild-type started with TM = 96 °C.
59
Two of the seven variants suffered from diminished activity, but the heat inactivation
profiles for all ancestrally mutated enzymes performed better than wild-type. Several
years later, the Yamagishi lab turned to ancestral design of isocitrate dehydrogenase
again showing that four out of the five single mutations improved thermostability.155 The
enzyme, 3-isopropylmalate dehydrogenase was revisited in 2006.156 This time, the authors made twelve individual ancestral mutations to 3IPMD from Thermus
thermophylus - another thermophilic organism, but 10 °C less stable than the previous
model organism, Sulfolobus sp. This paper reports that six of the twelve mutations were
stabilizing. Later, the authors combined several of these mutations into combined
variants and saw additive effects to the stability.157 While these results present an
intriguing alternative to consensus design, they offer little added benefit. The authors of
these articles do not cite the work of Steipe, Lehmann and Wyss, and offer no
comparison between their approach and consensus design. In addition, the majority of
their ancestral mutations were, in fact, consensus. Interestingly, the authors posit that
similar mutations will have a greater effects in mesophilic proteins and that current
studies are under way. Unfortunately, none of these studies have been published.
Recently, Tawfik has used phylogenetic information, akin to Yamagishi, to guide
combinatorial libraries of serum paraoxonases and sulfotransferases.158 In a medium-
throughput search of 300 variants, Tawfik identified a sulfotransferase with 50-fold
enhanced activity. Tawfik states that ancestral libraries comprise a means of focusing
diversity to positions that readily trigger changes in reaction specificity, thereby
60
facilitating the isolation of new variants by medium-throughput or even low-throughput
screens. This makes sense for promiscuous enzymes, like paraoxonase and
sulfotransferase, but it is unlikely that activity enhancements would be seen for well
diverged enzymes.
In 2007, the Joseph Thornton lab (University of Oregon/HHMI) published the crystal
structure of a resurrected ancient enzyme.159 Here, the authors deciphered the common
ancestral sequence of glucocorticoid and mineralocorticoid receptors. Thornton suggests
that the primitive protein had a broader range of substrate specificities and that evolution
of function occurred by a series of mutations that destabilized the receptor structure with
all hormones, but compensated with novel interactions specific to the new ligand.
Similar results have been reported in TIM-barrel enzymes.160
1.9.3 Correlation Statistics
Conservation statistics are applied independently to positions in multiple sequence
alignments - meaning the distribution at position x has no bearing on the statistical
information derived at position y. Primitive analysis of atomic structures and mutation studies prove that some - if not most - positions are chemically and physically intertwined
with other positions. These interactions are potentially ablated in consensus design.
Furthermore, correlations are another level of information encoded within the folding
code. Much like the elucidation of secondary structure propensities, understanding the
roles of conservation and correlation aid our ability to solve the Protein Folding Problem.
61
The calculation of correlations is considerably more complicated than consensus. Given a 100 residue protein there are 10,000 (100 x 100) pairwise correlations. If one wishes to understand the amino acid contributions at each site the twenty identities need to be explicit within the calculation. This expands the correlation matrix into a third dimension with total complexity of 4,000,000 data points (10,000 x [20x20]). Many bioinformatics labs have hypothesized ways of calculating these correlations, but that is beyond the scope of this introduction. Here we highlight several examples, where bioinformatics- driven hypotheses have been refuted or validated with empirical experiments.
How do distant sites in three-dimensional structure communicate within proteins? This is an important question in allostery, where binding at a one site has distal effects. Here, information is "communicated" or "propagated" through a network of interactions that connect the distant sites. Thermodynamic double-mutant cycles have been used to validate these networks, but their practicality are limited to small proteins and are low- throughput.161-163 Rama Ranganathan (University of Texas Southwestern Medical
Center) proposed that these interaction networks could be statistically calculated from sequences. If two positions are functionally coupled, their evolution should be mutually constrained which should be represented in the statistical coupling of the amino acid distributions. Ranganathan calculated these interactions using Statistical Coupling
Analysis (SCA).164 First, they calculate the root-mean-square of the binomial probabilities for each amino acid appearing at its observed frequency compared to a
62 reference frequency (ΔGstat). The statistical energy at site x is measured for two conditions: 1) The full MSA and 2) A subset of the MSA where the amino acid identity at position y is held constant. These calculations (ΔΔGstat) revealed pathways of physical connectivity between the peptide binding sites and cores of PDZ and POZ domains. The authors propose that binding energies are propagated along these pathways and similar phenomena may play roles in allosteric regulation. To further validate these assertions,
Ranganathan examined three protein families; G coupled-protein receptors, chymotrypsin and hemoglobins.165 Sparse, but connected networks were discovered within each family. It would be interesting if the authors demonstrate the allosteric effect of these correlations with mutations that ablate and conserve statistical interactions.
Figure 8: SCA and WW-Domains.
WW-domain shown with bound peptide. Pie charts showing the outcome of folding studies for natural (n=42), CC (n=43), IC (n=43), or random (n=19) WW sequences. Red, natively folded; blue, soluble but unfolded; yellow, insoluble; grey, poor expressing. Figure adapted from Socholich et al with permission from Rama Ranganathan.
63
In 2005, Ranganathan applied the SCA method to the Protein Folding Problem, trying to
define the sequence rules for specifying a fold. The lab created three libraries of the 36
residue WW-domain. The first library was constructed with no evolutionary information
and represents random sequences. The second library was constructed using site-
independent conservation (consensus only) and the final library was designed with coupled conservation (consensus and correlation).166 All libraries produced soluble
members. The inclusion of correlated data did not improve the solubility of the library,
but did increase the fraction of native-like folded sequences (0 to 28 %). Several
members from the CC-Library were characterized in detail confirming WW-domain
structure and function.167 It is important to note that Ranganathan's site-independent
conservation WW-domains are not true consensus proteins as reported in section 1.9.1.
Here, the amino acid at each position was chosen at random based on amino acid
frequencies derived from the multiple sequence alignments. This procedure likely
scrambles conserved correlations that populate the consensus ankyrins, TPRs, etc.
Recently, the Ranganathan lab has reported an updated SCA calculation that measures the significance of observed correlations as judged by the conservation of the amino acids
under consideration.168 Further calculations remove insignificant correlations (noise)
based on eigenvalues and the remaining pairwise correlations are clustered according to
eigenvectors. The authors showed that the raw conservation weighted covariance matrix
between all sequence positions in the S1A serine protease family. Here, relatively few
positions exhibit strong correlations to primary sequence neighbors. After "spectral
64 cleaning" and clustering three sectors are observed. These sectors represent positions with strong intra-sector correlations, but sparse inter-sector correlations. The authors found that these sectors represented distinct tertiary sites within the protein that likely coevolved for fitness.
Figure 9: Beyond Consensus Analysis of TPRs.
Networks of statistically interacting residues implied from perturbation analysis. Statistically significant interactions can be arranged into “networks” by examining the differences in amino acid distribution for various TPR subsets. Lines between different positions represent direct correlations. (a) The identity of residue 8, almost always Gly or Ala, is affected by residues 4–9 and 21–24. Residue 24 tends to get larger or smaller inversely with residue 8. (b) Positions 26 and 29 tend to have opposite charges. (c) TPRs with Leu7 tend to have Tyr11, DE16, Lys19 and DE22. TPRs with KR7 tend to have Leu11, KR16 and Glu 19, in addition to Lys2, Tyr 4, Arg6, AC10, Asp23 and Asp31. Image and caption from reference Magliery and Regan.
Thomas Magliery and Lynne Regan used Ranganathan's initial protocols to study the conservation and interaction profiles in the tetratricopeptide repeat motif (TPR).169
Statistical free energies were used to determine the significance of each position within the motif. The most conserved residues (high ΔGstat) were located within the
65
hydrophobic core, although conserved glycines and prolines were seen at turn positions.
As noted earlier, the binding site exhibited lower ΔGstat values than average surface
residues. This is because the ensemble of sequences used in the multiple sequence
alignment has evolved diverse pockets to bind diverse ligands. Statistical Coupling
Analysis reveals several interesting correlated networks. These correlated networks
represent real examples where the folding code has been deconvoluted - even if only to a
small degree.
Biogen Idec recently reported the statistical analysis of VH and VL domains from the V-
class of Ig-folds.170 The covariation between sites was calculated by correlation
coefficients. The correlations had interesting implications in the quaternary structure of
multi-chained antibodies. First, the strongest correlations were localized to the VH-VL interface, suggesting that evolution stringently selected for heterodimerization.
Additional correlations were observed at the VH-C interface, but not the VL-C interface.
The authors published the entire roster of correlated residue pairs to support engineering
efforts in the antibody community.
Finally, we examine a integral dilemma in the analysis of correlated occurrences of
amino acids. As demonstrated earlier, even a small protein of 100 amino acids has
10,000 pairwise interactions and millions of data points generated by methods like
correlation coefficients. The standard for viewing these values is limited to tables,
spreadsheets and heat maps. The rapid acceleration in the throughput of data collection
66
(microarrays, genomic efforts, etc.) requires the use of new tools for data visualization.
Developments from the Patricia Babbitt lab (University of California, San Francisco) and
the William Ray lab (The Ohio State University) are brilliant examples of integration
between bioinformatics and visualization. In particular, the Ray lab developed a Java-
enabled program that simplifies the data produced in correlation analyses - first in nucleic
acids, then in proteins.171-173 A standard heat map displays redundant information as the
matrix is symmetrical (identical axes). Ray's innovation bent the top axis from a linear
line into a circle and replaced the color-coding of correlations with connecting lines
(Fig.10). The genius of this approach is that the data can be viewed in multiple dimensions and at multiple levels. If the circle is imaged from the top side, lines indicate
correlations between positions. This interface displays the twenty amino acids and their
distributions as a cylinder extending from the circle. Rotating the cylinder allows one to
visualize how the amino acids contribute to the statistical correlation (meaning instead of
position x correlating to position y, Ala at position x correlates to Phe at position y). This
interface, named StickWRLD, is highly intuitive and especially useful for visualizing
networks of interactions. Olzer and Ray used StickWRLD to unveil two networks of
interaction in the adenylate kinase family. One network, found in gram negative bacteria,
stabilizes the lid domain through a series of hydrophobic interaction and hydrogen
bonding. A second network, commonly found in gram positive strains, replaces a subset of these amino acids for zinc chelation. The Magliery and Ray labs are currently testing the role of these networks with empirical experiments.
67
Figure 10: StickWRLD.
A StickWRLD diagram showing positions in the ADK lid domain. The amino acid identities are arranged vertically by their Kyte–Doolittle hydropathy score. Consensus identities in each position are highlighted by a transparent unit cube. In a live VRML browser, this diagram is completely navigable and the viewer can rotate, move and zoom the 3D diagram to examine details. Figure and caption reproduced with permission from Will Ray.
1.10 Triosephosphate Isomerase
Triosephosphate isomerase (TIM) is the archetypical member of the (β/α)8-barrel fold.
This fold is particularly important as more than 10 % of all native enzymes host their
catalytic residues on this architecture.174 TIMs are composed of eight parallel β-strands
surrounded by eight α-helices, resulting in concentric hydrophobic cores (Fig. 11).175
The loops that connect β-strands and α-helices hold functionally important residues for catalysis and binding. The ubiquity of functions seen in natural TIM-barrels has not been replicated in protein engineering studies.176,177 This is likely due to the marginally
stability of the β-sheet core. Fersht reported the directed evolution of TIM-barrel indole-
68
3-glycerol-phosphate dehydrogenase to fellow TIM-barrel phosphoribosylanthranilate
isomerase. 178,179 The article was later retracted citing in vivo contamination, and it was
later determined that the designed proteins aggregate in vitro. Proteins with high β-sheet
content are prone to aggregation and kinetic instability. This is often observed in
amyloids where β-secondary structures form sandwiches leading to aggregation and
precipitation. Pehr Harbury (Stanford University) "reverse engineered" the (β/α)8-fold of
S. cerevisiae triosephosphate isomerase.180 Here, the authors constructed libraries of conservative mutations based on a small multiple sequence alignments. They observed that the majority of structural positions tolerated conservative mutations (e.g. Glu→Asp) with minimal consequences on activity. However, when all positions were simultaneously varied between wild-type and conservative residues only 1 in 1010 members were active. In particular, mutations to the central hydrophobic core (β-sheets) would not tolerate amino acid mutations that changed the core volume by as little as one methylene group.
69
Figure 11: Triosephosphate Isomerase.
TIM is a homodimeric enzyme shown here with one monomer in gray cartoon and the second as a red Carbon-α trace. Note that the majority of interaction surface is contribute by a single loop that penetrates the active site of the adjacent monomer. Active site residues are shown as sticks and dynamic loop 6 is shown in purple. This image was rendered in PyMOL from the S. cerevisiae crystal structure, 1YPI.
With the exception of a few thermophilic tetramers, all known TIMs are homodimers
with picomolar KDs. The oligomeric nature of triosephosphate isomerase is proposed to
be essential for activity based on crystal structures and engineering experiments. The
third and longest loop contains ~15 residues that interdigitate into the adjacent monomer.
The tip of this loop forms van der Waals packing interactions with active site residues,
K12 and H95. These interaction may be responsible for the exquisite alignment and spacing within the active site. The Rik Wierenga Lab (University of Oulu) has studied the third loop in Trypanosomal triosephosphate isomerase. First, they computational modeled a redacted loop of only eight residues that ablated much of the protein-protein interaction surface (Fig. 12).181,182 This variant was monomeric at physiological concentrations (0.02-2.0 mg/mL) earning it the name, monoTIM. The crystal structure 70
revealed significant rearrangement of the three active site residues leading to diminished
4 -1 -1 8 -1 -1 activity (kcat/KM = 10 M min versus 10 M min for wild-type). A single point mutant
at the interface, H47N, produces monomeric TIM at concentrations below 3 mg mL-1, but
is less stable than native trypanosomal TIM.183 A series of crystal structures with point
mutations and inhibitors was published in 1995.184,185
Figure 12: monoTIM.
monoTIM (right) is a designed variant of triosephosphate isomerase that replace the 15 residue interface loop with a computationally designed 8 residue loop. The second monomer is depicted as a green sphere with active site residues. Images were rendered in Pymol 1.4.1 with PDB IDs 1YPI and 1TRI.
The residual activity of monoTIM was deduced from these structural structures based on
the flexibility of loops 1, 4 and 8. In wild-type TIM these loops are rigidified through
subunit-subunit contacts - ensuring the alignment of catalytic residues. Further
engineering of a seven residue loops and double mutants yielded similar monomeric
TIMs.186,187 Directed evolution of monoTIM has yielded an enzyme with 44-fold
improved specific activity.188 Engineered TIMs from human have also been
71
characterized.189 The double interface mutant (M14Q, R98Q) produces monomers that
are significantly less stable than the wild-type dimeric hTIM. Design strategies were
applied to both the monomeric and dimeric species that successfully increased stability in both constructs. There is a rare automsomal recessive disease associated with mutation to
TIM. The most common disease allele, E140D, does not significantly affect in vivo
activity, but does affect dimer stability.190
Figure 13: The Activity of TIM.
TIM catalyzes the interconversion of dihydroxyacetone phosphate (DHAP) and glyceraldehyde-3-phosphate (GAP) in the fifth step of glycolysis. To study the Michaelis-Menten parameters of these spectroscopically-silent substrates the reaction is coupled to the redox reaction of NAD+/NADH. The coupled enzymes, glyceraldehyde-3-phosphate dehydrogenase (GAPD) and α-glycerol-3-phosphate dehydrogenase, allow the measurements to be taken under conditions where the TIM reaction is irreversible.
72
Triosephosphate isomerase's activity plays a pivotal role in glycolysis - the metabolic
pathway that chemical breaks down glucose to two molecules of pyruvate (Fig. 13). The fourth step of this reaction pathway hydrolyzes six-carbon fructose-1,6-bisphosphate into two three carbon molecules, dihydroxyacetone phosphate (DHAP) and glyceraldehyde-3- phosphate (GAP). Only one of these substrates (GAP) can continue in the glycolytic pathway for energy production. To increase the efficiency of glucose metabolism, TIM isomerizes DHAP and GAP through an enediol intermediate at the diffusion limit. Most
evidence supports a mechanism were protons are shuttled directly to and from
DHAP/GAP by a catalytic glutamate and histidine and the negatively charged
intermediate is stabilized by the catalytic lysine.191-193 Jeremy Knowles (Harvard
University) designed two coupled assays to determine the Michaelis-Menten parameters
for TIM (Fig. 13).194 This assay and site-directed mutagenesis has led to detailed active-
site analysis. Mutation of the catalytic glutamate to the shorter aspartate diminishes
activity 500-fold.9 Mutation of the active site lysine to glycine results in a 3,000-fold lost
in activity and mutation of the histidine completely inactivates TIM.195 In the
DHAP:GAP equilibrium, formation of DHAP is thermodynamically favored 22:1.196
Catalysts may change the rate of reactions, but cannot affect the underlying thermodynamics and equilibria. TIM is able to continually shunt more DHAP into GAP because GAP is quickly metabolized to 1,3-bisphosphoglycerate activating Le Chatelier's principle. Triosephosphate isomerase also allows other carbon sources to shunt into glycolysis. Glycerol is converted to glycerol-3-phosphate, which can be oxidized into
DHAP allowing entry into glycolysis. Alternatively, DHAP can be reduced to glycerol-
73
3-phosphate which provide adipose tissue a source of activated glycerol for the synthesis
of triacylglycerides. Lactic acid can also be fed into glycolysis at the fifth step by a series
of chemical modifications to glyceraldehyde-3-phosphate.
The catalytic glutamate is located on dynamic loop six of triosephosphate isomerase. The
motion of this loop has been exhaustively studied by Arthur Palmer (Columbia
University), Ann McDermott (Columbia University) and Nicole Sampson (Stony Brook
University). This loop acts as rigid lid to the active site with two hinges. The lid must
remain closed during catalysis to exclude water from the active site which leads to toxic
side product, methylglyoxal. At the same time, the loop needs to open allowing entrance
of substrate and product release. The rate of loop motion is on the same time scale as
catalysis as determined by fluorescence, solid-state and solution-state NMR.197-202
Nearly every sequenced organism contains the gene for triosephosphate isomerase.
Exceptions include ureaplasmas - a branch of bacteria that do not perform glycolysis. At the time this dissertation is written there are 3,894 triosephosphate isomerase sequences in Pfam from 2,946 species. There are 270 triosephosphate isomerase structures in the
PDB and 2,090 TIM-barrel structures. The ubiquity and thorough evolution of this ancient enzyme make it ideal for statistical study. Furthermore, the plethora of biophysical and biochemical data collected aids in the study and understanding of our bioinformatically-derived variants.
74
1.11 Dissertation Synopsis
The work presented in this dissertation aims to understand the fundamental properties of
proteins - specifically, how is information encoded in the folding code? Anfinsen's work
thoroughly proved that the directions for folding and activity are solely encoded in the
primary sequence and many labs have designed experiments to interpret that code. We
believe that the "code" can be "broken" by comparing and contrasting sequences of
homologous proteins.
First, in chapter two we analyze protein stability as a consequence of numbers - both
combinatorial numbers and numbers of sequences. We highlight several technologies
that have allowed protein scientists to study large libraries that allow us to find stabilized variants and deduce the mechanisms of stability from aggregated data. Furthermore, we explain the role of conservation and correlation in protein stability. In the third chapter, we describe the design and characterization of fully consensus TIMs. The results of this chapter indicate that consensus design is sufficient for generating stable and active proteins. We also discuss the role of correlation and networks of interactions that provide
fine tuning of thermodynamic properties and activity. Our fourth chapter describes a
double-sieve statistical filter that accurately predicts stabilizing mutations based on
consensus and correlation. Here, we characterize 23 individual consensus variants that
vary in conservation, solvent assessable surface area, secondary structure and covariation
with other positions. We determined that consensus and correlation, alone, can predict
stabilizing mutants with > 90 % accuracy. Furthermore, the double-sieve filter selected
75
15 substitutions that led to dramatic stabilization of S. cerevisiae TIM. In the fifth chapter we describe the engineering and characterization of a novel TIM-deficient E. coli
from the Keio Collection. Here, we design a model system to test the interplay of
correlations and fitness using deep sequencing. The sixth chapter chronicles a high-
throughput assay for measuring the relative stability of libraries of proteins based on hydrophobic dye binding. The final experimental chapter describes the design and application of a novel vector for ligation independent cloning and traceless hexahistidine
protein purification. The sum of this dissertation presents methods for generating,
assaying and analyzing the fundamental properties of proteins. In particular, how is
information captured statistically in consensus and conservation?
76
Chapter 2: Protein Stability by Number
Protein stability by number: high-throughput and statistical approaches to one of protein
science's most difficult problems.
2.0 Contributions
The following review was published in Current Opinions in Chemical Biology under the authorship of Thomas J. Magliery, Jason J. Lavinder and Brandon J. Sullivan. The literature review and analysis of current and past techniques were performed by all authors. Jason Lavinder was instrumental in assembling sources detailing high throughput procedures. Brandon Sullivan was instrumental in assembling sources detailing statistical techniques towards protein stability. Thomas Magliery wrote the review and contributed sources. The author order is Thomas Magliery, Jason Lavinder and Brandon Sullivan.
2.1 Abstract
Most natural proteins are only barely stable, which impedes structural studies, protein engineering and use in therapeutic and industrial applications. It also makes proteins susceptible to single mutations that completely destabilize the native state, which
77
underlies numerous disease pathologies. Our ability to predict the thermodynamic
consequences of even single point mutations is still surprisingly limited, and the low-
throughput nature of protein stability measurements slows engineering efforts and
investigations to understand sequence-stability relationships better. A number of recent
methods are bringing protein stability studies into the practical high-throughput realm.
Some of these methods are based on inferential read-outs such as activity, proteolytic
resistance or split-protein fragment reassembly. Other methods use miniaturization of
direct measurements of stability, such as intrinsic fluorescence, H/D exchange, cysteine reactivity, aggregation and hydrophobic dye binding (DSF). Applications of these screens to difficult targets such as antibodies and membrane proteins are discussed. A second way that large-number approaches are intersecting with protein stability studies is in statistical analysis of sequence databases. Protein engineering based on both consensus and correlated occurrences of amino acids is promising, but much work remains to understand and implement these methods.
2.2 Introduction
Site-directed mutagenesis, still the core technology of protein engineering, will turn 30 next year. The last three decades have seen well in excess of 100,000 mutations made
(many more if we count combinatorial approaches) to probe and alter the structure, activity, folding and stability of a vast array of proteins with different folds and functions.
A huge number of stability measurements have been amassed, in addition to a massive body of hypothesis-driven experiments designed to tease out the basis of protein stability.
78
But predicting the stability of protein mutants remains one of the great unsolved
problems of protein science, proving itself more difficult than even the prediction of
protein structure or even the design of fairly efficient enzymes. This difficulty is in spite
of our actually knowing a great deal about the forces that dominate in protein folding and
perhaps even more about the atomic-resolution structures of folded proteins.203,204 So what’s the problem? One problem is that despite large forces being at work in the structure of the folded state, such as the enthalpies associated with all the hydrogen bonds that form, the net stabilities of proteins are small—5-15 kcal mol-1. This is because the
forces acting on the unfolded state, such as all the hydrogen bond donors and acceptors
that are satisfied by solvent, are also large. This marginal differential means that exquisite
accuracy is required from fairly crude potential functions, and the problem is exacerbated
by our inability to meaningfully model the unfolded state. Furthermore, it is difficult or
impossible to model key aspects of protein folding, such as backbone motion or solvent
entropy. Even empirical approaches that attempt to extrapolate from training sets of
thermodynamic data do not capture sufficient information to solve the problem, but it is
less clear if the reasons for this are fundamental. On one hand, the standard methods of
characterization—calorimetry or spectroscopically observed chemical or thermal
denaturation—are slow and laborious. On the other hand, even “large” databases are
easily dwarfed by the size of sequence space, and it is certainly clear that the effect of a
mutation is only meaningful in context. Mutating alanine to serine is a vastly different thing in different scaffolds, in different secondary structures, with different packing densities or solvent exposures, or with different amino acids nearby. So while insight
79 may not follow from numbers alone, there is a degree to which having large numbers of well-characterized and highly-related mutants will shed light on the problem of protein stability. And even if it does not, the technology to enable those measurements will also enable brute-force approaches for engineering stability. In recent years, the problem of protein stability had intersected with problems of large numbers in two interesting ways, which each are proving useful for engineering proteins for improved stability and elucidating the underlying reasons. The first is the development of fairly general high- throughput methods for measuring protein stability. The second is the use of statistics from the very large number of sequences that have resulted from 15 years of genome sequencing to predict stabilizing mutations. Here we will highlight some of the most important recent advances in these two areas.
Table 1: Protein Stability by Number.
80
2.3 Screening for Protein Stability
High-throughput approaches for measuring or improving protein stability generally fall
into two categories; either they attempt to infer the stability from properties that are
typically measured close to physiological conditions, or they perturb the conditions of the
protein in some way and read out the stability (more or less) directly. For example, protein expression level, solubility, secretion, binding and enzymatic activity, and resistance to proteolysis may all be taken as indications of a stable protein.205 In general,
the Achilles’ heel of these approaches is a lack of broad applicability (for example, many
interesting proteins do not have an enzymatic function) and “you-get-what-you-select- for” kinds of escape variants (for example, unstable but protease-resistant mutants). On the other hand, the problem with measuring the stability directly is the difficulty of
miniaturization; circular dichroism and differential scanning calorimetry are not well
suited to 96-well plates. But some creative ideas have been applied recently with both
types of approaches, which we highlight here. Several other recent reviews highlight
other aspects of combinatorial approaches to protein biophysical properties.205-207
2.4 Inferential Screens for Protein Stability
A straightforward approach for selecting for thermostable proteins is to monitor protein activity at elevated temperatures or after heating. For example, thermal inactivation was used to engineer an esterase with good tolerance of high temperatures but robust room- temperature activity, a feat that was not universally thought to be possible until it was
81
demonstrated directly.208 This approach is limited to proteins with an activity that can be
assayed easily, and there is not a clear correlation between the degree of thermal
inactivation and the stability since it is complicated by aggregation and folding rates. But
this is a very practical approach to screening for proteins with improved stability and
activity under various perturbing conditions. Screens for achieving a stability threshold
based on binding or catalytic activity have formed the basis for several notable
combinatorial experiments in protein design using λ suppressor, barnase, chorismate mutase, and Rop, to name a few.205 Resistance to proteolysis has been used broadly to
identify structured variants, particularly on phage particles. It has been difficult to rely on
nonspecific proteolysis as a read-out of stability, because proteolysis rates are related not
just to global stability but to local stability and substrate specificity. Recently, Bardwell
and coworkers developed a system in which a protein of interest (POI) is inserted into a
loop of TEM-1 β-lactamase, where it was hypothesized that lower-stability mutants of the
POI would generally lead to greater degradation by cellular proteases.209 For several
proteins, the log of the minimum inhibitory concentration (MIC) of antibiotic showed a
striking correlation to the stability of mutants (R2 > 0.6). The relationship was especially
good for Immunity protein 7, where it was clear that expression level was correlated to
stability. Often, expression level differences from varying rates of transcription and
translation, solubility differences, or display differences (on phage or yeast) can be confounding factors to these types of inferential screens. But the authors convincingly showed that the system could be used to select for Im7 variants with improvements in both thermodynamic and kinetic stability. The selection is also tunable and demands that
82
selected variants be soluble and expressible. Marqusee and coworkers have recently
employed pulse proteolysis in increasing concentrations of urea using thermolysin, which
retains its activity in high urea, to measure folding ΔG values.210 The method is read out by SDS-PAGE, but it can be applied to unpurified protein in crude lysate with sufficient overexpression or specific detection, making it suitable for fairly high-throughput quantitative determinations of stability. By adjusting the pulse time and using chemical denaturant, one can directly measure the fraction folded and avoid confounding differences in protease susceptibility under native conditions. This observation led Park et al. to challenge the entire E. coli proteome with protease under native conditions to specifically identify resistant proteins.211 Maltose binding protein was a notable survivor
of thermolysin treatment, and it achieves its resistance through kinetic stability to
unfolding. While it may be a challenge to apply to libraries of mutants, this screening
principle may be useful to shed light on determinants of kinetic stability. Split-protein
reassembly, also called protein-fragment complementation, has proven useful for
identifying protein interactions in living cells, wherein reassembly of fragments of
DHFR, GFP, luciferase or other proteins is driven by the interaction between POIs fused to the fragments.212-215 Split fluorescent proteins reassemble irreversibly, making them
useful for detecting weak interactions but generally unsuitable for measuring binding constants.213 In contrast, split luciferase reassembles reversibly, which has been exploited to look at interaction dynamics in cells but so far not to look at stability
directly.216 Koide and coworkers recently combined yeast surface display with protein
reassembly, and they demonstrated that FACS detected reassembly could be used to
83
measure stabilities and enrich in mutants with a defined range of stabilities.217 A human
fibronectin type III domain (FN3) was split, with one fragment displayed on the cell
surface and the other secreted into the medium. The fragments were fused to two epitopes for fluorescently-labeled antibodies, such that FACS could resolve the display and reassembly levels on each cell. The log of this ratio correlated well with the change in binding energy (R2 = 0.8) for a series of mutants that form a β-bulge in FN3. While not
every protein will reassemble and the binding energies may not perfectly match the
stabilities of the full-length proteins, the ability to rapidly determine and sort for absolute
protein stabilities is especially useful. Despite the complicating irreversibility of split
GFP reassembly, Linse and colleagues demonstrated that the split fragments of the B1
domain of protein G (GB1) could drive the reassembly of the known interaction-detection
fragments of GFP, and that fluorescence was related to the thermal stability of the
corresponding GB1 full-length protein with the same mutations.218 This result is
somewhat surprising, considering that in general cellular fluorescence from split GFP
reassembly does not quantitatively correspond to the binding affinity.213 This screen is
also limited to proteins that can be dissected to reassemble, and while it lacks inherent
controls for expression level differences, it is simpler in its implementation than yeast
display complementation screening. Waldo and colleagues introduced a screen for
soluble proteins based on fusion of a “folding reporter” GFP to the C-terminus of a
POI.219 The GFP only folds and becomes fluorescent if the fused POI folds and is
soluble. One substantial improvement to the screen was the dissection of a “super folder”
GFP into a tagging fragment of 15 amino acids from the C-terminus and a 215aa
84
“detector” fragment.220 These GFP fragments spontaneously reassemble, but only if the peptide is fused to a folded, soluble POI, and the tag influences the solubility of the fusion less than the original folding reporter. Procedures for HT implementation have been described recently.221 Solubility and stability are not generally directly related, but solubility is another key biophysical property that cannot be predicted and requires HT
screening methods.
Figure 14: Principles of Screening for Protein Stability.
Most methods of screening for stability modify the unfolded state or observe some unique property of it—for example, by proteolysis, reaction with an exposed cysteine, amide proton exchange, hydrophobic dye binding, or aggregation. Typically, the protein solution is heated or challenged with increasing concentration of chemical denaturant to establish when the signal is observed. Because most HT stability screening methods involve observation of an irreversible reaction, care must be taken in interpreting the data as a change in equilibrium stability. While some methods enable measurements in the presence of other proteins, most require sufficient purification so that other proteins to do produce the unfolding signal.
2.5 Direct, Small Scale Screens for Stability
Of all the traditional methods of measuring protein stability, thermal and chemical
denaturation monitored by intrinsic fluorescence of aromatic amino acids is the most
straightforward to miniaturize. Stites and colleagues developed an early home-built auto
titrator for semi-automated denaturation measurements.222 Edgell, Pielak and colleagues 85
carried out pioneering work in this area using auto titration methods and robotics with a
standard fluorimeter.223 Dalby and colleagues extended the method considerably by
adapting it to microtiter plates with auto titration, which dramatically increased the
throughput.224 Mayo and coworkers recently coupled this system with computational
design of proteins libraries, enabling exhaustive characterization of computational predictions in a reasonable experimental timeframe.225 Dalby and coworkers have gone
on to further miniaturize their method to nanoliter scale using microfluidics, which
enables screening with very small amounts of proteins (perhaps only 108 molecules).226
This approaches a scale where concomitant miniaturization of protein production is a challenge for libraries, but it has great promise immediately for protein-ligand interactions. Of course, these methods rely on the presence of an intrinsic fluorophore, which not all proteins possess, and they require a fair amount of specialized equipment even in their simplest implementation. However, they are likely to produce measurements that directly compare to those taken by standard methods. Hydrogen- deuterium exchange has been used extensively to measure the stability, dynamics and folding of proteins by NMR and mass spectrometry. Oas and Fitzgerald developed a HT screen called stability of unpurified proteins from rates of H/D exchange, or SUPREX.227
Cell lysate from 200 μL of cell culture is exposed to a pulse of D2O in varying
concentrations of chemical denaturant, and the sample is dried with MALDI matrix for
rapid acquisition. The method is complicated by aggregation or low expression. Also,
EX2 conditions (wherein folding is faster than the intrinsic exchange rates of the protons)
are required to extract thermodynamic parameters. But Oas has successfully used this
86
method to measure protein stabilities in living cells.228 (Gierasch and colleagues have
made similar measurements recently using biarsenical dyes as the readout instead of
MALDI mass spectrometry.229) Fitzgerald and colleagues recently described a variant of
SUPREX based on oxidation rates (SPROX) which addresses some of the complications
of H/D exchange for these sorts of experiments, such as resolving power, ion
suppression, chromatographic separation and reversibility of modification.230 Other reactions can also be used to monitor protein stability. For example, Harbury and colleagues developed a method called misincorporation proton-alkyl exchange (MPAX), which uses weak missense suppressors to make random, residue specific Cys mutations throughout a protein of interest.231 The burial of these Cys residues is interrogated by
alkylation, which can be read out through mass spectrometry or chemical scission and
PAGE. The method is especially useful for measuring the stabilities of proteins that do
not refold reversibly, since the measurement is made under native conditions. Hellinga,
Oas and colleagues have developed a related HT method called quantitative cysteine
reactivity (QCR), which uses gel-shift as a read out, as well as a fast (fQCR) variant with
a fluorescence readout.232,233 Like H/D exchange, thermodynamic parameters can only
be extracted in the EX2 regime. Also, an appropriate buried Cys residue (ideally only one) is required, or the protein must be engineered with some peril of changing the protein thermodynamics. This method was demonstrated at picomole (nanogram) scale using HT gene fabrication and cell-free transcription-translation, which is a very exciting frontier in HT stability measurements. A somewhat different measurement that is applicable to proteins that unfold irreversibly and aggregate—which represents a large
87
fraction of interesting proteins—has been called differential static light scattering
(DSLS). Senisterra et al. reported the use of a home-built instrument (which is now
commercially available) that is capable of light scattering measurements of protein
aggregation in 384-well format.234 It is worth noting that 600 nm absorbance in a standard plate reader is also a reasonable way to measure aggregation. The chief
advantage of this method is its simplicity, as no intrinsic or extrinsic probes are required.
Besides the limitation to aggregating proteins, this non-equilibrium method could be confounded by dramatic changes in the kinetics of unfolding or aggregation for different mutants. But these effects appear to be small in proof-of-principle experiments. A variation on this theme is isothermal denaturation (ITD), in which the rate of irreversible denaturation is observed, typically at a temperature just below that of melting. In principle, this denaturation can be observed by loss of a signal such as CD or shift of a signal such as UV absorbance or fluorescence. For proteins that aggregate, light scattering is also possible. ITD measurements are highly reproducible and have been
reported to be more sensitive to small changes in stability, which is especially useful for ligand binding studies. Senisterra et al. adapted the method to HT using their 384-well
scattering apparatus, with the additional advantage that ITD required less protein than
comparable methods.235 ITD measurements do require a priori knowledge of the
protein’s approximate melting temperature, which could be problematic for protein
libraries, and presumably could be very sensitive to changes in kinetics that may not be directly linked to equilibrium stability. Schaeffer and colleagues have introduced an in vitro hybrid of the GFP fusion method for solubility and ITD, which can measure
88
stability without purification.236 Here, the POI is fused to the N-terminus of GFP. The
protein, purified or in lysate, is then subjected to ITD in HT format. The method, called
GFP-Basta, is only applicable to proteins that aggregate upon unfolding and is limited by
the GFP aggregation and photophysics, but practically these are not very significant
limitations for most POIs. One especially promising method called differential scanning
fluorimetry (DSF) is simple, broadly applicable and requires little specialized equipment.
A method called Thermofluor was developed by 3D Pharmaceuticals, now owned by
Johnson & Johnson, which reports on the perturbation of the melting temperature of a
receptor by a potential ligand through addition of an extrinsic fluorophore.237
Hydrophobic dyes such as ANS are quenched in aqueous solvent but become fluorescent in organic solvent or when bound to molten globules or protein unfolding intermediates.
Most laboratory implementations of DSF, which is used extensively to optimize buffer conditions for crystallography,238,239 use real-time PCR machines which typically lack filter sets in the blue. Consequently, dyes such as SYPRO Orange have been widely used instead of ANS. RT-PCR machines enable DSF in 96 and 384-well formats with ~20 μL
of solution, where ~ 1 μg μL-1 solutions are required. Nordlund and colleagues have shown that DSF is applicable to a broad range of proteins but that some proteins bind to
SYPRO Orange in the folded state.238 Magliery and co-workers demonstrated that for a
series of related mutants of a protein, the correspondence between Tm values determined
from CD thermal denaturation and DSF is excellent.99 The reverse-format protein-
engineering implementation of Thermofluor, in which the conditions and ligands are held
constant and the protein varied, was called High-Throughput Thermal Scanning (HTTS).
89
It has been applied to core and loop libraries of four-helix bundle proteins to elucidate
determinants of stability (Lavinder, J.J., Hari, S.B., Sen, S. and TJM, in preparation).
DSF is surprisingly reversible through the melting point, although dye-protein aggregates
appear upon extended heating of the denatured state.240 It is likely that the dye itself will
perturb the apparent melting point, but the ΔTm values correspond quite well to
calorimetric and spectroscopic measurements. All of these miniaturized methods require
miniaturization and high-throughput handling of protein expression, purification, and
conceivably library construction. Growth of bacteria in 1-2mL of culture in 96 deep-well
plates is the technology of choice for most of these methods, where some amount of
robotic liquid handling for plate or bead-based affinity purifications (particularly IMAC)
is helpful. For the most part, these are achieved with considerable home optimization at
present. Platforms for HT oligonucleotide and gene synthesis and in vitro protein
expression stand to expand the screening front-end further, but these are still far from
straightforward implementation in most labs.241-243
2.6 Membrane Proteins and Antibodies
Two targets of great interest in the pharmaceutical industry are particularly challenging
for adapting to stability screens. Membrane proteins make up a large fraction of all drug
targets, but they are difficult to work with in vitro, particularly for structural studies.
Many recent successes in membrane protein crystallography have been born of strategies
to stabilize the POI.244 The most rapidly expanding area in pharmaceuticals is that of
biologics, which antibodies and antibody-like molecules dominate at present. But their
90
generally poor biophysical properties make them difficult to engineer and formulate as
drugs. Membrane proteins are often difficult to express or purify and are only stable in
detergent or lipid formulations. Stevens and colleagues adapted cysteine reactivity
reported by fluorescence, similar to the fQCR method, for membrane proteins in
detergent.245 They demonstrated its use on a lipophilic model protein, a monotopic
membrane protein, and an integral membrane protein from the GPCR family. More
recently, Cherezov, Stevens and colleagues have expanded this method by examining
protein unfolding in the lipidic cubic phase with intrinsic fluorescence or fluorescence
246 upon cysteine modification as a read-out (LCP-Tm). The method was additionally used for ITD over long time frames for membrane proteins. Baldwin and coworkers also developed an ITD screen for membrane proteins in detergent.247 DSF has been applied to
membrane proteins, but the high fluorescence background of the dye in detergent is a
complicating factor.248 Dyes that are more specific for proteins over lipids may improve this.
Even many full-length monoclonal antibodies are limited in their use as therapeutics by marginal stability and aggregation. Formats that are more straightforward to engineer and express, such as Fabs and scFvs, often suffer from decreased stability, and more significant engineering for humanization and generation of bispecific species can compromise stability further. DSF and DSLS have both been applied successfully to formulation studies of monoclonal antibodies.249,250 Thermal inactivation screening has
also been used as a means of establishing sufficient stability for scFv variants.251
91
Cysteine reactivity has been applied to mAb stability, which seems particularly apt given
the importance of disulfide bonds for antibody stability.252 Little has been published on the application of these sorts of screens to engineering antibody stability, but much of this
work is behind industry doors at present.
2.7 Protein Stability from Sequence Statistics
Most of the methods described above are useful for sorting out which members of a library are folded and stable. This enables the researcher to make mutations according to some hypothesis of design or even entirely at random, and locate variants with suitable
physical properties. But it is often not simple to find rare stabilizing mutations, and such
mutations may compromise other features of the protein, such as enzymatic function or expression level. An alternative approach to identifying sites of stabilizing mutations is to turn to statistical analysis of the natural repertoire of the motif, domain, protein or fold of interest. An attractive idea is that making mutations to the most common amino acid in some position of a protein is likely to be beneficial, and indeed these so-called consensus mutations are tolerated and stabilizing far more frequently than at random or even from the best predictions today. But implementation is harder than it sounds.
Multiple-sequence alignment (MSA) is often challenging, especially in poorly conserved regions or loops, leading to high noise. Most sites in proteins are not well conserved, and taking the most common amino acid in these positions is often little better than picking one at random. Moreover, some positions, especially weakly conserved ones, can be seen to vary together—that is, to be correlated—although these correlations are only
92
sometimes close in space and are of uncertain significance in most cases. The very large
number of protein sequences available today makes these kinds of approaches worthy of
greater attention in the years ahead.
2.8 Consensus
Steinbacher, Pluckthun and colleagues found that about half of the mutations made to an
antibody Vκ domain were stabilizing.117 Steipe and coworkers went on to use this
concept to generate hyperstable VH domains for intracellular expression of Fvs in E.
coli.253 Wyss and colleagues made a number of variants of fungal phytases based on the
consensus of a very small number of closely-related sequences (less than 20), and they
found that even the full consensus sequences were active and significantly
thermostabilized.136,137 (‘Full consensus’ means the most common amino acid from the
MSA was used in every position of the protein.) More recent efforts have focused on the
design of ubiquitous motifs, such as ankyrin repeats and tetratricopeptide repeats.120,129,254
Consensus variants of both of these repeats have been assembled into very stable domains and engineered using library and rational methods for novel binding properties.
The origin of consensus stabilization is not entirely clear. One possibility is that individual proteins in the MSA only avail themselves of as many stabilizing mutations as necessary for function, but that consensus amalgamates these mostly additive mutations.
It is also not yet clear why only half of the mutations are stabilizing. Some light was shed on this recently by Arnold, Hilvert and coworkers, who showed that consensus mutations from library selections were also stabilizing.148 The authors suggested that this
93 method benefited from the removal of phylogenetic artifacts. Other factors stemming from correlation and poorly conserved residues also likely play a role. A related approach to making consensus mutations is to make “ancestral” mutations by tracing mutations back to early sequences along the phylogenetic path. These mutations also turn out to be stabilizing about half the time.156 Tawfik and colleagues have
incorporated ancestral mutations into the family shuffling of paraoxonase-3 to
successfully yield stable, active chimeric enzymes.255
Box 1 Consensus and correlation
A ‘consensus’ residue is simply the most common amino acid in one position of a family of proteins—that is, in a column of a multiple sequence alignment. It is not always easy to determine the consensus sequence of a protein, because it is not easy to align stretches of sequence that are poorly conserved or have insertions or deletions (such as loops). Also, many positions are only weakly conserved and may use nearly all 20 amino acids with some frequency.
A correlation is, fundamentally, when a pair of residues in two positions is observed more or less frequently than expected by chance. For example, if Ala is seen in position A in 20% of sequences, and if it is in position B in 20% of sequences, than we would expect Ala–Ala pairs in
A–B in 4% of sequences. If we observed it in all 20% of the sequences, that might represent a strong correlation. Compared to consensus, many more sequences are necessary to be confident of the significance of correlations since there are 400 possible pairs in any two positions.
Information theory (e.g. relative entropy and mutual information) can be used to quantify how biased or conserved a position is, and how interconnected the distributions of two positions are.
94
2.9 Correlation
An additional layer of complication in the statistical analysis of MSAs is that not all positions are statistically independent. Ranganathan and colleagues developed a method called statistical coupling analysis (SCA), which is a perturbation-based approach to identifying overrepresented pairs of amino acids.164 They showed that inclusion of both
consensus and correlation information was necessary and sufficient for the design of
folded WW domains—meaning that variants that were plausible from positional
distributions alone were often unfolded if they did not capture sufficient correlation
information.166 The meaning of these types of correlations is even less well understood
than the etiology of consensus stabilization. Many correlated residues are not close in
space, but many can also be assembled into networks of interacting residues that connect
distant regions of a protein fold. A number of studies have identified roles for correlated
residues in allosteric regulation.256-258 Ranganathan has recently devised a new kind of
SCA calculation and has used it to identify independent clusters of co-evolving residues
in protein families (sectors).168 Mutations to different sectors in trypsin demonstrated
that one had structural and the other had functional consequences. There is a great deal
of work that lies ahead to understand the meaning of correlations in a general way.
Magliery and Regan applied consensus analysis and SCA to TPR motifs, which
uncovered two subfamilies with distinct alternative networks of interacting residues and
resulted in an algorithm for identifying active-site residues.131,169 Among the most
striking of the results of that study was an explanation for the unusually high charge of a
consensus TPR motif (-7) despite the average TPR have a zero net charge. Charge
95
neutralization occurred by correlations in weakly conserved positions on the surface of
the motif. This effect was not always local (for example, two residues close in space
forming a salt bridge), suggesting at least one mechanism for important non-local correlations. Magliery and colleagues recently engineered two closely related consensus variants of triosephosphate isomerase (TIM) from slightly different sequence databases.259 The two variants differed dramatically in their physical properties and
activity, with one of them having wild-type like kinetics and the other being weakly
active and poorly folded. Both variants were much more substantially different from any
natural TIM than the two variants were from each other. The only apparent difference
between these two variants is the extent to which they complete networks of correlated
residues. This type of host-guest approach will hopefully shed light on the physical
meaning of correlated positions.
2.10 Conclusions and Outlook
First-principles computational methods are likely to remain far from a comprehensive
predictive model for some time, until better potential functions and better treatments of the unfolded state can be incorporated. Even empirical parameterization is very difficult
given our sparse coverage of sequence space in thermodynamics studies. But there has
been a dramatic increase in efforts to bridge that gap in the last 5-10 years in the form of
new HT screens for foldedness, thermodynamic stability, solubility and kinetic stability.
In the next decade, with improved methods of HT gene construction, handling,
expression and purification, these new stability screening methods will give us a vastly
96
richer and more detailed view of the effects of mutations on protein physical properties.
And in the meantime, these methods are immediately adaptable for screening random
libraries to improve the physical properties of proteins for easier handling, crystallization
and structural studies, and superior biotherapeutics. Application of these methods to
biophysically “unfriendly” proteins that are larger, more complex and do not refold spontaneously is likely to change our view of protein folding for the majority of proteins.
Random screening in the absence of any information is often a slog, and any information to narrow down libraries to find stabilizing mutations is welcome. Protein sequence statistics can be a useful tool for guiding combinatorial experiments and limiting possibilities in difficult engineering experiments. The molecular etiology of the effects of consensus and correlated mutations remains a difficult problem, but the combination of screening methods with these kinds of calculations in the next decade will accelerate research towards an understanding of those effects. Protein stability remains one of the most difficult problems in protein science, but its illumination by experiments that take advantage of large numbers, both experimentally and statistically, offers new hope for a solution in the years ahead.
2.11 Acknowledgements
The authors thank the NIH (R01 GM083114 and U54 NS058183 to TJM) and The Ohio
State University for support. JJL was an NIH CBIP fellow and a fellow of the Great
Rivers affiliate of the AHA. BJS was an NIH CBIP fellow and is a Presidential fellow of
The Ohio State University.
97
Chapter 3: Consensus Design of Triosephosphate Isomerase
Triosephosphate isomerase by consensus design: dramatic differences in physical
properties and activity of related variants
3.0 Contributions
The following research article was published in the Journal of Molecular Biology under the authorship of Brandon J. Sullivan, Venuka Durani and Thomas J. Magliery. Brandon
Sullivan, Venuka Durani and Thomas Magliery designed the experiments. Brandon
Sullivan and Venuka Durani executed the experiments and all authors examined and interpreted the data. The paper was written by all authors.
3.1 Abstract
Consensus design, the selection of mutations based on the most common amino acid in each position of a multiple sequence alignment, has proven to be an efficient way to engineer stabilized mutants and even to design entire proteins. However, its application has been limited to small motifs or small families of highly related proteins. Also, we have little idea of how information that specifies a protein's properties is distributed between positional effects (consensus) and interactions between positions (correlated
98 occurrences of amino acids). Here, we designed several consensus variants of triosephosphate isomerase (TIM), a large, diverse family of complex enzymes. The first variant was only weakly active, had molten globular characteristics, and was monomeric at 25 °C despite being based on nearly all dimeric enzymes. A closely related variant from curation of the sequence database resulted in a native-like dimeric TIM with near- diffusion controlled kinetics. Both enzymes vary substantially (30–40%) from any natural TIM, but they differ from each other in only a relatively small number of unconserved positions. We demonstrate that consensus design is sufficient to engineer a sophisticated protein that requires precise substrate positioning and coordinated loop motion. The difference in oligomeric states and native-like properties for the two consensus variants is not a result of defects in the dimerization interface but rather disparate global properties of the proteins. These results have important implications for the role of correlated amino acids, the ability of TIM to function as a monomer, and the ability of molten globular proteins to carry out complex reactions.
3.2 Introduction
The sequence of amino acids in a protein encodes its physical and functional properties, but our ability to read that code is still very limited.10 For example, there have been great successes in computational prediction and design of proteins in recent years, but we are still far from a comprehensive, accurate model of the thermodynamic consequences of mutations.30,58,60,260 In part, this is because natural proteins are typically only stabilized by 5–15 kcal mol-1 over the unfolded state, and our knowledge of how to model the
99 unfolded state is poor.203,204 Remarkable functional designs of enzymes have also been achieved recently, but it remains exceedingly difficult to achieve catalytic efficiencies that compare to natural enzymes.61,62,65 The effects of solvation, backbone motion, dynamics, and entropy are largely beyond our ability to predict or design. One method of designing nonnatural sequences with native-like structures and functions is to look to statistical analysis of families of natural proteins. Genomic sequencing has given us vast databases of sequences of proteins that all have approximately the same structure and activity. This is basically a postgenomic formulation of the so-called “inverse folding problem”: what are all sequences in nature that adopt a particular fold?48 In the limit, the conservation and variation of sequence features in a multiple sequence alignment (MSA) must contain all of the information necessary to design stable, active sequences. The question is: how do we read and apply that information? We were particularly interested in determining what information is encoded at the positional level
(consensus/conservation) versus what is encoded by coupling between sites (correlation).
The idea of designing proteins, domains, or motifs from consensus is attractive because it makes intuitive sense that the most common amino acid in each position of an MSA is there for a reason (structural, functional, dynamic, etc.). Consensus sequences of motifs such as the tetratricopeptide repeat (TPR) and ankyrin repeat have been shown to be folded.121,129,254 Enzymes, such as the fungal phytases, have been engineered using sequence consensus and have been shown to be active and stable. These consensus phytases were generated from 13 to 21 highly homologous sequences from near- neighbors in phylogeny.135-137 Consensus-designed proteins generally have had higher
100
thermal stabilities than the average proteins from which the consensus sequence was
derived; however, some rational design considerations were applied to unconserved sites
in many of these studies. Data from the phytases, antibodies, and thioredoxin suggest
that about half the time, mutation of an amino acid to the most common amino acid in the
MSA for that position is stabilizing.117,118,261-263 On the other hand, the most common
amino acid in an unconserved site presumably has little informational value, and
furthermore, unconserved sites may still be correlated to each other, which is lost in the
consensus. For example, the consensus sequence of TPR motifs has a canonical charge
of −7 although individual TPRs have a 0 ± 2.5 net charge, because the charged residues are largely poorly conserved surface residues that exhibit charge neutralization only when
correlation is considered.169 The distribution of information between consensus and
correlation is not known, although design of WW domains using only consensus versus
consensus plus correlation yielded a much larger fraction of folded proteins with
incorporation of the correlation data.166,167 When triosephosphate isomerase (TIM) was extensively mutated, virtually all structural positions could individually be mutated conservatively (e.g., Gln to Asn) with little effect on activity, but when all positions were simultaneously varied between the natural residue and a conservative replacement, only about 1 in 1010 was active.264 Therefore, interactions among sites appear to account for a great deal of the information in specifying a folded, active protein, but no experiments to date have elucidated the exact effects of these correlated mutations. To start to answer this question, we proposed to engineer the pure consensus sequence of a complex protein architecture from a large, diverse enzyme family. Presumably, this pure consensus
101
sequence would scramble or ablate many of the sequence correlations at poorly
conserved sites and, as such, could act as “host” for interrogating the effects of “guest”
correlation mutations. We selected the TIMs for this study, because they are a very well
studied archetypal member of the (β/α)8 proteins that make up 10 % of all biological
catalysts.174,191,265 Because of their glycolytic function in the isomerization of
dihydroxyacetone phosphate (DHAP) and glyceraldehyde-3-phosphate (GAP), virtually
every organism has a TIM and therefore hundreds of sequences are available. TIM
catalyzes a sophisticated reaction with nearly diffusion-limited kinetics and with
coordinated motion in the catalytic cycle.197,199-202 Furthermore, TIM barrel proteins have
generally been difficult to engineer despite their ubiquity in nature.176 Here, we report the construction and characterization of closely related TIM proteins based purely on
consensus, one from a “raw” sequence database and one from a later database curated of
fragments and repeats. The raw consensus TIM (cTIM) is weakly active, poorly folded,
and monomeric, in contrast to nearly all known natural TIMs, which are dimers. The
curated consensus TIM (ccTIM) is dimeric, well folded, and fully active. We demonstrate
that the oligomeric states are not a result of defects at the interface but rather that global
properties of the proteins differ dramatically. Those properties arise from sequence
variations at unconserved sites, where correlated occurrences of amino acids may play a
significant role.
102
3.3 Results
Consensus TIM. The consensus sequence of all TIMs was determined from the most
common amino acid in each position of the Pfam alignment (version 18.04) of 639
sequences. Because hidden Markov model alignment is not well suited to deal with insertions relative to the seed alignment, the total number of positions in the alignment
(373) is much larger than the average length of a TIM sequence (235 aligned positions).
Consequently, only positions with greater than 45 % occupancy were selected, resulting in a sequence of 248 aa including four unaligned N- and C-terminal residues from
Saccharomyces cerevisiae TIM (S.c. TIM). (S.c. TIM is also 248 aa long.) Because of
the great evolutionary diversity of this ancient enzyme family, the consensus amino acid
sequence is only 70 % identical with that of Tenebrio molitor TIM, its closest known
homolog. The gene for the cTIM was assembled from synthetic oligonucleotides using
a PCR scheme similar to the reassembly step in DNA shuffling.266 The gene was cloned
into two expression vectors, one under the control of the tac promoter and one under the
control of the T7 promoter. The tac construct was transformed into DF502, an
Escherichia coli strain deficient in TIM and several other genes nearby in the
chromosome.267 Growth on lactate and glycerol minimal media was comparable to
complementation with S.c. TIM using the same construct. However, DF502 growth was
inconsistent in our hands, perhaps because of the very slow growth on minimal media
due to the large number of metabolic genes knocked out in this strain. We turned to the
recent Keio collection single-gene knockout of TIM, which we lysogenized with DE3
phage to support transcription from the T7 promoter.268 At 5 μM IPTG, cTIM supported
103
growth on lactate minimal media in 2–3 days and on glycerol minimal media in 4 days,
while S.c. TIM resulted in growth in about 1 day on both media. The cTIM protein could
be overexpressed at very high levels in E. coli and was purified to near homogeneity using two-step IMAC purification with 6 ×His tag cleavage by tobacco etch virus (TEV) protease. To eliminate contamination by the endogenous E. coli TIM, the engineered
TIMs were purified from the Keio TIM knockout DE3 strain. The Michaelis–Menten parameters were determined from steady-state kinetics for both directions of the isomerization reaction. The apparent Km values for DHAP and GAP are comparable to
4 those for S.c. TIM, but the apparent kcat values are reduced by about 10 -fold (Table 2).
Wild-type TIMs exhibit bimolecular kinetics close to the diffusion limit, but apparently
weak growth can be supported with significant reductions in activity. Therefore, an
active TIM was derived from consensus alone, albeit one with significantly reduced activity. Far-UV circular dichroism (CD) spectra for cTIM and S.c. TIM are similar and consistent with similar (β/α)8 architecture (Fig. 15a). Thermal denaturation was followed
by CD spectroscopy at 222 nm (Fig. 15b). S.c. TIM unfolds in a single, irreversible step
at about 60 °C. cTIM exhibits a similar pretransition baseline to S.c. TIM but does not unfold in a single step and is only ∼50 % unfolded at 95 °C. Unlike S.c. TIM, which precipitates at 95 °C, cTIM shows no signs of precipitation and exhibits some reversibility on cooling from 95 °C. This behavior is consistent with the thermal stabilization that has been observed for consensus mutations, although it is possible that more molten globule character is also exhibited by cTIM.121,129,135-137,254,269 With the
exception of a few tetrameric TIMs from thermophiles, all known TIMs are
104 homodimeric. The structure of TIM suggests that dimerization is necessary for full assembly of the active site by the interdigitation of loop 3 from the opposite monomer,
4 and engineered monomeric TIMs exhibit kcat/Km values reduced by about 10 - fold.182,183,187,189,270 The quaternary structure of cTIM was determined by gel-filtration chromatography (Fig 15c). cTIM elutes significantly after S.c. TIM. Elution volumes were compared to a standard curve to determine apparent molecular masses; S.c. TIM eluted as the expected dimer (∼56 kDa), but the consensus enzyme elutes as a monomer at room temperature with an apparent molecular mass of ∼29 kDa. Surprisingly, the consensus sequence of over 600 dimeric proteins is a monomer.
Table 2: Kinetic Data for Consensus TIMs.
105
Figure 15: cTIM Structure and Stability.
(a) CD wavelength spectrum of cTIM and S.c. TIM. (b) Thermal melt and cooling of cTIM and S.c. TIM from the 222-nm CD data. Data collected at increasing temperatures are shown as closed points while data points collected during the reverse melt are shown open. (c) Gel-filtration chromatography shows that S.c. TIM elutes as a dimer, but cTIM elutes later with calculated molecular mass corresponding to monomeric TIM.
Figure 16: ir-cTIM Design.
(a) The crystal structure of S.c. TIM (2YPI) is shown as an open monomer. The active-site bound inhibitor 2PG is shown in purple. The 12 mutations between cTIM and ir-cTIM are shown as sticks. These residues are all within 5 Å of the second chain, which reaches nearly into the active site. (b) The same rendering as in (a) but with the full dimer shown. (c) The CD wavelength spectrum of ir-cTIM shows similar ellipticity at 222 nm but significantly more signal at 205 nm.
Engineering the interface of cTIM. Although the monomeric state of cTIM was a surprise, its activity is consistent with TIM variants intentionally engineered to be monomers.182,183,187,189,270 These attempts to monomerize TIM involved deletions in the
106
interfacial loop 3 and mutations that reversed charge pairing. We hypothesized that by
choosing the most common amino acid at each position of cTIM, we had scrambled
necessary amino acid interactions (i.e., correlations) at the dimer interface. To examine
this hypothesis, we reverted the dimerization interface to the sequence observed in S.c.
TIM, which is known to be dimeric. The 1YPI crystal structure reveals 40 residues within 5 Å of the opposite momomer. The 12 interface residues that differed between cTIM and S.c. TIM were mutated in cTIM to create an interface reversion cTIM (ir- cTIM; Fig 16a/b). The ir-cTIM was purified in similar yield to the original cTIM. CD spectra are similar, but ir-cTIM exhibits greater signal at 205 nm, suggesting more random coil (Fig. 16c). The thermal melts monitored at 222 nm were essentially identical.
By gel-filtration chromatography, ir-cTIM elutes at a calculated molecular mass slightly larger than that of cTIM at room temperature (∼42 kDa, Fig. 17a). Sedimentation velocity by analytical ultracentrifugation (AUC) confirmed that the protein is still monomeric at room temperature (Fig. 17b). Furthermore, ir-cTIM did not exhibit concentration-dependent oligomerization over a 10-fold range of concentrations (0.15–
1.5 mg mL-1 ). The activity of ir-cTIM was decreased compared to cTIM and failed to
complement the Keio TIM knockout on minimal media. When the gel-filtration
chromatography was repeated at 4 °C (Fig. 17a), all three of the proteins (S.c. TIM,
cTIM, and ir-cTIM) eluted as dimers. For cTIM, a shoulder on the dimer-weight peak
suggests that both monomer and dimer are populated at 4 °C and 37 μM (1 mg mL-1 ), suggesting that this concentration is close to the Kd at this temperature. These results
together suggest that the monomeric states of cTIM and ir-cTIM at room temperature
107
may not be the result of inherent defects in the dimerization interface but rather
nonnative global properties of the cTIM scaffold. We also analyzed the binding of the
three proteins to the hydrophobic dye 1-anilinonaphthalene-8-sulfonic acid (ANS). ANS is quenched in aqueous buffer but fluoresces strongly in lower dielectric environments
such as organic solvent or when bound in the core of a protein. ANS binding is taken to be a sign of fluid tertiary structure exhibited by molten globules.271,272 S.c. TIM shows a
weak fluorescence emission peak at 418 nm, but both cTIM and ir-cTIM have strong red-
shifted fluorescence with peaks at 460 nm (Fig. 17c). The 600-MHz 1H, 15N-
heteronuclear single quantum coherence NMR spectrum of cTIM, however, displays a fair amount of amide peak dispersion for a protein of this size (Fig. 18). Taken together,
the biophysical data suggest that cTIM is monomeric and not as well folded as native
TIMs at room temperature and above.
Figure 17: ir-cTIM Characterization.
(a) The elution volume from gel-filtration chromatography of ir-cTIM corresponds to a molecular mass close to monomer. At lower temperatures (4 °C), cTIM and ir-cTIM elute as dimers with a shoulder for monomeric species. (b) Sedimentation velocity shows that ir-cTIM is monomeric with no concentration-dependent oligomerization. (c) ANS binding of S.c. TIM exhibits a weak fluorescence peak at 420 nm. cTIM and ir-cTIM exhibit strong fluorescence with a red-shifted maxima of 460 nm, suggesting that they are both molten globular. 108
Figure 18: 1H,15N-HSQC NMR of cTIM and S.c. TIM.
The NMR spectra for cTIM (MW = 26 kDa) on the left and NMR spectra of S.c. TIM (MW = 52 kDa) on the right.
Concentration and temperature studies. One could imagine that the weak activity of
cTIM is due to weak activity in the monomer or to a small population of dimer. To
examine further the weak activity of cTIM, we observed single-point kinetics over a
range of enzyme concentrations at 4 and 37 °C (Fig. 19). S.c. TIM, which is dimeric at
both temperatures across the whole range of concentrations (16, 32, and 64 pM),
increased in activity linearly with respect to concentration at both temperatures.
Furthermore, there was a 13-fold decrease in activity at each concentration when the
reaction was performed at 4 °C versus 37 °C. When cTIM was assayed under the same
conditions (at 60– 240 μM enzyme), we still observed a linear increase in activity with
respect to concentration at both temperatures, but the activity was 80-fold lower at the
lower temperature for all three concentrations. If activity required dimerization, we
would have expected a nonlinear increase in activity at increasing concentration, as more of the dimeric state is populated, and we would have expected a smaller decrease in activity between 37 and 4 °C at all concentrations, since cTIM goes from mostly 109
monomeric to mostly dimeric under these conditions. The composite data suggest that
cTIM is active as a monomer with molten globular properties. It is worth noting that the dimeric species seen at 4 °C in cTIM and ir-cTIM may not be native-like dimers.
Figure 19: Temperature and Concentration Dependent Activity.
The activity of cTIM and S.c. TIM studied under a series of temperatures and concentrations. The activity at 37 °C and 1×[E] was arbitrarily set at unity (100%) for both enzymes. The activity doubled and halved for the wild-type enzyme when the concentrations were increased and decreased twofold, respectively. This occurred at both temperatures. If cTIM were active as the dimer, one would expect doubling the concentration of enzyme to have a nonlinear effect on activity. Lowering the temperature from 37 to 4 °C led to a 13-fold reduction in reaction rate for the wild-type enzyme at each concentration. At 4 °C, we observe cTIM dimers by gel filtration. All other things being equal, if cTIM dimers were the active unit, we would expect less than a 13-fold decrease for cTIM. In fact, we see the opposite; the average activity decreases 80-fold between 37 and 4 °C at each enzyme concentration.
Database curation. A third consensus TIM variant that we engineered shed light on the properties of the original cTIM. When we began the analysis for correlated occurrences of amino acids, we downloaded the then current version (22.0) of the Pfam database and curated it to remove repeated sequences and sequence fragments that did not represent full genes. More precisely, sequences with fewer than 205 aa (351 sequences) and exact sequence repeats (107 sequences) were removed from the 1239 sequence database to yield 781 nonredundant full length sequences. A new ccTIM was created using a similar 110 approach to occupancy as described for cTIM, resulting in a 248-aa sequence with 36 sequence differences from cTIM (34 substitutions, 1 insertion, and 1 deletion, Fig.
20a/b). There was a single position in the alignment (which aligned with S.c. TIM residue 49) that was equally occupied by two residues: alanine and glutamine. The position was arbitrarily chosen to be Gln. The differences between cTIM and ccTIM arise from unconserved positions in which the most common amino acid differs, and consequently, we expected these changes to have little impact. The amino acid bias of a position can be quantified by calculating the relative entropy between positional distribution and the distribution of amino acids in a neutral reference state, such as amino acid usage in all open reading frames in yeast. From this calculation, it is evident that only unbiased or weakly biased positions were affected (Fig. 20b). These positions tolerate virtually any amino acid in all TIMs, and therefore, only minor differences were anticipated between cTIM and ccTIM.
111
Figure 20: Comparison of Consensus TIM sequences.
(a) Sequence alignment of S.c. TIM and consensus TIMs. Secondary structure shown for S.c. TIM with interface residues (within 5Å of chain b) shown in red and active-site residues marked as stars. Periods denote the same amino acid as cTIM. (b) Plot showing the relative entropy (i.e., conservation) of each position in the TIM alignment. Residues that are mutated between cTIM and ccTIM are shown in black, while all other positions are shown in gray.
Curated consensus TIM. ccTIM expresses well in bacteria with yields approaching 50
mg L-1. CD wavelength spectra and thermal melt traces were essentially the same as
those of cTIM (Fig. 21a/b). The ellipticities for the 222-nm minima corresponding to α-
helical structure are all within 7 % when normalized for protein concentration, which was
confirmed by SDS-PAGE and amino acid analysis. However, other biophysical
properties turned out to be starkly different. When the thermal melt is reversed from 95
to 25 °C, ccTIM refolds almost quantitatively. There is a red shift in emission upon ANS
binding, but the very low level of fluorescence suggests that ccTIM is much less molten than cTIM (Fig. 22c). The protein elutes from a gel-filtration column at room
112
temperature with an apparent molecular mass of 66 kDa, slightly more than that of S.c.
TIM or the calculated dimeric mass (Fig. 22a/b). AUC sedimentation velocity studies confirm that the protein is dimeric (50.5 kDa with 95 % confidence) with less than 2 % forming higher aggregates (Fig. 22b). ccTIM is nearly as active as wild-type TIMs, with
4 5 -1 comparable DHAP and GAP Km values and kcat values of 10 –10 min . ccTIM
complements growth in the Keio TIM knockout, leading to growth on minimal media
similar to that of S.c. TIM and faster than that of cTIM (Fig. 21c). Surprisingly, although cTIM and ccTIM differ only in a relatively small number of unconserved positions and have similar structural and thermodynamic properties, cTIM is a molten globular monomer with weak activity and ccTIM is a native-like structured dimer with wild-type
activity.
Figure 21: ccTIM Characterization.
(a) CD wavelength spectra of consensus TIMs. (b) The consensus variants share similar unfolding patterns when ellipticity is monitored at 222 nm with temperatures ramping from 25 to 95 °C. If the melted samples are cooled back to room temperature, cTIM and ccTIM refold significantly as judged by an increase in ellipticity. ccTIM regains 95% of its initial ellipticity. (c) In vivo characterization of TIMs on lactate minimal media in the absence of IPTG. After 3 days of leaky expression at 37 °C, all but ir-cTIM complement the Keio(DE3) TIM knockout.
113
Figure 22: Structure of ccTIM.
(a) ccTIM elutes near the calculated volume corresponding to dimer by gel-filtration chromatography. (b) Sedimentation velocity confirms that ccTIM is dimeric with no concentration dependence between 0.16 and 1.6 mg mL-1. (c) ANS binding of ccTIM yields a very weak fluorescence at 460 nm.
Details of kinetic characterization. The catalyst of an isomerization reaction may not affect the thermodynamic equilibrium of its substrates. The Haldane relationship,
196 (kcat/Km for GAP)/(kcat/Km for DHAP), for TIM has been reported to be about 22. The consensus-designed variants reported in Table 2 apparently have Haldane ratios of 50 for cTIM and 75 for ccTIM, representing 2- to 3-fold combined error in the kcat/Km values.
The majority of this error is manifested in the inflation of DHAP Km values by
competitive arsenate inhibition.273 For cTIM, this is further complicated by the accurate
determination of kcat and Km due to some type of substrate inhibition at high concentrations of DHAP (Fig. 28). In the case of ccTIM, we estimated the Ki of arsenate
by analyzing the DHAP reaction in the presence and absence of arsenate (Fig. 23). The
Ki, 5 ± 2 mM, yields an adjusted DHAP Km of ∼2.4 mM, which translates to a Haldane relationship of 35 ± 13. This is within the range of previously reported data.
114
Figure 23: Arsenate Inhibition of ccTIM.
The Michaelis-Menten kinetics for ccTIM were measured in the absence and presence of 8.3 mM sodium arsenate. The increase in Km with maintenance of vmax suggests arsenate is a competitive inhibitor, which is supported by previous data. The Ki and Km for arsenate were estimated using the equations: