Quick viewing(Text Mode)

Engineering Proteins from Sequence Statistics: Identifying and Understanding the Roles

Engineering Proteins from Sequence Statistics: Identifying and Understanding the Roles

Engineering from Sequence Statistics: Identifying and Understanding the Roles

of Conservation and Correlation in Triosephosphate Isomerase

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree in the Graduate School of The Ohio State University

By

Brandon Joseph Sullivan B.S.

Graduate Program in the Ohio State Program

The Ohio State University

2011

Dissertation Committee:

Thomas J. Magliery, Advisor

Mark P. Foster

William C. Ray

Copyright by

Brandon Joseph Sullivan

2011

Abstract

The structure, function and dynamics of proteins are determined by the physical and chemical properties of their amino acids. Unfortunately, the information encapsulated within a position or between positions is poorly understood. Multiple sequence alignments of families allow us to interrogate these questions statistically. Here, we describe the characterization of bioinformatically-designed variants of triosephosphate isomerase (TIM). First, we review the state-of-the-art for engineering proteins with increased stability. We examine two methodologies that benefit from the availability of large numbers - high-throughput screening and sequence statistics of protein families. Second, we have deconvoluted what properties are encoded within a position (conservation) and between positions (correlations) by designing TIMs in which each position is the most common in the multiple . We found that a consensus TIM from a raw sequence database performs the complex isomerization reaction with weak activity as a dynamic molten globule. Furthermore, we have confirmed that the monomeric species is the catalytically active conformation despite being designed from 600+ dimeric proteins. A second consensus TIM from a curated dataset is well folded, has wild-type activity and is dimeric, but it only differs from the raw consensus TIM at 35 nonconserved positions. These two TIMs differ in the

ii

fraction of dataset sequences from and . These distribution

differences have led to the breaking and altering of networks of statistical correlations at

nonconserved positions which we demonstrate with mutual information and subset

perturbation calculations. Additionally, we show that the curated consensus TIM is an extreme thermostable . The protein remains half folded at 95 °C and may be the only TIM to completely refold after thermal denaturation.

Third, we wished to understand the determinants of protein stability -- one of biochemistry's most difficult questions. It has been shown that consensus improve the stability of native proteins approximately half the time, but there is no a

priori technique to predict which consensus mutations will be stabilizing. We have

developed a double-sieve filter that selects stabilizing mutations based on extent of

conservation and statistical independence from other positions within the multiple

sequence alignment. These two mathematical tests reliably predict stabilizing mutations

with greater than 90% accuracy. The statistical algorithm was used to select 15

consensus mutations that together, improved the melting temperature of wild-type TIM

by nearly 10 °C.

Finally, we designed and characterized a model system for testing the effects of

statistically correlated residues. The TIM-knockout from the Keio Collection was

engineered for T7 expression and tested for TIM activity complementation. The single

knockout exhibits differential growth that correlates well to in vitro specific

iii activities. The design and characterization of two libraries are proposed to test the relationship between correlations and protein fitness.

iv

Dedication

Mòran taing

To my family - Heidi, Keegan, Killian, Merlin and Addison.

To my Parents - Brian and Kathy Sullivan

v

Acknowledgments

Trí na chéile a thógtar na cáisléain

As Venuka, Tom and I developed this project we observed several residues that interact with many other positions - we deemed these as very important residues that define the protein. Looking back, I have adopted a different perspective. I believe that these residues alone are meaningless and only achieve importance through the contribution of the other positions. In that same light, all of my accomplishments have been dependent on the support of my family, friends, mentors and labmates. For that, I owe them everything.

I first want to thank my undergraduate advisor, David Wells and my many professors who developed my love for . When I wished to return to academics, it was Sean

Taylor who forwarded my curriculum vitae to newly hired faculty, Thomas Magliery. I thank Sean for the introduction and thank Tom for the great opportunity to join and start his lab.

I am highly appreciative of all the administrative help and talent I have been blessed with throughout these years, particularly Che Maxwell, Nicole Wade, Peter Sanders, Judith

Brown and Jennifer Hambach. I also thank Kevin Dill and Jerry Park for all their help. I

vi have thoroughly loved my opportunities to teach for the department. I thank the entire chemistry staff for their training, support, advice and guidance - especially

Yiying Wu, Steven Kroner, Christopher Callam, Matthew Stoltzfus, Mary Bailey, Robert

Tatz, Eric Heine, Holly Wheaton, Tami Sizemore and Christopher Hadad. I am also grateful to my many students who have brought me great pride and inspiration. On the other side of the desk, I have enjoyed all of my coursework because of my amazing professors: Dehua Pei, Russ Hille, David Bisaro, Paul Herman, George Marzluf, Mark

Foster, Ross Dalbey, Mark Pfeiffer, Jovica Badjic, Thomas Magliery, Charles Bell,

Chenglong Li, Michael Chan and Will Ray. I am also highly appreciative of Ross

Dalbey and Jill Rafael-Fortney for directing the Ohio State Biochemistry Program and providing incredible mentorship.

I thank David Hart for the weekly Science magazines and for being awesome. I thank surface tension for saving numerous experiments in my graduate career. I am incredibly thankful for our Graduate School and the wonderful work of Kathleen Wallace, Karen

Mayer and Dean Patrick Osmer. I am humbled by their nominations and kind words.

Additional thanks are due to Kathleen Wallace and Karen Mayer for their efforts regarding the Preparing Future Faculty Program and my mentor Heather Rhodes of

Denison University.

I am also grateful for the many professors who have inspired me in my tenure at Ohio

State. Their efforts continue to push and inspire me, especially: Pehr Harbury, Vern

vii

Schramm, Jay Keasling, Amy Keating, David Liu, Shelley Copley, David Baker, Peter

Schultz, Julius Rebek, , Dan Bolon, Patricia Babbitt and Daniel Nocera.

I am additionally thankful to Pehr and his student Kierstin for MPAX training.

Receiving a doctorate is truly an endurance event. I often feel that completing the journey has been made easier through my participation in endurance sports. Whether it has been a century ride, a triathlon, a marathon or a 15 mile swim the mental strength, stubbornness and stress relief have been invaluable. These traits have been second only to the friends and teammates who have inspired and pushed me to completion. In particular, I am thankful to labmates turned workout buddies: Matthew Heberling, Ely

Porter, Sarah Johnston and Ted Schoenfeldt. I thank my swim team, the Columbus

Sharks, for making 5:30 am workouts seem like a good idea. In particular I thank my coaches Tracy Hendershot and Bo Martin and teammate Evan Morrison for cultivating some truly insane swims.

I am grateful for the many friends and classmates I have made in graduate school. In particular, I wish to thank Christopher Jones, Jeffrey Joyner, John Shimko, Ross Wilson,

Ian Kleckner and Kevin Fiala. These students set high benchmarks and provided friendly competition. I am deeply thankful for my talented (and crazy) labmates. Jason Lavinder was a great friend, fellow football fan, and labmate. If I had known Jason before graduate school, I would have bought stock in Wendy's and retired after receiving my

Ph.D. Some of my favorite Magliery memories took place on the golf course with Jason

viii and the awkward brilliance of Sanjay Hari. Quiz team, snow pants, printer cartridges, the

-20, lab lingo, wooden seat, B, the Inner Game. Man - I miss Sanjay. I am also incredibly fortunate to have adopted two lab sisters in Lihua Nie and Brinda R-a-m-a-s-u- b-r-a-m-a-n-i-a-n. Lihua has taught me that no detail is too small, to be persistent and never come up short, and that set backs are often little. She has also taught me that a trilingual-pint-sized-analytical-organic-biological- wielding a hammer is a force to be reckoned with. In all seriousness, Lihua is the hardest working individual I have ever met. It has been an honor working with her. Brinda has been a fantastic friend. She has worked on one of the most difficult projects in the lab with grace, patience and unparalleled persistence. It was a great joy for our families to bring Sahana Umbah@@ and Keegan into the world only six weeks apart. Brinda also taught me that "isosbestic point" is a curse word. I also wish to thank the Genomic Design Group, especially

Venuka Durani, Nicholas Callahan, Deepamali Perera and Sidharth Mohan. Of this group, Venuka deserves a special thanks. This project would not have been possible without her help. She continually provided invaluable feedback, suggestions, time and the techniques that brought this dissertation to fruition. I am most grateful and happy to have shared authorships with her. I thank David Mata for autoclaving the rotor in the middle of my electrocompetent cell preparation.

One of my favorite aspects of graduate school has been the mentoring of undergraduates.

I have been so lucky to have shared my project with incredibly talented people including

Trixy Syu, Miriam Thomas, Deepti Mathur, Tran Nguyen and Samantha Rojas. These

ix

students continually surprised me, contributed significantly to the work written here, and

brightened each day in the lab. I am so happy that I was able to work side by side with

these amazing undergraduates.

I also thank my committee members for their advice, time and support - Ross Dalbey,

Mark Foster, Will Ray and Tom Magliery. In particular, I wish to thank my advisor,

Thomas Magliery. I will never be able to repay Tom for all that I have received. He gave me the opportunity, the resources, the mentorship and the confidence and tools to succeed. He has changed my life and for that I will always be indebted. Additionally, I thank Tom for showing me in my second year how to read a meniscus and for pointing out that "6-0-start" is quicker to type than "1-0-0-start" on the microwave.

I also wish to thank those that funded my education. I thank the faculty of the Ohio State

Biochemistry Program for funding our first year of study in addition to the Ohio State

University. I am grateful to the National Institute of Health that provided two years of funding through the Chemistry- Interface Training Program. I appreciated the

University and Department of Chemistry for allowing me to teach and earn my tuition and stipend. I am thankful to Tom for supporting me as a Graduate Research Assistant.

Lastly, I thank the University for the Presidential Fellowship which supported my dissertation year at Ohio State.

x

Finally and most deservingly, I thank my family. I thank my in-laws Clarence and

Sharon Zielke for their support and temporary roof. I thank my parents for their love, pride and support. None of this would have been possible without their continued lessons and advice. I am also thankful for the rounds of golf I have shared with my parents - the perfect four hour distraction from a rough week in lab. I am incredibly grateful for Heidi,

Keegan, Killian, Merlin and Addison. Merlin and Addison, my pups, have provided infinite smiles and ridiculousness to my life. It's cliché, but I love their excitement and happiness every evening when I come home from lab. I want to thank my son, Keegan - the happiest kid I have ever seen. He ensures that every day starts and ends with a smile.

Elmo, bananas, elephants, a stuffed dog named Kona, Rocket and "more" - these are the things that now bring me happiness. I am thankful for Killian and look forward to forging many memories with him. Finally, I want to thank Heidi. I have come to realize that receiving a Ph. D. takes persistence and hard work, but being married to someone earning that degree takes unparallel patience and sacrifice. I thank Heidi for that and for meeting those daily challenges with absolute grace. This degree would not have been possible without her love and support. Heidi, thank you.

Trí na chéile a thógtar na cáisléain

xi

Vita

2000...... Saint Charles Preparatory School

2004...... B.S. Biology, The Ohio State University

2006-2011 ...... Graduate Associate, Ohio State

Biochemistry Program, The Ohio State

University

Publications

High-throughput thermal scanning: a general, rapid dye-binding thermal shift screen for protein engineering: Jason J. Lavinder, Sanjay B. Hari, Brandon J. Sullivan and Thomas J. Magliery. J Am Chem Soc 2009. 131(11): 3794-3785.

Protein stability by number: high-throughput and statistical approaches to one of protein science’s most difficult problems. Thomas J. Magliery, Jason J. Lavinder and Brandon J. Sullivan. Curr Opin Chem Biol 2011. 12(3): 443-451.

Triosephosphate isomerase by consensus design: Dramatic differences in physical properties and activity of related variants. Brandon J. Sullivan, Venuka Durani and Thomas J. Magliery. J Mol Biol 2011. 413(1): 195-208.

Fields of Study

Major Field: Biochemistry

Specialization: Protein Engineering and Sequence Statistics xii

Table of Contents

Abstract ...... ii

Dedication ...... v

Acknowledgments...... vi

Vita ...... xii

Publications ...... xii

Fields of Study ...... xii

Table of Contents ...... xiii

List of Tables ...... xviii

List of Figures ...... xix

Chapter 1: Introduction ...... 1

1.1 The Importance of Proteins ...... 1

1.2 and Function ...... 5

1.3 The Thermodynamic Hypothesis ...... 9

1.4 The Sequence-Structure-Function Relationship ...... 15

1.5 The Problem ...... 18 xiii

1.6 Computational Structure Prediction ...... 21

1.7 The Inverse Folding Problem ...... 24

1.8 Empirical Protein Science ...... 32

1.9 The Genomic Era and Sequence Statistics ...... 40

1.9.1 Conservation Statistics ...... 47

1.9.2 Ancestral Statistics ...... 59

1.9.3 Correlation Statistics ...... 61

1.10 Triosephosphate Isomerase ...... 68

1.11 Dissertation Synopsis ...... 75

Chapter 2: Protein Stability by Number ...... 77

2.0 Contributions ...... 77

2.1 Abstract ...... 77

2.2 Introduction ...... 78

2.3 Screening for Protein Stability ...... 81

2.4 Inferential Screens for Protein Stability ...... 81

2.5 Direct, Small Scale Screens for Stability ...... 85

2.6 Membrane Proteins and ...... 90

2.7 Protein Stability from Sequence Statistics ...... 92

2.8 Consensus ...... 93

xiv

2.9 Correlation ...... 95

2.10 Conclusions and Outlook ...... 96

2.11 Acknowledgements ...... 97

Chapter 3: Consensus Design of Triosephosphate Isomerase ...... 98

3.0 Contributions ...... 98

3.1 Abstract ...... 98

3.2 Introduction ...... 99

3.3 Results ...... 103

3.4 Discussion ...... 117

3.5 Acknowledgements ...... 125

3.6 Kinetic Plots ...... 126

Chapter 4: Protein Stability and Sequence Statistics ...... 128

4.0 Contributions ...... 128

4.1 Abstract ...... 129

4.2 Introduction ...... 130

4.3 Results ...... 134

4.4 Discussion ...... 150

4.5 Acknowledgements ...... 158

Chapter 5: In vivo Analyses of Triosephosphate Isomerase ...... 159

xv

5.0 Contributions ...... 159

5.1 Abstract ...... 159

5.2 Introduction ...... 160

5.3 Results and Discussion ...... 164

5.4 Future Directions ...... 172

5.5 Outlook ...... 175

5.6 Acknowledgements ...... 177

Chapter 6: Materials and Methods ...... 178

6.1 Sequence Statistics ...... 178

6.1.1 Databases and Curation ...... 178

6.1.2 Conservation ...... 179

6.1.3 Correlation ...... 181

6.2 Chapter 3 Methods ...... 182

6.2.1 Sequences and Cloning ...... 182

6.2.2 Expression ...... 187

6.2.3 Purification ...... 187

6.2.4 Purity and Yield ...... 188

6.2.5 ...... 189

6.2.6 Gel Filtration ...... 190

xvi

6.2.7 Analytical Ultracentrifugation ...... 190

6.2.8 Hydrophobic Dye Binding ...... 191

6.2.9 Activity ...... 191

6.2.10 Nuclear Magnetic Resonance ...... 193

6.3 Chapter 4 Methods ...... 194

6.3.1 Sequences and Cloning ...... 194

6.3.2 Circular Dichroism ...... 197

6.3.3 Differential Static Light Scattering...... 198

6.3.4 FoldX Calculations ...... 199

6.4 Chapter 5 Methods ...... 199

6.4.1 Keio(DE3) Construction ...... 199

6.4.2 Cloning of E165D Mutants...... 201

6.4.3 Solid and Liquid Minimal Media Growth ...... 201

References ...... 204

xvii

List of Tables

Table 1: Protein Stability by Number...... 80

Table 2: Kinetic Data for Consensus TIMs...... 105

Table 3: Differences between cTIM and ccTIM...... 120

Table 4: Characterization of Mutants...... 138

Table 5: Kinetic values for Keio characterization...... 165

xviii

List of Figures

Figure 1: The Importance of Proteins...... 2

Figure 2: The Structure-Function Relationship...... 7

Figure 3: Protein Folding Energy Landscape...... 11

Figure 4: Transformation Efficiencies...... 36

Figure 5: Redesign of Rop...... 39

Figure 6: Example Multiple Sequence Alignment...... 44

Figure 7: Binding Specificities of SH2 Domains...... 46

Figure 8: SCA and WW-Domains...... 63

Figure 9: Beyond Consensus Analysis of TPRs...... 65

Figure 10: StickWRLD...... 68

Figure 11: Triosephosphate Isomerase...... 70

Figure 12: monoTIM...... 71

Figure 13: The Activity of TIM...... 72

Figure 14: Principles of Screening for Protein Stability...... 85

Figure 15: cTIM Structure and Stability...... 106

Figure 16: ir-cTIM Design...... 106

Figure 17: ir-cTIM Characterization...... 108

xix

Figure 18: 1H,15N-HSQC NMR of cTIM and S.c. TIM...... 109

Figure 19: Temperature and Concentration Dependent Activity...... 110

Figure 20: Comparison of Consensus TIM sequences...... 112

Figure 21: ccTIM Characterization...... 113

Figure 22: Structure of ccTIM...... 114

Figure 23: Arsenate Inhibition of ccTIM...... 115

Figure 24: Sequence Differences between cTIM and ccTIM...... 117

Figure 25: Taxonomy Statistics...... 118

Figure 26: Sequence Correlations in TIMs...... 119

Figure 27: Kinetic Plots for S.c. TIM...... 126

Figure 28: Kinetic Plots for cTIM...... 126

Figure 29: Kinetic Plots for ir-cTIM...... 127

Figure 30: Kinetic Plots of ccTIM...... 127

Figure 31: re-S.c.TIM...... 135

Figure 32: CD Characterization of Highly Conserved Mutations...... 136

Figure 33: Thermal Assays...... 137

Figure 34: Concordance of Stability Assays...... 139

Figure 35: Correlation of Thermal Methods...... 139

Figure 36: Filtering by Conservation...... 143

Figure 37: Mutual Information and Protein Stability...... 145

Figure 38: Hidden Correlations...... 146

Figure 39: Characterization of comboTIM and algoTIM...... 148

xx

Figure 40: Kinetic Unfolding of Consensus Variants...... 150

Figure 41: Mutual Information for comboTIM and algoTIM...... 153

Figure 42: Correlations in TIM...... 155

Figure 43: Physical Properties...... 157

Figure 44: Differential growth in solid media...... 166

Figure 45: Differential growth in liquid media...... 167

Figure 46: Protein expression in TIM-knockout...... 169

Figure 47: Analytical digest schemes to assess fitness...... 171

Figure 48: Analytical digests to determine populations...... 172

Figure 49: Library sites in triosephosphate isomerase...... 174

xxi

Chapter 1: Introduction

1.1 The Importance of Proteins

DNA is most simply the blueprint for life. It gives us our identity, but more importantly provides systematic directions for the assembly of our cellular machines. Nearly every cellular event and life process is the result of the actions of proteins. A simple bacterium employs roughly 2,000 different proteins while more complex organisms like make use of more than 20,000 diverse proteins. Numerous polypeptides are required,

because they perform highly specialized and unique roles in their native environments.

There are essentially six classes of functions performed by these amazing

macromolecules: binding, movement, structure, signaling, transport and (Fig. 1).

First, proteins have the ability to bind small molecules and other macromolecules like

carbohydrates, nucleic acids, or even other proteins. For example, the heme- containing protein, myoglobin, reversibly binds oxygen (O2) under saturating conditions as an oxygen storage protein. Whales harbor high concentrations of myoglobin within muscle tissue enabling them to dive deeply in the water for long periods of time. A second example of binding is witnessed in the immune system with antibodies.

Antibodies are multi-chained proteins that tightly bind substances known as antigens.

1

These antigens are generally foreign molecules (e.g. viral proteins) that are recognized as foreign threats and tagged for destruction by the binding of antibodies.

Figure 1: The Importance of Proteins.

A. Myoglobin is an example of a binding protein. The crystal structure shown here binds diatomic oxygen via a heme group. B. Myosin is one of the protein components of muscle that allow the tissue to contract and move. C. Ras is an important signaling protein that binds GTP. D. Triosephosphate isomerase is a -controlled enzyme that catalyzes the isomerization of two three-carbon sugars. E. This serves as a chloride channel that transports ions into the cell. F. The triple helix provides structure, strength and flexibility to skin, cartilage and bone. Images were rendered in Pymol 1.4.1 with PDB IDs: 1MBN, 1W9I, 3GFT, 2YPI, 1KPL, 1CAG.

Other protein molecules like and myosin orchestrate movement via muscular contractions. Likewise, some ATPases and flagellum motors choreograph complex rotation. Some proteins serve structural roles. Microtubules and filaments provide an architectural scaffold for each cell. Cross-linked keratin provides the structure and strength of nails and hair. As previously mentioned, many binding events set into motion a series of further incidents. This communication allows for complicated signaling

2

between cells, within a cell, and with the environment. Many membrane proteins, like the acetylcholine receptor protein, bind extracellular molecules that trigger intracellular events. Another class, transport, is responsible for moving molecules from point A to

point B. This class may operate as membrane transporters, vesicle transporters or carrier

proteins.

The hallmark function of proteins is their ability to catalyze biochemical reactions.

Virtually every chemical reaction in the cell is catalyzed by proteins known as .

There are ten proteins that serially breakdown the six-carbon sugar, glucose, to two

smaller molecules of pyruvate. From here, further proteins continue to alter pyruvate in a

process that yields energy. A second example, DNA polymerase, catalyzes the

replication of our DNA with extremely high fidelity.

These examples represent only a small snapshot of the many diverse, but imperative roles

of proteins. While the variety is truly remarkable, the fidelity with which proteins

perform their tasks is truly impressive. The cell requires this fidelity to maintain

homeostasis, replicate and perform specialized tasks. Unfortunately, alterations in DNA

genes may lead to downstream changes in protein composition that may impair the

molecule's function. This is often the case in disease. For example, tumor suppressor

protein p53 has been implicated in more than 50% of cancers.1 This protein's

native function is to respond to cellular stresses that lead to cell growth, proliferation and division. Functional p53 operates as an early transcription factor that essentially turns off

3 these pathways. In the absence of active p53 these pathways continuously run unregulated leading to uncontrolled cell growth - tumors. The study of p53's mechanism, structure, stability, and effects of are major goals in biochemistry. Likewise, a deeper understanding of all physiological proteins is necessary to better understand life and disease.

In addition, the field of protein science is highly interested in the action of proteins outside of their native environment and the manipulation of proteins for novel tasks.

With the plethora of functions proteins perform, it is not surprising that many wish to exploit these and similar functions for use in therapeutics, industry and academics. DNA polymerases have been isolated from extreme thermophilic organisms for use in polymerase chain reactions and DNA amplification. Antibodies have been raised and isolated from many species as a method of identification and quantification. Other industries have supplemented detergents with protein molecules like lipases. These proteins hydrolyze lipids which are often difficult to remove from greasy stains with standard chemical procedures. The use of proteins in pharmaceutics for in vitro screening and even therapeutics is becoming increasingly common. Small proteins such as and human growth are long-standing staples in the drug industry.

Recently, protein antibodies have gained favor as therapeutic reagents for their ability to bind receptor molecules which trigger intracellular events. Herceptin, a monoclonal , binds the extracellular domain of HER2 - a receptor protein often over expressed in breast cancer patients.2 The binding of Herceptin to HER2 arrests the cell

4 during the G1 phase of the reducing proliferation. The Magliery Lab has formed a collaboration with several research groups to improve the drug-like properties of human paraoxonase 1 (huPON1). huPON1's native function is unknown, but it is found bound to high density lipoprotein (HDL) in the blood serum.3 Experiments have shown the enzyme has esterase and lactonase activities, and has been linked to cardioprotection.4-7 In addition to these proposed functions, huPON1 weakly hydrolyzes organophosphate nerve agents found in and potentially used in chemical warfare. Our lab and others have sought to fine tune the properties of human paraoxonase 1 to increase its usefulness as a therapeutic agent.

In summary, the native roles of proteins, their abilities to bind targets and catalyze chemical reactions, and their role in disease make them one of science's most important molecules. Despite decades of intense study and research, and despite knowing great details of certain proteins and systems, we are still far from understanding many of the fundamental properties of proteins. In order to truly understand life, prepare for life's challenges, and manipulate nature it is imperative to understand these elementary properties.

1.2 Protein Structure and Function

Proteins are assembled from the linkage of twenty natural building blocks known as amino acids. The twenty amino acids vary in size, shape, branching, charge and chemical composition. The carboxyl- and amino- groups of amino acids undergo a condensation

5

reaction at the ribosome. This repetitive reaction can link several amino acids to form an

oligopeptide or dozens, hundreds, or thousands to assemble polypeptides.

The linear polymer synthesized at the ribosome generally assumes a compact three- dimensional structure. This is imperative because nearly all proteins are biologically

active in their folded conformations. To further demonstrate, let us consider two

examples - an SH2 binding domain and the enzyme, triosephosphate isomerase. Src

Homology 2 (SH2) domains are structurally conserved domains that are involved in

signal transduction.8 They mediate key pathways by recognizing and binding specific

sequences that include phosphorylated . In order to fulfill this role, each

SH2 domain must bind their target sequences tightly and specifically. The ~100 residue

domain achieves this via its unique three-dimensional structure (Fig. 2A). The extended

polypeptide chain folds into an anti-parallel β-sheet surrounded by two α-helices. A

conserved forms strong electrostatic interactions with the phosphorylated

and nonconserved sites within the binding cleft provide the correct shape,

geometry and donors and acceptors for specificity. There are over 3000

SH2 domains recorded in the Protein Families Databank () and slightly over 200

known structures. Although the overall fold is conserved, each of these have slight

differences at the atomic level allowing them to recognize different target sequences.

A second example is well illustrated in the glycolytic enzyme triosephosphate isomerase

(TIM). This complex protein catalyzes the isomerization of dihydroxyacetone phosphate

6

and glyceraldehyde-3-phosphate. Similar to the SH2 domain, TIM must recognize and

bind both of its substrates with affinity and specificity. The enzyme is ~240 residues, but

only three of those are directly involved in catalysis (Fig. 2B). This means that the

majority of the protein is providing a stable architecture to position and align the

functional groups of only a handful of amino acids - in this case a lysine, and

glutamic acid. As is the case with most enzymes, the positioning of catalytic residues

must be precise for the chemistry to follow. In triosephosphate isomerase, mutation of

the catalytic glutamic acid to aspartic acid diminishes the activity of the enzyme by 500-

fold.9 In this case, the carboxylate is maintained, but moved one methylene group (~1.5

Å) away from the . Clearly, the exquisite structure of proteins is imperative for function.

Figure 2: The Structure-Function Relationship.

A. The crystal structure of SHP2 SH2 domain shown in gray. The SH2 domain binds the phosphorylated peptide, RLNpYAQLWHR, shown in green (PDB: 3TL0). B. A crystal structure of triosephosphate isomerase with bound inhibitor shown in spheres. The glutamate (12 o'clock) is 500-fold more active than a mutant with aspartate at this position (Figure rendered in Pymol 1.4.1 with PDB: 2YPI).

7

Thus far we have described protein structure as a consequence of its function. This makes intuitive sense as nature evolves based on phenotypic differences that affect fitness. Triosephosphate isomerase, as mentioned, has evolved with glutamic acid rather than the shorter aspartate because that amino acid provides the cell with a vital advantage. In this sense, is the true "designer" of protein structures. If we consider the diversity of reactions in organisms it is clear why nature has raised many distinct proteins. Further examination of this conclusion leads to many more complicated questions regarding protein structure and function. Most importantly, how has nature evolved thousands of structures and functions from only twenty building blocks?

First, the great diversity seen in proteins is driven by combinatorics. While twenty amino acids appear modest, a hypothetical protein of length, n, has 20n possible sequences (a short protein of only 50 residues has >1065 solutions). Furthermore, the amino acids themselves are quite diverse in both size, shape and composition. The sizes range from small (glycine) to large () with 18 intermediate volumes. Most of the amino acids are neutral, but several can be positively charged and others can form anions. Even within the nonpolar hydrocarbons there is a great degree of diversity in both size, shape and branching. Finally, each of the amino acids hosts a handful of common rotamers and inherent flexibilities. Taken together, the twenty diverse amino acids may combine to form nearly infinite structures. The next question is how do the extended chains fold into biologically relevant conformations?

8

1.3 The Thermodynamic Hypothesis

In the early 1950s, Christian B. Anfinsen began his fundamental studies on the model

protein, A.10-12 RNase is well-studied enzyme that catalyzes the cleavage of

RNA through general acid-base catalysis.13 Here, the 2'-OH is activated to attack the

phosphodiester linkage ultimately leading to an upstream and downstream product.

Anfinsen's lab showed that RNase unfolds in high concentrations of urea and β- mercaptoethanol. The native protein contains eight residues that form four disulfide bonds. As predicted, the reduced enzyme unfolded in urea and lost its biological activity. When the protein was dialyzed into nondenaturing buffer to promote refolding, the enzyme retained less than 1% of its original activity. The refolding experiment produced a population of semi-folded proteins with scrambled disulfide bonds. If the primary sequence contains eight cysteine residues one expects 7 x 5 x 3 x 1

= 105 possible outcomes from random disulfide bond formation. If the native structure represents the only active conformation, 0.95% ((1/105) x 100 %) is expected to be active. This correlates nearly perfectly to the assayed data. When the refolding experiment was repeated with trace amounts of β-mercaptoethanol the starting activity was recovered nearly quantitatively. The addition of reducing agent allows the protein to sample multiple disulfide bond solutions until the thermodynamically favored result was achieved. These experiments demonstrated several phenomena that revolutionized the field of biochemistry.

9

First, Anfinsen showed that there is not a clear dichotomy between what happens within the cell (in vivo) and what occurs inside a test tube (in vitro). The RNase molecule was able to obtain its catalytically competent conformation whether the starting point was on or off the ribosome. Anfinsen later discovered and described a second enzyme, protein disulfide isomerase (PDI), that catalyzes the reduction of disulfide bonds within the cell11. PDI, much like β-mercaptoethanol, rescues proteins trapped in nonnative conformations. These experiments paved the way for more rigorous analysis of proteins in settings where the investigators could better control variables not possible in many biological contexts. The lack of division between in vivo and in vitro was further demonstrated by in 1969, when his lab synthesized RNase A on solid phase support.14 Merrifield received the 1984 in Chemistry for his ground breaking synthetic protocols.

Anfinsen's work also provided some of the first insights into the thermodynamics of protein folding and structure. These discoveries are described as the Thermodynamic

Hypothesis. Here, the native structure is a unique, stable and kinetically accessible minimum of the free energy. This means that a polypeptide with given length and composition will fold into a single three dimensional structure under the environmental conditions in which folding occurs - barring conformational dynamics and allostery. In order for this to be true, the energy landscape for protein folding must resemble a rugged funnel (Fig. 3). The wide flange represents many possible conformations with similar high and unfavorable energies. As one progresses down the folding funnel the number of

10

possible conformations decrease as does the free energy. The native state conformation is found at the energy minimum and generally has high activation barriers before reaching other nonnative folds. Because the native state has the lowest energy, thermodynamics dictate that this will be the true structure and hence unique. The large activation barriers between the native state and nonnative folds provide inherent stability

to the biologically relevant conformation. For protein folding to be a thermodynamic

process, the native state must be kinetically accessible and for nearly all proteins this is

true. α-Lytic protease is a rare example where the thermodynamic minimum is

kinetically inaccessible.15 The protease is synthesized with a proregion that serves as a

folding catalyst and is degraded after proper folding making unfolding essentially

irreversible (t1/2 = 1.2 years). This unique mechanism decouples the folding and unfolding events, providing escape from Anfinsen's thermodynamic descriptions of protein folding.

Figure 3: Protein Folding Energy Landscape.

The native state conformation (N) is the energetic minima with high activation barriers to other conformations. In this figure, other "valleys" represent local energy minimas. The top of the funnel is wide representing numerous unfavorable folds. As a protein proceeds down the energy funnel, the free energy and number of conformations decrease. Image used with permission of Ken Dill.16 11

Anfinsen showed that the structural fold of the protein is literally dictated by physical chemistry and thermodynamics. This leads to interesting questions: 1) What are the forces that lead to protein folding and energy minimization and 2) how does Nature sample and examine these potential structures? First, proteins fold in aqueous solution as a result of large but nearly balancing forces - enthalpy (ΔH) and entropy (ΔS). From

Gibb's free energy equations one can express the protein states as: ΔHunfolding - TΔSunfolding

≈ ΔHfolding - TΔSfolding or ΔGunfolding ≈ ΔGfolding. Enthalpy consists of all the possible

interactive forces in protein folding including ionic interactions, dipoles, hydrogen bonds

and van der Waals forces. Each amino acid residue can form two hydrogen bonds using

the backbone amide and carboxyl donor and acceptor, respectively. Polar amino acids

such as glutamine, asparagine, serine, threonine and tyrosine can form additional

hydrogen bonds using their side chains. Other amino acids including histidine, arginine,

lysine, glutamate and aspartate can form salt bridges with their ionizable side chains. A folded polypeptide will maximize the number of intramolecular hydrogen bonds and salt bridges while limiting steric clashes (surface exposed residues may interact with the environment to satisfy hydrogen bonds). Additionally, the native state will minimize cavities to achieve tight packing to maximize van der Waals interactions. The average hydrogen bond strength is ~ 1-4 kcal mol-1 and these forces add up quickly when

examining the folded state of proteins. It would be easy to conclude that these

interactions drive protein folding and provide the "glue" for structural stability, however,

this is not entirely true. Nearly every, if not all, hydrogen bond donor and acceptor can

12

be equally satisfied by intermolecular water molecules. Thus the n(1-4 kcal mol-1) are nearly degenerate between the folded and unfolded states. The same phenomenon is true for hydrophobic packing interactions such as London dispersion forces. So what is the driving force for protein folding?

Although counterintuitive, entropy is the dominant force that drives polypeptides into their native and active conformations. An unfolded polypeptide with many degenerate states has more degrees of freedom. Each side chain has multiple favorable rotamers which are entropically very favorable thus stabilizing the unfolding state. The examination of crystal and NMR structures show that tight packing interactions severely limit the accessible degrees of freedom for most residues. While this is necessary for proper function it is entropically unfavorable. To fully understand the entropic consequences of protein folding, one must examine the water molecules surrounding the unfolded and folded peptide.

A typical organism, and therefore cell, is comprised of ~70 % water. Pure water has a concentration of 55.5 M, while most cellular proteins are present at micromolar (10-6 M) concentrations or lower. A protein folds by means of hydrophobic collapse, where nonpolar amino acids are buried within the protein's core and hydrophilic amino acids remain solvent exposed. This is the same phenomenon seen in micelles from detergent molecules. Interestingly, the hydrophobic collapse of the protein molecule ultimately leads to an increase in the system's entropy. An unfolded, linear peptide has a large

13

solvent assessable surface area. Submerged in an aqueous environment, water molecules

must surround the polymer in a highly organized fashion greatly reducing the system's

overall entropy.

Water surrounding large macromolecules is often described as "ice" or "solid" water due

to its lack of freedom and low ΔS. As the protein folds the solvent assessable surface

area diminishes freeing many water molecules to behave more "liquid" or "bulk-like."

For a given volume, a sphere will always have the minimum surface area, therefore, it is

no surprise that the majority of native proteins are globular. To illustrate this point,

consider a sphere and a cylinder, both of which have a volume of 100 Å3. If one describes the cylinder as a rod with height 100-fold greater than the radius, this shape has a solvent assessable surface area of ~300 Å2. The sphere, however, only has a surface

area of ~100 Å2 - representing a three-fold decrease over the elongated cylinder. Clearly, more ordered water is required to encase an unfolded peptide versus a globular protein.

This view is somewhat exaggerated as unfolded proteins are still approximately spherical; however, these states are larger and exclude less water. This increase in randomness, known as entropy, drives protein folding by ultimately lowering the free energy (ΔG).

Anfinsen's Thermodynamic Hypothesis states that most native proteins adopt an active structure with the lowest free energy. Does this mean that the molecule samples all conformations, evaluates the energy and selects the most stable structure? To date, the

14

mechanism of protein folding remains largely unknown, but Cyrus Levinthal's

computational thought experiment performed in 1969 ruled out the "sample all"

approach.17 Levinthal demonstrated that proteins must fold on a linear (or reasonably

linear) pathway where proteins sample far less than 1 % of all possible conformations.

He considered a small protein with only 100 amino acids and therefore 99 peptide bonds

with 198 phi and psi angles. He limited the calculation to 3 possible rotamers for each

bond angle yielding 3198 (~3 x 1094) different conformers. Note that here we ignore many of the possible rotations amongst single bonds within the complex structure. A single side chain rotation occurs in ~10-8 seconds, meaning it would take longer than the age of

the universe to sample all combinatorial solutions. This value is far longer than the

average millisecond (10-3 sec) to seconds required for protein folding. This

computational observation became known as Levinthal's Paradox. To rationalize these

impossibilities, Levinthal suggested that proteins fold on one or several trajectories in

which local assembly precedes global events. The details of this mechanism are still

poorly understood but many hypotheses are gaining traction.18-23 The pioneering work by

Anfinsen and Levinthal described many complex details in protein folding and function,

but also generated new and imperative questions.

1.4 The Sequence-Structure-Function Relationship

Christian Anfinsen's conclusions drawn from the RNase studies earned him the 1972

Nobel Prize in Chemistry. Interestingly, his lab was not the only group studying the

thermodynamic properties and folding of ribonuclease A. At the 25th Anniversary

15

Symposium of the Protein Society, David Eisenberg (University of , Los

Angeles) shared the story of a second lab whose studies preceded the work by Anfinsen

and colleagues. Yale University medical student, Lisa Steiner, and her advisor Fred

Richards had performed nearly the exact experiments several years prior to Anfinsen.

The work went unpublished until Steiner's , "The Reduction of the Disulfide Bonds

of Ribonuclease," was released in 1959 - after Anfinsen's Science paper.

Anfinsen's work revolutionized the field and dramatically affected the trajectory of biochemistry. The primary papers on RNase refolding and activity have each earned over

300 citations while his Nobel Lecture, "Principles that Govern the Folding of Protein

Chains," stands at 3,720 citations at the time this dissertation was written. Why was the

research of Steiner and Elliott left for a medical thesis and not published in highly distributed journals?

The 1950's were an exciting time for protein science and . It was

during this decade that , , and were active,

joining Anfinsen in a quest to change the scientific landscape. One such discovery, first

posited by Francis Crick, was the Central Dogma of Biology which was a further description of his Sequence Hypothesis.24 The Sequence Hypothesis states:

“…..In its simplest form it assumes that the specificity of a piece of nucleic acid is expressed solely by the sequence of its bases, and that this sequence is a simple code for the amino acid sequence of a particular protein.”

16

The Central Dogma describes that the transfer of information from DNA to RNA and

RNA to protein is irreversible:

"This states that once “information” has passed into protein it cannot get out again. In more detail, the transfer of information from nucleic acid to nucleic acid, or from nucleic acid to protein may be possible, but transfer from protein to protein, or from protein to nucleic acid is impossible. Information means here the precise determination of sequence, either of bases in the nucleic acid or of amino acid residues in the protein."

We now know that the above quotation is not entirely true with the discovery of reverse transcriptase and nonribosomal ; nevertheless, the overall flow of information from DNA to RNA to Protein is canonical. Crick's hypotheses were first articulated in

1958, coming years after the Yale University work on ribonuclease, but concurrent with the findings of Anfinsen. Anfinsen's laboratory was now armed with all the puzzle pieces to change protein science. The mating of the Central Dogma (DNA→RNA→Protein) with the Thermodynamic Hypothesis formed the basis of the sequence-structure-function relationship. Lisa Steiner's work was simply ahead of its time.

The sequence-structure-function relationship describes the intimate flow of information from the protein's primary sequence (ultimately the DNA gene) to its biological activity.

Earlier examples chronicling the binding site of SH2 domains and the active site of triosephosphate isomerase showed conclusive evidence to support the structure-function relationship. Anfinsen's RNase experiments elucidated the sequence-structure paradigm.

In vivo, or in the case of in vitro translation, the mRNA template and ribosome catalyze the linkage of amino acid monomers. This amino acid chain will spontaneously fold into

17 the biologically relevant conformation. This means that the sequence of amino acids provides all the information required to describe the folded state. One should note that some (perhaps the majority) proteins do employ chaperones to accelerate the folding process and prevent nonnative conformations in vivo. The example of RNase, and many others in the last 50 years, have shown that proteins can be unfolded and refolded in vitro without any "magic" that extends beyond the simple thermodynamics that describe the systems of supramolecular chemistry. This observation led to the birth of modern protein biochemistry.

1.5 The Protein Folding Problem

The Protein Folding Problem is most simply a question, "How does a protein's primary sequence dictate the three-dimensional structure?" Ken Dill elaborates on this question by describing the problem as three closely related puzzles25: 1) The folding code: the thermodynamic question of what balance of forces dictate the structure of a protein for a given amino acid sequence; 2) Protein structure prediction: the computational problem of predicting the native structure from sequence; and 3) The folding process: the kinetic question of how proteins fold so quickly.

The forces that dictate structure were described earlier. Prior to Anfinsen's work it was believed that the amino acid sequence predetermined secondary structure which drove tertiary contacts and hence protein folding. Many of the early views of protein folding are reviewed by Anfinsen.26 Linus Pauling, Robert Corey and Herman Branson proposed

18 the first description of helices within proteins and further hypothesized that proteins would form regular internal lattices. Diverse sequences of DNA have the same structure regardless of sequence and initial speculations suggested proteins would behave similarly. This made sense since secondary structures such as α-helices are mediated by backbone hydrogen bonds, which all amino acids have regardless of side chains composition. The first crystal structure of myoglobin in 1958 dramatically changed these viewpoints and later studies revealed that that secondary structures are actually stabilized by hydrophobic collapse.27-29

As Dill points out, and crystal structures indicate, folding relies on side chains and therefore the folding code is distributed both locally and globally within primary sequences. This is why paradigmatic proteins such as lysozyme and ribonuclease have distinct structures. In summary, the folding code is globally distributed among the side chains. Many details of this puzzle remain unknown. For example, how do mutations to primary structure affect the stability (free energy)? To date, there is still no physicochemical model that can predict the effects of even a single point mutation.30

The kinetic puzzle - how do proteins fold so quickly? - is a corollary of Levinthal's paradox. Many labs have sought to understand the mechanism of protein folding. This has proven to be even more difficult than predicting structures, but promising avenues are being paved in this field. Further review of the kinetics of folding and folding pathways

19 is beyond the scope of this chapter. Robert Matthews (University of Massachusetts

Medical School) has provided a thorough review.31

Perhaps the greatest revolution of Anfinsen's work was the idea that protein structure could be predicted from the sequence alone since it necessarily carries the entire folding code. Furthermore, the native structure could be confirmed by comparing its computed free energy to other structures, ensuring it is the most stable (lowest free energy). This became a major, but challenging goal in .

Predicting structure is an intuitive but complicated enterprise. Imagine using an organic chemistry molecular model kit and building a small protein like hen egg white lysozyme.

The protein has roughly 130 residues and slightly more than 1,000 atoms. Given the sequence and the polymer nature of proteins, all of the covalent bonds are known (less disulfides). The builder would now bend, fold, and twist the backbone, rotate side chains and collapse the extended peptide into a globular three-dimensional structure. During the process, the architect would attempt to optimize several features: 1) Minimize the surface area to increase the entropy of hypothetical water molecules; 2) Avoid steric clashes and minimize cavities to increase van der Waals interactions; and 3) Maximize the number of hydrogen bonds and salt bridges. From a purely theoretical perspective, the above goals seem possible, but there are simply too many theoretical structures to build and evaluate. To alleviate this burden computational biologists began developing algorithms and modeling procedures to increase the throughput.

20

With an influx in computational programs in the late 80s and early 90s many began to erroneously feel that the Protein Folding Problem had been solved. In response, computational biologist John Moult (University of Maryland) founded the Critical

Assessment of Techniques for Protein Structure Prediction (CASP) in 1994. Prior to the first meeting, Moult and colleagues gathered primary sequences for proteins with solved, but unpublished structures. These sequences were offered to the computational community who were given a month to submit predicted structures. The inaugural CASP meeting attracted 35 labs and 24 target sequences ranging from easy to difficult.32 Many labs were able to accurately predict small domains and proteins (scored as GDT_TS), but success rates fell quickly as protein sizes increased and targets became more challenging.33 The CASP meeting is held every two years and is currently preparing for its ninth gathering. Data collected from previous contests shows modest improvements between contests, but still demonstrates the need for enhanced algorithms and search strategies. In the next section we highlight several success stories in the field of computational .

1.6 Computational Structure Prediction

The Protein Folding Problem is general thought of as what structure will a given sequence adopt? CASP meetings have chosen to focus on this computational avenue. At recent CASP meetings three groups routinely outperformed other labs and servers - The

21

Zhang Lab's I-TASSER and QUARK, Baker's Rosetta and Xu's RAPTOR

(http://predictioncenter.org/casp9/).

TASSER (Threading/ASSEmbly/Refinement) combines modeling and Monte

Carlo simulations to predict the tertiary structure of proteins.34-36 First, the primary sequence is threaded through a template library constructed from the Protein DataBank

(PDB). Sequence segments that share homology to sequences in the PDB are modeled to the corresponding structures. These stretches of amino acids are used as building blocks for structure prediction. In the next step, the building blocks are held constant, and regions not selected by the threading procedure are optimized and refined by Monte

Carlo methods. I-TASSER was used to predict the structure of T0437_1 at CASP8.37

The model protein predicted by the Zhang lab exhibited an RMSD of 1.13 Å to the actual crystal structure. This method has been generalized in the latest algorithm, QUARK

(manuscript in preparation). Here, the building blocks are designed by replica-exchange rather than extraction from homologous PDB structures.

Jinbo Xu, from the , has developed CASP-winning software suites under the title of RAPTOR (RApid Protein Threading predictOR).38,39 Similarly to I-

TASSER and QUARK, RAPTOR uses protein threading in the initial stages of prediction. One problem with current threading techniques is that the procedure is "NP- hard," making the matching process computationally expensive. NP-hard algorithms run in polynomial time where O(nk), and the size of the input determines the upper bound of

22 the running time. RAPTOR improves these search algorithms by applying the mathematical theory of linear programming. Linear programming is used to identify extreme data points within a complex function, in this case the minimal energies. These adaptions have improved the throughput and speed making RAPTOR ideal for full automation. In fact, RAPTOR has routinely taken the top honors at CASP in the

CAFASP division (CA[fully automated]SP).

The David Baker lab (University of Washington/HHMI) has spent the past decade developing computer programs around a scaffold script known as Rosetta.40,41 The algorithm uses potential functions to compute interaction energies within and between molecules, ultimately searching for the lowest energy solutions using Monte Carlo simulated annealing. Since its inception in 1998, the code has been modified for several distinct procedures including RosettaDesign, RosettaDock, Robetta and others.42-44

Rosetta threads three and nine residue stretches through the PDB identifying homologous sequences. Like TASSER, Rosetta starts by identifying local building blocks in the PDB and optimizes the global structure around these short stretches. Although the protocols are similar to other prediction technologies, the Rosetta suite employs highly tuned potential energy functions. It distributes greater weight to compact structures, buried hydrophobic residues, and paired β-strands. Rosetta has gone beyond complete structure prediction by guiding search algorithms with biophysical and evolutionary data.45,46 In

2010, the program accurately determined the structure of larger proteins using only backbone NMR data.47

23

Even after two decades of in silico structure prediction, the field is still learning. Despite numerous advances, the algorithms still fall short of predicting structures for some of our most biologically interesting molecules - large human proteins and membrane proteins.

The intimate pairing of biophysical data and computational algorithms is a promising avenue for improving structural predictions.

1.7 The Inverse Folding Problem

A second and related question is to ask, what sequence will adopt a given structure? This is known as the Inverse Folding Problem.48 In the genomic era, we are essentially faced with two imperative challenges. First, the number of available structures pales in comparison to the number of available sequences. This is the impetus for in silico structure prediction. Second, is the design of novel sequences with desired physical features such as increased stability, solubility, or activity. This, the Inverse Folding

Problem, is highly related to the sequence-structure relationship.

A significant hurdle with designing sequences that adopt a particular fold is that the margin of error is minimal. Native proteins are generally only stabilized over the unfolded state by ~5-15 kcal mol-1. This means from a computational perspective, that missing even one hydrogen bond can result in rejected structures. Protein folding can be written as the equation: unfolded peptide → folded peptide. If we take the folded state as the product, the equilibrium constant, Keq, can be expressed as relative concentrations

24

(Keq = [folded peptide]/[unfolded peptide]). From the equation, ΔG = -RT ln Keq, we see that the incorrect placement of a single hydrogen bond affects the equilibrium nearly

1000-fold. Several labs have accepted the challenge of computationally predicting stabilities - an intuitively simple test of the Inverse Protein Problem. Here, we highlight the work of two labs, Nikolay Dokholyan's (University of North Carolina) and Luis

Serrano's (Centre de Regulacio ).

The Dokholyan lab uses a computational and biophysical methods to study the physical nature of interactions within and between molecules. In 2007, they introduced a protein stability prediction server named Eris.49 Eris is a physical force field with atomic modeling that features fast side-chain packing and algorithms to relax the backbone.50

The authors point out several key advantages of Eris, compared to other predictive software. First, most competitors use extensive training of the algorithms from known stability data. While the general consensus is that including empirical data helps calculations, it unfortunately biases the output. In the case of protein stability prediction, most libraries of stability-characterized proteins include a large fraction of amino acid to alanine mutations - also known as alanine scanning. Inclusion of these datasets tend to bias the computers towards large to small mutations as this is what they encounter upon training. Second, the majority of algorithms slight or completely ignore the effects of backbone motions and flexibility. Recently, Sachdev Sidhu and colleagues designed binomial libraries of antibodies where binding site residues were limited to tyrosine or serine.51,52 These limited libraries were able to bind diverse substrates with nanomolar

25 affinities. One interpretation from this work is the importance of backbone flexibility which added a degree of diversification to an already limited library. The Eris server, with the ability to model backbone flexibility, was tested against 595 mutations from five different proteins. The authors found significant agreement (Pearson correlation coefficient = 0.75) between the calculated and experimentally derived ΔΔGs, but the standard deviations are reported as ~2-3 kcal mol-1.

The Serrano lab developed FoldX in 2005 as a web server, and recently packaged the code for use in the graphical interface, YASARA.53,54 Similar to Eris, FoldX uses a full atomic description of the structure of proteins, which include flexible backbones. Unlike

Eris, the weighting of atomic interactions have been modeled using empirical data from

1,088 point mutations spanning many diverse folding scaffolds. The original FoldX predictions yielded a correlation coefficient of 0.81 with a standard deviation of 0.46 kcal mol-1. In the YASARA interface, researchers are given the option of doing local or global repacking after mutations are modeled. In 2010, the effectiveness of FoldX and eleven other, stability predictors were tested by Khan and Vihinen.55 In this study

Dmutant and FoldX were the best of the predictors, but the authors left the audience with heavy guidance:

"In conclusion, at best, the methods predicted the changes in stability caused by mutations with only moderate accuracies. However, the number of false positives and false negatives returned by the programs was substantial. As so many factors affect protein stability, even small differences in the ΔΔG values between a wild type and its mutant can be significant. Molecular dynamics and Monte Carlo simulations provide more accurate results in general; however, characterization of mutational effects is still problematic even when these methods are used.

26

Additionally, the computational power demands of these two methods are prohibitively great for the analysis of large datasets.

For mutation effect investigations the tested methods have only limited applicability, and should thus be used preferably together with other prediction approaches. One way to improve the performance of predictors might be to use additional features."

Despite these warnings for predicting stabilities, many labs have pushed the Inverse

Folding Problem even further in the computational design of de novo proteins with surprising success.

The first in silico design efforts were based on Ramachandran plots and common side chain rotomers seen in the PDB.56 These angle constraints were used to mold short sequences into a given backbone architectures. Initial design efforts were limited to small peptides or the redesign of hydrophobic cores. In the years that followed, further development of rotamer libraries greatly reduced computational space, and improved search algorithms such as the Monte Carlo methods made larger design efforts more promising. The Steve Mayo lab (California Institute of Technology) was an early innovator in the computational design of proteins. In 1996, they developed algorithms that quantitatively considered side-chain packing, sterics, and their relationship to packing specificity.57 A year later, the same authors used these core lessons and expanded the design criteria to include the solvent exposed surface.58,59 Dead-end elimination was used to select the optimal sequence for the redesign of the ββα-motif found in zinc-finger DNA-binding domains. The designed fold successfully repacked the core in the absence of zinc. 27

In 2003, exciting research began to pour out of the previously mentioned Baker lab.

Using the RosettaDesign software, Kuhlman et al. attempted to engineer a protein with novel structure (i.e. a structural scaffold not known in Nature).60 The lab undertook the design of a 93 residue peptide that would harbor five antiparallel β-strands and two α- helices. An iterative approach was prescribed that cycles between sequence optimization and structure prediction. After each cycle of sequence generation, the backbone was relaxed and optimized until the outcome stabilities were comparable to native proteins.

The structure of one designed variant, Top7, was characterized in detail. Top7 was only

31 % sequence identical to the initial model sequence, but had a backbone RMSD of 1.1

Å to the starting model. The model agrees with the crystal structure at the atomic level, with an RMSD of 1.17 Å.

The Baker lab has taken these successes and applied them to the design of functional proteins in recent years in collaboration with Donald Hilvert's lab (Swiss Federal Institute of Technology - ETH Zurich). In 2008, Hilvert and Baker reported the de novo computational design of retro-aldol enzymes.61 These designer enzymes catalyzed the carbon-carbon cleavage of nonnatural substrates with multiple turnover and rate enhancements approaching four orders of magnitude. Crystallography of the catalysts nearly superimposed on the designed models. Dan Tawfik (Weizmann Institute) collaborated with Baker to design enzymes that could transfer protons from carbon atoms via the Kemp elimination.62 These eliminases showed similar catalytic improvements as

28 the retro-aldol enzymes and were further optimized with rounds of in vitro .63,64

In 2010, the Seattle lab furthered their design legacy by computationally engineering proteins that could catalyze the bimolecular Diels-Alder reaction.65

These experiments illustrate the power and promise of computational design, but it is important to not oversell these successes. The prediction of stabilizing mutations, the design of folds, and the computational proposal of catalytic proteins is a young and error- prone field. Even in the most advanced algorithms, one can only expect the computer to predict stabilizing mutations with ~60 % accuracy.55 The design of novel folds and redesign of native folds is still difficult and often requires additional rational or irrational design. Top7 is the tip of an arrow - for every success there are hundreds of failures. At symposia, David Baker is famous for leaving the audience with the following take home message, "computational design of proteins does not work." Even the famously designed catalysts described here and elsewhere fail to rival the efficiencies, speed and fidelity of native like enzymes which are up to three orders of magnitude faster than those currently described in the literature. In addition, the very nature of these computational experiments are highly specialized and generally require years of training in mathematics and computational science. In the following years it is clear that these technologies will improve and become more user friendly.

As a final caveat to the use of computational programming, we analyze the reports of

Regan, Elber and Bryan. Many computational algorithms rely on sequence homology,

29 either by alignment or protein threading through the PDB. The work presented here, provides a succinct warning (even if anomalous) against such practices. In the early

1990s, George Rose and Trevor Creamer formulated the Peracelsus challenge: convert a globular protein into a second fold without altering more than 50 % of its sequence. The confident pair guaranteed $1,000 to the first successful alchemist. The Lynne Regan lab

(Yale University) started with the predominantly β-sheet Protein G (the B1 domain of

Streptococcal IgG-binding protein). Regan's lab mutated ~40 % of the amino acids based on secondary structure propensities, energy minimizations, visual modeling and intuition.66,67 The end result, Janus, was a four- protein similar to the structure of a single chain Rop variant (Repressor of primer). Regan accepted the award at a Johns Hopkins ceremony in 1997 and donated the prize money to charity.

Inspired by this work, Ron Elber (University of ) began a computational project describing networks of sequence flow between protein structures.68,69 Elber noticed that the sequence-structure relationship is highly asymmetric as many sequences fold into relatively few structures. He postulated that given two sequences of two different folds, mutations could be made serially in a Markov chain leading one structure to another.

Many of these mutations would retain the native fold, but as these mutations amass there would eventually be a single missense mutation that flipped structures. Continued mutations towards sequence B would retain the flipped fold. Elber's lab has evaluated many such Markov chains in the PDB and has found many evolutionarily interesting

30 sequence networks. The Magliery lab is collaborating to empirically validate these chains.

The Elber lab has postulated the existence of a solution to the Paracelsus challenge with more extreme criteria (a single amino acid mutation versus 50% sequence divergence).

His lab has chosen to chase native proteins chains as his interests primarily lie with the mechanisms of protein evolution. Stepping outside these criteria, the Philip Bryan lab

(University of Maryland) has furthered the work of the Regan lab using in vitro evolution. Alexander et al. began with the IgG binding domains of streptococcal protein and staphylococcal protein A. These two proteins are similar in size, but have drastically different folds. was employed to drive the generation of similar sequence variants with different folds and binding functions.70 This method produced two heteromorphs with 59 % sequence identity. The tertiary structure of these molecules was later confirmed with high resolution NMR structures.71 In a third publication, the authors simplified which residues contributed to the folding code for each structure. After rounds of mutations at nonidentical and nonepitope binding sites, the authors were able to produce two functional variants, with different folds, that shared 88% sequence identity.72

Their original estimates suggested that only 10% of the residues within the small protein contained the folding code. The resultant NMR structures guided further rounds of mutagenesis that led to 95% sequence identity.73,74 Finally, the authors discovered a single amino acid mutation that completes the Markov chain from GA to GB - L/Y at

31

position 45.75 This work is further proof of Rob Elber's hypothesis, but much work is

still required to prove these Markov chains exist in Nature.

In conclusion, the computational design of proteins has sparked a new era of protein

science that has been interesting and useful. However, it seems that for every success

story there are countless failures and exceptions setting the stage for the next generation

of improved algorithms and protocols. In the next section we examine the counter

discipline - empirical data collection, its successes and challenges.

1.8 Empirical Protein Science

The most reliable data comes from direct experimentation. Unfortunately, this path is

generally the slowest and most expensive. The structure-function relationship began with

chemical modification of side chains. Ribonuclease's activity and stability have been

challenged by modifying , carboxylic acids, and the alcohols.76-79

The general acid base mechanism of RNase was elucidated from these studies and pH profiling along with crystal structures. Similar approaches using radiation have also been used to modify side chains. These technologies, however, are not specific and are limited to certain functional groups.

In the early 1970s methods began that altered the DNA sequences ultimately leading to missense mutations in proteins. Now, amino acids could be substituted at specific sites allowing the investigators to construct sequence-structure-function relationships. The

32

earliest technologies used restriction endonucleases to fragment phage . The

fragments were annealed to new, intact phage. Some of the ΦX174 progeny phage had genetic markers derived from the fragmented genomes.80 In 1978, the method was

further generalized by the addition of a single-stranded DNA primer and the Klenow

fragment of E. coli DNA polymerase I.81 This method granted more rigorous control of

the mutations, but was still limited to phage genomes, although the authors suggest

similar applications with plasmids.

This method was optimized to modern day form using Kunkel and QuikChange

mutagenesis. In these updated protocols, the background (wild-type) progeny are

removed by digestion of the original template. In Kunkel mutagenesis this is accomplished by producing the template DNA in E. coli strains that are dut- ung-, resulting in DNA with dUTP.82 After primer addition and replication, the template strand

is removed post transformation by native uracil deglycosidases. In QuikChange methods

(Stratagene), the template DNA is degraded by DpnI - a restriction endonuclease that

degrades methylated DNA at site, GATC. The mutagenic strand of DNA is synthetic and

therefore lacks adenosine methylation allowing for selection of mutants. Altered DNA

can also be generated by cassette PCR using restriction endonucleases and ligase.83

These methods greatly improved our ability to generate and examine mutant proteins.

Some of the earlier empirical work on protein sequence, structure and function was

performed in the laboratory of Jeffrey Miller (University of California, Los Angeles).

33

The Miller lab employed nucleotide mutagens like 2-aminopurine to probe the activity and quaternary structure of lac repressor.84,85 Later, Miller amassed nearly 4,000 single mutations at 328 total positions using nonsense suppression methods.86,87 From this collection, Miller determined that most sites within lac repressor are tolerant to substitution, but some sites required conservative mutations. Exceptions included the

DNA binding site and the inducer binding site which ablated activity by disrupting necessary intermolecular contacts. Furthermore, they also discovered that positions within the hydrophobic core were considerably more intolerant to mutation.

The observation of intolerance led Michael Hecht and Robert Sauer ( and Massachusetts Institute of Technology, respectively) to survey the relative mutabilities of surface and core positions of λ repressor.88 The lab generated 52 single mutants and assayed each for function. The low activity variants fell into two categories.

First, there were a series of surface mutations that colocalized on the solved crystal structure - these mutants directly impaired DNA binding.89 A second class, were distributed randomly throughout the hydrophobic core of the repressor. Seven surface positions and one core position were biophysically characterized in detail. The authors concluded that completely or partially buried mutations exert functional effects by altering stability and/or foldedness.

Brian Matthews, a crystallographer at the University of Oregon, has studied mutational effects in T4 lysozyme. In a radical study, Matthews challenged the hydrophobic core

34 with mutations at positions with native leucines, isoleucines, valines and .90 All of these substitutions were destabilizing, but the native fold could be reproduced with ten simultaneous mutations to methionine. The Matthews lab and others have made substantially more mutations over the years trying to elucidate the determinants between sequence, structure and stability.91,92

The above handful of examples represent only a snap shot of the single mutants that have been constructed and analyzed in the past 30 years. Many labs have performed similar experiments in a variety of host structures. Unfortunately, we are still far from understanding the details of the sequence-structure-function relationship despite direct probes into this problem. Miller was able to assay nearly 4,000 mutants of the E. coli lac repressor - nearly all of the 6,270 (330 x 19) possible individual mutants. In the crystal structure of proteins it is clear that amino acids have to be physically and chemically compatible with neighboring residues. As such, to truly understand the Protein Folding

Problem, one needs to study more than one site simultaneously. This dramatically affects sequence space and opens the door for combinatorial biochemistry.

Combinatorial biochemistry involves the design, creation and analysis of libraries of proteins. Currently, the most straight-forward approach to creating libraries starts at the gene level. Researchers order oligonucleotides with randomized nucleotide stretches.

For example, if one wishes to test all possible codons at position x, the cloning would proceed with a primer with codon NNN at x (N=A/T/G/C). The NNK (K=T/G) codon

35

still grants all twenty amino acids, but eliminates stop codons and some bias from amino acid codon degeneracy. Furthermore, one could analyze just the and alcohols by replacing the wild-type codon with DYV (D=A/G/T, Y=C/T, V=A/C/G).

The biased mixing of codons can be used to create many distinct, but related libraries due to the evolution of the genetic code.

Figure 4: Transformation Efficiencies.

The library size is shown as a function of the number of positions to be randomized to all twenty amino acids. The library size increases by more than an order of magnitude for each added position. Using logarithmic axes, we show standard transformation efficiencies for bacteria and yeast.

It is called combinatorial biochemistry because the addition of library positions expands

the sequence space combinatorially. Randomizing one position with an NNK codon

creates twenty library members (at the amino acid level). Randomizing two positions

creates a library size of 400, four yields 160,000, etc. (Fig. 4). In theory, one can

randomize as many positions as desired, but the number of variants that can be studied is

limited by several factors. One limitation is the transformation efficiency, which

describes the efficiency in which cells uptake extracellular DNA. Electrocompetent

transformations of E. coli routinely yield 108 clones.93 Many screens and proteins require

36

the further complexity of eukaryotic cells, of which yeast are the easiest to handle.

Competent cell preparations of S. cerevisiae generally cover smaller libraries nearing 106 variants. This means that it is very difficult to cover libraries with more than 4 and 7 randomized positions in yeast and bacteria, respectively.

As discussed above, cloning is a significant limitation to library sizes, but it is not the only constraint. Combinatorial experiments are also limited by the data collection method. Genetic screens that rely on life or death are the highest throughput - here one is simply limited by the number of cells one wishes to plate. Screens that rely on fluorescent or chormophoric reporters are lower in throughput, although the application of Fluorescent Activated Cell Sorting increases the feasibility of screening large numbers.94,95 The methods described above are both in vivo techniques for screening

protein function. If one wishes to analyze the folding, stability, or activity, lower

throughput methods are required. Many of these are detailed in the second chapter.

One of the earliest applications of PCR-generated libraries was used to examine the core

of barnase in the Lab (). In this study, twelve of the thirteen hydrophobic core residues were randomized, or subrandomized to a set of

hydrophobic amino acids (VILMF).96 Note here, that it is impossible to include all hydrophobic amino acids, such as glycine and alanine without adding the alcohols.

Barnase, a bacterial ribonuclease, is lethal to the cell when synthesized without binding inhibitor, Barnstar. The library is cloned into a nonsupressing E. coli strain where two

37

serine residues have been replaced with stop codons, thus leading to early truncation and

inactivation of Barnase. In theory, all library members complement growth in this strain.

In a second cloning step, the library is expressed in a nonsupressing bacterial strain where

the nonnative stop codons are read as serines. Under this scenario, library members that

form colonies must have inactive, or severely diminished (< 0.2 %) activity. Fersht

found that ~ 23 % of the library retained activity, including several variants that lack any

wild-type amino acids at the twelve core positions. These results led the authors to

conclude that hydrophobicity is sufficient to construct functional cores. They also

suggest that novel proteins should be designable from primitive cores.

Further examples of combinatorial libraries have been reported leading to the description

of two models for protein core packing, the oil droplet and jigsaw models.97 The oil droplet model suggests that the core simply requires greasy residues with approximate volumes and low packing specificities.98 This would imply that mutation to other

hydrophobic residues would have little effect on protein structure and function, which is

somewhat elucidated in the Barnase experiment. Other examples, suggest a higher degree of specificity in the packing of hydrophobic cores, much like a jig-saw puzzle.56

Which of the models is correct?

The likely answer is somewhere in between. Ken Dill proposes a model that he refers to as, "nuts and bolts in a jar.97" In the jig-saw model, packing resembles a series of lock

and key fits requiring specific size and shape complementation. The side chains are

38

frozen and only gain rotational freedom, once the protein is fully denatured. The nuts

and bolts model has no shape complementary and no critical separation. The nuts and

bolts have rotational freedom at near maximal packing density. Work within the

Magliery lab supports the nuts and bolts model. Lavinder et al. repacked the two central layers of Rop with hydrophobic amino acids (manuscript in preparation). Active variants were found with core volumes ranging from 230-440 Å3, lending initial credence to the

oil droplet model. The active variants were then assayed for relative stability using High-

Throughput Thermal Scanning.99 These results displayed a surprising trend in which

over packed variants generally yielded molten globular proteins, and stable variants had core volumes approximately equal to native Rop (320 Å3) (Fig. 5). This suggests there

are more stringent packing guidelines than simple hydrophobicity. It was also discovered

that stable variants were found at all core volumes, suggesting leniency to the jigsaw

model. The best way to reconcile these observations is to describe the packing as the

"nuts and bolts" model posited by Dill.

Figure 5: Redesign of Rop.

The central two layers of Rop were repacked with hydrophobic amino acids. Here, the stability and library fraction are plotted as a function of core volumes. Native Rop has a core volume of 320 Å3. Image from Lavinder thesis. 39

Combinatorial biophysics is a powerful tool, however, there are still several challenges

that prevent its application towards solving the Protein Folding Problem. First, the

design and construction of large libraries is not trivial. As discussed in figure 4, our

current molecular biology tools are limited to the construction of ~1010 variants. In vitro

methods may expand these numbers further, but present new challenges. Even if one

could generate larger libraries, the field is still limited by screening and analysis methods.

This is a particular problem when the "hit" rates ([active variants/total variants] x 100 %)

are not within a reasonable range. Excessively large hit rates require further deductions

to apply lower-throughput technologies for detailed characterization. Excessively low hit rates may require too many resources to amass a meaningful dataset, that may or not be

statistically significant for detailed characterization. Finally, many of these methods are

expensive and time consuming. In the next section, we examine a new field that mates

and genomic sequencing to examine the Protein Folding Problem.

1.9 The Genomic Era and Sequence Statistics

Many attribute April 14th, 2003 as the precise date of scission between the pre- and post-

Genomic Eras.100 This is the date the Human Project announced the completion of their final milestone: the complete sequencing of the human genome. I would argue

that the Genomic Era actually began a few years prior as DNA sequencing technologies

were improving, thus enabling the Human .

40

The sequencing of DNA was a major hurdle in biochemistry due to its size. Techniques

for sequencing both protein and RNA preceded it, and many of these protocols were

applied to DNA with limited success.101,102 won the 1958 for protein sequencing and soon turned his interests to nucleic acids. He

began with a method called "plus-minus" sequencing where up to 80 base pairs could be

determined a single experiment.103 Sanger and colleagues were able to determine the

5,386 nucleotide genome of ΦX174 phage with this technique.104 The method was

unfortunately slow, error prone and limited to single stranded DNA. In 1976, Allan

Maxam and improved the throughput and fidelity of DNA sequencing

using reagents that selectively modify and cleave nucleotide bases.105 The cleavage

reactions were ran in parallel lanes on an electrophoresis gel were the sequence could be

sequentially mapped. This marked a significant improvement over partial digestion

methods and the "plus-minus" technique, but was quickly replaced by the next iteration

of Sanger's methods.

Sanger turned to the use of chain terminating dideoxynucleotides (ddNTPs). Here, the

radiolabeled terminators lack the 3' hydroxyl needed for nucleophilic attack and chain

elongation.106 The sequencing reaction produces a ladder of products that can be

separated by gel electrophoresis. In this protocol, four reactions are run in parallel - one

for each nucleotide. This method was limited by the resolution of agarose gels to ~100

nucleotides. During the the technique was enhanced by

replacing the gel slabs with capillary electrophoresis and the radiolabeled ddNTPs with

41

dye-modified bases.107 These improvements yield ~1,000 base pairs reads, and eliminate

the necessity of parallel reactions since different fluorophores distinguish the bases. The

application of BigDye ddNTPs and capillary electrophoresis dramatically accelerated the

rates of sequencing. Maxam and Sanger shared the 1980 Nobel Prize in Chemistry for

their advancements. Parallel improvements in molecular biology methods such as

shotgun cloning, the use of bacterial artificial chromosomes and bioinformatics enabled

the complete assembly of the human genome in 2003. There are currently more than 200

organisms with complete published genomes (http://www.genomenewsnetwork.org) and

there are vastly more partial sequences available (Pfam). The velocity of genome

sequencing will continue to accelerate with further advances in sequencing technologies,

most notably, Roche's 454 and Illumina.108,109

The advances in sequencing technologies have given birth to a new field -

from sequence statistics. The design of proteins is important for two reasons. First, it

allows for the exploitation of certain features, such as enhanced activity or stability.

Second, the ability to design proteins is a rigorous test of our first principles that govern

protein science. Many fundamental properties of proteins have been deciphered through

reverse engineering efforts. As described earlier, the computational design and redesign

of proteins suffers from calculation errors and sampling. Empirical experiments, single

or combinatorial, suffer from sampling throughput. Furthermore, many variants in a

combinatorial experiment are inactive and provide little information. In contrast, the

genomic design and study of proteins provides solutions to many of these drawbacks.

42

The quintessential tool of genomic design is the multiple sequence alignment (Fig. 6).

Here, one makes use of genomic sequencing efforts to compare and contrast protein sequences since they can be directly inferred from DNA using codon tables. Datasets of interests can be downloaded from several servers including the Protein Families Database

(Pfam), the (PDB), UniProt/Swiss-Prot, GenBank, Kabat etc. The work presented in this dissertation utilizes the Pfam dataset, which is a depository of protein sequences organized by family and aligned by hidden Markov models.110 The

Pfam database strives to provide accuracy and complete coverage which are often competing factors.111 The database curators manage two alignments for each family.

The first, is a small, high quality seed alignment that may not change between updates.

The seed alignments are hand curated, annotated and confirmed as true members.

Triosephosphate isomerase (TIM) has a seed alignment of 58 sequences in the now current version, 25.0. While the seed alignment provides the highest quality data, it severely under-samples the number of known protein sequences for any given family.

Note, that it would be impossible to hand curate the entire Pfam database which contains over 12,000 families. To address this concern, and automate sequence identification,

Pfam generates a Hidden Markov Model-based profile (HMMR) derived solely from the seed alignment. This HMMR profile is used to search the existing database, locate, and align additional members. In the case of triosephosphate isomerase, the HMMR profile identifies 3,894 TIM sequences from 2,946 species. The division of seed/full sequence libraries is the main novelty of Pfam's approach. 43

Figure 6: Example Multiple Sequence Alignment.

A multiple sequence alignment aligns the protein sequence of multiple organisms. These alignments provide information regarding the properties for proteins. Here we show two sites. red and blue that are highly and weakly conserved, respectively. In purple, two positions are shown which are statistically correlated. Positively charged amino acids such as arginine (R) and lysine (K) form salt bridges at a neighboring positions via negatively charged residues aspartate (D) and glutamate (E).

What can we learn from multiple sequence alignments (MSAs)? MSAs provide a wealth

of data that correctly harnessed serve many applications. Similarity between sequences of known and unknown inputs may reflect evolution from a common ancestor and therefore similar functions and/or structures. As genomic efforts continue to produce unannotated sequences, BLAST and FASTA searches are often employed to provide clues to guide further experimentation. MSAs are used to construct phylogenetic trees from clustering algorithms that trace the trajectory of mutations based on evolutionary likelihoods. As described in the computational section, multiple sequence alignments can predict secondary, tertiary and even quaternary structures - if homologs have known folds. MSAs can also be used to identify functionally important positions such as catalytic or binding site residues. For example, MSAs may identify the catalytic triad of a putative serine protease (serine-aspartate-histidine). Casari et al. were one of the first

44

to predict functional residues from MSAs using model cyclins.112 The state of the art has

continually improved since this study.113,114 Identification of catalytic sites coupled with

high sequence identity can be used to prescribe function to unknown sequences with

reasonable accuracies. Although, situations of divergent evolution will lead to proteins

with similar sequences and structures with different functions. The analysis of MSAs can

also elucidate the folding code. Previous examples have demonstrated that the folding

code is necessarily contained within the primary sequence. It may be easier to decipher

the code in the context of many aligned and related sequences. How has nature evolved

through sequence space while maintaining function?

Multiple sequence alignments mate the best of computation and empirical resources.

Unlike purely computational data, these sequences have been validated for folding,

stability and activity. If they did not, they would not exist in nature. They are similar to

the output of a combinatorial experiment, but less laborious. In the limit, all the protein

properties are encoded in MSAs, and it represents many more mutations leading to stable,

active proteins than could be created and characterized in the laboratory. For example,

consider the screening of binding peptides to SH2 domains of interest (Fig. 7).115 Here, the Dehua Pei lab (The Ohio State University) profiled the recognition specificities of several SH2 domains against libraries of pY-containing peptides. Identified binding partners are assembled into multiple sequence alignments. The following image shows the alignment as amino acid percentages. Looking at one such example, the Grb2 SH2

45

domain shows little preference at the -2 position, but requires asparagine at the +2

position.

Figure 7: Binding Specificities of SH2 Domains.

Three SH2 domains are shown with distinct binding specificities. Here the phosphorylated tyrosine (pY) is located at the 0 position. A library with amino acids from -2 to +3 was screened and analyzed. Figure from Wavreille et al., reprinted with permission from Dehua Pei.

In theory, a genomically derived MSA of peptide sequences that bind Grb2 may reveal

the same conclusions - a strong preference for asparagine at the +2 position and little to no bias at the -2 and -1 positions. In the combinatorial experiment the library is designed,

46

constructed, screened and analyzed. Analyzing genomic data simplifies this process as

the first three steps - generally the hardest three steps - are performed by Nature. The

libraries are generated by random mutation, drift, recombination and gene transfer. The

DNA sequences and proteins are produced natively by the organisms and Natural

Selection provides the screening power. Here, only variants that confer fitness populate

the alignment. The laboratory SH2 example was limited to 205 = 3 x 106 peptides. The long evolutionary history suggests that Nature has sampled significantly greater sequence space.

The post genomic era provides a novel approach to understand the relationship between sequence and biophysical properties. Sequences of protein families can be studied to deduce how information is "printed" within primary structures. This may be considered a bioinformatics formulation of the Inverse Folding Problem for the post-genomic era.48,116

Furthermore, the application of sequence statistics may be used to improve the stability of native proteins (Chapter 2).

1.9.1 Conservation Statistics

The most intuitive analysis of multiple sequence alignments is the observation of conservation. If there is zero bias in the sequence determinants at a given position one expects all amino acids to be present roughly equally (100% / 20 aa = 5 % each). This value may be adjusted for codon usage since the amino acids do not equally populate genomes (e.g., leucine codons populate the S. cerevisiae genome at 10%, but tryptophan

47

is seen only 1%). Alternatively, one could use global mean propensities to normalize

codon usage by correcting for the number of codons (e.g., leucine has six codons versus

one for tryptophan). The difference between these three treatments are marginal when

calculating the consensus amino acid at highly conserved sites, but may have slight differences at nonconserved positions. For example, a consensus TPR is unfolded, but

the global mean propensity TPR is folded and stable (Regan, unpublished). As noted

above, the conservation at different sites can be used to indentify key residues which in

may help annotate unknown proteins. How can this information be used to design

proteins of interests and further our understanding of the Protein Folding Problem? Here

we discuss the engineering of proteins using consensus-guided mutations. The following

discussion presents two formulations of consensus design. First is the literal selection of mutations based on multiple sequence alignments. Second are related methods that employ additional rational and irrational design considerations.

Boris Steipe and Andreas Pluckthun (University of Zurich) were amongst the first to engineer proteins from sequence statistics. Antibodies are multi-chained proteins that tightly bind antigens within the immune system. These molecules bind nearly infinite substrates through a mechanism of recombination and somatic hypermutation at complementary determining regions (CDRs) of the heavy and light chains. The heavy chain has ~300 variable and 4 joining domains genetically encoded, while the light chain has 100-1000 variable, 4 joining and 12 diversity domain copies. These domains generate combinatorial solutions during B-cell differentiation. Upon antigen binding and

48

proliferation, further rounds of recombination and mutation lead to an estimated diversity

of >1015 - exponentially more than the ~20,000 human genes. Antigen binding mutations

often carry significant penalties to domain stability complicating engineering efforts. The

immunoglobulin fold is therefore an attractive candidate for stability enhancement.

Steipe et al. addressed the problem of domain stability by analyzing the Kabat immunoglobulin sequence database (www.kabatdatabase.com).117 The authors

hypothesized that: 1) The repertoire of sequences represents a canonical ensemble of

sequences compatible with antibody function; 2) The stability of an average member is

marginal; and 3) Sequence mutations that affect stability are in equilibrium. This is

because under physiological conditions there is no selection for stability beyond a certain

threshold, meaning that destabilizing mutations are selectively neutral assuming the

domain stability remains above a certain threshold. Likewise, there is no selection

pressure to produce hyperstable domains. Statistical mechanics allow us to quantify the

amino acid populations at each sites as free energies that quantitatively describe selection

at that position. Therefore, mutations to consensus amino acids should only select for

domain stabilization (barring biases from database construction). To test these

hypotheses Steipe predicted ten stability-enhancing consensus mutants to the VΚ domain of anti-phosphorylcholine antibody, McPC603. The ten single mutations were constructed via site-directed mutagenesis and assayed for stability by chemical denaturation. Six of the ten mutants were stabilized, three were neutral and only one mutation affected the stability negatively. Random mutations would have performed significantly worst, perhaps yielding zero stabilized variants.

49

Steipe continued to statistically design antibodies in his independent career at the

University of Toronto. His lab made panels of consensus mutation to the VL domain of

murine intrabodies and characterized the variants for expression, solubility and stability.

Steipe saw excellent correlations between stability and yield within the VL domains. The

individually stable mutants were combined in one molecule resulting in an antibody with

-1 -1 118 ΔGF = -34.3 kJ mol compared to -13.5 kJ mol for wild-type. The combining loops

from the esterolytic antibody, 17E8, were grafted onto the host architecture of Steipe's

119 dramatically stabilized VL. This hybrid enzyme was as active as 17E8 when expressed

and purified from E. coli, without any additional engineering. These lessons were further

tested in the consensus design of VH domains. The heavy domains are generally less

stable and more prone to aggregation. Wirtz and Steipe predicted and validated six stable

consensus mutations to the heavy chain. When combined into a single domain the TM increased by 6 °C and the D50 in urea increased by 1 M. Later, Andreas Pluckthun

observed that only seven VL and seven VH families cover more than 95 % of the human

germ-line repertoire. Knappik et al. designed fully consensus antibodies of each family.

The goal of this engineering project was to create starting scaffolds that were stable,

modular, and avoid the HAMA (Human Anti-Mouse Antibodies) response. The

consensus domains expressed well in E. coli and were surprisingly stable by thermal

denaturation. The sequences were designed to include convenient restriction

endonuclease sites to facility CDR engineering. Using phage display the authors were

able to generate libraries approaching 109 molecules. To add modularity to the system,

the seven heavy and light chains can be differential matched creating 49 unique starting

50 scaffolds. The panning of these libraries against multiple substrates produced low nanomolar affinity binders with specificity.

Antibodies are difficult proteins to engineer, characterize and produce in mass quantities.

This is largely because the proteins are multi-chained, contain disulfides, lack inherent stability, and require special conditions to express in bacteria. These factors have led laboratories to search for similar scaffolds with more amenable features. The Pluckthun lab has approached this problem by engineering repeat proteins such as the Ankyrins and

Armadillo repeats. Ankyrin repeats (Anks) contain 33 amino acids that fold into two antiparallel α-helices. The repeat proteins generally contain several of these helical units.

The native domains mediate protein-protein interactions where contact is driven by the loops, much akin to antibodies. Advantageously, these proteins are highly soluble, express well and contain no disulfide bonds. Pluckthun's lab has engineered these molecules for several tasks coining the name: Designed ProteINS

(DARPins).

The original consensus Anks were designed with the most common amino acid at each position, but randomized loop sequences.120 Six randomly chosen library members were well folded with stabilities ranging from stable (-10 kcal mol-1) to highly stable (-21 kcal mol-1). The X-ray diffraction structure was solved for one of these variants at 2.0 Å.

51

The Pluckthun lab has published over 30 articles demonstrating the use of DARPins. The

121 domains are generally very stable with TMs between 66-85 °C. In 2004, the lab

reported high affinity binding of consensus Anks to several targets including MBP, JNK2

122 and p38. The KDs from Surface Plasmon Resonance were reported in the single

nanomolar ranges and failed to bind other targets confirming specificity. Consensus

DARPins have also been engineered to inhibit intracellular kinases and

phosphotransferases, as well as HER2 and neurotensin.123-126 These represent a small

slice of the work performed with DARPins - for further review of design and application

please see Pluckthun's review.127 The lab has recently began evaluating the effectiveness

of consensus proteins.128

The Regan lab (Yale University) has consensus engineered the

proteins (TPR). These repeat domains are similar in structure to ankyrin repeats and

likewise, mediate protein-protein interactions. The Regan consensus TPRs are actually

global mean propensity TPRs; PG = [nij/Nj]/[ni/N], where nij is the number of i amino

acids at position j, Nj is the total number of amino acids at position j, ni is the total

number of i amino acids in the OWL/PFAM database.129 Accounting for global mean

propensities allows for permissible amino acids at every position. For example,

tryptophans and may be underrepresented in consensus alone because they have

fewer codons. Global mean propensities correct for this, putting the amino acids on more

level ground. Native TPRs have 3-16 repeats, but it is not clear if this is due to stability of functional constraints. Main et al. designed GMP-TPRs with a single, double and

52

triple repeat. Previous work, demonstrated stability enhancements from consensus

design, and this was again seen in the TPR work. CTPR1, CTPR2 and CTPR3

(consensus TPR # repeats) were all monomeric and folded. Circular Dichroism ellipticity was monitored with increasing temperature to assay . The authors found that the TMs increased with the number of repeats and all were reasonably reversible.

The structures were determined by solution state NMR and X-ray diffraction. Later, the

Regan lab analyzed the binding of ~10 native TPRs to Heat Shock Protein (Hsp90).130

Consensus binding residues, determined from known Hsp90 binders, were hosted on the

CTPR3 architecture. The designed protein was stable and bound Hsp90 with KD = 200

µM. The consensus TPR was specific, but bound the ligand 40-fold weaker than native

HSP-binders. Magliery and Regan quantified conservation with statistical free energies

and relative entropy calculations.131 For proteins with diverse binding specificities (e.g.

TPRs, ankyrins, His2-Cys2 Zn fingers and PDZ domains) they observed that positions in

contact with peptide ligands are more variable than average surface positions.

Sir Alan Fersht (University of Cambridge) has devoted his career to the elucidation of

two scientific problems: The structure and mechanism of proteins, and tumor suppressor

protein p53. For him, these two disciplines are intimately related. As noted in Section

1.1, mutations to p53 are directly implicated in 50 % of human cancers, and p53 is likely

involved in 100% of cancers cases.1 This apoptosis-inducing protein requires tight

regulation for homeostasis which it achieves through numerous binding partners and

weak stability resulting in increase proteolysis. Unfortunately this weak stability makes

53

p53 susceptible to loss of function through mutation. Improving the stability of p53 with

small molecules or gene therapy is a long-standing goal in cancer research. Furthermore,

the detailed characterization of p53 is complicated by its poor bacterial expression and

solubility. In order to study the protein in an academic environment Fersht's lab designed

and characterized consensus mutations in p53.132 They began with an alignment of 23

p53 homologs from different species. M133L, V203A, N239Y and N269D were

combined into a quad mutant that is 2.6 kcal mol-1 and 5.6 °C more stable than the wild-

type core domain with no effects on activity.

Consensus design has also been applied to the engineering of fluorescent and

chemiluminescent proteins. Dai et al. aligned 31 fluorescent proteins with at least 62 %

sequence identity to monomeric Azami Green (mAG).133 The consensus amino acid was

selected at each position, but at highly variable sites the mAG residues were chosen. The

novel consensus proteins is slightly less stable than mAG and has red-shifted

fluorescence, but expresses better and is brighter. Loening et al. sought to improve the

serum lability of for use a bioluminescent reporter.134 The authors made eight consensus-guided mutations to form RLuc8. RLu8 is 200-fold more resistant to inactivation in murine serum and exhibits 4-fold improvement in light output.

One of the most impressive uses of consensus design was the reengineering of fungal phytases. Phytases are enzymes belonging to the histidine acid phosphotase family that hydrolyze inorganic phosphate from phytic acid. Livestock such as poultry and pigs lack

54

the phytase activity required to liberate phosphate from plant diets and therefore require

expensive and ecologically threatening supplementation. An attractive solution would be

to add an active phytase enzyme to food sources. Unfortunately, the feed is pelleted at

manufacturing temperatures ranging from 60-90 °C and all known phytases unfolded

around 50 °C. Lehmann and Wyss aligned 13 known phytases from six different fungal

species.135 In this case, additional rational design was also employed: 1) The first 26 aa,

which contain a signal sequence, were taken from A. terreus CBS phytase. 2) At 18

ambiguous positions the authors chose A. niger or A. fumigatus amino acids. The

resulting sequence showed between 58 and 80 % sequence identity to its parent

sequences. The melting temperature measured by DSC and heat inactivation was 78 °C,

15-22 °C higher than its predecessors. The consensus phytase showed maximal activity

at 70 °C, but was as active as its mesophilic parents at 37 °C. All wild-type sequences

increased activity with rising temperatures until the TM was reached. The authors later

increased the TM of the consensus protein beyond 90 °C with an improved alignment and

site-directed mutagenesis at predominantly surface-exposed sites.136,137

The Abp1p SH3 domain from S. cerevisiae is considerably less stable than comparable

SH3 domains.138 A large alignment of SH3 domains revealed eight atypical residues in

Abp1p. Eight consensus mutations at these positions were constructed individually, and

three resulted in resistance to chemical and thermal denaturation. The three stabilizing

mutations were combined in the SH3 domain leading to dramatic stability. The TM raised from 60 to 90 °C and the D50 in urea doubled. The folding and unfolding rates were

55

measured to assess the kinetic stability of the designed binding motif. The engineered

domain folded ten times faster and unfolded five-fold slower. Thermodynamic stability

has also been reported in consensus variants of papain, 1,

subtilisin and glucose dehydrogenase.139-142 In all of these studies additional design

constraints were employed including structure-based alignments and computational

modeling.

TAG-72 is a glycoprotein overexpressed on the surface of cancer cells. CC49 is a

clinically validated antibody that binds TAG-72.143 Roberge et al. designed a CC49 scFV-β-lactamase fusion, TAB2.4, for use as a antibody dependent enzyme prodrug therapy (ADEPT).144 Unfortunately, TAB2.4 expressed poorly and was prone to

degradation. Combinatorial Consensus Mutagenesis (CCM) was recruited to increase the

native stability of the CC49 scFV-BLA fusion. Multiple sequence alignment of

antibodies revealed 11 positions in TAB2.4 that harbored amino acids seen in < 5 % of

homologs. Consensus mutation were made at sites of high conservation and a library of

higher frequency amino acids were tested combinatorially at nonconserved sites.

TAB2.5 was identified from this library with enhanced features - 4-fold improved

expression and 2.5 °C increase in TM. Recently, Biogen Idec has designed stabilizing libraries based on Shannon entropies from the multiple sequence alignment of Antibodies

145,146 CH3 and αTT.

56

An interesting philosophical debate comes to light when analyzing conservation in

multiple sequence alignments. That is, what are the driving forces that lead to amino acid

bias at different positions? A handful to the majority of sites are nonconserved meaning

many amino acid identities are tolerated. Some positions can be directly tied to function

such as catalytic residues, interaction and binding sites. Other positions may be involved

in dynamics such as loop motions and allostery. But for most proteins the number of

defined "functional" residues will be low leading us to wonder how the folding code is

distributed throughout the remainder of the positions with medium to high conservation.

Are these positions structurally significant, stability determining, or are they simply

phylogenetic artifacts? In general, once a position mutates from an ancestral sequence it

is unlikely that further mutations will occur at that same site. For example, the rate of

mutation in E. coli is approximately 1 per 108 base pairs under neutral conditions.147

While this makes subsequent mutations at the same site appear unlikely, it is not that improbable in the timescale of evolutionary history. Here, we argue that a conserved position, that does not directly relate to function, must have implications to the overall fitness of the organism (protein structure, stability, dynamics, etc.) or else would have been subject to further mutability and lost its amino acid bias. Unfortunately, any experiments to directly test this theory remain elusive.

On the other side of the argument, (California Institute of Technology) and Donald Hilvert have attempted to demonstrate consensus protein design without phylogenetic bias.148 Here, the authors argue that Steipe's idea that consensus stability is

57

correlated to statistical free energies fails, because the sequences are not independent as

they are biased by common ancestry. In other words, this violates the logarithmic

relationship between an amino acid's stability contribution and its frequency in an MSA.

The authors suggest that these biases can be avoided by making large libraries of a single

protein sequence and constructing alignments based on these libraries. Combinatorial

strategies were applied to chorismate mutase (CM) and library members were screened in

an E. coli strain deficient in the desired activity. The screen identified 26 catalytically

active variants that were used to generate a synthetic-consensus CM. The consensus CM

is 9 °C and 2.6 kcal mol-1 more stable than any of the 26 library members. The specific

activity was 2-fold greater than the E. coli CM. To provide generality to the method,

they also repeated the procedure with M. jannaschii CM. Similar results were obtained, but the consensus M. jannaschii CM is 30-fold less active than wild-type.

It is unclear to me, whether these experiments provide any credence to this method over standard consensus design. First, in terms of stability and activity this work matched, or failed to match the activities seen in pure consensus designs. Second, the design and characterization of the libraries represent significant hurdles in both time and resources.

Finally, while it may not be phylogenetic bias, the sequence datasets themselves are

biased - perhaps to an even greater degree than raw consensus approaches. This method

forces bias by demanding the selection of wild-type starting sequence and adds further

stereotyping by controlling the library mutagenesis, screening organism, not to mention

inherent biases in PCR. It is conceivable that this method could be powerful for

58

engineering proteins lacking homologs, like Rop. The Magliery lab has studied the central core (ITLA) of this four helix bundle. A hydrophobic/alcohol library of these four positions produced a consensus sequence of IVVA that was considerably more stable than all other library members.

1.9.2 Ancestral Statistics

While some have chosen to argue that the phylogenetic bias of consensus design is problematic, others have chosen to embrace the ancestral nature of multiple sequence alignments in a new field coined, Ancestral Design.

The antiquity of thermophilic organisms have been suggested by Woese.149 He notes that thermophiles exist in both bacteria and archaea and generally have the deepest and shortest branches in the phylogenetic tree. The idea of a thermostable ancestor is well supported by other labs.150,151 Forterre believes that the thermostabilities seen in bacteria

and archaea are a result of convergent evolution and not common ancestry.152 To directly test the thermophilic ancestor hypothesis, the Akihiko Yamagishi lab (Tokyo University) has designed and characterized ancestral sequences. The lab began with the CLUSTAL alignment of all 3-isopropylmalate dehydrogenases from the GenBank database.153 The

MSA was phylogenetically analyzed by PHYLIP, constructed into a phylogenetic tree, and the ancestral sequence was determined.154 The lab analyzed a handful of single and

double mutations in addition to one quad mutant. The estimated melting temperatures by

CD were recorded at > 95 °C for all variants, but the wild-type started with TM = 96 °C.

59

Two of the seven variants suffered from diminished activity, but the heat inactivation

profiles for all ancestrally mutated enzymes performed better than wild-type. Several

years later, the Yamagishi lab turned to ancestral design of isocitrate dehydrogenase

again showing that four out of the five single mutations improved thermostability.155 The

enzyme, 3-isopropylmalate dehydrogenase was revisited in 2006.156 This time, the authors made twelve individual ancestral mutations to 3IPMD from Thermus

thermophylus - another thermophilic organism, but 10 °C less stable than the previous

model organism, Sulfolobus sp. This paper reports that six of the twelve mutations were

stabilizing. Later, the authors combined several of these mutations into combined

variants and saw additive effects to the stability.157 While these results present an

intriguing alternative to consensus design, they offer little added benefit. The authors of

these articles do not cite the work of Steipe, Lehmann and Wyss, and offer no

comparison between their approach and consensus design. In addition, the majority of

their ancestral mutations were, in fact, consensus. Interestingly, the authors posit that

similar mutations will have a greater effects in mesophilic proteins and that current

studies are under way. Unfortunately, none of these studies have been published.

Recently, Tawfik has used phylogenetic information, akin to Yamagishi, to guide

combinatorial libraries of serum paraoxonases and sulfotransferases.158 In a medium-

throughput search of 300 variants, Tawfik identified a sulfotransferase with 50-fold

enhanced activity. Tawfik states that ancestral libraries comprise a means of focusing

diversity to positions that readily trigger changes in reaction specificity, thereby

60

facilitating the isolation of new variants by medium-throughput or even low-throughput

screens. This makes sense for promiscuous enzymes, like paraoxonase and

sulfotransferase, but it is unlikely that activity enhancements would be seen for well

diverged enzymes.

In 2007, the Joseph Thornton lab (University of Oregon/HHMI) published the crystal

structure of a resurrected ancient enzyme.159 Here, the authors deciphered the common

ancestral sequence of glucocorticoid and mineralocorticoid receptors. Thornton suggests

that the primitive protein had a broader range of substrate specificities and that evolution

of function occurred by a series of mutations that destabilized the receptor structure with

all , but compensated with novel interactions specific to the new ligand.

Similar results have been reported in TIM-barrel enzymes.160

1.9.3 Correlation Statistics

Conservation statistics are applied independently to positions in multiple sequence

alignments - meaning the distribution at position x has no bearing on the statistical

information derived at position y. Primitive analysis of atomic structures and mutation studies prove that some - if not most - positions are chemically and physically intertwined

with other positions. These interactions are potentially ablated in consensus design.

Furthermore, correlations are another level of information encoded within the folding

code. Much like the elucidation of secondary structure propensities, understanding the

roles of conservation and correlation aid our ability to solve the Protein Folding Problem.

61

The calculation of correlations is considerably more complicated than consensus. Given a 100 residue protein there are 10,000 (100 x 100) pairwise correlations. If one wishes to understand the amino acid contributions at each site the twenty identities need to be explicit within the calculation. This expands the correlation matrix into a third dimension with total complexity of 4,000,000 data points (10,000 x [20x20]). Many bioinformatics labs have hypothesized ways of calculating these correlations, but that is beyond the scope of this introduction. Here we highlight several examples, where bioinformatics- driven hypotheses have been refuted or validated with empirical experiments.

How do distant sites in three-dimensional structure communicate within proteins? This is an important question in allostery, where binding at a one site has distal effects. Here, information is "communicated" or "propagated" through a network of interactions that connect the distant sites. Thermodynamic double-mutant cycles have been used to validate these networks, but their practicality are limited to small proteins and are low- throughput.161-163 Rama Ranganathan (University of Texas Southwestern Medical

Center) proposed that these interaction networks could be statistically calculated from sequences. If two positions are functionally coupled, their evolution should be mutually constrained which should be represented in the statistical coupling of the amino acid distributions. Ranganathan calculated these interactions using Statistical Coupling

Analysis (SCA).164 First, they calculate the root-mean-square of the binomial probabilities for each amino acid appearing at its observed frequency compared to a

62 reference frequency (ΔGstat). The statistical energy at site x is measured for two conditions: 1) The full MSA and 2) A subset of the MSA where the amino acid identity at position y is held constant. These calculations (ΔΔGstat) revealed pathways of physical connectivity between the peptide binding sites and cores of PDZ and POZ domains. The authors propose that binding energies are propagated along these pathways and similar phenomena may play roles in allosteric regulation. To further validate these assertions,

Ranganathan examined three protein families; G coupled-protein receptors, chymotrypsin and hemoglobins.165 Sparse, but connected networks were discovered within each family. It would be interesting if the authors demonstrate the allosteric effect of these correlations with mutations that ablate and conserve statistical interactions.

Figure 8: SCA and WW-Domains.

WW-domain shown with bound peptide. Pie charts showing the outcome of folding studies for natural (n=42), CC (n=43), IC (n=43), or random (n=19) WW sequences. Red, natively folded; blue, soluble but unfolded; yellow, insoluble; grey, poor expressing. Figure adapted from Socholich et al with permission from Rama Ranganathan.

63

In 2005, Ranganathan applied the SCA method to the Protein Folding Problem, trying to

define the sequence rules for specifying a fold. The lab created three libraries of the 36

residue WW-domain. The first library was constructed with no evolutionary information

and represents random sequences. The second library was constructed using site-

independent conservation (consensus only) and the final library was designed with coupled conservation (consensus and correlation).166 All libraries produced soluble

members. The inclusion of correlated data did not improve the solubility of the library,

but did increase the fraction of native-like folded sequences (0 to 28 %). Several

members from the CC-Library were characterized in detail confirming WW-domain

structure and function.167 It is important to note that Ranganathan's site-independent

conservation WW-domains are not true consensus proteins as reported in section 1.9.1.

Here, the amino acid at each position was chosen at random based on amino acid

frequencies derived from the multiple sequence alignments. This procedure likely

scrambles conserved correlations that populate the consensus ankyrins, TPRs, etc.

Recently, the Ranganathan lab has reported an updated SCA calculation that measures the significance of observed correlations as judged by the conservation of the amino acids

under consideration.168 Further calculations remove insignificant correlations (noise)

based on eigenvalues and the remaining pairwise correlations are clustered according to

eigenvectors. The authors showed that the raw conservation weighted covariance matrix

between all sequence positions in the S1A serine protease family. Here, relatively few

positions exhibit strong correlations to primary sequence neighbors. After "spectral

64 cleaning" and clustering three sectors are observed. These sectors represent positions with strong intra-sector correlations, but sparse inter-sector correlations. The authors found that these sectors represented distinct tertiary sites within the protein that likely coevolved for fitness.

Figure 9: Beyond Consensus Analysis of TPRs.

Networks of statistically interacting residues implied from perturbation analysis. Statistically significant interactions can be arranged into “networks” by examining the differences in amino acid distribution for various TPR subsets. Lines between different positions represent direct correlations. (a) The identity of residue 8, almost always Gly or Ala, is affected by residues 4–9 and 21–24. Residue 24 tends to get larger or smaller inversely with residue 8. (b) Positions 26 and 29 tend to have opposite charges. (c) TPRs with Leu7 tend to have Tyr11, DE16, Lys19 and DE22. TPRs with KR7 tend to have Leu11, KR16 and Glu 19, in addition to Lys2, Tyr 4, Arg6, AC10, Asp23 and Asp31. Image and caption from reference Magliery and Regan.

Thomas Magliery and Lynne Regan used Ranganathan's initial protocols to study the conservation and interaction profiles in the tetratricopeptide repeat motif (TPR).169

Statistical free energies were used to determine the significance of each position within the motif. The most conserved residues (high ΔGstat) were located within the

65

hydrophobic core, although conserved glycines and prolines were seen at turn positions.

As noted earlier, the binding site exhibited lower ΔGstat values than average surface

residues. This is because the ensemble of sequences used in the multiple sequence

alignment has evolved diverse pockets to bind diverse ligands. Statistical Coupling

Analysis reveals several interesting correlated networks. These correlated networks

represent real examples where the folding code has been deconvoluted - even if only to a

small degree.

Biogen Idec recently reported the statistical analysis of VH and VL domains from the V-

class of Ig-folds.170 The covariation between sites was calculated by correlation

coefficients. The correlations had interesting implications in the quaternary structure of

multi-chained antibodies. First, the strongest correlations were localized to the VH-VL interface, suggesting that evolution stringently selected for heterodimerization.

Additional correlations were observed at the VH-C interface, but not the VL-C interface.

The authors published the entire roster of correlated residue pairs to support engineering

efforts in the antibody community.

Finally, we examine a integral dilemma in the analysis of correlated occurrences of

amino acids. As demonstrated earlier, even a small protein of 100 amino acids has

10,000 pairwise interactions and millions of data points generated by methods like

correlation coefficients. The standard for viewing these values is limited to tables,

spreadsheets and heat maps. The rapid acceleration in the throughput of data collection

66

(microarrays, genomic efforts, etc.) requires the use of new tools for data visualization.

Developments from the Patricia Babbitt lab (University of California, San Francisco) and

the William Ray lab (The Ohio State University) are brilliant examples of integration

between bioinformatics and visualization. In particular, the Ray lab developed a Java-

enabled program that simplifies the data produced in correlation analyses - first in nucleic

acids, then in proteins.171-173 A standard heat map displays redundant information as the

matrix is symmetrical (identical axes). Ray's innovation bent the top axis from a linear

line into a circle and replaced the color-coding of correlations with connecting lines

(Fig.10). The genius of this approach is that the data can be viewed in multiple dimensions and at multiple levels. If the circle is imaged from the top side, lines indicate

correlations between positions. This interface displays the twenty amino acids and their

distributions as a cylinder extending from the circle. Rotating the cylinder allows one to

visualize how the amino acids contribute to the statistical correlation (meaning instead of

position x correlating to position y, Ala at position x correlates to Phe at position y). This

interface, named StickWRLD, is highly intuitive and especially useful for visualizing

networks of interactions. Olzer and Ray used StickWRLD to unveil two networks of

interaction in the adenylate kinase family. One network, found in gram negative bacteria,

stabilizes the lid domain through a series of hydrophobic interaction and hydrogen

bonding. A second network, commonly found in gram positive strains, replaces a subset of these amino acids for zinc chelation. The Magliery and Ray labs are currently testing the role of these networks with empirical experiments.

67

Figure 10: StickWRLD.

A StickWRLD diagram showing positions in the ADK lid domain. The amino acid identities are arranged vertically by their Kyte–Doolittle hydropathy score. Consensus identities in each position are highlighted by a transparent unit cube. In a live VRML browser, this diagram is completely navigable and the viewer can rotate, move and zoom the 3D diagram to examine details. Figure and caption reproduced with permission from Will Ray.

1.10 Triosephosphate Isomerase

Triosephosphate isomerase (TIM) is the archetypical member of the (β/α)8-barrel fold.

This fold is particularly important as more than 10 % of all native enzymes host their

catalytic residues on this architecture.174 TIMs are composed of eight parallel β-strands

surrounded by eight α-helices, resulting in concentric hydrophobic cores (Fig. 11).175

The loops that connect β-strands and α-helices hold functionally important residues for catalysis and binding. The ubiquity of functions seen in natural TIM-barrels has not been replicated in protein engineering studies.176,177 This is likely due to the marginally

stability of the β-sheet core. Fersht reported the of TIM-barrel indole-

68

3-glycerol-phosphate dehydrogenase to fellow TIM-barrel phosphoribosylanthranilate

isomerase. 178,179 The article was later retracted citing in vivo contamination, and it was

later determined that the designed proteins aggregate in vitro. Proteins with high β-sheet

content are prone to aggregation and kinetic instability. This is often observed in

amyloids where β-secondary structures form sandwiches leading to aggregation and

precipitation. Pehr Harbury (Stanford University) "reverse engineered" the (β/α)8-fold of

S. cerevisiae triosephosphate isomerase.180 Here, the authors constructed libraries of conservative mutations based on a small multiple sequence alignments. They observed that the majority of structural positions tolerated conservative mutations (e.g. Glu→Asp) with minimal consequences on activity. However, when all positions were simultaneously varied between wild-type and conservative residues only 1 in 1010 members were active. In particular, mutations to the central hydrophobic core (β-sheets) would not tolerate amino acid mutations that changed the core volume by as little as one methylene group.

69

Figure 11: Triosephosphate Isomerase.

TIM is a homodimeric enzyme shown here with one monomer in gray cartoon and the second as a red Carbon-α trace. Note that the majority of interaction surface is contribute by a single loop that penetrates the active site of the adjacent monomer. Active site residues are shown as sticks and dynamic loop 6 is shown in purple. This image was rendered in PyMOL from the S. cerevisiae crystal structure, 1YPI.

With the exception of a few thermophilic tetramers, all known TIMs are homodimers

with picomolar KDs. The oligomeric nature of triosephosphate isomerase is proposed to

be essential for activity based on crystal structures and engineering experiments. The

third and longest loop contains ~15 residues that interdigitate into the adjacent monomer.

The tip of this loop forms van der Waals packing interactions with active site residues,

K12 and H95. These interaction may be responsible for the exquisite alignment and spacing within the active site. The Rik Wierenga Lab (University of Oulu) has studied the third loop in Trypanosomal triosephosphate isomerase. First, they computational modeled a redacted loop of only eight residues that ablated much of the protein-protein interaction surface (Fig. 12).181,182 This variant was monomeric at physiological concentrations (0.02-2.0 mg/mL) earning it the name, monoTIM. The crystal structure 70

revealed significant rearrangement of the three active site residues leading to diminished

4 -1 -1 8 -1 -1 activity (kcat/KM = 10 M min versus 10 M min for wild-type). A single point mutant

at the interface, H47N, produces monomeric TIM at concentrations below 3 mg mL-1, but

is less stable than native trypanosomal TIM.183 A series of crystal structures with point

mutations and inhibitors was published in 1995.184,185

Figure 12: monoTIM.

monoTIM (right) is a designed variant of triosephosphate isomerase that replace the 15 residue interface loop with a computationally designed 8 residue loop. The second monomer is depicted as a green sphere with active site residues. Images were rendered in Pymol 1.4.1 with PDB IDs 1YPI and 1TRI.

The residual activity of monoTIM was deduced from these structural structures based on

the flexibility of loops 1, 4 and 8. In wild-type TIM these loops are rigidified through

subunit-subunit contacts - ensuring the alignment of catalytic residues. Further

engineering of a seven residue loops and double mutants yielded similar monomeric

TIMs.186,187 Directed evolution of monoTIM has yielded an enzyme with 44-fold

improved specific activity.188 Engineered TIMs from human have also been

71

characterized.189 The double interface mutant (M14Q, R98Q) produces monomers that

are significantly less stable than the wild-type dimeric hTIM. Design strategies were

applied to both the monomeric and dimeric species that successfully increased stability in both constructs. There is a rare automsomal recessive disease associated with mutation to

TIM. The most common disease allele, E140D, does not significantly affect in vivo

activity, but does affect dimer stability.190

Figure 13: The Activity of TIM.

TIM catalyzes the interconversion of dihydroxyacetone phosphate (DHAP) and glyceraldehyde-3-phosphate (GAP) in the fifth step of glycolysis. To study the Michaelis-Menten parameters of these spectroscopically-silent substrates the reaction is coupled to the redox reaction of NAD+/NADH. The coupled enzymes, glyceraldehyde-3-phosphate dehydrogenase (GAPD) and α-glycerol-3-phosphate dehydrogenase, allow the measurements to be taken under conditions where the TIM reaction is irreversible.

72

Triosephosphate isomerase's activity plays a pivotal role in glycolysis - the metabolic

pathway that chemical breaks down glucose to two molecules of pyruvate (Fig. 13). The fourth step of this reaction pathway hydrolyzes six-carbon fructose-1,6-bisphosphate into two three carbon molecules, dihydroxyacetone phosphate (DHAP) and glyceraldehyde-3- phosphate (GAP). Only one of these substrates (GAP) can continue in the glycolytic pathway for energy production. To increase the efficiency of glucose metabolism, TIM isomerizes DHAP and GAP through an enediol intermediate at the diffusion limit. Most

evidence supports a mechanism were protons are shuttled directly to and from

DHAP/GAP by a catalytic glutamate and histidine and the negatively charged

intermediate is stabilized by the catalytic lysine.191-193 Jeremy Knowles (Harvard

University) designed two coupled assays to determine the Michaelis-Menten parameters

for TIM (Fig. 13).194 This assay and site-directed mutagenesis has led to detailed active-

site analysis. Mutation of the catalytic glutamate to the shorter aspartate diminishes

activity 500-fold.9 Mutation of the active site lysine to glycine results in a 3,000-fold lost

in activity and mutation of the histidine completely inactivates TIM.195 In the

DHAP:GAP equilibrium, formation of DHAP is thermodynamically favored 22:1.196

Catalysts may change the rate of reactions, but cannot affect the underlying thermodynamics and equilibria. TIM is able to continually shunt more DHAP into GAP because GAP is quickly metabolized to 1,3-bisphosphoglycerate activating Le Chatelier's principle. Triosephosphate isomerase also allows other carbon sources to shunt into glycolysis. Glycerol is converted to glycerol-3-phosphate, which can be oxidized into

DHAP allowing entry into glycolysis. Alternatively, DHAP can be reduced to glycerol-

73

3-phosphate which provide adipose tissue a source of activated glycerol for the synthesis

of triacylglycerides. Lactic acid can also be fed into glycolysis at the fifth step by a series

of chemical modifications to glyceraldehyde-3-phosphate.

The catalytic glutamate is located on dynamic loop six of triosephosphate isomerase. The

motion of this loop has been exhaustively studied by Arthur Palmer (Columbia

University), Ann McDermott (Columbia University) and Nicole Sampson (Stony Brook

University). This loop acts as rigid lid to the active site with two hinges. The lid must

remain closed during catalysis to exclude water from the active site which leads to toxic

side product, methylglyoxal. At the same time, the loop needs to open allowing entrance

of substrate and product release. The rate of loop motion is on the same time scale as

catalysis as determined by fluorescence, solid-state and solution-state NMR.197-202

Nearly every sequenced organism contains the gene for triosephosphate isomerase.

Exceptions include ureaplasmas - a branch of bacteria that do not perform glycolysis. At the time this dissertation is written there are 3,894 triosephosphate isomerase sequences in Pfam from 2,946 species. There are 270 triosephosphate isomerase structures in the

PDB and 2,090 TIM-barrel structures. The ubiquity and thorough evolution of this ancient enzyme make it ideal for statistical study. Furthermore, the plethora of biophysical and biochemical data collected aids in the study and understanding of our bioinformatically-derived variants.

74

1.11 Dissertation Synopsis

The work presented in this dissertation aims to understand the fundamental properties of

proteins - specifically, how is information encoded in the folding code? Anfinsen's work

thoroughly proved that the directions for folding and activity are solely encoded in the

primary sequence and many labs have designed experiments to interpret that code. We

believe that the "code" can be "broken" by comparing and contrasting sequences of

homologous proteins.

First, in chapter two we analyze protein stability as a consequence of numbers - both

combinatorial numbers and numbers of sequences. We highlight several technologies

that have allowed protein scientists to study large libraries that allow us to find stabilized variants and deduce the mechanisms of stability from aggregated data. Furthermore, we explain the role of conservation and correlation in protein stability. In the third chapter, we describe the design and characterization of fully consensus TIMs. The results of this chapter indicate that consensus design is sufficient for generating stable and active proteins. We also discuss the role of correlation and networks of interactions that provide

fine tuning of thermodynamic properties and activity. Our fourth chapter describes a

double-sieve statistical filter that accurately predicts stabilizing mutations based on

consensus and correlation. Here, we characterize 23 individual consensus variants that

vary in conservation, solvent assessable surface area, secondary structure and covariation

with other positions. We determined that consensus and correlation, alone, can predict

stabilizing mutants with > 90 % accuracy. Furthermore, the double-sieve filter selected

75

15 substitutions that led to dramatic stabilization of S. cerevisiae TIM. In the fifth chapter we describe the engineering and characterization of a novel TIM-deficient E. coli

from the Keio Collection. Here, we design a model system to test the interplay of

correlations and fitness using deep sequencing. The sixth chapter chronicles a high-

throughput assay for measuring the relative stability of libraries of proteins based on hydrophobic dye binding. The final experimental chapter describes the design and application of a novel vector for ligation independent cloning and traceless hexahistidine

protein purification. The sum of this dissertation presents methods for generating,

assaying and analyzing the fundamental properties of proteins. In particular, how is

information captured statistically in consensus and conservation?

76

Chapter 2: Protein Stability by Number

Protein stability by number: high-throughput and statistical approaches to one of protein

science's most difficult problems.

2.0 Contributions

The following review was published in Current Opinions in under the authorship of Thomas J. Magliery, Jason J. Lavinder and Brandon J. Sullivan. The literature review and analysis of current and past techniques were performed by all authors. Jason Lavinder was instrumental in assembling sources detailing high throughput procedures. Brandon Sullivan was instrumental in assembling sources detailing statistical techniques towards protein stability. Thomas Magliery wrote the review and contributed sources. The author order is Thomas Magliery, Jason Lavinder and Brandon Sullivan.

2.1 Abstract

Most natural proteins are only barely stable, which impedes structural studies, protein engineering and use in therapeutic and industrial applications. It also makes proteins susceptible to single mutations that completely destabilize the native state, which

77

underlies numerous disease pathologies. Our ability to predict the thermodynamic

consequences of even single point mutations is still surprisingly limited, and the low-

throughput nature of protein stability measurements slows engineering efforts and

investigations to understand sequence-stability relationships better. A number of recent

methods are bringing protein stability studies into the practical high-throughput realm.

Some of these methods are based on inferential read-outs such as activity, proteolytic

resistance or split-protein fragment reassembly. Other methods use miniaturization of

direct measurements of stability, such as intrinsic fluorescence, H/D exchange, cysteine reactivity, aggregation and hydrophobic dye binding (DSF). Applications of these screens to difficult targets such as antibodies and membrane proteins are discussed. A second way that large-number approaches are intersecting with protein stability studies is in statistical analysis of sequence databases. Protein engineering based on both consensus and correlated occurrences of amino acids is promising, but much work remains to understand and implement these methods.

2.2 Introduction

Site-directed mutagenesis, still the core technology of protein engineering, will turn 30 next year. The last three decades have seen well in excess of 100,000 mutations made

(many more if we count combinatorial approaches) to probe and alter the structure, activity, folding and stability of a vast array of proteins with different folds and functions.

A huge number of stability measurements have been amassed, in addition to a massive body of hypothesis-driven experiments designed to tease out the basis of protein stability.

78

But predicting the stability of protein mutants remains one of the great unsolved

problems of protein science, proving itself more difficult than even the prediction of

protein structure or even the design of fairly efficient enzymes. This difficulty is in spite

of our actually knowing a great deal about the forces that dominate in protein folding and

perhaps even more about the atomic-resolution structures of folded proteins.203,204 So what’s the problem? One problem is that despite large forces being at work in the structure of the folded state, such as the enthalpies associated with all the hydrogen bonds that form, the net stabilities of proteins are small—5-15 kcal mol-1. This is because the

forces acting on the unfolded state, such as all the hydrogen bond donors and acceptors

that are satisfied by solvent, are also large. This marginal differential means that exquisite

accuracy is required from fairly crude potential functions, and the problem is exacerbated

by our inability to meaningfully model the unfolded state. Furthermore, it is difficult or

impossible to model key aspects of protein folding, such as backbone motion or solvent

entropy. Even empirical approaches that attempt to extrapolate from training sets of

thermodynamic data do not capture sufficient information to solve the problem, but it is

less clear if the reasons for this are fundamental. On one hand, the standard methods of

characterization—calorimetry or spectroscopically observed chemical or thermal

denaturation—are slow and laborious. On the other hand, even “large” databases are

easily dwarfed by the size of sequence space, and it is certainly clear that the effect of a

mutation is only meaningful in context. Mutating alanine to serine is a vastly different thing in different scaffolds, in different secondary structures, with different packing densities or solvent exposures, or with different amino acids nearby. So while insight

79 may not follow from numbers alone, there is a degree to which having large numbers of well-characterized and highly-related mutants will shed light on the problem of protein stability. And even if it does not, the technology to enable those measurements will also enable brute-force approaches for engineering stability. In recent years, the problem of protein stability had intersected with problems of large numbers in two interesting ways, which each are proving useful for engineering proteins for improved stability and elucidating the underlying reasons. The first is the development of fairly general high- throughput methods for measuring protein stability. The second is the use of statistics from the very large number of sequences that have resulted from 15 years of genome sequencing to predict stabilizing mutations. Here we will highlight some of the most important recent advances in these two areas.

Table 1: Protein Stability by Number.

80

2.3 Screening for Protein Stability

High-throughput approaches for measuring or improving protein stability generally fall

into two categories; either they attempt to infer the stability from properties that are

typically measured close to physiological conditions, or they perturb the conditions of the

protein in some way and read out the stability (more or less) directly. For example, protein expression level, solubility, secretion, binding and enzymatic activity, and resistance to proteolysis may all be taken as indications of a stable protein.205 In general,

the Achilles’ heel of these approaches is a lack of broad applicability (for example, many

interesting proteins do not have an enzymatic function) and “you-get-what-you-select- for” kinds of escape variants (for example, unstable but protease-resistant mutants). On the other hand, the problem with measuring the stability directly is the difficulty of

miniaturization; circular dichroism and differential scanning calorimetry are not well

suited to 96-well plates. But some creative ideas have been applied recently with both

types of approaches, which we highlight here. Several other recent reviews highlight

other aspects of combinatorial approaches to protein biophysical properties.205-207

2.4 Inferential Screens for Protein Stability

A straightforward approach for selecting for thermostable proteins is to monitor protein activity at elevated temperatures or after heating. For example, thermal inactivation was used to engineer an esterase with good tolerance of high temperatures but robust room- temperature activity, a feat that was not universally thought to be possible until it was

81

demonstrated directly.208 This approach is limited to proteins with an activity that can be

assayed easily, and there is not a clear correlation between the degree of thermal

inactivation and the stability since it is complicated by aggregation and folding rates. But

this is a very practical approach to screening for proteins with improved stability and

activity under various perturbing conditions. Screens for achieving a stability threshold

based on binding or catalytic activity have formed the basis for several notable

combinatorial experiments in protein design using λ suppressor, barnase, chorismate mutase, and Rop, to name a few.205 Resistance to proteolysis has been used broadly to

identify structured variants, particularly on phage particles. It has been difficult to rely on

nonspecific proteolysis as a read-out of stability, because proteolysis rates are related not

just to global stability but to local stability and substrate specificity. Recently, Bardwell

and coworkers developed a system in which a protein of interest (POI) is inserted into a

loop of TEM-1 β-lactamase, where it was hypothesized that lower-stability mutants of the

POI would generally lead to greater degradation by cellular proteases.209 For several

proteins, the log of the minimum inhibitory concentration (MIC) of antibiotic showed a

striking correlation to the stability of mutants (R2 > 0.6). The relationship was especially

good for Immunity protein 7, where it was clear that expression level was correlated to

stability. Often, expression level differences from varying rates of transcription and

translation, solubility differences, or display differences (on phage or yeast) can be confounding factors to these types of inferential screens. But the authors convincingly showed that the system could be used to select for Im7 variants with improvements in both thermodynamic and kinetic stability. The selection is also tunable and demands that

82

selected variants be soluble and expressible. Marqusee and coworkers have recently

employed pulse proteolysis in increasing concentrations of urea using thermolysin, which

retains its activity in high urea, to measure folding ΔG values.210 The method is read out by SDS-PAGE, but it can be applied to unpurified protein in crude lysate with sufficient overexpression or specific detection, making it suitable for fairly high-throughput quantitative determinations of stability. By adjusting the pulse time and using chemical denaturant, one can directly measure the fraction folded and avoid confounding differences in protease susceptibility under native conditions. This observation led Park et al. to challenge the entire E. coli with protease under native conditions to specifically identify resistant proteins.211 Maltose binding protein was a notable survivor

of thermolysin treatment, and it achieves its resistance through kinetic stability to

unfolding. While it may be a challenge to apply to libraries of mutants, this screening

principle may be useful to shed light on determinants of kinetic stability. Split-protein

reassembly, also called protein-fragment complementation, has proven useful for

identifying protein interactions in living cells, wherein reassembly of fragments of

DHFR, GFP, luciferase or other proteins is driven by the interaction between POIs fused to the fragments.212-215 Split fluorescent proteins reassemble irreversibly, making them

useful for detecting weak interactions but generally unsuitable for measuring binding constants.213 In contrast, split luciferase reassembles reversibly, which has been exploited to look at interaction dynamics in cells but so far not to look at stability

directly.216 Koide and coworkers recently combined yeast surface display with protein

reassembly, and they demonstrated that FACS detected reassembly could be used to

83

measure stabilities and enrich in mutants with a defined range of stabilities.217 A human

fibronectin type III domain (FN3) was split, with one fragment displayed on the cell

surface and the other secreted into the medium. The fragments were fused to two epitopes for fluorescently-labeled antibodies, such that FACS could resolve the display and reassembly levels on each cell. The log of this ratio correlated well with the change in binding energy (R2 = 0.8) for a series of mutants that form a β-bulge in FN3. While not

every protein will reassemble and the binding energies may not perfectly match the

stabilities of the full-length proteins, the ability to rapidly determine and sort for absolute

protein stabilities is especially useful. Despite the complicating irreversibility of split

GFP reassembly, Linse and colleagues demonstrated that the split fragments of the B1

domain of protein G (GB1) could drive the reassembly of the known interaction-detection

fragments of GFP, and that fluorescence was related to the thermal stability of the

corresponding GB1 full-length protein with the same mutations.218 This result is

somewhat surprising, considering that in general cellular fluorescence from split GFP

reassembly does not quantitatively correspond to the binding affinity.213 This screen is

also limited to proteins that can be dissected to reassemble, and while it lacks inherent

controls for expression level differences, it is simpler in its implementation than yeast

display complementation screening. Waldo and colleagues introduced a screen for

soluble proteins based on fusion of a “folding reporter” GFP to the C-terminus of a

POI.219 The GFP only folds and becomes fluorescent if the fused POI folds and is

soluble. One substantial improvement to the screen was the dissection of a “super folder”

GFP into a tagging fragment of 15 amino acids from the C-terminus and a 215aa

84

“detector” fragment.220 These GFP fragments spontaneously reassemble, but only if the peptide is fused to a folded, soluble POI, and the tag influences the solubility of the fusion less than the original folding reporter. Procedures for HT implementation have been described recently.221 Solubility and stability are not generally directly related, but solubility is another key biophysical property that cannot be predicted and requires HT

screening methods.

Figure 14: Principles of Screening for Protein Stability.

Most methods of screening for stability modify the unfolded state or observe some unique property of it—for example, by proteolysis, reaction with an exposed cysteine, amide proton exchange, hydrophobic dye binding, or aggregation. Typically, the protein solution is heated or challenged with increasing concentration of chemical denaturant to establish when the signal is observed. Because most HT stability screening methods involve observation of an irreversible reaction, care must be taken in interpreting the data as a change in equilibrium stability. While some methods enable measurements in the presence of other proteins, most require sufficient purification so that other proteins to do produce the unfolding signal.

2.5 Direct, Small Scale Screens for Stability

Of all the traditional methods of measuring protein stability, thermal and chemical

denaturation monitored by intrinsic fluorescence of aromatic amino acids is the most

straightforward to miniaturize. Stites and colleagues developed an early home-built auto

titrator for semi-automated denaturation measurements.222 Edgell, Pielak and colleagues 85

carried out pioneering work in this area using auto titration methods and robotics with a

standard fluorimeter.223 Dalby and colleagues extended the method considerably by

adapting it to microtiter plates with auto titration, which dramatically increased the

throughput.224 Mayo and coworkers recently coupled this system with computational

design of proteins libraries, enabling exhaustive characterization of computational predictions in a reasonable experimental timeframe.225 Dalby and coworkers have gone

on to further miniaturize their method to nanoliter scale using , which

enables screening with very small amounts of proteins (perhaps only 108 molecules).226

This approaches a scale where concomitant miniaturization of protein production is a challenge for libraries, but it has great promise immediately for protein-ligand interactions. Of course, these methods rely on the presence of an intrinsic fluorophore, which not all proteins possess, and they require a fair amount of specialized equipment even in their simplest implementation. However, they are likely to produce measurements that directly compare to those taken by standard methods. Hydrogen- deuterium exchange has been used extensively to measure the stability, dynamics and folding of proteins by NMR and . Oas and Fitzgerald developed a HT screen called stability of unpurified proteins from rates of H/D exchange, or SUPREX.227

Cell lysate from 200 μL of cell culture is exposed to a pulse of D2O in varying

concentrations of chemical denaturant, and the sample is dried with MALDI matrix for

rapid acquisition. The method is complicated by aggregation or low expression. Also,

EX2 conditions (wherein folding is faster than the intrinsic exchange rates of the protons)

are required to extract thermodynamic parameters. But Oas has successfully used this

86

method to measure protein stabilities in living cells.228 (Gierasch and colleagues have

made similar measurements recently using biarsenical dyes as the readout instead of

MALDI mass spectrometry.229) Fitzgerald and colleagues recently described a variant of

SUPREX based on oxidation rates (SPROX) which addresses some of the complications

of H/D exchange for these sorts of experiments, such as resolving power, ion

suppression, chromatographic separation and reversibility of modification.230 Other reactions can also be used to monitor protein stability. For example, Harbury and colleagues developed a method called misincorporation proton-alkyl exchange (MPAX), which uses weak missense suppressors to make random, residue specific Cys mutations throughout a protein of interest.231 The burial of these Cys residues is interrogated by

alkylation, which can be read out through mass spectrometry or chemical scission and

PAGE. The method is especially useful for measuring the stabilities of proteins that do

not refold reversibly, since the measurement is made under native conditions. Hellinga,

Oas and colleagues have developed a related HT method called quantitative cysteine

reactivity (QCR), which uses gel-shift as a read out, as well as a fast (fQCR) variant with

a fluorescence readout.232,233 Like H/D exchange, thermodynamic parameters can only

be extracted in the EX2 regime. Also, an appropriate buried Cys residue (ideally only one) is required, or the protein must be engineered with some peril of changing the protein thermodynamics. This method was demonstrated at picomole (nanogram) scale using HT gene fabrication and cell-free transcription-translation, which is a very exciting frontier in HT stability measurements. A somewhat different measurement that is applicable to proteins that unfold irreversibly and aggregate—which represents a large

87

fraction of interesting proteins—has been called differential static light scattering

(DSLS). Senisterra et al. reported the use of a home-built instrument (which is now

commercially available) that is capable of light scattering measurements of protein

aggregation in 384-well format.234 It is worth noting that 600 nm absorbance in a standard plate reader is also a reasonable way to measure aggregation. The chief

advantage of this method is its simplicity, as no intrinsic or extrinsic probes are required.

Besides the limitation to aggregating proteins, this non-equilibrium method could be confounded by dramatic changes in the kinetics of unfolding or aggregation for different mutants. But these effects appear to be small in proof-of-principle experiments. A variation on this theme is isothermal denaturation (ITD), in which the rate of irreversible denaturation is observed, typically at a temperature just below that of melting. In principle, this denaturation can be observed by loss of a signal such as CD or shift of a signal such as UV absorbance or fluorescence. For proteins that aggregate, light scattering is also possible. ITD measurements are highly reproducible and have been

reported to be more sensitive to small changes in stability, which is especially useful for ligand binding studies. Senisterra et al. adapted the method to HT using their 384-well

scattering apparatus, with the additional advantage that ITD required less protein than

comparable methods.235 ITD measurements do require a priori knowledge of the

protein’s approximate melting temperature, which could be problematic for protein

libraries, and presumably could be very sensitive to changes in kinetics that may not be directly linked to equilibrium stability. Schaeffer and colleagues have introduced an in vitro hybrid of the GFP fusion method for solubility and ITD, which can measure

88

stability without purification.236 Here, the POI is fused to the N-terminus of GFP. The

protein, purified or in lysate, is then subjected to ITD in HT format. The method, called

GFP-Basta, is only applicable to proteins that aggregate upon unfolding and is limited by

the GFP aggregation and photophysics, but practically these are not very significant

limitations for most POIs. One especially promising method called differential scanning

fluorimetry (DSF) is simple, broadly applicable and requires little specialized equipment.

A method called Thermofluor was developed by 3D Pharmaceuticals, now owned by

Johnson & Johnson, which reports on the perturbation of the melting temperature of a

receptor by a potential ligand through addition of an extrinsic fluorophore.237

Hydrophobic dyes such as ANS are quenched in aqueous solvent but become fluorescent in organic solvent or when bound to molten globules or protein unfolding intermediates.

Most laboratory implementations of DSF, which is used extensively to optimize buffer conditions for crystallography,238,239 use real-time PCR machines which typically lack filter sets in the blue. Consequently, dyes such as SYPRO Orange have been widely used instead of ANS. RT-PCR machines enable DSF in 96 and 384-well formats with ~20 μL

of solution, where ~ 1 μg μL-1 solutions are required. Nordlund and colleagues have shown that DSF is applicable to a broad range of proteins but that some proteins bind to

SYPRO Orange in the folded state.238 Magliery and co-workers demonstrated that for a

series of related mutants of a protein, the correspondence between Tm values determined

from CD thermal denaturation and DSF is excellent.99 The reverse-format protein-

engineering implementation of Thermofluor, in which the conditions and ligands are held

constant and the protein varied, was called High-Throughput Thermal Scanning (HTTS).

89

It has been applied to core and loop libraries of four-helix bundle proteins to elucidate

determinants of stability (Lavinder, J.J., Hari, S.B., Sen, S. and TJM, in preparation).

DSF is surprisingly reversible through the melting point, although dye-protein aggregates

appear upon extended heating of the denatured state.240 It is likely that the dye itself will

perturb the apparent melting point, but the ΔTm values correspond quite well to

calorimetric and spectroscopic measurements. All of these miniaturized methods require

miniaturization and high-throughput handling of protein expression, purification, and

conceivably library construction. Growth of bacteria in 1-2mL of culture in 96 deep-well

plates is the technology of choice for most of these methods, where some amount of

robotic liquid handling for plate or bead-based affinity purifications (particularly IMAC)

is helpful. For the most part, these are achieved with considerable home optimization at

present. Platforms for HT oligonucleotide and gene synthesis and in vitro protein

expression stand to expand the screening front-end further, but these are still far from

straightforward implementation in most labs.241-243

2.6 Membrane Proteins and Antibodies

Two targets of great interest in the pharmaceutical industry are particularly challenging

for adapting to stability screens. Membrane proteins make up a large fraction of all drug

targets, but they are difficult to work with in vitro, particularly for structural studies.

Many recent successes in membrane protein crystallography have been born of strategies

to stabilize the POI.244 The most rapidly expanding area in pharmaceuticals is that of

biologics, which antibodies and antibody-like molecules dominate at present. But their

90

generally poor biophysical properties make them difficult to engineer and formulate as

drugs. Membrane proteins are often difficult to express or purify and are only stable in

detergent or formulations. Stevens and colleagues adapted cysteine reactivity

reported by fluorescence, similar to the fQCR method, for membrane proteins in

detergent.245 They demonstrated its use on a lipophilic model protein, a monotopic

membrane protein, and an integral membrane protein from the GPCR family. More

recently, Cherezov, Stevens and colleagues have expanded this method by examining

protein unfolding in the lipidic cubic phase with intrinsic fluorescence or fluorescence

246 upon cysteine modification as a read-out (LCP-Tm). The method was additionally used for ITD over long time frames for membrane proteins. Baldwin and coworkers also developed an ITD screen for membrane proteins in detergent.247 DSF has been applied to

membrane proteins, but the high fluorescence background of the dye in detergent is a

complicating factor.248 Dyes that are more specific for proteins over lipids may improve this.

Even many full-length monoclonal antibodies are limited in their use as therapeutics by marginal stability and aggregation. Formats that are more straightforward to engineer and express, such as Fabs and scFvs, often suffer from decreased stability, and more significant engineering for humanization and generation of bispecific species can compromise stability further. DSF and DSLS have both been applied successfully to formulation studies of monoclonal antibodies.249,250 Thermal inactivation screening has

also been used as a means of establishing sufficient stability for scFv variants.251

91

Cysteine reactivity has been applied to mAb stability, which seems particularly apt given

the importance of disulfide bonds for antibody stability.252 Little has been published on the application of these sorts of screens to engineering antibody stability, but much of this

work is behind industry doors at present.

2.7 Protein Stability from Sequence Statistics

Most of the methods described above are useful for sorting out which members of a library are folded and stable. This enables the researcher to make mutations according to some hypothesis of design or even entirely at random, and locate variants with suitable

physical properties. But it is often not simple to find rare stabilizing mutations, and such

mutations may compromise other features of the protein, such as enzymatic function or expression level. An alternative approach to identifying sites of stabilizing mutations is to turn to statistical analysis of the natural repertoire of the motif, domain, protein or fold of interest. An attractive idea is that making mutations to the most common amino acid in some position of a protein is likely to be beneficial, and indeed these so-called consensus mutations are tolerated and stabilizing far more frequently than at random or even from the best predictions today. But implementation is harder than it sounds.

Multiple-sequence alignment (MSA) is often challenging, especially in poorly conserved regions or loops, leading to high noise. Most sites in proteins are not well conserved, and taking the most common amino acid in these positions is often little better than picking one at random. Moreover, some positions, especially weakly conserved ones, can be seen to vary together—that is, to be correlated—although these correlations are only

92

sometimes close in space and are of uncertain significance in most cases. The very large

number of protein sequences available today makes these kinds of approaches worthy of

greater attention in the years ahead.

2.8 Consensus

Steinbacher, Pluckthun and colleagues found that about half of the mutations made to an

antibody Vκ domain were stabilizing.117 Steipe and coworkers went on to use this

concept to generate hyperstable VH domains for intracellular expression of Fvs in E.

coli.253 Wyss and colleagues made a number of variants of fungal phytases based on the

consensus of a very small number of closely-related sequences (less than 20), and they

found that even the full consensus sequences were active and significantly

thermostabilized.136,137 (‘Full consensus’ means the most common amino acid from the

MSA was used in every position of the protein.) More recent efforts have focused on the

design of ubiquitous motifs, such as ankyrin repeats and tetratricopeptide repeats.120,129,254

Consensus variants of both of these repeats have been assembled into very stable domains and engineered using library and rational methods for novel binding properties.

The origin of consensus stabilization is not entirely clear. One possibility is that individual proteins in the MSA only avail themselves of as many stabilizing mutations as necessary for function, but that consensus amalgamates these mostly additive mutations.

It is also not yet clear why only half of the mutations are stabilizing. Some light was shed on this recently by Arnold, Hilvert and coworkers, who showed that consensus mutations from library selections were also stabilizing.148 The authors suggested that this

93 method benefited from the removal of phylogenetic artifacts. Other factors stemming from correlation and poorly conserved residues also likely play a role. A related approach to making consensus mutations is to make “ancestral” mutations by tracing mutations back to early sequences along the phylogenetic path. These mutations also turn out to be stabilizing about half the time.156 Tawfik and colleagues have

incorporated ancestral mutations into the family shuffling of paraoxonase-3 to

successfully yield stable, active chimeric enzymes.255

Box 1 Consensus and correlation

A ‘consensus’ residue is simply the most common amino acid in one position of a family of proteins—that is, in a column of a multiple sequence alignment. It is not always easy to determine the consensus sequence of a protein, because it is not easy to align stretches of sequence that are poorly conserved or have insertions or deletions (such as loops). Also, many positions are only weakly conserved and may use nearly all 20 amino acids with some frequency.

A correlation is, fundamentally, when a pair of residues in two positions is observed more or less frequently than expected by chance. For example, if Ala is seen in position A in 20% of sequences, and if it is in position B in 20% of sequences, than we would expect Ala–Ala pairs in

A–B in 4% of sequences. If we observed it in all 20% of the sequences, that might represent a strong correlation. Compared to consensus, many more sequences are necessary to be confident of the significance of correlations since there are 400 possible pairs in any two positions.

Information theory (e.g. relative entropy and mutual information) can be used to quantify how biased or conserved a position is, and how interconnected the distributions of two positions are.

94

2.9 Correlation

An additional layer of complication in the statistical analysis of MSAs is that not all positions are statistically independent. Ranganathan and colleagues developed a method called statistical coupling analysis (SCA), which is a perturbation-based approach to identifying overrepresented pairs of amino acids.164 They showed that inclusion of both

consensus and correlation information was necessary and sufficient for the design of

folded WW domains—meaning that variants that were plausible from positional

distributions alone were often unfolded if they did not capture sufficient correlation

information.166 The meaning of these types of correlations is even less well understood

than the etiology of consensus stabilization. Many correlated residues are not close in

space, but many can also be assembled into networks of interacting residues that connect

distant regions of a protein fold. A number of studies have identified roles for correlated

residues in allosteric regulation.256-258 Ranganathan has recently devised a new kind of

SCA calculation and has used it to identify independent clusters of co-evolving residues

in protein families (sectors).168 Mutations to different sectors in trypsin demonstrated

that one had structural and the other had functional consequences. There is a great deal

of work that lies ahead to understand the meaning of correlations in a general way.

Magliery and Regan applied consensus analysis and SCA to TPR motifs, which

uncovered two subfamilies with distinct alternative networks of interacting residues and

resulted in an algorithm for identifying active-site residues.131,169 Among the most

striking of the results of that study was an explanation for the unusually high charge of a

consensus TPR motif (-7) despite the average TPR have a zero net charge. Charge

95

neutralization occurred by correlations in weakly conserved positions on the surface of

the motif. This effect was not always local (for example, two residues close in space

forming a salt bridge), suggesting at least one mechanism for important non-local correlations. Magliery and colleagues recently engineered two closely related consensus variants of triosephosphate isomerase (TIM) from slightly different sequence databases.259 The two variants differed dramatically in their physical properties and

activity, with one of them having wild-type like kinetics and the other being weakly

active and poorly folded. Both variants were much more substantially different from any

natural TIM than the two variants were from each other. The only apparent difference

between these two variants is the extent to which they complete networks of correlated

residues. This type of host-guest approach will hopefully shed light on the physical

meaning of correlated positions.

2.10 Conclusions and Outlook

First-principles computational methods are likely to remain far from a comprehensive

predictive model for some time, until better potential functions and better treatments of the unfolded state can be incorporated. Even empirical parameterization is very difficult

given our sparse coverage of sequence space in thermodynamics studies. But there has

been a dramatic increase in efforts to bridge that gap in the last 5-10 years in the form of

new HT screens for foldedness, thermodynamic stability, solubility and kinetic stability.

In the next decade, with improved methods of HT gene construction, handling,

expression and purification, these new stability screening methods will give us a vastly

96

richer and more detailed view of the effects of mutations on protein physical properties.

And in the meantime, these methods are immediately adaptable for screening random

libraries to improve the physical properties of proteins for easier handling, crystallization

and structural studies, and superior biotherapeutics. Application of these methods to

biophysically “unfriendly” proteins that are larger, more complex and do not refold spontaneously is likely to change our view of protein folding for the majority of proteins.

Random screening in the absence of any information is often a slog, and any information to narrow down libraries to find stabilizing mutations is welcome. Protein sequence statistics can be a useful tool for guiding combinatorial experiments and limiting possibilities in difficult engineering experiments. The molecular etiology of the effects of consensus and correlated mutations remains a difficult problem, but the combination of screening methods with these kinds of calculations in the next decade will accelerate research towards an understanding of those effects. Protein stability remains one of the most difficult problems in protein science, but its illumination by experiments that take advantage of large numbers, both experimentally and statistically, offers new hope for a solution in the years ahead.

2.11 Acknowledgements

The authors thank the NIH (R01 GM083114 and U54 NS058183 to TJM) and The Ohio

State University for support. JJL was an NIH CBIP fellow and a fellow of the Great

Rivers affiliate of the AHA. BJS was an NIH CBIP fellow and is a Presidential fellow of

The Ohio State University.

97

Chapter 3: Consensus Design of Triosephosphate Isomerase

Triosephosphate isomerase by consensus design: dramatic differences in physical

properties and activity of related variants

3.0 Contributions

The following research article was published in the Journal of Molecular Biology under the authorship of Brandon J. Sullivan, Venuka Durani and Thomas J. Magliery. Brandon

Sullivan, Venuka Durani and Thomas Magliery designed the experiments. Brandon

Sullivan and Venuka Durani executed the experiments and all authors examined and interpreted the data. The paper was written by all authors.

3.1 Abstract

Consensus design, the selection of mutations based on the most common amino acid in each position of a multiple sequence alignment, has proven to be an efficient way to engineer stabilized mutants and even to design entire proteins. However, its application has been limited to small motifs or small families of highly related proteins. Also, we have little idea of how information that specifies a protein's properties is distributed between positional effects (consensus) and interactions between positions (correlated

98 occurrences of amino acids). Here, we designed several consensus variants of triosephosphate isomerase (TIM), a large, diverse family of complex enzymes. The first variant was only weakly active, had molten globular characteristics, and was monomeric at 25 °C despite being based on nearly all dimeric enzymes. A closely related variant from curation of the sequence database resulted in a native-like dimeric TIM with near- diffusion controlled kinetics. Both enzymes vary substantially (30–40%) from any natural TIM, but they differ from each other in only a relatively small number of unconserved positions. We demonstrate that consensus design is sufficient to engineer a sophisticated protein that requires precise substrate positioning and coordinated loop motion. The difference in oligomeric states and native-like properties for the two consensus variants is not a result of defects in the dimerization interface but rather disparate global properties of the proteins. These results have important implications for the role of correlated amino acids, the ability of TIM to function as a monomer, and the ability of molten globular proteins to carry out complex reactions.

3.2 Introduction

The sequence of amino acids in a protein encodes its physical and functional properties, but our ability to read that code is still very limited.10 For example, there have been great successes in computational prediction and design of proteins in recent years, but we are still far from a comprehensive, accurate model of the thermodynamic consequences of mutations.30,58,60,260 In part, this is because natural proteins are typically only stabilized by 5–15 kcal mol-1 over the unfolded state, and our knowledge of how to model the

99 unfolded state is poor.203,204 Remarkable functional designs of enzymes have also been achieved recently, but it remains exceedingly difficult to achieve catalytic efficiencies that compare to natural enzymes.61,62,65 The effects of solvation, backbone motion, dynamics, and entropy are largely beyond our ability to predict or design. One method of designing nonnatural sequences with native-like structures and functions is to look to statistical analysis of families of natural proteins. Genomic sequencing has given us vast databases of sequences of proteins that all have approximately the same structure and activity. This is basically a postgenomic formulation of the so-called “inverse folding problem”: what are all sequences in nature that adopt a particular fold?48 In the limit, the conservation and variation of sequence features in a multiple sequence alignment (MSA) must contain all of the information necessary to design stable, active sequences. The question is: how do we read and apply that information? We were particularly interested in determining what information is encoded at the positional level

(consensus/conservation) versus what is encoded by coupling between sites (correlation).

The idea of designing proteins, domains, or motifs from consensus is attractive because it makes intuitive sense that the most common amino acid in each position of an MSA is there for a reason (structural, functional, dynamic, etc.). Consensus sequences of motifs such as the tetratricopeptide repeat (TPR) and ankyrin repeat have been shown to be folded.121,129,254 Enzymes, such as the fungal phytases, have been engineered using sequence consensus and have been shown to be active and stable. These consensus phytases were generated from 13 to 21 highly homologous sequences from near- neighbors in phylogeny.135-137 Consensus-designed proteins generally have had higher

100

thermal stabilities than the average proteins from which the consensus sequence was

derived; however, some rational design considerations were applied to unconserved sites

in many of these studies. Data from the phytases, antibodies, and thioredoxin suggest

that about half the time, mutation of an amino acid to the most common amino acid in the

MSA for that position is stabilizing.117,118,261-263 On the other hand, the most common

amino acid in an unconserved site presumably has little informational value, and

furthermore, unconserved sites may still be correlated to each other, which is lost in the

consensus. For example, the consensus sequence of TPR motifs has a canonical charge

of −7 although individual TPRs have a 0 ± 2.5 net charge, because the charged residues are largely poorly conserved surface residues that exhibit charge neutralization only when

correlation is considered.169 The distribution of information between consensus and

correlation is not known, although design of WW domains using only consensus versus

consensus plus correlation yielded a much larger fraction of folded proteins with

incorporation of the correlation data.166,167 When triosephosphate isomerase (TIM) was extensively mutated, virtually all structural positions could individually be mutated conservatively (e.g., Gln to Asn) with little effect on activity, but when all positions were simultaneously varied between the natural residue and a conservative replacement, only about 1 in 1010 was active.264 Therefore, interactions among sites appear to account for a great deal of the information in specifying a folded, active protein, but no experiments to date have elucidated the exact effects of these correlated mutations. To start to answer this question, we proposed to engineer the pure consensus sequence of a complex protein architecture from a large, diverse enzyme family. Presumably, this pure consensus

101

sequence would scramble or ablate many of the sequence correlations at poorly

conserved sites and, as such, could act as “host” for interrogating the effects of “guest”

correlation mutations. We selected the TIMs for this study, because they are a very well

studied archetypal member of the (β/α)8 proteins that make up 10 % of all biological

catalysts.174,191,265 Because of their glycolytic function in the isomerization of

dihydroxyacetone phosphate (DHAP) and glyceraldehyde-3-phosphate (GAP), virtually

every organism has a TIM and therefore hundreds of sequences are available. TIM

catalyzes a sophisticated reaction with nearly diffusion-limited kinetics and with

coordinated motion in the catalytic cycle.197,199-202 Furthermore, TIM barrel proteins have

generally been difficult to engineer despite their ubiquity in nature.176 Here, we report the construction and characterization of closely related TIM proteins based purely on

consensus, one from a “raw” sequence database and one from a later database curated of

fragments and repeats. The raw consensus TIM (cTIM) is weakly active, poorly folded,

and monomeric, in contrast to nearly all known natural TIMs, which are dimers. The

curated consensus TIM (ccTIM) is dimeric, well folded, and fully active. We demonstrate

that the oligomeric states are not a result of defects at the interface but rather that global

properties of the proteins differ dramatically. Those properties arise from sequence

variations at unconserved sites, where correlated occurrences of amino acids may play a

significant role.

102

3.3 Results

Consensus TIM. The consensus sequence of all TIMs was determined from the most

common amino acid in each position of the Pfam alignment (version 18.04) of 639

sequences. Because hidden Markov model alignment is not well suited to deal with insertions relative to the seed alignment, the total number of positions in the alignment

(373) is much larger than the average length of a TIM sequence (235 aligned positions).

Consequently, only positions with greater than 45 % occupancy were selected, resulting in a sequence of 248 aa including four unaligned N- and C-terminal residues from

Saccharomyces cerevisiae TIM (S.c. TIM). (S.c. TIM is also 248 aa long.) Because of

the great evolutionary diversity of this ancient enzyme family, the consensus amino acid

sequence is only 70 % identical with that of Tenebrio molitor TIM, its closest known

homolog. The gene for the cTIM was assembled from synthetic oligonucleotides using

a PCR scheme similar to the reassembly step in DNA shuffling.266 The gene was cloned

into two expression vectors, one under the control of the tac promoter and one under the

control of the T7 promoter. The tac construct was transformed into DF502, an

Escherichia coli strain deficient in TIM and several other genes nearby in the

chromosome.267 Growth on lactate and glycerol minimal media was comparable to

complementation with S.c. TIM using the same construct. However, DF502 growth was

inconsistent in our hands, perhaps because of the very slow growth on minimal media

due to the large number of metabolic genes knocked out in this strain. We turned to the

recent Keio collection single-gene knockout of TIM, which we lysogenized with DE3

phage to support transcription from the T7 promoter.268 At 5 μM IPTG, cTIM supported

103

growth on lactate minimal media in 2–3 days and on glycerol minimal media in 4 days,

while S.c. TIM resulted in growth in about 1 day on both media. The cTIM protein could

be overexpressed at very high levels in E. coli and was purified to near homogeneity using two-step IMAC purification with 6 ×His tag cleavage by tobacco etch virus (TEV) protease. To eliminate contamination by the endogenous E. coli TIM, the engineered

TIMs were purified from the Keio TIM knockout DE3 strain. The Michaelis–Menten parameters were determined from steady-state kinetics for both directions of the isomerization reaction. The apparent Km values for DHAP and GAP are comparable to

4 those for S.c. TIM, but the apparent kcat values are reduced by about 10 -fold (Table 2).

Wild-type TIMs exhibit bimolecular kinetics close to the diffusion limit, but apparently

weak growth can be supported with significant reductions in activity. Therefore, an

active TIM was derived from consensus alone, albeit one with significantly reduced activity. Far-UV circular dichroism (CD) spectra for cTIM and S.c. TIM are similar and consistent with similar (β/α)8 architecture (Fig. 15a). Thermal denaturation was followed

by CD spectroscopy at 222 nm (Fig. 15b). S.c. TIM unfolds in a single, irreversible step

at about 60 °C. cTIM exhibits a similar pretransition baseline to S.c. TIM but does not unfold in a single step and is only ∼50 % unfolded at 95 °C. Unlike S.c. TIM, which precipitates at 95 °C, cTIM shows no signs of precipitation and exhibits some reversibility on cooling from 95 °C. This behavior is consistent with the thermal stabilization that has been observed for consensus mutations, although it is possible that more molten globule character is also exhibited by cTIM.121,129,135-137,254,269 With the

exception of a few tetrameric TIMs from thermophiles, all known TIMs are

104 homodimeric. The structure of TIM suggests that dimerization is necessary for full assembly of the active site by the interdigitation of loop 3 from the opposite monomer,

4 and engineered monomeric TIMs exhibit kcat/Km values reduced by about 10 - fold.182,183,187,189,270 The quaternary structure of cTIM was determined by gel-filtration chromatography (Fig 15c). cTIM elutes significantly after S.c. TIM. Elution volumes were compared to a standard curve to determine apparent molecular masses; S.c. TIM eluted as the expected dimer (∼56 kDa), but the consensus enzyme elutes as a monomer at room temperature with an apparent molecular mass of ∼29 kDa. Surprisingly, the consensus sequence of over 600 dimeric proteins is a monomer.

Table 2: Kinetic Data for Consensus TIMs.

105

Figure 15: cTIM Structure and Stability.

(a) CD wavelength spectrum of cTIM and S.c. TIM. (b) Thermal melt and cooling of cTIM and S.c. TIM from the 222-nm CD data. Data collected at increasing temperatures are shown as closed points while data points collected during the reverse melt are shown open. (c) Gel-filtration chromatography shows that S.c. TIM elutes as a dimer, but cTIM elutes later with calculated molecular mass corresponding to monomeric TIM.

Figure 16: ir-cTIM Design.

(a) The crystal structure of S.c. TIM (2YPI) is shown as an open monomer. The active-site bound inhibitor 2PG is shown in purple. The 12 mutations between cTIM and ir-cTIM are shown as sticks. These residues are all within 5 Å of the second chain, which reaches nearly into the active site. (b) The same rendering as in (a) but with the full dimer shown. (c) The CD wavelength spectrum of ir-cTIM shows similar ellipticity at 222 nm but significantly more signal at 205 nm.

Engineering the interface of cTIM. Although the monomeric state of cTIM was a surprise, its activity is consistent with TIM variants intentionally engineered to be monomers.182,183,187,189,270 These attempts to monomerize TIM involved deletions in the

106

interfacial loop 3 and mutations that reversed charge pairing. We hypothesized that by

choosing the most common amino acid at each position of cTIM, we had scrambled

necessary amino acid interactions (i.e., correlations) at the dimer interface. To examine

this hypothesis, we reverted the dimerization interface to the sequence observed in S.c.

TIM, which is known to be dimeric. The 1YPI crystal structure reveals 40 residues within 5 Å of the opposite momomer. The 12 interface residues that differed between cTIM and S.c. TIM were mutated in cTIM to create an interface reversion cTIM (ir- cTIM; Fig 16a/b). The ir-cTIM was purified in similar yield to the original cTIM. CD spectra are similar, but ir-cTIM exhibits greater signal at 205 nm, suggesting more random coil (Fig. 16c). The thermal melts monitored at 222 nm were essentially identical.

By gel-filtration chromatography, ir-cTIM elutes at a calculated molecular mass slightly larger than that of cTIM at room temperature (∼42 kDa, Fig. 17a). Sedimentation velocity by analytical ultracentrifugation (AUC) confirmed that the protein is still monomeric at room temperature (Fig. 17b). Furthermore, ir-cTIM did not exhibit concentration-dependent oligomerization over a 10-fold range of concentrations (0.15–

1.5 mg mL-1 ). The activity of ir-cTIM was decreased compared to cTIM and failed to

complement the Keio TIM knockout on minimal media. When the gel-filtration

chromatography was repeated at 4 °C (Fig. 17a), all three of the proteins (S.c. TIM,

cTIM, and ir-cTIM) eluted as dimers. For cTIM, a shoulder on the dimer-weight peak

suggests that both monomer and dimer are populated at 4 °C and 37 μM (1 mg mL-1 ), suggesting that this concentration is close to the Kd at this temperature. These results

together suggest that the monomeric states of cTIM and ir-cTIM at room temperature

107

may not be the result of inherent defects in the dimerization interface but rather

nonnative global properties of the cTIM scaffold. We also analyzed the binding of the

three proteins to the hydrophobic dye 1-anilinonaphthalene-8-sulfonic acid (ANS). ANS is quenched in aqueous buffer but fluoresces strongly in lower dielectric environments

such as organic solvent or when bound in the core of a protein. ANS binding is taken to be a sign of fluid tertiary structure exhibited by molten globules.271,272 S.c. TIM shows a

weak fluorescence emission peak at 418 nm, but both cTIM and ir-cTIM have strong red-

shifted fluorescence with peaks at 460 nm (Fig. 17c). The 600-MHz 1H, 15N-

heteronuclear single quantum coherence NMR spectrum of cTIM, however, displays a fair amount of amide peak dispersion for a protein of this size (Fig. 18). Taken together,

the biophysical data suggest that cTIM is monomeric and not as well folded as native

TIMs at room temperature and above.

Figure 17: ir-cTIM Characterization.

(a) The elution volume from gel-filtration chromatography of ir-cTIM corresponds to a molecular mass close to monomer. At lower temperatures (4 °C), cTIM and ir-cTIM elute as dimers with a shoulder for monomeric species. (b) Sedimentation velocity shows that ir-cTIM is monomeric with no concentration-dependent oligomerization. (c) ANS binding of S.c. TIM exhibits a weak fluorescence peak at 420 nm. cTIM and ir-cTIM exhibit strong fluorescence with a red-shifted maxima of 460 nm, suggesting that they are both molten globular. 108

Figure 18: 1H,15N-HSQC NMR of cTIM and S.c. TIM.

The NMR spectra for cTIM (MW = 26 kDa) on the left and NMR spectra of S.c. TIM (MW = 52 kDa) on the right.

Concentration and temperature studies. One could imagine that the weak activity of

cTIM is due to weak activity in the monomer or to a small population of dimer. To

examine further the weak activity of cTIM, we observed single-point kinetics over a

range of enzyme concentrations at 4 and 37 °C (Fig. 19). S.c. TIM, which is dimeric at

both temperatures across the whole range of concentrations (16, 32, and 64 pM),

increased in activity linearly with respect to concentration at both temperatures.

Furthermore, there was a 13-fold decrease in activity at each concentration when the

reaction was performed at 4 °C versus 37 °C. When cTIM was assayed under the same

conditions (at 60– 240 μM enzyme), we still observed a linear increase in activity with

respect to concentration at both temperatures, but the activity was 80-fold lower at the

lower temperature for all three concentrations. If activity required dimerization, we

would have expected a nonlinear increase in activity at increasing concentration, as more of the dimeric state is populated, and we would have expected a smaller decrease in activity between 37 and 4 °C at all concentrations, since cTIM goes from mostly 109

monomeric to mostly dimeric under these conditions. The composite data suggest that

cTIM is active as a monomer with molten globular properties. It is worth noting that the dimeric species seen at 4 °C in cTIM and ir-cTIM may not be native-like dimers.

Figure 19: Temperature and Concentration Dependent Activity.

The activity of cTIM and S.c. TIM studied under a series of temperatures and concentrations. The activity at 37 °C and 1×[E] was arbitrarily set at unity (100%) for both enzymes. The activity doubled and halved for the wild-type enzyme when the concentrations were increased and decreased twofold, respectively. This occurred at both temperatures. If cTIM were active as the dimer, one would expect doubling the concentration of enzyme to have a nonlinear effect on activity. Lowering the temperature from 37 to 4 °C led to a 13-fold reduction in reaction rate for the wild-type enzyme at each concentration. At 4 °C, we observe cTIM dimers by gel filtration. All other things being equal, if cTIM dimers were the active unit, we would expect less than a 13-fold decrease for cTIM. In fact, we see the opposite; the average activity decreases 80-fold between 37 and 4 °C at each enzyme concentration.

Database curation. A third consensus TIM variant that we engineered shed light on the properties of the original cTIM. When we began the analysis for correlated occurrences of amino acids, we downloaded the then current version (22.0) of the Pfam database and curated it to remove repeated sequences and sequence fragments that did not represent full genes. More precisely, sequences with fewer than 205 aa (351 sequences) and exact sequence repeats (107 sequences) were removed from the 1239 sequence database to yield 781 nonredundant full length sequences. A new ccTIM was created using a similar 110 approach to occupancy as described for cTIM, resulting in a 248-aa sequence with 36 sequence differences from cTIM (34 substitutions, 1 insertion, and 1 deletion, Fig.

20a/b). There was a single position in the alignment (which aligned with S.c. TIM residue 49) that was equally occupied by two residues: alanine and glutamine. The position was arbitrarily chosen to be Gln. The differences between cTIM and ccTIM arise from unconserved positions in which the most common amino acid differs, and consequently, we expected these changes to have little impact. The amino acid bias of a position can be quantified by calculating the relative entropy between positional distribution and the distribution of amino acids in a neutral reference state, such as amino acid usage in all open reading frames in yeast. From this calculation, it is evident that only unbiased or weakly biased positions were affected (Fig. 20b). These positions tolerate virtually any amino acid in all TIMs, and therefore, only minor differences were anticipated between cTIM and ccTIM.

111

Figure 20: Comparison of Consensus TIM sequences.

(a) Sequence alignment of S.c. TIM and consensus TIMs. Secondary structure shown for S.c. TIM with interface residues (within 5Å of chain b) shown in red and active-site residues marked as stars. Periods denote the same amino acid as cTIM. (b) Plot showing the relative entropy (i.e., conservation) of each position in the TIM alignment. Residues that are mutated between cTIM and ccTIM are shown in black, while all other positions are shown in gray.

Curated consensus TIM. ccTIM expresses well in bacteria with yields approaching 50

mg L-1. CD wavelength spectra and thermal melt traces were essentially the same as

those of cTIM (Fig. 21a/b). The ellipticities for the 222-nm minima corresponding to α-

helical structure are all within 7 % when normalized for protein concentration, which was

confirmed by SDS-PAGE and amino acid analysis. However, other biophysical

properties turned out to be starkly different. When the thermal melt is reversed from 95

to 25 °C, ccTIM refolds almost quantitatively. There is a red shift in emission upon ANS

binding, but the very low level of fluorescence suggests that ccTIM is much less molten than cTIM (Fig. 22c). The protein elutes from a gel-filtration column at room

112

temperature with an apparent molecular mass of 66 kDa, slightly more than that of S.c.

TIM or the calculated dimeric mass (Fig. 22a/b). AUC sedimentation velocity studies confirm that the protein is dimeric (50.5 kDa with 95 % confidence) with less than 2 % forming higher aggregates (Fig. 22b). ccTIM is nearly as active as wild-type TIMs, with

4 5 -1 comparable DHAP and GAP Km values and kcat values of 10 –10 min . ccTIM

complements growth in the Keio TIM knockout, leading to growth on minimal media

similar to that of S.c. TIM and faster than that of cTIM (Fig. 21c). Surprisingly, although cTIM and ccTIM differ only in a relatively small number of unconserved positions and have similar structural and thermodynamic properties, cTIM is a molten globular monomer with weak activity and ccTIM is a native-like structured dimer with wild-type

activity.

Figure 21: ccTIM Characterization.

(a) CD wavelength spectra of consensus TIMs. (b) The consensus variants share similar unfolding patterns when ellipticity is monitored at 222 nm with temperatures ramping from 25 to 95 °C. If the melted samples are cooled back to room temperature, cTIM and ccTIM refold significantly as judged by an increase in ellipticity. ccTIM regains 95% of its initial ellipticity. (c) In vivo characterization of TIMs on lactate minimal media in the absence of IPTG. After 3 days of leaky expression at 37 °C, all but ir-cTIM complement the Keio(DE3) TIM knockout.

113

Figure 22: Structure of ccTIM.

(a) ccTIM elutes near the calculated volume corresponding to dimer by gel-filtration chromatography. (b) Sedimentation velocity confirms that ccTIM is dimeric with no concentration dependence between 0.16 and 1.6 mg mL-1. (c) ANS binding of ccTIM yields a very weak fluorescence at 460 nm.

Details of kinetic characterization. The catalyst of an isomerization reaction may not affect the thermodynamic equilibrium of its substrates. The Haldane relationship,

196 (kcat/Km for GAP)/(kcat/Km for DHAP), for TIM has been reported to be about 22. The consensus-designed variants reported in Table 2 apparently have Haldane ratios of 50 for cTIM and 75 for ccTIM, representing 2- to 3-fold combined error in the kcat/Km values.

The majority of this error is manifested in the inflation of DHAP Km values by

competitive arsenate inhibition.273 For cTIM, this is further complicated by the accurate

determination of kcat and Km due to some type of substrate inhibition at high concentrations of DHAP (Fig. 28). In the case of ccTIM, we estimated the Ki of arsenate

by analyzing the DHAP reaction in the presence and absence of arsenate (Fig. 23). The

Ki, 5 ± 2 mM, yields an adjusted DHAP Km of ∼2.4 mM, which translates to a Haldane relationship of 35 ± 13. This is within the range of previously reported data.

114

Figure 23: Arsenate Inhibition of ccTIM.

The Michaelis-Menten kinetics for ccTIM were measured in the absence and presence of 8.3 mM sodium arsenate. The increase in Km with maintenance of vmax suggests arsenate is a competitive inhibitor, which is supported by previous data. The Ki and Km for arsenate were estimated using the equations:

A comparison between consensus TIMs. While cTIM is 70 % identical with T. molitor

TIM, ccTIM is only 61 % identical with its nearest natural sequence neighbor,

Roseiflexus sp. TIM, but cTIM and ccTIM are 85 % identical with one another. Only one

residue mutated between cTIM and ccTIM is within 5 Å of the active-site residues (K12,

H95, and E165), the active-site lid (residues 166–176), or the 2PG inhibitor bound in

crystal structure 2YPI (Fig. 24). The only proximal mutated position (I127V) is close by virtue of a backbone–backbone interaction with E165. The mutations are spread throughout the protein secondary structures (17 in helices, 10 in sheets, and 8 in loops), and they are mainly solvent exposed (21 are more than 10 % exposed in the dimer with an average exposure of 21%, Fig. 24).274 Except for F224A, the eight nonconservative mutations were on the protein surface.275 Stated simply, there is no obvious reason for 115

the dramatic differences between the properties of cTIM and ccTIM. The 36 differences between the consensus TIMs are at largely unconserved positions (Figs. 20b and 26c).

The average relative entropy compared to the neutral reference state is 1.42 for all positions versus 0.82 for the 36 varying positions. Most of the 12 mutated positions with

relative entropies greater than 1.00 arise from distributions with a significant number of

sequences occupied by 2 or 3 aa. For example, position 238 has a relative entropy of

1.38. The initial distribution was 169 Pro and 137 Ala out of 407 sequences occupied at

this site. The curated distribution changed to 221 Pro and 325 Ala out of 720, switching

the most common and next most common residues. A large fraction of the positions (11)

were mutated to Ala. The mutations result in a significant decrease in the charge of the

protein (−11 in the 240 aligned positions versus −5.5 for cTIM, −3.5 for S.c. TIM, and −5

± 5 for TIMs overall). This phenomenon was seen before with the consensus sequence of

the TPR motif, where it was shown to arise from scrambling of correlated surface

charges.169 We speculate that one major difference between cTIM and ccTIM may be in

the extent of correlated occurrences of amino acids that are scrambled or broken. To test

this hypothesis, we analyzed statistical correlations between all positions in the MSA.

These correlations, calculated here as the mutual information between the amino acid

distributions at each pair of positions in the MSA, reveal that most sites in TIM do not

exhibit strong correlations (Fig. 26a). However, the weakly conserved positions that

change between cTIM and ccTIM are highly enriched in positions with strong sequence

correlations (Fig 26b/c). Another difference between cTIM and ccTIM is that the

sequence databases used to construct them differ substantially in their phylogenetic

116

distributions. Specifically, a greater fraction of sequences leading to ccTIM came from

bacteria (Fig. 25). Our preliminary analysis of the correlation data suggests that networks

of correlated residues differ in differing branches of phylogeny. A full analysis of TIM

sequence correlations will be presented separately. Studies of further mutants of cTIM

and ccTIM designed to assess the roles of individual mutations and correlated pairs or networks of mutations are underway.

Figure 24: Sequence Differences between cTIM and ccTIM.

(a) The 36 mutations are shown as colored sticks with the active site-bound inhibitor colored purple. Buried residues are cyan and surface-exposed residues are green. (b) The active site is shown by highlighting the catalytic residues (K12, H95, and E165) as gray sticks. Only six residues have any atoms within 8 Å of the inhibitor, active-site residues, or active-site lid, loop 6, and none appear to be intimately involved with the active-site residues. (c) Eight mutations occur within 5 Å of the other monomer. The majority of these residues are surface exposed. A44F and V45T (A44 and T45 in S.c. TIM) are the only two positions that change solvent exposure between the monomer and the dimer. All figures were rendered in PyMOL using the 2PG inhibitor- bound crystal structure, 2YPI.

3.4 Discussion

One important lesson from this work is that, even for a large family of enzymes with

significant evolutionary and sequence diversity that carry out a sophisticated and highly

tuned reaction, native-like activity can be engineered from consensus alone. Natural

TIMs exhibit nearly diffusion-controlled kinetics, which are believed to arise from a

highly orchestrated cycle of loop motion and precise positioning of residues in the active 117 site to stabilize the enediol intermediate and avoid the formation of a toxic methylglyoxal by-product. ccTIM is able to carry out this reaction at wild-type rates despite differing from the nearest natural TIM in 40 % of its amino acids and never having been subject itself to evolution. This strongly argues that the vast majority of information for protein structure and function is encoded positionally, at the level of consensus, and not in higher-order correlations. It would be interesting in the future to examine methylglyoxal formation by the TIM variants engineered here. However, the stark differences between ccTIM, cTIM, and ir-cTIM illustrate that there is more information in the sequence

Figure 25: Taxonomy Statistics.

Taxonomic distribution of sequences used to generate the cTIM and ccTIM sequences. The cTIM database had a much higher fraction of eukaryotic and metazoan sequences than the ccTIM database, which was dominated by bacterial sequences.

118

Figure 26: Sequence Correlations in TIMs.

(a) Heat map of the mutual information for all 57,600 pairwise positional correlations in the TIM alignment. Cool colors (blue and green) represent weak correlations, and warm colors (orange and red) represent strong correlations. (b) Heat map of the correlations observed for the 35 positions that differ between cTIM and ccTIM. (c) On the left, the relative entropy (sequence bias) of each position is plotted with those corresponding to the mutated positions in black. On the right, the maximum mutual information value for each position is plotted with those corresponding to the mutated positions in black. Most of the mutations between the consensus TIMs occur at nonconserved positions, but these sites are enriched in positions with strong sequence correlations.

119

Table 3: Differences between cTIM and ccTIM. aThe S.c. TIM residue and position that aligns with cTIM and ccTIM in the MSA; bThe consensus TIM residue; cThe curated consensus TIM residue; dThe relative entropy for each position. Increasing relative entropies show increasing conservation within the amino acid distributions. eThe three most common residues at each position; fThe percent abundance of the three most common residues at each position; gThe Henikoff score is a quantitative value to describe the conservativeness of a mutation based on phylogenetic analysis. Here, neutral mutations are given a score of 0, common mutations score positive while rare mutations are scored as negative values. hThe solvent exposure of each residue was calculated in MOLMOL version 2K.2.0 with a 1.4 Å sphere. Buried residues generally have a solvent exposure of less than 10%; iX indicates residues that are within 5 Å of chain B in the crystal structure, 1YPI. Note that many of these residues are surface exposed, except positions 44 and 45 which pack into the second monomer; jThe secondary structures based on DSS analysis in PyMOL; kThe distances of all atoms from the mutated residues and the active site residues (K12, H95, E165), the 2PG inhibitor, and active site loop 6 (residues 166-176) were calculated in the inhibitor-bound crystal structure, 2YPI. The minimum distance is shown in Å with its closest active site neighbor (K-Lys12, H-His95, E-Glu165, 2PG-Inhibitor).

120

families than just the positional information. These proteins are all, in a sense,

“consensus” variants. They differ in sites that are highly tolerant to mutation, and they

arise from variations between the most common amino acids at those unconserved

positions. There is no obvious reason that the particular set of amino acids at the 36

positions that differ between cTIM and ccTIM results in a weakly active monomer in the

former case and a wild type-like dimer in the latter. However, the two proteins appear to

differ in sites that are enriched in stronger statistical correlations, and the phylogenetic

compositions of the databases leading to them differ substantially. We therefore

speculate that differences in the extent to which sequence correlations were preserved or

“broken” in the two consensus sequences may play a role in their properties.

Experiments to test this idea are underway. It is worth noting that the large number of

sequences used to construct these variants makes it possible to meaningfully assess

sequence correlations. While a native-like protein resulted from the curated database and

a less-active molten globular protein resulted from the uncurated one, this does not necessarily suggest that curation is the key to successful consensus engineering. The

sequence collections that are available significantly undersample complete evolutionary

history and are affected by researcher interest and organism availability. For consensus design, it is difficult to articulate a convincing reason that any one sequence (or even

sequence fragment) should be included or omitted from a sequence library, since the

process by which the library was created was inherently biased. Sequence fragments are

a complication for correlation analysis, but they simply add information to consensus

analysis for the sites to which they correspond. Similarly, it is possible to imagine that

121

many duplicates of a small number of unique sequences might bias the consensus

sequence, but that was not the case here. Of the 1239 TIM sequences in Pfam 22, over

1000 of the sequences are unique and only 2 full-length sequences were found to be

repeated more than 3 times (8 and 10 times). We also do not think that the difference in

size of the final cTIM and ccTIM databases (639 versus 781) had any significant effect in

itself on the consensus sequence. We randomly removed 142 sequences from the ccTIM

database in proportion to the taxonomic distribution in 20 separate trials. On average, the

consensus sequence adopted 3 ± 2 aa mutations relative to ccTIM with a range of 0–8. It is therefore possible to produce the same ccTIM sequence without additional sequences.

A related factor that we completely neglect here is that sequence alignment quality is likely to have some effect, especially on weakly conserved positions. Weakly conserved stretches and regions with length heterogeneity (such as loops) are the most difficult to align with certitude. Larger numbers of sequences improve alignment quality, and further expansion of sequence databases will likely improve our understanding of weakly conserved positions and correlations among them. The biophysical differences between cTIM and ccTIM are especially fascinating. Because of the way that the enzymes are designed, all of the conserved residues required for function (e.g., the Glu, His, and Lys in the active site) are present. The consensus enzymes exhibit similar CD spectra to yeast

TIM, and even the weak activity of cTIM suggests that the proteins exhibit or at least sample highly similar structures to the natural TIMs. However, the oligomeric states and

ANS binding data suggest that the primary difference between cTIM and ccTIM is in their global properties; that is, cTIM is more fluid and only dimerizes significantly at low

122

temperature. It is still unclear how evolutionarily common mutations at 36 unconserved

positions result in this difference. Structural and dynamic studies on cTIM and ccTIM

are underway to understand better the nature of this change. While it is difficult to prove

beyond a shadow of a doubt, the preponderance of the evidence argues that cTIM is

active as a monomer. The most convincing evidence is that cTIM activity increases in

direct proportion to concentration (i.e., the specific activity is not concentration

dependent, implying that any additional dimerization is not increasing activity) and that

cTIM is reduced further in activity than S.c. TIM upon cooling to 4 °C, although S.c.

TIM is a dimer at both concentrations and cTIM is significantly dimeric only at 4 °C.

Further purification of cTIM by ion exchange chromatography did not result in higher activity, and multiple preparations yielded similar activities, suggesting that the problem is not simply that there is a large inactive population. Careful controls, including purification from a TIM-free strain, ensure that wild-type TIM contamination is not the cause of the activity. Wierenga et al. have engineered several versions of trypanosomal

TIM to be monomeric, which turned out to be a surprisingly difficult undertaking.182,183,187,189,270 Even relatively radical mutations or deletions to the

interfacial loop 3 resulted in significant amounts of dimer at higher concentrations.

Similarly, Goraj et al. attempted to engineer monomeric TIMs from human TIM by

interface mutations, but the results were monomer–dimer equilibria, as well as inactive proteins or concentration-dependent specific activities, implying that activity arises from the dimer. We believe that our concentration- and temperature-dependent kinetic studies provide some of the strongest evidence that TIM can function as a monomer. However,

123

it is interesting that the trypanosomal monomeric mutants have similar kcat values to

cTIM and that the mechanism of monomerization is so different in cTIM/ir-cTIM (i.e.,

global scaffold changes versus interface mutations). The unusual dynamic nature of cTIM calls to mind the loop motions present in the TIM catalytic cycle.197-202 Movement

of loop 6 occurs on the same time scale as catalysis. As it appears to form a lid on the

active site, its motion is thought to be coordinated with catalysis. This loop motion has

been observed directly by fluorescence and by solution-state and solid-state NMR. One

possibility is that cTIM's low activity is due in part to dysregulation of the loop motions.

We attempted to make single-Trp mutants of cTIM for 19F-Trp incorporation and NMR

studies analogous to those of McDermott et al., but the single-Trp168 mutant (W11F

W157F W191F) of cTIM is inactive. Further experiments to probe this issue in cTIM

and ccTIM are underway. Finally, it is a surprise that cTIM is even weakly active given

its fluid nature, because the TIM reaction is thought to result from highly precise

positioning of catalytic residues. The result is reminiscent of the recent discovery of

Hilvert et al. that an engineered monomeric chorismate mutase from Methanococcus

jannaschii (mMjCM) has similar catalytic efficiency to its native-like dimeric

counterpart.276,277 The balance of enthalpy and entropy changes upon substrate binding

was dramatically altered for mMjCM but with little net effect on the overall free energy.

It will be interesting to calorimetrically analyze the binding of cTIM to inhibitors.

124

3.5 Acknowledgements

B.J.S. was a National Institutes of Health Chemistry-Biology Interface Program Fellow and Ohio State Presidential Fellow. We are grateful to Deepti Mathur for technical assistance with some of the enzyme preparation and kinetics. We thank Christopher

Jaroniec and Jeffrey Lary for their expertise in NMR and AUC, respectively. This work was supported by The Ohio State University.

125

3.6 Kinetic Plots

Figure 27: Kinetic Plots for S.c. TIM.

Figure 28: Kinetic Plots for cTIM.

For the forward (GAP) reaction, apparent vmax and Km values were calculated from 0-4.5 mM GAP. The velocity of the reaction appeared to reach a maximum around 4.5 mM GAP and then decrease, suggesting substrate inhibition similar to that observed for rabbit TIM. Fits performed with standard substrate inhibition models did not match the data well, suggesting the process of inhibition is more complex.

126

Because the reaction is coupled to reduction by NADH, it is much more likely that the inhibition is from substrate than from product. It has been suggested that this inhibition occurs when two substrate molecules bind the enzyme in a manner that inhibits activity. No evidence of substrate inhibition was seen in the other consensus-designed variants reported here.

Figure 29: Kinetic Plots for ir-cTIM.

Values for ir-cTIM in the forward reaction were barely visible above the background reactions even at high enzyme concentrations. Using JP Richard’s method of background subtraction we were able to calculate rates for the GAP reactions. When the same method was applied to the DHAP reaction, vobs ≈ vo.

Figure 30: Kinetic Plots of ccTIM.

127

Chapter 4: Protein Stability and Sequence Statistics

A statistical protocol for stabilizing proteins; The interplay of conservation and

correlation in triosephosphate isomerase stability

4.0 Contributions

The following research article is authored by Brandon J. Sullivan, Tran Nguyen, Venuka

Durani, Deepti Mathur, Samantha Rojas, Miriam Thomas, Trixy Syu and Thomas J.

Magliery. The idea of filtering consensus mutations with relative entropies for protein

stabilization was conceived by Thomas Magliery. The experimental protocols were

performed by Brandon Sullivan, Tran Nguyen, Deepti Mathur, Samantha Rojas, Miriam

Thomas and Trixy Syu. Venuka Durani was instrumental in the statistical analysis of protein sequences and cloned and characterized several compensatory mutations. The

paper was written primarily by Brandon Sullivan and Thomas Magliery with help from

the remaining authors.

128

4.1 Abstract

Understanding the determinants of protein stability remains one of protein science’s

greatest challenges. There are still no computational solutions that calculate the stability

effects of even point mutations with sufficient reliability for practical use. Amino acid

substitutions rarely increase the stability of native proteins, so large libraries and high-

throughput screens or selections are needed to stabilize proteins using directed evolution.

Consensus mutations have proven effective for increasing stability, but these mutations

are successful only about half the time. We set out to understand why some consensus

mutations fail to stabilize, and how this can be predicted a priori. Overall, consensus

mutations at more conserved positions were more likely to be stabilizing in our model,

triosephosphate isomerase (TIM) from Saccharomyces cerevisiae. However, positions

coupled to other sites were more likely not to stabilize upon mutation. Destabilizing

mutations could be removed both by removing sites with high statistical correlations to other positions, and by removing nearly invariant positions at which ‘hidden correlations’ can occur. Application of these rules resulted in identification of stabilizing mutations in

9 out of 10 positions, and amalgamation of all predicted stabilizing positions resulted in

the most stable yeast TIM variant we produced (+8 ºC). In contrast, a multimutant with

14 mutations found to stabilize TIM independently was destabilized by 2 ºC. Our results

are a practical extension to the consensus concept of protein stabilization, and they

further suggest the importance of positional independence in the mechanism of consensus

stabilization.

129

4.2 Introduction

Most native proteins are only marginally stable, meaning the folded and unfolded states

are generally separated by no more than 5-15 kcal mol-1.203,204 Many natural proteins are

not stable enough for research, pharmaceutical or industrial applications, and many

disease pathologies arise from single mutations that destabilize proteins. For example,

most of the “hot-spot” mutations observed in the tumor suppressor p53 in cancer mutations are far from the DNA binding site and merely reduce the stability of the protein.278 However, the prediction of protein stability remains one of the most difficult

problems in protein biochemistry, due to inadequate performance of potential functions,

difficulty in sampling backbone motion, lack of knowledge of the unfolded state, and the

challenge of modeling entropic effects.30,260,269 A systematic analysis of the performance

of eleven stability prediction algorithms by Khan and Vihinen55 concluded that

Dmutant279 and FoldX53 were among the most reliable, but even these were only about

60% accurate in correctly predicting qualitatively if mutations were stabilizing or

destabilizing. For example, for FoldX, the standard deviation of the difference between

the experimental and calculated G values for a mutation is 0.5-1.0 kcal mol-1

(depending on the implementation and elimination of outliers), but the mean experimental G values are about 2.5 kcal mol-1.53 Part of the challenge in

understanding protein stability is that its measurement, by calorimetry or spectroscopic

observation of thermal or chemical denaturation, is slow and labor- and material-

intensive. In general, library-based strategies to improve protein stability have been very

successful, but these require library construction, an appropriate screen, and/or some

130

rational design.99,223,224,280 These types of experiments demonstrate that few mutations to

natural proteins are stabilizing, on the scale of 1% or less.

Advances in DNA sequencing technologies have provided a wealth of genomic data

which can be readily translated into protein sequences. Many families of proteins now

have hundreds to thousands of known sequences, allowing one to interrogate the

determinants of protein fitness statistically. One such approach, consensus design, or the

replacement of an amino acid with the most common amino acid in a multiple sequence

alignment (MSA), has been shown to increase the stability of antibodies as well as other

proteins.117,118,261-263 For example, Steipe et al. engineered ten consensus mutations in the

117 VΚ domain of murine antibody McPC603. Enhanced stability was observed in six

variants, three were neutral, and only one was less stable than wild-type McPC603. This

and other studies show that consensus mutations stabilize proteins about 50% of the time,

which is dramatically better than random mutagenesis. Consensus design has also been

applied to full consensus repeats such as the tetratricopeptide repeats (TPRs) and

ankyrins, in addition to whole enzymes including the fungal phytases and, recently,

triosephosphate isomerase (TIM).121,129,135-137,254,259 In general, these full-consensus

proteins are dramatically more stable than the proteins from which their sequences arise.

A consensus fungal phytase was 15-22 °C more stable than its parental sequences, and

previously-constructed consensus TIM variants cannot be fully melted at 95 ºC.135,259

Recently, the concept of ancestral design, replacing an amino acid with one from a common ancestor in phylogeny, has seen similar results for stabilization.153,156,281,282 The

131

Yamagishi lab individually replaced twelve residues with ancestral amino acids in 3- isopropylmalate dehydrogenase and found that half of the mutations improved stability.156

We wished to understand why consensus mutations are only stabilizing about half the

time and, ideally, to predict which half would be stabilizing. For one thing, we

hypothesized that positions that are highly variable (i.e., not conserved) are not likely to

be stabilized by the consensus mutation, since those sites contain relatively little

information. For another, we hypothesized that consensus mutations in sites that are

strongly coupled to other positions might result in destabilization, at least without some kind of compensatory mutation. For example, one can imagine that mutation of a residue

in a buried polar interaction to a consensus hydrophobic residue would be destabilizing

unless the partner polar amino acid were also mutated. To test these ideas, we used the

well-studied triosephosphate isomerase from Saccharomyces cerevisiae as a host for a

large number of consensus mutations, and we examined the effects on thermal stability

for different levels of sequence conservation and correlation, as well as structural properties like surface exposure and secondary structure.

Triosephosphate isomerase is the archetypical member of the (β/α)8-barrel fold family, which is seen in more than 10% of all natural enzymes.174,191 TIM catalyzes the

isomerization between dihydroxyacetone phosphate (DHAP) and glyceraldehyde-3-

phosphate (GAP) in glycolysis; therefore, it is present in nearly every organism and

132

amenable to statistical analysis. The enzyme, a homodimer in most species, has been characterized in detail from several organisms including , Saccharomyces cerevisiae, Trypanosoma brucei and Homo sapiens.175,194,283-287 The active site residues

of (β/α)8-barrel proteins are typically found on the surface loops connecting the β-strand core to the α-helical surface, as are those in TIM (e.g., K12, H95 and E165 in yeast TIM).

Other loops are critical for function, including loop 3, which is interdigitated into the

other monomer, and loop 6, the opening and closing motion of which is coordinated with

catalytic activity. Despite their ubiquity and apparently modular nature, loop swapping

and other TIM-barrel engineering has proven more difficult than expected.176 The mutability of triosephosphate isomerase has been studied in the Harbury Lab. Silverman et al. found that many single conservative mutations (e.g. Glu to Asp) of yeast TIM were tolerated, but libraries of conservative mutations resulted in only 1 in 1010 active variants,

suggesting the importance of coupling between those mutations.180

We present the characterization of single consensus mutations made in a large number of

sites in S. cerevisiae TIM. We demonstrate that, in general, higher levels of conservation

lead to stabilization, but that both the most highly conserved sites and the most highly

correlated sites are less likely to be stabilizing, due to coupling effects including ‘hidden

correlations.’ Application of the resulting algorithm allows one to predict stabilizing

mutations in TIM with high reliability (9 of 10 tested were stabilizing). Furthermore,

while aggregation of all the mutations found to be individually stabilizing actually

133 resulted in net destabilization, aggregation of all of the mutations predicted to be stabilizing by our algorithm resulted in dramatic thermostabilization.

4.3 Results re-S.c. TIM. We hypothesized that highly conserved positions imply greater importance in defining the family, and therefore consensus mutations at these positions might result in greater thermostabilization. To quantify the extent of conservation, the relative entropy between the distribution of amino acids in a neutral reference state and the distribution in each position in the multiple sequence alignment of triosephosphate isomerase was calculated. Relative entropy is an easily-calculated information theoretic estimate of the log of the probability of observing a given distribution if one expects a reference distribution.131,288 The reference state was taken from the codon usage in the yeast genome,289 which approximates equal usage with slight deviations from codon bias and chemical constraints of the amino acids. Positions 31 and 126 in TIM were the least and most conserved, respectively, with relative entropies of 0.31 and 4.31, and the average relative entropy was 1.42 (Fig. 31a). We chose to simultaneously mutate the six most conserved positions in wild-type Saccharomyces cerevisiae TIM (S.c. TIM) that were not already consensus amino acids. This yielded the TIM variant re-S.c. TIM

(F11W L13M Q82M W90Y K134R A212V).

The gene was assembled from synthetic oligonucleotides, cloned into an E. coli overexpression vector, and purified to near-homogeneity by Ni-NTA chromatography

134

before and after cleavage of an N-terminal hexahistidine tag with the capsid protein

protease of tobacco etch virus. Purification of re-S.c. TIM yielded 5-10 mg L-1 of culture

from the soluble fraction. The enzyme was assayed for activity under Vmax (saturating)

conditions at 4 mM GAP (meaning, ~5× Km for wild-type S.c. TIM). The specific

activity is within two-fold of wild-type (~104 mol min-1 mg-1). Far-UV circular dichroism produced nearly identical spectra with broad minima spanning the 208, 215 and 222 nm peaks observed for mixed alpha/beta proteins (Fig. 32a). The ellipticity at

222 nm was monitored with increasing temperature to compare relative stabilities (Fig.

31c). Both proteins maintain folded baselines until ~50 °C before unfolding through a

single, cooperative transition. The wild-type enzyme remains half folded at 59.1 °C, but

the engineered re-S.c. TIM unfolds with a T½ of 57.0 °C. (We say T½ here because all of

the variants in this study unfold irreversibly, with precipitation upon continued heating in the unfolded state.) In contrast to our initial expectation, combining consensus mutations from the most conserved sites did not stabilize the protein.

Figure 31: re-S.c.TIM.

(a) Histogram of relative entropy values for all 240 aligned positions in the triosephosphate isomerase family. The mean RE is 1.42. (b) The six most conserved positions in S.c. TIM that are not consensus amino acids are shown in green sticks. The active site residues are shown in orange on the 1YPI crystal structure. (c) Ellipticity at 222 nm is followed with increasing temperature. The wild-type melts at 59.1 °C, but re-S.c. TIM melts at 57.0 °C. 135

Individual consensus mutants. To determine why the six highly-conserved mutations did not stabilize the protein, we constructed the mutations individually. All variants are within an order of magnitude of the wild-type activity. Five of the six variants share similarly-shaped CD spectra with comparable mean residue ellipticities at 222 nm (Fig.

32a). W90Y is the only exception, which taken together with the activity data suggests this variant may be partially unfolded. Thermal denaturation reveals that F11W and

W90Y are destabilized, but L13M, Q82M, K134R and A212V are more stable than wild- type (Fig. 32b). re-S.c. TIM was 2 ºC less stable than S.c. TIM, and the individual mutants ranged ±4 °C from the T½ of S.c. TIM. Surprisingly, mutation to even some of the most conserved residues in the MSA was destabilizing.

Figure 32: CD Characterization of Highly Conserved Mutations.

(a) The circular dichroism spectra of wild-type and consensus variants of TIM. All have similar ellipticity when normalized for concentration except W90Y, which may be partially unfolded. (b) The CD thermal melts indicate that 4 individual consensus variants are more stable than wild-type, but the remaining two are less stable.

To further understand the consensus mutation phenomenon and its role in stabilization, we engineered and assayed a variety of consensus mutations with varying levels of

136 conservation. There are 240 aligned positions in the TIM family. Of these positions, 43% of the positions deviate between S.c. TIM and the consensus sequence. Of these 103 positions, we chose to characterize 23 individual consensus mutations that vary in solvent exposure, secondary structure, conservation and evolutionary substitution frequency.

Figure 33: Thermal Assays.

Thermal stabilities of consensus TIM variants. We monitored the loss of secondary structure with increasing temperature at 222 nm for helices (a) and 215 nm for sheets (b). (c) The optical density at 600 nm from aggregation reports similar two-state unfolding profiles as the CD thermal melts. (d) High-Throughput Thermal Scanning was used to assay the melting temperatures based on hydrophobic dye binding. Note that the same colors are used for the same variants in parts a, b, c and d.

137

Table 4: Characterization of Mutants. aRelative entropy between the positional distribution and the yeast neutral reference state. bDerived from CD thermal c d data at 222 nm. T½ - 59.1 °C, where 59.1 °C is the T½ of S.c. TIM. (+) indicates mutations that are expected to increase the stability of S.c. TIM based on conservation and correlation filters. eThe specific activity for the turnover of f GAP to DHAP. The enzymes were assayed at 4 mM GAP, which corresponds to >5× Km of S.c. TIM. The solvent exposure of each residue in PDB entry 1YPI was calculated in MOLMOL version 2K.2.0 with a 1.4 Å sphere. gThe Henikoff score is a quantitative value to describe the conservativeness of a mutation based on phylogenetic analysis. Here, neutral mutations are given a score of 0, common mutations score positive, and rare mutations are scored as negative values. hWe consider a site to be highly correlated if its maximal mutual information score is greater than 0.5. iThe ΔΔG values were calculated for each mutant using FoldX (see Materials and Methods). Here, a destabilized and stabilized mutation have positive and negative values, respectively. jDid not express.

138

Figure 34: Concordance of Stability Assays.

The variants are arranged by the T½s derived from CD thermal denaturation at 222 nm. Data was not collected by HTTS for A66C, I109V, D180Q, and A212V.

Figure 35: Correlation of Thermal Methods.

The T1/2 values are compared between four complementary methods: (1) Loss of CD ellipticity at 222 nm for α- helices, (2) Loss of CD ellipticity at 215 nm for β-strands, (3) Diffraction of light at 600 nm for detection of precipitation products and (4) High-Throughput Thermal Scanning. Here, the T1/2 values are plotted for comparison. 139

The 23 variants were expressed and purified from BL21(DE3) E. coli in similar yield to

S.c. TIM and re-S.c. TIM. The I20A, G122T and F229A mutants did not express in sufficient quantities to characterize. Multiple codons and contexts were tested for I20A

(GCG, GCT) and G122T (ACA, ACT, ACC, ACG) with similar results. Consequently, these three mutations were classified as destabilizing. The remaining twenty proteins were assayed for catalytic activity monitoring the turnover of GAP to DHAP. All the variants turned over substrate with specific activity values of ~103-4 µmol min-1 mg-1, which is on par with wild-type. All variants displayed similar mean residue ellipticities at 222 nm (data not shown). The T½ for each variant was determined by CD thermal

denaturation (Fig. 33a & b). The T½s of twelve of the 23 mutants were greater than wild-

type, one was the same as wild-type, and the remaining ten exhibited a loss in stability—

essentially the same as the 50% rate of stabilization previously seen for both consensus and ancestral mutations.

The most destabilized variant in our dataset was N213K at T½ = -5.1 °C, and the most

stabilizing mutation was L13M at T1/2 = +4.0 °C. Since the entire dataset differs by

only 9.1 °C, we relied on several other thermal assays to accurately rank the relative

stabilities of our variants: (1) We also observed thermal denaturation of the TIMs by

215 nm CD signal. (2) TIM thermal denaturation leads to aggregation and precipitation

upon unfolding. We measured the T½ values from the scattering of light (optical density) at 600 nm. This method is essentially the same as what is referred to as Differential

140

Static Light Scattering (DSLS).207 (3) We previously showed that TIM stability differences could be accurately ascertained by High-Throughput Thermal Scanning.99

The data from each of these methods are highly concordant, as shown in Figures 33 and

34 and 35. Taken together, we are able to accurately measure small differences in stability. (Throughout this report, unless specified, T½ values refer to those obtained by

loss of ellipticity at 222 nm on thermal denaturation.)

Structural nature of consensus mutations. For each of the consensus mutations, we

analyzed the physical and chemical properties of the amino acids, their context within the

folded protein and their sequence statistics (Table 4). On average, residues in TIM are

17% solvent exposed and our consensus mutants average 13% ranging from 0% to 60%.

2 In our dataset there is no correlation between solvent assessable surface area and T½ (R =

0.20, see Figure 43). Five of the seven loop mutants were stabilized, six of the ten helical

mutations were stabilizing, and only one of the six β-sheet mutants was more stable. The

Harbury Lab has previously shown that the β-strand core of TIM is highly sensitive to mutations.180 BLOSUM scores are based on the mutational propensities between amino

acids as calculated across phylogeny, and are consequently a way of quantifying how

conservative a mutation is.275 Here, six out of ten common substitutions (positive

BLOSUM) were more stable, and six out of 13 rarer substitutions (zero to negative

BLOSUM) were more stable. Therefore, except that few positions in -strands result in

stabilization, the general structural properties of the mutations were not predictive of

stabilization.

141

Sequence statistics and stability of tested mutants. Our initial results from re-S.c. TIM

and its individual constituents suggested that high positional conservation alone is not

more predictive of which consensus mutations will be stabilizing. We re-examined this

in light of the full set of 23 mutations. Mutations at sites that are more conserved than

average (relative entropy greater than 1.42) yielded more stable mutants in nine of 14

consensus variants, while only three of nine consensus mutations at weakly conserved

positions were stabilizing (Fig. 36a). Overall, limiting consensus mutations to sites with

more than average conservation improves the chances of making stabilizing mutations

from about 1-in-2 to about 2-in-3, even though some mutations at highly-conserved

positions were destabilizing.

Our second initial hypothesis was that some consensus mutations would fail to stabilize

because of coupling. One potential way to predict coupling is from statistical correlation

of positions in a MSA. Here, statistical correlation was determined from the mutual

information between the amino acid distributions at each pair of positions (Fig. 37).

Mutual information is the relative entropy between the observed pairwise distribution and

the joint distribution calculated from the positional frequencies. As an illustration, if

position i is amino acid a 25% of the time, and position j is amino acid b 10% of the time, then we randomly expect a-b pairs in 2.5% of sequences. The degree to which we see more (or fewer) a-b pairs than this increases the information in one distribution about the

other distribution, and implies correlation (or anti-correlation).

142

Figure 36: Filtering by Conservation.

(a) All positions in TIM have been plotted against their relative entropies from the neutral reference state. All sites are shown in gray, and stabilizing and destabilizing consensus mutations are shown in green and red, respectively. Note that the stable mutations aggregate above the black arrow which indicates the mean relative entropy of 1.42. (b) Amino acid distributions for the yeast neutral reference state and positions with relative entropy values of 0.5, 1.0, 1.5, 2.0, and 2.5 are shown.

By this calculation, only a small number of positions are seen to interact strongly (Fig.

37a). This was also observed in WW and PDZ domains using a related metric for 143 sequence correlation (SCA).166,167 Of the 14 positions with above average conservation, the five that were destabilized (Fig. 37d) upon mutation to the consensus residue are more highly correlated to other positions than the nine that were stabilized (Fig. 37c).

Three of the five less stable variants are hubs for interaction networks, with detectable correlations to multiple positions. To estimate the importance of mutual information scores, we generated mock alignments in which the positional distributions were maintained but amino acids were scrambled between the sequences, and then recalculated the mutual information scores. Detectable here means above this ‘noise’ threshold

(MI=0.23). The strengths of several correlations to the destablized conserved positions are significant. Pairwise correlations between positions 90-157 (MI=0.72), 123-90

(MI=0.63), and 180-229 (MI=0.51) are all within the top 0.4% of the 28,680 possible unique correlations for 240 positions. A full analysis of the statistical interactions in the

TIM family is beyond the scope of this paper, and will be presented separately (VD, BJS and TJM, manuscript in preparation).

Two variants with above-average conservation and little sequence correlation, F11W

(-3.7 ºC) and V266I (-0.6 ºC), were destabilizing. Position 11 is so conserved that it is almost invariant in the TIM family—it is Trp in 595 out of 719 sequences in our MSA.

Coupling to very highly conserved positions cannot be detected by sequence correlation because, if position 11 is nearly always Trp, then i-11 pairs will always be a-Trp, and no additional information occurs in the pairwise distribution than in what would be expected at random. But it is still possible that a highly conserved position could be physically

144 coupled to another position, in the sense that mutation of the conserved positions might require a compensatory mutation at a second position to rescue stability or function.

These types of ‘hidden correlations’ can only occur at the most conserved sites, and so they can be eliminated from sites for potential consensus mutations by putting an upper limit on conservation (e.g., a relative entropy greater than 3), in addition to the lower limit already described.

Figure 37: Mutual Information and Protein Stability.

(a) The mutual information matrix for all 240 positions in TIM is shown. The matrix is symmetric (x-y is the same as y-x), and there is no meaning to the self-correlations (x-x), which were not calculated. (b) The distribution of mutual information scores is shown for the entire matrix (b). Here, approximately 30% of all pairwise correlations are above the noise threshold of 0.23. The distribution of mutual information scores are shown for stabilizing mutations (c) and destabilizing mutations (d). Note that there is a significantly higher fraction of strong correlations at the positions that lead to a loss in stability. 145

To explore the role of both statistical correlations and hidden correlations further, we

attempted to design compensatory mutants for consensus variants F11W and W90Y. We

analyzed the MSA and the crystal structures of TIMs with these different amino acids.

As noted, yeast TIM is one of very few with Phe at position 11; TIM from most

organisms, such as T. maritima, have a Trp at this position. It appears that position 20, in

van der Waals contact with F11, is a larger amino acid, Ile, than is typically seen in this

position (Ala in T. maritama). Alignment of the crystal structures from S. cerevisiae and

T. maritima shows that the F11W mutation would sterically clash with the Ile in position

20 (Fig. 38a). The compensatory mutant F11W I20A in yeast TIM was 4.3 °C more stable than yeast TIM. In the context of F11W, the I20A mutation netted 8 °C of thermal

stabilization. The I20A mutant alone did not express, perhaps because of destabilization due to underpacking against Phe11. Thus positions 11 and 20 are coupled in TIM, although this could not be detected by correlation statistics.

Figure 38: Hidden Correlations.

A ‘hidden correlation’ between positions 11 and 20. (a) The crystal structures of S.c. TIM and T. maritima TIM [PDB entries 1YPI (pink) and 1B9B (green)] are aligned and residues 11 and 20 are highlighted. The F11W mutation may have introduced a steric clash resulting in destabilization. (b) CD thermal denaturation of F11W and F11W I20A. I20A alone did not express in appreciable quantities. 146

W90Y was a second mutation where above-average conservation did not yield consensus

thermostabilization. Mutual information shows this position is a hub of statistical

interactions. We attempted to ‘correct’ the strongest broken correlation, 90-122 by

mutating the glycine at position 122 to the larger threonine. Thr is the consensus amino

acid at position 122 and co-evolves with Tyr at position 90. The G122T substitution did

not express in the context of S.c. TIM or W90Y S.c. TIM with codons ACA or ACC.

Mutual information analysis also suggested that position 123 co-evolved with 90 and

122. A V123P consensus mutation was also constructed, but did not express in any

scaffold (V123P, W90Y/V123P, W90Y/V123P/G122T). There are 16 residues that cage

the aromatic ring at position 90, half of which directly pack against the side chain. All

are consensus amino acids except 122 and 123, at which we tested possible substitutions.

G122R is a known human mutation that leads to thermolability.284 The hub-like nature of

position 90 makes it difficult to engineer compensatory mutations without disrupting

other possible interactions.

If the three criteria described here are taken together—above-average conservation (here,

relative entropy greater then 1.42), below-average coupling (here, maximal mutual

information less than 0.50), and elimination of nearly-invariant sites with possible hidden

correlations (here, relative entropy greater than 3)—then 14 consensus mutations would

be predicted to be stabilizing. We tested 10 of those individually, and 9 were stabilizing.

147

Figure 39: Characterization of comboTIM and algoTIM.

(a) The CD wavelength scans of wild-type S.c. TIM and algoTIM are nearly identical. comboTIM shows less ellipticity at 222 nm and has its deepest minima at 205 nm, suggesting some random coil. (b) The CD thermal melts monitored at 222 nm are shown for all characterized proteins in gray, with comboTIM, S.c. TIM, the F11W I20A mutant, and algoTIM highlighted.

Multimutants. After the failure of the original re-S.c. TIM variant, we wanted to test

whether a super-stable mutant of yeast TIM could be made by amalgamating predicted

stabilizing mutations. Of the 240 aligned positions in the TIM alignment, 103 are not

consensus. Only 19 positions have relative entropies between 1.42 and 3.00, and four of

those positions (C41A, W90Y, V123P and D180Q) have high maximal mutation

information values as well as large numbers of significant correlations (see Figure 42).

As a result, we designed algoTIM which includes 15 consensus mutations (L13M I40V

A66C N78I Q82M I83L I109V V121L I127V K134R K135E V162I I184V A212V

V226I). These mutations include 9 stabilizing mutations, 1 destabilizing mutation and 5 uncharacterized mutations. In addition, we characterized a second TIM we named comboTIM that simply combines all stabilizing mutations characterized in this study

(F11W L13M I20A S31K Y49Q A66C Q82M I83L I109V V121L K134R A175T I184V

148

A212V). Note that this variant contains the stabilizing F11W I20A pair, and does not

contain the destabilizing V226I mutation or any of the mutations we removed from

algoTIM due to high correlations.

The algorithmic multimutant algoTIM melted with a T½ of 67.2 °C, nearly 10 °C greater

than S.c. TIM and an additional 4 °C more stable than any variant previously

characterized (Fig. 39). In stark contrast, comboTIM was destabilized (T1/2 = 56.7 °C)

from wild-type despite harboring 14 known stabilizing mutations.

Kinetic stability. The mechanism of consensus stabilization is poorly understood.

Recent studies on thioredoxin suggest that such stabilization might principally arise from

decreased unfolding rates; however, thermal denaturation was used to measure stability,

and chemical denaturation was used to measure unfolding rate.262,290 Although chemical unfolding kinetics can generally be measured more reliably, S.c. TIM does not unfold entirely in 8 M urea and unfolds too quickly for practical measurement by CD in low concentrations of guanidinium chloride. We assayed the apparent unfolding rates of the

TIM variants with no chemical denaturant by temperature-jump experiments, monitoring the loss of 222 nm ellipticity at 70 °C (Fig. 40). While this decision was practical, it also provides a direct comparison to measured T½ values. All of the destabilized variants

unfolded more rapidly than S.c. TIM. However, only about half of the stabilized variants

unfolded more quickly, and there was not a strong correlation overall between the

2 apparently thermal unfolding rate constants and T½ (R = 0.30). The half-times of

149

unfolding varied over a small range, from ~10-20 seconds, but the stabilities also vary

only over a small range. This suggests that the stabilizing effects of consensus mutations

have both thermodynamic and kinetic components.

Figure 40: Kinetic Unfolding of Consensus Variants.

The unfolding rate constants, k, have been plotted against the T½ values. All of the destabilized variants unfold more quickly than wild-type, but there is no correlation between unfolding rates and melting temperatures for stabilized mutants. S.c. TIM is shown as a block square at the intersection of crosshairs.

4.4 Discussion

A number of lines of evidence show that about 50% of consensus mutations are stabilizing. We set out to understand how to identify which half are stabilizing and the basis for that distinction. Our two fundamental hypotheses were that consensus

150

mutations at weakly conserved positions would be less likely to stabilize, and that

mutations at positions that are coupled to other sites might destabilize more frequently.

We originally tried to simply amalgamate the consensus mutations at the six most conserved sites that were not already the consensus residue in yeast TIM, but this actually

resulted in slight destabilization. Dissection of the re-S.c. TIM multimutant into its

constituent mutations showed that two of the mutations, F11W and W90Y, were

destabilizing. Position 90 is both strongly correlated to several other positions, and is

correlated at least weakly to a large number of positions. Our attempts to generate

compensatory mutants for the W90Y mutation illustrate how difficult it can be to mutate

highly-correlated positions. All but two positions around 90 are already consensus

residues, but mutation of those two positions to consensus residues, G122T and V123P,

alone, in combination with W90Y, or all together, resulted in no expression.

Consequently we suggest removal of highly correlated positions from the set of stabilizing mutations to test.

F11W represents a different and more subtle kind of coupling. Position 11 is Trp in virtually every TIM, and we initially were quite surprised that F11W was destabilizing in yeast TIM. In retrospect, it seems reasonable that in order for yeast TIM to have a mutation at the highly conserved W11 seen in most TIMs, something else might also have mutated in response. That turns out to be the case. Mutation of adjacent position 20 from the larger Ile to the smaller (consensus) Ala seen in most TIMs apparently compensates for the larger Trp in position 11; moreover, I20A alone (that is, in the

151 context of F11) results in no expression. The F11W I20A dual mutant is the most stable simple mutant we engineered here, which suggests that consensus mutations at the most conserved positions can have a big payoff, but with some peril. Namely, we cannot statistically detect correlations to invariant positions. If two residues are highly conserved in a protein, it is impossible to say if they are conserved together or separately, unless the single mutants reduce fitness and a double mutant rescues it. We think of this as a kind of ‘hidden correlation’ that, like the statistical correlations we can detect, are best to avoid to maximize the number of stabilizing mutations.

When we look at the entire group of 23 consensus mutations made here, the fraction of mutations that stabilize versus destabilize or abrogate expression is better in the more- conserved half of positions (two-thirds are stabilizing, versus half overall). There is little pattern to which mutations stabilize otherwise. Few stabilizing mutations were found in beta strands, but many stabilizing mutations were solvent exposed or in loops, where we might not expect stabilization and certainly cannot meaningfully predict it computationally. Even fairly non-conservative mutations, like Y49Q, were often stabilizing. We analyzed the consensus mutations with the computational protein stability predictor FoldX (Table 4).53 There was essentially no correlation between the predicted Gs and T½s (see Supporting Figure 43). FoldX is able to identify which mutations are stabilizing (i.e., the sign of G agrees with the sign of T½) in about 60% of cases, but this is about the fraction of consensus mutations that are stabilizing overall.

FoldX did predict large destabilizations (3-8.5 kcal mol-1) for the six mutants that did not

152 express, which is valuable information for the protein engineer. It is important to note that while Gs and T½s both report relative stabilities, they are not the same thermodynamically, and the irreversible thermal denaturations here are not under equilibrium conditions. It is possible that some variants have, for example, decreased thermal stability but a greater free energy difference between the folded and unfolded states.

Figure 41: Mutual Information for comboTIM and algoTIM.

The positions of mutation for comboTIM and algoTIM have been isolated from the mutual information matrix of all pairwise interactions. (a) The positions of mutation in algoTIM have a virtually no strong (red, orange) correlations to other sites in the protein. (b) In contrast, the 14 positions of mutation in comboTIM have many strong correlations with other positions within TIM. (c) The 15 mutations in algoTIM are assembled into a matrix with the correlations displayed as a heat map. The positions of mutation are not correlated to each other. (d) The 14 mutations in comboTIM are assembled into a matrix with the correlations displayed as a heat map. Although, these mutations were stabilizing independently, there are many strong correlations between sites of mutation in comboTIM, perhaps leading to non-additive effects.

One interesting note about the consensus mutations explored here is that, except for the

153

variants that do not express, all of the variants, including algoTIM with 15 mutations,

have extremely high catalytic activity. None is reduced even an order of magnitude, and

wild-type TIM is among the most efficient enzymes known. This is not because TIM is

especially mutable. It is a highly-tuned enzyme that works by exquisite positioning of catalytic residues with coordinated loop dynamics in the catalytic cycle. Harbury found that vanishingly few variants with multiple conservative mutations were active.264 Unlike

most mutations designed by humans and computers, consensus mutations have been

tested for fitness by Nature in a variety of contexts. When we choose to replace an amino

acid with the most common one in a MSA, there is greater confidence in the maintenance

of function. Interestingly, Hilvert recently reported that ‘consensus’ mutations in

libraries of chorismate mutase from directed evolution were also stabilizing, but the

consensus variants ranged significantly in activity (from 2-fold higher to 30-fold

lower).148

Three multimutants were constructed for this study, re-S.c. TIM, comboTIM and

algoTIM. The sum of the T½ values for the six re-S.c. TIM mutants is -0.6 ºC, but re-

S.c. TIM is actually destabilized about 2.1 ºC. More strikingly, comboTIM is made up of

the stabilized F11W I20A mutant and 12 additional mutations all found to be stabilizing.

The sum of those T½s is +22.9 ºC, but the protein is actually destabilized by 2.5 ºC. In

contrast, amalgamating the 15 residues suggested by our conservation-correlation

algorithm results in 8.2 ºC of stabilization, the most we saw in this study. This variant includes one mutation we know to be destabilizing and 5 that we did not test separately.

We suggest that, besides helping to identify which consensus mutations will be

154 stabilizing, removal of coupled positions also increases the additivity of the mutations. comboTIM includes several residues with below average relative entropy that are enriched in statistical interactions to other sites (Fig. 41b and d). Although each of these mutations is stabilizing in the context of wild-type, coupling among the 16 mutated positions negates additive gains in T½ (Fig. 41b and d). In contrast, the positions in algoTIM were selected for independence (Fig. 41a and c).

Figure 42: Correlations in TIM.

The positions of mutation for comboTIM and algoTIM have been isolated from the mutual information matrix of all pairwise interactions. (a) The positions of mutation in algoTIM have a virtually no strong (red, orange) correlations to other sites in the protein. (b) In contrast, the 14 positions of mutation in comboTIM have many strong correlations with other positions within TIM. (c) The 15 mutations in algoTIM are assembled into a matrix with the correlations displayed as a heat map. The positions of mutation are not correlated to each other. (d) The 14 mutations in comboTIM are assembled into a matrix with the correlations displayed as a heat map. Although, these mutations were stabilizing independently, there are many strong correlations between sites of mutation in comboTIM, perhaps leading to non-additive effects.

We cannot definitively say from this work what exact quantitative standards should be applied for the conceptual filters proposed here. We chose to make a relative entropy of 155

3 from the yeast codon usage reference state our upper limit on conservation (for

removing ‘hidden correlations’). If the most common amino acid at a position is Leu, a

relative entropy of 2.27 corresponds to 99% conservation; but if the most common amino

acid is Trp, it is 4.46—because Trp is used less overall, and so it would be more

improbable for it to dominate a site. In practice, values above 2.5 or even 2.0 represent

very highly biased positions, and it would take a much larger data set to quantitatively set

this limit. Likewise, it is not clear if the mean relative entropy score is the optimal lower

limit on conservation, and, further, this value might change substantially for proteins

enriched in rarer amino acids. But until a much larger dataset is available, the top half of

conservation scores is a reasonable place to look. It is much more difficult to articulate a

quantitative criterion for “too correlated.” The residues we removed here had both very

high maximal mutual information values (in the top 1%) and also had a large number of

significant correlations overall. It is unclear if one of these criteria is more important

than the other. Again, until much more data is available, removing positions with the top

1% of mutual information scores is practical.

Finally, it is worth noting that while this work offers some explanation for why some

consensus mutations are not stabilizing, it does not tell us why consensus mutations are

stabilizing in general. Several groups have articulated the notion that adding consensus

mutations to a protein generates a superposition of stabilizing interactions, only a fraction

of which are necessary in any one protein to achieve sufficient stability for fitness. That necessarily implies that the effects of consensus mutations are mostly additive, which

156 virtually must be true for fully consensus enzymes with sequences far from any natural variant to be stable and active, as they sometimes are. Here we see that the consensus mutations that are most likely to stabilize are the ones that are the most independent, which is consistent with the importance of additivity for consensus stabilization. Still, given the multitude of evolutionary pressures for fitness besides adequate stability

(activity, solubility, folding rate, etc.), it is remarkable that so many consensus mutations stabilize proteins.

Figure 43: Physical Properties.

(a) The percent solvent exposure for each mutation is plotted against the T1/2. (b) The computationally predicted ΔΔG from FoldX is plotted against the T1/2 .

In summary, we have demonstrated that consensus mutations at more conserved sites were more likely to stabilize yeast TIM, and that removal of mutations at nearly invariant and highly correlated positions increased the likelihood of stabilization. These mutations could be amalgamated into a highly stable multimutant, probably in part because of their independence. The high activity of all resulting proteins suggests that application of this algorithm to proteins even for which little is known about the structure or mechanism is a 157 promising way to rapidly generate stable proteins for research and applied uses. At least in the case of TIM, our method improves the likelihood of predicting stabilizing mutations from ~50% to ~90%, which is of great practical use to the protein engineer.

4.5 Acknowledgements

BJS was an National Institute of Health Chemistry-Biology Interface Program Fellow and Ohio State Presidential Fellow. TN was an ASC Research Scholar and Dean's

Research Fund recipient. We thank Nicholas Callahan and Deepamali Perera for helpful conversations and suggestions. This work was supported by The Ohio State University.

158

Chapter 5: In vivo Analyses of Triosephosphate Isomerase

Statistical correlations and protein fitness: A model system to interrogate these

phenomenon in triosephosphate isomerase

5.0 Contributions

The following research article is authored by Brandon J. Sullivan, Sidharth Mohan,

Samantha Rojas, Venuka Durani and Thomas J. Magliery. Brandon J. Sullivan

engineered and characterized the (DE3) lysogenized Keio strain. Brandon J. Sullivan,

Sidharth Mohan, Samantha Rojas and Venuka Durani characterized growth rates for solid

and liquid media. Samantha Rojas and Brandon J. Sullivan performed Michaelis-Menten

kinetics on the knockdown-mutants of triosephosphate isomerase. Brandon J. Sullivan,

Venuka Durani, and Thomas J. Magliery designed the libraries. The research article was

primarily written by Brandon J. Sullivan and Thomas J. Magliery with help from all

authors.

5.1 Abstract

Triosephosphate isomerase is the archetypical member of the (β/α)8-barrel fold which is

the host scaffold for more than 10 % of all enzymes. Directed evolution and engineering

159 of TIM-barrels have been long standing goals in biochemistry, but that task has proven difficult. How has Nature developed such diverse activities from a single starting scaffold? Furthermore, how does conservation and statistical correlation of amino acids affect activity and fitness? In order to answer these question, large libraries will need to be amassed and characterized. Here, we report the design and engineering of a TIM knockout obtained from the Keio Collection. This engineered strain fails to grow on media lacking six carbon sugars unless complemented with an active triosephosphate

2 8 -1 -1 isomerase. We screened variants with kcat/Km values ranging from 10 -10 M min and saw good correlation between growth rates and specific activities in solid and liquid media. The dynamic range of this in vivo assay makes it ideal for sorting large libraries of protein variants based on differential activity. Furthermore, the differential growth rates are ideal for testing the role of statistical conservation and correlations using deep sequencing. Finally, the λ(DE3) lysogen makes the strain ideal for protein expression and purification for in vitro analysis. Hexahistidine-tagged fusions were purified in high yield

(>20 mg/L) lacking endogenous E. coli TIM, which may copurify with engineered triosephosphate isomerase.

5.2 Introduction

The physical and chemical properties of amino acids determine the structure and activity of proteins, but our ability to read that code is still simplistic.10 The computational prediction of protein structure and stability is a promising route, but is currently limited by sampling and exceedingly large error bars.55 Native proteins are only stable by 5-15

160 kcal mol-1 therefore requiring extreme accuracies from fairly crude energy potentials.

Mutagenesis, both individually and combinatorially, is powerful for understanding the sequence-structure relationship, but is likewise limited by sampling and throughput.

Sequence statistics of multiple sequence alignments provide insights into a protein's history and physical chemistry. Sequence conservation has been used to stabilize antibodies, repeat domains, and even enzymes.117,121,135-137,259 The inclusion of statistically correlated information has improved the folded fraction of WW-domain libraries.166,167 Magliery and Regan later showed that correlations were important for charge neutralization in tetratricopeptide repeats.169 These studies have recognized that correlated occurrences of amino acids are important for native structures and activities, but the details of these mechanisms remain poorly understood.

We have applied several mathematical strategies to identify what are the most important pairwise and network correlations within the triosephosphate isomerase family (VD, BJS and TJM - manuscript in preparation). We wish to interrogate the interplay of statistical correlations in both physical properties and evolutionary history. Are interactions assembled from sequence statistics of MSAs phylogenetic artifacts? Are correlations critical to folding, stability, activity? Here, we describe a model system to analyze sequence correlations by quantifying fitness via population dynamics. The Bolon lab has recently reported the characterization of Hsp90 mutations using deep sequencing as a reporter of protein fitness.291 Here, a library of single mutations competed in a single reaction vessel. Library members that provided the organism (an Hsp90-knockout) with

161

a selective advantage grew and divided more rapidly biasing output populations, which

were quantified with deep sequencing. Similar approaches have been used to study

protein binding and infer stability.292,293

We wish to assay libraries of triosephosphate isomerase where correlated pairs and

networks are replaced with degenerate codons. We hypothesized that after amplification,

the population dynamics should indicate preferred amino acid pairings that can be

compared to the distributions seen in multiple sequence alignments (MSA). If these

distributions do not match the MSA the correlations are likely phylogenetic artifacts -

although we argue that such occurrences are rare. Furthermore, the populations provide

real experimental data to better understand how statistical correlations manifest physical properties within proteins.

We have chosen to study the triosephosphate isomerase family (TIM). TIM is the archetypical member of the (β/α)8-barrel fold which constitutes ~10 % of all enzyme

scaffolds.174 The scaffold is an ancient superfold that has diverged to contain the active site residues of many catalysts. TIM, itself, is an ancient enzyme with a rich evolutionary history (i.e., TIMs from different organisms have diverged significantly in sequence).

There are currently more than 1,000 TIM sequences in the Protein families database

(Pfam) lending itself to statistical study. Our lab has already analyzed consensus variants of TIM, used sequence statistics to stabilize the enzyme, and have reported a detailed

162

statistical analysis of the protein (BJS et al.; VD, BJS and TJM - manuscripts in preparation).259

Triosephosphate isomerase catalyzes the fifth step in glycolysis by interconverting

glyceraldehyde-3-phosphate (GAP) and dihydroxyacetone phosphate (DHAP). A TIM- knockout, DF502, has been used to study the activity of triosephosphate isomerase in vivo.180,267 DF502 fails to grow on minimal media lacking six carbon sugars. Active

variants of TIM can complement DF502 growth on minimal media supplemented with

glycerol or lactate. From glycerol, the molecule is activated to glycerol-3-phosphate and

oxidized to DHAP which enters glycolysis through TIM providing energy. From lactate,

the molecule is modified to GAP which is converted to DHAP by TIM. DHAP can be

shunt into the necessary fatty acid and amino acid syntheses. Unfortunately, DF502 is

difficult to work with, perhaps due to the concomitant knockout of 16 metabolic genes

(rhaD, rhaA, rhaB, rhaS, rhaR, rhaT, sodA, kdgT, yiiM, cpxA, cpxR, cpxP, fieF, pfkA,

sbp, and cdh).

In this paper we engineer and characterize a novel TIM-knockout from the Keio

Collection.268 We demonstrate that the single gene knockout is robust and well behaved.

Furthermore, we show that the DE3 lysogeny is ideal for protein expression for in vitro

analysis and screening in vivo. We also analyze proof-of-principle fitness experiments,

and describe libraries to be screened by deep sequencing.

163

5.3 Results and Discussion

TIM-knockouts. TIM-knockout, DF502, has become a staple in triosephosphate

isomerase study since it was engineered just over thirty years ago. We acquired the strain

to assay the in vivo activity of consensus TIMs and aimed to use the strain for library

screening. Unfortunately, in our hands DF502 growth was not consistent perhaps due to

the loss of 16 metabolic genes nearby in the chromosome. DF502 fails to grow on media

lacking six carbon sugars. The metabolic pressure resulting from these 16 deletions yields slow growth even when complemented with wild-type triosephosphate isomerases.

In search for a more robust host, we acquired the ΔtpiA knockout from the Keio

Collection. Unlike DF502, the Keio Collection strains are single gene knockouts engineered by homologous recombination. The chromosomal gene of interest is replaced

with the Neomycin phosphotransferase II gene which grants kanamycin resistance. Not

only is the single gene knockout less perturbing to cellular homeostasis, but the antibiotic

resistance provides further means for avoiding contamination. We wished to make the

ΔtpiA strain tunable and compatible with high levels of expression so we lysogenized the

genome with λ(DE3) phage. After infection, clones were selected for minimal leaky

expression, but maximal TIM production at 0.1 mM IPTG. The best lysogeny yields 5-

25 mg L-1, depending on the expressed TIM variant - similar to BL21(DE3). We refer to

this engineered Keio strain as ΔtpiA-Keio(DE3).

E165D mutants. Our lab recently published the design and characterization of fully

consensus triosephosphate isomerases.259 We described the in vitro activities of S.

164 cerevisiae TIM (GAP turnover: 108, DHAP turnover: 107 M-1 min-1), ccTIM (108, 107), cTIM (104, 102), ir-cTIM (103, undetectable) and pHLIC (null). Plasmids harboring these genes were transformed into ΔtpiA-Keio(DE3) and grown on minimal media with either

lactate or glycerol as the sole carbon source. S.c. TIM and ccTIM grew within one day, cTIM was visible in two days on lactate and four days on glycerol media, and ir-cTIM failed to complement. We wished to further test the dynamic range and sensitivity of

ΔtpiA-Keio(DE3), but did not possess TIM variants with intermediate activities (kcat/Km =

104-6 M-1 min-1). We chose to mutate the catalytic glutamate of S.c. TIM and ccTIM to the shorter aspartate as this mutation has been showed to knockdown activity 500-1000 fold.9 Additionally, we mutated the same residue in a bacterial consensus TIM with

7 native kcat/Km = 10 (VD, BJS and TJM - manuscript in preparation). The mutations were

constructed from overlapping PCR and subcloned into a T7 expression vector. As

expected, the specific activities for each variant fell ~103 with nearly all the contribution

in kcat (Table 5). We now possessed TIM variants with activities distributed from the diffusion limit (108) to the lower cusp of activity (low 103).

Table 5: Kinetic values for Keio characterization.

165

Figure 44: Differential growth in solid media.

The solid media growth of ΔtpiA-Keio(DE3) with eight TIM variants and a null plasmid are reported on glycerol and lactate minimal media. The most active variant, S.c. TIM, is the first to grow, quickly followed by ccTIM and bacTIM. After three days the E165D variants with intermediate activities are seen, followed by cTIM at 4 days. Under these conditions of leaky expression, 0.2 % w/v carbon source, and 37 °C both ir-cTIM and the null fail to grow. The lactate screen behaves similar to glycerol, but all growth rates are accelerated.

Differential growth. The eight variants shown in Table 5 and a null vector were transformed into electrocompetent ΔtpiA-Keio(DE3) cells and grown on M63 minimal media plates supplemented with 0.2 % w/v lactate or glycerol (see Materials and Methods for detailed screening conditions). The growth of each transformant was monitored for five days at 37 °C under conditions of leaky expression (0 mM IPTG) (Fig. 44). After one day, robust growth was seen for the most active variant (S.c. TIM) with ccTIM following slightly slower. At 36 hours, the three top-tier proteins were all visible, and by

72 hours the E165D variants with intermediate activities began to form lawns. As the screen reached the four day mark cTIM colonies were visible, but ir-cTIM and the null vector still failed to grow. Qualitatively, there is a perfect correlation between the in

166

vitro activities and growth rates in ΔtpiA-Keio(DE3) on agar supplemented with glycerol or lactate. The lactate screen performs similarly to the glycerol screen, but all growth

rates are accelerated.

The same experiment was performed in liquid media with glycerol as the sole carbon

source. Initial conditions failed to reach saturation, even for the diffusion-controlled

variants and the relatively clear M63 media turned bright yellow with concomitant drop

in pH post inoculation. We increased the buffering capacity of the M63 media from 100 mM to 150 mM and induced with 2.5 µM IPTG to increase protein expression. These modifications led to robust growth in highly active variants without compromising the dynamic range and sensitivity of the screen (Fig. 45).

Figure 45: Differential growth in liquid media.

The optical density of each variant was monitored by collecting the absorbance at 600 nm. The variants with optimal activities grow and divide quicker in glycerol minimal media at 37 °C with 2.5 µM IPTG. The results presented here are the averages of triplicated data. The final optical density at 13 hours is shown in the insert with error bars.

167

Effects of protein expression. Previous characterization of yeast TIM versus consensus variants showed that there may be differences in expression. We typically purify 1-5 mg

L-1 of S.c.TIM while we consistently produce 20-50 mg L-1 of cTIM and ccTIM from

BL21(DE3) in rich media. To test the expression levels in ΔtpiA-Keio(DE3) we normalized the number of cells for each variant, lysed the cells and assayed the soluble fraction by SDS-PAGE (Fig. 46). The sequence identity and protein properties of this dataset is highly divergent which may manifest differences in protein expression levels.

In minimal media, the expression of all variants is fairly consistent (within 2-fold), with the exception of bacTIM. Several proteins within this dataset share less than 60 % sequence identity with each other, which is far greater than expected from libraries of a single variant. A possible drawback of in vivo screens for fitness is that stability and activity may be trumped by escape variants with high expression. ΔtpiA-Keio(DE3) appears to be robust to expression levels. There is a slight deviation in growth rates versus the activities between ccTIM E165D and S.c. TIM E165D. This may be an effect of protein concentration, but the activities and growths share overlapping error bars.

168

Figure 46: Protein expression in TIM-knockout.

The soluble fraction of cells grown in glycerol and lactate are shown here arranged by specific activities. Most of the variants express equally well in ΔtpiA-Keio(DE3), with possible exceptions; S.c. TIM and bacTIM. The cells were normalized for concentration and lysed by glass beads and HEW lysozyme. Cells were then pelleted to remove the insoluble fraction before loading on an denaturing gel. The TIM variants run in the middle of the gel, while added lysozyme to disrupt the cellular wall is seen on the bottom.

Fitness competitions. Previous studies in our lab and other have hypothesized the importance of sequence correlations in protein structure, stability and function. Several experiments have tried to elucidate the role of these statistical interactions, but much remains unknown in this field. One fundamental hurdle is choosing the right model and experiment to validate the importance of coevolving pairs. To this end we previously

designed and characterized consensus variants of triosephosphate isomerase. These

sequences were constructed by choosing the most common amino acid in each column of

the family's multiple sequence alignment (MSA). Because this approach considers each

169

site independently, many native-like correlations are scrambled or ablated. We

hypothesized that these consensus variants would be ideal models for "host-guest" experiments where the consensus scaffold would "host" correlated "guest" amino acids.

Preliminary analyses of the strongest correlations in TIM showed that correcting scrambled correlations had essentially zero effect on thermodynamic properties, but did yield minor changes in activities. These results, coupled with differential growth patterns in ΔtpiA-Keio(DE3) suggest a powerful route for screening libraries of correlations in vivo. The relative abundance of coevolving pairs can be determined with next generation sequencing (i.e. Illumina) providing a high-throughput handle that links statistical correlations to organismal fitness.

Low-throughput and cost effective experiments were designed to test these hypotheses.

Four overnight seed cultures were inoculated with S.c. TIM, cTIM, ccTIM or pHLIC.

The seed cultures were grown in rich media at 37 °C to saturation. The next morning the cells were washed twice with phosphate buffered saline and normalized by optical density at 600 nm. Competitive experiments were initiated for four pairings; ccTIM v. pHLIC, cTIM v. pHLIC, cTIM v. ccTIM and ccTIM v. S.c. TIM. All one liter M63- glycerol tests contained the same number of starting cells (10 mL of O.D.600 = 1). The

competitions were set up with the following ratios of cells (shown as mL of O.D.600 = 1

culture): ccTIM-1, pHLIC-9; cTIM-1, pHLIC-9; cTIM-5-ccTIM-5; ccTIM-5, S.c. TIM-5.

Every 24 hours, 10 mL of O.D.600 = 1 culture was diluted into one liter of fresh minimal

media. Serial dilutions were performed up to ten 24 hour generations. An equal number

170

of cells from each generation were extracted for plasmid isolation and identification.

Purified vector, representing both species in the competition was digested with AlwNI to

linearize the vector. A second and third digestion was performed to distinguish the

population for each experiment (Fig.47).

Figure 47: Analytical digest schemes to assess fitness.

Schematic diagrams are shown for three plasmids. The AlwNI restriction endonuclease cleaves double stranded DNA at sequence, CAGNNNCTG, which is specifically found in the ColE1 origin. The enzymes PsiI, KpnI and NheI were chosen because they cleave sequences within the ccTIM gene, cTIM gene, and pHLIC stuffer fragment, respectively. These three enzymes lack recognition sequences in the orthogonal plasmids. The cleavage of plasmid DNA into two fragments is a positive result for identification.

The analytical restriction digests were electrophoresed in agarose gel to separate cleavage products. The highly active ccTIM enriched ~10 fold in early generations compared to pHLIC. By the second or third generation, ccTIM appears to be the only species, although our method of quantification is not very sensitive - ethidium bromide staining.

Next we compared two active variants, diffusion controlled ccTIM and weakly active cTIM. The two samples were mixed in a 1:1 ratio and amplified for four generations. At

24 hours and later, the culture was dominated by ccTIM. Similar results were seen between cTIM and pHLIC at a 1:10 ratio. cTIM was the majority species after one

171 generation. These results confirm that differences in fitness can be quantified by fraction population.

Figure 48: Analytical digests to determine populations.

Top. ccTIM and cTIM are inoculated in the same minimal media at a 1:10 ratio. After 24 hours 10 mL of saturated culture is extracted to inoculate 1 L of fresh minimal media. ccTIM is enriched approximately 10-fold in each generation and is the majority species by the second 24 hour generation. Middle. The same experiment is performed with cTIM and ccTIM at a 1:1 starting ratio. Bottom. cTIM and pHLIC with starting ratio 1:10. The second lane in each gel is a digestion with AlwnI only. Lanes marked with "ccTIM" are digested with AlwNI and PsiI. Lanes marked with "cTIM" are digested with AlwNI and KpnI. Lanes marked with "pHLIC" are digested with AlwNI and NheI.

5.4 Future Directions

The populations from the one-on-one fitness competitions were determined by low- throughput analytical restriction digests. These methods allowed us to estimate the relative distributions between two variants and can possibly be scaled to a handful of

172

dissimilar variants (i.e. proteins with diverse sequences and restriction endonuclease

sites). We wish to use these competition assays to quantify the fitness of correlations in

libraries. To accomplish this, one requires tools to analyze hundreds to thousands of members and collect data that is statistically significant. Next generation sequencing, like Illumina, is an ideal solution for these goals. First, Illumina uses spatial separation on microchips that allows parallel DNA sequencing of mixed populations. This means large libraries can be assayed in single experiments, yet quantified in parallel, without

any up front separation.

We wish to validate our screening procedures with a final medium-throughput test. In a

previous experiment our lab characterized two compensatory mutations in S. cerevisiae

triosephosphate isomerase. The first pair is folded, stable and active with either F11/I20

(native) or W11/A20 (compensatory) (Fig 62). The F11W and I20A mutations are

destabilizing on their own (ΔT1/2 = -3.7 °C for F11W and I20A does not express). Nearly every organism hosts the W11/A20 pair, but yeast TIM is a rare exception. A library that interrogates these positions with an NNK codon provides a theoretical diversity of 400 protein sequences, but we expect only a small fraction of sequences will be indentified after 3-4 amplified generations. Position 11 is highly intolerant to mutation based on sequence statistics, and position 20 is tightly correlated to many other positions. It is possible by the fourth generation only wild-type and the stable compensatory will be selected. This means a single sequenced plate of 48 or 96 variants will provide statistical

significance. A second hub of correlated residues was identified at positions 90, 122, and

173

123 (Fig. 49). A W90Y mutation was destabilizing and two-fold less active. We were

not able to predict a compensatory mutation through sequence statistics. We attempted

G122T and G122T/V123P mutations at these sites, but position 122 is a known human

mutation that leads to loss in stability284. A library that interrogates theses positions with

an NNK codon provides a theoretical diversity of 8000 protein sequences. It will be

interesting to see if rigorous amplification to select only wild-type like fitness reveals any

amino acid solutions beyond the native sequence.

Figure 49: Library sites in triosephosphate isomerase.

The two libraries are shown within the S. cerevisiae TIM crystal structure in green. The 11/20 library will test the fitness of 400 library members, while the 90/122/123 library will assay 8000 variants.

Our lab is currently constructing these two libraries via overlapping PCR with mutagenic

primers. The libraries will be screened for fitness as described above in liquid M63

media supplemented with glycerol. The members will be amplified for 3-4 generations

and 48 or 96 variants will be colony sequenced. If these experiments perform as

expected, our lab will be ready to clone larger libraries that interrogate networks of

interaction and begin optimizing procedures for deep sequencing. 174

5.5 Outlook

Genetic screens are powerful tools for analyzing libraries, especially if the knockout is

lethal under selective conditions. These screens are well used in directed evolution

experiments. The drawback is that individual variants need to be isolated and sequenced

which is usually done on agarose media at the expense of throughput and budget. Deep

sequencing enables novel experiments where selections are performed in liquid media and sequenced en masse. These experiments provide the same useful data as current directed evolution and screening procedures with new layers of deeper information.

First, thousands to millions of cells can be sequenced thus expanding our data analysis.

Second, in the competition experiments the redundant sequencing provides statistical

significance and population profiles which directly correlate to organism fitness. This

methodology provides a more-or-less direct link between protein sequence modification

and fitness, in extremely high-throughput.

One necessity to employ this technology is a robust host. Bacteria and yeast are ideal for

their short generation times and transformation efficiencies. DF502, a TIM-knockout

bacteria, has been used for directed evolution experiments and computational design

validation.294 Unfortunately, this strain is exceedingly hard to maintain and in our own

hands, was inconsistent. DF502, was originally used as a knockout for phosphofructokinase, but gained favor as a triosephosphate isomerase knockout. Both

enzymes are deleted, as are 15 other metabolic genes, mostly those for rhamnose

175

metabolism. We sought a single gene knockout, that would provide robust growth in rich

media and when complemented with wild-type activities. The recently reported Keio

Collection fit this description after λ(DE3) infection.

Here, we illustrate the use of ΔtpiA-Keio(DE3) for several biochemical experiments.

First, we show that the knockout can be complemented by exogenous TIMs with activities ranging from 103 to the diffusion limit (108). These tests can validate the

observance of in vitro activity or select active members from libraries. Second, ΔtpiA-

Keio(DE3) supports T7 expression allowing researchers to purify enzymes of interest

from a TIM-free host. Because these purifications lack endogenous E. coli TIM, there is

greater confidence in enzymes with low levels of activity. Third, the differential growth

rates in both liquid and minimal media provide a platform for assaying fitness by

populations - which can be ascertained in high-throughput.

These experiments enable the interrogation of hypotheses generated from sequence

statistics. Our lab and others have begun reporting the importance of correlated

occurrences of amino acids, but correctly designed experiments to test these observed

pairs and networks are not trivial. Additionally, correlated residues may be evolutionary

artifacts. Finally, wild-type sequences are optimized by evolution and the context of

amino acids may be individual. Therefore ablating, swapping, or rebuilding correlated

networks within this framework may be further complicated by individuality.

176

Instead of choosing a specific amino acid at each site of the correlation and assaying the effects, we propose to create libraries at each position and compare the output sequence to those generated by Nature. It will be interesting to see if the output sequences will recapitulate those seen in Nature. The libraries will be made in wild-type and consensus variants. In the wild-type scaffolds we will be interesting in seeing if the wild-type network is the most fit, or if other orthogonal networks will come to light. In the consensus architectures, we assume that many native-like correlations are ablated since each site is considered independently. Will the libraries reveal the amino acid identities predicted by the MSA? These studies will shed further light on role of correlations with proteins.

We have studied triosephosphate isomerase because it is a well studied enzyme with statistical significance. The Keio Collection contained ~2,000 single-gene knockouts and the majority of these proteins can be studied in an analogous manner. This includes difficult to work with membrane proteins since the components are in vivo screening and

DNA sequencing. Further collections of datasets analogous to those obtained in TIM, will provide deeper insights in the Protein Folding Problem and its implication in evolution.

5.6 Acknowledgements

B.J.S. was a National Institutes of Health Chemistry-Biology Interface Program Fellow and Ohio State Presidential Fellow. This work was supported by The Ohio State

University.

177

Chapter 6: Materials and Methods

6.1 Sequence Statistics

Sanger DNA sequencing changed dramatically when the radiolabeled methods were

replaced with Sanger Big-Dye dideoxynucleotide terminators. This coupled with capillary electrophoresis exponentially increased the throughput of DNA sequencing.

These advancements allowed researchers to characterize the genomes of many organisms and assemble those sequences into comprehensive databases assembled as either gene or protein sequences. The work presented here statistically analyzes the Pfam database.

The Protein Families database contains nearly 10 million sequences organized by families ranging from a few dozen to several thousand sequences. Each family is aligned by Hidden Markov Model of a seed alignments. The database is user friendly and available for download in many formats.

6.1.1 Databases and Curation

The original multiple sequence alignment of triosephosphate isomerase was obtained from Pfam version 18.04 in October 2005. The raw database contains 639 triosephosphate isomerase sequences. The output alignment contains 373 positions, which is significantly longer than native-like TIMs which are on average 235 amino acid

178

residues. Rare insertions, mostly in loop structures, are ancestral artifacts that expand the

multiple sequence alignment. These columns within the alignment have lower occupancy

than canonical positions. A percent occupancy of 45 % or greater was chosen to yield a

polypeptide length on par with native TIMs and the canonical sequence of 240 amino

acid residues. That is to say, any column in the multiple sequence alignment that did not

contain an amino acid in at least 45 % of its sequences was removed from the database.

In February 2008 the then current multiple sequence alignment of triosephosphate isomerase was downloaded from Pfam version 22.0. The updated database contains 1239

TIM sequences. Preliminary correlation analysis of the initial database revealed that sequence repeats and fragments were affecting mutual information calculations in a biased fashion therefore these artifacts were curated from the new database. Within the

1239 sequences, 584 bin to 230-240 aa. All sequences with less than 205 residues were

removed yielding a database with 888 sequences. Next, all exact duplicate sequences

were removed. The final curated database contains 781 nonredundant full-length sequences. In order to produce a multiple sequence alignment analogous to the initial

18.04 database, an occupancy cut off of 70 % was chosen to yield 240 aligned canonical positions.

6.1.2 Conservation

The consensus proteins were created by choosing the most common amino acid at each

position regardless of distribution. For example, if ala occurs in 95 % of sequences at

179

position x, alanine is chosen as the consensus amino acid. Similarly, if alanine populates position y in only 20 % of the sequences, but no other residue is present more often, alanine is chosen as the consensus amino acid. Unlike the global mean propensity mutants of the TPR domain, the consensus amino acids are not adjusted for codon usage and bias.

To quantitatively calculate the extent conservation the relative entropy was calculated for all 240 aligned positions in triosephosphate isomerase. The relative entropy measures the deviation between populations, in this case - the actual distribution versus an expected distribution. The expected, or reference distribution for these studies is the S. cerevisiae codon usage in open reading frames. This organism was chosen because it is a close approximation to all eukaryotic open reading frames, but will not change as more sequences are deposited from genomic efforts. If all amino acids were used equally the expected distribution would be 100 % / 20 = 5% for each amino acid. The actual distribution of S. cerevisiae is: A-6, C-1, D-6, E-7, F-4, G-5, H-2, I-7, K-7, L-10, M-2, N-

6, P-4, Q-4, R-4, S-9, T-6, V-6, W-1, Y-3%. The relative entropy is calculated as:

Here, px is the actual amino acid distribution and fx are the expected distributions derived

from yeast codon usage. The theoretical values range from 0 (the actual distribution of

amino acids is identical to the reference state) to a maximal value of 2.3-4.6. The upper

limit depends not only on the distribution, but also on which amino acid is preferred. A

leucine at position x 100 % yields a lower relative entropy than a completely conserved 180

cysteine since the later is used only 10 % as often in yeast ORFs. The calculate relative

entropies for the TIM database range from 0.18 to 4.31.

6.1.3 Correlation

The statistical interactions between positions is calculated as mutual information. Mutual

information is essentially the relative entropy of the joint distribution and is a polynomial function. Relative entropies calculate the difference in distributions between actual and expected distributions. For conservation one compares populations of amino acids within a column versus a neutral reference state. In mutual information one calculates the probability of two events concurring at random versus the actual rate of concurrence. By

example, if alanine occurs at position x in 70 % of TIM sequences and valine occurs at position y in 50 % of TIM sequences, one would expect 35 % (70 % x 50 %) of sequences in the multiple sequence alignment to have the alanine-valine pair at random.

If the percentage of sequences that contain the pair is higher or lower than 35 %, the residues are correlated or anticorrelated, respectively. The mutual information is calculated as:

, . . , , ln

In this calculation px is the frequency of an amino acid at position x, while py is the frequency of an amino acid at position y. The product of these two terms is the expected distribution at random. The term px,y is the actually percentage of sequences with the x

and y pairing. The values for mutual information increase with increasing correlations.

181

6.2 Chapter 3 Methods

6.2.1 Sequences and Cloning

The consensus TIM genes (cTIM, ir-cTIM and ccTIM) were constructed from synthetic

oligonucleotides purchased from Sigma Genosys or Integrated DNA Technologies (see

below). The oligonucleotides ranged from ~50-100 nucleotides and were designed with

the following thermodynamic parameters: Loop TMs < 50 °C, overlap primer-to-target

TMs > 70 °C and the 5' and 3' ends of each primer culminate in a G or C. The genes were

assembled from 10-20 synthetic primers where odd numbered primers are sense-coding,

even numbered primers are anti-sense coding - also known as reverse complements. The

overlapping primers were assembled into full length genes using reassembly PCR - a

technique similar the reassembly step in DNA shuffling. Reassembly Polymerase Chain

Reactions consisted of 10 to 25 cycles of [95 °C for 30 seconds, annealing temperatures

ranged from 50 °C to 72 °C for 30 seconds, with 72 °C extension for 45 seconds]. A

final 72 °C step of 4 minutes followed to ensure complete elongation. The wild-type S.

cerevisiae TIM gene was cloned by PCR of lysed YPH 499 cells. The wild-type E. coli

TIM gene was cloned by PCR of lysed DH10B cells.

All PCR reactions are performed with either Taq (M0267), Deep Vent (M0258S), Pfu or

Phusion (M0530S) polymerase. Polymerases are purchased from New England Biolabs except Pfu which is purified recombinately within the Magliery Lab. Reactions with Taq are performed with 94 °C melting steps. Reactions are 100 µL with 1 µM primers, template masses ranging from ~ 0.06-60 ng, and 250 µM dNTPs.

182

The amino acid sequence for each construct is listed here. The TEV-cleavable hexahistidine tag is underlined. The oligonucleotides for gene synthesis are shown below each protein sequence in italics. cTIM MAHHHHHHGGENLYFQARTPFVGGNWKMNGTKAEAKELVEALKAKLPDDVEVVVAPPAVYLDTAREALKGS KIKVAAQNCYKEAKGAFTGEISPEMLKDLGADYVILGHSERRHYFGETDELVAKKVAHALEHGLKVIACIG ETLEEREAGKTEEVVFRQTKALLAGLGDEWKNVVIAYEPVWAIGTGKTATPEQAQEVHAFIRKWLAENVSA EVAESVRILYGGSVKPANAKELAAQPDIDGFLVGGASLKPEFLDIINSRN*

1.ATGGCTCGTACGCCGTTTGTTGGTGGGAATTGGAAAATGAATGGTACCAAGGCCGAGGCTAAAGAACTG GTGGAAGCGCTTAAAGCCAAAC 2.CTTTGAGGGCTTCGCGTGCTGTATCCAGATACACAGCCGGAGGGGCTACCACTACCTCAACATCGTCTG GCAGTTTGGCTTTAAGCGCTTC

3.CACGCGAAGCCCTCAAAGGTAGCAAAATTAAGGTTGCGGCTCAGAACTGCTATAAAGAGGCAAAAGGTG CGTTCACCGGTGAAATCTCTCC

4.CCCCAAAGTAATGACGACGCTCAGAGTGGCCGAGGATGACATAATCAGCGCCCAAATCTTTCAGCATTT CAGGAGAGATTTCACCGGTG

5.GTCGTCATTACTTTGGGGAAACTGATGAATTAGTAGCGAAGAAAGTCGCACACGCTTTAGAACATGGCT TAAAGGTAATTGCATGCATAGG

6.GGCAAGCAGCGCTTTCGTCTGACGAAAAACCACTTCCTCCGTTTTGCCCGCTTCGCGCTCTTCCAGGGT TTCGCCTATGCATGCAATTACC

7.CGAAAGCGCTGCTTGCCGGCTTGGGGGACGAATGGAAAAACGTCGTGATCGCCTACGAGCCTGTGTGGG CGATTGGCACAGGAAAGACTGC

8.CCACTTCTGCACTCACATTTTCAGCCAGCCATTTACGAATGAAAGCATGAACTTCTTGCGCCTGCTCGG GGGTCGCAGTCTTTCCTGTGCC

9.GAAAATGTGAGTGCAGAAGTGGCGGAAAGCGTTCGCATTCTTTATGGAGGTTCCGTCAAACCGGCAAAT GCAAAAGAACTGGCCGCACAACCG

10.GTTCCGCGAGTTTATGATATCCAGAAACTCCGGCTTCAGTGATGCTCCGCCGACCAAGAAGCCGTCAA TATCCGGTTGTGCGGCCAGTTC

ir-cTIM MGHHHHHHGHGGENLYFQARTPFVGGNFKLNGSKAEAKELVEALKAKLPDDVEVVVAPPATYLDYAREALK GSKIKVAAQNCYKKASGAFTGENSPEQIKDVGADYVILGHSERRHYFGETDEFVAKKVAHALEHGLKVIAC IGETLEEREAGKTEEVVFRQTKALLAGLGDEWKNVVIAYEPVWAIGTGKTATPEQAQEVHAFIRKWLAENV SAEVAESVRILYGGSVKPANAKELAAQPDIDGFLVGGASLKPEFLDIINSRN*

1.ATGGCTCGTACTCCGTTTGTTGGTGGGAATTTTAAACTGAATGGTAGCAAGGCCGAGGCTAAAGAACTG GTGGAAGCGCTTAAAGCCAAAC 183

2.CTTTGAGGGCTTCGCGTGCATAATCCAGATAGGTAGCCGGTGGTGCAACTACTACCTCAACATCGTCTG GCAGTTTGGCTTTAAGCGCTTC

3.CACGCGAAGCCCTCAAAGGTAGCAAAATTAAGGTAGCGGCTCAGAACTGTTATAAAAAAGCATCTGGAG CGTTCACCGGTGAAAACTCTCC

4.CCCCAAAGTAATGACGACGCTCAGAGTGGCCGAGGATGACATAATCAGCGCCCACATCTTTAATCTGTT CAGGAGAGTTTTCACCGGTG

5.GTCGTCATTACTTTGGGGAAACTGATGAATTTGTAGCAAAGAAAGTCGCACACGCTTTAGAACATGGCT TAAAGGTAATTGCATGCATAGG

6-10. The same oligonucleotides as cTIM above.

ccTIM MGHHHHHHGGENLYFQGSSGARTPLVAGNWKMNGTLAEAKELVEALAAKLPDDVEVVVAPPFTYLDQVREL LKGSKIAVGAQNCYKEDSGAFTGEISPAMLKDLGASYVILGHSERRQYFGETDELVAKKVAAALAAGLTPI LCVGETLEEREAGKTEEVVARQLKAVLAGLSDEWSNVVIAYEPVWAIGTGKTATPEQAQEVHAFIRKWLAE LSAEVAEKVRILYGGSVKPANAAELAAQPDIDGALVGGASLKAEDFLAIINSRN*

1.ATGGCTCGAACGCCGTTAGTGGCTGGAAATTGGAAAATGAATGGTACTCTTGCCGAAGCGAAGGAGCTC GTCGAAGCCTTAGCAGCGAAGTTACCCGATG

2.CACGTACCTGGTCCAGATATGTAAAGGGAGGGGCTACCACAACTTCCACGTCATCGGGTAACTTCGCTG CTAAGGCTTCGACGAGCTCCTTCGCTTCGGC

3.CGTGGAAGTTGTGGTAGCCCCTCCCTTTACATATCTGGACCAGGTACGTGAGCTGCTTAAGGGTTCTAA GATTGCTGTAGGCGCTCAAAACTGCTACAAAG

4.GTCCTTTAGCATGGCTGGACTGATCTCACCCGTGAACGCTCCACTATCTTCTTTGTAGCAGTTTTGAGC GCCTACAGCAATCTTAGAACCCTTAAGCAGC

5.GAAGATAGTGGAGCGTTCACGGGTGAGATCAGTCCAGCCATGCTAAAGGACCTCGGAGCGTCGTACGTA ATACTAGGCCATAGCGAAAGGCGGCAATAC

6.GCCAAGGCTGCCGCGACCTTTTTAGCCACGAGCTCATCGGTTTCTCCAAAGTATTGCCGCCTTTCGCTA TGGCCTAGTATTACGTACGACGCTCCGAGG

7.CTTTGGAGAAACCGATGAGCTCGTGGCTAAAAAGGTCGCGGCAGCCTTGGCAGCTGGATTGACCCCAAT CTTATGTGTCGGCGAAACTTTGGAAGAGAGAG

8.GAGAACAGCCTTTAGTTGTCGCGCAACTACTTCCTCTGTTTTCCCAGCTTCTCTCTCTTCCAAAGTTTC GCCGACACATAAGATTGGGGTCAATCCAGC

9.GCTGGGAAAACAGAGGAAGTAGTTGCGCGACAACTAAAGGCTGTTCTCGCAGGTCTATCAGATGAATGG TCGAATGTTGTGATCGCATATGAGCCTGTC

10.CCTCTTGTGCCTGCTCTGGTGTTGCAGTTTTGCCCGTCCCAATCGCCCAGACAGGCTCATATGCGATC ACAACATTCGACCATTCATCTGATAGACCTG

11.CTGGGCGATTGGGACGGGCAAAACTGCAACACCAGAGCAGGCACAAGAGGTACACGCCTTTATAAGGA AATGGTTAGCGGAACTTTCAGCTGAGGTGGCG 184

12.CAGCGGCGTTGGCAGGCTTAACTGACCCACCATACAATATTCTGACTTTTTCCGCCACCTCAGCTGAA AGTTCCGCTAACCATTTCCTTATAAAGGCGTG

13.GTCAGAATATTGTATGGTGGGTCAGTTAAGCCTGCCAACGCCGCTGAGTTAGCTGCTCAACCGGACAT TGATGGAGCACTTGTTGGTGGGGCATC

14.GTTGCGACTATTAATTATTGCCAGGAAGTCTTCTGCTTTCAAAGATGCCCCACCAACAAGTGCTCCAT CAATGTCCGGTTGAGCAGCTAACTC

S. cerevisiae TIM MAHHHHHHGGENLYFQGSSGARTFFVGGNFKLNGSKQSIKEIVERLNTASIPENVEVVICPPATYLDYSVS LVKKPQVTVGAQNAYLKASGAFTGENSVDQIKDVGAKWVILGHSERRSYFHEDDKFIADKTKFALGQGVGV ILCIGETLEEKKAGKTLDVVERQLNAVLEEVKDWTNVVVAYEPVWAIGTGLAATPEDAQDIHASIRKFLAS KLGDKAASELRILYGGSANGSNAVTFKDKADVDGFLVGGASLKPEFVDIINSRN*

Reactions are assessed by agarose gel electrophoresis. Successful PCRs are purified by

Phenol Chloroform extraction. Here an equal volume of PCR and phenol chloroform are vortexed for 2 minutes and centrifuged. The organic layer is discarded before addition of equal volume chloroform isoamyl alcohol. The mixture is again vortexed for 2 minutes and centrifuged. The organic layer is discarded. A 1/10 volume of 3 M sodium acetate pH 5 and 2.5 volumes of 100 % ethanol are added to the aqueous layer before freezing at

-80 °C. Next, the samples are spun at 4 °C for 30 minutes to pellet the DNA. The ethanol is decanted and the pellet is washed with 70 % ethanol to remove salts and centrifuged at 4 °C for an additional 30 minutes. The 70 % ethanol is decanted and the pellet is air dried. The purified DNA may be resuspended in water or any desired buffer.

DNA genes are prepared for ligation by restriction endonuclease digest. All enzymes are purchased from New England Biolabs. Purified DNA is resuspended in 30 µL water. 2

µL of each restriction enzyme, 1 µL BSA and 4 µL NEB buffer are added to constitute a

185

40 µL reaction. The digestions generally occur at 37 °C for 3-4 hours. Reactions are

quenched by equal volume addition of DNA loading buffer.

The desired product from restriction digests is purified by agarose gel electrophoresis.

The desired band is excised from the gel and purified with Qiagen's Qiaquick gel extraction kit (28704). Samples are eluted from the column with 50 µL of an 8-fold diluted EB buffer. Samples destined for ligation were lyophilized and resuspended in 10

µL water.

All DNA plasmids are produced in E. coli DH10B grown in 2xYT media at 37 °C.

Plasmids are purified with Qiagen's QIAprep spin miniprep kit (28704). Samples are

eluted from the column in 60 µL of an 8-fold diluted EB buffer. Similar restriction digest

conditions are followed for the plasmid as were described for the insert. Generally 30 µL

of miniprep are added to each reaction, digested, purified and eluted in 50 µL 8-fold

diluted EB . 1 µL of purified digested plasmid is transformed into DH10B and plated on

selective media. Appearance of zero to several colonies is taken as an indication that the

plasmid was completely digested. If >10 colonies appeared from a 1 mL quench and 100

µL plating, the DNA is redigested. Plasmid passing the quality control procedure is

lyophilized dry and resuspended in 10 µL water.

Ligations are done in 5 µL reactions to conserve resources (digested plasmid). These

reactions contain 0.5 µL T4 DNA Ligase (M0202S - NEB), 0.5 µL supplied buffer, 0.5

186

µL digested plasmid (irrelevant of concentration) and 3.5 µL digested insert (irrelevant of

concentration). The ligations are placed at 16 °C overnight. 1 µL of ligated material is transformed into 40 µL of electrocompetent DH10B via electroporation. These transformations are recovered in 1 mL 2xYT at 37 °C before plating on selective media.

Colonies are grown to saturation in liquid media, miniprepped and submitted for DNA sequencing to confirm successful cloning.

6.2.2 Expression

Triosephosphate isomerase proteins are expressed recombinately in E. coli. Proteins purified for biophysical characterization are expressed in BL21(DE3), but protein for kinetic analyses are expressed in the TIM-deficient KEIO(DE3). Overnight seed cultures are grown at 37 °C in 2xYT overnight. The saturated seed cultures are diluted 40-fold into 2xYT and grown at 37 °C. Generally, TIM variants are produced from 500-1000 mL. The cultures are induced with 0.1 mM IPTG at O.D.600 = ~0.75. Induced cultures

continue to grow at 37 °C for 3-4 hours or are moved to 30 °C for 6-8 hours. Cells are

harvested by centrifugation and stored at -80 °C till purification.

6.2.3 Purification

Cell pellets from 1 L cultures are resuspended to 30 mL with Lysis Buffer (50 mM

Tris·HCl pH 8, 300 mM NaCl, 10 mM imidazole, 2 mM β-mercaptoethanol, 1 mM

TCEP). 5 mM MgCl2, 0.5 mM CaCl2, 2 U/mL DNase (Pierce), 200 ng/mL RNase

(Fisher), 0.25 mg/mL lysozyme (Fisher), and 0.1 % Triton X-100 are added to the

187

suspensions for 30 minutes at 4 ºC. Post incubation, the cell pastes are either sonicated

(if to be used for biophysical characterization) or disrupted by glass beads (if to be used

for kinetic analysis). Centrifugation is performed to separate the soluble and insoluble

fractions. The soluble fraction is bound to 1.0-1.5 mL 50% Ni-NTA agarose (30210 -

Qiagen). After one hour the sample is poured into a pre-fritted column and washed with

12 mL (50 mM Tris·HCl pH 8, 300 mM NaCl, 20 mM imidazole, 2 mM β-

mercaptoethanol, 1 mM TCEP) before elution with 2 mL (50 mM Tris·HCl pH 8, 300

mM NaCl, 250 mM imidazole, 2 mM β-mercaptoethanol, 1 mM TCEP). The 6×His-

TEV-TIM fusion is digested overnight by 6×His-TEV protease (40 µg) with 5 mM DTT.

After quantitative cleavage the sample is moved to 100 mM potassium phosphate pH 7.5,

300 mM NaCl, 2 mM β-mercaptoethanol, 1 mM TCEP by size exclusion chromatography (17-0851-01 - GE). The 6×His tag and 6×His-TEV protease are

removed by a second Ni-NTA step. The purified sample is concentrated and stored at 4

°C in 50 mM phosphate buffer, 300 mM NaCl, 2 mM DTT, 2 mM TCEP pH 8.

6.2.4 Purity and Yield

The purity and yield of TIM variants are assayed by SDS-PAGE and UV-Vis

spectrometry. Samples are diluted 1:1 in SDS loading buffer, heated at 95 °C for 10

minutes and then spun down. The supernatant is loading on a 12.5 % SDS PAGE gel

with known concentrations of bovine serum albumin. Concentration of TIM variants is

also determined by Absorbance at 280 nm and Beer's Law. The extinction coefficient of

all variants is calculated from the primary sequence using Scripps Protein Calculator v3.3

188

excluding disulfide bonds. In order to verify absorbance at 280 nm, cTIM and S.c. TIM

were submitted for amino acid analysis. Gel, A280 and AAA are all consistent. For future

determinations, SDS-PAGE and A280 were used to determine concentration and purity.

6.2.5 Circular Dichroism

Data was collected on an Jasco J-815 Circular Dichroism spectrometer. TIM samples

were diluted to 12 µM in 50 mM potassium phosphate, 300 mM NaCl pH 8. Far-UV

spectra were recorded in triplicate from 195 to 250 nm. Wavelength scans recorded

ellipticity every nanometer with 2 second data integration time (DIT) and 100 nm min-1 scanning speed. Data points reporting HT voltages greater than 600 V were discarded.

For thermal denaturation, ellipticity was monitored at 222 nm with temperatures increasing from 25 to 95 °C. Data were collected in 1 °C steps with 6 second

-1 temperature equilibration, 1 °C min ramping and 2 second data integration. The T1/2s were calculated by fitting the thermal denaturation data to the equation described by

Koepf.

/ 1 /

ΔHm, enthalpy at the unfolding transition; T1/2, temperature in K at which half the protein

is unfolded; T, temperature in K; R, universal gas constant; mn, slope of the pretransition

baseline; yn, intercept of the pretransition baseline; md, slope of the post-transition

baseline; yd, intercept of the post-transition baseline. Data was exported and analyzed in

Microsoft Excel 2007.

189

6.2.6 Gel Filtration

Size-exclusion chromatography was performed on a Pharmacia-LKB FPLC. Protein

samples were diluted to 37 µM (1 mg/mL) in 50 mM Tris·HCl, 100 mM NaCl pH 8 and

eluted from a Superdex 75 10/300 column (GE-Amersham) with 50 mM Tris·HCl, 100

mM NaCl pH 8 at 0.4 mL min-1. Protein peaks were collected by observing the

absorbance at 280 nm. Molecular weights for the TIM variants were calculated based on

fits from known standards. Alcohol dehydrogenase 9.6 mL-150 kD, Albumin 11.0 mL-

66 kD, Carbonic anhydrase 11.8 mL-29.0 kD, Cytochrome c 13.6 mL-12.4 kD and

Aprotinin 15.6 mL-6.5 kD. Chromatograms were collected at room temperature and 4

°C.

6.2.7 Analytical Ultracentrifugation

Sedimentation velocity studies were performed by the University of Connecticut AUC

Facility. TIM samples were tested at concentrations ranging from 0.1 to 1.5 mg mL-1 in

50 mM potassium phosphate, 300 mM NaCl, 1 mM TCEP, 2 mM DTT, pH 8. Samples were exhaustively dialyzed before submitting both sample and diasylate. Runs were performed at 55,000 rpm at 20 °C in a Beckman-Coulter XL-I analytical ultracentrifuge.

Data was collected at 60 second intervals for 4 to 6 hours. Data was analyzed with

DcDt+, version 2.1.0, Sedfit, version 11.3 and Sedphat, version 5.01.

190

6.2.8 Hydrophobic Dye Binding

Fluorescence spectra were recorded using a Perkin-Elmer LS50B fluorimeter. The stock

concentration of ANS in methanol was determined by absorbance at 372 nm using an extinction coefficient of 8 x 103 M-1cm-1. 25 µM protein was incubated with 5 µM dye at room temperature for 5 to 10 minutes. The samples were excited at 372 nm and the

emission spectra were obtained from 400 to 600 nm.

6.2.9 Activity

in vitro

The kcat and Km were calculated using the assays described by Plaut and Knowles at 37

°C and pH 7.4. Our data were collected in 96-well plates using a Molecular Devices

Spectramax M5 plate reader. The path length was calculated using known concentrations

of NADH and Beer’s Law with ɛ = 6220 M-1cm-1. For a given volume, path lengths were

consistent across the plate regardless of well position. The stock concentrations of

DHAP and DL-GAP were determined enzymatically under conditions where each was

the limiting reagent and the reaction was run to apparent completion. Here, Af -Ai = εb(cf

- ci), where ci is 0 and cf is stoichiometrically related to the initial substrate concentration

for the reaction. Reaction rates were determined by monitoring the appearance or

disappearance of NADH at 340 nm when coupled to glyceraldehyde-3-phosphate

dehydrogenase and α-glycerol-3-phosphate dehydrogenase, respectively. Each reaction

was monitored for ten minutes in the absence of TIM to calculate the background

reaction rate (vo) from trace TIM contamination in the commercial coupling enzymes.

191

After ten minutes, an aliquot of TIM was added to the well and the entire volume was

pipetted 2 to 3 times to ensure proper mixing. The reaction rate (vobs ) was then

monitored for an additional 10 minutes. The initial reaction rate (vi ) used to calculate the

Michaelis-Menten parameters was calculated as (vi = vobs - vo). Absorbance values were

sampled every 45 seconds to limit photodegradation of the NADH. With GAP as the

substrate, the reactions contained 0.1 M TEA buffer pH 7.4, 5 mM EDTA, 0.16 mM

NADH, and 0 to 6.6 mM GAP. TIM concentration was 622 nM for cTIM, 1.24 µM for ir-cTIM, 12.4 pM for ccTIM and 10 pM for yeast TIM. The concentration of α-glycerol-

-1 3-phosphate dehydrogenase was optimized at 17 µg mL to ensure it never limited vobs.

With DHAP as the substrate, the reactions contained 0.1 M TEA buffer pH 7.4, 5 mM

EDTA, 6 mM sodium arsenate, 1 mM NAD+, 0 to 12 mM DHAP. TIM concentration

was 1.24 µM for cTIM, 1.24 µM for ir-cTIM, 124 pM for ccTIM and 1.48 nM for yeast

TIM. The concentration of glyceraldehyde-3-phosphate dehydrogenase was optimized

-1 at 0.2 mg mL to ensure it never limited vobs. Triosephosphate isomerase concentrations

were optimized such that vobs would be linear for longer than 15 minutes, but statistically

greater than vo. Michaelis-Menten data sets were collected in triplicate and fitted in

KaleidaGraph.

The Michaelis-Menten kinetics for ccTIM were measured in the absence and presence of

8.3 mM sodium arsenate. The increase in Km with maintenance of vmax suggests arsenate

192

is a competitive inhibitor, which is supported by previous data. The Ki and Km for

arsenate were estimated using the equations:

in vivo

The Keio(DE3) strain was grown on minimal media agar plates lacking six carbon

sugars. These plates were supplemented with M63 salts (2 g (NH4)2SO4, 13.6 g KH2PO4,

6 g KOH for 1 L), 0.2 % w/v glycerol or lactate, 1 mg L-1 thiamine, 80 mg L-1 histidine and 50 mg L-1 uracil. Plates contained ampicillin for plasmid selection and kanamycin for strain selection. For solid media, growth was seen from leaky expression (0 mM

IPTG) even from cTIM. Cells were grown for one to four days at 37 °C.

6.2.10 Nuclear Magnetic Resonance

Seed cultures were grown in 2×YT to saturation overnight. The next morning, cells were diluted 40-fold into 1 L minimal media supplemented with 5 mM sodium citrate, trace

15 metals, 500 μM MgSO4, 0.5% glucose, 0.1 mg thiamine and 1 g NH4Cl. Cells were

grown to OD600 ~ 0.75 before induction with 0.1 mM IPTG. The proteins were purified

as described in above. Purified proteins were concentrated to 350 μM in 50 mM sodium

phosphate pH 7.4, 150 mM NaCl, 10 mM DTT and 1 mM TCEP. The 1H, 15N-HSQC

NMR spectra were obtained at 20 °C on a Bruker DRX 600 MHz.

193

6.3 Chapter 4 Methods

6.3.1 Sequences and Cloning

All point mutants were constructed from overlapping PCR reactions. Here a long primer

(~40 nucleotides) contains the desired mutation at the center and 18 nucleotides of

complementary overlap up and down stream of the mutation site. These primers are

optimized to have Loop TMs < 50 °C, overlap primer-to-target TMs > 70 °C and the 5' and

3' ends with G or C. Additionally, the reverse complement primer is ordered. Two initial

PCRs are performed for each mutant. One using the mutagenic primer and a reverse

complemented primer downstream of the target mutation. This reaction yields a double

stranded product containing the mutation and gene sequence 3' of the mutation. A second reaction is performed with the reverse complemented mutagenic primer and an upstream sense-coding primer. This reaction yields a double stranded product containing the

mutation and gene sequence 5' of the mutation. The two double stranded products are treated with DpnI (R0176S - NEB) to digest the methylated template. The DNA is purified and diluted into a final PCR reaction where the full length product is reassembled and amplified.

All PCR reactions are performed with either Taq (M0267), Deep Vent (M0258S), Pfu or

Phusion (M0530S) polymerase. Polymerases are purchased from New England Biolabs except Pfu which is purified recombinately within the Magliery Lab. Reactions with Taq are performed with 94 °C melting steps. Reactions are 100 µL with 1 µM primers, template masses ranging from ~ 0.06-60 ng, and 250 µM dNTPs.

194

The sense-coding mutagenic primer for each variant is shown below. Antisense primers are not shown, but are simply the reverse complement of each listed sequence.

F11W CTTTCTTTGTCGGTGGTAACTGGAAATTAAACGGTTCCAAAC L13M GTCGGTGGTAACTTTAAAATGAACGGTTCCAAACAATC I20A CTTTAAATTAAACGGTTCCAAACAATCAGCGAAGGAAATCGTTGAAAGATTG S31K GAAAGATTGAACACTGCTAAAATCCCAGAAAATGTCGAAG Y49Q CCAGCTACCTACTTAGATCAGTCTGTCTCTTTGGTTAAG T60A GTTAAGAAGCCACAAGTCGCTGTCGGTGCTCAAAACGCC A66C CACTGTCGGTGCTCAAAACTGCTACCTGAAGGCTTCTGGTG Q82M GTGAAAACTCCGTTGACATGATCAAGGATGTTGGTGC I83L GAAAACTCCGTTGACCAACTCAAGGATGTCGGTGCTAAG W90Y CAAGGATGTTGGTGCTAAGTATGTTATTTTGGGTCACTC I109V CACGAAGATGACAAGTTTGTTGCTGACAAGACCAAGTTC V121L GTTCGCTTTAGGTCAAGGTCTCGGTGTCATCTTGTGTATC G122T GCTTTAGGTCAAGGTGTCACAGTCATCTTATGTATCGGTG GCTTTAGGTCAAGGTGTCACGGTCATCTTGTGTATCGGTG GCTTTAGGTCAAGGTGTCACCGTCATCTTGTGTATTGGTG GCTTTAGGTCAAGGTGTCACTGTCATCTTGTGTATCGGTG V123P CTTTAGGTCAAGGTGTCGGTCCCATCTTGTGTATCGGTGAAAC K134R GTGAAACTTTGGAAGAAAGGAAGGCCGGTAAGACTTTG K135E GAAACTTTGGAAGAAAAGGAGGCCGGTAAGACTTTGGATG A175T GCCATTGGTACCGGTTTGACAGCTACTCCAGAAGATGCTC D180Q GTTTGGCTGCTACTCCAGAACAAGCTCAAGATATTCACGCTTC I184V CCAGAAGATGCTCAAGATGTTCACGCTTCCATCAGAAAG A212V CTTATACGGTGGTTCCGTCAACGGTAGCAACGCCG N213K CTTATACGGTGGTTCAGCTAAAGGTAGCAACGCCGTTACC T219E CTAATGGTAGCAACGCCGTTGAGTTCAAGGACAAGGCTGATG V226I CTTCAAGGACAAGGCTGATATCGATGGTTTCTTGGTCGGTG F229A CAAGGCTGATGTCGATGGTGCGTTGGTCGGTGGTGCTTCTTTG

re2-S.c. TIM was constructed as described in 6.2.1 by gene reassembly. The oligonucleotides for gene construction are listed below the protein sequence in italics.

The cleavable hexahistidine tag is shown underlined.

MAHHHHHHGGENLYFQGSSGARTFFVGGNWKMNGSKQSIKEIVERLNTASIPENVEVVICPPATYLDYSVS LVKKPQVTVGAQNAYLKASGAFTGENSVDMIKDVGAKYVILGHSERRSYFHEDDKFIADKTKFALGQGVGV ILCIGETLEERKAGKTLDVVERQLNAVLEEVKDWTNVVVAYEPVWAIGTGLAATPEDAQDIHASIRKFLAS KLGDKAASELRILYGGSANGSNAVTFKDKVDVDGFLVGGASLKPEFVDIINSRN*

1.GCGCGTACGTTCTTTGTTGGTGGTAACTGGAAAATGAACGGTTCTAAACAGTCTATCAAGGAAATCGTT GAACGTCTCAACACTGCGTC 195

2.CTTAACCAGAGAAACAGAGTAGTCCAGGTAGGTCGCCGGAGGGCAGATAACAACTTCAACGTTTTCCGG GATAGACGCAGTGTTGAGACGTTC

3.TGGACTACTCTGTTTCTCTGGTTAAGAAACCGCAGGTGACCGTTGGCGCCCAGAACGCGTACCTGAAGG CATCTGGTGCGTTCACCGGCGAG

4.GAAGTAAGAACGACGTTCAGAGTGACCGAGGATAACGTATTTCGCACCAACGTCTTTGATCATATCGAC AGAGTTCTCGCCGGTGAACGCAC

5.TCACTCTGAACGTCGTTCTTACTTCCACGAAGACGACAAGTTTATCGCGGACAAAACCAAATTCGCTCT GGGTCAGGGTGTTGGCGTGATTC

6.GTTGAGCTGACGTTCGACCACATCCAGAGTTTTACCCGCTTTACGCTCTTCCAGGGTTTCACCGATGCA CAGAATCACGCCAACACCCTGAC

7.GGTCGAACGTCAGCTCAACGCTGTTCTGGAGGAAGTTAAAGACTGGACCAATGTTGTCGTTGCGTACGA ACCGGTGTGGGCGATTGGTACTG

8.CCCAGTTTAGAAGCCAGGAATTTACGAATAGAGGCGTGGATGTCCTGCGCGTCTTCCGGGGTTGCCGCC AGACCAGTACCAATCGCCCACAC

9.AATTCCTGGCTTCTAAACTGGGTGACAAAGCGGCGTCTGAACTGCGTATCCTCTACGGTGGTTCTGTTA ACGGTAGCAACGCGGTAACCTTC

10.GATGATGTCCACAAATTCCGGCTTCAGAGACGCACCGCCAACCAGGAAACCGTCAACGTCCGCCTTAT CTTTGAAGGTTACCGCGTTGCTAC

comboTIM and algoTIM genes were purchased from Genewiz, Inc. The genes were received as full length products subcloned into pET vectors with flanking restriction enzyme sites for facile subcloning into pHLIC. The wild-type S. cerevisiae TIM gene was cloned by PCR of lysed YPH 499 cells. The protein and DNA sequence for algoTIM and comboTIM are listed below. The cleavable hexahistidine tag is shown underlined.

comboTIM MAHHHHHHGHGGENLYFQGSSGARTFFVGGNWKMNGSKQSAKEIVERNTAKIPENVEVVICPPATYLDQSV SLVKKPQVTVGAQNCYLKASGAFTGENSVDMLKDVGAKWVILGHSERRSYFHEDDKFVADKTKFALGQGLG VILCIGETLEERKAGKTLDVVERQLNAVLEEVKDWTNVVVAYEPVWAIGTGLTATPEDAQDVHASIRKFLA SKLGDKAASELRILYGGSVNGSNAVTFKDKADVDGFLVGGASLKPEFVDIINSRN*

196

CCATGGCGGCGAGAATCTGTATTTTCAGGGCAGCTCTGGCGCACGTACCTTTTTTGTGGGCGGGAATTGGA AAATGAATGGCAGCAAACAGAGCGCGAAGGAAATTGTGGAACGGAACACCGCGAAGATTCCCGAAAATGTG GAAGTGGTGATTTGTCCACCGGCAACCTATCTGGATCAGTCCGTGAGCCTGGTGAAAAAACCGCAGGTGAC CGTGGGTGCGCAGAATTGCTATCTGAAAGCGAGCGGCGCATTTACCGGCGAAAATAGCGTGGACATGCTGA AGGACGTGGGGGCGAAATGGGTGATTCTGGGCCACAGCGAACGCCGTTCGTACTTCCATGAGGATGACAAG TTCGTGGCCGACAAAACGAAATTTGCGCTTGGCCAGGGCCTGGGCGTGATCCTGTGCATTGGCGAAACCCT TGAAGAACGTAAAGCGGGCAAAACCTTGGATGTGGTAGAACGTCAGCTGAATGCGGTGCTGGAAGAAGTGA AAGATTGGACCAATGTGGTGGTGGCGTATGAACCCGTCTGGGCGATTGGCACCGGCCTGACCGCGACCCCG GAGGATGCGCAGGATGTGCATGCGAGCATTCGTAAATTCCTGGCGAGCAAACTGGGGGATAAGGCGGCGAG CGAACTGCGTATTCTGTATGGCGGCTCCGTGAATGGTAGCAACGCGGTAACGTTTAAAGACAAAGCGGACG TTGATGGCTTTCTGGTAGGCGGAGCGAGCCTGAAACCGGAATTCGTGGATATTATCAATAGCCGTAATTAA GGATCC

algoTIM MAHHHHHHGHGENLYFQGSSGARTFFVGGNFKMNGSKQSIKEIVERNTASIPENVEVVVCPPATYLDYSVS LVKKPQVTVGAQNCYLKASGAFTGEISVDMLKDVGAKWVILGHSERRSYFHEDDKFVADKTKFALGQGLGV ILCVGETLEEREAGKTLDVVERQLNAVLEEVKDWTNVVVAYEPVWAIGTGLAATPEDAQDVHASIRKFLAS KLGDKAASELRILYGGSVNGSNAVTFKDKADIDGFLVGGASLKPEFVDIINSRN*

CCATGGCGGAGAAAATCTGTATTTTCAGGGGAGCAGTGGTGCGCGGACATTTTTTGTGGGTGGCAATTTCA AGATGAACGGCAGCAAGCAGAGCATAAAGGAAATTGTGGAGCGTAATACCGCGAGCATACCGGAAAATGTG GAAGTGGTGGTGTGTCCGCCTGCGACCTATCTGGACTATTCCGTGTCGTTGGTGAAAAAACCGCAGGTGAC TGTTGGGGCGCAGAATTGCTATCTGAAAGCATCGGGTGCTTTTACCGGGGAGATTAGCGTGGATATGCTTA AGGATGTTGGCGCGAAATGGGTGATTCTGGGTCACAGCGAGCGGCGGTCATACTTTCACGAAGATGACAAG TTCGTGGCCGACAAAACCAAGTTTGCGCTGGGGCAGGGCCTTGGCGTGATCTTGTGCGTAGGCGAGACCTT GGAAGAACGTGAAGCGGGTAAGACCCTGGACGTGGTGGAACGTCAGCTTAACGCAGTGCTGGAGGAGGTGA AAGACTGGACCAATGTCGTGGTCGCGTATGAGCCTGTGTGGGCTATAGGCACCGGACTGGCCGCGACCCCA GAAGATGCCCAGGATGTGCATGCGTCAATACGTAAGTTTCTGGCGAGTAAACTGGGCGATAAGGCGGCCTC AGAGCTGCGCATTCTGTATGGAGGCAGCGTGAATGGTAGCAATGCGGTGACCTTCAAAGACAAAGCGGATA TTGATGGCTTTTTGGTGGGAGGGGCGAGCCTGAAACCGGAATTTGTGGACATAATCAACAGCCGTAATTAA GGATCC

6.3.2 Circular Dichroism

Data was collected on an Jasco J-815 Circular Dichroism spectrometer. TIM samples were diluted to 12 µM in 50 mM potassium phosphate, 300 mM NaCl pH 8. Far-UV spectra were recorded in triplicate from 195 to 250 nm. Wavelength scans recorded ellipticity every nanometer with 2 second data integration time (DIT) and 100 nm min-1 scanning speed. Data points reporting HT voltages greater than 600 V were discarded.

For thermal denaturation, ellipticity was monitored at 222 nm with temperatures increasing from 25 to 95 °C. Data were collected in 1 °C steps with 6 second 197

-1 temperature equilibration, 1 °C min ramping and 2 second data integration. The T1/2s were calculated by fitting the thermal denaturation data to the equation described by

Koepf.

/ 1 /

ΔHm, enthalpy at the unfolding transition; T1/2, temperature in K at which half the protein

is unfolded; T, temperature in K; R, universal gas constant; mn, slope of the pretransition

baseline; yn, intercept of the pretransition baseline; md, slope of the post-transition

baseline; yd, intercept of the post-transition baseline. Data was exported and analyzed in

Microsoft Excel 2007.

6.3.3 Differential Static Light Scattering

Data was collected on an Jasco J-815 Circular Dichroism spectrometer. TIM samples

were diluted to 12 µM in 50 mM potassium phosphate, 300 mM NaCl pH 8. For thermal

denaturation, absorbance was monitored at 600 nm with temperatures increasing from 25

to 95 °C. Data were collected in 1 °C steps with 6 second temperature equilibration, 1 °C

-1 min ramping and 2 second data integration. The relative melting temperatures (TM)

were fit based on second derivatives or steepest slopes. Generally a sliding 5 °C window was used to calculate slopes. The TM was determined as the central data point of the

window with the greatest slope.

The unfolding rate constants were determined by monitoring the loss of ellipticity with

198 time after temperature jump from 25 to 70 °C. Here, 30 µL of 14 µM sample is diluted into 270 µL of 70 °C buffer and transferred to a cuvette preincubated at 70 °C. Data were collected after 1 minute to ensure temperature equilibration and recorded for an additional 2-5 minutes during which all variants completely unfolded. Unfolding curves were fit a single exponential using the following equation:

Yo, starting ellipticity; k, unfolding rate constant; t, time in seconds; C, horizontal asymptote. Data was exported and analyzed in Microsoft Excel 2007.

6.3.4 FoldX Calculations

The FoldX algorithm was downloaded as a YASARA add-in (www.yasara.org,

http://foldxyasara.switchlab.org). The S. cerevisiae crystal structure, 1YPI, was

"repaired" and saved before mutations. The stability change was calculated from the average of three runs at pH 7, 298 K, ionic strength 500 and van der Waal design 2.

Neighboring residues were allowed to move during the energy minimization.

6.4 Chapter 5 Methods

6.4.1 Keio(DE3) Construction

DF502 was obtained from the E. coli Genetic Stock Center at Yale University. To screen in DF502 we use the tac promoter plasmid, Pinpoint Xa-1. To express proteins for purification we employed a second plasmid, pHLIC, under control of the T7 RNA polymerase promoter.

199

To simplify cloning procedures and make the entire system more amenable for higher-

throughput study we wished to screen and express from the same construct and cell line.

In order to accomplish this goal, we received the tpiA-knockout from the Keio Collection

via the E. coli Genetic Stock Center at Yale University. First, we engineered the acquired

strain to support T7 transcription. We ordered the λDE3 kit from Novagen (catalog #:

69734) to lysogenize the T7 RNA polymerase gene into the genome under control of the

lac operon. A culture of glycerol-stocked Keio was grown to O.D.600 = 0.5 in 2xYT before infection. Phage stocks were as follows: λDE3-2 x 1011, Helper-3 x 1010,

Selection-3 x 1010. Approximately 1 x 108 PFU were added in the following volumes

(µL):

Cells λDE3 Helper Selection 1 0.2 3.3 3.3 5 0.2 3.3 3.3 10 0.2 3.3 3.3

After 37 °C incubation for 20 minutes, 10 µL water was added to help plate on LB-kan.

Colonies appeared on all titers and 18 were prepared for electrocompetent transformation.

S.c. TIM.pHLIC was transformed into all eighteen variants to test for expression sensitivity at 0 and 0.1 mM IPTG. The transformed constructs were expressed with 0.1

mM IPTG at O.D.600 = 0.7. A single variant was chosen for low level of expression

under uninduced conditions and high levels of expression at 0.1 mM. This strain was

given the name ΔtpiA-Keio(DE3).

200

6.4.2 Cloning of E165D Mutants

The E165D knockdown mutation was engineered by overlapping PCR with mutagenic

primers (see 6.3.1 ). The mutagenic primers are shown below for S.c TIM, ccTIM and

bacTIM (Durani, Sullivan and Magliery, manuscript in preparation). The antisense

primers are not shown, but are simply the reverse complements.

S.c. TIM GGACTAACGTCGTTGTCGCTTATGATCCAGTCTGGGCCATTGGGACCGG ccTIM GAATGTTGTGATCGCTTATGATCCTGTCTGGGCAATTGGGACGGG bacTIM CTGGTGATTGCGTATGATCCGGTGTGGGCGATTGGCAC

The protein sequence for bacTIM is shown below with cleavable hexahistidine tag

underlined.

bacTIM

MGHHHHHHGHGENLYFQGSSGMRHPLIAGNWKMNGTLAEAKALVEALAALLPDVDGVEVVVCPPFTYLAAV AEALEGSNIALGAQNVHWEDSGAFTGEISAAMLKDLGCSYVIVGHSERRQYFGETDELVNKKAKAALAAGL TPILCVGETLEEREAGKTEEVVARQLDAVLAGLGAEQFANLVIAYEPVWAIGTGKTATPEQAQEVHAFIRA TLAELFGAEVAEKVRILYGGSVKPDNAAELFAQPDIDGALVGGASLKAEDFLAIVKAAEAAKQA

6.4.3 Solid and Liquid Minimal Media Growth

Solid

The Keio(DE3) strain was grown on minimal media agar plates lacking six carbon sugars. The plates were made by autoclaving 1 L of water with 2 g (NH4)2SO4, 13.6 g

KH2PO4, and KOH to pH 7.0, and 15 g Bacto Agar with a magnetic stir bar. After

sterilization the media was placed on a magnetic stir plate and allowed to cool to 55 °C. 201

The following sterile-filtered additives were pipetted into the media: 0.2 % w/v glycerol

or lactic acid, 1 mg L-1 thiamine, 80 mg L-1 histidine, 0.5 mg L-1 ferrous sulfate and 50

mg L-1 uracil. Ampicillin is added for plasmid selection and kanamycin is added for selection of the Keio strain. Experiments were performed under conditions of leaky expression with 0 mM IPTG. Saturated cultures were grown overnight in 2xYT at 37 °C.

These cells are spun down and washed twice with phosphate buffered saline before

plating on minimal media. The number of cells were normalized by optical density and

600 nm. Cells are grown on solid media for one to four days at 37 °C.

Liquid

The liquid media was prepared identically to solid media with the following exceptions.

First, 20.4 g of KH2PO4 was added as opposed to 13.6 g to increase the buffering

capacity of the solution. Secondly, 2.5 µM IPTG was added to increase the protein

concentration. Individual growth curves were determined by monitoring the optical

density at 600 nm every hour for 13 hours at 37 °C.

Competition experiments were performed by inoculating a single 1 L culture with two

liquid cultures harboring different TIM variants. The saturated overnight cultures were centrifuged and washed twice with phosphate buffered saline before normalization. Each

culture was diluted to an O.D.600 = 1.0. The selective media was inoculated with 10 mL

of cells as either 1 mL variant A + 9 mL variant B, or 5 mL variant A + 5 mL variant B

depending on the experiment. Three milliliters of culture were extracted at 24 hours for

202 plasmid isolation and quantification by restriction analytical digest. A fraction of the remaining culture was diluted to O.D.600 = 1.0. Ten milliliters of the this sample was used to inoculate the next generation. The amplification process was performed for 2-10

24 hours generations.

203

References

1. Bullock, A. N. & Fersht, A. R. (2001). Rescuing the function of mutant p53. Nat Rev Cancer 1, 68-76. 2. Baselga, J., Norton, L., Albanell, J., Kim, Y. M. & Mendelsohn, J. (1998). Recombinant humanized anti-HER2 antibody (Herceptin) enhances the antitumor activity of paclitaxel and doxorubicin against HER2/neu overexpressing human breast cancer xenografts. Cancer Res 58, 2825-31. 3. Josse, D., Xie, W., Renault, F., Rochu, D., Schopfer, L. M., Masson, P. & Lockridge, O. (1999). Identification of residues essential for human paraoxonase (PON1) arylesterase/organophosphatase activities. Biochemistry 38, 2816-25. 4. Harel, M., Aharoni, A., Gaidukov, L., Brumshtein, B., Khersonsky, O., Meged, R., Dvir, H., Ravelli, R. B., McCarthy, A., Toker, L., Silman, I., Sussman, J. L. & Tawfik, D. S. (2004). Structure and evolution of the serum paraoxonase family of detoxifying and anti-atherosclerotic enzymes. Nat Struct Mol Biol 11, 412-9. 5. Khersonsky, O. & Tawfik, D. S. (2006). The histidine 115-histidine 134 dyad mediates the lactonase activity of mammalian serum paraoxonases. J Biol Chem 281, 7649-56. 6. Khersonsky, O. & Tawfik, D. S. (2005). Structure-reactivity studies of serum paraoxonase PON1 suggest that its native activity is lactonase. Biochemistry 44, 6371-82. 7. Yeung, D. T., Josse, D., Nicholson, J. D., Khanal, A., McAndrew, C. W., Bahnson, B. J., Lenz, D. E. & Cerasoli, D. M. (2004). Structure/function analyses of human serum paraoxonase (HuPON1) mutants designed from a DFPase-like homology model. Biochim Biophys Acta 1702, 67-77. 8. Pawson, T. (2004). Specificity in signal transduction: from phosphotyrosine-SH2 domain interactions to complex cellular systems. Cell 116, 191-203. 9. Straus, D., Raines, R., Kawashima, E., Knowles, J. R. & Gilbert, W. (1985). Active site of triosephosphate isomerase: in vitro mutagenesis and characterization of an altered enzyme. Proc Natl Acad Sci U S A 82, 2272-6. 10. Anfinsen, C. B. (1973). Principles that govern the folding of protein chains. Science 181, 223-30. 11. Anfinsen, C. B. & Haber, E. (1961). Studies on the reduction and re-formation of protein disulfide bonds. J Biol Chem 236, 1361-3. 12. Sela, M., White, F. H., Jr. & Anfinsen, C. B. (1957). Reductive cleavage of disulfide bridges in ribonuclease. Science 125, 691-2.

204

13. Findlay, D., Herries, D. G., Mathias, A. P., Rabin, B. R. & Ross, C. A. (1961). The active site and mechanism of action of bovine . Nature 190, 781-84. 14. Gutte, B. & Merrifield, R. B. (1969). The total synthesis of an enzyme with ribonuclease A activity. J Am Chem Soc 91, 501-2. 15. Jaswal, S. S., Truhlar, S. M., Dill, K. A. & Agard, D. A. (2005). Comprehensive analysis of protein folding activation thermodynamics reveals a universal behavior violated by kinetically stable proteases. J Mol Biol 347, 355-66. 16. Chan, H. S. & Dill, K. A. (1998). Protein folding in the landscape perspective: chevron plots and non-Arrhenius kinetics. Proteins 30, 2-33. 17. Levinthal, C. (1969). How to Fold Graciously. Proceedings of a meeting held at Allerton House, Monticello, Illinois, 22-24. 18. Bachmann, A. & Kiefhaber, T. (2001). Apparent two-state tendamistat folding is a sequential process along a defined route. J Mol Biol 306, 375-86. 19. Baldwin, R. L. & Rose, G. D. (1999). Is protein folding hierarchic? I. Local structure and peptide folding. Trends Biochem Sci 24, 26-33. 20. Dyson, H. J., Wright, P. E. & Scheraga, H. A. (2006). The role of hydrophobic interactions in initiation and propagation of protein folding. Proc Natl Acad Sci U S A 103, 13057-61. 21. Gilmanshin, R., Williams, S., Callender, R. H., Woodruff, W. H. & Dyer, R. B. (1997). Fast events in protein folding: relaxation dynamics of secondary and tertiary structure in native apomyoglobin. Proc Natl Acad Sci U S A 94, 3709-13. 22. Kim, P. S. & Baldwin, R. L. (1982). Specific intermediates in the folding reactions of small proteins and the mechanism of protein folding. Annu Rev Biochem 51, 459-89. 23. Sridevi, K., Juneja, J., Bhuyan, A. K., Krishnamoorthy, G. & Udgaonkar, J. B. (2000). The slow folding reaction of barstar: the core tryptophan region attains tight packing before substantial secondary and tertiary structure formation and final compaction of the polypeptide chain. J Mol Biol 302, 479-95. 24. Crick, F. H. (1958). On protein synthesis. Symp Soc Exp Biol 12, 138-63. 25. Dill, K. A., Ozkan, S. B., Shell, M. S. & Weikl, T. R. (2008). The protein folding problem. Annu Rev Biophys 37, 289-316. 26. Anfinsen, C. B. & Scheraga, H. A. (1975). Experimental and theoretical aspects of protein folding. Adv Protein Chem 29, 205-300. 27. Kendrew, J. C., Bodo, G., Dintzis, H. M., Parrish, R. G., Wyckoff, H. & Phillips, D. C. (1958). A three-dimensional model of the myoglobin molecule obtained by x-ray analysis. Nature 181, 662-6. 28. Banavar, J. R., Maritan, A., Micheletti, C. & Trovato, A. (2002). Geometry and physics of proteins. Proteins 47, 315-22. 29. Chan, H. S. & Dill, K. A. (1990). Origins of structure in globular proteins. Proc Natl Acad Sci U S A 87, 6388-92. 30. Richards, F. M. (1997). Protein stability: still and unsolved problem. Cell Mol Life Sci 53, 790-802.

205

31. Matthews, C. R. (1993). Pathways of protein folding. Annu Rev Biochem 62, 653- 83. 32. (1995). CASP1 Proceedings. Proteins: Structure, Function and Bioinformatics 23. 33. Venclovas, C., Zemla, A., Fidelis, K. & Moult, J. (2003). Assessment of progress over the CASP experiments. Proteins 53 Suppl 6, 585-95. 34. Wu, S., Skolnick, J. & Zhang, Y. (2007). Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biol 5, 17. 35. Zhang, Y. (2007). Template-based modeling and free modeling by I-TASSER in CASP7. Proteins 69 Suppl 8, 108-17. 36. Zhang, Y. (2008). I-TASSER server for protein 3D structure prediction. BMC Bioinformatics 9, 40. 37. Zhang, Y. (2009). I-TASSER: fully automated protein structure prediction in CASP8. Proteins 77 Suppl 9, 100-13. 38. Xu, J., Li, M., Kim, D. & Xu, Y. (2003). RAPTOR: optimal protein threading by linear programming. J Bioinform Comput Biol 1, 95-117. 39. Xu, J., Peng, J. & Zhao, F. (2009). Template-based and free modeling by RAPTOR++ in CASP8. Proteins 77 Suppl 9, 133-7. 40. Simons, K. T., Bonneau, R., Ruczinski, I. & Baker, D. (1999). Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins Suppl 3, 171- 6. 41. Simons, K. T., Kooperberg, C., Huang, E. & Baker, D. (1997). Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 268, 209-25. 42. Gray, J. J., Moughon, S. E., Kortemme, T., Schueler-Furman, O., Misura, K. M., Morozov, A. V. & Baker, D. (2003). Protein-protein docking predictions for the CAPRI experiment. Proteins 52, 118-22. 43. Nauli, S., Kuhlman, B. & Baker, D. (2001). Computer-based redesign of a protein folding pathway. Nat Struct Biol 8, 602-5. 44. Kim, D. E., Chivian, D. & Baker, D. (2004). Protein structure prediction and analysis using the Robetta server. Nucleic Acids Res 32, W526-31. 45. Thompson, J. & Baker, D. (2011). Incorporation of evolutionary information into Rosetta comparative modeling. Proteins 79, 2380-8. 46. Rohl, C. A. & Baker, D. (2002). De novo determination of protein backbone structure from residual dipolar couplings using Rosetta. J Am Chem Soc 124, 2723-9. 47. Raman, S., Lange, O. F., Rossi, P., Tyka, M., Wang, X., Aramini, J., Liu, G., Ramelot, T. A., Eletsky, A., Szyperski, T., Kennedy, M. A., Prestegard, J., Montelione, G. T. & Baker, D. (2010). NMR structure determination for larger proteins using backbone-only data. Science 327, 1014-8. 48. Pabo, C. (1983). Molecular technology. Designing proteins and peptides. Nature 301, 200. 49. Yin, S., Ding, F. & Dokholyan, N. V. (2007). Eris: an automated estimator of protein stability. Nat Methods 4, 466-7. 206

50. Yin, S., Ding, F. & Dokholyan, N. V. (2007). Modeling backbone flexibility improves protein stability estimation. Structure 15, 1567-76. 51. Gilbreth, R. N., Esaki, K., Koide, A., Sidhu, S. S. & Koide, S. (2008). A dominant conformational role for amino acid diversity in minimalist protein-protein interfaces. J Mol Biol 381, 407-18. 52. Birtalan, S., Zhang, Y., Fellouse, F. A., Shao, L., Schaefer, G. & Sidhu, S. S. (2008). The intrinsic contributions of tyrosine, serine, glycine and arginine to the affinity and specificity of antibodies. J Mol Biol 377, 1518-28. 53. Schymkowitz, J., Borg, J., Stricher, F., Nys, R., Rousseau, F. & Serrano, L. (2005). The FoldX web server: an online force field. Nucleic Acids Res 33, W382- 8. 54. Van Durme, J., Delgado, J., Stricher, F., Serrano, L., Schymkowitz, J. & Rousseau, F. (2011). A graphical interface for the FoldX forcefield. Bioinformatics 27, 1711-2. 55. Khan, S. & Vihinen, M. (2010). Performance of protein stability predictors. Hum Mutat 31, 675-84. 56. Ponder, J. W. & Richards, F. M. (1987). Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes. Journal of Molecular Biology 193, 775-91. 57. Dahiyat, B. I. & Mayo, S. L. (1997). Probing the role of packing specificity in protein design. Proc Natl Acad Sci U S A 94, 10172-7. 58. Dahiyat, B. I. & Mayo, S. L. (1997). De novo protein design: fully automated sequence selection. Science 278, 82-7. 59. Dahiyat, B. I., Sarisky, C. A. & Mayo, S. L. (1997). De novo protein design: towards fully automated sequence selection. J Mol Biol 273, 789-96. 60. Kuhlman, B., Dantas, G., Ireton, G. C., Varani, G., Stoddard, B. L. & Baker, D. (2003). Design of a novel globular protein fold with atomic-level accuracy. Science 302, 1364-8. 61. Jiang, L., Althoff, E. A., Clemente, F. R., Doyle, L., Rothlisberger, D., Zanghellini, A., Gallaher, J. L., Betker, J. L., Tanaka, F., Barbas, C. F., 3rd, Hilvert, D., Houk, K. N., Stoddard, B. L. & Baker, D. (2008). De novo computational design of retro-aldol enzymes. Science 319, 1387-91. 62. Rothlisberger, D., Khersonsky, O., Wollacott, A. M., Jiang, L., DeChancie, J., Betker, J., Gallaher, J. L., Althoff, E. A., Zanghellini, A., Dym, O., Albeck, S., Houk, K. N., Tawfik, D. S. & Baker, D. (2008). Kemp elimination catalysts by computational enzyme design. Nature 453, 190-5. 63. Khersonsky, O., Rothlisberger, D., Dym, O., Albeck, S., Jackson, C. J., Baker, D. & Tawfik, D. S. (2010). Evolutionary optimization of computationally designed enzymes: Kemp eliminases of the KE07 series. J Mol Biol 396, 1025-42. 64. Khersonsky, O., Rothlisberger, D., Wollacott, A. M., Murphy, P., Dym, O., Albeck, S., Kiss, G., Houk, K. N., Baker, D. & Tawfik, D. S. (2011). Optimization of the in-silico-designed kemp eliminase KE70 by computational design and directed evolution. J Mol Biol 407, 391-412.

207

65. Siegel, J. B., Zanghellini, A., Lovick, H. M., Kiss, G., Lambert, A. R., St Clair, J. L., Gallaher, J. L., Hilvert, D., Gelb, M. H., Stoddard, B. L., Houk, K. N., Michael, F. E. & Baker, D. (2010). Computational design of an enzyme catalyst for a stereoselective bimolecular Diels-Alder reaction. Science 329, 309-13. 66. Dalal, S., Balasubramanian, S. & Regan, L. (1997). Protein alchemy: changing beta-sheet into alpha-helix. Nat Struct Biol 4, 548-52. 67. Rose, G. D. (1997). Protein folding and the Paracelsus challenge. Nat Struct Biol 4, 512-4. 68. Meyerguz, L., Kleinberg, J. & Elber, R. (2007). The network of sequence flow between protein structures. Proc Natl Acad Sci U S A 104, 11627-32. 69. Cao, B. & Elber, R. (2010). Computational exploration of the network of sequence flow between protein structures. Proteins 78, 985-1003. 70. Alexander, P. A., Rozak, D. A., Orban, J. & Bryan, P. N. (2005). Directed evolution of highly homologous proteins with different folds by phage display: implications for the protein folding code. Biochemistry 44, 14045-54. 71. He, Y., Yeh, D. C., Alexander, P., Bryan, P. N. & Orban, J. (2005). Solution NMR structures of IgG binding domains with artificially evolved high levels of sequence identity but different folds. Biochemistry 44, 14055-61. 72. Alexander, P. A., He, Y., Chen, Y., Orban, J. & Bryan, P. N. (2007). The design and characterization of two proteins with 88% sequence identity but different structure and function. Proc Natl Acad Sci U S A 104, 11963-8. 73. He, Y., Chen, Y., Alexander, P., Bryan, P. N. & Orban, J. (2008). NMR structures of two designed proteins with high sequence identity but different fold and function. Proc Natl Acad Sci U S A 105, 14412-7. 74. Alexander, P. A., He, Y., Chen, Y., Orban, J. & Bryan, P. N. (2009). A minimal sequence code for switching protein structure and function. Proc Natl Acad Sci U S A 106, 21149-54. 75. Shortle, D. (2009). One sequence plus one mutation equals two folds. Proc Natl Acad Sci U S A 106, 21011-2. 76. King, T. P. (1966). Selective chemical modification of arginyl residues. Biochemistry 5, 3454-9. 77. Wilchek, M., Frensdorff, A. & Sela, M. (1967). Modification of the carboxyl groups of ribonuclease by attachment of glycine or alanylglycine. Biochemistry 6, 247-52. 78. Previero, A., Coletti-Previero, M. A. & Cavadore, J. C. (1967). A reversible chemical modification of the tryptophan residue. Biochim Biophys Acta 147, 453- 61. 79. Marzotto, A. & Giormani, V. (1970). Attempts of chemical modification of threonine and serine residues in RNase A. Experientia 26, 833-4. 80. Hutchison, C. A., 3rd & Edgell, M. H. (1971). Genetic assay for small fragments of bacteriophage phi X174 deoxyribonucleic acid. J Virol 8, 181-9. 81. Hutchison, C. A., 3rd, Phillips, S., Edgell, M. H., Gillam, S., Jahnke, P. & Smith, M. (1978). Mutagenesis at a specific position in a DNA sequence. J Biol Chem 253, 6551-60. 208

82. Kunkel, T. A. (1985). Rapid and efficient site-specific mutagenesis without phenotypic selection. Proc Natl Acad Sci U S A 82, 488-92. 83. Wells, J. A., Vasser, M. & Powers, D. B. (1985). Cassette mutagenesis: an efficient method for generation of multiple mutations at defined sites. Gene 34, 315-23. 84. Weber, K., Platt, T., Ganem, D. & Miller, J. H. (1972). Altered sequences changing the operator-binding properties of the Lac repressor: colinearity of the repressor protein with the i-gene map. Proc Natl Acad Sci U S A 69, 3624-8. 85. Schmitz, A., Schmeissner, U. & Miller, J. H. (1976). Mutations affecting the quaternary structure of the lac repressor. J Biol Chem 251, 3359-66. 86. Miller, J. H. (1991). Use of nonsense suppression to generate altered proteins. Methods Enzymol 208, 543-63. 87. Markiewicz, P., Kleina, L. G., Cruz, C., Ehret, S. & Miller, J. H. (1994). Genetic studies of the lac repressor. XIV. Analysis of 4000 altered Escherichia coli lac repressors reveals essential and non-essential residues, as well as "spacers" which do not require a specific sequence. J Mol Biol 240, 421-33. 88. Hecht, M. H., Nelson, H. C. & Sauer, R. T. (1983). Mutations in lambda repressor's amino-terminal domain: implications for protein stability and DNA binding. Proc Natl Acad Sci U S A 80, 2676-80. 89. Pabo, C. O. & Lewis, M. (1982). The operator-binding domain of lambda repressor: structure and DNA recognition. Nature 298, 443-7. 90. Gassner, N. C., Baase, W. A., Mooers, B. H., Busam, R. D., Weaver, L. H., Lindstrom, J. D., Quillin, M. L. & Matthews, B. W. (2003). Multiple methionine substitutions are tolerated in T4 lysozyme and have coupled effects on folding and stability. Biophys Chem 100, 325-40. 91. Baase, W. A., Liu, L., Tronrud, D. E. & Matthews, B. W. (2010). Lessons from the lysozyme of phage T4. Protein Sci 19, 631-41. 92. Rennell, D., Bouvier, S. E., Hardy, L. W. & Poteete, A. R. (1991). Systematic mutation of bacteriophage T4 lysozyme. J Mol Biol 222, 67-88. 93. Dower, W. J., Miller, J. F. & Ragsdale, C. W. (1988). High efficiency transformation of E. coli by high voltage electroporation. Nucleic Acids Res 16, 6127-45. 94. Magliery, T. J. & Regan, L. (2004). A cell-based screen for function of the four- helix bundle protein Rop: a new tool for combinatorial experiments in biophysics. Protein Eng Des Sel 17, 77-83. 95. Cantor, H., Simpson, E., Sato, V. L., Fathman, C. G. & Herzenberg, L. A. (1975). Characterization of subpopulations of T lymphocytes. I. Separation and functional studies of peripheral T-cells binding different amounts of fluorescent anti-Thy 1.2 (theta) antibody using a fluorescence-activated cell sorter (FACS). Cell Immunol 15, 180-96. 96. Axe, D. D., Foster, N. W. & Fersht, A. R. (1996). Active barnase variants with completely random hydrophobic cores. Proc Natl Acad Sci U S A 93, 5590-4. 97. Bromberg, S. & Dill, K. A. (1994). Side-chain entropy and packing in proteins. Protein Sci 3, 997-1009. 209

98. Kamtekar, S., Schiffer, J. M., Xiong, H., Babik, J. M. & Hecht, M. H. (1993). Protein design by binary patterning of polar and nonpolar amino acids. Science 262, 1680-5. 99. Lavinder, J. J., Hari, S. B., Sullivan, B. J. & Magliery, T. J. (2009). High- throughput thermal scanning: a general, rapid dye-binding thermal shift screen for protein engineering. J Am Chem Soc 131, 3794-5. 100. Guttmacher, A. E. & Collins, F. S. (2003). Welcome to the genomic era. N Engl J Med 349, 996-8. 101. Robertson, H. D., Barrell, B. G., Weith, H. L. & Donelson, J. E. (1973). Isolation and sequence analysis of a ribosome-protected fragment from bacteriophage phiX 174 DNA. Nat New Biol 241, 38-40. 102. Ziff, E. B., Sedat, J. W. & Galibert, F. (1973). Determination of the nucleotide sequence of a fragment of bacteriophage phiX 174 DNA. Nat New Biol 241, 34-7. 103. Sanger, F. & Coulson, A. R. (1975). A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol 94, 441-8. 104. Sanger, F., Air, G. M., Barrell, B. G., Brown, N. L., Coulson, A. R., Fiddes, C. A., Hutchison, C. A., Slocombe, P. M. & Smith, M. (1977). Nucleotide sequence of bacteriophage phi X174 DNA. Nature 265, 687-95. 105. Maxam, A. M. & Gilbert, W. (1977). A new method for sequencing DNA. Proc Natl Acad Sci U S A 74, 560-4. 106. Sanger, F., Nicklen, S. & Coulson, A. R. (1977). DNA sequencing with chain- terminating inhibitors. Proc Natl Acad Sci U S A 74, 5463-7. 107. Smith, L. M., Sanders, J. Z., Kaiser, R. J., Hughes, P., Dodd, C., Connell, C. R., Heiner, C., Kent, S. B. & Hood, L. E. (1986). Fluorescence detection in automated DNA sequence analysis. Nature 321, 674-9. 108. Margulies, M., Egholm, M., Altman, W. E., Attiya, S., Bader, J. S., Bemben, L. A., Berka, J., Braverman, M. S., Chen, Y. J., Chen, Z., Dewell, S. B., Du, L., Fierro, J. M., Gomes, X. V., Godwin, B. C., He, W., Helgesen, S., Ho, C. H., Irzyk, G. P., Jando, S. C., Alenquer, M. L., Jarvie, T. P., Jirage, K. B., Kim, J. B., Knight, J. R., Lanza, J. R., Leamon, J. H., Lefkowitz, S. M., Lei, M., Li, J., Lohman, K. L., Lu, H., Makhijani, V. B., McDade, K. E., McKenna, M. P., Myers, E. W., Nickerson, E., Nobile, J. R., Plant, R., Puc, B. P., Ronan, M. T., Roth, G. T., Sarkis, G. J., Simons, J. F., Simpson, J. W., Srinivasan, M., Tartaro, K. R., Tomasz, A., Vogt, K. A., Volkmer, G. A., Wang, S. H., Wang, Y., Weiner, M. P., Yu, P., Begley, R. F. & Rothberg, J. M. (2005). Genome sequencing in microfabricated high-density picolitre reactors. Nature 437, 376-80. 109. Mardis, E. R. (2008). Next-generation DNA sequencing methods. Annu Rev Genomics Hum Genet 9, 387-402. 110. Finn, R. D., Mistry, J., Tate, J., Coggill, P., Heger, A., Pollington, J. E., Gavin, O. L., Gunasekaran, P., Ceric, G., Forslund, K., Holm, L., Sonnhammer, E. L., Eddy, S. R. & Bateman, A. (2010). The Pfam protein families database. Nucleic Acids Res 38, D211-22.

210

111. Sonnhammer, E. L., Eddy, S. R. & Durbin, R. (1997). Pfam: a comprehensive database of families based on seed alignments. Proteins 28, 405- 20. 112. Casari, G., Sander, C. & Valencia, A. (1995). A method to predict functional residues in proteins. Nat Struct Biol 2, 171-8. 113. Sankararaman, S. & Sjolander, K. (2008). INTREPID--INformation-theoretic TREe traversal for Protein functional site IDentification. Bioinformatics 24, 2445- 52. 114. Landau, M., Mayrose, I., Rosenberg, Y., Glaser, F., Martz, E., Pupko, T. & Ben- Tal, N. (2005). ConSurf 2005: the projection of evolutionary conservation scores of residues on protein structures. Nucleic Acids Res 33, W299-302. 115. Wavreille, A. S., Garaud, M., Zhang, Y. & Pei, D. (2007). Defining SH2 domain and PTP specificity by screening combinatorial peptide libraries. Methods 42, 207-19. 116. Drexler, K. E. (1981). Molecular engineering: An approach to the development of general capabilities for molecular manipulation. Proc Natl Acad Sci U S A 78, 5275-5278. 117. Steipe, B., Schiller, B., Pluckthun, A. & Steinbacher, S. (1994). Sequence statistics reliably predict stabilizing mutations in a protein domain. J Mol Biol 240, 188-92. 118. Ohage, E. & Steipe, B. (1999). Intrabody construction and expression. I. The critical role of VL domain stability. J Mol Biol 291, 1119-28. 119. Ohage, E. C., Wirtz, P., Barnikow, J. & Steipe, B. (1999). Intrabody construction and expression. II. A synthetic catalytic Fv fragment. J Mol Biol 291, 1129-34. 120. Kohl, A., Binz, H. K., Forrer, P., Stumpp, M. T., Pluckthun, A. & Grutter, M. G. (2003). Designed to be stable: crystal structure of a consensus ankyrin repeat protein. Proc Natl Acad Sci U S A 100, 1700-5. 121. Binz, H. K., Stumpp, M. T., Forrer, P., Amstutz, P. & Pluckthun, A. (2003). Designing repeat proteins: well-expressed, soluble and stable proteins from combinatorial libraries of consensus ankyrin repeat proteins. J Mol Biol 332, 489- 503. 122. Binz, H. K., Amstutz, P., Kohl, A., Stumpp, M. T., Briand, C., Forrer, P., Grutter, M. G. & Pluckthun, A. (2004). High-affinity binders selected from designed ankyrin repeat protein libraries. Nat Biotechnol 22, 575-82. 123. Amstutz, P., Binz, H. K., Parizek, P., Stumpp, M. T., Kohl, A., Grutter, M. G., Forrer, P. & Pluckthun, A. (2005). Intracellular kinase inhibitors selected from combinatorial libraries of designed ankyrin repeat proteins. J Biol Chem 280, 24715-22. 124. Kohl, A., Amstutz, P., Parizek, P., Binz, H. K., Briand, C., Capitani, G., Forrer, P., Pluckthun, A. & Grutter, M. G. (2005). Allosteric inhibition of aminoglycoside phosphotransferase by a designed ankyrin repeat protein. Structure 13, 1131-41.

211

125. Zahnd, C., Pecorari, F., Straumann, N., Wyler, E. & Pluckthun, A. (2006). Selection and characterization of Her2 binding-designed ankyrin repeat proteins. J Biol Chem 281, 35167-75. 126. Milovnik, P., Ferrari, D., Sarkar, C. A. & Pluckthun, A. (2009). Selection and characterization of DARPins specific for the neurotensin receptor 1. Protein Eng Des Sel 22, 357-66. 127. Binz, H. K., Amstutz, P. & Pluckthun, A. (2005). Engineering novel binding proteins from nonimmunoglobulin domains. Nat Biotechnol 23, 1257-68. 128. Parmeggiani, F., Pellarin, R., Larsen, A. P., Varadamsetty, G., Stumpp, M. T., Zerbe, O., Caflisch, A. & Pluckthun, A. (2008). Designed armadillo repeat proteins as general peptide-binding scaffolds: consensus design and computational optimization of the hydrophobic core. J Mol Biol 376, 1282-304. 129. Main, E. R., Xiong, Y., Cocco, M. J., D'Andrea, L. & Regan, L. (2003). Design of stable alpha-helical arrays from an idealized TPR motif. Structure 11, 497-508. 130. Cortajarena, A. L., Kajander, T., Pan, W., Cocco, M. J. & Regan, L. (2004). Protein design to understand peptide ligand recognition by tetratricopeptide repeat proteins. Protein Eng Des Sel 17, 399-409. 131. Magliery, T. J. & Regan, L. (2005). Sequence variation in ligand binding sites in proteins. BMC Bioinformatics 6, 240. 132. Nikolova, P. V., Henckel, J., Lane, D. P. & Fersht, A. R. (1998). Semirational design of active tumor suppressor p53 DNA binding domain with enhanced stability. Proc Natl Acad Sci U S A 95, 14675-80. 133. Dai, M., Fisher, H. E., Temirov, J., Kiss, C., Phipps, M. E., Pavlik, P., Werner, J. H. & Bradbury, A. R. (2007). The creation of a novel fluorescent protein by guided consensus engineering. Protein Eng Des Sel 20, 69-79. 134. Loening, A. M., Fenn, T. D., Wu, A. M. & Gambhir, S. S. (2006). Consensus guided mutagenesis of Renilla luciferase yields enhanced stability and light output. Protein Eng Des Sel 19, 391-400. 135. Lehmann, M., Kostrewa, D., Wyss, M., Brugger, R., D'Arcy, A., Pasamontes, L. & van Loon, A. P. (2000). From DNA sequence to improved functionality: using protein sequence comparisons to rapidly design a thermostable consensus phytase. Protein Eng 13, 49-57. 136. Lehmann, M., Pasamontes, L., Lassen, S. F. & Wyss, M. (2000). The consensus concept for thermostability engineering of proteins. Biochim Biophys Acta 1543, 408-415. 137. Lehmann, M., Loch, C., Middendorf, A., Studer, D., Lassen, S. F., Pasamontes, L., van Loon, A. P. & Wyss, M. (2002). The consensus concept for thermostability engineering of proteins: further proof of concept. Protein Eng 15, 403-11. 138. Rath, A. & Davidson, A. R. (2000). The design of a hyperstable mutant of the Abp1p SH3 domain by sequence alignment analysis. Protein Sci 9, 2457-69. 139. Choudhury, D., Biswas, S., Roy, S. & Dattagupta, J. K. (2010). Improving thermostability of papain through structure-based protein engineering. Protein Eng Des Sel 23, 457-67. 212

140. Zakrzewska, M., Krowarsch, D., Wiedlocha, A. & Otlewski, J. (2004). Design of fully active FGF-1 variants with increased stability. Protein Eng Des Sel 17, 603- 11. 141. Ditursi, M. K., Kwon, S. J., Reeder, P. J. & Dordick, J. S. (2006). Bioinformatics- driven, rational engineering of protein thermostability. Protein Eng Des Sel 19, 517-24. 142. Vazquez-Figueroa, E., Chaparro-Riggers, J. & Bommarius, A. S. (2007). Development of a thermostable glucose dehydrogenase by a structure-guided consensus concept. Chembiochem 8, 2295-301. 143. Muraro, R., Kuroki, M., Wunderlich, D., Poole, D. J., Colcher, D., Thor, A., Greiner, J. W., Simpson, J. F., Molinolo, A., Noguchi, P. & et al. (1988). Generation and characterization of B72.3 second generation monoclonal antibodies reactive with the tumor-associated glycoprotein 72 antigen. Cancer Res 48, 4588-96. 144. Roberge, M., Estabrook, M., Basler, J., Chin, R., Gualfetti, P., Liu, A., Wong, S. B., Rashid, M. H., Graycar, T., Babe, L. & Schellenberger, V. (2006). Construction and optimization of a CC49-based scFv-beta-lactamase for ADEPT. Protein Eng Des Sel 19, 141-5. 145. Demarest, S. J., Rogers, J. & Hansen, G. (2004). Optimization of the antibody C(H)3 domain by residue frequency analysis of IgG sequences. J Mol Biol 335, 41-8. 146. Demarest, S. J., Chen, G., Kimmel, B. E., Gustafson, D., Wu, J., Salbato, J., Poland, J., Elia, M., Tan, X., Wong, K., Short, J. & Hansen, G. (2006). Engineering stability into Escherichia coli secreted Fabs leads to increased functional expression. Protein Eng Des Sel 19, 325-36. 147. Pray, L. (2008). DNA Replication and Causes of Mutation. Nature Education 1. 148. Jackel, C., Bloom, J. D., Kast, P., Arnold, F. H. & Hilvert, D. (2010). Consensus protein design without phylogenetic bias. J Mol Biol 399, 541-6. 149. Woese, C. R. (1987). Bacterial evolition. Microbiology Review 51, 221-227. 150. Pace, N. R. (1991). Origin of life--facing up to the physical setting. Cell 65, 531- 3. 151. Nisbet, E. G. F., C.M.R. (1996). Some liked it hot. Nature 382, 404-405. 152. Forterre, P. (1996). A hot topic: the origin of hyperthermophiles. Cell 85, 789-92. 153. Miyazaki, J., Nakaya, S., Suzuki, T., Tamakoshi, M., Oshima, T. & Yamagishi, A. (2001). Ancestral residues stabilizing 3-isopropylmalate dehydrogenase of an extreme thermophile: experimental evidence supporting the thermophilic common ancestor hypothesis. J Biochem 129, 777-82. 154. Retief, J. D. (2000). Phylogenetic analysis using PHYLIP. Methods Mol Biol 132, 243-58. 155. Iwabata, H., Watanabe, K., Ohkuri, T., Yokobori, S. & Yamagishi, A. (2005). Thermostability of ancestral mutants of Caldococcus noboribetus isocitrate dehydrogenase. FEMS Microbiol Lett 243, 393-8.

213

156. Watanabe, K., Ohkuri, T., Yokobori, S. & Yamagishi, A. (2006). Designing thermostable proteins: ancestral mutants of 3-isopropylmalate dehydrogenase designed by using a phylogenetic tree. J Mol Biol 355, 664-74. 157. Watanabe, K. & Yamagishi, A. (2006). The effects of multiple ancestral residues on the Thermus thermophilus 3-isopropylmalate dehydrogenase. FEBS Lett 580, 3867-71. 158. Alcolombri, U., Elias, M. & Tawfik, D. S. (2011). Directed evolution of sulfotransferases and paraoxonases by ancestral libraries. J Mol Biol 411, 837-53. 159. Ortlund, E. A., Bridgham, J. T., Redinbo, M. R. & Thornton, J. W. (2007). Crystal structure of an ancient protein: evolution by conformational . Science 317, 1544-8. 160. Jurgens, C., Strom, A., Wegener, D., Hettwer, S., Wilmanns, M. & Sterner, R. (2000). Directed evolution of a (beta alpha)8-barrel enzyme to catalyze related reactions in two different metabolic pathways. Proc Natl Acad Sci U S A 97, 9925-30. 161. Schreiber, G. & Fersht, A. R. (1995). Energetics of protein-protein interactions: analysis of the barnase-barstar interface by single mutations and double mutant cycles. J Mol Biol 248, 478-86. 162. Carter, P. J., Winter, G., Wilkinson, A. J. & Fersht, A. R. (1984). The use of double mutants to detect structural changes in the active site of the tyrosyl-tRNA synthetase (Bacillus stearothermophilus). Cell 38, 835-40. 163. Hidalgo, P. & MacKinnon, R. (1995). Revealing the architecture of a K+ channel pore through mutant cycles with a peptide inhibitor. Science 268, 307-10. 164. Lockless, S. W. & Ranganathan, R. (1999). Evolutionarily conserved pathways of energetic connectivity in protein families. Science 286, 295-9. 165. Suel, G. M., Lockless, S. W., Wall, M. A. & Ranganathan, R. (2003). Evolutionarily conserved networks of residues mediate allosteric communication in proteins. Nat Struct Biol 10, 59-69. 166. Socolich, M., Lockless, S. W., Russ, W. P., Lee, H., Gardner, K. H. & Ranganathan, R. (2005). Evolutionary information for specifying a protein fold. Nature 437, 512-8. 167. Russ, W. P., Lowery, D. M., Mishra, P., Yaffe, M. B. & Ranganathan, R. (2005). Natural-like function in artificial WW domains. Nature 437, 579-83. 168. Halabi, N., Rivoire, O., Leibler, S. & Ranganathan, R. (2009). Protein sectors: evolutionary units of three-dimensional structure. Cell 138, 774-86. 169. Magliery, T. J. & Regan, L. (2004). Beyond consensus: statistical free energies reveal hidden interactions in the design of a TPR motif. J Mol Biol 343, 731-45. 170. Wang, N., Smith, W. F., Miller, B. R., Aivazian, D., Lugovskoy, A. A., Reff, M. E., Glaser, S. M., Croner, L. J. & Demarest, S. J. (2009). Conserved amino acid networks involved in antibody variable domain interactions. Proteins 76, 99-114. 171. Ozer, H. G. & Ray, W. C. (2006). MAVL/StickWRLD: analyzing structural constraints using interpositional dependencies in biomolecular sequence alignments. Nucleic Acids Res 34, W133-6.

214

172. Ray, W. C. (2004). MAVL and StickWRLD: visually exploring relationships in nucleic acid sequence alignments. Nucleic Acids Res 32, W59-63. 173. Ray, W. C. (2005). MAVL/StickWRLD for protein: visualizing protein sequence families to detect non-consensus features. Nucleic Acids Res 33, W315-9. 174. Nagano, N., Orengo, C. A. & Thornton, J. M. (2002). One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions. J Mol Biol 321, 741-65. 175. Lolis, E., Alber, T., Davenport, R. C., Rose, D., Hartman, F. C. & Petsko, G. A. (1990). Structure of yeast triosephosphate isomerase at 1.9-A resolution. Biochemistry 29, 6609-18. 176. Gerlt, J. A. & Raushel, F. M. (2003). Evolution of function in (beta/alpha)8-barrel enzymes. Curr Opin Chem Biol 7, 252-64. 177. Gerlt, J. A. & Babbitt, P. C. (2009). Enzyme (re)design: lessons from natural evolution and computation. Curr Opin Chem Biol 13, 10-8. 178. Altamirano, M. M., Blackburn, J. M., Aguayo, C. & Fersht, A. R. (2000). Directed evolution of new catalytic activity using the alpha/beta-barrel scaffold. Nature 403, 617-22. 179. Altamirano, M. M., Blackburn, J. M., Aguayo, C. & Fersht, A. R. (2002). Retraction. Directed evolution of new catalytic activity using the alpha/beta-barrel scaffold. Nature 417, 468. 180. Silverman, J. A., Balakrishnan, R. & Harbury, P. B. (2001). Reverse engineering the (beta/alpha)8 barrel fold. Proc Natl Acad Sci USA 98, 3092-7. 181. Borchert, T. V., Abagyan, R., Kishan, K. V., Zeelen, J. P. & Wierenga, R. K. (1993). The crystal structure of an engineered monomeric triosephosphate isomerase, monoTIM: the correct modelling of an eight-residue loop. Structure 1, 205-13. 182. Borchert, T. V., Abagyan, R., Jaenicke, R. & Wierenga, R. K. (1994). Design, creation, and characterization of a stable, monomeric triosephosphate isomerase. Proc Natl Acad Sci USA 91, 1515-8. 183. Borchert, T. V., Zeelen, J. P., Schliebs, W., Callens, M., Minke, W., Jaenicke, R. & Wierenga, R. K. (1995). An interface point-mutation variant of triosephosphate isomerase is compactly folded and monomeric at low protein concentrations. FEBS Lett 367, 315-8. 184. Borchert, T. V., Kishan, K. V., Zeelen, J. P., Schliebs, W., Thanki, N., Abagyan, R., Jaenicke, R. & Wierenga, R. K. (1995). Three new crystal structures of point mutation variants of monoTIM: conformational flexibility of loop-1, loop-4 and loop-8. Structure 3, 669-79. 185. Schliebs, W., Thanki, N., Eritja, R. & Wierenga, R. (1996). Active site properties of monomeric triosephosphate isomerase (monoTIM) as deduced from mutational and structural studies. Protein Sci 5, 229-39. 186. Thanki, N., Zeelen, J. P., Mathieu, M., Jaenicke, R., Abagyan, R. A., Wierenga, R. K. & Schliebs, W. (1997). Protein engineering with monomeric triosephosphate isomerase (monoTIM): the modelling and structure verification of a seven-residue loop. Protein Eng 10, 159-67. 215

187. Schliebs, W., Thanki, N., Jaenicke, R. & Wierenga, R. K. (1997). A double mutation at the tip of the dimer interface loop of triosephosphate isomerase generates active monomers with reduced stability. Biochemistry 36, 9655-62. 188. Saab-Rincon, G., Juarez, V. R., Osuna, J., Sanchez, F. & Soberon, X. (2001). Different strategies to recover the activity of monomeric triosephosphate isomerase by directed evolution. Protein Eng 14, 149-55. 189. Mainfroid, V., Terpstra, P., Beauregard, M., Frere, J. M., Mande, S. C., Hol, W. G., Martial, J. A. & Goraj, K. (1996). Three hTIM mutants that provide new insights on why TIM is a dimer. J Mol Biol 257, 441-56. 190. Ralser, M., Heeren, G., Breitenbach, M., Lehrach, H. & Krobitsch, S. (2006). Triose phosphate isomerase deficiency is caused by altered dimerization--not catalytic inactivity--of the mutant enzymes. PLoS ONE 1, e30. 191. Alber, T., Banner, D. W., Bloomer, A. C., Petsko, G. A., Phillips, D., Rivers, P. S. & Wilson, I. A. (1981). On the three-dimensional structure and catalytic mechanism of triose phosphate isomerase. Philos Trans R Soc Lond B Biol Sci 293, 159-71. 192. Komives, E. A., Chang, L. C., Lolis, E., Tilton, R. F., Petsko, G. A. & Knowles, J. R. (1991). Electrophilic catalysis in triosephosphate isomerase: the role of histidine-95. Biochemistry 30, 3011-9. 193. Nickbarg, E. B., Davenport, R. C., Petsko, G. A. & Knowles, J. R. (1988). Triosephosphate isomerase: removal of a putatively electrophilic histidine residue results in a subtle change in catalytic mechanism. Biochemistry 27, 5948-60. 194. Plaut, B. & Knowles, J. R. (1972). pH-dependence of the triose phosphate isomerase reaction. Biochem J 129, 311-20. 195. Go, M. K., Koudelka, A., Amyes, T. L. & Richard, J. P. (2010). Role of Lys-12 in catalysis by triosephosphate isomerase: a two-part substrate approach. Biochemistry 49, 5377-89. 196. Veech, R. L., Raijman, L., Dalziel, K. & Krebs, H. A. (1969). Disequilibrium in the triose phosphate isomerase system in rat liver. Biochem J 115, 837-42. 197. Desamero, R., Rozovsky, S., Zhadin, N., McDermott, A. & Callender, R. (2003). Active site loop motion in triosephosphate isomerase: T-jump relaxation spectroscopy of thermal activation. Biochemistry 42, 2941-51. 198. Rozovsky, S., Jogl, G., Tong, L. & McDermott, A. E. (2001). Solution-state NMR investigations of triosephosphate isomerase active site loop motion: ligand release in relation to active site loop dynamics. J Mol Biol 310, 271-80. 199. Rozovsky, S. & McDermott, A. E. (2001). The time scale of the catalytic loop motion in triosephosphate isomerase. J Mol Biol 310, 259-70. 200. Williams, J. C. & McDermott, A. E. (1995). Dynamics of the flexible loop of triosephosphate isomerase: the loop motion is not ligand gated. Biochemistry 34, 8309-19. 201. Kempf, J. G., Jung, J. Y., Ragain, C., Sampson, N. S. & Loria, J. P. (2007). Dynamic requirements for a functional protein hinge. J Mol Biol 368, 131-49.

216

202. Wang, Y., Berlow, R. B. & Loria, J. P. (2009). Role of loop-loop interactions in coordinating motions and enzymatic function in triosephosphate isomerase. Biochemistry 48, 4548-56. 203. Dill, K. A. (1990). Dominant forces in protein folding. Biochemistry 29, 7133-55. 204. Rose, G. D. & Wolfenden, R. (1993). Hydrogen bonding, hydrophobicity, packing, and protein folding. Annu Rev Biophys Biomol Struct 22, 381-415. 205. Magliery, T. J. & Regan, L. (2004). Combinatorial approaches to protein stability and structure. Eur J Biochem 271, 1595-608. 206. Roodveldt, C., Aharoni, A. & Tawfik, D. S. (2005). Directed evolution of proteins for heterologous expression and stability. Curr Opin Struct Biol 15, 50-6. 207. Senisterra, G. A. & Finerty, P. J., Jr. (2009). High throughput methods of assessing protein stability and aggregation. Mol Biosyst 5, 217-23. 208. Giver, L., Gershenson, A., Freskgard, P. O. & Arnold, F. H. (1998). Directed evolution of a thermostable esterase. Proc Natl Acad Sci U S A 95, 12809-13. 209. Foit, L., Morgan, G. J., Kern, M. J., Steimer, L. R., von Hacht, A. A., Titchmarsh, J., Warriner, S. L., Radford, S. E. & Bardwell, J. C. (2009). Optimizing protein stability in vivo. Mol Cell 36, 861-71. 210. Park, C. & Marqusee, S. (2005). Pulse proteolysis: a simple method for quantitative determination of protein stability and ligand binding. Nat Methods 2, 207-12. 211. Park, C., Zhou, S., Gilmore, J. & Marqusee, S. (2007). Energetics-based protein profiling on a proteomic scale: identification of proteins resistant to proteolysis. J Mol Biol 368, 1426-37. 212. Pelletier, J. N., Campbell-Valois, F. X. & Michnick, S. W. (1998). Oligomerization domain-directed reassembly of active dihydrofolate reductase from rationally designed fragments. Proc Natl Acad Sci U S A 95, 12141-6. 213. Magliery, T. J., Wilson, C. G., Pan, W., Mishler, D., Ghosh, I., Hamilton, A. D. & Regan, L. (2005). Detecting protein-protein interactions with a green fluorescent protein fragment reassembly trap: scope and mechanism. J Am Chem Soc 127, 146-57. 214. Hu, C. D., Chinenov, Y. & Kerppola, T. K. (2002). Visualization of interactions among bZIP and Rel family proteins in living cells using bimolecular fluorescence complementation. Mol Cell 9, 789-98. 215. Paulmurugan, R., Umezawa, Y. & Gambhir, S. S. (2002). Noninvasive imaging of protein-protein interactions in living subjects by using reporter protein complementation and reconstitution strategies. Proc Natl Acad Sci U S A 99, 15608-13. 216. Stefan, E., Aquin, S., Berger, N., Landry, C. R., Nyfeler, B., Bouvier, M. & Michnick, S. W. (2007). Quantification of dynamic protein complexes using Renilla luciferase fragment complementation applied to protein kinase A activities in vivo. Proc Natl Acad Sci U S A 104, 16916-21. 217. Dutta, S., Koide, A. & Koide, S. (2008). High-throughput analysis of the protein sequence-stability landscape using a quantitative yeast surface two-hybrid system and fragment reconstitution. J Mol Biol 382, 721-33. 217

218. Lindman, S., Hernandez-Garcia, A., Szczepankiewicz, O., Frohm, B. & Linse, S. (2010). In vivo protein stabilization based on fragment complementation and a split GFP system. Proc Natl Acad Sci U S A 107, 19826-31. 219. Waldo, G. S., Standish, B. M., Berendzen, J. & Terwilliger, T. C. (1999). Rapid protein-folding assay using green fluorescent protein. Nat Biotechnol 17, 691-5. 220. Cabantous, S., Terwilliger, T. C. & Waldo, G. S. (2005). Protein tagging and detection with engineered self-assembling fragments of green fluorescent protein. Nat Biotechnol 23, 102-7. 221. Listwan, P., Terwilliger, T. C. & Waldo, G. S. (2009). Automated, high- throughput platform for protein solubility screening using a split-GFP system. J Struct Funct Genomics 10, 47-55. 222. Stites, W. E., Byrne, M. P., Aviv, J., Kaplan, M. & Curtis, P. M. (1995). Instrumentation for automated determination of protein stability. Anal Biochem 227, 112-22. 223. Edgell, M. H., Sims, D. A., Pielak, G. J. & Yi, F. (2003). High-precision, high- throughput stability determinations facilitated by robotics and a semiautomated titrating fluorometer. Biochemistry 42, 7587-93. 224. Aucamp, J. P., Cosme, A. M., Lye, G. J. & Dalby, P. A. (2005). High-throughput measurement of protein stability in microtiter plates. Biotechnol Bioeng 89, 599- 607. 225. Allen, B. D., Nisthal, A. & Mayo, S. L. (2010). Experimental library screening demonstrates the successful application of computational protein design to large structural ensembles. Proc Natl Acad Sci U S A 107, 19838-43. 226. Gaudet, M., Remtulla, N., Jackson, S. E., Main, E. R., Bracewell, D. G., Aeppli, G. & Dalby, P. A. (2010). Protein denaturation and protein:drugs interactions from intrinsic protein fluorescence measurements at the nanolitre scale. Protein Sci 19, 1544-54. 227. Ghaemmaghami, S., Fitzgerald, M. C. & Oas, T. G. (2000). A quantitative, high- throughput screen for protein stability. Proc Natl Acad Sci U S A 97, 8296-301. 228. Ghaemmaghami, S. & Oas, T. G. (2001). Quantitative protein stability measurement in vivo. Nat Struct Biol 8, 879-82. 229. Ignatova, Z. & Gierasch, L. M. (2004). Monitoring protein stability and aggregation in vivo by real-time fluorescent labeling. Proc Natl Acad Sci U S A 101, 523-8. 230. West, G. M., Tang, L. & Fitzgerald, M. C. (2008). Thermodynamic analysis of protein stability and ligand binding using a chemical modification- and mass spectrometry-based strategy. Anal Chem 80, 4175-85. 231. Silverman, J. A. & Harbury, P. B. (2002). Rapid mapping of protein structure, interactions, and ligand binding by misincorporation proton-alkyl exchange. J Biol Chem 277, 30968-75. 232. Isom, D. G., Marguet, P. R., Oas, T. G. & Hellinga, H. W. (2011). A miniaturized technique for assessing protein thermodynamics and function using fast determination of quantitative cysteine reactivity. Proteins 79, 1034-47.

218

233. Isom, D. G., Vardy, E., Oas, T. G. & Hellinga, H. W. (2010). Picomole-scale characterization of protein stability and function by quantitative cysteine reactivity. Proc Natl Acad Sci U S A 107, 4908-13. 234. Senisterra, G. A., Markin, E., Yamazaki, K., Hui, R., Vedadi, M. & Awrey, D. E. (2006). Screening for ligands using a generic and high-throughput light- scattering-based assay. J Biomol Screen 11, 940-8. 235. Senisterra, G. A., Soo Hong, B., Park, H. W. & Vedadi, M. (2008). Application of high-throughput isothermal denaturation to assess protein stability and screen for ligands. J Biomol Screen 13, 337-42. 236. Moreau, M. J., Morin, I. & Schaeffer, P. M. (2010). Quantitative determination of protein stability and ligand binding using a green fluorescent protein reporter system. Mol Biosyst 6, 1285-92. 237. Pantoliano, M. W., Petrella, E. C., Kwasnoski, J. D., Lobanov, V. S., Myslik, J., Graf, E., Carver, T., Asel, E., Springer, B. A., Lane, P. & Salemme, F. R. (2001). High-density miniaturized thermal shift assays as a general strategy for drug discovery. J Biomol Screen 6, 429-40. 238. Ericsson, U. B., Hallberg, B. M., Detitta, G. T., Dekker, N. & Nordlund, P. (2006). Thermofluor-based high-throughput stability optimization of proteins for structural studies. Anal Biochem 357, 289-98. 239. Vedadi, M., Niesen, F. H., Allali-Hassani, A., Fedorov, O. Y., Finerty, P. J., Jr., Wasney, G. A., Yeung, R., Arrowsmith, C., Ball, L. J., Berglund, H., Hui, R., Marsden, B. D., Nordlund, P., Sundstrom, M., Weigelt, J. & Edwards, A. M. (2006). Chemical screening methods to identify ligands that promote protein stability, protein crystallization, and structure determination. Proc Natl Acad Sci U S A 103, 15835-40. 240. Layton, C. J. & Hellinga, H. W. (2010). Thermodynamic analysis of ligand- induced changes in protein thermal unfolding applied to high-throughput determination of ligand affinities with extrinsic fluorescent dyes. Biochemistry 49, 10831-41. 241. Cox, J. C., Lape, J., Sayed, M. A. & Hellinga, H. W. (2007). Protein fabrication automation. Protein Sci 16, 379-90. 242. Kosuri, S., Eroshenko, N., Leproust, E. M., Super, M., Way, J., Li, J. B. & Church, G. M. (2010). Scalable gene synthesis by selective amplification of DNA pools from high-fidelity microchips. Nat Biotechnol 28, 1295-9. 243. Jewett, M. C. & Swartz, J. R. (2004). Rapid expression and purification of 100 nmol quantities of active protein using cell-free protein synthesis. Biotechnol Prog 20, 102-9. 244. Rosenbaum, D. M., Rasmussen, S. G. & Kobilka, B. K. (2009). The structure and function of G-protein-coupled receptors. Nature 459, 356-63. 245. Alexandrov, A. I., Mileni, M., Chien, E. Y., Hanson, M. A. & Stevens, R. C. (2008). Microscale fluorescent thermal stability assay for membrane proteins. Structure 16, 351-9.

219

246. Liu, W., Hanson, M. A., Stevens, R. C. & Cherezov, V. (2010). LCP-Tm: an assay to measure and understand stability of membrane proteins in a membrane environment. Biophys J 98, 1539-48. 247. Postis, V. L., Deacon, S. E., Roach, P. C., Wright, G. S., Xia, X., Ingram, J. C., Hadden, J. M., Henderson, P. J., Phillips, S. E., McPherson, M. J. & Baldwin, S. A. (2008). A high-throughput assay of membrane protein stability. Mol Membr Biol 25, 617-24. 248. Yeh, A. P., McMillan, A. & Stowell, M. H. (2006). Rapid and simple protein- stability screens: application to membrane proteins. Acta Crystallogr D Biol Crystallogr 62, 451-7. 249. He, F., Hogan, S., Latypov, R. F., Narhi, L. O. & Razinkov, V. I. (2010). High throughput thermostability screening of monoclonal antibody formulations. J Pharm Sci 99, 1707-20. 250. Goldberg, D. S., Bishop, S. M., Shah, A. U. & Sathish, H. A. (2010). Formulation development of therapeutic monoclonal antibodies using high-throughput fluorescence and static light scattering techniques: Role of conformational and colloidal stability. J Pharm Sci. 251. Miller, B. R., Demarest, S. J., Lugovskoy, A., Huang, F., Wu, X., Snyder, W. B., Croner, L. J., Wang, N., Amatucci, A., Michaelson, J. S. & Glaser, S. M. (2010). Stability engineering of scFvs for the development of bispecific and multivalent antibodies. Protein Eng Des Sel 23, 549-57. 252. Lacy, E. R., Baker, M. & Brigham-Burke, M. (2008). Free sulfhydryl measurement as an indicator of antibody stability. Anal Biochem 382, 66-8. 253. Wirtz, P. & Steipe, B. (1999). Intrabody construction and expression III: engineering hyperstable V(H) domains. Protein Sci 8, 2245-50. 254. Mosavi, L. K., Minor, D. L., Jr. & Peng, Z. Y. (2002). Consensus-derived structural determinants of the ankyrin repeat motif. Proc Natl Acad Sci USA 99, 16029-34. 255. Khersonsky, O., Rosenblat, M., Toker, L., Yacobson, S., Hugenmatter, A., Silman, I., Sussman, J. L., Aviram, M. & Tawfik, D. S. (2009). Directed evolution of serum paraoxonase PON3 by family shuffling and ancestor/consensus mutagenesis, and its biochemical characterization. Biochemistry 48, 6644-54. 256. Hatley, M. E., Lockless, S. W., Gibson, S. K., Gilman, A. G. & Ranganathan, R. (2003). Allosteric determinants in guanine nucleotide-binding proteins. Proc Natl Acad Sci U S A 100, 14445-50. 257. Lee, J., Natarajan, M., Nashine, V. C., Socolich, M., Vo, T., Russ, W. P., Benkovic, S. J. & Ranganathan, R. (2008). Surface sites for engineering allosteric control in proteins. Science 322, 438-42. 258. Smock, R. G., Rivoire, O., Russ, W. P., Swain, J. F., Leibler, S., Ranganathan, R. & Gierasch, L. M. (2010). An interdomain sector mediating allostery in molecular chaperones. Mol Syst Biol 6, 414. 259. Sullivan, B. J., Durani, V. & Magliery, T. J. (2011). Triosephosphate isomerase by consensus design: dramatic differences in physical properties and activity of related variants. J Mol Biol 413, 195-208. 220

260. Cordes, M. H., Davidson, A. R. & Sauer, R. T. (1996). Sequence space, folding and protein design. Curr Opin Struct Biol 6, 3-10. 261. Knappik, A., Ge, L., Honegger, A., Pack, P., Fischer, M., Wellnhofer, G., Hoess, A., Wolle, J., Pluckthun, A. & Virnekas, B. (2000). Fully synthetic human combinatorial antibody libraries (HuCAL) based on modular consensus frameworks and CDRs randomized with trinucleotides. J Mol Biol 296, 57-86. 262. Godoy-Ruiz, R., Perez-Jimenez, R., Ibarra-Molero, B. & Sanchez-Ruiz, J. M. (2005). A stability pattern of protein hydrophobic mutations that reflects evolutionary structural optimization. Biophys J 89, 3320-31. 263. Pey, A. L., Rodriguez-Larrea, D., Bomke, S., Dammers, S., Godoy-Ruiz, R., Garcia-Mira, M. M. & Sanchez-Ruiz, J. M. (2008). Engineering proteins with tunable thermodynamic and kinetic stabilities. Proteins 71, 165-74. 264. Silverman, J. A., Balakrishnan, R. & Harbury, P. B. (2001). Reverse engineering the (beta/alpha )8 barrel fold. Proc Natl Acad Sci U S A 98, 3092-7. 265. Nickbarg, E. B. & Knowles, J. R. (1988). Triosephosphate isomerase: energetics of the reaction catalyzed by the yeast enzyme expressed in Escherichia coli. Biochemistry 27, 5939-47. 266. Stemmer, W. P., Crameri, A., Ha, K. D., Brennan, T. M. & Heyneker, H. L. (1995). Single-step assembly of a gene and entire plasmid from large numbers of oligodeoxyribonucleotides. Gene 164, 49-53. 267. Babul, J. (1978). Phosphofructokinases from Escherichia coli. Purification and characterization of the nonallosteric isozyme. J Biol Chem 253, 4350-5. 268. Baba, T., Ara, T., Hasegawa, M., Takai, Y., Okumura, Y., Baba, M., Datsenko, K. A., Tomita, M., Wanner, B. L. & Mori, H. (2006). Construction of Escherichia coli K-12 in-frame, single-gene knockout mutants: the Keio collection. Mol Syst Biol 2, 2006 0008. 269. Magliery, T. J., Lavinder, J. J. & Sullivan, B. J. (2011). Protein stability by number: high-throughput and statistical approaches to one of protein science's most difficult problems. Curr Opin Chem Biol. 270. Borchert, T. V., Pratt, K., Zeelen, J. P., Callens, M., Noble, M. E., Opperdoes, F. R., Michels, P. A. & Wierenga, R. K. (1993). Overexpression of trypanosomal triosephosphate isomerase in Escherichia coli and characterisation of a dimer- interface mutant. Eur J Biochem 211, 703-10. 271. Christensen, H. & Pain, R. H. (1991). Molten globule intermediates and protein folding. Eur Biophys J 19, 221-9. 272. Ptitsyn, O. B., Pain, R. H., Semisotnov, G. V., Zerovnik, E. & Razgulyaev, O. I. (1990). Evidence for a molten globule state as a general intermediate in protein folding. FEBS Lett 262, 20-4. 273. Putman, S. J., Coulson, A. F., Farley, I. R., Riddleston, B. & Knowles, J. R. (1972). Specificity and kinetics of triose phosphate isomerase from chicken muscle. Biochem J 129, 301-10. 274. Koradi, R., Billeter, M. & Wuthrich, K. (1996). MOLMOL: a program for display and analysis of macromolecular structures. J Mol Graph 14, 51-5, 29-32.

221

275. Henikoff, S. & Henikoff, J. G. (1992). Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 89, 10915-9. 276. Pervushin, K., Vamvaca, K., Vogeli, B. & Hilvert, D. (2007). Structure and dynamics of a molten globular enzyme. Nat Struct Mol Biol 14, 1202-6. 277. Vamvaca, K., Vogeli, B., Kast, P., Pervushin, K. & Hilvert, D. (2004). An enzymatic molten globule: efficient coupling of folding and catalysis. Proc Natl Acad Sci USA 101, 12860-4. 278. Joerger, A. C., Ang, H. C. & Fersht, A. R. (2006). Structural basis for understanding oncogenic p53 mutations and designing rescue drugs. Proc Natl Acad Sci U S A 103, 15056-61. 279. Zhou, H., Zhang, C., Liu, S. & Zhou, Y. (2005). Web-based toolkits for topology prediction of transmembrane helical proteins, fold recognition, structure and binding scoring, folding-kinetics analysis and comparative analysis of domain combinations. Nucleic Acids Res 33, W193-7. 280. Arnold, F. H. (2001). Combinatorial and computational challenges for biocatalyst design. Nature 409, 253-7. 281. Shimizu, H., Yokobori, S., Ohkuri, T., Yokogawa, T., Nishikawa, K. & Yamagishi, A. (2007). Extremely thermophilic translation system in the common ancestor commonote: ancestral mutants of Glycyl-tRNA synthetase from the extreme thermophile Thermus thermophilus. J Mol Biol 369, 1060-9. 282. Yamashiro, K., Yokobori, S., Koikeda, S. & Yamagishi, A. (2010). Improvement of Bacillus circulans beta-amylase activity attained using the ancestral mutation method. Protein Eng Des Sel 23, 519-28. 283. Noble, M. E., Zeelen, J. P., Wierenga, R. K., Mainfroid, V., Goraj, K., Gohimont, A. C. & Martial, J. A. (1993). Structure of triosephosphate isomerase from Escherichia coli determined at 2.6 A resolution. Acta Crystallogr D Biol Crystallogr 49, 403-17. 284. Mande, S. C., Mainfroid, V., Kalk, K. H., Goraj, K., Martial, J. A. & Hol, W. G. (1994). Crystal structure of recombinant human triosephosphate isomerase at 2.8 A resolution. Triosephosphate isomerase-related human genetic disorders and comparison with the trypanosomal enzyme. Protein Sci 3, 810-21. 285. Wierenga, R. K., Kalk, K. H. & Hol, W. G. (1987). Structure determination of the glycosomal triosephosphate isomerase from Trypanosoma brucei brucei at 2.4 A resolution. J Mol Biol 198, 109-21. 286. Lambeir, A. M., Opperdoes, F. R. & Wierenga, R. K. (1987). Kinetic properties of triose-phosphate isomerase from Trypanosoma brucei brucei. A comparison with the rabbit muscle and yeast enzymes. Eur J Biochem 168, 69-74. 287. Dabrowska, A., Kamrowska, I. & Baranowski, T. (1978). Purification, crystallization and properties of triosephosphate isomerase from human skeletal muscle. Acta Biochim Pol 25, 247-56. 288. Cover, T. M. a. T., J.A. (2006). Elements of Information Theory. 2nd edit, John Wiley & Sons, Inc., Hoboken. 289. Kanehisa, M., Goto, S., Kawashima, S. & Nakaya, A. (2002). The KEGG databases at GenomeNet. Nucleic Acids Res 30, 42-6. 222

290. Godoy-Ruiz, R., Ariza, F., Rodriguez-Larrea, D., Perez-Jimenez, R., Ibarra- Molero, B. & Sanchez-Ruiz, J. M. (2006). Natural selection for kinetic stability is a likely origin of correlations between mutational effects on protein energetics and frequencies of amino acid occurrences in sequence alignments. J Mol Biol 362, 966-78. 291. Hietpas, R. T., Jensen, J. D. & Bolon, D. N. (2011). Experimental illumination of a fitness landscape. Proc Natl Acad Sci U S A 108, 7896-901. 292. Ernst, A., Gfeller, D., Kan, Z., Seshagiri, S., Kim, P. M., Bader, G. D. & Sidhu, S. S. (2010). Coevolution of PDZ domain-ligand interactions analyzed by high- throughput phage display and deep sequencing. Mol Biosyst 6, 1782-90. 293. Fowler, D. M., Araya, C. L., Fleishman, S. J., Kellogg, E. H., Stephany, J. J., Baker, D. & Fields, S. (2010). High-resolution mapping of protein sequence- function relationships. Nat Methods 7, 741-6. 294. Dwyer, M. A., Looger, L. L. & Hellinga, H. W. (2004). Computational design of a biologically active enzyme. Science 304, 1967-71.

223