Exploring the sequence landscape of the model protein Rop to gain insights into sequence-stability relationship in proteins

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Nishanthi Panneerselvam

Graduate Program in Biophysics

The Ohio State University

2017

Dissertation Committee:

Professor Thomas J. Magliery, Advisor

Professor Charles E. Bell

Professor Ralf A. Bundschuh

Copyrighted by

Nishanthi Panneerselvam

2017

Abstract

Surface residues and surface electrostatics play an active role in maintaining protein stability. A combinatorial library randomizing five surface positions in Rop to

NNK (K=G or T) codons was constructed. Interestingly, the consensus sequence had positively charged residues instead of negatively charged residues present in wild-type.

Specifically, lysines were found more than arginines. To delve into this, all these five positions were individually mutated to Lys and Arg and four poly mutants were made by mutating to multiple lysines and arginines. Most of the point mutants were found to be active by Rop screen. All mutants were well folded as seen by circular dichroism studies.

An important result was that most of the mutants were more thermally stable than the

Cys-free wild-type scaffold AV-Rop. Thermodynamic parameters were found using

Gibbs-Helmholtz analysis and the entropic change was higher for most mutants.

Solubility was affected more in the poly mutants than in the point mutants. HSQC on selected variants revealed that the least stable mutant and poly mutants had more shifted peaks compared to the most stable mutant. No considerable differences were found between Lys and Arg mutants. Poly mutants had varied effects on activity and solubility compared to point mutants.

ii

If all the possible point mutants in a protein are subject to moderate selection for activity or function, high-throughput sequencing can measure the relative abundance of mutants in a bulk competition experiment which in turn gives the relative fitness of each mutant. This project involves making all possible point mutants of Rop by constructing

20 member positional libraries individually in 62 positions of Rop. Preliminary data includes proof-of-principle selection experiments in three positions of Rop (F14, D30 and

D43). Variants were enriched 1,000 fold for activity by performing six rounds of growth based selection. Likewise, we are in the process of constructing 62 positional libraries which will be combined together in the selection screen, subject to enrichment and then sequenced by high-throughput methods to give the sequence landscape. Analyzing the sequence/fitness landscape of Rop will yield each mutant’s competitive advantage/disadvantage.

iii

Acknowledgments

First, I would like to thank my advisor Dr. Thomas Magliery for providing me the opportunity to work in his lab. I have learnt a great deal from him and I have always admired his passion towards research. The degree wouldn’t have been possible if not for the valuable inputs from his expertise. I thank him for his constant support and encouragement during tough times. I owe him a lot for all this support. I thank the Magliery group members for providing support and guidance. I would like to thank Dr. Ralf Bundschuh for being on my committee and for advising me as the

Director of Biophysics. I would also like to thank my committee members Dr. Chuck

Bell and Dr. Christopher Jaroniec for providing me feedback and guidance. I would like to thank everyone else who supported me in some way through my graduate school career. I would like to thank my mom and dad who share a big part in finishing the degree by being there for me. I would also like to thank my bestie Geethi for every small and big help that she did to me. I want to thank my friends Kanu, Mithila,

Aishu and Nitya for making graduate school fun. I want to thank my husband

Karthik for the support during completion of the degree. I would like to thank my sisters for always being there for me. I want to thank my all other friends and family.

iv

Vita

2009...... B. Tech. Industrial Biotechnology, Anna

University

2013...... M.S. Biophysics, The Ohio State University

2013-present ...... Graduate associate, Biophysics program,

The Ohio State University

Fields of Study

Major Field: Biophysics

v

Table of Contents

Abstract ...... ii

Acknowledgments...... iv

Vita ...... v

List of Tables ...... x

List of Figures ...... xi

Chapter 1: Introduction ...... 1

1.1 Protein stability ...... 2

1.2 Plasmid regulation by the protein ‘ROP’ ...... 5

1.3 ROP as a model protein ...... 9

1.4 Role of surface residues in protein stability ...... 15

1.5 Effect of surface electrostatics on protein stability ...... 18

1.6 Other factors that affect protein stability ...... 23

1.7 Lysines vs Arginines on the protein surface ...... 24 vi

1.8 Solubility ...... 27

1.9 Deep sequencing ...... 32

1.9.1 DNA Sequencing ...... 34

1.9.2 Chain terminator sequencing ...... 35

1.9.3 Second generation sequencing...... 39

1.9.4 Next generation sequencing methods ...... 39

1.10 General applications of next generation sequencing ...... 52

1.11 Applications of deep sequencing in protein stability and fitness studies ...... 53

Chapter 2: Engineering surface residues of the four helix bundle protein Rop ...... 63

2.1 Summary ...... 64

2.2.1 Plasmids and strains...... 65

2.2.2 Rop NNK5-surface library construction ...... 65

2.2.3 Screen and selection for active variants ...... 66

2.2.4. Resequencing of pAC surface library ...... 66

2.3 Results ...... 69

2.4 Discussion ...... 78

vii

Chapter 3: Effect of surface lysines and arginines on stability and other characteristics of

Rop ...... 80

3.1: Summary ...... 81

3.2 Materials and Methods ...... 82

3.2.1 Cloning and Sequencing ...... 82

3.2.2 Solubility ...... 88

3.2.3 Activity ...... 88

3.2.4 Protein expression and purification ...... 89

3.2.5 Circular Dichroism ...... 91

3.2.6 Gibbs-Helmholtz Analysis ...... 91

3.2.7 HSQC NMR ...... 92

3.3 Results ...... 93

3.3.1 Construction of lysine and arginine variants ...... 93

3.3.2 Activity ...... 97

3.3.3 Stability ...... 100

3.3.4 Solubility ...... 105

3.3.5 Structural changes and electrostatics ...... 107 viii

3.4 Discussion ...... 114

Chapter 4: Exploring the sequence landscape of the four helix bundle protein Rop using

Illumina sequencing ...... 119

4.1 Summary ...... 120

4.2 Materials and Methods ...... 121

4.2.1 Materials ...... 121

4.2.2 Instrumentation ...... 122

4.2.3 Plasmids and strains...... 122

4.2.4 Methods ...... 125

4.3 Results ...... 145

4.3.1. Construction of the pMR point mutant library ...... 145

4.3.2. Mock growth selection ...... 148

4.3.3 Preliminary analysis of the point mutant library ...... 152

4.3.4 Construction of pAC point mutant library ...... 158

4.4 Discussion ...... 160

References ...... 164

ix

List of Tables

Table 1: Comparison of next generation sequencing platforms ...... 51

Table 2: Oligonucleotide sequences used for the construction of surface library ...... 68

Table 3: Sequences from round three of selection of pAC library ...... 73

Table 4: Sequences of actives from round four of selection of pMR library ...... 73

Table 5: Oligonucleotide sequences used for the construction of point mutants and multi- mutants of lysine and arginine variants in both pMR and pAC context ...... 84

Table 6: Thermodynamic parameters ...... 113

Table 7: Oligonucleotide sequences for pMR NNN point mutant library ...... 134

Table 8: Primers for bar-coded Illumina sequencing DNA template generation ...... 138

Table 9: Barcodes used for the four rounds which were Illumina sequenced ...... 138

Table 10: Oligos for pAC NNN point mutant library ...... 139

Table 11: Barcode oligos for pAC point mutant library (barcodes are highlighted in red)

...... 144

x

List of Figures

Figure 1: Schematic of Rop ...... 8

Figure 2: Negative and positive screen of Rop ...... 14

Figure 3:An example showing the optimization of charge-charge interactions in a model protein and the negative energies represent the mutations with favorable energies……..22

Figure 4:Schematic of chain terminator sequencing ...... 37

Figure 5: Dye termination sequencing ...... 38

Figure 6: Schematic of pyrosequencing ...... 41

Figure 7: Schematic of Illumina sequencing ...... 46

Figure 8: Side and top view of AV-Rop homodimer with the library residues highlighted

...... 72

Figure 9: Results from sequencing of pAC surface library (top) and pMR surface library

(bottom)...... 74

Figure 10: distribution of library hits by position; pMR is in blue and pAC is in red ...... 75

Figure 11: Homodimer of AV-Rop with library residues highlighted ...... 96

Figure 12: Activity of Rop variants (a) Plasmid levels (b) In vivo activity ...... 99

Figure 13: Circular dichroism scans of variants (top); Thermal melts of point mutants

(bottom)...... 103

xi

Figure 14: Thermal melts of multi-mutants (top); Urea melts of all variants (bottom) .. 104

Figure 15: Solubility levels of all variants ...... 106

Figure 16: HSQC NMR spectra of variants ...... 108

Figure 17:Gibbs-helmholtz curves of variants ...... 112

Figure 18: Maps of screening vector pUCBADGFPuv, pMR cloning vector pMRH6 and pAC cloning vector pACT7lacCm-AflIII ...... 123

Figure 19: PCR design for the pMR NNN point mutant library in AV ROP ...... 126

Figure 20: Overall experimental design for pMR point mutant library ...... 151

Figure 21: Illumina sequencing ...... 152

Figure 22: pAC enrichment (top) and pMR enrichment (bottom) ...... 155

Figure 23: Graphs of amino acid count excluding the wild-type amino acid in each position. The wild-type amino acid occurred at very high amounts compared to other residues ...... 156

Figure 24: Example heat map generated from the limited amount of information ...... 157

Figure 25: PCR design for the pAC NNN point mutant library in AV Rop ...... 159

xii

Chapter 1: Introduction

1

1.1 Protein stability

Protein engineering studies necessitate the understanding of the protein folding problem. Anfinsen1 put forward the thermodynamic hypothesis by his study on

RNase and postulated that the native state of a protein is attained at its thermodynamic minimum or lowest free energy state. The conformational stability of a protein is obtained by calculating the free energy difference between the folded (native) state and unfolded (denatured) state. The free energy of these two structural states consists of two sometimes counteracting forces of enthalpy and entropy. The magnitude of these forces is dependent on the interaction of the protein with itself and with the solvent. The entropy driven burial of hydrophobic residues has been key to the study of sequence-stability relationship. Hydrophobic effect is the main driving force in protein folding according to early studies2-5.

In spite of studies focusing on understanding the protein structure and folding, accurately measuring these large forces is difficult. This is because the free energy difference between the folded and unfolded state is only 5-15 kcal mol-1 with the native state being energetically favored6. On the other hand, the enthalpy and entropy energy differences between the folded and unfolded states range up to hundreds of kcal mol-1.

2

Computational modeling and prediction can be useful to deal with such complex problems. The conformational space of proteins is so big that predicting structure from sequence in the forward direction is very difficult. Instead, we can invert the prediction and find sequences that fit the target structure which has low free energy. This is the inverse protein folding problem7,8. As the sequence space available to proteins is very large, the inverse protein folding problem is also relatively complex. Structural information from literature aid greatly in predicting sequences that fit low energy structures.

The relationship between sequence and structure is studied by either rational design or combinatorial design, apart from computational predictions of free energies.

Rational design is carried out by site directed mutagenesis of the protein with known structural information. Combinatorial design is based on generation of libraries by randomizing certain positions in the protein to desired set of amino acids still based on certain rational design principles. The randomization is done by PCR using degenerate codons. For example, NNN (N = A, C, G, T) codes for all possible 64 codons. These can be varied as needed to end up in preferable type of amino acids but still limited by codon degeneracy. A large number of variants are not active or functional in any combinatorial library. So, this must be followed by a selection of active variants or “hits” for a certain

3 property of the protein and these hits can be sequenced and characterized. Sequencing the hits by the traditional Sanger sequencing limits us in time and cost, whereas the emergence of second generation sequencing (or deep sequencing or next generation sequencing) technologies have saved us the time and money required for these purposes.

For many diagnostic and therapeutic applications, stabilizing proteins is an important target. This has led to many protein engineering studies being done to improve thermostability. Various strategies have been experimentally tested to improve protein stability. One of the methods9 was oligonucleotide directed mutagenesis where alpha helical glycine residues in Lambda repressor were replaced with residues. Ala is a better helix forming residue and had a stabilizing effect. Thermostability of the mutants was found to increase up to 6 degrees compared to wild type. In two other studies,10,11 mutations from any amino acid to Pro and introduction of disulfide bonds seemed to decrease the configurational entropy of the unfolded state. Salt bridges have also been found to be stabilizing for proteins in various studies. Engineering salt bridges were previously based only on the consideration of spatial orientation of the residues and not based on the energetics of residues. Two factors have been proposed to play a role in engineering salt bridges for stability12. The first factor depends on the spatial orientation which is in turn the distance between oppositely charged residues and on the degree of

4 desolvation when the salt bridge forms. Calculations using solvation terms can be utilized for finding the intrinsic strength of the salt bridge. The second factor is taking into account the interactions of the sidechains forming the salt bridge with the rest of the protein. In other words, local context of the salt bridges should be considered. Other minor factors include local structure propensity, helix dipole interactions and solvation.

In spite of all these studies, there is still not a specific set of rules for thermostabilization of proteins.

1.2 Plasmid regulation by the protein ‘ROP’

Studies in which plasmid DNA is replicated have a wide range of applications.

Colicin E1 (ColE1) plasmid from Escherichia coli is an example. Tomizawa et al13 shed light on the initiation of DNA replication by analysis of transcription products of pNT7 plasmid containing the replication origin of ColE1 and β-lactamase gene. Here, they showed that the RNA transcript (RNA II) synthesized by RNA polymerase hybridizes to the RNA polymerase and this RNA transcript is cleaved by the hydrolytic enzyme Rnase

H. Then DNA polymerase I uses this processed transcript to function as the primer for replication and then the transcript is removed from the product by RNase H. Itoh et al14 proved that a species of small RNA (RNA I) of 108 nucleotides inhibit the transcription by interacting with the 555 nucleotides mature transcript RNA II, as RNA I 5 is complementary to the 5’ terminal region of RNA II. RNA I is made of three stem-loop domains and a free 5’ single stranded region. Tomizawa15 performed binding experiments which showed that the interactions of RNAs begins from the 5’ end of RNA I by kissing of the complementary single stranded loops which in turn leads to pairing along the entire stretch of RNA I. Also, a 63 amino acid protein named Repressor Of Primer (Rop) was found to be involved in enhancing the RNA I inhibitory effect during in vitro transcription of RNA II16. The protein which is also referred to as ROM (RNA One inhibition Modulator) was found to decrease the plasmid copy number by facilitating the binding of RNA I to RNA II. The structure of ROP was solved in 1986 (PDB id: 1Rop) by X-ray crystallography and it was discovered to be a four helix bundle protein17. The crystal structure of Rop without RNA has been solved and the attempts to crystrallize

Rop bound to RNA were unsuccessful. It was found that the RNA molecules bind together in the crystals leaving out the protein.

ROP is a homodimeric protein which consists of two helix-loop-helix monomers and each monomer is made of 63 amino acids. The helices pack in an antiparallel fashion forming a four helix coiled-coil bundle. Coiled coils generally have a heptad repeat of seven amino acids (a-g) where ‘a’ and‘d’ positions are usually occupied by hydrophobic amino acid residues. When the protein folds, the hydrophobic residues are packed into

6 the core so that they can be sandwiched between two different helices and ROP is unique with respect to tight packing of the helices. Regan and coworkers18 put forward the residues in ROP involved in RNA binding by using alanine scanning mutagenesis. The residues which led to defect in RNA binding when mutated to Ala were found. Phe-14 residues in both monomers were found to be involved in recognition of RNA by interacting with the loop of the hairpin pair. Also, Lys-3, Asn -10, Gln-18 and Lys-25 were found to contribute to RNA binding by possible contacts with the ribose phosphate backbone. Wild type ROP binds RNA based on the structure rather than the sequence.

7

Figure 1: Schematic of the four-helix bundle protein Rop (pdb id: 1rop); The two helices interact in an antiparallel fashion.

8

1.3 ROP as a model protein

Library design by combinatorial approaches requires methods to be done at a high-throughput scale. Here, biophysical characterization of a large collection of variants need to be performed. Low-throughput methods such as circular dichroism (CD), NMR, calorimetry, fluorescence spectroscopy and X-ray crystallography are not suitable to carry out experiments in a short amount of time. The Magliery lab19 developed the method of high-throughput thermal scanning calorimetry (HTTS) for thermodynamic characterization of large numbers of protein variants. This methodology shares a similar idea with the ThermoFluor method in which the protein-ligand interactions are detected by a fluorescent hydrophobic dye akin to 1-anilionaphthalene-8-sulfonic acid (ANS) 20,21.

The fluorescence of the dye molecules increases once the hydrophobic core gets exposed.

The unfolding process can be monitored in a fluorimeter. This was achieved in a high- throughput manner by using a 96-well real-time PCR machine and a commercially available hydrophobic dye, SYPRO Orange22. Thus, the relative thermal stabilities of proteins can be obtained using this method in a high-throughput fashion.

9

Many human diseases such as cystic fibrosis are caused by destabilized proteins which makes the study of protein stability essential. This in turn requires good model proteins to analyze the effects of mutation. Rop is a model protein which has been successfully used in a lot of studies before. As it is small and an all alpha helical protein, it can be easily expressed, purified and used for characterization work. Rop is a small protein that does not contain disulfide bonds, proline residues, and any cofactors. In spite of this, wild-type Rop folds and unfolds extremely slowly. Regan and coworkers23 attempted repacking Rop with Ala and Leu packing against each other in the core positions. They made two mutants with one being packed in all eight layers and the other one packed in just the central six layers. These variants ended up as well folded native like structures, possessed RNA binding property and were more thermally stable than the wild type. This was good news as there is a great chance of arriving at molten globular structures (presence of a secondary structure without unique tertiary packing and therefore exposed hydrophobic core) while repacking the core of a protein. They also made an interesting observation that the mutant proteins exhibit faster folding and unfolding rates than the wild type which makes one wonder whether the hydrophobic core drives the rate determining step. Another study from the same group24 probed further into repacking of core of ROP by changing the number of Ala-Leu layers, reversing the position of residues or changing the hydrophobic residues at the positions ‘a’ and ‘d’ to

10 study various properties of the core. As an example to make representations clear, in the mutant A2L2-8, the first subscript refers to the ‘a’ position in two helices, the second subscript to position‘d’, and number 8 here refers to the number of repacked layers. Some mutants like A2L2- 8-rev were also made reversing the large ‘a’ and small ‘d’ residues in the second and seventh layers similar to wild type ROP. From the results, the mutants were categorized into four classes. Class I containing A2L2- 2,4,6,8 and 8-rev were found to be structurally similar and possessed RNA binding properties. In addition to being native-like, they were observed to be more stable than the wild type. Class II had L2A2 and A2I2 were similar to Class I but were unable to bind RNA and had a different structure. Class III A2M2, A2V2 and A4 had decreased stability and had no RNA binding ability. Class IV containing L4 -8 variant had an over packed core and was found to be a tetramer with extreme thermal and chemical stability. Looking at the type of residues packed, the results mostly seemed to correlate well with the physical nature of the amino acids and the study was useful to predict the role of hydrophobic residues in the overall properties of a protein.

Engineering loop length to see the effect on protein folding and stability was done using Rop as a model system25. In this study, a two residue length loop was replaced with a series of Gly linkers. There was an inverse correlation between loop length and

11 stability. As the length of Gly linkers increased, the unfolding rate further increased while the folding rate decreased. Converting a dimeric protein into a monomeric form has also been a study of interest26. When Rop is made into a monomeric protein, it makes mutation studies easier. Point mutations introduced into one side of the monomer reflects as double mutations in the dimeric version whereas the monomeric Rop solves this issue.

Loops being too long or too short can prevent correct association of helices and lead to misfolding, or form higher order oligomers. On the other hand, a study27introduced loops of various lengths in a monomeric four-helix bundle Rop. This made Rop fold into different topologies.

Due to the simplicity of its all-alpha helical structure, Rop has become an ideal model for protein engineering and rational design. For any combinatorial library, a screening method has to be there for identifying hits and characterizing them. A novel cell-based screen for function of Rop developed by Magliery et al28 makes it possible to rapidly interrogate such libraries for proper protein folding and stability. The screen was developed based on the unique function of Rop in regulating the copy number of ColE1 plasmids. Rop modulates the copy number of ColE1 plasmids by facilitating the binding of the inhibitory RNA I to the priming RNA II of the ColE1 origin. Thus, RNA II is prevented from functioning as a primer for DNA replication, and plasmid copy number is

12 reduced. By expressing GFP from the ColE1 plamid, a cell based screen was developed.

In the presence of active Rop, the copy number of ColE1 plasmid diminishes, and the amount of GFP is expected to decrease. By expressing green fluorescent protein from

ColE1 plasmid, the cellular phenotype or the fluorescent level is an indicator for the copy number of ColE1 plasmids, and also Rop functionality. pAClacGFPuv based screen is a negative screen. It is difficult to always equate negative colonies to hits as even truncated protein can cause diminished activity of Rop. A positive screen was developed based on arabinose promoter and the particular conditions: 0.0005% arabinose and 100 µM IPTG.

Cells grown at 42 °C were fluorescent when wild-type Rop was present and dim with no

Rop activity. Thus, in the positive screen, the cellular fluorescence corresponded to active

Rop.

13

Figure 2: Negative screen (top) and positive screen (bottom) of Rop28

ROP is a great model protein for protein engineering studies because of its well- defined structure and ease of characterization. Even though the critical residues involved in RNA recognition by ROP have been discovered, the detailed mechanism by which

ROP functions has not been explored fully. The main reason for this is the absence of a crystal structure of RNA bound ROP. Also, the role of all residues in ROP have not been studied in detail. Mutant libraries of repacked core, surface and loop of ROP have been made before resulting in a huge dataset of functional mutants (Magliery, unpublished 14 data). These libraries have revealed many non-covalent interactions between amino acids which may play an important role in the activity and stability of the protein.

1.4 Role of surface residues in protein stability

Amino acid residues in the protein surface have always been considered to be less important than residues in the hydrophobic protein core in contributing to protein stability. This is because of the belief that surface residues are usually more tolerant when mutated to non-native residues than the protein core29. In spite of this, various studies using directed evolution and site-directed mutagenesis have explored the contributions of surface residues to protein stability. There is considerable evidence to prove that engineering surface residues and surface electrostatics contributes to increased protein stability in previous studies30-32. Unfortunately, it is difficult to engineer these properties to any protein one wants to stabilize. The surface residues in proteins like myoglobin and hemoglobin play an important role in binding oxygen. The changes in surface electrostatics can also have a direct effect on functional properties33. These studies are promising to engineer surface charge-charge interactions to improve protein stability.

Although, there is not even a rough consensus on how surface residues control stability

15 and more work on model proteins is needed to add information to this area. Rop is a great scaffold for studying the effects of surface residues on stability.

Site directed mutagenesis of solvent exposed residues has been found to improve stability of proteins. Sequences of homologous mesophilic and thermophilic cold shock proteins were compared to study protein stability at the molecular level34. Only two mutations were required to increase the thermostability by 15 kJ mol-1 and it is important to note that the two homologs differ by twelve amino acid residues. This result was achieved by mutating a Glu residue into two positions to Arg and Leu, respectively. This is very encouraging to find strategies for increasing protein stability by surface engineering. As seen from this study, stability can be achieved by mutating few sites and in an additive fashion.

Although rational site directed mutagenesis seems to have yielded fruitful results, it can’t be used in all cases as thermophilic and mesophilic homologs differ a lot by sequence. Also, mutating the sites important for stability must not have a negative effect on other important properties. For this, the residues important for protein stability must be identified by recombination experiments or sequence comparisons using computational methods. Once the important sites have been selected, optimal residues to

16 be mutated into can be identified by directed evolution methods. There are a number of examples that have used this method. One of the earlier studies35 was using directed evolution to convert Bacillus subtilis subtilisin E into an enzyme functionally equivalent to its thermophilic homolog thermitase. Five generations of random mutagenesis, recombination and screening resulted in a mutant which had half-life and optimal temperature for activity similar to thermitase. There were eight mutations in the optimized mutant and only two of them were present in thermitase. Directed evolution is thus a wonderful method in selecting for mutants with increased thermostability without compromising activity. Another study36 involved improving thermostability of a kanamycin resistant gene by directed evolution of Thermus thermophilus using growth temperature as selective pressure. This resulted in a mutant which had thermostability improved by 20 ºC and had 19 mutations. Interestingly, most of these mutations were located on the protein surface and five of them were prolines. One wouldn’t predict that mutation to prolines will result in an increased thermostability. Directed evolution has the power to predict such rare but successful mutations. This works only for proteins with selectable function.

17

1.5 Effect of surface electrostatics on protein stability

The role that surface residues play in protein stability has always received less attention. The above mentioned studies help in changing this notion and have encouraged more studies in this area. Chakravarty et al37 performed a structural comparison of homologous proteins and identified several important factors for protein stabilization.

Salt bridges were found considerably more frequently present in thermophiles, exclusively in solvent-exposed regions. Increased thermostability of thermophilic proteins correlated with an increased number of charged residues. Sequence composition from the surface and core regions of proteins from thermophilic and mesophilic bacteria were analyzed in a study38. As seen in many previous studies, thermophilic proteins had a significant bias towards charged residues in the composition of protein surface.

Torrez et al39 focused on optimizing electrostatic surfaces for obtaining mesophilic proteins with increased thermostability. Protein stabilities were calculated based on finite difference solutions of Poisson-Boltzmann continuum electrostatic theory which was used to find the electrostatic free energies. This represented the first in silico screening to obtain mutant candidates with enhanced thermostability. The calculations were made for 25 cold shock protein mutants and RNase T1 mutant which have been

18 used for experimentation in previous studies. This study established a method of comprehensive in silico screening. It also added value to the previous experimental observations that elimination of like-charge repulsion and creation of opposite-charge attractions on the protein surface is an effective method to improve thermostability of mesophilic proteins.

Design of thermostable protein variants based on the calculation of energies of charge-charge interactions of ionizable residues was introduced by Sanchez-Ruiz et al40.

The Tanford-Kirkwood model was used for the calculations in this study. In this model, the native protein was assumed to be a low-dielectric-constant sphere surrounded by a high-dielectric constant solution and the charges were placed on the surface of the protein sphere according to the pairwise distance taken from protein’s three-dimensional structure41. The energies of charge-charge interactions of ionizable residues were plotted as a function of the residue number in the sequence for three homologous cold shock proteins which are mesophilic, thermophilic and hyperthermophilic proteins, respectively. The energies of interaction with positive values implied destabilizing interactions and the ones with negative values were considered as favorable interactions.

The plots of the charge-charge interaction energy versus individual residues for all three homologous proteins revealed that charge-charge interaction played a fundamental role in

19 stability differences between these homologs. The most important conclusion of this study was that stabilization of mesophiles will not only rely on the removal of unfavorable interactions but also on the creation of additional favorable charge-charge interactions. The rational design of thermostable protein variants was made easier using this model.

Makhatadze et al42,43 experimentally proved that the stability of a mesophilic protein can be significantly improved by optimizing electrostatic interactions on the protein surface. A hybrid cold-shock protein was designed which combined the core residues of the mesophilic protein and surface charge distribution of the hyperthermophilic variant. The designed protein had similar secondary and tertiary structure as the parent protein, and had better affinity to single-stranded DNA than the mesophilic variant. Thermostability increased by 20 ºC and it was achieved by decreasing the enthalpy of the unfolded state. This study confirms that the surface electrostatic interactions play an important role in improving thermostability of proteins. Another study by Strickler and coworkers44 proved that surface electrostatic interactions also contribute to protein stability and that it is not just the interactions in the protein interior.

Computational optimization of surface charge-charge interactions for five proteins which differ in size and tertiary fold topology was performed and the increase in stability of the

20 mutant proteins was experimentally tested. The computational predictions matched with the experimental values and this work once again proved that surface interactions can be optimized in a rational way to improve stability. There are other studies45,46 used computational approach to enhance thermostability of an enzyme by rational design. For two human enzymes, acylphosphatase and Cdc42 GTPase, the designed proteins were significantly more stable than wild-type while retaining full enzyme activity. This is a good example where surface electrostatic interactions were optimized without perturbing the enzyme activity.

21

Figure 3: An example showing the optimization of energy of charge-charge interactions based on the number of amino-acid substitutions in a model protein acylphosphatase47. Each small dot corresponds to a different sequence. The sequences selected for experiment are shown in large symbols.Wild-type is in black bars and the experimental sample is in red bars. Positive energies represent unfavorable interactions of a given residue with all other ionizable residues in the protein whereas negative values of ΔGqq represent overall favorable energies.

22

Optimization of surface charge-charge interactions was tested to increase stability with only a few substitutions in the beta sheet protein Fyn SH3 domain by

Schweiker et al48. The thermodynamic stabilities obtained experimentally agreed well with the theoretical calculated values. The resulting mutant had an increase of melting temperature by 12 ºC and increase of stability by 7 kJ/mol. Along with this, the effect of stepwise addition of charges to the surface was observed for Fyn SH3 domain and ubiquitin. Energy of favorable interactions was predicted to decrease after 13 substitutions and 10 substitutions for Fyn domain and ubiquitin respectively. Addition of new charges into the finite space occupied by native protein topology will introduce both favorable and unfavorable interactions. The sites chosen will decide on which of the interactions will dominate while put together. Therefore, after a certain point, other interactions such as hydrophobic interactions must be optimized to further increase protein stability.

1.6 Other factors that affect protein stability

Now, it is clear that surface residues, especially surface charges, can be critical to protein stability. One of the methods49 to estimate electrostatic contribution of ionizable groups to protein stability is by calculating the differences between pKa values in the folded and unfolded states of a protein. Hydration effects have an important role in

23 determining the energetics of protein structure. A study by Makhatadze et al50 showed that the hydration effects of aliphatic groups stabilize the native structure whereas the hydration effects of the aromatic and polar groups destabilize the compact native state of the protein. The hydration effects in total destabilize the native structure as the negative contributions by the aromatic and polar groups is higher than the positive contribution by the aliphatic groups. The negative effects are counterbalanced by the enthalpy of interactions between the groups tightly packed in the interior. Another study by

Makhatadze et al51 showed the solvent isotopic effects on protein stability.

Thermodynamic changes were observed by even small changes in solvent water such as

H to D substitution. The decrease in enthalpy was compensated by entropic changes which left the stability of the protein unaffected. This study also proved that hydration effects are additive or in other words, they can be scaled with water-accessible surface area. This served as a benchmark study to emphasize the role of enthalpy-entropy compensation effects on the overall energetics of protein structure.

1.7 Lysines vs Arginines on the protein surface

A library randomizing five amino acids on the surface of Rop was constructed in a previous study (C. Nguyen, K. Stephany, D. Williams, manuscript in preparation). The hits obtained after screening had a consensus sequence dominated by lysines. It has to be

24 noted that wild-type Rop has predominantly negatively charged amino acids on the surface. So, this was a surprising result, and the domination of lysines over arginines also raised questions. Lysine and arginine are positively charged basic amino acids mostly exposed at the protein surface. They form ionic interactions and can play an important role in protein stability. Although both of these are basic residues, arginine usually contributes more to stability to protein structure than lysine due to its geometric structure.

This is due to the fact that the guanidinium group in arginine allows more interactions than lysine52. This results in arginine forming more hydrogen bonds and salt bridges compared to lysine. It is already known that thermophiles have a higher number of charged amino acids. It has also been shown that thermophiles have more thermal tolerance than mesophiles due to increased number of arginine based salt bridges53. This was found out by analyzing arginine to lysine ratio and the number of salt bridges formed by arginines and lysines in thermophilic and mesophilic proteins.

To compare the effects of arginines over lysines, green fluorescent protein (GFP) was used in a study54. This was an effort to analyze the advantageous properties of arginine and lysine with respect to protein stability. A variant of GFP was made in which maximum possible mutations from lysines to arginines were made on the surface of the protein. Then, the stability of the variant was analyzed. Thermal stability of the variant

25 remained unchanged whereas stability in the presence of chemical denaturants such as urea, alkaline pH and ionic detergents was enhanced for the variant. Structural analysis of the electrostatic interactions indicated as previously mentioned that the geometric properties of the guanidinium group in arginine led to such effects. The activity remained unchanged for the variant. However, the protein folding rate was adversely affected for the variant. The results concluded that the surface lysine to arginine mutagenesis is one of the important parameters in protein stability studies. This may not be true with every protein and it may depend upon the functional properties of the protein. The same research group worked on making some more mutants of GFP in which they introduced stabilizing mutations to counterbalance the negative effect on folding caused by the surface mutations55. They were successful in this attempt and were able to modulate the protein stability without affecting its activity and folding.

Arginine is widely used in protein refolding studies as it suppresses aggregation56.

In another computational study57, a high lysine/arginine ratio was shown to promote protein solubility at higher protein concentration and higher expression levels. They concluded that Lysine could be more effective in solubility than Arginine by supercharging (increasing the charge of proteins either positively or negatively). This may be an important point to be considered as to why the Rop library mentioned before

26 had more lysines than arginines in the consensus sequence. Although arginines help in making the protein more stable, they may not be good candidates from the protein solubility point of view. In this case, lysines emerge to be more important residues promoting protein solubility. Interestingly, in our surface library, the consensus hit sequence indicated supercharging by lysines.

1.8 Solubility

For most experiments in biochemistry, having proteins in soluble form is a major requirement. The solubility of protein is determined by various interactions such as within the protein and between protein and solvent. Low protein solubility is a major concern for protein pharmaceuticals in all the steps including preparation, sterilization, shipping, and storage. Therapeutic proteins also require the protein to be in soluble form.

In many cases, after the protein is purified, it is soluble at a low concentration but aggregates or precipitates at high concentration. But, protein is required at a high concentration for many applications.

Protein solubility being low is an important problem and it can be seen in various scenarios. The main ones are low in vitro solubility of protein, low in vivo solubility of

27 protein, and low protein solubility due to conformational changes. Low in vitro solubility is the case where proteins are expressed, purified and folded at room temperature but the maximum concentration being low and not enough for laboratory experiments, pharmaceutical or industrial applications. Low in vivo solubility involves proteins that are overexpressed in recombinant cells of E. coli. In many cases, proteins aggregate and form inclusion bodies resulting from low protein solubility.

Addition of charged amino acids arginine and glutamate was found to increase the long-term stability of the sample and to prevent protein aggregation and precipitation58.

One of the strategies to improve protein solubility is site-directed mutagenesis.

Mutagenizing hydrophobic amino acids on the protein surface to hydrophilic amino acids is a popular strategy used in many applications. Substitution of six residues with arginine in an exposed hydrophobic patch in a four ankyrin repeat protein resulted in increased solubility59. In the same study, similar substitution of with arginines made a partially folded ankyrin repeat protein to a fully folded conformation. But, not many hydrophobic residues are present at the protein surface to mutate for improve stability. Solubility is an important factor in structural studies. In a structural study60, to make the murine leukemia virus reverse transcriptase as soluble as the human homolog, a single amino acid substitution (L43K) in the connection domain of a truncated reverse

28 transcriptase significantly improved the solubility of the enzyme. It was found to be as active as the original protein and it was also possible to obtain diffracting crystals due to increased solubility.

There are several examples of using rational mutagenesis to mutate to mostly charged residues resulting in increasing solubility. Bacterial cytosine methyltransferae,

M. HhaI, is a popular enzyme used in structural and mechanistic studies of biological

DNA methylation but it suffers from insolubility. Using site directed mutagenesis of three functionally unimportant hydrophobic patches present on the surface of M. HhaI to polar amino acids, higher solubility was achieved for the engineered enzyme61. Surface redesign can also be used to gain insights into the structure of membrane proteins. In a few interesting studies, membrane proteins have been engineered for solubility by mutating its lipid exposed residues. Final folded states of membrane and soluble proteins are achieved through different pathways. But, if the lipid exposed surface of membrane proteins is redesigned for solubility in an aqueous environment, packing inside a membrane protein could maintain a similar fold. Such an example is the redesign of the transmembrane domain of phospholamban, where it was converted into a soluble protein without disrupting the oligomeric state. This indicates that the interior of the membrane protein has some determinants to dictate folding in an aqueous environment62. A similar

29 attempt was made in the protein bacteriorhodopsin to change a membrane protein into a soluble protein63. Two strategies were employed for this study. The first strategy involved replacing all residues in the trimeric protein which had more than 35% solvent accessibility. This was an unsuccessful strategy with 14.9% residues being mutated. The second strategy involved categorizing the accessible residues as fully or partially accessible. 13.5-24.3% of the residues were mutated in this strategy which involved changing all of their accessible residues and some of the partially accessible residues. The second strategy was successful and was recommended as the solubilization strategy for membrane proteins. Through this study, the authors recommended that while redesigning the surface of a membrane protein, large hydrophobic patches must be avoided as they may lead to poor expression and aggregation. Also, when designing soluble membrane protein, it was recommended to replace all or most of the accessible residues but only a few partially accessible residues to minimize interference with packing in the protein core. In addition to this, introduction pairs of salt bridges can increase the stability of the protein in aqueous solution. This solubilization strategy can be applied to membrane proteins without structures as mutagenesis experiments will be able to determine residues that are solvent exposed. For membrane proteins which have structural information, computational design can prove useful to select sites for mutagenesis. Water soluble variants of the potassium channel KcsA was generated by a computational approach64. In

30 addition to being soluble, the resulting variants by mutating lipid contacting side chains to more polar groups expressed in high yield and had tight specific binding to the substrate.

Therapeutic proteins need to be stable and soluble during preparation, storage, shipping and delivery. In an attempt to stabilize therapeutic proteins such as human growth hormone and human Leptin, a novel stabilizing peptide was fused at either the N or C terminus of these proteins65. The novel stabilizing peptide is the acidic tail of synuclein peptide which was previously used to protect fusion proteins from heat, pH and metl induced aggregation. These fusion proteins were significantly more stable and soluble. A study involving investigation of the relative contributions of all twenty amino acids to protein solubility was carried out66. A solvent exposed position in Ribonuclease

Sa (RNAse) was mutated to all twenty amino acids individually. The results showed that aspartate, glutamate and serine contribute significantly more favorably to protein solubility compared to other hydrophilic amino acids. The results also suggested that contribution of lysine and arginine to protein solubility widely depends on the net charge of the protein. Ionized acidic residues contributed more favorably than protonated acidic residues.

31

Homology modeling can also be used where sequence and structural alignments of a protein of interest and relatively more soluble homologous proteins are used to identify surface hydrophobic residues to be mutated into hydrophilic residues. A study involving making human leptin more soluble similar to murine leptin used homology modeling. Sequence and structural alignments between murine and human leptin helped in identifying key hydrophobic residues contributing to insolubility. The resulting mutant human leptin where the identified residues were mutated to hydrophilic residues had increased solubility67.

1.9 Deep sequencing

Deep sequencing is a great method for libraries as it is cost effective and yields statistically meaningful results. We previously cloned single combinatorial libraries

(mutating a residue at a position to all 19 non- wild type residues along with the wild type residue) for 62 positions of ROP individually, conducted a growth selection by combining all the 62 libraries, and sequenced naives and ‘hits’ from certain rounds by

Illumina technology (Exploring the sequence landscape of the four-helix bundle protein

Rop using deep sequencing, Nishanthi Panneerselvam, Master’s thesis). The libraries are made on a cysteine-free ROP which has similar structure, activity and stability to wild type ROP. The Cys-free protein (C38A, C52V) has an advantage towards purification

32 and characterization studies68. As ROP controls plasmid copy number, an excellent in vivo screen has been developed for ROP by tying the plasmid copy number to the amount of fluorescence produced by the reporter protein GFP (green fluorescent protein) expressed from the plasmid. The screen can be used for monitoring enrichment of hits during selection and also to analyze the variant’s activity. Usually, variants with active

Rop tend to grow faster than the ones with inactive Rop. This is because of the metabolic load imposed on the cells with inactive Rop due to higher plasmid copy number. For us, variants from a particular round of selection are of interest for deep sequencing. This was picked based on a mock selection experiment of active and inactive ROP such that the actives are 1,000 times enriched over the inactives. This was done by calculating the enrichment fold for each round by using the active to inactive cells ratio in the in vivo screen. Our goal is to obtain a fitness landscape from Illumina sequencing results, as we can look at the distribution of enriched and depleted variants. From the distribution, enrichment ratios can be calculated for each variant and the variants of interest can be subjected to biophysical characterization. This may lead to interesting observations on how residue changes in a protein’s core, loop and surface can affect the fitness.

33

1.9.1 DNA Sequencing

Biological research depends to a large extent on DNA sequencing69. A brief timeline of important milestones in sequencing technology advancements is presented below.

-A huge gap was present between the discovery of the double helical structure of DNA by Watson & Crick in 1953 and determination of the first DNA sequence.

-Sanger introduced the plus and minus method of sequencing in 197570. In this method, the DNA polymerase primes to the randomly generated DNA fragments and the products of both plus (supplied with one deoxy nucleotide) and minus (supplied with three deoxy nucleotides) reactions are analyzed on a gel. The sequence is identified based on the varying fragment lengths. This is applicable to only single stranded DNA.

-Using the above method, the genome of bacteriophage ɸX174 was sequenced in 197771.

-Maxam and Gilbert developed a method of chemical sequencing in which 32P end labeled DNA is treated with base specific chemicals. This produces terminated fragments of DNA which are imaged by autoradiography72. The advantage of this method is that it can use double stranded DNA for sequencing but the limitation is the use of toxic chemicals and technical complexity.

34

-The chain termination method developed by Frederick Sanger revolutionized the sequencing field73.

1.9.2 Chain terminator sequencing

This method73 developed by Sanger employs the use of dideoxynucleotide triphosphates (ddNTPs) which are basically dNTPs lacking a 3’-OH group. These ddNTPs act as chain terminators and prevent phosphodiester bond formation. The method requires a DNA template, alpha-32P-dATP, nonradioactive ddNTPs, a sequencing primer and DNA polymerase. Using the above reagents, four reactions will be carried out with all reactions containing all four dNTPs and each one with one of the ddNTPs. This produces random terminated fragments with varying sizes. These reactions are then gel electrophoresed to obtain separation of the DNA fragments by size. This can be documented by autoradiography. Overlapping the random terminated fragments will give us the full sequence of the DNA. Dye-terminator sequencing which is a variation of this method was developed later. This enables us to sequence from a single reaction and replaces radioactive labeled ddNTPs with fluorescent dye labeled ddNTPs. Each base comes out as its respective colored fluorescent dye. As each band of color which is

35 caused by collections of dye terminated fragments of the same size moves past the detector, it creates a peak in the signal which is produced on a graph.

36

Figure 4: Schematic of chain terminator sequencing74. The method requires a DNA template, alpha-32P-dATP, nonradioactive ddNTPs, a sequencing primer and DNA polymerase. Using the above reagents, four reactions will be carried out with all reactions containing all four dNTPs and each one with one of the ddNTPs. This produces random terminated fragments with varying sizes. These reactions are then gel electrophoresed to obtain separation of the DNA fragments by size. This can be documented by autoradiography. Overlapping the random terminated fragments will give us the full sequence of the DNA.

37

Figure 5: Schematic of Dye termination sequencing75. This method enables us to sequence from a single reaction and replaces radioactive labeled ddNTPs with fluorescent dye labeled ddNTPs. Each base comes out as its respective colored fluorescent dye. As each band of color which is caused by collections of dye terminated fragments of the same size moves past the detector, it creates a peak in the signal which is produced on a graph. 38

1.9.3 Second generation sequencing

In spite of all these developments, sequencing was a tedious process due to lack of automation especially for complex genomes. The first automated sequencer was launched by Applied Biosystems in 1987 utilizing capillary electrophoresis. One of the first next generation sequencing methods, massively parallel signature sequencing, was introduced in 2000. This method sequences clonal copies of cDNA attached on microbeads. But, cycles of adaptor ligation and cleavage in this method makes it extremely complex and not so useful. Based on this method, many other deep sequencing technologies started developing rapidly. For smaller sample sizes and for simple purposes like cloning, Sanger sequencing is the major method used till now. Next generation sequencing methods are used for larger projects and various other useful applications.

1.9.4 Next generation sequencing methods

Pyrosequencing

Pyrosequencing was introduced in 2005 and it is the first kind of high-throughput sequencing76. The laborious cloning process in Sanger sequencing is substituted by an efficient in vitro DNA amplification method called emulsion PCR. The template DNA is subjected to fragmentation and attached to two kinds of adaptors at each end. The

39 adaptors consist of sites for amplification and sequencing in addition to a biotin tag at one end that attaches to streptavin beads. The fragment library is diluted in such a way that a single molecule can be immobilized to the streptavidin coated magnetic beads.

40

Figure 6: Schematic of pyrosequencing77. It is based on the sequencing by synthesis principle. It relies on the detection of pyrophosphate release on nucleotide incorporation, rather than chain termination with dideoxynucleotides.

41

Each water droplet captures a bead and the oil will contain the PCR components.

The single DNA molecules bound to the beads are clonally amplified in water-oil microreactors producing 107 clonal copies in each bead. After amplification, the DNA positive beads are enriched and applied to a picotitre well plate followed by enzyme beads containing sulfurylase and luciferase. The sequencing by synthesis technique is utilized where release of inorganic pyrophosphate is detected by chemiluminescence.

Sequencing is carried out by adding primers, polymerase and dNTPs successively to all wells at the same time. In the event of incorporation of the base into the strand, pyrophosphate is released which is converted into ATP by sulfurylase and ATP is hydrolyzed by luciferase to produce light. The amount of light intensity is recorded by a charge coupled device (CCD) camera at the other end of the plate. A great advantage of pyrosequencing is the capacity of longer read lengths. The major disadvantage though is limited accuracy of detection of homopolymeric stretches of DNA as it relies on the magnitude of light and thus becomes less reliable with longer homopolymers. As a result, a large fraction of the errors encountered in this method are small insertions-deletions.

The major advantage of 454 sequencing is the longer read lengths; therefore it is the best choice for applications such as de novo assembly where reference genome is not available.

42

Illumina sequencing

In 2006, the first short read sequencing platform ‘Solexa’ was launched which is now owned by Illumina. The technology applies sequencing by synthesis chemistry similar to Sanger sequencing. Illumina sequencing77,78 also utilizes clonal free amplification of DNA fragments similar to pyrosequencing. In that, the incorporation reaction is stopped after each base, read out is given out by the labeled fluorescent dyes and only then, the next base is incorporated for the sequencing reaction to continue. For the amplification, the template DNA is subjected to fragmentation (<800bp) and end repaired to produce blunt ends and then phosphorylated. Next, a single ‘A’ base is added to the 3’end of the fragments which will let them ligate to the adapters with a ‘T’ overhang. The single stranded DNA fragments with adapters at one end are attached to a flow cell. The solid support flow cell is an optically transparent slide and has 8 separate lanes coated with adapters and complementary adapters. Once these DNA fragments get attached to one end, they can bend over and get hybridized with the complementary adapter in the flow cell which is known as bridge amplification. After several cycles of bridge amplification, the initial single stranded DNA fragments can give rise to clusters

(approx. 1000 copies of the original sequence). As these clusters will contain both the

43 forward and reverse strands of the DNA fragments, one of them has to be removed to avoid interference in the sequencing reaction by complementary base pairing. Therefore, a chemical cleavage step will be carried out such that only the forward strands are left for sequencing to happen. The sequencing by synthesis approach employs cyclic reversible termination which uses reversible terminators each labeled with four different fluorescent dyes. It begins with the addition of sequencing primers, DNA polymerase and the aforementioned fluorescently labeled reversible dNTP terminators. Only a single base can be incorporated at a time by a chemically cleavable terminator group present at the

3’-OH position and the identity of the bases are also determined by one of the four chemically cleavable fluorescent dyes. Once the identity and position of single incorporated base is recorded via a CCD camera, both the terminator group at the 3’ end of the base and fluorescent dye are removed and the synthesis cycle continues. The error rate increases for higher read lengths. Genome analyzer II was introduced in 2008 with the addition of paired-end module and also with decreased run time. Once the first run has been completed as described above, the second run can be performed from the opposite end of the fragments as the templates can be regenerated in situ. After removal of the sequenced strands from the first run, the templates undergo bridge amplification.

Now, the templates are removed and sequencing by synthesis is performed for the reverse strands. Currently, paired-end sequencing can sequence up to read lengths of 2 x 150 bp.

44

As pointed out before, the base call accuracy decreases with increasing read lengths due to dephasing noise from heterogeneous cluster cell populations. This can be due to various problems such as incomplete removal of the fluorescent labels or terminator groups. The Illumina technology is currently the most used for high-throughput sequencing. It produces data at lower cost and time compared to the traditional Sanger sequencing. The downside of the method is that it has higher error rates and much shorter reads. The schematic of an Illumina sequencer is shown in figure 7.

45

Figure 7: Schematic of Illumina sequencing77. This method uses sequencing by synthesis chemistry. DNA is fragmented, adaptors are ligated, the ligated DNA is selected, it is attached to the flow cell, clonal free amplification is carried out by bridge amplification. Then the sequencing primer is annealed and the sequence is read.

46

SoLid system

SoLid (Supported Oligonucleotide Ligation and detection) system is based on a hybridization and ligation chemistry scheme78. Similar to pyrosequencing, DNA amplification is carried out by emulsion PCR. Although, in the SoLid system, the beads containing DNA fragments attached to adapters remain attached to a glass support surface after amplification. A primer oligonucleotide is annealed to the primer to begin sequencing. During the first ligation sequencing step, the primer which was attached serves to ligate to the interrogation probe. The interrogation probe is a set of four fluorescently labeled octamers and these compete for ligation to the primer. The first two bases starting from the 3’ end of the probe is one of the dinucleotides of the 16 possible combinations followed by 6 degenerate bases and a corresponding fluorescent label in the

5’ end. When this attaches to the primer’s 5’ end, sequence of the initial two bases is determined by fluorescence detection and cleavage happens after the fifth base to start the second ligation step. Now, the second interrogation probe ligates to the end of the cleaved probe continuing to sequence the region. In this way, seven ligation cycles constitute one round. For the next round, a new primer offset by one base in the adaptor sequence is annealed and another round of sequencing is performed. Five primer reset rounds are performed by which each template nucleotide is sequenced twice and each time with a different primer. This method is called as ‘Two base encoding’ and it is more useful in

47 segregating sequence errors and sequence polymorphisms. This contributes significantly to reducing error rates.

Single molecule sequencing

HeliScope genetic analysis system was introduced in 2003 by Helicos Biosciences based on the work of Braslavsky and coworkers79. This method sequences the actual

DNA without any modifications after amplification and thus removing the need for clonal amplification. It uses a sensitive fluorescent detection method where the fragmented

DNA is polyadenylated with the end adenine being labeled with Cy-3 fluorescent dye.

These strands are denatured and bound to complementary poly-dT primers on the flow surface of a glass slide. A single sequencing cycle called quad requires washing of polymerase and one of the Cy-5 labeled dNTPs which can bind and give rise to a fluorescent signal recorded by a CCD camera. The major type of error encountered is deletions. This is directed to low levels of fluorescent detections from homopolymer regions.

Single molecule real time (SMRT) sequencing developed by Pacific Biosciences is considered as the third generation sequencer as it can observe fluorescence of nucleotides in real time as they are added by the polymerase80. This is done by the Zero 48 mode waveguide (ZMW) chip which is a nanoscale visualization chamber and provides a detection volume of 20 zeptolitres (10-21 L). The polymerase is at the bottom of the chip and the exact complementary nucleotides are held longer than the time of retention of wrong nucleotides. The biggest advantages of the method are the real time monitoring and the longest read length of 1,300 bp, which is better than any of the second generation sequencers.

Other methods

Ion semiconductor sequencing

Ion semiconductor sequencing was developed in 2011 by Ion torrent systems81.

This is based on the detection of hydrogen ions coming out of DNA polymerization reaction. Template strands are treated with one type of nucleotide and a hypersensitive ion sensor records the electrical signal. Error rates are high for homopolymer reads.

49

DNA nanoball sequencing

This method utilizes DNA nanoballs containing copies of genomic DNA fragments generated by rolling circle replication82. Unchained sequencing by ligation which runs cycles independent of the products of the previous cycle promises reduced error rates and the disadvantage being shorter reads.

Nanopore

It is based on the use of the biological pore, alpha-haemolysin as a detector of voltage fluctuations83. The pore is fused to exonuclease which chews single stranded

DNA into individual nucleotides which pass through the pore. Disruption of the continuous ionic flow is recorded and the size difference between different deoxyribonucleoside monophosphates helps in decoding the readout.

50

Table 1: Comparison of next generation sequencing platforms84

Sequence Detection Run Run Read No. of by types time length(bp) reads per run Roche GS FLX Synthesis Pyrophosphate Single 23 h 700 1 Titanium detection end million GS Junior Synthesis Pyrophosphate Single 10 h 400 0.1 system detection end million Life Ion torrent Synthesis Proton release Single 4 h 200-400 4 technologies end million Proton Synthesis Proton release Single 4 h 125 60-80 end million Abi/Solid Ligation Fluorescence Single 10 75 + 35 2.7 detection of and days billion di-base probes paired end Illumina HiSeq2000/2500 Synthesis Fluorescence; Single 12 2 x 100 3 reversible and days billion terminators paired end MiSeq Synthesis Fluorescence; Single 65 h 2 x 300 25 reversible and million terminators paired end Pacific RSII Single Fluorescence; Single 2 50% of 0.8 biosciences molecule terminally end days reads>10 million synthesis phospholinked kb Helicos Heliscope Single Fluorescence; Single 10 30 500 molecule virtual end days million synthesis terminator

51

1.10 General applications of next generation sequencing

Deep sequencing had a great scope in applications like genome sequencing where low cost reads were a big necessity. Sanger sequencing costs a fortune for such methods and its run time is substantially higher. Next generation sequencing (NGS) methods revolutionized this area by being both time and cost efficient.

Some primary applications for deep sequencing are as follows:

- De novo sequencing of a genome for which longer read lengths are important due to the absence of a reference genome, and hence 454 takes the stage here. Multiple examples can be cited to point out the use of the NGS technologies for complete genome sequencing. To cite a few, 454 sequencing was used to sequence the human genome with

7.5x coverage85 and using Illumina technology to sequence human genomes of a

Chinese86 and a Korean person87.

- Targeted genome resequencing, which is the sequencing of a specific region from a genome from multiple strains and this can be used to identify single nucleotide polymorphisms and insertions-deletions. This is an excellent use of technology to identify markers of genetic diseases. This was useful in HLA genotyping in humans and has a potential for genome wide fetal genotyping by non-invasive high throughput sequencing of the mother’s blood88,89.

52

-Whole transcriptome shotgun sequencing or RNA-Seq helps to find the primary sequence and relative abundance of transcripts for a certain developmental stage85. This has set a new path in understanding transcriptomics and has taken over DNA microarrays in many applications90,91.

- Epigenetic applications:

a) Chromatin immunoprecipitation followed by sequencing or ChIP-Seq is a method to quantitatively profile DNA binding proteins and has widely replaced the traditional ChIP experiment92.

b) DNA methylation profiling is carried out by bisulfite sequencing combined with deep sequencing and this has better advantages in gene regulation studies than traditional bisulfite sequencing93.

1.11 Applications of deep sequencing in protein stability and fitness studies

Deep sequencing is a great method to study protein stability, function and binding, and the studies done so far promise the extent to which next generation sequencing methods can lead us in understanding evolution. It has been a challenge to understand the process of adaptation that happens when organisms evolve and this particular phenomenon can be useful, for example to understand antibiotic resistance. 53

Laboratory evolution of microorganisms has helped a great deal in adding value to adaptation by organisms in nature. In any directed evolution experiment, a library of variants is produced for the gene encoding the protein of interest by recombinant methods. The variants are selected for a particular property such as binding, stability, function or any relevant characteristic. The selected hits from the library can be amplified and deep sequencing can be carried out to observe the population. It is extremely helpful to look at the fitness of mutants in the library by giving an estimate of the accumulation or depletion of mutants throughout the library selection. For example, when engineering a protein or when classifying mutations in a disease-related protein, the experimenter may be interested only in how single mutations affect protein activity. In this case, data for single amino acid substitutions derived from a deep mutational scan can be displayed as a heat map relating sequence to function. Various kinds of high throughput directed evolution methods used in combination with deep sequencing have yielded enormous useful datasets which have aided researchers a great deal in arriving to meaningful conclusions. Apart from obtaining information about the protein itself, this enormous amount of information greatly helps to understand the theory of protein evolution.

One of the studies to investigate laboratory evolution of a nutrient limited yeast strain was done using Illumina sequencing94. Whole genome sequencing is greatly useful

54 for identifying single nucleotide polymorphisms (SNPs) and copy number amplifications.

In this study, they applied a heuristic approach to detect a regulatory SNP in the nutrient responsive transcriptional regulator in the evolved strain from the single end reads and also uncovered the exact structural rearrangement of the commonly found sulfate transporter SUL1 gene in the evolved strain. Fields and coworkers also carried out a sequence-function relationship binding studies which pointed out that moderate selection will let us look at a larger number of library variants with a broad range of a functional importance rather than nailing down to a few highly active proteins by stringent selection.

In this study95, the WW domain of human Yes-associated protein 65 (hYAP65) was selected for binding to its peptide ligand in 3-6 rounds of phage display and Illumina sequencing was carried out for the 3rd to 6th rounds. The enrichment ratio is calculated by dividing the frequency of mutations in round six by the mutational frequency in the starting library. Quite obviously, the library members with high affinity will increase in number in each round and a high resolution map of sequence function relationships can be obtained. The heat map scaled the mutations in a way that the effect of substitution of almost every position to any other non-wild type amino acid can be studied. Rosetta was used to predict folding energies for the single and double mutants. An inverse relationship was observed between enrichment ratios and folding energies which led to the conclusion that thermodynamic instability plays a major role in depletion of variants

55 during selection. Thus, this study claims that moderate selection by phage display combined with high throughput sequencing is an efficient method to assay variants of broad range of activities.

One of the important structure-function relationship studies96 done utilized pyrosequencing combined with high throughput phage display to look at the coevolution of PDZ domains. PDZ domains, a class of peptide recognition modules (PRMs) bind to the C-terminus of a protein by recognizing short peptides. Ten core peptide binding residues of the model system, Erbin PDZ domain were randomized and the library was displayed on phage. These synthetic domains were constructed before by Sidhu and coworkers but selected for proteolytic resistance. Here, the library was selected for binding to 15 peptide ligands and six unique structural domains were able to perform C- terminal recognition for each ligand at the end of selection. A surprising result was that the evolved domains and the reference domains had a mean sequence identity of only

30%. The specificity profiles are calculated from the large dataset obtained from deep sequencing and affinity assays were carried out. The evolved domains had higher affinity and lesser specificity compared to the unevolved synthetic domains and natural domains.

This means that for the evolved domains, fewer residues are needed to bind to the ligand than the unevolved domains. The low specificity was attributed to the absence of

56 competition for ligands in the in vitro evolution experiment. In spite of this, the method proved to be a great example to study in vitro evolution with the power of combining deep sequencing with high throughput phage display.

Another example of peptide binding studies is the work done by Keating and coworkers in 201097 using yeast display screening which was later categorized well with the help of Illumina sequencing in 2012. The identification of peptide binding partners for the anti-apoptotic proteins of the Bcl-2 family plays a key role in drug design for cancer therapy98. The Bim BH3 peptide libraries displayed on the surface of yeast cells were selected for specific binding to either one of the receptors Mcl-1 or

Bcl-xL while selected against the other receptor. Keating and coworkers carried out another study99 using Illumina sequencing to identify peptides binding to the prosurvival protein Bfl-I having implications in cancer Biology. The peptide ligands displayed on yeast were deep sequenced after selection by fluorescence activated cell sorting (FACS) and the enrichment of the peptide variants are studied in detail. Thus, deep sequencing and in particular Illumina sequencing has been successful in answering important questions in protein evolution.

Bolon and coworkers utilized deep sequencing technology to study the fitness

57 landscape of all possible single point mutants for a 9-amino acid region of the eukaryotic yeast chaperone Hsp90100. The region is selected in such a way that there is variability in both evolutionary conservation and the physical position of the amino acids in the protein. The approach ‘EMPIRIC’ or extremely methodical or parallel investigation of random individual codons was used to determine fitness landscapes. Here selective pressure is monitored by looking at plasmid abundance which will be paralleled by cell growth. An exact measure of selection pressure is obtained by mixing all degenerate bases in a codon and thus including all stop codons and wild type strain. Selection coefficients are calculated to give a measure of relative fitness of each mutant. A great advantage is that fitness is calculated as change of relative abundances over time and this will rule out the possibility of mutational bias in the input library affecting the calculations. The selection coefficients for six mutants are compared for EMPIRIC and a classic two mutant competition method. The EMPIRIC results were greatly reproducible due to the fact that the mutants in this approach are grown in the same flask and thus are subjected to identical conditions in contrast to the classic method. They also observed that the fitness of stop codons falls with time as seen from the relative plasmid abundance. The fitness change of synonymous codon substitutions were near neutral as expected. The fitness change at a position with amino acids of similar physical properties is observed. The hydrophobic amino acid changes are more tolerant than the polar amino

58 acids and they inferred this may be due to the fact that the core can always be repacked with similar hydrophobic amino acids whereas polar amino acids have specific geometrical interactions forming hydrogen bonds with different energy profiles. Another important observation is the comparison of the presence of amino acids with WT like fitness in a certain position to the amino acids in phylogenetic alignment. In an example, glycine to phenylalanine mutation is highly favored in the selection whereas phenylalanine is completely absent in the evolutionary data, and this is because the mutation required two base changes. It is also found through simulations that the genetic code is highly optimized for single base changes between codons of WT like fitness. It is evident from the selection coefficients that a point mutation is favored if present in the evolutionary alignment and is mostly deleterious (80% in this case) if not seen in the alignment. Also, the varied codon usage in the phylogenetic data is supportive of occurrence of mutational sampling in nature. These results support the neutral model of molecular evolution where genetic drift plays a large role consisting of mostly neutral mutations, and thus EMPIRIC is a promising method to study population genetics. Point mutant libraries combined with Illumina sequencing prove to be a great strategy to study fitness of the library mutants101. Another deep sequencing study102 performed focused on analyzing the impacts of all possible ubiquitin single point mutants on yeast growth rate.

Illumina sequencing is done here for observation of the relative abundance of each point

59 mutant in the library. Thus the relative fitness of mutants is obtained by EMPIRIC analysis. From the analysis, few surface residues seem to be ultrasensitive to mutations and few residues on the opposite helical face are ultra-tolerant. Also, there is a non- uniform tolerance distribution in the core and a binding interface bearing C-terminus was sensitive to mutation, indicating that binding is a major determinant of ubiquitin function.

This study is another example to show that deep sequencing of combined library of single point mutants is useful to obtain the mutational profile for the entire gene.

As noted above, deep sequencing clearly is a technique with new applications cropping up and also sophistication of the field is observed to be growing at a tremendously high rate to match these needs. It is not an overstatement to say that deep sequencing has introduced a new era in the field of science, as statistical analysis out of an enormous amount of data greatly aids in achieving confidence in a particular piece of study. As seen from the described examples, Illumina sequencing is a cost effective and efficient deep sequencing method. We have utilized Illumina sequencing to study the fitness landscape of the protein.

60

In the following chapters, I have included the results from surface engineering of

Rop and also point mutant library data of Rop. In the second chapter, I have included a confirmation study done by Nguyen (Protein engineering surface residues of the four- helix bundle protein Rop: Evaluating the relationship of sequence-stability through high- throughput approaches, Chau Nguyen, Master’s thesis) in her master’s thesis. A NNK-5 surface library was first constructed in pAC vector, screened for hits and the consensus sequence was KKK(N/H)K. This was a surprising result and the hits were taken and cloned into a pMR vector in which the consensus sequence EEDH(Q/K) was very similar to wild-type sequence. There was a question of whether the library could be biased as the hits were recloned into pMR. In this chapter, I have described the work where hits from pAC library were subject to growth selection different from the previous method and were sequenced. Also, the pMR library was reconstructed from naives of pAC library instead of taking the hits like in the previous case. The results were quite similar to the previous case and aided in confirmation of whether the results are biased or not.

The presence of lysines among the hits in a greater number evoked curiosity of whether they play any important role in some characteristic of Rop. The question of nature choosing negatively charged residues over positively charged residues was also surprising. In addition to this, the reason behind selection of lysines over arginines was

61 also not known. In the third chapter, I have described the construction of lysine and arginine variants in these five positions of the library and characterization results.

A point mutant of a protein yields us relevant information on whether a particular residue in the position is crucial for the protein to be functional or not. A point mutant library would be more useful to find how tolerant a position is to all other non-wild type amino acids. NNN (N = A, G, C or T) codon point mutant libraries were constructed in a cysteine free mutant, AV-Rop which has two core cysteine residues mutated to alanine

(C38A) and (C52V). AV-Rop with wild type like properties is stable and free of problems of oxidative dimerization due to absence of cysteines. The library was subject to rounds of growth selection to select for hits and then sequenced by Illumina technology. The above mentioned work was done as my Master’s thesis project. In the fourth chapter, I have included the analysis of Illumina sequencing results obtained from the above work and preliminary results from redesign of the experimental strategy.

62

Chapter 2: Engineering surface residues of the four helix bundle

protein Rop

63

2.1 Summary

Originally, the NNK5-surface library was constructed by Chau (Protein engineering surface residues of the four-helix bundle protein Rop: Evaluating the relationship of sequence-stability through high-throughput approaches, Chau Nguyen,

Master’s thesis) in which five solvent-exposed residues E39, S40, D43, D46 and E47 were randomized to all twenty amino acids. It was first constructed in pAC vector, screened for hits and the consensus sequence was KKK(N/H)K. This was a surprising result and the hits were taken and cloned into a pMR vector in which the consensus sequence EEDH(Q/K) was very similar to wild-type sequence. There was a question of whether the library could be biased as the hits were recloned into pMR. In this chapter, I have described the work where hits from pAC library were subject to growth selection different from the previous method and were sequenced. Also, the pMR library was reconstructed from naives of pAC library instead of taking the hits like in the previous case. The results were quite similar to the previous case and aided in confirmation of whether the results are biased or not. It has to be noted that the number of sequences obtained in this study is lesser compared to the previous case and hence didn’t replicate the result in the exact same way. It produced a trend very similar to the previous study by

Nguyen.

64

2.2 Materials

2.2.1 Plasmids and strains

The plasmids pACT7lac and pUCBADGFPuv were constructed by Magliery et al28. The plasmid pMRH6 was constructed in Magliery lab, the Ohio State University.

The DH10B(DE3) strain of E. coli was lysogenized into the DE3 lamboid phage using a kit from Novagen by Jason Lavinder in the Magliery lab. The cloning sites are AflIII and

BamHI for pMRH6 vector and AflIII and BanI for pACT7lacCm-AflIII. Analytical tests were performed using XmnI digest which cuts pUCBADGFPuv in double sites and linearizes pMRH6 and pACT7lacCm-AflIII. Background digest was carried out by

EcoRI and NdeI for pMRH6 and by NcoI and EcoRI for pACT7lacCm-AflIII.

2.2.2 Rop NNK5-surface library construction

In the work by Nguyen, in the NNK5-surface library (K=G/T), five solvent-exposed residues 39, 40, 43, 46 and 47 were randomized to all 20 amino acids. An engineered Cys- free Rop (C38A/C52V) called AV-Rop was used as a scaffold for the library. The library was constructed by reassembly PCR of the two fragments NNKsurf-pro and NNK surf- term (Table 2.1) by using an in-house produced Pfu polymerase. The reassembly product was PCR amplified using two primers pAC-pro and pAC-term (Table 2.1). This library of

65 inserts was digested with AflIII and BanI and then cloned into the pACT7lacCm-AflIII vector, a low copy number plasmid that expresses Rop from a synthetic lac promoter.

2.2.3 Screen and selection for active variants

Rop controls plasmid copy number and therefore the inactive variants grow slowly due to metabolic load imposed by the high amount of plasmid. For any library, the naives will be grown in 1 L of 2YT at 42oC for 17-18 hours till it reached saturation.

From this, 100 uL of cells with OD (optical cell density) of 1 will be transferred into another 1 L of media and grown for 17-18 hours. After each round, the culture is plated on LB-kan-amp-0.0005% arabinose to observe the phenotypes and the selection of active variants. Magliery and Regan28 developed a cell based screen for Rop function.

Fluorescent phenotypes will be picked from a select round and plasmid levels will be checked by performing analytical digests. The active phenotypes which show controlled plasmid levels of GFP will be sequenced from colonies.

2.2.4. Resequencing of pAC surface library

The naïve library constructed by Nguyen in pACT7lacCm-AflIII vector was transformed into the strain DH10B(DE3)-pUCBADGFPuv and subjected to several rounds of growth selection at 42oC as mentioned above. From round three which had a mix of phenotypes, five fluorescent phenotypes and twelve intermediately fluorescent phenotypes

66 were picked and grown overnight in 5 mL LB media. The next day, the culture was spotted on LB-kan plates and sequenced directly from colonies by Genewiz Inc.

The naïve NNK surface library in pACT7lacCm-AflIII (made by Chau) was subcloned between AflIII and BamHI into the expression vector pMRH6 using the primers pMR-pro and pMR-term (Table 2). The library in pMRH6 was transformed into the strain

DH10B(DE3)-pUCBADGFPuv and subjected to five rounds of growth selection at 42 oC.

Fluorescent phenotypes were picked from round four and plasmid levels were checked by performing Xmn1 and NcoI analytical digests. The active phenotypes which show controlled plasmid levels of GFP were sequenced from colonies by Genewiz Inc. pMR- seq-rev and pac-seq.pro are the sequencing primers.

67

Table 2: Oligonucleotide sequences used for the construction of surface library Oligo Sequence NNKsurf- ACTAAGCAAGAGAAGACAGCACT pro TAATATGGCTCGTTTTATTCGTTC TCAAACTCTTACTCTTCTTGAAAA ACTTAATGAACTTGATGCTGACG NNKsurf- CAAAGCGAGCAAGAACAGAACG term ATAAAGMNNMNNAGCATGMNNA TGTAAMNNMNNAGCAATATCAGC TTGTTCGTCAGCATCAAGTTCATTAAG pMR-pro AATATTACACGTGGTGAGAATCTGT ATTTTCAGGGCACTAAGCA AGAGAAGACAGCAC pMR- TAATAAGGATCCTCACAGATTCTCA term CCATCATCACCAAAGCGAGC AAGAACAGAAC pAC-pro AATAATACACGTATGACTAAGCAAGAG AAGACAGCAC pAC-term TAATAAGGCACCTCACAGATTCTCA CCATCATCACCAAAGCGAGCA AGAACAGAAC pac- CGGCTCGTATGTTGTGTGGAATTG seq.pro PMR- CTAGTTATTGCTCAGCGG SEQ- REV

68

2.3 Results

This chapter is a confirmation study of a previous NNK5-surface library constructed using different conditions for selection and also is the basis for the work in the next chapter. Previously, combinatorial libraries were constructed in Rop (Chau

Nguyen) which randomized five surface positions into all twenty amino acids. The codon used to randomize was NNK (K=G or T) and this reduces the number of stop codons being incorporated in the gene and controls codon bias to some extent. The NNK5-surface library randomized five residues E39, S40, D43, D46 and E47 located at b, c and f positions of the heptad repeat which are highly solvent-exposed regions of helices 2 and

2’ (Figure 8). The RNA recognition site which is on the surface of helix 1 and 1’ was left undisturbed as the Rop screen is based on activity. Rop’s surface is highly negatively charged and the wild-type sequence mutated here is ESDDE. This library is also made on the scaffold AV-Rop and the theoretical diversity is 205 variants.

The library was first cloned into an activity screen pAC controlled by the lac promoter and then the library was recloned into pMR expression plasmid controlled by the T7 promoter. The vectors used here are pACT7lacCm-AflIII, pMRH6 and pUCBADGFPuv (Figure 18). The surprising result was that the pAC library had a consensus of positively charged residues whereas the pMR library was almost similar to

69 wild-type with a consensus of negative charge. Especially, lysines were found more than arginines in the pAC library. The sequences are listed in Tables 3 and 4. The chart with the active sequences is shown in Figure 9. The T7 construct was incorporated with an N- terminal His6 tag for facilitating affinity purification of overexpressed variants. In order for the overexpression to work, the presence of IPTG inducer was required and the library needed to be in a host containing DE3 lysogen. For this, a version of

DH10B(DE3), in which DH10B was lysogenized with DE3 phage was used. The T7 constructed in the DH10B(DE3) host provided a system which afforded us to screen, sequence, and overexpress from the same host/vector combination. Since, this is only a confirmation study, only 15 sequences were obtained for each of the libraries.

Comparison of hits by each position is shown in figure 9. The results don’t exactly replicate Nguyen’s library results. This is because her library had about a hundred sequences and this repeat is only from about twenty five residues. Still, it is a good representation of Nguyen’s results. Also, it has to be noted that a different method of growth selection was used and naives were cloned into pMR vector instead of hits. This may lead to some variation in results. But, it’s an interesting observation that the overall distribution is similar. Further biophysical characterization was done in the work by

70

Nguyen. This chapter covers only the confirmation of sequencing results from both libraries.

71

Figure 8: Side and top view of AV-Rop homodimer with the library residues highlighted

72

Table 3: Sequences from round three of selection of pAC library

Actives Intermediates EADDH WRKEK KQATE MYKTR ETKDK NTKQN ESTEK RNTTQ ETETK NTRLT HFSEK KLNDA LGEDQ NSTDV NHEEK NKQKQ KTTEN

Table 4: Sequences of actives from round four of selection of pMR library

SVHDA RCTHG KNLHW HSGDF KTQQQ MKNQQ ESDSL YETAQ MHLHE KQVTQ RQSNK RSAQV MTIVE RVTDE MTIVE EHWTQ DEDAM RVTDE 73

Figure 9: Results from sequencing of Nguyen’s pAC surface library (top left) and my pAC surface library (top right).Results from sequencing of Nguyen’s pMR surface library (bottom left) and my pMR surface library (bottom right). Nguyen’s figures were taken for comparison from her Master’s thesis.

74

Figure 10: Amino acid distribution of library hits by position; pMR is in blue and pAC is in red (figure 10 continued below)

Figure 10 continued

75

Figure 10 continued

Figure 10 continued

76

Figure 10 continued

77

2.4 Discussion

Studies have revealed that the surface residues and surface electrostatics of a protein are more important for stability than previously thought. Amino acid residues in the protein surface have always been considered to be less important than residues in the hydrophobic protein core in contributing to protein stability. This is because of the belief that surface residues are usually tolerable when mutated to non-native residues than the protein core29. In spite of this, various studies using directed evolution and site-directed mutagenesis have explored the contributions of surface residues to protein stability. There is numerous evidence to prove that engineering surface residues and surface electrostatics contribute to increase protein stability in previous studies30-32. This chapter presents the effect of surface residues on protein stability from extensive study through the sequence- stability relationship. By using both rational and combinatorial design approaches, the

NNK5- surface library is constructed at five highly solvent-exposed positions on a four- helix bundle protein Rop. Selection for stable and well-folded protein variants is performed by a cell-based selection method based on the RNA binding function of Rop variants.

Sequencing data of 15 active variants from the activity screen (the pAC library) and 15 active variants of the expression screen (the pMR library) had significant

78 differences in the amino acid compositions. It is observed that Lys is highly favorable in the pAC library, whereas the pMR library has more negatively charged residues Glu and

Asp as they appear on the AV-Rop sequence. The transfer of library from pAC to pMR screen either as naives or hits didn’t make a difference in the results. pAC screen is supposed to be more stringent than the pMR screen. Therefore, the intermediates in pAC screen may be actives in pMR screen. To support this, a hit picked at random contained more intermediates than actives for pAC screen and more actives for pMR screen. The sequence-stability relationship for a combinatorial library of surface mutations was obtained in this method. More work regarding further structure analysis needs to be carried out in order to establish solid conclusions on the role and contribution of surface residues to protein stability.

79

Chapter 3: Effect of surface lysines and arginines on stability and

other characteristics of Rop

80

3.1: Summary

The NNK5-surface library was constructed by Nguyen in which five solvent- exposed residues E39, S40, D43, D46 and E47 were randomized to all twenty amino acids. It was first constructed in pAC vector, screened for hits and the consensus sequence was KTKKK. This was a surprising result. There was a question of whether the library could be biased as the hits were recloned into pMR. In chapter two, I have described the work where hits from pAC library were subject to growth selection different from the previous method and were sequenced. The results were quite similar to the previous case and aided in confirmation of whether the results are biased or not. The presence of lysines among the hits in a greater number evoked curiosity of whether they play any important role in some characteristic of Rop. The question of nature choosing negatively charged residues over positively charged residues was also surprising. In addition to this, the reason behind selection of lysines over arginines was also not known.

In this chapter, I have described the construction of lysine and arginine variants in these five positions of the library and characterization results. There are some very surprising results which would add value to the model studies done on Rop.

81

3.2 Materials and Methods

3.2.1 Cloning and Sequencing

The variants were created by PCR assembly of four synthetic oligonucleotides

(Sigma-Aldrich) that make up the Rop gene. The amino acid positions from a previously made NNK5-surface library (39, 40, 43, 46 and 47) were taken and either Lysine or

Arginine point mutations were made at all of these positions. Four multi-mutants (KTKKK,

KTKTH, RTRRR and RTRTH) at these five positions were also constructed. All these mutants were made in the background of an engineered Cys free (C38A/C52V) “AV” Rop.

The oligonucleotide sequences used to construct the variants are listed in Table 5.

Following PCR assembly with Klenow DNA polymerase, the product was PCR amplified using in-house made Pfu polymerase for 25 cycles with the pMRTEV5 and pMRH6TEV3 primers that contain the restriction sites and the sequence encoding a TEV protease cleavage site. This PCR product was ligated between the AflIII and BamHI sites of the expression vector pMRH6, which fuses a hexahistidine tag amino-terminal to the TEV cleavage site. This cloned vector was transformed into DH10B and was sequence confirmed by Genewiz, Inc. (South Plainfield, New Jersey). In a similar way, for performing in vivo functional assays, these variants were also cloned between AflIII and

BanI restriction sites into the screening vector pACT7lacCm-AflIII vector, a plasmid that

82 uses a lac promoter. Plasmids used for expression were transformed into BL21 (DE3) for protein purification.

83

Table 5: Oligonucleotide sequences used for the construction of point mutants and multi- mutants of lysine and arginine variants in both pMR and pAC context Oli Sequence

E39 GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCA K GCGTGGTCGTGAAGCGATTTAGCGATATCTGCCTGTTCATCCGCGTC CAGCTCG S40 GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCA K GCGTGGTCGTGAAGTTTTTCAGCGATATCTGCCTGTTCATCCGCGTC CAGCTCG D43 GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCA K GCGTGTTTGTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTC CAGCTCG D46 GCCATCGTCACCAAAGCGCGCGAGAACACTGCGGTAAAGCTCTTTAG K CGTGGTCGTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCC AGCTCG E47 GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGTTTATCA K GCGTGGTCGTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTC CAGCTCG E39 GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCA R GCGTGGTCGTGAAGCGAACGAGCGATATCTGCCTGTTCATCCGCGTC CAGCTCG S40 GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCA R GCGTGGTCGTGAAGACGTTCAGCGATATCTGCCTGTTCATCCGCGTC CAGCTCG D43 GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCA R GCGTGACGGTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTC CAGCTCG D46 GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCACGA R GCGTGGTCGTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTC CAGCTCG Table 5 continued

84

Table 5 continued

E47 GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGACGATCA R GCGTGGTCGTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTC CAGCTCG AV- ACCAAACAGGAAAAAACCGCCCTTAACATGGCCCGCTTTATCAGAA WT5 GCCAGACATTAACGCTTCTGGAGAAACTCAACGAGCTGGACGCGGA ' TG SM- TAA TAA GGA TCC TCA GAG GTT TTC GCC ATC GTC ACC AAA GCG pMR TEV 3 pMR AAT ATT ACA CGT GGT GAG AAC CTG TAT TTT CAG GGC ACC AAA TEV CAG GAA AAA ACC 5 KTK GCCATCGTCACCAAAACGCGCGAGAACGCTGCGGTAAAGatgggtAGC TH GTGtttGTGAAGggtTTTAGCGATATCTGCCTGTTCATCCGCGTCCAGCT CG KTK GCCATCGTCACCAAAACGCGCGAGAACGCTGCGGTAAAGttttttAGCG KK TGtttGTGAAGggtTTTAGCGATATCTGCCTGTTCATCCGCGTCCAGCTC G RTR GCCATCGTCACCAAAACGCGCGAGAACGCTGCGGTAAAGgcggcgAG RR CGTGgcgGTGAAGggtgcgAGCGATATCTGCCTGTTCATCCGCGTCCAGC TCG RTR GCCATCGTCACCAAAACGCGCGAGAACGCTGCGGTAAAGatgggtAGC TH GTGgcgGTGAAGggtgcgAGCGATATCTGCCTGTTCATCCGCGTCCAGCT CG PAC ACGAAACAAGAAAAGACGGCCCTTAACATGGCCCGCTTTATCAGAA - GCCAGACATTAACGCTTCTGGAGAAACTCAACGAGCTGGACGCGGA OP5 TG Table 5 continued

85

Table 5 continued

PA TAATAAGGATCCTCAGAGGTTTTCGCCATCGTCACCAAAACGCGCGA C- GAACGCTGCGGTAAAG AP 3- NP PAC AATATTACACGTGGTGAGAACCTGTATTTTCAGGGCACGAAACAAG -AP5 AAAAGACG pAC- gGCGAGAACaCTGCGGTAAAGCTCATCcGCGTGGTCGTGAAGCGAttt E39 AGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG K pAC- gGCGAGAACaCTGCGGTAAAGCTCATCAGCGTGGTCGTGAAGCGAgc E39 gAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG R pAC- gGCGAGAACaCTGCGGTAAAGCTCATCAGCGTGGTCGTGAAGtttTTC E40 AGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG K pAC- gGCGAGAACaCTGCGGTAAAGCTCATCAGCGTGGTCGTGAAGacgTTC E40 AGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG R pAC- gGCGAGAACaCTGCGGTAAAGCTCATCAGCGTGtttGTGAAGCGATTC E43 AGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG K pAC- gGCGAGAACaCTGCGGTAAAGCTCATCAGCGTGacgGTGAAGCGATT E43 CAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG R pAC- gGCGAGAACaCTGCGGTAAAGCTCtttAGCGTGGTCGTGAAGCGATTC E46 AGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG K pAC- gGCGAGAACaCTGCGGTAAAGCTCacgAGCGTGGTCaTGAAGCGATTC E46 AGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG R Table 5 continued

86

Table 5 continued pA gGCGAGAACaCTGCGGTAAAGtttATCAGCGTGGTCGTGAAGCGATTCA C- GCGATATCTGCCTGTTCATCCGCGTCCAGCTCG E4 7K pA gGCGAGAACaCTGCGGTAAAGacgATCAGCGTGGTCGTGAAGCGATTC C- AGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG E4 7R pAC- TAATAAGGATCCTCAGAGGTTTTCGCCATCGTCACCAAAACGgGCG AP3- AGAACaCTGCGGTAAAG KR pMR GCCATCGTCACCAAAGCGCGCGAGAACaCTGCGGTAAAGCTCATCA - GCGTGGTCGTGAAGCGAgcgAGCGATATCTGCCTGTTCATCCGCGTC E39R CAGCTCG -II PAC- CGCGAGAACGCTGCGGTAAAGATGGGTAGCGTGTTTGTGAAGGGTT KTK TTAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG TH- OP3 PAC- CGCGAGAACGCTGCGGTAAAGTTTTTTAGCGTGTTTGTGAAGGGTTT KTK TAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG KK- OP3 PAC- CGCGAGAACGCTGCGGTAAAGGCGGCGAGCGTGGCGGTGAAGGGT RTR GCGGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG RR- OP3 PAC- CGCGAGAACGCTGCGGTAAAGATGGGTAGCGTGGCGGTGAAGGGT RTR GCGGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG TH- OP3

87

3.2.2 Solubility

A 5 mL culture of each variant was grown for a day for a small-scale expression to assess protein solubility. In the evening, the cultures were diluted to cell densities (OD600) of 0.8, placed in 37 °C for 10 minutes and then induced with 50 µL of a 10mM IPTG solution. Cultures were grown at 30 °C overnight, then normalized for OD600, and collected by centrifugation. 125 µL of lysis buffer (50 mM Na phosphate, 300 mM NaCl, 10 mM imidazole, pH 7.4) was used to resuspend the pellets. 12 µL of a solution containing MgCl2,

CaCl2, DNase, and RNase was added to the resuspension for a final concentration of 5 mM

MgCl2, 0.5 mM CaCl2, 2 U/mL DNase I, and 6 mg/L RNase A. Cells were lysed using glass beads and vortexed. Lysate was removed and pellets were washed three times with 1 mL of lysis buffer. 20 µL of SLB (SDS Loading Buffer) was added to pellets and they were heated and vortexed five times. 5 µL of the pellet solution was mixed with another 15 µL of SLB and all the 20 µL of each were loaded on a 18% SDS-PAGE gel. 10 µL of lysate of each variant mixed with 10 µL SLB was also loaded.

3.2.3 Activity

Variants cloned in pACT7lacCmAflIII vector were transformed into electrocompetent DH10BtonA cells which contain the reporter plasmid pUCBADGFPuv.

Cultures of each variant were grown overnight at 42 °C and then normalized for OD of 1

88 and 3 µL of each was plated on LB agar medium containing 100 µg/mL Ampicillin, 30

µg/mL Kanamycin, 0.0005% arabinose (w/v) and incubated for 16 h at 42 °C. According to a cell-based screen for the function of Rop developed by Magliery and Regan, the active

Rop displays a high fluorescence phenotype, while the inactive variants do not fluoresce.

Cellular fluorescence was documented using UV illumination. To compare the ColE1 plasmid levels directly, each doubly transformed variant was grown overnight in a 5 mL culture at 42 °C, normalized for OD 1 and alkaline lysis miniprep was performed on those cultures. Analytical digests using XmnI were done on the variants and also on the control

DNA isolated from DH10BtonA.gfp cells containing pAC-AVRop (contains active Rop) and pACT7lac Cm-AflIII (inactive linker). These digests were analyzed by agarose gel electrophoresis to compare plasmid levels of each variant which correlates to its activity.

3.2.4 Protein expression and purification

Variants from the expression vector pMRH6 were overexpressed in BL21(DE3) in

1 L of 2YT media containing 30 µg/mL kanamycin. Cells were grown to log phase (OD600~

0.7) and then induced with 0.1 mM IPTG. Cells were harvested by centrifugation after growing at 30 °C overnight. Cell pellets were stored in -80 °C and the frozen pellets were resuspended in 25 mL of lysis buffer (50 mM Na phosphate pH 7.4, 300 mM NaCl, 10 mM imidazole), mixed with 5 mM MgCl2, 0.5 mM CaCl2, 2 U/mL DNase I, 5 mg/L Rnase A

89 and 0.1% Triton X-100 and then incubated on ice for 30 mins. Cells were lysed using

Emulsiflex C-3 homogenizer. After lysis, the solution was incubated at room temperature for 30 mins. Then, the cells were centrifuged for 1 h at 15,000 rpm. The soluble fraction was bound to 1 mL of Ni-NTA agarose slurry (Qiagen) for 1.5 hrs at 4 °C. The slurry was passed through a prefritted column (Bio-Rad), washed with 12 mL of wash buffer (50 mM

Na phosphate pH 7.4, 300 mM NaCl, 20 mM imidazole) and the protein was eluted with 3 mL of elution buffer (50 mM Na phosphate pH 7.4, 300 mM NaCl, 250 mM imidazole).

0.5 mg of rTEV protease with 5 mM DTT was added to the protein in two reactions to cleave the hexahistidine tag. The first reaction was incubated overnight at room temperature and the second reaction was incubated at 30 °C for 3 h. The reaction was desalted and buffer exchanged into lysis buffer using PD10 columns (GE healthcare).

Then, the eluent was mixed with 1 mL of Ni-NTA for 1.5 h at 4 °C. The solution was then passed through a prefritted column to collect the cleaved protein in the flow-through. The protein was buffer exchanged into CD buffer (50 mM Na phosphate pH 6.3, 300 mM NaCl) and concentrated by centrifugation for subsequent biophysical analysis.

For HSQC NMR studies, the variants were expressed in 15N labelled minimal media. 100 mL of 10X CSB1 salt solution and 40 mL of CSB2 supplement were added to bring up to a volume of 1L of minimal media for expression. 1L of 10X CSB1 salt solution

90 has 45.5 g of NaH2PO4, 116.7 g of K2HPO4 and 1g of N15 labelled NH4Cl. CSB2 supplement has 5 mL of 1 M sodium citrate, 3 mL of trace metal solution, 0.5 mL of 1M

MgSO4, 30 mL of 20% glucose, 10 mg thiamine and 1 mL of water. Trace metal solution used contains 10 g FeCl3·6H2O, 2 g ZnCl2·4H2O, 2 g CoCl2·6H2O, 2 g Na2MoO4·2H2O, 1 g CaCl2·2H2O 1 g CuCl2, 1 g MnCl2, 0.5 g H3BO3, 100 mL HCl and water made up to 1L.

3.2.5 Circular Dichroism

CD (Circular dichroism) data were collected using a Jasco J-185 CD spectrometer

(Ohio State University Department of Chemistry). The variants were analyzed at a concentration of 50 µM monomer determined by UV absorption at 280 nm in CD buffer

(50 mM Na phosphate pH 6.3, 300 mM NaCl). Extinction coefficients were calculated using the ExPASy ProtParam tool. Wavelength scans were generated from 190-260 nm.

Thermal denaturation profiles were acquired at 1°C/min, 222nm from 25-95 °C with 6 s equilibration at each temperature.

3.2.6 Gibbs-Helmholtz Analysis

Wavelength scans at 222 nm for the variants were taken at 5 °C intervals from

5°C to 85°C at 0, 1, 2, 3, 4, 5, 6, 7, 8 M urea concentrations with the protein

91 concentration kept at 50 µM. The change in Gibbs free energy, enthalpy, specific heat capacity and entropy was calculated using this analysis.

3.2.7 HSQC NMR

A few variants expressed in 15N labelled minimal media were purified for performing 1H15N HSQC NMR. All the variants were kept in CD buffer and mixed with

10% D2O just before the spectra were taken. Spectra were collected on protein labelled with 15N at 0.5-3 mM concentration using a Bruker 600 MHz NMR spectrometer (Campus

Chemical Institute Center, Ohio State University). Data was processed and analyzed using

TopSpin software.

92

3.3 Results

3.3.1 Construction of lysine and arginine variants

Improving the stability of proteins for a variety of industrial and therapeutic applications remains a continuous challenge in the field of protein engineering.

Especially, the contribution of surface residues and surface electrostatics to protein stability was underestimated before and there are a lot of recent studies which have focused on analyzing their importance. Rop (Repressor of protein), an antiparallel homodimer, is a model protein used to study protein stability. Rop is a bacterial RNA- binding protein and the function is regulation of copy number of plasmids. AV-Rop is a

Cys-free mutant of wild-type Rop with two cysteine residues mutated to Ala and Val residues (C38A and C52V). AV-Rop with wild-type like properties is stable and free of problems of oxidative dimerization. This makes AV-Rop an excellent model system for cloning combinatorial libraries. The majority of surface residues of Rop are negatively charged residues. To understand the sequence-stability relationship of solvent exposed residues, the NNK5-surface library was constructed from two overlapping degenerate oligos that randomized five residues on the solvent-exposed region of helix 2 and 2’

(chapter 2). Mutations were not made in the RNA-protein recognition site which was identified to be on the surface of helix 1 and 1’ of Rop as this would affect its function. In addition to this, the in vivo GFP screen which will be described in the next section is

93 based on Rop activity and requires the RNA binding site to be intact. The five mutations were made at 39, 40, 43, 46, and 47 located at b, c and f positions of the heptad repeat

(Figure 11). These five surface residues carry a highly negative net charge with the wild type amino acid sequence being ESDDE. The NNK5-surface library (K= G or T nucleotide) was constructed in which all the five residues were randomized to all 20 amino acids. Library in the context of NNK reduces the number of codons for each amino acid. This would reduce codon bias and also restricts the number of stop codons to one instead of three. The pAC library was first cloned into an activity screen under the control of a lac promoter and the vector used was pACT7lacCm-AflIII. The library was subject to several rounds of growth selection to select for hits. Clones with inactive Rop have a higher plasmid copy number which increases the metabolic load on the cell. This takes a toll on the growth rate and cells with active Rop eventually dominate the pool. These hits were visually identified using the in vivo GFP reporter screen and sequenced. The consensus sequence from this library was KKK(N/H)K in the place of wild type residues

ESDDE. A surprisingly high number of positively charged residues were present in the library hits. This by itself was a surprising result as the wild-type Rop has a negative surface charge. In addition to this, the high number of lysine residues and the absence of arginine residues in the library hits was a strange result. We wanted to understand the

94 reason behind the domination by Lys residues. It evoked curiosity to see if the Lys residues contributed to a certain characteristic of the protein over Arg residues.

A point mutant of a protein yields us relevant information on whether a particular residue in the position is crucial for the protein to be functional. Therefore, point mutants of lysines and arginines were made at the five positions of the surface library. In addition to this, four multi-mutants were made in the five positions (KTKKK, KTKTH, RTRRR and RTRTH). All the mutants were made in the context of both the screening vector pACT7lacCm-AflIII and the expression vector pMRH6. The expression vector is under the control of T7 promoter. Cloning the point mutants and multi mutants was done by reassembling the gene fragments of Rop containing the mutation(s) through Klenow reaction and then amplifying the reassembled products using PCR. This was possible as

Rop is a small protein with only 63 amino acids residues.

95

Figure 11: Homodimer of AV-Rop with library residues highlighted

96

3.3.2 Activity

The function of Rop is to control the copy number of ColE1 plasmids in E.coli and the principle of the screen is based on the expression of a GFP reporter from the

ColE1 plasmid. It is a positive screen such that the active variants result in a high level of

GFP expression and the inactive variants show a low level. This causes the actives to be more fluorescent, the intermediates to be slightly fluorescent and the inactives to be non- fluorescent in the screening plate. This is an in vivo screen. The screen results can also be confirmed in vitro by lysing the cells and examining the plasmid levels by gel electrophoresis. In the gel assay, active variants will show lower amount of GFP plasmid as active Rop controls plasmid copy number. Figure 12b shows the results of the in vivo functional assay of Rop. The positive and negative controls used to compare were samples of pAC-AV-Rop and pACT7lacCm-AflIII. The fluorescence of these controls and also the plasmid levels were compared to the mutants. E39R, D43K and E47R seemed to have an intermediate level of expression. All the other point mutants looked active. There were no inactive point mutants. Among the multi-mutants, KTKKK and

RTRTH turned out to have intermediate activity whereas RTRRR and KTKTH turned out to be inactive variants. The plasmid levels by gel as shown in Figure 12a mostly matched well with the corresponding in vivo screen results. There were no inactive point mutants and no fully active multi-mutants. This implies that as we mutate more negatively

97 charged residues on the surface to positively charged residues, the function of the protein is affected. Point mutants don’t affect the activity much. There was no correlation between the effect of lysines vs arginines on activity. More biophysical analysis has to be done to see how activity correlates to other properties of the protein.

98

(a)

(b)

Figure 12: Activity of Rop variants by (a) Plasmid levels (b) In vivo screen; Active Rop will have lesser amount of GFP plasmid and will be brighter in the cell based screen. Inactive Rop will have higher amount of GFP plasmid and will be dimmer in the screen.

99

3.3.3 Stability

Circular dichroism (CD) spectroscopy can be used to determine the secondary structure and thus the helical content of the protein. Subjecting the mutants to thermal or chemical denaturation monitored by CD spectroscopy would give us information of stability of the mutants. The mutants were expressed in large scale and purified by affinity chromatography. It was not possible to purify the mutant E39K in large scale for characterization as the expression in soluble fraction was very low. Wavelength scans of all the other mutants shown in Figure 13 (top) were generated to determine mean residue ellipticity using CD at a fixed concentration of 50 µM. All mutants except four of them had similar mean residue ellipticity as AV-Rop. This indicates that these mutants possess similar secondary structure as AV-Rop. RTRRR, S40R, E47K and D46R had lower ellipticity values compared to AV-Rop. This may reflect some structural changes in these variants. Methods such as NMR analysis or X-ray crystallography may help in analyzing the structural changes.

Figures 13 (bottom) and (top) show melts of point mutants and multi-mutants from thermal denaturation. In this method, the mutants were melted from 25 ºC to 95 ºC and monitored using CD spectroscopy. All the variants showed cooperative unfolding

100 and were reversible upon thermal denaturation. A range of melting temperatures (Tm) was observed. The Tms ranged from 69.6 ºC for RTRTH to 83.2 ºC for D43R. The Tm of

AV-Rop is 66.8ºC. Only D46R and D46K had lower Tms compared to AV-Rop. The rest of the eleven variants had higher Tms than AV-Rop. All the multi-mutants were stabilized over AV-Rop. There was no obvious correlation between the number of mutations to stability. D43K, D43R, S40R and KTKTH had Tms which were higher than

AV Rop by 10 degrees. Other mutants were in the range of +5 degrees. An interesting observation was that the Tms were very close for the respective Lys and Arg versions.

This reveals the fact that selection of Lysines over Arginines in the original surface library was not dependent on thermal stability of the mutants. In Nguyen’s library, most of the multi-mutants were slightly stabilized over AV-Rop. But, surprisingly, in this study, the majority of the variants were stabilized over AV-Rop. Also, many point mutants were stabilized to a sufficient amount. This is an important result because in the previous libraries of Rop, not many variants were found to be stabilized than AV-Rop.

Even if they did, it was not to a great degree. This emphasizes the contribution of certain surface residues to stability in Rop. Many studies have been done to see the effect of a single point mutation on a protein. Here, single point mutants have produced a protein so much more stable than the pseudo-wild-type.

101

Figure 14 (bottom) shows the melts from urea denaturation monitored by CD spectroscopy. There was a wonderful trend observed between thermal and chemical denaturation. D46K and D46R which were found to be less stable than AV-Rop by thermal denaturation were also less stable by urea denaturation. All the other mutants were more stable than AV-Rop and thus had higher Cms by chemical denaturation. Only

KTKKK was an outlier. It was 4 degrees more stable than AV-Rop by thermal denaturation but was less stable than AV-Rop by chemical denaturation. Otherwise, a correlation between both methods was observed. There was no obvious correlation between activity and stability.

102

Figure 13: Circular dichroism scans of variants (top); Thermal melts of point mutants (bottom)

103

Figure 14: Thermal melts of multi-mutants (top); Urea melts of all variants (bottom)

104

3.3.4 Solubility

Surface mutations are known to affect the solubility of the protein as they interact with solvent. All the variants were purified in small scale to compare the amount of protein in the soluble fraction (supernatant) and the insoluble fraction (pellet). For controls, variants from the original surface library (LMATA for pellet and DETHQ for supernatant) were also included. The mutations in the surface positions seemed to affect the solubility to a great extent. As seen in Figure 15a, DETHQ was mostly present in the soluble fraction and LMATA was mostly seen in the insoluble fraction. E39K, E39R,

E47K, E47R, D43R, KTKKK and KTKTH had equal amount of protein in the soluble and insoluble fraction. As the original expression of S40R was low, it was difficult to purify it in small scale and had a faint band. S40K, D43K, D46K and D46R had more protein in the soluble fraction. For all of the point mutants, although the pellet fraction had a considerable amount of protein, the soluble fraction also had a large amount of protein. Among the multi mutants, RTRRR and RTRTH had more protein in the pellet than supernatant. The surface changes made to the protein obviously increased the amount of insoluble fraction. It was also observed that the amount of insoluble fraction increases with the increased number of mutations on surface especially with arginines.

105

Figure 15: Solubility levels of all variants

106

3.3.5 Structural changes and electrostatics

1H-15N Heteronuclear Single Quantum Coherence (HSQC) spectroscopy was used to find if the protein is well-folded and also if it has a different conformation. Five variants were selected based on the Tms. The least stable mutant (D46R), the mutant which has a similar Tm as AV-Rop (S40K) and the most stable mutant (D43R) were chosen for NMR studies. Out of the multi mutants, KTKTH and RTRTH were also taken.

1H-15N HSQC spectra for these mutants shown in figure 16 were obtained (in red) and compared to that of wild-type AV Rop in blue. D43R had almost similar peaks as AV-

Rop and thus seems to have the same conformation. The least stable mutant D46R had many shifted peaks. The poly mutants seemed to have many shifted peaks and thus a different structure. This helped in selecting interesting leads for further crystallographic studies. There were some shifted peaks for S40K and almost all peaks seem to be shifted from AV-Rop for the rest of the mutants. More mutations on the surface and lower stability seem to affect the foldedness and conformation. Further crystallographic or sophisticated NMR analysis need to be carried out to look into the structural details.

107

Figure 16: 1H15N HSQC NMR spectra of variants; Rop is in red and variant is in blue. Figure 16 continued

108

Figure 16 continued

Figure 16 continued

109

Figure 16 continued

110

Determination of the conformational stability of a protein is critical for the appropriate knowledge of the physical interactions that stabilize the protein. The stability estimate of a protein is very often based on the analysis of chemical induced or thermally induced unfolding transitions measured either spectroscopically or calorimetrically.

Gibbs-Helmholtz analysis in the context of both thermal and chemical melts was done on some of the variants. This is useful to find if changes in surface electrostatics affect Gibbs free energy. The correlation between thermal and urea induced denaturation was continuous. This study requires highly concentrated protein. So, this was done only on nine of the variants. The curves are shown in figure 17. The change in Gibbs free energy, entropy and enthalpy, temperature at maximum stability (Tg) and temperature of conformational stability (Ts) are shown in table 6. The change in entropy was significant in many variants. The changes in both enthalpy and entropy were highest for the most stable mutant D43R and the least stable mutants D46K & D46R. This is an interesting observation.

111

Figure 17:Gibbs-helmholtz curves of variants

112

Table 6: Thermodynamic parameters

113

3.4 Discussion

Surface residues and surface electrostatics of a protein are more important for stability than previously thought. Amino acid residues in the protein surface have always been considered to be less important than residues in the hydrophobic protein core in contributing to protein stability. This is because of the belief that surface residues are usually more tolerable when mutated to non-native residues than the protein core29. There is numerous evidence to prove that engineering surface residues and surface electrostatics contribute to increased protein stability in previous studies30-32. However, there is not even a rough consensus on how surface residues control stability and more work on model proteins are needed to add information to this area. Rop is a great scaffold for studying the effects of surface residues on stability.

The construction of the NNK5 surface library on five surface residues E39, S40,

D43, D46 and E47 of Rop was described in chapter 2. The consensus sequence from the pAC library had a lot of lysine residues. This was a different and interesting result provided that wildtype Rop has many negatively charged residues on the surface. Point mutants of lysines and arginines were made in these five positions along with four multi- mutants KTKKK, KTKTH, RTRRR and RTRTH in the same five positions. All the mutants were made in the context of both the screening vector pACT7lacCm-AflIII and

114 the expression vector pMRH6. When tested for activity, E39R, D43K and E47R seemed to have an intermediate level of expression. All the other point mutants looked active.

Among the multi-mutants, KTKKK and RTRTH turned out to have intermediate activity whereas RTRRR and KTKTH turned out to be inactive variants. There were no inactive point mutants and no fully active multi-mutants. As the number of mutations increased on the surface to positively charged residues, the activity seems to be affected. There was no obvious correlation between activity and the residue (Lys or Arg) mutated.

The secondary structure of the variants was obtained by CD spectroscopy. All the variants except four of them had similar mean residue ellipticity as AV-Rop. This indicates that these variants possess similar secondary structure as AV-Rop. RTRRR,

S40R, E47K and D46R had lower ellipticity values compared to AV-Rop. This may reflect some structural changes in these variants. Tms of the variants were obtained using thermal denaturation and Cms were obtained using urea denaturation melts by CD. The

Tms ranged from 69.6 ºC for RTRTH to 83.2 ºC for D43R. Only D46R and D46K were less stable than AV-Rop. The rest of the eleven variants had higher Tms than AV-Rop.

All the multi-mutants were stabilized over AV-Rop. D43K, D43R, S40R and KTKTH had Tms which were higher than AV Rop by 10 degrees. Other mutants were in the range of +5 degrees. There was no obvious correlation between the number of mutations and

115 stability. An interesting observation was that the Tms were very close for the respective

Lys and Arg versions which means that the selection of Lysines over Arginines in the original surface library was not dependent on thermal stability of the mutants. In

Nguyen’s library, most of the multi-mutants were slightly stabilized over AV-Rop. But, surprisingly, in this study, majority of the variants were stabilized over AV-Rop. This is an important result because in the previous libraries of Rop, not many variants were found to be stabilized over AV-Rop. This emphasizes the contribution of certain surface residues to stability in Rop. Also, single point mutants have produced a protein much more stable than the pseudo-wild-type. A correlation was observed between thermal and chemical denaturation results. D46K and D46R which were found to be less stable than

AV-Rop by thermal denaturation were also less stable by urea denaturation. All the other mutants were more stable than AV-Rop and thus had higher Cms by chemical denaturation. Except for KTKKK, the correlation between both methods looked good.

Solubility was tested by comparing the soluble and insoluble fraction of the variants. E39K, E39R, E47K, E47R, D43R, KTKKK and KTKTH had equal amount of protein in the soluble and insoluble fraction. As the original expression of S40R was low, it was difficult to purify it in small scale and had a faint band. S40K, D43K, D46K and

D46R had more protein in the soluble fraction. For all of the point mutants, although the

116 pellet fraction had a considerable amount of protein, the soluble fraction also had a good amount of protein. Among the multi mutants, RTRRR and RTRTH had more protein in the pellet than supernatant. The surface changes made to the protein increased the amount of insoluble fraction. It was also observed that the amount of insoluble fraction increases with the increased number of mutations on surface especially with arginines.

1H-15N Heteronuclear Single Quantum Coherence (HSQC) spectroscopy was used to find if the protein is well-folded and also if it has a different conformation. Five variants were selected based on the Tms. Most stable mutant D43R had very similar peaks to AV-Rop and thus seems to have the same conformation. The least stable mutant

D46R and the multi-mutants seemed to have many shifted peaks and thus a different structure. There were some shifted peaks for S40K and almost all peaks seem to be shifted from AV-Rop for the rest of the mutants. A higher number of mutations on the surface and lower stability seem to affect the foldedness and conformation. The stability estimate of a protein is very often based on the analysis of chemically induced unfolding or thermally induced unfolding transitions measured either spectroscopically or calorimetrically. Gibbs-Helmholtz analysis in the context of both thermal and chemical melts was done on some of the variants which is useful to find if changes in surface electrostatics affect the Gibbs free energy. A correlation between thermal and urea

117 induced denaturation was observed. The change in entropy was significant in many variants. The changes in both enthalpy and entropy were highest for the most stable mutant D43R and the least stable mutants D46K & D46R.

Although there was no overall correlation between lysines and arginine mutations and the results, the effect of each point mutant or multi-mutant on each characteristic was interesting. Mainly, the results from the stability analysis by thermal or urea denaturation melts were unique. Most of the mutants were stabilized over AV-Rop. This is an important result with respect to Rop and also protein stability. The effect of point mutants will be described more in the next chapter but has been proven to be large by these results. Gibbs-Helmholtz analysis provides more information on how enthalpy or entropy changes effect protein stability.

118

Chapter 4: Exploring the sequence landscape of the four helix

bundle protein Rop using Illumina sequencing

119

4.1 Summary

Rop, an antiparallel homodimer, is an excellent model protein used to study protein stability. A point mutant of a protein yields us relevant information on whether a particular residue in the position is crucial for the protein to be functional or not. A point mutant library would be more useful to find how tolerant a position is to all other non-wild type amino acids. NNN (N = A, G, C or T) codon point mutant libraries were constructed as shown in Fig. 19 in a cysteine free mutant, AV-

Rop which has two core cysteine residues mutated to alanine (C38A) and valine

(C52V). AV-Rop with wild type like properties is stable and free of problems of oxidative dimerization due to absence of cysteines. The library was subject to rounds of growth selection to select for hits and then sequenced by Illumina technology. The above mentioned work was done as my Master’s thesis project. In this chapter, I have included the analysis of Illumina sequencing results obtained from the above work and preliminary results from redesign of the experimental strategy.

120

4.2 Materials and Methods

4.2.1 Materials

General items for experiments and reagents required to make media for cell cultures such as tryptone, yeast extract and agarose were purchased from Fisher. PCR tubes, microcentrifuge tubes, petri dishes and electroporation cuvettes were bought from

USA Scientific. Antibiotics were bought from Gold Bio. 1000X stock solutions of antibiotics in sterile water (ampicillin at 100 mg/mL, kanamycin at 35 mg/mL) were made and sterile filtered using 0.22 μm syringe filters from Millipore. 100 mM

Deoxyribonucleotide mixtures (dNTPs), Tris, sodium chloride, reagents for electrophoresis and phenol chloroform extraction were purchased from American

Bioanalytical. DNA miniprep buffers and Pfu polymerase were made in house. Other

DNA polymerases, DNA ladders, T4 DNA ligase and Restriction enzymes were purchased from New England Biolabs (NEB). Oligonucleotide primers for PCR were purchased from Sigma-Genosys. Water for all experiments was purified using a Barnstead NANOpure Diamond system to 18 MΩ•cm. Buffers needed for PCR cleanup, agarose gel purification and 50% glycerol were filter sterilized using

0.22 μm syringe filters from Millipore.

121

4.2.2 Instrumentation

PCR reactions were carried out in a CFX96 real time thermal cycler (Bio-Rad).

Measurements of DNA absorbance and cell density were typically done using Agilent

8453 UV/VIS spectrophotometer or SpectraMax M5 plate reader (Molecular Devices).

DNA electrophoresis was usually done with a 1% agarose gel in a 10 cm horizontal gel apparatus and gels were documented using TFML UV trans illuminator (UVP) and

Kodak Gel logic documentation system. For comparing molecular weight, Lambda

BstEII ladder and 100 bp ladder from NEB were used. Centrifugation of small volumes up to 2 mL were done in Eppendorf 5415 unrefrigerated bench top centrifuge and volumes up to 50 mL & 96 well plates were done in Eppendorf 5810 refrigerated bench top centrifuge. Concentration of DNA was done using Savant 1200P SpeedVac vacuum concentrator in low heating mode. Electroporation was done in a Bio-Rad Micropulser.

4.2.3 Plasmids and strains

The plasmids pACT7lac and pUCBADGFPuv were constructed by Magliery et al28. The plasmid pMRH6 were constructed in the Magliery lab, the Ohio State

University. The DH10B(DE3) strain of E. coli was lysogenized into the DE3 lamboid phage using a kit from Novagen by Jason Lavinder in the Magliery lab. The cloning sites are AflIII and BamHI for pMRH6 vector and AflIII and BanI for pACT7lacCm-AflIII.

122

Analytical tests were performed using XmnI digest which cuts pUCBADGFPuv in double sites and linearizes pMRH6 and pACT7lacCm-AflIII. Background digest was carried out by EcoRI and NdeI for pMRH6 and by NcoI and EcoRI for pACT7lacCm-

AflIII.

Figure 18: Maps of screening vector pUCBADGFPuv, pMR cloning vector pMRH6 and pAC cloning vector pACT7lacCm-AflIII Figure 18 continued 123

Figure 18 continued

124

4.2.4 Methods

Molecular Cloning-pMR library

NNN (N = A, G, C or T) single codon mutant libraries were constructed individually for the 63 amino acid protein Rop in which each library had one residue randomized to all 20 amino acids. The library inserts were about 220 bases long and they were cloned into 4088 base pair pMRH6 vector between the restriction sites of AflIII and

BamHI. PCR inserts were generated by reassembly of the two synthetic 3’ and 5’ oligonucleotide fragments. The PCR strategy differed for the beginning, middle and end of the gene with two primers being necessary for the reassembly product of the beginning and end regions whereas four primers were used for the middle region of the gene. The reassembly library product was then PCR amplified by the two primers containing restriction endonuclease sites. PCR was typically set up with Pfu polymerase and Pfu buffer purified in house. The standard conditions for a reassembly PCR are as follows: A

25 μL reaction in sterile water contains 1 unit of Pfu polymerase in 1X Pfu buffer, 25 mM dNTPs, and 2 μM primers. Amplification PCR has the same conditions along with approximately 0.5-2 ng μL-1 reassembly product as template. The PCR products were cleaned up using PB buffer and eluted using 10 mM Tris HCl as elution buffer based on the Qiagen kit procedure. The cleaned up PCR products were digested with AflIII and

BamHI restriction enzymes in the recommended buffers from NEB.

125

Figure 19: PCR design for the pMR NNN point mutant library in AV ROP Figure 19 continued

126

Figure 19 continued

Ligation

pMRH6 vector was miniprepped and digested with the same enzymes. Vector

DNA of about 4000 base pairs and insert DNA of about 220 b.p. were purified by agarose gel extraction according to the QIAquick Gel Extraction Kit Protocol. Then, vector and insert DNA resuspended in elution buffer were used for cloning. The concentration of vector and insert DNA was determined by agarose gel electrophoresis by comparing intensity of bands to NEB Lambda BstEII digest marker. For ligation, the ratio of vector

127 to insert was set up to be 1:3 even though the original ratio intended is 1:1. This is because the insert itself is only 200 bases and the cut fragment is also really small making it difficult to separate completely by gel purification. Therefore, insert DNA is added in excess. The concentration of vector and insert DNA to the desired amount is done using

SpeedVac in low heating mode. About 50 ng of vector and 10 ng of insert were used for cloning a single point mutant library. Also, 50 ng of vector only is ligated to check for self-ligation of vector (background). Complete background elimination is important for cloning of especially libraries. If self-ligation is seen, the vector digest is re-cut using the cloning enzymes and cleaned up for ligation. 10 uL ligations were kept overnight with

T4 ligase buffer and T4 ligase in 16 ºC. The overnight ligations were kept at 65 ºC to heat kill the ligase and then subject to background vector digest using restriction enzymes

NdeI and EcoRI. The resulting libraries were cleaned up using PB buffer and eluted in

10% Qiagen Elution Buffer for transformation.

Transformation

The cleaned up background-digested ligation was transformed into electrocompetent DH10B cells made in house. Electrocompetent DH10B cells were prepared according to the protocol prepared by Tom Magliery in 2001. The cell

128 growth is arrested in the mid-log phase and cells are spun down, washed in 10 % glycerol, frozen in dry ice and are good to be used for 2-3 months. The cells are tested before use by transforming a pUC plasmid (1 ng) into an aliquot (40 μL) of cells, quenched with 1 mL of 2YT media, recovered for an hour at 37 ºC and 10 µL of recovery volume is plated on the appropriate antibiotic plates and grown at 37 ºC.

Transformation efficiency of 108 and above is recommended for libraries and is calculated by the following equation: AB/CD; A is the number of colonies

(transformants), B is the final recovery volume (mL), C is the amount of plasmid DNA transformed(μg) and D is the plating volume (mL). For libraries, the transformation is quenched in 5 mL of 2YT media, recovered for an hour at 37ºC and 100 μL of the culture is plated on appropriate antibiotic plates. Then the culture is made up to 25 mL and grown overnight at 37 ºC. The following day, the complexity of the library is calculated such that it is 200 times greater than the library size of 20 variants. From the overnight culture, glycerol stocks and DNA minipreps are made and stored. For each library, at least 4 colonies were picked from a plate, miniprepped and sent to Genewiz, South

Plainfield, NJ for sequencing to confirm the mutation at the desired position. Before sending for sequencing, the minipreps were checked in gel for insert by cutting with

AflIII and BamHI and for any possible contamination by checking 260/230 and 260/280

129 absorbance ratios in a UV-VIS spectrophotometer. The sequencing results were analyzed using Clustal Omega.

Growth based selection

As Rop controls plasmid levels in cells, growth of active and inactive Rop containing cells imposes a growth advantage on Rop containing cells. Control strains of

AV Rop in pAC plasmid (pACT7lacRop) and Linker containing a chloramphenicol resistance gene “linker” in pAC (pACT7lacCm) were used for the mock selection experiment. Both of these plasmids were transformed into DH10B(DE3) cells containing pUCBADGFPuv screening plasmid. The two control strains were grown together in 1 L of 2YT media with Kan and Amp at 42 ºC for 17-18 hours (1 round). The first round had

1:10 or 1:100 ratios of wild type to linker cell volume ratios from measuring overnight growths of seed cultures. Both strains were mixed after normalizing for cell density. At each round, the culture being inoculated was maintained at cell density (OD600, optical density at 600 nm) of 1 (OD 1 is approximately 108 cells per mL) and a culture volume of

100 μL. After the completion of each round, an aliquot was used for making minipreps, glycerol stocks and plating on Kan-Amp-Arabinose screening plates. The minipreps were cut with XmnI so that an agarose gel can be used to observe each of the three plasmid levels with size differentiation. As active Rop reduces plasmid DNA levels, it was easy to

130 differentiate the Rop containing strain (pACT7lacRop) from the other one based on band intensity. Comparing the plasmid intensity levels on gel with the fluorescent phenotypic ratios from the plates of each round led us calculate the fold of enrichment of pACT7lacRop per round. This was used as a mock experiment to calculate the number of rounds of selection needed to achieve 1,000-fold enrichment of actives over inactives.

Though the point mutant library is in pMR, the difference between enrichment fold values of pAC and pMR weren’t very large with reference to previous work.

Cell based screen for Rop

A cell based screen for in vivo activity of Rop was developed by Magliery and

Regan. It helps in linking the expression of the reporter GFPuv (Recombinant GFP which expresses faster, and is 18 fold brighter than native GFP and excitable by UV light at 395 nm) to the copy number of ColE1 plasmid using the arabinose promoter. Based on the screen, cells containing active Rop show high fluorescence in Kan-Amp-Arabinose selection plates under UV light and the ones with inactive Rop are non-fluorescent. pUCBADGFPuv plasmid contains GFPuv expressed by arabinose promoter, AraC gene, pUC19 version of the ColE1 origin and ampicillin resistance gene.

NNN Point mutant library

Point mutant libraries in DH10B were grown separately, mixed such that there is

131 almost the same cell density for each library, and a single miniprep was made. The miniprep of the point mutant libraries in pMR vector was transformed into DH10B(DE3) cells containing pUCBADGFPuv and recovered in 25 mL for 1 hour. The recovery volume was then plated on a LB-Kan-Amp-0.0005% arabinose (KAA) agarose plate which was grown at 42 ºC for 12-16 hours. Also, the recovery was made up to 250 mL, grown at 37 ºC overnight and glycerol stocks and minipreps were made from the overnight culture. The KAA plate of the naive library had about 50% fluorescent colonies

(hits). The point mutant naive library was subject to growth selection under the same conditions as the mock selection for eight rounds. For the test libraries F14, D30 and

D43, each of the libraries in DH10B was individually transformed into DH10B(DE3) cells containing pUCBADGFPuv. Recovery, growth and selection for the three libraries were done the same way as the entire point mutant library.

Illumina sequencing

The library products (minipreps) from the rounds to be deep sequenced (0, 3, 6 and 8) were amplified by PCR. These products were cleaned up using the Qiagen gel extraction method. Unique six-letter barcodes were attached to both ends of each round’s amplified product by PCR (Tables 2.2 and 2.3) followed by purification by Qiagen gel extraction. The DNA concentration for the four rounds of library was determined using

Qubit fluorometric quantitation. The labchip GX assay was used to estimate the size of

132

DNA for all four library samples which was about 247 b.p. After normalization for concentration, equal amount of libraries were pooled to a total amount of 400 ng to generate the Illumina library. The library was generated using Illumina TruSeq DNA

Sample Preparation Kit (low throughput protocol). In the library generation, the initial shearing step was skipped as the insert size of the library sequenced is already small.

Ends were repaired, A overhangs were attached, TruSeq DNA adaptors were ligated and the DNA fragments attached to the ligated adaptors were enriched using PCR. The size of the purified product was estimated by labchip GX which was seen around 378 b.p. as expected. The libraries were sequenced in the MiSeq using Miseq reagents v2, where a

250 bp paired end run was performed. Data was generated in a Fastq file format and is available in basespace of Illumina.

133

Table 7: Oligonucleotide sequences for pMR NNN point mutant library

GCACTTAACATGGCCCGCTTTATCAGAAGTCAGACATTAACGCTGCTGGAGAAACTCAACG FWD2TO7 AGCTGGACGCGGATG GCACTTAACATGGCCCGCTTCATCAGAAGTCAGACATTAACGCTGCTGGAGAAACTCAACG FWDQ4 AGCTGGACGCGGATG T2PMR AATATTACACGTGGTGAGAACCTGTATTTTCAGGGTNNNAAACAGGAAAAAACCGCACTTA TEV5 ACATGGCCCGCTTTATCAGA K3PMR AATATTACACGTGGTGAGAACCTGTATTTTCAGGGCACCNNNCAGGAAAAAACCGCACTTA TEV5 ACATGGCCCGCTTTATCAGA Q4PMR AATATTACACGTGGTGAGAACCTGTATTTTCAGGGCACCAAANNNGAAAAAACCGCACTTA TEV5 ACATGGCCCGCTTCATCAGA E5PMR AATATTACACGTGGTGAGAACCTGTATTTTCAGGGCACCAAACAGNNNAAAACCGCACTTA TEV5 ACATGGCCCGCTTTATCAGA K6PMR AATATTACACGTGGTGAGAACCTGTATTTTCAGGGCACCAAACAGGAANNNACCGCACTTA TEV5 ACATGGCCCGCTTTATCAGA T7PMR AATATTACACGTGGTGAGAACCTGTATTTTCAGGGCACCAAACAGGAAAAANNNGCACTT TEV5 AACATGGCCCGCTTTATCAGA ACCAAACAGGAAAAAACCNNNCTTAACATGGCCCGCTTTATCAGAAGCCAGACATTAACGC A8 TTCTGGAGAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCNNNAACATGGCCCGCTTTATCAGAAGCCAGACATTAACG L9 CTTCTGGAGAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTNNNATGGCCCGCTTTATCAGAAGCCAGACATTAACGC N10 TTCTGGAGAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTAACNNNGCCCGCTTTATCAGAAGCCAGACATTAACGC M11 TTCTGGAGAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTAACATGNNNCGCTTTATCAGAAGCCAGACATTAACGC A12 TTCTGGAGAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTAACATGGCCNNNTTTATCAGAAGCCAGACATTAACGC R13 TTCTGGAGAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTAACATGGCCCGCNNNATCAGAAGCCAGACATTAACG F14 CTTCTGGAGAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTAACATGGCCCGCTTTNNNAGAAGCCAGACATTAACG I15 CTTCTGGAGAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTAACATGGCCCGCTTTATCNNNAGCCAGACATTAACGC R16 TTCTGGAGAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTAACATGGCCCGCTTTATCAGANNNCAGACATTAACGC S17 TTCTGGAGAAACTCAACGAGCTGGACGCGGATG Table 7 continued

134

Table 7 continued

ACCAAACAGGAAAAAACCGCCCTTAACATGGCCCGCTTTATCAGAAGCNNNACATTAACGC Q18 TTCTGGAGAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTAACATGGCCCGCTTTATCAGAAGCCAGNNNTTAACGC T19 TTCTGGAGAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTAACATGGCCCGCTTTATCAGAAGCCAGACANNNACG L20 CTTCTGGAGAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTAACATGGCCCGCTTTATCAGAAGCCAGACATTANNNC T21 TTCTGGAGAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTAACATGGCCCGCTTTATCAGAAGCCAGACATTAACGN L22 NNCTGGAGAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTAACATGGCCCGCTTTATCAGAAGCCAGACATTAACGC L23 TTNNNGAGAAACTCAACGAGCTGGACGCGGATG

E24 ACCAAACAGGAAAAAACCGCCCTTAACATGGCCCGCTTTATCAGAAGCCAGACATTAACGC TTCTGNNNAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTAACATGGCCCGCTTTATCAGAAGCCAGACATTAACGC K25 TTCTGGAGNNNCTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTAACATGGCCCGCTTTATCAGAAGCCAGACATTAACGC L26 TTCTGGAGAAANNNAACGAGCTGGACGCGGATG GTCGTGCAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCNNNGAGTTTCTCCA N27 GCAGCGTTAA GTCGTGCAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGNNNGTTGAGTTTCTCC E28 AGCAGCGTTAA GTCGTGCAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCNNNCTCGTTGAGTTTCTCCA L29 GCAGCGTTAA GTCGTGCAGCGATTCAGCGATATCTGCCTGTTCATCCGCNNNCAGCTCGTTGAGTTTCTCCA D30-NP GCAGCGTTAA GTCGTGCAGCGATTCAGCGATATCTGCCTGTTCATCNNNGTCCAGCTCGTTGAGTTTCTCCA A31 GCAGCGTTAA GTCGTGCAGCGATTCAGCGATATCTGCCTGTTCNNNCGCGTCCAGCTCGTTGAGTTTCTCCA D32 GCAGCGTTAA GTCGTGCAGCGATTCAGCGATATCTGCCTGNNNATCCGCGTCCAGCTCGTTGAGTTTCTCC E33 AGCAGCGTTAA GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCAGCGTGGTCGTGAAG Q34 CGATTCAGCGATATCTGCNNNTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCAGCGTGGTCGTGAAG A35 CGATTCAGCGATATCNNNCTGTTCATCCGCGTCCAGCTCG Table 7 continued

135

Table 7 continued

GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCAGCGTGGTCGTGAAG D36 CGATTCAGCGATNNNTGCCTGTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCAGCGTGGTCGTGAAG I37 CGATTCAGCNNNATCTGCCTGTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCAGCGTGGTCGTGAAG A38 CGATTCNNNGATATCTGCCTGTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCAGCGTGGTCGTGAAG E39 CGANNNAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCAGCGTGGTCGTGAAG S40 NNNTTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCAGCGTGGTCGTGNNN L41 CGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCAGCGTGGTCNNNAAG H42 CGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG ACCGTCATCACCGAAACGCGCGAGGCAGCTGCGGTAAAGCTCATCAGCGTGNNNGTGAAG D43 CGATTCACAGATATCTGCCTGTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAACGCGCGAGAACGCTGCGGTAAAGCTCATCAGCNNNGTCGTGAAG H44 CGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCNNNGTGGTCGTGAAG A45 CGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCNNNAGCGTGGTCGTGAA D46 GCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGNNNATCAGCGTGGTCGTGAA E47 GCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG

L48 GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTANNNCTCATCAGCGTGGTCGTGAAG CGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGNNNAAGCTCATCAGCGTGGTCGTGAAG Y49 CGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAACGGGCGAGAACGCTNNNGTAAAGCTCATCAGCGTGGTCGTGAAG R50 CGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAGCGGGCGAGAACNNNGCGGTAAAGCTCATCAGCGTGGTCGTGAA S51 GCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAGCGGGCGAGNNNGCTGCGGTAAAGCTCATCAGCGTGGTCGTGAA V52 GCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG GCCATCGTCACCAAAGCGCGCNNNAACGCTGCGGTAAAGCTCATCAGCGTGGTCGTGAAG L53 CGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG Table 7 continued

136

Table 7 continued

GCCATCGTCACCAAAGCGNNNGAGAACGCTGCGGTAAAGCTCATCAGCGTGGTCGTGAAG A54 CGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG H44R50SMP TAATAAGGATCCTCAGAGGTTTTCGCCATCGTCACCAAAACG MRTEV3 TGCGAGAACACTGCGGTAAAGCTCATCAGCGTGGTCGTGAAGCGATTCAGCGATATCTGCC REV55TO63 TGTTCATCCGCGTCCAGCTCG R55- TAATAAGGATCCTCAGAGGTTTTCACCATCGTCACCAAANNNTGCGAGAACACTGCGGTAA SMPMRTEV3 AGCTCATCAGCGTGGTCGTGAAGCGATTCAGCGATATC F56- TAATAAGGATCCTCAGAGGTTTTCACCATCGTCACCNNNGCGTGCGAGAACACTGCGGTAA SMPMRTEV3 AGCTCATCAGCGTGGTCGTGAAGCGATTCAGCGATATC G57- TAATAAGGATCCTCAGAGGTTTTCACCATCGTCNNNAAAGCGTGCGAGAACACTGCGGTA SMPMRTEV3 AAGCTCATCAGCGTGGTCGTGAAGCGATTCAGCGATATC D58- TAATAAGGATCCTCAGAGGTTTTCACCATCNNNACCAAAGCGTGCGAGAACACTGCGGTA SMPMRTEV3 AAGCTCATCAGCGTGGTCGTGAAGCGATTCAGCGATATC D59- TAATAAGGATCCTCAGAGGTTTTCACCNNNGTCACCAAAGCGTGCGAGAACACTGCGGTA SMPMRTEV3 AAGCTCATCAGCGTGGTCGTGAAGCGATTCAGCGATATC G60- TAATAAGGATCCTCAGAGGTTTTCNNNATCGTCACCAAAGCGTGCGAGAACACTGCGGTA SMPMRTEV3 AAGCTCATCAGCGTGGTCGTGAAGCGATTCAGCGATATC E61- TAATAAGGATCCTCAGAGGTTNNNGCCATCGTCACCAAAGCGTGCGAGAACACTGCGGTA SMPMRTEV3 AAGCTCATCAGCGTGGTCGTGAAGCGATTCAGCGATATC N62- TAATAAGGATCCTCAGAGNNNTTCACCATCGTCACCAAAGCGTGCGAGAACACTGCGGTA SMPMRTEV3 AAGCTCATCAGCGTGGTCGTGAAGCGATTCAGCGATATC L63- TAATAAGGATCCTCANNNGTTTTCACCATCGTCACCAAAGCGTGCGAGAACACTGCGGTAA SMPMRTEV3 AGCTCATCAGCGTGGTCGTGAAGCGATTCAGCGATATC GCCATCGTCACCAAAGCGCGCGAGAACGCTGCGGTAAAGCTCATCAGCGTGGTCGTGAAG SM-AV-3P CGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG SM-PMRTEV3 TAATAAGGATCCTCAGAGGTTTTCGCCATCGTCACCAAAGCG pMRtev5 AATATTACACGTGGTGAGAACCTGTATTTTCAGGGCACCAAACAGGAAAAAACC ACCAAACAGGAAAAAACCGCCCTTAACATGGCCCGCTTTATCAGAAGCCAGACATTAACGC AV-WT5' TTCTGGAGAAACTCAACGAGCTGGACGCGGATG ACCAAACAGGAAAAAACCGCCCTTAACATGGCACGCTTTATTAGAAGCCAGACATTAACGC LOOP1-NP TGCTGGAGAAACTC LOOP3-NP ATCGCTGAATCGCTGCACGACCATGCTGATGAGCTTTACCGTAGTGTTCTC LOOP4-NP GCCATCGTCACCAAAGCGCGCGAGAACACTACGGTAAAGCTCATCAGC

137

Table 8: Primers for bar-coded Illumina sequencing DNA template generation

DSEQFWD GGTGAGAACCTGTATTTTCAG DSEQREV GTTAGCAGCCGGATCCTC

R0- ATGACGGGTGAGAACCTGTATTTTCAG DSEQFWD R3- DSEQFWD ATCAGCGGTGAGAACCTGTATTTTCAG

R6- TACACGGGTGAGAACCTGTATTTTCAG DSEQFWD

R8- TAGAGCGGTGAGAACCTGTATTTTCAG DSEQFWD R0-DSEQREV TACGTGGTTAGCAGCCGGATCCTC R3-DSEQREV TAGCACGTTAGCAGCCGGATCCTC R6-DSEQREV AGTCTGGTTAGCAGCCGGATCCTC R8-DSEQREV ATGGTCGTTAGCAGCCGGATCCTC

Table 9: Barcodes used for the four rounds which were Illumina sequenced

Round of selection Forward Reverse

R0 ATGACG CACGTA

R3 ATCAGC GAGCTA

R6 TACACG CAGACT

R8 TAGAGC GACCAT

138

Table 10: Oligos for pAC NNN point mutant library

Oligos Sequence pACA ACACAGGAAACACGTATG P5_2- 23 pACO ACCAAAGCGCGCGAGAACaCTaCGGTAAAGCTCATCAGCGTGGTCGTGAAGCGATT P3_2- CAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCGTTGAGTTTC 23 pACA CGGCTGATTACCCTCTTCAGAGGTTTTCGCCATCGTCACCAAAGCGCGCGAG P3_2- 23 T2 CAGGAAACACGTATGNNNAAACAAGAAAAGACGGCCCTTAACATGGCgCGCTTTA TCAGgAGCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG K3 CAGGAAACACGTATGACGNNNCAAGAAAAGACGGCCCTTAACATGGCgCGCTTTA TCAGgAGCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG Q4 CAGGAAACACGTATGACGAAANNNGAAAAGACGGCCCTTAACATGGCgCGCTTTA TCAGgAGCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG E5 CAGGAAACACGTATGACGAAACAANNNAAGACGGCCCTTAACATGGCgCGCTTTA TCAGgAGCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG K6 CAGGAAACACGTATGACGAAACAAGAANNNACGGCCCTTAACATGGCgCGCTTTA TCAGgAGCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG T7 CAGGAAACACGTATGACGAAACAAGAAAAGNNNGCCCTTAACATGGCgCGCTTTA TCAGgAGCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG A8 CAGGAAACACGTATGACGAAACAAGAAAAGACGNNNCTTAACATGGCgCGCTTTA TCAGgAGCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG L9 CAGGAAACACGTATGACGAAACAAGAAAAGACGGCCNNNAACATGGCgCGCTTTA TCAGgAGCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG N10 CAGGAAACACGTATGACGAAACAAGAAAAGACGGCCCTTNNNATGGCgCGCTTTA TCAGgAGCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG M11 CAGGAAACACGTATGACGAAACAAGAAAAGACGGCCCTTAACNNNGCgCGCTTTA TCAGgAGCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG A12 CAGGAAACACGTATGACGAAACAAGAAAAGACGGCCCTTAACATGNNNCGCTTTA TCAGgAGCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG R13 CAGGAAACACGTATGACGAAACAAGAAAAGACGGCCCTTAACATGGCgNNNTTTA TCAGgAGCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG F14 CAGGAAACACGTATGACGAAACAAGAAAAGACGGCCCTTAACATGGCgCGCNNN ATCAGgAGCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG I15 CAGGAAACACGTATGACGAAACAAGAAAAGACGGCCCTTAACATGGCgCGCTTTN NNAGgAGCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG R16 CAGGAAACACGTATGACGAAACAAGAAAAGACGGCCCTTAACATGGCgCGCTTTA TCNNNAGCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG Table 10 continued

139

Table 10 continued

S17 CAGGAAACACGTATGACGAAACAAGAAAAGACGGCCCTTAACATGGCgCGCTTTA TCAGgNNNCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTG Q18 CAGGAAACACGTATGACGAAACAAGAAAAGACGGCCCTTAACATGGCgCGCTTTA TCAGgAGCNNNACATTAACGCTgCTGGAGAAACTCAACGAGCTG T19 CAGGAAACACGTATGACGAAACAAGAAAAGACGGCCCTTAACATGGCgCGCTTTA TCAGgAGCCAGNNNTTAACGCTgCTGGAGAAACTCAACGAGCTG L20 CAGGAAACACGTATGACGAAACAAGAAAAGACGGCCCTTAACATGGCgCGCTTTA TCAGgAGCCAGACANNNACGCTgCTGGAGAAACTCAACGAGCTG T21 CAGGAAACACGTATGACGAAACAAGAAAAGACGGCCCTTAACATGGCgCGCTTTA TCAGgAGCCAGACATTANNNCTgCTGGAGAAACTCAACGAGCTG L22 CAGGAAACACGTATGACGAAACAAGAAAAGACGGCCCTTAACATGGCgCGCTTTA TCAGgAGCCAGACATTAACGNNNCTGGAGAAACTCAACGAGCTG L23 CAGGAAACACGTATGACGAAACAAGAAAAGACGGCCCTTAACATGGCgCGCTTTA TCAGgAGCCAGACATTAACGCTgNNNGAGAAACTCAACGAGCTG pACA ACACAGGAAACACGTATGACGAAACAAGAAAAGACGGC P5_24- 33 pACO ACGAAACAAGAAAAGACGGCCCTTAACATGGCgCGCTTTATCAGgAGCCAGACAT P1_24- TAACGCTgCTG 33 pACO GATATCGCTGAATCGCTTCACGACCACGCTGATGAGCTTTACCGtAGtGTTCTC P3_24- 33 pACO GCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTAAAGC P4_24- 33 pACA CGGCTGATTACCCTCTTCAGAGGTTTTCGCCATCGTCACCAAAG P3_24- 33 E24 GAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCGTTGAGTTTNNNC AGcAGCGTTAATGTCTGG K25 GAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCGTTGAGNNNCTC CAGcAGCGTTAATGTCTGG L26 GAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCGTTNNNTTTCTCC AGcAGCGTTAATGTCTGG N27 GAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCNNNGAGTTTCTCC AGcAGCGTTAATGTCTGG E28 GAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGNNNGTTGAGTTTCTCC AGcAGCGTTAATGTCTGG Table 10 continued

140

Table 10 continued

L29 GAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCNNNCTCGTTGAGTTTCTCC AGcAGCGTTAATGTCTGG D30 GAAGCGATTCAGCGATATCTGCCTGTTCATCCGCNNNCAGCTCGTTGAGTTTCTCC AGcAGCGTTAATGTCTGG A31 GAAGCGATTCAGCGATATCTGCCTGTTCATCNNNGTCCAGCTCGTTGAGTTTCTCC AGcAGCGTTAATGTCTGG D32 GAAGCGATTCAGCGATATCTGCCTGTTCNNNCGCGTCCAGCTCGTTGAGTTTCTCC AGcAGCGTTAATGTCTGG E33 GAAGCGATTCAGCGATATCTGCCTGNNNATCCGCGTCCAGCTCGTTGAGTTTCTCC AGcAGCGTTAATGTCTGG pACA ACACAGGAAACACGTATGACGAAACAAG P5_34- 54 pACO ACACGTATGACGAAACAAGAAAAGACGGCCCTTAACATGGCgCGCTTTATCAGgA P5_34- GCCAGACATTAACGCTgCTGGAGAAACTCAACGAGCTGGACGCGGATG 54 pACA CGGCTGATTACCCTCTTCAGAGGTTTTCGCCATCGTCACCAAAGC P3_34- 54 Q34 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTAAAGCTCATCAGCGTGGTCG TGAAGCGATTCAGCGATATCTGCNNNTTCATCCGCGTCCAGCTCG A35 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTAAAGCTCATCAGCGTGGTCG TGAAGCGATTCAGCGATATCNNNCTGTTCATCCGCGTCCAGCTCG D36 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTAAAGCTCATCAGCGTGGTCG TGAAGCGATTCAGCGATNNNTGCCTGTTCATCCGCGTCCAGCTCG I37 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTAAAGCTCATCAGCGTGGTCG TGAAGCGATTCAGCNNNATCTGCCTGTTCATCCGCGTCCAGCTCG A38 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTAAAGCTCATCAGCGTGGTCG TGAAGCGATTCNNNGATATCTGCCTGTTCATCCGCGTCCAGCTCG E39 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTAAAGCTCATCAGCGTGGTCG TGAAGCGANNNAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG S40 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTAAAGCTCATCAGCGTGGTCG TGAAGNNNTTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG L41 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTAAAGCTCATCAGCGTGGTCG TGNNNCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG H42 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTAAAGCTCATCAGCGTGGTCN NNAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG D43 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTAAAGCTCATCAGCGTGNNN GTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG Table 10 continued

141

Table 10 continued

H44 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTAAAGCTCATCAGCNNNGTC GTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG A45 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTAAAGCTCATCNNNGTGGTC GTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG D46 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTAAAGCTCNNNAGCGTGGTC GTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG E47 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTAAAGNNNATCAGCGTGGTC GTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG L48 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGGTANNNCTCATCAGCGTGGTC GTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG Y49 CGCCATCGTCACCAAAGCGCGCGAGAACaCTaCGNNNAAGCTCATCAGCGTGGTC GTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG R50 CGCCATCGTCACCAAAGCGCGCGAGAACaCTNNNGTAAAGCTCATCAGCGTGGTC GTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG S51 CGCCATCGTCACCAAAGCGCGCGAGAACNNNaCGGTAAAGCTCATCAGCGTGGTC GTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG V52 CGCCATCGTCACCAAAGCGCGCGAGNNNaCTaCGGTAAAGCTCATCAGCGTGGTC GTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG L53 CGCCATCGTCACCAAAGCGCGCNNNAACaCTaCGGTAAAGCTCATCAGCGTGGTC GTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG A54 CGCCATCGTCACCAAAGCGNNNGAGAACaCTaCGGTAAAGCTCATCAGCGTGGTC GTGAAGCGATTCAGCGATATCTGCCTGTTCATCCGCGTCCAGCTCG pACA ACACAGGAAACACGTATGACGAAACAAGAAAAGACGGCCC P5_55- 63 pACO CAAGAAAAGACGGCCCTTAACATGGCgCGCTTTATCAGgAGCCAGACATTAACGC P5_55- TgCTGGAGAAACTCAACGAGCTGGACGCGGATGAACAGGCAG 63 pACO CGCGAGAACaCTaCGGTAAAGCTCATCAGCGTGGTCGTGAAGCGATTCAGCGATAT P3_55- CTGCCTGTTCATCCGC 63 R55 CGGCTGATTACCCTCTTCAGAGGTTTTCGCCATCGTCACCAAANNNCGCGAGAACa CTaCGG F56 CGGCTGATTACCCTCTTCAGAGGTTTTCGCCATCGTCACCNNNGCGCGCGAGAACa CTaCGG G57 CGGCTGATTACCCTCTTCAGAGGTTTTCGCCATCGTCNNNAAAGCGCGCGAGAACa CTaCGG D58 CGGCTGATTACCCTCTTCAGAGGTTTTCGCCATCNNNACCAAAGCGCGCGAGAAC aCTaCGG Table 10 continued

142

Table 10 continued

D59 CGGCTGATTACCCTCTTCAGAGGTTTTCGCCNNNGTCACCAAAGCGCGCGAGAACa CTaCGG G60 CGGCTGATTACCCTCTTCAGAGGTTTTCNNNATCGTCACCAAAGCGCGCGAGAACa CTaCGG E61 CGGCTGATTACCCTCTTCAGAGGTTNNNGCCATCGTCACCAAAGCGCGCGAGAACa CTaCGG N62 CGGCTGATTACCCTCTTCAGAGNNNTTCGCCATCGTCACCAAAGCGCGCGAGAACa CTaCGG L63 CGGCTGATTACCCTCTTCANNNGTTTTCGCCATCGTCACCAAAGCGCGCGAGAACa CTaCGG

143

Table 11: Barcode oligos for pAC point mutant library (barcodes are highlighted in red)

Barcode Sequence oligos N18-FWD ACACAGGAAACACGTATG N18-REV AATAATGGCACCNNNNNNNNNNNNNNNNNNCGGCTGATTACCCTCTTC N18- NNNNNNACACAGGAAACACGTATG DECODFWD N18- NNNNNNCTTAATCAGTGAGGCACCGGCACC DECODREV ROUND- NNNNNNGAAGAGGGTAATCAGCCG FWD R0 GACACTCTTAATCAGTGAGGCACCGGCACC R1 GGCCCACTTAATCAGTGAGGCACCGGCACC R2 GATTTCCTTAATCAGTGAGGCACCGGCACC R3 TGCGCCCTTAATCAGTGAGGCACCGGCACC R4 TCTAATCTTAATCAGTGAGGCACCGGCACC R5 ACGAGCCTTAATCAGTGAGGCACCGGCACC R6 GCGATACTTAATCAGTGAGGCACCGGCACC R7 CGAGAGCTTAATCAGTGAGGCACCGGCACC R8 AGTCTCCTTAATCAGTGAGGCACCGGCACC R9 ATCAGTCTTAATCAGTGAGGCACCGGCACC

144

4.3 Results

4.3.1. Construction of the pMR point mutant library

A point mutant of a protein yields us relevant information on whether a particular residue in the position is crucial for the protein to be functional. A point mutant library would be more useful to find how tolerant a position is to all other non-wild type amino acids. NNN (N = A, G, C or T) codon point mutant libraries were made in a cysteine free mutant, AV-Rop. AV-Rop with wild type like properties is stable and free of problems of oxidative dimerization due to absence of cysteines. This makes AV-Rop an excellent model system to use for cloning combinatorial libraries. As Rop monomer has 63 amino acids, it is straight-forward to construct the gene insert from synthetic reassembly fragments. The PCR strategy was different for five sets of positions throughout the Rop gene as shown in figure 19. In general, the 5’ extension primer contained the hexahistidine fusion tag, T7 promoter, AflIII cloning site and TEV site. The 5’ reassembly fragment had about half of ROP gene, the 3’ reassembly fragment had most of the remaining half of Rop and 3’ extension primer had the last part of Rop gene with

BamHI cloning site. For positions 8 to 26, the 5’ reassembly fragment was made with a degenerate codon at the respective sites (A8-L26) and all these positions had a common overlapping 3’ reassembly fragment (SM-AV-3P). Similarly, for positions 34 to 54, the degenerate codon was put in the respective 3’ reassembly fragments (Q34-A54) with a 145 common overlapping 5’ reassembly fragment (AV-WT-5’). As it is the overlapping part of reassembly fragments for positions 27-33 in the middle loop region, a strategy involving four reassembly fragments was used with the third one containing the respective degenerate site (Loop1-NP, Loop2-NP, N27-E33 and Loop3-NP). Extension primers (pMRtev5 and SM-pMRtev3) at both ends were the same for positions 8-54. For the end regions, the degenerate codon was added in the extension primers as normally the end region would overlap between the reassembly fragment and the extension primer.

Library inserts for positions 2 to 7 were made from synthesizing a 5’ extension primer

(T2pMRtev5-T7pMRtev5) containing a degenerate codon sandwiched in the primer sequence. These six positions had a common 3’ extension primer (pMRtev3), 5’ and

3’reassembly fragments (Fwd 2to7 and SM-AV-3P). Similarly, for positions 55-63, the degenerate codon was added in the 3’ extension primer (R55SMpMRtev3-L63pMRtev3) and other primers (AV-WT-5’, SM-AV-3P and pMRtev5) were common. The experimental design has been shown in figure 20. The library inserts generated for each amino acid position by PCR were individually ligated with pMRH6 vector and transformed into DH10B cells. Thus, 62 single point mutant constructs were cloned in

DH10B. These constructs were grown, mixed together after normalizing for cell density and plasmid DNA of the combined library was isolated. The combined point mutant library of ROP was then transformed into DH10B(DE3) cells containing the screening

146 plasmid pUCBADGFPuv. The theoretical diversity of the combined point mutant library is 1,179 [(62 x 19) mutants + 1 wild-type] protein variants. The transformation efficiency of DH10B(DE3) cells used was 108 transformants per μg of DNA. By plating an aliquot of the transformation after recovery, the size of transformants was calculated to be

~50,000 which was about 20 fold larger than the library size. Coverage over several fold ensures efficient transformation of all library variants. DH10B(DE3) cells containing the combined point mutant library was then enriched by growth based selection. As Rop regulates plasmid copy number, the inactive variants in the library produce large amount of plasmid. Due to this, the metabolic load on the inactive variants becomes very high and this acts as a selective pressure while growing a population of actives and inactives of a library in the same pool. This is the basis of the growth selection done on the library and each round of selection is done for 18 h at 42 ºC. Eight rounds of growth selection were performed for the combined point mutant library. The volume (100 μL) and density

(Optical Density or OD600 of 1) of cells put in each round was maintained constant for all rounds. After each round of growth selection, an aliquot of the population was plated on screening plates to look at the in vivo activity using a GFP reporter screen. The GFP fluorescence level of the colonies on Kan-Amp-Arabinose or KAA screening plates correlates to the plasmid level or Rop activity in the cells. A fluorescent colony on the

GFP screen would indicate that the corresponding variant has an active Rop in vivo which

147 binds RNA and thus modulates plasmid levels. The enrichment of the library was followed in every round by looking at the ratio of active variants (fluorescent or intermediate fluorescent colonies) and inactive variants (nonfluorescent colonies). The approximate hit rate for each round was calculated based on the count of fluorescent and non-fluorescent colonies of the aliquot that was plated on screening plates. It should be noted that the intermediately fluorescent colonies were also counted as actives. The approximate hit rate observed from the library screen was 50% for the input library population (round zero), 70% for the population for round three, 90% for the population for round six and 100% hit rate for round eight.

4.3.2. Mock growth selection

Rop controls plasmid copy number and the absence of Rop activity in cells confer higher plasmid levels than cells with active ROP. This implies that Rop inactive cells grow at very slow rates compared to Rop active cells. This way, the inactives are selected out and the pool becomes dominated by actives. The difference in plasmid copy number between actives and inactives is prominently seen at 42 ºC. This is because Rop reduces plasmid copy number at both 37 ºC and 42 ºC and thus the copy number difference is larger at the higher temperature. Growth selection at 42 ºC leads to enrichment of Rop containing actives. As there is a huge difference in plasmid copy number between actives 148 and inactives, analytical digestions of the plasmids to produce different banding patterns in gel electrophoresis can be performed to monitor the selection. From a population of cells harvested at the end of any round of growth selection, DNA is miniprepped, analytically digested and run on a gel. In the gel, the decreasing intensity of GFP plasmid indicates increasing ROP activity through the selection rounds. The in vivo GFP screening method lets us correlate fluorescence levels on Kan-Amp-Arabinose agar plates to plasmid copy number of cells. The observation of banding patterns on gel is more reliable and a great supplement to the in vivo plate results. Selection of any library to greater than 90% hit rate will be considered ideal so that performing Sanger sequencing at random will yield us mostly hit sequences. As the goal here is to study the fitness landscape of a point mutant library by Illumina deep sequencing, it is interesting to look at the distribution of actives and inactives of the library. At the same time, as deep sequencing results in more than a million reads, it is disadvantageous to spend money and resources on sequencing a lot of inactives. Therefore, we optimized the number of rounds of growth selection such that we end up in 1,000 times more actives than inactives. The optimization was done using a mock selection experiment with positive control

(pACT7lacRop) and negative control (pACT7lacCm) strains both in DH10B(DE3) cells

(in the screening strain with pUCBADgfpuv). Both positive and negative controls were inoculated together to start a growth selection round with negative control being 9 times

149 as much as the positive control (1:9 of positive and negative controls). 200 μL of total inoculum was mixed from overnight cultures of both strains at OD of 1 for the first round. The cell density (OD of 1) and volume (200 μL) was maintained the same for subsequent rounds. Aliquots taken from each round was miniprepped and XmnI analytical digest on each round produced unique banding patterns for pUCBADgfpuv, pACT7lacRop and pACT7lacCm. The band sizes in b.p. were 3,446 for pACT7lacCm

(negative strain), 1,961 and 1,017 for pACT7lacRop (positive strain) and 4,255 and 679 for pUCBADgfpuv (GFP plasmid). The decreasing band intensity of GFP plasmid (due to increased Rop activity), and that of negative control and the increase in intensity of positive control can be seen in Figure 22. This was verified by plating an aliquot of cells in KAA screening plates after every round. The fold of enrichment per round was calculated based on the relative enrichment of actives and it was found to be approximately 10 for pAC and 3 for pMR. So, it will take three rounds in pAC for 1000- fold enrichment whereas it will take more rounds for pMR to reach 1000-fold enrichment of positives.

150

Figure 20: Overall experimental design for pMR point mutant library

151

Figure 21: Bioanalyzer output graph; Rop’s size is 220 base pairs and a band is seen at 378 base pairs. This includes the barcodes and the adaptors attached.

4.3.3 Preliminary analysis of the point mutant library

From the deep sequencing data, we acquired 16149703 read pairs. The following criteria was followed by Dr. Bundschuh for analysis of the deep sequencing data. First, all read pairs with more than 10 Ns (unidentified nucleotides) which were 47900 (0.3 %) of the read pairs from the pool were eliminated. Second, the barcodes were screened. In checking the bar code, up to one mismatch was allowed and a missing first base was 152 allowed. Each of the paired reads must have barcodes that must match being from the same round. This eliminated 11066439 (68.5%) read pairs. This round eliminated majority of the sequences which was a major disadvantage. Third, the sequences were checked for mutations. If a mutation occurs only in one of the two paired reads but the other shows the nucleotide as not mutated, the mutation was ignored. In case of two different mutations at the same position between the two read pairs, the read pairs are eliminated. This eliminated 2063927 (12.8%) of the read pairs. Fourth, the remaining mismatches are counted and any read pair with more than 20 (nucleotide) mismatches is eliminated. This eliminated 93075 (0.6%) of the read pairs. Finally, the number of read pairs selected from the analysis was 814780 (5.0%) for R0, 900133 (5.6%) for R3,

579918 (3.6%) for R6 and 583531 (3.6%) for R8. As it can be seen from the results, majority of the read pairs were lost in barcode checking step. As the round barcodes were attached using PCR, many mutations were introduced in the barcodes. This was a major reason for such results. Also, there were many forward and reverse read pairs with barcodes from different rounds. This was a shocking occurrence. Also, there was an unusually high number of AV-Rop sequences (scaffold for the library). This is because every position will have wild-type sequence as part of the library. But, the number was higher than expected. From the mock selection, selection in pMR takes many rounds to achieve 1000 fold selection of actives and is a little inconsistent with the results.

153

Enrichment in pAC takes 3 rounds for 1000 fold selection of actives and it was consistent with the results. The results for both pAC and pMR enrichment are shown in figure 4.5.

Excluding the wild-type residue, the graphs showing amino acid count for other residues is shown in figure 23 for positions 10 and 30. This can be generated for every position.

Figure 24 shows the type of heat map that can be generated from the information. This heat map will be generated from calculating selection coefficient at each position.

Selection coefficients yield the relative fitness of the mutant at each position. It is the ratio of the mutant to wild-type residue occurrence at each time point. The plan is to carry out reanalysis to rescue the read pairs lost in barcode checking step by relaxing the criteria a bit. At the same time, these libraries are being constructed in pAC background with a different strategy for deep sequencing.

154

Figure 22: pAC enrichment (top) and pMR enrichment (bottom)

155

N10 1200

1000

800

600 N10 400

200

0 I L V F M C A G P T S Y W Q N H E D K R *

D30 30000

25000

20000

15000 D30 10000

5000

0 I L V F M C A G P T S Y W Q N H E D K R *

Figure 23: Graphs of amino acid count excluding the wild-type amino acid in each position. The wild-type amino acid occurred at very high amounts compared to other residues

156

Fitness scale

wild-type Null wildtype aminoacid

Figure 24: Example heat map generated from the limited amount of information

157

4.3.4 Construction of pAC point mutant library

Ongoing work involves the PCR strategy which is different for four sets of positions throughout the ROP gene (positions 2-23, 24-33, 34-54, and 55-63) as shown in

Figure 21. PCR consists of three steps. First is the reassembly of synthetic gene fragments. Second step is the extension with AFlIII cloning site. Third step is the addition of barcoding oligo containing degenerate N18 site and BanI cloning site. Barcoding with degenerate codon makes every single mutant unique. This helps us to identify the mutants easily even if the round barcodes are not consistent in the results. For positions 2 to 23, the 5’ reassembly fragment was made with a degenerate codon at the respective sites and all these positions had a common overlapping 3’ reassembly fragment. Similarly, for positions 23 to 33, the degenerate codon was put in the respective 3’ reassembly fragments with a common overlapping 5’ reassembly fragment. As it is the overlapping part of reassembly fragments for positions 27-33 in the middle loop region, a strategy involving four reassembly fragments was used with the third one containing the respective degenerate site. Extension primers were separate for each of the four sets. For the end regions, the degenerate codon was added in the extension primers as normally the end region would overlap between the reassembly fragment and the extension primer.

N18-fwd and N18-rev were barcoding oligos common for all positions.

158

Figure 25: PCR design for the pAC NNN point mutant library in AV Rop

159

4.4 Discussion

A point mutant of a protein yields us relevant information on whether a particular residue in the position is crucial for the protein to be functional or not.

Directed evolution experiments usually probe a gene of interest for a particular property and the interesting variants are subject to some kind of selection. Deep sequencing of the selected variants and the input variants helps in particularly looking at enrichment or depletion of variants. Protein fitness and binding studies have utilized deep sequencing to analyze the whole populations going through the selection process. NNN (N = A, G, C or T) codon point mutant libraries were constructed in pMR background in a cysteine free mutant, AV-ROP which has two core cysteine residues mutated to alanine (C38A) and valine (C52V). AV-ROP with wild type like properties is stable and free of problems of oxidative dimerization due to absence of cysteines. The library was subject to rounds of growth selection to select for hits and then sequenced by Illumina methodology. The above mentioned work was done as my Master’s thesis project (Thesis title: Exploring the sequence landscape of the four-helix bundle protein Rop using deep sequencing, 2013). This chapter covered the analysis of Illumina sequencing results obtained from the above work and preliminary results from redesign of the experimental strategy.

160

From the analysis of pMR library, a lot of the sequence information was lost due to errors in the step of attaching barcodes. From 16149703 starting number of read pairs, 814780 (5.0%) for R0, 900133 (5.6%) for R3, 579918 (3.6%) for R6 and

583531 (3.6%) for R8 were finally obtained. This resulted in being less reliable with the results we got from the available sequence information. The selection done here on the library is based on growth competition and it should be noted that some factors may bias the growth of certain variants. E. coli codon preferences may affect the expression levels of a functional protein variant and some in vivo RNA binding protein variants may be completely unstable when purified out of the cell. These factors not only affect the fitness landscape results but also pose challenges on the biophysical characterization. So, during analysis, the codon bias should be accounted for and by relative calculations with the input library so that we can nullify the bias effects to a certain extent. Apart from that, Cys-free wild-type being the naturally fit sequence is expected to dominate the population at later rounds.

Reanalysis will be carried out with the sequence data.

From the enrichment results, it was seen that pAC has higher enrichment fold per round than pMR and the results are more consistent in pAC background. Therefore, the

161 libraries are being reconstructed in pAC with a different experimental strategy. To identify each round, degenerate oligos are attached at the PCR step so that barcodes will be unique for each mutant. Then, decoding oligos will be used for each round. Barcodes will be attached for each round after cloning the libraries. Finally, after sequencing, each mutant will be unique with a unique barcode. Selection coefficients can be calculated after analysis of results and a heat map can be showing each position. As it is a point mutant library, for each residue in the protein ROP, enrichment ratios can be calculated by dividing the frequency of mutations in a selected round by the frequency of mutations in the input variant population. This gives us a direct measure of mutational tolerance levels and amino acid preferences of each position.

Illumina deep sequencing study done here to analyze the protein fitness landscape using single point mutant library is similar to the study of point mutants eukaryotic yeast chaperone and the study on all possible ubiquitin point mutants where the relative fitness is calculated by the EMPIRIC method101,102. EMPIRIC yields selection coefficients that give us the mutant to wildtype ratios at each time point. WW domain peptide binding studies used moderate selection to get a broad range of functional residues which is similar to the partial enrichment that we perform here96.

162

The relative fitness landscape of the protein can be used as a mutation scale for the protein ROP for further studies. This not only lets us analyze the sequence of landscape of the protein but also helps understand the mutational preferences of core loop and surface positions in a four-helix bundle protein. More work involving purification and biophysical characterization of interesting variants from the library has to be carried out to come to final conclusions on the contributions of core, surface and loop residues to protein stability.

163

References

1. Anfinsen CB. The formation and stabilization of protein structure. Biochem J.

1972;128(4):737-749.

2. Tanford C. The hydrophobic effect and the organization of living matter. Science.

1978;200(4345):1012-1018.

3. Dill KA. Dominant forces in protein folding. Biochemistry (N Y ). 1990;29(31).

4. Dill KA. The meaning of hydrophobicity. Science. 1990;250(4978):297-298.

5. Rose GD, Wolfenden R. Hydrogen bonding, hydrophobicity, packing, and protein folding. Annu Rev Biophys Biomol Struct. 1993;22:381-415.

6. Sandberg WS, Terwilliger TC. Influence of interior packing and hydrophobicity on the stability of a protein. Science. 1989;245(4913):54-57.

7. Drexler KE. Molecular engineering: An approach to the development of general capabilities for molecular manipulation. Proc Natl Acad Sci U S A. 1981;78(9):5275-

5278.

164

8. Pabo C. Molecular technology. designing proteins and peptides. Nature.

1983;301(5897):200.

9. Hecht MH, Sturtevant JM, Sauer RT. Stabilization of lambda repressor against thermal denaturation by site-directed gly----ala changes in alpha-helix 3. Proteins. 1986;1(1):43-

46.

10. Matthews BW, Nicholson H, Becktel WJ. Enhanced protein thermostability from site-directed mutations that decrease the entropy of unfolding. Proc Natl Acad Sci U S A.

1987;84(19):6663-6667.

11. Matsumura M, Signor G, Matthews BW. Substantial increase of protein stability by multiple disulphide bonds. Nature. 1989;342(6247):291-293.

12. Makhatadze GI, Loladze VV, Ermolenko DN, Chen X, Thomas ST. Contribution of surface salt bridges to protein stability: Guidelines for protein engineering. J Mol Biol.

2003;327(5):1135-1148.

13. Itoh T, Tomizawa J. Formation of an RNA primer for initiation of replication of

ColE1 DNA by ribonuclease H. Proc Natl Acad Sci U S A. 1980;77(5):2450-2454.

14. Tomizawa J, Itoh T. Plasmid ColE1 incompatibility determined by interaction of

RNA I with primer transcript. Proc Natl Acad Sci U S A. 1981;78(10):6096-6100. 165

15. Tomizawa J. Control of ColE1 plasmid replication: The process of binding of RNA I to the primer transcript. Cell. 1984;38:861-870.

16. Som T, Tomizawa J. Regulatory regions of ColE1 that are involved in determination of plasmid copy number. Proc Natl Acad Sci U S A. 1983;80(11):3232-3236.

17. Banner DW, Kokkinidis M, Tsernoglou D. Structure of the ColEl rop protein at l-7 A resolution. J Mol Biol. 1987;196:657-675.

18. Predki PF, Nayak LM, Gottlieb MB, Regan L. Dissecting RNA-protein interactions:

RNA-RNA recognition by rop. Cell. 1995;80(1):41-50.

19. Lavinder JJ, Hari SB, Sullivan BJ, Magliery TJ. High-throughput thermal scanning:

A general, rapid dye-binding thermal shift screen for protein engineering. J Am Chem

Soc. 2009;131(11):3794-3795.

20. Pantoliano MW, Petrella EC, Kwasnoski JD, et al. High-density miniaturized thermal shift assays as a general strategy for drug discovery. J Biomol Screen. 2001;6(6):429-440.

21. Matulis D, Kranz JK, Salemme FR, Todd MJ. Thermodynamic stability of carbonic anhydrase: Measurements of binding affinity and stoichiometry using ThermoFluor.

Biochemistry. 2005;44(13):5258-5266.

166

22. Lo MC, Aulabaugh A, Jin G, et al. Evaluation of fluorescence-based thermal shift assays for hit identification in drug discovery. Anal Biochem. 2004;332(1):153-159.

23. Munson M, Brien RO, Sturtevant JM, Regan L. Redesigning the hydrophobic core of a four-helix-bundle protein. Protein science : a publication of the Protein Society.

1994;3:2015-2022.

24. Munson M, Balasubramanian S, Fleming KG, et al. What makes a protein a protein? hydrophobic core designs that specify stability and structural properties. Protein science : a publication of the Protein Society. 1996;5(8):1584-93.

25. Nagi aD, Anderson KS, Regan L. Using loop length variants to dissect the folding pathway of a four-helix-bundle protein. J Mol Biol. 1999;286(1):257-65.

26. Predki PF, Regan L. Redesigning the topology of a four-helix-bundle protein:

Monomeric rop. Biochemistry (N Y ). 1995;34:9834-9839.

27. Kresse HP, Czubayko M, Nyakatura G, Vriend G, Sander C, Bloecker H. Four-helix bundle topology re-engineered: Monomeric rop protein variants with different loop arrangements. Protein Eng. 2001;14(11):897-901.

167

28. Magliery TJ, Regan L. A cell-based screen for function of the four-helix bundle protein rop: A new tool for combinatorial experiments in biophysics. Protein engineering, design & selection : PEDS. 2004;17(1):77-83.

29. Yi F, Sims DA, Pielak GJ, Edgell MH. Testing hypotheses about determinants of protein structure with high-precision, high-throughput stability measurements and statistical modeling. Biochemistry. 2003;42(24):7594-7603.

30. van den Burg B, Eijsink VG. Selection of mutations for increased protein stability.

Curr Opin Biotechnol. 2002;13(4):333-337.

31. Spector S, Wang M, Carp SA, et al. Rational modification of protein stability by the mutation of charged surface residues. Biochemistry. 2000;39(5):872-879.

32. Perl D, Schmid FX. Electrostatic stabilization of a thermophilic cold shock protein. J

Mol Biol. 2001;313(2):343-357.

33. Saunders S, Hedlund BE. Electrostatic modification of protein surfaces: Effect on hemoglobin ligation and solubility. Biochemistry. 1984;23(7):1457-1461.

34. Perl D, Mueller U, Heinemann U, Schmid FX. Two exposed amino acid residues confer thermostability on a cold shock protein. Nat Struct Biol. 2000;7(5):380-383.

168

35. Zhao H, Arnold FH. Directed evolution converts subtilisin E into a functional equivalent of thermitase. Protein Eng. 1999;12(1):47-53.

36. Hoseki J, Yano T, Koyama Y, Kuramitsu S, Kagamiyama H. Directed evolution of thermostable kanamycin-resistance gene: A convenient selection marker for thermus thermophilus. J Biochem. 1999;126(5):951-956.

37. Chakravarty S, Varadarajan R. Elucidation of factors responsible for enhanced thermal stability of proteins: A structural genomics based study. Biochemistry.

2002;41(25):8152-8161.

38. Fukuchi S, Nishikawa K. Protein surface amino acid compositions distinctively differ between thermophilic and mesophilic bacteria. J Mol Biol. 2001;309(4):835-843.

39. Torrez M, Schultehenrich M, Livesay DR. Conferring thermostability to mesophilic proteins through optimized electrostatic surfaces. Biophys J. 2003;85(5):2845-2853.

40. Sanchez-Ruiz JM, Makhatadze GI. To charge or not to charge? Trends Biotechnol.

2001;19(4):132-135.

41. Ibarra-Molero B, Loladze VV, Makhatadze GI, Sanchez-Ruiz JM. Thermal versus guanidine-induced unfolding of ubiquitin. an analysis in terms of the contributions from charge-charge interactions to protein stability. Biochemistry. 1999;38(25):8138-8149. 169

42. Makhatadze GI, Loladze VV, Gribenko AV, Lopez MM. Mechanism of thermostabilization in a designed cold shock protein with optimized surface electrostatic interactions. J Mol Biol. 2004;336(4):929-942.

43. Loladze VV, Ibarra-Molero B, Sanchez-Ruiz JM, Makhatadze GI. Engineering a thermostable protein via optimization of charge-charge interactions on the protein surface. Biochemistry. 1999;38(50):16419-16423.

44. Strickler SS, Gribenko AV, Gribenko AV, et al. Protein stability and surface electrostatics: A charged relationship. Biochemistry. 2006;45(9):2761-2766.

45. Gribenko AV, Patel MM, Liu J, McCallum SA, Wang C, Makhatadze GI. Rational stabilization of enzymes by computational redesign of surface charge-charge interactions.

Proc Natl Acad Sci U S A. 2009;106(8):2601-2606.

46. Schweiker KL, Makhatadze GI. A computational approach for the rational design of stable proteins and enzymes: Optimization of surface charge-charge interactions.

Methods Enzymol. 2009;454:175-211.

47. Gribenko AV, Patel MM, Liu J, McCallum SA, Wang C, Makhatadze GI. Rational stabilization of enzymes by computational redesign of surface charge-charge interactions.

Proc Natl Acad Sci U S A. 2009;106(8):2601-2606.

170

48. Schweiker KL, Zarrine-Afsar A, Davidson AR, Makhatadze GI. Computational design of the fyn SH3 domain with increased stability through optimization of surface charge charge interactions. Protein Sci. 2007;16(12):2694-2702.

49. Chan CH, Wilbanks CC, Makhatadze GI, Wong KB. Electrostatic contribution of surface charge residues to the stability of a thermophilic protein: Benchmarking experimental and predicted pKa values. PLoS One. 2012;7(1):e30296.

50. Makhatadze GI, Privalov PL. Hydration effects in protein unfolding. Biophys Chem.

1994;51(2-3):291-304; discussion 304-9.

51. Makhatadze GI, Clore GM, Gronenborn AM. Solvent isotope effect and protein stability. Nat Struct Biol. 1995;2(10):852-855.

52. Borders CL,Jr, Broadwater JA, Bekeny PA, et al. A structural role for arginine in proteins: Multiple hydrogen bonds to backbone carbonyl oxygens. Protein Sci.

1994;3(4):541-548.

53. Matsutani M, Hirakawa H, Nishikura M, et al. Increased number of arginine-based salt bridges contributes to the thermotolerance of thermotolerant acetic acid bacteria, acetobacter tropicalis SKU1100. Biochem Biophys Res Commun. 2011;409(1):120-124.

171

54. Sokalingam S, Raghunathan G, Soundrarajan N, Lee SG. A study on the effect of surface lysine to arginine mutagenesis on protein stability and structure using green fluorescent protein. PLoS One. 2012;7(7):e40410.

55. Raghunathan G, Sokalingam S, Soundrarajan N, Madan B, Munussami G, Lee SG.

Modulation of protein stability and aggregation properties by surface charge engineering.

Mol Biosyst. 2013;9(9):2379-2389.

56. Tsumoto K, Umetsu M, Kumagai I, Ejima D, Philo JS, Arakawa T. Role of arginine in protein refolding, solubilization, and purification. Biotechnol Prog. 2004;20(5):1301-

1308.

57. Warwicker J, Charonis S, Curtis RA. Lysine and arginine content of proteins:

Computational analysis suggests a new tool for solubility design. Mol Pharm.

2014;11(1):294-303.

58. Golovanov AP, Hautbergue GM, Wilson SA, Lian LY. A simple method for improving protein solubility and long-term stability. J Am Chem Soc. 2004;126(29):8933-

8939.

59. Mosavi LK, Peng ZY. Structure-based substitutions for increased solubility of a designed protein. Protein Eng. 2003;16(10):739-745.

172

60. Das D, Georgiadis MM. A directed approach to improving the solubility of moloney murine leukemia virus reverse transcriptase. Protein Sci. 2001;10(10):1936-1941.

61. Daujotyte D, Vilkaitis G, Manelyte L, Skalicky J, Szyperski T, Klimasauskas S.

Solubility engineering of the HhaI methyltransferase. Protein Eng. 2003;16(4):295-301.

62. Li H, Cocco MJ, Steitz TA, Engelman DM. Conversion of phospholamban into a soluble pentameric helical bundle. Biochemistry. 2001;40(22):6636-6645.

63. Mitra K, Steitz TA, Engelman DM. Rational design of 'water-soluble' bacteriorhodopsin variants. Protein Eng. 2002;15(6):485-492.

64. Slovic AM, Kono H, Lear JD, Saven JG, DeGrado WF. Computational design of water-soluble analogues of the potassium channel KcsA. Proc Natl Acad Sci U S A.

2004;101(7):1828-1833.

65. Lee EN, Kim YM, Lee HJ, et al. Stabilizing peptide fusion for solving the stability and solubility problems of therapeutic proteins. Pharm Res. 2005;22(10):1735-1746.

66. Trevino SR, Scholtz JM, Pace CN. Amino acid contribution to protein solubility:

Asp, glu, and ser contribute more favorably than the other hydrophilic amino acids in

RNase sa. J Mol Biol. 2007;366(2):449-460.

173

67. Trevino SR, Scholtz JM, Pace CN. Measuring and increasing protein solubility. J

Pharm Sci. 2008;97(10):4155-4166.

68. Hari SB, Byeon C, Lavinder JJ, Magliery TJ. Cysteine-free rop: A four-helix bundle core mutant has wild-type stability and structure but dramatically different unfolding kinetics. Protein science : a publication of the Protein Society. 2010;19(4):670-9.

69. Hutchison Ca. DNA sequencing: Bench to bedside and beyond. Nucleic Acids Res.

2007;35(18):6227-37.

70. Sanger F, Coulson aR. A rapid method for determining sequences in DNA by primed synthesis with DNA polymerase. J Mol Biol. 1975;94(3):441-8.

71. Sanger F. Nucleotide sequence of bacteriophage phiX174 DNA. Nature. 1977;265.

72. Maxam aM, Gilbert W. A new method for sequencing DNA. 1977. Proc Natl Acad

Sci U S A. 1992;24(2):99-103.

73. Sanger F, Nicklen S. DNA sequencing with chain-terminating inhibitors. Proc Natl

Acad Sci U S A. 1977;74(12):5463-5467.

174

74. DNA: Restriction mapping and nuclotide sequencing. http://www.biologydiscussion.com/dna/dna-restriction-mapping-and-nucleotide- sequencing/12390.

75. Sequencing DNA using dye terminators. http://www.di.uq.edu.au/sparqseqdyeterm.

76. Margulies M, Egholm M, Altman WE, et al. Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005;437(7057):376-80.

77. Ansorge WJ. Next-generation DNA sequencing techniques. New Biotechnology.

2009;25(4):195-203.

78. Shendure J, Porreca GJ, Reppas NB, et al. Accurate multiplex polony sequencing of an evolved bacterial genome. Science (New York, N.Y.). 2005;309(5741):1728-32.

79. Braslavsky I, Hebert B, Kartalov E, Quake SR. Sequence information can be obtained from single DNA molecules. Proc Natl Acad Sci U S A. 2003;2003.

80. Korlach J, Marks PJ, Cicero RL, et al. Selective aluminum passivation for targeted immobilization of single DNA polymerase molecules in zero-mode waveguide nanostructures. Proc Natl Acad Sci U S A. 2008;105(4):1176-81.

175

81. Rothberg JM, Hinz W, Rearick TM, et al. An integrated semiconductor device enabling non-optical genome sequencing. Nature. 2011;475(7356):348-52.

82. Drmanac R, Sparks AB, Callow MJ, et al. Human genome sequencing using unchained base reads on self-assembling DNA nanoarrays. Science (New York, N.Y.).

2010;327(5961):78-81.

83. Clarke J, Wu HC, Jayasinghe L, Patel A, Reid S, Bayley H. Continuous base identification for single-molecule nanopore DNA sequencing. Nat Nanotechnol.

2009;4(4):265-270.

84. Buermans HP, den Dunnen JT. Next generation sequencing technology: Advances and applications. Biochim Biophys Acta. 2014;1842(10):1932-1941.

85. Wang Z, Gerstein M, Snyder M. RNA-seq: A revolutionary tool for transcriptomics.

Nature reviews.Genetics. 2009;10(1):57-63.

86. Wheeler Da, Srinivasan M, Egholm M, et al. The complete genome of an individual by massively parallel DNA sequencing. Nature. 2008;452(7189):872-6.

87. Kim J, Ju YS, Park H, et al. A highly annotated whole-genome sequence of a korean individual. Nature. 2009;460(7258):1011-5.

176

88. Lind C, Ferriola D, Mackiewicz K, et al. Next-generation sequencing: The solution for high-resolution, unambiguous human leukocyte antigen typing. Hum Immunol.

2010;71(10):1033-42.

89. Burgess DJ. Human disease: Next-generation sequencing of the next generation.

Nature reviews.Genetics. 2011;12(2):78-78.

90. Bainbridge MN, Warren RL, Hirst M, et al. Analysis of the prostate cancer cell line

LNCaP transcriptome using a sequencing-by-synthesis approach. BMC Genomics.

2006;7:246-246.

91. Nagalakshmi U, Wang Z, Waern K, et al. The transcriptional landscape of the yeast genome defined by RNA sequencing. Science (New York, N.Y.). 2008;320(5881):1344-9.

92. Costa V, Gallo MA, Letizia F, Aprile M, Casamassimi A, Ciccodicola A. PPARG:

Gene expression regulation and next-generation sequencing for unsolved issues. PPAR research. 2010.

93. Bormann Chung Ca, Boyd VL, McKernan KJ, et al. Whole methylome analysis by ultra-deep sequencing using two-base encoding. PloS one. 2010;5(2):e9320-e9320.

94. Araya CL, Payen C, Dunham MJ, Fields S. Whole-genome sequencing of a laboratory-evolved yeast strain. BMC Genomics. 2010;11:88-88. 177

95. Fowler DM, Araya CL, Fleishman SJ, et al. High-resolution mapping of protein sequence-function relationships. Nature methods. 2010;7(9):741-6.

96. Ernst A, Gfeller D, Kan Z, et al. Coevolution of PDZ domain-ligand interactions analyzed by high-throughput phage display and deep sequencing. Molecular bioSystems.

2010;6(10):1782-90.

97. Dutta S, Gullá S, Chen TS, Fire E, Grant Ra, Keating AE. Determinants of BH3 binding specificity for mcl-1 versus bcl-xL. J Mol Biol. 2010;398(5):747-62.

98. DeBartolo J, Dutta S, Reich L, Keating AE. Predictive bcl-2 family binding models rooted in experiment or structure. J Mol Biol. 2012;422(1):124-44.

99. Dutta S, Chen TS, Keating AE. Peptide ligands for pro-survival protein bfl-1 from computationally guided library screening. ACS chemical biology. 2013;8(4):778-88.

100. Hietpas RT, Jensen JD, Bolon DNa. Experimental illumination of a fitness landscape. Proc Natl Acad Sci U S A. 2011;108(19):7896-901.

101. Hietpas R, Roscoe B, Jiang L, Bolon DNa. Fitness analyses of all possible point mutations for regions of genes in yeast. Nature protocols. 2012;7(7):1382-96.

178

102. Roscoe BP, Thayer KM, Zeldovich KB, Fushman D, Bolon DNa. Analyses of the effects of all ubiquitin point mutants on yeast growth rate. J Mol Biol. 2013;425(8):1363-

77.

179