Developing methods to understand and engineer cleavage specificity

A Dissertation SUBMITTED TO THE FACULTY OF THE UNIVERSITY OF MINNESOTA BY

Michael Douglas Lane

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Burckhard Seelig

September 2016

© Michael D Lane, 2016 Acknowledgements

I would like to thank the many people who supported me along the way towards completing my thesis. Above all, I thank my advisor, Dr. Burckhard Seelig, who gave me the opportunity to pursue my ideas, embraced my research with thoughtful criticism and relentless optimism, and shaped my scientific approach through challenging me to be the best I can be. I also would like to thank previous members of the Seelig lab, Dr. Misha Golynskiy, Dr. John Haugner, Dr. Dana Morrone, and Dr. Aleardo Morelli, for being good friends and the many scientific discussions we had that guided me through the rough experiments. To the new Seelig lab members, especially Dr. Matilda Newton, Dr. Yari Cabezas, and Dr. Nisha Kanwar, thank you for your critical feedback, friendship, and taking away all my lab responsibilities. To the MSTP, I would like to thank Dr. Yoji Shimizu, Susan Shurson, and Nicholas Berg for their unending help guiding through the woven MD/PhD program. In particular, I thank the MSTP leadership for allowing me to pursue the program tailored to my interests. To my thesis committee, thank you for your support through the years. From annual reviews, to informal discussions, to grant application drafts, your feedback molded my abilities and experience. To my friends and family, especially my wife, Dr. Kathleen Lane, thank you for keeping me going through tough times, and understanding when I disappeared for weeks or months at a time. I would like to acknowledge the NIH MSTP Training Grant (T32 GM008244), American Heart Association Predoctoral Fellowship, NIH Exploratory/Developmental Research Grant (R21 AI113406), and the Doctoral Dissertation Fellowship for funding my training and making my ideas a reality. Finally, I would like to thank the administration of the Biotechnology Institute and the Biochemistry, Molecular Biology, and Biophysics Department for creating a remarkable research environment and training program. And to all the researchers in the Gortner Laboratory, thank you for making this old bunker a place of inspiring thought and collaboration. i

Dedication

For my parents, Barry and Lyn Lane, who raised me to appreciate rational thought, cultivated my curiosity and creativity, and inspired me to push myself as hard as I can.

ii

Abstract

Proteases are ubiquitous that comprise nearly 2% of all human genes. These robust enzymes are attractive potential therapeutics due to their catalytic turnover and capability for exquisite specificity. While most existing drugs require a stoichiometric ratio to function, therapeutic could clear their targets much more efficiently. Unfortunately, existing technologies are inadequate for understanding and engineering therapeutic proteolytic specificities. My thesis work has focused on building the groundwork to enable these technologies to thrive. For the goal of engineering a new protease, it is currently necessary to identify prototype proteases for engineering efforts that have specificities similar to the desired target substrate. Current technologies are unable to characterize proteases adequately for this goal. Accordingly, I invested in developing a method for the accurate characterization of protease cleavage specificity. Our unique combination of mRNA display technology, Next-Generation Sequencing, and mass spectrometry enables the sampling of all possible permutations of octamer substrates and the identification of millions of cleavage sites. The throughput of our approach is orders of magnitude greater than the current state-of-the-art methods. The resulting high-resolution specificity maps can be applied to identify promising protease prototypes, predict human cross-reactivity, or lead to a better understanding of this critical component of natural physiology. In the work presented here, I applied my new specificity-screening method to assess the specificities of the proteases factor Xa, ADAM17, and streptopain. The resulting cleavage preference maps confirmed known specificities, and revealed new insight into the broad preferences of both narrow- and broad-specificity proteases. In particular, disfavored amino acids were illuminated better than ever before. The next focus of my work was to engineer multiple-subsite novel protease specificity. I chose streptopain as the prototype for my efforts to neutralize the superantigen exotoxin SpeA. I identified a target loop of SpeA wherein cleavage would result in inactive fragments. Further, I confirmed that streptopain can be successfully presented as an mRNA displayed fusion. iii

In summary, my thesis work established crucial methodologies for applying mRNA display technology to enable the understanding and ultimately engineering the specificity of therapeutic proteases.

iv

Table of Contents

Acknowledgements ...... i Dedication ...... ii Abstract ...... iii Table of Contents ...... v List of Tables ...... viii List of Figures ...... ix

Chapter 1: Introduction ...... 1 1.1 Thesis overview ...... 1 1.2 Significance...... 3 1.2.1 The untapped potential of proteases as therapeutics ...... 3 1.2.2 Protease specificity drives their unique function ...... 3 1.2.3 Limited technologies hamper progress towards therapeutic proteases ...... 4 1.2.4 Engineering therapeutic proteases can defend against bacterial exotoxins ...... 5 1.3 Current methods for characterizing protease specificity ...... 7 1.3.1 Probing specificity with chemically synthesized peptide libraries ...... 7 1.3.2 Probing specificity with large combinatorial libraries ...... 8 1.3.3 Current methods are too limited to characterize promiscuous proteases ...... 11 1.4 Current methods for engineering protease specificity ...... 14 1.4.1 Specificity engineering of protease OmpT by cell-surface display ...... 14 1.4.2 TEV protease specificity engineering by YESS ...... 15 1.4.3 Computational design of α-gliadin peptidase ...... 15 1.4.4 Limitations of current protease specificity engineering efforts ...... 16 1.5 Advances in the directed evolution of proteins ...... 17 1.5.1 Overview ...... 17 1.5.2 Introduction ...... 17 1.5.3 Advancing selection technologies...... 18 1.5.4 Expanding the scope of selections to new properties ...... 24 1.5.5 Conclusion ...... 28 1.6 Outlook ...... 29

Chapter 2: Highly efficient recombinant production and purification of streptococcal streptopain with increased enzymatic activity ...... 31 2.1 Overview ...... 31 2.2 Introduction ...... 32 2.3 Materials and methods ...... 34 2.3.1 Materials ...... 34 2.3.2 Overexpression of streptopain by autoinduction in E. coli ...... 34 2.3.3 Purification of streptopain by affinity chromatography ...... 35 2.3.4 Confirmation of streptopain protein identity by mass spectrometry ...... 35 v

2.3.5 Proteolytic activity measured by azocasein assay ...... 36 2.3.6 Assay of streptopain cleavage specificity using peptides substrates ...... 37 2.4 Results ...... 37 2.4.1 Testing multiple strategies to improve soluble protein expression ...... 37 2.4.2 Overexpression of mature streptopain by autoinduction ...... 38 2.4.3 Purification of streptopain...... 40 2.4.4 Assays to confirm proteolytic activity and substrate specificity ...... 41 2.5 Discussion ...... 44 2.6 Conclusion ...... 47 2.7 Supplementary information ...... 48

Chapter 3: Developing a comprehensive specificity analysis method for promiscuous proteases ...... 53 3.1 Overview ...... 53 3.2 Introduction ...... 54 3.3 Results ...... 57 3.3.1 Design and optimization of the octamer peptide substrate construct ...... 57 3.3.2 mRNA display of proof of principle with His6-based immobilization ...... 62 3.3.3 Optimization of His6-based immobilization of octamer library ...... 63 3.3.4 N-terminal labeling of fusions via NHS ester for subsequent immobilization 67 3.3.5 Incorporation of non-natural amino acid to improve immobilization ...... 67 3.3.6 Pilot digestion of click-immobilized nnAA octamer library by streptopain... 74 3.3.7 Optimization of click-immobilization of nnAA octamer library ...... 76 3.3.8 Two rounds of screening for three different proteases ...... 81 3.3.9 Preliminary analysis of Next-Generation Sequencing results ...... 85 3.4 Discussion ...... 89 3.5 Future work ...... 91 3.6 Conclusions ...... 93 3.7 Materials and methods ...... 93 3.7.1 PCR construction of controls, octamer libraries, and sequencing pools...... 93 3.7.2 Expression and purification of tRNA synthetase ...... 94 3.7.3 tRNA transcription, hammerhead ribozyme activation, and pPaF charging .. 94 3.7.4 mRNA display of octamer designs ...... 95 3.7.6 Click immobilization of octamer fusions ...... 96 3.7.7 Wash of azide-agarose immobilized fusions ...... 97 3.7.8 Protease digestion of click-immobilized fusions ...... 97 3.7.9 Next-Generation Sequencing and data analysis ...... 98 3.8 Supplementary information ...... 100

Chapter 4: Towards engineering proteases with multiple novel subsite specificities using mRNA display ...... 107 4.1 Overview ...... 107 4.2 Introduction ...... 107 4.3 Results ...... 109 vi

4.3.1 Identifying an ideal SpeA cleavage target for proteolytic inactivation ...... 109 4.3.2 Rabbit toxicity experiments verified therapeutic value of target cleavage ... 111 4.3.3 Streptopain can be mRNA displayed as zymogen or maturated lengths ...... 111 4.4 Discussion ...... 113 4.5 Future work ...... 113 4.5.1 Build a library of streptopain variants ...... 113 4.5.2 Selection of SpeA-cleaving streptopain variants ...... 114 4.5.3 Optimize streptopain variant to cleave full-length SpeA protein ...... 115 4.6 Conclusions ...... 115 4.7 Materials and methods ...... 115 4.7.1 Construction of control designs ...... 115 4.7.2 mRNA display of control designs and digestion with streptopain ...... 116 4.7.3 Cloning, expression, purification, and digestion of SpeA, SpeA fragments 116 4.7.4 Rabbit toxicity experiments ...... 118 4.8 Supplementary information ...... 119

Conclusions and Future Directions ...... 123

Bibliography ...... 125

vii

List of Tables

Chapter 2 Table S2.1: PCR primers used for cloning of streptopain into pET24a vector...... 48

Chapter 3 Table 3.1 Test substrate construct designs for protease screening by mRNA display ...... 58 Table 3.2 Substrate construct designs for immobilization by click chemistry...... 70 Table 3.3 Comparison of click immobilization yields for different alkyne incorporation strategies ...... 72 Table 3.4 Copper improves click yields in a UAG-dependent manner...... 74 Table 3.5 Streptopain digestion loses discrimination of preferred sequences when more digestion occurs...... 77 Table 3.6 Magnetic resin immobilization attempts...... 80 Table 3.7 Results from rounds 1 and 2 of cleavage specificity selection for three proteases...... 83 Table S3.1 Primers used in construction of controls and libraries...... 100 Table S3.2 Nucleotide sequence of all primers used in this chapter...... 101 Table S3.3 Nucleotide sequence of all mRNA display constructs in this chapter. ... 103 Table S3.4 Next-Generation Sequencing reads from two rounds of selection that were used in analysis...... 105

Chapter 4 Table 4.1 Evaluating potential target sequences in SpeA most likely to result in loss of toxin function upon cleavage...... 110 Table 4.2 Toxicity tests in rabbits verified that toxin fragments that would result from cleavage in loop 1 were inactive...... 111 Table S4.1 Primers used in PCR construction or amplification of controls...... 119 Table S4.2 Nucleotide sequence of all primers used in this chapter...... 120 Table S4.3 Nucleotide sequence of all mRNA display peptide controls in this chapter...... 121 Table S4.4 Nucleotide sequence of all mRNA display streptopain controls in this chapter...... 122

viii

List of Figures

Chapter 1 Figure 1.1 Protease subsite and substrate cleavage nomenclature ...... 4 Figure 1.2 Probing protease specificity with synthesized combinatorial libraries...... 7 Figure 1.3 mRNA display links mRNA to its encoding protein ...... 11 Figure 1.4 Overview of directed evolution...... 18 Figure 1.5 Overview of recent technical advances...... 19

Chapter 2 Figure 2.1 SDS-PAGE gel of the expression and purification of mature streptopain 39 Figure 2.2 Amino acid sequence of streptopain protein translated from pET24a vector ...... 40 Figure 2.3 Azocasein assays to determine protease activity of purified streptopain . 43 Figure 2.4 Digestion of test peptide substrates with and without streptopain ...... 44 Figure S2.1 Impurity of streptopain when purified without HgCl2 ...... 49 Figure S2.2 Mass spectrum of the positive control peptide without streptopain...... 50 Figure S2.3 Mass spectrum of the positive control peptide digested by streptopain. .. 50 Figure S2.4 Mass spectrum of the positive control peptide digested by streptopain. .. 51 Figure S2.5 Mass spectrum of the negative control peptide without streptopain...... 51 Figure S2.6 Mass spectrum of the negative control peptide with streptopain...... 52

Chapter 3 Figure 3.1 General scheme for screening protease specificity by mRNA display, Next- Generation Sequencing, and LC-MS/MS ...... 54 Figure 3.2 Streptopain digest of octamer substrate constructs D1 to D5 in solution . 59 Figure 3.3 Octamer substrate fusions are less susceptible to proteolysis when immobilized to Ni-NTA ...... 61 Figure 3.4 Protease digestions and reverse transcription of early octamer libraries .. 63 Figure 3.5 Fusions leaching off Ni-NTA immobilization by digestion buffer washes without protease ...... 66 Figure 3.6 Overview of click immobilization and screening for protease specificity 69 Figure 3.7 Next-Generation Sequencing revealed enrichment in pilot nnAA octamer library digestion ...... 75 Figure 3.8 Optimization of octamer fusion washing ...... 79 Figure 3.9 Next-Generation Sequencing reveals cleavage specificity of three proteases ...... 86 Figure S3.1 Transcription product and hammerhead ribozyme activation of pPaFT 106

Chapter 4 Figure 4.1 mRNA display selection scheme for novel protease specificity ...... 108 Figure 4.2 mRNA display of loop 1 targets reveal little natural activity ...... 110 Figure 4.3 Streptopain can be mRNA displayed in zymogen or mature forms ...... 112 Figure 4.4 Reverse transcription primers to be used in engineering selection ...... 114 ix

Chapter 1

Chapter 1: Introduction

Section 1.5 is a reprint of the article: Michael D Lane, Burckhard Seelig. (2014) Advances in the directed evolution of proteins. Curr Opin Chem Biol. 22, 129-136. The article is reproduced here with kind permission from Elsevier. The article was written in collaboration with Dr. Burckhard Seelig with both authors having significant contributions to each section.

Hyperlink to the original publication: http://www.sciencedirect.com/science/article/pii/S1367593114001240

1.1 Thesis overview Chapter 1 begins with a glimpse into the wide-ranging potential of proteases as therapeutics. Basic protease properties are reviewed in order to present the potential and challenges of creating therapeutic proteases. The discussion continues with an overview of the state of the field, appraising the advantages and limitations of existing technologies with regards to protease specificity analysis in addition to protease specificity engineering. Specificity screening is a relatively mature field, yet still has not yielded a method for adequately sampling broad-specificity proteases. In contrast, protease specificity engineering is a young field with few landmark examples of engineering novel specificities. A review of recent advances in methods of engineering proteins is included to give perspective on the latest achievements in protein engineering. Finally, an outlook is presented on the future of protease engineering. Chapter 2 describes the work we performed while generating a protocol to express and purify the broad-specificity protease streptopain. As the remainder of this thesis concerns the study of streptopain substrate specificity in addition to commercially available proteases, it was critical that we obtain large quantities of highly-purified and active streptopain. Published methods for obtaining recombinant streptopain proved unreliable

1

Chapter 1 and produced relatively little protease. We therefore developed a highly efficient method which can generate large quantities of streptopain that is several-fold more active than previously reported. Chapter 3 details work towards understanding the natural specificity of proteases by using mRNA display. This approach combines the powerful methods of mRNA display, to screen all possible octamers, with Next-Generation Sequencing, to identify millions of cleaved sequences, and mass spectrometry, to discern the cleavage locations of the peptides. Extensive experiments were performed to optimize the immobilization of the mRNA-displayed peptides (fusions) and reduce the background rate of release. Two rounds of specificity screening were performed on three proteases of interest: factor Xa as a narrow-specificity and well-characterized control, ADAM17 as a broad-specificity and relatively well-characterized control, and streptopain as a broad-specificity and poorly- characterized application. Method development, optimization, and early Next-Generation Sequencing result analysis are discussed. Chapter 4 presents the progress we have made towards engineering a protease to inactivate a superantigen toxin. Through structural analysis, digestion and subsequent mass spectrometry identification, an optimal region of the superantigen SpeA was identified as an ideal cleavage site against which to evolve proteolytic activity against. In vivo models confirmed that cleavage at this site yielded inactive SpeA. The broad-specificity protease streptopain was prepared for mRNA display, and shown to be amenable to fusion- formation at three potential display lengths. This potentially enables the display of streptopain in both zymogen or active lengths. Finally, the thesis concludes with a discussion of the potential of protease engineering for future applications.

2

Chapter 1

1.2 Significance 1.2.1 The untapped potential of proteases as therapeutics Proteases are a rich source for the development of powerful therapeutics that can inactivate disease-causing proteins. While most drugs are binding agents that act at a stoichiometric ratio, proteases are capable of catalytic turnover. This allows a single protease molecule to cleave many molecules of the target protein. Therapeutic proteases are currently used to treat thrombosis, sepsis, coagulation, and neuromuscular and digestion disorders1. Significantly, all of these applications required identification of an existing protease in nature with the desired activity. Protein engineering efforts thus far have principally focused on improving only certain properties of these pre-existing proteases, including serum half-life2 but not yet cleavage specificity. To engineer proteases capable of therapeutic interventions, it will be necessary to repurpose the natural specificity towards novel targets as well as engineer narrowed specificity to reduce undesirable side effects.

1.2.2 Protease specificity drives their unique function Proteases comprise roughly 2% of all human genes, making them the largest family of enzymes3. With this enormous number of proteases, it follows that they are responsible for a myriad of cell functions, including activation, recycling, signaling, post-translational modifications, defense, and cell survival4-6. The proteases responsible for recycling, such as those comprising the proteasome, are able to cleave nearly any peptide sequence. However, most human proteases have evolved to recognize very specific peptide motifs. The main factors contributing to whether a protein can be cleaved by a protease are the substrate’s amino acid sequence at the cleavage site, three-dimensional fold, and exosite regions. Of these, the specificity is the dominant factor determining specificity. The active site region typically recognizes 2-8 target amino acids in pockets called subsites. These are labeled by the Schechter and Berger nomenclature7 which identifies the target amino acids on the target protein with respect to the cleavage site:

3

Chapter 1

“prime-side” indicates amino acids on the C-terminal side of the cleavage, and “non-prime side” indicates amino acids on the N-terminal side of the cleavage (Figure 1.1). The two new amino acid sequences resulting from a cleavage are identified as PX-P1, where P1 is the new C-terminus, and P1′-PX′ with P1′ as the new N-terminus. These two fragments fit into corresponding subsite pockets of the protease labelled SX-S1-S1′-SX′ with matching numbering. Frequently, subsites closer to the cleaved bond play a larger role in determining specificity.

Figure 1.1: Protease subsite and substrate cleavage nomenclature. Protease subsites S4-S4′ correspond to substrate amino acids P4-P4′, where cleavage occurs between the P1 and P1′ position.

Because the substrate’s three-dimensional fold can make certain regions inaccessible, cleavage sites are frequently seen on surface regions and accessible loops. However, cleavage sites can be seen in both α-helical and β-sheet regions8. Notably, exosites can play a major factor in fine-tuning the specificity of a protease. These are macromolecular recognition sites outside the catalytic region of the protease that can modify proteolytic activity significantly9-11. These sites are commonly responsible for the fine-tuning involved in substrate specificity and activity10, 12. Understanding and engineering of exosites is also an emerging field, with a recent notable example demonstrating that exosite specificity can be swapped between similar proteases13.

1.2.3 Limited technologies hamper progress towards therapeutic proteases To maximize the chances of successfully engineering a protease towards a desired therapeutic activity, the protease used as the starting point would ideally have a specificity that already somewhat matches its intended target. In addition, a major concern for using 4

Chapter 1 proteases as therapeutics is the potential off-target cleavage of other human proteins besides their intended target. Fortunately, proteases can have exquisite specificities such that only a single known sequence is cleaved14. As an example, the Staphylococcus aureus exfoliative toxins are proteases that have specificities so refined that it was initially difficult to determine their proteolytic activity and target15. Current methods for characterizing protease specificity have identified in some cases up to thousands of cleavage sites. Yet this information has still been inadequate for revealing human cross-reactivity3 that could be engineered to minimize or eliminate, or identifying suitable prototypes for engineering therapeutic proteases. Although there have been multiple advances in characterizing protease specificity, no method has successfully probed the broad-specificity of proteases adequately to reliably predict cleavage sites. Despite the huge potential for therapeutic proteases, engineering efforts thus far have accomplished relatively small changes in specificity. The most advanced examples have successfully changed the specificity at a single subsite. Since proteases can recognize frequently up to 8 amino acids, there is a large amount of refinement possible. It will be critical to expand specificity at multiple subsites simultaneously to enable the cleavage of new substrates, but also to refine specificity to eliminate unwanted cross-reactivity with human proteins. It will require large advances in methodology to be able to modify multiple subsites simultaneously.

1.2.4 Engineering therapeutic proteases can defend against bacterial exotoxins As our initial goal, we are developing a method to engineer proteases capable of neutralizing deadly exotoxins. such as Staphylococcus aureus and Streptococcus pyogenes can secrete extremely potent exotoxins named “superantigens”, which have been linked to toxic shock syndrome, necrotizing fasciitis, and Kawasaki-like diseases16, 17. We aim to create a protease to neutralize SpeA, a Streptococcus pyogenes superantigen linked to many diseases including Streptococcal Toxic Shock Syndrome (STSS)16, 17. The current standard of care for STSS is non-specific supportive therapies (intravenous fluids, surgical removal of infected tissue) and antibiotics with adjunctive 5

Chapter 1 intra-venous immunoglobulin (IVIG) therapy18 as the only agent potentially capable of directly neutralizing the superantigen. Even with maximal therapeutic intervention, case fatality rates remain high for STSS (30-70%)19. In addition, the multicenter 2014 ProCESS trial with 1,341 patients concluded that the most advanced therapy for septic shock failed to show improvement over usual-care20. There have been no therapeutic advances in decades despite >30,000 patients enrolled and hundreds of millions of dollars invested in clinical trials21. Superantigen-neutralizing proteases created through engineering custom protease specificity would help address this void and revolutionize the treatment of STSS. A protease capable of neutralizing superantigens through cleavage would have several advantages over any binding agents such as antibodies. First, the protease could inactivate at a catalytic rate as opposed to a stoichiometric ratio, resulting in a significantly improved efficiency; i.e. thousands of superantigens neutralized per protease as opposed to one or two per inhibitor. Second, the fragments resulting from cleavage will be returned to the blood stream, thereby allowing a protective antibody response to develop and assist in future superantigen defense. Third, proteases can be extremely specific, as shown by protease cascades in the human body that control thrombosis, complement, or apoptosis2. Therefore, a proteolytic therapeutic can be refined to eliminate adverse effects. Lastly, proteases can be recombinantly expressed and purified – a much less expensive process than the production of IVIG. Overall, a proteolytic therapeutic could be efficient, immune- boosting, and cost-efficient, while having fewer adverse effects. Our long-term goal is to establish methods to analyze and create novel protease specificities. These techniques can eventually be used to evolve multiple proteases, each capable of neutralizing a specific bacterial exotoxin. Furthermore, engineered proteases have the potential to treat a wide variety of non-infectious disorders including coagulation, inflammatory, or auto-immune disorders1. The methods launched from my thesis work may open the path to a new class of protease therapeutics with broad applications, from neutralizing the wide array of bacterial exotoxins in infectious disease, to treating a variety of additional disorders that involve aberrant proteins. 6

Chapter 1

1.3 Current methods for characterizing protease specificity 1.3.1 Probing specificity with chemically synthesized peptide libraries In the 1990’s, protease specificities were characterized through the use of combinatorial libraries of peptides22. These libraries were labelled with a fluorophore (AMC) that would be released upon proteolysis. They were designed to separately assess positions P4, P3, and P2 by testing all 20 amino acids held constant at a single position at a time, and the remaining positions having a mixture of 19 amino acids (excluding cysteine) (Figure 1.2A). This was used to assess the prime-side specificity of IL-1β-converting 23 and subsequently generate an inhibitor with pM affinity (Ki = 56 pM) . This method has been used to characterize a wide range of proteases, including plasmin, thrombin, urokinase, tissue plasminogen activator, factor Xa, , and cruzain23.

Figure 1.2: Probing protease specificity with synthesized combinatorial libraries. (A) Design of combinatorial peptide libraries to test individual subsite specificities. (B) MSP-MS library design for characterizing protease specificity with 124 unique 14mer peptides that test all combinations of amino acid pairs separated by 0, 1, or 2 amino acids.

Fluorescence-quenched peptide libraries were generated by synthesizing a peptide labelled with O-aminobenzoyl (Abz) on the N-terminus and nitro-tyrosine (Y(NO2)) on the

7

Chapter 1

C-terminus such that cleavage in the peptide will separate the quencher, allowing fluorescence. This approach was applied to assess both prime and non-prime side specificity of , Factor Xa, and streptopain24-26. While this method had the advantage of being able to now assess both sides of the cleaved bond, the peptides must be chemically synthesized – a labor-intensive process – for each specific protease. Positional scanning libraries were developed to use the above fluorescence- quenched strategy, but aimed to assess the optimal amino acid at a single subsite at a time. For example, each of the twenty amino acids is individually sampled at P1, while the other amino acids are randomized equally. All twenty pools are sampled to identify the optimal amino acid at P1. Then P1 is held constant with the optimal amino acid identified, while P2 is individually probed, and P3-P4 are evenly randomized. After scanning positions P1, P2, P3, or more, an optimized substrate is identified. This has been applied to a variety of proteases27. While the trend in many methods has been to maximize the number of substrates that can be screened, Multiplex Substrate Profiling by Mass Spectrometry (MSP-MS) aimed to do the opposite: create a minimal library of well-designed peptides to assess specificity simply and quickly. This approach was applied using a rationally-designed library of only 124 different 14mer peptides28. This library contained all possible doublets of amino acids spaced next to each-other, as well as spaced by 1 or 2 amino acids (Figure 1.2B). Notably, the library used in this approach intentionally does not include cysteines to eliminate potential disulfide bond formation, and methionines were replaced with norleucine to eliminate oxidation. The resulting cleaved peptides can be identified by mass spectrometry and compiled to generate a specificity map. Many proteases have been quickly assessed using this very efficient method.

1.3.2 Probing specificity with large combinatorial libraries While the above methods work well for proteases with relatively high substrate specificity, and the MSP-MS approach can be applied to begin to understand proteases with more broad-specificity, there is a need for characterizing the specificity of very broad 8

Chapter 1 specificity proteases. These proteases play important roles in infectious disease, intracellular proteolysis, and cell signaling. Frequently, broad-specificity proteases can cleave such a wide range of substrates that probing with dozens or hundreds of peptides, as described above, would be inadequate to understand their specificity. Many efforts have been made to screen as many substrates as possible and therefore maximize the assessed complexity. Alternative methods are required to allow the generation and screening of vast peptide libraries. Phage display is a method where a bacteriophage displays a unique peptide linker that attaches an affinity tag, such as a hexa-histidine (His6) to its surface. A library of bacteriophages is created with random unique peptide linkers for each variant. Cleavage within the random peptide sequence releases the tag and therefore allows selection against the presence of the affinity tag and subsequent enrichment of random peptides that were preferentially cleaved. The target libraries are typically randomized 4-6 amino acid lengths. This method was initially reported in the early 1990s, and is still being applied currently29, 30. Phage display can screen up to 107 unique sequences in a single round, and generally requires 5 or more rounds to enrich the pool such that Sanger sequencing of the final output can be used to identify hits. It can successfully identify prime and non-prime specificity and some subsite effects30. A number of proteases have been characterized, including subtilisin, factor Xa, HSV-1 protease, collagenase-3, and HIV-1 protease31. Cellular Libraries of Peptide Substrates (CLiPS) is a cellular display method which presents peptides on the surface of the cell for cleavage32. The major differences between this method and phage display is that CLiPS displays up to 104 copies of the substrate on the surface of a single cell, enabling fluorescence-based single-cell quantitative real-time measurement. Therefore, substrate cleavage kinetics are much easier to determine and can be included as part of a selection scheme from a library of cellular variants. Proteomic Identification of Cleavage Sites (PICS) is a method based upon cleavage of peptides derived from a proteomic source3. The isolated proteins must be processed by a conserved protease first (such as trypsin, GluC, or chymotrypsin) into shorter peptides, 9

Chapter 1 which are then chemically modified to protect all free amines and cysteines. Next, these shorter peptides are digested with the protease of interest, and the newly freed amine termini are chemically labeled with biotin and isolated via streptavidin affinity purification. Finally, purified peptides are identified by liquid chromatography – tandem mass spectrometry (LC-MS/MS). The identified N-terminal protein sequence is then matched to the proteome to bioinformatically identify the C-terminal sequence of the cleavage location. This has been performed to identify hundreds of proteomic cleavage sites for a wide range of proteases3, 33-35. Several derivative methods have improved upon PICS in various ways. Quantitative-PICS (Q-PICS) incorporated Tandem Mass Tag (TMT) isobaric labeling to make a quantitative assessment of the proteins in the digestion36. This method allowed simultaneous identification and quantification in a single experiment. Q-PICS was further improved to enable the labeling and subsequent identification of both the N- and C-terminal fragments by using TMT labeling and assessing the loss of C-terminal proteins37. Recently, a PICS-based approach was described which could be performed without needing to protect for free lysines or cysteines38. This approach used a comparison of differentially stable isotope labeled peptide libraries to quantify peptides and successfully identified almost 4000 cleavage sites caused by 7 proteases. Finally, Terminal Amine Isotopic Labeling of Substrates (TAILS) is another method similar to PICS, but with the major difference that it allows cleavage of large proteome targets39. With this approach, the proteome is digested with the protease of interest initially, and the subsequent work-up prepares the digested proteome for identification by mass spectrometry. This is done by isotopically labeling amine groups, mixing with an undigested control, and digesting with trypsin to obtain short peptides. The mass spectrometry analyzes the complex mixture output, after which computational analysis identifies differences in the N-termini between the treated and untreated samples. TAILS has been very successful at identifying strongly preferred cleaved sites in the proteome for several proteases. Finally, mRNA display technology can be used to characterize proteases. mRNA display is an entirely in vitro technology during which the mRNA encoding the protein 10

Chapter 1 library of up to 1013 variants is translated and covalently linked to its translated protein via a puromycin linker region (Figure 1.3). Selected protein variants can then be amplified by error-prone PCR, introducing additional random mutations, and selected repeatedly. This duplicates natural evolution through directly linking each variant’s survival and replication to their functional capacity. Our laboratory has previously used this technology to create an artificial RNA ligase40, and we applied this method throughout my thesis. mRNA display has previously been applied once to characterize specificity by displaying a proteomic library41. This strategy incorporated a 15 amino acid sequence called an AviTag to the N-terminus of the fusions. The AviTag is specifically recognized by the E. coli biotin (BirA), which covalently attached a biotin to this sequence, thus enabling immobilization of the N-terminus on a streptavidin-decorated resin. After screening the proteomic library, hits were sequenced, allowing the identification of 115 caspase substrates. Cleavage locations were identified by bioinformatically scanning for the well- defined caspase cleavage motif DXXD in the identified proteins.

Figure 1.3: mRNA display links mRNA to its encoding protein. A DNA library is transcribed into RNA, modified with a puromycin moiety, and translated. When the translation reaches the puromycin antibiotic, the ribosome will create a covalent bond between the puromycin and the encoding protein. Therefore, the mRNA is covalently linked to its encoded protein via the puromycin.

1.3.3 Current methods are too limited to characterize promiscuous proteases None of the methods developed thus far have been able to generate specificity maps capable of accurately predicting previously unknown cleavage sites in the human

11

Chapter 1 proteome. There are a number of proteases that are known to cleave specific human proteins, but the exact site of cleavage remains unidentified. For example, streptopain, a cysteine protease expressed by S. pyogenes, is known to cleave a number of human proteins, but many of the exact cleavage sites are still a mystery. This broad-specificity protease has been characterized by methods based on chemically synthesized peptides, yet the resulting specificity map is inadequate for any practical use. The map identifies multiple preferred cleavage sites in a peptide that is known not to be cleaved, and cannot identify unknown cleavage sites reliably. First, a major downside of all of the methods described above is that they only identify dozens to low thousands of cleaved substrates as the output of their screens. For a protease with rather broad specificity, this number will only barely begin to assess its promiscuous specificity. All of the proteomic approaches (such as PICS) require LC- MS/MS identification of hits, which is currently limited to a maximum of ~1,000 sequences. Phage display screens, despite having the capacity to screen up to 107 unique sequences, have been limited by the method of sequencing after enrichment by Sanger sequencing, reducing the output to typically dozens or hundreds sequences. Furthermore, subsite-cooperativity effects are a significant factor in protease specificity42. Due to the combinatorial explosion of permutations when holding any given amino acid constant, it is necessary to assess orders of magnitude more cleaved substrates than has currently been done. Some of the most complex analyses performed so far have identified a handful of these effects, requiring hundreds of identified cleavage sequences to identify one or two cooperativity effects30, 35, 38, 43. Second, fully sampling the potential input complexity requires unique technology. The published peptide screens are typically testing fewer than 10,000 peptides. Phage display is the most successful current method for screening large libraries at a maximum of ~107 peptides tested in a single experiment, yet this only corresponds roughly to the complexity of a hexa-peptide (20 possible amino acids at each of 6 positions = 6.4 × 107). Many proteases have recognition motifs comprising more than 6 contiguous amino acids.

12

Chapter 1

This requires longer peptides substrates to adequately sample this specificity, and thus exponentially more input substrates to screen. Third, many of the most successful methods achieved higher complexities through using the proteome as a source of their libraries. While this is suited for identifying specific single cleavage sites in the proteome, it also heavily biases the peptide sequence diversity that is presented to the protease. Housekeeping proteins, for example, will be heavily over- represented, and the amino acid sequences sampled will be only a tiny fraction of the possible sequence space. The specificity maps generated from these methods must always be considered in the context of the proteome that was sampled, as opposed to a map of the natural specificity of the protease. This is a significant challenge if the resulting map is to be applied for identifying prototype proteases for engineering efforts. Fourth, published protocols unfortunately need to regularly modify or protect amino acids in the substrates prior to digestion by the protease of interest. Free lysines and cysteines are protected in almost all of the PICS-based approaches; cysteine is absent and methionine substituted for norleucine in the MSP-MS approach. Further, several of the methods require digestion with a starter protease to prepare short peptide substrates. Most PICS-based methods digest with trypsin or GluC to prepare multiple libraries, which are then characterized and compared after the analysis. This creates a strong bias by removing certain amino acid pairs from presentation to the protease. Finally, display-based methods require cleavage to occur at specific locations to enable useful hits. For example, phage display must be cleaved at the randomized peptide target instead of a conserved location such as an affinity tag or linker region29. If a conserved cleavage occurred, all of the library would be cleaved, eliminating the usefulness of the screen. Similarly, CLiPS is an in vivo display method which depends on cleavage occurring on the cell surface within the desired region. If the cell is cleaved at a sensitive location that eliminates the viability of the cell, it would also be removed from the screen. A useful case study that demonstrated the fallibility of existing specificity analysis methods, especially the limited peptide-based assessments, is the characterization of staphopains A, B, and C44. In this study, three related staphopains were characterized by 13

Chapter 1 multiple methods sequentially: a set of synthetic peptidomimetic substrates, multiple positional scanning libraries, and finally a CLiPS library. As the scanning method became subsequently broader, each specificity map noticeably disagreed with its previous method. This revealed the inadequacy of limited library assessments of proteases, which can rarely consider subsite-cooperativity effects and easily be trapped in a valley of local activity.

1.4 Current methods for engineering protease specificity Many enzymes other than proteases have been successfully engineered to improve a variety of properties, including stability and reactivity45. However, the engineering of protease specificity has seen fewer successes thus far. Initial efforts at repurposing protease specificity concentrated on exchanging the amino acids responsible for specificity between extremely similar proteases. In a classic protease specificity modification effort, trypsin was modified to achieve chymotrypsin-like specificity through mutation of the surface loops46. These two proteases have a very similar fold, and modifying a limited number of residues enabled completely swapping their specificity. Engineering completely novel specificities has been a much more challenging endeavor. Recent efforts have principally used a combination of directed evolution with first rationally choosing the regions for randomization. In this approach, the structure- guided design of large libraries of mutant variants is combined with high throughput screening technologies into a “directed evolution” approach. This method has successfully modified the substrate specificity of proteases, although the specificity changes have been limited to single subsites.

1.4.1 Specificity engineering of protease OmpT by cell-surface display An excellent example of the use of directed evolution to alter the active site specificity of a protease is the single-subsite engineering of OmpT, an E. coli outer membrane protease. OmpT variants were produced through error-prone PCR, then displayed on the surface of individual E. coli cells. Fluorescence-quenched substrates were

14

Chapter 1 added such that cleavage by OmpT resulted in fluorescence, which then adhered to the charged cell surface, allowing the isolation of desired mutants by fluorescence-activated cell sorting (FACS)47, 48. The specificity of OmpT was altered from arginine to glutamic acid, tyrosine, or threonine at a single subsite by 4, 7, or 9 active site mutations, 48 respectively, while maintaining kcat/KM values above the wild type protease . Thus, remarkable changes in specificity can be achieved with directed evolution while maintaining high efficiencies.

1.4.2 TEV protease specificity engineering by YESS Yeast Endoplasmic Reticulum Sequestration Screening (YESS) successfully engineered novel specificity through an intra-cellular sequestration step. YESS was applied to engineer TEV protease for modified specificity49. A C-terminal endoplasmic reticulum (ER) retention sequence was engineered onto the Aga2 peptide, which normally displays on the cell surface. In addition, protease variants were co-localized to the ER. Proteases capable of cleaving the retention sequence enabled Aga2 display, fluorescent labeling, and subsequent FACS enrichment. This was applied to sort up to 5 × 107 variants and select for changes in specificity at a single subsite. A particular advantage was the inclusion of a counter-selection region in their design. The counter-selection region contained a peptide sequence that, if cleaved, would remove a critical transport marker. This allowed the scheme to both select for a certain desired specificity, and select against an unwanted specificity. Applying this novel method, specificity could be refined to a 5,000-fold and 1,100-fold change in selectivity for two different specificities at a single subsite.

1.4.3 Computational design of α-gliadin peptidase The only example of engineering novel specificity through computational design and low-throughput screening was performed on the protease kumamolisin-As from the acidophilic bacterium Alicyclobacillus sendaiensis50. Computational design was used to create 261 variants which were screened individually to identify the best variant, which

15

Chapter 1 was capable of 120-fold improved kcat/KM as compared to the wild type protease for the very difficult to proteolyse proline-glutamine (PQ) motif. This work showed that we are nearing the ability to solely computationally engineer novel specificities for single subsites. Yet, to refine highly active proteolysis, as with most enzyme engineering, it still requires incorporating directed evolution.

1.4.4 Limitations of current protease specificity engineering efforts All successful engineering efforts have altered protease specificity at only a single subsite. To modify protease specificity more substantially, most likely more efficient techniques will be required. One option to increase the chances of finding multiple subsite specificity changes is to drastically increase the diversity sampled at a time. Each of the above methods screened a maximum of 107 variants. Finding variants with simultaneously altered specificity in multiple subsites will be exponentially less likely. Therefore, it will likely require screening exponentially more sequence space to be able to select for multiple subsite specificity simultaneously. Another drawback of the most successful methods for modifying specificity is the requirement of an in vivo step to select for activity. This is a major challenge, since any protease variants capable of cleaving a critical cellular protein will be toxic to the host, removing themselves from the selection. Similar to the Innovation-Amplification- Divergence theory for the evolution of novel enzyme functions51, proteases will likely transition through a broad-specificity variant before narrowing to a refined specificity for a new substrate. This substantially increases the risk that it may be toxic to the cell.

16

Chapter 1

1.5 Advances in the directed evolution of proteins Similar to my efforts in creating technical advances in the screening of natural protease specificity and selection of novel proteases specificity, there have been many recent advances in the general methods of directed evolution of proteins. This is a field constantly in flux, as techniques enabling new types of selections, advances in methodologies allowing much more fluid or rigorous selections, and combinations of technologies reaching new plateaus of function are constantly being published. I wrote a review article on this topic published in Current Opinion in Chemical Biology in collaboration with Dr. Burckhard Seelig. We contributed equally in writing and reviewing this work. The following is a reprint of the published article.

1.5.1 Overview Natural evolution has produced a great diversity of proteins that can be harnessed for numerous applications in biotechnology and pharmaceutical science. Commonly, specific applications require proteins to be tailored by protein engineering. Directed evolution is a type of protein engineering that yields proteins with the desired properties under well-defined conditions and in a practical time frame. While directed evolution has been employed for decades, recent creative developments enable the generation of proteins with previously inaccessible properties. Novel selection strategies, faster techniques, the inclusion of unnatural proteins amino acids or modifications, and the symbiosis of rational design approaches and directed evolution continue to advance protein engineering.

1.5.2 Introduction Synthetic biology describes the engineering of biological parts and whole systems by either modifying natural organisms or building new biosystems from scratch. To date, most proteins used as parts in synthetic biology are taken from nature. Utilizing naturally evolved proteins has led to numerous successful applications in biotechnology. Nevertheless, these applications invariably benefit from an optimization of the original

17

Chapter 1 natural proteins by protein engineering45. In contrast, building entirely artificial proteins that do not resemble natural proteins is still a major challenge52-54 and therefore much less common than the engineering of natural proteins for new or improved properties. Protein engineering has developed into a multi-faceted field with hundreds of publications in the last two years alone. This field encompasses a variety of approaches for creating desired protein properties, ranging from purely computational design to selecting proteins from entirely random polypeptide libraries. Due to the incredible breadth of the field, and to enable us to focus on recent advances, we will direct the reader to excellent reviews on the fundamentals of directed evolution technologies55-59 and computational protein design60-62. This review will therefore focus on the latest developments in the directed evolution of proteins (Figure 1.4).

Figure 1.4: Overview of directed evolution.

1.5.3 Advancing selection technologies In any directed evolution experiment, the isolation of the desired protein from a library of gene variants is the crucial step. Many efforts have been made to push the boundaries of evolution schemes, attempting to create better protein libraries, new selection systems with improved features, and faster selection procedures (Figure 1.5).

18

Chapter 1

Figure 1.5: Overview of recent technical advances. 19

Chapter 1

Maximizing library quality The chance of discovering desired protein variants is directly related to the quality and complexity of the starting library. For example, random mutations that destabilize a protein can be detrimental. Therefore, building libraries with a high potential of containing functional proteins is vital. ‘Smarter’ libraries have been pursued that are less complex but of high-quality63. To build those libraries, targeted mutagenesis guided by structural or phylogenetic information, the use of compensatory stabilizing mutations and other approaches have successfully been applied64, 65. Alternatively, a library maximizing complexity while enriching for well-folded proteins was constructed based on one of nature’s most common enzyme folds, the (β/α)8 barrel fold. All residues on the catalytic face of the protein scaffold were randomized and, simultaneously, the library was enriched for protease resistance by an mRNA display selection, which has been correlated with well- folded and therefore more likely functional proteins66.

Refining selection steps Directed evolution of membrane proteins has been challenging due to either the target membrane protein or the selection conditions being toxic to the host cell with in vivo methods. Liposome display is a new method that has enabled in vitro directed evolution of toxic integral membrane proteins67. This approach creates giant unilamellar liposomes and encapsulate a single DNA molecule along with a cell-free translation system. Each liposome will display many copies of a single variant, which can subsequently be labeled by fluorescent emission of functional membrane proteins and sorted by fluorescence- activated cell sorting (FACS). This approach was applied to evolve an α hemolysin mutant with pore-forming activity 30-fold greater than wild-type. An alternative approach aimed to select for activity of membrane bound proteins in conditions that are hazardous for lipid barriers. G protein-coupled receptors (GPCR) are an important group of drug targets and have recently been evolved for increased expression68. However, finding mutants of these integral membrane proteins that are stable in detergents remained challenging. Rather than screening variants individually as has been previously done, a cellular high-throughput 20

Chapter 1 encapsulation, solubilization, and screening method (CHESS) was developed to screen an entire library of GPCR variants69. A library of 108 variants was expressed in E. coli and the cells where then encapsulated in a polymer. The cells were lysed, but the “nano- container” trapped the GPCR variants along with their encoding DNA while allowing free diffusion of fluorescent ligands and thereby enabling FACS. With this technique, functional receptors were identified in the presence of the detergent of choice. The use of bead display for directed evolution has been limited by very few copies of DNA or displayed protein70-74. Recently, a “megavalent” bead surface display (BeSD) system was developed to allow the display of protein and its encoding DNA in defined quantities up to a million copies per bead75. This method combines advantages of in vitro selections with multivalency of in vivo display systems, enabling the ranking and sorting of the output variants of an in vitro selection by flow cytometry. Protease enzymes have a tremendous potential in medicine and biotechnology but engineering their activities via directed evolution for altered specificity, instead of simply broadening activity, has been successful until recently in only a few select cases using E. coli cell surface display of the E. coli outer membrane protease T2. This system is limited to the relatively few bacterial proteases that can be displayed and active on the prokaryote’s cell surface. To enable the engineering of more complex mammalian proteases, yeast surface display was modified to evolve novel protease specificity. In the revised system, both the protease variants and a yeast adhesion receptor were colocalized inside the endoplasmic reticulum (ER) through attached signal sequences49. Successful proteolytic cleavage of a linker region detached the ER retention signal and enabled the yeast surface display of the adhesion receptor including its FLAG tag, which was then identified by anti- FLAG antibodies. Counter-selection tags were also incorporated to improve the selectivity of resulting protease variants. This method was used to alter the specificity of tobacco etch virus protease, as well as granzyme K and hepatitis C virus protease, and was even modified to demonstrate in principle the selection of kinase activity. A directed evolution approach was devised to improve the targeting specificity of an engineered methyltransferase. Methylation of only a single site in a target DNA was 21

Chapter 1 selected for by digesting with a target site-specific restriction endonuclease and a second, unusual restriction enzyme that digests DNA with two distally methylated sites76. This method identified methyltransferase variants that showed 80% methylation at the target site and less than 1% methylation at off-target sites. Phage assisted continuous evolution (PACE) enables the sustained evolution of protein variants through hundreds of rounds of evolution in a week with little researcher intervention77. This method was used to probe evolutionary pathway independence by evolving RNA polymerases for various promoter specificities78. RNA polymerases that initially recognized the T7 promoter were evolved to recognize T3 or SP6 promoters separately, and then a final hybrid promoter of T3 and SP6. The resulting RNA polymerases from the SP6 pathway were ~3-4-fold more active than those from the T3 pathway and further evolution did not diminish this gap. Sequencing at multiple steps along the evolutionary path further illuminated that the divergent populations were unable to converge to the same solution. This suggests that it may be beneficial to evolve through multiple subpopulations instead of a single large population. In additional work, PACE was improved to allow the modulation of selection stringency via engineering phage propagation to be dependent on the small molecule anhydrotetracycline79. Further, the authors enabled counterselection to refine promoter specificity. The combination of these methods was used to create RNA polymerases with a 10,000-fold net change in promoter specificity. While this method now enables fast selections with advanced features such as counter-selection, it still can only be used to evolve proteins or activities that directly or indirectly involve expression, such as polymerase activity. The use of chaperonins such as GroEL and GroES during directed evolution has been shown to allow more destabilizing mutations and mutations in the protein core to survive during evolution by stabilizing folding intermediates80, 81. This chaperonin system was used to characterize the evolutionary pathway for a natural phosphotriesterase to a novel arylesterase82. This study demonstrated for the first time on a molecular level how mutations found early during an evolutionary optimization yield larger improvements than

22

Chapter 1 later mutations. The results suggest that mutations seem to initially cluster near the active site and then radiate towards the rest of the enzyme to stabilize the early mutations.

Improving speed of selection Two different strategies have significantly expedited the directed evolution process for in vitro selections. In the first strategy, many rounds of selection were performed very quickly, followed by sequencing and characterization of relatively few output sequences. For this purpose, a modified version of mRNA display, named “TRAP display” was devised where the puromycin linker was attached simply via base pairing instead of covalent modification, enabling a round of selection in as little as 2.5 hours compared to the traditional 2-3 days83. In just 14 hours and 6 rounds of selection, macrocyclic peptides with low nanomolar affinity against human serum albumin were selected. In the second strategy, only a single round of stringent selection was performed, but then a large number of selected clones were analyzed by high-throughput sequencing to enable population-level statistical analysis. Following this approach, nanomolar affinity binders were identified after a single round of selection from the small protein scaffold 10Fn3 (10th fibronectin type III domain) with two random sequence loops, using the continuous flow magnetic separation mRNA display technology84. The key to this approach was identifying the clones with the most enriched copy numbers after selection. In another example of the same strategy, multiple rounds of phage display biopanning of a heptapeptide library was performed against target cells, but high-throughput sequencing was performed at each step to assess the value of each round85. Overall, a single round of screening was capable of identifying the best binders when sequenced at sufficient depth; and multiple rounds were only helpful in decreasing the background of non-binders when sequencing a small pool.

23

Chapter 1

1.5.4 Expanding the scope of selections to new properties Photoresponsive binders Biological systems engineered to use light-sensing components have attracted attention due to their vast potential applications and have been recently reviewed86. Similar efforts have developed photo-reactive peptide aptamers for future use with in vitro or in vivo photoregulation, immunoassays, or bio-imaging analyses. Ribosome display of peptides that contained azobenzene-modified lysine enabled the selection of UV- responsive streptavidin binders87. A different approach used a ribosome display scheme with a benzoxadiazole-modified phenylalanine to select for calmodulin binding peptides with single-digit micromolar affinity that fluoresce upon binding88. Two laboratories created cyclized peptides via azobenzene linkers to select for light-responsive peptides by phage display technology. A peptide cyclized via azobenzene-modified cysteines flanking a randomized 7 residue peptide was capable of binding to streptavidin in the dark and cease binding upon irradiation with a 22-fold discrimination89. Likewise, a synthesized photoswitchable azobenzene-based cyclization compound enabled the identification of peptide binders with single-digit micromolar affinity and a 3-fold change in affinity upon UV exposure90. Photoresponsive peptide ligands can now be selected with large UV- induced binding affinity changes. In the future, in vivo activity of evolved ligands needs to be demonstrated to further their use in optogenetics.

Selecting unnatural peptide binders Binding peptides are valuable for a variety of purposes including detection assays and therapeutic applications. mRNA display and cDNA display have been used recently to create short peptide aptamers with unusual structure or composition, creating new diversity and thereby enabling the selection of binders with unique properties. In one approach, up to 12 unnatural amino acids were incorporated into an mRNA-displayed peptide library using a custom-mixed cell-free PURE translation system that was reconstituted from the purified components necessary for E. coli translation. The peptides were then cyclized via cysteine residues and the selection yielded unnatural peptides with nanomolar binding 24

Chapter 1 affinities for thrombin91. A related approach created mRNA-displayed lantipeptides by using a translation system where lysine was substituted with 4-selenalysine, and inducing post-translational elimination via H2O2 and dehydroalanine. This provided an alternative cyclization mechanism for a drug candidate library and yielded binders with low micromolar affinity for Sortase A92. Similarly, a cDNA display library was cyclized93 and a phage display library was bi-cyclized94 through disulfide bonds formed in cysteine-rich peptides and used to select for peptide aptamers. Additional work on macrocyclic peptide selections has been recently reviewed95. Unnatural amino acids have also been used to decorate peptides and evolve multivalent glycopeptides96. Alkyne-containing glycine residues were incorporated into an mRNA-displayed peptide library to enable glycosylation via click chemistry. A selection against a broadly-neutralizing antibody of HIV identified glycopeptides with potential as vaccine candidates.

Multi-subunit protein selections Multi-subunit proteins are common, but directed evolution of these has been limited to in vivo methods, as opposed to in vitro methods that are capable of screening much larger libraries. To address this issue, an mRNA-displayed Fab fragment was entrapped in emulsion PCR enabling the in vitro selection of heterodimeric Fab fragments97. However, the emulsion step reduced the library complexity to a similar range as in vivo methods. To achieve an in vitro selection of multi-subunit proteins of potentially up to 1014 variants, ribosome display was performed where one Fab subunit was randomized while the other subunit was held constant98. This was carried out with both heavy chain and light chain libraries yielding tight binders to VEGF and CEA. While cell-surface display of multi- subunit proteins has been performed before, a notable advance is the description of a mammalian cell-surface display that also features a titratable secretion of the same Fab fragments through alternate splicing of the pre-mRNA99.

25

Chapter 1

In vitro compartmentalization selections applied to new reactions The performance of in vitro compartmentalization methods has been improved in recent years with reported screening speeds of 2,000 droplets per second for water-in-oil emulsion screening100. A number of creative protocols have been developed for this technology and applied to evolve enzymes capable of a range of chemical reactions. This includes a generalizable screen for hydrogenase activity101, a selection for meganuclease specificity102, an entirely microfluidic screen for hydrolytic activity of a sulfatase103, and a quantitative screen for glucose oxidase activity104. While the above methods predominantly used custom-made microfluidic chips to sort their water-in-oil emulsions, another study described a generalizable method to produce water-in-oil-in-water emulsions that can be sorted by standard fluorescence-activated cell sorting (FACS) equipment105. This protocol enabled the generation of monodisperse double emulsions at 6-12,000 droplets per second, which can be stored for months to years and manipulated to adjust their volume as needed. These droplets can be sorted in a commercial FACS-machine at 10-15,000 droplets per second, while enriching active variants by up to 100,000 fold. The downside to this approach is that fluorescence must be linked to product formation.

Combining directed evolution with rational design The combination of rational design to create informed libraries of variants with directed evolution to refine activity and efficiency has become a recurrent theme in protein engineering. A good example is the optimization of a computationally designed Kemp eliminase by directed evolution. Through rounds of error prone PCR, DNA shuffling, and site-directed mutagenesis, this artificial protein was refined to yield an enzyme that accelerated the reaction 6 × 108-fold, approaching the efficiency of natural enzymes61. In another study evolving an unrelated artificial Kemp eliminase, stabilizing consensus mutations were added during the library generation process. This stabilization facilitated the identification of a variant with >2,000-fold improved catalytic efficiency after rounds of DNA shuffling, error prone PCR and selection106. Furthermore, an artificial Diels-Alderase was evolved by combining mutations from different rational design 26

Chapter 1 variants with rounds of error prone PCR. The modestly active original enzyme was thereby turned into a proficient biocatalyst for this abiological [4+2] cycloaddition reaction107. Cytochrome P450-derived enzymes can perform a variety of reactions and have been engineered to improve multiple properties108. A collection of cytochrome P450 mutants was screened for cyclopropanation of styrenes, and optimized through informed site-directed mutagenesis109. This enzymatic activity has not been observed in nature but is very useful to synthetic chemists. The work presents a great example how catalytic promiscuity of enzymes can be exploited. Using chemical intuition, the active site of one of those cytochrome P450s was rationally re-designed to change the reduction potential of the heme-bound FeII/III, which allowed the efficient NAD(P)H-driven cyclopropanation while suppressing the native monooxygenation activity110. The modification therefore enabled the use of the P450 variant as a whole-cell catalyst. In another case, high regio- and stereoselective hydroxylation of unactivated C-H bonds by a cytochrome P450 enzyme was achieved through a creative combination of active site mutagenesis, high-throughput “fingerprinting” to identify functionally diverse variants, and fingerprint-driven reactivity predictions111. A mononuclear zinc metalloenzyme was computationally redesigned and then evolved with a combination of saturation mutagenesis and error prone PCR to create a variant that was 2,500-fold better than the initial design112. Structure-guided design of libraries of paraoxonase 1 and directed evolution led to variants capable of up to 340-fold higher catalytic efficiency for toxic isomers of G-agents113. Similarly, iterative saturation mutagenesis was used to evolve variants of phosphotriesterase capable of hydrolyzing V- type nerve agents with a 230-fold improvement of catalytic efficiency114. In a separate effort to create a toxin-neutralizing protein, a unique was computationally designed and subsequently optimized by yeast display. The resulting protein reacted with a fluorophosphonate probe at rates comparable to natural serine , yet it was incapable of catalytic turnover115. Rational design has also been used to construct whole artificial protein scaffolds. In some cases, directed evolution was applied to these artificial structures to select for 27

Chapter 1 desired functions. Proteins from a combinatorial library of artificial four-helix bundle proteins were found to function in vivo by rescuing E. coli strains that lacked a conditionally essential gene116. In addition, select four-helix bundle proteins bound to heme and exhibited peroxidase activity. This activity was improved by random mutagenesis and directed evolution117. Using structural principles of natural repeat proteins, designed ankyrin repeat proteins (DARPins) have been built and shown to function as artificial antibody mimetics. Recently, a flexibility loop and additional randomized regions were incorporated, creating the LoopDARPin scaffold118. A library based on this improved scaffold design yielded picomolar binding proteins after only a single round of selection by ribosome display.

1.5.5 Conclusion With the investment of sufficient resources and determination, directed evolution generally appears to yield desired improvements of protein properties, sometimes producing remarkable results119. Therefore, one might be tempted to consider protein engineering a mature field. But those success stories mainly apply to improving or changing proteins that were provided by natural evolution. In contrast, the generation of novel activities without natural precedent is still in its infancy, although several examples have been reported40, 53, 56, 109, 120, 121. Achieving the synthetic biology goal of integrating artificial proteins into biological systems will introduce additional challenges, which again can be overcome with the help of directed evolution58, 122.

28

Chapter 1

1.6 Outlook Proteases are only beginning to be developed as therapeutics. This is a promising field with an enormous range of possible applications. The earliest applications have focused on improving general properties of proteases, such as solubility or serum half-life, yet avoided modifying their specificity. However, the myriad of natural cellular functions of proteases means there are countless possible targets for therapeutic intervention. Many protease targets are intermediary proteins involved in cascades of function, indicating that there are numerous checks and balances that could be subtly modified by proteases with refined specificities. This novel approach to therapeutic intervention has not been pursued yet because technologies for tailoring proteases remain limiting. Our laboratory’s long- term goal of developing therapeutic proteases combines the relatively matured field of protease specificity screening to identify prototypes with an emerging field of protease specificity engineering to create valuable new activities. Initially, repurposing existing specificity towards new targets will be the goal in engineering therapeutic proteases. The most difficult step of this process will likely be modifying the specificity, therefore it will be much easier to begin with a protease with natural specificity that is similar to the intended therapeutic target. In this way, re-inventing the protease for every application can be avoided by choosing well-characterized prototypes. Unfortunately, existing methodologies still cannot fully probe the specificity of proteases, especially those with broad substrate specificity. With new technologies aimed at improving the depth, accuracy, and applicability of protease specificity maps, it may be possible to identify the best prototypes for engineering efforts to create therapeutic activity. In particular, broad-specificity proteases are a likely powerful starting point for engineering novel specificities. Therefore, this is a field with a dire need for more powerful methods. After a promising protease candidate is identified, it will be necessary to engineer refined specificity for the new target. Current efforts have successfully modified subsite preference only for a single subsite47-50. However, this will need to be expanded to engineer multiple subsite preferences to repurpose existing proteases into therapeutically relevant 29

Chapter 1 activities. Further, unwanted cross-reactivity with human proteins will need to be removed, narrowing the specificity to eliminate likely side-effects. This has been performed with a counter-selection via a directed evolution strategy once, demonstrating that refining specificity is viable49. Identifying the best prototypes is necessary for current engineering efforts. In the future, as methodologies advance, it may be possible to engineer more significant specificity changes than present, perhaps through major advances in computational design. The described example above (Section 1.4.3) demonstrates that computational design is just beginning to be able to design activity50. It may be possible in the future to achieve multiple-subsite specificity modifications through computational design of a manageable library and screening by other directed evolution means. In this case, other properties of the protease may be more important than having a natural specificity that closely aligns with the therapeutic target sequence. For example, understanding the exact residues involved in each subsite preference, exosite binding regions, or ability to unfold protein targets may be more important than the natural specificity. Subsequent refining towards temperature and pH tolerance, stability, or reduced antigenicity could then be performed with classic protein engineering. While custom-designed protease therapeutics are still a long way off, the building blocks and the technologies needed to create this feat exist today. Through unique combinations of existing methods, it will be possible to make the first examples of this new class of therapeutics. My thesis aimed to lay the groundwork towards achieving these goals by creating robust technologies for the screening and engineering of proteases.

30

Chapter 2 Chapter 2: Highly efficient recombinant production and purification of streptococcal cysteine protease streptopain with increased enzymatic activity

The overall goal of my thesis was to create the tools needed to create proteases with novel specificities. To this end, we chose streptopain as our prototype protease (described in Chapter 4) and developed a method for characterizing its specificity (described in Chapter 3). However, we needed to obtain very active samples of streptopain to be used in these endeavors. The previously published methods for recombinantly producing streptopain were ineffective in our hands. Therefore, we set out to develop a rigorous method for producing large quantities of very pure streptopain that are extremely active. This task turned into a longer ordeal than expected, and the experiments culminated in a publication describing the challenges and achievements of these efforts. The following is a reprint of the article describing this journey: Lane, M. D. & Seelig, B. (2016) Highly efficient recombinant production and purification of streptococcal cysteine protease streptopain with increased enzymatic activity. Protein Expr & Purif 121, 66-72. The article is reprinted here with permission from Elsevier. Dr. Seelig and I designed the experiments and analyzed the data. I performed all experiments.

Hyperlink to original publication: http://www.sciencedirect.com/science/article/pii/S104659281630002X

2.1 Overview Streptococcus pyogenes produces the cysteine protease streptopain (SpeB) as a critical virulence factor for pathogenesis. Despite having first been described seventy years ago, this protease still holds mysteries which are being investigated today. Streptopain can cleave a wide range of human proteins, including immunoglobulins, the complement activation system, chemokines, and structural proteins. Due to the broad activity of

31

Chapter 2 streptopain, it has been challenging to elucidate the functional results of its action and precise mechanisms for its contribution to S. pyogenes pathogenesis. To better study streptopain, several expression and purification schemes have been developed. These methods originally involved isolation from S. pyogenes culture but were more recently expanded to include recombinant E. coli expression systems. While substantially easier to implement, the latter recombinant approach can prove challenging to reproduce, often resulting in mostly insoluble protein and poor purification yields. After extensive optimization of a wide range of expression and purification conditions, we applied the autoinduction method of protein expression and developed a two-step column purification scheme that reliably produces large amounts of purified soluble and highly active streptopain. This method reproducibly yielded 3 mg of streptopain from 50 mL of expression culture at >95% purity, with an activity of 5,306 +/- 315 U/mg, and no remaining affinity tags or artifacts from recombinant expression. This improved method therefore enables the facile production of the important virulence factor streptopain at higher yields, with no purification scars that might bias functional studies, and with an 8.1- fold increased enzymatic activity compared to previously described procedures.

2.2 Introduction Streptococcus pyogenes is a human-specific pathogen responsible for over 500,000 deaths per year globally123. This ubiquitous bacterium commonly causes mild infections of the upper respiratory tract and skin. However, severe infections of the skin, blood stream, and soft tissues are possible and are frequently life-threatening. Additionally, recurrent infections can lead to a variety of autoimmune diseases including acute rheumatic fever, rheumatic heart disease, acute poststreptococcal glomerulonephritis, and possibly pediatric autoimmune neuropsychiatric disorders associated with streptococcal infections (PANDAS)124. S. pyogenes produces several virulence factors responsible for its infectivity, including secreted toxic superantigens and proteases125. Streptopain is a cysteine protease secreted by S. pyogenes that is critical for full host infectivity due to its ability to cleave host proteins (plasminogen, fibrinogen), 32

Chapter 2 antimicrobial peptides, and antibodies124. This protease is also known as SpeB (streptococcal pyrogenic exotoxin B) because it was initially believed to have superantigenic activity. However, this originally detected activity was found to be caused by contamination or co-purification with superantigens and therefore it was concluded that streptopain does not have superantigenic activity126. Despite first being isolated and characterized in the 1940’s127, the detailed mechanism of streptopain’s proven role in bacterial pathogenesis is still poorly understood128. Streptopain frequently produces enigmatic results based on the proteins it is known to cleave. For example, its activity seems to both inhibit and activate systems such as inflammation, complement, immunoglobulin defense, as well as cleave numerous proteins produced by S. pyogenes128. It is these seemingly contradictory activities that continue to make streptopain a relevant and challenging research target today. Classically, streptopain was isolated from S. pyogenes culture supernatant by a variety of chromatography techniques129-133. These yielded purified protein because the bacteria secrete streptopain to act on the extracellular matrix. Recombinant production of streptopain variants in Escherichia coli was pursued for convenient exploration of point mutations134. These protein variants were purified by combinations of ion exchange chromatography135-139, dye-ligand chromatography135, 137, size-exclusion chromatography136, 137, 139, or Ni2+-chelating chromatography134, 138, 140-142. In S. pyogenes, streptopain is initially expressed as a 40 kDa zymogen. Maturation is caused by cleavage of the 138 N-terminal amino acids, resulting in a 28 kDa active protease143. This cleavage can be performed by mature streptopain or by exogenous proteases144. Most previously published recombinant purifications yielded the zymogen, which was subsequently activated by incubation with mature streptopain134, 140-142, although some cases of streptopain self-activation during expression and purification were also reported139, 140. Our efforts at replicating recombinant streptopain expression and purification methods in E. coli repeatedly met challenges and did not achieve high yields, purity, or activity. Specifically, we frequently were able to express large quantities of streptopain, 33

Chapter 2 but the protein remained in the insoluble fraction. Accordingly, we trialed a variety of expression and purification strategies to identify an improved method of purification. Here we report our most successful expression system and purification method whereby we obtained the highest reported yield (3 mg / 50 mL) and activity (5,306 +/- 315 U/mg by azocasein assay) of a highly purified (>95% by SDS-PAGE) activated streptopain. Our approach has the added benefit of fully maturating the protease with no remaining affinity tags that might bias its activity or structure in subsequent experiments.

2.3 Materials and methods 2.3.1 Materials The streptopain-containing plasmid pUMN701 was generously donated by Dr. Patrick Schlievert. All primers were synthesized by the University of Minnesota Genomics Center. The restriction enzymes BamHI-HF and XhoI, the T4 DNA ligase, and BSA were purchased from New England Biolabs. BL21(DE3) cells, kanamycin, acetonitrile, and standard phosphate buffered saline were purchased from VWR. LB medium, tryptone, yeast extract, glycerol, and glucose were purchased from Fisher Scientific. The SP Sepharose FF resin and SP HP HiTrap column were products from GE Healthcare Life Sciences. All other reagents were purchased from Sigma-Aldrich.

2.3.2 Overexpression of streptopain by autoinduction in E. coli The coding sequence of the streptopain zymogen was PCR-amplified from the pUMN701 plasmid with PCR primers “SpeB_BamHI_FW” and “SpeB_XhoI_RV”

(Supplementary Table S1), which added an N-terminal His6 tag. The PCR product was cloned into the pET24a plasmid using restriction digestion with BamHI-HF / XhoI and ligation with T4 DNA Ligase. The resulting pET24a construct was sequenced to confirm insertion of the correctly oriented full coding sequence of streptopain (amino acids 1-398) with an N-terminal His6 tag. The plasmid was transformed into BL21(DE3) E. coli cells. A culture of 150 mL of standard LB medium and 36 µg/mL Kanamycin was inoculated with a single colony and grown overnight at 37°C. 400 µL of the overnight culture was 34

Chapter 2 used to inoculate 200 mL of auto-induction media (50 mM Na2HPO4, 50 mM KH2PO4, 2 mM MgSO4, 36 µg/mL Kanamycin, 2% tryptone, 0.5% yeast extract, 0.5% NaCl, 60% glycerol, 10% glucose, 8% lactose, w/v). Cultures were grown at 37°C for ~5-6 hours until

OD600 ~0.3-0.4 and then transferred to 25°C for an additional 24 hours. Cells were divided into 50 mL aliquots, collected by centrifugation, and frozen at -80°C.

2.3.3 Purification of streptopain by affinity chromatography A frozen cell pellet from 50 mL of culture medium was thawed at 4°C and resuspended in 5 mL of lysis buffer (20 mM sodium acetate, 50 mM NaCl, 1 mM HgCl2, pH 5.0). Mercury (II) chloride was added to all steps to a final concentration of 1 mM to reversibly inhibit the activity of streptopain. HgCl2 is a hazardous material, so standard precautions should be taken for all materials containing HgCl2, and contaminated waste should be disposed of properly. The suspension was lysed by sonication and the lysate was centrifuged to separate the insoluble fraction. The soluble fraction was run over 4 mL of SP Sepharose FF resin equilibrated with lysis buffer. The column was washed with 5 × 4 mL of lysis buffer. Streptopain was eluted with 5 aliquots of 4 mL elution buffer (20 mM sodium acetate, 100 mM NaCl, 1 mM HgCl2, pH 5.0) and contained about 5 mg of protein in fractions 2-5. These 4 fractions were combined, diluted with 20 mM sodium acetate, 25 mM NaCl, 1 mM HgCl2, pH 5.0 buffer to reach a final NaCl concentration of 50 mM, and concentrated with an Amicon Ultra centrifugal filter 3 kDa MWCO to a final volume of 5.5 mL. This solution was applied to an SP HP HiTrap column equilibrated with 20 mM sodium acetate (pH 5.0), 50 mM NaCl, 1 mM HgCl2 and eluted with a gradient of NaCl from 50 mM to 200 mM via FPLC. Fractions of 1.5 mL were collected and protein concentrations were determined by measuring absorbance at 280 nm using a Nanodrop spectrophotometer.

2.3.4 Confirmation of streptopain protein identity by mass spectrometry Cleared supernatant from an E. coli autoinduction expression of streptopain was separated on a 4-12% SDS-PAGE. The overexpressed band at 28 kDa was cut out, 35

Chapter 2 solubilized, digested with trypsin, concentrated in a SpeedVac vacuum concentrator, desalted by the Stage Tip procedure145, and dried in a SpeedVac. Tryptic peptides were rehydrated in water/acetonitrile (ACN)/formic acid (FA) 98:2:0.1 and loaded using a Paradigm AS1 autosampler system (Michrom Bioresources, Inc., Auburn, CA). Each sample was subjected to Paradigm Platinum Peptide Nanotrap (Michrom Bioresources, Inc.) pre-column (0.15×50 mm, 400 μL volume) followed by an analytical capillary column (100 μm×12 cm) packed with C18 resin (5 μm, 200 Å MagicC18AG, Michrom Bioresources, Inc.) at a flow rate of 320 nL/minute. Peptides were fractionated on a 60 minute (5– 35% ACN) gradient on a flow MS4 flow splitter (Michrom Bioresources, Inc.). Mass spectrometry (MS) was performed on an LTQ (Thermo Electron Corp., San Jose, CA). Ionized peptides eluting from the capillary column were subjected to an ionizing voltage (1.9 kV) and selected for MS/MS using a data-dependent procedure alternating between an MS scan followed by five MS/MS scans for the five most abundant precursor ions.

2.3.5 Proteolytic activity measured by azocasein assay Purified streptopain was assayed with azocasein substrate to determine proteolytic activity146 by measuring absorbance at 366 nm over time. Undigested azocasein is precipitated in the protocol, while digested azocasein is freed to absorb at 366 nm. 20 µL of protease was added to 160 µL of a reaction mixture containing 2.7 mg/mL azocasein, 5 mM DTT, 5 mM EDTA in standard phosphate buffered saline. Time points between 5 s and 1 h were quenched with 40 µL of ice-cold 15% trichloroacetic acid. Absorbance at 366 nm was measured using a Nanodrop spectrophotometer. As previously described140, one enzyme unit is defined as the amount of protease capable of releasing 1 µg of soluble 1% - azopeptides per minute using the reported specific absorption coefficient A366 = 40 M 1*cm-1.

36

Chapter 2 2.3.6 Assay of streptopain cleavage specificity using peptides substrates A positive control peptide cleavage substrate (FAAIK↓AGARY) and a negative control peptide cleavage substrate (FAAGKAGARY) were synthesized by Chi Scientific to >90% purity. 1 nmol of each peptide was incubated with 10 pmol of purified active streptopain in 100 µL of 50 mM sodium phosphate (pH 7.0), 2 mM DTT, and 1 mM EDTA for 1 hour at 37°C. Streptopain was separated by filtering with a Nanosep 10k MWCO filter. The flow-through was further purified by the Stage Tips procedure145 using MCX extraction disks, a mixed-mode cation exchange sorbent. Disks were punctured and the small round piece of sorbent was placed in a 200 µL pipet tip and washed with 45 µL of 100% methanol. Peptide samples were acidified to pH 3 with formic acid, loaded into the tip, washed with 45 µL of 0.1% trifluoroacetic acid, washed with 45 µL of 100% methanol, and eluted with 50 µL of 5% ammonium hydroxide in methanol. The elution was dried in a SpeedVac vacuum concentrator and resuspended in 50 µL of 98% water, 2% acetonitrile, and 0.1% formic acid. Approximately 0.1 µg of peptide was loaded on the capillary LC column and analyzed by LC-MS/MS on the Orbitrap Velos system as described previously147, except for the following modifications to the MS acquisition parameters: The MS1 scan was 250-1200 m/z; the MS2 first mass value was 116; dynamic exclusion time duration was 15 seconds; the lock mass value was 371.1012 m/z; the LC gradient was 0-38% B in 28 minutes, then increased to 80% B at 28.1, held at 80% B for 4 minutes, then the system was re-equilibrated at 2% B for 8 minutes. Peaks were manually inspected and peptide fragment ions were assigned.

2.4 Results 2.4.1 Testing multiple strategies to improve soluble protein expression Our first efforts at recombinant expression of streptopain were to clone streptopain into a pET24a vector to render streptopain IPTG-inducible in an E. coli strain. These initial efforts at recombinant expression and purification of streptopain failed to produce significant quantities of soluble protein. Nearly all of the overexpressed protein was in the insoluble fraction after sonication. To optimize soluble protein expression, we tested 3 E. 37

Chapter 2 coli strains (BL21(DE3), BL21(DE3)pLysS, BL21(DE3)Rosetta), 2 chaperone systems (pTF16, pGRO7), 5 expression temperatures (16°C, 25°C, 28°C, 30°C, 37°C), 4 induction concentrations of IPTG (0, 0.1, 0.5, 1.0 mM), 2 different culture media (LB and minimal media), 7 expression time points (0, 4, 8, 12, 16, 24, 48 hours of culture post induction), 4 lysis methods (lysozyme, BugBuster®, French press, sonication), and 3 Ni-NTA purification methods (native, urea-denaturing, guanidine-denaturing). None of these efforts resulted in sufficient yields of soluble streptopain.

2.4.2 Overexpression of mature streptopain by autoinduction We then performed an expression using the autoinduction method148. In this scheme, minimal media was supplemented with glycerol, glucose, and lactose, but IPTG was not used to induce expression. Over a longer period of time (24 hours), protein was expressed to a lower intracellular concentration compared to an IPTG induced expression due to the weaker induction by lactose, but the cells grew to a higher density. Using this protocol, we obtained approximately 10 mg of soluble protease from a 50 mL culture (Figure 2.1A). But the autoinduction approach introduced a new challenge: the ~40 kDa protease zymogen was cleaved into its ~28 kDa active form, thereby removing its N- terminal His6 tag that was to be used in the purification of the inactive zymogen. This cleavage occurred despite the attempt to inhibit streptopain from self-activation by the inclusion of 1 mM HgCl2 in all lysis steps. In contrast, when we performed the purification in the absence of HgCl2, we observed much more extensive proteolysis of all proteins present in the lysate, resulting in many products <15 kDa in the lysate, cleared lysate, flow- through and wash, and the resulting purified streptopain had much higher concentrations of streptopain fragments at ~12 and ~26 kDa, likely due to uninhibited activity of streptopain (Supplemental Figure S1). We verified the identity of the 28 kDa protein product by isolating the respective band from an SDS-PAGE separation, digesting the protein with trypsin, and analyzing it by mass spectrometry. The ProteinPilot search engine was used to match proteins to the peptide spectra of trypsin digested streptopain. 259 unique streptopain peptides were identified (Supplementary Tables S2 and S3). These 38

Chapter 2 matched with >95% confidence to the predicted 253 amino acid sequence of the activated 28 kDa streptopain sequence according to the ProteinPilot Paragon scoring algorithm (Figure 2.2). Peptides from the N-terminal cleaved region were also identified, and are likely present in low concentrations as residual cleavage fragments. These data confirmed that the principal soluble product was activated streptopain.

Figure 2.1: SDS-PAGE gel of the expression and purification of mature streptopain. (A) Autoinduction lysate and cleared lysate. Arrow indicates mature streptopain migrating at ~28 kDa as expected. (B) SP Sepharose FF purification with flow-through (FT), washes 1-3 (W1-3), and elutions 1-5 (E1-5). Arrows indicate mature streptopain or streptopain degradation fragments. (C) SP HP HiTrap purification by FPLC with elutions 37, 38, and 39 and the combined elutions 37-39 more than one year after initial purification.

39

Chapter 2

Figure 2.2: Amino acid sequence of streptopain protein translated from pET24a vector. The first 20 amino acids indicated in light gray originate from the pET24a plasmid and are an N-terminal addition to the native streptopain zymogen. The ↓ indicates the beginning of the mature active streptopain sequence. Font styles indicate confidence of peptide matching from analyzing the 28 kDa gel band by mass spectrometry: bold underline uppercase is >95 confidence, lowercase is <50 confidence, italicized lowercase is no match.

2.4.3 Purification of streptopain

Due to the loss of the N-terminal His6 tag through zymogen cleavage during the expression and lysis process, a method for purifying streptopain without the affinity tag was required. We tested a variety of column purifications, including multiple sequential orthogonal purifications. Overall, we tested weak anion exchange resin (DEAE), strong cation exchange resins (SP Sepharose FF by gravity flow, or HiTrap SP HP by FPLC), and size exclusion resins (G-25 by gravity flow, or G-75 by FPLC). We found that streptopain did not bind tightly to DEAE resin, but that a G-25 column to perform a crude cleared lysate clean-up followed with an SP Sepharose FF resin could purify the protein (data not shown). However, we achieved higher yields when we instead used SP Sepharose FF gravity flow resin to directly purify the cleared lysate without a prior G-25 column. We could reproducibly obtain ~90% purity streptopain from this single step purification (Figure 2.1B), and the impurities were largely degradation products of streptopain itself (described below). This purity may be adequate for a variety of applications. We lost a significant quantity of streptopain in W1, which is simply a rinse with lysis buffer. This 40

Chapter 2 indicates that streptopain is only weakly binding to the column at 50 mM NaCl. In some attempts, we re-used this W1 on a fresh column and obtained additional purified streptopain (data not shown), indicating that it may be possible to maximize yield further at the potential cost of purity. To further purify streptopain beyond 90% purity, we tried adding a subsequent G- 75 size-exclusion column step, and found that we were unable to separate the 26 kDa impurity (shown in Figure 2.1B, lane E5) from the 28 kDa desired product. In addition, the amount of streptopain degradation products (presumably caused by streptopain cleaving itself) would frequently increase during the size-exclusion column, even in the presence of

1 mM HgCl2 throughout (data not shown). We found that a combination of two successive strong cation exchange resins gave the best results for soluble, pure enzyme with high activity. First, we used an SP Sepharose FF column (strong cation exchange) to isolate streptopain from the crude cleared lysate by gravity flow (Figure 2.1B). This was followed with a HiTrap SP HP column (strong cation exchange) using an FPLC setup (Figure 2.1C). Analyzing the protein by SDS-PAGE, we found our product to be >95% pure in fractions 37, 38, and 39 (Figure 2.1C). The additional minor bands at ~26 kDa and ~12 kDa in the SP Sepharose FF were degradation products of streptopain as confirmed by mass spectrometry, performed as described above for the 28 kDa band (data not shown). The protease is stable in the elution buffer for more than a year at 4°C (Figure 2.1C). Our final yield after the second purification step was 3 mg of streptopain in these 3 fractions, from the original 50 mL culture. That was a 30% greater yield per volume of E. coli culture than the highest previously reported yield140.

2.4.4 Assays to confirm proteolytic activity and substrate specificity We tested the activity of our purified streptopain protease by the standard azocasein protease activity assay. We found that the activity of our final purified protease in our pooled fractions was 5,306 +/- 315 U/mg (Figure 2.3). This is 8.1-fold more active than the highest value reported to date140. To further confirm that our purified and active protease had a cleavage preference similar to previous reports for streptopain, we designed 41

Chapter 2 two test peptides based on published streptopain specificity experiments26. A positive control peptide (FAAIK↓AGARY) and a negative control peptide (FAAGKAGARY) were synthesized such that a cleavage should occur in the center of the positive control sequence as indicated by the arrow, but no cleavage should occur in the negative control peptide. Those two peptides only differed at a single amino acid position (I in place of G, shown in bold and underlined). In order to facilitate the peptide analysis by liquid chromatography– mass spectrometry (LC-MS) by improving separation on the C18 column, the large bulky amino acids F and Y were added to the N- and C-termini of the peptides, respectively. The two peptides were incubated with or without streptopain, purified, and then analyzed by MS (Figure 2.4). When no protease was added to the incubation, the intact full length positive control peptide eluted at 25.2 minutes (Figure 2.4A). The tandem mass spectrum of the doubly charged precursor 534.3047 m/z (2.5 ppm mass error) labelled with all fragment ions can be found in the Supplementary Figure S2. When streptopain was incubated with the positive control peptide, two products eluted at 21.9 and 25.1 minutes (Figure 2.4C). The 21.9 minute peak is the appropriate cleavage product FAAIK (doubly charged precursor) (<1 ppm mass error), while the 25.1 minute peak is the methylated version of the same cleavage product (<1 ppm mass error). In contrast, only the uncleaved full length negative control peptide was detected both without and with streptopain at 21.9 and 21.3 minutes, respectively (Figure 2.4 B and D). The doubly charged precursor in both of these samples was identified with <1 ppm mass error. The ion-labelled tandem mass spec of all samples are presented in Supplementary Figures 2—6. In summary, the cleavage specificity of our streptopain preparation is identical to previous reports26.

42

Chapter 2

Figure 2.3: Azocasein assays to determine protease activity of purified streptopain. (A) A characteristic azocasein proteolysis curve. (B) Azocasein assay of our purified streptopain (combined fractions E37-39) measured in the linear activity range at early time points to determine protease activity. The measurement was performed in triplicate and standard deviation was calculated between the three measurements.

43

Chapter 2

Figure 2.4: Digestion of test peptide substrates with and without streptopain. LC-MS traces of two test peptide substrates without protease (A and B) and after an incubation with streptopain (C and D). Peptide incubation with streptopain resulted in digestion of the cleavable positive control peptide (FAAIK↓AGARY, C) but no digestion of the non-cleavable negative control peptide (FAAGKAGARY, D). Peaks are labeled with their sequences identified by MS analysis.

2.5 Discussion At the outset of this study, we attempted to replicate published protocols for the recombinant expression of streptopain. However, we found that, in our hands, recombinant E. coli production of streptopain yielded largely insoluble protein. The reason for this is unclear. Induction of BL21(DE3) cells with IPTG produced large quantities of zymogen, but the protein was exclusively in the insoluble fraction. Varying IPTG concentrations made little difference, resulting in about the same overexpression from induction by 0.1 44

Chapter 2 mM or 1.0 mM IPTG. We were concerned about poor protein folding, since varying IPTG induction concentration did not improve solubility. We therefore varied induction temperature between 16°C and 37°C, but observed no increase in soluble protease. BL21(DE3)pLysS or BL21(DE3) Rosetta strains did not yield soluble protein either. Further, we tried expressing in cells that also contained the pTF16 or pGRO7 plasmids, which produced the TF or GroES/GroEL chaperones, respectively, yet still did not see any improvement in soluble protein expression. We hypothesized that perhaps our lysis method was not adequately disrupting cells, yet cell lysis using French press, sonication, BugBuster reagent, and lysozyme all gave similar results. Interestingly, we occasionally saw protease activity in our samples that were stored on ice in Laemmli buffer, including lysates and insoluble fractions. In these samples, the ~40 kDa streptopain zymogen activated to its mature 28 kDa protein, and also digested the other proteins present in that sample. This suggested that the protein we produced was capable of folding and activity, yet the expression conditions were still unsuited for soluble expression. Consequently, we attempted denaturing purifications of our successful insoluble expressions. In contrast to 4.5 M urea, 6 M guanidine hydrochloride successfully solubilized streptopain from the insoluble fraction, which was then immobilized on Ni- NTA agarose. The unfolded streptopain could be purified, but renaturation attempts were unsuccessful, including on-column renaturing, dialysis, and step-wise renaturation. These attempts consistently precipitated or degraded the protein, suggesting that fully denaturing streptopain was an irreversible process. We were finally able to produce large quantities of soluble streptopain using the autoinduction method for E. coli recombinant protein expression. Our soluble streptopain was obtained as mature enzyme, indicating that at some point during autoinduction or cell lysis, the protease must have become active. If incubated long enough after lysis, the protease digested most proteins in the lysate (data not shown). Yet, the autoinduced bacteria continued to grow for at least 48 hours. These data suggested that the protease did not become active until after lysis. As this approach eliminated the N-terminal His6 tag, we explored a variety of methods that had previously been used to purify streptopain directly 45

Chapter 2 from cell culture supernatants of S. pyogenes instead of E. coli. A combination of two successive strong cation exchange resins gave the best results for soluble, pure enzyme with the highest activity.

Cloning and expressing the protease with a C-terminal His6 tag using the autoinduction method, as an alternative to the cleaved N-terminal tag, was not explored. 139 Although C-terminal His6 tags have been used with success to purify streptopain , we preferred to create a method to produce 100% native streptopain for subsequent downstream applications. A residual C-terminal His6 tag would not satisfy this goal. Furthermore, we did not attempt to directly clone and express the mature protein alone. The mature streptopain is capable of digesting the majority of proteins in the lysate, therefore it was assumed that overexpression of the active version of this broad specificity and robust protease would be incompatible with survival of the E. coli host. Digesting our positive control peptide (FAAIK↓AGARY) with streptopain yielded the expected cleavage product FAAIK, and an additional C-terminally methylated version of that same FAAIK peptide. In contrast, streptopain digestion of our negative control peptide (FAAGKAGARY) yielded no detectable cleavage products. This result corroborates previous specificity analyses of streptopain26. The methylation product of FAAIK is a common adduct due to LC/MS sample preparation by purifying peptides with mixed-mode cation exchange sorbents and methanol (private communication with a Waters Corporation representative). We detected comparable methylation of other peptides in analogous experiments (data not shown). Our optimized protocol improved streptopain activity by 8.1-fold and purification yield by 30%, compared to the best previously published protocols. The increase in protease yield is likely attributable to our use of the autoinduction method. This expression method has been employed previously for protease expression, where TEV protease was produced with high yields149. Autoinduction enables cell growth to a higher density, which results in higher protein yields, even though the expression per cell is lower. In our experience, IPTG induction of streptopain expression consistently led to insoluble protein. This IPTG-induced overexpression may have led to streptopain accumulation in inclusion 46

Chapter 2 bodies, which we found to be accessible only by a denaturing purification. Since autoinduction expresses continuously at a lower level, this may have prevented streptopain from aggregating in inclusion bodies. Furthermore, we can only speculate about the reason for the 8.1-fold increased activity of our purified streptopain over previous reports. Presumably, our autoinduction and dual cation exchange purification scheme yielded a higher fraction of properly folded and active streptopain. Unfortunately, of the few previously described streptopain purifications, even fewer reported any activity data, making it difficult to draw conclusions from our increased activity. Finally, a major strength of our approach is that we obtained fully mature streptopain with no residual affinity tags that could otherwise interfere with the activity or structure of the protease. Our purified protein is identical to wild type streptopain and functions with specificity appropriate for this protease.

2.6 Conclusion Reliable production and purification methods are critical for continuing research on the challenging streptopain protease. We developed an expression protocol based on the autoinduction protein expression technique which produced large quantities of soluble, mature streptopain. Through applying classic column purification techniques, we found a purification scheme which resulted in highly pure (>95%) streptopain that also gave higher yield (3 mg / 50 mL) of pure enzyme with higher activity (5,306 +/- 315 U/mg) than previously reported in the literature. Further, since the protease cleaves off its N-terminus during the purification steps, the final purified product has no remaining affinity tags, eliminating potential undesired effects on streptopain’s natural activity due to modifications. This work combined techniques from modern and classic protein expression and purification systems to result in a scheme which is both more reliable and reproducible in our hands while obtaining protein of the highest activity and yield reported. This will be useful for further studies on streptopain, as this robust expression and purification method will enable the facile production of large quantities of active protease.

47

Chapter 2

2.7 Supplementary information

Table S2.1: PCR primers used for cloning of streptopain into pET24a vector. Primer Sequence SpeB_BamHI_FW 5′-GGATCCGGATCCCATCATCATCATCATCATGATCAAAACTTTGCTCGTAACGAA-3′ SpeB_XhoI_RV 5′-GCACCTCGAGCTAAGGTTTGATGCCTACAACAG-3′

48

Chapter 2

Figure S2.1: Impurity of streptopain when purified without HgCl2. Performing an identical purification to the described protocol but without 1 mM HgCl2 resulted in substantially more degradation of proteins from the lysate, yielding fragments of <15 kDa by SDS-PAGE which are labeled in the gel as “Low mass degradation products”.

49

Chapter 2

Figure S2.2: Mass spectrum of the positive control peptide without streptopain. MS/MS of 534.30 m/z precursor.

Figure S2.3: Mass spectrum of the positive control peptide digested by streptopain. MS/MS of 275.17 m/z precursor.

50

Chapter 2

Figure S2.4: Mass spectrum of the positive control peptide digested by streptopain. MS/MS of 282.18 m/z precursor.

Figure S2.5: Mass spectrum of the negative control peptide without streptopain. MS/MS of 506.27 m/z precursor.

51

Chapter 2

Figure S2.6: Mass spectrum of the negative control peptide with streptopain. MS/MS of 506.27 m/z precursor.

52

Chapter 3

Chapter 3: Developing a comprehensive specificity analysis method for promiscuous proteases

The following is a collection of unpublished experiments working towards the goal of developing a method for comprehensive profiling of the specificity of potentially any protease. Dr. Burckhard Seelig and I designed all experiments and analyzed all data. I performed all experiments unless otherwise stated.

3.1 Overview Understanding the specificity of a protease is critical to understanding its function. Narrow specificity proteases can generally be well understood with existing specificity- probing methods, but proteases with broad specificity have been challenging to fully elucidate. We here describe the progress we have made towards generating a method for fully characterizing broad-specificity proteases through a powerful combination of technologies: mRNA display selection, Next-Generation Sequencing, and mass spectrometry. The mRNA display selection technology can screen mixtures of >1012 unique peptides, enabling sampling all possible permutations of octamer substrates in a single experiment. Next-Generation Sequencing can screen millions of output cleaved sequences, allowing analysis of orders of magnitude more cleaved sequences than has previously been done. Mass spectrometry can identify cleaved locations in the peptides, revealing the exact cleavage sites and subsequently prime and non-prime side specificity. Using this unique combination, we were able to probe the specificity of three different proteases that are either narrow specificity (factor Xa) or broad specificity (ADAM17, streptopain). We performed two rounds of selection, and demonstrated the enrichment of preferred sequences for each protease. The resulting specificity maps yielded valuable insight into understanding the specificity of these proteases.

53

Chapter 3

3.2 Introduction Proteases perform a vast array of critical functions in the cell. These activities are dependent upon the highly-refined specificity of the protease, including evolved natural cleavage specificities and co-evolved cleavage sites150. Recent efforts have assessed the specificity of broad-specificity proteases using human proteomic screens3, 33, 35, 38, 39, 41, 43, 151 or designed peptide libraries28, as described in Section 1.3. However, these methods have drawbacks that limit their ability to generate reliable specificity maps, as described in Section 1.3.3. We aimed to overcome these limitations by combining mRNA display technology, Next-Generation Sequencing, and LC-MS/MS to increase the number of analyzed cleavage sites by several orders of magnitude (Figure 3.1). With this combination, we are able to screen all possible sequences for an octamer substrate (26 billion: 208 = 2.6 × 1010) simultaneously for cleavage by a given protease and identify up to millions of cleavage sequences. This leads to over 1,000-fold increase in the cleavage sequences identified, and subsequently generates improved-resolution specificity maps that will lead to a better understanding of broad-specificity proteases.

Figure 3.1: General scheme for screening protease specificity by mRNA display, Next- Generation Sequencing, and LC-MS/MS. A library containing nearly all octamer peptide substrates is mRNA displayed, immobilized, and subsequently digested with a protease of interest. The mRNA released from immobilization by protease cleavage of the respective peptide substrate is reverse-transcribed and sequenced by Next-Generation Sequencing to identify the cleaved 54

Chapter 3 sequences. The remaining immobilized digested peptides are eluted and identified by LC-MS/MS to identify the cleavage locations. These two data streams are merged to generate a specificity map.

Our new specificity analysis approach has the following advantages over the existing state-of-the-art methods. (1) We eliminated the bias of amino acid diversity that results from using a proteomic library source. We used a library synthesized with nucleic acid trimers, as opposed to degenerate NNS or NNK codons, which allowed a relatively even distribution of the possible octamer substrates. Knowing the precise distribution of amino acids that were present in the library allows us to accurately assess the fold change of amino acids at each position, including both enrichment and depletion. This is an important advantage over proteomic source libraries, which do not present all possible permutations of amino acids, and therefore have more difficulty determining which amino acids are disfavored in each position. (2) Our library did not require protease processing prior to selection. PICS and some PICS-related methods require pre-processing with trypsin, for example, to prepare short peptide fragments, which leads to obvious biases. Our library evenly presents nearly all possible octamers with conserved frames. (3) We included all natural amino acids with no modifications. Several previous methods omit certain amino acids, such as cysteine or methionine, or modify amino acids, such as protecting free lysines. With mRNA display, we presented nearly all possible octamers as unmodified peptides. (4) By combining LC-MS/MS (to directly identify the N-terminal fragments of our cleavage products) and Next-Generation Sequencing (to identify the full octamer substrate), we assessed both the prime and non-prime cleavage specificity simultaneously. (5) mRNA display is a purely in vitro technology, which can allow for modifications of cleavage conditions to easily assess changes in protease function according to temperature, pH, ionic strength, or inhibitors56. (6) We were able to identify orders of magnitude more preferred cleavage sequences than the previous best methods. mRNA display presented nearly every possible octamer sequence and Next-Generation Sequencing created a maximum possible output of millions of sequences. These data will

55

Chapter 3 allow the generation of high-resolution specificity maps through bioinformatic processing that will subsequently enable the subsite cooperativity profile analysis. We chose the octamer length because proteases typically recognize between 2-4 residues on either side of the cleaved bond152. By scanning all possible permutations of eight residues, we will comprehensively assess the protease specificity both C- and N- terminal to the cleaved bond. We hypothesized that exhaustive sampling of nearly all amino acid permutations will enable better characterization of the active site specificity and therefore enable better predictions of cleavage sites. The three proteases characterized in this chapter are factor Xa, ADAM17, and streptopain, as a representative group of narrow- and broad-specificity, well- and poorly-characterized protease controls. Factor Xa is a serine protease and a critical component of the coagulation cascade. It is a narrow-specificity protease that has been characterized by several methods23, 25, 34, 153, 154. A PICS analysis of factor Xa identified 124 cleavage sequences, which confirmed the canonical glycine in P2 and arginine in P1 specificity, but also revealed P1′ and P2′ preference for small amino acids and a handful of alternative amino acids in the canonical P2-P1 positions34. This protease was included in our analysis as a benchmark of a narrow specificity protease that has been well-characterized by existing methods. ADAM17 is a transmembrane metalloprotease, requiring zinc, that is involved in development, healthy physiology, and pathological mechanisms155. This enzyme is in a category of proteases named ADAM (A Disintegrin And Metalloprotease). ADAM17 is also known as TACE (TNFα (tumor necrosis factor α) Converting Enzyme) due to the early identification of TNFα as a substrate156. This protease has a broad specificity that has been best characterized by a peptide library approach155 and Q-PICS37. The Q-PICS approach yielded 365 cleavage sequences, which identified that P2-P5 and P2′-P5′ preferred neutral or aliphatic amino acids, while P1′ preferred valine or leucine37. ADAM17 was included in our study as a comparison to the existing best methods for characterizing a broad- specificity protease. Streptopain is a broad-specificity cysteine protease that has been introduced in Chapter 2. Briefly, it is produced by S. pyogenes and has a large number of known 56

Chapter 3 activities128. Streptopain’s specificity was initially characterized by identifying the autocatalytic processing sites that convert the zymogen to active mature streptopain157. The best characterization of streptopain used over 25 quenched fluorescent peptide substrates to probe positions P3-P1 and P1′26. This study revealed that P2 prefers hydrophobic residues, P1 slightly preferred lysine or arginine, but that there was little preference revealed in P3 or P1′. Streptopain was included in our initial set of proteases to assess with our method as an example of applying the protease to a poorly-understood broad- specificity protease. The immobilization strategy used in our selections applied click chemistry to immobilize mRNA displayed peptide fusions to azide-modified agarose. Click chemistry is a bio-orthogonal reaction that has been applied for modifying various biomolecules158. This reaction uses copper as a catalyst for the cycloaddition reaction of azide to alkyne.

3.3 Results 3.3.1 Design and optimization of the octamer peptide substrate construct Our goal is to present all possible octamer substrates for cleavage, digest with a protease of interest, and detect the output. Through mRNA display, we can generate an affinity tag linked to its encoding mRNA by the random octamer peptide. However, mRNA display requires additional constant regions on each terminus of the peptide sequence that is to be displayed. The starting DNA library must contain a T7 promoter, translation enhancer, and a start codon on the 5′ end. The 3′ end must contain a conserved site for cross-linking, and has traditionally also contained a linker region to improve flexibility and incorporate additional radioactive signal159. Our first design of a peptide that contained a cleavable target substrate used the standard constant regions that have been used multiple times for other projects in our laboratory66, 159, 160. This includes a methionine as the first amino acid followed by a 6x histidine affinity tag (His6) on the N-terminus, and a conserved MGMSGSG-TGY coding sequence on the C-terminus. The initial design (D1-D2) simply placed the octamer target substrate to be tested between these elements. Table 3.1

57

Chapter 3 summarizes all of the designs relevant to this section. The peptide sequences AAIK*AGAR and AAGKAGAR are positive and negative controls, as described in Section 2.3.6, where the positive control sequence should be cleaved after the K (an asterisk indicates the cleavage site).

Table 3.1: Test substrate construct designs for protease screening by mRNA display. Bolded amino acids highlight significant changes between designs. X represents any amino acid. An asterisk (*) represents the cleavage site for the positive cleavage sequence.

Design name Start codon Affinity tag Frame Target Linker D1 M HHHHHH AAIK*AGAR MGMSGSG-TGY D2 M HHHHHH AAGKAGAR MGMSGSG-TGY D3 M HHHHHH AAGKAGAR MPMSPSG-TGY D4 M HHHHHH PQP AAIK*AGAR PQPMP-TGY D5 M HHHHHH PQP AAGKAGAR PQPMP-TGY D6 M HHHHHH PQPMP-TGY D7 M HHHHHH M PQP-TGY D8 M HHHHHH PQP GG PQPMP-TGY First and second M HHHHHH PQP XXXXXXXX PQPMP-TGY octamer libraries

In a proof-of-principle experiment to test a substrate construct that disfavors digestion outside the randomized octamer sequence, D1-D3 were constructed by overlap extension PCR, then transcribed, translated, and purified by oligo(dT) cellulose as described previously159. The purified mRNA-displayed peptide fusions were incubated with two concentrations of streptopain protease. At the higher protease concentration both the positive (D1) and negative (D2) control sequences were digested similarly to 80-91% digestion indicating over-digestion (Figure 3.2A). However, at a 10-fold lower concentration of protease, there was a significant decrease in digestion of the negative peptide compared to the positive peptide (11% vs. 82%), showing that there was a preference for the positive sequence over the negative. In addition, I tested in parallel an alternative conserved C-terminal linker that also contained the negative design peptide (D3). This alternative design negative control incorporated prolines to discourage cleavage outside the desired target, because prolines are frequently more difficult for proteases to

58

Chapter 3 digest. Under the same conditions, D3 was digested less than D2 at the highest concentration of protease at 71% digestion, indicating that we could further decrease the undesired cleavage outside our desired motif by modifying the composition of the constant framing sequences to discourage their cleavage.

Figure 3.2: Streptopain digest of octamer substrate constructs D1 to D5 in solution. Fusions displaying positive (“Pos”) or negative (“Neg” or “Neg2”) peptide sequences are digested with streptopain. (A) Designs D1-3 or (B) designs D4-5 are digested with increasing concentrations of streptopain. Note that radioactive gels are intentionally shown over-exposed to better visualize faint bands.

To further reduce cleavage outside of the intended substrate region, we modified the design of the substrate construct by adding a proline-glutamine-proline (PQP) sequence to frame both sides of the cleavage motif and modified the C-terminus such that the PQP coding sequence will be part of the cross-linking region required for mRNA display. PQP was chosen as a frame because this motif is generally difficult for proteases to cleave. The

59

Chapter 3 designs D4-D5 (Table 3.1) used this modified motif along with the same positive and negative cleavage control sequences described above. The respective DNA constructs were synthesized, translated into mRNA-displayed peptides, oligo(dT) cellulose purified, and then incubated with streptopain at the same conditions as above (Figure 3.2B). The negative peptide fusion (D5) cleavage was undetectable (~0%), while the positive peptide fusion (D4) was almost completely digested (88%) at the lower protease concentration. This indicated that framing the peptide target sequence with PQP enabled selection for preferred over non-preferred peptide sequence digestions. Surprisingly, at the highest protease concentration, the negative peptide fusion continues to be significantly digested at 70%. This cleavage of D5 that was still occurring despite the PQP framed motif indicated continued cleavage occurring within the constant region. Therefore, we prepared minimal designs D6-8 that did not contain any cleavage peptide target to determine if cleavage still occurred in a construct that now only consisted of the constant framing regions. Design D6 just contained a methionine to start, followed by a His6 affinity tag and the conserved updated PQPMP frame that is necessary for mRNA display. D7 is similar to D6, except its PQPMP linker was simplified to PQP. D8 is similar to D4-5, but displays only two glycines in the target region, which is an unlikely small motif to be preferred. mRNA-displayed D6- 8 fusions were incubated with streptopain, similar to the above conditions. All of these motifs exhibited significant cleavage (41-54%), therefore indicating continued cleavage in a conserved region (Figure 3.3A).

60

Chapter 3

Figure 3.3: Octamer substrate fusions are less susceptible to proteolysis when immobilized to Ni-NTA. Fusions were incubated with streptopain either in solution (A) or while immobilized to Ni-NTA resin (B). Control designs D6-8 contained no peptide target, while controls D4-5 contain positive or negative peptide target respectively. Numbered products are the same as in Figure 3.2. When fusions were digested while immobilized to Ni-NTA resin, they were incubated for 6-times longer but with the same concentration of protease (2 μg/mL) as in solution. “OdT” is oligo(dT) purified fusions. “Buff” is the no-protease control. “SpeB” is the streptopain digestion. “RT” is the reverse transcription of the D4 flow-through (FT) sample.

The overall peptide display and cleavage selection strategy required immobilization of fusions by the His6 affinity tag, then incubation with streptopain to digest off preferred

61

Chapter 3 peptide sequences. While it has been shown that cleavage does occur within the conserved framing region when fusions are in solution, it was possible that this may not occur when the target is immobilized. For example, if the cleavage occurred within the His6 affinity tag, this region may not be accessible for cleavage if it was immobilized to the Ni-NTA resin. Therefore, we attempted immobilizing the fusions to Ni-NTA agarose resin, then incubating with the same concentration of streptopain (2 μg/mL) but for 6-times longer than the previous solution digestion. Digested fragments were collected (flow-through: FT), then fusions or digestion products were eluted with imidazole. When immobilized to Ni-NTA, no visible digestion of the design controls D6-8 was visible (Figure 3.3B upper). In summary, a substantial degree of undesired cleavage of the negative control designs (D6-8) was observed in solution, yet prior immobilization to Ni-NTA agarose solved this issue such that the same negative control fusions were stable against cleavage even during extended protease incubations.

3.3.2 mRNA display of proof of principle with His6-based immobilization To confirm that the entire selection scheme was working appropriately, the above designs D4-D5 displaying the positive and negative peptide sequence were immobilized on Ni-NTA and digested with streptopain at identical conditions to D6-8. The flow-through was collected and remaining fusions were eluted with imidazole. Finally, reverse transcription (RT) was performed on the flow-through sample from D4 to generate a PCR- amplifiable cDNA strand. Analysis by gel electrophoresis showed that 87% of the positive control peptide fusions (D4) were digested, while substantially less of the negative peptide fusions (D5) were cleaved (only 10%) (Figure 3.3B). Interestingly, both the cleaved fusions (2) and even the cleaved-off peptides (3) could be recovered and visualized by SDS-PAGE. The successful reverse transcription of the fusion is visible as a band with lower electrophoretic mobility in the gel. The reverse transcription product was PCR-amplified, which was confirmed by agarose gel electrophoresis to be of the expected same length as the initial template. This experiment confirmed that the entire selection scheme was working. 62

Chapter 3

3.3.3 Optimization of His6-based immobilization of octamer library The first octamer library was synthesized as an oligonucleotide with degenerate NNS codons to incorporate the random diversity in the octamer region. This oligonucleotide was PCR-amplified using primers to create the full length construct (Table S3.1) which was agarose gel purified. This dsDNA template was transcribed and cross- linked to puromycin-oligo159. At all steps, the number of copies of each unique sequence was kept >200, thus ensuring a >99.9% chance that any given octamer is present. The library was mRNA displayed, immobilized, and digested with streptopain for different amounts of time. The conditions were similar to the digests of D4-5 above, except that 3- fold more streptopain was used. The first octamer library was digested to 27% after 5 minutes and 71% after 60 minutes (Figure 3.4A). This indicates most of the library was accessible and available for digestion.

Figure 3.4: Protease digestions and reverse transcription of early octamer libraries. Streptopain (A) and factor Xa (B) digestion of Ni-NTA immobilized octamer libraries resulted in FT samples with clearly digested fusions. Released peptides are appropriate for the amount of digestion. Reverse transcription optimization (C) identified reverse transcriptase concentration as the dominant factor in improving yields. Primer 1 is Spec RT (PQPMP) V6 and is 30 nucleotides long. Primer 2 is Spec RT (PQPMP) V9 and is 15 nucleotides long (Table S3.2).

63

Chapter 3

During the subsequent PCR, it was found that one primer was able to also bind to an internal primer in the second conserved PQP region. Therefore, the sequence was re-designed using silent mutations to avoid this problem (Table S3.3). The modified library was constructed as before using overlap PCR (Table S3.1) again at a scale such that the potential complexity of all octamers would be over-sampled more than 200- fold at any step. mRNA display can present ~1012 fusions for digestion, which oversamples the potential maximum complexity of all possible octamers by ~40-fold. Next-Generation Sequencing can obtain ~20 million reads. For a narrow-specificity protease, an estimated 0.1% or less of the library may be cleavable. In this case, the percentage of fusions released from immobilization by digestion could be as high as 4%. However, we would prefer a much lower digestion to enable release of highly preferred sequences before less preferred sequences, which will likely happen at lower levels of overall digestion. Therefore, we would ideally aim for 1% or less digestion for any given protease. First digestion experiments of the octamer library revealed that even after washing of immobilized fusions, incubation with the factor Xa digestion buffer alone without the protease already released ~2-3% of immobilized material as uncleaved fusions which were clearly identified by gel electrophoresis (Figure 3.4B). In contrast, there were no detectable fusions in the similarly treated sample with streptopain digestion buffer. Note that the digestion buffers for each condition were different. During this experiment, the streptopain digestion buffer was 20 mM Tris (pH 8.0), 150 mM NaCl, 0.1 mM DTT, and 0.1 mM EDTA. The factor Xa digestion buffer was 20 mM Tris (pH 8.0), 150 mM NaCl, 1 mM

CaCl2. The change in background elution was attributed to the difference in buffers with respect to 0.1 mM DTT and 0.1 mM EDTA for streptopain versus 1 mM CaCl2 for factor Xa. Fusions being released in the digestion buffer alone will increase background noise and obscure the protease-specific release of fusions. The amount of background release is significant because a background of just 1% correlates to over 1010 fusions, which is roughly 500-fold greater than our Next-Generation Sequencing capacity. For our target of 64

Chapter 3

1% or less digestion with the protease, the background release of fusions from buffer alone could dominate the sample. Therefore, it was necessary to further minimize this background release. One of our first strategies to reduce background was to wash with more stringent buffers. Initially, we had planned to reverse-transcribe the fusions after cleavage, but we decided to perform this step before immobilization in order to increase the stability of the fusions and thereby allow for harsher washing conditions. Because the reverse- transcription would now be carried out on oligo(dT) purified fusions, it was necessary to optimize the protocol since there would be a significant quantity of remaining cross-linking oligo in the sample. Therefore, we optimized the length of the reverse transcription primer, in addition to the concentration of primer and reverse transcriptase in the reaction, and found that the amount of enzyme was the dominant factor in improving cDNA yield. Increasing the concentration by 4-fold to 2 U/μL in the reaction achieved >85% cDNA formation (Figure 3.4C). Following these alterations to the protocol, we hoped to decrease background noise due to leaching of poorly-immobilized fusions. Increasing the wash volume to >225 column volumes of factor Xa buffer did not decrease signal below the desired minimum of 0.1%. This particular sample was then digested with factor Xa, but the buffer-only control again released ~3% of immobilized signal. The extensive wash did not improve background signal during the digestion conditions. We also explored the correlation between rates of leaching and digestion buffer composition. For this experiment, we immobilized fusions to Ni-NTA resin, and then washed with various buffers used in the digestions. The digestion buffers continuously leached ~0.2-3% of remaining immobilized signal (Figure 3.5). Certain components significantly increased this leaching, including EDTA, DTT, and E-64 (an irreversible inhibitor for cysteine proteases which we decided not to use in the protocol). Furthermore, we found in multiple experiments that there was a time-effect involved: there was a positive correlation between the time of tumbling to bead-immobilized fusions and the signal released. We also discovered that replacing Tris buffer with sodium phosphate buffer 65

Chapter 3 successfully decreased background; unfortunately, sodium phosphate and CaCl2 will form a calcium phosphate precipitate. CaCl2 is required for factor Xa activity. Therefore, phosphate buffer could not be used in factor Xa digestions.

Figure 3.5: Fusions leaching off Ni-NTA immobilization by digestion buffer washes without protease. Fusions immobilized to Ni-NTA resin were washed with a variety of digestion buffers. All buffers contained 150 mM NaCl and 0.1% Triton X-100. Resin was washed with 24 column volumes of each condition. Additives were 20 mM Tris (pH 8.0), 50 mM HEPES-KOH (pH 7.4),

1 mM CaCl2, 0.1 mM DTT, 0.1 mM EDTA, and/or 28 μM E-64 (an irreversible cysteine protease inhibitor).

Next, we explored multiple alternative methods for using the His6 tag to enrich for cleaved products by binding to Ni-NTA agarose and collecting the flow-through. We tried using a second Ni-NTA column following digestion to remove uncleaved fusions, pre- blocking the Ni-NTA agarose resin with bovine serum albumin (BSA), and trialed three

66

Chapter 3 separate anti-His monoclonal antibody-based agarose resins. None of these methods solved the problem. Finally, we tested gel purification of the cleaved fusions. Because the cleaved and uncleaved fusions have slightly different electrophoretic mobilities in SDS-PAGE, we could theoretically enrich for the cleaved product by cutting out the respective band and purifying the fusions by the crush and soak method161. This was attempted, yet after the purification, a subsequent SDS-PAGE of the sample showed that the cleaved band had not significantly been enriched. Overall, after testing multiple methods for refining the Ni-NTA agarose immobilization strategy, we were unable to reduce background signal below 2-3%. Therefore, we began exploring alternative immobilization strategies.

3.3.4 N-terminal labeling of fusions via NHS ester for subsequent immobilization We explored using N-hydroxysuccinimide (NHS) with the aim of immobilizing fusions via chemical modification of their primary amine at the N-terminus. Our protocol required incubation of fusions with NHS-PEG3-azide and subsequent click chemistry to immobilize them to an alkyne-modified agarose. As NHS preferentially reacts with N- terminal amines over lysines at pH 6.5162, the reactions were performed under these conditions. Pilot experiments with purified fusions were carried out in multiple parameters, yet failed to immobilize greater than 5% of fusions compared to controls. This yield was inadequate for our goals so we abandoned this strategy.

3.3.5 Incorporation of non-natural amino acid to improve immobilization We developed a method for covalent immobilization of our octamer library through incorporating a non-natural amino acid in our fusions to enable click-chemistry, allowing very stringent washes to rigorously remove any non-covalently immobilized fusions (Figure 3.6A). We used an orthogonal non-natural amino acid tRNA synthetase and tRNA pair to incorporate an alkyne group specifically at amber stop codons (UAG), which could

67

Chapter 3 be subsequently immobilized by click chemistry (Figure 3.6B). This strategy has been used previously to incorporate a wide range of non-natural amino acids163, and has been used with mRNA display with a eukaryotic amber suppressor tRNA164.

68

Chapter 3

Figure 3.6: Overview of click immobilization and screening for protease specificity. (A) The strategy used for mRNA display, immobilization, and cleavage of fusions. The extensive stringent washes (*) are described in Section 3.3.7. “SpeB” is streptopain. “FXa” is factor Xa. (B) Click chemistry immobilization. (C) Two strategies for incorporation of para-propargyloxy- phenylalanine (pPaF). (D) The design and mechanism of production of pPaF-specific tRNA. 69

Chapter 3

In order to accomplish this immobilization with high efficiency, we altered how we generate mRNA-protein fusions. First, we switched to a specialized prokaryotic in vitro translation method based on the E. coli Protein synthesis Using Recombinant Elements (PURE) system165. The PURE system allows custom transcriptions and translations through its highly-defined composition: ribosomes, initiation factors, elongation factors, release factors and aminoacyl-tRNA synthetases, in addition to tRNA, nucleotides, amino acids, etc. This system is necessary because our protocol necessitated the omission of release factor 1 (RF1). RF1 is specific for the amber stop codon while the two remaining release factors do not recognize the amber stop codon. When RF1 is omitted, the amber stop codon can be repurposed as an additional codon to specifically incorporate a non- natural amino acid166. We used the NEB PURExpress system in all pPaF experiments. Second, we modified the translation enhancer regions in our construct design so that they would be compatible with the prokaryotic PURE translation system (Table S3.3). Third, we added a 6 base pair region immediately after the start codon of our library, encoding for a glycine and an amber (UAG) stop codon. The design of the control substrates and the new non-natural amino acid (nnAA) octamer library are shown in Table 3.2.

Table 3.2: Substrate construct designs for immobilization by click chemistry. Bolded amino acids highlight significant changes between designs. Underlined show the unnatural amino acid, Z, or the analogous site in non-amber codon designs. X represents any amino acid. * represents the cleavage site for the positive cleavage sequence.

Design name Start nnAA Affinity Frame Target Linker codon tag D9 M GZG HHHHHH PQP AAIK*AGAR PQPMP-TGY D10 M GZG HHHHHH PQP AAGKAGAR PQPMP-TGY D11 M GSG HHHHHH PQP AAIK*AGAR PQPMP-TGY D12 M GZG PQP AAIK*AGAR PQPMP-TGY D13 M GZG PQP AAGKAGAR PQPMP-TGY D14 M GSG PQP AAIK*AGAR PQPMP-TGY nnAA octamer library M GZ HHHHHH PQP XXXXXXXX PQPMP-TGY nnAA octamer ΔUAG M GS HHHHHH PQP XXXXXXXX PQPMP-TGY

70

Chapter 3

Finally, we used a synthetase-tRNA pair that had been optimized by previous work. Tyrosyl tRNA synthetase (pPaFS) from Methanococcus jannaschii can incorporate para- propargyloxy-phenylalanine (pPaF)166 at amber stop codons. pPaFS was originally selected from a library where five active site residues had been randomized and selected for improved pPaF incorporation during translation167. We used a tRNA sequence (pPaFT) that had been optimized for use with pPaFS166. In addition, we used a procedure to generate the tRNA by fusing its coding sequence to a hammerhead ribozyme166, 168. As the tRNA is transcribed, the hammerhead ribozyme will cleave at its 3′ terminus, releasing a tRNA with a free 5′ end (Figure 3.6D). pPaFS was cloned, expressed, and purified identically to published protocols166 to >95% purity. We also tried expressing and purifying pyrrolysyl tRNA synthetase from Methanosarcina baerkeri, but we were unable to express significant quantities of the protein, either by published protocols169 or using the autoinduction method148, so it was abandoned in favor of pPaFS. Initially, we attempted to incorporate the non-natural amino acid using strategy #1 (Figure 3.6C). This strategy combined the non-natural amino acid, DNA encoding the tRNA, and the purified tRNA-synthetase with the PURExpress kit in a one-pot reaction. The NEB PURExpress system is capable of both transcription and translation, therefore we were expecting transcription to yield functional tRNA (pPaFT) which could be charged by the tRNA-synthetase (pPaFS) and incorporate the non-natural amino acid (pPaF) in the translated fusions. The potential advantages of this strategy were that: (1) the tRNA does not need to be separately transcribed and purified and (2) there should be catalytic turnover of the pPaFS charging the tRNA with pPaF, therefore lower quantities of all reagents should be required. Several attempts using strategy #1 (Figure 3.6C) gave lower total translation yields (3-4 fold) when pPaFS was present, as determined by yield after oligo(dT) cellulose purification. Temperature (30°C and 37°C) and translation time (2 hours and 18 hours) were varied with no improvement. To test whether these fusions had the non-natural amino acid incorporated correctly, we compared translations with templates D9 and D11 (Table 71

Chapter 3

3.2) that differed by a single point mutation changing the amber stop codon to a serine codon. We also compared these to control translations that were completely missing all orthogonal machinery in the translation. Therefore, we would expect that when either no amber stop codon is present or no nnAA-incorporating components were added then there should be no immobilization. These were subjected to the procedure for click immobilization to azide-agarose resin (Figure 3.6B) and washed. When all components were included, positive control (D9) successfully immobilized at 20% when the amber stop codon was present (Table 3.3). Also, when all non-natural amino acid machinery was absent from the translation (control), immobilization with or without the amber codon was ~1%. However, for the negative control where all non-natural components were added yet there was no amber stop codon (D11), the immobilization was only slightly reduced from the positive control (15%). This suggested that the pPaFS was misincorporating pPaF, perhaps by charging the wrong tRNA with the non-natural amino acid. For these reasons, incorporation strategy #1 was not suitable for our purposes.

Table 3.3: Comparison of click immobilization yields for different alkyne incorporation strategies. Immobilization strategy #1 added all non-natural machinery into the translation mix. Strategy #2 pre-charged tRNA with the non-natural amino acid and only added the tRNA to the translation mix. The control added no non-natural machinery to the translation mix.

His6 Immobilization after incorporation strategy Design name Peptide Amber tag Strategy #1 Strategy #2 Control D9 Positive Yes 20%, 20% 76%, 61%, 76% 1.1% D10 Negative Yes Yes 72%, 58%, 74% D11 Positive No 15% 1.2%, 0.5% 1.0% D12 Positive Yes 80% D13 Negative No Yes 76% D14 Positive No 1.7% nnAA octamer Yes 62%, 66%, 60% Octamer Yes nnAA octamer ΔUAG No 1.3%

We therefore tried the alternative strategy #2 (Figure 3.6C) to avoid possible mis- charging by the synthetase pPaFS. In this strategy, the tRNA pPaFT was first pre-charged 72

Chapter 3 with pPaF, and the resulting acyl-tRNA is then added directly to the translation reaction. This external pre-charging eliminated the possibility of the tRNA synthetase mischarging incorrect tRNA. However, since there was no pPaFS present in the translation reaction, there could be no recharging of the tRNA. Therefore, the addition of a large amount of charged tRNA was required. We successfully generated tRNA using a combination of our standard protocol and previously published methods. After transcription of the pPaFT construct, the hammerhead ribozyme was activated168, and gel purified by denaturing urea-PAGE (Figure S3.1). pPaFT was charged by combining the purified tRNA, pPaFS, and pPaF according to published methods170. The translation and click immobilization was performed using strategy #2 for templates D9-11 (Figure 3.6C). Immobilization yields for constructs D9-D10, which both contained amber stop codons, were ~60-70%, while the yield for the D11 control without an amber stop codon was only ~1% (Table 3.3). This demonstrated that, while immobilization was occurring in a pPaF-dependent manner, there was a low rate of mis- incorporation with strategy #2. Overall, this experiment showed that strategy #2 had both higher translation yields and final immobilization rates than strategy #1 (Table 3.3). Next, we wanted to assess immobilization of our nnAA octamer library, which contained the amber stop codon for pPaF incorporation. This nnAA octamer library contains constant regions similar to D9 (Table 3.2). Notable differences include: the non- natural amino acid (Z) region translated to GZ instead of GZG, there were a number of silent mutations in the PQP regions (to avoid RNA secondary structure formation, leading to unexpected SDS-PAGE bands following fusion formation), and the 3′ terminus ended in a single stop codon instead of three stop codons in each frame. The translated product differs by only the single glycine next to the non-natural amino acid compared to construct D9. The nucleotide designs of all constructs are shown in Table S3.3. The randomized region of the nnAA octamer library, the octamer, was synthesized using trinucleotide building blocks (Glen Research Corporation) by the W. M. Keck Biotechnology Resource Laboratory. A precise mix of 20 codons, one for each amino acid, was mixed such that 73

Chapter 3 there was a 5% chance of any amino acid at any given position. The use of trinucleotides provided a more even distribution of amino acids in the nnAA octamer library compared to the alternative approach that used degenerate codons. The nnAA octamer library fusions immobilized at rates typically between 60-70% (Table 3.3). We also tested the immobilization of an identical library that contained a point mutation to substitute the amber stop codon for a serine codon (nnAA octamer library ΔUAG), which immobilized at only ~1% even when pre-charged pPaFT was added (Table 3.3). We also assessed whether Cu(I) enhances the click chemistry reaction, as expected. Surprisingly, when the protocols were repeated for both the nnAA octamer and the nnAA octamer ΔUAG libraries, adding Cu(I) increased the immobilization by only half compared to the negative control (Table 3.4). The nnAA octamer library immobilized at ~46% without copper. However, both of the nnAA octamer ΔUAG samples, with or without Cu(I), showed very low rates of immobilization (~1%). Therefore, immobilization was strongly dependent on the amber stop codon confirming the pPaF-driven immobilization.

Table 3.4: Copper improves click yields in a UAG-dependent manner.

Library name Amber codon +Cu -Cu nnAA octamer library Yes 60%, 68% 45%, 47% nnAA octamer library ΔUAG No 1.3% 1.1%

Therefore, in this section we have shown that we can immobilize fusions in a UAG codon-dependent manner with the NEB PURExpress system.

3.3.6 Pilot digestion of click-immobilized nnAA octamer library by streptopain A pilot round of the entire protocol of mRNA display and protease digestions was performed using the nnAA octamer library for assessment of streptopain. The flow-through fraction after streptopain digestion was reverse-transcribed, PCR-amplified to regenerate the full-length template, then PCR-amplified to generate amplicons for Next-Generation Sequencing. In addition, a PCR-amplified sample of the chemically synthesized unselected

74

Chapter 3 library was prepared for sequencing as a control to assess its original amino acid distribution. The sequencing data were processed to eliminate all sequences that did not contain a perfect match for the constant regions flanking the random octamer region. A total of 35.5 million sequences from the chemically synthesized unselected library were processed to 19.2 million sequences used in the assessment. Since the nnAA octamer library was synthesized from 20 individual codons at a 5% mixture, the ideal distribution is 5% of each amino acid at each position. Sequencing revealed that most of the amino acids were present with less than 20% variance from expected, and the total representation was 5% +/- 1%. However, there were some biases for or against certain amino acids which also shifted with position number (Figure 3.7A). The most biased amino acids were glycine (64-35% underrepresented), arginine, isoleucine, glutamine, valine, and tyrosine.

Figure 3.7: Next-Generation Sequencing revealed enrichment in pilot nnAA octamer library digestion. (A) The overall distribution of amino acids in the chemically synthesized library was

75

Chapter 3

5% +/- 1%. (B) The amino acid distribution after streptopain digest was compared to the chemically synthesized starting library showing changes of up to 48%.

The sequencing of the selected pilot library after cleavage with streptopain resulted in 28.9 million reads, which were processed to 14.1 million reads in an identical pipeline as to the unselected library above. The amino acid prevalence at each position of the octamer was compared to the unselected library to determine enrichment, which revealed significant trends (Figure 3.7B). In particular, negative charge had decreased, with over 1/3 of the amino acids eliminated at positions towards the C-terminus. Hydrophobic amino acids were generally enriched, especially the bulky hydrophobic amino acids. Interestingly, these preferences appeared at certain positions, such as tryptophan being enriched near the termini, while phenylalanine or tyrosine were enriched in the center of the peptide or perhaps slightly to the C-terminus. Overall, this map revealed significant distribution changes after one round of selection by protease cleavage. However, this comparison took into account only the chemically synthesized starting library and the final selected library distribution. This pilot digestion experiment did not account for other potential biases during translation, transcription, or immobilization. Therefore, for the final selection it was necessary to also sequence the immobilized library immediately prior to the protease digestion, as described in Section 3.3.8.

3.3.7 Optimization of click-immobilization of nnAA octamer library While the initial pilot showed significant enrichment after the single round of digestion (Figure 3.7B), we wanted to further optimize the signal to noise ratio to maximize the signal for preferred cleavage sequences. For this purpose, we used the positive and negative control fusion constructs (D9-D10), immobilized them to azide-agarose with strategy #2 described above, and varied the protease concentrations and duration of digestion. Ideally, the positive control should exhibit 100% digestion and the negative 0%. The data indicated that the preferred peptide fusions were being released by up to ~6-fold 76

Chapter 3 more than the negative peptide fusions. This discrimination decreased when more protease was added (Table 3.5).

Table 3.5: Streptopain digestion loses discrimination of preferred sequences when more digestion occurs.

Protease excess (fold) Time (minutes) Enrichment ~30 30 6.3× ~20 60 5.0× ~40 60 4.3× ~700 60 1.2×

The above results led to concern about non-specific digestion, despite the PQP frames. To assess the possibility that undesired cleavage might be occurring inside the His6 tag, we created constructs D12-D14 (Table 3.3) which removed this region. However, while these constructs immobilized with similar rates to D9-11 (76-80% with UAG, 1.7% without UAG) (Table 3.2), the digestion discrimination between positive and negative substrate was not improved (~6-fold). Therefore, we concluded that the cleavage was not due to a preference for the conserved His6 tag. The protease digestion results (Table 3.5) showed that lower amounts of protease improved the preference for positive over negative peptide sequence. However, if we reduced the digestion too much, the amount released due to protease quickly decreased to less than the background release caused by buffer alone. Despite moving to the click- chemistry immobilization strategy, we were finding the background rates to be still ~1% of immobilized. Similar to the His6 strategy, it was important to minimize background as much as possible, especially because we needed to decrease the digestion further to improve separation between preferred and less preferred sequences. We attempted numerous modifications to the protocol with this aim. Below we review these efforts before describing the final best strategy for the full-scale selection digestions in Section 3.3.8. Initially, we explored reducing the background signal (release) of fusions through extensive washing. We immobilized reverse-transcribed fusions by click chemistry,

77

Chapter 3 washed with an 8M urea denaturing wash, then incubated the fusions in digestion conditions without protease but otherwise identical to the actual digestion: rotating in a column at 37°C for 1 hour. After 8 of these washes, the released signal approached the radiation detection limit of ~0.01% (Figure 3.8A). However, when these beads were then split into 2 aliquots for the +/- streptopain conditions, the background for the no protease mock digestion control increased again >20-fold compared to the background seen in the last wash. This indicated that successive washing is not enough to minimize background. Next, instead of performing multiple 1-hour incubations and washes, we immobilized the fusions and washed with a very large volume of wash buffers (~750 mL; 42-fold increased). The background was still 0.24% in the mock digestion. This was repeated with a column custom-build from a Pasteur pipet to decrease the volume of buffer exposed to the resin at any time. This was washed with ~1 L, yet the background was still 1.6% in the mock digestion. A final attempt to reduce the background was made by combining both of the above methods: a long overnight wash with continuous rotation. After >400 mL of washing, the mock digestion background was 1.1%. Therefore, we determined that neither the length of the wash nor the tumbling were sole determinants of background signal release.

78

Chapter 3

Figure 3.8: Optimization of octamer fusion washing. (A) Fusions were washed with sequential 1-hour washes. Released signal (line) neared radiation detection limit (dotted line). After resin recovery and mock digestion, signal released increased over 20-fold (square). (B) Digested fusions confirmed by SDS-PAGE, see numbered key. (C) Fusions bound to azide-agarose in low salt conditions and did not elute until washes reached 150 mM NaCl (line).

Next, we tested various steps of the entire procedure to confirm they were not contributing to this problem. We assessed the function of PCR clean-up kits to purify the fusions following cleavage. We found that the kits reliably purified 60-80% of loaded fusions (as quantified by qPCR) after a single round of binding and elution. In addition, we separated these samples via agarose gel electrophoresis to confirm that the fusions were being cleaved appropriately. Oligo(dT)-purified fusions showed appropriate gel shifts (Figure 3.8B): shift up during reverse transcription, then shift down after cleavage from azide-agarose. Importantly, we tested whether the fusions were binding to agarose resin non-specifically in low salt conditions. We incubated cleaved fusions which did not contain the azide amino acid anymore with azide-agarose resin in a no-salt buffer, then attempted to elute the fusions with increasing concentrations of salt, up to the digestion buffer

79

Chapter 3 condition of 150 mM NaCl. To our great surprise, fusions bound tightly to the azide- agarose in low-salt conditions (up to 50 mM NaCl), then began to elute at 150 mM NaCl (Figure 3.8C). Therefore, cleaved fusions, that did not contain an alkyne group anymore, still bound very tightly to the azide-modified agarose nonspecifically. From this finding we concluded that significant non-specific binding that could not be eliminated with extensive washing was occurring and we attempted to modify the click chemistry conditions to reduce it. We used a urea-click method suggested by our resin supplier (ClickChemistryTools) to reduce background. This method variation suggested to use 4 M urea, 500 mM NaCl in the click reaction and lasted 16-20 hours instead of 3 hours. However, we only immobilized 26% of our fusions in these conditions (a third of normal), and the final background signal was still ~2-3%. The urea-click immobilization therefore did not reduce our background. Finally, we tested a number of alternative azide-modified resins, since the agarose azide resin led to significant non-specific binding of our fusions. We used four different magnetic resins: non-styrene resin with PEG-linked azide, “TurboBeads” with benzylic azide groups, “TurboBeads” with a PEGylated surface modified with azide, and “FG beads” (ferrite particles coated with glycidyl methacrylate) modified with azide. We allowed the click reaction to occur overnight, with conditions appropriate to each resin. After performing the denaturing wash (as detailed in methods), we were able to only immobilize 0.5-12% of our fusions (Table 3.6), a large decrease from our previous experience of ~60-75% immobilization with azide-modified agarose (Table 3.3). Further, when +/- protease digestions were performed, the background to signal ratio was still very poor for all conditions. After two attempts, this option was not explored further.

Table 3.6: Magnetic resin immobilization attempts.

Resin type Azide link % Immobilization % Background % Signal Non-styrene resin PEG-linked 12 0.05 0.1 TurboBeads Benzylic azide 5 0.5 0.5 TurboBeads – PEGylated PEG-linked 3 3.9 9.2 FG beads Azide 0.5 14 38

80

Chapter 3

In summary, we attempted to reduce background through extensive washes, modifying click conditions, and using alternative azide-modified resins. The most successful approach was to perform multiple extensive washes of the resin. For still unclear reasons, the background signal increases dramatically when the resin is recovered and split into multiple digestion conditions, including a no protease mock digestion.

3.3.8 Two rounds of screening for three different proteases The best background signal achieved through optimization in Section 3.3.7 was achieved by washing multiple times in a single column, to ~0.2% (Figure 3.8A). The signal had then increased >20-fold when the beads were recovered and split into multiple columns. Therefore, we instead modified the protocol so that we could keep the beads in a single column for the entire wash and digestion. We performed serial washes on the column until the background reached a minimum, then added protease in sequentially increasing concentrations to digest off preferred peptides (Figure 3.6A). This was performed on two of the three proteases we planned to test: streptopain and factor Xa. Unfortunately, ADAM17 requires complete absence of salt during the digestion, so it was necessary to use split column digestions, as described above. The overall digestion scheme is shown in Figure 3.6A. We prepared a library of fusions for each protease experiment, washed with the digestion buffer appropriate for each protease until background levels reached 0.1%. The number of washes needed to reach this goal varied significantly between the three digestion buffers. Streptopain buffer (containing 0.1 mM EDTA and 0.1 mM DTT) reached 0.03% background after a single overnight wash and 3 subsequent 1-hour washes. In contrast, factor Xa buffer (containing 1 mM CaCl2) required two overnight washes and 11 one hour washes to reach 0.12% background. Digestions were prepared with three increasing concentrations of protease (streptopain, factor Xa), or in split columns with two different concentrations of protease (ADAM17). The results for both rounds of selection are summarized in Table 3.7. In round 1, streptopain digestions released fusions above the background signal at the intermediate 81

Chapter 3 protease concentration. The factor Xa digest did not result in a significantly increased signal above background at any protease concentration. The ADAM17 digestion was above background only at the high protease concentration.

82

Chapter 3

Table 3.7: Results from rounds 1 and 2 of cleavage specificity selection for three proteases. Asterisk (*) samples were pooled, for each protease, and continued to round 2.

Round 1 Round 2 Fusions Fusions Protease Fusions Background Fusions Background presented Signal presented Signal /protease signal /protease signal ×1011 ×1011 Low 480 <0.05% * 660 <0.07% Streptopain Intermediate 24 8.6 0.03% 0.31% * 33 12 0.02% 0.51% High 1.2 3.7% 1.7 4.4% Low 72 0.20% * 63 <0.10% Factor Xa Intermediate 3.6 9.4 0.11% 0.19% * 3.1 8.2 0.03% <0.07% High 0.18 0.21% * 0.16 0.15% Low 9.5 2.7 0.87% 11 3.1 0.48% ADAM17 0.95% 0.37% High 0.48 2.7 2.4% * 0.49 2.8 3.2%

83

Chapter 3

All 8 digestion outputs were PCR-amplified to regenerate full length constructs and create multiple copies of the sequences. For round 2, we wanted to use an output from round 1 that had a detectable level of digestion to be sure that enriched sequences were mostly hits from digestion instead of background moving forward. Because we performed serial digestions for factor Xa and streptopain, it was necessary to combine all lower digestions as well so that we did not lose highly preferred sequences that were released from the lower levels of digestion. The chosen digestions were mixed at an equimolar ratio for preparation for round 2. For streptopain, we pooled the selected DNA that resulted from the low and intermediate protease conditions as input for round 2. For factor Xa, we pooled the round 1 output from all three protease concentrations, since none had resulted in a signal significantly above background. For ADAM17, we did not need to pool the samples because they were digested separately instead of sequentially. Therefore, used the DNA resulting from the high protease condition which showed clear digestion above background as input for round 2. Round 2 digestions were performed similarly to round 1, except that the length and number of washes were substantially increased. This entire round was performed by Dr. Bo Zhu in our laboratory, under my supervision. Instead of washing many times for 1 hour, more and longer washes were performed for 6-17 hours. This resulted in a significantly reduced background for all three protease digestion conditions. The results are summarized in Table 3.7. Overall, digestion yields were similar between rounds 1 and 2, but there was a trend towards increased digestion in the second round. This is appropriate for an enriched pool of peptides that were selected for cleavage preference by the protease. Interestingly, in round 2 the factor Xa digestion at the highest protease condition resulted in a release of fusions significantly over background, whereas the signal had been within the background for round 1. The output from round 2 was PCR-amplified to regenerate the constant regions of the design and prepare copies. We also included an additional control sample for sequencing to assess any potential bias occurring from translation, transcription, or immobilization (described in Section 3.3.6). For this purpose, nnAA octamer library fusions were immobilized and 84

Chapter 3 washed identically to the above protocol. The resin was recovered and used as a template for PCR-amplification to generate a library of immobilized sequences, identical to the sequences immediately before incubation with protease. Therefore, this library was used as a control for biases that might occur during the transcription, translation, and immobilization steps. The selection outputs from protease digests of rounds 1 and 2, the original chemically synthesized library, and the immobilized library were all prepared for Next- Generation Sequencing. This was done in a single PCR using primers that amplified in the constant regions flanking the random region.

3.3.9 Preliminary analysis of Next-Generation Sequencing results The sequencing results were processed identically to Section 3.3.6, keeping only those sequences that perfectly matched the designed constant regions surrounding the random octamer region. We received 18 total pools of sequencing: the chemically synthesized library (1), the immobilized library (2), streptopain-digested samples at three concentrations and two rounds (3-8), factor Xa-digested samples at three concentrations and two rounds (9-14), and ADAM17-digested samples at two concentrations and two rounds (15-18). We received ~22 million reads for each pool, which were processed to yield ~14 million usable reads (~63%) (Table S3.4). The unselected library showed similar amino acid distributions from the ideal of 5% in each position, similar to the analysis of the same sample performed during the pilot selection (see Section 3.3.6). The amino acid distribution of the immobilized library, that had not yet been incubated with a protease, showed a pattern of enrichment extremely similar to the pilot enrichment in Section 3.3.6 (Figure 3.9A). This indicated that this distribution shown in Section 3.3.6 was mainly the result of a bias from the preparation process, which includes: transcription, translation, or click-immobilization. This bias we refer to as immobilization bias. Since this immobilization bias is consistent and reproducible, it can be accounted for when analyzing the sequences that are released from protease incubations.

85

Chapter 3

Figure 3.9: Next-Generation Sequencing reveals cleavage specificity of three proteases. (A) The amino acid change of the immobilized library compared to the chemically synthesized library reveals the “immobilization bias”. (B-D) Amino acid preferences for streptopain (B), factor Xa (C), and ADAM17 (D) after two rounds of enrichment and immobilization bias subtraction.

86

Chapter 3

For each protease, we first compared the enrichment between the immobilized library and the round 1 digestions. This analysis accounts for the immobilization bias that occurred prior to protease digestion. Each of the resulting maps had a pattern unique to each protease, which agreed with known specificities, described further below. As protease concentrations were increased, these patterns increased in strength. This makes sense: if the background number of sequences is constant, then increased digestion emerging above background should reveal a pattern specific to that protease. Also, because streptopain and factor Xa digestions were performed in serially increased concentrations, it was necessary to sum the reads from lower and higher digestions. For example, when analyzing the lowest digestion, only the reads from this digestion were used. However, when analyzing the intermediate digestion, both the lowest and intermediate digestion needed to be pooled. This is necessary because highly preferred sequences may be removed by the lowest digestion, then be depleted from higher digestions performed subsequently. All data presented for factor Xa or streptopain are summations of all three digestion reads. ADAM17 did not require summation because this was a set of split digestions. We next computationally generated an input distribution for each protease for their round 2 digestions. This was done by averaging the distributions for each of the libraries that were pooled (i.e. streptopain low and mid protease digestions were averaged, because they were experimentally pooled to continue to round 2). We then compared the changes from the putative round 2 input libraries to the round 2 digestion outputs. Interestingly, we saw a bias pattern in every sample that was nearly identical to the immobilization bias pattern seen between the chemically synthesized library and the immobilized library. Therefore, this immobilization bias is occurring between every round and should be subtracted in each additional round to best analyze the resulting data. When we computationally subtracted an additional round of immobilization bias from the maps from the second round, then the unique protease patterns first visualized in round 1 become even more vivid.

87

Chapter 3

Finally, we generated bias-corrected and enriched specificity maps for each protease to assess their preferences (Figure 3.9 B-D). This was performed on the highest digested samples from round 2 and incorporated two rounds of resin bias subtraction. Comparing the three final maps, it is readily apparent that each of the proteases has a strikingly different pattern. Both enrichments and depletions of amino acid distributions are significantly different between maps. This is critical for our approach because it shows that, because we presented nearly all possible octamers initially, we are able to accurately determine both favored and disfavored amino acids simultaneously. Further, certain amino acids show significant changes if presented towards the N- or C-terminus of the octamer. This suggests that we will be able to determine subsite-specific preferences, after computationally aligning sequences to their cleavage site. Notably, there appears to be a framing bias in all three proteases, to varying degrees of significance. Amino acid changes in positions 1 and 8 frequently differ significantly from nearby positions. This is prominent for ADAM17 in positions 1 and 2 (Figure 3.9D), and less so for streptopain and factor Xa in position 1 (Figure 3.9 A-B). For ADAM17, this recapitulates a known specificity preference. In position -1, there is a 100% conserved proline. ADAM17 has a known preference for proline-valine/isoleucine. Because this was presented at a much higher rate than other combinations, it visually appears over-enriched. However, through subsequent computational weighting algorithms, we will be able to account for these discrepancies for all three proteases. The analysis of substrate cleavage by streptopain revealed a specificity motif of largely preferring isoleucine, phenylalanine, and methionine, although there may be a significant enrichment for alanine, leucine, tyrosine, and threonine towards the C-terminal side of the octamer (Figure 3.9B). Through a private communication with a leading investigator of protease specificity, we compared our specificity map to an unpublished dataset where streptopain was characterized. Our map confirms the favored preferences, especially the preference for phenylalanine, isoleucine, and methionine. However, we can also clearly identify disfavored amino acids using our approach. In particular, glycine, tryptophan, cysteine, proline, and all charged or polar amino acids are disfavored. 88

Chapter 3

Cleavage analysis for Factor Xa revealed a specificity map that confirmed the canonical cleavage specificity of glycine-arginine in positions P2-P1 (Figure 3.9C). Comparing our data to the most thorough assessment of factor Xa using PICS34, we find frequent similarities, although the disfavored amino acids are strikingly different. The PICS analysis found that serine, threonine, and glutamine were frequently disfavored, yet these appear largely neutral in our data. We found much stronger negative preferences against valine, isoleucine, leucine, phenylalanine, and histidine. Comparing our ADAM17 cleavage specificity analysis to the best prior assessment also confirms known specificities (Figure 3.9D). Q-PICS found a preference for proline- valine/leucine in positions P1-P1′37, which is clear in our map. We also confirmed known preferences in the remaining sites, such as alanine, leucine, methionine, and even our framing amino acids proline and glutamine. However, our method also identifies novel negative preferences against several amino acids, including cysteine, histidine, phenylalanine, and tryptophan.

3.4 Discussion The characterization of protease specificity by mRNA display and Next-Generation Sequencing is a large step forward in fidelity over previously developed methods. Through presenting nearly all possible 26 billion octamer substrate sequences and analyzing millions of cleaved sequences, we were able to characterize both favored and disfavored amino acids to characterize the natural specificity of the protease. In the pilot streptopain digestion experiment (Section 3.3.6), an enrichment pattern was detectable. However, when a control sample for the immobilized library was included for comparison in the subsequent series of experiments (Section 3.3.9), it was revealed that this enrichment pattern was dominated by a bias from transcription, translation, and immobilization of fusions. Once identified, this immobilization bias can be accounted for and subtracted from the specificity maps to reveal the actual cleavage specificities. When this was applied to the maps generated from two rounds of enrichment, Factor Xa’s

89

Chapter 3 specificity map confirmed its canonical specificity, while both streptopain’s and ADAM17’s maps corroborated known specificities. The strength of our method can be seen in the resolution in which previously poorly-probed amino acids are now revealed as favored or disfavored. Factor Xa’s specificity map markedly departed from the previous best analysis performed by PICS, with several negative preferences against valine, isoleucine, leucine, phenylalanine, and histidine34. The specificity map generated from ADAM17 better resolved both the positive and negative preferences compared to a Q-PICS analysis37. Streptopain, the least characterized of the three proteases tested here, revealed a specificity map that identified many unknown disfavored amino acids, leading to a better understanding than ever before. Using this procedure and one or two successive rounds of selection, it is possible to characterize the specificity of nearly any protease. After a single round, we could identify many known preferences. The enrichment from the second round was similar to the enrichment from the first round, indicating that the pool could be further enriched with subsequent rounds. However, the immobilization bias is subtracting a significant percentage of the presented sequences in each round. This effect was nearly as strong as the protease-selected enrichment in some cases. While two rounds clearly revealed unique specificity maps for each protease, it is possible that further computational analysis will show that a single round is adequate for identifying specificity. More than two rounds may become unfeasible due to the increasingly strong effect of adding many rounds of the immobilization bias. Importantly, we identified that it is critical to assess the actual immobilized library to identify the change from each round. In subsequent efforts, we will sequence the input library for each round, the immobilized library presented for each protease, and the final washed off fusions prior to protease digestion. During this work, we repeatedly found confounding results compared to expectations. The mRNA display peptide fusions exhibit properties of both proteins and nucleic acids, are extremely dilute, and contain unique chemistry features. These attributes make them difficult to reliably predict their behavior. Ni-NTA resin should be able to enrich for fusions missing a His6 tag by negative selection, yet we found other effects to be 90

Chapter 3 more dominant. N-hydroxysuccinimide (NHS) labeling of the N-terminal primary amine of the fusions was expected to be as efficient as labeling whole proteins, but we could not demonstrate significant modifications. Click chemistry immobilization should be a covalent bond enabling extremely stringent washes to eliminate nearly all non-specific background, however many very stringent washing procedures failed to reduce background signal below 1%. Washing a column with several thousands of column volumes of a given buffer should typically reduce background well below 1%, yet agitation of the beads caused by a simple resin recovery apparently caused significant additional release of fusions that had not been released during the proceeding washes. The quality of data is always dependent on the background noise of a given method. In our method, the immobilization and subsequent cleavage of the peptide substrates is the crucial selective step. The optimization to reduce the significant non-selected release of fusions proved surprisingly difficult. The most successful approach was to simplify the handling as much as possible. A procedure that included serial extensive washing steps and subsequent digestion with increasing amounts of protease while avoiding any bead transfer or agitation resulted in the lowest background of any method tested: 0.02%. For ADAM17, when we could not apply this method, the background was as high as ~1%. For an input of 1012 sequences, this rate still corresponds to as many as 1010 sequences, which is 500-fold more than can be sequenced by our method of Next-Generation Sequencing. Further reducing this background would increase the signal to noise ratio and therefore the quality of our data.

3.5 Future work The Next-Generation Sequencing data results showed that two rounds are adequate to reveal specificity maps. However, this data needs to be further processed to align the sequences to their cleavage sites. Initially, this can be performed through motif-scanning algorithms, where known specificities are used as anchors to align the data. In addition, an independent confirmation of cleavage locations will improve this bioinformatics processing scheme. We our collaborating with Dr. Joshua Baller with the Minnesota 91

Chapter 3

Supercomputing Institute to complete the computational analysis of our data. He is currently analyzing the existing sequencing data and searching for 4mer pattern motifs. Next, we will work towards generating the second data stream for identifying cleaved sequences: LC-MS/MS of cleaved peptides to identify exact cleavage locations (Figure 3.1). While modern mass spectrometry can separate up to thousands of different peptides, this complexity is still substantially less than our selection output. Therefore, we will dilute our second round output digestion to prepare a simplified library of a complexity between 100 and 1000 sequences. This simplified sub-library will be sequenced separately by Next-Generation Sequencing to identify exactly all sequences contained in the input MS sample. Then, this library will be translated into fusions as described above, but immobilized to Ni-NTA resin instead of azide-modified agarose. The immobilized fusions will be digested to yield cleaved peptides by the appropriate protease. After extensive washing, the N-terminal peptide containing the His6 tag will be eluted using imidazole (Figure 3.1). These peptides will be further purified by the STAGE-tip procedure145, and analyzed by LC-MS/MS. The identification of individual peptides will be enabled by comparing the LC-MS/MS data to the Next-Generation Sequencing data of the sequences contained in the input sample. A database of identified cleavage sites will be prepared for this simplified library. Through computational analysis, this database of detected cleavage sites for the sub-library will be used to identify a conserved motif, which will be used to assist in aligning the full-length digested peptide sequences for the full complexity libraries. Through iterative rounds of analysis, matching, and capture, as much of the library will be aligned as possible. The resulting high-resolution specificity map will be used to identify subsite cooperativity effects and possible human cleavage sites. As a final confirmation, we will use the refined specificity maps to predict poor and favored peptide sequences. A selection of those will be chemically synthesized and cleaved individually with the respective proteases to confirm the predictive power of the specificity maps.

92

Chapter 3

3.6 Conclusions The method described in this chapter is an unprecedented leap forward in specificity determination and can be applied to identify the specificity of potentially any protease. By scanning all possible permutations of eight residues, we can comprehensively assess both the prime and nonprime side specificity. The resulting high-resolution specificity maps will therefore allow for the prediction of human protein cross-reactivity by mining human genomic data for shared motifs. Finally, these ubiquitous enzymes play critical roles in hormone activation, digestion, immune response, and are the target of about 10% of all pharmaceuticals27. The generalizable, high-throughput, and powerful specificity sampling tool created here will lead to better understanding of the activity of potentially any protease.

3.7 Materials and methods All chemicals were purchased from Sigma-Aldrich unless otherwise stated.

3.7.1 PCR construction of controls, octamer libraries, and sequencing pools The DNA for each control peptide sequence and the first and second octamer libraries were synthesized as primers and assembled by a 1-3 step PCR assembly. The PCR primers used for each construction are shown Table S3.1, the primer sequences are shown in Table S3.2, and the nucleic acid sequences for all constructs are in Table S3.3. The nnAA octamer library with the amber stop codon was synthesized with trimer codons in the random octamer region by the Keck facility. The library was amplified with primers D18 FW Ext 2 and 8mer RV Ext 5. The nnAA octamer library ΔUAG was constructed by PCR- amplifying the nnAA octamer library with primers 8mer V4 FW 1 and 8mer RV Ext 5. The final pools to be sent for sequencing were amplified with primers NGS FW V4 and NGS RV V4.

93

Chapter 3

3.7.2 Expression and purification of tRNA synthetase We replicated the published protocol for the cloning, expression, and purification of pPaFS166 with the following changes: (1) Cells were split into 4 × 250 mL aliquots, spun down for 15 minutes, and not washed prior to freezing at -80°C. (2) A 250 mL aliquot was resuspended in lysis buffer. (3) Sonication was used to lyse the cells. (4) 1.5 mL of Ni- NTA resin was used in a gravity flow system. (5) 3.5k MWCO slide-a-lyser (0.5-3 mL size) were used for dialysis. (6) The sample was dialyzed against 3 × 2 L of 1x PBS buffer for 2 hours, 2 hours, and overnight. (7) Recovered samples were filtered through a 0.45 μm SpinFilter and not concentrated further.

3.7.3 tRNA transcription, hammerhead ribozyme activation, and pPaF charging The sequences of the DNA encoding pPaFT were synthesized by GenScript and PCR-amplified with primers T7 FW1 and pPaFT RV. These were agarose-gel purified, and transcribed by T7 polymerase using standard procedures. The activation protocol was optimized by Kun-Hwa Lee in the Seelig laboratory. The transcription product was mixed with MgCl2 to 30 mM and Tris-HCl (pH 8.3) to 40 mM. This mixture was split into 100 μL aliquots, transferred to PCR tubes, and heat-cycled using the thermocycler program: (1) 1 minute at 72°C, 5 minutes at 65°C, 5 minutes at 37°C, repeat total of 15 times. (2) 2 minutes at 60°C, hold at 4°C. The product was ethanol precipitated by standard procedures, and purified by denaturing 8% urea-PAGE. The dominant middle band (Figure S3.1) was cut out, and purified by crush and soak standard protocol to obtain purified pPaFT. Charging pPaFT was performed by combining the following: 100 mM HEPES-

KOH (pH 7.2), 30 mM KCl, 12 mM MgCl2, 4 mM ATP, 2 mM DTT, 1 mM pPaF purchased from Chem-Impex International Inc., 10 μM pPaFT, 3 μM pPaFS. These were split into 2.5 nmol pPaFT aliquots, incubated at 37°C for 1 hour, then ethanol precipitated twice. The final ethanol precipitation was stored dry at -80°C until immediately prior to use.

94

Chapter 3

3.7.4 mRNA display of octamer designs The transcription, cross-linking, and translation using rabbit reticulocyte lysate were performed as described previously159. The translation using NEB PURExpress and pre-charged tRNA was performed by combining the following for a 125 μL final reaction volume: (1) NEB PURExpress-provided “solution A Δ aa/tRNA”, “solution B Δ RF123”, tRNA, RF2, and RF3 according to the manufacturer’s protocol. (2) Amino acid mixture Δ methionine from Promega to 100 μM. (3) 35S-labeled methionine to 0.344 μM. (4) Cross- linked RNA template to 1 μM. Incubate at 37°C for 2 hours. Add KCl to 550 mM and

MGCl2 to 50 mM. Incubate at room temperature for 30 minutes. Oligo(dT) purification was performed as described previously159. Reverse transcription was optimized for an oligo(dT) purified product as described in Section 3.3.3. The final conditions were identical to previously described159, except that we used a 100- fold excess of RT primer, and 3000 U/mL Superscript II. The RT primer used was Spec RT (PQPMP) V6.

3.7.5 Digestion of octamer control peptides Oligo(dT) purified fusions of each control peptide (D1-D8) were incubated with 0, 2, or 20 μg/mL of purified streptopain in streptopain digestion buffer (20 mM Tris (pH 8), 150 mM NaCl, 0.1 mM DTT, 0.1 mM EDTA) at a final volume of 15 μL. The mixture was incubated at 37°C for 10 minutes then 15 μL of 2x Laemmli buffer was added, and the sample was run on SDS-PAGE to assess digestions. All SDS-PAGE digestion calculations were quantified using ImageJ171. For Ni-NTA immobilization and subsequent digestion of control peptides D4-D8, oligo(dT) purified fusions were mixed 25 μL of Ni-NTA resin in streptopain digestion buffer to a final volume of 125 μL. After inverting at 4°C for 30 minutes, the resin was washed in batch and resuspended in 75 μL of streptopain digestion buffer. 2 μg/mL of purified streptopain was added, and the mixture was inverted at 37°C for 1 hour. The flow- through was separated from the resin by a 0.45 μm Spinfilter, then the immobilized fusions were eluted using elution buffer (20 mM Tris (pH 8), 150 mM NaCl, 250 mM imidazole).

95

Chapter 3

For Ni-NTA immobilization and digestion of octamer libraries, similar conditions were used to the above D4-D8 with the following changes: (1) 6 μg/mL of purified streptopain was added. (2) Digestions were performed between 5 and 60 minutes. When factor Xa was to be used, the following changes were made: (1) Digestion buffer contained

1 mM CaCl2 instead of DTT or EDTA. (2) The factor Xa final concentration was 20 μg/mL. (3) Digestion was incubated for 5 minutes.

3.7.6 Click immobilization of octamer fusions The click immobilization strategy used for protease selections was derived from a protocol used to modify mRNA displayed fusions96. Solution A was prepared in bulk and stored at 4°C until use: 200 mM HEPES-KOH (pH 7.6), 10 mM aminoguanidine hemisulfate, 0.05% Triton X-100. Solution B was prepared fresh for every click immobilization: 2 mM CuSO4, 2 mM THPTA. 350 μL of azide-agarose resin 50% slurry from ClickChemistryTools was rinsed with 3 × 1 mL H2O, then blocked with 0.5 mg/mL yeast tRNA for at least 20 minutes. Four 0.5 mL tubes were cut to remove their caps. The resin was rinsed with 3 × 1 mL H2O then 3 × 1 mL of Solution B. 120 μL of Solution B was added to resuspend and transfer the resin to a capless 0.5 mL tube. 120 μL of Solution A and 150 μL H2O were each transferred to a capless 0.5 mL tube. 1.5-2 mg of sodium L- ascorbate was weighed into a capless 0.5 mL tube. The H2O and ascorbate tubes were placed in a 25 mL heart-shaped flask, while the Solution A and Solution B tubes were placed in a 50 mL heart-shaped flask. The flasks were purged under argon flow for 1 hour. 50 μL H2O was added to each 1 mg of ascorbate in the purged flasks and suspended. 24 μL of the suspended ascorbate was transferred to the Solution B tube. All of the Solution A was transferred to the Solution B tube. The reaction was incubated at room temperature for 3 hours, with resuspending the resin after each hour.

96

Chapter 3

3.7.7 Wash of azide-agarose immobilized fusions After immobilization, fusions were washed with a variety of methods summarized in Section 3.3.7. Here we describe the method used in the final rounds of screening. The resin slurry from the click reaction was recovered and transferred to a 10 mL Poly-Prep Bio-Rad column. The resin was washed with 8 × 500 μL then 4 × 10 mL of 60°C warmed denaturing urea buffer (8M urea, 500 mM NaCl, 20 mM Na-Phosphate (pH 6.5), 0.1% Triton X-100). Each wash was measured by scintillation counting to track immobilization and washes. The resin was washed with the appropriate digestion buffer for the protease that it would be digested with subsequently: (1) Streptopain: 50 mM HEPES-KOH (pH 7.4), 150 mM NaCl, 0.1 mM EDTA, 0.1 mM DTT, 0.1% Triton X-100. (2) Factor Xa: 50 mM

HEPES-KOH (pH 7.4), 150 mM NaCl, 1 mM CaCl2, 0.1% Triton X-100. (3) ADAM17:

50 mM HEPES-KOH (pH 7.4), 150 mM NaCl, 2.5 μM ZnCl2, 0.1% Triton X-100. Note that ADAM17 digestion buffer normally does not contain NaCl, but it is necessary for the wash. 10 × 1 mL, then 4 × 500 μL of digestion-wash buffer was used, with scintillation tracking decrease in signal. Next, the column was capped and 800 μL (streptopain, factor Xa) or 600 μL (ADAM17) of digestion buffer was added. The column was rotated at 37°C for 1 hour, then uncapped and flow-through collected. Each column would be washed for 2-4 × 500-800 μL of digestion buffer. This column wash was repeated until the flow- through signal reduced to 0.1% or less of immobilized signal.

3.7.8 Protease digestion of click-immobilized fusions For both streptopain and factor Xa, digestions were performed with increasing concentrations of protease. The final 1-hour wash was recorded as the background signal, 800 μL of appropriate digestion buffer was added (described above), then the lowest concentration of each protease to be used was added. The column was capped and rotated at 37°C, then washed to release cleaved fusions. This was repeated for the intermediate and high protease concentrations. For streptopain, 82 pg, 1.6 ng, and 33 ng were added for low, middle, and high digestions. For factor Xa, 250 pg, 5 ng, and 100 ng were added for low, 97

Chapter 3 middle, and high digestions. A Qiagen PCR clean-up kit was performed to capture the fusions and concentrate. For the ADAM17 digestions, the resin needed to be recovered and split into three columns, because the digestion needed to be performed with no salt present. The resin in each column was washed with digestion buffer (50 mM HEPES-KOH (pH 7.4), 2.5 μM

ZnCl2, 0.1% Triton X-100), then resuspended in 200 μL of digestion buffer. Next, either buffer-only, low protease (1 ng), or high protease (50 ng) was added. ADAM17 was purchased from R&D Systems. The digestion continued for 16 hours, rotating, at 37°C. During recovery, 600 μL of elute buffer (50 mM HEPES-KOH (pH 7.4), 200 mM NaCl,

2.5 μM ZnCl2, 0.1% Triton X-100) was added to wash fusions off the azide-agarose. The flow-through was collected. A Qiagen PCR clean-up kit was performed to capture the fusions and concentrate.

3.7.9 Next-Generation Sequencing and data analysis Both the pilot sample of streptopain and the full two rounds were sequenced by ACGT, Inc. using an Illumina NextSeq500 with a paired end, 75 bp read. The data was manipulated with the following programs: (1) PEAR alignment to match forward and reverse reads. (2) Manipulate FASTQ to reverse complement incorrectly oriented sequences. (3) Manipulate FASTQ to require 100% identity on the 5′ and 3′ flanking regions of the octamer and trim off these nucleotides. (4) Filter FASTQ all sequences that are not exactly 24 nucleotides. (5) Manipulate FASTQ to chop each octamer into 8 × 3 nucleotide segments. (6) CUSP to count each codon at each subsite. The resulting codon frequencies were compiled to generate frequencies at each site. The input libraries for round 2 were calculated as the average of each input pool. Factor Xa was the average of all three digestions from round 1. Streptopain was the average of the low and intermediate digestion. ADAM17 used only the high digestion as the input library. The immobilization bias was identified as the change in amino acid frequencies at each site between the chemically synthesized library and the immobilized library. This bias was applied to the second round digestion inputs to estimate the actually-presented 98

Chapter 3 sequences to the proteases for the round 2 enrichment. The final enrichments were calculated from summing the amino acid distribution from all the digestions for streptopain and factor Xa (while ADAM17 only used the highest digestion), subtracting one round of immobilization bias, and comparing to the immobilization library.

99

Chapter 3

3.8 Supplementary information

Table S3.1: Primers used in construction of controls and libraries.

PCR 1 PCR 2 PCR 3 Construct Forward primer Reverse primer Forward primer Reverse primer Forward primer Reverse primer D1 Spec FW Pos cntrl 1 Spec RV Pos cntrl 1 Spec RV EXT 1 D2 FW Neg cntrl 1 RV Neg cntrl 1 D3 FW Neg cntrl 2 RV Neg cntrl 2 Spec FW EXT 1 Spec RV EXT 2 D4 Spec FW Pos cntrl 4 Spec RV Pos cntrl 5 Spec RV Pos cntrl 5 D5 Spec FW Neg cntrl 5 Spec RV Neg cntrl 6 Spec RV Neg cntrl 6 D6 Spec RV PQPMP 1 D7 Spec FW EXT 1 Spec RV MHM PQP 1

D8 Spec RV GG 1 D9 D21 RV 1 D21 FW 1 D10 D22 RV 1 D11 D23 FW 1 D21 RV 1 D18 FW Ext 2 D21 RV 2 D12 D27 FW 1 D33 RV 1 D13 D28 FW 1 D34 RV 1 D14 D29 FW 1 D33 RV 1 First octamer library 8mer Library FW 8mer RV Ext 1 8mer FW Ext 1 Spec PoP RV short 1 8mer RV Ext 1 8mer FW Ext 2 Second octamer library 8mer Library FW 2 8mer RV Ext 4 8mer FW Ext 5 8mer RV Ext 5 8mer RV Ext 4 nnAA octamer ΔUAG 8mer V4 FW 1 8mer RV Ext 5 D18 FW Ext 2 8mer RV Ext 5

100

Chapter 3

Table S3.2: Nucleotide sequence of all primers used in this chapter.

Primer Spec FW Pos cntrl 1 5′- CAATTACAATGCATCACCACCATCATCACGCAGCTATCAAGGCAGGAGCAAGG Spec RV Pos cntrl 1 5′- TTAATAGCCGGTGCCAGATCCAGACATTCCCATCCTTGCTCCTGCCTTGATAGCTGC Spec FW EXT 1 5′- TCTAATACGACTCACTATAGGGACAATTACTATTTACAATTACAATGCATCACCACCATC Spec RV EXT 1 5′- TTAATAGCCGGTGCCAGATCCAGACATTCCC FW Neg cntrl 1 5′- CAATTACAATGCATCACCACCATCATCACGCAGCTGGAAAGGCAGGAGCAAGG RV Neg cntrl 1 5′- TTAATAGCCGGTGCCAGATCCAGACATTCCCATCCTTGCTCCTGCCTTTCCAGCTGC FW Neg cntrl 2 5′- CAATTACAATGCATCACCACCATCATCACGCTGCAGGAAAGGCTGGAGCACGC RV Neg cntrl 2 5′- TTAATAGCCGGTGCCAGATGGAGACATAGGCATGCGTGCTCCAGCCTTTCCTGCAGC Spec RV EXT 2 5′- TTAATAGCCGGTGCCAGATGGAGACATAGGC Spec FW Pos cntrl 4 5′- CAATTACAATGCATCACCACCATCATCACCCTCAACCGGCTGCTATCAAGGCAGGAGC Spec RV Pos cntrl 5 5′- TTAATAGCCGGTGGGCATCGGTTGAGGTCTTGCTCCTGCCTTGATAGCAGC Spec FW Neg cntrl 5 5′- CAATTACAATGCATCACCACCATCATCACCCTCAACCGGCTGCTGGAAAGGCAGGAGC Spec RV Neg cntrl 6 5′- TTAATAGCCGGTGGGCATCGGTTGAGGTCTTGCTCCTGCCTTTCCAGCAGC Spec RV PQPMP 1 5′- TTAATAGCCGGTGGGCATCGGTTGAGGGTGATGATGGTGGTGATGCATTG Spec RV MHM PQP 1 5′- TTAATAGCCGGTGGGTTGAGGCATGTGATGATGGTGGTGATGCATTG Spec RV GG 1 5′- TTAATAGCCGGTGGGCATCGGTTGAGGTCCTCCCGGTTGAGGGTGATGATGGTGGTGATG D21 FW 1 5′- TAGTAAGGAGGACAGCTAAATGGGCTAGGGACATCACCACCATCATCACCCGCAACCAGC D21 RV 1 5′- GGGCATCGGTTGTGGTCTTGCTCCTGCCTTGATAGCAGCTGGTTGCGGGTGATGATGGTG D18 FW Ext 2 5′- TCTAATACGACTCACTATAGGGTTAACTTTAGTAAGGAGGACAGCTAAATG D21 RV 2 5′- TCAGTCACTTAATAGCCGGTGGGCATCGGTTGTGGTCTTGCTCCTGCC D22 RV 1 5′- GGGCATCGGTTGTGGTCTTGCTCCTGCCTTTCCAGCAGCTGGTTGCGGGTGATGATGGTG D23 FW 1 5′- TAGTAAGGAGGACAGCTAAATGGGCTCGGGACATCACCACCATCATCACCCGCAACCAGC D27 FW 1 5′- TAGTAAGGAGGACAGCTAAATGGGCTAGGGACCGCAACCAGCTGCTATCAAGGCAGGAG D33 RV 1 5′- GGGCATCGGTTGTGGTCTTGCTCCTGCCTTGATAGCAGCTGGTTGCGG D28 FW 1 5′- TAGTAAGGAGGACAGCTAAATGGGCTAGGGACCGCAACCAGCTGCTGGAAAGGCAGGAG D34 RV 1 5′- GGGCATCGGTTGTGGTCTTGCTCCTGCCTTTCCAGCAGCTGGTTGCGG D29 FW 1 5′- TAGTAAGGAGGACAGCTAAATGGGCTCGGGACCGCAACCAGCTGCTATCAAGGCAGGAG 8mer Library FW 5′- CATCACCCTCAACCGNNSNNSNNSNNSNNSNNSNNSNNSCCTCAACCGATGCC 8mer RV Ext 1 5′- TTA ATAGCCGGTG GGCATCGGTTGAGG 8mer FW Ext 1 5′- CTATTTACAATTACAATGCATCACCACCATCATCACCCTCAACCG

101

Chapter 3

Spec PoP RV short 1 5′- TTAATAGCCGGTGGGCATCG 8mer FW Ext 2 5′- TCTAATACGACTCACTATAGGGACAATTACTATTTACAATTACAATGC 8mer Library FW 2 5′- CATCACCCGCAACCGNNSNNSNNSNNSNNSNNSNNSNNSCCACAGCCTATGCC 8mer RV Ext 4 5′- TTAATAGCCGGTGGGCATAGGCTGTG 8mer FW Ext 5 5′- CTATTTACAATTACAATGCATCACCACCATCATCACCCGCAACCG 8mer RV Ext 5 5′- TTAATAGCCGGTGGGCATAG Spec RT (PQPMP) V6 5′- TTTTTTTTTTTTTTTCGGCATAGGCTGTGG Spec RT (PQPMP) V9 5′- TTTTTTTTTTTTTTTCGGCATAGGC T7 FW 1 5′- TAATACGACTCACTATAGGG pPaFT RV 5′- TGGTCCGGCGGAGGGGATTTG 8mer V4 FW 1 5′- TAGTAAGGAGGACAGCTAAATGGGCTCGCATCACCACCATCATCACCCGCAACCG NGS FW V4 5′- CATCATCACCCGCAAC NGS RV V4 5′- GCCGGTGGGCATAGG

102

Chapter 3

Table S3.3: Nucleotide sequence of all mRNA display constructs in this chapter. 5′ end of constructs

Construct T7 promoter Enhancers Start nnAA His6 D1 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC D2 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC D3 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC D4 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC D5 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC D6 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC D7 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC D8 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC First octamer library 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC Second octamer library 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC D9 5′-TCTAATACGACTCACTATAGGG TTAACTTTAGTAAGGAGGACAGCTAA ATG GGCTAGGGA CATCACCACCATCATCAC D10 5′-TCTAATACGACTCACTATAGGG TTAACTTTAGTAAGGAGGACAGCTAA ATG GGCTAGGGA CATCACCACCATCATCAC D11 5′-TCTAATACGACTCACTATAGGG TTAACTTTAGTAAGGAGGACAGCTAA ATG GGCTCGGGA CATCACCACCATCATCAC D12 5′-TCTAATACGACTCACTATAGGG TTAACTTTAGTAAGGAGGACAGCTAA ATG GGCTAGGGA D13 5′-TCTAATACGACTCACTATAGGG TTAACTTTAGTAAGGAGGACAGCTAA ATG GGCTAGGGA D14 5′-TCTAATACGACTCACTATAGGG TTAACTTTAGTAAGGAGGACAGCTAA ATG GGCTCGGGA nnAA octamer library 5′-TCTAATACGACTCACTATAGGG TTAACTTTAGTAAGGAGGACAGCTAA ATG GGCTAG CATCACCACCATCATCAC nnAA octamer library 5′-TCTAATACGACTCACTATAGGG TTAACTTTAGTAAGGAGGACAGCTAA ATG GGCTCG CATCACCACCATCATCAC ΔUAG

103

Chapter 3

Continued from above (3′ end of constructs)

Construct Frame Target Frame-Linker Cross-link Stop D1 GCAGCTATCAAGGCAGGAGCAAGG ATGGGAATGTCTGGATCTGG CACCGGCTAT TAA-3′ D2 GCAGCTGGAAAGGCAGGAGCAAGG ATGGGAATGTCTGGATCTGG CACCGGCTAT TAA-3′ D3 GCAGCTGGAAAGGCAGGAGCAAGG ATGCCTATGTCTCCTTCTGG CACCGGCTAT TAA-3′ D4 CCTCAACCG GCTGCTATCAAGGCAGGAGCAAGA CCTCAACCGATGCC CACCGGCTAT TAA-3′ D5 CCTCAACCG GCTGCTGGAAAGGCAGGAGCAAGA CCTCAACCGATGCC CACCGGCTAT TAA-3′ D6 CCTCAACCGATGCC CACCGGCTAT TAA-3′ D7 ATG CCTCAACC CACCGGCTAT TAA-3′ D8 CCTCAACCG GGAGGA CCTCAACCGATGCC CACCGGCTAT TAA-3′ First octamer library CCTCAACCG (NNS)^8 CCTCAACCGATGCC CACCGGCTAT TAA-3′ Second octamer library CCGCAACCG (NNS)^8 CCACAGCCTATGCC CACCGGCTAT TAA-3′ D9 CCGCAACCA GCTGCTATCAAGGCAGGAGCAAGA CCACAACCGATGCC CACCGGCTAT TAAGTGACTGA-3′ D10 CCGCAACCA GCTGCTGGAAAGGCAGGAGCAAGA CCACAACCGATGCC CACCGGCTAT TAAGTGACTGA-3′ D11 CCGCAACCA GCTGCTATCAAGGCAGGAGCAAGA CCACAACCGATGCC CACCGGCTAT TAAGTGACTGA-3′ D12 CCGCAACCA GCTGCTATCAAGGCAGGAGCAAGA CCACAACCGATGCC CACCGGCTAT TAAGTGACTGA-3′ D13 CCGCAACCA GCTGCTGGAAAGGCAGGAGCAAGA CCACAACCGATGCC CACCGGCTAT TAAGTGACTGA-3′ D14 CCGCAACCA GCTGCTATCAAGGCAGGAGCAAGA CCACAACCGATGCC CACCGGCTAT TAAGTGACTGA-3′ nnAA octamer library CCGCAACCG (Trimer)^8 CCACAGCCTATGCC CACCGGCTAT TAA-3′ nnAA octamer library CCGCAACCG (Trimer)^8 CCACAGCCTATGCC CACCGGCTAT TAA-3′ ΔUAG

104

Chapter 3

Table S3.4: Next-Generation Sequencing reads from two rounds of selection that were used in analysis.

Reads Sample Initial Processed % Used Chemically synthesized 22,002,271 14,503,347 66% Controls Immobilized 22,432,133 14,215,161 63% Round 1 22,165,915 14,724,833 66% Low Round 2 22,550,106 14,451,007 64% Round 1 21,989,882 13,954,913 63% SpeB Intermediate Round 2 21,528,255 14,050,077 65% Round 1 21,224,393 13,954,172 66% High Round 2 21,857,399 14,336,110 66% Round 1 21,740,644 13,446,976 62% Low Round 2 22,973,098 14,380,069 63% Round 1 27,425,979 16,605,287 61% FXa Intermediate Round 2 21,043,985 12,835,348 61% Round 1 22,167,802 14,291,485 64% High Round 2 22,077,234 12,958,423 59% Round 1 21,795,672 13,795,792 63% Low Round 2 22,370,194 13,768,261 62% A17 Round 1 21,785,137 13,470,161 62% High Round 2 21,587,761 13,329,556 62%

105

Chapter 3

Figure S3.1: Transcription product and hammerhead ribozyme activation of pPaFT.

106

Chapter 4

Chapter 4: Towards engineering proteases with multiple novel subsite specificities using mRNA display

The following is a set of unpublished experiments working towards applying mRNA display to engineer novel specificity. I performed all experiments.

4.1 Overview Proteases have valuable attributes that could make them powerful therapeutics. They can have rapid turnover, discerning specificity, highly tuned activity, and resilience. However, our ability to engineer novel specificities has been limited to single subsites at a time. To create proteases capable of therapeutic interventions, such as degrading deadly superantigen proteins during an invasive bacterial infection, we will need to engineer multiple subsites simultaneously. In this chapter, I present my work towards creating a new method for engineering therapeutic activity of the broad specificity protease streptopain. I identified an ideal site in the superantigen SpeA where cleavage would result in an inactive toxin. Further, I demonstrated that streptopain can be mRNA displayed in different formats. I have outlined the future work for this project towards the eventual goal of engineering therapeutic proteases.

4.2 Introduction Current methods of engineering protease specificity include structure-based computational design and directed evolution, which have been used independently and synergistically to modify specificities for single subsites2, 47-50, as reviewed in Section 1.4. Unfortunately, these methods have significant drawbacks, as described in Section 1.4.4. Our aim was to overcome these limitations by using mRNA display to select for multiple subsite specificity against a therapeutic target (Figure 4.1). Through this method, we can select from ten million times more variants than was screened previously to evolve novel protease specificities47, 48. Therefore, mRNA display is orders of magnitude more likely to

107

Chapter 4 identify potentially very rare enzyme variants compared to conventional screening technologies. In addition, mRNA display is entirely in vitro, removing potential protease- host toxicity issues and is capable of selecting against a full-length protein substrate.

Figure 4.1: mRNA display selection scheme for novel protease specificity. A DNA library (A) encoding protease mutants is transcribed into RNA, linked to puromycin, and in vitro translated to yield a library of protease variants that are each covalently linked to their encoding mRNA (B). This library of up to 1012 mRNA-displayed proteins is reverse transcribed with a primer bearing the superantigen target to be degraded (C) and immobilized via click chemistry. Proteins that cleave the superantigen will thereby cut off the anchor group, releasing their cDNA from immobilization (D). Selected cDNA sequences are amplified by PCR and sequenced or used as input for additional rounds of mutagenesis and selection.

Our initial therapeutic target is the superantigen Streptococcal pyrogenic exotoxin A (SpeA), introduced in Section 1.2.4. Superantigens have shown natural resistance to proteolytic degradation16 and there is no known protease that neutralizes the SpeA toxin. Streptopain, a cysteine protease produced by S. pyogenes and introduced in Chapter 2, was shown to degrade SmeZ (the most potent superantigen) to biological inactivity yet did not degrade SpeA172 despite over 50% sequence identity and a very similar fold. This natural specificity makes streptopain an ideal prototype for engineering a protease specific for SpeA. It is not known why S. pyogenes produces the streptopain protease that cleaves its own superantigen toxins but other likely functions of streptopain include host defense, bacterial life cycle modulation, or bacterial dissemination128, 142. 108

Chapter 4

Structural characterization of streptopain with and without the cysteine protease inhibitor E-64 revealed that E-64 binds to the active site of streptopain, and implicated several regions near the active site as potentially important in substrate binding173. In addition, an NMR structure visualized a highly disordered loop at the C-terminus, which appears to act as an arm to hold on to the substrate region142. Further, a point mutation in this flexible loop region led to a drastic drop in activity142. Combining the known structural data and point mutation activity data, 26 residues are most likely important for substrate binding142.

4.3 Results 4.3.1 Identifying an ideal SpeA cleavage target for proteolytic inactivation Superantigens must interact with both the T-cell receptor (TCR) and the class II Major Histocompatibility Complex (MHC II) in order to function17. These two binding sites and the resulting toxicities for SpeA have been well-elucidated from point-mutation studies174-181, so we used this data to identify regions where disruption should eliminate activity. We identified four surface exposed loops, based on the published crystal structures of SpeA174, 179-182, that are near residues that are critical for activity. Surface exposed loops are the most likely accessible regions for a protease to cleave because they are less protected by the substrate’s three-dimensional structure. The peptide sequences of these four targets (Table 4.1) were chemically synthesized, incubated with purified streptopain, and analyzed by mass spectrometry. No detectable cleavage was observed by mass spectrometry for any of the targets. Therefore, we evaluated the relevant features of each of the sites (Table 4.1) and opted to further explore the loop 1 target. The loop 1 target peptide (YNVSGPNYDK) has the most number of functionally important residues nearby: 6 residues critical for TCR binding and 8 residues critical for MHC II binding.

109

Chapter 4

Table 4.1: Evaluating potential target sequences in SpeA most likely to result in loss of toxin function upon cleavage. Number of nearby amino acids Target SpeA residues Peptide sequence important for binding to TCR MHC II Zn2+ site Loop 1 48-57 YNVSGPNYDK 6 8 0 Loop 2 105-112 NHEGNHLE 0 1 2 Loop 3 160-167 YTNGPSKY 3 3 0 Loop 4 175-182 IPKNKESF 1 3 0

We then confirmed that the target peptide sequence was not naturally cleavable by streptopain. If streptopain could already cleave the peptide, then this would not be an ideal peptide sequence to evolve activity against. The loop 1 sequence YNVSGPNYDK and a derived sequence VSGPNY were mRNA displayed in the PQP frame motif described in Section 3.3.1 (Figure 4.2A). These fusions were immobilized to Ni-NTA, digested with streptopain, and eluted. The full (D15) or truncated loop 1 (D16) target peptide sequences were not digested, similar to the negative control (D5) peptide sequence (Figure 4.2B). Therefore, streptopain does not have significant natural activity towards this sequence and it is a good target to evolve activity against.

Figure 4.2: mRNA display of loop 1 targets reveal little natural activity. Ni-NTA immobilized fusions displaying the full or truncated loop 1 peptide sequence were digested with streptopain and

110

Chapter 4 eluted. (A) The amino acid sequence of all constructs. (B) Elution from Ni-NTA after incubation with streptopain.

4.3.2 Rabbit toxicity experiments verified therapeutic value of target cleavage Next, we aimed to verify that cleavage at our desired target would result in inactive fragments of SpeA. Therefore, we cloned, expressed, and purified wild type SpeA and the two fragments of SpeA that would result from cleavage in our desired target region (loop

1). These were purified by N-terminal His6 tags, which were subsequently removed by thrombin, and purified further by a size exclusion chromatography prior to use. Wild type SpeA, wild type SpeA incubated with streptopain, and combined complementary fragments of SpeA were sent to our collaborator Dr. Schlievert at the University of Iowa. He performed rabbit toxicity experiments with these protein mixtures, and found that streptopain was unable to inactivate the toxicity of SpeA, and that the complementary SpeA halves were not toxic. Therefore, the loop 1 region would be an ideal target for evolving therapeutic proteolytic activity.

Table 4.2: Toxicity tests in rabbits verified that toxin fragments that would result from cleavage in loop 1 were inactive.

Condition Number alive Conclusion / total SpeA 0/3 Purified SpeA is toxic SpeA + streptopain 0/3 Streptopain cannot inactivate SpeA SpeA complementary halves 2/2 Complementary fragments are not toxic

4.3.3 Streptopain can be mRNA displayed as zymogen or maturated lengths Inside its host, streptopain is initially translated as a 371 amino acid zymogen, which is subsequently activated by cleaving off the N-terminal 118 amino acids, leaving the mature streptopain of 253 amino acids (Figure 4.3A). This poses a few possible challenges for mRNA display. First, this is a very large construct for mRNA display. The efficiency of mRNA display decreases for constructs over 300 amino acids183. After adding

111

Chapter 4 an affinity tag and mandatory linker regions, the final construct for the zymogen is 386 amino acids long. Second, the linker region commonly used in mRNA display contained a MGMSGSG-TGY conserved sequence on the C-terminus. This sequence is likely to be cleaved by streptopain, as shown in our early octamer digestion experiments in Section 3.3.1. Therefore, we replaced this linker region with the PQPMP-TGY linker that was successfully used for most octamer experiments described in Chapter 2. We expect this linker not to be cleaved by streptopain. To verify that streptopain can be mRNA displayed, we constructed designs S1-S2 composed of streptopain in zymogen or mature forms framed by a His6 affinity tag and PQPMP-TGY conserved region (Figure 4.3B). Upon mRNA display, a 100 μL scale translation in rabbit reticulocyte lysate yielded over 1011 fusions after oligo(dT) purification, and the purified fusions appeared as a clear band by SDS-PAGE (Figure 4.2C). This is comparable to good mRNA display yields159, therefore mRNA display expression and purification yields will not be a challenge for the mRNA display of streptopain.

Figure 4.3: Streptopain can be mRNA displayed in zymogen or mature forms. Wild type streptopain in zymogen or mature lengths was framed with an N-terminal His6 affinity tag and a C- terminal linker region for mRNA display.

112

Chapter 4

4.4 Discussion Superantigens have shown natural resistance to proteases. Therefore, we attempted to maximize the chances of evolving a therapeutically relevant protease by confirming that cleavage at the target loop would inactivate the toxin. If we successfully evolve activity against this region, it should consequently be able to neutralize the toxin. Accordingly, we identified that the loop 1 target region should disrupt sites known to be critical for SpeA function, is not currently a preferred substrate of streptopain, and that fragments of SpeA that would result from cleavage are not toxic in vivo. Therefore, the loop 1 peptide is an ideal target for future directed evolution efforts. Next, we needed to confirm that streptopain can be mRNA displayed with acceptable efficiency. We found that a small scale pilot was already capable of producing over 1011 fusions for either the zymogen or maturated protease. This is advantageous because the protease would ideally be produced as a zymogen to minimize activity prior to selection, then subsequently immobilized via the target substrate to be evolved against and finally activated with a small quantity of mature streptopain to cleave off the N-terminal pro-peptide. Because the full-length zymogen can form fusions with high efficiency, background activity can be reduced by not allowing any protease activity until the selection step.

4.5 Future work 4.5.1 Build a library of streptopain variants We will generate the initial streptopain library by randomizing the 26 putative specificity-conferring residues142. Mutagenic DNA cassettes will be recombined with the wild type streptopain gene using PCR, restriction digestion, ligation, and purification as described previously66 to produce the final streptopain variant library.

113

Chapter 4

4.5.2 Selection of SpeA-cleaving streptopain variants The fusions will be immobilized via a reverse transcription (RT) primer that contains the SpeA loop 1 target peptide. The crucial selection step in our directed evolution scheme will be the cleavage of that peptide (Figure 4.1). We have designed a modular RT primer that links an oligonucleotide with the target peptide (via a maleimide-cysteine linkage), flexible polyethylene glycol linkers, and an immobilization anchor alkyne (Figure 4.4A). The RT primers will be used to reverse-transcribe the fusions and to immobilize the resulting complex to azide-modified beads via click chemistry. Washing will then remove all unbound fusions.

Figure 4.4: Reverse transcription primers to be used in engineering selection. The oligonucleotide priming the reverse transcription will be covalently linked by the reaction between maleimide and cysteine of the chemically synthesized target peptide and alkyne immobilization anchor compound (A). The two polyethylene glycol linkers provide flexibility. Alternatively, the full-length target protein (SpeA, expressed in E. coli) (B) carrying a His tag anchor will be linked to the oligonucleotide-maleimide through its only free cysteine.

The inactive mRNA-displayed streptopain zymogen will be incubated with separately purified wild type streptopain, cleaving off the pro-domain and activating all immobilized streptopain variant fusions. Active streptopain variants capable of cleaving the SpeA target region will release themselves from immobilization and be collected in the flow-through. Their DNA will be amplified by PCR for additional rounds of selection.

114

Chapter 4

4.5.3 Optimize streptopain variant to cleave full-length SpeA protein We will mRNA display the resulting library from the initial selection that has been enriched for variants able to cleave the loop 1 target peptide against a full-length folded SpeA protein target (Figure 4.4B). This will be performed to fine-tune cleavage specificity to be able to accommodate the 3-dimensional structure of SpeA and successfully cleave the protein at the desired target region. The maleimide moiety of the RT oligonucleotide will be reacted with E. coli expressed and purified SpeA using the only free cysteine residue, which is C-terminal to our desired cleavage loop. After RT, we will immobilize the mRNA-displayed streptopain complex via the N-terminal His tag of SpeA to a Ni-NTA resin. Cleavage at our desired loop will elute active variants. This approach of selecting for cleavage of the full-length SpeA will be taken only after initial activity has been created towards the target region of SpeA to minimize the risk of selecting for protease activity towards other regions that might not yield inactive fragments.

4.6 Conclusions The experiments described here are laying the groundwork for evolving streptopain to have therapeutic activity against the superantigen SpeA. The first key steps of identifying an evolution target and demonstrating that streptopain can be mRNA displayed with high efficiency of fusion-formation have been accomplished. Next, the library generation and subsequent selection for therapeutic specificity can be pursued rapidly.

4.7 Materials and methods All chemicals were purchased from Sigma-Aldrich unless otherwise stated.

4.7.1 Construction of control designs The DNA for each control peptide sequence were synthesized as primers and assembled by a 2-step PCR assembly. The PCR primers used for each construction are shown Table S4.1, the primer sequences are shown in Table S4.2, the nucleic acid

115

Chapter 4 sequences for all peptide constructs are in Table S4.3, and the nucleic acid sequences for both streptopain constructs are in Table S4.4. Streptopain constructs S1 and S2 were prepared for mRNA display by two PCR steps, amplifying from wild type streptopain. The first PCR used primers “SpeB MD FW Ext 1” and “SpeB MD RV Ext 3” for S1 and primers “SpeB MD FW Ext 3” and “SpeB MD RV Ext 3” for S2. The second PCR used primers “SpeB MD FW Ext 2” and “SpeB MD RV Ext 4” for both designs. The final PCR product was agarose gel extracted to purify.

4.7.2 mRNA display of control designs and digestion with streptopain The transcription, cross-linking, translation using rabbit reticulocyte lysate, and oligo(dT) purification of D4-5, D15-16, and S1-2 were performed as described previously159. Oligo(dT) purified fusions D4-5 and D15-16 were immobilized to Ni-NTA agarose resin and digested with streptopain identically to the described procedure in Section 3.7.5.

4.7.3 Cloning, expression, purification, and digestion of SpeA, SpeA fragments The full length gene for SpeA was commercially synthesized by GenScript. The structural gene was PCR-amplified to add restriction sites with primers “SpeA NdeI FW” and “SpeA XhoI RV”. The amplicon was digested with NdeI and XhoI and cloned into pET15b. The plasmid was transformed into BL21 Rosetta cells. A culture of 50 mL of standard LB medium and 100 μg/mL Ampicillin was inoculated with a single colony and grown overnight at 37°C. 15 mL of the overnight culture was used to inoculate 1 L of LB medium. The culture was grown at 37°C for ~2 hours until OD600 ~0.6, induced with 1 mM IPTG, and incubated for an additional 4 hours. Appropriate precautions should be taken for safe handling of the superantigen toxin during expression, purification, and manipulation. Cells were divided into 250 mL aliquots, collected by centrifugation, and frozen at -80°C. The superantigen was purified by standard Ni-NTA His6 affinity tag purification, and dialyzed overnight against 2 L of 20 mM Tris, 150 mM NaCl.

116

Chapter 4

The halves of SpeA were cloned by PCR-amplifying the appropriate length of SpeA from the wild type construct above. The first half fragment was PCR-amplified with primers “SpeA iFW NdeI 2” and “SpeA L1 RV BamHI 1” and the second half fragment used primers “SpeA L1 FW NdeI 1” and “SpeA RV BamHI 1”. These were purified by agarose gel extraction, digested with NdeI and XhoI, and cloned into pET15b. The plasmids were transformed into BL21 Rosetta cells. The first half SpeA fragment was expressed identically to full length SpeA. The second half SpeA fragment was grown in a culture of 50 mL of standard LB medium, 100 μg/mL Ampicillin, and 34 μg/mL Chloramphenicol at 37°C overnight. 400 μL of the overnight culture was added to 200 mL of autoinduction media148, grown at 37°C for ~3 hours until OD600 ~ 0.6, then grown at 25°C for 24 hours. Cells were collected by centrifugation, and frozen at -80°C. For each of the fragment constructs, a denaturing Ni-NTA affinity purification was performed to obtain soluble protein. The frozen 200 mL or 1 L cell pellet was resuspended in 10 mL denaturing binding buffer (20 mM Tris-HCl, 150 mM NaCl, 6 M guanidine). The cells were lysed by sonication, and the lysate was centrifuged to separate the insoluble fraction. The cleared lysate was incubated with 1 mL of Ni-NTA agarose resin at 4°C for >30 minutes. The resin was washed with denaturing binding buffer, then decreasing concentrations of guanidine binding buffer (6 M, 4.5 M, 3 M, 1.5 M, 1 M, 0.5 M, 0.3 M, 0 M), then eluted with 250 mM imidazole. The SpeA fragments were dialyzed against 2 L of 1x PBS overnight.

Purified SpeA and the purified SpeA fragments had thrombin-removable His6 affinity tags. These were removed by incubating with 1:50 or 1:100 ratio of agarose resin immobilized thrombin to target protein at 4-25°C for 1 hour to overnight. The thrombin- immobilized agarose was separated by 0.45 μm Spinfilter separation. The samples were purified from their cleavage products by size exclusion chromatography using a Superdex 75 column. The SpeA-containing fractions were pooled and dialyzed overnight against 2 L of 1x PBS. The first and second fragment halves of SpeA were mixed at an equal molar ratio. 117

Chapter 4

Purified SpeA (170 μg) was incubated with streptopain (17 μg) in 3 mL total volume of 1x PBS with 5 mM DTT, 5 mM EDTA for 8 hours at 37°C. The reaction was stopped by adding 28 μM E-64. The mixture was dialyzed against 2 L of 1x PBS for 16 hours.

4.7.4 Rabbit toxicity experiments All rabbit toxicity experiments were performed by our collaborator, Dr. Patrick Schlievert, at the University of Iowa. 5 μg/kg of SpeA, SpeA incubated with streptopain, or SpeA complementary fragments were administered intravenously in the marginal ear vein of American Dutch belted rabbits (1.5-2 kg). 3 rabbits per control were tested and 2 rabbits per fragment condition. After 4 hours, the rabbits were injected with 0.15 μg/kg lipopolysaccharide (LPS) from Salmonella enterica serovar Typhimurium via the same route. In this animal model, LPS is used to potentiate the effects of SpeA. Pyrogenicity was determined by monitoring body temperature at 0 hours and 4 hours, and survival of rabbits after LPS treatment was monitored over 48 hours.

118

Chapter 4

4.8 Supplementary information

Table S4.1: Primers used in PCR construction or amplification of controls.

PCR 1 PCR 2 Construct Forward primer Reverse primer Forward primer Reverse primer D4 Spec FW Pos cntrl 4 Spec RV Pos cntrl 5 Spec RV Pos cntrl 5 D5 Spec FW Neg cntrl 5 Spec RV Neg cntrl 6 Spec RV Neg cntrl 6 Spec FW EXT 1 D15 Spec FW Loop1 4 Spec RV Loop1 2 Spec RV Loop1 2 D16 Spec FW Loop1 5 Spec RV Loop1 3 Spec RV Loop1 3 SpeA SpeA NdeI FW SpeA XhoI RV SpeA first half SpeA iFW NdeI 2 SpeA L1 RV BamHI 1 SpeA second half SpeA L1 FW NdeI 1 SpeA RV BamHI 1 S1 SpeB MD FW Ext 1 SpeB MD RV Ext 3 SpeB MD FW Ext 2 SpeB MD RV Ext 4 S2 SpeB MD FW Ext 3

119

Chapter 4

Table S4.2: Nucleotide sequence of all primers used in this chapter.

Primer Spec FW EXT 1 5′- TCTAATACGACTCACTATAGGGACAATTACTATTTACAATTACAATGCATCACCACCATC Spec FW Pos cntrl 4 5′- CAATTACAATGCATCACCACCATCATCACCCTCAACCGGCTGCTATCAAGGCAGGAGC Spec RV Pos cntrl 5 5′- TTAATAGCCGGTGGGCATCGGTTGAGGTCTTGCTCCTGCCTTGATAGCAGC Spec FW Neg cntrl 5 5′- CAATTACAATGCATCACCACCATCATCACCCTCAACCGGCTGCTGGAAAGGCAGGAGC Spec RV Neg cntrl 6 5′- TTAATAGCCGGTGGGCATCGGTTGAGGTCTTGCTCCTGCCTTTCCAGCAGC Spec FW Loop1 4 5′- CAATTACAATGCATCACCACCATCATCACCCTCAACCGTACAACGTGAGCGGACC Spec RV Loop1 2 5′- TTAATAGCCGGTGGGCATCGGTTGAGGCTTGTCGTAATTAGGTCCGCTCACG Spec FW Loop1 5 5′- CAATTACAATGCATCACCACCATCATCACCCTCAACCGGTGAGCGGACCTAATTAC Spec RV Loop1 3 5′- TTAATAGCCGGTGGGCATCGGTTGAGGGTAATTAGGTCCGCTCAC SpeA NdeI FW 5′- GGAACCCATATGGAAAACAATAAAGAAGTATTGAAG SpeA XhoI RV 5′- GCACCTCGAGTTACTTGGTTGTTAGGTAGACTTC SpeA iFW NdeI 2 5′- CCGCGCGGCAGCCATATGCAACAAGACCCCGATCC SpeA L1 RV BamHI 1 5′- GCAGCCGGATCCTCGAGTTACCCTGAAACATTATATATTAAATCG SpeA L1 FW NdeI 1 5′- CCGCGCGGCAGCCATATGCCAAATTATGATAAATTAAAAAC SpeA RV BamHI 1 5′- GCAGCCGGATCCTCGAGTTACTTGGTTGTTAGGTAGACTTC SpeB MD FW Ext 1 5′- CTATTTACAATTACAATGCATCACCACCATCATCACGATCAAAACTTTGCTCGTAACG SpeB MD RV Ext 3 5′- TTAATAGCCGGTGGGCATCGGTTGAGGAGGTTTGATGCCTACAACAGC SpeB MD FW Ext 3 5′- CTATTTACAATTACAATGCATCACCACCATCATCACCAACCAGTTGTTAAATCTCTCC SpeB MD FW Ext 2 5′- TCTAATACGACTCACTATAGGGACAATTACTATTTACAATTACAATGCATCACC SpeB MD RV Ext 4 5′- TTAATAGCCGGTGGGCATCG

120

Chapter 4

Table S4.3: Nucleotide sequence of all mRNA display peptide controls in this chapter.

5′ end of constructs

Construct T7 promoter Enhancers Start His6 D4 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC D5 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC D15 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC D16 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC

Continued from above (3′ end of constructs)

Construct Frame Target Frame-Linker Cross-link Stop D4 CCTCAACCG GCTGCTATCAAGGCAGGAGCAAGA CCTCAACCGATGCC CACCGGCTAT TAA-3′ D5 CCTCAACCG GCTGCTGGAAAGGCAGGAGCAAGA CCTCAACCGATGCC CACCGGCTAT TAA-3′ D15 CCTCAACCG TACAACGTGAGCGGACCTAATTACGACAAG CCTCAACCGATGCC CACCGGCTAT TAA-3′ D16 CCTCAACCG GTGAGCGGACCTAATTAC CCTCAACC CACCGGCTAT TAA-3′

121

Chapter 4

Table S4.4: Nucleotide sequence of all mRNA display streptopain controls in this chapter.

5′ end of construct

T7 promoter Enhancers Start His6 S1 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC S2 5′-TCTAATACGACTCACTATAGGG ACAATTACTATTTACAATTACA ATG CATCACCACCATCATCAC

Continued from above (3′ end of constructs) Frame- Cross- Streptopain Stop Linker link GATCAAAACTTTGCTCGTAACGAAAAAGAAGCAAAAGATAGCGCTATCACATTTATCCAAAAATCAGCAGCTATCAAAGC AGGTGCACGAAGCGCAGAAGATATTAAGCTTGACAAAGTTAACTTAGGTGGAGAACTTTCTGGCTCTAATATGTATGTTT ACAATATTTCTACTGGAGGATTTGTTATCGTTTCAGGAGATAAACGTTCTCCAGAAATTCTAGGATACTCTACCAGCGGA TCATTTGACGCTAACGGTAAAGAAAACATTGCTTCCTTCATGGAAAGTTATGTCGAACAAATCAAAGAAAACAAAAAATT AGACACTACTTATGCTGGTACCGCTGAGATTAAACAACCAGTTGTTAAATCTCTCCTTGATTCAAAAGGCATTCATTACA ACCAAGGTAACCCTTACAACCTATTGACACCTGTTATTGAAAAAGTAAAACCAGGTGAACAATCTTTTGTAGGTCAACAT CCTCA GCAGCTACAGGATGTGTTGCTACTGCAACTGCTCAAATTATGAAATATCATAATTACCCTAACAAAGGGTTGAAAGACTA CACCG TAA S1 ACCGA CACTTACACACTAAGCTCAAATAACCCATATTTCAACCATCCTAAGAACTTGTTTGCAGCTATCTCTACTAGACAATACA GCTAT -3′ TGCC ACTGGAACAACATCCTACCTACTTATAGCGGAAGAGAATCTAACGTTCAAAAAATGGCGATTTCAGAATTGATGGCTGAT GTTGGTATTTCAGTAGACATGGATTATGGTCCATCTAGTGGTTCTGCAGGTAGCTCTCGTGTTCAAAGAGCCTTGAAAGA AAACTTTGGCTACAACCAATCTGTTCACCAAATTAACCGTAGCGACTTTAGCAAACAAGATTGGGAAGCACAAATTGACA AAGAATTATCTCAAAACCAACCAGTATACTACCAAGGTGTCGGTAAAGTAGGCGGACATGCCTTTGTTATCGATGGTGCT GACGGACGTAACTTCTACCATGTTAACTGGGGTTGGGGTGGAGTCTCTGACGGCTTCTTCCGTCTTGACGCACTAAACCC TTCAGCTCTTGGTACTGGTGGCGGCGCAGGCGGCTTCAACGGTTACCAAAGTGCTGTTGTAGGCATCAAACCT CAACCAGTTGTTAAATCTCTCCTTGATTCAAAAGGCATTCATTACAACCAAGGTAACCCTTACAACCTATTGACACCTGT TATTGAAAAAGTAAAACCAGGTGAACAATCTTTTGTAGGTCAACATGCAGCTACAGGATGTGTTGCTACTGCAACTGCTC AAATTATGAAATATCATAATTACCCTAACAAAGGGTTGAAAGACTACACTTACACACTAAGCTCAAATAACCCATATTTC AACCATCCTAAGAACTTGTTTGCAGCTATCTCTACTAGACAATACAACTGGAACAACATCCTACCTACTTATAGCGGAAG CCTCA AGAATCTAACGTTCAAAAAATGGCGATTTCAGAATTGATGGCTGATGTTGGTATTTCAGTAGACATGGATTATGGTCCAT CACCG TAA S2 ACCGA CTAGTGGTTCTGCAGGTAGCTCTCGTGTTCAAAGAGCCTTGAAAGAAAACTTTGGCTACAACCAATCTGTTCACCAAATT GCTAT -3′ TGCC AACCGTAGCGACTTTAGCAAACAAGATTGGGAAGCACAAATTGACAAAGAATTATCTCAAAACCAACCAGTATACTACCA AGGTGTCGGTAAAGTAGGCGGACATGCCTTTGTTATCGATGGTGCTGACGGACGTAACTTCTACCATGTTAACTGGGGTT GGGGTGGAGTCTCTGACGGCTTCTTCCGTCTTGACGCACTAAACCCTTCAGCTCTTGGTACTGGTGGCGGCGCAGGCGGC TTCAACGGTTACCAAAGTGCTGTTGTAGGCATCAAACCT 122

Conclusions and Future Directions The power of catalytic turnover and precise specificity poises proteases to have tremendous potential for future medical applications. Currently, many therapeutics function by binding to a target at a stoichiometric ratio. If custom engineered therapeutic proteases become a reality, it will be possible to inactivate thousands of target molecules, defend against the myriad of bacterial exotoxins, or tweak the natural proteome with a much smaller number of proteases. This is a field with immense potential, albeit with correspondingly daunting challenges to overcome. In my thesis, I worked towards inventing the methodology needed to create therapeutic proteases. Initially, I needed to obtain a large quantity of highly purified streptopain for the remainder of my thesis goals. Published methods were inadequate for my needs, so I designed, optimized, and executed an improved protocol for producing large quantities of purified and highly active streptopain, as described in Chapter 2. While this effort required significantly more trouble-shooting than expected, the resulting method was such an improvement over published protocols that this was prepared into an independent publication. I have created a robust method that can screen all possible substrate octamers in a single experiment. The final Next-Generation Sequencing results from two rounds of protease screening demonstrated the power of our technique. Factor Xa, an example of a narrow-specificity protease with a canonical preference, generated a map which confirmed the known specificity, yet also suggested additional alternative amino acid preferences that have not been explored previously. Our specificity map obtained for ADAM17 confirmed the known broad-specificity, as previously studied, and went further to identify disfavored amino acids. Finally, applying our method to streptopain yielded a map that confirmed known specificities, but also identified many amino acids that are disfavored. In the near future, the primary goal for the specificity screening effort will be to complete the mass spectrometry and computational analysis. While identifying protease prototypes as starting points for our engineering efforts was the initial goal that inspired the creation of this new technology, it will be invaluable for furthering our understanding

123

of the outputs from our engineering efforts. For the future applicability of therapeutic proteases, it will be critical to identify undesirable human cross-reactivity that needs to be removed in subsequent rounds of engineering. The method invented here may be capable of both of these goals. Furthermore, it could be applied to understand the specificity of potentially any protease. Regarding my efforts to engineer therapeutic activity, I began the first steps towards using mRNA display to engineer protease specificity. While most of my work on this topic was project design and experimental planning, my completed experiments have fulfilled two critical tasks: identifying a suitable cleavage target sequence in SpeA to eventually inactivate this toxin, and demonstrating the feasibility of mRNA-displaying streptopain as either a zymogen or matured protease. The next steps of the engineering project will be to begin the selection of novel specificities derived from streptopain variants. I expect that modifying the specificity at multiple subsites for a broad-specificity protease will be achievable, but that refining the specificity to exclude unwanted cross-reactivities will be more challenging. While streptopain has been studied for many years, there are still numerous mysteries to its function. Because it is unknown which amino acids are responsible for specificity at each subsite, there are a large number of residues predicted to be possibly important for specificity. This number of residues is too many to randomize and screen, even with mRNA display. Protease variants resulting from initial in vitro selection during the protease engineering project will help subsequent engineering efforts because they will likely reveal which amino acids are most critical for modifying specificity at each subsite. In conclusion, I have contributed towards establishing two synergistic methods that will remove current barriers to the application of proteases as a new class of therapeutic drugs: proteolytic neutralization of exotoxins. Infectious disease sorely needs novel therapies due to the waning efficacy of antibiotics. Custom anti-bacterial proteases, including exotoxin-defense, may be a reality in the foreseeable future. Beyond this, therapeutic proteases may be expanded to a broad range of applications, including autoimmune, inflammatory, or coagulation disorders.

124

Bibliography

1. Craik, C.S., Page, M.J. & Madison, E.L. Proteases as therapeutics. The Biochemical Journal 435, 1-16 (2011).

2. Pogson, M., Georgiou, G. & Iverson, B.L. Engineering next generation proteases. Current opinion in biotechnology 20, 390-397 (2009).

3. Schilling, O. & Overall, C.M. Proteome-derived, database-searchable peptide libraries for identifying protease cleavage sites. Nature Biotechnology 26, 685-694 (2008).

4. Turk, B., Turk, D. & Turk, V. Protease signalling: the cutting edge. EMBO J 31, 1630- 1643 (2012).

5. Puente, X.S., Sánchez, L.M., Gutiérrez-Fernández, A., Velasco, G. & López-Otín, C. A genomic view of the complexity of mammalian proteolytic systems. Biochem Soc Trans 33, 331-334 (2005).

6. Chang, H.Y. & Yang, X. Proteases for cell suicide: functions and regulation of caspases. Microbiol Mol Biol Rev 64, 821-846 (2000).

7. Schechter, I. & Berger, A. On the size of the active site in proteases. Biochemical and Biophysical Research Communications 27, 157-162 (1967).

8. Timmer, J.C. et al. Structural and kinetic determinants of protease substrates. Nat Struct Mol Biol 16, 1101-1108 (2009).

9. McQuibban, G.A. et al. Inflammation dampened by gelatinase A cleavage of monocyte chemoattractant protein-3. Science 289, 1202-1206 (2000).

10. Krishnaswamy, S. Exosite-driven substrate specificity and function in coagulation. J Thromb Haemost 3, 54-67 (2005).

11. Jabaiah, A.M., Getz, J.A., Witkowski, W.A., Hardy, J.A. & Daugherty, P.S. Identification of protease exosite-interacting peptides that enhance substrate cleavage kinetics. Biol Chem 393, 933-941 (2012).

12. Gettins, P.G. & Olson, S.T. Exosite determinants of serpin specificity. J Biol Chem 284, 20441-20445 (2009).

125

13. Gao, W. et al. Rearranging exosites in noncatalytic domains can redirect the substrate specificity of ADAMTS proteases. J Biol Chem 287, 26944-26952 (2012).

14. Hanakawa, Y. et al. Enzymatic and molecular characteristics of the efficiency and specificity of exfoliative toxin cleavage of desmoglein 1. The Journal of Biological Chemistry 279, 5268-5277 (2003).

15. Bukowski, M., Wladyka, B. & Dubin, G. Exfoliative toxins of Staphylococcus aureus. Toxins (Basel) 2, 1148-1165 (2010).

16. Lappin, E. & Ferguson, A.J. Gram-positive toxic shock syndromes. The Lancet Infectious Diseases 9, 281-290 (2009).

17. Bisno, A.L., Brito, M.O. & Collins, C.M. Molecular basis of group A streptococcal virulence. The Lancet Infectious Diseases 3, 191-200 (2003).

18. Norrby-Teglund, A. et al. Varying titers of neutralizing antibodies to streptococcal superantigens in different preparations of normal polyspecific immunoglobulin G: implications for therapeutic efficacy. Clinical Infectious Diseases 26, 631-638 (1998).

19. Stevens, D.L., Sexton, D.J. & Baron, E.L. Treatment of streptococcal toxic shock syndrome. (2013).

20. Yealy, D.M. et al. A randomized trial of protocol-based care for early septic shock. The New England Journal of Medicine 370, 1683-1693 (2014).

21. Opal, S.M., Dellinger, R.P., Vincent, J.-L., Masur, H. & Angus, D.C. The next generation of sepsis clinical trial designs: what is next after the demise of recombinant human activated protein C? Critical Care Medicine 42, 1714-1721 (2014).

22. Rano, T.A. et al. A combinatorial approach for determining protease specificities: application to interleukin-1beta converting enzyme (ICE). Chem Biol 4, 149-155 (1997).

23. Harris, J.L. et al. Rapid and general profiling of protease specificity by using combinatorial fluorogenic substrate libraries. Proc Natl Acad Sci U S A 97, 7754-7759 (2000).

24. Stennicke, H.R., Renatus, M., Meldal, M. & Salvesen, G.S. Internally quenched fluorescent peptide substrates disclose the subsite preferences of human caspases 1, 3, 6, 7 and 8. Biochem J 350 Pt 2, 563-568 (2000).

25. Bianchini, E.P. et al. Mapping of the catalytic groove preferences of factor Xa reveals an inadequate selectivity for its macromolecule substrates. J Biol Chem 277, 20527-20534 (2002). 126

26. Nomizu, M. et al. Substrate specificity of the streptococcal cysteine protease. The Journal of Biological Chemistry 276, 44551-44556 (2001).

27. Lim, M.D. & Craik, C.S. Using specificity to strategically target proteases. Bioorganic & Medicinal Chemistry 17, 1094-1100 (2009).

28. O'Donoghue, A.J. et al. Global identification of peptidase specificity by multiplex substrate profiling. Nat Methods 9, 1095-1100 (2012).

29. Matthews, D.J. & Wells, J.a. Substrate phage: selection of protease substrates by monovalent phage display. Science 260, 1113-1117 (1993).

30. Gallwitz, M., Enoksson, M., Thorpe, M. & Hellman, L. The extended cleavage specificity of human thrombin. PLoS One 7, e31756 (2012).

31. Deperthes, D. Phage display substrate: a blind method for determining protease specificity. Biological Chemistry 383, 1107-1112 (2002).

32. Boulware, K.T. & Daugherty, P.S. Protease specificity determination by using cellular libraries of peptide substrates (CLiPS). Proc Natl Acad Sci U S A 103, 7583-7588 (2006).

33. Schilling, O., Huesgen, P.F., Barré, O., Auf dem Keller, U. & Overall, C.M. Characterization of the prime and non-prime active site specificities of proteases by proteome-derived peptide libraries and tandem mass spectrometry. Nature Protocols 6, 111-120 (2011).

34. Schilling, O., auf dem Keller, U. & Overall, C.M. Factor Xa subsite mapping by proteome- derived peptide libraries improved using WebPICS, a resource for proteomic identification of cleavage sites. Biological Chemistry 392, 1031-1037 (2011).

35. Biniossek, M.L., Nagler, D.K., Becker-Pauly, C. & Schilling, O. Proteomic identification of protease cleavage sites characterizes prime and non-prime specificity of cysteine B, L, and S. Journal of Proteome Research 10, 5363-5373 (2011).

36. Jakoby, T., van den Berg, B.H. & Tholey, A. Quantitative protease cleavage site profiling using tandem-mass-tag labeling and LC-MALDI-TOF/TOF MS/MS analysis. J Proteome Res 11, 1812-1820 (2012).

37. Tucher, J. et al. LC-MS based cleavage site profiling of the proteases ADAM10 and ADAM17 using proteome-derived peptide libraries. J Proteome Res 13, 2205-2214 (2014).

127

38. Biniossek, M.L. et al. Identification of Protease Specificity by Combining Proteome- Derived Peptide Libraries and Quantitative Proteomics. Mol Cell Proteomics 15, 2515- 2524 (2016).

39. Kleifeld, O. et al. Isotopic labeling of terminal amines in complex samples identifies protein N-termini and protease cleavage products. Nat Biotechnol 28, 281-288 (2010).

40. Seelig, B. & Szostak, J.W. Selection and evolution of enzymes from a partially randomized non-catalytic scaffold. Nature 448, 828-831 (2007).

41. Ju, W. et al. Proteome-wide identification of family member-specific natural substrate repertoire of caspases. Proceedings of the National Academy of Sciences of the United States of America 104, 14294-14299 (2007).

42. Ng, N.M., Pike, R.N. & Boyd, S.E. Subsite cooperativity in protease specificity. Biological Chemistry 390, 401-407 (2009).

43. Eckhard, U. et al. Active site specificity profiling of the matrix metalloproteinase family: Proteomic identification of 4300 cleavage sites by nine MMPs explored with structural and synthetic peptide cleavage analyses. Matrix Biol 49, 37-60 (2016).

44. Kalińska, M. et al. Substrate specificity of Staphylococcus aureus cysteine proteases-- Staphopains A, B and C. Biochimie 94, 318-327 (2012).

45. Bornscheuer, U. & Kazlauskas, R.J. Survey of protein engineering strategies. Current Protocols in Protein Science Chapter 26, Unit26 27 (2011).

46. Hedstrom, L., Szilagyi, L. & Rutter, W.J. Converting trypsin to chymotrypsin: the role of surface loops. Science 255, 1249-1253 (1992).

47. Varadarajan, N., Georgiou, G. & Iverson, B.L. An engineered protease that cleaves specifically after sulfated tyrosine. Angewandte Chemie (International Ed. in English) 47, 7861-7863 (2008).

48. Varadarajan, N., Rodriguez, S., Hwang, B.Y., Georgiou, G. & Iverson, B.L. Highly active and selective endopeptidases with programmed substrate specificities. Nature Chemical Biology 4, 290-294 (2008).

49. Yi, L. et al. Engineering of TEV protease variants by yeast ER sequestration screening (YESS) of combinatorial libraries. Proceedings of the National Academy of Sciences of the United States of America 110, 7229-7234 (2013).

128

50. Gordon, S.R. et al. Computational design of an α-gliadin peptidase. Journal of the American Chemical Society 134, 20513-20520 (2012).

51. Näsvall, J., Sun, L., Roth, J.R. & Andersson, D.I. Real-time evolution of new genes by innovation, amplification, and divergence. Science 338, 384-387 (2012).

52. Chiarabelli, C. et al. Investigation of de novo totally random biosequences, Part II: On the folding frequency in a totally random library of de novo proteins obtained by phage display. Chem Biodivers 3, 840-859 (2006).

53. Urvoas, A., Valerio-Lepiniec, M. & Minard, P. Artificial proteins from combinatorial approaches. Trends Biotechnol 30, 512-520 (2012).

54. Chao, F.A. et al. Structure and dynamics of a primordial catalytic fold generated by in vitro evolution. Nat Chem Biol 9, 81-83 (2013).

55. Turner, N.J. Directed evolution drives the next generation of biocatalysts. Nat Chem Biol 5, 567-573 (2009).

56. Golynskiy, M.V. & Seelig, B. De novo enzymes: from computational design to mRNA display. Trends in Biotechnology 28, 340-345 (2010).

57. Brustad, E.M. & Arnold, F.H. Optimizing non-natural protein function with directed evolution. Current opinion in chemical biology 15, 201-210 (2011).

58. Cobb, R.E., Chao, R. & Zhao, H.M. Directed evolution: past, present, and future. AICHE Journal 59, 1432-1440 (2013).

59. Davids, T., Schmidt, M., Bottcher, D. & Bornscheuer, U.T. Strategies for the discovery and engineering of enzymes for biocatalysis. Current Opinion in Chemical Biology 17, 215-220 (2013).

60. Feldmeier, K. & Höcker, B. Computational protein design of ligand binding and catalysis. Current Opinion in Chemical Biology 17, 929-933 (2013).

61. Blomberg, R. et al. Precision is essential for efficient catalysis in an evolved Kemp eliminase. Nature 503, 418-421 (2013).

62. Damborsky, J. & Brezovsky, J. Computational tools for designing and engineering enzymes. Current Opinion in Chemical Biology 19, 8-16 (2014).

129

63. Shivange, A.V., Marienhagen, J., Mundhada, H., Schenk, A. & Schwaneberg, U. Advances in generating functional diversity for directed protein evolution. Current opinion in chemical biology 13, 19-25 (2009).

64. Goldsmith, M. & Tawfik, D.S. Directed enzyme evolution: beyond the low-hanging fruit. Current opinion in structural biology 22, 406-412 (2012).

65. Goldsmith, M. & Tawfik, D.S. Enzyme engineering by targeted libraries. Methods in enzymology 523, 257-283 (2013).

66. Golynskiy, M.V., Haugner, J.C. & Seelig, B. Highly diverse protein library based on the ubiquitous (β/α)8 enzyme fold yields well-structured proteins through in vitro folding Selection. Chembiochem 14, 1553-1563 (2013).

67. Fujii, S. et al. Liposome display for in vitro selection and evolution of membrane proteins. Nature protocols 9, 1578-1591 (2014).

68. Schlinkmann, K.M. et al. Critical features for biosynthesis, stability, and functionality of a G protein-coupled receptor uncovered by all-versus-all mutations. Proceedings of the National Academy of Sciences of the United States of America 109, 9810-9815 (2012).

69. Scott, D.J. & Plückthun, A. Direct molecular evolution of detergent-stable G protein- coupled receptors using polymer encapsulated cells. Journal of molecular biology 425, 662-677 (2013).

70. Sepp, A., Tawfik, D.S. & Griffiths, A.D. Microbead display by in vitro compartmentalisation: selection for binding using flow cytometry. FEBS letters 532, 455- 458 (2002).

71. Griffiths, A.D. & Tawfik, D.S. Directed evolution of an extremely fast phosphotriesterase by in vitro compartmentalization. The EMBO journal 22, 24-35 (2003).

72. Gan, R., Yamanaka, Y., Kojima, T. & Nakano, H. Microbeads display of proteins using emulsion PCR and cell-free protein synthesis. Biotechnology progress 24, 1107-1114 (2008).

73. Gan, R. et al. Directed evolution of angiotensin II-inhibiting peptides using a microbead display. Journal of bioscience and bioengineering 109, 411-417 (2010).

74. Paul, S., Stang, A., Lennartz, K., Tenbusch, M. & Überla, K. Selection of a T7 promoter mutant with enhanced in vitro activity by a novel multi-copy bead display approach for in vitro evolution. Nucleic acids research 41, e29 (2013).

130

75. Diamante, L., Gatti-Lafranconi, P., Schaerli, Y. & Hollfelder, F. In vitro affinity screening of protein and peptide binders by megavalent bead surface display. Protein engineering, design & selection 26, 713-724 (2013).

76. Chaikind, B. & Ostermeier, M. Directed evolution of improved zinc finger methyltransferases. PloS one 9, e96931 (2014).

77. Esvelt, K.M., Carlson, J.C. & Liu, D.R. A system for the continuous directed evolution of biomolecules. Nature 472, 499-503 (2011).

78. Dickinson, B.C., Leconte, A.M., Allen, B., Esvelt, K.M. & Liu, D.R. Experimental interrogation of the path dependence and stochasticity of protein evolution using phage- assisted continuous evolution. Proceedings of the National Academy of Sciences of the United States of America 110, 9007-9012 (2013).

79. Carlson, J.C., Badran, A.H., Guggiana-Nilo, D.a. & Liu, D.R. Negative selection and stringency modulation in phage-assisted continuous evolution. Nature chemical biology 10, 216-222 (2014).

80. Tokuriki, N. & Tawfik, D.S. Chaperonin overexpression promotes genetic variation and enzyme evolution. Nature 459, 668-673 (2009).

81. Wyganowski, K.T., Kaltenbach, M. & Tokuriki, N. GroEL/ES buffering and compensatory mutations promote protein evolution by stabilizing folding intermediates. Journal of molecular biology 425, 3403-3414 (2013).

82. Tokuriki, N. et al. Diminishing returns and tradeoffs constrain the laboratory optimization of an enzyme. Nature communications 3, 1257 (2012).

83. Ishizawa, T., Kawakami, T., Reid, P.C. & Murakami, H. TRAP display: a high-speed selection method for the generation of functional polypeptides. Journal of the American Chemical Society 135, 5433-5440 (2013).

84. Olson, C.A. et al. Single-round, multiplexed antibody mimetic design through mRNA display. Angewandte Chemie International Edition 51, 12449-12453 (2012).

85. Hoen, P.A. et al. Phage display screening without repetitious selection rounds. Analytical biochemistry 421, 622-631 (2012).

86. Olson, E.J. & Tabor, J.J. Optogenetic characterization methods overcome key challenges in synthetic and systems biology. Nature Chemical Biology 10, 502-511 (2014).

131

87. Liu, M., Tada, S., Ito, M., Abe, H. & Ito, Y. In vitro selection of a photo-responsive peptide aptamer using ribosome display. Chemical communications 48, 11871-11873 (2012).

88. Wang, W. et al. A fluorogenic peptide probe developed by in vitro selection using tRNA carrying a fluorogenic amino acid. Chemical communications 50, 2962-2964 (2014).

89. Jafari, M.R. et al. Discovery of light-responsive ligands through screening of a light- responsive genetically encoded library. ACS chemical biology 9, 443-450 (2014).

90. Bellotto, S., Chen, S., Rentero Rebollo, I., Wegner, H.a. & Heinis, C. Phage selection of photoswitchable peptide ligands. Journal of the American Chemical Society 136, 5880- 5883 (2014).

91. Schlippe, Y.V.G., Hartman, M.C.T., Josephson, K. & Szostak, J.W. In vitro selection of highly modified cyclic peptides that act as tight binding inhibitors. Journal of the American Chemical Society 134, 10469-10477 (2012).

92. Hofmann, F.T., Szostak, J.W. & Seebeck, F.P. In vitro selection of functional lantipeptides. Journal of the American Chemical Society 134, 8038-8041 (2012).

93. Mochizuki, Y., Nishigaki, K. & Nemoto, N. Amino group binding peptide aptamers with double disulphide-bridged loops selected by in vitro selection using cDNA display. Chemical communications 50, 5608-5610 (2014).

94. Chen, S. et al. Bicyclic peptide ligands pulled out of cysteine-rich peptide libraries. Journal of the American Chemical Society 135, 6562-6569 (2013).

95. Passioura, T., Katoh, T., Goto, Y. & Suga, H. Selection-based discovery of druglike macrocyclic peptides. Annual review of biochemistry 83, 727-752 (2014).

96. Horiya, S., Bailey, J.K., Temme, J.S., Guillen Schlippe, Y.V. & Krauss, I.J. Directed evolution of multivalent glycopeptides tightly recognized by HIV antibody 2G12. Journal of the American Chemical Society 136, 5407-5415 (2014).

97. Sumida, T., Yanagawa, H. & Doi, N. In vitro selection of Fab fragments by mRNA display and gene-linking emulsion PCR. Journal of nucleic acids 2012, 371379 (2012).

98. Stafford, R.L. et al. In vitro Fab display: a cell-free system for IgG discovery. Protein engineering, design & selection 27, 97-109 (2014).

99. Horlick, R.a. et al. Simultaneous surface display and secretion of proteins from mammalian cells facilitate efficient in vitro selection and maturation of antibodies. Journal of biological chemistry 288, 19861-19869 (2013). 132

100. Fallah-Araghi, A., Baret, J.-C., Ryckelynck, M. & Griffiths, A.D. A completely in vitro ultrahigh-throughput droplet-based microfluidic screening system for protein engineering and directed evolution. Lab on a chip 12, 882-891 (2012).

101. Stapleton, J.a. & Swartz, J.R. Development of an in vitro compartmentalization screen for high-throughput directed evolution of [FeFe] hydrogenases. PloS one 5, e15275 (2010).

102. Takeuchi, R., Choi, M. & Stoddard, B.L. Redesign of extensive protein-DNA interfaces of meganucleases using iterative cycles of in vitro compartmentalization. Proceedings of the National Academy of Sciences, USA 111, 4061-4066 (2014).

103. Kintses, B. et al. Picoliter cell lysate assays in microfluidic droplet compartments for directed enzyme evolution. Chemistry & Biology 19, 1001-1009 (2012).

104. Ostafe, R., Prodanovic, R., Nazor, J. & Fischer, R. Ultra-high-throughput screening method for the directed evolution of glucose oxidase. Chemistry & biology 21, 414-421 (2014).

105. Zinchenko, A. et al. One in a million: flow cytometric sorting of single cell-lysate assays in monodisperse picolitre double emulsion droplets for directed evolution. Analytical chemistry 86, 2526-2533 (2014).

106. Khersonsky, O. et al. Bridging the gaps in design methodologies by evolutionary optimization of the stability and proficiency of designed Kemp eliminase KE59. Proceedings of the National Academy of Sciences of the United States of America 109, 10358-10363 (2012).

107. Preiswerk, N. et al. Impact of scaffold rigidity on the design and evolution of an artificial Diels-Alderase. Proceedings of the National Academy of Sciences of the United States of America 111 (2014).

108. McIntosh, J.A., Farwell, C.C. & Arnold, F.H. Expanding P450 catalytic reaction space through evolution and engineering. Current Opinion in Chemical Biology 19, 126-134 (2014).

109. Coelho, P.S., Brustad, E.M., Kannan, A. & Arnold, F.H. Olefin cyclopropanation via carbene transfer catalyzed by engineered cytochrome P450 enzymes. Science 339, 307-310 (2013).

110. Coelho, P.S. et al. A serine-substituted P450 catalyzes highly efficient carbene transfer to olefins in vivo. Nature chemical biology 9, 485-487 (2013).

133

111. Zhang, K., Shafer, B.M., Demars, M.D., Stern, H.a. & Fasan, R. Controlled oxidation of remote sp3 C-H bonds in artemisinin via P450 catalysts with fine-tuned regio- and stereoselectivity. Journal of the American Chemical Society 134, 18695-18704 (2012).

112. Khare, S.D. et al. Computational redesign of a mononuclear zinc metalloenzyme for organophosphate hydrolysis. Nature chemical biology 8, 294-300 (2012).

113. Goldsmith, M. et al. Evolved stereoselective hydrolases for broad-spectrum G-type nerve agent detoxification. Chemistry & biology 19, 456-466 (2012).

114. Bigley, A.N., Xu, C., Henderson, T.J., Harvey, S.P. & Raushel, F.M. Enzymatic neutralization of the chemical warfare agent VX: evolution of phosphotriesterase for phosphorothiolate hydrolysis. Journal of the American Chemical Society 135, 10426- 10432 (2013).

115. Rajagopalan, S. et al. Design of activated serine-containing catalytic triads with atomic- level accuracy. Nature chemical biology 10, 386-391 (2014).

116. Fisher, M.a., McKinley, K.L., Bradley, L.H., Viola, S.R. & Hecht, M.H. De novo designed proteins from a library of artificial sequences function in Escherichia coli and enable cell growth. PloS one 6, e15364 (2011).

117. Patel, S.C. & Hecht, M.H. Directed evolution of the peroxidase activity of a de novo- designed protein. Protein Engineering, Design & Selection 25, 445-451 (2012).

118. Schilling, J., Schöppe, J. & Plückthun, A. From DARPins to LoopDARPins: novel LoopDARPin design allows the selection of low picomolar binders in a single round of ribosome display. Journal of molecular biology 426, 691-721 (2014).

119. Savile, C.K. et al. Biocatalytic asymmetric synthesis of chiral amines from ketones applied to sitagliptin manufacture. Science 329, 305-309 (2010).

120. Keefe, A.D. & Szostak, J.W. Functional proteins from a random-sequence library. Nature 410, 715-718 (2001).

121. Kries, H., Blomberg, R. & Hilvert, D. De novo enzymes by computational design. Current opinion in chemical biology 17, 221-228 (2013).

122. Dougherty, M.J. & Arnold, F.H. Directed evolution: new parts and optimized function. Current opinion in biotechnology 20, 486-491 (2009).

123. Carapetis, J.R., Steer, A.C. & Mulholland, E.K. (World Health Organization, 2005).

134

124. Walker, M.J. et al. Disease manifestations and pathogenic mechanisms of Group A Streptococcus. Clin Microbiol Rev 27, 264-301 (2014).

125. Olsen, J.G., Dagil, R., Niclasen, L.M., Sorensen, O.E. & Kragelund, B.B. Structure of the mature Streptococcal cysteine protease exotoxin mSpeB in its active dimeric form. Journal of Molecular Biology 393, 693-703 (2009).

126. Braun, M.A. et al. Stimulation of human T cells by streptococcal "superantigen" erythrogenic toxins (scarlet fever toxins). Journal of Immunology 150, 2457-2466 (1993).

127. Elliott, S.D. A Proteolytic Enzyme Produced By Group A Streptococci With Special Reference To Its Effect On The Type-Specific M Antigen. J Exp Med 81, 573-592 (1945).

128. Nelson, D.C., Garbe, J. & Collin, M. Cysteine proteinase SpeB from Streptococcus pyogenes - a potent modifier of immunologically important host and bacterial proteins. Biological Chemistry 392, 1077-1088 (2011).

129. Hauser, A.R. & Schlievert, P.M. Nucleotide sequence of the streptococcal pyrogenic exotoxin type B gene and relationship between the toxin and the streptococcal proteinase precursor. Journal of Bacteriology 172, 4536-4542 (1990).

130. Chaussee, M.S., Gerlach, D., Yu, C.E. & Ferretti, J.J. Inactivation of the streptococcal erythrogenic toxin B gene (speB) in Streptococcus pyogenes. Infect Immun 61, 3719-3723 (1993).

131. Kapur, V., Majesky, M.W., Li, L.L., Black, R.A. & Musser, J.M. Cleavage of interleukin 1 beta (IL-1 beta) precursor to produce active IL-1 beta by a conserved extracellular cysteine protease from Streptococcus pyogenes. Proc Natl Acad Sci U S A 90, 7676-7680 (1993).

132. Berge, A. & Björck, L. Streptococcal cysteine proteinase releases biologically active fragments of streptococcal surface proteins. J Biol Chem 270, 9862-9867 (1995).

133. Tsai, P.J. et al. Effect of group A streptococcal cysteine protease on invasion of epithelial cells. Infect Immun 66, 1460-1466 (1998).

134. Gubba, S., Low, D.E. & Musser, J.M. Expression and characterization of group A Streptococcus extracellular cysteine protease recombinant mutant proteins and documentation of seroconversion during human invasive disease episodes. Infect Immun 66, 765-770 (1998).

135. Musser, J.M., Stockbauer, K., Kapur, V. & Rudgers, G.W. Substitution of cysteine 192 in a highly conserved Streptococcus pyogenes extracellular cysteine protease (interleukin 135

1beta convertase) alters proteolytic activity and ablates zymogen processing. Infect Immun 64, 1913-1917 (1996).

136. Toyosaki, T. et al. Definition of the mitogenic factor (MF) as a novel streptococcal superantigen that is different from streptococcal pyrogenic exotoxins A, B, and C. Eur J Immunol 26, 2693-2701 (1996).

137. Watanabe, Y. et al. Cysteine protease activity and histamine release from the human mast cell line HMC-1 stimulated by recombinant streptococcal pyrogenic exotoxin B/streptococcal cysteine protease. Infect Immun 70, 3944-3947 (2002).

138. Terao, Y. et al. Group A streptococcal cysteine protease degrades C3 (C3b) and contributes to evasion of innate immunity. J Biol Chem 283, 6253-6260 (2008).

139. González-Páez, G.E. & Wolan, D.W. Ultrahigh and high resolution structures and mutational analysis of monomeric Streptococcus pyogenes SpeB reveal a functional role for the glycine-rich C-terminal loop. The Journal of Biological Chemistry 287, 24412- 24426 (2012).

140. Chen, C.Y. et al. Maturation processing and characterization of streptopain. The Journal of Biological Chemistry 278, 17336-17343 (2003).

141. Kansal, R.G., Nizet, V., Jeng, A., Chuang, W.J. & Kotb, M. Selective modulation of superantigen-induced responses by streptococcal cysteine protease. J Infect Dis 187, 398- 407 (2003).

142. Wang, C.C. et al. Solution structure and backbone dynamics of streptopain: insight into diverse substrate specificity. The Journal of Biological Chemistry 284, 10957-10967 (2009).

143. Carroll, R.K. & Musser, J.M. From transcription to activation: how group A streptococcus, the flesh-eating pathogen, regulates SpeB cysteine protease production. Molecular Microbiology 81, 588-601 (2011).

144. Liu, T.Y. & Elliott, S.D. Streptococcal Proteinase: The Zymogen to Enzyme Transformation. J Biol Chem 240, 1138-1142 (1965).

145. Rappsilber, J., Ishihama, Y. & Mann, M. Stop and go extraction tips for matrix-assisted laser desorption/ionization, nanoelectrospray, and LC/MS sample pretreatment in proteomics. Anal Chem 75, 663-670 (2003).

146. Stockbauer, K.E. et al. A natural variant of the cysteine protease virulence factor of group A Streptococcus with an arginine-glycine-aspartic acid (RGD) motif preferentially binds 136

human integrins alphavbeta3 and alphaIIbbeta3. Proc Natl Acad Sci U S A 96, 242-247 (1999).

147. Lin-Moshier, Y. et al. Re-evaluation of the role of calcium homeostasis endoplasmic reticulum protein (CHERP) in cellular calcium signaling. J Biol Chem 288, 355-367 (2013).

148. Studier, F.W. Protein production by auto-induction in high density shaking cultures. Protein Expr Purif 41, 207-234 (2005).

149. Blommel, P.G. & Fox, B.G. A combined approach to improving large-scale production of tobacco etch virus protease. Protein Expr Purif 55, 53-68 (2007).

150. Kawaguchi, M., Inoue, K., Iuchi, I., Nishida, M. & Yasumasu, S. Molecular co-evolution of a protease and its substrate elucidated by analysis of the activity of predicted ancestral hatching enzyme. BMC Evol Biol 13, 231 (2013).

151. Valencia, C.A., Cotten, S.W., Dong, B.A. & Liu, R.H. mRNA-display-based selections for proteins with desired functions: A protease-substrate case study. Biotechnology Progress 24, 561-569 (2008).

152. Rawlings, N.D., Barrett, A.J. & Bateman, A. MEROPS: the database of proteolytic enzymes, their substrates and inhibitors. Nucleic Acids Research 40, D343-D350 (2012).

153. Gosalia, D.N., Salisbury, C.M., Maly, D.J., Ellman, J.A. & Diamond, S.L. Profiling serine protease substrate specificity with solution phase fluorogenic peptide microarrays. Proteomics 5, 1292-1298 (2005).

154. Scholle, M.D. et al. Mapping protease substrates by using a biotinylated phage substrate library. Chembiochem 7, 834-838 (2006).

155. Caescu, C.I., Jeschke, G.R. & Turk, B.E. Active-site determinants of substrate recognition by the metalloproteinases TACE and ADAM10. Biochem J 424, 79-88 (2009).

156. Black, R.A. et al. A metalloproteinase disintegrin that releases tumour-necrosis factor- alpha from cells. Nature 385, 729-733 (1997).

157. Doran, J.D. et al. Autocatalytic processing of the streptococcal cysteine protease zymogen: processing mechanism and characterization of the autoproteolytic cleavage sites. Eur J Biochem 263, 145-151 (1999).

158. Presolski, S.I., Hong, V.P. & Finn, M.G. Copper-Catalyzed Azide-Alkyne Click Chemistry for Bioconjugation. Curr Protoc Chem Biol 3, 153-162 (2011). 137

159. Seelig, B. mRNA display for the selection and evolution of enzymes from in vitro- translated protein libraries. Nature Protocols 6, 540-552 (2011).

160. Chao, F.-A. et al. Structure and dynamics of a primordial catalytic fold generated by in vitro evolution. Nature Chemical Biology 9, 81-83 (2013).

161. Sambrook, J. & Russell, D.W. Isolation of DNA fragments from polyacrylamide gels by the crush and soak method. CSH Protoc 2006 (2006).

162. Chan, A.O. et al. Modification of N-terminal α-amino groups of peptides and proteins using ketenes. J Am Chem Soc 134, 2589-2598 (2012).

163. Quast, R.B., Mrusek, D., Hoffmeister, C., Sonnabend, A. & Kubick, S. Cotranslational incorporation of non-standard amino acids using cell-free protein synthesis. FEBS Lett 589, 1703-1712 (2015).

164. Li, S., Millward, S. & Roberts, R. In vitro selection of mRNA display libraries containing an unnatural amino acid. J Am Chem Soc 124, 9972-9973 (2002).

165. Shimizu, Y. et al. Cell-free translation reconstituted with purified components. Nat Biotechnol 19, 751-755 (2001).

166. Hong, S.H. et al. Cell-free protein synthesis from a release factor 1 deficient Escherichia coli activates efficient and multiple site-specific nonstandard amino acid incorporation. ACS Synth Biol 3, 398-409 (2014).

167. Deiters, A. & Schultz, P.G. In vivo incorporation of an alkyne into proteins in Escherichia coli. Bioorg Med Chem Lett 15, 1521-1524 (2005).

168. Albayrak, C. & Swartz, J.R. Cell-free co-production of an orthogonal transfer RNA activates efficient site-specific non-natural amino acid incorporation. Nucleic Acids Res 41, 5949-5963 (2013).

169. Polycarpo, C. et al. Activation of the pyrrolysine suppressor tRNA requires formation of a ternary complex with class I and class II lysyl-tRNA synthetases. Mol Cell 12, 287-294 (2003).

170. Sheppard, K., Akochy, P.M. & Söll, D. Assays for transfer RNA-dependent amino acid biosynthesis. Methods 44, 139-145 (2008).

171. Schneider, C.A., Rasband, W.S. & Eliceiri, K.W. NIH Image to ImageJ: 25 years of image analysis. Nat Methods 9, 671-675 (2012).

138

172. Azuma, K. et al. Detection of circulating superantigens in an intensive care unit population. International Journal of Infectious Diseases 8, 292-298 (2004).

173. Kagawa, T.F. et al. Crystal structure of the zymogen form of the group A Streptococcus virulence factor SpeB: an integrin-binding cysteine protease. Proceedings of the National Academy of Sciences of the United States of America 97, 2235-2240 (2000).

174. Sundberg, E.J. et al. Structures of two streptococcal superantigens bound to TCR beta chains reveal diversity in the architecture of T cell signaling complexes. Structure 10, 687- 699 (2002).

175. Kline, J.B. & Collins, C.M. Analysis of the superantigenic activity of mutant and allelic forms of streptococcal pyrogenic exotoxin A. Infection and Immunity 64, 861-869 (1996).

176. Kline, J.B. & Collins, C.M. Analysis of the interaction between the bacterial superantigen streptococcal pyrogenic exotoxin A (SpeA) and the human T-cell receptor. Molecular Microbiology 24, 191-202 (1997).

177. Roggiani, M., Stoehr, J.A., Leonard, B.A. & Schlievert, P.M. Analysis of toxicity of streptococcal pyrogenic exotoxin A mutants. Infection and Immunity 65, 2868-2875 (1997).

178. Carra, J.H., Welcher, B.C., Schokman, R.D., David, C.S. & Bavari, S. Mutational effects on protein folding stability and antigenicity: the case of streptococcal pyrogenic exotoxin A. Clinical Immunology 108, 60-68 (2003).

179. Papageorgiou, A.C. et al. Structural basis for the recognition of superantigen streptococcal pyrogenic exotoxin A (SpeA1) by MHC class II molecules and T-cell receptors. EMBO J 18, 9-21 (1999).

180. Earhart, C.A., Vath, G.M., Roggiani, M., Schlievert, P.M. & Ohlendorf, D.H. Structure of streptococcal pyrogenic exotoxin A reveals a novel metal cluster. Protein Sci 9, 1847-1851 (2000).

181. Baker, M., Gutman, D.M., Papageorgiou, A.C., Collins, C.M. & Acharya, K.R. Structural features of a zinc binding site in the superantigen strepococcal pyrogenic exotoxin A (SpeA1): implications for MHC class II recognition. Protein Sci 10, 1268-1273 (2001).

182. Baker, M.D., Gendlina, I., Collins, C.M. & Acharya, K.R. Crystal structure of a dimeric form of streptococcal pyrogenic exotoxin A (SpeA1). Protein Sci 13, 2285-2290 (2004).

139

183. Cotten, S.W., Zou, J., Valencia, C.A. & Liu, R. Selection of proteins with desired properties from natural proteome libraries using mRNA display. Nat Protoc 6, 1163-1182 (2011).

140