Evolvability of a Viral

Experimental Evolution of

Catalysis, Robustness and Specificity

THOMAS SHAFEE

Department of Biochemistry and Gonville and Caius College,

University of Cambridge

Under the supervision of F. Hollfelder and R. Minter

A thesis submitted for the degree of Doctor of Philosophy

September of 2013

Declaration

This dissertation is the result of my own work and includes nothing which is the outcome of work done in collaboration except where specifically indicated in the text. This work is not being submitted for another degree or qualification at any academic institution and does not exceed 60,000 words.

The work presented in Chapter C has been submitted to Nature Genetics as ‘The evolvability of nucleophile exchange’ (Shafee, T., Gatti-Lafranconi, P., Minter, R., Hollfelder, F. 2013)

The work in Chapters D & E have been combined and prepared for submission to Molecular Biology and Evolution as ‘Neutral evolution – two combined mechanisms generate robust and evolvable ’ (Shafee, T., Gatti- Lafranconi, P., Minter, R., Hollfelder, F. 2013)

Thomas Shafee, 27th of September 2013

Acknowledgements

I am very grateful for the help and guidance I’ve received throughout my PhD. I have been given great intellectual freedom by my supervisor, Dr Florian Hollfelder, who has always greeted my ideas and research with enthusiasm, encouragement and support. I would also like to thank my second supervisor, Dr Ralph Minter, for his insightful suggestions and help through my projects.

In particular, I would like to thank Dr Pietro Gatti-Lafranconi who has helped me with everything from pipetting to manuscript preparation and his advice and input has been invaluable in all of my projects. Additionally, Nigel Miller’s patient help with flow cytometry has made many of my experiments possible.

I also thank the excellent PostDoc community in the group for lively debate and useful comments. Specifically, I have appreciated the attention to detail of Dr Sean Devenish, the encyclopaedic literature knowledge of Dr Stephane Emond, the level-headed advice of Dr Maren Butz, the tireless enthusiasm of Dr Albert Kwok, and the chemical know-how of Dr Mark Mohammed. I’ve also been fortunate to have the support and companionship of a superb peer group of PhD students, in particular Charlie Miton who has always gone out of her way to be helpful and Chris Bayer with whom I’ve had many a fascinating discussion.

I have been in Cambridge for seven years of undergraduate and graduate life and am hugely grateful to the people who have contributed to this personally and intellectually stimulating time in my life. I would like to thank my parents, brother, uncles and aunts (biological and appointed) and Grandmouse for their fantastic support and for being interested in my work even if, on occasion, they weren’t sure what I was talking about. Lastly, I would like to thank my wonderful partner Katrina, without whom my world would be a less loving, less fascinating and disappointingly sensible place.

Abstract

The aim of this thesis is to investigate aspects of molecular evolution and engineering using the experimental evolution of Tobacco Etch protease (TEV) as a model. I map key features of the local fitness landscape and characterise how they affect details of enzyme evolution.

In order to investigate the evolution of core machinery, I mutated the nucleophile of TEV to . The differing chemical properties of oxygen and sulphur force the enzyme into a fitness valley with a >104-fold activity reduction. Nevertheless, was able to recover function, resulting in an enzyme able to utilise either nucleophile. High-throughput screening and sequencing revealed how the array of possible beneficial mutations changes as the enzyme evolves. Potential adaptive mutations are abundant at each step along the evolutionary trajectory, enriched around the active site periphery.

It is currently unclear how seemingly neutral mutations affect further adaptive evolution. I used high-throughput directed evolution to accumulate neutral variation in large, evolving enzyme populations and deep sequencing to reconstruct the complex evolutionary dynamics within the lineages. Specifically I was able to observe the emergence of robust enzymes with improved mutation tolerance whose descendants overtake later populations.

Lastly, I investigate how evolvability towards new substrate specificities changed along these neutral lineages, dissecting the different determinants of immediate and long-term evolvability. Results demonstrate the utility of evolutionary understanding to protease engineering.

Together, these experiments forward our understanding of the molecular details of both fundamental evolution and enzyme engineering.

Abbreviations

ABBREVIATION IN FULL Å Angstrom (10-10 meters) Aa Amino acid AraC Arabinose inhibitor Bla β-lactamase BLIP β-lactamase inhibitory protein Cat Chloramphenicol acetyltransferase CFP Cyan fluorescent protein C-Y(x) Cyan and yellow fluorescent protein, linked by sequence ‘x’ DE Directed evolution DFE Distribution of fitness effects DNA Deoxyribonucleic acid dNTP Deoxy nucleotide tri-phosphate E. coli Escherichia coli epPCR Error-prone polymerase chain reaction FACS Fluorescence activated cell sorting FRET Fluorescence resonance energy transfer HT-SAS High throughput screening and sequencing IPTG Isopropyl β-D-1-thiogalactopyranoside LacI Lac inhibitor Nt nucleotide PCR Polymerase chain reaction PDB Protein database (a repository of solved protein structures) pH Hydrogen ion potential (measure of bulk solution acidity) pKa Acid dissociation constant (measure of specific residue acidity) Population A population of enzymes having undergone ‘x’ rounds of (x,y) neutral evolution and ‘y’ rounds of adaptive evolution TEM1 A class of β-lactamases commonly found in E. coli TEV Tobacco Etch Virus nuclear inclusion 1 wtTEV In the context of discussing population evolution TEVCys In the context of discussing different nucleophiles TEVAla C151A mutant of TEV TEVSer C151S mutant of TEV TM Temperature at which 50% of a protein has melted YFP Yellow fluorescent protein

Table of Contents

Chapter A - Introduction 1

1 Summary ______1 2 Enzymes and Evolution ______2 2.1 Catalysis of chemical reactions 2 2.2 Nucleophilic catalysis and catalytic triads 4 2.3 Promiscuity 5 2.4 Divergent molecular evolution 6 2.5 Protein sequence space and fitness landscapes 10 2.6 The distribution of fitness effects 14 3 Experimental Evolution of Proteins______16 3.1 Generating variation 18 3.2 Detecting fitness differences 19 3.3 Ensuring heredity 20 3.4 High throughput sequencing 21 4 Robustness ______24 4.1 Genomic and protein robustness 24 4.2 Neutral networks 25 5 Evolvability ______26 5.1 Definition and quantification 26 5.2 Determinants of evolvability 28 5.3 Evolution of evolvability 30 5.4 Theoretical and applied significance 31 6 TEV Protease – a Model Enzyme______32 6.1 Origin 32 6.2 Structure and function 32 6.3 Specificity 34 7 Overview of results ______36

Chapter B - Selection and screening methods for protease directed evolution 37

1 Summary ______37 2 Introduction (Inhibition relief selection) ______38 2.1 β-lactamase Inhibitor Protein and inhibition-relief selection 38 3 Results and Discussion (Inhibition relief selection) ______40

3.1 BLIP plasmid assembly 40 3.2 Insertion of cleavage sequence into BLIP preserves function 41 3.3 TEV of BLIP does not restore cell survival 43 4 Introduction (FRET screening) ______47 4.1 Dual-GFP FRET pair and fluorescence screening 47 5 Results and Discussion (FRET screening) ______49 5.1 C-Y(tev) FRET-pair plasmid assembly 49 5.2 Fluorescence of C-Y(tev) is changed by TEV proteolysis in cell lysate 50 5.3 TEV cleaves purified C-Y(tev) in vitro 53 5.4 TEV proteolysis of C-Y(tev) is detectable in individual cells 55 6 Conclusions ______58 6.1 Selection by BLIP cleavage is unsuitable for directed evolution 58 6.2 FRET screening provides multiple platforms for directed evolution 58

Chapter C - Handicap-recover evolution to accommodate a new nucleophile 61

1 Summary ______61 2 Introduction ______62 2.1 The effect of nucleophile identity on enzyme mechanism 62 2.2 Nucleophile exchange in natural evolution 64 3 Results and Discussion ______66 3.1 Replacement of nucleophile decreases TEV activity 104-fold 66 3.2 Nucleophile mutation has minimal effect on catalytic promiscuity 68 3.3 Directed evolution restores proteolytic function using serine 68 3.4 The evolved protease can use either nucleophile 72 3.5 Catalytic activity is restored at the expense of structural integrity 73 3.6 Mutations that recover activity cluster around the active site 74 3.7 Measurement of local fitness landscapes along the trajectory 77 3.8 Mutations have strong epistatic dependence on their predecessors 79 3.9 Adaptive mutations are common at all steps of the trajectory 81 3.10 Some protein regions are hotspots for adaptive mutations 82 4 Conclusions ______85 4.1 Evolutionary constraints on the evolved trajectory 85 4.2 High adaptive potential generates a versatile active site 86

Chapter D - Neutral evolution and emergent robustness 87

1 Summary ______87 2 Introduction ______88 2.1 Protein robustness and tolerance to mutations 88 2.2 Theoretical predictions of robustness evolution 89 2.3 Empirical evidence for the evolution of robustness 90 3 Results and Discussion ______94

3.1 The distribution of fitness effects of TEV mutants 94 3.2 Generating neutral evolution lineages 96 3.3 Enriched mutations confer population robustness 99 3.4 Later populations are descended from robust node sequences 103 3.5 Emergence of robustness requires large effective population size 107 4 Conclusions ______113

Chapter E - Adaptive evolvability and changing cleavage specificity 116

1 Summary ______116 2 Introduction ______117 2.1 Evolvability of enzymes 117 2.2 Neutral mutations and cryptic variation 118 2.3 Protease specificity engineering 120 2.4 Clinical applications 122 3 Results and Discussion ______124 3.1 Target sequence choice 124 3.2 Promiscuous cleavage of target sequences 127 3.3 Neutral evolution erodes average promiscuity 127 3.4 Neutral evolution generates fluctuating immediate evolvability 130 3.5 Neutral evolution increases long-term evolvability 132 3.6 Stepping-stones did not aid evolution on un-cleavable sequences 135 3.7 Mutations that underlie immediate and long-term evolvability 136 4 Conclusions ______141 4.1 Evolution of promiscuity, and immediate and long-term evolvability 141 4.2 Relevance to protein engineering 142

Chapter F - Discussion and outlook 145

1 Summary ______145 2 Evolution and Evolvability ______146 2.1 Neutral fitness plateau 146 2.2 Rugged fitness valley 148 2.3 Alternative fitness peaks 150 3 Enzymes and Engineering ______151 3.1 Neutral evolution and controlling protease specificity 151 3.2 Manipulating active site chemistry 152 4 Synopsis ______153

Chapter G - Materials and methods 155

1 Plasmids ______155

2 Primers ______156 3 Buffers ______158 4 Molecular biology______159 4.1 Gene amplification 159 4.2 User mutagenesis 159 4.3 Targeted mutagenesis 159 4.4 Random mutagenesis 159 4.5 Shuffling 159 4.6 Ligation and cloning 160 4.7 Preparation of chemo-competent cells 160 4.8 Preparation of electro-competent cells 160 4.9 Transformation 160 5 Biochemistry and assays ______161 5.1 Protein expression and purification 161 5.2 96-well expression and purification 161 5.3 Fluorescence 96-well plate screening 161 5.4 Fluorescence in vitro kinetics 161 5.5 Fluorescence Activated Cell Sorting (FACS) 162 5.6 Gel band densitometry 162 5.7 Differential scanning fluorimetry 163 6 Bioinformatics and data analysis ______164 6.1 PA clan alignment and phylogeny 164 6.2 High throughput sequencing and data analysis 164 6.3 FoldX 165 6.4 FACS mutation tolerance analysis 165

Chapter H - References 167

Table of Figures

Figure A-1 | Energetics of ______2 Figure A-2 | Common elements of nucleophilic catalysis using a ______4 Figure A-3 | Crosswise catalytic promiscuity in the alkaline phosphatase superfamily ______7 Figure A-4 | Neo- and sub- functionalization after gene duplication ______8 Figure A-5 | Protein co-evolution sectors ______9 Figure A-6 | Example fitness landscape ______11 Figure A-7 | Mutation epistasis and landscape ruggedness ______12 Figure A-8 | Fitness peak separation by valleys ______13 Figure A-9 | Possible distributions of fitness effects ______15 Figure A-10 | Directed evolution cycle ______17 Figure A-11 | Directed evolution of a cytochrome P450 ______17 Figure A-12 | Mutagenesis methods ______19 Figure A-13 | Genotype-phenotype linkage methods ______21 Figure A-14 | Experimentally measured fitness landscape of DNA binders ______22 Figure A-15 | Sequence-function relationship mapped on ______23 Figure A-16 | Neutral network ______25 Figure A-17 | Experimentally measured distributions of fitness effects ______27 Figure A-18 | Abundance of beneficial mutations ______27 Figure A-19 | Measured determinants of evolvability ______29 Figure A-20 | Filamentous potyvirus ______32 Figure A-21 | TEV protease β-barrels ______33 Figure A-22 | Substrate bound in the active site tunnel ______35 Figure B-1 | BLIP in complex with β-lactamase ______39 Figure B-2 | Scheme of how an active protease is coupled to cell survival ______39 Figure B-4 | Structure of BLIP showing insert sites ______42 Figure B-5 | Comparison of wtBLIP and iBLIP ______43 Figure B-6 | Comparison of STEV and STEVAla ______45 Figure B-7 | Purification and cleavage of iBLIP ______45 Figure B-8 | Proteolytic cleavage of a Dual-GFP fusion FRET-pair ______48 Figure B-9 | pC-Y plasmid map ______49 Figure B-10 | pMAA TEV plasmid map (±MBP) ______51 Figure B-11 | Effect of TEV protease on fluorescence in CFP-YFP FRET system ______52 Figure B-12 | Time-course of lysate FRET ratios ______52 Figure B-13 | Sample lysate screen of TEV mutant library ______53 Figure B-14 | Concentration dependence of FRET ratio ______54 Figure B-15 | Equilibration of C-Y FRET ratio ______54 Figure B-16 | Concentration dependent cleavage of C-Y(tev) by TEV and kinetic fitting ______55 Figure B-17 | Flow cytometric measurement of FRET in E. coli ______56 Figure B-18 | Sample flow cytometric of TEV mutant library ______57 Figure C-1 | Differences in cysteine and serine proteolysis mechanisms ______63 Figure C-2 | PA clan phylogeny ______65 Figure C-3 | Monophasic and biphasic curve fitting of TEVCys and TEVSer ______67 Figure C-4 | Promiscuous phosphonatase activity of TEVCys and TEVSer ______68 Figure C-5 | Evolutionary recovery of TEVSer→TEVSerX lineage ______71 Figure C-6 | Nucleophile activity tradeoff ______72 Figure C-7 | Stability losses during recovery of TEVSer lineage ______74 Figure C-8 | Mutations positions ______75

Figure C-9 | Schematic of sequence space flanking trajectory ______78 Figure C-10 | Mutation enrichments for TEVSer ______79 Figure C-11 | Mutation enrichments for the TEVSer→TEVSerX lineage libraries ______79 Figure C-12 | Enrichments of notable mutations ______80 Figure C-13 | Locations of beneficial or deleterious mutations ______81 Figure C-14 | Adaptive potential along the TEVSer→TEVSerX lineage ______81 Figure C-15 | The effect of mutations in the protein core ______82 Figure C-16 | Beneficial mutations more common in triad second shell ______83 Figure D-1 | Protein stability effects on fitness ______88 Figure D-2 | Robustness of a miRNA fold ______90 Figure D-3 | Viral robustness after evolution at different co-infection rates ______91 Figure D-4 | Mutation tolerance over neutral evolution of P450 oxygenase ______92 Figure D-5 | Enriched mutations after neutral evolution of TEM1 β-lactamase ______92 Figure D-6 | Bimodal distribution of fitness effects and purified mutant characterisation ______95 Figure D-7 | Mutation accumulation rate______97 Figure D-8 | Population mutation tolerance ______98 4 Figure D-9 | Mutation accumulation by structural region (Ne=10 lineages) ______100 4 Figure D-10 | Population average activity (µ=1.5%, Ne=10 lineage) ______101 4 Figure D-11 | Population average soluble expression (µ=1.5%, Ne=10 lineage) ______102 4 Figure D-12 | Ne=10 Lineage phylogenies ______104 4 Figure D-13 | Reconstructed node mutation locations (Ne=10 lineages)______105 Figure D-14 | Robustness of reconstructed node sequences ______106 2 Figure D-15 | Mutation tolerance (Ne=10 lineages) ______108 2 Figure D-16 | Mutation accumulation by structural region (Ne=10 lineages) ______109 2 Figure D-17 | Ne=10 lineage phylogenies and node mutation locations ______112 Figure E-1 | Two overlapping neutral networks ______118 Figure E-2 | Protease therapeutic concept ______122 Figure E-3 | Identification of flexible target loop regions ______125 Figure E-4 | Target structures and loop sequences ______126 Figure E-5 | Summary and nomenclature of C-Y(veg) evolution lineages ______128 Figure E-6 | Promiscuity on randomised substrate libraries ______129 Figure E-7 | Immediate evolvability along neutral evolution ______130 Figure E-8 | Long-term evolvability on target sequences of neutrally evolved populations. ______132 Figure E-9 | Activity on C-Y(veg) and C-Y(pdl) for wtTEV and evolved variants ______134 Figure E-10 | Activity of variants after 3 rounds of evolution on target sequence______134 Figure E-11 | Stepping stone substrate cleavage in lysate by wtTEV ______135 Figure E-12 | Mutation locations and viral substrate tunnel loop ______137 Figure E-13 | Substrate P6 binding by dual asparagines ______138 Figure E-14 | Phylogeny of C-Y(veg) active mutants ______139 Figure F-1 | Bimodal distribution of fitness effects as a landscape ______147 Figure F-2 | Explicit transect of a fitness landscape TEVSer→TEVSerX ______148 Figure F-3 | Schematic of fitness landscape flanking TEVSer→TEVSerX trajectory ______149

Table C-2 | Variant property summary ______70 Table C-3 | Regression of curve fitting of evolved variants ______71 4 Table D-1 | Reconstructed node summary (Ne=10 lineages) ______105 2 Table D-2 | Reconstructed node summary (Ne=10 lineages) ______112 Table E-1 | Previously measured immediate evolvabilities ______119 Table E-2 | Previously achieved TEV protease specificity changes ______121 Table E-4 | Target sequences ______126 Table E-5 | CFP-YFP target linker variants ______127 Table E-6 | Population average promiscuous activity on C-Y(veg) ______130 Table E-7 | Stepping stone substrate CFP-YFP linker variants______135 Table G-2 | Substrate plasmids ______155 Table G-3 | Primers mentioned in report ______156 Table G-4 | Multiplex identifier sequences ______157 Table G-5 | Sequences added to DALI alignment by BLASTp ______164

Chapter A - Introduction

Chapter A - Introduction

1 SUMMARY

This section introduces the key concepts of enzymes and their evolution that underpin the later chapters. Additionally it describes experimental evolution methods and explains some of what these have revealed about evolution and evolvability. It concludes with a brief overview of the results chapters.

1

2 ENZYMES AND EVOLUTION

2.1 CATALYSIS OF CHEMICAL REACTIONS

Enzymes are the biological catalysts that perform almost all chemical reactions that occur in an organism. They are produced as a linear polymer of amino acids but fold up into a 3D structure (defined by their sequence) to bring together a few key amino acids into a precise arrangement (the active site) to perform chemistry1. These residues are responsible for binding a substrate and lowering the activation energy required for its conversion to product2 (Figure A-1). This lower activation energy determines the rate of the catalysed reaction

3 (kcat/KM) . Enzymes do this so efficiently that some reactions that naturally take millions of years can be catalysed in seconds4.

S‡

ΔG kuncat ES‡

kcat/KM E+S E.S E.P E+P Figure A-1 | Energetics of enzyme catalysis A free energy profile (ΔG) of an abstract reaction of substrate to product (S→P) catalysed by enzyme (E). Uncatalysed the reaction passes through a high energy transition state (S‡). When catalysed by an enzyme, however, it binds to form a complex (E·S) which passes through a lower energy transition state (E·S‡). Finally the product complex (E·P) dissociates to release free product. The lower transition state energy means that the catalysed rate kcat/KM is much faster than the uncatalysed rate kuncat. (Adapted from reference3)

Activation energy is lowered by a combination of factors, some or all of which are used in different enzymes: Electrostatic stabilisation of

2 Chapter A - Introduction charge build-up on high energy transition states, the provision of general acids and bases, exclusion of solvent, orientation of substrates in reactions with more than one, and sometimes formation of a transient covalent bond. These effects are achieved using the backbone of the protein, the side chain functional groups and organic and inorganic cofactors such as metals and cytochromes.

The active site, is made up of a large, interconnected network of cooperative residues5–7. Some directly contact the substrate and are involved in lowering transition state energy by the mechanisms listed above. Others play a supporting role in orienting, supporting and tuning the electronegativity of these. Although active site residues must be precisely orientated, they are commonly polar or charged1 and hence repel each other. Internal active site repulsion is compensated for by the enzyme having a stable structure around them provided by a well packed, hydrophobic core scaffold8,9 which accounts for much of the rest of the protein.

Despite this stable fold, it’s important to note that enzymes are not static entities. They’re not even particularly rigid entities8,10. Enzymes are highly flexible and hence exist as an ensemble of conformations that are constantly interconverting. These motions range from very fast and small (such as side chain rotation11,12) to large and slow (such as loop or whole domain movements13–15). The role of these dynamics in activity is controversial16–19 but what is clear is that many enzymes have several distinct conformations which optimise the various requirements of substrate binding, catalysis and product release. Some enzymes will change between these states during their catalytic cycle2,13–15,19.

3

2.2 NUCLEOPHILIC CATALYSIS AND CATALYTIC TRIADS

One method of lowering a reaction’s activation energy is to split it into Nucleophile A chemical species capable two easier reactions by nucleophilic, covalent catalysis. In this specific of forming new bonds by donating electrons case, a single residue has great importance to the mechanism since it is tuned by its neighbours to lower its pKa and increase the availability of the deprotonated form. This nucleophile firstly attacks the substrate to form a covalent bond which is subsequently hydrolysed by activated water20. Each of these two steps requires lower energy than the uncatalysed reaction, increasing its rate.

A common method for generating a nucleophilic residue for catalysis is by using an Acid-Base-Nucleophile catalytic triad21,22. The triad forms a charge-relay network to polarise and activate the nucleophile, which attacks the substrate, forming a covalent intermediate which is then hydrolysed to regenerate free enzyme (Figure A-2).

a

b

Figure A-2 | Common elements of nucleophilic catalysis using a catalytic triad (a) The enzyme (black) performs a nucleophilic attack on substrate (red) via a tetrahedral intermediate to eject the first product, leaving an acyl-enzyme intermediate. (b) The intermediate is hydrolysed by an activated water molecule via a second tetrahedral intermediate to release the second product and regenerate free enzyme.

4 Chapter A - Introduction

This motif is so effective that the same geometry has arisen at least 25 times independently21,23 as a means of catalysing group transfer24,25 and hydrolysis22 reactions (including proteolysis). The two most commonly used nucleophiles are the oxygen (alcohol) of serine and the sulphur (thiol) of cysteine21,22. I introduce the evolution and significance of different nucleophiles in greater detail in Chapter C.

In addition to covalent catalysis, the triad and surrounding residues coordinate additional mechanistic effects. General acid catalysis protonates leaving groups. General base catalysis activates the water that hydrolyses the intermediate. Backbone electrostatics in the stabilise charge build-up on transition states and intermediates along the reaction pathway. Additional residues assist in binding substrate and, in some enzymes, coordinating large structural movements involved in the catalytic cycle such as lid opening and closing13,14.

2.3 PROMISCUITY

Having described how enzymes achieve catalysis of their target substrate, it is useful to introduce catalysis of alternative substrates. Classically, enzymes have been viewed as catalysing only one reaction. Over recent decades, however, it has become increasingly clear that besides efficiently catalysing their target reaction, enzymes often also have a set of side-activities. The ability to catalyse more than one reaction with the same active site is known as promiscuity26. The ability of an enzyme to catalyse more than one Promiscuity is subcategorised into 3 classes describing in what way reaction these reactions differ. Catalytic promiscuity refers to an enzyme that can cleave two different types of chemical bond, for example the hydrolysis of phosphonate and sulphate bonds by Pseudomonas aeruginosa arylsulfatase27. Substrate promiscuity is when an enzyme can bind several related substrates to perform the same catalysis on

5

them, for example some P450 enzymes can oxidise hundreds of substrates28. Lastly, product promiscuity occurs when an enzyme converts a substrate into a high energy intermediate (e.g. the conversion of E,E farnesyl diphosphate to a reactive cation by γ- humulene synthase), which can then be resolved to a variety of different products29.

The origins of promiscuity are still unclear. In many enzymes there is no evidence that promiscuity has been specifically selected for. In these cases it seems likely that promiscuous activities are simply the accidental by-products of extreme catalytic activity for the native reaction and the difficulty of discriminating between similar substrates. Lack of strong counter-selection against promiscuous activities allows their evolutionary persistence30. For example clotting such as merely need to be specific enough not to cause accidental cleavage.

Conversely some promiscuous enzymes are likely the result of direct selection for multiple activities such as enzymes that scavenge nutrients or remove environmental toxins (e.g. P450 oxidases28, serum paraoxonases31 and phosphotriesterases32). Occasionally promiscuity can be exquisitely refined such as the ribosome. To function it must catalyse the polymerisation of any of 20 natural amino acids (substrate promiscuity), but specifically organised, based on the sequence of an mRNA template rather than mere random polymerisation. In this way controlled promiscuity is necessary for its function.

Divergent evolution 2.4 DIVERGENT MOLECULAR EVOLUTION The accumulation of differences between genes that share a common Divergent evolution of proteins has created a wide array of enzymes ancestor descended from a common ancestor, but now encompassing a wide Superfamily 33,34 The largest set of proteins range of activities . The widest evolutionary category that we can for which common ancestry can be inferred currently place around a set of enzymes is a superfamily which

6 Chapter A - Introduction describes a group of enzymes with common ancestry inferred by fold Adaptive evolution and mechanistic similarity. The divergence of gene sequences occurs Change in gene sequences due to selection for through two main process: Adaptive evolution (natural selection for improved function improved or altered function35,36) and neutral evolution (the Neutral evolution / Genetic drift accumulation of mutations that have little or no effect on function Stochastic change in gene sequences when only leading to genetic drift37,38). selected for the retention of function

Evidence for the importance of promiscuous activities as a starting point for adaptive evolution comes, in part, from comparative studies of existing proteins39. The promiscuous activity of one member of a is often the native activity of another member of that family such as in the aminotransferase40 or alkaline phosphatase superfamilies41–43 (Figure A-3). Cross-wise promiscuity implies that divergent evolution has (partially) specialised each member towards one of a set of similar activities. For example, the pesticide-degrading phosphotriesterase still shows promiscuous activity towards the substrate of the enzyme from which it is descended44. Similarly, type III antifreeze is descended from sialic acid synthase, which itself shows promiscuous ice-binding activity45.

AP NPP

PMH AS

Figure A-3 | Crosswise catalytic promiscuity in the alkaline phosphatase superfamily Shown are the enzymes Alkaline Phosphatase (AP), nucleotide pyrophosphatase/ phosphodiesterase (NPP), phosphonate monoester hydrolase (PMH) and aryl sulfatase (AS). Each enzyme is shown with its native substrate in the circles. Promiscuous activities are indicated by connecting lines. (Adapted from reference43)

The promiscuity of enzymes is a key feature in modern models of molecular evolution. All models of gene diversification by duplication

7

rely on some level of promiscuous activity either before or after gene duplication46. Classical neofunctionalisation models postulate that

after a gene duplicates, one copy is released from selective pressure for Paralog the parental function47. This paralog accumulates mutations that Two genes descended from a duplicated ancestor either deactivate it (pseudogenisation48, Figure A-4a) or create new

Pseudogene activities (Figure A-4b). Conversely, subfunctionalisation models A gene inactivated by deleterious mutations assume some level of promiscuous activity before duplication such that when a gene duplicates each copy can specialise towards one of the parent activities49 (Figure A-4c). a

Ancestral gene

Functionless pseudogene

b Paralog

Ancestral gene

Neofunctionalised paralog c Subfunctionalised paralog A

Promiscuous ancestral gene Subfunctionalised paralog B

Figure A-4 | Neo- and sub- functionalization after gene duplication (a) A gene duplicates but one copy loses function as it mutates to random sequence. (b) A gene duplicates and mutations in one copy endow it with new function. (c) A promiscuous gene duplicates allowing each copy to specialise towards one function.

Both neo- and sub- functionalization theories fail to explain some aspects of molecular evolution. In neofunctionalisation, the relative abundance of deleterious mutations over beneficial mutations50 makes the likelihood that beneficial mutations will accumulate in a duplicated gene before it succumbs to pseudogenisation very low. Conversely, for subfunctionalisation to explain the development of the large range of contemporary functions, requires the assumption that a large number of genes are promiscuous. Indeed it has sometimes been misinterpreted to imply that ancestral genes were all promiscuous and have linearly

8 Chapter A - Introduction become more specialised over time, however several papers have attempted to clear up this confusion30,51.

Modern models tend to lie somewhere in between these extremes whereby some degree of promiscuity likely exists before duplication. Which model best explains divergent evolution is likely to be protein specific and depend on the relative strength of selection, the prevalence of beneficial promiscuity, and the trade-offs between activities50. Duplication may be neutral or be selected for gene copy amplification52 or relief of constraints if optimisation of the new activity trades-off with the native45,53,54.

Through these divergent evolution processes, all new genes are generated (with the rare exception of de-novo gene birth55). The evolution of enzyme active sites is constrained by chemical requirements on function. Other residues are less constrained and so their evolutionary divergence is more dominated by genetic drift (i.e. accumulation of neutral mutations). Therefore, active site residues tend to evolve more slowly except during periods where there is selection for change in function56,57. Additionally, whilst unconstrained residues evolve independently, some residues are highly dependent on the identity of their neighbours. Interactions between residues cause them to co-evolve as ‘protein sectors’58,59 (Figure A-5).

S195Ser His

Asp

Figure A-5 | Protein co-evolution sectors Structure of rat (PDB 3tgi) with co-evolutionary sectors shown as surfaces and substrate as sticks. Blue sector is the β-barrel scaffold. Green sector is the triad and barrel interface. Red sector is the main substrate binding pocket. (Adapted from reference58)

9

2.5 PROTEIN SEQUENCE SPACE AND FITNESS LANDSCAPES

Protein sequence space In order to understand evolution of the enormous diversity of protein All possible protein sequences arranged such sequence and function, it is helpful to talk of sequence space – an that neighbours differ by one amino acid imagined array of all protein sequences. Formally, each residue in a protein is a dimension with 20 possible positions along that axis corresponding to the possible amino acids. Hence there are 400 possible di-peptides arranged in a 20x20 space but that expands to 10130 for even a small protein of 100 amino acids arranged in a space with 100 dimensions. Despite the diversity of protein superfamilies, however, sequence space is extremely sparsely populated by functional proteins60. Most random protein sequences have no fold or function. Enzyme superfamilies, therefore, exist as tiny clusters of active proteins in a vast empty space of non-functional sequence.

A fitness landscape consists of adding an extra axis of fitness. For Fitness landscape A representation of the organisms, fitness formally describes the number of viable offspring fitnesses of all genotypes by height on a surface with that a particular sequence has. When discussing enzymes, similar genotypes closer together however, it is common to define fitness as a particular function of interest61. Explicit fitness landscapes are overwhelmingly multidimensional but features of the topology are often represented in a 3D simplification (e.g by principal component analysis). This generates a landscape with two horizontal axes of sequence space and a vertical axis of fitness62 (Figure A-6).

10 Chapter A - Introduction

Figure A-6 | Example fitness landscape A common representation of an abstract fitness landscape. Sequences are arranged in two horizontal dimensions in a simplified sequence space. The fitness of each sequence is then represented on the vertical axis. This leads to a surface where islands of functional sequences have higher vertical elevation (fitness) in a sea of non-functional sequences.

Since each point is a different enzyme sequence, mutations link adjacent points on the fitness landscape. Adaptive evolution of a single gene can therefore be thought of as a gene moving uphill from point to point in the fitness landscape. Similarly a population of gene variants inhabit a cloud of sequences which evolve such that higher points are more likely to proliferate and lower points are purged, leading the population to move uphill each generation. In this way, the structure of the fitness landscape guides evolutionary trajectories63 and evolutionary trajectories report on fitness landscape structure62.

In these landscapes there are a number of key features. The maximum peak height (the global maximum) represents the most active possible enzyme sequence. Some enzymes likely exist near these as they operate at the diffusion limit4,64,65 (i.e. diffusion of substrate in and product out limits rate) and so fundamentally cannot be any faster. If all mutations are additive, they can be acquired in any order (Figure A-7a). The Epistasis landscape, then, is perfectly smooth, with only one peak and all Interaction between two mutations such that their sequences can evolve uphill to it (Figure A-7d). effects are not additive

Ruggedness Conversely, if mutations interact with one another (epistasis Figure The landscape topology when the fitness of a A-7b,c), the fitness landscape becomes rugged as the effect of a sequence is dissimilar to the fitness of neighbouring mutation depends on the genetic background of other mutations66,67. sequences

11

Such interactions occur whenever one mutation alters the local environment of another residue (either by directly contacting it, or by inducing changes in the protein structure). For example in a disulphide bridge, a single cysteine has no effect on protein stability until a second is present at the correct location. At its most extreme interactions are so complex that the fitness is ‘uncorrelated’ with gene sequence and the topology of the landscape is random (Figure A-7f).

a AB b AB c

AB ab ab Fitness ab B a b A Additive Epistatic

d e f Fitness

Smooth Rugged

Figure A-7 | Mutation epistasis and landscape ruggedness Two mutations a⟶A and b⟶B. (a) When effects are additive mutations can be acquired in any order. (b) Epistasis restricts uphill paths to the order ab⟶aB⟶AB. (c) Sign epistasis prevents both mutations existing together. (d) On a larger scale landscape, high additivity creates a single, smooth peak. (e,f) Increasing average epistasis between mutations creates an increasingly rugged landscape

Most landscapes are likely in between these two extremes of ruggedness (Figure A-7e). Such landscapes contain many local maxima which, though not as high as the global maximum, are nevertheless the highest accessible nearby peak. These are sometimes referred to as multiple evolutionary solutions to a problem68,69. The number and separation of these peaks defines how accessible they are from each other (Figure A-8).

12 Chapter A - Introduction

Additionally, clustering of peaks in regions of the fitness landscape (i.e. landscape structure) indicates that the solutions contained are similar. Such structure comes about when chemical or physical constraints cause some regions of a fitness landscape to be particularly sparse21 in functional sequences, whilst others regions are rich70–72.

Figure A-8 | Fitness peak separation by valleys (a) If multiple fitness peaks are broad and close, then there are few valleys and a sequence starting at point 1 can evolve to the higher peaks 2 or 3. (b) If peaks are separated by wide, deep valleys then genes are trapped at their local optima.

Several important caveats exist. It is important to note that, since the human mind struggles to think in greater than three dimensions, 3D topologies can mislead73,74. In particular it is not clear whether peaks are ever truly separated by fitness valleys in such multidimensional landscapes, or whether they are connected by vastly long neutral ridges75,76. Additionally, the fitness landscape is not static in time but dependent on the changing environment and evolution of other genes. It is hence more of a seascape77, further affecting how separated

13

adaptive peaks can actually be. Finally, since it is common to use function as a proxy for fitness when discussing enzymes, any promiscuous activities exist as overlapping landscapes that together will determine the ultimate fitness of the organism.

With these limitations in mind fitness landscapes can still be an instructive way of thinking about evolution. It is fundamentally possible to measure (even if not visualise) some of the multidimensional parameters of landscape ruggedness and of peak number, height, separation, and clustering. Simplified 3D landscapes can then be used relative to each other to visually represent the relevant features (see section 3.4).

2.6 THE DISTRIBUTION OF FITNESS EFFECTS

The result of mutations to a particular sequence can be described by The Distribution of Fitness Effects (DFE) the Distribution of Fitness Effects (DFE), the spectrum of fitnesses of For a gene or genome, the relative frequencies of mutants. Mutations can be deleterious, neutral or beneficial (each to mutations of different fitness effect on an axis different degrees) and so the DFE describes the local fitness landscape from deleterious to beneficial within a ‘one mutation radius’ of the mutagenised sequence78. The relative frequencies of deleterious, neutral and beneficial mutation have large consequences for evolution. Selectionist theories proposed that almost all mutations were either deleterious or beneficial and hence variation in nature was exclusively adaptive35,36 (Figure A-9a). The neutral theory postulated that, in fact, many mutations were neutral and hence variation was the product of non-directional drift37 (Figure A-9b). This was later refined in the nearly-neutral theory that posited that mutations that are only mildly deleterious are also frequently fixed in populations38 (Figure A-9c).

14 Chapter A - Introduction

a b c 100% 100% 100%

80% 80% 80%

60% 60% 60%

40% 40% 40%

Frequency Frequency Frequency 20% 20% 20%

0% 0% 0%

Figure A-9 | Possible distributions of fitness effects The relative proportions of different classes of mutations in (a) selectionist theories (b) neutral (qualitative data from references38,79)

Studies of fitness landscapes have been largely theoretical73–76,80–82, relying either on mathematical and computation models or on bioinformatic measurements of natural diversity. Nevertheless, there are increasing attempts to study evolution in an experimental setting. Subsequent sections of this chapter will introduce some of the experimental work that is beginning to measure aspects of fitness landscapes and their effects on enzyme evolution.

15

3 EXPERIMENTAL EVOLUTION OF PROTEINS

The study of evolution is largely based on extant organisms and their genes. However, research is often limited by the lack of fossils (and particularly the lack of ancient DNA sequences83–85) and incomplete knowledge of ancient environmental conditions. Experimental evolution has been used in various formats to understand underlying evolutionary processes in a controlled system. Experimental evolution has been performed on multicellular86 and unicellular87 , prokaryotes88, viruses89 as well as individual enzyme90–92, ribozyme93 and replicator94,95 genes. Since experiments address topics as varied as the evolution of multicellularity87, maximum flight speed86 or heat adaptation96,97, I will concentrate here on the experimental evolution of enzyme genes.

Evolution requires three things to occur: variation between replicators,

that variation causes fitness differences upon which selection acts, and Directed Evolution (DE) An experimental process of that this variation is heritable98. Directed evolution (DE) is a mimic of mutation, selection and amplification that mimics the natural evolution cycle in which a single gene is evolved by iterative natural evolution on a single gene rounds of mutagenesis, selection or screening, and amplification (Figure A-10). It is frequently used for protein engineering as an alternative to rational design99, but can also be used to investigate fundamental questions of enzyme evolution62.

As a protein engineering tool, DE has been most successful in three areas. Firstly, improving enzyme stability for biotechnological use at high temperatures or in harsh solvents15,96. Secondly, for improving binding affinity of antibody therapeutics100 and the activity of de novo designed enzymes101. Thirdly in altering substrate specificity of existing enzymes102–105, again, often for use in industry99.

16 Chapter A - Introduction

Gene library E. coli expressing Gene library

Mutagenic Screening PCR Mutagenesis (fitness (variation) differences) Activity assay

Gene Gene amplification (heredity)

Isolation of Gene isolation desired variants

Figure A-10 | Directed evolution cycle An example of a directed evolution experiment with comparison to natural evolution. Inner circle indicates the 3 stages of the cycle with the natural process being mimicked in brackets. The outer circle demonstrates a typical experiment.

Of course, selecting for improvement in the assayed function simply generates improvements in the assayed function. To understand how these are achieved, the properties of the evolving enzyme have to be measured. Improvement of the assayed activity can be due to improvements in enzyme catalytic activity (Figure A-11) or enzyme concentration.

Figure A-11 | Directed evolution of a cytochrome P450 Directed evolution of P450-BM3 (hydroxylase of fatty acids) for hydroxylation of propane to propanol (round 9 selection for increased stability). First panel shows how the assayed activity of turnover number increased. Second and third panels show the improvements of kinetic parameters kcat (catalysis) and KM (binding). Final panel shows the concomitant 62 loss of T50 (thermal denaturation temperature). (Adapted from reference )

17

Additionally, there is no guarantee that improvement on one substrate will improve activity on another. This is particularly important when the desired activity cannot be screened or selected for and so a ‘proxy’ substrate is used which can lead to evolutionary specialisation to the proxy without improving the desired activity. Consequently, choosing appropriate screening or selection conditions is vital for successful directed evolution. In sections 3.1-3.3 I will outline the variety of techniques that can be used for each of the three steps in the directed evolution cycle.

3.1 GENERATING VARIATION

The first step in performing a cycle of DE is the generation of a library of variant genes. Random sequence is vast (10130 possible sequences for a 100 amino acid protein) and extremely sparsely populated by functional proteins. Neither experimental106, nor natural107 evolution can ever get close to sampling so many sequences. Of course, natural evolution samples variant sequences close to functional protein sequences and this is imitated in DE by mutagenising an already functional gene (Figure A-12).

The starting gene can be mutagenised by random point mutations (by chemical mutagens108 or error prone PCR109) and insertions and deletions (by transposons110, and S. Emond unpublished data). Gene recombination can be mimicked by DNA shuffling111 of several sequences (usually of more than 70% homology) to jump into regions of sequence space between the shuffled parent genes. Finally, specific regions of a gene can be systematically randomised112 for a more focused approach based on structure and function knowledge. Depending on the method, the library generated will vary in the proportion of functional variants it contains. Even if an organism is used to express the gene of interest, by mutagenising only that gene,

18 Chapter A - Introduction the rest of the organism’s genome remains the same and can be ignored for the evolution experiment (to the extent of providing a constant genetic environment).

Point mutations

Insertions and deletions

Shuffling

Figure A-12 | Mutagenesis methods Starting gene (left) and library of variants (right). Point mutations change single nucleotides, insertions and deletions add or remove sections of DNA, shuffling recombines two (or more) similar genes.

3.2 DETECTING FITNESS DIFFERENCES

Two main categories of method exist for isolating functional variants. Selection systems directly couple protein function to survival of the gene, whereas screening systems individually assay each variant and allow a quantitative threshold to be set for sorting a variant or population of variants of a desired activity.

Selection for binding activity is conceptually simple. The target molecule is immobilised on a solid support, a library of variant proteins is flowed over it, poor binders are washed away, and the remaining bound variants recovered to isolate their genes113. Binding of an enzyme to immobilised covalent inhibitor has been also used as an attempt to isolate active catalysts4. This approach, however, only selects for single catalytic turnover and is not a good model of substrate binding or true substrate reactivity. If an enzyme activity can be made necessary for cell survival, either by synthesizing a vital metabolite, or

19

destroying a toxin, then cell survival is a function of enzyme activity114,115. Such systems are generally only limited in throughput by the transformation efficiency of cells. They are also less expensive and labour intensive than screening, however they are typically difficult to engineer, prone to artefacts and give no information on the range of activities present in the library.

An alternative to selection is a screening system. Each variant gene is individually expressed and assayed to quantitatively measure the activity (most often by a colourgenic or fluorogenic product). The variants are then ranked and the experimenter decides which variants to use as temples for the next round of DE. Even the most high throughput assays usually have lower coverage than selection methods but give the advantage of producing detailed information on each one of the screened variants. This disaggregated data can also be used to characterise the distribution of activities in libraries which is not possible in simple selection systems. Screening systems, therefore, have advantages when it comes to experimentally characterising adaptive evolution and fitness landscapes.

3.3 ENSURING HEREDITY

When functional proteins have been isolated, it is necessary that their genes are too, therefore a genotype-phenotype link is required116 (Figure A-13). This can be covalent, such as mRNA display where the mRNA gene is linked to the protein at the end of translation by puromycin106. Alternatively the protein and its gene can be co- localised by compartmentalisation in living cells117 or emulsion droplets118. The gene sequences isolated are then amplified by PCR or by transformed host bacteria. Either the single best sequence, or a pool of sequences can be used as the template for the next round of mutagenesis. The repeated cycles of Diversification-Selection-

20 Chapter A - Introduction

Amplification generate protein variants adapted to the applied selection pressures.

Covalent link Compartment Protein Gene

Gene

Puromycin Protein

Figure A-13 | Genotype-phenotype linkage methods As the protein is expressed, it can either be covalently linked to its gene (as in mRNA, left) or compartmentalized with it (cells or artificial compartments, right). Either way ensures that the gene can be isolated based on the activity of the encoded protein.

3.4 HIGH THROUGHPUT SEQUENCING

The maximum library size that can be screened for active variants (also called library coverage) is limited by the throughput of the assay. Additionally, the number of active variants chosen to be taken through to the next round of DE is limited by data analysis of their sequences. If a single variant is selected from one round to the next, each member of the lineage can be characterised in molecular detail but stochastic events by genetic drift have a larger effect. Conversely, at each round a pool of improved variants can be selected however it has, until recently, not been possible to know the sequences present in this gene pool. This trade-off between depth and breadth in DE studies has been partly alleviated by the rise of ‘next generation’ sequencing technologies. It has become feasible to sequence 105 or more gene variants and so it is possible to keep track of the composition of large, mixed pools of variants.

Additionally, by sequencing a library of variants before and after an High Throughput Screening experimental selection, it is possible to measure the enrichment of and Sequencing (HT-SAS) Using next generation functional sequences and the purging of non-functional sequences. sequencing to monitor the enrichment of mutations Indeed, the degree of enrichment for each variant (its enrichment after experimental selection to ascribe fitnesses to a large set of mutants

21

factor) indicates the relative fitnesses of variants and therefore which residue mutations are adaptive to that selection.

This technique has been used in two relevant ways, with a focus either on DNA sequence or on protein structure. If screening is performed in combination with next generation sequencing, each sequence can have a fitness ascribed to it and be laid out as an explicit fitness landscape119. For example, when 30nt DNA molecules were evolved to bind a protein target120, the local fitness landscape could be mapped to show two distinct broad optima, one higher than the other (Figure A-14). Sequences were observed to evolve uphill towards these peaks over 8 rounds. These works are revealing the complexity of real fitness landscapes and in particular their tendency to have many peaks of similar fitness79,120–122.

a b

c

Figure A-14 | Experimentally measured fitness landscape of DNA binders 6000 DNA sequences (30nt) were evolved to bind allophycocyanin protein for 8 rounds. (a) Phylogeny of successful binders in the final round. (b) Sequence space diagram of all sequences (dots) from all rounds with related sequences clustered together. Point colour indicates binding score (fitness). Arrows indicate two lineages starting from the red circle and evolving to the yellow circle. (c) The sequence space of (b) converted into a fitness landscape (smoothed), spheres indicate 150 randomly picked initial variants with lined depicting the direction of subsequent evolution by round 7. (Adapted from reference120)

When adaptive and deleterious mutations are mapped on the protein’s structure they can indicate which regions are enriched in functional variation and which are constrained to the wild type amino acid122–126

22 Chapter A - Introduction

(Figure A-15). This can show structural and functional constraints that prevent adaptive mutation. Additionally, it can uncover regions where beneficial mutations are more common, acting as evolutionary hotspots.

Figure A-15 | Sequence-function relationship mapped on protein structure Structure of WW domain as spheres (PDB 1jm q) bound to peptide as sticks. Colour indicated evolutionary constraint after mutants were selected for peptide binding, blue is constrained to wt residue, white is no residue preference, and red is preference for non- wt residue. (Adapted from reference121)

Together, these technologies and methods are beginning to allow the understanding of complex evolutionary processes that have traditionally not been possible to investigate in detail. Examples of such processes are the evolution of robustness and evolvability, the structure of real fitness landscapes and the experimental evolution of large populations. The rest of this chapter will introduce these topics and the rest of this thesis will consist of their investigation.

23

4 ROBUSTNESS

4.1 GENOMIC AND PROTEIN ROBUSTNESS

Genomes mutate by environmental damage and imperfect replication, yet they display remarkable tolerance. For example >95% of point mutations in C. elegans have no detectable effect127 and even 90% of

single gene knockouts in E. coli are non-lethal128. This comes from Robustness The tendency to remain the robustness both at the genome level and protein level. Since Muller’s same after some perturbation, in this case ratchet (the fixation of slightly deleterious mutations) tends to reduce after mutation robustness, selection pressures must exist that act to increase it (see Chapter D, 2.1).

There are many mechanisms that provide genome robustness. For example, genetic redundancy129 reduces the effect of mutations in any one copy of a multi-copy gene. Additionally the flux through a metabolic pathway is typically limited by only a few of the steps, meaning that changes in function of many of the enzymes have little effect on fitness130,131. Similarly metabolic networks haven multiple alternative pathways to produce many key metabolites132. Protein mutation tolerance is the product of two main features: the structure of the genetic code and protein structural robustness133,134.

Proteins are resistant to mutations because many sequences can fold into highly similar structural folds135. A protein adopts a limited ensemble of native conformations because those conformers have lower energy than unfolded and misfolded states (ΔΔG of folding)136,137. This is achieved by a distributed, internal network of cooperative interactions (hydrophobic, polar and covalent)138. Protein structural robustness results from few single mutations being sufficiently disruptive to compromise function. Proteins have also evolved to avoid aggregation139 as partially folded proteins can

24 Chapter A - Introduction combine to form large, repeating, insoluble fibrils and masses140. There is evidence that proteins show negative design features to reduce the exposure of aggregation-prone β-sheet motifs in their structures141.

Additionally, there is some evidence that the genetic code itself may be optimised such that most point mutations lead to similar amino acids (conservative)142,143. These factors create a DFE of mutations that contains a high proportion of neutral and nearly-neutral mutations. Robustness is sometimes measured as the proportion of mutations that are neutral (roughly two thirds in proteins so far measured50,144,145).

4.2 NEUTRAL NETWORKS

This resistance to deleterious mutation allows access to a large neutral network of phenotypically equivalent genes146–148. Since many Neutral network A set of genes with sequences have equivalent function, they can be imagined as existing equivalent function or fitness all related by point on a broad, flat plateau on a fitness landscape. The more robust a mutations protein, the more neutral neighbours it has149. Since some sequences have more neutral neighbours than others, the population is predicted to evolve towards these robust sequences149 (Figure A-16). This is sometimes called circum-neutrality and represents the movement of populations away from cliffs in the fitness landscape150. Experiments are beginning to confirm these predicted processes (see Chapter D).

a b Sequence Sequence space

Sequence space

Figure A-16 | Neutral network Each circle represents a functional gene variant and lines represents point mutations between them. Light grid-regions are sequences with low fitness, dark regions have high fitness. (a) White circles have few neutral neighbours, black circles have many. Light grid- regions contain no circles because those sequences have low fitness. (b) The population is predicted to evolve towards the centre and away from ‘fitness cliffs’ (dark arrows).

25

5 EVOLVABILITY

5.1 DEFINITION AND QUANTIFICATION

Just as evolution is the change of gene sequences over time, evolvability is the propensity for change over time. Early definitions were mostly Evolvability The potential of a trait or concerned with heritability of variation and so whether that variation sequence for further evolution (sometimes could lead to evolution. Later definitions are broader – the latent measured as the proportion of adaptive mutations) potential for further evolution. Evolvability is sometimes quantified as the percentage of mutations that lead to a particular trait. At its minimum, zero evolvability indicates that the evolution of a trait is impossible (e.g. enzyme rates faster than the diffusion limit or insects thicker than 16 cm151,152). Evolvability of 10-3 would mean that in a 200 amino acid enzyme, only a single point mutation is adaptive. Higher evolvability indicates multiple possible solutions. In this sense, evolvability is dependent on the DFE of mutations. Neutral evolvability will depend on the proportion of mutations that are not deleterious, i.e. robustness. Adaptive evolvability will depend on the much smaller number of mutations that are beneficial to function and fitness.

Experimentally measured DFEs are, in general, roughly bimodal (Figure A-17) whether for individual proteins79 or whole organisms153,154. Some studies have also identified a discrete nearly- neutral set and measured the amount by which these mutations reduce fitness155. Additionally, the distribution of the rare set of beneficial mutations seems to be exponential with mutations of small effect being more common than mutations of large effect156–158.

26 Chapter A - Introduction

a b

Model frequency Model Observed frequency Observed

Selection coefficient

c

Observed frequency Observed Observed frequency Observed

Selection coefficient Selection coefficient Figure A-17 | Experimentally measured distributions of fitness effects In a DFE, the selection coefficient describes the relative enrichment by selection where 0 indicates neutrality, -1 indicates lethality, and values >0 indicate benefit. (a) The proportions of different mutation effects expected by models and the observed DFE from mutagenesis of 9 residues of Heat Shock Protein 90 in yeast to all possible amino acids. The red line shows the DFE of non-synonymous mutations, the grey line shows the DFE of synonymous mutations. Observed DFE in (b) 66 Tobacco Etch Virus mutants and (c) 45 ΦX174 virus mutants. (Adapted from references79,154)

Estimates of the abundance of beneficial mutations span 5 orders of magnitude in different systems50,97,123,126,157–162 (Figure A-18). This range is unsurprising since one can imagine different selection pressures for which it would be more or less difficult to find adaptive mutations.

10-0

10-1 Drosophila evolution to lab conditions Pseudomonas evolution to sorbitol, mannitol or glucose as sole carbon source -2 10 U3 ubiquitin evolution for optimised ligase activity β-lactamase evolution to hydrolyse ampicillin 10-3 β-lactamase evolution to hydrolyse cefotaxime 10-4 Bacteriophage φX174 evolution towards high temperature

10-5 Pseudomonas evolution to serine as a sole carbon source Abundance of adaptive mutations adaptive of Abundance 10-6 E. coli evolution to rich media

Figure A-18 | Abundance of beneficial mutations The proportion of spontaneous mutations that are adaptive to different selection pressures in different organisms. E3 ligase and β-lactamase evolution was performed by mutagenesis of that gene only, other evolutions by whole genome mutagenesis. The ‘lab conditions’ of the Drosophila evolution are detailed in reference161 and references therein. (Data from references50,97,123,157–162)

27

5.2 DETERMINANTS OF EVOLVABILITY

Evolvability depends on two aspects of a fitness landscape. Firstly, the Pleiotropy 163 When a single mutation or population’s movement speed through a fitness landscape gene has multiple effects (mutation164–166 and recombination111,167 rate). Secondly, the structure of that landscape (pleiotropy168, chemical/physical constraints21 and epistasis169–172) can limit the number of accessible fitness peaks.

Since functional sequences are not evenly distributed in sequence space, some biochemical properties are more readily accessible from some points in sequence space. Some key determinants the structure of the fitness landscape are detailed below.

Pleiotropy occurs when a single mutation affects multiple traits173,174. It was originally used to describe genes or mutations that affect multiple traits in organisms such as butterfly wing spots175. In biochemistry it is used for underlying protein properties. For example, it is commonly observed that mutations increasing activity decrease stability61,176 (negative pleiotropy). Conversely, proteins often struggle to discriminate between similar substrates and mutations improving an adaptive activity may increase an anti-adaptive one168 (positive pleiotropy). For example when β-isopropylmalate dehydrogenase is evolved to use NADP+ instead of the native NAD+ as a , it is unable to distinguish between NADP+ and the more abundant NADPH (Figure A-19a). The NADPH acts as an inhibitor and inability to uncouple their affinities leads to constraint on efficiency with the NADP+ cofactor. Similarly, RuBisCO cannot perfectly distinguish between carbon dioxide and oxygen, causing energy wastage177,178. Pleiotropic effects, therefore, reduce the power of selection to independently optimise enzyme properties.

28 Chapter A - Introduction

a b

/ µM /

NADPH

i K

NADP KM / µM

c d

Diffusionlimit

Relative frequency Relative

1E+00 1E+01 1E+02 1E+03 1E+04 1E+05 1E+06 1E+07 1E+08 1E+09 1E+10 k /K / s-1M-1 cat M Figure A-19 | Measured determinants of evolvability (a) In β-isopropylmalate dehydrogenase being experimentally evolved to use NADP+ as a cofactor, the pleiotropic affinity for the more abundant NADPH causes inhibition. This positive pleiotropy prevents discrimination between the desired cofactor and the reduced inhibitor. Black circles are random point mutants and white circles are engineered variants. (b) In serine and cysteine proteases, the nucleophile cannot be replaced with threonine as all possible rotamers of threonine cause a clash between the γ-methyl and backbone or basic triad member. This constraint causes all threonine proteases to have their nucleophile at the N-terminus. (c) The distribution of enzyme activities. The fastest enzymes on the far right are constrained by the diffusion limit and hence there are no 10 -1 -1 enzymes with kcat/KM > 10 s M . (d) In β-lactamase evolved to cleave cefotaxime, the best mutant has 5 mutations. Of all 120 possible routes from the wt (-----) to the final mutant (+++++), only the 18 shown here give consistent step-wise increases in kcat/KM. Each circle is an enzyme variant, the number indicates kcat/KM, and arrows indicate mutations. Wider arrows indicate more probable mutations, dashed and solid arrows refer to correlated and equal fixation probability models. (Adapted from references4,21,168,170)

Likewise, physical and chemical constraints prevent the evolution of some features. For example some proteases use threonine as their nucleophile. This has to be at the N-terminus of the protein since the methyl group sterically clashes with other residues in the enzyme in order to accommodate the substrate21 (Figure A-19b). Consequently, serine proteases can never have their nucleophile replaced with

29

threonine. In the same way, the speed of diffusion limits the maximum attainable rate of an enzyme or transport protein4,64,65 (Figure A-19c).

Finally, epistasis is usually considered a constraining factor on evolution as the lack of a smooth landscape makes it harder for evolution to access fitness peaks172. In highly rugged landscapes, fitness valleys block access to some genes, and even if ridges exist that allow access, these may be rare or prohibitively long. For example, 5 mutations together enable TEM1 β-lactamase to cleave cefotaxime (a 3rd generation antibiotic). However, of the 120 possible pathways to this 5-mutant variant, only 7% were accessible to evolution as the remainder passed through fitness valleys170 (Figure A-19d). Conversely, a 5-mutant variant of epoxide hydrolase with altered specificity can be accessed by 46% of pathways179.

5.3 EVOLUTION OF EVOLVABILITY

Of particular contention is whether evolvability is itself evolvable163,180,181 or whether it is an intrinsic, unselected property of biological systems172. Initially the idea of evolvability as being, itself, evolvable made many biologists uncomfortable as it seems teleological (the product of foresight)182–184.

Recent works are beginning to suggest that there may be mechanisms whereby evolvability can evolve. In some instances there is evidence that evolvability evolved as the accidental, pleiotropic by-product of direct selection of another trait185. Alternatively, an evolvability- enhancing mutation might be indirectly selected by hitchhiking on the fitter variants it facilitates186 (second-order selection).

These works have largely focussed on evolution of evolvability by the altered rate of mutation and recombination. There is now increasing interest in experimentally exploring whether populations can also

30 Chapter A - Introduction evolve towards regions of a fitness landscape187,188, which promote evolvability by being less constrained (see Chapter D) and hence possibly richer in beneficial mutations (see Chapter E).

5.4 THEORETICAL AND APPLIED SIGNIFICANCE

The study of evolvability has fundamental importance for understanding very long term evolution of protein superfamilies71,72 and organism phyla and kingdoms180,189,190. A thorough understanding of the details of long term evolution will likely form part of the Extended Evolutionary Synthesis180,191–193 (the update to the Modern Synthesis). In addition, these phenomena have two main practical applications. For protein engineering we wish to increase evolvability, and in medicine and agriculture we wish to decrease it.

Firstly, for protein engineering it is important to understand the factors that determine how much a protein function can be altered. In particular, both design and DE approaches aim to create changes rapidly through mutations with large effects194,195. Such mutations, however, commonly destroy enzyme function or at least reduce tolerance to further mutations134,176. Identifying evolvable proteins196 and manipulating their evolvability will be increasingly necessary in order to achieve ever larger functional modification of enzymes.

Secondly, many human pathologies are not static phenomena but capable of evolution. In medicine, bacteria, fungi, and cancers evolve to escape host immune responses and to develop drug resistance197–199. In agriculture, are becoming resistant to common herbicides200 and insects to insecticides201. It is possible that we are facing the end of the effective life of most available antibiotics202. Predicting evolvability of our pathogens203 and devising strategies to slow or circumvent204 it will require deeper knowledge of the complex forces driving evolution at the molecular level.

31

6 TEV PROTEASE – A MODEL ENZYME

For the work presented in this thesis, I use the nuclear inclusion cysteine protease of Tobacco Etch Virus (TEV protease) as a model enzyme which I shall introduce here.

6.1 ORIGIN

In order to minimise genome size, many viruses express their entire genome as one massive polyprotein which is subsequently cleaved into functional units by proteases205. Tobacco etch virus is a filamentous potyvirus (Figure A-20) for which proteolysis is largely performed by the nuclear inclusion cysteine protease206 to generate mature viral proteins. TEV protease can be expressed as an active but non-toxic, 27kDa protein in E. coli207 and is commonly used as a biochemical tool for the sequence-specific cleavage of fusion proteins207,208. It is consequently well characterised and shows several properties that make it a useful model system for experimental evolution.

a b

Figure A-20 | Filamentous potyvirus The structure of the closely related tobacco mosaic virus based on (a) the X-ray crystal structure of the capsid protein (2tmv) and (b) scanning electron microscopy. (Adapted from references209,210)

6.2 STRUCTURE AND FUNCTION

The structure of TEV protease has been solved by X-ray crystallography211. It is comprised of two β-barrels and a flexible C-

32 Chapter A - Introduction terminal tail (Figure A-21) and displays homology to the superfamily of proteases19 (see Chapter C). Covalent catalysis is performed with an Asp-His-Cys triad, split between the two barrels212 (Asp on β1 and His and Cys on β2). The substrate is held as a β- sheet213, forming an antiparallel interaction with the cleft between the barrels and a parallel interaction with the C-terminal tail. The enzyme therefore forms a binding tunnel around the substrate and side chain interactions control specificity (detailed description in section 6.3). Catalysis proceeds as described in Figure A-2.

β2 β1

Figure A-21 | TEV protease β-barrels Structure of TEV protease. The double β-barrels that define the super family are highlighted in red. The final 15 amino acids (222-236) of the enzyme C-terminus are not visible in the structure as they are too flexible and the final 12 are not necessary for proteolytic activity214. (PDB 1lvb)

As a biochemical tool, TEV protease has several limitations. It is prone to deactivation by self-cleavage (autolysis), though this can be abolished through a single S219V mutation in the internal cleavage site215. The protease expressed alone is also poorly soluble216 however several attempts have been made to improve its solubility through DE217 and computational design218. It has also been shown that expression can be improved208 by fusion to Maltose Binding Protein (MBP) which acts a solubility enhancing partner219. In the work

33

presented in later chapters I use the S219V variant of TEV fused to MBP for evolution.

6.3 SPECIFICITY

The reason for the use of TEV protease as a biochemical tool is its high sequence specificity. This specificity allows the controlled cleavage of proteins when the preferred sequence is inserted into flexible loops. It also makes it relatively non-toxic in vivo208 as the recognized sequence scarcely occurs in proteins.

The preferred cleavage sequence was first identified by examining the cut sites in the native polyprotein substrate for recurring sequence220. The consensus for these native cut sites is ENLYFQ\S where ‘\’ denotes the cleaved peptide bond. Residues of the substrate are labelled P6 to P1 before the cut site and P1’ after the cut site. Early works also measured cleavage of an array of similar substrates to characterise how specific the protease was for the native sequence205,221. Studies have subsequently used sequencing of cleaved substrates from a pool of randomised sequences to determine preference patterns222,223.

Though ENLYFQ\S is the optimal sequence, the protease is active to a greater or lesser extent on a range of substrates (i.e. shows some substrate promiscuity). The highest cleavage is of sequences closest to the consensus EXLYΦQ\φ where X is any residue, Φ is any large or medium hydrophobe and φ is any small hydrophobe205,221–223.

Specificity is endowed by the large contact area between enzyme and substrate. Proteases such as trypsin have specificity for one residue before and after the cleaved bond due to a shallow binding cleft with only one or two pockets that bind the substrate side chains. Conversely, viral proteases such as TEV protease have a long C- terminal tail which completely covers the substrate to create a binding

34 Chapter A - Introduction tunnel (Figure A-22a). This tunnel contains a set of tight binding pockets such that each side chain of the substrate peptide (P6 to P1’) is bound in a complementary site (S6 to S1’) (Figure A-22b)211.

a b

Figure A-22 | Substrate bound in the active site tunnel Surface model of TEV bound to uncleaved substrate (black), also showing the catalytic triad (red). (a) Substrate bound inside the active site and (b) enzyme cut through the plane of the active site to expose the internal organisation of the bound substrate (black) and triad (red). (TEV C151A nucleophile mutation - PDB 1lvb)

In particular, peptide side chain P6-Glu contacts a network of three hydrogen bonds; P5-Asn points into the solvent, making no specific interactions (hence the absence of substrate consensus at this position); P4-Leu is buried in a hydrophobic pocket; P3-Tyr is held in a hydrophobic pocket with a short hydrogen bond at the end; P2-Phe is also surrounded by hydrophobes including the face of the triad histidine; P1-Gln forms four hydrogen bonds; and P1’-Ser is only partly enclosed in a shallow hydrophobic groove.

Although rational design has had limited success in changing protease specificity, DE has been used to change the preferred residue either before224 or after114,214 the cleavage site. I discuss these works in Chapter E, 2.3, and approach specificity from an evolutionary point of view, attempting to change the specificity of multiple substrate residues simultaneously.

35

7 OVERVIEW OF RESULTS

In Chapter B, I describe the development of both a selection system and a screening system for protease activity detection and quantification. The screening system was then used for the subsequent evolution experiments of TEV protease.

Chapter C addresses the evolution of key active site residues by mutating the nucleophile from cysteine to serine. DE using small libraries recovered protease activity and combining this with HT-SAS, I characterised the evolutionary trajectory and its flanking fitness landscape.

I use high throughput DE and HT-SAS in Chapter D to investigate the evolution of protein robustness and tolerance to mutations. By selection only for retention of wild-type-like activity, I neutrally evolve a population of TEV protease genes. I use deep sequencing to reconstruct the complex phylogeny and show that robust sequences appear and that their tolerance to mutations allows their descendants to dominate later populations.

To test whether these populations also show altered evolvability, in Chapter E I look at adaptive evolution towards new (possibly clinically relevant) substrate sequences. By quantifying changes in evolvability of the population I separate out two distinct mechanisms whereby neutral mutations can increase different aspects of evolvability.

Finally, in Chapter F I draw these experiments together and comment on the characterisation of the fitness landscape surrounding TEV protease and its implications for evolvability and enzyme engineering.

36

Chapter B - Selection and screening methods for protease directed evolution

1 SUMMARY

High throughput protease assays are necessary to detect fitness differences between variants and isolating those required for directed evolution. Both selection and screening systems were attempted. A cell survival selection system was constructed, in which the cleavage recognition sequence of Tobacco Etch Virus protease was successfully cloned into the toxic β-lactamase inhibitory protein. Though inhibitory activity was retained, co-expression of the protease did not reduce this inhibition to restore cell survival. In parallel, a screening system was adapted, based on the proteolytic destruction of fluorescence resonance energy transfer in a fusion of donor and acceptor fluorescent proteins. The assay was first optimised for 96-well plate lysate screening (in conjunction with P. Gatti-Lafranconi) and then for fluorescence- activated cell sorting. The 96 well lysate screen proved the more sensitive method, but the cell sorting screen was high throughput enough to be mainly limited by cell transformation efficiency.

37

2 INTRODUCTION (SELECTION)

In a directed evolution cycle of diversification, selection/screening and amplification, only the second step is protein specific. As discussed in Chapter A, 3.2, this requires either for the particular activity to be coupled to gene enrichment (selection) or for a high throughput assay (screening). As proteases are active on peptides, the substrate can be synthesised directly by ribosomes in vivo. This affords the opportunity of a cell-survival selection system based on the destruction of a toxic protein. Because the preference sequence is so long, it occurs by random once every 207=109 times and so is highly unlikely to occur in a protein. Additionally, TEV protease cleaves in unstructured regions, so the sequence needs to be in a flexible loop to be accessible.

2.1 β-LACTAMASE INHIBITOR PROTEIN AND INHIBITION-RELIEF SELECTION

In general, selection systems are high throughput and allow a fast and crude sweep of a large library225. By engineering a cleavage site into a lethal protein, cleavage by the protease should restore vitality and allow isolation of active genes. An inhibitor protein of β-lactamase was β-lactamase Inhibitor Protein (BLIP) identified as a good candidate by P. Gatti-Lafranconi as the selection A competitive inhibitor of the antibiotic resistance pressure could be tuned by varying the ampicillin concentration. gene β-lactamase E. coli cells expressing β-lactamase are able to survive on ampicillin as β-lactamase hydrolyses the antibiotic. Co-expression of β-lactamase Inhibitor Protein (BLIP) in these cells prevents cell survival on ampicillin226. The structure of the interaction complex has been solved227 (Figure B-1) and the residues critical to this interaction mapped by mutagenic studies228. These demonstrated that BLIP competitively inhibits β-lactamase by binding over its active site and

38 Chapter B - Selection and screening methods for protease directed evolution inserting residues into the active site pocket. As β-lactamase is localised to the periplasm, export of BLIP is critical to function.

Figure B-1 | BLIP in complex with β-lactamase Complex of BLIP (grey) and β-lactamase (white). The yellow loop indicates a putative sites for insertion of a protease cleavage sequence. (PDB file 3c7v).

The proposed system consists of coupling the activity of the protease to cell survival by engineering a BLIP protein to act as a substrate of the protease (Figure B-2). The sequence must be inserted at a point that will not affect the binding of BLIP to β-lactamase whilst still being accessible to the protease. Cleavage of BLIP by the protease must destroy the interaction with β-lactamase. Thus, an active protease that is expressed in the periplasm would cleave BLIP, which could no longer inhibit β-lactamase, allowing cell survival on ampicillin.

Protease BLIP Β-lactamase Ampicillin

Active protease Survival Inactive protease Death

Figure B-2 | Scheme of how an active protease is coupled to cell survival Protease activity and cell survival are linked through a series of inhibitory interactions (described in text).

The ampicillin concentration can be used to control the stringency of selection so that selection pressure can be adapted to the protease library being used. Selection generates a population enriched in functional variants. By mutating the inserted cleavage site, the system would be adaptable to any other protease that is sequence-specific enough to be non-toxic in E. coli but active enough to cleave BLIP.

39

3 RESULTS AND DISCUSSION (SELECTION)

3.1 BLIP PLASMID ASSEMBLY

For a functional selection system, it was important to ensure inhibition of β-lactamase by BLIP that is efficient enough to abolish ampicillin resistance but sensitive enough to have that resistance restored by BLIP’s proteolysis. To achieve this it was necessary to construct a custom plasmid that ensures 1:1 stoichiometry of BLIP to β-lactamase. Additionally, the lactose inhibitor (LacI) was required for controlling inducible expression and the chloramphenicol acetyltransferase (CAT) as an alternative antibiotic resistance marker when BLIP is being expressed.

I therefore designed several cloning strategies, eventually constructing the plasmid pSB-BLIP (Figure B-3). β-lactamase was amplified from pUC18 by PCR (using primers blaF+blaR) which was ligated into pJET1.2/BLUNT to make pJET-bla. The pJET-Bla plasmid was subsequently digested and β-lactamase inserted into the multiple cloning site of pSU18. To insert wtBLIP as well as each loop variant, the BLIP gene and LacI were amplified by PCR from pGR32-BLIP (using primers blipF+blipR) and also inserted into the plasmid. The final plasmids were designated pSB-BLIP.

Both β-lactamase and BLIP are preceded by the TEM1 signal sequence for export to the periplasm. Chloramphenicol acetyltransferase was included so that the plasmid can be maintained when ampicillin selection is not being applied.

40 Chapter B - Selection and screening methods for protease directed evolution

TEM1 signal sequence

Bla promoter

Lac operon

pSB-BLIP 

Figure B-3 | pSB-BLIP plasmid map The plasmid used to express BLIP (and variants). Inside wedges indicate which starting plasmids were used to construct the final BLIP plasmid. Gene abbreviations are: BLIP = β- lactamase inhibitory protein; LacI = Lactose inhibitor; Cat = Chloramphenicol acetyltransferase; Bla = β-lactamase.

3.2 INSERTION OF CLEAVAGE SEQUENCE INTO BLIP PRESERVES FUNCTION

In order for the ampicillin inhibition-relief selection system (Figure B-2) to function, a cleavable BLIP protein needed to be engineered. As previously stated, the position of the insert has to satisfy the following criteria:

 The uncleaved loop in BLIP must not affect β-lactamase interaction  The loop must be exposed and accessible for the protease to cleave  Upon cleavage, the BLIP fragments must not inhibit β-lactamase

I identified two possible sites based on analysis of the structure of the BLIP/β-lactamase complex (Figure B-4). Loop regions were chosen in the hope that they would better tolerate the presence of insertions and minimally disrupt the rest of the structure. Both insert sites are situated away from the interaction surface with β-lactamase. Finally both sites roughly bisect the protein so that residues essential to the interaction228,229 with β-lactamase are split between the two halves.

41

a 1 b

1

2 2

Figure B-4 | Structure of BLIP showing insert sites BLIP is shown as cartoon with the insert loops 1 and 2 in yellow and key interaction residues in red. β-lactamase is shown as surface in the background. The structure is shown from (a) the front and (b) 90o to the side. (PDB file 3C7V)

Loops were inserted using the USER cloning method12 with primers ins1F+ins1R and ins2F+ins2R to generate iBLIP and i2BLIP (Table B-1) in the original pGR32 plasmid.

i_BLIP MSIQHFRVALIPFFAAFCLPVFAHPEHHHHHHAGVMTGAKFTQIQFGMTRQQVLDIA i2_BLIP MSIQHFRVALIPFFAAFCLPVFAHPEHHHHHHAGVMTGAKFTQIQFGMTRQQVLDIA wt_BLIP MSIQHFRVALIPFFAAFCLPVFAHPEHHHHHHAGVMTGAKFTQIQFGMTRQQVLDIA

i_BLIP GAENCETGGSFGDSIHCRGHAAGDYYAYATFGFTSAAENLYFQNADAKVDSKSQEKL i2_BLIP GAENCETGGSFGDSIHCRGHAAGDYYAYATFGFTSAA------ADAKVDSKSQEKL wt_BLIP GAENCETGGSFGDSIHCRGHAAGDYYAYATFGFTSAA------ADAKVDSKSQEKL

i_BLIP LAPS------APTLTLAKFNQVTVGMTRAQVLATVGQGSCTTWSEYYPAYPSTAGV i2_BLIP LAPSENLYFQNAPTLTLAKFNQVTVGMTRAQVLATVGQGSCTTWSEYYPAYPSTAGV wt_BLIP LAPS------APTLTLAKFNQVTVGMTRAQVLATVGQGSCTTWSEYYPAYPSTAGV

i_BLIP TLSLSCFDVDGYSSTGFYRGSAHLWFTDGVLQGKRQWDLVF 205 i2_BLIP TLSLSCFDVDGYSSTGFYRGSAHLWFTDGVLQGKRQWDLVF 205 wt_BLIP TLSLSCFDVDGYSSTGFYRGSAHLWFTDGVLQGKRQWDLVF 198

Table B-1 | Alignment of wild type BLIP and BLIP insert variants Residues known to strongly affect BLIP binding to β-lactamase are shown in red.

The inserted loops are required to not impair inhibition of β-lactamase. I therefore developed an assay (with P. Gatti-Lafranconi) of the efficiency of BLIP inhibition of β-lactamase based on the serial dilution of a bacterial culture, spotted onto solid media. At high cellular concentration, low β-lactamase activity allows cell survival because ampicillin is cleared by the cumulative activity of many cells. As cell concentration decreases, selection is more stringent. Addition of IPTG

42 Chapter B - Selection and screening methods for protease directed evolution induces expression of the BLIP construct. Cells expressing wtBLIP, had to be roughly 100-fold more concentrated in order to achieve the same survival on 100 µg/µl ampicillin hence BLIP reduces β-lactamase activity by approximately 100-fold in this assay (Figure B-5, upper half).

Having BLIP and β-lactamase expressed from a single plasmid (pSB- BLIP) gave repeatable results by reducing the effects of relative gene copy number variation. Cells expressing i2BLIP showed no decreased survival on ampicillin indicating that the insert has disrupted BLIP inhibition of β-lactamase. Conversely, iBLIP was only 2- to 4-fold less effective than wtBLIP at inhibiting growth on ampicillin (Figure B-5, lower half). This indicates that iBLIP still functions as an effective inhibitor and hence the insert loop minimally interferes with the BLIP/β-lactamase interaction. Given this success, iBLIP was further characterised for cleavability by TEV.

IPTG -

wtBLIP +

-

iBLIP +

2x dilutions of cell culture

Figure B-5 | Comparison of wtBLIP and iBLIP Top10 cells are transformed with pSB-wtBLIP or pSB-iBLIP. As the cell culture is diluted there is no change in survival when IPTG is absent and BLIP is not induced. However, when IPTG is added cells are unable to survive at low cell density. Reduction in survival is comparable between wtBLIP and iBLIP. 100 µg/µl ampicillin, 400 µM IPTG (as indicated) at 37oC.

3.3 TEV PROTEOLYSIS OF BLIP DOES NOT RESTORE CELL SURVIVAL

The final goal of this system was to select for active TEV variants so the sensitivity of iBLIP to proteolysis and restoration of ampicillin

43

resistance was crucial. Several plasmids were constructed that encode TEV protease but vary in its fusion state, localisation and activity.

MBP-TEV – TEV protease was provided on plasmid pMAT10 as an MBP fusion (P. Gatti-Lafranconi). As the resistance gene of pMAT10 is β-lactamase, (β-lactamase is present on pSB-iBLIP so cannot be used as the resistance gene on the TEV protease-containing plasmid) I cloned the MBP-TEV fusion into pET28a which encodes kanamycin resistance to produce the plasmid pET28-MBPTEV.

Signal sequence-TEV – As both β-lactamase and BLIP are exported to the periplasm, TEV needs to be co-localised in order to function. I therefore also cloned TEV protease alone into pET22b which contains the PelB signal sequence. The signal sequence and TEV protease sequence were then cloned together into pET28a (pET28-STEV).

Inactive TEV – An appropriate negative control was needed for defining the system’s dynamic range. I therefore made the active site C151A mutation (TEVAla, previously shown to be inactive212) by targeted mutagenesis of pET28-STEV (primers tevaF+tevaR). Clones were confirmed by sequencing (T7F primer). The TEVAla gene was subsequently cloned into the pET28-MBPTEV plasmid.

For the selection system to function, active protease must relieve iBLIP inhibition of β-lactamase. When co-expressed, TEV protease expressed in the cytosol (pET28-MBPTEV) has no effect on cell survival. Surprisingly, however, co-expression of TEV in the periplasm (pET28-STEV) also did not increase cell survival (Figure B-6) and was as ineffective as the inactive C151A mutant (pET28-STEVAla). This indicates that the expected inhibition relief of iBLIP cleavage by TEV does not occur. Likely explanations for this negative result include TEV mis-translocation, failure of TEV to cleave iBLIP, continued inhibitory function of the cleaved iBLIP.

44 Chapter B - Selection and screening methods for protease directed evolution

IPTG -

STEV +

iBLIP - Ala

STEV +

2x dilutions of cell culture

Figure B-6 | Comparison of STEV and STEVAla BL21 cells are transformed with pSB-iBLIP and either pET28-STEV or pET28-STEVAla In the presence of IPTG, both iBLIP and the protease are expressed. There is no improvement in cell survival in the presence of the protease (as compared to the inactive alanine mutant control). 100 µg/µl ampicillin, 50 µg/µl kanamycin, 400 µM IPTG (as indicated) at 37oC.

In order to understand which part of the pathway was not operating as planned, I expressed and purified both TEV and iBLIP. Purifications of iBLIP from whole lysate and periplasm only were performed. Expression of iBLIP was low (Figure B-7a) and multiple bands appeared around its expected size (21kDA) though it is unclear which, if any, was iBLIP. To characterise whether iBLIP is cleavable by TEV in vitro, the proteins were incubated together in TEV activity buffer.

b

a

20 kDA 15 kDa 10 kDa

1 2 3 4 5 6 7 8 Elution

Figure B-7 | Purification and cleavage of iBLIP Nickel affinity purification if iBLIP (6 hour expression at 25 °C and preparation of periplasm by osmotic shock). (b) Co-incubation of 2 μM TEV and 2 μM iBLIP for 24 hr at 25 °C.

Co-incubation of the iBLIP preparation with TEV in vitro gave no evidence of cleavage (Figure B-7b) since concentrations were so low. Together these results indicate that although the insertion of a cleavage loop into BLIP to create iBLIP was successful, this construct was not inactivated by co-expression of TEV. Troubleshooting of these

45

negative results was hindered by poor expression of the iBLIP construct.

Given the difficulty in addressing the failure point in the system, a more manageable system might be the cleavage of a cytosolic, modular repressor with an accessible, flexible, cleavable linker between the DNA binding and transcription inhibition domains. A fully unstructured region would ensure TEV cleavage and separate domains would ensure separation and inactivation after cleavage. In fact, during the course of this work, just such a system was published114 with either histidine biosynthesis, antibiotic resistance or fluorescent protein genes downstream of the repressor.

46 Chapter B - Selection and screening methods for protease directed evolution

4 INTRODUCTION (SCREENING)

In parallel with work on the BLIP-based selection system, I worked with P. Gatti Lafranconi on a screening system. This was envisaged as both an alternative to cell survival selection as well as a possible lower throughput but higher detail addition for DE.

4.1 DUAL-GFP FRET PAIR AND FLUORESCENCE SCREENING

There are many assays for measuring the activity of purified protease. The throughput of most is not high enough for use in directed evolution208,221 (>102 variants screened per round). High throughput in vivo assays of protease activity tend to be constructed for sensitive detection of inactive proteases for the screening of inhibitors. A few methods, however, are good candidates for conversion to a screening Fluorescence Resonance system for directed evolution. A dual GFP-variant Fluorescence Energy Transfer (FRET) The transfer of energy from 117,230,231 one fluorophore to another Resonance Energy Transfer (FRET) system was chosen for such that excitation of the first leads to emission by optimisation for protease library screening. the second

FRET occurs when the emission wavelength of one fluorophore coincides with the excitation wavelength of another nearby fluorophore such that the energy is directly transferred across. Thus excitation of the first fluorophore leads to emission by the second. FRET decreases as the sixth power of the distance between the fluorophores. In the case of covalently bonded fluorophores, cleavage can be considered to completely abolish FRET.

A dual-GFP FRET protease sensor117,230,231 has been adapted (with P. Gatti-Lafranconi) to be suitable for the directed evolution of Tobacco Etch Virus protease (TEV). A cyan GFP-variant (CyPet or CFP) is fused to a yellow GFP-variant (YPet or YFP) by a linker including the TEV substrate recognition sequence. The emission wavelength of CFP

47

and the excitation wavelength of YFP are very similar. When CFP is excited by a photon at 414 nm, the energy is transferred to YFP and emitted at 527 nm. Cleavage by TEV protease abolishes FRET (Figure B-8Error! Reference source not found.) such that when excited at 414 nm, light is emitted from CFP at 475 nm.

UV excitation UV excitation FRET

Protease

Yellow Cyan emission emission

Figure B-8 | Proteolytic cleavage of a Dual-GFP fusion FRET-pair If the linker is intact, excitation at the absorbance wavelength of CFP (414nm) causes emission by YFP (527nm) due to FRET. If the linker is cleaved by a protease, FRET is abolished and emission is at the CFP wavelength (425nm).

Cells co-expressing both the FRET sensor and TEV are induced and incubated so that cleavage occurs in vivo (in the cytosol). When CFP is excited, the ratio of CFP to YFP is used as an indicator of FRET efficiency and hence protease activity. This fluorescence ratio can

Flow cytometry either be measured in the cell lysate of a clonal population representing The assay of cells by moving them individually past a each variant (96 well plate screening) or in each individual cell by flow laser and detector array in a thin stream of liquid cytometry and Fluorescence Activated Cell Sorting (FACS).

Fluorescence Activated Cell Sorting A particular benefit is that modifications of the same core assay can be The use of electrical charge in flow cytometry to select used for in vitro kinetic characterisation of purified enzymes as well as a sub-population of cells to a separate container the high-throughput in vivo screen ensuring consistency of substrate.

48 Chapter B - Selection and screening methods for protease directed evolution

5 RESULTS AND DISCUSSION (SCREENING)

5.1 C-Y(TEV) FRET-PAIR PLASMID ASSEMBLY

I constructed a screening system based on the change in fluorescence of a substrate upon proteolysis. The CFP-YFP gene has previously been optimised for intramolecular FRET117 and caspase-3 cleavage. The initial stages of adapting the system for directed evolution was performed by P. Gatti-Lafranconi (introducing the TEV cleavage linker and hanging induction to L-arabinose to allow co-induction of substrate and protease. The substrate gene is on a plasmid also containing chloramphenicol acetyltransferase (Figure B-9) and the protease gene on a plasmid containing β-lactamase. Cells have to maintain both these plasmids when grown on ampicillin and chloramphenicol.

Linker containing cleavage sequence

pC-Y

Figure B-9 | pC-Y plasmid map The plasmid used to express C-Y. The wedges indicate which starting plasmids were used to construct the final pC-Y plasmid (the gap indicates synthetic oligonucleotides). Gene abbreviations: CFP = Cyan fluorescent protein; YFP = Yellow fluorescent protein; MBP = Maltose binding protein; AraC = Arabinose inhibitor; Cat = Chloramphenicol acetyltransferase.

49

In order to facilitate the easy manipulation of linker sequence between CFP and YFP, I introduced restriction sites (XhoI, SpeI). Whole gene synthesis was necessary as mutagenic PCR was impossible due to the high sequence identity of CFP and YFP causing non-specific primer annealing. These sites allow the DNA encoding the protease cleavage sequence to be inserted with synthetic oligonucleotides. The linker (discussed in more detail in Chapter E) can be either a single sequence or a set of sequences with randomised positions (NNS codon). To begin with, I inserted the cleavage preference sequence of TEV protease to create pC-Y(tev):

…GGSGGSENLYFQSGGSGGS… Spacer – Cleavage seq - Spacer

5.2 FLUORESCENCE OF C-Y(TEV) IS CHANGED BY TEV PROTEOLYSIS IN CELL LYSATE

A cell lysate screen was optimised for the parallel assay of TEV mutant libraries of 102. Due to low solubility, TEV was expressed as a MBP fusion protein to provide a constant solubility buffer219 (Figure B-10a). Both the enzyme and substrate were cloned into plasmids that give L- arabinose inducible expression so that co-expression is induced simultaneously. For expressing TEV without the MBP fusion for biophysical characterisation of the unfused enzyme, it was also cloned into pMAA NoMBP (Figure B-10b).

50 Chapter B - Selection and screening methods for protease directed evolution

a

pMAA (TEV)

His6 affinity tag

b

His6 affinity tag pMAA NoMBP (TEV)

Figure B-10 | pMAA TEV plasmid map (±MBP) The plasmid used to express TEV when using the C-Y(x) substrate where ‘x’ can be any desired linker sequence. (b) The plasmid used to express TEV without MBP (for TM measurements). Inside wedges indicate which starting plasmids were used to construct the final plasmid. Gene abbreviations: TEV = Tobacco Etch Virus Protease (or variants); MBP = Maltose binding protein; AraC = Arabinose inhibitor; Bla = β-lactamase.

I ran control screens comparing the FRET ratio of cells co-expressing the C-Y(tev) substrate with either wild type TEV, the C151A mutant (TEVAla), or MBP. In the absence of TEV, there was an emission peak at 525 nm (the emission wavelength of YFP indicative of FRET) (Figure

51

B-11). In the presence of TEV, emission was mostly at 475 nm (the emission wavelength of CFP). This gives a dynamic range of a 7-fold drop in FRET efficiency (525/475 nm) in the presence of TEV. This demonstrates that the FRET pair functions with the TEV cleavage sequence as a linker and that it can be successfully proteolysed.

a 475nm 525nm b 5 20 18 4 16 14 3 12 10 Uncleaved

8 ratio FRET Lysate 2

Intensity Intensity RFU / 6 4 1 2 Cleaved 0 0 pMAA-MBP pMAA-MBPTEV(Ala) pMAA-MBPTEV(Cys)

450 470 490 510 530 550 Ala

Emission wavelength - nm TEV

MBP TEV TEV

Figure B-11 | Effect of TEV protease on fluorescence in CFP-YFP FRET system Lysates of E. coli co-expressing the substrate C-Y(tev) an either MBP or TEV for 4 hours. (a) The fluorescence spectra of the lysates when excited at 414 nm (the excitation wavelength of CFP). (b) FRET ratio (525/475 nm) of lysates. Error bars are standard deviation and n=4.

The FRET ratio was time dependent (Figure B-12). Over time, expression of the substrate increased FRET ratio, similarly longer co- expression times with TEV lead to a lower FRET ratio. Readings were more repeatable at later time points as the signal to noise ratio drops.

3

2.5

2

1.5

1 Lysate FRET ratio FRET Lysate

0.5

0 0 2 4 6 8 Time / hours

Figure B-12 | Time-course of lysate FRET ratios Lysates of E. coli co-expressing the substrate C-Y(tev) an either MBP (black) or TEV (red) for different times. Error bars indicate standard deviation of 8 biological replicates.

52 Chapter B - Selection and screening methods for protease directed evolution

The assay can be performed in parallel in a 96 well plate on a library of protease variants generated by error prone PCR (epPCR) (Figure B-13a). Next generation sequencing (see Chapter D, 3.1) confirmed that variants each contained 1.5 nucleotide mutations on average (spread even along the gene) with 34% of these being silent mutations and <1% being frameshifts. The lysate assay allowed for screening of mutant library sizes of 102. The FRET ratios of mutants mostly cluster around the FRET ratios of cells co-expressing wild type TEV (cleaved substrate) or cells co-expressing MBP (uncleaved substrate) (Figure B-13b).

a

b 35 Cleaved Uncleaved

30

25

20

15 Frequency

10

5

0

Lysate FRET ratio

Figure B-13 | Sample lysate screen of TEV mutant library 96 well plate of E. coli lysates. All express C-Y(tev) but the culture in each well co- expressed a different point mutant of TEV generated by random mutagenesis. (b) A histogram of two such plates showing the distribution of activities present in the library.

5.3 TEV CLEAVES PURIFIED C-Y(TEV) IN VITRO

All proteases and C-Y(tev) variants contain a His6 tag for nickel affinity purification. C-Y(tev) variants could be purified to high concentrations

53

(up to 90 µM) but TEV had poor solubility (maximum 4 µM). At low concentrations the fluorescence of C-Y(tev) variants was insufficient to measure FRET ratio however at concentrations above 0.5 µM FRET ratio was constant (Figure B-14).

8

7

6

5

4

FRET ratio FRET 3

2

1

0 0 0.5 1 1.5 2 [C-Y(tev)] / μM

Figure B-14 | Concentration dependence of FRET ratio Dilution series of C-Y(tev) with the buffer TEV buffer (used for cleavage kinetics)

A concentration of 1 μM substrate was therefore used in all further kinetics in line with the concentrations used in the original report of the FRET pair117. When thawed, purified C-Y(tev) variants did not exhibit their maximum FRET ratio (Figure B-15).

8 7 6 5 4

3 FRET ratio FRET 2 1 0 0 2 4 6 8 10 12 14 Time / hours

Figure B-15 | Equilibration of C-Y FRET ratio Purified C-Y(tev) (1 μM) in TEV buffer at 25 °C after having been thawed from -80 °C at t= -0.6 hours. TEV (1 μM) added at t= 4 hours to half the sample (red).

A >4 hour equilibration period at the assay temperature was therefore required in order to give a constant negative control. FRET ratio remained stable for several days after thawing. It is only the FRET ratio that equilibrates in this fashion, not the total fluorescence. I would

54 Chapter B - Selection and screening methods for protease directed evolution hypothesise that this is indicative of slow structural rearrangements of the relative orientations of the fluorophores.

Cleavage of C-Y(tev) by purified TEV is specific (neither thrombin nor TEVAla showed substrate turnover) and concentration dependent (Figure B-16a). This indicates specific hydrolysis of the C-Y(tev) linker by TEV using the catalytic cysteine nucleophile. Kinetic traces could be globally fitted with the Michaelis Menten function (Methods eq. G- 2) in order to determine the composite kinetic activity parameter, kcat/KM (Figure B-16b).

a 1.0

0.8 M

μ 0.6

0.4

0.2 [Product] /

0.0

-0.2 0 60 120 180 240 Time / min b 1.0

0.8

M μ 0.6

0.4 [Product] / / [Product] 0.2

0.0 0 60 120 180 240 Time / min Figure B-16 | Concentration dependent cleavage of C-Y(tev) by TEV and kinetic fitting Activity of different 0.5, 1, 2, 4 μM TEV (red), 5 μM TEVAla (green) and 30 μM Thrombin (blue) on 1 μM C-Y(tev). FRET ratio used to determine product concentration by fitting to eq. C-1 (Methods) (b) Data for 1 μM TEV activity fitted with eq. C-2 (Methods)

5.4 TEV PROTEOLYSIS OF C-Y(TEV) IS DETECTABLE IN INDIVIDUAL CELLS

The lysate co-expression assay was modified to increase throughput from 102 to 106. To test the dynamic range of the assay in FACS, TOP10 E. coli containing pC-Y(tev) were transformed with either pMAA(TEV) or pMAA. After overnight co-expression the clonal populations were assayed by FACS. The populations were

55

distinguishable by the difference in FRET ratio (After excitation of CFP, the emission of YFP divided by emission of CFP - FL5/FL4) however there was significant overlap (Figure B-17a). YFP was used as a reporter for overall expression since the emission of YFP after its excitation (FL1) is independent of cleavage of the CFP-YFP fusion. When separated by FL1 it can be seen that cells with higher expression 10 4 also have less signal variance due to higher signal:noise (Figure B-17b,c,d).

3 606666 253 10 10 4 b 454499 189 303333 126

Counts

Counts

Counts 151166 63 2 Cleaved Uncleaved 10 3 a 10 0 0 FLLog 1 100 101010 101021 101032 101043 104 (FL 5 Log/FL 4 Log)(FL 5 Log/FL 4 Log) 41894324 c 4189 31413243 3141 1 10 10 2 20942162 2094

FLLog 1

Counts

Counts

Counts 10471081 1047 0 0 100 1010 1021 1032 1043 104 10 0 1 (FL 5 Log/FL 4 (FLLog) 5 Log/FL 4 Log) 100 10101 102 103 104 (FL 5 Log/FL 4 Log) 333358 d 358 249268 268 10 0 100 101 102166179 103179 104

Counts

Counts (FLCounts 5 Log/FL 4 Log) 8389 89 0 0 100 1010 1021 1032 1034 104 (FL 5 Log/FL 4 Log)(FL 5 Log/FL 4 Log) Figure B-17 | Flow cytometric measurement of FRET in E. coli Flow cytometry plots showing FL1 (Emission from YFP after excitation of YFP) against FL5/FL4 (FRET ratio after excitation of CFP) for cells co-expressing C-Y(tev) and either MBP (blue) or TEV (red). Horizontal dashed lines indicate top and bottom FL1 deciles. The same data plotted as a histogram of FL5/FL4 (FRET ratio) for (b) cells with the top 10% FL1, (c) all cells, (d) cells with the bottom 10% FL1.

Since the measurement of cleavage is ratiometric it is concentration independent, however high fluorescence gives higher signal:noise in the FRET ratio. Due to the small size of E. coli, achieving high enough substrate C-Y concentration required expression optimisation. It has previously been reported that low temperature expression improves fluorescence by reduced aggregation117. I found that overnight expression at 21 °C gave the highest signal. Additionally slow flow rate (<1000 cells s-1) to give longer laser beam passage time and high laser

56 Chapter B - Selection and screening methods for protease directed evolution

power (>100mW) were necessary for high signal:noise ratio. Though these increase the signal:noise ratio and give a reduced variance in the FRET ratio, the median ratio remains the same. The ratio was also insensitive to the concentration of inducer (0.02%-1.00%) which is important for consistent library screening.

A library of TEV variants generated by epPCR was co-expressed with C-Y(tev) and assayed by flow cytometry. The library showed two populations that correspond to the FRET ratios of populations co-

10 4 expressing TEV or expressing MBP (Figure B-18). The distribution of FRET ratios of cells in the top 10% of FL1 was roughly bimodal.

3 Bimodality indicates that, in this assay, mutations are either effectively 10 10 4 neutral or fully deleterious to TEV function (discussed in Chapter D).

10 4 2 10 a 10 3 b

FLLog 1

10 3 1 10 10 2

FLLog 1

10 2 0

10 1 FLLog 1 100 10101 102 103 104 (FL 5 Log/FL 4 Log) c d 606666 253 237 3031 3031 10 1 454499 189 177 10 0 2273 2273 100 101 102 103 104 303333 126 (FL 5 Log/FL 4118 Log)1515 1515

Counts

Counts

Counts

Counts

Counts

Counts 151166 63 59 757 757 10 0 0 0 0 0 0 0 1 2 3 4 0 10 21 32 100 1043 0 1010 104 1 101021 102 101032 103 101043 104 104 10 1010 1010 1010 1010 10 (FL 5 Log/FL 4 Log) (FL 5 Log/FL 4 Log)(FL 5 Log/FL 4 Log) (FL 5 Log/FL(FL 5 Log/FL4 Log)(FL 5 4Log/FL Log) 4 Log)

Figure B-18 | Sample flow cytometric of TEV mutant library Flow cytometry plots showing FL1 (Emission from YFP after excitation of YFP) against FL5/FL4 (FRET ratio after excitation of CFP) for (a) cells co-expressing C-Y(tev) and either MBP (blue) or TEV (red) (b) a density plot of a library of TEV mutants.

The majority of noise in the system is biological rather than technical. If a specific subpopulation of cells are sorted and immediately re- assayed, they show very little spread. If the sorted cells are grown and re-expressed, however, they produce a pattern identical to the original, unsorted population indicating that the re-expression has re- introduced biological variation.

57

6 CONCLUSIONS

6.1 SELECTION BY BLIP CLEAVAGE IS UNSUITABLE FOR DIRECTED EVOLUTION

The plasmids and genes required to test the system were successfully constructed. Of the two loops targeted, only one allowed retention of inhibitory activity after insertion of the TEV cleavage sequence as demonstrated by the reduction in survival when iBLIP is expressed (Error! Reference source not found.). This demonstrates that loop 1 tolerates the insertion of seven amino acids without disrupting function. Co-expression of TEV, however, fails to restore survival (Figure B-6) indicating that TEV fails to successfully inactivate BLIP and restore vitality. There could be a number of reasons for this failure:

 TEV is not soluble or active in the periplasm  TEV is not exported to the periplasm by PelB export sequence  TEV does not cleave iBLIP because the loop is not exposed  Cleavage of iBLIP does not destroy the interaction with β-lactamase

The main problem with the system is that it is difficult to isolate the point of failure in the multi-step pathway and the low expression of BLIP has made purification for in vitro analysis impractical. Since the main benefit of a cell survival selection system is its high throughput, the success of the high throughput FRET screening system in FACS system rendered this selection system obsolete.

6.2 FRET SCREENING PROVIDES MULTIPLE PLATFORMS FOR DIRECTED EVOLUTION

The CFP-YFP FRET-pair lends itself well to a variety of assay formats, each specialised for a different function. Purified proteins can be used to determine kinetic parameters by whole curve fitting (Figure B-16), cell lysate screening is a sensitive screen for medium throughput directed evolution (102) (Figure B-13) and flow cytometry allows for high throughput screening (106) (Figure B-18). Using the same

58 Chapter B - Selection and screening methods for protease directed evolution substrate for both screening and detailed characterisation avoids some artefacts common in directed evolution studies. In particular, directed evolution is rarely possible on the desired substrate and so one or more proxy colourmetric or fluorometric substrates are used. Improvements in turnover of the assayed substrate may not correlate to activity on the desired substrate as evolution optimises binding the colourmetric or fluorometric group232 or catalysis relies on their artificially high pKa values233. The FRET system is suitable for high and low throughput screening as well as subsequent kinetic characterisation.

The substrate cleavage sequence can also be manipulated to create individual variant substrates or a whole class of related substrates by randomising one or more codons, allowing the study of specificity by measuring the set of cleaved substrates.

The mutant library screened by lysate and FACS gives a good comparison between the assay formats. The overall agreement is good, both showing a bimodal distribution of mutant activities (Figure B-13

& Figure B-18). The lysate screening system has higher dynamic range and greater sensitivity at discriminating between similar variants. Additionally the signal:noise ratio is high because the lysate screens the average FRET ratio of a large clonal population of E. coli. It is therefore well suited to the evolution from a low starting activity by isolating the single best variant from each round. The FACS system, conversely, was able to screen 4 orders of magnitude more variants and gives the ability to sort large populations. It is therefore better suited to experiments where an evolving population is generated.

These FRET-based screening systems enabled the directed evolution of TEV and are used in the further experiments laid out in this dissertation.

59

60

Chapter C - Handicap-recover evolution to accommodate a new nucleophile

1 SUMMARY

To understand the evolution of active sites, handicap-recover directed evolution was performed to convert Tobacco Etch Virus cysteine protease to a functional . Catalysis using serine was improved >103-fold but rather than re-specialise, the resulting enzyme was able to use either nucleophile. Description of the fitness landscape flanking the evolved trajectory revealed features that constrain and guide the evolution of active site architecture. In particular, high epistasis creates a rugged landscape but there are consistently several available uphill pathways at each step along the trajectory. Though mutations in the protein core are penalised, the active site second periphery is enriched in adaptive potential.

Parts of this chapter have been submitted for publication:

Shafee, T., Gatti-Lafranconi, P., Minter, R., Hollfelder, F. The evolvability of nucleophile exchange Nature Genetics. (2013)

61

2 INTRODUCTION

2.1 THE EFFECT OF NUCLEOPHILE IDENTITY ON ENZYME MECHANISM

As introduced in Chapter A, 2.2, nucleophilic enzymes use an interconnected set of active site residues to achieve catalysis. The sophistication of the active site network causes residues involved in catalysis, and residues in contact with these, to be the most evolutionarily conserved within their families58. In catalytic triads, the most common nucleophiles are serine (an alcohol) or cysteine (a thiol). Compared to oxygen, sulphur’s extra d orbital makes it larger (by 0.4

234 Å ), softer, form longer bonds (dC-X and dX-H by 1.3-fold) and have

235 lower pKa (by 5 units ). Here I concentrate on chemical differences between cysteine and serine proteases on catalytic chemistry, however similar issues affect and in general.

The pKa of cysteine is low enough that some cysteine proteases (e.g. papain) have been shown to exist as an S- thiolate ion in the ground state enzyme236 (Figure C-1a) and many even lack the acidic triad member (Figure C-1b). Serine is also more dependent on other residues to reduce its pKa235 for concerted deprotonation with catalysis (Figure C-1c) by optimal orientation of the acid-base triad members (Figure C-1d)21. The low pKa of cysteine works to its disadvantage in the resolution of the first tetrahedral intermediate as unproductive reversal of the original nucleophilic attack is the more favourable breakdown product21. The triad base is therefore preferentially oriented to protonate the leaving group amide (Figure C-1e) to ensure that it is ejected to leave the enzyme sulphur covalently bound to the substrate N-terminus. Finally, resolution of the acyl-enzyme (to release the substrate C-terminus) requires serine to be re-protonated (Figure C-1f) whereas cysteine can leave as S-.

62

Chapter C - Handicap-recover evolution to accommodate a new nucleophile

) concerted concerted )

c

(

cysteine proteases, proteases, cysteine

) alcohol protonation of the serine leaving group. serine leaving of the protonation )alcohol

f

) aspartate (grey) not present in all in present not (grey) aspartate )

b

(

substrate (red) to form a tetrahedral intermediate. This breaks down by ejection of the of the ejection by down breaks This intermediate. tetrahedral a form to (red) substrate

peptide peptide

) the deprotonated cysteine, cysteine, deprotonated the )

a

are ( are

) amide protonation of the first leaving group, ( group, leaving first of the protonation )amide

enzyme enzyme intermediate. Water replaces the first product and hydrolysis occurs via a second tetrahedral

-

e

differences

, to form the acyl

terminus

-

(black) performs a nucleophilic attack on on attack nucleophilic a (black)performs

) aspartate hydrogen bonding, ( hydrogen )aspartate

d

protease

he he

,t

2

-

Differences in cysteine and serine proteolysis mechanisms serine proteolysis and cysteine in Differences

A

|

1

-

C

Figure Figure

Figure Figure in As first product, the substrate C Indicated enzyme. free regenerate to intermediate ( of serine, deprotonation

63

Sterically, the sulphur of cysteine also has longer bonds and a bulkier Van der Waals radius to fit in the active site237 and a mutated nucleophile can be trapped in unproductive orientations. For example the crystal structure of thio-trypsin indicates that cysteine points away from the substrate, instead forming interactions with the oxyanion hole234.

The evolutionary specialisation of enzymes around the needs of their nucleophile makes it unsurprising that nucleophiles cannot be interconverted in extant proteases212,238–243 (nor in most other enzymes244–249) and the large activity reductions (>104) observed can be explained as a result of compromised reactivity or structural misalignment.

2.2 NUCLEOPHILE EXCHANGE IN NATURAL EVOLUTION

Although extant enzymes cannot switch nucleophiles, there are several PA clan superfamilies that show divergent evolution of serine and cysteine The ‘Protease A’ clan of double β-barrel of serine 250 and cysteine protease proteases from a common ancestor. I use the PA clan of families chymotrypsin-like proteases as a model superfamily to study nucleophile switches in evolution. The PA clan contains a diverse array of proteases from eukaryotes, and viruses. It also encompasses varied functions including blood clotting (e.g. thrombin), (e.g. trypsin), snake venoms (e.g. pit viper haemotoxin), bacterial toxins (e.g. exfoliative toxin) and viral polyprotein processing (e.g. , , and TEV proteases).

Despite retaining as little as 10% sequence identity, PA clan members isolated from viruses, prokaryotes and eukaryotes show structural homology. Members were therefore aligned by structural similarity (with DALI251). The alignment was expanded by adding sequences without known structure by (with BLASTp, see

64 Chapter C - Handicap-recover evolution to accommodate a new nucleophile methods Table G-5) and this final alignment used to generate a phylogeny (Figure C-2). My phylogeny suggests three independent nucleophile switching events (crossed circles), though nodes 1 and 2 are so deeply rooted that they may be a single event. Cellular proteases use serine as the nucleophile, but both cysteine and serine protease families are found in the viruses (indicated by red and black shading respectively in inner circle). Despite the fact that replacing the nucleophile has detrimental effects on the activity of extant proteases212,238–243, the PA clan shows that the distinct active sites of cysteine and serine proteases can arise from divergent as well as . There is currently, however, no model that explains evolution of such core components of the catalytic machinery.

1 2 3

Figure C-2 | PA clan phylogeny The phylogeny of PA clan of proteases indicates that nucleophile changes have occurred in evolution (crossed circles). The colour of the branches and inner ring indicates the identity of the nucleophile (cysteine in red, serine in black), outer ring indicates organism (viral in white, in light blue, in blue). Proteases with known structure (marked by black dots) were aligned to TEVCys on the basis of structure using DALI, sequences of unknown structure were added by sequence alignment with BLASTp; the maximum likelihood phylogeny was generated using MEGA5.

65

3 RESULTS AND DISCUSSION

For this chapter, Tobacco Etch Virus nuclear inclusion cysteine protease will be referred to as TEVCys. The nucleophile identity of all variants will also be indicated by the superscript.

Directed Evolution (DE) To investigate nucleophile switches in evolution I mutated the An experimental process of mutation, selection and nucleophile of Tobacco Etch Virus nuclear inclusion cysteine protease amplification that mimics natural evolution on a (TEVCys) to serine (TEVSer). I performed 10 rounds of Directed single gene Evolution (DE) and characterised the recovery of activity by two High Throughput Screening and Sequencing (HT-SAS) complementary methodologies. Firstly, I used biochemistry to Using next generation sequencing to monitor the investigate changes in catalytic activity and protein stability. Secondly enrichment of mutations after experimental selection to ascribe I used High Throughput Screening and Sequencing (HT-SAS) to fitnesses to a large set of mutants characterise the local fitness landscapes flanking the DE trajectory.

3.1 REPLACEMENT OF NUCLEOPHILE DECREASES TEV ACTIVITY 104-FOLD

Targeted mutagenesis was used to introduce the C151S mutation into TEVCys to create TEVSer (primers tevsF+tevsR). The C151S mutation causes a large reduction in kinetic activity of the purified enzyme (TEVSer) in line with other proteases212,238–243. Additionally, the replacement of the nucleophile with serine induces biphasic kinetics (Figure C-3), indicating a fast first step followed by a slower, rate- limiting step. A standard steady-state interpolation (Equation C-1) can

Cys only accurately fit kcat/KM for TEV because the first step is rate- limiting. For TEVSer and variants, the modified Equation C-2 was used, as it allows measurement of the rate of all steps before and after the

rate-limiting step (kobs1 and kobs2) (Table C-1). Compared to the kcat/KM

Cys Ser of TEV , the kobs1 and kobs2 of TEV were found to be 80 and 20,000- fold lower, respectively, indicating that a strong handicap is associated with nucleophile replacement in TEVCys. Given the mechanism shown in section 2.1, it appears likely that the slow second step refers to de-

66 Chapter C - Handicap-recover evolution to accommodate a new nucleophile acylation of the covalent intermediate. It could also be a slowed product-release step after catalysis, however this sems less likely given the nature of the mutation.

[푷] = [푷]풎풂풙 − 풆−[푬]풌풄풂풕/푲푴풕 Equation C-1 | Standard Michalis Menten interpolation [P] = concentration of product [P]max = final concentration of product (should be = starting substrate) [E] = concentration of enzyme kcat/KM = rate of overall reaction t = time

−[푬]풌풐풃풔ퟏ풕 −[푬]풌풐풃풔ퟐ풕 [푷] = [푷]풎풂풙 − (푨ퟏ풆 + 푨ퟐ풆 ) Equation C-2 | Modification of Eq 1 for biphasic reaction [P] = concentration of product [P]max = final concentration of product (should be = starting substrate) A1 = Jump amplitude (ie. how much turnover is performed in first kinetic step) A2 = Second jump amplitude (should = [P]max-A1) [E] = concentration of enzyme kobs1 = rate of first kinetic step kobs2 = rate of second kinetic step t = time

a

uM [P] / / [P]

Time /min

b

uM [P] / [P]

Time /min

SINGLE EXPONENTIAL DOUBLE EXPONENTIAL 2 2 kcat/KM R Kobs1 Kobs2 R TEVCys 2.4 x 10-2 0.99 2.4 x 10-2 0.00 0.99 TEVSer 7.6 x 101 0.92 2.3 x 102 8.9E-01 0.98

Figure C-3 | Monophasic and biphasic curve fitting of TEVCys and TEVSer Kinetic data for cleavage of 1 µM substrate by (a) 4 µM TEVCys and (b) 20 µM TEVSer fit with single exponential (Eq. C-1) or double exponential (Eq. C-2). Table C-1 | Regression of curve fitting of TEVCys and TEVSer Summary of kinetic parameters and R2 values for fits to single or double exponential model.

67

3.2 NUCLEOPHILE MUTATION HAS MINIMAL EFFECT ON CATALYTIC PROMISCUITY

Trypsin shows a range of catalytic promiscuity39. I therefore checked for promiscuous hydrolysis of a range of substrates: phosphate, phosphonate, phospho-diester, sulphate, sulphonate, acetate, phosphoryl-choline and xylopyranoside. Of these, only activity on phosphonates was detectable (Figure C-4) and found to be 2.7 min-1M- 1, 104 worse than the native protease activity. Surprisingly, the activity of TEVSer was 0.8 min-1M-1 indicating that this promiscuous activity is less sensitive to the nucleophile mutation than the native activity.

3.E-03

2.E-03 Paranitrophenylphosphonate

1

- V0 /s V0

1.E-03

0.E+00 0 5 10 15 20 25 30 35 40 45 50 [Subtrate] /mM

Figure C-4 | Promiscuous phosphonatase activity of TEVCys and TEVSer Hydrolysis of different concentration of pNp-phenolphosphonate by 10 µM enzyme in TEV activity buffer at 25 °C.

3.3 DIRECTED EVOLUTION RESTORES PROTEOLYTIC FUNCTION USING SERINE

In order to investigate how the enzyme can compensate for the handicap of using a non-native nucleophile, 10 rounds of directed evolution (DE) were performed on TEVSer by selecting for activity recovery in the presence of the mutated nucleophile.

Libraries of variant genes were generated by error prone PCR (epPCR), such that on average each resulting gene contained 1.3±0.4 amino acid mutations. Libraries were screened by the lysate assay detailed in

68 Chapter C - Handicap-recover evolution to accommodate a new nucleophile

Chapter B, 5.2. Briefly, libraries were transformed into E. coli Top10 cells bearing the CFP-YFP substrate plasmid and the lysates from 350 individual colonies co-expressing protease and substrate were screened in each round. The single most active variant was used as the template for the next epPCR library and any S151C revertant rejected to force the recovery to follow a ‘forward’ evolutionary pathway. Revertants were predicted to occur once per thousand variants and results agreed with a revertant found in every second round (on average).

In this ratiometric, co-expression assay, the effect of the handicapped nucleophile caused a more moderate 6-fold cleavage reduction. Due to the sensitivity of the lysate assay, this was enough to allow directed evolution. A co-expression time of 4 hours gave good signal strength (high enough substrate concentration) whilst also giving good dynamic range (no saturation).

Over 10 rounds of evolution (TEVSer→TEVSerX), the in vivo activity was recovered from 17 to 75% of TEVCys in the cell lysate assay (Figure

C-5a & Table C-2). To determine which enzyme properties were responsible for the improvement, I purified each enzyme variant for kinetic characterisation. In vitro kinetics indicated that the in vivo activity increase is due to a larger burst amplitude and an improvement

3 3 in both kobs1 (2x10 -fold) and kobs2 (3x10 -fold) (Figure C-5b), with diminishing increments in later rounds, typical of DE trajectories53. Despite the recovery, kinetics of all isolated variants were still best fit to the biphasic model (Table C-3). Conversely, S151C revertants could be fit to monophasic kinetics.

69

R

X IX VIII VII VI V IV III II I TEV

OUND

K6N, H28L, K6N,H28L, K6N, H28L, K6N, T113S, R50G, H28L, K6N, T113S, R50G, H28L, K6N, L210M, T113S, R50G, H28L, K6N, L210M, T113S, R50G, H28L, K6N, K6N, T113S, R50G, K6N, R50G, K6N, R50G, E223G K6N, M

UTATIONS

stability( Detailing mutations accumulated in each round (new mutations in bold). For serine variants: k Table 1

M

H28L

-

1

), soluble expression (% of wt) and thermal stability ( stability thermal and wt) of (% expressionsoluble ),

C

-

, R50G, T113S, T113S, R50G, ,

2

S31T I35L

T113S

°

|

C).

Variantsummary property

*MutationsE K229R,

, R50G, T113S, R50G, ,

, I35L, R50G, T113S, L155V, N174K, A206T, L210M, A206T, T113S,L155V, N174K, I35L, R50G, ,

, E223G, E223G, ,

K215R

N174K A206T C130W

, E223G, A231V E223G, ,

A231V

L155V

, L210M, ,

,

, ,

A206T, L210M, A206T,

K215R, A231V K215R,E223G,

L210M

230K and 230K

, N174K, A206T, L210M, A206T, N174K, ,

,

E194D

K215R, E194D, E223G, (Frameshift) E223G, E194D, K215R,

K215R, E223G, A231V E223G, K215R,

Δ231

, E223G, ,

-

236 are 236 all caused by 1nt a frameshift.

K215R, E194D, E223G, (Frameshift) E223G, E194D, K215R,

°

(Frameshift*) C). For the cysteine back cysteinethe For C).

K215R, E194D, E223G, (Frameshift) E223G, K215R,E194D,

K215R, E194D, E223G, (Frameshift) E223G, E194D, K215R,

-

mutants: k mutants:

obs1

cat

/K

(min

M M

(min

-

1

M

-

1

-

1

), ), burst amplitude (Proportion of [E]), k

M

-

1

), soluble expression (% of wt) and thermal and wt) of (% expressionsoluble ),

7 6 4 4 3 4 2 5 7 2 3

X 10 X X 10 X 10 X 10 X 10 X 10 X 10 X 10 X 10 X 10 X 10 X K

OBS1

5 5 5 5 5 5 5 4 4 4 2

S

ERINE NUCLEOPHILE ERINE

25% 10% 11% 39% 38% 26% 4% 8% 7% 3%

1% BURST

AMPLITUDE

3 1 7 2 2 1 2 3 3 5 10 x 9

X 10 X X 10 X 10 X 10 X 10 X 10 X 10 X 10 X 10 X 10 X K

OBS2

-

3 2 2 3 3 3 2 2 2 1

1

15% 23% 30% 29% 44% 39% 18% 46% 40% 70% 89%

SOLUBILITY

47.0 46.0 44.6 44.6 47.4 obs2

STABILITY

(min

7 3 2 8 6 2 1 1 2 1 2 C

YSTEINENUCLEOPHILE

X 10 X X 10 X 10 X 10 X 10 X 10 X 10 X 10 X 10 X 10 X 10 X - K

CAT/KM

3 4 4 4 4 4 4 4 4 4 4

14% 24% 19% 30% 30% 25% 11% 25% 38% 66% 100%

SOLUBILITY

48.3 47.8

STABILITY

70 Chapter C - Handicap-recover evolution to accommodate a new nucleophile

The process of transcribing and translating a gene to make an enzyme is not perfect with an error made every 104 residues. It is possible that the apparent activity of TEVSer (and variants) is due contamination by enzymes with the original cysteine nucleophile. To rule this out, the identity of the nucleophile was confirmed in the proteins TEVSer and TEVSerIV by mass spectrometry and sensitivity to PMSF (a serine protease inhibitor) but not Io-Ac (a cysteine protease inhibitor).

a 120% R2 FIT 100% 100% 100% 100% 100% 100% 101% 100% 100% 100% 100% VARIANT EQN. 1 EQN. 2

100% TEVSer 0.92 0.98 80% C151S TEVSerI 0.92 0.96 TEVSerII 0.91 0.98 75% 60% 68% TEVSerIII 0.86 0.98 60% 62% 56% TEVSerIV 0.90 0.96 40% 51% 51% 48% 49% TEVSerV 0.73 0.98

Lysate activity / % / wtcleavage activity of Lysate 37% 20% TEVSerVI 0.31 0.97

Ser 17% TEV VII 0.26 0.98 0% TEVSerVIII 0.83 0.95 TEV I II III IV V VI VII VIII IX X Evolution round TEVSerIX 0.81 0.94 b 1E+06106 TEVSerX 0.54 0.97

1E+05105 TEVCys 0.99 0.99 TEVCysI 0.98 0.99 1E+04104

) Cys 1

C151S TEV II 0.98 0.98

-

M 1 1 - TEVCysIII 0.99 0.99

1E+03103 min ( TEVCysIV 0.99 0.99 TEVCysV 0.99 0.99 Activity Activity 1E+02102 TEVCysVI 0.99 0.99 TEVCysVII 0.99 0.99 1E+01101 TEVCysVIII 0.95 0.98

Cys 1E+00100 TEV IX 0.95 0.99 TEV I II III IV V VI VIII VIII IX X TEVCysX 0.93 0.99 Evolution round 1E-0110-1

Figure C-5 | Evolutionary recovery of TEVSer→TEVSerX lineage (a) Activity of TEVSer→TEVSerX (black) and TEVCys back-mutants (white) along evolution Cys lineage by cell lysate screening assay normalised to wtTEV. (b) In vitro kcat/KM for TEV Ser variants (white circles) and kobs1 and kobs2 for TEV variants (black and grey circles respectively). The red dashed arrow indicates the effect of the initially introduced handicap nucleophile mutation (C151S). Error bars represent standard deviation of: 8 biological repeats of lysate activity, 3-4 kinetics technical repeats of kinetic activity. Table C-3 | Regression of curve fitting of evolved variants Summary of R2 values for fits to single exponential (Eq. 1) or double exponential (Eq. 2) model for all variants. Kinetic data for cleavage of 1 µM substrate 20 µM TEVSer and 4 µM of all other variants.

71

3.4 THE EVOLVED PROTEASE CAN USE EITHER NUCLEOPHILE

Natural enzymes are sensitive to nucleophile mutations because of the evolutionary specialisation of their active site244–249. Therefore, DE mutations were expected to re-specialise the enzyme for the serine nucleophile. However, nucleophile S151C back-mutants of all TEVSer→TEVSerX variants (TEVCys→TEVCysX) retained in vivo activity levels similar to TEVCys in the lysate assay (Figure C-5a). The kinetics of purified enzyme variants show that the modifications to the enzyme active site architecture prove nearly-neutral to activity with the original cysteine nucleophile. Consequently while the kobs1 and kobs2 of

Ser 3 Ser TEV X are 10 -fold improved over TEV , the kcat/KM of TEVCys→TEVCysX variants fluctuated by less than 4-fold (Figure C-5b), giving a trajectory that lacks a strong trade-off in nucleophile use (Figure C-6).

a b VII VII 100,000 100,000 wt wt IV IV

10,000 I 10,000 I X X

1,000 1,000 Cys nucleophile Cys

100

using Cys nucleophile Cys using 100

using using

M

M

/K

/K

cat

cat k k 10 10

1 1 10 100 1,000 10,000 100,000 1,000,000 0 1 10 100 1,000 10,000 k using Ser nucleophile kobs1 using Ser nucleophile obs2 Figure C-6 | Nucleophile activity tradeoff Plot of kcat/KM activity using the original cysteine nucleophile versus (a) kobs1 and (b) kobs2 with a serine nucleophile for lineage.

This neutrality may be because the mutations that accumulate to improve serine’s nucleophilicity have little effect on cysteine catalysis due to its higher intrinsic nucleophilicity22. Mutations did not occur that fully specialise the active site towards serine (by failing to create a thiolate zwitterion239, sterically clashing237, or causing hyper-reactive oxidisation to sulphinic acid252).

72 Chapter C - Handicap-recover evolution to accommodate a new nucleophile

3.5 CATALYTIC ACTIVITY IS RESTORED AT THE EXPENSE OF STRUCTURAL INTEGRITY

Previous works have observed trade-offs between improved activity and structure destabilisation61,176,253. Function-altering mutations close to the active site also tend to be in the core, which does not tolerate mutations well. Since the total amount of active protein present in a cell is a complex function of expression, folding, unfolding and aggregation, I used a set of techniques to characterise the stability of variants.

Differential scanning fluorimetry measures protein melting temperature (TM). For this, enzyme variants had to be expressed and purified without the MBP fusion partner since this would mask the signal from thermal unfolding of TEV. Secondly, the soluble expression of variants still fused to MBP was measured by gel band densitometry and indicates the total amount of active protein present in the cell. Finally an in silico folding energy estimate could be evaluated by FoldX which calculates the total energy difference between the folded and unfolded states (ΔG of folding) as well as the change in this difference upon individually mutating of each residue to alanine (ΔΔG of alanine mutation).

The nucleophile mutation did not affect melting temperature (TM) of purified protein (Figure C-7a) and quantification of soluble expression in E. coli showed only a 10% decrease in TEVSer (Figure C-7b). FoldX also only predicted a minor destabilisation of the folded state (Figure C-7c). In contrast, during the evolution, there was a negative trade-off between activity and soluble expression (Figure C-7b), a frequently reported compromise61,253. FoldX calculations agreed with this by predicting increasing destabilisation of the folded state along the DE trajectory (Figure C-7c). By the tenth round (TEVSerX), the >1000-fold increase of kobs1 and kobs2 was accompanied by a 6-fold drop in soluble expression. In rounds where solubility decreased, mutations were in

73

residues with low solvent accessibility (0.1) whereas the average solvent accessibility of mutations in rounds where solubility increased was higher (0.6).

This negative pleiotropy was temporarily broken in rounds V and VI by two surface mutations (W130C and E194D) and a C-terminal truncation (due to a frameshift), which allowed the increase of both activity and solubility.

a 100% b 49

Cys 48 80%

47 C

60% o 46

/ / M

T 45 40% 44 20% 43

Soluble expression TEVof % / expression Soluble 0% 42 TEV I II III IV V VI VII VIII IX X Evolution round

c d 6 40

/ / 35

Cys 5 30 4 25

3 20

1 -

Frequency 15 2 kJmol 10 1 5

0 0

2 0 1 3 4 5 6 7 8 9 -1 TEV I II III IV V VI VII VIII IX X -2

G of folding energy TEVcompared toG energy folding of -1 ΔΔG of folding energy after residue Δ mutation to alanine / kJ Evolution round -2 Figure C-7 | Stability losses during recovery of TEVSer lineage (a) Soluble expression (gel band quantification) and (b) thermal denaturation temperature of selected variants (Differential scanning fluorimetry without MBP fusion partner). (c) Energy difference between the folded and unfolded states (ΔG) for all evolved variants (grey) and S151C revertants (white). Values are normalised to TEVCys, higher values indicate lower stability. (d) Distribution of the destabilisation effects of alanine mutations to all residues in TEVCys. Higher values indicate greater destabilisation by mutation to alanine. All variants show similar profiles. Error bars represent standard deviation of: 2 biological repeats of soluble expression, 3 technical repeats of DSF, 2 technical repeats of FoldX.

3.6 MUTATIONS THAT RECOVER ACTIVITY CLUSTER AROUND THE ACTIVE SITE

To examine the distribution of mutations within the protein structure I used the published structure211 (PDB code 1lvm) to separate the enzyme into concentric shells emanating outwards from the catalytic triad. The first shell includes all residues within 4 Å of His46, Asp81 and Ser151. The second shell is all residues 4 Å from the first shell, and

74 Chapter C - Handicap-recover evolution to accommodate a new nucleophile so on. Mutations accumulated in the DE lineage (Table C-2) were not evenly distributed through the structure (p=0.005). The triad first shell did not accumulate any mutations. The triad second shell, however, contained 54% of mutations despite comprising only 23% of the protein (Figure C-8). Shells further than the second, conversely, were depleted in mutations. These observations suggest a strong selective pressure for the improvement of amino acid interactions around the active site254, consistent with the idea that the non-native nucleophile had to be re-tuned.

Figure C-8 | Mutations positions Structure of TEV protease (1lvm) with 2nd shell (dark blue) and ≥3rd shell (light blue) mutations from directed evolution trajectory. Triad in red, substrate in black. The triad first shell is defined as residues 4A away from H48, D81 or S151. The second shell is any residues 4A from the first shell.

Given that more than one mutation accumulated in each round, I performed a back-shuffle of the round II, V and VIII variants with TEVSer to see which mutations could be neutrally reverted to wild type at different rounds. This involved digesting the variant genes into 50 nt fragments, mixing them together with the similarly digested TEVSer gene and using PCR to reassemble the shuffled products. For reassembly, each evolved variant was mixed with TEVSer in a ratio such that 2-3 of the evolved mutations would be randomly reverted in each member of the shuffled library. Sequencing of 10 active variants from each shuffled library was expected to reveal some mutations that are

75

universally required for improved function (and so active variants never show reversion) whereas others would be neutral (and so would be reverted in variants retaining high activity). Mutations close together, however, rarely recombined and so the effects of some mutations could not be separated (e.g. A206T & L210M).

What was actually found was evidence of epistasis between the mutations accumulated in the lineage (Table C-4). This is consistent with other previous work on evolution of active site residues which looked in detail at the epistatic interactions between three residues255. Reversions of H28L, L210M and K215R were tolerated from TEVSerV but not from TEVSerVIII. Similarly, reversion of E224G in TEVSerII did not reduce in vivo activity but its removal from TEVSerVIII does. There are therefore no mutations that remain completely neutral. The finding that some mutations are nearly-neutral in early rounds yet adaptive in later rounds suggests epistatic dependence of later adaptive mutations on these early nearly-neutral substitutions. This makes sense as the mutations are tightly clustered, with many contacting each other in the structure. Unfortunately, multiple attempts to crystalise variants along the DE trajectory were unsuccessful. Even the wild type protease TEVCys is reportedly hard to crystallise due to poor reproducibility211.

Table C-4 | Back-shuffle mutation retention Selection of 10 variants that retain activity after back-shuffle with TEVSer. Shown are the proportion of active variants that retain each mutation. Back-shuffles were performed by shuffling TEVSer with each of TEVSerII, TEVSerV and TEVSerVIII such that 2-3 reversions occurred on average per variant. Mutations K229R, E230K and Δ231-236 are all caused by a 1nt frameshift and so are included in the table as ‘STOP’.

The >103 increase in activity using serine was achieved by relatively few mutations (TEVSerX has 92.4% sequence identity to TEVCys) even

76 Chapter C - Handicap-recover evolution to accommodate a new nucleophile though the most closely related natural serine proteases have only 15% identity.

The closeness in protein sequence space of this nucleophile- permissiveness to an extant enzyme explains how divergent evolution of core catalytic machinery can occur within evolutionary superfamilies (such as the PA clan, Figure C-2). The ancestral gene basal to the divergent serine and cysteine branches of the PA clan could have been nucleophile-permissive despite modern members being unable to tolerate the switch.

3.7 MEASUREMENT OF LOCAL FITNESS LANDSCAPES ALONG THE TRAJECTORY

Directed evolution follows a single pathway through a fitness landscape and is a useful tool to understand how a protein can evolve, and what properties the evolving enzyme gains and loses62. The focus on detailed characterisation of a single trajectory, however, misses the broader context of the surrounding fitness landscape. In particular, it would be useful to know which properties of the evolved trajectory are general, and which occurred by chance in that specific lineage. I therefore used the Fluorescence Activated Cell Sorting (FACS) assay (Chapter B, 5.4) for High Throughput Screening and Sequencing (HT- SAS) to characterise the local fitness landscapes flanking the trajectory. I aimed to address to what extent the following observations were stochastic events, or reflective of features of the fitness landscape:

 Rapid evolutionary recovery of activity using the new nucleophile  Clustering of adaptive mutations around the active site periphery  Trade-offs between improved activity and protein stability  Epistatic interactions between the accumulated mutations

For each of the mutagenised libraries generated in rounds I to X of the evolutionary trajectory, ≈106 members were re-screened by flow cytometry. The library members exhibiting the top 1% activities were

77

isolated and a representative sample of ≈103 mutants subjected to 454 sequencing. Frequency of mutations before and after selection was measured and used to calculate the enrichment factor (for each position of the sequence). Mutations that are enriched are beneficial under the selection pressures applied, those that are purged are deleterious. Linking sequence and function79,126 provides insight into the series of local fitness landscapes flanking the DE trajectory(Figure C-9). Furthermore, mapping functional effects on the protein structure121 indicates which regions of the structure are constrained and which are enriched in adaptive potential.

Figure C-9 | Schematic of sequence space flanking trajectory A 2D simplification of the multi-dimensional space of all possible protein sequences. Similar sequences are closer together in this space. (a) The TEVSer→TEVSerX trajectory generated by directed evolution. Only the trajectory members (large circles) are characterised. (b) Many related sequences (small circles) surround members of the directed evolution trajectory. High throughput selection and sequencing allows some of their properties to be characterised.

After HT-SAS, 29% of TEVSer residues showed greater than 2-fold enrichment of mutations (Error! Reference source not found.). The protein surface and regions away from the active site were particularly enriched (52% and 36% respectively). The same analysis was performed for all variants along the DE lineage up to TEVSerIX (Figure C-11) to build an overlapping series of local fitness landscapes.

78 Chapter C - Handicap-recover evolution to accommodate a new nucleophile

a b 2 Neutral 10 5% 49% 4% 1 10 4% Deleterious Beneficial 3% 29% 22% 0 10 3% 2% -1 Frequency 10 2% 1% Enrichment after selection Enrichment -2 10 1%

0 20 40 60 80 100 120 140 160 180 200 220 0%

1 2 3 6

10 18 32

0.1 0.2 0.3 0.6 Residue number 0.1 0.03

Figure C-10 | Mutation enrichments for TEVSer ------c + EnrichmentG K S L F K G P R D Y N P I SforS T I C HeachL T N E S D G mutationH T T S L Y G I G F G P F IshownI T N K H L F R R N(a)N G T L LforV Q S L HeachG V F K V K NresidueT T T L Q Q H L I D GinR D M ItheI I R M P sequenceK D F P P F P Q K L K F R E andP Q R E E R I(b)C L V T TasN F Q TaK S M TEVSer I distribution.III IV V VI VII VIII

a IX b ------

+ + + + + + + + 2 + Neutral 10 5% 3.8 MSUTATIONSS M V S D T S C T F P S S D G I FHAVEW K H W I Q T K DSTRONGG Q S G S P L V S T R D G F EPISTATICI V G I H S A S N F T N T N N Y DEPENDENCEF T S V P K N F M E L L T N Q E A Q QW V S GONW R L N A DTHEIRS V L WG G H K V FPREDECESSORSM V K P 49%E E P F Q P V K E A T Q L MN X S151C TEVSer 4% I 1 II 10 4% Deleterious Beneficial III 3% 29% 22% IV 0 DueV to10 differences between the lysate and FACS3% screening techniques, VI VII 2% -1 Frequency VIII 10 2% IX the selection pressures in HT-SAS are not identical1% to those in DE Enrichment after selection Enrichment -2 10 1%

0 20 40 60 80 100 120 140 160 180 200 220 0%

1 2 3 6

10 18 32

0.1 0.2 0.3 0.6

(Chapter B 6.2). Residue number 0.1 0.03

------c +

G K S L F K G P R D Y N P I S S T I C H L T N E S D G H T T S L Y G I G F G P F I I T N K H L F R R N N G T L L V Q S L H G V F K V K N T T T L Q Q H L I D G R D M I I I R M P K D F P P F P Q K L K F R E P Q R E E R I C L V T T N F Q T K S M TEVSer I III IV V VI VII

VIII

IX ------+ + + + + + + + +

S S M V S D T S C T F P S S D G I F W K H W I Q T K D G Q S G S P L V S T R D G F I V G I H S A S N F T N T N N Y F T S V P K N F M E L L T N Q E A Q QW V S G W R L N A D S V L WG G H K V F M V K P E E P F Q P V K E A T Q L MN X S151C TEVSer I II III IV V VI VII VIII IX Figure C-11 | Mutation enrichments for the TEVSer→TEVSerX lineage libraries Heatmap (blue=purged, red=enriched) of mutation enrichment after selection from each round of TEVSer lineage mapped on primary sequence. Residues are marked (+) to show consistent enrichment ≥0 and (-) for enrichment ≤0. Nucleophile reversion S151C always enriched so all variants with S151C separated from other analyses. (Terminal half of TEVSerII shown only due to failed forward sequencing read).

Despite the lower sensitivity of the HT-SAS analysis at the single residue level, some striking correlations were evident (Figure C-12): the H28L mutation, introduced in TEVSerIV, showed enrichment of alternative residues in the two subsequent rounds but is increasingly selected for retention in later rounds. The back mutation of W130C in TEVSerV also coincided with its purge from the library. Finally the N174K of TEVSerVIII showed enrichment (both N174K and N174Y) before it is selected in the lineage, after which further mutations were no longer enriched. The previous observation, that nearly-neutral

79

mutations can become adaptive for later evolution, appears to represent a more general property of the local fitness landscapes.

1.5 1.5

1.0 1.0

0.5 0.5

0.0 0.0

I

S

II

V

III

IX

IV VI

VII

VIII Enrichemnt -0.5 Enrichemnt -0.5

-1.0 -1.0

-1.5 -1.5

-2.0 -2.0 S151C S151X H28X L28X C130X W130X C130X N174X K174X -2.5 -2.5 S151CFigure C-12 | EnrichmentsS151XS151C of notableS151X mutationsH28X C130XH28X N174XC130X N174X Mutation enrichment after selection from each round of TEVSer lineage of nucleophile mutations to cysteine (S151C) or any non-cysteine residue (S151X). Also shown are three residues which mutated during the DE lineage (new residue in grey boxes).

The real power of HT-SAS is not in analysing individual residues, but rather analysing trends in large sets of residues90. Results in section 3.6 suggested that there are epistatic interactions between the mutations accumulated in the DE lineage in line with some previous work255. To see if this was a general effect, I looked at whether the mutations accumulated in the DE lineage affected which residues subsequently showed adaptive mutations. The enrichment profiles (Figure C-11c) indicated that mutations at very few positions are universally enriched or universally purged throughout the DE lineage. The S151C nucleophile reversion was always enriched (average 9.8-fold, p=0.0004) and any other S151X mutation was selected against (average 43-fold,

p<0.0001) (Figure C-11c & Figure C-12). Only 9 other residues (4%) showed consistent mutation enrichment, on the surface (L98, T191), in the binding pocket (N177, S208, V209, V216) and in the unstructured C-terminus (V228, E230, L235). Similarly, only 29 (12%) positions are so constrained that they were consistently purged of any variation, mostly buried, hydrophobes (Figure C-13). The mutations accumulated in the DE lineage, therefore, had large effects on the adaptiveness of other possible mutations in the protein. At each round, the set of mutations that are adaptive appeared to change. The shifting

80 Chapter C - Handicap-recover evolution to accommodate a new nucleophile spectrum of adaptive mutations is consistent with a rugged fitness landscape in which the mutations accepted along the DE trajectory drastically influence the possible uphill routes by epistasis.

a b

10-1 100 101 Mutation Enrichment

Figure C-13 | Locations of beneficial or deleterious mutations (a) Structure of TEVCys (PDB 1lvm) showing enrichment of non-wt amino acids for each residue averaged over whole TEVSer→TEVSerX lineage. Thick, blue regions indicate positive enrichment of non-wt amino acids, thin red region indicate purging of non-wt amino acids. (b) Structure of TEV (PDB code 1lvm) overlaid with 29 consistently purged mutations (red), 6 consistently enriched mutations (blue). Also shown are catalytic triad (red sticks) and N-terminal product (black sticks).

3.9 ADAPTIVE MUTATIONS ARE COMMON AT ALL STEPS OF THE TRAJECTORY

Epistasis is usually considered a constraining factor on evolution as the lack of a smooth landscape makes it harder for evolution to access fitness peaks172. Despite this, at all steps along the adaptive trajectory there were residues (average 22% of 236) that display beneficial mutations (Figure C-14), indicating multiple uphill pathways in the flanking fitness landscape.

35% 30% 25% 20% 15%

% of residues with residues of % 10% enriched mutations enriched 5%

0%

I

S

II

V

III

IX

IV VI VII VIII Figure C-14 | Adaptive potential along the TEVSer→TEVSerX lineage Proportion of residues that show >2-fold enrichment for mutation in each library along evolutionary lineage.

81

That many alternatives to improve activity exist is consistent with the rapid evolutionary compensation of the mutated nucleophile handicap. It indicates that the abundance of adaptive mutations was sufficient to overcome the small number of variants screened at each round of DE.

3.10 SOME PROTEIN REGIONS ARE HOTSPOTS FOR ADAPTIVE MUTATIONS

The mutations accumulated in the DE lineage were overrepresented in the triad second shell. However, the available adaptive mutations vary from round to round. To resolve this apparent conflict, I looked at whether different regions of the protein were more or less likely to contain adaptive mutations.

As stated previously, the 12% of positions that were consistently constrained to the wild type residue are mostly located in the protein core. In fact, mutations in buried residues tended to be more deleterious on average (p=0.01 Figure C-15), supporting the previous observations of a trade-off between activity and solubility (Figure C-7).

10%

5%

S IV VIII II III V VI VII IX 0% I

-5%

mutations -10%

Relative Relative enrichment of -15%

-20% Figure C-15 | The effect of mutations in the protein core The average enrichments for residues with solvent accessibility <0.02 compared to the average for each round.

Additionally, adaptive mutations are disproportionately located in the active site periphery (triad second shell). Firstly beneficial mutations were more common in this region (average 28% of 53, p<0.0001, Figure C-16a). Secondly, the average enrichment of triad second shell

82 Chapter C - Handicap-recover evolution to accommodate a new nucleophile mutations was higher than for the protein as a whole (p=0.04, Figure

C-16b & Figure C-13b). The second shell encompasses some core residues which, if removed from the analysis, enhance this trend (p=0.01). Conversely, residues contacting the nucleophile (nucleophile first shell) were strongly constrained showing few beneficial mutations (average 16% of 11, p<0.0001) and certainly more detrimental on average (p<0.0001). The triad’s first shell showed the largest fluctuations, indicative of the strong effects the interactions have on the core catalytic machinery. Finally, mutations in shells further away (triad ≥third shell) showed no more beneficial mutations than the protein on average (p=0.18).

a 100%

0% I

S

II

V

III

IX

IV VI

VII

VIII

mutations Relative Relative frequencyof adaptive

-100% Nucleophile 1st shell Triad 1st shell Triad 2nd shell ≥Triad 3rd shell b 20%

10%

IX I II III IV V VI VII VIII 0% S

-10%

-20%

-30%

-40% Relative Relative enrichment mutations of -50% Nucleophile 1st shell Triad 1st shell Triad 2nd shell ≥Triad 3rd shell

Figure C-16 | Beneficial mutations more common in triad second shell (a) Proportion of residues that show >2-fold enrichment and (b) the enrichment of mutations in each shell normalised to the average per round. Nucleophile 1st shell: residues 4 Å from S151. Triad 1st shell: residues 4 Å from H48, D81 or S151. Triad 2nd shell: residues 4 Å from the first shell. Triad ≥3rd shell: remaining residues.

Together, these results indicate that even though the set of adaptive residues available changes from round to round, there are some underlying trends in their locations.

83

Underlying trends in the adaptive potential of different regions of the enzyme explain the biased accumulation of mutations in this region during the adaptive DE trajectory (Figure C-8). Although few individual residues show consistent effect upon mutation, several regions of the protein do. Therefore, despite high epistasis, the active site second shell is repeatedly enriched in adaptive mutations. This is consistent with meta-analysis of directed evolution experiments which show that the largest catalytic changes are often caused by mutations 4-8Å from the catalytic centre254.

Conversely, residues that showed improvement in activity using serine do not correlate to those found in natural serine proteases in the PA clan. In the PA clan, therefore, the adaptive differences between serine and cysteine proteases are buried by drift, consistent with their ancient divergence.

84 Chapter C - Handicap-recover evolution to accommodate a new nucleophile

4 CONCLUSIONS

The handicap-recover directed evolution demonstrated that it is possible to convert TEVCys into a functional serine protease TEVSerX. In addition to biochemically characterising this specific uphill trajectory (TEVSer→TEVSerX), I generate complementary data on the flanking fitness landscape by HT-SAS. These combined methods reveal how local fitness landscape topology guides the evolutionary trajectory.

4.1 EVOLUTIONARY CONSTRAINTS ON THE EVOLVED TRAJECTORY

The trajectory highlights several evolutionary constraints. The activity improvement negatively trades-off with structural integrity. The adaptive mutations accumulated along the evolutionary trajectory to TEVSerX destabilise the structure and promote aggregation. Characterising local fitness landscapes throughout the trajectory reveals that buried residues rarely contain adaptive mutations. Likewise, the 12% of residues consistently purged of mutations are mostly in the protein core. These biophysical constraints bias evolution away from these regions of sequence space.

Previous work on active site evolution found two supporting mutations that enabled exchange of an active site residue in phosphoglycerate kinase255 but that their effects were contingent on their order of introduction. Of eight possible routes to the three mutation variant, only one did not pass through a fitness valley, suggesting that rare mutation combinations dominate and constrain the evolutionary trajectory. Similarly my results are consistent with high epistasis between mutations within the trajectory and between trajectory mutations and the spectrum of possible mutations. This creates a rugged fitness landscape where the acceptance of each mutation

85

drastically alters which subsequent mutations are adaptive. This ruggedness reduces repeatability of evolution such that replicate trajectories would widely deviate.

4.2 HIGH ADAPTIVE POTENTIAL GENERATES A VERSATILE ACTIVE SITE

Despite these constraints, the 104-fold drop in activity was compensated for remarkably easily. In only 10 rounds, screening 350 variants per round, activity was recovered by >103 to improve the activity of TEVSerX to within an order of magnitude of the original TEVCys. The results indicate that the proportion of adaptive mutations is of the order of 10-1-10-2, at the high end of previous evolvability measurements (Chapter A, 5.1) and therefore that extra rounds of evolution might further increase activity. Additionally, this increase does not trade-off with the ability to use the original nucleophile, cysteine, generating an enzyme that efficiently uses either. This was achieved through very few mutations compared to the sequence difference between natural serine and cysteine proteases. These mutations cluster around the active site periphery and is representative of an underlying trend for triad second shell mutations to be rich in adaptive potential.

Indeed analysis of the flanking fitness landscape shows that the ease of this evolution reflects the large number of possible adaptive pathways at each step. By HT-SAS, we show that there are many possible beneficial mutations available at each round. Hence, whilst routes to any given solution may be rare255, the abundance of solutions results in high overall evolvability of nucleophile exchange. Thus the easily evolved enzyme TEVSerX is likely one of a great many similar solutions.

86

Chapter D - Neutral evolution and emergent robustness

1 SUMMARY

Populations of TEV protease genes were neutrally evolved in order to investigate how they spread through the local neutral network. Next generation sequencing revealed that evolution generated a number of highly robust variants whose descendants come to dominate the later populations due to their increased tolerance to mutations. These variants show improved structural robustness that correlates with a preferential distribution of mutations in exposed, disordered structural regions away from the active site that have minor impacts on protein stability. This shows how long-term neutral evolution contains implicit selection for robustness as genes diffuse though the local neutral network to extinction-resistant regions with more neutral neighbours.

Parts of this chapter and Chapter E have been combined and prepared for publication:

Shafee, T., Gatti-Lafranconi, P., Minter, R., Hollfelder, F. Neutral evolution – two combined mechanisms generate robust and evolvable enzymes. MBE. (2013)

87

2 INTRODUCTION

2.1 PROTEIN ROBUSTNESS AND TOLERANCE TO MUTATIONS

As introduced in more detail in Chapter A, 4.1, proteins achieve

stability through an internal network of interactions. Measurements of

Mutation tolerance (τ ) sub protein tolerance to random amino acid substitution (τsub) have found The probability that an 50,79,144,256 enzyme will retain wild type an average τsub=0.65±0.02 . This indicates that only a third of like function after random amino acid substitution random mutations destroy function in proteins so far studied.

Robustness The tendency to remain the In this robustness there is an apparent contradiction: proteins are same after some perturbation, in this case apparently stable enough to tolerate most mutations, yet they are only after mutation marginally stable before mutation, with ΔGfolding of around -10 kcal/mol257–259. This marginal stability has been explained by both adaptive and neutral mechanisms (Figure D-1). Firstly by selection to avoid accumulation of aggregates and be protease resistant139,259. Secondly there is evidence that excess stability can reduce the flexibility necessary for protein activity14,17,18,61. Finally excess stability above what can be selected for can be eroded by genetic drift8,260 since mutations are more likely to reduce than increase stability. Neutral

Selection Drift Selection Fitness

Less stable Stabilty (-ΔG ) More stable folding Figure D-1 | Protein stability effects on fitness Selection pressure on the left hand side of the curve increases stability to avoid aggregation, degradation and unfolding. Selection on the right hand side reduces stability to maintain activity and avoid uncontrolled and toxic accumulation of protein. Drift in the centre reduces stability down to the minimum viable threshold as random mutations erode stability.

88 Chapter D - Neutral evolution and emergent robustness

A deeper understanding of the evolution of robustness may be able to resolve the apparent paradox that proteins tolerate mutations surprisingly well despite being perpetually on the edge of unfolding even before those mutations.

2.2 THEORETICAL PREDICTIONS OF ROBUSTNESS EVOLUTION

Mathematical modelling in silico has been used to investigate neutral network structure and its evolutionary consequences and predicts that neutral networks of equivalent proteins are not Neutral neighbour Two enzyme variants homogeneous133,149,150,261–265. Some sequences in the network are related by a single mutation that retain the same more highly connected (have more neutral neighbours) and form function central nodes in the network, whilst others form sparsely connected edge regions. Both abstract mathematics149,265 and explicit models133,263,264 predict that large, neutrally-evolving populations can evolve towards regions of the network with more neutral neighbours, i.e. with increased robustness. This can be understood in terms of the disproportionate extinction of variants that, though perfectly fit, have fewer neutral neighbours and so are more likely to be deactivated by mutation. As a counterpoint, models of lethal mutagenesis therapy in viruses indicate that, although robustness can increase as an adaptation to high mutation rates266, it may not be high enough to prevent extinction267.

Classical population genetics also makes a number of predictions on

Effective population size (Ne) the conditions under which robustness could evolve. For example, The number of variants in a population that genetically population size (N=total number of variants) and effective population contribute to the next generation size (Ne=total number of breeding variants) have a strong effect on the balance of selection versus genetic drift in an evolving population. Small populations (low N) have a smaller pool of variation, important when beneficial mutations are rare and the rate limiting step of evolution is the generation of adaptive variation, not its fixation by

89

selection. In small effective populations (low Ne), the stochastic effect of random chance has a larger impact on the differential survival of mutants and can overwhelm selection. Consequently, the small selection coefficients hypothesised to be important in the evolution of robustness likely require large populations149.

2.3 EMPIRICAL EVIDENCE FOR THE EVOLUTION OF ROBUSTNESS

Although the theoretical foundations of the evolution of robustness have been in place for some time, experimental and empirical evidence is currently limited.

Bioinformatic analysis indicates that eukaryote miRNA hairpins are more robust to mutations (Figure D-2a) than random sequences which fold into the same secondary structure268 (Figure D-2b). This indicates that natural RNA structures have evolved to be tolerant to mutations.

a b

Figure D-2 | Robustness of a miRNA fold Around a central starting structure, all point mutants are arranged. Greater distance from the central circle indicates greater structural divergence after mutation. (a) An example of a natural miRNA. (b) An example of the same starting structure but produced using randomised nucleotides to produce correct starting base pairing. The resulting mutant structures are more dissimilar to the parent structure. (Adapted from reference268)

Additionally, experimental evolution of an RNA virus (phage φ6) has looked at the erosion of robustness under reduced selection pressure. Populations were either evolved under co-infection conditions (when complementation is expected to buffer deleterious mutations), or under single-infection conditions. Populations evolved with high co- infection were sensitive to mutagenesis which caused greater

90 Chapter D - Neutral evolution and emergent robustness reductions in fitness than mutation of the control populations (Figure D-3). This was interpreted to suggest that buffering by co-infection weakened selection for robustness.

0.2

0.0

-0.2

-0.4

-0.6

-0.8

fitness fitness mutation (logscale) after Low co-infection High co-infection Δ Figure D-3 | Viral robustness after evolution at different co-infection rates After evolution with low co-infection, populations retain wild type robustness levels when mutagenised. When evolved at high co-infection, the resulting viruses show greater variance and lower average fitnesses after mutation.

Two experimental protein evolution works have addressed how the accumulation of neutral mutations can increase robustness during neutral evolution187,188. This process is sometimes called ‘neutral drift’ although I will avoid this term for now, since it implies that the process is completely non-directional which is not necessarily the case.

Firstly, an experiment on P450 oxidase confirmed that robustness can be generated as an emergent property of a neutrally evolving population187. A lysate activity endpoint screening system was used to assay either 1 or 435 variants and sort all variants with >75% activity. It was found that when only 1 variant was selected from round to round the mutation tolerance fell from 48% to 39% (Figure D-4).

The larger evolving population became highly polymorphic with few mutations shared in multiple variants. In this case, the tolerance to mutations increased very slightly from 48% to 50% and the enzymes were found to be more thermostable and resistant to urea degradation.

91

100% 90% 80% 70% 60% 50% 40%

Mutation tolerance Mutation 30% 20% 10% 0% 0 5 10 15 20 25 30 Evolution round Figure D-4 | Mutation tolerance over neutral evolution of P450 oxygenase The fraction of variants that retain function after mutagenesis of µ=0.3 non-synonymous changes per codon i.e. roughly one mutation per 460a.a gene. The N=1 lineage (white) had a single mutant screened at each round, if it retained >75% activity, it was selected for the next. The N=102 lineage (grey) had 435 variants screened at each round and all variants retaining >75% activity were selected for the next. (Data from reference187)

Secondly, an experiment on β-lactamase concentrated on altering the strength of selection pressure188. A cell survival selection system was used to evolve the gene on either 250 or 12.5 µg/ml of ampicillin. At each round >106 variants were selected, however tolerance to mutations was not quantified and merely reported to be in general >50%. Several mutations were enriched (Figure D-5a) which were predicted to be stabilising. Some of these mutations were characterised and indeed increased thermostability (Figure D-5b) and some were able to compensate for an introduced destabilising mutation.

a b 60 *

59

) 3 - 58 * * * 57

56 C

* o

/ / 55 M T 54 53

52 Mutation frequency frequency Mutation (x10 51

Residue number in TEM1

Figure D-5 | Enriched mutations after neutral evolution of TEM1 β-lactamase (a) After neutral evolution for 18 rounds, the most enriched mutations (χ2 p<0.05) are shown here. Selection on 0 µg/ml (black bars), 12.5 µg/ml (blue bars), 250 µg/ml (red bars), * indicates variants chosen for biophysical characterisation. (b) Thermal denaturation temperatures of the chosen variants are higher than the wild type TEM starting enzyme. ((a) Adapted from reference188 and (b) data from reference188)

92 Chapter D - Neutral evolution and emergent robustness

Together, these two evolution experiments support the core theoretical prediction that mutations which increase robustness can be accumulated in neutral evolution.

In my work I have added to this body of results by providing evidence of how robust variants appear, some of their biochemical and biophysical properties, and how their descendants take over the population. Using a high throughput screening system I was able to generate disaggregated data on the distribution of activities after mutation (as with the P450 evolution), in addition to assaying large populations (as with the TEM1 evolution). This has enabled observation of the accumulation of mutation diversity in greater breadth and detail than previously possible. Use of phylogenetic methods and ancestor reconstruction allowed confirmation of the hypothesis that the ancestors of late round populations genuinely were more robust and mutation tolerant than the wild type starting point.

93

3 RESULTS AND DISCUSSION

For this chapter, Tobacco Etch Virus nuclear inclusion cysteine protease will be referred to as wtTEV and derived populations will be denoted by the number of rounds of evolution.

In Chapter C, I performed DE via small libraries using the lysate assay developed in Chapter B, 5.2. Local fitness landscapes flanking the trajectory were assessed by HT-SAS using a higher throughput, flow cytometric assay, described in Chapter B, 5.4. In this chapter, I have used the flow cytometric assay to perform DE by screening large libraries and selecting a pool of variants at each round that maintain wild type function.

3.1 THE DISTRIBUTION OF FITNESS EFFECTS OF TEV MUTANTS

The wtTEV gene was randomly mutagenised by epPCR, the library of mutants was cloned into pMAA and co-expressed with the C-Y(tev) substrate in E. coli. When assayed by FACS, the distribution of phenotypes was bimodal. This corresponds to a DFE where mutant enzymes either have activity similar to wild-type TEV, or are inactive (Figure D-6a). This agrees with previous findings that mutations in whole organisms153,154 and individual enzymes79 often show a bimodal distribution. The rate at which mutations were introduced did not alter the bimodality of the distribution, but affected the relative ratio between the two sub-populations.

94 Chapter D - Neutral evolution and emergent robustness

a b 30

20

wtTEV Frequency

237 Frequency 10 177 0 118 wtTEV

Counts -4 -2 0 2 4 6 8 10 12 14 16 59 protease Specific activity (V ) / μM min-1 0 c o 0 1 2 3 4 25310 10 10 10 10 (FL 5 Log/FL 4 Log) 15 189 Background

126 10 wtTEV

Counts

63 5 Frequency 0 100 101 102 103 0 104 (FL 5 Log/FL 4 Log) -0.20.0-0.1 0.10.20.2 0.30.40.4 0.50.60.6 0.70.8 d Expression / μM Protein

639 1

µ=0.5% 51% - 479 1.E+04101 319

Counts

159 M min μ 0 1.E+03 / 100 100 101 102 103 104 (FL 5 Log/FL 4 Log)

496 Library of protease mutants protease of Library 372 µ=1.5% 1.E+0210-1 248 32%

Counts 124 -2 1.E+01 activity Specific 10 0 0.0 0.2 0.4 0.6 0.8 100 101 102 103 0.0 0.5 104 1.0 (FL 5 Log/FLCell FRET ratio 4 Log) ExpressionSoluble expression / μM Protein / AU Figure D-6 | Bimodal distribution of fitness effects and purified mutant characterisation (a) FRET ratios as measured in FACS of cells expressing: wtTEV, no protease or library of protease variants after random mutagenesis at µ=0.5% (1 amino acid per gene) and µ=1.5% (3 amino acids per gene). For 96 purified variants, (b) the initial rate activity on purified substrate, and (c) total soluble protein in 300μl lysate from 1ml expression culture (measured by Bradford test). (d) Soluble expression vs specific activity for 92 members of a library of variants (wtTEV after mutagenesis at µ=1.5%)

Mutation tolerance of an enzyme or enzyme population can be calculated by the relative abundance of wt-like enzymes after error prone PCR (epPCR). Since epPCR generates some variants with 1 mutation, some 2, some 3 etc. the mutation range follows a Gaussian distribution. This spread can be used to calculate what proportion of

random mutations are neutral (τsub) using Equation D-1 (adapted from reference144). The equation makes a number of assumptions: Mutations are either neutral or lethal, and their effects are additive (i.e. two neutral mutations are always neutral together).

푛 0 1 2 푛 푆 = ∑ 푓푛 (휏푠푢푏) = 푓0(휏푠푢푏) + 푓1(휏푠푢푏) + 푓2(휏푠푢푏) … +푓푛(휏푠푢푏) 푛=0 Equation D-1 | The probability of enzyme activity retention after random mutation S = the fraction of enzyme variants displaying wt-like activity n = number of mutations

fn = the fraction of enzyme variants with n mutations

τsub = the probability that a random mutation will be neutral Nb. Frameshift mutations will cause underestimation τsub, however, these were rare enough that they made no difference to these estimates

95

When mutations were introduced at either µ=0.5% or µ=1.5% non- synonymous changes per codon (≈1 or 3 amino acid mutations per enzyme) in wtTEV, the probability of activity retention after a random amino acid substitution (τsub) was 0.78 and 0.74, respectively. This is higher than reported previously for other proteins (average

50,79,144,256 τsub=0.65±0.02) meaning that in wtTEV only a quarter of random amino acid changes destroy activity. This may be because viral proteins are hypothesised to have higher structural robustness269 to compensate for the lack of genomic robustness, high mutation rates and strong selection pressures inherent to the viral life cycle266,270,271.

I attempted to investigate the underlying protein properties by purifying 92 members of the µ=1.5% mutant library. Poor expression of the protease lead to a high degree of noise in the data. However, some broad observations could be made. The underlying distribution of purified enzyme specific activities appeared less defined (Figure D-6b) and the range of expression showed no bimodality (Figure D-6c). The lack of obvious bimodality in the activity of purified enzyme variants may indicate that the bimodal distribution of phenotypes was partly the effect of the ‘end-point’ style phenotype assay. This is consistent with the observation that changes in enzyme kcat/KM have non-linear effects on pathway flux130 and fitness272. There also appeared to be a negative trade-off between activity and soluble expression in the library (Figure D-6d) in line with results in Chapter C.

3.2 GENERATING NEUTRAL EVOLUTION LINEAGES

To investigate how mutation tolerance can change during evolution, I performed several neutral evolution experiments. For each generation, a population of 106 mutant enzymes was expressed, assayed, and a homogeneous sub-population of functionally neutral mutants isolated by FACS (with >99% of selected clones retaining activity on the native

96 Chapter D - Neutral evolution and emergent robustness cleavage sequence). In this way, mutations that were neutral were accepted for the next round of evolution. Better-than-wt cleavage was not preferentially selected, and so mutations that increase activity were not preferentially selected.

Population size was controlled at N=106 by screening 106 variants at each round. Similarly I controlled effective population size at 2x104 by limiting the number of neutral variants selected for the next round of mutagenesis. Parallel neutral evolution lineages compared the effect of mutation rate (µ=0.5% or µ=1.5% non-synonymous changes per codon per round). As mentioned in Chapter B, 5.2, mutations were evenly distributed through the gene with very few frameshifts. Keeping population and effective population sizes, mutation rate and selection pressure constant ensured that changes seen were due to the duration of neutral evolution, not changing conditions.

Since not all mutations are neutral, they were accumulated at a slightly lower rate than they were introduced (i.e. some mutations are

4 deleterious and purged from the evolving populations). The Ne=10 lineages accumulated mutations at roughly linear rates of 0.4% and 0.7% amino acid variation per round (at µ=0.5% and 1.5% respectively Figure D-7a) although the rate lowered after round 7. These rates were

2 slightly higher than in the equivalent Ne=10 lineages (0.3% and 0.7% Figure D-7b). a b 8% 8%

6% 6%

4% 4%

Variation to wt Variation to wt Variation 2% 2%

0% 0% 0 2 4 6 8 10 0 2 4 6 8 10 Evolution round Evolution round Figure D-7 | Mutation accumulation rate Accumulation of sequence diversity over course of experimental evolution at µ=0.5% (white) or µ=1.5% (grey) non-synonymous mutations per codon per round with effective 4 2 population size of (a) Ne=10 or (b) Ne=10 . Dashed lines indicate average trend rate.

97

Initially, as the populations accumulate mutations, their tolerance to further mutations (defined as τ0.5% and τ1.5% for µ=0.5% and µ=1.5%, respectively) decreased (Figure D-8). By the time populations had accumulated ≈2% sequence diversity, mutation tolerance fell from

τ0.5%=0.60 to 0.08 and from τ1.5%=0.28 to 0.06. This indicates that these populations contain variants that, though neutral in function, are less robust. This would be expected by a model in which proteins have stability in excess of some threshold134,273. Mutations, on average, erode this excess stability such that a greater proportion of the population sits on the stability threshold and is affected by mutations (less robust)91,134. 100%

90%

80% x

τ 70%

60%

50%

40%

Mutation tolerance Mutation / 30%

20%

10%

0% 0 1 2 3 4 5 6 7 8 9 10 11 Evolution round Figure D-8 | Population mutation tolerance Change in mutational tolerance over evolution at mutation rates µ=0.5% (white) or µ=1.5% (grey) non synonymous mutations per codon per round. Tolerance measured by proportion of variants retaining wt-like activity after mutagenesis (i.e. τ0.5% for µ=0.5% of codons per gene and τ1.5% for µ=1.5%).

From this low point in mutation tolerance, the decline halted. In the µ=0.5% lineage, tolerance initially fluctuated and then recovered to near wild type levels by later rounds. The trend in the µ=1.5% lineage was more pronounced and showed a general increase in the tolerance to mutations that begins at round 3 and reached τ1.5%>0.50 at rounds 9-11.

Consistent with the threshold hypothesis discussed above, once the population reaches a critical stability level, selective pressure to improve robustness sets in. In the case of wtTEV, this leads to the

98 Chapter D - Neutral evolution and emergent robustness recovery and stabilisation of mutation tolerance with efficiency that is relative to the rate at which mutations are introduced. The stronger increase in mutation tolerance in the µ=1.5% lineage is consistent with a greater selection pressure for robustness under higher mutation rate. It may also explain why this tolerance increase is greater than that seen in previous work on P450 oxidase187 (Figure D-4).

3.3 ENRICHED MUTATIONS CONFER POPULATION ROBUSTNESS

In order to investigate which mutations were accumulating in the evolving lineages, the populations of TEV variant genes were deep sequenced (1200±500 sequences per population). In early rounds, most residues accumulated diversity at rates at or below the mutation rate: increased diversity at individual positions in the sequence indicates the accumulation of neutral mutations is favoured.

Conversely, positions where non-wt mutations were purged indicates that those mutations are deleterious and under negative selection. In both lineages, after diversity accumulated to 2% (5 rounds at µ=0.5% or 3 rounds at µ=1.5% (Figure D-9a,g), previously neutral mutations started to be strongly enriched, indicative of positive selection. Residues in shells further from the active site triad accumulated more mutations (Figure D-9b,h). This is in sharp contrast to results in Chapter C, 3.10 where mutations in the triad second shell were enriched in variants more active with an alternative nucleophile.

Additionally, residues predicted by FoldX to be minimally destabilised by alanine mutation (indicating their importance in stable folding) accumulate fewer mutations (Figure D-9c,i). Later rounds also showed a bias towards mutation accumulation outside of the enzyme core and in disordered regions (Figure D-9d,j).

99

4 Figure D-9 | Mutation accumulation by structural region (Ne=10 lineages) (a) Changes in frequency of non-wt residues in the µ=0.5% lineage plotted by sequence. (b) Mutation accumulation separated by triad interaction shell with residues <4Å from the catalytic triad (light red), 4-8Å from the catalytic triad (red) and >8Å from the catalytic triad (dark red). (c) Mutation accumulation separated by alanine mutation sensitivity (FoldX prediction) with residues predicted to give >3kJmol-1 destabilisation (dark blue) and <1kJmol-1 destabilisation (light blue). (d) Mutation accumulation separated by disorder prediction (IUPred) with residues with disorder prediction >0.6 (light purple) and <0.2 (dark purple). (e) Mutation accumulation separated by residues with sequence conservation within the viral C4 proteases of >0.9 (green) and all residues (black) (f) Round 11 non-wt residue frequency mapped onto structure (residues purged of variation in red, enriched in blue). (g-l) Same data for µ=1.5%.

100 Chapter D - Neutral evolution and emergent robustness

However, there was no correlation between the accumulated mutations and sequence conservation in an alignment of naturally occurring viral proteases (Figure D-9e,k). This disagrees with previous experiments that suggested that back-to-ancestor mutations were key to the emergent robustness in an experimentally evolving population188,274s. These trends could also be seen when mutation accumulation was plotted on the structure as mutations were purged in the core, whereas the surface contained neutrally accepted and actively enriched mutations (Figure D-9f,l).

Although neutral to the selected phenotype, mutations did affect the underlying properties that give rise to that phenotype. To analyse this

4 in more detail, I focused on the µ=1.5%, Ne=10 lineage. The average kcat/KM activity of the population fell to 16% of wtTEV by round 3 after which it was never above 30% (Figure D-10). Average soluble expression of the populations was always lower than wtTEV (Figure D-11a) and showed increasing dependence on the MBP fusion partner (Figure D-11c).

1.E+055

1 10 -

1.E+04min 4 1

- 10 M

1.E+03 / ) 103

M /K

1.E+02cat 2

k 10

1.E+01101 Activity Activity ( 1.E+00100 wtTEV 1 3 5 7 9 11 Evolution round 4 Figure D-10 | Population average activity (µ=1.5%, Ne=10 lineage) Averaged Activity (kcat/KM) of populations in the µ=1.5% lineage. The whole population of enzyme variants was expressed and purified. Activity assays of the purified mixed enzyme reflects average activity.

This indicates that (at least at the whole-population level) selection for increased activity or concentration of active enzyme has not occurred and in fact both properties have fallen slightly, likely reflective of permissive selection for neutral function. A similar phenomenon has

101

been seen in other neutral evolution experiments where the activity of populations falls slightly to match the selection threshold187,188.

While the soluble expression of the evolving populations was always less than the wtTEV starting level (Figure D-11a), their soluble expression after mutagenesis is higher than wtTEV in later rounds (Figure D-11b). This indicates that later populations retain soluble expression after mutagenesis better than wtTEV. Therefore, despite the fall in average soluble expression in later populations, these suffered a smaller loss in solubility upon mutation. This suggested that the enriched mutations (Figure D-9a) were likely selected for their effect on improving enzyme robustness that increased mutation tolerance and extinction-resistance (Figure D-8).

a 100% 80% 60% 40% % of wtTEV%of 20% Soluble expression / Soluble expression 0% 1 3 5 7 9 Evolution round b 40% 30%

20%

%of wtTEV%of 10%

Soluble expression / Soluble expression 0% 1 2 3 4 5 6 7 8 9 Evolution round c 100%

50% % of wtTEV%of

Soluble expression Soluble / expression 0% 1 2 3 4 5 6 7 8 9 Evolution round 4 Figure D-11 | Population average soluble expression (µ=1.5%, Ne=10 lineage) Population average soluble expression for (a) evolved populations, (normalised to wtTEV) (b) evolved populations after mutagenesis (normalised to wtTEV) (c) evolved populations after mutagenesis without MBP fusion (normalised to wtTEV after mutagenesis without sMBP fusion).

102 Chapter D - Neutral evolution and emergent robustness

3.4 LATER POPULATIONS ARE DESCENDED FROM ROBUST NODE SEQUENCES

High-throughput sequencing allowed phylogenies of the evolving lineages to be generated. A standard phylogeny only shows the sequences of contemporary, extant sequences (in this case, the final round of the experimental evolution). Since lineages in this work were sequenced throughout evolution, it is possible to use these ‘fossil’ DNA sequences to view evolutionary dynamics on a phylogeny.

When viewed as a phylogeny it can be seen that early populations have wtTEV as last common ancestor (Figure D-12) in agreement with mutations being accumulated across the gene. Later populations, however, are descended from non-wt nodes, consistent with the enrichment of mutation cohorts in Figure D-9. The co-enrichment of mutation cohorts is common in situations of high mutation rate and strong selection275.

The phylogeny shows that these mutation cohorts were enriched, not as the result of a single sequence becoming more common, but due to the descendants of a particular sequence being more common.

In the case of the µ=1.5% lineage (Figure D-12b), node sequence TEV- β1 appears in round 3 and its descendants proliferate until they are outcompeted by descendants of TEV-β2 in rounds 9-11. A similar trend is seen for the µ=0.5% lineage (Figure D-12a) but with TEV-α1 and TEV-α2 cofounding later populations, i.e. neither are extinct by round 11. These phylogenetic topologies suggests that the node sequences (α1, α2, β1 & β2) are more extinction-resistant than wtTEV and other members of their population.

103

a

wtTEV Round 1 Round 3 α2 Round 5 Round 7 Round 9 α1 Round 11

5 nt mutations

b

β2

β1

4 Figure D-12 | Ne=10 Lineage phylogenies Phylogenetic trees of (a) the µ=0.5% lineage and (b) the µ=1.5% lineage. Small circles indicate 100 sequences from each of rounds 1,3,5,7,9,11 that were aligned and used to generate each phylogeny. Large circles indicate wtTEV (red) and reconstructed nodes (grey).

The observation that these node sequences appeared to be more

evolvable could be confirmed by reconstructing and characterising the

Ancestral gene node enzymes. Ancestral gene reconstruction is increasingly used to reconstruction 276,277 Inference of the most likely check predictions about the properties of ancient enzymes . sequence of an ancestral gene using a phylogeny of Phylogenetic methods implicitly generate the most likely sequence of modern descendants each node anyway and whole gene synthesis is becoming economical. The same technique is even more applicable to the phylogenies

104 Chapter D - Neutral evolution and emergent robustness generated here. The deep sequencing of very similar genes gives a high degree of certainty when inferring node sequences.

I therefore did this for the genes of four ‘ancestral’ node genes, two from each of the µ=0.5% and µ=1.5% lineages (Table D-1 & Figure D-13). The ancestral node sequences were inferred by the phylogenetic trees of late round populations (7, 9 & 11). I was able to observe all predicted node sequences in the sequencing results of earlier rounds, confirming that they genuinely existed. Since the forward and reverse halves of the gene were sequenced separately, the two half-reads were stitched together based on tree topology. Finally, the correct pairing of the two gene halves was confirmed by full length Sanger sequencing of individual descendants from later rounds.

NAME FIRST SEEN AMINO ACID MUTATIONS IN ROUND α1 7 P8Q, P13T, Q58R, N68T, F139L, S153N, R203G, α2 9 K6M, P8S, H61Q, F132S, D136E, F162L, T191R β1 3 Q58R, S123N, W202R β2 5 S15A, T29S, K67R, D148V, S200R, E223STOP

a b

c d

4 Table D-1 | Reconstructed node summary (Ne=10 lineages) List of the mutations in each reconstructed node sequences relative to wtTEV 4 Figure D-13 | Reconstructed node mutation locations (Ne=10 lineages) Amino acid mutation locations in α1, α2, β1 and β2 (a-d) mapped onto the structure of wtTEV (PDB 1lvm). Substrate in black, catalytic triad in red, mutations in grey.

The reconstructed nodes each contained 9-10 nucleotide mutations compared to wtTEV, only 3 to 7 of which were non-synonymous (Table

105

D-1). The only recurring mutation was Q58R independently evolved in α1 and β1 (using different codons). P8 was also mutated to polar residues in α1 and α2. The C-terminal truncation at residue 223 in β2 is interesting as a similar truncation was also observed in the TEV evolution experiment for increased activity using a mutant nucleophile reported in Chapter C. This convergence suggests that the C-terminal truncation in TEVSerVI may have been an adaptation to increase robustness in that system as well.

As hypothesised, the enzymes that these genes encode displayed higher tolerance to mutations (Figure D-14a). At low mutation rates (µ=0.5%) all reconstructed nodes showed >90% of variants that still retained wt-like activity upon mutation. Even at high mutation rates (µ=2.0%) which leave only 10% of wtTEV variants active, reconstructed nodes displayed >50% active mutants. Using the µ=2.0% mutation rate, I calculate τsub=0.89 for β1, τsub=0.95 for α1 and

τsub=0.96 for α2 and β2 (for wtTEV τsub=0.74). This validates the hypothesis that these node sequences encode particularly mutation tolerant enzymes. a b c 100% 1.E+05 4000040

35000

1 x - 4 τ 80% 1.E+0410

3000030

min

1 - 25000

60% 1.E+03M / ) 103

M /K

cat 2000020

40% 1.E+02102

Mutation / tolerance Mutation 15000

Activity Activity (k Soluble AU / expression 1000010 20% 1.E+01101 5000

0% 1.E+00100 00 wt 1 2 1 2 wt 1 2 1 2 wt 1 2 1 2 α β α β α β

Figure D-14 | Robustness of reconstructed node sequences (a) Mutational tolerance to µ=0.5% (white) µ=1.0% (grey) and µ=2.0% (black) amino acid -1 -1 mutations per protein, (b) kcat/KM activity in M min and (c) soluble expression after 4 hours (black) and 16 hours (grey).

Indeed, the smaller improvement in mutation tolerance of β1 over wtTEV may explain why it was outcompeted by β2. Conversely, α1 and

106 Chapter D - Neutral evolution and emergent robustness

α2 have almost the same mutation tolerance in agreement with the persistence of descendants of both variants in later rounds.

In order to assess the underlying biophysical and biochemical determinants of the observed robustness, I purified the enzymes for characterisation. All reconstructed nodes had reduced kcat/KM relative to wtTEV (Figure D-14b) in line with the generally reduced population average activity (Figure D-10). However, they showed increased soluble expression levels (Figure D-14c). The variants with the highest mutation tolerance (α2 and β2) showed both the highest activity and soluble expression among reconstructed nodes. This agrees with the high retention of solubility after mutation (Figure D-11b) in the populations descended from these nodes.

Mutations frequently reduce the stability of enzymes169,278 and it is known that wtTEV is poorly soluble and prone to aggregation216–218. Together, these results imply that structure robustness, particularly resistance to aggregation, is a key aspect of the extreme mutation tolerance observed in the reconstructed enzymes.

3.5 EMERGENCE OF ROBUSTNESS REQUIRES LARGE EFFECTIVE POPULATION SIZE

By way of comparison, analogous experiments were performed in

2 2 which a smaller effective population of only 2x10 variants (Ne=10 ) were selected at each round. As mutation rates (µ=0.5% or 1.5%) and screened library size (N=106) were kept constant, these parallel lineages report on the effects of bottlenecks in the effective population size only. In this way it is possible to disentangle the two effects of a smaller variation pool (N kept constant) and the effect of drift (Ne varied by two orders of magnitude) and address the theoretical predictions introduced in section 2.2.

107

Under these conditions, neither of the lineages recovered tolerance (Figure D-15Error! Reference source not found.). At µ=0.5%, tolerance remained constantly below τ1.5<0.15. When µ=1.5%, tolerance only transiently increased, peaking at τ1.5=0.75 in round 8 (≈4% amino acid variation) but quickly fell to τ1.5=0.00 causing lineage extinction. This

4 contrasts with the results of the Ne=10 lineages (Figure D-8) and indicates that restriction in the number of variants used leads to, in the best scenario, stochastic variations in robustness and does not allow the fixation extinction-resistant populations.

100%

90%

80% x

τ 70%

60%

50%

40%

Mutation tolerance Mutation / 30%

20%

10%

0% 0 1 2 3 4 5 6 7 8 9 10 11 Evolution round 2 Figure D-15 | Mutation tolerance (Ne=10 lineages) Change in mutational tolerance over evolution at mutation rates µ=0.5% (white circles) or µ=1.5% (grey circles) non synonymous mutations per codon per round in lineages with 2 Ne=10 . Tolerance measured as the proportion of variants retaining wt-like activity after mutagenesis (i.e. τ0.5% for µ=0.5% of codons per gene and τ1.5% for µ=1.5%)

108 Chapter D - Neutral evolution and emergent robustness

2 The restricted effective population size (Ne=10 ) had effects on the efficacy of selection for both populations. The lack of highly robust variants generated in the µ=0.5% lineage may be due to the low accepted mutation rate (total amino acid diversity is never >2%) and therefore insufficient sequences were sampled to find effective, robust variants. Alternatively it may be due to the stochastic purging of otherwise robust variants by genetic drift in the small effective populations. a 100%

80%

60%

40%

11 20%

9 Percentage wt not residue Percentage 0% 7 1 5 51 3 101 1 151 201 wtTEV b 4% c 4% 3% 3%

variation f

2% 2% acid acid

1% 1%

Amino acid acid Amino variation Amino Amino 0% 0% wtTEV 1 3 5 7 9 11 wtTEV 1 3 5 7 9 11 Evolution round Evolution round d 4% e 4%

3% 3%

2% 2%

1% 1%

amino acid variation acid amino amino acid variation acid amino 0% 0% 1 2 3 4 5 6 7 wtTEV 1 3 5 7 11 0% 10% 100% Evolution round Evolution round Mutation accumulation g 100%

80%

60%

40%

11 20%

9 Percentage wt not residue Percentage

0% 7 1 5 51 3 101 1 151 h 12% i 12% 201 wtTEV 10% 10% 8% 8% 6% 6% l 4% 4%

Amino acid acid Amino variation 2%

Amino acid acid Amino variation 2% 0% 0% wtTEV 1 3 5 7 9 11 wtTEV 1 3 5 7 9 11 Evolution round Evolution round j 12% k 12% 10% 10% 8% 8% 6% 6% 4% 4%

amino acid variation acid amino 2% 2% amino acid variation acid amino 0% 0% wtTEV 1 3 5 7 9 11 wtTEV 1 3 5 7 11 0% 10% 100% Evolution round Evolution round Mutation accumulation 2 Figure D-16 | Mutation accumulation by structural region (Ne=10 lineages)

109

(a) Changes in frequency of non-wt residues in the µ=0.5% lineage plotted by sequence. (b) Mutation accumulation separated by triad interaction shell with residues <4Å from the catalytic triad (light red), 4-8Å from the catalytic triad (red) and >8Å from the catalytic triad (dark red). (c) Mutation accumulation separated by alanine mutation sensitivity (FoldX prediction) with residues predicted to give >3kJmol-1 destabilisation (dark blue) and <1kJmol-1 destabilisation (light blue). (d) Mutation accumulation separated by disorder prediction (IUPred) with residues with disorder prediction >0.6 (light purple) and <0.2 (dark purple). (e) Mutation accumulation separated by residues with sequence conservation within the viral C4 proteases of >0.9 (green) and all residues (black) (f) Round 11 non-wt residue frequency mapped onto structure (residues purged of variation in red, enriched in blue). (g-l) Same data for µ=1.5%.

Mutations were tolerated in regions similar to those seen in in the

4 Ne=10 lineages. These regions were away from the catalytic triad (Figure D-16b), in residues predicted to be only marginally destabilised by alanine mutation (FoldX) (Figure D-16c), and in disordered regions (Figure D-16d). However, for the µ=1.5% lineage, the trend was weak and mutations were accumulated more evenly through the structure (Figure D-16h-l). In particular mutations close to the triad were not purged as they were in other lineages. It is tempting to speculate that the lack of confinement of mutations to certain structural regions was the cause of the extinction of this lineage.

2 I also constructed phylogenies for the Ne=10 lineages. The phylogeny indicates that the µ=0.5% lineage is dominated by the descendants of a variant (TEV-α3) that contains a single mutation V199D (Figure D-17a

& Table D-2). This did not particularly increase the population’s mutation tolerance (Figure D-15Error! Reference source not found.) and therefore is likely a chance fixation by genetic drift.

Conversely, the phylogeny of the µ=1.5% lineage shows dominance of round 7 by descendants of TEV-β3, a variant with 3 amino acid substitutions (Table D-2b). Later rounds are dominated by the descendants of TEV-β4, a direct descendant of β3 with 9 extra substitutions and a truncation. The truncation is in fact a convergence with the truncation observed in β2. Radiation from β4 is limited in the final rounds due to the extinction of the whole lineage when no variants remained active after epPCR.

110 Chapter D - Neutral evolution and emergent robustness

2 Therefore the phylogeny topologies for the Ne=10 lineages reflects their failure to evolve increased robustness, in line with predictions that large populations are required. Since the number of mutants screened at each round was the same (N=106), this is likely not due to a smaller pool of variation, but due to the greater stochastic effects of drift in agreement with the reduced purging of mutations in the active site and regions vital to structure (particularly when µ=1.5%). It is worth noting, however, that previous work performed on a P450

2 2 oxidase compared Ne=1 versus Ne=10 found that 10 was sufficient for the evolution of robustness187. Theoretical work suggest that 103 is roughly the threshold where emergent robustness is expected in populations of these mutation rates149.

111

NAME FIRST SEEN AMINO ACID MUTATIONS IN ROUND α3 9 V199D β3 3 D78E, S120N, D127E β4 7 K3R, T22M, D78E, K119E, S120N, D127E, S153C, T175S, K184E, F186L, S208P, W211R, E223STOP a

wtTEV c Round 1 Round 3 Round 5 α3 Round 7 Round 9 Round 11

5 nt mutations d b

β3 e

β4

2 Table D-2 | Reconstructed node summary (Ne=10 lineages) List of the mutations in each reconstructed node sequences relative to wtTEV 2 Figure D-17 | Ne=10 lineage phylogenies and node mutation locations Phylogenetic trees of (a) the µ=0.5% lineage and (b) the µ=1.5% lineage. Small circles indicate 100 sequences from each of rounds 1,3,5,7,9,11 that were aligned and used to generate each phylogeny. Large circles indicate wtTEV (red) and reconstructed nodes (grey). Amino acid mutation locations in α3, β3 and β4 (c-e) mapped onto the structure of wtTEV (PDB 1lvm). Substrate in black, catalytic triad in red, mutations in grey.

112 Chapter D - Neutral evolution and emergent robustness

4 CONCLUSIONS

By coupling high-throughput experimental evolution and deep sequencing I was able to reconstruct a complex sequence of evolutionary events. The results presented here indicate that neutrally evolving populations of enzymes can find robust enzyme variants in the local neutral network around wtTEV. These robust variants have higher mutation tolerance and hence their descendants are overrepresented in later populations. In this way robustness can evolve by simply selecting for preservation of wild-type function in the face of mutations. The robustness of natural enzyme folds may, in part, be an emergent property of such forces. Additionally, these results also indicate that mimicking neutral processes can be used as an enzyme engineering method to accumulate mutations to increase robustness and so manipulate evolvability. wtTEV already has higher mutation tolerance (τsub=0.76) than

50,79,144,256 previously measured proteins (average τsub=0.65±0.02), possibly an adaptation to the high mutation rate of viral geneomes267,269,279. Viruses are known to have lower tolerance to random nucleotide mutations than cellular organisms154. In the absence of robustness mechanisms such as genetic redundancy and alternative metabolic pathways, higher protein structure robustness may be an adaptation to achieve viable mutation tolerance.

Deep-sequencing allowed me to follow the populations’ evolution in molecular detail. This showed how mutations were purged from enzyme regions essential for its function254 (close to the catalytic triad) and its stability (those predicted to have high order propensity and to be key to fold stability). It also revealed how some mutations were enriched above the background rate of mutagenesis, indicating positive selection.

113

In both high and low mutation rate lineages, selection for neutrality lead to the erosion of mutation tolerance until amino acid diversity of ≈2% accumulated. Once this point was reached most variants, although functional, were sensitive to mutation (average τ0.5%=0.08 and τ1.5%=0.06) and those variants eventually went extinct. However,

4 when the effective population size is large (Ne=10 ), populations contain a small number of variants with increased robustness.

Deep-sequencing also allowed the phylogenies of the evolving lineages to be built. Using the lineage phylogenies, I reconstructed four node enzymes. Only a single mutation convergently evolved in two, highlighting the diverse set of solutions to the problem of mutation tolerance. Despite the lack of common mutations, there are common properties. All have reduced kcat/KM activities but increased soluble expression compared to wtTEV. Additionally their descendants retain solubility after mutation better than wtTEV. The wtTEV enzyme is known to be highly insoluble and this resistance to aggregation both before and after mutation indicates greater protein structure robustness.

This structural robustness drastically increases mutation tolerance of the reconstructed enzymes. Three of these (α1, α2, β2) have tolerance to random amino acid mutation of τsub>0.95 (compared to τsub=0.76 for

144 wtTEV and τsub≈0.66 for an average protein ). This mutation tolerance indicates that they have a high proportion of neutral neighbours and are therefore in a more interconnected region of the neutral network. Though these robust variants exist in the population only transiently, the high retention of fitness by their mutant descendants causes disproportionate neutral sequence radiation. This sequence radiation demonstrates how second order selection for robustness can result from direct selection for maintenance of wild type activity.

114 Chapter D - Neutral evolution and emergent robustness

115

Chapter E - Adaptive evolvability and changing cleavage specificity

1 SUMMARY

Building on the results of Chapter D, I demonstrate how the accumulation of mutations neutral to activity on the native substrate can increase evolvability for sequence specificity changes. During neutral evolution, the frequency of fortuitous adaptive mutations fluctuated stochastically, with pronounced effects on the ‘immediate evolvability’ in response to selection. Additionally, later, robust populations had increased ‘long-term evolvability’ enabling higher activity after several rounds of adaptive evolution. In addition to shedding light on the evolution of evolvability, these results suggest ways of manipulating the sequence specificity of proteases for therapeutic and biotechnological applications.

Parts of this chapter and Chapter D have been combined and prepared for publication:

Shafee, T., Gatti-Lafranconi, P., Minter, R., Hollfelder, F. Neutral evolution – two combined mechanisms generate robust and evolvable enzymes. MBE. (2013)

116 Chapter E - Adaptive evolvability and changing cleavage specificity

2 INTRODUCTION

2.1 EVOLVABILITY OF ENZYMES

Enzyme evolvability is the propensity of an enzyme (or population of enzyme variants) to acquire new functions by mutation and selection. The generation and improvement of, promiscuous functions provides a starting source of adaptive variation for evolution. In experimental measurements of adaptive evolvability of enzymes, several outcomes affect the response to a novel selection pressure. Even if the whole population does not display a promiscuous activity, some individual variants may be adapted, or have high potential for adaptive evolution. I will therefore separate ‘immediate’ and ‘long-term’ evolvability.

Immediate evolvability is how well a population can adapt by selection Standing variation from standing variation alone. It is dependent on the population size Genetic variation present in a heterogeneous population (number of variants), the number of those variants that are unique

(heterogeneity), and whether any variants have new promiscuous Immediate evolvability The potential for adaptive activities that do not compromise the native activity. Immediate evolution provided by standing variation by selection but no further evolvability, therefore, dictates the maximum promiscuous activity mutation that can be selected from that population without further mutations. Long-term evolvability The for adaptive evolution Long-term evolvability determines the maximum activity after several, provided by both standing variation and new mutations iterative rounds of mutagenesis and selection. It partly depends on adaptive standing (as with immediate evolvability) variation but also on the frequency and magnitude of beneficial, de novo mutations. For example, robustness could increase evolvability if a stable scaffold can compensate for destabilising adaptive mutations in the active site273,280–282. Experiments suggest that this is the case if the active site and scaffold are separated in the structure72,273,283 allowing ‘global suppressor’ mutations.

117

2.2 NEUTRAL MUTATIONS AND CRYPTIC VARIATION

Interest in the interplay between drift and selection started in the 1930s when the shifting-balance theory proposed that, in some situations, genetic drift could facilitate later adaptive evolution284. Although this theory was largely discredited285 it drew attention to possibility that Cryptic variation drift can generate cryptic variation that, though neutral to current Genetic variation that introduces promiscuous 127,189,286,287 function, may affect selection for new functions . activities to some members of a population

In terms of sequence space, current theories predict that if the neutral networks for two activities overlap, a neutrally evolving population may diffuse to regions of the neutral network of the first activity that allow it to access a second146,148,288,289 (Figure E-1). This would only be the case when the distance between activities is smaller than the distance that a neutrally evolving population can cover. The degree of inter-penetration of the two networks will determine how common cryptic variation for the promiscuous activity is in sequence space. Sequence space

Sequence space

Figure E-1 | Two overlapping neutral networks A large set of ‘lattice model’ protein genes generated in silico (top) contain two neutral networks of Structure A (blue) and Structure B (red). Each black dash represents a sequence, the lines connecting them represent mutations. When viewed as a fitness landscape (bottom), they form two overlapping peaks. It is possible for a neutrally evolving sequence to mutate to the edge of network A (a1⟶a2⟶a/b) and arrive at a promiscuous sequence that exists in both neutral networks. From there is this possible to evolve to new structure B specialist (a/b⟶b1). (Adapted from reference288)

118 Chapter E - Adaptive evolvability and changing cleavage specificity

Evidence is accumulating that mutations neutral to native function can be rich in cryptic variation that though unselected, may affect immediate evolvability188,290,291. For example, accumulation of neutral mutations in P450s, caused up to 4-fold changes in six measured promiscuous activities290 (Table E-1). Similarly, in β-lactamase, neutrally evolved populations contained slightly more variants that conferred resistance to cefotaxime than the mutagenised wild type188.

Conversely, experiments on immediate evolvability of β-glucuronidase indicate that pools of neutral mutants can be less rich in adaptive mutations than random mutants. In this case, the most adaptive mutations and mutation combinations trade-off with the original activity and so are not present in neutrally evolved populations26,53,292,293.

ROUNDS OF MOST NEUTRAL IMPROVED ENZYME EVOLUTION TARGET SUBSTRATE VARIANT P450 oxidase 15 propranolol 1.6 - fold 11-phenoxyundecanoic acid 2.2 - fold 2-amino-5-chlorobenzoxazole 3.2 - fold 1,2-methylenedioxybenzene 2.6 - fold 2- phenoxyethanol 4.1 - fold β-glucuronidase 4 β-galactose 3.9 - fold paraoxonase 1 1-3 mixed paraoxon 2.2 - fold 2-naphtylacetate 5.0 - fold dihydrocoumarin 1.8 - fold O-acetoxy-7-hydroxycoumarin 2.1 - fold 5-thiobutyl butyrolactone 1.9 - fold 7-O-diethylphosphoryl-3-cyano- 2.4 - fold 7-hydroxycoumarin FREQUENCY INCREASE β-lactamase 18 0.5 µM cefotaxime 1.7 - fold 0.8 µM cefotaxime 1.8 - fold 1.0 µM cefotaxime 1.1 - fold 2.0 µM cefotaxime 1.1 - fold 5.0 µM cefotaxime * Table E-1 | Previously measured immediate evolvabilities Improvement in a promiscuous activity in the best variant isolated from neutrally evolved populations. Fold improvements calculated from reported kcat/KM activities for P450 oxidase and β-glucuronidase, calculated from ‘specific activity’ for paraoxonase 1. Since activities of individual variants were not reported for β-lactamase, the frequency of variants surviving a particular antibiotic concentration (normalised to an error prone PCR library) was used. * indicates that the wt enzyme library had no surviving colonies, the neutrally evolved population had 1.5% survival. (Data from references188,290,291,293)

119

Observations that neutral evolution can increase enzyme robustness187 (see Chapter D) indicate that long-term evolvability might also be affected as more mutation-tolerant enzymes may better tolerate the introduction of adaptive, but destabilising mutations180,273,281. Long- term evolvability has been described for organisms including bacteria186 and algae294 but comparative studies of protein long-term evolvability are largely theoretical72,295,296. Substrate specificity is therefore a good model to study evolvability, although available studies have not, so far, addressed immediate and long term evolvability independently.

2.3 PROTEASE SPECIFICITY ENGINEERING

Directed evolution has proven to be a successful method for altering protease specificity297. In the last two years, three publications on TEV protease have shown changes in specificity for the residue before224 and after114,214 the cleaved peptide bond at the P1 and P1’ positions (Table E-2).

A cell survival selection system was created based on cleavage of a modular transcription repressor allowing expression of a kanamycin resistance gene. This system isolated a TEV variant active on a substrate with aspartate at the P1’ position114 after 3 rounds of selection. The final variant had 14 mutations and was 3-fold more active on the target substrate and 92-fold less active on the native.

A GFP-RFP fluorescence assay has been used to select for activity on a substrate with arginine at the P1’ position214. Though never kinetically quantified, the isolated variant after 4 rounds had almost 2-fold higher specific activity on the target sequence and showed general broadening of sequence preference at the P1’ position. This was achieved through

120 Chapter E - Adaptive evolvability and changing cleavage specificity a mutation that may increase flexibility of the loop forming the substrate binding tunnel.

Finally, altered specificity at the P1 position has been attempted224. This was a much greater challenge than the previous two works listed as the specificity at this position is strict and evolutionarily conserved. The Yeast Endoplasmic reticulum Screening System (YESS) couples protease activity to display of affinity tags on the surface of the yeast cell are then fluorescently labelled. A library of substrates randomised at the P1 position was screened against a library of proteases randomised around the P1 pocket whilst also counter-selecting against cleavage of the native sequence. The system isolated variants active on either P1 glutamate or histidine. The efficacy of the screening system led to one variant which had 500-fold improvement on the glutamate substrate with a 10-fold reduction on the native, whilst another was 20- fold improved on the histidine substrate and 50-fold reduced on the native.

SELECTION/ MOST TARGET DIFFERENCES SCREENING IMPROVED SEQUENCE TO NATIVE SYSTEM VARIANT ENLYFQD 1 Split repressor 3.4 - fold ENLYFQR 1 GFP-RFP FRET 1.8 - fold ENLYFES 1 YESS 500 - fold ENLYFHS 1 YESS 20 - fold Table E-2 | Previously achieved TEV protease specificity changes Improvement in a promiscuous cleavage activity in a variant isolated through directed evolution towards that sequence. Fold improvements calculated from reported kcat/KM activities for split repressor and YESS systems, calculated from ‘specific activity’ for GFP- RFP FRET screening system. (Data from references114,214,224)

A next step in controlling protease specificity is activity on substrates with multiple residue differences to the native preference sequence. Such substrates represent a far greater evolutionary challenge and so improved understanding of evolutionary potential and how to influence it is required. I therefore use protease specificity changes as a suitable measure of adaptive evolution when investigating evolvability.

121

2.4 CLINICAL APPLICATIONS

Protease therapeutics and biotechnology tools provide a suitable Therapeutic antibody benchmark for the practical application of evolvability research. The Use of antibodies to bind and neutralise target foremost application for the control of protease specificity is as an pathogenic proteins due to their high specificity and alternative to antibody therapeutics. Antibodies are protein binders affinity that evolved in vertebrates as part of the immune system to identify and neutralise foreign objects in the body298. Antibody therapeutics work by binding to a pathogenic target cell or protein to prevent its action299. Use of antibodies as drugs started in 1989 with Muromonab-CD3, a mouse anti-CD3 antibody that acts as an immunosuppressant299,300. Since then, antibodies have proved highly successful with Adalimumab the 7th highest grossing drug (£41 billion in 2011)301 and over a hundred antibody drugs currently in clinical trials302,303.

In a similar way, if a protease could be made to be specific to a single, chosen target (to avoid off-target effects), it should be able to perform much the same function by cleaving that target (Figure E-2). This would have the benefit over antibody therapeutics of being capable of multiple catalytic turnover, and so require a lower dose.

Protease targets Signal soluble signal ligand protein in blood

Signal Protease targets receptor extracellular portion (extracellular) of membrane anchored receptor Cell membrane

Signal receptor (intracellular)

Figure E-2 | Protease therapeutic concept A schematic of possible protease targets. A protease therapeutic could cleave a pathogenic soluble protein such as an overexpressed signal ligand. Alternatively, it could cleave the extracellular portion of a receptor in order to down-regulate it. (Adapted from reference304)

122 Chapter E - Adaptive evolvability and changing cleavage specificity

Generating bespoke antibodies is now routine as it is possible to manipulate the biological systems that already exist in nature for generating specific protein binders. Additionally, established selection protocols allow for affinity maturation in vitro to further enhance binding. Conversely, there are currently no such protocols for manipulating protease specificity (as mentioned in Section 2.3). The development of protease cleavage engineering will advance it as an alternative technology to binding-based biological therapeutics.

In this work I apply experimental evolutionary techniques in an attempt to understand the evolution of evolvability and, in so doing, develop methods for protease specificity engineering. I use the TEV protease lineages that were neutrally evolved on the native sequence (generated in Chapter D) to investigate the changes in promiscuity, immediate evolvability and long-term evolvability as a product of neutral evolution.

123

3 RESULTS AND DISCUSSION

3.1 TARGET SEQUENCE CHOICE

As suitable targets for adaptive evolution, I searched for sequences in human proteins that had the following properties:

 The sequence shows some level of promiscuous cleavage by wtTEV  Appears in a region that is flexible enough to be protease accessible  Cleavage of the protein is likely to have a biological effect

I used the list of therapeutic antibody Stage 3 and Stage 4 targets (provided by Ralph Minter, Table E-3). If stoichiometric binding by an antibody gives a biological effect, then multiple turnover by a protease also has a high chance of effect. Therapeutic antibodies were excluded if the effect has been shown to not be due to simple binding.

SECRETED MEMBRANE TARGET DISEASE GENERIC COMMERCIAL TARGET DISEASE GENERIC COMMERCIAL VEGF-A Cancer Bevacizumab Avastin a4b1 MS Natalizumab Tysabri IgE Asthma Omalizumab Xolair EGFR Cancer Cetuximab Erbitux

TNFa RA Adalimumab Humira HER2 Cancer Trastuzumab Herceptin IL6 RA Tocilizumab Actemra BLyS SLE Belimumab Benlysta IL12/23 Psoriasis Ustekinumab Stelara PPROVED IL1b CAPS Canakinumab Ilaris

A RANKL Osteoporosis Denusomab Prolia C5 PNH Eculizumab Soliris HGF Cancer Rilotumumab AMG102 CSF1R Cancer AMG820 IGF1/2 Cancer MEDI-573 HER3 Cancer AMG888 Ang-2 Cancer IGF1R Cancer Ganitumab AMG479 CCL2 Cancer FGFR Cancer RG7444 EGFL7 Cancer C-Met Cancer Onartuzumab RG3638

Flt-3 Cancer IMC-EB10 RON Cancer IMC-RON8

RIALS

T PDGFRa Cancer MEDI-575 VEGFR1 Cancer Icrucumab IMC-18F1 VEGFR2 Cancer Ramucirumab IMC-1121B

LINICAL LINICAL

C CTLA4 Cancer CXCR4 Cancer DLL4 Cancer FGFR3 Cancer PD1 Cancer PDL1 Cancer

Table E-3 | Target protein list Targets of antibody therapeutics (as of 2012). Excluded are antibody/complement- directed cellular cytotoxicity targets and agonist mAb targets. Highlighted are the targets found to have sequences in loop regions which contain the sequence pattern [EDVQRSGT]-x-[LIVAHTYP]-[YLIHWFMVRKAT]-x-Q-{P} by ExPASy ScanProsite305 (Table E-4).

124 Chapter E - Adaptive evolvability and changing cleavage specificity

Flexible loop regions were identified by examining the published B-factor structures for regions with high B-factor (Figure E-3a). These loops Displacement of an atom around its average position, used as an indication of were cross-correlated with computational predictions of disorder local flexibility (IUPred306, Figure E-3b) and compiled into a list (Figure E-3c).

1 a 2 3

b

1 2 3 Disorder score Disorder

Residue number

c 1

>PDL-1 >PDL-5 HGEEDLKVQHSSYRQ NSKREEK >PDL-2 >PD1-1 EKQLDLAA> SNWSEDL PDL-3 >PD1-2 VVDPVTSEH PSNQTE >PDL-4 >PD1-3 SDHQVLSG LHPKAK Figure E-3 | Identification of flexible target loop regions Structure of PDL1 (PDB code 3bik). (a) B-factor map (a measure of disorder within the X- ray crystal structure). Regions identified with high B-factor (e.g. 1, 2 & 3) (b) Predicted disorder score by primary sequence (IUpred306). Loops were cross-referenced with this disorder prediction (loops 1, 2 & 3 indicated). (c) Final loops shown on structure in red compiled into FASTA list which was searched for loops with sequence patterns that may be cleaved by wtTEV. Indicated is one such loop, PDL-1.

In order to maximize the probability of finding some level of promiscuous activity by wtTEV I searched this list for sequence patterns of residues similar to the canonical cleavage sequence of wtTEV211,222,223:

[EDVQRSGT]-[x]-[LIVAHTYP]-[YLIHWFMVRKAT]-[x]-[Q]-{P}

125

Residues are separated by dashes. Allowed residues are contained in [] brackets and disallowed residues in {} brackets ‘x’ indicates any amino acid. Four target proteins contained loops satisfying these criteria (Figure E-4), each with ≥4 differences to the native ENLYFQ\S sequence (Table E-4).

a b

c d

TARGET SEQUENCE & SIMILARITY e native ENLYFQS

PDL HGEEDLKVQHSSYRQ *:* .* N Y Q VEGFR2 RDLKTQSSGSEM E F L S .:* ** BLyS SETPTIQKGSY *. :*. IL23 STDILKDQKEPKNKT : * *. Figure E-4 | Target structures and loop sequences The 4 target proteins identified as having putative cleavable loops (shown in red). (a) Programmed death inhibitory ligand, (b) Vascular endothelial growth factor receptor 2, (c) B lymphocyte stimulator, (d) Interleukin-23. (e) Cutaway of the binding tunnel of wtTEV (PDB code 1lvb). Substrate shown with residues in alternating light and dark blue and labelled. Triad shown in red. Table E-4 | Target sequences The 4 target sequences identified as having putative cleavable loops by wtTEV. Sequence similarity is denoted by ‘*’ = identical. ‘:’ = conservative. ‘.’ = semi-conservative.

For eventual practical use of evolved TEV variants, the enzyme would need to cleave the sequence in the context of the true target structure. These experiments, however, concentrate on the evolution and manipulation of enzyme sequence specificity and so the sequences are only assayed in an idealized context of the flexible CFP-YFP linker.

126 Chapter E - Adaptive evolvability and changing cleavage specificity

3.2 PROMISCUOUS CLEAVAGE OF TARGET SEQUENCES

The 4 loop sequences were cloned into the linker sequence between CFP and YFP in the pC-Y plasmid by synthesising the oligonucleotide sequences encoding the linker and ligating that directly into the digested plasmid. The substrate names and linker sequences are listed in Table E-5. Activity of wtTEV on C-Y(pdl) and C-Y(veg) was 97- and 13,000-fold lower respectively compared to activity on the native sequence, C-Y(tev). Conversely, there was no detectable cleavage of the substrates with 5 residue differences, C-Y(il23) and C-Y(blys).

LINKER DIFFERENCE GENE NAME SEQUENCE TO NATIVE KCAT /KM C-Y(tev) ENLYFQS - 1.9x104 C-Y(pdl) EDLKVQH 4 2.0x102 C-Y(veg) RDLKTQS 4 1.5x100 C-Y(blys) ETPTIQK 5 < 1x10-1 C-Y(il23) DILKDQK 5 < 1x10-1

Table E-5 | CFP-YFP target linker variants Amino acid sequences of C-Y linkers. Residues differing from the native preference -1 -1 sequence are underlined. kcat/KM in min M .

Since C-Y(veg) had the lowest detectable activity it was chosen as the largest challenge for the most in depth analyses, however directed evolution results will be presented on all 4 substrate sequences.

3.3 NEUTRAL EVOLUTION ERODES AVERAGE PROMISCUITY

TEV was previously neutrally evolved on the native cleavage sequence C-Y(tev) in Chapter D to generate lineages of neutral variants. For this

4 specificity evolution study, I have used the µ=1.5%, Ne=10 lineage (mutations introduced at 1.5% per codon per round, 104 variants retaining wt-like activity selected each round). This population showed the greatest final tolerance to mutations (Chapter D & Figure D-8) and so was chosen to investigate the effects of neutral mutations on adaptive evolution for new substrate specificities. In this chapter I will refer to populations as ‘x,y’ where ‘x’ is the number of rounds of

127

neutral evolution on C-Y(tev), and ‘y’ is the number of rounds of adaptive evolution on C-Y(veg) (Figure E-5). Hence, population 7,3 indicates that seven rounds of evolution on C-Y(tev) were followed by three rounds of evolution on C-Y(veg), and so on.

0,3 7,3 Y(veg) - 3

2 5,1 1

0 -1 0 1 2 3 4 5 6 7 8 9 10 11 Rounds of adaptive evolution adaptive Rounds of C on Rounds of neutral evolution on C-Y(tev) Figure E-1-5 | Summary and nomenclature of C-Y(veg) evolution lineages Schematic of neutral evolution on the native sequence (black) from which several evolutions for new function (green) were performed. Populations are named by these coordinates: rounds of evolution on C-Y(tev), on C-Y(veg).

To test for detectability of promiscuous activities of wtTEV in this system, I first looked at cleavage of three ‘substrate libraries’. In each substrate library, two of the substrate residues were randomised (NNS codon) to generate 400 variant substrates. The library of substrates could be co-expressed with the wtTEV protease in the same way that previously I have expressed libraries of proteases with a single substrate and measures by FACS.

Substrates showed a roughly bimodal distribution of cleavage, as seen in libraries of proteases in Chapter D. The percentage of substrates cleaved by wtTEV is an indicator of how many of these sequences are promiscuous substrates (Figure E-6a). Equally, the same experiment with libraries of proteases shows the average promiscuity of the enzyme population on the substrate library. Although cleavage of these substrate libraries does not directly demonstrate which sequences were cleaved, sorting and sequencing cleaved substrates would allow measurement of sequence specificity222,223.

128 Chapter E - Adaptive evolvability and changing cleavage specificity

Since C-Y(P34xx) showed an intermediate number of substrates cleaved I used that substrate library to test how promiscuity of the neutrally evolving lineage changed over multiple rounds of neutral evolution (Figure E-6b). The neutrally evolving lineage showed slightly reduced and fluctuating promiscuity towards the C-Y(P34xx) library. a 100% LINKER 80% GENE NAME SEQUENCE C-Y(tev) ENLYFQS 60% C-Y(P12xx) ENLYXXS 40% C-Y(P34xx) ENXXFQS

20% C-Y(P56xx) XXLYFQS % of substrates cleaved substrates %of 0% TEV

b 30% 25% 20% 15% 10% 5% % of substrates cleaved substrates %of 0% wtTEV 1 3 5 7 9 11 Rounds of neutral evolution on C-Y(tev) Figure E-6 | Promiscuity on randomised substrate libraries (a) The percentage of substrates cleaved when wtTEV is co-expressed with C-Y(P56xx) (black), C-Y(P34xx) (grey), C-Y(P12xx) (white). (b) The percentage of substrates cleaved when neutrally evolved populations are co-expressed with C-Y(P34xx).

The mixed enzyme populations of the neutral lineage (at rounds 0, 1,

3, 5, 7, 9 & 11) were also purified in order to measure average kcat/KM activity on C-Y(veg) in vitro. The populations did not show any increase in activity on the promiscuous substrate, in fact the average promiscuous kcat/KM activity on C-Y(veg) falls below the detection threshold (Table E-6). Together, the reduction in promiscuity on the substrate C-Y(veg) and the substrate library C-Y(P34xx) indicate that the absence of selection leads to loss of promiscuity, not specificity.

129

KCAT /KM POPULATION C-Y(VEG) wtTEV 2.0x102 1,0 < 1x10-1 3,0 < 1x10-1 5,0 < 1x10-1 7,0 < 1x10-1 9,0 < 1x10-1 11,0 < 1x10-1

Table E-6 | Population average promiscuous activity on C-Y(veg) -1 -1 Kcat/KM values (M min ) in vitro for mixed purified enzyme population cleavage of C- Y(veg).

3.4 NEUTRAL EVOLUTION GENERATES FLUCTUATING IMMEDIATE EVOLVABILITY

Although the populations lost activity on C-Y(veg) on average, there might be minority members that do show promiscuous activity. To test whether promiscuity towards the target sequence C-Y(veg) changed significantly during the neutral evolution lineage, I performed a selection for C-Y(veg) cleavage on the enzyme populations without further mutagenesis (Figure E-7). Selection pressure was applied by 2 rounds of FACS enrichment of the top 1% most active variants on C- Y(veg). The response to selection for a particular substrate by a population is a measure of its immediate evolvability as it depends on whether there are fortuitous adaptive variants in the standing variation.

Va

120%

on Y(veg) - 100% 80% 60% 40%

) after selection selection C on )after 20%

Population median Population median activity 0%

Y(veg -

C wtTEV 1 3 5 7 9 11 Rounds of neutral evolution on C-Y(tev) Figure E-7 | Immediate evolvability along neutral evolution Median FRET ratio of populations being selected for activity on C-Y(veg) (two rounds of enrichment) after 0-11 rounds of neutral evolution on C-Y(tev). FRET ratio normalised to the median ratio of cells expressing wtTEV and C-Y(tev). The population from which variant Va was isolated is indicated.

130 Chapter E - Adaptive evolvability and changing cleavage specificity

Populations 5,0 and 11,0 responded to this selection pressure with an increase in the population average activity on C-Y(veg), demonstrating their high immediate evolvability. From population 5,0 (with the highest immediate evolvability) I isolated the most active single variant (TEV-

Va) which showed a 35-fold improvement in kcat/KM over wtTEV on C- Y(veg). This is somewhat larger than the typical 2 to 5-fold promiscuous activity improvements found in variants isolated from standing variation in a neutrally evolved lineages188,290,291,293 (Table E-1). It is also higher than most previous attempts at protease specificity engineering114,214,224. The best of these experiments reported a 500-fold improvment224, however this was from first randomising the substrate and selecting to ensure an easy target and only altering a single residue of specificity.

The non-linear increase in immediate evolvability as the population neutrally evolves indicates that mutations adaptive for C-Y(veg) activity are not directly selected for during neutral evolution. Lack of a consistent trend along the neutral lineage confirms that C-Y(veg) activity is truly an unselected promiscuous activity, i.e. is lost due to lack of selection. Therefore, even though the populations on average have no activity on C-Y(veg), some populations do contain members that do show improved activity.

The immediate evolvability towards C-Y(veg) also does not correlate with the general promiscuity towards C-Y(P34xx), although both measures fluctuate along the neutral lineage (Figure E-6b & Figure E-7). Therefore, the level of promiscuity in the population does not appear to be a good predictor of whether promiscuity for a particular target substrate is more likely to be present.

131

3.5 NEUTRAL EVOLUTION INCREASES LONG-TERM EVOLVABILITY

The immediate evolvability so far presented describes the prevalence of fortuitous mutations in a population and is the facet of evolvability most commonly addressed in previous studies. To expand upon these works, I also measured the potential for further beneficial mutations by performing 3 rounds of adaptive evolution on C-Y(veg) (Figure E-8a). At each round, diversity was introduced (µ=0.5% of codons per gene per round) after which the top 1% most active variants were sorted, re- expressed and re-sorted to ensure stringent selection (N=106 screened,

4 Ne=10 selected). As starting points I used wtTEV and populations 5,0 and 7,0 from the neutral evolution lineage. These populations compared the long-term evolvability of populations which previously showed very different levels of immediate evolvability (Figure E-6c).

Pa

a 100% b 100%

Y(veg)

Y(pdl) -

Vd - 80% 80% Vc 60% 60%

40% 40% Vb 20% 20%

0% 0% Population average on C average activity Population 0 1 2 3 on C average activity Population 0 1 2 3 Rounds of evolution on C-Y(veg) Rounds of evolution on C-Y(pdl)

c 100% d 100%

Y(il23)

Y(blys)

- - 80% 80%

60% 60%

40% 40%

20% 20%

0% 0% Population average on C average activity Population 0 1 2 3 on C average activity Population 0 1 2 3 Rounds of evolution on C-Y(il23) Rounds of evolution on C-Y(blys) Figure E-8 | Long-term evolvability on target sequences of neutrally evolved populations. Median FRET ratio on C-Y(veg) of populations being selected for (a) activity on C-Y(veg), (b) C-Y(pdl), (c) C-Y(blys), (d) C-Y(il23). Black circles indicate the lineage starting from wtTEV (0,00,3). Grey circles indicate the lineage starting from population neutrally evolved on C-Y(tev) for 5 rounds (5,05,3). White circles indicate the lineage starting from population neutrally evolved on C-Y(tev) for 7 rounds (7,07,3). The populations from which variants Vb, Vc, Vd and Pa were isolated are indicated.

132 Chapter E - Adaptive evolvability and changing cleavage specificity

For C-Y(veg), each round shows small increase in population median activity, however lineage 7,0⟶7,3 shows a fastest rate increase. The most active variant isolated from population 0,3 (TEV-Vb) has a 40- fold faster kcat/KM than wtTEV. This is barely higher than that of Va indicating that, by the measure of most active isolatable variant, long- term evolvability of wtTEV is barely higher than the immediate evolvability present in population 5,0 from variation generated by neutral evolution alone.

Average cleavage increased faster for populations that had been neutrally evolved. In particular the lineage 5,0⟶5,3 improves rapidly but plateaus whereas lineage 7,0⟶7,3 reaches a higher average activity. The most active variants from populations 5,3 (TEV-Vc) and 7,3 (TEV-

Vd) were 37 and 203-fold improved respectively (Figure E-9 & Figure E-10a). Of note is that Vc (from population 5,3) is directly descended from Va (from population 5,0) but shows minimal activity improvement, merely slightly increased solubility. Therefore populations that have been neutrally evolved show faster adaptive evolution and yield more active variants (i.e. long-term evolvability).

Long-term evolvability was also compared between wtTEV and population 7,0 for 3 rounds of adaptive evolution on C-Y(pdl), C- Y(il23) and C-Y(blys). After 3 rounds of evolution on C-Y(pdl), population 7,3(pdl) contained a variant 25-fold improved on C-Y(pdl) (TEV-Pa), whereas 3 rounds of evolution from wtTEV yielded no improved variants (Figure E-8b). Neither wtTEV nor population 7,0 showed any improvement on C-Y(il23) or C-Y(blys) after 3 rounds of evolution on those substrates (Figure E-8c,d). The higher long-term evolvability of population 7,0 appears to be an intrinsic property, rather than specific to C-Y(veg) since population 7,0 also showed higher long- term evolvability for C-Y(pdl) activity than wtTEV. Both C-Y(il23) and C-Y(blys) proved too great an evolutionary challenge for this system.

133

wtTEV a1E+05105 b C-Y(tev) Pa Va Vb Vc Vd C-Y(pdl) C-Y(veg) C-Y(veg) C-Y(veg) C-Y(veg)

1E+04104

)

1 -

M wtTEV 1

- 1E+03103 C-Y(pdl) (min

1E+02M 102

/K cat 1E+01k 101 wtTEV C-Y(veg) 1E+00100 wtTEV 0x wtTEVC-Y(tev) wtTEV 7x C-Y(tev) 5x C-Y(tev) 0x C-Y(tev) 5x C-Y(tev) 7x C-Y(tev) tev 0x C-Y(target)Pdl vegfR 3x C-Y(pdl) 0x C-Y(veg) 3x C-Y(veg)

Figure E-9 | Activity on C-Y(veg) and C-Y(pdl) for wtTEV and evolved variants Activity (kcat/KM) (a) wtTEV and (b) evolved variants on substrates C-Y(pdl) and C-Y(veg). Substrate indicated by bar colour (C-Y(tev) in white, C-Y(pdl) in purple, C-Y(veg) in green)

The purified variants, though showing improved kcat/KM on their target sequences, have comparatively little change in the kcat/KM on C- Y(tev) (Figure E-10b). This indicates that interactions with the new substrate are specific rather than a general increase in reactivity.

250 a On C-Y(pdl) On C-Y(veg) 203 200

150

100 improvememnt 50 35 41 37

Fold Fold 25 1 nd Pa 1 Va Vb Vc Vd 0 wtTEV 0x C-Y(tev) 7x C-Y(tev) wtTEV 5x C-Y(tex) 0x C-Y(tev) 5x C-Y(tev) 7x C-Y(tev) 3x C-Y(pdl) 0x C-Y(veg) 3x C-Y(veg)

b On C-Y(tev) 200

150

100 improvememnt 50

Fold Fold Pa Va Vb Vc Vd 1 nd 2.5 1 1.3 0.9 1.8 1.5 0 wtTEV 0x C-Y(tev) 7x C-Y(tev) wtTEV 5x C-Y(tex) 0x C-Y(tev) 5x C-Y(tev) 7x C-Y(tev) 3x C-Y(pdl) 0x C-Y(veg) 3x C-Y(veg)

Figure E-10 | Activity of variants after 3 rounds of evolution on target sequence (a) kcat/KM improvement of evolved variants on substrates C-Y(pdl) and C-Y(veg). (b) kcat/KM improvement on the original C-Y(tev) substrate of evolved variants. Top row indicates rounds of neutral evolution on C-Y(tev). Bottom row indicates rounds of adaptive evolution on C-Y(target).

134 Chapter E - Adaptive evolvability and changing cleavage specificity

3.6 STEPPING-STONES DID NOT AID EVOLUTION ON UN-CLEAVABLE SEQUENCES

An attempt was also made at circumventing the constraints that prevent the evolution of activity on C-Y(il23) and C-Y(blys). Since both C-Y(pdl) and C-Y(veg) are cleaved to some extend by wtTEV it was possible to narrow down two residues in each substrate that were likely the main factors preventing activity. Each of these residues was individually introduced into C-Y(tev) as well as being individually removed from C-Y(il23) or C-Y(blys) (Table E-7). Residues that were preventing cleavage could be spotted by their strong reduction in activity when added to C-Y(tev), or their restoration of activity when removed from C-Y(target) (Figure E-11).

C-Y(tev) GENE LINKER DIFFERENCE 100% NAME SEQUENCE TO NATIVE (t+I6) C-Y(tev) ENLYFQS - (t+T3) C-Y(t+D2) ENLYDQS 1 75% C-Y(t+I5) EILYFQS 1 C-Y(t+P4) ENPYFQS 1 50% (t+D2) C-Y(t+T3) ENLTFQS 1 C-Y(i-D2) DILKDQK 4 25% (i-D2) C-Y(i-I6) DILKDQK 4

Lysate activity / % of wt cleavage ofwt activity % / Lysate (b-P4) (i-I6) (b-T3) (t+P4) C-Y(b-P4) ETLTIQK 4 0% C-Y(b-T3) ETPYIQK 4 C-Y(il23) C-Y(blys) Table E-7 | Stepping stone substrate CFP-YFP linker variants Amino acid sequences of C-Y linkers for il23 and blys ‘stepping stone’ substrates. Substrate residues hypothesised to be inhibiting cleavage were introduced to the canonical TEV cleavage sequence and removed from the target sequence to produce hybrid sequences. Residues differing from the native sequence are underlined. Figure E-11 | Stepping stone substrate cleavage in lysate by wtTEV Cleavage level of each ‘stepping stone’ substrate by wtTEV in in vitro lysate assay. Top red line indicates cleavage of C-Y(tev) native sequence. Lower, dark blue lines indicate cleavage of C-Y(il23) and C-Y(blys). Light blue lies indicate cleavage levels of stepping stone substrates, either native sequence with target residue added, or target sequence with native residue added.

Some residues such as isoleucine in position P6 gave no change in cleavage either when added to C-Y(tev) or when removed from C- Y(il23), indicating that it is not the cause of cleavage inhibition. Addition of proline at P4 almost completely prevented cleavage of C- Y(tev), however its removal from C-Y(blys) failed to increase cleavage. Therefore, although the P4 proline is inhibitory (likely as a β-sheet

135

breaker), the blys sequence must contain several other residues that restrict cleavage.

Of the 4 residues tested, the aspartate in P2 of C-Y(il23) seemed the most promising as it caused a 50% reduction in activity when added to the native sequence and restored 15% activity when removed from C- Y(il23) (Figure E-11). Therefore, I performed 3 rounds of adaptive evolution on each of C-Y(t+D2) and C-Y(i-D2). I did this starting both from wtTEV and from population 7,0. After 3 rounds, however, none of these 4 lineages showed any detectable increase in cleavage.

Together these results indicate that some target sequences are simply too challenging for the approaches used here. For activities too far away (e.g. C-Y(blys) and C-Y(il23)), neither robust variants nor the wild-type showed any evolvability, although it is possible that populations other than 7,0 might have greater evolvability for this target. Additionally, attempting several stepping stone sequences between C-Y(tev) and C- Y(target) could make finding stepwise solutions more likely.

3.7 MUTATIONS THAT UNDERLIE IMMEDIATE AND LONG-TERM EVOLVABILITY

The isolated improved variants (TEV- Va, Vb, Vc & Vd) contain between 5 and 13 amino acid mutations (Figure E-12a-c). Populations previously neutrally evolved on C-Y(tev) yielded variants with the most mutations due to the accumulated neutral mutations before the adaptive evolution rounds.

The isolated variants all showed mutations in their C-termini which forms the flexible lid region of the substrate binding tunnel in viral proteases (Figure E-12d-f). This has previously been shown to be important in substrate recognition and mutations have previously been observed in directed evolution or specificity switched in TEV214,224.

136 Chapter E - Adaptive evolvability and changing cleavage specificity

Additionally, all isolated variants contained N171D or N176I (Figure E-12a-c). These two asparagines are responsible for the recognition of the aspartate of the native cleavage sequence (ENLYFQ\S) (Figure E-13a). In C-Y(veg) the substrate sequence contains arginine in the place of aspartate rendering these residues counter-productive and removing cationic residues from the P5 binding pocket likely reduces repulsion of the C-Y(veg) substrate sequence.

a b c d

N176I N176I N176I N171D e f g h

Figure E-12 | Mutation locations and viral substrate tunnel loop Mutation positions in isolated from indicated populations with the highest activity on C- Y(veg) activity mapped onto wtTEV (PDB 1lvm) for (a) TEV-Va (5,0), (b) TEV-Vb (0,3), and (c) TEV-Vd (7,3). The C-terminal extension only present in viral members of the PA clan of chymotrypsin-like proteases as (d) surface with loop in blue (e) secondary structure and (f) b-factor putty (thicker indicates greater flexibility) for the structure of TEVC151A (1lvb). Substrate in black, active site triad in red.

When either the N171D or N176I mutations were introduced into wtTEV, it increased activity on the C-Y(veg) substrate by more than an order of magnitude (Figure E-13b). N176I was neutral to activity on the native sequence but N171D changed specificity by reducing activity on C-Y(tev) by an order of magnitude (likely due to negative charge repulsion). Both mutations also reduced soluble expression of the protease (Figure E-13c).

In order to understand the sequence changes giving rise to fluctuating immediate evolvability, I reanalysed data from Chapter D, 3.3. Looking at the population sequence variation in the neutrally evolving lineage,

137

it can be seen that position 171 is mostly purged of mutations (Figure E-13d) indicating strong constraint at that position. No single mutation accumulated above 1.5% and only in conjunction with known stabilising mutations (Figure E-13e). This agrees with the reduction in activity on the native substrate sequence by the N171D mutation.

a b1.E+05105 On C-Y(veg) On C-Y(tev) c 120%

100%

1.E+04104

1

-

M 1 - 80% N F 1.E+03103 L Y

E min / 60% M 2

Q 1.E+02/K 10 cat

k 40% 1.E+01101

20% Soluble expression wtTEV% / Soluble of expression 1.E+00100 0% N171 N176

d e f 5% 5% 5% N176S N171S N176Y N171Y N176D N171D 4% 4% 4% N176I N171I N171K N176K N176T 3% 3% 3% N176H

2% 2% 2%

1% 1% 1%

0% 0%

Percentage Percentage of residue populationin Percentage Percentage of residue populationin 0% Percentage of residue populationin

Rounds of neutral evolution Rounds of neutral evolution Rounds of neutral evolution on C-Y(tev) on C-Y(tev) on C-Y(tev) Figure E-13 | Substrate P6 binding by dual asparagines (a) N171 and N176 mapped onto wtTEV (PDB 1lvm) with all substrate to enzyme hydrogen bonds indicated. Note that N171 and N176 make bonds with the glutamate of C-Y(tev) but would be absent arginine of C-Y(veg). Measurements of (b) in vitro activity (kcat/KM), and (c) soluble expression by gel band quantification (ImageJ). (d) Percentage of population with non-arginine residues at positions 171 (light green) and 176 (dark green) after rounds of neutral evolution on C-Y(tev). Dashed line indicates non-wt residues averaged over whole TEV sequence. Data separated for each non-arginine residue at (c) N171 and (d) N176.

Conversely, position 176 accumulated amino acid diversity at about the average rate for the whole gene during the neutral evolution (Figure E-13d). Because there was no direct selection for or against N176I, its concentration in the population fluctuated over the course of neutral evolution. The N176I mutation was most prevalent in population 5,0 (Figure E-13f), the population with the highest immediate evolvability towards C-Y(veg) (Figure E-6c). The same N176I mutation was

138 Chapter E - Adaptive evolvability and changing cleavage specificity therefore convergently evolved during adaptive evolution starting from wtTEV. Thus as neutral or slightly deleterious mutations stochastically fluctuate in concentration within the population, they have profound effects on immediate evolvability.

When viewed as a phylogeny, it can be seen that Vd and Pa (the most active variants found for their target substrates) are both descended from β2 (Figure E-14), the robust variant identified in Chapter D. Variant Vc, however, is directly descended from Va and so offers an explination for why the 5,0⟶5,3 lineage plateaued in average cleavage, and yielded only a 37-fold improved variant. By being highy adapted, Va outcompeted less the well adapted but more robust variant, β2. Therefore the lineage had high immediate evolvability, enriching Va, but low long-term evolvability as robust variants were absent.

Vc Vb

Va β2

Pa

Vd

5 nt mutations

Figure E-14 | Phylogeny of C-Y(veg) active mutants Phylogeny of lineage neutrally evolved on C-Y(tev) is indicated by lines (data as in Chapter D, Figure D-12b). Circles indicate sequences of plus top variants from populations then selected for activity on C-Y(veg) in green, and C-Y(pdl) in purple.

The exceptionally high mutation tolerance of β2 may aid adaptive evolution both by having fewer deleterious mutations, and a greater proportion of beneficial mutations by buffering beneficial but

139

destabilising mutations. This would explain both the fast rate of adaptation and highly active variant evolved from population 7,0.

In this way, the causal factors for immediate and long-term evolvability can be antagonistic. Fortuitous standing variation leads to high immediate evolvability but its enrichment may come at the cost of robust variants that, though they are not active towards the new substrate, would have given higher long-term evolvability.

140 Chapter E - Adaptive evolvability and changing cleavage specificity

4 CONCLUSIONS

4.1 EVOLUTION OF PROMISCUITY, AND IMMEDIATE AND LONG-TERM EVOLVABILITY

In this chapter I show how neutral evolution affects immediate evolvability though genetic drift of sequences and long-term evolvability through the emergence of robustness.

TEV protease is known for its high sequence specificity211,222,223 so unsurprisingly has very low promiscuous activity on alternative substrate sequences. Along the neutral evolution lineage, average promiscuity towards a set of 400 test substrates fell slightly indicating the general loss of promiscuous activities as did promiscuity towards a specific test substrate sequence, C-Y(veg). This is perhaps explained by the slight drop in native activity (Chapter D) which may cause some minor promiscuous activities to drop below the detection threshold.

However, as the population evolved through the local neutral network, adaptive standing variation (e.g. N176I) appeared and disappeared introducing chance increases in promiscuous cleavage activities into some members of the population (cryptic variation). These members (e.g. TEV-Va from population 5,0) could be isolated by selection to indicate the large fluctuations in immediate evolvability.

Simultaneously, later populations contained variants with mutations that improved robustness (Chapter D, 3.4) and influenced long-term evolvability. The mutation tolerance lead to a faster increase in population average activity and the isolation of the most active variants, Vd and Pa, after 3 rounds of adaptive evolution on C-Y(veg) and C-Y(pdl) from population 7,0. Therefore, in this case, longer durations of neutral evolution did increase long-term adaptive evolvability due to the emergence of robustness.

141

Immediate and long-term evolvability do not correlate as they represent two orthogonal library properties (likelihood of bearing a variant with altered specificity vs. improvement in mutational tolerance). Evidence comes from the long-term evolvability of population 7,0, which is higher than wtTEV despite population 7,0 starting with no greater immediate evolvability. Indeed there is evidence of antagonism between the two properties as population 5,0 shows the highest immediate evolvability, yielding variant Va, which outcompetes robust variants (such as β2) and lowers long-term evolvability.

4.2 RELEVANCE TO PROTEIN ENGINEERING

These experiments compare favorably with previously reported evolution attempts114,214,224. Although I was able to isolate the enzyme Va (35-fold improved on C-Y(veg)) from standing variation alone (also somewhat higher than is typical for adaptive variation in neutral evolution Table E-1), the inconsistency of immediate evolvability explains the limited adaptive potential of some previous neutral evolution experiments293. However, in later rounds, the enrichment of the descendants of robust variants does improve long-term evolvability. Starting from wtTEV, 3 rounds of adaptive evolution produced no improved variants on C-Y(pdl) and only a 40-fold improved variant on C-Y(veg) (Vb). The same adaptive evolution starting from population 7,0 found a 25-fold improved Pa (on C-Y(pdl)) and the >200-fold more active Vd (on C-Y(veg)), both descended from the robust variant β2.

Therefore, characterising the exploration of the local neutral network by an evolving population of genes allowed insight into the relationship between neutral evolution and adaptive evolvability. Additionally, it highlights the generation of robust enzymes as a key goal in improving long-term evolvability in practical applications of experimental evolution.

142 Chapter E - Adaptive evolvability and changing cleavage specificity

143

Chapter F - Discussion and outlook

1 SUMMARY

In this chapter I will draw together the results presented in this thesis into two themes. Firstly I present how my results advance the understanding of protein evolution by describing relevant features of the fitness landscape surrounding TEV protease. Secondly I discuss the relevance of these findings to future protein engineering efforts.

145

2 EVOLUTION AND EVOLVABILITY

Development of the multi-format FRET screening system in Chapter B allowed the generation of an integrated set of data using the same fluorescent assay. It was therefore possible to use the same substrate for high throughput FACS screening all the way to detailed kinetics studies. Although the results presented here are specific to TEV protease, in conjunction with other published works they shed light on several general aspects of enzymology.

2.1 NEUTRAL FITNESS PLATEAU

In the case of TEV protease the distribution of mutant phenotypes is bimodal with mutations either reducing activity to background levels or affecting it so little that a wt-like phenotype remains (in several different assay formats – Chapter D). This indicates a local fitness landscape that is mainly comprised of flat, high plateaus (a neutral network) traversed by deep fitness canyons (sudden, complete loss of function) (Figure F-1a,b).

This is similar to previous observations of a bimodal distribution of fitness effects of mutations on proteins and organisms79,153,278. In such landscape it has been predicted that populations will evolve towards regions of higher mutational tolerance149. This robustness is equivalent to the population moving in the landscape to regions with fewer nearby fitness cliffs and more so neutral neighbours263 and as such, the evolution of robustness can be an emergent property of landscape topology.

One way of moving away from fitness cliffs if to move towards the centre of a neutral network. A relatively smooth, additive landscape

146 Chapter F - Discussion and outlook forms a neutral network with a coherent peak (Figure F-1c). In this case it would be expected that intermediate sequences between the starting enzyme (wtTEV) and the robust variant (e.g. TEV-β2) would be progressively more robust. In the lineage that produced β2, the phylogeny gives no evidence of particularly mutation tolerant intermediates. This result is more in line with a rugged, epistatic landscape which creates a complex, irregular neutral network in which the population is funnelled towards a different area with fewer cliffs (Figure F-1d) and would not necessarily show more robust intermediates.

a b β2

wtTEV

Fitness Fitness

c d

β2 Sequence Sequence space Sequence Sequence space wtTEV

Sequence space Sequence space Figure F-1 | Bimodal distribution of fitness effects as a landscape A representation of a broad fitness peak (a) and a detailed view of the top plateau of a fitness peak (b) shows that even at high fitness, many single mutations can completely inactivate the enzyme causing cliffs that take fitness from ≈1 to ≈0. (c) A neutral network generated by a smooth, additive landscape that forms a ‘round’ peak. Each circle represents a functional gene variant and lines represents point mutations between them. Light grid-regions have low fitness, dark regions have high fitness. White circles have few neutral neighbours, black circles have many. Light grid-regions contain no circles because those sequences have low fitness (as in Chapter A, Figure A-16). (d) An alternative scenario, in which a large neutral network is criss-crossed with ‘fitness cliffs’ due to epistasis (a neutral network representation of (b)). In this case it is also possible for a population to evolve to regions with fewer local cliffs.

147

2.2 RUGGED FITNESS VALLEY

By forcing TEV into a local fitness valley (TEVSer) and experimentally evolving for activity recovery in Chapter C, I have been able to map out an uphill trajectory (TEVSer→TEVSerX) which lies parallel to a nearly flat neutral trajectory (TEVCys↔TEVCysX). In this case I can plot an explicit transect of a fitness landscape (Figure F-2) showing that the mutations accumulated in TEVSerX are nearly-neutral to activity using the original cysteine nucleophile. In this case TEVSerX could be described as being robust to nucleophile exchange, an alternative and far more specific facet of robustness than the more general structural robustness to random mutagenesis described in Chapter D. c

5

10 )

1 -

4 M 1

10 - 103 a b 102 101 Cysteine Activity (min 100 X VIII IX V VI VIII Serine III IV I II TEV Figure F-2 | Explicit transect of a fitness landscape TEVSer→TEVSerX Activity recovery of the TEVSer→TEVSerX directed evolution lineage and TEVCys back- mutants. Kinetic activity of purified enzyme variants is plotted as a transect of the local fitness landscape. Solid arrows indicate the route taken in handicap-recover experiment starting with (a) deleterious TEVCys→TEVSer mutation, then (b) TEVSer→TEVSerX recovery to arrive at neutral neighbours TEVSerX↔TEVCysX capable of using either nucleophile. (c) Dashed arrow indicates alternative, nearly-neutral TEVCys↔TEVCysX route that does not Cys Ser require any large drops in activity. Values are kcat/KM of TEV variants and kobs2 of TEV variants (measured at pH 8, 25°C), standard deviations were below 15% (Chapter C, Figure C-5).

These observations contrast with the weak negative trade-offs between old and new function that are typical for evolution for altered substrate specificity or catalytic activity26,54. When selective pressure is applied for new function, typically the lack of selection the old activity leads to its loss by genetic drift as mutations re-organise the active site53. Given the specialisation of natural serine protease towards their nucleophile, we might expect that further evolution would start to show a trade-off.

148 Chapter F - Discussion and outlook

Mutations that are adaptive, or even neutral towards use of the new serine nucleophile may gradually erode the potential to use the original cysteine. However the main constraints in this case were observed in the enzyme core and mutations in contact with the catalytic triad. The lack of a trade-off (so far) between old and new function indicates that active site optimisation around the serine nucleophile doesn’t disrupt the use of cysteine. It also suggests that neutral evolution could bridge widely differing active sites in natural evolution.

High throughput screening and sequencing in conjunction with traditional directed evolution approaches allowed characterisation of the local fitness landscapes flanking this trajectory (Figure F-3).

a b

TEVSerX

Fitness Fitness

TEVSerI TEVSer TEVSer

Figure F-3 | Schematic of fitness landscape flanking TEVSer→TEVSerX trajectory Although the wider fitness landscape is unknown, HT-SAS analysis reveals the topology of the local landscape (within white border). (a) HT-SAS of TEVSer characterises the local fitness landscape (dashed arrows indicate screened mutants, solid arrow represents the fittest chosen for the trajectory). (b) HT-SAS for each variant along the directed evolution trajectory reveals how the flanking fitness landscape changes round by round. High epistasis (non-additive effects of mutations) means that close genotypes can vary widely in fitness causing a rugged (as opposed to smooth) landscape and the most beneficial paths are not always in the same direction. High adaptive potential (many possible beneficial mutations) means that at each step there are many nearby uphill pathways. Constraints and trends (e.g. for changes in triad second shell but not in the core) mean that despite the rugged landscape, in general more peaks exist in one direction than others.

The analysis characterises the fitness valley around TEVSer as having a rugged topology yet an abundance of uphill pathways directed by constrains in core structural residues and adaptive potential in the active site perimeter. This underlying landscape guided the trajectory towards one possibility in a set of many but similar solutions. Therefore, although only one evolutionary pathway was characterised

149

in detail, it is likely representative of a large class of equivalent trajectories.

2.3 ALTERNATIVE FITNESS PEAKS

Finally, in Chapter E, I show evolution towards several nearby fitness peaks with alternative sequence specificities. By measuring both immediate and long-term evolvability I show that some variants with altered specificity could be found within the neutrally evolving lineage (e.g. TEV-Va), but that movement of population to regions of high robustness gave eventual access to better variants (e.g. TEV-Vc).

Fluctuating immediate evolvability also aids our understanding of divergent evolution after gene duplication. It suggests that in the evolutionary history of a population of enzyme variants, promiscuous activities are constantly being introduced and purged at random. As genes duplicate at a roughly constant rate, redundant copies are mostly inactivated by pseudogenisation. The chance periods of high promiscuity, however, may lead to greater likelihood of sub- and neo- functionalisation.

Immediate and long-term evolvability have different requirements. Immediate evolvability is achieved by a diverse, polymorphic population in order to fortuitously find rare beneficial mutations. Long-term evolvability, however, is improved by a robust starting enzyme (Chapter D) that, at the least has fewer deleterious mutations, and at most may have more beneficial ones compared to wtTEV. Consequently, it is possible that these even trade-off against one another if some active variants (higher immediate evolvability) displace less active but robust variants (higher long term evolvability) early in an adaptive evolutionary period.

150 Chapter F - Discussion and outlook

3 ENZYMES AND ENGINEERING

3.1 NEUTRAL EVOLUTION AND CONTROLLING PROTEASE SPECIFICITY

The relevance of ‘neutral drift’ for protein engineering has been a contentious topic188,274,290,293 with conflicting results as to whether the accumulation of neutral mutations increases evolvability, especially in a protein engineering context. By measuring the sequence evolution of populations (Chapters D & E) and reconstructing their phylogenies I was able to partly resolve this issue. Specifically, neutral evolution displayed two distinct processes with independent evolvability effects.

The first process is a classical genetic drift as the population accumulated neutral and nearly-neutral mutations. As an engineering tool this method has several drawbacks. Firstly, mutations have to be neutral to the original function as well as beneficial to the new. Though this is true of many mutations54,291, it is certainly not true of all292,293.

Additionally, neutral mutations appear and disappear stochastically in the population as they are under no specific selection for retention (Chapter E). This leads to large fluctuations in the immediate evolvability from standing variation and arbitrary lengths of neutral evolution are not guaranteed to explore productive regions of sequence space. Hence, screening for new activities after an arbitrary number of generations may miss adaptive potential that was present in earlier rounds. This stochastic element of neutral evolution can accurately be called ‘neutral drift’ since it describes non-directional diffusion in the neutral network. As an engineering technique then, the ‘drift’ part of neutral evolution is inefficient, as observed for previous experimental evolution of B-glucuronidase to B- galactosidase293.

151

The second process is a directional movement of populations towards enzymes with higher robustness (Chapter D). This is a process of second-order selection for extinction resistance186 and so the label of ‘drift’188,274,290,291,293 is misleading. The neutral evolution of robustness is more promising as an engineering technique since it reduces the proportion of mutations that are deleterious to function and enables faster adaptation to a new activity as well as reaching a more active final variant, improving long-term evolvability.

Finally, none of the variants evolved in Chapter E showed re- specialisation to the new sequence, activity was improved on the new sequence, sometimes by several orders of magnitude, but the native activity never varied more than 2-fold. This may be in issue in applications where tight specificity is required but could be solved with appropriate counter-selection224. However, the >200-fold improvement of TEV-Va on its target sequence (4 residue differences to native sequence) compares favourably with previous works that change the substrate specificity at only one substrate residue. This gives some optimism that the large specificity changes desired in protease engineering are possible.

3.2 MANIPULATING ACTIVE SITE CHEMISTRY

Chapter C shows the sensitivity of the enzyme to nucleophile mutation, exchange of a single atom amongst the tens of thousands in the enzyme. Results are consistent with previous nucleophile mutagenesis studies212,238–249 and with the difficulties of de novo enzyme design in which a designed active site307 is docked into existing protein structures308,309. Together, these results highlight that engineering of optimised catalysis is more than precise orientation, but also correct activation and tuning of chemistry101,310,311.

152 Chapter F - Discussion and outlook

Nevertheless, directed evolution of the enzyme was able to adapt the active site to accommodate a new nucleophile despite constraints such as epistasis and negative pleiotropy between activity and solubility. The proportion of beneficial mutations is at the high end of previous measurements and reflects that the local sequence space is much richer in solutions to this chemical problem than originally expected. The adaptive mutations for nucleophile re-optimisation were disproportionately located in the active site periphery (in strong contrast to the neutral evolution lineages in Chapter D where such mutations were purged).

Additionally, the lack of a strong trade-off between use of the original and new nucleophile supports manipulation of core catalytic machinery as an avenue of protein engineering105. In particular it may allow access to alternative chemistries, not normally associated with a particular protein scaffold or specificity.

4 SYNOPSIS

To close, I have demonstrated the utility of integrating traditional and high-throughput techniques both for directed evolution (Chapters D

& E) and for measuring fitness landscapes to aid the understanding of the process and outcomes of directed evolution (Chapter C). I have used these techniques to help create an integrated picture of neutral evolution, the emergence of robustness, evolvability, active site evolution, and substrate specificity. Together these aid our understanding both of enzyme evolution and engineering.

153

154 Chapter F - Discussion and outlook

Chapter G - Materials and methods

1 PLASMIDS

NAME RELEVANT FEATURES SOURCE pET28-MBPTEV Kanmx, T7 lac promoter, MBP-TEV fusion This project pET28-STEV Kanmx, T7 lac promoter, PelB export This project sequence, TEV pET28-STEVa Kanmx, T7 lac promoter, PelB export This project sequence, C151A TEV pMAA Bla, Ara promoter, MBP, AraC Novagen pMAT(MBPTEV) Bla, Ara promoter, MBP-TEV fusion, AraC Gatti-Lafranconi pMAA(MBPTEV) Bla, Ara promoter, MBP-TEV fusion, AraC Gatti-Lafranconi pMAA(MBPTEVs) Bla, Ara promoter, MBP-TEVSer fusion, AraC This project pMAA(MBPTEVa) Bla, Ara promoter, MBP-TEVAla fusion, AraC This project pMAA(TEV) Bla, Ara promoter, TEV, AraC Gatti-Lafranconi pMAA(TEVs) Bla, Ara promoter, TEVSer, AraC This project Table G-1 | Protease plasmids pMAA(MBPTEVs) was used to express all variants in Chapter C except for protein expression for thermal denaturation, when protease genes were subcloned into pMAA(TEVs). Similarly for Chapter C pMAA(MBPTEV) and pMAA(TEV) were used.

NAME RELEVANT FEATURES SOURCE pC-Y(cleavage) CAT, Dual-GFP FRET fusion, AraC Gatti-Lafranconi pGR32 CAT, lac promoter, BLIP, LacI Palzkill pJET1.2/Blunt Blunt ligation plasmid Fermentas pSU18 CAT, Novagen pUC18 Bla, Novagen pJET-bla Bla (+ required restriction sites) This project pSB-wtBLIP Bla , CAT, LacI, lac promoter, BLIP This project pSB-iBLIP Bla , CAT, LacI, lac promoter, iBLIP This project Table G-2 | Substrate plasmids pC-Y(cleavage) represents a set of plasmids all encoding the CFP-YFp substrate but with different cleavage sequences as the linker. The full table of linker sequences is detailed in Chapter C.

155

2 PRIMERS

NAME TM/°C SEQUENCE ins1F 71 ACTUCCAGAATGCCCCGACGCTCACCCTCGC ins1R 71 AAGUACAGATTCTCGCTCGGGGCCAGCAGCTTCTCC ins2F 69 ACTUCCAGAATGCCGACGCGAAGGTGGACTCG ins2R 71 AAGUACAGATTCTCGGCGGCGCTGGTGAAGCCG blaF 60 GATCGATCAAGCTTGGTCTGACGCTCAGTGGAACG blaR 58 GCTGCTGATCTAGATGGTTTCTTAGACGTCAGGTGG blipF 68 GCGATTCGGATCCTCTCCGGGAGCTGCATGTGTTCAG blipR 68 GTTGTACCCGCGGTGCAGGTCGACTCTAGAATACAAGGTCCC tevF 60 GTTCCGAGGGGATCCATGGG tevR 59 CCCCGATCCCTCGAGAAGC restF 66 TGAATAGAGATCTTACTGACTCGAGCCGTTTTTCCATAGGCTCCG restR 67 AGGCTACAGATCTCCTGATATCGATCTCGAATACAAGGTCCCACTGCCG tevaF 63 GGGCAGGCTGGATCCCCATTAGTATCAACTAGAGATGGG tevaR 64 CTAATGGGGATCCAGCCTGCCCATCCTTGGTTTGAATCC tevsF 63 GGGCAGTCTGGATCCCCATTAGTATCAACTAGAGATGGG tevsR 64 CTAATGGGGATCCAGACTGCCCATCCTTGGTTTGAATCC shufF 60 GAACATCCCGCAGATGTCCGC shufR 59 GTTATGCTAGTTATTGCTCAGCGG seqXF 60 CGTATCGCCTCCCTCGCGCCATCAG(MID-X)GTTCCGAGGGGATCCATGGG seqXR 59 CTATGCGCCTTGCCAGCCCGCTCAG(MID-X)CCCCGATCCCTCGAGAAGC Table G-3 | Primers mentioned in report Primers were designed with NetPrimer (primer TM was calculated under standard conditions, 1.5nM Mg2+) and synthesised by Invitrogen Life Technologies. Sequencing results were analyzed with FinchTV and CLC sequence viewer 6. Primers seqXF and seqXR contain multiplex identifier sequences (MID-X) detailed in Table G-4.

156 Chapter G - Materials and methods

MID-X SEQUENCE 1 ACGAGTGCGT 2 ACGCTCGACA 3 AGACGCACTC 4 AGCACTGTAG 5 ATCAGACACG 6 ATATCGCGAG 7 CGTGTCTCTA 8 CTCGCGTGTC 9 TCTCTATGCG 10 TGATACGTCT 11 CATAGTAGTG 12 CGAGAGATAC 13 ATACGACGTA 14 TCACGTACTA 15 CGTCTAGTAC 16 TCTACGTAGC 17 TGTACTACTC 18 ACGACTACAG 19 CGTAGACTAG 20 TACGAGTATG Table G-4 | Multiplex identifier sequences Standard multiplex sequences that are each differ by at least two nucleotides. Used to sequence multiple different populations simultaneously by 454 sequencing. The mixed sequences can then be separated afterwards (de-multiplexed) by grouping sequences by their MID. Each group then corresponds to a population before the samples were mixed for sequencing.

157

3 BUFFERS

LYSIS TEV STORAGE 50 mM Tris 50 mM Tris 300 mM NaCl 1 mM EDTA 5 mM Imidazole 10% sucrose pH 8 pH 8

WASH TEV ACTIVITY 50 mM Tris 50 mM Tris 300 mM NaCl 1 mM EDTA 50 mM Imidazole 10% sucrose pH 8 pH 8 with HCl 1 mM DTT (freshly added) ELUTION 50 mM Tris RF1 300 mM NaCl 100 mM RbCl 300 mM Imidazole 30 mM Potassium acetate pH 8 10 mM CaCl2 50 mM MnCl2 96W-LYSIS 10% glycerol pH 5.8 with acetic acid 50 mM NaH PO 2 4 Filter sterilised 300 mM NaCl pH 8 RF2 96W-WASH 10 mM MOPS 10 mM RbCl 83 mM NaH PO 2 4 75 mM CaCl 500 mM NaCl 2 10% glycerol 10 mM Imidazole pH 6.8 with acetic acid pH 7 Filter sterilised

96W-ELUTION PBS 50 mM NaH PO 2 4 137 mM NaCl 300 mM NaCl 2.7 mM KCl 150 mM Imidazole 10 mM NaH HPO pH 7 2 4 1.8 mM KH2PO4 pH 8 96W-MES 20 mM MES 100 mM NaCl 150 mM Imidazole pH 5

158 Chapter G - Materials and methods

4 MOLECULAR BIOLOGY

4.1 GENE AMPLIFICATION

PCR using primers containing the desired restriction sites was performed in an Eppendorf

Mastercyler (25 rounds of: 30 s at 94 °C, 30 s at (TM-5) °C, 1 min per kb of amplicon at 72 °C) using 0.4u Pfu Turbo (Agilent), 1 µM each primer, 4 mM dNTPs, 100 ng total plasmid template in 20 µl.

Refer to Table G-3 for primer TM values.

4.2 USER MUTAGENESIS

Performed as published312. Briefly, the plasmid was amplified by PCR in an Eppendorf Mastercyler

(30 s at 94 °C, 30 s at 56 °C, 360 s at 68 °C) using Pfu CX polymerase according to the manufacturer’s instructions and uracil-containing primers. DpnI (1 µl) and USER (1 µl) were then added to the crude product (20 µl) and incubated 1hr at 37 °C. DNA was purified using the QIAquick PCR purification kit and eluted in H2O (30 µl). DNA ligase buffer and T4 DNA ligase (1 µl) was added and incubated 12 h at 16 °C.

4.3 TARGETED MUTAGENESIS

For targeted mutagenesis, a two-step, overlap-extension method was used. This requires each half of a gene to be amplified using outside primers and inside mutagenic primers that introduce the desired mutation. The two gene fragments are then used as primers for each other to generate the full length mutant gene. PCR was performed in an Eppendorf Mastercyler using flanking primers and primers containing the desired mutations tevF+(mutagenic primer) and tevR+(mutagenic primer) was performed (10 rounds of: 30 s at 94 °C, 30 s at (TM-5) °C, 60 s at 72 °C) using 0.4u Pfu Turbo (Agilent), 1 µM each primer, 4 mM dNTPs, 100 ng amplicon template in 20 µl. These amplificates were mixed and overlap extension PCR with the flanking primers tevF+tevR completed the gene (10 rounds of: 30 s at 94 °C, 30 s at (TM-5) °C, 60 s at 72 °C) using 0.4u Pfu Turbo (Agilent), 1 µM each primer, 4 mM dNTPs, 100 ng total plasmid template in 20 µl. The product was ligated into pMAA as described below.

4.4 RANDOM MUTAGENESIS

Error prone PCR in an Eppendorf Mastercyler using flanking primers tevF+tevR was performed (30 rounds of: 30 s at 94 °C, 30 s at 50 °C, 90 s at 72 °C, 30 rounds) using 0.4u Mutazyme II polymerase (Agilent), 1µM each primer, 0.8 mM dNTPs, 400ng plasmid template in 20 µl. The product was ligated into pMAA as described below.

4.5 SHUFFLING

The genes of interest were each amplified by PCR using external primers shufF+shufR (30 rounds of: 30 s at 94 °C, 30 s at (TM-5) °C, 90 s at 72 °C) using 4u Pfu Turbo (Agilent), 1 µM each primer, 4 mM dNTPs, 1 µg amplicon template in 200 µl. The PCR products for the two genes to be shuffled were mixed such that product genes would have 2-3 amino ‘back-mutations’ from the earlier round shuffle parent on average, (total 4 µg) and digested in 50 µl with DNAase (0.005 units) for 10 minutes. The digested DNA was separated by gel electrophoresis and fragments between 50 and 70 bp were excised and purified. Assembly of the purified fragments was achieved by a 35 round PCR as above, except using a modified annealing step (65, 62, 59, 56, 53, 50, 47, 44, 41 °C each for 30s) using 1u Pfu Turbo (Agilent), 4 mM dNTPs, in 50 µl. 10% of the product was used as the template for a PCR reaction using the nested primers tevF+tevR to amplify the gene for cloning (30 rounds of: 30 s at 94 °C, 30 s at (TM-5) °C, 60 s at 72 °C) using 1u Pfu Turbo (Agilent), 1 µM each primer, 4 mM dNTPs, in 50 µl. The product was ligated into pMAA as described below.

159

4.6 LIGATION AND CLONING

Unless otherwise stated in results sections, the following protocol was used as a standard protocol for ligating genes into plasmids (library creation/evolution/shuffling etc). PCR product was purified by spin column (Zymo research), digested with XhoI, NcoI and DpnI (1u each, Fermentas) at 37°C for 1 hr, purified again, and ligated into the XhoI, NcoI digested pMAA vector at 16 °C for 16 h, 1u T4 ligase (NEB), 10 ng of each vector and insert in 10 µl. This was used to transform electro- competent E. cloni (lucigen) as per the manufacturer’s instructions. Restriction digests were performed according to the manufacturer’s instructions. DNA was analysed by agarose (0.5%) gel electrophoresis and stained with SybrSafe (1%) against a Hyperladder I (Bioline) standard. DNA was purified after each enzymatic reaction (QIAquick PCR purification kit) according to manufacturer’s instructions. DNA concentrations were determined by measuring the absorption at 260 nm with a Nanodrop spectrophotometer (Thermo Scientific). Ligations into JET1.2/Blunt vector were performed according to manufacturer’s instructions. All other ligations were performed using equimolar vector and insert DNA (10 ng) with T4 DNA ligase (1 μl) in ligase buffer (total 10 μl) and incubation for 16 h at 16 °C. When synthetic oligonucleotides were used as insert, 20 ng (in 10 μl

H2O) was prepared by heating the mix of forward and reverse oligonucleotides to 95 °C, cooling to 37 °C over 1 h and then phosphorylating by addition of PNK (3u) and T4 ligase buffer followed by incubation for 1 h at 37 °C.

4.7 PREPARATION OF CHEMO-COMPETENT CELLS

LB (500 ml, containing the appropriate antibiotics) was inoculated with cells (5 ml from overnight inoculum) and incubated at 37 °C (200 rpm) until OD600=0.4. The culture was divided into 50 ml aliquots and incubated on ice for 15 min then pelleted (25 min centrifugation at 3500 rpm). Pellets were resuspended in RF1 solution (8 ml per aliquot), incubated on ice again for 15 min and re- pelleted as before. Pellets were resuspended in RF2 solution (3 ml per aliquot), incubated on ice again for 15 min and 50 µl aliquots in PCR tubes were snap-frozen using liquid nitrogen.

4.8 PREPARATION OF ELECTRO-COMPETENT CELLS

LB (1 L, containing the appropriate antibiotics) was inoculated with cells (10 ml from overnight inoculum) and incubated at 37 °C (200 rpm) until OD600=0.4. The culture was divided into 50 ml aliquots and incubated on ice for 30 min then pelleted (25 min centrifugation at 3500 rpm). Pellets were resuspended in distilled H2O (25 ml per aliquot), incubated on ice again for 15 min and re- pelleted. The previous step is repeated, with resuspension in 12.5 ml, 2.5 ml then 100 µl of distilled

H2O +10% glycerol. 50 µl aliquots in PCR tubes were snap-frozen using liquid nitrogen.

4.9 TRANSFORMATION

Chemo-competent cells (50 µl) were incubated with DNA (1-100 ng) on ice for 15 min, heat-shocked for 30s at 42 °C then 60s on ice and incubated in LB (1 ml) for 1 h at 37 °C before plating on LB-agar containing the appropriate antibiotic.

Electro-competent cells (50 µl) were incubated with DNA (1-100 ng) on ice for 1 min in an electroporation cuvette. Electroporation of 1800v was followed immediately by the addition of Invitrogen ‘Recovery Medium’ (300 µl) and incubation for 1 h at 37 °C before plating on LB-agar containing the appropriate antibiotic.

160 Chapter G - Materials and methods

5 BIOCHEMISTRY AND ASSAYS

5.1 PROTEIN EXPRESSION AND PURIFICATION

LB (1 l, containing the appropriate antibiotics) was inoculated with cells (from an overnight inoculum) and incubated at 37 °C and 200 rpm for 4 hours. The cells were induced by addition of L-arabinose (0.2%) and incubated at 30 °C for a further 4 hours. The cells were pelleted and frozen for storage. The cell pellet was thawed and resuspended in 10 ml lysis. Cells were lysed on ice by sonication (2 s on, 10 s off, 5 min) and pelleted. The supernatant was loaded onto a HisTrap (GE Life Sciences, 5ml) column, washed in wash buffer (6 column volumes) and eluted with elution buffer

(500 µl fractions). Protein content was monitored by A280. Fractions containing the protein of interest were pooled and exchanged into TEV Storage buffer using a HiTrap desalting column (GE Life Sciences, 5 ml). 100 µl aliquots in PCR tubes were snap-frozen using liquid nitrogen.

5.2 96-WELL EXPRESSION AND PURIFICATION

In a 96-deepwell plate, LB (1 ml per well, containing the appropriate antibiotics) was inoculated with the cells to be tested and incubated for 12 hours at 30 °C. The cells were induced by addition of 50 µl L-arabinose (0.2% final concentration) and incubated at 25 °C for a further 16 hours. Cells were pelleted, then resuspended in 300 µl PBS + Bugbuster (Novagen) and incubated for 1 hour at 25 °C. Finally, cells were pelleted and the supernatant separated and 700 µl 96w-lysis buffer added to each.

A 96-well TALON resin plate (BD Biosciences) was used for parallel purification of enzymes. Liquid was drawn through using a vacuum manifold (BD Biosciences). The plate was prepared by washing with H2O (3 times 300 µl per well) then washing with 96w-lysis buffer (3 times 300 µl per well). The cell supernatants (1 ml each) were loaded into the wells and incubated on ice (10 min). The supernatant was then drawn through using the vacuum manifold and washed with 96w-lysis buffer (3 times 300 µl per well) then washed with 96w-wash buffer (6 times 300 µl per well). 300 µl 96w- elution buffer was added to each well and incubated on ice (10 min). The elutant was then drawn through using the vacuum manifold into individual tubes. Protein concentration of each sample was assayed by adding 30 µl elutant to 200 µl Bradford reagent (Sigma) and absorbance was measured by spectrophotometer (Spectromax M5 at 595 nm). The TALON plate was washed with

96w-MES (3 times 300 µl per well) then washed with H2O (3 times 300 µl per well).

5.3 FLUORESCENCE 96-WELL PLATE SCREENING

In a 96-deepwell plate, LB (1 ml per well, containing the appropriate antibiotics) was inoculated with the cells to be tested and incubated for 12 hours at 30 °C. The cells were induced by addition of 50 µl L-arabinose (0.2% final concentration) and incubated at 25 °C for a further 6 hours. Cells were pelleted, then resuspended in 300 µl PBS + Bugbuster (Novagen) and incubated for 1 hour at 25 °C. Finally, cells were pelleted and the supernatant was assayed in a Spectramax M5 (Molecular Devices, excitation 414 nm, observation 475 nm and 525 nm).

5.4 FLUORESCENCE IN VITRO KINETICS

CFP-YFP substrate was thawed overnight at 4 °C. Unless otherwise stated, 1 µM substrate was added to 1, 2, 4 and 8 µM purified enzyme in 200 µl TEV Storage buffer + 1 mM DTT in a 96-well plate. Solutions were assayed by spectrophotometer (Spectromax M5 - excitation 414 nm, observation 475 nm and 525 nm) reading every minute for 12 hours at 25 °C for 12 hours. Concentration of product was calculated using Equation G-1. Kinetic traces were fit first to simple Micaelis -Menten kinetics, Equation G-2.

161

푅0 − 푅푡 [푃] = [S] 푅0 − 푅∞ Equation G-1 | Product concentration [P] = Concentration of product [S] = Concentration of substrate

R0 = FRET ratio of substrate (at t=0) R∞ = FRET ratio of product (at t=∞)

Rt = FRET ratio at time t

[푃] = [푃]푚푎푥 − 푒−[퐸]푘푐푎푡/퐾푀푡 Equation G-2 | Michaelis Menten interpolation (also C-1)

The serine variants were found to exhibit biphasic kinetics, so were better fit by the two- exponential modification of Equation G-2, named Equation G-3.

−[퐸]푘표푏푠1푡 −[퐸]푘표푏푠2푡 [푃] = [푃]푚푎푥 − (퐴1푒 + 퐴2푒 ) Equation G-3 | Biphasic modification of Equation G-2 (also C-2) [P] = Concentration of product [P]max = Final concentration of product (should be = starting substrate) A1 = Jump amplitude (ie. how much turnover is performed in first kinetic step) A2 = Second jump amplitude (should = [P]max-A1) [E] = Concentration of enzyme kcat/KM = Rate of overall reaction kobs1 = Rate of first kinetic step kobs2 = Rate of second kinetic step t = Time

5.5 FLUORESCENCE ACTIVATED CELL SORTING (FACS)

Transformed cells were scraped off the agar plate in LB (2 ml), 4 µl of which was used to inoculate fresh culture in a 96-depwell plate, LB (1 ml per well, containing the appropriate antibiotics) After 4 hours at 37 °C, 4 µl of this culture was used to inoculate 1 ml fresh LB (+antibiotic) and incubated for 4 hours at 30 °C. The cells were induced by addition of L-arabinose (0.2%) and incubated at 20 °C overnight (16 hours). For FACS a MoFlo MLS (BeckmanCoulter) was used. Samples were diluted 50:50 with PBS and flowed at 1000-3000 cells per second (70 µm nozzle, 60 psi (4 bar), 514 nm ssc trigger). Excitation at 406 nm (100 mW Krypton laser) was monitored using filters at 470±10 nm (FL4) and 546±5 mn (FL5). Excitation at 514 nm (200 mw Argon laser) was monitored by filters at 546±5 nm (FL1). Cells expressing pMAA + pCtY and pMAA-TEV + pCtY were used as negative and positive controls, respectively. Sorting was performed by gating firstly above FL1 to ensure expression and secondly on the FL5/FL4 to select the 1% of cells with the lowest FRET ratio. Cells were sorted into LB (containing 100 µg/ml ampicillin and 37 µg/ml chloramphenicol) and incubated overnight at 37 °C before being harvested.

5.6 GEL BAND DENSITOMETRY

In a 96-deepwell plate, LB (1 ml per well, containing the appropriate antibiotics) was inoculated with the cells to be tested and incubated for 12 hours at 30 °C. The cells were induced by addition of L-arabinose (0.2%) and incubated at 25 °C for a further 4 hours. Cells pelleted by centrifugation and then resuspended in PBS (250 µl) + Bugbuster (Novagen) and incubated for 1 hour at 25 °C. Cells were pelleted and the supernatant (10 µl) was run on a polyacrylamide gel (1.5%). Gels were stained with InstantBlue (Expedion) according the manufacturer’s instructions. Stained gels were photographed on a lightbox and the image analysed with ImageJ.

162 Chapter G - Materials and methods

5.7 DIFFERENTIAL SCANNING FLUORIMETRY

Sypro Orange (Sigma) was added to purified protein (20 µl of 100 mg/ml). Samples were analysed in a real time PCR machine (Corbett Research 2plex HRM) by increasing temperature from 25 °C to 95 °C in 0.5 °C steps over 70 minutes (i.e. increasing 1 °C/min). The fluorescence curves were fitted to Equation G-4 to calculate the temperature at which half of the protein is melted (TM).

푌푚푎푥 − 푌푚푖푛 푌 = 푌푚푖푛 + 푇 −ℎ 1 + ( ) 푇푀 Equation G-4 | Thermal denaturation sigmoidal curve Y = Fluorescence Ymin = Minimum fluorescence Ymax = Maximum fluorescence T = Temperature TM = Melting temperature (infection point) h = Hill coefficient (unfolding )

163

6 BIOINFORMATICS AND DATA ANALYSIS

6.1 PA CLAN ALIGNMENT AND PHYLOGENY

An alignment based on structural homology to TEV protease (PDB 1lvm) was generated using DALIprotein251. From this primary homology assessment, sequences lacking known structures were added based on amino acid sequence similarity to several proteases from the phylogeny (Table G-5) using BLASTp search of non-redundant sequences. Columns with >90% gaps were removed (Gapstreeze313 online). The amino acid sequence alignment was used to generate a maximum likelihood phylogeny in MEGA5314.

BLAST QUERY PDB NO. SEQUENCES ADDED Tobacco etch 1lvm 75 Astrovirus 2w5e 14 Rhinovirus 2A* 2hrvB 14 Sesbania mosaic 1zyo 12 Rhinovirus 3C* 1cqqA 10 Alkaline Protease 3cp7 8 Norovirus 2fyq 6 Achromobacter 1arc 5 Hepatitus C 2obq 4 Human BB fragment 1rtk 3 Equine arteritis 1mbm 3 Hepatitis A 1hav 2 Table G-5 | Sequences added to DALI alignment by BLASTp The Protein sequences of each query were used to search the non-redundant sequence data base by BLASTp. * Rhinovirus has two paralogous copies of its protease. Protease 2A is smaller than 3C and has a different substrate specificity.

6.2 HIGH THROUGHPUT SEQUENCING AND DATA ANALYSIS

The plasmids to be sequenced by 454 were amplified by PCR using tevF+tevR (10 rounds of: 30 s at

94 °C, 30 s at (TM-5)°C, 60 s at 72 °C) using 0.4u Pfu Turbo (Agilent), 1 µM each primer, 4 mM dNTPs, 100 ng template in 20 µl. This product was purified to remove primer and used as the template for a second PCR to add the 454 recognition sequences seqXF+seqXR, each population using the appropriate multiplex identifier sequences (see Table G-4) (10 rounds of: 30 s at 94 °C, 30 s at (TM- 5) °C, 60 s at 72 °C) using 0.4u Pfu Turbo (Agilent), 1 µM each primer, 4 mM dNTPs in 20 µl. The products were mixed in equimolar ratios and sequenced using a GS Junior (Roche) by the Cambridge University Biochemistry Department Sequencing Facility. Sequencing generated forward reads for the first 400nt and reverse reads for the last 400nt of the 800nt total amplicon. Reads were aligned to the TEVCys DNA sequence using AVA aligner (Roche). Columns with >90% gaps were removed (Gapstreeze online) and remaining gaps replaced with ‘n’s in order to prevent frameshifts upon translation to protein. The alignments were used to generate a quantitative consensus (Jalview) which was exported to Excel (Microsoft) for analysis. Enrichment scores were calculated for each codon in the gene as the proportion of non-wt residue at each position after selection divided by the proportion before selection. To reflect the sensitivity of sequencing and to avoid dividing by zero, the minimum frequency of non-wt residues is set as 0.001. Protein regions were identified using PyMol315 for nucleophile and triad shells, IUPred306 for disordered regions, GetArea316 for solvent accessibility, Jalview for conserved regions, and FoldX317 for regions sensitive to alanine mutation in silico. Statistical significance was calculated by heteroscedastic t-test (since

164 Chapter G - Materials and methods equal variance could not be guaranteed) when comparing the average effect of a subset of mutations in a protein regions the whole protein. Similarly, χ2 was used when comparing the number of beneficial mutations in a protein region to the whole protein. A 3-category χ2 test was used for analysing the distribution of the 15 mutations accumulated in the DE lineage.

6.3 FOLDX

PDB file 1lvm was edited to remove the second chain and subjected to iterative rounds of the FoldX "repair" routine until energy levels reached a stable minimum. This PDB file was then used for the determination of total energies of all mutants in the serine lineage (and corresponding revertants) by FoldX. Two structures were generated for each mutant, and average values used for analysis. As the sequence of TEVCys used in this work differs from 1LVM, the ΔG of TEVCys was calculated and used as reference (set to zero). A single PDB file for each mutant was then used in the alanine scanning routine to calculate the contribution of each position to global stability.

6.4 FACS MUTATION TOLERANCE ANALYSIS

The FL5/FL4 ratio of cells above the FL1 threshold was recorded (Summit 4.3) and exported to Excel. The histogram was fitted to Equation G-5 to calculate the relative populations of active and inactive variants.

1 푥−휇 2 1 푥−휇 2 퐴1 − ( 1) 퐴2 − ( 2) 푦 = 푒 2 휎1 + 푒 2 휎2 휎1√2휋 휎2√2휋 Equation G-5 | Bimodal normal distribution y = frequency x = FRET ratio [A1] = Size of active population [A2] = Size of inactive population σ1 = standard deviation of active population σ2 = standard deviation of inactive population µ1 = mean FRET ratio of active population µ2 = mean FRET ratio of inactive population

When the sequence before mutagenesis is a single gene and the library afterwards has been sequenced, the proportion of variants that retained wt-like activity was fit to Equation D-1 to determine the tolerance to random amino acid substitution. When a library was generated from a mixed population it is not possible to know which variants acquired which number of mutations and so τsub cannot be calculated. In these cases tolerance is simply given at the mutation rate (τ0.5 for µ=0.5% and τ1.5 for µ=1.5%).

푛 0 1 2 푛 푆 = ∑ 푓푛 (휏푠푢푏) = 푓0(휏푠푢푏) + 푓1(휏푠푢푏) + 푓2(휏푠푢푏) … +푓푛(휏푠푢푏) 푛=0 Equation G-6 | The probability of enzyme activity retention after random mutation (also D-1) S = the fraction of enzyme variants displaying wt-like activity n = number of mutations

fn = the fraction of enzyme variants with n mutations τsub = the probability that a random mutation will be neutral Nb. Frameshift mutations will cause underestimation τsub, however, these were rare enough that they made no difference to these estimates

6.5 PROCRASTINATION

Avoiding work was mostly achieved through the news sites: BBC, Ars Technica and Nature news. On slow news days Facebook was occasionally required. For a full list see browser history.

165

166

Chapter H - References

1. Bartlett, G. J., Porter, C. T., Borkakoti, N. & 11. Li, M., Chen, C., Davies, D. R. & Chiu, T. K. Thornton, J. M. Analysis of Catalytic Induced-fit mechanism for prolyl Residues in Enzyme Active Sites. J. Mol. Biol. . J. Biol. Chem. 285, 21487–95 324, 105–121 (2002). (2010).

2. Benkovic, S. J., Hammes, G. G. & Hammes- 12. Zavodszky, M. I. & Kuhn, L. A. Side-chain Schiffer, S. Free-energy landscape of enzyme flexibility in protein-ligand binding: the catalysis. Biochemistry 47, 3317–21 (2008). minimal rotation hypothesis. Protein Sci. 14, 1104–14 (2005). 3. Babtie, A., Tokuriki, N. & Hollfelder, F. What makes an enzyme promiscuous? Curr. 13. Gunasekaran, K., Ma, B. & Nussinov, R. Opin. Chem. Biol. 14, 200–7 (2010). Triggering Loops and Enzyme Function: Identification of Loops that Trigger and 4. Bar-Even, A., Noor, E. & Savir, Y. The Modulate Movements. J. Mol. Biol. 332, 143– moderately efficient enzyme: evolutionary 159 (2003). and physicochemical trends shaping enzyme parameters. Biochemistry 50, 4402–10 (2011). 14. Nicholson, L. K. et al. Flexibility and function in HIV-1 protease. Nat. Struct. Biol. 2, 274– 5. Alden, R., Wright, C. S. & Kraut, J. A 280 (1995). Hydrogen-Bond Network at the Active Site of Subtilisin. Philos. Trans. R. Soc. B Biol. Sci. 15. Gatti-Lafranconi, P. et al. Evolution of 257, 119–124 (1970). Stability in a Cold-Active Enzyme Elicits Specificity Relaxation and Highlights 6. Shaltiel, S., Cox, S. & Taylor, S. S. Substrate-Related Effects on Temperature Conserved water molecules contribute to the Adaptation. J. Mol. Biol. 395, 155–166 (2010). extensive network of interactions at the active site of protein kinase A. Proc. Natl. 16. Eisenmesser, E. E. Z., Bosco, D., Akke, M. & Acad. Sci. U. S. A. 95, 484–91 (1998). Kern, D. Enzyme dynamics during catalysis. Science 295, 1520–3 (2002). 7. Miller, B. G., Snider, M. J., Wolfenden, R. & Short, S. A. Dissecting a charged network at 17. Villà, J. & Warshel, A. Energetics and the active site of orotidine-5’-phosphate Dynamics of Enzymatic Reactions. J. Phys. decarboxylase. J. Biol. Chem. 276, 15174–6 Chem. B 105, 7887–7907 (2001). (2001). 18. Eisenmesser, E. Z. et al. Intrinsic dynamics of 8. Taverna, D. M. & Goldstein, R. A. Why are an enzyme underlies catalysis. Nature 438, proteins marginally stable? Proteins 46, 105– 117–21 (2005). 109 (2002). 19. Benkovic, S. J. S. & Hammes-Schiffer, S. A 9. Jaenicke, R. Protein stability and molecular perspective on enzyme catalysis. Science 301, adaptation to extreme conditions. FEBS 202, 1196–202 (2003). 715–28 (1991). 20. Bender, M. L. Mechanisms of Catalysis of 10. Karplus, M. & McCammon, J. A. Molecular Nucleophilic Reactions of Carboxylic Acid dynamics simulations of biomolecules. Nat. Derivatives. Chem. Rev. 60, 53–113 (1960). Struct. Biol. 9, 646–52 (2002).

167

21. Buller, A. R. & Townsend, C. A. Intrinsic 33. Gerlt, J. & Babbitt, P. Mechanistically diverse evolutionary constraints on protease enzyme superfamilies: the importance of structure, enzyme acylation, and the identity chemistry in the evolution of catalysis. Curr. of the catalytic triad. Proc. Natl. Acad. Sci. 110, Opin. Chem. Biol. 2, 607–12 (1998). E653–61 (2013). 34. Babbitt, P. C. Understanding Enzyme 22. Dodson, G. & Wlodawer, A. Catalytic triads Superfamilies. Chemistry as the fundamental and their relatives. Trends Biochem. Sci. 23, determinant in the evolution of new catalytic 347–52 (1998). activities. J. Biol. Chem. 272, 30591–30594 (1997). 23. Garavito, R. M., Rossmann, M. G., Argos, P. & Eventoff, W. Convergence of active center 35. Nei, M. Selectionism and neutralism in geometries. Biochemistry 16, 5065–71 (1977). molecular evolution. Mol. Biol. Evol. 22, 2318– 42 (2005). 24. Peelman, F. et al. A proposed architecture for lecithin cholesterol acyl (LCAT): 36. Fay, J. C. Weighing the evidence for identification of the catalytic triad and adaptation at the molecular level. Trends molecular modeling. Protein Sci. 7, 587–99 Genet. 27, 343–9 (2011). (1998). 37. Kimura, M. Preponderance of synonymous 25. Mei, B. & Zalkin, H. A cysteine-histidine- changes as evidence for the neutral theory of aspartate catalytic triad is involved in molecular evolution. Nature 267, 275–276 glutamine amide transfer function in purF- (1977). type glutamine amidotransferases. J. Biol. 38. Ohta, T. The nearly neutral theory of Chem. 264, 16613–16619 (1989). molecular evolution. Annu. Rev. Ecol. Syst. 23, 26. Khersonsky, O., Roodveldt, C. & Tawfik, D. 263–286 (1992). S. Enzyme promiscuity: evolutionary and 39. O’Brien, P. J. & Herschlag, D. Catalytic mechanistic aspects. Curr. Opin. Chem. Biol. promiscuity and the evolution of new 10, 498–508 (2006). enzymatic activities. Chem. Biol. 6, R91–R105 27. Babtie, A. C., Bandyopadhyay, S., Olguin, L. (1999). F. & Hollfelder, F. Efficient catalytic 40. Patrick, W. M. & Matsumura, I. A study in promiscuity for chemically distinct reactions. molecular contingency: glutamine Angew. Chemie 48, 3692–4 (2009). phosphoribosylpyrophosphate 28. Ekroos, M. & Sjögren, T. Structural basis for amidotransferase is a promiscuous and ligand promiscuity in cytochrome P450 3A4. evolvable phosphoribosylanthranilate Proc. Natl. Acad. Sci. 103, 13682–7 (2006). . J. Mol. Biol. 377, 323–36 (2008).

29. Yoshikuni, Y., Ferrin, T. E. & Keasling, J. D. 41. Van Loo, B. et al. An efficient, multiply Designed divergent evolution of enzyme promiscuous hydrolase in the alkaline function. Nature 440, 1078–82 (2006). phosphatase superfamily. Proc. Natl. Acad. Sci. 107, 2740–5 (2010). 30. Nobeli, I., Favia, A. D. & Thornton, J. M. Protein promiscuity and its implications for 42. Olguin, L. F., Askew, S. E., O’Donoghue, A. biotechnology. Nat. Biotechnol. 27, 157–67 C. & Hollfelder, F. Efficient catalytic (2009). promiscuity in an enzyme superfamily: an arylsulfatase shows a rate acceleration of 31. Elias, M. & Tawfik, D. S. Divergence and 10(13) for phosphate monoester hydrolysis. J. Convergence in Enzyme Evolution: The Am. Chem. Soc. 130, 16547–55 (2008). Parallel Evolution of Paraoxonases from Quorum Quenching Lactonases. J. Biol. 43. Mohamed, M. F. & Hollfelder, F. Efficient, Chem. 287, 11–20 (2011). crosswise catalytic promiscuity among enzymes that catalyze phosphoryl transfer. 32. Jackson, C. J. et al. Conformational sampling, Biochim. Biophys. Acta 1834, 417–24 (2013). catalysis, and evolution of the bacterial phosphotriesterase. Proc. Natl. Acad. Sci. 106, 21631–6 (2009).

168 Chapter H - References

44. Roodveldt, C. & Tawfik, D. S. Directed 57. Gould, S. J. & Eldredge, N. Punctuated evolution of phosphotriesterase from equilibrium comes of age. Nature 366, 223–7 Pseudomonas diminuta for heterologous (1993). expression in Escherichia coli results in 58. Halabi, N., Rivoire, O., Leibler, S. & stabilization of the metal-free state. Protein Ranganathan, R. Protein sectors: Eng. Des. Sel. 18, 51–8 (2005). evolutionary units of three-dimensional 45. Deng, C., Cheng, C.-H. C., Ye, H., He, X. & structure. Cell 138, 774–86 (2009). Chen, L. Evolution of an antifreeze protein 59. McLaughlin, R. N., Poelwijk, F. J., Raman, A., by neofunctionalization under escape from Gosal, W. S. & Ranganathan, R. The spatial adaptive conflict. Proc. Natl. Acad. Sci. 107, architecture of protein function and 21593–8 (2010). adaptation. Nature 491, 138–42 (2012). 46. Jensen, R. A. Enzyme recruitment in 60. Keefe, A. D. & Szostak, J. W. Functional evolution of new function. Annu. Rev. proteins from a random-sequence library. Microbiol. 30, 409–25 (1976). Nature 410, 715–8 (2001). 47. Ohno, S. Evolution by Gene Duplication. 61. Tokuriki, N., Stricher, F., Serrano, L. & (publisher: Springer-Verlag, Berlin, 1970). Tawfik, D. S. How protein stability and new 48. Li, W. H., Gojobori, T. & Nei, M. functions trade off. PLoS Comput. Biol. 4, Pseudogenes as a paradigm of neutral e1000002 (2008). evolution. Nature 292, 237–9 (1981). 62. Romero, P. A. & Arnold, F. H. Exploring 49. Lynch, M. & Force, a. The probability of protein fitness landscapes by directed duplicate gene preservation by evolution. Nat. Rev. Mol. Cell Biol. 10, 866– subfunctionalization. Genetics 154, 459–73 876 (2009). (2000). 63. Martin, C. H. & Wainwright, P. C. Multiple 50. Bershtein, S. & Tawfik, D. S. Ohno’s Model fitness peaks on the adaptive landscape drive Revisited: Measuring the Frequency of adaptive radiation in the wild. Science. 339, Potentially Adaptive Mutations under 208–11 (2013). Various Mutational Drifts. Mol. Biol. Evol. 25, 64. Caldwell, S. R., Newcomb, J. R., Schlecht, K. 2311–2318 (2008). A. & Raushel, F. M. Limits of diffusion in 51. Copley, S. D. Toward a systems biology the hydrolysis of substrates by the perspective on enzyme evolution. J. Biol. phosphotriesterase from Pseudomonas Chem. 287, 3–10 (2012). diminuta. Biochemistry 30, 7438–7444 (1991).

52. Näsvall, J., Sun, L., Roth, J. R. & Andersson, 65. Knowles, J. & Albery, W. Perfection in D. I. Real-time evolution of new genes by enzyme catalysis: the energetics of innovation, amplification, and divergence. triosephosphate isomerase. Acc. Chem. Res. Science. 338, 384–7 (2012). 23, (1977).

53. Tokuriki, N. et al. Diminishing returns and 66. Ortlund, E. A., Bridgham, J. T., Redinbo, M. tradeoffs constrain the laboratory R. & Thornton, J. W. Crystal structure of an optimization of an enzyme. Nat. Commun. 3, ancient protein: evolution by conformational 1257 (2012). epistasis. Science 317, 1544–8 (2007).

54. Aharoni, A. et al. The “evolvability” of 67. Harms, M. J. & Thornton, J. W. Evolutionary promiscuous protein functions. Nat. Genet. biochemistry: revealing the historical and 37, 73–6 (2005). physical causes of protein properties. Nat. Rev. Genet. 14, 559–71 (2013). 55. Carvunis, A.-R. et al. Proto-genes and de novo gene birth. Nature 487, 370–4 (2012). 68. Mariño-Ramírez, L., Jordan, I. K. & Landsman, D. Multiple independent 56. Bak, P. & Sneppen, K. Punctuated evolutionary solutions to core histone gene equilibrium and criticality in a simple model regulation. Genome Biol. 7, R122 (2006). of evolution. Phys. Rev. Lett. 71, 4083–4086 (1993).

169

69. Tracewell, C. a & Arnold, F. H. Directed 82. Woodcock, G. & Higgs, P. G. Population enzyme evolution: climbing fitness peaks one evolution on a multiplicative single-peak amino acid at a time. Curr. Opin. Chem. Biol. fitness landscape. J. Theor. Biol. 179, 61–73 13, 3–9 (2009). (1996).

70. Koonin, E., Wolf, Y. & Karev, G. The 83. Höss, M., Jaruga, P., Zastawny, T. H., structure of the protein universe and genome Dizdaroglu, M. & Pääbo, S. DNA damage evolution. Nature 420, (2002). and DNA sequence retrieval from ancient tissues. Nucl. Acids Rec. 24, 1304–7 (1996). 71. Ranea, J. a G., Sillero, A., Thornton, J. M. & Orengo, C. a. Protein superfamily evolution 84. Pääbo, S. et al. Genetic analyses from ancient and the last universal common ancestor DNA. Annu. Rev. Genet. 38, 645–79 (2004). (LUCA). J. Mol. Evol. 63, 513–25 (2006). 85. Overballe-Petersen, S., Orlando, L. & 72. Dellus-Gur, E., Toth-Petroczy, A., Elias, M. Willerslev, E. Next-generation sequencing & Tawfik, D. S. What makes a protein fold offers new insights into DNA degradation. amenable to functional innovation? Fold Trends Biotechnol. 30, 364–8 (2012). polarity and stability trade-offs. J. Mol. Biol. 86. Marden, J. H., Wolf, M. R. & Weber, K. E. 425, 2609–21 (2013). Aerial performance of Drosophila melanogaster 73. McCandlish, D. M. Visualizing fitness from populations selected for upwind flight landscapes. Evolution (N. Y). 65, 1544–58 ability. J. Exp. Biol. 200, 2747–55 (1997). (2011). 87. Ratcliff, W. C., Denison, R. F., Borrello, M. 74. Pigliucci, M. Adaptive Landscapes, & Travisano, M. Experimental evolution of Phenotypic Space, and the Power of multicellularity. Proc. Natl. Acad. Sci. 109, Metaphors. Q. Rev. Biol. 83, 283–287 (2008). 1595–600 (2012).

75. Gavrilets, S. Evolution and speciation on 88. Barrick, J. E. et al. Genome evolution and holey adaptive landscapes. Trends Ecol. Evol. adaptation in a long-term experiment with 12, 307–12 (1997). Escherichia coli. Nature 461, 1243–7 (2009).

76. Kaplan, J. The end of the adaptive landscape 89. Heineman, R. H., Molineux, I. J. & Bull, J. J. metaphor? Biol. Philos. 23, 625–638 (2008). Evolutionary Robustness of an Optimal Phenotype: Re-evolution of Lysis in a 77. Mustonen, V. & Lässig, M. From fitness Bacteriophage Deleted for Its Lysin Gene. J. landscapes to seascapes: non-equilibrium Mol. Evol. 61, 181–191 (2005). dynamics of selection and adaptation. Trends Genet. 25, 111–9 (2009). 90. Moses, A. & Davidson, A. In vitro evolution goes deep. Proc. Natl. Acad. Sci. 108, 8071–2 78. Jacquier, H. et al. Capturing the mutational (2011). landscape of the β-lactamase TEM-1. Proc. Natl. Acad. Sci. U. S. A. 110, 13067–72 (2013). 91. Bloom, J. D. & Arnold, F. H. In the light of directed evolution: Pathways of adaptive 79. Hietpas, R. T., Jensen, J. D. & Bolon, D. N. A. protein evolution. Proc. Natl. Acad. Sci. 106, Experimental illumination of a fitness 9995–10000 (2009). landscape. Proc. Natl. Acad. Sci. 108, 7896– 901 (2011). 92. Goldsmith, M. & Tawfik, D. S. Directed enzyme evolution: beyond the low-hanging 80. Britain, G. et al. Evolution in a flat fitness fruit. Curr. Opin. Struct. Biol. 22, 406–12 landscape. Bull. Math. Biol. 53, 355–382 (2012). (1991). 93. Salehi-Ashtiani, K. & Szostak, J. W. In vitro 81. Flyvbjerg, H. & Lautrup, B. Evolution in a evolution suggests multiple origins for the rugged fitness landscape. Phys. Rev. A 46, hammerhead ribozyme. Nature 414, 82–84 6714–6723 (1992). (2001).

170 Chapter H - References

94. Mills, D. R., Peterson, R. L. & Spiegelman, S. 106. Lipovsek, D. & Plückthun, A. In-vitro An extracellular Darwinian experiment with protein evolution by ribosome display and a self-duplicating nucleic acid molecule. Proc. mRNA display. J. Immunol. Methods 290, 51– Natl. Acad. Sci. U. S. A. 58, 217–24 (1967). 67 (2004).

95. Sumper, M. & Luce, R. Evidence for de novo 107. Dryden, D. T. F., Thomson, A. R. & White, J. production of self-replicating and H. How much of protein sequence space has environmentally adapted RNA structures by been explored by life on Earth? J. R. Soc. 5, bacteriophage Qβ replicase. Proc. Natl. Acad. 953–6 (2008). Sci. U. S. A. 72, 162–6 (1975). 108. Kuchner, O. & Arnold, F. H. Directed 96. Zhao, H. & Arnold, F. H. Directed evolution evolution of enzyme catalysts. Trends converts subtilisin E into a functional Biotechnol. 15, 523–30 (1997). equivalent of thermitase. Protein Eng. 12, 47– 109. Sen, S., Venkata Dasu, V. & Mandal, B. 53 (1999). Developments in Directed Evolution for 97. Bull, J. J., Badgett, M. R. & Wichman, H. A. Improving Enzyme Functions. Appl. Biochem. Big-benefit mutations in a bacteriophage Biotechnol. 143, 212–223 (2007). inhibited with heat. Mol. Biol. Evol. 17, 942– 110. Jones, D. D. Triplet nucleotide removal at 50 (2000). random positions in a target gene: the 98. Darwin, C. On the Origin of Species. tolerance of TEM-1 β-lactamase to an amino (publisher: John Murray, 1859). acid deletion. Nucl. Acids Rec. 33, e80 (2005).

99. Turner, N. J. Directed evolution drives the 111. Crameri, A., Raillard, S. A., Bermudez, E. & next generation of biocatalysts. Nat. Chem. Stemmer, W. P. DNA shuffling of a family of Biol. 5, 567–573 (2009). genes from diverse species accelerates directed evolution. Nature 391, 288–91 100. Hawkins, R. E., Russell, S. J. & Winter, G. (1998). Selection of phage antibodies by binding affinity. Mimicking affinity maturation. J. 112. Reetz, M. T. & Carballeira, J. D. Iterative Mol. Biol. 226, 889–96 (1992). saturation mutagenesis (ISM) for rapid directed evolution of functional enzymes. 101. Giger, L. et al. Evolution of a designed retro- Nat. Protoc. 2, 891–903 (2007). aldolase leads to complete active site remodeling. Nat. Chem. Biol. 9, 494–8 (2013). 113. Willats, W. G. T. Phage display: practicalities and prospects. Plant Mol. Biol. 50, 837–54 102. Shaikh, F. A. & Withers, S. G. Teaching old (2002). enzymes new tricks: engineering and evolution of glycosidases and glycosyl 114. Verhoeven, K. D., Altstadt, O. C. & Savinov, transferases for improved glycoside synthesis. S. N. Intracellular detection and evolution of Biochem. Cell Biol. 86, 169–77 (2008). site-specific proteases using a genetic selection system. Appl. Biochem. Biotechnol. 103. MacBeath, G., Kast, P. & Hilvert, D. 166, 1340–54 (2012). Redesigning Enzyme Topology by Directed Evolution. Science. 279, 1958–1961 (1998). 115. Reetz, M. T., Höbenreich, H., Soni, P. & Fernández, L. A genetic selection system for 104. Cheriyan, M. et al. Directed evolution of a evolving enantioselectivity of enzymes. pyruvate aldolase to recognize a long chain Chem. Commun. (Camb). 5502–4 (2008). acyl substrate. Bioorg. Med. Chem. 19, 6447–53 doi:10.1039/b814538e (2011). 116. Leemhuis, H., Stein, V., Griffiths, A. D. & 105. Toscano, M. D., Woycechowsky, K. J. & Hollfelder, F. New genotype-phenotype Hilvert, D. Minimalist active-site redesign: linkages for directed evolution of functional teaching old enzymes new tricks. Angew. proteins. Curr. Opin. Struct. Biol. 15, 472–478 Chemie 46, 3212–36 (2007). (2005).

171

117. Nguyen, A. W. & Daugherty, P. S. 129. Gu, Z. et al. Role of duplicate genes in genetic Evolutionary optimization of fluorescent robustness against null mutations. Nature proteins for intracellular FRET. Nat. 421, 63–6 (2003). Biotechnol. 23, 355–360 (2005). 130. Kauffman, K. J., Prakash, P. & Edwards, J. S. 118. Schaerli, Y. & Hollfelder, F. The potential of Advances in flux balance analysis. Curr. microfluidic water-in-oil droplets in Opin. Biotechnol. 14, 491–496 (2003). experimental biology. Mol. Biosyst. 5, 1392– 131. Nam, H. et al. Network context and selection 1404 (2009). in the evolution to enzyme specificity. 119. Araya, C. L. & Fowler, D. M. Deep Science. 337, 1101–4 (2012). mutational scanning: assessing protein 132. Krakauer, D. C. & Plotkin, J. B. Redundancy, function on a massive scale. Trends Biotechnol. antiredundancy, and the robustness of 29, 435–42 (2011). . Proc. Natl. Acad. Sci. 99, 1405–1409 120. Knight, C. G. et al. Array-based evolution of (2002). DNA aptamers allows modelling of an 133. Taverna, D. & Goldstein, R. Why are explicit sequence-fitness landscape. Nucl. proteins so robust to site mutations? J. Mol. Acids Rec. 37, e6 (2009). Biol. 315, 479–84 (2002). 121. Fowler, D. M. et al. High-resolution mapping 134. Tokuriki, N. & Tawfik, D. S. Stability effects of protein sequence-function relationships. of mutations and protein evolvability. Curr. Nat. Methods 7, 741–6 (2010). Opin. Struct. Biol. 19, 596–604 (2009). 122. Robins, W. P., Faruque, S. M. & Mekalanos, J. 135. Meyerguz, L., Kleinberg, J. & Elber, R. The J. Coupling mutagenesis and parallel deep network of sequence flow between protein sequencing to probe essential residues in a structures. Proc. Natl. Acad. Sci. 104, 11627– genome or gene. Proc. Natl. Acad. Sci. 110, 32 (2007). E848–57 (2013). 136. Karplus, M. Behind the folding funnel 123. Starita, L. M. et al. Activity-enhancing diagram. Nat. Chem. Biol. 7, 401–4 (2011). mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proc. Natl. 137. Tokuriki, N., Stricher, F., Schymkowitz, J., Acad. Sci. 110, E1263–72 (2013). Serrano, L. & Tawfik, D. S. The stability effects of protein mutations appear to be 124. Hinkley, T. et al. A systems analysis of universally distributed. J. Mol. Biol. 369, mutational effects in HIV-1 protease and 1318–32 (2007). reverse transcriptase. Nat. Genet. 43, 487–9 (2011). 138. Shakhnovich, B. E., Deeds, E., Delisi, C. & Shakhnovich, E. Protein structure and 125. Whitehead, T. et al. Optimization of affinity, evolutionary history determine sequence specificity and function of designed space topology. Genome Res. 15, 385–92 influenza inhibitors using deep sequencing. (2005). Nat. Biotechnol. 30, 543–8 (2012). 139. Monsellier, E. & Chiti, F. Prevention of 126. Deng, Z. et al. Deep sequencing of systematic amyloid-like aggregation as a driving force of combinatorial libraries reveals β-lactamase protein evolution. EMBO Rep. 8, 737–42 sequence constraints at high resolution. J. (2007). Mol. Biol. 424, 150–67 (2012). 140. Fink, A. L. Protein aggregation: folding 127. Davies, E. K. High Frequency of Cryptic aggregates, inclusion bodies and amyloid. Deleterious Mutations in Caenorhabditis Fold. Des. 3, R9–23 (1998). elegans. Science. 285, 1748–1751 (1999). 141. Richardson, J. S. & Richardson, D. C. 128. Baba, T. et al. Construction of Escherichia coli Natural β-sheet proteins use negative design K-12 in-frame, single-gene knockout to avoid edge-to-edge aggregation. Proc. Natl. mutants: the Keio collection. Mol. Syst. Biol. 2, Acad. Sci. U. S. A. 99, 2754–9 (2002). 2006.0008 (2006).

172 Chapter H - References

142. Müller, M. M. et al. Directed evolution of a 155. Burch, C. L., Guyader, S., Samarov, D. & model primordial enzyme provides insights Shen, H. Experimental estimate of the into the development of the genetic code. abundance and effects of nearly neutral PLoS Genet. 9, e1003187 (2013). mutations in the RNA virus phi 6. Genetics 176, 467–76 (2007). 143. Firnberg, E. & Ostermeier, M. The genetic code constrains yet facilitates Darwinian 156. Orr, H. A. The distribution of fitness effects evolution. Nucl. Acids Rec. 41, 7420–8 (2013). among beneficial mutations. Genetics 163, 1519–26 (2003). 144. Guo, H. H., Choe, J. & Loeb, L. A. Protein tolerance to random amino acid change. Proc. 157. Kassen, R. & Bataillon, T. Distribution of Natl. Acad. Sci. 101, 9205–9210 (2004). fitness effects among beneficial mutations before selection in experimental populations 145. Camps, M., Herman, A., Loh, E. & Loeb, L. of bacteria. Nat. Genet. 38, 484–8 (2006). A. Genetic constraints on protein evolution. Crit. Rev. Biochem. Mol. Biol. 42, 313–26 158. Perfeito, L., Fernandes, L., Mota, C. & (2007). Gordo, I. Adaptive mutations in bacteria: high rate and small effects. Science. 317, 813– 146. Wroe, R., Chan, H. S. & Bornberg-Bauer, E. 5 (2007). A structural model of latent evolutionary potentials underlying neutral networks in 159. Barrett, R. D. H., MacLean, R. C. & Bell, G. proteins. HFSP J. 1, 79 (2007). Mutations of intermediate effect are responsible for adaptation in evolving 147. Wilke, C. O. Adaptive evolution on neutral Pseudomonas fluorescens populations. Biol. networks. Bull. Math. Biol. 63, 715–30 (2001). Lett. 2, 236–8 (2006). 148. Ebner, M., Shackleton, M. & Shipman, R. O. 160. Bataillon, T. Estimation of spontaneous B. How Neutral Networks Influence genome‐wide mutation rate parameters: Evolvability. Complexity 7, 19–33 (2002). whither beneficial mutations? Heredity 149. Van Nimwegen, E., Crutchfield, J. & (Edinb). 84, 497–501 (2000). Huynen, M. Neutral evolution of mutational 161. García-Dorado, A. The rate and effects robustness. Proc. Natl. Acad. Sci. 96, 9716– distribution of viability mutation in 9720 (1999). Drosophila: minimum distance estimation. 150. Proulx, S. R. & Adler, F. R. The standard of Evolution (N. Y). 51, 1130–1139 (1997). neutrality: still flapping in the breeze? J. Evol. 162. Imhof, M. & Schlotterer, C. Fitness effects of Biol. 23, 1339–50 (2010). advantageous mutations in evolving 151. Schmidt-Nielsen, K. Scaling: why is size Escherichia coli populations. Proc. Natl. Acad. so important? (publisher: Cambridge Sci. 98, 1113–7 (2001). University Press, 1984). 163. Colegrave, N. & Collins, S. Experimental 152. Kaiser, A. et al. Increase in tracheal evolution: experimental evolution and investment with beetle size supports evolvability. Heredity (Edinb). 100, 464–70 hypothesis of oxygen limitation on insect (2008). gigantism. Proc. Natl. Acad. Sci. 104, 13198– 164. Shaver, A. C. et al. Fitness evolution and the 203 (2007). rise of mutator alleles in experimental 153. Eyre-Walker, A. & Keightley, P. D. The Escherichia coli populations. Genetics 162, distribution of fitness effects of new 557–66 (2002). mutations. Nat. Rev. Genet. 8, 610–8 (2007). 165. Lindgren, P., Karlsson, Å. & Hughes, D. 154. Sanjuán, R. Mutational fitness effects in Mutation rate and evolution of RNA and single-stranded DNA viruses: fluoroquinolone resistance in Escherichia coli common patterns revealed by site-directed isolates from patients with urinary tract mutagenesis studies. Philos. Trans. R. Soc. B infections. Antimicrob. Agents Chemother. 47, Biol. Sci. 365, 1975–82 (2010). 3222–3232 (2003).

173

166. Sniegowski, P., Gerrish, P. & Lenski, R. 179. Reetz, M. T. & Sanchis, J. Constructing and Evolution of high mutation rates in analyzing the fitness landscape of an experimental populations of E. coli. Nature experimental evolutionary process. 606, 703–705 (1997). ChemBioChem 9, 2260–7 (2008).

167. Colegrave, N. Sex releases the speed limit on 180. Pigliucci, M. Is evolvability evolvable? Nat. evolution. Nature 420, 664–666 (2002). Rev. Genet. 9, 75–82 (2008).

168. Miller, S. P., Lunzer, M. & Dean, A. M. 181. Earl, D. J. & Deem, M. W. Evolvability is a Direct demonstration of an adaptive selectable trait. Proc. Natl. Acad. Sci. 101, constraint. Science. 314, 458–61 (2006). 11531–6 (2004).

169. Bershtein, S., Segal, M., Bekerman, R., 182. Hanke, D. in Explan. styles Explan. Sci. 143 Tokuriki, N. & Tawfik, D. S. Robustness- (publisher: Oxford University Press, 2004). epistasis link shapes the fitness landscape of a 183. Mayr, E. The idea of teleology. J. Hist. Ideas randomly drifting protein. Nature 444, 929– 53, 117–135 (1992). 32 (2006). 184. Pigliucci, I. & Kaplan, I. The fall and rise of 170. Weinreich, D. M., Delaney, N. F., Depristo, Dr Pangloss: adaptationism and the M. A. & Hartl, D. L. Darwinian evolution Spandrels paper 20 years later. Trends Ecol. can follow only very few mutational paths to Evol. 15, 66–70 (2000). fitter proteins. Science. 312, 111–4 (2006). 185. Redfield, R. J. Do bacteria have sex? Nat. Rev. 171. Gong, L. I., Suchard, M. & Bloom, J. D. Genet. 2, 634–9 (2001). Stability-mediated epistasis constrains the evolution of an influenza protein. Elife 2, 186. Woods, R. J. et al. Second-order selection for e00631 (2013). evolvability in a large Escherichia coli population. Science. 331, 1433–6 (2011). 172. Kauffman, S. The origins of order: Self organization and selection in evolution. 187. Bloom, J. D. et al. Evolution favors protein (publisher: Oxford University Press, 1993). mutational robustness in sufficiently large populations. BMC Biol. 5, 29 (2007). 173. He, X. & Zhang, J. Toward a molecular understanding of pleiotropy. Genetics 173, 188. Bershtein, S., Goldin, K. & Tawfik, D. S. 1885–91 (2006). Intense Neutral Drifts Yield Robust and Evolvable Consensus Proteins. J. Mol. Biol. 174. Lenski, R. Experimental studies of pleiotropy 379, 1029–1044 (2008). and epistasis in Escherichia coli. II. Compensation for maldaptive effects 189. Wagner, A. The Origins of Evolutionary associated with resistance to virus T4. Innovations: A Theory of Transformative Change Evolution (N. Y). 42, 433–440 (1988). in Living Systems. 253 (publisher: Oxford University Press, 2011). 175. Brakefield, P. M. Evo-devo and constraints on selection. Trends Ecol. Evol. 21, 362–8 (2006). 190. Stansbury, M. & Moczek, A. Evolvability of Arthropods 479–493 (publisher: Springer 176. Wang, X., Minasov, G. & Shoichet, B. K. Berlin Heidelberg, 2013). Evolution of an antibiotic resistance enzyme constrained by stability and activity trade- 191. Danchin, E. et al. Beyond DNA: integrating offs. J. Mol. Biol. 320, 85–95 (2002). inclusive inheritance into an extended theory of evolution. Nat. Rev. Genet. 12, 475–86 177. Delgado, E. & Medrano, H. Species variation (2011). in Rubisco specificity factor. J. Exp. … 46, 1775–1777 (1995). 192. Pigliucci, M. Do we need an extended evolutionary synthesis? Evolution (N. Y). 61, 178. Bainbridge, G. & Madgwick, P. Engineering 2743–9 (2007). Rubisco to change its catalytic properties. J. Exp. Bot. 46, 1269–1276 (1995). 193. Pigliucci, M. An extended synthesis for evolutionary biology. Ann. N. Y. Acad. Sci. 1168, 218–28 (2009).

174 Chapter H - References

194. Carter, P. J. Introduction to current and 205. Dougherty, W. G., Cary, S. M., Dawn Parks, future protein therapeutics: a protein T. & Parks, T. D. Molecular genetic analysis engineering perspective. Exp. Cell Res. 317, of a plant virus polyprotein cleavage site: A 1261–9 (2011). model. Virology 171, 356–364 (1989).

195. Bommarius, A. S., Blum, J. K. & 206. Dougherty, W. G. & Dawn Parks, T. Post- Abrahamson, M. J. Status of protein translational processing of the tobacco etch engineering for biocatalysts: how to design virus 49-kDa small nuclear inclusion an industrially useful biocatalyst. Curr. Opin. polyprotein: Identification of an internal Chem. Biol. 15, 194–200 (2011). cleavage site and delimitation of VPg and proteinase domains. Virology 183, 449–456 196. O’Loughlin, T. L., Patrick, W. M. & (1991). Matsumura, I. Natural history as a predictor of protein evolvability. Protein Eng. Des. Sel. 207. Parks, T. D., Leuther, K. K., Howard, E. D., 19, 439–42 (2006). Johnston, S. A. & Dougherty, W. G. Release of proteins and peptides from fusion proteins 197. Merlo, L. M. F., Pepper, J. W., Reid, B. J. & using a recombinant plant virus proteinase. Maley, C. C. Cancer as an evolutionary and Anal. Biochem. 216, 413–7 (1994). ecological process. Nat. Rev. Genet. 6, 924–35 (2006). 208. Kapust, R. B. & Waugh, D. S. Controlled Intracellular Processing of Fusion Proteins by 198. Pan, D., Xue, W., Zhang, W., Liu, H. & Yao, TEV Protease. Protein Expr. Purif. 19, 312–318 X. Understanding the drug resistance (2000). mechanism of NS3/4A to ITMN-191 due to R155K, A156V, D168A/E 209. Goodsell, D. Tobacco Mosaic Virus. PDB mutations: A computational study. Biochim. Mol. Mon. (2009). at Biophys. Acta 1820, 1526–34 (2012). 199. Woodford, N. & Ellington, M. J. The emergence of antibiotic resistance by 210. Alonso, J. M., Ondarçuhu, T. & Bittner, A. mutation. Clin. Microbiol. Infect. 13, 5–18 M. Integration of plant viruses in electron (2007). beam lithography nanostructures. Nanotechnology 24, 105305 (2013). 200. Neve, P. Challenges for herbicide resistance evolution and management: 50 years after 211. Phan, J. et al. Structural basis for the substrate Harper. Weed Res. 47, 365–369 (2007). specificity of tobacco etch virus protease. J. Biol. Chem. 277, 50564–72 (2002). 201. Labbé, P. et al. Forty years of erratic insecticide resistance evolution in the 212. Dougherty, W. G. et al. Characterization of mosquito Culex pipiens. PLoS Genet. 3, e205 the catalytic residues of the tobacco etch (2007). virus 49-kDa proteinase. Virology 172, 302– 310 (1989). 202. Rodríguez-Rojas, A., Rodríguez-Beltrán, J., Couce, A. & Blázquez, J. Antibiotics and 213. Madala, P. K., Tyndall, J. D. a, Nall, T. & antibiotic resistance: A bitter fight against Fairlie, D. P. Update 1 of: Proteases evolution. Int. J. Med. Microbiol. 303, 293–7 universally recognize β strands in their active (2013). sites. Chem. Rev. 110, PR1–31 (2010).

203. Schenk, M. F., Szendro, I. G., Krug, J. & de 214. Renicke, C., Spadaccini, R. & Taxis, C. A Visser, J. A. G. M. Quantifying the adaptive Tobacco Etch Virus Protease with Increased potential of an antibiotic resistance enzyme. Substrate Tolerance at the P1’ position. PLoS PLoS Genet. 8, e1002783 (2012). One 8, e67915 (2013).

204. Read, A. F., Lynch, P. A. & Thomas, M. B. 215. Kapust, R. B. et al. Tobacco etch virus How to make evolution-proof insecticides for protease: mechanism of autolysis and rational malaria control. PLoS Biol. 7, e1000058 design of stable mutants with wild-type (2009). catalytic proficiency. Protein Eng. 14, 993– 1000 (2001).

175

216. Wei, L. et al. In vivo and in vitro 227. Hanes, M. S., Jude, K. M., Berger, J. M., characterization of TEV protease mutants. Bonomo, R. A. & Handel, T. M. Structural Protein Expr. Purif. 83, 157–63 (2012). and Biochemical Characterization of the Interaction between KPC-2 β-Lactamase and 217. Van den Berg, S., Löfdahl, P.-Å., Härd, T. & β-Lactamase Inhibitor Protein. Biochemistry Berglund, H. Improved solubility of TEV 48, 9185–9193 (2009). protease by directed evolution. J. Biotechnol. 121, 291–298 (2006). 228. Yuan, J., Huang, W., Chow, D.-C. & Palzkill, T. Fine Mapping of the Sequence 218. Cabrita, L. D. et al. Enhancing the stability Requirements for Binding of β-Lactamase and solubility of TEV protease using in silico Inhibitory Protein (BLIP) to TEM-1 β- design. Protein Sci. 16, 2360–7 (2007). Lactamase Using a Genetic Screen for BLIP 219. Kapust, R. B. & Waugh, D. S. Escherichia coli Function. J. Mol. Biol. 389, 401–412 (2009). maltose-binding protein is uncommonly 229. Reichmann, D. et al. Binding hot spots in the effective at promoting the solubility of TEM1-BLIP interface in light of its modular polypeptides to which it is fused. Protein Sci. architecture. J. Mol. Biol. 365, 663–79 (2007). 8, 1668–74 (1999). 230. Kimura, R. H., Steenblock, E. R. & 220. Carrington, J. C. & Dougherty, W. G. A viral Camarero, J. A. Development of a cell-based cleavage site cassette: identification of amino fluorescence resonance energy transfer acid sequences required for tobacco etch reporter for Bacillus anthracis lethal factor virus polyprotein processing. Proc. Natl. Acad. protease. Anal. Biochem. 369, 60–70 (2007). Sci. 85, 3391–5 (1988). 231. Vinkenborg, J. L., Evers, T. H., Reulen, S. W. 221. Kapust, R. B., Tözsér, J., Copeland, T. D. & A., Meijer, E. W. & Merkx, M. Enhanced Waugh, D. S. The P1’ specificity of tobacco Sensitivity of FRET-Based Protease Sensors etch virus protease. Biochem. Biophys. Res. by Redesign of the GFP Dimerization Commun. 294, 949–955 (2002). Interface. ChemBioChem 8, 1119–1121 222. Boulware, K. T., Jabaiah, A. & Daugherty, P. (2007). S. Evolutionary optimization of peptide 232. Cherny, I. et al. Engineering V-type nerve substrates for proteases that exhibit rapid agents detoxifying enzymes using hydrolysis kinetics. Biotechnol. Bioeng. 106, computationally focused libraries. ACS 339–346 (2010). Chem. Biol. (2013). doi:10.1021/cb4004892 223. Kostallas, G., Löfdahl, P.-Å. & Samuelson, P. 233. Menger, F. M. & Ladika, M. Origin of rate Substrate Profiling of Tobacco Etch Virus accelerations in an enzyme model: the p- Protease Using a Novel Fluorescence- nitrophenyl ester syndrome. J. Am. Chem. Soc. Assisted Whole-Cell Assay. PLoS One 6, 109, 3145–3146 (1987). e16136 (2011). 234. McGrath, M. E., Wilke, M. E., Higaki, J. N., 224. Yi, L. et al. Engineering of TEV protease Craik, C. S. & Fletterick, R. J. Crystal variants by yeast ER sequestration screening structures of two engineered thiol trypsins. (YESS) of combinatorial libraries. Proc. Natl. Biochemistry 28, 9264–70 (1989). Acad. Sci. U. S. A. 110, 7229–34 (2013). 235. Polgár, L. & Asbóth, B. The basic difference 225. Quinn, D. J., Cunningham, S., Walker, B. & in catalyses by serine and cysteine Scott, C. J. Activity-based selection of a proteinases resides in charge stabilization in proteolytic species using ribosome display. the transition state. J. Theor. Biol. 121, 323–6 Biochem. Biophys. Res. Commun. 370, 77–81 (1986). (2008). 236. Migliorini, M. & Creighton, D. Active-site 226. Doran, J. L., Leskiw, B. K., Aippersbach, S. & ionizations of papain. Eur. J. … 192, 189–192 Jensen, S. E. Isolation and characterization of (1986). a β-lactamase-inhibitory protein from Streptomyces clavuligerus and cloning and analysis of the corresponding gene. J. Bacteriol. 172, 4909–4918 (1990).

176 Chapter H - References

237. Abrahmsén, L. et al. Engineering subtilisin 247. Amara, A. & Rehm, B. H. a. Replacement of and its substrates for efficient ligation of the catalytic nucleophile cysteine-296 by peptide bonds in aqueous solution. serine in class II polyhydroxyalkanoate Biochemistry 30, 4151–9 (1991). synthase from Pseudomonas aeruginosa- mediated synthesis of a new polyester: 238. Neet, K. E. & Koshland, D. E. The identification of catalytic residues. Biochem. J. Conversion of Serine at the Active Site of 374, 413–21 (2003). Subtilisin to Cysteine: A ``Chemical Mutation’'. Proc. Natl. Acad. Sci. 56, 1606– 248. Sigal, I. S., Harwood, B. G. & Arentzen, R. 1611 (1966). Thiol-β-lactamase: replacement of the active- site serine of RTEM β-lactamase by a cysteine 239. Beveridge, A. J. A theoretical study of the residue. Proc. Natl. Acad. Sci. 79, 7157–60 active sites of papain and S195C rat trypsin: (1982). implications for the low reactivity of mutant serine proteinases. Protein Sci. 5, 1355–65 249. Kowal, A T. et al. Effect of cysteine to serine (1996). mutations on the properties of the [4Fe-4S] center in Escherichia coli fumarate reductase. 240. Hahn, C. S. & Strauss, J. H. Site-directed Biochemistry 34, 12284–93 (1995). mutagenesis of the proposed catalytic amino acids of the Sindbis virus capsid protein 250. Rawlings, N. D., Barrett, A. J. & Bateman, A. autoprotease. J. Virol. 64, 3069–73 (1990). MEROPS: the database of proteolytic enzymes, their substrates and inhibitors. 241. Cheah, K. C., Leong, L. E. & Porter, a G. Nucl. Acids Rec. 40, D343–50 (2012). Site-directed mutagenesis suggests close functional relationship between a human 251. Holm, L. & Rosenström, P. Dali server: rhinovirus 3C cysteine protease and cellular conservation mapping in 3D. Nucl. Acids Rec. trypsin-like serine proteases. J. Biol. Chem. 38, W545–9 (2010). 265, 7180–7 (1990). 252. Shin, S. et al. Characterization of a novel Ser- 242. Lawson, M. A. & Semler, B. L. cisSer-Lys catalytic triad in comparison with thiol proteinase 3C can utilize a serine the classical Ser-His-Asp triad. J. Biol. Chem. nucleophile within the putative catalytic 278, 24937–43 (2003). triad. Proc. Natl. Acad. Sci. U. S. A. 88, 9919– 253. Beadle, B. M. & Shoichet, B. K. Structural 23 (1991). Bases of Stability–function Tradeoffs in 243. Turkenburg, J. P. et al. Structure of a Enzymes. J. Mol. Biol. 321, 285–296 (2002). Cys25→Ser mutant of human cathepsin S. 254. Morley, K. L. & Kazlauskas, R. J. Improving Acta Crystallogr. Sect. D Biol. Crystallogr. 58, enzyme properties: when are closer 451–455 (2002). mutations better? Trends Biotechnol. 23, 231–7 244. Sharp, J. D. et al. Serine 228 is essential for (2005). catalytic activities of 85-kDa cytosolic 255. Wellner, A., Raitses Gurevich, M. & Tawfik, phospholipase A2. J. Biol. Chem. 269, 23250– D. S. Mechanisms of protein sequence 4 (1994). divergence and incompatibility. PLoS Genet. 245. Li, J., Szittner, R., Derewenda, Z. S. & 9, e1003665 (2013). Meighen, E. A. Conversion of serine-114 to 256. Loh, E., Choe, J. & Loeb, L. A. Highly cysteine-114 and the role of the active site tolerated amino acid substitutions increase nucleophile in acyl transfer by myristoyl-ACP the fidelity of Escherichia coli DNA thioesterase from Vibrio harveyi. Biochemistry polymerase I. J. Biol. Chem. 282, 12201–9 35, 9967–73 (1996). (2007). 246. Walker, I., Easton, C. J. & Ollis, D. L. Site- 257. Savage, H. & Elliott, C. Lost hydrogen bonds directed mutagenesis of dienelactone and buried surface area: rationalising stability hydrolase produces dienelactone isomerase. in globular proteins. J. Chem. Soc. 89, 2609– Chem. Commun. 2, 671–672 (2000). 2617 (1993).

177

258. Privalov, P. L. & Khechinashvili, N. N. A 271. Lauring, A. S., Frydman, J. & Andino, R. The thermodynamic approach to the problem of role of mutational robustness in RNA virus stabilization of globular protein structure: a evolution. Nat. Rev. Microbiol. 11, 327–36 calorimetric study. J. Mol. Biol. 86, 665–84 (2013). (1974). 272. Walkiewicz, K. et al. Small changes in enzyme 259. DePristo, M. A., Weinreich, D. M. & Hartl, function can lead to surprisingly large fitness D. L. Missense meanderings in sequence effects during adaptive evolution of space: a biophysical view of protein antibiotic resistance. Proc. Natl. Acad. Sci. U. evolution. Nat. Rev. Genet. 6, 678–87 (2005). S. A. 109, 21408–13 (2012).

260. Sung, W. & Ackerman, M. Drift-barrier 273. Bloom, J. D., Labthavikul, S. T., Otey, C. R. hypothesis and mutation-rate evolution. Proc. & Arnold, F. H. Protein stability promotes Natl. Acad. Sci. 2012, (2012). evolvability. Proc. Natl. Acad. Sci. 103, 5869– 5874 (2006). 261. Visser, J. A. G. M. et al. Perspective: Evolution and detection of genetic 274. Gupta, R. D. & Tawfik, D. S. Directed robustness. Evolution (N. Y). 57, 1959–1972 enzyme evolution via small and effective (2003). neutral drift libraries. Nat. Methods 5, 939– 942 (2008). 262. Wilke, C. O. & Adami, C. Evolution of mutational robustness. Mutat. Res. 522, 3–11 275. Lang, G. I. et al. Pervasive genetic hitchhiking (2003). and clonal interference in forty evolving yeast populations. Nature 500, 571–574 263. Wilke, C. O., Wang, J. L., Ofria, C., Lenski, (2013). R. E. & Adami, C. Evolution of digital organisms at high mutation rates leads to 276. Finnigan, G. C., Hanson-Smith, V., Stevens, survival of the flattest. Nature 412, 331–3 T. H. & Thornton, J. W. Evolution of (2001). increased complexity in a molecular machine. Nature 481, 360–4 (2012). 264. Edlund, J. & Adami, C. Evolution of robustness in digital organisms. Artif. Life 10, 277. Bridgham, J. T., Ortlund, E. a & Thornton, J. 167–79 (2004). W. An epistatic ratchet constrains the direction of glucocorticoid receptor 265. Kaneko, K. Evolution of Robustness. 3–8 evolution. Nature 461, 515–9 (2009). 266. Graci, J. D. et al. Mutational robustness of an 278. Wylie, C. S. & Shakhnovich, E. I. A RNA virus influences sensitivity to lethal biophysical protein folding model accounts mutagenesis. J. Virol. 86, 2869–73 (2012). for most mutational fitness effects in viruses. 267. O’Dea, E. B., Keller, T. E. & Wilke, C. O. Proc. Natl. Acad. Sci. 108, 9916–21 (2011). Does mutational robustness inhibit 279. Sanjuán, R., Nebot, M. R., Chirico, N., extinction by lethal mutagenesis in viral Mansky, L. M. & Belshaw, R. Viral mutation populations? PLoS Comput. Biol. 6, e1000811 rates. J. Virol. 84, 9733–48 (2010). (2010). 280. Draghi, J. A., Parsons, T. L., Wagner, G. P. & 268. Borenstein, E. & Ruppin, E. Direct evolution Plotkin, J. B. Mutational robustness can of genetic robustness in microRNA. Proc. facilitate adaptation. Nature 463, 353–5 Natl. Acad. Sci. U. S. A. 103, 6593–8 (2006). (2010). 269. Tokuriki, N., Oldfield, C. J., Uversky, V. N., 281. Bornberg-Bauer, E. & Kramer, L. Robustness Berezovsky, I. N. & Tawfik, D. S. Do viral versus evolvability: a paradigm revisited. proteins possess unique biophysical features? HFSP J. 4, 105–8 (2010). Trends Biochem. Sci. 34, 53–59 (2009). 282. Whitacre, J. & Bender, A. Degeneracy: a 270. Elena, S. F. RNA virus genetic robustness: design principle for achieving robustness and possible causes and some consequences. evolvability. J. Theor. Biol. 263, 143–53 (2010). Curr. Opin. Virol. 2, 525–30 (2012).

178 Chapter H - References

283. Harms, M. J. et al. Biophysical mechanisms 295. Kirschner, M. & Gerhart, J. Evolvability. Proc. for large-effect mutations in the evolution of Natl. Acad. Sci. 95, 8420–8427 (1998). steroid hormone receptors. Proc. Natl. Acad. 296. Glasner, M. E., Gerlt, J. A. & Babbitt, P. C. Sci. U. S. A. 110, 11475–80 (2013). Evolution of enzyme superfamilies. Curr. 284. Wright, S. The roles of mutation, inbreeding, Opin. Chem. Biol. 10, 492–7 (2006). crossbreeding and selection in evolution. 297. Li, Q., Yi, L., Marek, P. & Iverson, B. L. Proc. sixth Int. Congr. Genet. 356–366 (1932). Commercial proteases: present and future. 285. Coyne, J., Barton, N. & Turelli, M. FEBS Lett. 587, 1155–63 (2013). Perspective: a critique of Sewall Wright’s 298. Matsunaga, T. & Andersson, E. Evolution of shifting balance theory of evolution. vertebrate antibody genes. Fish Shellfish Evolution (N. Y). 51, 643–671 (1997). Immunol. 4, 413–419 (1994). 286. Masel, J. Cryptic genetic variation is enriched 299. Leavy, O. Therapeutic antibodies: past, for potential adaptations. Genetics 172, 1985– present and future. Nat. Rev. Immunol. 10, 297 91 (2006). (2010). 287. Hayden, E. J., Ferrada, E. & Wagner, A. 300. Buss, N. A., Henderson, S. J., McFarlane, M., Cryptic genetic variation promotes rapid Shenton, J. M. & de Haan, L. Monoclonal evolutionary adaptation in an RNA enzyme. antibody therapeutics: history and future. Nature 474, 92–5 (2011). Curr. Opin. Pharmacol. 12, 615–22 (2012). 288. Bornberg-Bauer, E., Huylmans, A.-K. & 301. Mullin, R. Before the Storm. Chem. Eng. News Sikosek, T. How do new proteins arise? Curr. 89, 12–18 (2011). Opin. Struct. Biol. 20, 390–6 (2010). 302. Nelson, A. L., Dhimolea, E. & Reichert, J. M. 289. Wagner, A. Robustness and evolvability: a Development trends for human monoclonal paradox resolved. Proc. R. Soc. B 275, 91–100 antibody therapeutics. Nat. Rev. Drug Discov. (2008). 9, 767–74 (2010). 290. Bloom, J. D., Romero, P. A., Lu, Z. & Arnold, 303. Reichert, J. M. Antibodies to watch in 2013: F. H. Neutral genetic drift can alter Mid-year update. MAbs 5, 513–7 (2013). promiscuous protein functions, potentially aiding functional evolution. Biol. Direct 2, 17 304. Goodsell, D. Epidermal growth factor. PDB (2007). Mol. Mon. (2010). at Latent evolutionary potentials under the neutral mutational drift of an enzyme. HFSP 305. Sigrist, C. J. a et al. PROSITE: a documented J. 1, 67 (2007). database using patterns and profiles as motif descriptors. Brief. Bioinform. 3, 265–74 (2002). 292. Matsumura, I. & Ellington, A. D. In vitro evolution of β-glucuronidase into a β- 306. Dosztányi, Z., Csizmok, V., Tompa, P. & galactosidase proceeds through non-specific Simon, I. IUPred: web server for the intermediates. J. Mol. Biol. 305, 331–9 (2001). prediction of intrinsically unstructured regions of proteins based on estimated 293. Smith, W. S., Hale, J. R. & Neylon, C. energy content. Bioinformatics 21, 3433–4 Applying neutral drift to the directed (2005). molecular evolution of a β-glucuronidase into a β-galactosidase: Two different 307. Tantillo, D. J., Chen, J. & Houk, K. N. evolutionary pathways lead to the same Theozymes and compuzymes: theoretical variant. BMC Res. Notes 4, 138 (2011). models for biological catalysis. Curr. Opin. Chem. Biol. 2, 743–50 (1998). 294. Kaltz, O. & Bell, G. The ecology and genetics of fitness in Chlamydomonas. XII. Repeated 308. Richter, F., Leaver-Fay, A., Khare, S. D., sexual episodes increase rates of adaptation Bjelic, S. & Baker, D. De novo enzyme design to novel environments. Evolution (N. Y). 56, using Rosetta3. PLoS One 6, e19230 (2011). 1743–1753 (2002).

179

309. Zhou, G., Guo, J. & Huang, W. Crystal structure of a catalytic antibody with a serine protease active site. Science. 265, 1059–1064 (1994).

310. Baker, D. An exciting but challenging road ahead for computational enzyme design. Protein Sci. 19, 1817–9 (2010).

311. Wang, L. et al. Structural analyses of covalent enzyme-substrate analog complexes reveal strengths and limitations of de novo enzyme design. J. Mol. Biol. 415, 615–25 (2012).

312. Villiers, B. R. M., Stein, V. & Hollfelder, F. USER friendly DNA recombination (USERec): a simple and flexible near homology-independent method for gene library construction. Protein Eng. Des. Sel. 23, 1–8 (2010).

313. Gapstreeze. at

314. Tamura, K. et al. MEGA5: molecular evolutionary genetics analysis using maximum likelihood, evolutionary distance, and maximum parsimony methods. Mol. Biol. Evol. 28, 2731–9 (2011).

315. PyMol. at

316. Fraczkiewicz, R. & Braun, W. Exact and efficient analytical calculation of the accessible surface areas and their gradients for macromolecules. J. Comput. Chem. 19, (1998).

317. Schymkowitz, J. et al. The FoldX web server: an online force field. Nucl. Acids Rec. 33, W382–8 (2005).

180