<<

The Pennsylvania State University The Graduate School

COMPUTATIONAL REDESIGN OF CHANNEL ,

ENZYMES, AND ANTIBODIES

A Dissertation in Chemical

by Ratul Chowdhury © 2020 Ratul Chowdhury

Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

May 2020

The dissertation of Ratul Chowdhury was reviewed and approved* by the following:

Costas D. Maranas Donald B. Broughton Professor of Chemical Engineering Dissertation Advisor Chair of Committee

Manish Kumar Assistant Professor in Chemical Engineering

Michael Janik Professor in Chemical Engineering

Reka Albert Professor in Physics

Phillip E. Savage Department Head and Graduate Program Chair Professor in Chemical Engineering

ii

ABSTRACT

Nature relies on a wide range of with specific biocatalytic roles to carry out much of the chemistry needed to sustain life. Proteins catalyze the interconversion of a vast array of molecules with high specificity - from molecular nitrogen fixation to the synthesis of highly specialized hormones, quorum-sensing molecules, defend against disease causing foreign proteins, and maintain osmotic balance using transmembrane channel proteins. Ever increasing emphasis on renewable sources for energy and waste minimization has turned biocatalytic proteins (enzymes) into key industrial workhorses for targeted chemical conversions. Modern engineering is central to not only food and beverage manufacturing processes but are also often ingredients in countless consumer product formulations such as proteolytic enzymes in detergents and amylases and peptide-based therapeutics in the form of designed antibodies to combat neurodegenerative diseases (such as Parkinson’s, Alzheimer’s) and outbreaks of Zika and Ebola virus. However, successful protein or tweaking an existing protein for a desired functionality has remained a constant challenge. This is mainly because of the complex energy landscape of proteins have not been fully discerned. Any change to a protein can bring about large-scale conformational changes that destabilize the whole structure. However, with the ease of availability of computing power, Monte Carlo based random sampling of the amino sequence has become more amenable and has led to the discovery of several de novo protein sequences that fold like natural analogues. This thesis presents three new computational protein design tools targeted at redesign of – (a) channel proteins- for tunable pore size and chemistry to control the passage/ rejection of desired solutes in a membrane-separation module, (b) enzymes – by allowing prediction of amino acid insertions and deletions, along with substitutions to emulating natural evolution of proteins to switch their specificity towards/ from an intended substrate or ,

iii and (c) humanized antibody sequences that would possess structural complementarity to a disease causing antigen protein, and thereby neutralize it. All these tools employ a mixed-integer linear optimization approach for designing the optimal protein sequence, a molecular-mechanics force- field calculations (using CHARMM, and Rosetta packages) to assess the respective structures, and an iterative workflow that uses a Metropolis criterion to retain or reject the predicted and improve upon them. First, PoreDesigner provides a systematic approach to tune the pore size and inner pore wall chemistry of any channel protein which has potential applications from membrane- based separation of aqueous solutes to DNA sequencing. Next, we have developed an iterative redesign and optimization tool that enables the user to find active site mutations on the enzyme to alter substrate and cofactor specificity. Finally, our third tool is aimed at assembling fragments of human antibody using a mixed-integer linear optimization approach to obtain a complete a library of antibody variable domains that bind to a user-provided disease-causing antigen protein with high specificity. Binding scores are computed using non-bonded enthalpic energy models comprising harmonic potential terms for – electrostatics, solvation, hydrogen bonding, and van der Waal’s interaction.

iv

TABLE OF CONTENTS LIST OF FIGURES ...... vi LIST OF TABLES ...... viii NOMENCLATURE ...... ix

ACKNOWLEDGEMENTS ...... x

1. Chapter 1 POREDESIGNER FOR TUNING SOLUTE SELECTIVITY IN A ROBUST AND HIGHLY PERMEABLE OUTER MEMBRANE PORE ...... 1

1.1 Significance ...... 1 1.2 Introduction ...... 2 1.3 Results ...... 7 1.4 Discussion ...... 23 1.5 Materials and Methods ...... 26 1.6 References ...... 36

2. Chapter 2 BRIEF HISTORY OF ENZYME REDESIGN FROM TO COMPUTATIONAL ENZYME REDESIGN

...... 43 2.1 Background ...... 43 2.2 Successes ...... 56 2.3 New approaches and future directions ...... 69 2.4 References ...... 70

3. Chapter 3 IPRO+/- COMPUTATIONAL PROTEIN ALLOWING FOR INSERTIONS AND DELETIONS

...... 85

3.1 Significance ...... 85 3.2 Introduction ...... 86 3.3 Materials and Methods ...... 92 3.4 Results and Discussion ...... 97 3.6 References ...... 111

4. Chapter 4 OPTMAVEN-2.0 FOR DE NOVO DESIGN OF VARIABLE ANTIBODY REGIONS AGAINST TARGETED ANTIGEN EPITOPES

...... 116

4.1 Significance ...... 116 4.2 Introduction ...... 118 4.3 Materials and Methods ...... 120 4.4 Results ...... 141 4.5 Summary and Discussion ...... 154 4.6 References ...... 156

5. Chapter 5 SYNOPSIS ...... 163

v

LIST OF FIGURES

Figure 1. Water wires in the classical water channel, Aquaporin 1(AQP1) were used as a template to redesign the outer F (OmpF) channel pore resulting in three different types of designs...... 8

Figure 2. OCD-TFTrp design (left) shows steric clash between adjacent tryptophans resulting in a less occluded and larger pore size than expected. However, in a UCD design (right) a R82L mutation alleviates a steric clash with Trp62 (unlike OCD-TFTrp). UCD designs are seen to intersperse smaller side chain hydrophobic amino acids between longer ones so their side chains face the pore lumen resulting in smaller pore sizes...... 11

Figure 3. Twenty OmpF mutants, spanning the entire sub-nm range were designed ...... 14

Figure 4. Osmotic shock stopped flow light scattering experiments were used to demonstrate an order of magnitude or higher permeability than aquaporins for WT OmpF protein and its mutants as well as solute retention trends seen in OmpF protein mutants...... 17

Figure 5. MD simulations of OmpF provide support for experimentally observed permeability and selectivity trends...... 21

Figure 6. OmpF pore analysis...... 32

Figure 7. Seventy notable events in the history of enzyme engineering starting with the Egyptians using wild grain for bread-making and brewing to directed evolution and phage display techniques for which the Nobel Prize in chemistry was awarded in 2018. The computational milestones are indicated as purple lines...... 44

Figure 8. Two different views of the lysozyme binding site (marked in blue) and the active site residues highlighted in red. The peptidoglycan substrate is shown as yellow sticks...... 46

Figure 9. Conformational change in hexokinase during product release. The active site has been highlighted in bright green. The substrate and products have been marked as pink sticks. Accession IDs for closed and open hexokinase conformations are 2E2N, and 2E2Q...... 47

Figure 10. The seven-step DESADER schematic overview of enzyme redesign computational workflows of RosettaDesign and IPRO...... 56

Figure 11. The four mutations that led to enhanced catalysis in beta-glucosidase have been marked as blue sticks and the protein is represented as light pink cartoon. All four mutations are too far from the binding pocket to interact with the substrate...... 59

Figure 12. The histogram shows the distribution of 156 TEM1-beta lactamase homologs with respect to the number of amino acids that constitute the polypeptides. The sequences have been grouped into five amino acid long bins with the mean lengths indicated in the X-axis labels...... 88

Figure 13. IPRO+/- graphical schematic...... 93

Figure 14. The steps of IPRO+/- design cycle...... 96

Figure 15. Engineering beta-lactamase...... 98

Figure 16. Venn diagram shows the fraction of naturally occurring indels in TBL homologs that were recovered by IPRO+/- simulations on EcTBL...... 101

Figure 17. The sequence alignment of the seven 4CLs with specificities spanning small to large cinnamate-derivatives, reveals two possible deletion sites (V345 and L346) and thirteen possible substitution positions in Gm4CL2...... 104

vi

Figure 18. Overview of Gm4CL2-intermediate complex ...... 105

Figure 19. Deletion of V345 in Gm4CL2 opens up the substrate binding groove, thus favorably accommodating larger sinapyl-group, which otherwise clashes with A235 and V345 alike...... 108

Figure 20. Performance of Indel-Maker...... 110

Figure 21. The workflow of OptMAVEn-2.0...... 121

Figure 22. The grid search procedure...... 127

Figure 23. The steps involved in the embedder module...... 134

Figure 24. The optimal gap penalty (g) is 8. For each category of MAPs parts and each gap penalty g (4 to 12), pairwise aligned (Dalign) and embedded (Dembed) distances were generated...... 137

Figure 25. A 3D coordinate was computed for each MAPs part. For each pair of MAPs parts within each category, the two parts’ embedded distance in Euclidean space was plotted against their sequence alignment distance...... 139

Figure 26. The native and 5gzn_R0 antigens remained stably bound (RMSD < 6 Å), while in 5gzn_R27 antigen remained partially bound (6 Å < RMSD < 12 Å) throughout the MD simulations. Heavy-atom RMSDs of antigen residues within a box at the antigen-antibody interface were computed after aligning the antibody residues within the box. The RMSDs for each complex are relative to the first frame (time 0ns) of the production run...... 151

Figure 27. The key interactions at the antigen-antibody interface post-MD simulations for native, 5gzn_R0 and 5gzn_R27 have been depicted. The light and heavy chain residues are shown as magenta and cyan sticks respectively, while antigen residues are depicted as green sticks...... 152

vii

LIST OF TABLES

Table 1. Ranges of values for the inner pore wall, outer pore wall, and overall free energies of transfer from water to ethanol (which serve as a surrogate to hydrophobicity scores) of the three selected OmpF mutants in contrast with the wild type...... 12

Table 2. List of 50 computational enzyme design successes till date (grouped as per relevance). The experimental aspect of each of these endeavors have been noted as well thus indicating that majority of these successes are due to synergistic effort of simulations and experiments...... 62

Table 3. CHARMM interaction and complex energy scores of 35 1ERM designs from IPRO+/-. The designs are arranged in descending order of variant-inhibitor interaction energy scores (column 4). These designs sample indels that are seen in natural homologs and have complex energy scores (stability metrics) not less than 90% of the wild type TEM1- β-lactamase complex energy with PEB inhibitor. Designs that constitute the most stable complexes are indicated with an asterisk (*)...... 102

Table 4. The difference in interaction energy scores of each of the 23 designs in comparison to wild type Gm4CL2, along with the amino acid substitutions and deletions with five AMP-conjugates of cinnamate-like substrates (sinapate, ferulate, caffeate, 4-coumarate, and cinnamate) in decreasing order of size have been listed...... 106

Table 5. The Spearman rank correlation coefficient (ρ) for each MAPs part category at each gap penalty g. ρ is independent of g for the J parts because the J parts do not have gaps...... 136

Table 6. The root mean squared error (RMSE) for each MAPs part category at each gap penalty g. RMSE is independent of g for the J parts because the J parts do not have gaps...... 136

Table 7. For each category of MAPs parts, the levels of the gap penalty g were ranked from 1 to 5 on the basis of ρ (highest ρ is rank 1) and RMSE (lowest RMSE is rank 1). J parts were excluded because for the J parts, ρ and RMSE are independent of g, as their sequences are devoid of residue gaps...... 136

Table 8. The performance of OptMAVEn on ten antigens for benchmarking. Tpos, Tener, TMILP, and TCPU are in hours; Dmax is in megabytes; Emin is in kcal/mol. *2R0W was excluded from analysis of Emin...... 143

Table 9. The performance of OptMAVEn-2.0 on ten antigens for benchmarking. Tpos, Tener, TMILP, and TCPU are in hours; Dmax is in megabytes; Emin is in kcal/mol. *2R0W was excluded from analysis of Emin...... 144

Table 10. Comparison of the performances of OptMAVEn and OptMAVEn-2.0 on ten antigens. Tpos, Tener, TMILP, TCPU, Dmax, and Npos report the log10 of the ratios of the corresponding OptMAVEn-2.0 and OptMAVEn values. .. 145

Table 11. OptMAVEn-2.0 was tested on 54 antigens in addition to those used for benchmarking against OptMAVEn. TCPU is in hours, Dmax is in megabytes, and Emin is in kcal/mol...... 146

Table 12. The antigen chains and epitope residues of the designs used in the test cases...... 148

Table 13. Comparison of HScores of the top 5 de novo designs with the HScores of the native antibodies for zika envelope protein ...... 150

Table 14. Comparison of HScores of the top 5 de novo designs with the HScores of the native antibodies for lysozyme...... 153

viii

NOMENCLATURE

Symbols

A Arrhenius pre-exponential factor Cp Specific heat D Diffusion constant E Energy Ea Arrhenius activation energy G Gibbs free energy H Enthalpy kB Boltzmann constant m Mass NH Number of hydrogen bonds P Osmotic pressure R Gas constant σ Lennard-Jones collision diameter Å Angstrom kD Dissociation constant Tm Melting temperature

Abbreviations

MILP Mixed-Integer Linear Program OmpF Outer membrane porin Type-F AQP Aquaporin OCD Off-center Closure Design UCD Uniform Closure Design CSD Cork Screw Design IPRO Iterative Protein Redesign and Optimization IPRO+/- Iterative Protein Redesign and Optimization with Indels TBL TEM1 Beta Lactamase EcTBL Escherichia coli TEM1 Beta Lactamase OptMAVEn Optimal Engineering of Modular Antibody Variable domains

ix

ACKNOWLEDGEMENTS

I would like to express my sincere gratitude to Professors Costas D. Maranas and Manish Kumar for co- advising my thesis and making me a part of several wonderful projects where I got to learn a lot and contribute a little as well.

I also extend my sincere gratitude to Professor Brian Pfleger, University of Wisconsin Madison for several insightful conversations during the course of our collaboration and entertaining and encouraging me in formalizing any new actionable research idea. He has also helped me in all my post-doctoral applications by serving as a referee.

“Love is the only reality and it is not a mere sentiment. It is the ultimate truth that lies at the heart of a creation”- Rabindranath Tagore.

My journey would not be possible without the love and support of my family and friends, who all have become a part of my extended family in course of these six and half years. My parents, Braja Gopal

Chowdhury and Mana Chowdhury have served as the epitome of honesty, integrity, and hard work for me and I cannot thank them enough. Debolina, Debmalya, Chaitali and Vivek Sarkar are no less than family and have provided the due encouragement for each of my baby-steps till the completion of my thesis.

The silent contributions of Arpan-da, Samhita-di, and Partha-da outside the lab and Rajib, Akhil, and Ali from the lab need special mention. I appreciate them for being a hitchhiker on the road I have travelled – and I always knew that I had less before I met them all. I further thank my co-workers, Lin, Charles, Hoang,

Patrick, Shyam, John, Soodabeh, Veda, Vikas, and Deepro.

I thank National Science Foundation for funding all the enzyme engineering efforts for this thesis. Details:

Research Division of Chemical, Bioengineering, Environmental, and Transport Systems, Grant/Award

Number: CBET‐1703274

x

This work is dedicated to my beloved grandparents, Late Nani Gopal Chowdhury and Shikha Das. They always encouraged me to pursue my dreams and have smiled at my happiness.

xi

Chapter 1

POREDESIGNER FOR TUNING SOLUTE SELECTIVITY IN A ROBUST AND

HIGHLY PERMEABLE OUTER MEMBRANE PORE

This chapter has been previously published in a modified form in Communications

(Chowdhury R, Ren T, Shankla M, Decker K, Grisewood M, Prabhakar J, Baker C, Golbeck JH,

Aksimentiev A, Kumar M, Maranas CD. PoreDesigner for tuning solute selectivity in a robust and highly permeable outer membrane pore. Nature communications. 2018 Sep 10;9(1):3661.)

1.1. Significance

Angstrom-size pores with no polydispersity and embedded in a suitable matrix offer the promise of highly selective membrane-based separations. Such membranes can provide substantial energy savings in applications ranging from water treatment to small molecule bioseparations.

Monodisperse angstrom-sized pores in the form of membrane proteins are commonplace in biological membranes but difficult to implement in synthetic industrial membranes. They have only recently become available as commercial products in the form of aquaporin-based membranes. Unfortunately, in these membranes, improvements in selectivity and permeability have remained modest with no ability to control selectivity. Here we demonstrate the successful implementation of a pore design procedure, PoreDesigner, to redesign the robust beta-barrel Outer

Membrane Protein F (OmpF) as a “scaffold” to access several less than 4 Å pore sizes with specific solute selectivity including complete salt rejection. The elliptic pore constriction of OmpF has major and minor axes lengths of 11 Å and 7 Å, respectively. We combined PoreDesigner along with simulations to redesign the wild type (WT) OmpF pore (with retention

1

of solutes with molecular weights of 600 Da or larger). We obtained an array of designs with varying pore sizes and profiles that could be categorized into three distinct pore topologies: off- center pore closure, uniform pore closure, and cork-screw design. Experimental testing of representative mutants from each category revealed a range of designs that maintain water permeabilities exceeding those of classical aquaporins by more than an order of magnitude at over

10 billion water molecules per channel per second while providing specific pore designs that exclude sucrose and larger solutes (>360 Da), glucose and larger solutes (>180 Da), or salt and larger solutes (>58 Da). PoreDesigner provides us the ability to design any specified Å pore size

(spanning 3 – 10 Å), pore profile, and chemistry that may be ideal for conducting Angstrom-scale aqueous separations.

1.2. Introduction

Precise chemical separations such as desalination and distillation are among the most challenging and resource intensive industrial process operations practiced today with an annual energy consumption of ~ 50 Quads (50×1015 BTUs) in the United States alone1. Membranes are generally defined as thin, selective barriers that ideally only allow select molecules to permeate through while rejecting others2,3. Permeability and selectivity are key metrics of performance for membrane separations. Membrane permeability is defined as the membrane flux (volume of liquid or gas that passes per unit membrane area per unit time) per unit driving force. For biological membranes, the permeability4 is abstracted as the dimensionless driving force defined as the osmotic gradient times the molar volume of water. Membrane selectivity is consequently defined as the ratio between the permeability of two solutes, two solvents or between a solute and a solvent.

Membrane separation offers advantages such as higher selectivity, simpler operation, and higher compactness over other (in many cases thermally driven) separation processes5 They are

2

increasingly being applied to a number of industrial sectors including water treatment6 industrial gas separations7,8, CO2 capture9,10, food processing11, and biopharmaceutical separations12,13. A variety of materials such as synthetic polymers14, ceramics15 , metals16, and cellulose17 can be used to synthesize membranes. Membrane technology applications have expanded rapidly in recent years and so has the understanding of interaction between membrane chemistry, structure, and performance. However, several challenges still remain in membrane materials design, particularly at the “pore” scale and translation of such designs to large areas necessary for application18–21. These challenges include: (1) overcoming the trade-off between selectivity and permeability to develop membranes with high selectivity and permeability because improvements in selectivity would simplify multiple separation steps and decrease separation costs significantly;

(2) designing angstrom scale pores that result in the same angstrom scale separations in synthesized membranes. This would be critical for the efficient separation of small molecules such as ions, gases, and small organics; and (3) synthesizing membranes with uniform pore size distributions. The elimination of polydispersity in membrane pore size would greatly enhance selectivity performance19,22.

To meet these criteria, in recent years, new materials including zeolites23, carbon nanotubes24, graphene oxide25, and membrane proteins26, and membrane protein mimics27,28 have emerged as advanced membrane materials to assemble the desired pore geometry. However, designing sub- nm pores with perfectly monodisperse distributions is still an unmet challenge and no procedure exists to rationally design the continuum of pore sizes between 3 and 10 Å. Membrane protein channels and biomimetic membranes based on these proteins provide the possibility of realizing sub nanometer pore size membranes with perfectly monodisperse pore size distributions, retaining high selectivity while maintaining high permeability19,21. A well-known example of such

3

membranes uses water transport membrane protein, aquaporins (AQPs), incorporated into liposomes and further stabilized in polyamide polymer membranes for water desalination26,29.

Membrane protein redesign studies have been performed and reported on gated ion channels

(Bocquet et al.30, Hibbs and Gouaux31, Hilf and Dutzler32, Miyazawa et al.33) and helix-bundle ion transporters (John et al.34). In a recent work, Liu et al.35 engineered a ferrichrome outer membrane iron transporter (FhuA) to attain pore sizes larger than wild-type and explored its transport properties.

AQPs are well-studied tetrameric water channel proteins36–41 present in microbes, plants, and animals. In mammalian cells39, fourteen isoforms have been identified with distribution in a wide range of cell types. AQP1 is the most studied isoform, so we chose it as a model water channel for our study. Each monomer is comprised of eight alpha helices. Six alpha helices are transmembrane in nature and two only partially span the membrane (helices 3 and 7). The pore that results from folding of this protein has an hour-glass structure with a constriction diameter of ~2.7 Å. The high rate of water selective water permeation through AQPs at ~3 billion water molecules/ channel/s, makes them ideal for desalination membranes because any solute of sizes greater than a water molecule is rejected by it. However, the advantages of AQPs for high permeability and high selectivity membrane applications are still being debated due to questions regarding the long-term stability of this alpha helical protein42, the low density of proteins embedded in membranes43, and its pore wall chemistry which forms hydrogen bonds with the permeating water44 molecules that may impede single channel osmotic permeability. Further, AQP-based membranes do not allow for selective removal of larger solutes in the sub-nm range.

To address the limitations of AQP-based biomimetic membranes and to further enhance the promise of membrane protein channel based membranes, we put forth a predictive platform to

4

computationally redesign the pore-constricting residues of a member of the highly stable outer membrane protein45 β-barrel family of channels present in bacteria46. These proteins have been extensively engineered and shown to be stable under varying temperature and chemical conditions47,48. In particular, we worked with the trimeric Escherichia coli protein outer membrane porin type F (OmpF, wild type pore size of ~ 11Å) to attain desired pore sizes that could enable precise molecular separations. OmpF49 is mainly involved in transporting ions50, antibiotics51, small sugars52, polyamines52, and amino acids52. There are more than 50 isoforms of OmpF in gram-negative bacteria such as, E. coli. OmpF also has an hour-glass shaped structure, similar to that of AQP1, but the pore constriction diameter of OmpF is ~4 times larger. The stability and mutation tolerance of OmpF53,54 makes it a suitable candidate for computational redesign and subsequent experimental validation for performing separations at the angstrom scale. In addition, it can be easily assembled into stable two-dimensional crystals formed within block-copolymer membrane matrices55 which may be ideal for preparation of larger scale separation membranes26.

In this work, we outline the systematic workflow PoreDesigner for a predictive platform to utilize the OmpF “scaffold” to design angstrom-scale pore sizes with specific solute selectivity and high osmotic permabilities compared to aquaporins. This was accomplished by leveraging the protein design IPRO (Iterative Protein Redesign and Optimization suite of programs)56 developed earlier followed by the subsequent application of molecular dynamics (MD) simulations and validation using stopped-flow light scattering experiments. The core computational module of

PoreDesigner restricts the modification of the pore constriction residues to long side-chain and hydrophobic amino acids and identifies an optimal set of rotational isomers (from a rotamer library) for the altered residues that avoid backbone and side-chain clashes using a mixed-integer linear optimization program (MILP). Long side-chain hydrophobic amino acids were selected with

5

the dual objective of obtaining smaller pores with hydrophobic side chains that extend into the pore lumen to provide selectivity while maintaining high osmotic permeability based on the hypothesis that reducing water-pore wall interactions will lead to increased permeability57,58.

Designs identified by the MILP problem were retained so long as (1) they minimized water wire to pore wall interactions, or (2) had pore sizes smaller than the desired size, using an interaction energy calculation check and a pore area estimation criterion. Structural investigation on the designs revealed three distinct topologies of pore geometries: (a) uniform pore closure designs

(UCD) with a smaller but nearly co-centric pore eyelet diameter resulting from an orderly distribution of similar-side chain size hydrophobic amino acids along the pore perimeter, (b) off- center pore closure designs (OCD) which involve a pore center that is displaced towards the perimeter compared to the wild type OmpF pore utilizing long side chain hydrophobic resides on one side of the pore and smaller ones on the other, and (c) cork-screw designs (CSD) which introduce a lateral twist as we proceed along the pore axis stemming from alternating stacking long with short side-chain amino acids (see Figure 1). We were able to redesign the 7×11Å OmpF WT pore to obtain an array of designs with varying pore size profiles sampling pore sizes across the 3

– 10 Å range and experimentally tested a subset of these designs in the 3-4 Å range critical for the most challenging separations. The permeabilities of tested designed ranged from 1.0 (±0.23) ×10-

12 cm3/s for the WT to 4.4 (±0.93) ×10-12 cm3/s for the selected CSD design compared to ~9 ×10-

14 cm3/s estimated for AQP159. CSD osmotic water permeabilities were not only higher than AQP1 by an order of magnitude but were also an order of magnitude higher than the highest reported permeabilities of any channel of the aquaporin family of proteins (2.4 (±0.47)×10-13 cm3/s determined for AqpZ43). Solute rejection capabilities of these channels were evaluated using

6

stopped flow experiments with various solutes. These results demonstrate the potential of computationally tuning the pore diameter of OmpF in response to desired separations.

1.3. Results

1.3.1 Recapitulating the water wire geometry of AQP1 in OmpF mutants

Aquaporins have been proposed to have the ideal internal pore geometry for selective and highly permeable water channels but the pore wall interacts strongly with the permeating water wire60, indicating that there may be a possibility of enhancing permeability at similar size ranges without sacrificing selectivity. All aquaporins have a conserved asparagine-proline-alanine (known as the

NPA motif) near the constriction region, in which the Asn interacts with the water wire by forming hydrogen bonds. These interactions impede the hydraulic permeability through AQP1. A recent study60 showed that in addition to the NPA motif there are twelve amino acids along the internal pore profile of AQP1 that can form hydrogen bonds with the water wire. Further, the number of hydrogen bonds between the water wire and the inner pore wall of AQPs was directly related to the single channel permeability of the pore. Our aim is to redesign the water channel such that it does not interact with the permeating water wire, thus eliminating hydrogen bonds in the central part of the channel, but retains the water wire geometry. To discern the unique water wire configurations through AQP1, we examined individual frames of AQP1-water using molecular dynamics simulations. Water wires from each simulation frame were isolated and clustered using a k-means approach. Four clusters representing four types of water wire geometries were observed.

Even though all four were non-uniform helical water wires, members of the same cluster had their pitch per turn and major and minor axes values that were similar to each other compared to members of another cluster. Each water wire was subsequently positioned inside the OmpF pore

7

and PoreDesigner was used to alter the pore constricting residues to form the equivalent of a molecular “mold” around the water wire (see Figure 1a).

Figure 1. Water wires in the classical water channel, Aquaporin 1(AQP1) were used as a template to redesign the outer membrane protein F (OmpF) channel pore resulting in three different types of designs. (a)The left panel shows a frame from an MD simulation of single file water permeation through AQP1. The pore wall residues capable of forming hydrogen bonds with the permeating water wire have been highlighted in yellow. The water wire is isolated with its geometry preserved and is thereafter placed in the OmpF pore. The pore constricting residues are altered such that they fill in all space around the water wire forming a molecular mold of the selective internal geometry of AQP1 within OmpF beta scaffold. (b) The three distinct internal pore geometries of OmpF that resulted from the employed redesign procedure included: (i) off-center pore closure design (OCD),

(ii) uniform pore closure design (UCD), and (iii) cork-screw design (CSD).

8

An all-atom 10ns MD simulation was performed61 to identify the various geometric poses assumed by the water wire when permeating through AQP1 (see Figure 1) and ~30,000 frames with

~1,000,000 water wire trajectories were collected. Thereafter, the principal geometric modes of water transport were determined by clustering the similar water wires using a k-means clustering protocol. All the water wires were observed to assume an elliptic helix shape which can be represented by three parameters – major and minor axes of the ellipse, and pitch per turn. K-means clustering of the ~1,000,000 water wires yielded four unique water wire geometries.

1.3.2. Categorization of the designs based on resultant internal pore geometry

Each of the four unique water wires was placed in the pore of the crystal structure of wild-type

OmpF (2omf.pdb) one at a time and used as input into PoreDesigner. Thereafter, the porin and the water wire were aligned such that the constriction center was at the origin and the longitudinal pore axis coincides with the Z-axis. The pore constricting residues fill up the annular space using hydrophobic, long side-chain amino acids with the objective of designing a narrow yet hydrophobic pore with minimal interaction with the water molecules. An explicit constraint was imposed to ensure that the distance between any atoms of the pore constriction residue side chain and the water wire oxygen was greater than the sum of their van der Waals radii (excluding hydrogen atoms). This precludes the possibility of arriving at designs with pore constriction residue side chains clashing with the water wire. PoreDesigner reduces binding with the central water wire by maintaining their respective interaction energies at their maximum by replacing the pore constriction residues of wild-type OmpF with long side-chain, hydrophobic amino acids such as tryptophan (Trp), (Phe), and tyrosine (Tyr). By imposing a minimum percentage

(50%) required of long side-chain residues in the redesigned we safeguard against the trivial

9

designs involving all alanines or valines that would be too far to interact with the water wire. We appended a design assessment step at each iteration by accepting only designs whose constriction diameter is less than 4Å.

PoreDesigner yielded 40 different OmpF designs with pore sizes less than 4Å. Analysis of the engineered designs revealed that they conform to three categories (see Figure 1) based on the resultant internal pore geometry: (1) only Trp/Phe mutations resulting in a narrower but off-center pore lumen (OCD: off-center pore closure design), (2) a smaller co-centric pore with the bulky groups (such as Trp, Phe and Tyr) interspersed with less bulky alanines and valines arranged in a single plane (UCD: uniform pore closure design), and (3) regularly patterned larger with smaller side-groups along the pore profile resulting in an internal pore geometry that involves a twist. We refer to this class of designs as cork-screw designs (CSDs). There were two off-center (OCD) designs for which we allowed the mutation of the 25 pore constriction residues to: (a) only Phe mutations (TFPhe), and (b) only Trp mutations (TFTrp). A biased distribution of bulky groups

(Phe/Trps) towards one side of the pore periphery resulted in a smaller pore with its center away from the large residues. The presence of twenty-five Phe/Trps led to steric clashes forcing most of the Phe/ Trps side chains to face away from the pore lumen (see Figure 2). Despite the fact that the non-lumen facing Phe/ Trps did not contribute to the pore size reduction, they enhanced the inner pore wall hydrophobicity. In the remaining 38 designs, all hydrophobic long side chain amino acids were permitted. These designs sample generally smaller pore sizes by placing long side chain residues interspersed with short amino acids (generally alanines, valines, and leucines).

These designs eliminate the possibility of a Phe-Phe/Trp-Trp side-chain steric clash as often seen in the OCD designs (see Figure 1b). We identify these designs as UCD designs. However, there exist a few designs which conform to the third type (CSD) from among these 38 designs. We

10

identified 31 UCD and seven CSD designs. We chose the smallest predicted pore size design from each type for subsequent hydraulic permeability and solute rejection experiments and molecular dynamics simulations. The predicted pore constriction dimensions after MD simulations for the three selected designs were: 3.54×3.25 Å, 3.18×3.12 Å, and 3.05×3.01 Å for OCD, CSD, and UCD protein designs, respectively. These OmpF mutant proteins and the wild type OmpF proteins were produced by expression from synthetic genes cloned into the pET23a(+) expression plasmid vector transformed into E. coli BL21(DE3) Omp8 Rosetta (ΔlamBompF::Tn5 ΔompAΔompC) mutant strain. The purified proteins were incorporated into liposomes for assessment of single channel water permeability and solute passage as described in subsequent sections.

Figure 2. OCD-TFTrp design (left) shows steric clash between adjacent tryptophans resulting in a less occluded and larger pore size than expected. However, in a UCD design (right) a R82L mutation alleviates a steric clash with Trp62 (unlike OCD-TFTrp). UCD designs are seen to

11

intersperse smaller side chain hydrophobic amino acids between longer ones so their side chains face the pore lumen resulting in smaller pore sizes.

Before testing the three designs for water transport, the Kyte-Doolittle (KD) hydrophobicity scores of the three designs were calculated (see Methods) and contrasted with that of wild type OmpF

(see Table 1). Their respective hydrophobicity values were computed by summing the transfer

풘풂풕풆풓→풆풕풉풂풏풐풍 free energy62 (∆푮풕풓풂풏풔풇풆풓 ) (from water to ethanol) of each one of the amino acid side chains that constitute the inner and outer pore wall. Lou et al.47 defines the inner and outer pore wall residues as those with side chains protruding into and from the beta-barrel, respectively. The relative order of inner pore wall hydrophobicities was seen to be CSD > OCD > UCD >wild type

Ompf. We hypothesized that the experimentally measured single channel permeabilities will follow the same trend as increasing hydrophobicity based on our design principle of eliminating water-pore wall interactions to enhance permeability.

Table 1. Ranges of values for the inner pore wall, outer pore wall, and overall free energies of transfer from water to ethanol (which serve as a surrogate to hydrophobicity scores) of the three selected OmpF mutants in contrast with the wild type

Inner pore wall Outer pore wall Overall hydrophobicity

Designs hydrophobicity score* hydrophobicity score score

풘풂풕풆풓→풆풕풉풂풏풐풍 −ퟏ 풘풂풕풆풓→풆풕풉풂풏풐풍 −ퟏ 풘풂풕풆풓→풆풕풉풂풏풐풍 −ퟏ ∆푮풕풓풂풏풔풇풆풓 (풌풄풂풍 풎풐풍 ) ∆푮풕풓풂풏풔풇풆풓 (풌풄풂풍 풎풐풍 ) ∆푮풕풓풂풏풔풇풆풓 (풌풄풂풍 풎풐풍 )

CSD -97.7 -76.5 -174.2

OCD -69.2 -81.2 -150.4

UCD -65.6 -71.3 -136.9 wild type -61.3 -68.5 -129.8

12

* A higher negative value represents a higher hydrophobicity62. This scale reports the free energies of transfer of different amino acid side chains from water phase to ethanol. As a result, the hydrophobic amino acids have a lower free energy of transfer than charged amino acids.

The hydrophobicity trends reveal that the OCD-TFTrp mutant has the highest estimated outer pore wall hydrophobicity. This is possibly due to the steric clashes between contiguous tryptophans

(see Figure 2) where the majority of the 25 pore-constricting tryptophans are forced away from the lumen.

In addition, we also used PoreDesigner to predict designs that span the remaining 4–10 Å range.

The overall goal is to precisely match any desired pore size needed for separations spanning the sub-nm (3–10 Å) range. PoreDesigner was accordingly modified to only accept pore designs with pore constriction diameter between a pre-specified range Dmin and Dmax. For example, setting Dmin and Dmax values to 5Å and 6Å respectively, yields OmpF designs with pores predicted to be within this range. We identified 17 new designs (see Figure 3) with at least two designs within a pore size bin of range 1Å starting from 4 Å to 9 Å. We used the aforementioned structural classification scheme and developed ten OCD designs, three UCD and four CSD designs. Generally, the smaller the desired pore size, the higher was the number of required mutations. OCD type designs were seen to be most prevalent spanning almost the entire sub-nm range (see Figure 3b). Whereas, CSD and UCD type designs were limited to the mid region of the sub-nm spectrum.

13

Figure 3. Twenty OmpF mutants, spanning the entire sub-nm range were designed. (a) Plot of the number of mutations vs. pore diameter for 20 mutants (including three mutants that were validated experimentally before MD simulations). The general trend indicates that the smaller the desired pore the greater the number of mutations required. (b) Plot of the number of designs for each pore size and type classification (color coded).

1.3.3. Experimental validation of pore designs

The wild type OmpF has a pore size around 7 × 11Å63 and a molecular weight cutoff of around

600Da64,65.We redesigned OmpF to target a range of sub-nm pore sizes as discussed above. We selected one mutant from each class of designs targeting sub 4Å pore size measured their single channel permeability and solute passage rates experimentally. The three OmpF mutants that we chose were OCD with in silico estimated pore sizes of 3.25Å (minor axis of elliptical pore cross- section), CSD with pore size of 3.12Å, and UCD with estimated pore size of 3.01 Å.

14

OmpF can be reconstituted into lipid vesicles66 and allows passive diffusion of small molecules across the membrane. We characterized the influx of different molecular weight solutes through wild type OmpF and its mutants under hypertonic conditions using stopped flow light scattering measurements67, and compared their solute transport trends. All OmpF mutants were reconstituted into L-α-phosphatidylcholine(PC)/L-α-phosphatidylserine (PS) vesicles using a detergent destabilization method at a lipid to protein mass ratio 400 (LPR400)43. The proteoliposomes were rapidly mixed with hypertonic solutions in the mixing cell of a stopped flow setup, NaCl, glycine, glucose, sucrose, and polyethylene glycol 600 (PEG600) as osmolytes. All experiments were conducted with proteoliposomes with a polydispersity index of <0.2 leading to a signal to noise ratio of > 50 in the stopped flow curves that were used for permeability calculation. This is expected to provide high reliability in terms of the calculated parameters from this experiment4.

Solute rejection of OmpF mutants.

In the method employed, proteoliposomes are subjected to a hyperosmotic shock, and time dependent light scattering data collected upon mixing of the osmolyte and proteoliposmes. The resulting light scattering profile can be used to determine solute exclusion as well as water permeability of incorporated proteins. As shown in Figure 4a, for both the cases where solute is completely excluded by the channel (solute exclusion model) and when there is some solute leakage through the channel (solute permeable model), during the first stage of the mixing process water flows outward from the vesicles leading to vesicle shrinkage, due to the high osmolarity outside the vesicles. This shrinkage leads to an increase of light scattering intensity measured at

90 degrees to the incident light due to constructive interference of scattered light. This is because vesicles with a size comparable to the wavelength of light stop acting like point particles and show

15

an increasing trend in scattering intensity with decreasing volume68 at the scattering angle used for measurements (90°). During the second stage of the mixing process (Figure 4b), the light scattering intensity trend changes based on whether the solute can or cannot not diffuse through the porin67. When the solute molecular size is larger than the porin pore size (Figure 4b, solute exclusion model), the water continues flowing outward from the vesicles to reach the equilibrium state of the osmotic pressure dictated by the solute concentration, and the light scattering intensity levels off as measurement time increases. However, if the solute size is smaller than the porin pore size (Figure 4b, solute permeable model), in the second stage solutes diffuse through porins and led to a corresponding influx of water into vesicles, which is observed as a decrease in light scatting intensity67. Based on the observation of light scattering intensity change, we can estimate the solute rejection trends of porins.

We estimated the approximate molecular weight limit at which solute rejection for WT OmpF and the three OmpF mutants occurred. For WT OmpF, the light scattering intensity decreased at the second stage when WT OmpF reconstituted liposomes were exposed to NaCl containing hypertonic solutions. Also, the light scattering intensity decreased at the second stage when WT

OmpF reconstituted liposomes were exposed to glycine, glucose or sucrose containing hypertonic solutions. This indicates that WT OmpF is permeable to these solutes. The light scattering intensity leveled off when exposing the proteoliposomes to PEG600 containing hypertonic solutions

(Figure 4c). This observation demonstrated that WT OmpF can reject PEG600 (600Da) or larger molecules, which is consistent with previous reports49.

For UCD, the light scattering intensity leveled off when exposing the liposomes to all the solutes used including NaCl (58.5Da), leading us to conclude that this mutant can substantially reject molecules larger than 58.5Da (Figure 4d). We also estimated the approximate molecular

16

weight exclusion limit for OCD and CSD. Based on the solute rejection experiments above, we estimated the molecular exclusion limit of the three mutants to have the following sequence: Wild type (~600 Da) > OCD (~342 Da) > CSD (~ 180 Da) > UCD (~58 Da) (see Figure 4e) which follows the same trend as the designed pore sizes. Thus, small molecule separation membranes can be developed by a selection of different pore size OmpF mutants for biomimetic membranes.

The mutant with the smallest pore size, UCD, has ionic solute rejection properties similar to aquaporins (while not excluding protons), and can be selected as a candidate protein for developing membrane protein based biomimetic desalination membranes.

a b

c d e

Figure 4. Osmotic shock stopped flow light scattering experiments were used to demonstrate an order of magnitude or higher permeability than aquaporins for WT OmpF protein and its mutants as well as solute retention trends seen in OmpF protein mutants. (a) When OmpF (or OmpF mutant) containing proteoliposomes are mixed with hypertonic solutions, two different transport

17

models can be observed based on whether the solute is permeable to the porin or not, (b) In the stopped flow setup, for solute excluded model, normalized light scattering intensity levels off during the “second stage” as there is no inflow of water and solutes; for solute permeable model, normalized light scattering intensity decreased during the second stage due to inflow of water and solutes. (c) OmpF (WT) rejects PEG600 (600Da) and larger molecules and thus only the PEG600 curves show no decreasing portion of the curve. (d) UCD rejects NaCl (58.5Da) and larger molecules as there is no decreasing portion of the stopped flow curve for any of the solutes tested.

(e) Summary of the estimated solute rejection (light bars) and single channel permeability (dark bars) of OmpF WT and the three OmpF mutants. The two y-axes represent permeability (black left y-axis) or the molecular weight cut off data (red right y-axis). Curves shown in panels c & d are averages of 6-10 traces from each stopped flow light scattering experiment. Each experiment was conducted at least three times with independent vesicle preparations.

Single channel permeability of OmpF mutants.

Recent literature has focused on emphasizing the importance of membrane design efforts that lead to high selectivity while maintaining or increasing current membrane permeabilites19. Solute rejection experiment results showed that high molecular selectivity can be achieved by designing

OmpF mutants with different pore sizes through the PoreDesigner workflow. In addition to estimating selectivity, we also evaluated OmpF mutant permeabilities, which were specifically characterized by determining single channel permeability of each mutant.

The permeability values of vesicle membranes containing various mutant OmpF proteins were calculated from the light scattering intensity curves obtained from stopped flow light scattering experiments conducted with completely the excluded solute PEG 600 as an osmotic agent for all mutants. By fitting the normalized light scattering intensity curve to a double exponential curve

18

similar to that was used in previous work26 we obtained a rate constant k (the larger constant from the double exponential fit). This rate constant was then used to calculate the osmotic water permeability (Pf) using the following equation26:

푘 푃 = 푓 푆 ( ) × ∆휋 × 푉푤 푉표 where S is the initial surface area of OmpF reconstituted vesicles, 푉표 is the initial volume of OmpF reconstituted vesicles, ∆휋 is the osmotic gradient across the lipid bilayers, and 푉푤 is the molar volume of water.

Figure 4a shows light scattering curves obtained from vesicle membranes with reconstituted

WT and mutant OmpF proteins at an of LPR400, and resulted in calculated net permeabilities between 2049 and 3411 cm3/s. Because net permeability can depend both on the number of proteins reconstituted and the single channel permeability43, for more accurate comparison between mutants, we calculated the single channel permeability of wild type OmpF and its mutants. Single channel permeability was calculated by combining the net permeability from stopped flow light scattering measurement with the number of proteins inserted per vesicle, determined from fluorescence correlation spectroscopy (FCS) experiments. This approach, pioneered by Pohl and coworkers69, has been used to calculate the single channel permeability of

Aquaporin Z43,70 and peptide-appended pillar[5]arene (PAP) channels71 successfully. OmpF proteins were first labeled using a pyrylium dye, which is shown to only have detectable fluorescence signal after conjugation with proteins43,72. We reconstituted these labeled OmpF proteins into vesicles and performed FCS to first determine the number of vesicles (Nves) by fitting the auto correlation function obtained from FCS measurements of these fluorescent vesicles. We then added the membrane protein compatible detergent, octyl glucoside (OG) to the same vesicle

19

solutions (final OG concentration is 2.5%) to break down the vesicles into protein-detergent micelles. Thus, we can calculate the number of proteins (Npro) using the same method by conducting FCS measurements on these solubilized proteins assuming one protein trimer per micelle similar to what has been reported for aquaporins70. By taking the ratio of the number of proteins to the number of vesicles, we can obtain the average number of proteins per vesicle (Npro

/Nves). Combing vesicle permeability and average number of OmpF proteins per vesicle, we calculated average single channel permeability of OmpF and its mutants (see Figure 4e). CSD had the highest single channel permeability followed by OCD and UCD, which have similar single channel permeabilities and all the three mutants have single channel permeabilities higher than

WT OmpF. This serves as the experimental corroboration of the predicted water permeation rates from the inner pore wall hydrophobicities. Compared to aquaporins, single channel permeabilities of wild type OmpF and its mutants are at least an order of magnitude higher73, the highest single channel water permeability of OmpF mutants is ~18 times faster than that of the E. coli AqpZ, which was measured using the same platform43 and ~49 times faster than that of AQP1 reported in literature59.

1.3.4. Molecular dynamics simulation of the pore designs

Using the all-atom MD method, we independently assessed osmotic permeability of the wild type

OmpF and the three experimentally verified designs, starting from the molecular configurations suggested by PoreDesigner. The monomeric proteins were patched to form trimers using a VMD74 plugin, set in a lipid-bilayer and solvated in 1 M NaCl solution. (see Figure 5a). Osmotic permeabilites were evaluated (from the rate of water displacements75) and averaged through each monomer of the trimeric molecule and the variabilities during the last 30ns of the 35ns simulation were reported (see Figure 5b). The MD-computed osmotic permeabilities of OmpF and the three

20

designs are seen to corroborate the same single channel permeability trend as seen in stopped-flow light scattering experiments. Highest permeability of CSD reaffirms the applicability of using inner pore wall hydrophobicity scores as a surrogate to predict relative channel permeabilities.

Figure 5. MD simulations of OmpF provide support for experimentally observed permeability and selectivity trends. (a) Typical simulation system. (Top) Cut-away view if the system revealing a transmembrane water passage through an OmpF monomer. The OmpF monomer is depicted in purple, the lipid-bilayer in cyan, water molecules as red and white spheres, and Na+ and Cl– ions as orange and green spheres, receptively. (Bottom) Top-down view of the system. The OmpF trimer is drawn using a cartoon representation, the lipid bilayer as cyan bonds; water and ions are not shown. (b) Simulated osmotic permeability of OmpF variants (red) and the corresponding experimental values (gray). (c) Ionic conductance of OmpF trimers obtained from applied field simulations under a 500 mV transmembrane voltage. (d) Water occupancy of OmpF variants. The green volume depicts the average location of water molecules in each channel characterized as a

21

0.3 g/cm3 isosurface of water oxygen density. For reference, each channel is shown using a semitransparent cartoon representation. (e) Major axis dimensions of the pores measured from

PoreDesigner before MD (gray) and from the last 100 frames of MD (red). The error bars represent standard deviations. A 0.4 nm line represents the PoreDesigner design constraint of identifying pore designs smaller than 0.4 nm. (f) The average number of hydrogen bonds made between water and an OmpF monomer in each of the regions depicted in panel d.

The ionic conductances of the WT and mutant pores were evaluated by simulating the systems under a transmembrane voltage of 500 mV (Figure 5c). The ionic conductance was determined by averaging instantaneous displacements of ions over the last 25 ns of the 35 ns MD trajectoires76. All the three mutants exhibited negligibly low conductances which were about an order of magnitude lower than that of wild type OmpF with the CSD mutant being the most conductive of the three as expected from solute rejection experiments. While solute rejection experiments identified UCD to be more restrictive towards salt compared to OCD, MD simulations, as conducted, do not seem to be conclusive regarding the difference between the two mutants (Figure 5c).

Further analysis of the MD trajectories identified the inner volume of the WT and mutant pores accessible to water (see Figure 5d). The volumes were determined by averaging water density over the last 10 ns of the MD simulations carried out under a 500 mV bias. The volume density maps reveal the UCD mutant to have the narrowest pore constriction, which correlates with the best solute rejection performance of that mutant. Figure 5e shows the major pore constriction diameters as measured using PoreDesigner (before MD simulations were performed) along with that observed during the 5 ns of MD. The closely packed side chains of OCD and UCD allowed marginal movement of pore constriction amino acid side chains, thereby showing less

22

variability in the pore size during the course of MD. However, the bulky groups of CSD are stacked in different planes thus allowing some movement of the pore constriction side chains leading to higher variability in pore sizes during water permeation. The same PoreAnalyzer module that was used in PoreDesigner was used for assessing the pores from the MD trajectories.

To assess the relative hydrophobicity of the mutant pores, we determined the average number of hydrogen bonds between the OmpF monomer and water located in each of the three regions of the pores (defined in Figure 5d). A hydrogen bond was reported if the water molecule was within 0.3 nm of a protein atom capable of forming a hydrogen bond similar to as explained in Ireta et al77 and Durrant et al78. It was also ensured that the protein atom–water hydrogen–water oxygen angle was 20 degrees or less. The number of calculated bonds were averaged over entire

35 ns MD trajectories with error bars showing standard deviations (see Figure 5f). The constriction region (see Figure 5f) shows progressively decreasing number of hydrogen bonds as the number of mutations (in the constriction region) as the pore is occluded by more hydrophobic residues.

This attests to the effectiveness of PoreDesigner’s design objective of replacing pore constriction residues with hydrophobic ones to limit pore wall – water wire interactions in order to arrive at designs with high single channel permeabilities while tuning size selectivity.

1.4. Discussion

Ultra-permeable membranes (dense solution, diffusion-based or channel-based) have emerged as a promising alternative to energy intensive separations including desalination and water purification. AQPs are ideal candidates for channel-based membranes owing to their high permeabilities and selectivities but the range of solutes that can be separated are limited. Here, we have put forth a computational (i.e., PoreDesigner) and experimental workflow that relies on the

23

mechanically, chemically, and mutationally stable beta-barrel scaffold of OmpF to construct channel-based membranes for aqueous-phase separations of specific solutes without improving upon the single channel permeabilities of AQPs thus expanding the selectivity range of potential biomimetic membranes. A simple theoretical analysis of what ion conductance targets for membranes based on OmpF channels was conducted. This analysis reveals that in order for such membranes to usable for seawater desalination, the estimated maximum total conductance values is 0.018 nS for seawater desalination (assuming a feed of 35 g/L NaCl). Similarly, for brackish water desalination this maximum is 0.034 nS (assuming a feed of 5 g/L NaCl), and 0.056 nS for low salinity wastewater (assuming a feed salinity of 2g/L NaCl). All these conductance target values can be met by the derived UCD and OCD designs as their simulated conductance values are in the requisite range. However, it is important to note that the error bars on these calculations are quite wide implying that direct experimental validation is still needed.

The use of OmpF provides two distinct advantages over the use of AQPs for the envisioned applications in aqueous separations. First, AQPs are arguably overdesigned for water desalination as they remove protons along with other monovalent and divalent ions while OmpF can be designed to pass protons while removing other ions. The requirement to reject protons imposes the need to have hydrogen bonding between translocating water molecules and the pore wall. Recent studies have indicated that an ideal water-conducting pore could transport water more efficiently if all hydrogen bonding between waters and the central section of the pore are eliminated60.

Changing these residues (especially the key conserved residues of the NPA motif) in AQPs leads to lowered AQP permeability79 (see for example Yong et al., IUBMB Life, 2009). OmpF provides an excellent platform to tune (or potentially completely eliminate) hydrogen bonding. Second, the higher permeability of OmpF over AQPs can be advantageous for water purification and

24

desalination in specific instances where space is at a premium or where energy savings can be substantial. Ultrapermeable membranes, such as those based on OmpF, with high salt rejection appropriate for RO have the potential to substantially reduce energy (~45%) or plant infrastructure

(pressure vessels, up to 65%) in low salinity streams9 such as brackish water desalination and water reuse. The energy advantage is lower for high salinity seawater applications (15% less energy) but the plant size reduction remains significant (i.e., 44%9, 10) Sub-nm pore size membranes for nanofiltration (NF) and ultrafiltration (UF) have diverse applications in water treatment,11, 12, 13 food production and processing,14, 15, 16 and energy applications17, 18, 19 which will also benefit from energy and capital cost reduction.

We demonstrated that PoreDesigner can precisely design sub-nm pore sizes within the stable beta barrel of a bacterial channel porin without jeopardizing the channel stability. PoreDesigner provides another powerful demonstration of de novo protein design on a class of proteins that so far have not been the subject of systematic protein design as have enzymes80, antibodies81,82, binding proteins56 or protein interfaces83. We experimentally validated our designs and show excellent solute selectivity for a range of solutes while maintaining high permeabilities. We subsequently expanded upon the range of designs obtained by PoreDesigner with pore sizes spanning the entire sub-nm (3-10 Å) spectrum in 1Å bin sizes. PoreDesigner (see Figure 3) identified designs that spanned the entire pore size spectrum. OCD designs were seen to span the entire sub-nm spectrum whereas CSD and UCD designs were limited to the mid of the sub-nm spectrum. While these latter designs have not been experimentally tested yet and need to be further validated, successful computation-driven designs for the 4 Å range suggest that the proposed framework lays the foundation of a new paradigm for membrane-based sub-nm aqueous

25

separations with applications ranging from desalination, vesicle-mediated drug delivery to separating solutes of biochemical importance with marginal difference in sizes. Moving beyond aquaporins we believe that PoreDesigner could be applied to tune pore size and geometry for any other porin system with a known structure.

1.5. Materials and Methods

PoreDesigner implementation

PoreDesigner used results from molecular dynamics simulations (all atom 10ns with 2 fs timestep) of pressure driven water transport through tetrameric AQP1-membrane assembly to isolate water wires that were then used to constrain the OmpF redesign process using an procedure where the interaction energy between the water wire and the mutated residues was maximized. Further details, including a step by step procedure is provided as follows.

The PoreDesigner iterative redesign cycle largely follows the sequence of steps defined in IPRO3.

Step 1: Backbone perturbation of an 11-amino acid window with a randomly chosen DP as the sixth (central) amino acid is performed. The side chains are stripped, and backbone phi and psi dihedral angles are randomly perturbed using values from a normal distribution (µ=0, 휎=1.5º).

Step 2: Repacking the amino acids side chains inside and within 4.5Å of the perturbed region and redesign of all DPs included within the perturbed region to any of the allowed amino acids is done.

This optimization step is carried out by solving an MILP problem with an objective function of maximizing the interaction energy (van der Waals, electrostatics, and solvation).Constraints in the

MILP formulation impose selection of only one amino acid rotamer at each design position, mutationof at least 50% of the DPs to longer side chain amino acids (Trp, Phe, Tyr, and Met), and prevention of the same amino acid-rotamer combination to be chosen at the same design position in follow up iterations.

26

This MILP formulation that is solved in Step 2 is stated as follows:

Sets: i,j =1, …, N, set of all design positions r,s=1, …, R, set of rotamers for position i.

U={(푖, 푟)|푖 = 1, … , 푁; 푟 = 1, … , 푅}universal set of all feasible residue position and amino acid rotamer combinations.

퐶푈푇푆 = {(푖, 푟)|푦푖푟 = 1} set of residue position and amino acid rotamer combinations for the design obtained from any iteration

A={Trp, Tyr, Phe, Met,Ile, Leu, Val, Pro, Ala} set of allowed amino acids

UAA= set of all amino acids

Parameters:

퐸퐶푖푟stores the interaction energy of rotamer r at position i and the non-rotamer region

푗푠 퐸푅푖푟 stores interaction energy betweenrotamer r at position i and rotamer s at position j

1, if 푟 is a rotamer of a long side chain amino acid (Trp, Phe, Tyr, Met) 퐿푂푁퐺 = { 푟 0, otherwise

1, if 푟 is a rotamer of a short side chain amino acid (Ile, Leu, Val, Pro, Ala) 푆퐻푂푅푇 = { 푟 0, otherwise

푀 = maximum number of mutations allowed at each iteration.

1, if rotamer 푟 is seen at design position 푖 in the wild type structure 푊푇 = { 푖푟 0, otherwise

Binary variables:

1, if rotamer r is selected at position i 푦 = { 푖푟 0, otherwise

1, if rotamer 푟 is selected at position 푖 and 푠 at position 푗 푤푗푠 = { 푖푟 0, otherwise

27

푧푖푟

1, if rotamer r is selected with amino acid from set A at position i upon mutation = { 0, if position i is unmutated

MILP formulation

푁 푅푖 푁−1 푅푖 푁 푅푗 푗푠 푗푠 푀푎푥푖푚푖푧푒 ∑ ∑ 푦푖푟퐸퐶푖푟 + ∑ ∑ ∑ ∑ 푤푖푟 퐸푅푖푟 푖=1 푟=1 푖=1 푟=1 푗=푖+1 푠=1

푠푢푏푗푒푐푡 푡표:

∑ 푦푖푟 = 1, ∀푖| 1, … , 푁 (1) 푟=1

푅 푗푠 푦푖푟 = ∑ 푤푖푟 , ∀푖|1, … , 푁 − 1, ∀푗|푖 + 1, … , 푁, ∀푟|1, … , 푅 (2) 푠=1

푅 푗푠 푦푗푠 = ∑ 푤푖푟 , ∀푖|1, … , 푁 − 1, ∀푗|푖 + 1, … , 푁, ∀푠|1, … , 푅 (3) 푟=1

푁 푁

∑ 푦푖푟퐿푂푁퐺푟 − ∑ 푦푖푟 푆퐻푂푅푇푟 ≥ 0, ∀푟 ∈ 퐴 (4) 푖=1 푖=1

푅 푁

∑ ∑ 푧푖푟 ≤ 푀 (5) 푟=1 푖=1

∑ 푧푖푟 ≤ 1, ∀푖|1, … , 푁 (6) 푟=1

푧푖푟 = 0 , ∀푖|1, … , 푁, ∀푟 ∈ 푈퐴퐴\퐴 (7)

푦푖푟 ≥ 푧푖푟 , ∀푖|1, … , 푁, ∀푟|1, … , 푅 (8)

푦푖푟 ≥ (1 − ∑ 푧푖푟) 푊푇푖푟 (9) 푟=1

28

푅 푁 푅 푁

∑ ∑ 푦푖푟 + ∑ ∑(1 − 푦푖푟) ≤ (푅 × 푁) − 1 (10) 푟=1 푖=1 푟=1 푖=1 (푖,푟)∈퐶푈푇푆 (푖,푟)∈푈\퐶푈푇푆

The objective function maximizes the net interaction energy of the rotamers with the non-rotamer portion of the binding assembly and with each other. Constraint 1 ensures that exactly one rotamer

푗푠 is selected at each design position. Constraints 2 and 3 ensure that 푤푖푟 is one only when both 푦푖푟 and 푦푗푠 have a value of one. Constraint 4 ensures that at least 50% of the DPs are mutated to longer side chain residues (Trp, Phe, Tyr, and Met). This alleviates the need to cycle through designs with all smaller side chain residues (Ile, Leu, Val, Pro, and Ala) and this accelerates convergence. Using a very low percentage will result in longer run times owing to small side chain-rich designs being identified first which will be weaned out at the pore size check at step 5. On the other hand, a very high percentage will result in steric clashes between the chosen long side chain residues yielding larger than expected pore sizes (akin to OCDTFTrp design). Constraint 5 ascertains that at most

M out of N design positions are allowed to mutate in a given iteration. The value of M is randomly generated for each PoreDesigner iteration and fed as a parameter to the MILP step. Constraint 6 makes sure that if a design position is to be mutated, it assumes only one new rotamer and if unmutated it retain the wild type amino acid rotamer while constraint 7 prevents mutation to non- hydrophobic residues. Constraints 8 and 9 together pass on the information about the current design to the objective function using the binary variable 푦푖푟. If a design position is mutated, constraint 8 uses 푧푖푟 to set the 푦푖푟 value for that position and rotamer combination to one. However, if a design position is not mutated (i.e. 푧푖푟 = 0), constraint 9 uses parameter 푊푇푖푟 to set the 푦푖푟 value to one corresponding to the wild-type configuration of that residue. Therefore, an MILP design has 푦푖푟 values of one for each design position obtained either from a mutation or from the

29

wild type configuration (if unmutated). At the end of each iteration the design is appended to a

CUTS set. Constraint 10 makes sure none of the existing designs from the CUTS set are chosen in the current iteration.

Step 3: A local, rigid-body of the water wire using random translations in the X, Y, and Z directions by sampling coordinates for the water wire from a normal distribution centered at zero and standard deviations of 0.2Å, 0.2Å, and 2Å, respectively.

Step 4: A complex energy minimization in Cartesian coordinates x,y,z using a gradient based search.

Step 5: Pore size analysis is performed using a PoreAnalyzer module (details in section 3) to check if it satisfies the desired pore size. The interaction energy is calculated if the pore opening is within the desired size range, otherwise the design is discarded.

Step 6: If the design is accepted in the previous step, the interaction energy between the water wire and the redesigned OmpF is calculated. Redesigns with lower interaction energy than the currently best are always accepted. Redesigns with larger interaction energies (in absolute magnitude) are

−∆(푖푛푡푒푟푎푐푡푖표푛 푒푛푒푟푔푦) accepted with a probability or a Boltzmann factor equal to 푒 푘푇 (i.e., Metropolis criterion where, k is the Boltzmann constant and is ~0.33×10-23cal/K, and T is the temperature in

K).A temperatureof 3,640 K in the Boltzmann factor is used which ensures that there is a 25% probability that a redesign with an interaction energy 10 kcal/mol more negative than the best so far will be retained.

Step 7: A cumulative set of integer cuts which stores information about the current mutation

(residue, position, rotamer). This ensures that no redesign (either accepted or rejected previously) is revisited.

30

Step 8: A number of perturbation/redesign iterations of PoreDesigner are performed until a pre- specified number of accepted redesigns in terms of pore size are retrieved (i.e., typically we require

30 accepted redesigns).

Post-redesign analysis of results

This step is used to estimate the pore constriction diameter of the designed OmpF mutant and accept or discard designs accordingly.

Step 1:Introduce in the ~8 Å long constriction region (see Figure 6) perpendicular planes at every

0.5 Å (approximately 16 slices).

Step 2:Apply developed PoreAnalyzer algorithm comprised of the following sub-steps:

1. Supply the oriented OmpF pdb structure.

2.Identify the list of pore center coordinates at each of the 16 slices (see Figure 6) of the pore constriction region.

3. At each slice, find the coordinates of the pore constricting atoms (including their van der Waals radii) nearest to the pore center coordinates.

4. At each slice, fit the largest ellipse (see Ellipse fitting method for details) that just touches the pore constricting atoms (with no atoms inside the ellipse). The ellipse is centered at the pore center coordinates for a given slice.

5. Store the major axis dimensions (Di) of the ellipses from each slice in a list Dpore such that, 퐷푖 ∈

푝표푟푒 퐷 . The complete set of major axes for the ~8Å constriction region is stored in Dpore.

푝표푟푒 푝표푟푒 6.The minimum value in Dpore(i.e. 푖푀푖푛 퐷 = 퐷푚푖푛 determines the actual pore bottleneck diameter or the pore constriction diameter (see Figure 6).

31

푝표푟푒 7. The minor axis dimension corresponding to the ellipse for which major axis is equal to퐷푚푖푛 is identified and thus the pore constriction dimensions are determined.

Figure 6. OmpF pore analysis. (a) The 8 Å constriction region is divided into slices every 0.5 Å.

Pore area is calculated at each slice and the lowest of them determines the redesigned pore constriction area. (b)The OmpF pore profile was generated using PoreAnalyzer module and visualized using PyMOL. The twenty-five pore constriction residues are highlighted in yellow.

The channel cavity (pink) is composed of multiple pink spheres placed at regular intervals of 0.5

Å. A schematic hour-glass shaped internal pore geometry has been overlaid.

Step 3: Impose a design check to retain OmpF redesigns if it meets the desired pore size criteria.

The first check is imposed while designing AQP-like small pores (<4 Å diameter) that allow single file water transport. The latter is used for redesigning OmpF for selective separation (rejects aqueous solute A but not B) of aqueous solutes (A and B) with hydrodynamic diameters DA and

DB respectively.

푝표푟푒 Check 1: Accept the design if퐷푚푖푛 < 4 Å is satisfied.

푝표푟푒 푝표푟푒 Check 2: Accept the design if both 퐷푚푖푛 < 퐷퐴and 퐷푚푖푛 > 퐷퐵 are satisfied.

32

Accepted OmpF designs are sorted in decreasing order of the DM-TM interaction energy implying that redesigns with least interaction with the water wire are ranked higher. Thus, PoreDesigner can be used to create the selective internal structure of AQP1 (or any desired pore size) inside the stable beta-scaffold of OmpF.

Ultimately, we obtain the structures for the final 100 frames of MD simulation of pressure driven water transport through OmpF and the mutant designs. We use the PoreAnalyzer module on them to report the predicted pore sizes. This is to ensure we track any pore widening that might occur during water permeation.

Ellipse fitting method to compute pore dimensions

At each slice of pore contriction region we fit the general equation of a rotated ellipse 퐴푥2 +

퐵푥푦 + 퐶푦2 + 퐷푥 + 퐸푦 + 퐹 = 0 using non-linear regression. Here, the coefficients A, B, C, D, E, and F represent arbitrary real-valued constants with at least one of A, B, or C as well as at least one of D, E, or F nonzero. Although all conic sections can be represented in this way, some combination of the constants could give rise to one of the five degenrate conic sections (a point, a line, or two intersecting lines, two parallel lines or the empty set). But we verified that resulting figure is a non-degenerate conic section (circle or ellipse) by checking that the area inside the curve after the non-linear regression was less than 60. This is because the wild type OmpF pore area with major and minor axes lengths 11 Å and 7 Å respectively, is 60.47 Å2.

Computing inner, outer pore wall and overall hydrophobicity scores

풘풂풕풆풓→풆풕풉풂풏풐풍 We used the ∆푮풕풓풂풏풔풇풆풓 values4 (from the Kyte-Doolittle (KD) hydrophobicity scale) to evaluate the inner pore wall, outer pore wall and overall KD-hydrophobicities of the OCD-TFTrp,

33

UCD, and CSD designs which were subsequently purified and expressed, and embedded in vesicles for transport experiments. KD-hydrophobicities have been used as a standard to assess the performance of novel hydrophobicity scales by Perunov et al.5. Furthermore, Kister et al.6 has reported that the accuracy of the KD scale in estimating protein hydrophobicities is considerably reliable. Each of the three hydrophobicity scores were calculated by adding the products of amino

풘풂풕풆풓→풆풕풉풂풏풐풍 acid (i) frequencies (ni) to their individual ∆푮풕풓풂풏풔풇풆풓 values. Amino acid frequencies refer to the number of times (ni) a single amino acidi occurs. The overall hydrophobicity score of a channel protein is the sum of its inner pore wall and outer pore wall hydrophobicity scores calculated from transfer free energies.

20 20

∑ (푛푖푛푛푒푟_푝표푟푒푤푎푙푙 × ∆푮풘풂풕풆풓→풆풕풉풂풏풐풍 ) + ∑ (푛표푢푡푒푟_푝표푟푒푤푎푙푙 × ∆푮풘풂풕풆풓→풆풕풉풂풏풐풍 ) 푖 풕풓풂풏풔풇풆풓 푖 푖 풕풓풂풏풔풇풆풓 푖 푖=1 푖=1

= 푂푣푒푟푎푙푙 ℎ푦푑푟표푝ℎ표푏푖푐푖푡푦 푠푐표푟푒

In order to decide if a given amino acid is a part of the inner pore wall or the outer pore wall,

Python 2.7 scripts were written. If the distance between the pore center and the Cα atom of any amino acid is greater than that from its Cβ atom (or analogous atom), then it is counted as an inner pore wall residue. Otherwise, it is counted an outer pore wall residue.

Molecular Dynamics Simulations

All MD simulations were performed using the program NAMD84, a 2 fs integration time step, and

2–2–6 multiple time-stepping. Parameters for the POPC lipid-bilayer, OmpF protein, and ions were taken from the CHARMM36 parameter set with the CMAP corrections85. A TIP3P model was used for water86. All simulations employed a 10–12 Å cutoff for van der Waals and short- range electrostatic forces, the particle mesh Ewald (PME) method for long-range electrostatics87

34

computed over a 1.1 Å grid and periodic boundary conditions. Equilibration simulations were performed in the NPT (constant number of particles N, pressure P, temperature T) ensemble using a Lowe–Andersen thermostat88 and Nosé–Hoover Langevin piston pressure control89 set at 295 K and 1 atm, respectively. The simulations of ion conductance were carried out in the NVT (constant number of particles N, volume P, temperature T) ensemble following a previously described protocol76. and analysis were performed using VMD74.

Experimental methods

OmpF and the three selected mutant design proteins were produced by expression of synthetic genes using pET23a(+) in an E. coli BL21(DE3) Omp8 Rosetta mutant strain according to the pET cloning and expression system (Novagen). Thereafter the porins were purified in their native trimeric state and stabilized in a detergent solution before reconstitution, and subsequently passed through an equilibriated anion exchange chromatography column (HiScreen DEAE HF), and an

Superose 12 size exclusion column. Bradford assays were used to determine the protein concentration. Dry lipid films and a rehydration buffer were used to reconstitute mutant and wild type OmpF which were extruded using a 200nm track-etched membrane. Stopped flow light scattering goniometer set up was used to assess hydraulic permeabilities. Finally, high concentrations of various solutes of different hydrodynamic radii were used the mixing cell of the apparatus to determine solute rejection performance of the mutants in contrast to the wild type

OmpF.

35

1.6. References

1. Sholl, D. S. & Lively, R. P. Seven chemical separations to change the world. Nature 532,

435–437 (2016).

2. Levin, R. J. The living barrier: a primer on transfer across biological membranes.

(Butterworth-Heinemann, 2014).

3. Grzelakowski, M., Cherenet, M. F., Shen, Y. xiao & Kumar, M. A framework for accurate evaluation of the promise of aquaporin based biomimetic membranes. J. Memb. Sci. 479, 223–231

(2015).

4. Elimelech, M. & Phillip, W. A. The Future of Seawater and the Environment: Energy,

Technology, and the Environment. Science 333, 712–718 (2011).

5. Crittenden, J. C., Trussell, R. R., Hand, D. W., Howe, K. J. & Tchobanoglous, G. MWH’s

Water Treatment: Principles and Design: Third Edition. MWH’s Water Treatment: Principles and

Design: Third Edition (2012). doi:10.1002/9781118131473

6. Sanders, D. F. et al. Energy-efficient polymeric gas separation membranes for a sustainable future: A review. Polymer (Guildf). 54, 4729–4761 (2013).

7. Wang, S. et al. Advances in high permeability polymer-based membrane materials for CO2 separations. Energy Environ. Sci. 9, 1863–1890 (2016).

8. Kotsanopoulos, K. V. & Arvanitoyannis, I. S. Membrane Processing Technology in the

Food Industry: Food Processing, Wastewater Treatment, and Effects on Physical, Microbiological,

Organoleptic, and Nutritional Properties of Foods. Crit. Rev. Food Sci. Nutr. 55, 1147–1175

(2013).

9. Xie, R., Chu, L.-Y. & Deng, J.-G. Membranes and membrane processes for chiral resolution. Chem. Soc. Rev. 37, 1243–1263 (2008).

36

10. Fane, A. G., Wang, R. & Hu, M. X. Synthetic Membranes for Water Purification: Status and Future. Angew. Chemie Int. Ed. 54, 3368–3386 (2015).

11. Zhu, L. et al. A low-cost mullite-titania composite ceramic hollow fiber microfiltration membrane for highly efficient separation of oil-in-water emulsion. Water Res. 90, 277–285 (2016).

12. Konno, M., Shindo, M., Sugawara, S. & Saito, S. A composite palladium and porous aluminum oxide membrane for hydrogen gas separation. J. Memb. Sci. 37, 193–197 (1988).

13. Yang, L., Hsiao, W. W. & Chen, P. Chitosan–cellulose composite membrane for affinity purification of biopolymers and immunoadsorption. J. Memb. Sci. 197, 185–197 (2002).

14. Livingston, A. & Baker, R. Membranes from academia to industry. Nature Materials 16,

280–282 (2017).

15. Park, H. B., Kamcev, J., Robeson, L. M., Elimelech, M. & Freeman, B. D. Maximizing the right stuff: The trade-off between membrane permeability and selectivity. Science (80-. ). 356,

(2017).

16. Rangnekar, N., Mittal, N., Elyassi, B., Caro, J. & Tsapatsis, M. Zeolite membranes – a review and comparison with MOFs. Chem. Soc. Rev. 44, 7128–7154 (2015).

17. Hinds, B. J. et al. Aligned Multiwalled Carbon Nanotube Membranes. Science (80-. ). 303,

62 (2004).

18. Li, H. et al. Ultrathin, Molecular-Sieving Graphene Oxide Membranes for Selective

Hydrogen Separation. Science (80-. ). 342, 95 (2013).

19. Kumar, M., Grzelakowski, M., Zilles, J., Clark, M. & Meier, W. Highly permeable polymeric membranes based on the incorporation of the functional water channel protein

Aquaporin Z. Proc. Natl. Acad. Sci. 104, 20719–20724 (2007).

37

20. Shen, Y. xiao, Saboe, P. O., Sines, I. T., Erbakan, M. & Kumar, M. Biomimetic membranes: A review. Journal of Membrane Science 454, 359–381 (2014).

21. Werber, J. R., Osuji, C. O. & Elimelech, M. Materials for next-generation desalination and water purification membranes. Nature Reviews Materials 1, (2016).

22. Bocquet, N. et al. X-ray structure of a pentameric ligand-gated ion channel in an apparently open conformation. Nature 457, 111–114 (2009).

23. Hibbs, R. E. & Gouaux, E. Principles of activation and permeation in an anion-selective

Cys-loop receptor. Nature 474, 54–60 (2011).

24. Hilf, R. J. C. & Dutzler, R. X-ray structure of a prokaryotic pentameric ligand-gated ion channel. Nature 452, 375–379 (2008).

25. Miyazawa, A., Fujiyoshi, Y. & Unwin, N. Structure and gating mechanism of the acetylcholine receptor pore. Nature 423, 949–955 (2003).

26. Joh, N. H. et al. De novo design of a transmembrane Zn 2+ -transporting four-.

Science (80-. ). 346, 1–6 (2014).

27. Liu, Z., Ghai, I., Winterhalter, M. & Schwaneberg, U. Engineering Enhanced Pore Sizes

Using FhuA Δ1-160 from E. coli Outer Membrane as Template. ACS Sensors 2, 1619–1626

(2017).

28. To, J. & Torres, J. Can stabilization and inhibition of aquaporins contribute to future development of biomimetic membranes? Membranes 5, 352–368 (2015).

29. Ren, T. et al. Membrane Protein Insertion into and Compatibility with Biomimetic

Membranes. Adv. Biosyst. 1, n/a-n/a (2017).

30. Kong, Y. & Ma, J. Dynamic mechanisms of the membrane water channel aquaporin-1

(AQP1). Proc. Natl. Acad. Sci. U. S. A. 98, 14345–14349 (2001).

38

31. Tamm, L. K., Hong, H. & Liang, B. Folding and assembly of beta-barrel membrane proteins. Biochim. Biophys. Acta 1666, 250–63 (2004).

32. Yamashita, E., Zhalnina, M. V, Zakharov, S. D., Sharma, O. & Cramer, W. A. Crystal structures of the OmpF porin: function in a colicin translocon. EMBO J. 27, 2171–80 (2008).

33. Lou, K. et al. Structural and Functional Characterization of OmpF Porin Mutants set ; their functional characterization is reported in the. J. Biol. Chem. 271, 20669–20675 (1996).

34. Jeanteur, D. et al. Structural and functional alterations of a colicin-resistant mutant of

OmpF porin from Escherichia coli. Proc. Natl. Acad. Sci. U. S. A. 91, 10675–10679 (1994).

35. Kefala, G. et al. Structures of the OmpF porin crystallized in the presence of foscholine-

12. Protein Sci. 19, 1117–1125 (2010).

36. Nestorovich, E. M., Rostovtseva, T. K. & Bezrukov, S. M. Residue ionization and ion transport through OmpF channels. Biophys. J. 85, 3718–29 (2003).

37. Jaffe, A., Chabbert, Y. A. & Semonin, O. Role of porin proteins OmpF and OmpC in the permeation of ??-lactams. Antimicrob. Agents Chemother. 22, 942–948 (1982).

38. Baslé, A., Iyer, R. & Delcour, A. H. Subconductance states in OmpF gating. Biochim.

Biophys. Acta - Biomembr. 1664, 100–107 (2004).

39. Benson, S. A., Occi, J. L. L. & Sampson, B. A. Mutations that alter the pore function of the ompF porin of Escherichia coli K12. J. Mol. Biol. 203, 961–970 (1988).

40. Phale, P. S. et al. Role of charged residues at the OmpF porin channel constriction probed by mutagenesis and simulation. Biochemistry 40, 6319–6325 (2001).

41. Surrey, T., Schmid, A. & J??hnig, F. Folding and membrane insertion of the trimeric ??- barrel protein OmpF. Biochemistry 35, 2283–2288 (1996).

39

42. Pantazes, R. J., Grisewood, M. J., Li, T., Gifford, N. P. & Maranas, C. D. The Iterative

Protein Redesign and Optimization (IPRO) suite of programs. J. Comput. Chem. 36, 251–263

(2015).

43. Portella, G. & De Groot, B. L. Determinants of water permeability through nanoscopic hydrophilic channels. Biophys. J. 96, 925–938 (2009).

44. Portella, G., Pohl, P. & De Groot, B. L. Invariance of single-file water mobility in gramicidin-like peptidic pores as function of pore length. Biophys. J. 92, 3930–3937 (2007).

45. Kozono, D., Yasui, M., King, L. S. & Agre, P. Aquaporin water channels: Atomic structure and molecular dynamics meet clinical medicine. Journal of Clinical Investigation 109, 1395–1399

(2002).

46. Erbakan, M. et al. Molecular Cloning, Overexpression and Characterization of a Novel

Water Channel Protein from Rhodobacter sphaeroides. PLoS One 9, e86830 (2014).

47. Horner, A. et al. The mobility of single-file water molecules is governed by the number of

H-bonds they may form with channel-lining residues. Sci. Adv. 1, e1400083 (2015).

48. Kyte, J. & Doolittle, R. F. A simple method for displaying the hydropathic character of a protein. J. Mol. Biol. 157, 105–132 (1982).

49. Agre, P. & Kozono, D. Aquaporin water channels: molecular mechanisms for human diseases11Presented at the Nobel Symposium 126, Membrane Proteins: Structure, Function, and

Assembly, Friibergh’s Herrgård, Örsundsbro, Sweden, August 23, 2003. FEBS Lett. 555, 72–78

(2003).

50. Humphrey, W., Dalke, A. & Schulten, K. VMD: Visual molecular dynamics. J. Mol.

Graph. 14, 33–38 (1996).

40

51. Zhu, F., Tajkhorshid, E. & Schulten, K. Collective diffusion model for water permeation through microscopic channels. Phys. Rev. Lett. 93, (2004).

52. Aksimentiev, A. & Schulten, K. Imaging α-hemolysin with molecular dynamics: Ionic conductance, osmotic permeability, and the electrostatic potential map. Biophys. J. 88, 3745–3761

(2005).

53. Ireta, J., Neugebauer, J. & Scheffler, M. On the accuracy of DFT for describing hydrogen bonds: Dependence on the bond directionality. J. Phys. Chem. A 108, 5692–5698 (2004).

54. Durrant, J. D. & McCammon, J. A. HBonanza: A computer algorithm for molecular- dynamics-trajectory hydrogen-bond analysis. J. Mol. Graph. Model. 31, 5–9 (2011).

55. Lee, A., Elam, J. W. & Darling, S. B. Membrane materials for water purification: design, development, and application. Environ. Sci. Water Res. Technol. 2, 17–42 (2016).

56. de Fraiture, C., Molden, D. & Wichelns, D. Investing in water for food, ecosystems, and livelihoods: An overview of the comprehensive assessment of water management in agriculture.

Agric. Water Manag. 97, 495–501 (2010).

57. Grisewood, M. J. et al. Computational Redesign of Acyl-ACP Thioesterase with Improved

Selectivity toward Medium-Chain-Length Fatty Acids. ACS Catal. 7, 3837–3849 (2017).

58. Chowdhury, R., Allan, M. F. & Maranas, C. D. OptMAVEn-2.0: De novo Design of

Variable Antibody Regions Against Targeted Antigen Epitopes. Antibodies 7, 23 (2018).

59. Mackerell, A. D. Empirical force fields for biological macromolecules: Overview and issues. Journal of Computational Chemistry 25, 1584–1604 (2004).

60. Jorgensen, W. L., Chandrasekhar, J., Madura, J. D., Impey, R. W. & Klein, M. L.

Comparison of simple potential functions for simulating liquid water. J. Chem. Phys. 79, 926–935

(1983).

41

61. Darden, T., York, D. & Pedersen, L. Particle mesh Ewald: An N·log(N) method for Ewald sums in large systems. J. Chem. Phys. 98, 10089–10092 (1993).

62. Koopman, E. A. & Lowe, C. P. Advantages of a Lowe-Andersen thermostat in molecular dynamics simulations. J. Chem. Phys. 124, (2006).

63. Hoenger, A., Pagès, J.-M., Fourel, D. & Engel, A. The Orientation of Porin OmpF in the

Outer Membrane of Escherichia coli. J. Mol. Biol. 233, 400–413 (1993).

64. Saparov, S. M. & Pohl, P. Beyond the diffusion limit: Water flow through the empty bacterial potassium channel. Proc. Natl. Acad. Sci. U. S. A. 101, 4805–9 (2004).

65. Shen, Y. et al. Highly permeable artificial water channels that can self-assemble into two- dimensional arrays. Proc. Natl. Acad. Sci. 112, 9810–9815 (2015).

66. Latimer, P. & Pyle, B. E. Light Scattering at Various Angles: Theoretical Predictions of the Effects of Particle Volume Changes. Biophys. J. 12, 764–773 (1972).

67. Kometani, T. & Kasai, M. Ionic permeability of sarcoplasmic reticulum vesicles measured by light scattering method. J. Membr. Biol. 41, 295–308 (1978).

42

Chapter 2

BRIEF HISTORY OF ENZYME REDESIGN FROM DIRECTED EVOLUTION

TO COMPUTATIONAL ENZYME REDESIGN

This chapter has been previously published in a modified form in AIChE Journal (Chowdhury R,

Maranas CD. From directed evolution to computational enzyme engineering—A review. AIChE

Journal. 2019 Nov 7.)

2.1. Background

Background of enzymes, enzymology and enzyme design

Biological systems are masterful chemists that build complex molecules and systems from simple precursor compounds. At the heart of this complex machinery are enzymes that account for ~4% of proteins1 .The use of biocatalysts by humanity which emerged as an accidental by-product of gathering wild grain dates at least back to the ancient Egyptians (circa 10,000 BC) who used fermentation for bread-making and brewing purposes. However, it was not until the 19th century that fermentation was recognized as carried out by living cells2. In 1835, Swedish chemist Jacob

Berzelius used the term ‘proteins’ to describe similar molecules extracted from egg-whites, blood, serum, fibrin and wheat gluten which all had atomic ratios of C:H:N:O:S to be approximately 1:

1.58: 0.28: 0.3: 0.01 (experiments performed by Johannes Mulder)3 and reported that some of the proteins are catalytic. It was much later in 1878, that German physiologist Wilhelm Kuhne coined the term ‘enzymes’. Figure 7 shows a chronological compilation of seventy key events in the history of enzyme engineering starting from the 1830s up to 2018 when the Nobel Prize in chemistry was co-awarded for the ‘directed evolution of enzymes’ and ‘phage display’.

43

Figure 7. Seventy notable events in the history of enzyme engineering starting with the Egyptians using wild grain for bread-making and brewing to directed evolution and phage display techniques for which the Nobel Prize in chemistry was awarded in 2018. The computational milestones are indicated as purple lines.

Edward Buchner in 1897 isolated an enzyme complex which he called zymase from cell-free yeast extracts and successfully demonstrated that it can catalyze the breakdown sugars in alcoholic

44

fermentation4. During the same time, German chemist, Emil Fischer postulated the ‘lock and key’ hypothesis5 for enzyme activity where the substrate (key) was thought to rigidly fit into the complementary groove of the enzyme (lock). Figure 8 shows the lysozyme binding pocket with a peptidoglycan substrate occupying the binding pocket. However, most of enzymatic catalysis could not be adequately explained by the rigid enzyme model6 till 1958 when Koshland laid out the ‘induced fit’ theory7 for enzyme substrate action. The three principal tenets of ‘induced fit’ theory as explained in the article were: “a) the precise orientation of catalytic groups is required for enzyme action, b) the substrate causes an appreciable change in the three-dimensional relationship of the amino acids at the active site, and c) the changes in the caused by the substrate brings the catalytic groups into the proper alignment, whereas a non-substrate does not”. Figure 9 shows the change in hexokinase structure (from closed to open) necessary for product release. In 1903, French chemist Victor Henri derived a functional form8,9 of enzyme kinetics from his investigations on the invertase enzyme that hydrolyzes sucrose to glucose and fructose. However, a simplified and more celebrated version of the equation that equated the rate of the reaction with the rate at which the concentration of various species involved in the reaction was formalized by Michaelis and Menten in 191310. About a decade later during 1930s, Sumner,

Northrop, and Stanley independently crystallized urease11, pepsin12 and nucleoproteins responsible for tobacco-mosaic virus activity13 respectively, for which they shared the Nobel Prize in chemistry in 1946. These structural studies were soon complemented with methods for discerning the sequence of short peptide chains14 in 1950 by Pehr Edman and in 1952 Frederick Sanger reported the complete amino acid sequence of polypeptide chains A and B of bovine insulin15,16 building on the work started by Charles Chibnall17. Concurrently, William Stein and Stanford

Moore collaborated on developing an analytical procedure to determine the amino acid content of

45

any protein at The Rockefeller Institute of Medical Research. Stein and Moore used potato starch in a column for fractionation of proteins from peptides along with simultaneous counting of the amino acids18. Subsequently, they followed up with a better and faster quantification approach19 of amino acids in peptides. The next two decades saw isolation, purification, and characterization of various enzymes ranging from myoglobin from sperm whales20 to high resolution lysozyme structures from egg white using X-ray crystallography21. It was in 1978 when directed evolution revolutionized the search for better enzymes.

Figure 8. Two different views of the lysozyme binding site (marked in blue) and the active site residues highlighted in red. The peptidoglycan substrate is shown as yellow sticks.

46

Figure 9. Conformational change in hexokinase during product release. The active site has been highlighted in bright green. The substrate and products have been marked as pink sticks. Accession

IDs for closed and open hexokinase conformations are 2E2N, and 2E2Q.

Mutation followed by natural selection was established by Darwin’s On the Origin of Species in

1859 as the organizing principle in biology. However, for thousands of years before humans unknowingly exploited this process in selective breeding and domestication. It was only in the late

1970s that evolution was brought inside the laboratory with the specific objective of discovering microbial phenotypes for better utilization of desired carbon substrates. Lerner et al. designed a xylitol utilization phenotype22 of Aerobacter aerogenes in 1964. In 1967 Spiegelman et al. performed in vitro reconstitution of RNA templates with pure RNA replicase to study the effect of selective pressures for several generations in the famous ‘Spiegelman’s monster’ experiments23,24.

Inspired by these, Francis and Hansche performed ‘directed evolution’ in yeast and achieved 30% higher orthophosphate activity with a single mutation but with a growth rate trade-off of 83%25.

This was soon followed by a more comprehensive demonstration of directed evolution by Barry

47

Hall where up to four mutations in the -galactosidase coding region in Escherichia coli cultured with lactose as the sole carbon source yielded phenotypes spanning a wide range of growth rates26.

Within a decade, Eigen and Gardiner proposed a cyclic ‘evolutionary machine’27 comprised of genetic mutations, amplification and selection to produce stable mutant proteins in vitro. The subsequent development of error–prone PCR (polymerase chain reaction) for random mutagenesis enabled generation of large-scale mutant libraries with >1010 designs and has been a cornerstone in the history of enzyme engineering28–31.

Methods for directed evolution

Enzymes (and proteins in general) are modular biopolymers composed of 20 canonical amino acid monomers as encoded by their cognate nucleotide sequences (genes). They have the potential to evolve through changes in their amino acid sequence. This evolvability has been exploited to explore the combinatorial sequence space for catalyzing reactions with improved specificity, regioselectivity and stereoselectivity32. Thus, directed evolution of enzymes and binding proteins is a synthetic procedure relying on molecular insights, which emulates the natural evolution process in the laboratory at an expedited rate. The procedure commits to intended variation of protein sequences with prescribed randomness of amino acid choices. This is further coupled to engineered screening and selection strategies. In other words, directed evolution involves iterative identification of a starting protein, diversification of its coding gene sequence, expression and subsequent functional screening until an acceptable of enzymatic activity, binding affinity or specificity is accomplished.

Sampling the entire combinatorically explosive mutational landscape for any protein is impossible as complete randomization of a mere pentapeptide would yield ~1013 unique amino acid sequences.

48

Gene diversification approaches are thus designed to perform an optimal sparse sampling of the multidimensional sequence space, with the objective of ascending in the landscape of desired phenotype by accruing beneficial mutations. Several gene diversification methods for directed evolution have been proposed over the last two decades. These strategies typically integrate random mutagenesis, focused mutagenesis, and homologous recombination.

Random mutagenesis

Random mutagenesis starts by obtaining a library of point mutants from a single parent sequence and transforming the library into a strain to express the variant proteins. A high-throughput screen for the desired phenotype then identifies the successful candidates. Error-prone PCR (epPCR), first described by Goeddel et al.33 utilized the low-fidelity of DNA polymerases to make point mutations during amplification of the gene that codes for the protein of interest. Gheraldi et al.34 and Kunkel et al.35 were able to enhance the rate of mutation (from 10-10 to 10-4) by adding mutagenic dNTP analogues or increasing magnesium concentrations in the epPCR setup.

Additional screening for properly expressed proteins by fusing the target gene with a green- fluorescent protein reporter was soon demonstrated by Tawfik et al.36. A modified epPCR was developed by Joyce et al. that used a combination of Taq polymerase, 0.2 mM dGTP, 0.2 mM dATP, 1mM dCTP, and 1mM of dTTP, higher MgCl2, and 0.5 mM MnCl2 to reduce polymerase fidelity without affecting gene amplification and alleviated the strong bias towards A→G, and

T→C transitions as faced by Goeddel and co-workers. Arnold and co-workers have documented several successes using random mutagenesis including introducing activity towards a wide range of native-like substrates in cytochrome P45037, and exploring novel carotenoid biosynthesis routes38.

Focused mutagenesis

49

The probability of identifying active redesigns which emerge from synergism of simultaneous point mutations (which are themselves marginally useful) is very low using random mutagenesis as the number of possible unique sequences increase exponentially with the number of randomized sites. To this end, focused mutagenesis uses phylogenetic analyses of homologous proteins to identify specific amino acid substitutions that are likely to improve substrate binding or catalysis.

A mutagenic oligonucleotide cassette39 containing degenerate codons for a targeted amino acid change is inserted40 into a vector plasmid for expression of a desired enzyme variant. Parra et al.41 fed focused mutagenesis library of xylanase to epPCR to identify 12 more thermostable variants with the best mutant showing a 4.3ºC increase in melting temperature.

Homologous recombination

An alternate strategy to access beneficial combinations of mutations is achieved using homologous recombination. This is a mimic of the natural process of biological evolution. One of the early approaches, DNA shuffling, involved a DNase-mediated fragmentation of a target gene, followed by random re-stitching using a PCR setup. Monticello et al.42 replaced the random priming of

DNA fragments by a sophisticated random chimeragenesis technique (RACHITT). They were able to achieve several folds of higher recombination than any other method in a dibenzothiophene monooxygenase gene. The expressed proteins not only exhibited higher than wild-type activity, but also showed 20-fold higher affinity for several hydrophobic non-natural substrates. Arnold et al.43 also reported an optimized DNA shuffling workflow to control the point mutagenesis rate to as low as 0.05% by adding Mn2+ and Mg2+ ions during DNase I digestion of the gene and appropriate choice of DNA polymerase to effect high-fidelity recombination. A number of modeling frameworks were developed44 for estimating the occurrence of mutations in error-prone

50

PCR after multiple generations45 and the location of crossovers in directed evolution experiments46,47.

Recent trends in directed evolution has seen attempts and successes at improving proteins with the biological proviso of still being relevant to the metabolic pathways they belong to, thus creating novel whole cell chemical factories for synthesis of value-added chemicals48–50. More recently, biochemists have aimed at dialing in novel functionalities in enzymatic proteins which are not seen in nature51,52. A decade old review by Toscano et al.53 on active site redesign strategies provide considerable insight towards function-driven enzyme redesign.

Methods for computational protein design

Computational methods provide the means to screen in silico many enzyme redesign alternatives thus focusing the number of variants to be tested experimentally. Existing approaches generally use biophysics-inspired or statistical fitness functions to screen design alternatives in terms of conservation (or enhancement) of desired interactions and absence of aberrant ones. There is an ever expanding literature of scoring function54,55 and combinatorial search algorithms56 devoted towards the efficient traversal of combinatorial space of residue alternatives. Software tools that integrate all these tasks include RosettaDesign57, Osprey58, Tinker59, TransCent60 and IPRO61. The difficulty and success rate in computational design depends on how ambitious the enzyme re- design goal is. For example, attempts to switch cofactor or substrate specificity of a well characterized enzyme have been met with many successes62–65, however, efforts to improve the catalytic affinity of a native enzyme towards its preferred substrate is much more difficult with only a few success stories66,67. In addition, efforts at introducing a novel enzymatic activity are also very difficult68. Nevertheless, there has been a lot of exciting, industrially-relevant research focused on generation of stable humanized immunoproteins69,70 with biopharmaceutical relevance

51

and enzymes with enhanced turnover71, altered substrate- and stereo-selectivity72 in the past one and a half decade.

Statistical protein design approaches

Existing protein structures already contain a vast amount of information that correlate amino acid sequence to structure. A database-driven energy function reliant on the frequency of certain structural arrangement of amino acid backbones and side chains have been used to create a

‘knowledge-based potential’. DrugScore73 was used to predict and score ligand conformations at the active site of an enzyme using entropic contributions and implicit solvation upon learning from

159 experimentally resolved enzyme-ligand complexes. However, the lack of hydrogen atoms failed to capture the effect of protonation states, and also undermined electrostatic contributions to a great extent. Buchete et al.74 developed statistical potentials using orientations of different amino acid side chains seen in experimentally resolved crystal structures to predict folded conformations for a given protein sequence. On the other hand, Lin et al.75 used evolutionary information from multiple sequence alignment of homologous proteins from closer organisms to develop knowledge-based statistic potentials. Here each protein was converted to several binary profiles, each containing information about different parameters (dihedrals, solvent accessibility, and so on) for each position (instead of actual amino acid sequences). An associated scoring system assessed how close a designed structure would be to existing structures from the alignment to have consistent folding. A similar (TmFoldRec) for predicting folds in membrane- segments of transmembrane channels by learning from 124 crystallized transmembrane folds was published by Kozma et al76. Knowledge-based protein design tools provide the advantage of introducing additional descriptor terms (such as helix propensity, solvent exposure) without enhancing computing time significantly. Poole and Ranganathan77 provide a comprehensive

52

review of such similar knowledge-based potentials used for computational protein design. An integrated approach using a library (rotamer libraries78) of statistically preferred amino acid side- chain conformations in the phi-psi dihedral space and molecular-mechanics calculations to score a choice of a substituent amino acid rotamer forms the basis of most current day protein design software61,79.

Force fields for computational protein design

Force fields are used to compute interaction and overall stability energy scores of protein-ligand complexes or individual proteins. These energy terms (or scores) represent side chain and backbone geometries, protonation states, and effect of solvents and only enthalpic contributions are factored (not protein entropy). Force-field calculations helps to assess enzyme substrate affinities and modeling of side chains. The most popular force field parameters (bond spring constant, bond angles, dihedrals, improper dihedrals, partial charges) are computed using ab initio quantum mechanical and calculations. Knowledge-based force field like

Rosetta uses extra potential energy terms obtained after refitting of statistical and experimental knowledge-based data. Unlike statistical knowledge-based potentials, these empirical force fields are capable of capturing actual forces between atoms (electrostatics, van der Waals, and solvent contribution). Several independently developed force fields have been developed till date – such as, Amber80, CHARMM81, OPLS82,83, GROMOS84, and Rosetta85,86. Depending on whether each and every atom or only heavy atoms and polarizable hydrogens are represented within the force field, they are called “all atom” or “united atom” force fields. GROMOS is exclusively united atom force fields, Amber - (ff14SB87 or ff15FB88), CHARMM and Rosetta all atom, while OPLS has both versions. Mackerel et al.81 provides a detailed discussion on the development of empirical force fields.

53

Biophysical protein design tools

Biophysical protein design tools include computing enthalpic energy contributions of covalently bonded amino acids along the polypeptide backbone of a protein, and pairwise non-covalent interactions (van der Waals, electrostatics, and solvent effects) between atoms in proximity to each other. These force-field based energy scores are used in iterative or random-substitution computational workflows to make design choices towards identifying stable enzyme variants with improved ligand affinity, altered cofactor specificities, and other biochemical objectives. Several tools using either full atomistic57,89,90 or coarse-grained91,92 representations of proteins have been developed over the last two decades. Go and Taketomi93 employed non-transferable potentials tailored to the native structure of a protein by evaluating the partial contributions of long-range and short-range forces at play throughout the molecule. Any variant to the native protein (referred to as ‘Go-models’) would attain its lowest energy score when the corresponding inter-residue root mean square deviation with the native structure is minimum. Even though Go-proteins cannot explore novel folds, they have had high success rate in identifying functional variants that fold as only an extremely restricted set of positions permit substitutions to similar-to-native side chain properties (charge and size). The protein-module of Martini coarse-grained force-field94 was developed for predicting peptide conformations in lipid-bilayers. This was an extension to the lipid-exclusive Martini-force field92. Using dioleylphosphatidylcholine (DOPE) bilayer and a series of pentapeptides as a model system, the potential of mean force for each amino acid was evaluated as a function of its distance from the center of the lipid region of the bilayer. These values were used as precedents to estimate the geometry of any new transmembrane protein whose overall geometry is dependent on the interactions with the surrounding lipid molecules. For detailed account of other coarse-grained models, we suggest the review by Ivan Colluza95. Full

54

atomistic simulation packages on the hand, are capable of handling fully resolved all-atom structures of entire proteins and has precise description of bonded and non-bonded parameters and consequently involve longer compute times. RosettaDesign57, Maestro96 Schrodinger Inc.,

PoreDesigner90 and IPRO61 are examples of such full atom protein design packages. These packages have two essential compute modules: (a) rotamer chooser, and (b) force-field dependent evaluation of redesigned protein. During protein design RosettaDesign and Maestro both create large randomized libraries of protein variants with minimum deviation from native structure or a scored property (hydrophobicity, binding to an interacting partner and so on), followed by evaluating enthalpic energy scores for each design using their respective empirical force-field energy functions. These energy scores are used to subsequently rank the designs depending on the design objective (such as interaction with a ligand). IPRO and PoreDesigner on the other hand iteratively uses a mixed-integer linear program to identify unique combinations of amino acid substitutions which satisfy the design objective. These choices are driven by CHARMM force- field based energy scores accounting for bonded and non-bonded energy terms. IPRO is an iterative protein redesign and optimization tool which emulates focused mutagenesis to identify stable enzyme variants that accomplish intended binding or unbinding of a substrate (or improve binding with one simultaneously eliminating with another). Figure 10 provides a general seven- step schematic overview of RosettaDesign and IPRO execution modules. DESADER acronym represents the seven general steps of: Dock, Ensure catalytic constraints, Substrate binding residue identification, Adjacent residue repacking, Designing sequence, Energy minimization and

Ranking of designs. PoreDesigner relies on similar principles and predicts designs that enable users to precisely tune the pore size of any channel protein, thus offering interaction or size-based separations of aqueous solute mixtures. It has been experimentally validated to be able successfully

55

redesign a bacterial porin to narrow pore sizes that performed perfect desalination using a membrane assembly. Donald et al.97 and Pantazes et al.98 provides a comprehensive review of other for computational protein design.

Figure 10. The seven-step DESADER schematic overview of enzyme redesign computational workflows of RosettaDesign and IPRO. RosettaDesign uses a stochastic Monte Carlo to create a library of enzyme variants. IPRO uses a deterministic mixed-integer linear optimization program to identify amino acid substitutions which are driven by the biochemical design objective (such as, maximize or minimize CHARMM-based interaction energy score with the ligand).

2.2. Successes

Successes in the ‘directed evolution’ of enzymes

Enzyme design is a difficult challenge as only an infinitesimally small fraction of possible amino acid sequences adopts a functional fold. It has been estimated99(using a beta-lactamase as a proxy)

56

that the fraction of all sequences that fold into viable enzymes with some minimal activity is as low as 1 in 1077. This implies that random mutations predominantly tend to almost always adversely affect protein function. Thus, directed evolution capitalizes on the range of weak promiscuous activities of enzymes which can be quickly driven towards a desired catalytic activity only after pin-pointing few key mutations. Random mutagenesis, focused mutagenesis and homologous recombination protocols along with efficient expression and screening of variant proteins have yielded several successes in redesigning enzymes for improved catalysis, altered substrate and cofactor specificities, and stability.

Reetz et al.100 used a two-step protocol where first, epPCR was employed to obtain a short library of enantioselective cyclohexane monooxygenases with R or S selectivity that showed at least 95% activity compared to wild type, followed by random mutagenesis with subsequent screening for activity yielded eight with turnovers of the desired enantiomer ranging from two to nine-fold improvement over the wild-type. Sequencing these mutants revealed only one to three amino acid changes in these eight mutants.

A separate endeavor by Arnold and co-workers101 pushed the activity-stability trade-off by using random mutagenesis, recombination, and screening of mesophilic Bacillus subtilis p- nitrobenzyl esterase and designed seven thermostable variants with melting temperatures (Tm) higher than the wild type by 5 – 14C. Out of these seven, three best mutants were identified to show activities higher than wild type. These best mutant (Tm = 66.5C; specific activity = 0.16 mmol product/ (min mg enzyme) where wild type activity was 0.125 mmol product/ (min mg enzyme)) was screened from 1,500 possible variants and exhibited stability at par with thermophilic enzymes. This was comparable with results from site directed mutagenesis102 which however necessitate extensive sequence and structure information a priori. Ultimately, one of the

57

most important insight gleaned from this study was that there is always an increase in activity with temperature until the enzyme denatures. This subsequently means simultaneous low-temperature activity and thermostability screening is sufficient to produce highly active variants viable across a wide temperature range.

Random mutagenesis explorations have been instrumental beneficial mutations which are beyond the scope of strategies. Kim et al.103 demonstrated that random mutagenesis on Agrobacterium sp. beta-glucosidase and screening using in vitro endocellulase-coupled assay yielded two highly active mutants with two (A19T, E358G) and four mutations (A19T, E358G,

Q248R, M407V) with activities seven and twenty-seven folds higher than wild type respectively.

What sets this work apart from other similar studies is that – all these mutations were at least 9Å away from the substrate and could not directly interact with the substrate to affect the turnover.

This suggested that these mutations bring about conformation changes to the active site thereby providing a congenial groove for the substrate to sit and potentially react. We used the

Agrobacterium sp. beta-galactosidase sequence (NCBI accession: WP_006316672.1), generated the best mutant sequence, and homology modeled it using Swiss MODEL104 to show that these mutations are distal from the substrate-binding domain (see Figure 11).

In another effort Zhao et al. developed a staggered extension process (StEP)105 for in vitro mutagenesis and recombination of polynucleotide sequence. In contrast to optimized DNA- shuffling43 where DNase I digests a set of parent genes into an array of DNA fragments which are thermocycled into complete genes using DNA polymerase, StEP generates full-length recombination cassettes relying on a template-based extension using DNA polymerase. They tested the recombination efficiency between two thermostable subtilisin E genes which code for protease. Adenine to guanine changes in bases 1107 from gene 1, and 995 from gene 2 led to amino

58

changes N181D, and N218S in the final protein. Single variants of N181D and N218S exhibited threefold and twofold longer, and the double mutant eightfold longer half-lives than wild type at

65C and were even stable at 75C. Out of the 368 clones that were screened, 84% were active and showed wildtype-like catalytic activity. Out of the active ones, 21% exhibited thermostability like the double mutant, 61% were like the single mutant, and 18% were as thermostable as the wild type.

Figure 11. The four mutations that led to enhanced catalysis in beta-glucosidase have been marked as blue sticks and the protein is represented as light pink cartoon. All four mutations are too far from the binding pocket to interact with the substrate.

In contrast to most random mutagenesis studies where an enzyme is engineered with the objective of finding a fitter variant with altered stereo-specificity, thermostability, or higher activity than wild-type, Chen et al.106 engineered a serine protease from Bacilus subtilis to function in a highly non-natural environment with high concentrations of polar organic solvent,

59

dimethylformamide (DMF). Proteases and lipases are known to be promising catalysts for organic synthesis of acrylic and methacrylic esters107 which find applications as cement material for knee or hip arthroplasty surgeries108, and PVC modifiers in plastic industry. In this work, mutagenesis and screening was performed with the objective of identifying amino acid substitutions that recover the lost catalytic activity of serine protease in organic media. Through three rounds of sequential screening, ten amino acid changes were pinpointed within the binding groove of the substrate, in loop regions that offered sequence variability without affecting the tertiary folds of the reactive pocket – and restored catalysis. Seven out of the ten mutations were seen in other protease homologs from other organisms. To investigate the effects of each and every mutation, ten single variants were generated and checked for enhancement of catalytic activity (Km and kcat).

Results indicated G131D mutation alone enhanced substrate affinity with 20% DMF by ~90%

(reduction of Km from 12.2 mM to 1.4mM). Furthermore, N181S, T255A, E156G, S182G, and

S188P mutations were also identified to reduce Km and all mutants with last three mutations improved catalysis even in aqueous media. Overall, we noticed that mutations where the substituted amino acids are hydrophobic led to lower Km in organic media and thus stabilized the substrate, whereas charged residues (D60N, Q103R, and G131D) improved kcat and thus product formation. This is also corroborated from the observation that the latter set of mutations also enhanced turnover in aqueous medium (0% DMF).

Successes in computational enzyme redesign

Nearly all engineered enzymes that are used today emerged from structure-based protein- engineering efforts of the 1980s. The successes have been notable, but the results came slowly till the advent of directed evolution in 1990s that led to major breakthroughs. However, most amino acid changes accumulated during evolution have marginal or no effect on the desired catalysis,

60

making it a ‘needle in a haystack’ problem to pinpoint key positions. To this end, computational methods have shown promise in sampling thousands of amino acid combinations and conformations with assessment of their impact on protein stability. Table 2 shows 50 key publications in protein design that uses computational and experimental steps to generate stable de novo protein scaffolds, catalytic antibodies, and highly active enzyme redesigns. Several in silico tools have been able to glean design rules which have been used to tune the substrate and cofactor specificities of various enzymes along with unraveling novel, non-natural catalytic modes. In 1997 the Mayo lab reported the first case of de novo redesign of streptococcal protein G β1 domain using a van der Waals potential to compute steric contributions, atomic solvation potential to favor burial of non-polar residues. The selection algorithm iteratively scanned and identified optimal sidechain conformation for a given backbone pose and accepted designs based on the sum of two pairwise interaction terms: (a) side chain and backbone, and (b) side chain and side chain. A statistically preferred set of 1.1×1062 side chain rotamers109 was used, and a dead-end elimination theorem110 was employed to constrain the search space to non-clashing ones and complete sequence design for a 50-residue window was achieved for every single design run. The design process only targeted non-polar residues from the surface residues. The design exhibited striking geometrical resemblance with zing-finger protein Zif268 even though the sequence similarity

(39%) and identity (21%) were low, with most conserved residues located in buried and ordered regions of the protein indicating this to be a novel sequence. NCBI p-BLAST111 revealed this sequence to have similar alignment score (<39% identity) with any random amino acid sequence of similar length. This work paved the path for competing in silico methods to handle immense combinatorial search required for computational protein design, and inspired the development of various molecular-mechanics based force fields (CHARMM112, Amber80, gromacs113). These force

61

fields started factoring in near-accurate contributions of van der Waals, electrostatics, and solvation terms. Furthermore, Sumners and Schulten introduced molecular dynamics for studying temporal fold changes and stability of biomolecular complexes114 (such as enzyme-ligand) to determine their macroscopic thermodynamic properties following the ergodic hypothesis.

Table 2. List of 50 computational enzyme design successes till date (grouped as per relevance).

The experimental aspect of each of these endeavors have been noted as well thus indicating that majority of these successes are due to synergistic effort of simulations and experiments.

Year Study Experimental contribution Reference

2004 Design of ‘bait-and-switch’ catalytic antibodies Reactive immunization Xu et al. 1

2000 Shape complementarity, binding-site dynamics, Chen et al. 2 and transition state (TS) stabilization using 1E9 antibody by QM study of Diels-Alder catalysis

1993 QM calculations on endo and exo stereoisomeric Diels-Alder cycloaddition Gouverneur et al. 3 TSs of Diels-Alder cycloaddition reaction to obtain enantiomerically pure products

1998 QM calculations identify AspH50 and TyrL36 as Crystal structure of exo Diels- Heine et al. 4 catalytic residues and AsnL91 for TS stability in Alderase inhibitor complex 13G5 catalyzed exo Diels-Alder between N- solved at 1.95 Å butadienyl carbamate and N,N- dimethylacrylamide

2003 MD relaxation of 13G5 antibody around rigid Absolute enantiomeric Cannizzaro et al.5 TS revealed role of three water molecules in selectivities of 13G5 and 4D5 orienting catalytic base AspH50 antibodies established

2002 Shape complementarity of TS and catalytic triad Crystal structures of Fab Hugot et al. 6 Trp-Phe-Ser identified in 10F11 antibody for 10F11 and 9D9 antibodies in retro Diels-Alder reaction complex with substrate analogs solved at 1.8 Å and 2.3 Å

1990 DFT calculations used to discern active site Leach et al. 7 interactions (TrpH104-PheH101) by π- π stacking to stabilize TS for 10F11 antibody during retro Diels-Alder reaction

2002 Proof of non-specific TS binding offered by Kinetic constants for TS- Kim et al. 8 antibodies guided by solvophobic effects, unlike antibody binding calculated enzymes in Diels-Alder reactions for 1E9, 39A11, 13G5, 4D5, 22C8 and 7D4 antibodies

62

1995 Proof of Kemp eliminase activity being related Catalytic antibody found for Casey et al. 9 1996 to TS geometry and polarity of the solvent in ring opening in Kemp 34E4 antibody elimination Kemp et al. 10

1988 QM calculations on aldol reactions exploring Li and Houk 11 relative stability of ‘chair’ and ‘twist boat’ TS structures

1995 TS studies on ab38C2, ab84G3, and ab33F12 Activities of aldolase Wagner et al. 12 1997 aldolase antibodies to catalyze aldol and retro- antibodies measured to be Barbas et al. 13 1998 aldol reactions akin to class I aldolases using ϵ- comparable to natural Hoffmann et al. 14 amino group of catalytic LysH93 aldolases 2003 QM study on polar residues at binding pockets Arnó and Domingo of aldolase Abs in C-C bond-formation step 15

1975 Aprotic polar solvents desolvate carboxylate Kinetic constants for this Kemp et al. 16 1995 reactant by stabilizing TS through dispersion reaction are experimentally Zipse et al. 17 interactions using Monte-Carlo free energy estimated perturbation (FEP) calculations

2003 QM, MD, and FEP calculations on 21D8 21D8 catalyzes Ujaque et al. 18 1991 antibody for decarboxylation-catalyzed ring- decarboxylation of 5-nitro-3- Lewis et al. 19 opening reaction carboxybenisoxazole by 61,000-fold than in water.

1993 QM study on endo-tet TS for cyclization of X-ray structure of antibody Na et al. 20 1995 trans-epoxy alcohols show SN1 behavior and Fab5C8 crystallized Gruber et al. 21 AspH95-HisL89 catalytic residues

1994 Homology model of 43C9 antibody variable Water-mediated hydrogen- Roberts et al. 22 1999 region revealed ArgL96 to be the oxyanion hole bonding network at the active Thayer et al. 23 and HisL91 the catalytic nucleophile for site is key for catalysis seen hydrolysis of aromatic amides. from X-ray crystal of 43C9

2003 QM, MD, and FEP calculations on 43C9 reveals Chong et al. 24 alternate mechanism using direct hydride attack

1995 QM calculations to mimic active site of Catalytic rate analysis Wiest et al. 25 1994 chorismite mutase antibodies IF7 and IIF1-2E11 performed after crystallizing Haynes et al. 26 (for Claisen rearrangement) revealed H-bond IF7 antibody. donors at the active site.

2010 Reconstructed evolutionary adaptive path Whole-gene synthesis (library Chen et al. 27 (REAP) analysis at active site of T. aquaticus size = 93) reveals predicted DNA polymerase to accept unnatural NTPs single amino acid changes efficiently catalyze unnatural NTPs

2010 Improved enantioselectivity via 3DM analysis at Site-saturation mutagenesis Jochens et al. 28 4 specific active sites on P. fluorescens esterase (library size = ~500) yielded ~200-fold improvement in activity and ~20-fold higher enantioselectivity

2008 Hot-spot selection to improve pH/protease Whole-gene synthesis (library Ehren et al. 29 stability of S. capsulata prolyl endopeptidase size = 91) revealed 200-fold based on multiple sequence alignment and ML higher protease resistance and on peptide library 20% higher activity

63

2009 MD simulations to identify mutational hotspots Site-directed and site- Pavlova et al. 30 in access tunnels to active site of R. rhodochrous saturation mutagenesis (library haloalkane dehalogenase size = 2500) showed 32-fold higher activity

2009 SCHEMA structure guided recombination of Whole gene synthesis (library Heinzelman et al. 31 peptide fragment from 3 CBH II cellulase for size = 48) showed 15°C higher increased thermostability thermostability

2010 MOE molecular modeling analysis for altered Site-saturation and random Savile et al. 32 substrate specificity, solvent tolerance and mutagenesis (library size = thermostability on Arthobacter sp. transaminase 36,000).

2009 K* algorithm and SCMF entropy-based protocol Site-directed mutagenesis Chen et al. 33 using rotamer library and flexible ligand (library size = 10) showed docking to switch specificity from Phe to Leu/ 600-fold specificity shift from Arg/ Lys/ Glu/ Asp on gramicidine S synthetase Phe to Leu by changes in kM A Phe-adenylation domain values 2019 IPRO used to explore promiscuity of A domain Site-directed mutagenesis Throckmorton et al. of Ser-specific NRPS from E. coli (library size = 160) identified 34 152 new Ser-specific domains

2009 RosettaDesign to vary active-site and loop- Site-directed mutagenesis and Murphy et al. 35 length composition for human guanine PCR assembly (library size = deaminase to switch specificity for ammelide/ 10) showed >106 specificity cytosine change

2010 QM/ MM simulations using RosettaMatch on Site-directed mutagenesis Siegel et al. 36 Diels-Alderase (library size = 100) showed activity similar to catalytic antibodies

2009 VMD modeling to reconstitute active site of Site-directed mutagenesis Yeung et al. 37 nitric oxide reductase (NOR) in myoglobin yielded functional NOR

2009 Hotspot Wizard server to create mutability maps Haloalkane dehalogenase Pavelka et al. 38 based on sequence-structure information from (DhaA) engineering from Pavlova et al. 30 existing protein databases Rhodococcus rhodochrous

2007 Engineering proteinase K using machine 24 amino acid substitutions in Liao et al. 39 learning and synthetic genes 59 variants were tested for hydrolase activity on tetrapeptides at 68°C

2010 Functional benefits of distal mutations through Directed evolution Wu et al. 40 induced allostery for enantioselective Baeyer- experiments (library size = Villiger monooxygenase using MD simulations 400) revealed one double (also discerned active site geometry changes) mutant that induced allostery

2003 Pairwise alignment of N-acetyluraminate lyase An L142R mutation in NAL Joerger et al. 41 (NAL) and dihydropicolate synthase (DHDPS) abolished NAL activity and revealed Leu-Arg mismatch at active site improved DHDPS activity by 8-fold

2005 Four mutations to active site of keto-L- 170-fold higher HPS activity Yew et al. 42 gluconate phosphate synthase identified to is recorded. enhance promiscuity to arabinose-hex3-ulose 6 phosphate synthase (HPS)

64

2000 De novo design of helical bundle scaffolds for Dinuclear metal-binding Hill et al. 43 metal-chelation activity recording using His- triad catalytic motif

2018 PoreDesigner to redesign beta-barrel scaffold Stopped-flow light scattering Chowdhury et al. 44 from E. coli OmpF to access any user-defined experiments reveal narrowest sub-nm pore size design perform like aquaporin

2003 RosettaDesign used to design nine globular Circular dichroism Dantas et al. 45 proteins experiments confirmed 8/9 of these proteins to be folded akin to native, and 6/9 showed up to 7 kcal/mol stabilities than wild type

2003 alpha/beta for accessing novel folds by 93 residue alpha/beta fold Kuhlman et al. 46 iterative search through sequence design and protein crystallized and structural folds matched structure prediction with RMSD = 1.2 Å.

2008 Computational design of periplasmic binding Evaluation of kinetic constants Boas and Harbury 47 proteins through conformer sampling and experimentally revealed the continuous minimization revealed the design to outperform the importance of accurate capture of partial charges native Kd values (17microM and electrostatic potentials vs. 210 nM-native)

2010 De novo alpha-helical bundle designed to bind UV/visible and circular Fry et al. 48 heme-like large cofactors dichroism, size exclusion chromatography and analytic centrifugation indicate active enzyme but low activities

2010 MD simulations with all-atom Amber force Kiss et al. 49 fields were used to assess the integrity of a Kemp eliminase identifying caveats that static simulations are agnostic to

2015 Computational protocol for zeolites with Sauer and Freund 50 detailed description of active site interactions

2010 Influence of structural fluctuations on active-site A separate experimental Ruscio et al. 51 2010 preorganization in RA22 using molecular endeavor discerned that Lasilla et al. 52 dynamics revealed an alternate conformation of majorly catalysis is done by substrate relative to His233 allows nucleophilic Lys159 due to the favorable attack by Lys159 where Asp53 (original interaction with the naphthyl catalytic residue) is solvated and hence non- group of the substrate catalytic

2010 Empirical valence bond calculations using FEP Difficulty in improving the Frushicheva et al. 53 umbrella sampling on kemp eliminase (KE) KE activity is due to improper designs (KE07, KE70, KE59) partial charge characterization

2010 Eight mutations identified on KE07 with the 2.6-fold lower KM and 76-fold Khersonsky et al. 54 objective of improving activity further higher kcat value yielding 200- fold higher activity

2011 Computational sequence optimization for Nine rounds of random Khersonsky et al. 55 increased activity of KE70 mutagenesis along with computational predictions

65

yielded 12-fold lower KM and 53-fold higher kcat

2012 Fold-stabilizing mutations were predicted to 16 rounds of directed Khersonsky et al. 56 enhance activity of KE59. evolution yielded >2000-fold increase in activity

2012 Iterative approach to ensure every design cycle Kinetic characterization of Privett et al. 57 necessitates active enzyme redesigns and MD Kemp eliminases HG-3, HG-2 screening of mutants before experiments show higher activity of HG-3

2017 Computational redesign of Acyl-ACP 27 variants with enhanced C8- Grisewood et al. 58 thioesterase with improved selectivity toward production titers were medium-chain-legth fatty acids constructed and best mutant was crystallized (5TID)

2017 Highly active C8-Acyl-ACP using synthetic 1.7 g/L C8-titers with >90% Lozada et al. 59 selection and computational modeling specificity towards C8 and 15- fold increase in kcat over WT

Soon after, and colleagues used a novel computational enzyme design methodology115 to facilitate the Kemp elimination reaction – which has a high activation energy barrier and for which no naturally occurring enzyme existed. Eight in silico designs were generated containing one of the two proposed catalytic motifs. Directed evolution on these designs produced

>200-fold increase in kcat /Km values. The Kemp eliminase reaction is the amino-induced elimination of benzisoxazole into relavant o-cyanophenolate ion116. The reaction requires a base- mediated proton abstraction from a carbon with subsequent dispersion of the resulting negative charge or stabilization of the partial negative charge on the phenolic oxygen. To this end, the authors designed two alternative ideal catalytic bases – (a) Asp-His dyad, or (b) single aspartate or glutamate. Quantum-mechanical calculations on the backbone of the desired binding pocket was used to choose an optimal combination of amino acids that served the dual objective of stabilizing the substrate and positioning the catalytic base at the appropriate distance from the substrate.

RosettaMatch57 was used to screen about 105 binding pocket designs by finding the most stable side chain conformations of the pocket residues from each design, given their backbone conformation. The designs were scored based on binding free energies between the enzyme and

66

the transition state. 49 top designs were synthesized in vitro and eight showed catalytic activity.

After seven rounds of mutagenesis and screening, one of the designs showed a kcat /Km value of

2600 (M s)-1 which was a result of fine-tuning the pocket residues to accommodate the substrate better. A recent work by Kingsley et al.117 defines the binding pocket as ‘substrate tunnels’ and the authors demonstrate that turnover can be severely impacted by altering the pocket residues even if the catalytic motif is unperturbed. This work proves the potential of a synergistic workflow between computational enzyme design to create an overall active site framework, and molecular evolution to explore novel enzyme-mediated reactions. The Baker lab followed up with another breakthrough with designing enzymes for an energetically more demanding retro-aldol118 reaction that involved breaking a carbon-carbon bond of a hydroxy-carbonyl compound to form an aldehyde (or ketone) and another carbonyl moiety using acid-base catalysis initiated by a nucleophilic attack on the ketone. In thin enamine catalysis, the carbinolamine intermediate undergoes spontaneous dehydration to yield imine/ iminium product. Subsequently, the enamine tautomerizes to another imine which undergoes a similar dehydration to release the product and frees the enzyme. The authors constructed several protein scaffolds that can simultaneously accommodate both the transition states for the two-step reaction and grafted four alternative quantum-mechanically optimized catalytic motifs that would initiate the acid-base catalysis.

Altogether 32 designs showed weak catalytic activity with the most active designs containing a co-crystallized water molecule which served the dual role of stabilizing the intermediate and also as a proton acceptor. Even though the X-ray crystallographic structure of the active site showed great agreement (RMSD < 1Å), the catalytic efficiencies were low (0.74 (M s)-1 and turnover of 1 molecule of product every 2 hours). Interestingly, from both these studies, the most promising folds generated in silico that had high catalytic efficiencies were triose-phosphate-isomerase (TIM)

67

type containing eight alpha and beta helices. TIM toroids are known to be very effective for enzymatic reactions119, and thus this shows convergence between in vivo and in silico fold preference.

Most computational enzyme redesign approaches were aimed at improving substrate specificity and catalyzing non-natural reactions. One of the earliest examples to switch co-factor specificity was in Candida boidinii xylose reductase (CbXR)120. The authors gleaned and incorporated key co-factor switching mutation information from previous studies and were able to successfully alter the cofactor preference from NADPH to NADH. Amino acid changes in the

CbXR binding pocket were systematically chosen using a mixed-integer linear program with the objective of simultaneously improving binding to NADH while eliminating binding with NADPH

- where the binding score was expressed as a sum of van der Waals, electrostatics and solvation terms. After sampling nearly 8,000 possible CbXR variants, 10 were found to show enhanced affinity for NADH and seven of the ten designs showed significant xylitol production. Eight out of ten designs showed more than 90% abolition of NADPH dependent activity while the remaining two showed equal preference for NADPH and NADH. The best design exhibited a 27-fold improvement in NADH-dependent activity. Other successes from the same group include,

OptGraft121 for grafting a binding site from one protein into another protein scaffold, rational design to obtain 200-fold higher D-hydantoinase activity in Bacillus stearothermophilus using just two amino acid changes122, OptZyme123 for redesigning enzymes by improving binding to a transition state analogue instead of the substrate as it correlates with greater turnover, IPRO Suite of programs89 for fully-automated protein redesign, and altering substrate specificity of thioesterase enzyme from long-chain fatty acyl ACP to medium-chain ones50.

68

Although the articles discussed show that computational enzyme designs are feasible, the catalytic activities of artificial enzymes with novel folds show significantly lower catalytic activities barring the high activities seen in Kemp eliminase. However, computational designs that maintain the wild-type binding groove geometry have remained extremely successful in exploiting the promiscuity of enzymes to drive a desired reaction by minimal residue interventions. Thus, it remains an open question, if computational designs alone will be able to outperform natural enzymes. A synergy between computational predictions and directed evolution still remains the best bet to date.

2.3. New approaches and future directions

Even though designing a protein remains a challenging task due to the large sequence space that requires sampling, the number of resolved crystal structures are increasing day by day. A number of algorithms that use these sequence and structure databases to learn various sequence to structure features are emerging. Needless to say, machine learning and deep-learning neural networks are emerging as key players in this domain. Cadet et al.124 came up with a supervised learning of enantioselective enzyme sequences and activity of n individual point mutations to predict the activity of all combinations (2n) of these point mutations. The method involves numerically encoding the sequences (wildtype and single mutants) and experimental activities, converting them to a signal using Fourier transform, and using a partial least square step to predict the activity of a mutant which is a combination of multiple point mutations fed in the learning step. The correlation coefficient between 28 mutants validated experimentally revealed a good agreement (R2=0.81). A non-conventional crowd-sourced online competitive gaming protocol – Foldit125 to use human intuitions as a lever for accessing novel catalytic folds or predicting folded polypeptide geometries.

69

Factoring in contributions from binding pocket geometries, alternate catalytic motifs, and hydrophobicity of the pocket would be a step forward in using these algorithms more reliably.

Popova et al.126 have developed a deep reinforcement learning tool for drug discovery to identify molecules with desired properties such as: hydrophobicity, melting point, and inhibitory activity against specific enzymes. Instead of constructing novel small molecule libraries, if this workflow can be used for screening whether a ligand will show activity against a library of an enzyme and its mutants – this could emerge as a useful enzyme engineering tool. Protein design thus remains an active field of research for the search of a unified set of rules that can be used for tuning substrate and cofactor specificity and tailoring novel functionalities or redesigning them anew. It could be worth mentioning, that directed evolution and computational design have also be aimed at creating synthetic pathways that take advantage of the new enzymes (for example, Schwander et al.127 and

Siegel et al.128) along with several updated genome-scale networks of eukaryotes129,130 and pathway redesign tools131,132. The marriage of new algorithms and directed evolution approaches bears promise of generating efficient catalysts needed by the food, pharmaceutical, and renewable energy industries.

70

2.4. References

1. Ponomarenko EA, Poverennaya E V., Ilgisonis E V., et al. The Size of the Human

Proteome: The Width and Depth. Int J Anal Chem. 2016. doi:10.1155/2016/7436849

2. Alba-Lois L, Segal-Kischinevzky C. Yeast Fermentation and the Making of Beer and

Wine. Nat Educ. 2010. doi:10.1038/455699a.

3. F. F. Protein Purification Using Zeolite Adsorbent. Dr Diss UMP. 2009.

4. Eduard B. Alkoholische Gärung ohne Hefezellen. Berichte der Dtsch Chem Gesellschaft.

2018;30(1):1110-1113. doi:10.1002/cber.189703001215

5. Emil Fischer. Ueber die Glucoside der Alkohole. Berichte der Dtsch Chem Gesellschaft.

1893;26(3):2400-2412. doi:10.1002/cber.18930260327

6. Koshland DE. The Key–Lock Theory and the Induced Fit Theory. Angew Chemie Int Ed

English. 1995. doi:10.1002/anie.199423751

7. Koshland DE. Application of a Theory of Enzyme Specificity to Protein Synthesis. Proc

Natl Acad Sci U S A. 1958;44(2):98-104. http://www.ncbi.nlm.nih.gov/pubmed/16590179%0Ahttp://www.pubmedcentral.nih.gov/articler ender.fcgi?artid=PMC335371.

8. V Henri. Théorie générale de l’action de quelques diastases. C R Hebd Seances Acad Sci.

1902;135:916-919.

9. V Henri. ois générales de l’action des diastases. Hermann Paris. 1903.

10. Michaelis L, Menten ML. Die Kinetik der Invertinwerkung (The kinetics of invertase activity). Biochem Z. 1913.

71

11. Sumner J. The isolation and crystallization of the enzyme urease. J Biol Chem.

1926;69:435.

12. Northrop JH. Crystalline pepsin. Science (80- ). 1929. doi:10.1126/science.69.1796.580

13. Stanley WM. Isolation of a crystalline protein possessing the properties of tobacco-mosaic virus. Science (80- ). 1935. doi:10.1126/science.81.2113.644

14. Edman P, Högfeldt E, Sillén LG, Kinell P-O. Method for Determination of the Amino Acid

Sequence in Peptides. Acta Chem Scand. 1950. doi:10.3891/acta.chem.scand.04-0283

15. Sanger F, Tuppy H. The amino-acid sequence in the phenylalanyl chain of insulin. 1. The identification of lower peptides from partial hydrolysates. Biochem J. 1951. doi:10.1042/bj0490463

16. Sanger F, Thompson, E O P. The amino-acid sequence in the glycyl chain of insulin. I. The identification of lower peptides from partial hydrolysates. Biochem J. 1953.

17. C. C. Bakerian lecture - Amino-acid analysis and the structure of proteins. Proc R Soc

London Ser B - Biol Sci. 1942. doi:10.1098/rspb.1942.0021

18. Stein W, Moore S. Chromatography of amino acids on starch columns. Separation of phenylalanine, leucine, isoleucine, methionine, tyrosine and valine. J Biol Chem. 1948;176:337-

365.

19. Moore S, Stein WH. Photometric ninhydrin method for use in the chromatography of amino acids. J Biol Chem. 1948.

20. Edmundson AB. Amino-acid sequence of sperm whale myoglobin. Nature. 1965. doi:10.1038/205883a0

21. Phillips DA. The Hen Egg-White Lysozyme Molecule. Proc Natl Acad Sci. 1967;57(3).

72

22. Lerner SA, Wu TT, Lin ECC. Evolution of a catabolic pathway in bacteria. Science (80- ).

1964. doi:10.1126/science.146.3649.1313

23. Mills DR, Peterson RL, Spiegelman S. An extracellular Darwinian experiment with a self- duplicating molecule. Proc Natl Acad Sci U S A. 1967.

24. Levisohn R, Spiegelman S. Further Extracellular Darwinian Experiments With Replicating

RNA Molecules: Diverse Variants Isolated Under Different Selective Conditions. Proc Natl Acad

Sci. 2006. doi:10.1073/pnas.63.3.805

25. Francis JC, Hansche PE. Directed evolution of metabolic pathways in microbial populations. I. Modification of the acid phosphatase pH optimum in S. cerevisiae. Genetics. 1972.

26. Hall BG. Number of mutations required to evolve a new lactase function in Escherichia coli. J Bacteriol. 1977.

27. Eigen M, Gardiner W. Evolutionary molecular engineering based on RNA replication. Pure

Appl Chem. 1984. doi:10.1351/pac198456080967

28. Robertson DL, Joyce GF. Selection in vitro of an RNA enzyme that specifically cleaves single-stranded DNA. Nature. 1990. doi:10.1038/344467a0

29. Cadwell RC, Joyce GF. Mutagenic PCR. Genome Res. 1994. doi:10.1101/gr.3.6.S136

30. Cadwell RC, Joyce GF. Randomization of genes by PCR mutagenesis. Genome Res. 1992. doi:10.1101/gr.2.1.28

31. Beaudry AA, Joyce GF. Directed evolution of an RNA enzyme. Science (80- ). 1992. doi:10.1126/science.1496376

32. Farinas ET, Bulter T, Arnold FH. Directed enzyme evolution. Curr Opin Biotechnol. 2001. doi:10.1016/S0958-1669(01)00261-0

73

33. Leung DW, Chen E, Goeddel D V. A Method for random mutagenesis of a defined DNA segment using a modified polymerase chain reaction. Technique. 1989.

34. Zaccolo M, Williams DM, Brown DM, Gherardi E. An approach to random mutagenesis of DNA using mixtures of triphosphate derivatives of nucleoside analogues. J Mol Biol. 1996. doi:10.1006/jmbi.1996.0049

35. Tindall KR, Kunkel TA. Fidelity of DNA Synthesis by the Thermus aquaticus DNA

Polymerase. Biochemistry. 1988. doi:10.1021/bi00416a027

36. Gupta RD, Tawfik DS. Directed enzyme evolution via small and effective neutral drift libraries. Nat Methods. 2008. doi:10.1038/nmeth.1262

37. Landwehr M, Hochrein L, Otey CR, Kasrayan A, Bäckvall JE, Arnold FH.

Enantioselective α-hydroxylation of 2-arylacetic acid derivatives and buspirone catalyzed by engineered cytochrome P450 BM-3. J Am Chem Soc. 2006. doi:10.1021/ja061261x

38. Umeno D, Tobias A V., Arnold FH. Diversifying Carotenoid Biosynthetic Pathways by

Directed Evolution. Microbiol Mol Biol Rev. 2005. doi:10.1128/mmbr.69.1.51-78.2005

39. Wells JA, Vasser M, Powers DB. Cassette mutagenesis: an efficient method for generation of multiple mutations at defined sites. Gene. 1985. doi:10.1016/0378-1119(85)90140-4

40. Gibson DG, Young L, Chuang RY, Venter JC, Hutchison CA, Smith HO. Enzymatic assembly of DNA molecules up to several hundred kilobases. Nat Methods. 2009. doi:10.1038/nmeth.1318

41. Acevedo JP, Reetz MT, Asenjo JA, Parra LP. One-step combined focused epPCR and saturation mutagenesis for thermostability evolution of a new cold-active xylanase. Enzyme

Microb Technol. 2017. doi:10.1016/j.enzmictec.2017.02.005

74

42. Coco WM, Levinson WE, Crist MJ, et al. DNA shuffling method for generating highly recombined genes and evolved enzymes. Nat Biotechnol. 2001. doi:10.1038/86744

43. Zhao H, Arnold FH. Optimization of DNA shuffling for high fidelity recombination.

Nucleic Acids Res. 1997. doi:10.1093/nar/25.6.1307

44. Moore GL, Maranas CD. Modeling DNA mutation and recombination for directed evolution experiments. J Theor Biol. 2000. doi:10.1006/jtbi.2000.2082

45. Pritchard L, Corne D, Kell D, Rowland J, Winson M. A general model of error-prone PCR.

J Theor Biol. 2005. doi:10.1016/j.jtbi.2004.12.005

46. Wangkumhang P, Chaichoompu K, Ngamphiw C, et al. WASP: A Web-based Allele-

Specific PCR assay designing tool for detecting SNPs and mutations. BMC Genomics. 2007. doi:10.1186/1471-2164-8-275

47. Shugay M, Zaretsky AR, Shagin DA, et al. MAGERI: Computational pipeline for molecular-barcoded targeted resequencing. PLoS Comput Biol. 2017. doi:10.1371/journal.pcbi.1005480

48. Arnold FH. Combinatorial and computational challenges for biocatalyst design. Nature.

2001. doi:10.1038/35051731

49. Chartrain M, Salmon PM, Robinson DK, Buckland BC. Metabolic engineering and directed evolution for the production of pharmaceuticals. Curr Opin Biotechnol. 2000. doi:10.1016/S0958-1669(00)00081-1

50. Hernández Lozada NJ, Lai RY, Simmons TR, et al. Highly Active C8-Acyl-ACP

Thioesterase Variant Isolated by a Synthetic Selection Strategy. ACS Synth Biol. 2018. doi:10.1021/acssynbio.8b00215

75

51. Brustad EM, Arnold FH. Optimizing non-natural protein function with directed evolution.

Curr Opin Chem Biol. 2011. doi:10.1016/j.cbpa.2010.11.020

52. Zhao H. Directed evolution of novel protein functions. Biotechnol Bioeng. 2007. doi:10.1002/bit.21455

53. Toscano MD, Woycechowsky KJ, Hilvert D. Minimalist active-site redesign: Teaching old enzymes new tricks. Angew Chemie - Int Ed. 2007. doi:10.1002/anie.200604205

54. Moal IH, Moretti R, Baker D, Fernández-Recio J. Scoring functions for protein-protein interactions. Curr Opin Struct Biol. 2013. doi:10.1016/j.sbi.2013.06.017

55. Huang SY, Grinter SZ, Zou X. Scoring functions and their evaluation methods for protein- ligand docking: Recent advances and future directions. Phys Chem Chem Phys. 2010. doi:10.1039/c0cp00151a

56. Istrail S, Lam F. Combinatorial Algorithms for in Lattice Models: A

Survey of Mathematical Results. Commun Inf Syst. 2009. doi:10.4310/cis.2009.v9.n4.a2

57. Liu Y, Kuhlman B. RosettaDesign server for protein design. Nucleic Acids Res. 2006. doi:10.1093/nar/gkl163

58. Gainza P, Roberts KE, Georgiev I, et al. Osprey: Protein design with ensembles, flexibility, and provable algorithms. In: Methods in Enzymology. ; 2013. doi:10.1016/B978-0-12-394292-

0.00005-9

59. Rackers JA, Wang Z, Lu C, et al. Tinker 8: Software Tools for Molecular Design. J Chem

Theory Comput. 2018. doi:10.1021/acs.jctc.8b00529

60. Fischer A, Enkler N, Neudert G, Bocola M, Sterner R, Merkl R. TransCent: Computational enzyme design by transferring active sites and considering constraints relevant for catalysis. BMC

Bioinformatics. 2009. doi:10.1186/1471-2105-10-54

76

61. Pantazes RJ, Grisewood MJ, Li T, Gifford NP, Maranas CD. The Iterative Protein

Redesign and Optimization (IPRO) suite of programs. J Comput Chem. 2015;36(4):251-263. doi:10.1002/jcc.23796

62. Lehmann A, Saven JG. Computational design of four-helix bundle proteins that bind nonbiological cofactors. In: Biotechnology Progress. ; 2008. doi:10.1021/bp070178q

63. Khoury GA, Fazelinia H, Chin JW, Pantazes RJ, Cirino PC, Maranas CD. Computational design of Candida boidinii xylose reductase for altered cofactor specificity. Protein Sci. 2009. doi:10.1002/pro.227

64. Cui D, Zhang L, Jiang S, et al. A computational strategy for altering an enzyme in its cofactor preference to NAD(H) and/or NADP(H). FEBS J. 2015. doi:10.1111/febs.13282

65. García-Guevara F, Bravo I, Martínez-Anaya C, Segovia L. Cofactor specificity switch in

Shikimate dehydrogenase by rational design and consensus engineering. Protein Eng Des Sel.

2016. doi:10.1093/protein/gzx031

66. Chen CY, Georgiev I, Anderson AC, Donald BR. Computational structure-based redesign of enzyme activity. Proc Natl Acad Sci U S A. 2009. doi:10.1073/pnas.0900266106

67. Grisewood MJ, Hernández-Lozada NJ, Thoden JB, et al. Computational Redesign of Acyl-

ACP Thioesterase with Improved Selectivity toward Medium-Chain-Length Fatty Acids. ACS

Catal. 2017;7(6):3837-3849. doi:10.1021/acscatal.7b00408

68. Khare SD, Fleishman SJ. Emerging themes in the computational design of novel enzymes and protein-protein interfaces. FEBS Lett. 2013. doi:10.1016/j.febslet.2012.12.009

69. Chowdhury R, Allan MF, Maranas CD. OptMAVEn-2.0: De novo Design of Variable

Antibody Regions Against Targeted Antigen Epitopes. Antibodies. 2018;7(3):23.

77

70. Tiller KE, Chowdhury R, Li T, et al. Facile affinity maturation of antibody variable domains using natural diversity mutagenesis. Front Immunol. 2017. doi:10.3389/fimmu.2017.00986

71. Kirk O, Borchert TV, Fuglsang CC. Industrial enzyme applications. Curr Opin Biotechnol.

2002. doi:10.1016/S0958-1669(02)00328-2

72. Williams GJ, Domann S, Nelson A, Berry A. Modifying the stereochemistry of an enzyme- catalyzed reaction by directed evolution. Proc Natl Acad Sci. 2003. doi:10.1073/pnas.0635924100

73. Gohlke H, Hendlich M, Klebe G. Knowledge-based scoring function to predict protein- ligand interactions. J Mol Biol. 2000. doi:10.1006/jmbi.1999.3371

74. Buchete N V., Straub JE, Thirumalai D. Development of novel statistical potentials for protein fold recognition. Curr Opin Struct Biol. 2004. doi:10.1016/j.sbi.2004.03.002

75. Dong QW, Wang XL, Lin L. Novel knowledge-based mean force potential at the profile level. BMC . 2006. doi:10.1186/1471-2105-7-324

76. Kozma D, Tusnády GE. TMFoldRec: A statistical potential-based transmembrane protein fold recognition tool. BMC Bioinformatics. 2015. doi:10.1186/s12859-015-0638-5

77. Poole AM, Ranganathan R. Knowledge-based potentials in protein design. Curr Opin

Struct Biol. 2006. doi:10.1016/j.sbi.2006.06.013

78. Dunbrack RL. Rotamer libraries in the 21st century. Curr Opin Struct Biol. 2002. doi:10.1016/S0959-440X(02)00344-5

79. Lauck F, Smith CA, Friedland GF, Humphris EL, Kortemme T. RosettaBackrub-a web server for flexible backbone protein structure modeling and design. Nucleic Acids Res. 2010. doi:10.1093/nar/gkq369

78

80. Wang J, Wolf RM, Caldwell JW, Kollman PA, Case DA. Development and testing of a general Amber force field. J Comput Chem. 2004. doi:10.1002/jcc.20035

81. Mackerell AD. Empirical force fields for biological macromolecules: Overview and issues.

J Comput Chem. 2004;25(13):1584-1604. doi:10.1002/jcc.20082

82. Jorgensen WL, Tirado-Rives J. The OPLS Potential Functions for Proteins. Energy

Minimizations for Crystals of Cyclic Peptides and Crambin. J Am Chem Soc. 1988. doi:10.1021/ja00214a001

83. Tieleman DP, MacCallum JL, Ash WL, Kandt C, Xu Z, Monticelli L. Membrane protein simulations with a united-atom lipid and all-atom protein model: Lipid-protein interactions, side chain transfer free energies and model proteins. J Phys Condens Matter. 2006. doi:10.1088/0953-

8984/18/28/S07

84. Scott WRP, Hünenberger PH, Tironi IG, et al. The GROMOS biomolecular simulation program package. J Phys Chem A. 1999. doi:10.1021/jp984217f

85. Alford RF, Leaver-Fay A, Jeliazkov JR, et al. The Rosetta All-Atom Energy Function for

Macromolecular Modeling and Design. J Chem Theory Comput. 2017. doi:10.1021/acs.jctc.7b00125

86. Rohl CA. Protein structure estimation from minimal restraints using Rosetta. Methods

Enzymol. 2005. doi:10.1016/S0076-6879(05)94009-3

87. Maier JA, Martinez C, Kasavajhala K, Wickstrom L, Hauser KE, Simmerling C. ff14SB:

Improving the Accuracy of Protein Side Chain and Backbone Parameters from ff99SB. J Chem

Theory Comput. 2015. doi:10.1021/acs.jctc.5b00255

79

88. Wang LP, McKiernan KA, Gomes J, et al. Building a More Predictive Protein Force Field:

A Systematic and Reproducible Route to AMBER-FB15. J Phys Chem B. 2017. doi:10.1021/acs.jpcb.7b02320

89. Pantazes RJ, Grisewood MJ, Li T, Gifford NP, Maranas CD. The Iterative Protein

Redesign and Optimization (IPRO) suite of programs. J Comput Chem. 2015. doi:10.1002/jcc.23796

90. Chowdhury R, Ren T, Shankla M, et al. PoreDesigner for tuning solute selectivity in a robust and highly permeable outer membrane pore. Nat Commun. 2018. doi:10.1038/s41467-018-

06097-1

91. Tozzini V. Coarse-grained models for proteins. Curr Opin Struct Biol. 2005. doi:10.1016/j.sbi.2005.02.005

92. Marrink SJ, Risselada HJ, Yefimov S, Tieleman DP, De Vries AH. The MARTINI force field: Coarse grained model for biomolecular simulations. J Phys Chem B. 2007. doi:10.1021/jp071097f

93. Taketomi H, Ueda Y, Gō N. STUDIES ON PROTEIN FOLDING, UNFOLDING AND

FLUCTUATIONS BY COMPUTER SIMULATION. Int J Pept Protein Res. 2009. doi:10.1111/j.1399-3011.1975.tb02465.x

94. Monticelli L, Kandasamy SK, Periole X, Larson RG, Tieleman DP, Marrink SJ. The

MARTINI coarse-grained force field: Extension to proteins. J Chem Theory Comput. 2008. doi:10.1021/ct700324x

95. Coluzza I. Computational protein design: A review. J Phys Condens Matter. 2017. doi:10.1088/1361-648X/aa5c76

80

96. Laimer J, Hofer H, Fritz M, Wegenkittl S, Lackner P. MAESTRO - multi agent stability prediction upon point mutations. BMC Bioinformatics. 2015. doi:10.1186/s12859-015-0548-6.

97. Gainza P, Nisonoff HM, Donald BR. Algorithms for protein design. Curr Opin Struct Biol.

2016. doi:10.1016/j.sbi.2016.03.006

98. Pantazes RJ, Grisewood MJ, Maranas CD. Recent advances in computational protein design. Curr Opin Struct Biol. 2011. doi:10.1016/j.sbi.2011.04.005

99. Axe DD. Estimating the prevalence of protein sequences adopting functional enzyme folds.

J Mol Biol. 2004. doi:10.1016/j.jmb.2004.06.058

100. Reetz MT, Brunner B, Schneider T, Schulz F, Clouthier CM, Kayser MM. Directed evolution as a method to create enantioselective cyclohexanone monooxygenases for catalysis in

Baeyer-Villiger reactions. Angew Chemie - Int Ed. 2004. doi:10.1002/anie.200460272

101. Giver L, Gershenson A, Freskgard PO, Arnold FH. Directed evolution of a thermostable esterase. Proc Natl Acad Sci U S A. 1998.

102. Stearman RS, Frankel AD, Freire E, Liu B, Pabo CO. Combining Thermostable Mutations

Increases the Stability of λ Repressor. Biochemistry. 1988. doi:10.1021/bi00419a059

103. Kim YW, Lee SS, Warren RAJ, Withers SG. Directed evolution of a glycosynthase from

Agrobacterium sp. increases its catalytic activity dramatically and expands its substrate repertoire.

J Biol Chem. 2004. doi:10.1074/jbc.M406890200

104. Guex N, Peitsch MC. SWISS-MODEL and the Swiss-PdbViewer: An environment for comparative protein modeling. Electrophoresis. 1997. doi:10.1002/elps.1150181505

105. Zhao H, Giver L, Shao Z, Affholter JA, Arnold FH. Molecular evolution by staggered extension process (StEP) in vitro recombination. Nat Biotechnol. 1998. doi:10.1038/nbt0398-258

81

106. Chen K, Arnold F. Tuning the activity of an enzyme for unusual environments: in dimethylformamide. PNAS. 1992. doi:10.1073/pnas.90.12.5618

107. Warwel S, Steinke G, Klaas MRG. An efficient method for lipase-catalysed preparation of acrylic and methacrylic acid esters. Biotechnol Tech. 1996. doi:10.1007/BF00184030

108. Speeckaert AL, Brothers JG, Wingert NC, Graham JH, Klena JC. Airborne Exposure of

Methyl Methacrylate During Simulated Total Hip Arthroplasty and Fabrication of Antibiotic

Beads. J Arthroplasty. 2015. doi:10.1016/j.arth.2015.02.036

109. Ponder JW, Richards FM. Tertiary templates for proteins. Use of packing criteria in the enumeration of allowed sequences for different structural classes. J Mol Biol. 1987. doi:10.1016/0022-2836(87)90358-5

110. Desmet J, De Maeyer M, Hazes B, Lasters I. The dead-end elimination theorem and its use in protein side-chain positioning. Nature. 1992. doi:10.1038/356539a0

111. Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool.

J Mol Biol. 1990. doi:10.1016/S0022-2836(05)80360-2

112. Brooks BR, Brooks CL, Mackerell AD, et al. CHARMM: The biomolecular simulation program. J Comput Chem. 2009;30(10):1545-1614. doi:10.1002/jcc.21287

113. Lindahl E, Hess B, van der Spoel D. GROMACS 3.0: A package for molecular simulation and trajectory analysis. J Mol Model. 2001. doi:10.1007/S008940100045

114. Mesirov JP, Schulten K, Sumners DW. Mathematical Applications to Biomolecular

Structure and Dynamics, IMA Volumes in Mathematics and Its Applications. Springer-Verlag;

1996.

115. Röthlisberger D, Khersonsky O, Wollacott AM, et al. Kemp elimination catalysts by computational enzyme design. Nature. 2008. doi:10.1038/nature06879

82

116. D’Anna F, La Marca S, Noto R. Kemp elimination: A probe reaction to study ionic liquids properties. J Org Chem. 2008. doi:10.1021/jo702662z

117. Kingsley LJ, Lill MA. Substrate tunnels in enzymes: Structure-function relationships and computational methodology. Proteins Struct Funct Bioinforma. 2015. doi:10.1002/prot.24772

118. Jiang L, Althoff EA, Clemente FR, et al. De novo computational design of retro-aldol enzymes. Science (80- ). 2008. doi:10.1126/science.1152692

119. Richard JP, Zhai X, Malabanan MM. Reflections on the catalytic power of a TIM-barrel.

Bioorg Chem. 2014. doi:10.1016/j.bioorg.2014.07.001

120. Khoury GA, Fazelinia H, Chin JW, Pantazes RJ, Cirino PC, Maranas CD. Computational design of Candida boidinii xylose reductase for altered cofactor specificity. Protein Sci. 2009. doi:10.1002/pro.227

121. Fazelinia H, Cirino PC, Maranas CD. OptGraft: A computational procedure for transferring a binding site onto an existing protein scaffold. Protein Sci. 2009;18(1):180-195. doi:10.1002/pro.2

122. Lee SC, Chang YJ, Shin DM, et al. Designing the substrate specificity of d-hydantoinase using a rational approach. Enzyme Microb Technol. 2009. doi:10.1016/j.enzmictec.2008.10.020

123. Grisewood MJ, Gifford NP, Pantazes RJ, et al. OptZyme: Computational Enzyme

Redesign Using Transition State Analogues. PLoS One. 2013. doi:10.1371/journal.pone.0075358

124. Cadet F, Fontaine N, Li G, et al. A machine learning approach for reliable prediction of amino acid interactions and its application in the directed evolution of enantioselective enzymes.

Sci Rep. 2018. doi:10.1038/s41598-018-35033-y

125. Khatib F, Cooper S, Tyka MD, et al. Algorithm discovery by protein folding game players.

Proc Natl Acad Sci U S A. 2011. doi:10.1073/pnas.1115898108

83

126. Wang J, Cao H, Zhang JZH, Qi Y. Computational Protein Design with

Neural Networks. Sci Rep. 2018. doi:10.1038/s41598-018-24760-x

127. Schwander T, Von Borzyskowski LS, Burgener S, Cortina NS, Erb TJ. A synthetic pathway for the fixation of carbon dioxide in vitro. Science (80- ). 2016. doi:10.1126/science.aah5237

128. Siegel JB, Smith AL, Poust S, et al. Computational protein design enables a novel one- carbon assimilation pathway. Proc Natl Acad Sci U S A. 2015. doi:10.1073/pnas.1500545112

129. Thiele I, Swainston N, Fleming RMT, et al. A community-driven global reconstruction of human metabolism. Nat Biotechnol. 2013. doi:10.1038/nbt.2488

130. Chowdhury R, Chowdhury A, Maranas CD. Using Gene Essentiality and Synthetic

Lethality Information to Correct Yeast and CHO Cell Genome-Scale Models. Metabolites.

2015;5(4):536-570. doi:10.3390/metabo5040536

131. Burgard AP, Pharkya P, Maranas CD. OptKnock: A Bilevel Programming Framework for

Identifying Gene Knockout Strategies for Microbial Strain Optimization. Biotechnol Bioeng.

2003. doi:10.1002/bit.10803

132. Ranganathan S, Suthers PF, Maranas CD. OptForce: An optimization procedure for identifying all genetic manipulations leading to targeted overproductions. PLoS Comput Biol.

2010. doi:10.1371/journal.pcbi.1000744

84

Chapter 3

IPRO+/- COMPUTATIONAL PROTEIN DESIGN TOOL ALLOWING FOR

INSERTIONS AND DELETIONS

This chapter has not been previously published anywhere. This work includes contributions from

Dr. Matthew Grisewood (currently Research Scientist, Schrodinger Inc.) and Veda Sheersh

Boorla. Prof. Costas D. Maranas provided technical insights in shaping up the study and has also edited the chapter.

3.1. Significance

Insertions and deletions in protein sequences alter the residue spacing along the polypeptide backbone and consequently open up possibilities of tuning protein function in a way that is inaccessible by amino acid substitutions alone. We describe an optimization-based computational protein redesign approach that allows for amino acid insertions and deletions besides substitutions to an existing protein backbone. This new algorithmic capability would be of interest for enzyme engineering and broadly inform other protein design tasks.

Current computational protein design approaches are limited in their inability to predict insertions and deletions in the polypeptide sequence in response to a protein redesign goal. In contrast, nature has extensively relied upon amino acid insertions or deletions (indels) to tune protein function as evidenced by the prevalence of gaps in family protein sequence alignments. Here we introduce this new capability within the Iterative Protein Redesign and Optimization (IPRO+/-) tool for redesigning proteins with improved binding scores against relevant interaction partners including

85

small molecules, peptides and/or even other proteins. The algorithmic advancement is centered around predicting beneficial combinations of indels along with substitutions and also obtain putative substrate-docked structures for these protein variants. We highlight this new capability by recapitulating the sequences of existing TEM-1 β-Lactamases (TBLs) active variants of different size using overall protein stability as the protein design objective. In addition, starting from a 4-Coumarate:CoA Ligase (4CL) we identified shorter ligases with enhanced in vitro activities towards non-native (larger) substrates sinapate and ferulate by shortening loops guarding the active site. IPRO+/- is made freely available as an open source tool from https://maranasgroup.com.

3.2. Introduction

Proteins design is a core task that underpins many applications from drug design1 and enzyme engineering for improved or altered substrate specificity2 to antibody design for nanomolar affinity for a specific epitope in an antigen3 or protein pore for (bio)separations4. At its core, protein design entails the identification of the exact sequence of the amino acids in a polypeptide chain that upon folding leads to the right structure for the desired function. The combinatorial nature of this task arises from the fact that there are twenty amino acids choices for each one out of the typically hundreds of positions in the polypeptide chain. This implies that exhaustive searches using combinatorial libraries can only sample a tiny fraction of the sequence space. Therefore, it is important to find ways to focus libraries in the most promising combinations of sequence space.

Directed evolution5 protocols through a sequence of screens (or preferably selections) have been quite successful in steering libraries towards improved designs. However, the chances of success are problem-specific, limited insight is gained on the molecular mechanism of improvement and lessons learned from one case-study do not always translate to others. At the same time,

86

computational design6 has emerged as an important tool for homing in on design alternatives that optimize a set of computationally accessible metrics (e.g., binding affinity, overall protein stability, decoy rejection, etc.). A number of success stories for the de novo design of enzymes7–10, antibodies11–13 and inhibitors14,15 have been reported.

Existing computational tools rely on either biophysics-inspired scoring functions (e.g.,

CHARMM, AMBER, GROMACS) to quantify the energetics of the molecular interactions allowing for the in silico exploration of the impact of amino acid mutations on binding or stability metrics. A number of these techniques form the basis of software platforms for protein design such as Rosetta16, Site Directed 17, OSPREY18, and Tinker19. Alternatively, a number of protein design tools rely on the analysis of the statistics of the amino acid combinations in a protein library that preserve (or enhance) a particular function 20. Often these methods are supplemented by information21 associated with the desired functionality. Despite the many success stories and rapid progress there exists a systematic bias associated with all existing tools. Even though the effect of specific deletions on the overall protein has been investigated before22, all existing computational protein redesign methods require that the original length of the polypeptide chain is pre-specified and remains unchanged during the design process.

For example, consider the family of TEM-1 β-Lactamases (TBLs). It catalyzes the formation of the hydroxyl- substituted beta-amino acid from its corresponding beta-lactam (such as penicillins and cephalosporins) using the conserved Met69, Ser130, and Arg244 catalytic triad.

Its high antibiotic resistance makes it a convenient candidate for assessing protein fitness using minimum inhibitory concentration (MIC)-like approaches in high-throughput protein evolution experiments. Multiple studies have shed light onto the effect of insertions and deletions on

Ampicillin resistance. Two studies independently assaying 53 and 87 amino acid deletions across

87

the protein reported that more than 26% variants showed a 99% loss of wild type MIC score when the deletion occurred in α-helices or β-sheets, whereas deletions in loop and β-sheet-loop junction were almost always tolerated. To quantify the prevalence of indels in protein family sequences and demonstrate the problem that this causes for computational protein design, we generated a sequence alignment of a published set of 156 class A β-Lactamases from several bacterial species.

Despite the relatively high average sequence similarity (i.e., 86%) there is on average 6 gaps per sequence in the library and 5 insertions (see also Figure 12) with respect to Escherichia coli TEM-

1 β-Lactamase (referred to as EcTBL henceforth). The most prevalent backbone size has 275 amino acids which represents only 34% of the total number of sequences in the family. This means that if a computational protein design algorithm attempted a TBL redesign starting from a member from the aa=275 grouping only 34% of the TBL family diversity as encoded in the sequence alignment would be accessible. A random starting sequence would only access on average 15% of the protein family diversity. This becomes even more restrictive for most other protein families that tend to have even more gaps in their alignment due to the lower sequence identity. This means that even though nature seems to extensively use protein length as a “lever” to optimize the function of proteins, existing protein design tools are always restricted to particular chain length.

One could iteratively attempt to apply computational protein design for different lengths but this is clearly inefficient as the design goal is not used to guide the search for the most advantageous protein length. This calls for a dedicated method that uses protein size as a design criterion.

34

s

e

c n

e 17

u 15 q

e 12 11

s

L

B 6 5

T

f

o

% 265 270 275 280 285 290 295 Average number of amino acids in each group of TBL homologs

88

Figure 12. The histogram shows the distribution of 156 TEM1-beta lactamase homologs with respect to the number of amino acids that constitute the polypeptides. The sequences have been grouped into five amino acid long bins with the mean lengths indicated in the X-axis labels.

In enzyme design, often the objective is to alter the specificity of the enzyme for a new substrate. For example, being able to switch the specificity of an acyl-ACP thioesterase from long chain acyl acyl-ACP (C14 and higher) to short chain (C8) acyl-ACPs could unlock microbial octanoate production with implications for the oleochemical industry. Thioesterases hydrolyze the thioester bond in acyl-ACPs and yield the corresponding acid. Thioesterase from different species vary both in length and substrate size preference. It is reasonable to assume that if substrate size is the only changing factor, then polypeptide chain length should be an important design variable.

Deletions in the substrate binding groove can result in a more compact thioesterase with a remodeled pocket that can only accommodate shorter ACPs. Insertions, on the other hand could be used to “open up” otherwise smaller pockets for larger substrates. This is observed for two

Cuphea viscosissima acyl-ACP thioesterase variants (CvFatB1 and CvFatB2) sharing a 70% sequence similarity. The longer variant CvFatB2 (80 more amino acids) acts on only C14 to C16-

ACPs whereas the shorter variant CvFatB1 only accepts up to C8-ACPs23. However, it must be noted that the effect of a deletion or insertion could also have a counter-intuitive effect. For example, if the active site is partially occluded by neighboring loops, a deletion in a loop may enable access to larger substrates. Conversely, an insertion may preclude access to larger substrates thus changing specificity towards smaller ones.

Motivated by these observations and lack of computational tools here we introduce a modified version of the protein design tool IPRO+/- that allows for both residue insertions and deletions. IPRO+/- builds upon the existing suite of protein design programs IPRO24 that use the

89

CHARMM25 energy function to quantify the energetics of the molecular interactions and a mixed- integer linear optimization algorithm to select residue-rotamer combinations that maximize the user-specified design objective (e.g., interaction energy). Conceptually, IPRO+/- allows for insertions and deletions by allowing every position in the protein sequence to accept either one of the twenty amino acids or remain unoccupied (i.e., gap). The family protein sequence alignment provides the blueprint for which positions can remain unoccupied by simply inspecting whether there exists any member that has a gap in the position of interest. In addition, it establishes the maximum number of residue positions and provides a universal numbering scheme for any redesigned protein. This implies that gaps are encoded as the 21st amino acid. Therefore, any chain length contained in the protein family alignment is accessible by IPRO+/- with gaps only allowed at positions that there is already at least one member with a gap in the family protein alignment

(see Figure 13A).

The key novel redesign steps in IPRO+/- are (i) deletion of a residue (aa → _) or (ii) or insertion of a residue in a gapped position (i.e. _ → aa). The protein family alignment provides a straightforward way to quantify the probability of occurrence of these two transitions. For example, if at a given position, A proteins out of N involve a gap then we choose the probability of opening a gap in that position to be A/N (see Figure 13A). Similarly, if B proteins out of N in the alignment have a residue at a currently gapped position then we set the probability of adding a residue at B/N (See Figure 13A). One could envision more elaborate schemes for setting these transition probabilities or allow for direct user-supplied specifications. Residue deletion involves cutting the polypeptide chain removing the residue in question and then bringing together the two ends. This task (i.e., end-joining) forms a frequently occurring problem in robotics for object retrieval. It arises when a sequence of rotations at different articulated joints needs to be calculated

90

so as an articulated mechanical arm needs to grab a stationary target object26. End-joining of the polypeptide chain is accomplished in IPRO+/- using a modified cyclic-coordinate descent method.

At each cycle, two rotations around the protein backbones at symmetric positions from the end- joining locations are carried out with the goal of minimizing the distance between the backbone

N-Cα-C triplets of the end-joining segments (see Figure 13B). This sequence of pairwise rotations is initiated five residues away from the end-joining segments. Progressively, pairwise rotations move closer until they meet one another (root-mean-square deviation (RMSD) <0.001Å) at the joining segment during the last cycle (see Figure 13B). By carrying out rotations in a symmetric manner (bi-directional cyclic coordinate descent) around the joining end we avoid any possible direction biases. The same end-joining cyclic procedure is called upon for both deletions and after an insertion of a residue at a given position. A glycine is exclusively introduced at all residue addition events, which can be changed into other residues in follow up steps of IPRO+/-. In both cases, the torsion angles of the backbone chain are changed therefore new rotamer assignments need to be made using the MILP algorithm in IPRO. Figure 13B illustrates the basic steps of the loop close protocol implemented inside the IPRO+/- algorithm. Detailed algorithmic details and implementation information is described in Materials and Methods.

As a demonstration, we tested IPRO+/- for the redesign of a TEM-1 β-Lactamase (TBLs) to assess whether IPRO+/- managed to identify designs with a different chain length than the starting point by opening gaps and/or inserting residues in positions consistent with the family sequence alignment. In addition, we carried out the redesign of a 4CL2 (i.e., 4-Coumarate: CoA

Ligase 2 from Glycine max) so as it shows new substrate specificity towards the larger cinnamate, caffeate, and ferulate substrates. We assessed whether IPRO+/- managed to recapitulate the pattern

91

of deletions seen in the protein family alignment for 4CL2 enzyme variants that have activity for larger substrates (such as sinapate and ferulate).

3.3. Materials and Methods

The traditional IPRO workflow design iteration consists of a backbone perturbation, a rotamer repacking and amino acid selection using a mixed integer approach, target molecule redocking, computing interaction energy metrics, deciding whether to retain or reject the design which is followed by a backbone perturbation at a different site. A new set of decisions and tasks is incorporated within the IPRO workflow such that insertions and deletions can be used as design choices along with substitutions. The probability of making an insertion or deletion at each design cycle is guided by the family sequence alignment (see Figure 13A). For either insertion or deletion, the polypeptide backbone must be first opened by performing a ψ-angle rotation on the residue to the left of the break such that the distance between 0N and RightC is at least 4 Å (see

Figure 13B). The two new backbone ends are then generated by either appending a new GlyN-

GlyCα-GlyC triplet in case of insertion or removal of the N0-Cα0-C0 triplet in case of deletion (see

Figure 13B). IPRO’s loop closure algorithm is an adaptation of the cyclic coordinate descent method (CCD), which has been employed in homology modeling27 and in robotics28 for solving inverse kinematics problems. The objective of the loop closure algorithm is to minimize the RMSD between the N-Cα-C triplets of the two free ends (see Figure 13B). IPRO+/- first renumbers residues in the protein by adding all gapped positions in the original sequence that accept amino acids for some of the protein family sequence members (see Figure 13A). The insertion step is performed before the rotamer repacking and amino acid selection step which provides an

92

opportunity to the inserted glycine to be altered to a different amino acid. Deletion step, on the other hand, is executed before the target molecule redocking step of IPRO.

Figure 13. IPRO+/- graphical schematic. a) Family sequence alignment is used as the blueprint to determine allowable amino acid insertion and deletion locations and corresponding probabilities. b) Gap opening and gap closure steps for insertion and deletion tasks yield longer and shorter backbone lengths, respectively.

Step 1. Renumbering amino acids based on family sequence alignment

IPRO+/- treats a gap in the sequence alignment as the twenty first amino acid (see Figure 13A).

The family sequence alignment is used to identify design positions (DPs) on the starting sequence which could accept a different amino acid (aa→aa), insert a glycine (_→aa), or delete an amino acid (aa→_). Gaps on the starting sequence are assigned residue numbers and the rest of the amino acids are renumbered accordingly (see Figure 13A). Insertion and deletion probabilities for each

93

position are computed as a fraction of sequences in the alignment that have amino acids or gaps in those positions, respectively.

Step 2. Loop opening

Both insertion and deletion steps start with initiating a break in the polypeptide backbone where an insertion or deletion need to be made. As shown in figure 13B, the break is introduced by performing a ψ-angle rotation on the residue to the left of the intended break point – such that the distance between the 0N and RightC is at least 4 Å. For an insertion task, a glycine (GlyN, GlyCα, GlyC triplet) is built onto the 0N atom and an extra set of backbone atoms (xN, xCα, and xC) are built onto the GlyN atom (see Figure 13B). Alternatively, in case of a deletion task the 0N-0Cα-0C triplet is renamed to xN-xCα-xC.

Step 3. Loop closure

The newly generated ends of the polypeptide backbones during an insertion or deletion task are re-joined using bi-directional CCD to obtain the structure of the corresponding insertion or deletion variant. Loop closure contains a series of φ and ψ-dihedral rotations with the objective of reducing the RMSD between xN-xCα-xC and RightN-RightCα-RightC triplets to less than 0.001 Å (see Figure

13B). Loop closure operation involves φ-ψ rotations performed alternately on residues lying to the left and right of the gap starting with the fifth residue away from the gap and progressively moving towards the gap. Up to 20 rounds of such operations are performed until the polypeptide backbone ends assume the same coordinates (i.e., rejoined). This bi-directional CCD27 approach safeguards against any directional bias (towards left or right of the original polypeptide break point). The

퐿푒푓푡 푅푖푔ℎ푡 퐿푒푓푡 푅푖푔ℎ푡 sequence of φ-ψ rotations is as follows: 5φψ → 5φψ → 4φψ → 4φψ →

퐿푒푓푡 푅푖푔ℎ푡 퐿푒푓푡 푅푖푔ℎ푡 퐿푒푓푡 푅푖푔ℎ푡 푥 3φψ → 3φψ → 2φψ → 2φψ → 1φψ → 1φψ → 0φψ. The Left/Right

94

superscripts indicate residue location relative to the polypeptide break and the subscript number from 1 to 5 its corresponding distance (see Figure 13B).

Step 4. Integration with remaining IPRO workflow

The original IPRO workflow iterates between five steps for deciding on amino acid substitutions in user defined positions (i.e., DP) that optimize a binding assembly metric such as (complex energy for enhanced stability or interaction energy for improved binding with a small molecule or another protein). The algorithmic and implementation details are described by Pantazes et al.29.

Briefly, the first step of IPRO starts with picking a DP at random, followed by perturbing the backbones of an eleven-residue window centered around the DP. Repacking of amino side chains of this window is then performed by solving a MILP, where only the DP is allowed to receive rotamers different from the original amino acid type. In IPRO+/-, if the randomly chosen DP is a gap, a glycine is introduced with a probability equal to the insertion probability of that site from the family sequence alignment. In subsequent iterations, this glycine is can be replaced and repacked with a different amino acid rotamer depending on its covalently-bonded and non-bonded interaction-energy scores. On the other hand, if the randomly chosen DP is a residue with a non- zero deletion probability, it is deleted with the said probability after the rotamer repacking step of a design cycle. Subsequent target molecule redocking with the protein, energy minimization of the protein (in complex with its binding partner – if present), and CHARMM-interaction energy scores are computed. A design is retained if it performs better than the current best variant for the intended design goal, else rejected with a probability using a Metropolis criterion. IPRO+/- currently uses a default of 3000 such design runs for a full simulation, on five nodes of 10-core Xeon E7-4830 processors with 4GB physical memory. The schematic of IPRO+/- steps has been outlined in figure 14.

95

Figure 14. The steps of IPRO+/- design cycle. Family sequence alignment guides the probabilities of making amino acid insertions or deletions along the polypeptide backbone. After an insertion step, the inserted Gly is allowed to be replaced with a different amino acid from subsequent design cycles.

96

3.4. Results and Discussion

The IPRO+/- algorithm was first used to identify shorter and longer variants of TEM1 β-lactamase with the Escherichia coli (TBL with PDB accession id 1ERM30) as the input/ design sequence.

Bacterial resistance to penicillin and cephalosporin-like drugs is primarily governed by β- lactamase-mediated hydrolysis of the drug molecule. We hypothesize that variants that form stable complexes with the penicillin-like boronate inhibitor (PEB - phenylacetamido-carboxyphenyl ethyl boronate) are likely to confer antibiotic resistance. In addition, IPRO+/- was used to design shorter variants of a lignin biosynthesis pathway enzyme starting with soybean 4-Coumarate CoA ligase (Gm4CL2) enzyme (homology modeled against Nt4CL2 – PDB 5BST31) that show improved binding to larger substrates (such as sinapate and ferulate) than its native substrate (4- coumarate).

Searching for TEM-1 β-lactamases of different lengths starting from the E. coli native sequence

TEM-1 β-lactamase (TBL) is a commonly used as a model for protein evolution experiments as it confers resistance to penicillin and cephalosporin-like drugs, which is used as a proxy for protein fitness. A recent study by Gonzalez et al.32 investigated a comprehensive library of 5270 amino acid insertion and 286 deletion variants of E. coli TBL (EcTBL PDB: 1ERM) capable of penicillin- resistance. Overall, protein stability was found to be least affected by single amino acid indels in loops, followed by indels in tertiary structure-loop junctions, helices and sheets. We selected as design positions (DP) within IPRO+/- thirteen positions spanning four of the six loops that involved indels in at least one enzyme family member (see Figures 15A-B) from a family sequence alignment with 156 other bacterial β-lactamases (alignment reported by Gonzalez et

97

al.32). Enzyme-inhibitor complex energy score was the minimized and only designs with enzyme- inhibitor interaction scores not worse by more than 25% of that of the starting sequence (EcTBL) were accepted. Nine out of these thirteen positions could be replaced with a gap whereas four gapped positions in the starting sequence EcTBL could be filled with a residue (see Figure 15C).

In addition, because substitution G120A was co-occurring in 78% of β-lactamase family members that involved deletion D119_, we added Gly120 as a design position that could be substituted but not deleted. The family sequence alignment with 156 homologous TBLs provided the deletion/insertion probabilities at each one of the thirteen DPs. The goal was to assess whether

IPRO+/- identified backbone length modifications and residue substitutions that mirrored the ones seen in the natural family of β-lactamases. (see Figure 15C).

Figure 15. Engineering beta-lactamase. a) TEM1- β-lactamase from E. coli (1ERM) in complex with boronate inhibitor (PEB). b) Wild type TEM1- β-lactamase aligned with 48 beta-lactamase sequences (including gaps) shows the positions which can have an insertion or deletion. C) The

98

insertion and deletion probabilities of each position were selected from the multiple sequence alignment with 48 other natural beta-lactamase homologs.

IPRO+/- could in principle sample designs with backbone lengths between 261 and 264 residues through the accumulation of deletion or additions at the 13 DP by performing 3,000 IPRO+/- redesign runs starting from the 263-residue homologue (EcTBL). Each redesign run was capped at 20 loop closure iterations for re-stitching the backbone after an insertion or deletion) step. This is a user-defined number and can be adjusted for performing larger contiguous insertions and deletions. Thirty-five redesigns with at least one insertion or deletion were identified with interaction energy scores not worse by more than 25% of the wild type (see Table 1). Notably, seven out of nine (78%) and five out of seven (71%) of the naturally occurring deletions, and insertions respectively, were recovered in these 35 IPRO+/- designs (see Figure 16). Each of the nine possible deletion sites could have one out of 21 fates (20 amino acids and one gap). Thus, the probability of recovering the best design (E64_, K55V, E64K design R1.D1 in Table 1) by a randomly selecting procedure following a binomial distribution after 3000 runs is estimated to be

1 9 3.8 × 10−19 ( computed as, 3000 × ( ) ). This provides confidence that IPRO+/- proposed 21 redesigns could not have been identified by random chance in 3000 design cycles. In addition, the identified insertions (_99P, _99V, _115G, _118G, and _118T) match the residue seen in the protein family alignment even though the originally added residue was Gly. Designs with deletions in loops 2 and 3 (positions 62-64, and 99, respectively) enhanced binding to the PEB-inhibitor (see designs R1.D1 – R1.D19 except) by reorienting the neighboring polar residue side for electrostatic stabilization of the electronegative caboxyphenyl moiety of the PEB inhibitor is corroborative of

Gonzalez et al.32. Insertions at position 99 in loop 3 yielded the most stable complexes while

99

retaining wild-type substrate binding activity (see designs R1.D5, R1.D6, R1.D11, R1.D14,

R1.D17, and R1.D19 indicated with an asterisk in Table 1). This is because even though the location of the inserted residue 99 is approximately 23 Å away from the substrate and does not affect substrate binding, the newly introduced residues (_99P, _99G, and _99V) serve as stabilizing anchors by establishing hydrophobic contacts with neighboring Leu114 side chain, and electrostatic contacts (using the backbone N- and O-atoms of residue 99) with the side chains of

Asp119 and Thr116, from loop 4.

IPRO+/- also predicted six additional insertions in the native EcTBL sequence in the four gapped positions (99, 115, 117, and 118) where the inserted amino acid type in these positions is not observed in any of the family sequences (see Figure 16). Except for an insertion prediction

(positively charged _117R) at position 117 (naturally occurring insertion is negatively charged

_117D), all other predicted insertions had side chain types consistent with natural sequences

(position 99 – hydrophobic, and positions 115 and 118 – polar). For example, arginine in _118R has a positively charged side chain similar to lysine seen in bacterial TBL from Bosea lupini

(Uniprot ID A0A3Q9AU82) in that position. In addition to recovering _99P as seen in Cobeita sp., IPRO+/- designs predicted alternate hydrophobic amino acids such as glycine and alanine at position 99 with similar interaction and complex energy scores computed using CHARMM.

100

Indels from naturally occurring β-lactamases Indels from IPRO+/- designs

Figure 16. Venn diagram shows the fraction of naturally occurring indels in TBL homologs that were recovered by IPRO+/- simulations on EcTBL.

The computational models of the designed loops were very similar (RMSD < 0.24 Å) to those obtained from crystal structures of the loops containing the same indels. 23 of the 156 TEM1 β- lactamase homologs used the sequence alignment had reported crystal structures and were used for assessing the quality of IPRO+/- predicted structures of EcTBL variants. The average Cα-

RMSDs of only the loop regions from designs and reported crystals with the same single amino acid deletions of Lys55, Pro62, Glu63, Ala86, Gly87, Gln88, and Asp119 were 0.21 Å, 0.17 Å,

0.05 Å, 0.14 Å, 0.15 Å, 0.17 Å, and 0.23 Å, respectively. Furthermore, the φ–ψ angles of the backbone atoms were in the exact same region of the Ramachandran plot for nearly 30% of all the insertions as observed from the crystal or homology modeled structures. The remaining inserted residues in the designed structures explored different backbone conformations and side chain orientations, and consequently established new electrostatic and hydrophobic contacts (with

Met121 side chain) to stabilize the substrate and the complex as a whole.

101

Structural aspects of the active site are preserved between the wild type and the indel designs.

Importantly, the catalytic Ser70 residue adopts the same wild-type like dihedral angles (and side chain orientation) in all designs. One particular design (design R1.D30 from Table 3) was able to perfectly recapitulate both the loop 4 configuration (Cα-RMSD ~0.02 Å) and sequence as seen in the TEM1 β-lactamase from Vibrio pectenicida (Uniprot ID: A0A427U5F7) which has a combination of an insertion, a deletion, and a substitution (_118G, D119_, G120A) with respect to the starting sequence – EcTBL. IPRO+/-, therefore, lays the foundation for future computational protein design approaches where predicting novel indels (along with substitutions) would aid the generation of focused libraries.

Table 3. CHARMM interaction and complex energy scores of 35 1ERM designs from IPRO+/-.

The designs are arranged in descending order of variant-inhibitor interaction energy scores

(column 4). These designs sample indels that are seen in natural homologs and have complex energy scores (stability metrics) not less than 90% of the wild type TEM1- β-lactamase complex energy with PEB inhibitor. Designs that constitute the most stable complexes are indicated with an asterisk (*).

% increase (↑)/ Variant-inhibitor Variant-inhibitor decrease (↓) in Amino acid transitions in complex energy interaction energy Variant ID interaction energy EcTBL score (CHARMM score (CHARMM score with respect to energy units) energy units) EcTBL WT EcTBL None -10051.29 -59.88 0 R1.D1 E63_, K55V, E64K -10106.01 -93.06 55.41 ↑ R1.D2 P62_, E63G -10054.16 -90.89 51.79 ↑ R1.D3 Q88_, E63R, K55G -10077.52 -89.78 49.93 ↑ R1.D4 E63_, P62A, E64D -10135.22 -89.64 49.70 ↑ R1.D5 * G87_, A86F, _99V -10819.16 -88.01 46.98 ↑ R1.D6 * A86_, G87A, _99V -10974.09 -87.61 46.31 ↑ R1.D7 Q88_, A86R, K55L -10103.79 -85.99 43.60 ↑ R1.D8 E63_, P62V, E64R -10101.15 -84.22 40.65 ↑ R1.D9 G87_, A86G, Q88K -10059.36 -83.47 39.40 ↑ R1.D10 G87_, A86W, Q88G -10063.58 -82.77 38.23 ↑ R1.D11 * Q88_, A86R, _99G -10723.98 -82.61 37.96 ↑ R1.D12 P62_ -10103.02 -81.07 35.39 ↑ R1.D13 Q88_, A86K, K55L -10080.03 -78.07 30.38 ↑

102

R1.D14 * _99G, P62K -10904.63 -74.62 24.62 ↑ R1.D15 _117G, G120A -10103.06 -72.30 20.74 ↑ R1.D16 _118R, D119G -10103.21 -69.99 16.88 ↑ R1.D17 * _99P, P62G, E63K -10800.1 -69.69 16.38 ↑ R1.D18 _118G, D119_, G120L -10116.84 -68.77 14.85 ↑ R1.D19 * _99G, P62A -10884.59 -67.65 12.98 ↑ R1.D20 _115G, T116A -10282.3 -67.32 12.42 ↑ R1.D21 _115V, D119S -10023.44 -65.88 10.02 ↑ R1.D22 _115D, T116G -10090.24 -64.32 7.41 ↑ R1.D23 _115V, T116E -10103.52 -63.44 5.95 ↑ R1.D24 _115R, T116A -10306.72 -62.92 5.08 ↑ R1.D25 _115R, T116V -10090.24 -60.70 1.37 ↑ R1.D26 _117G, D119T -10203.52 -60.56 1.14 ↑ R1.D27 K55_, P62D, E63G -10186.72 -59.40 0.80 ↓ R1.D28 _118G, D119_, G120K -10097.63 -59.05 1.39 ↓ R1.D29 _117R -10130.38 -58.73 1.92 ↓ R1.D30 _118G, D119_, G120A -10080.11 -57.82 3.44 ↓ R1.D31 D119_, G120P -10067.84 -54.98 8.18 ↓ R1.D32 K55_, P62F -10073.95 -54.87 8.37 ↓ R1.D33 D119_, G120A -10202.92 -54.69 8.67 ↓ R1.D34 D119_ -10056.59 -50.63 15.45 ↓ R1.D35 K55_ -10060.07 -46.66 22.08 ↓

Redesign of 4-Coumarate CoA Ligase 2 from soybean (Glycine max)

Plant 4CLs have been characterized from a wide range of species and have exhibited different isoform distribution patterns in terms of folded structure, with substrate specificities spanning several ring-substituted cinnamates. Lindermayr et al.33 reported that soybean (Glycine max) has three Gm4CL isoforms with a peptide motif that was functionally linked to turnover of three cinnamate ring substituents (namely sinapate, ferulate, and caffeate, all of which are bulkier than the native substrate 4-coumarate). Two out of three Gm4CL isoforms (4CL2 and 4CL3) have an extra amino acid at the center of this motif, which when deleted, enables these isoforms to show enhanced turnover of the aforementioned larger cinnamic-substrates along with compromised native substrate activity. An alignment of six 4CL sequences with high sinapate (and ferulate) activity comprising two Gm4CLs, three At4CLs (Arabidopsis thaliana) and one PheA

(phenylalanine activating subunit of synthase 1 from Bacillus brevis) revealed two possible amino acid deletion sites, Val345 and Leu346 on Gm4CL2 (see Figure 17). Fifteen non-

103

conserved binding pocket residues on Gm4CL2 (with two of them having non-zero deletion probabilities) were chosen as DPs (see Figure 17). Val285, Lys332, Leu333, Gly334, Gln335,

Gly336, Met339, Ala342, Gly343, Pro344, Val345, Leu346, Thr347, Met348, Ser349, and Leu350 constituted the set of DPs. The deletion probabilities at positions Val345 and Leu346 were computed to be 14.3% (one out of seven) for both.

Figure 17. The sequence alignment of the seven 4CLs with specificities spanning small to large cinnamate-derivatives, reveals two possible deletion sites (V345 and L346) and thirteen possible substitution positions in Gm4CL2. IPRO+/- redesign aims to enhance binding with larger substrates like sinapate and ferulate by accessing combinations of shorter 4CL sequences (such as

Gm4CL1 and PheA) or similar length 4CL sequences with combinations of amino acid substitutions (similar to Gm4CL3, At4CL1, At4CL2, and At4CL3).

Preparation of enzyme structure and substrate docking

A homology modeled structure of Gm4CL2 (see Figure 18) using two luciferases from Photinus pyralis

(PDB: 1BA334) and one 4CL from Nicotiana tabacum (PDB: 5BST31) as templates, was prepared as described in Lindermayr et al.33. The ATP-mediated reaction mechanism proceeds by forming a 4-

Coumaroyl-AMP intermediate. Gm4CL2 variant non-covalent interaction energy scores with 4-coumaroyl, cinamyl, caffyl, ferulyl, and sinapyl-AMPs intermediates were used as in silico proxies for in vitro enzyme- substrate affinities. The reaction intermediates were first docked on to wild type Gm4CL2 using the torsion geometry of a 4-coumaroyl AMP co-crystalized with Nt4CL2 (PDB: 5BST), as a guide. Very low overall

104

interatomic RMSD (1.8 Å), even lower (0.2 Å) binding pocket RMSDs, and high (>89%) sequence similarity between Nt4CL2 and Gm4CL2 ensured that the substrate binding pose and catalytic distances would also be conserved.

A B

Figure 18. Overview of Gm4CL2-intermediate complex. a) Docked conformation of reaction intermediate, 4-Coumaroyl-AMP (yellow) in the catalytic groove of wild type Gm4CL2 along with the fifteen binding pocket residues have been indicated in pink. b) A zoomed-in view of the binding pocket.

Design runs

IPRO+/- design runs were set up to identify variants that enhance binding with non-native sinapyl-

AMP substrate intermediate and check if this improvement comes at the cost of native 4-coumaryl-

AMP binding. From the list of the four aforementioned non-native substrates (sinpate, caffeate, cinnamate, and ferulate), the largest substrate – sinapate, was chosen as the target substrate. The

105

13 (out of 15) DPs, with no deletion probabilities, were allowed an unconstrained choice of substituent amino acids in order to improve binding to sinapyl-AMP. Overall 298 unique trajectories were sampled and 23 Gm4CL2 variants with various combinations of amino acid deletion and substitutions were found (see Table 2). IPRO+/- drove the designs towards (V345 and

L346) deletion variants which opened up more space at the binding pocket allowing clash-free stabilization of larger substrates. Ten out these 23 reflected a site-specific recovery of amino acids as seen in other naturally occurring 4CLs. One of these sequences had both the Val245 and Leu246 deleted, while at least one deletion was observed in nine sequences. Table 4 shows amino acid substitutions and deletions in each of the 23 successful designs and the corresponding in silico interaction energy scores with not only sinapyl-AMP, but also caffeyl-AMP, cinnamyl-AMP, ferulyl-AMP and native intermediate 4 coumaroyl-AMP.

Table 4. The difference in interaction energy scores of each of the 23 designs in comparison to wild type Gm4CL2, along with the amino acid substitutions and deletions with five AMP- conjugates of cinnamate-like substrates1 (sinapate, ferulate, caffeate, 4-coumarate, and cinnamate) in decreasing order of size have been listed. The green-red color coding denotes improvement

(green_ or loss (red) of binding to a substrate in a Gm4CL2-variant in comparison to wild type2.

All these variants contain at least one amino acid change (deletion or substitution) seen other naturally occurring 4CL2 homologs.

CHARMM interaction energy score* reduction (w.r.t WT Gm4CL2) between enzyme variants and substrate-AMP intermediates of varying size (= ScoreWT-Intermediate – ScoreVariant-intermediate) Sinapate Ferulate Caffeate Coumarate Cinnamate C11H12O5 C10H10O4 C9H8O4 C9H7O3 C9H8O2 Variant Amino acid transitions in (CHARMM (CHARM (CHARMM (CHARMM (CHARMM ID Gm4CL2 energy M energy energy energy energy units) units) units) units) units)

106

WT None 0 0 0 0 0 Gm4CL2 R2.D1 K332T, G334V, V345_ 87.33 ↑ 12.41 ↑ -10.27 ↓ -1.91 ↓ -24.42 ↓ L346_, T347I, M348S, R2.D2 86.56 ↑ 46.00 ↑ 27.39 ↑ -4.36 ↓ -49.22 ↓ S349T R2.D3 V285I, L346G 83.63 ↑ 13.36 ↑ -14.42 ↓ -49.10 ↓ -11.70 ↓ R2.D4 V345A, T347I 73.99 ↑ 30.74 ↑ 0.93 ↑ -17.91 ↓ -12.41 ↓ R2.D5 L333F, V345_, L346_ 71.04 ↑ 52.77 ↑ -5.72 ↓ -39.18 ↓ -63.95 ↓ R2.D6 K332R, V285I, V345G 70.34 ↑ 17.28 ↑ 16.31 ↑ 8.68 ↑ -39.94 ↓ R2.D7 K332R, V345_ 68.30 ↑ 10.57 ↑ 8.00 ↑ -20.31 ↓ -17.60 ↓ R2.D8 V345_, T347A 67.77 ↑ 40.72 ↑ 36.47 ↑ -21.87 ↓ -57.59 ↓ V285I, K332R, V345_, R2.D9 64.26 ↑ 4.51 ↑ 13.71 ↑ 4.15 ↑ -3.43 ↓ L346A R2.D10 V345_, M348I 61.90 ↑ 15.91 ↑ 16.47 ↑ -24.30 ↓ -62.75 ↓ R2.D11 V345C, L346_, T347G 58.88 ↑ 30.61 ↑ -11.42 ↓ -13.92 ↓ -70.61 ↓ R2.D12 G336A, V345_ 56.67 ↑ 22.31 ↑ 19.61 ↑ -35.47 ↓ -72.22 ↓ R2.D13 G336A, A342T 52.20 ↑ 30.31 ↑ -4.51 ↓ -8.21 ↓ -73.04 ↓ G334A, Q335K, L346G, R2.D14 51.99 ↑ 13.08 ↑ 9.44 ↑ -21.15 ↓ -19.06 ↓ T347I R2.D15 L333Y, V345C 47.63 ↑ 27.00 ↑ 4.32 ↑ -20.06 ↓ -64.62 ↓ R2.D16 P334G, G343S 45.37 ↑ 35.76 ↑ 21.51 ↑ -28.93 ↓ -25.89 ↓ R2.D17 K332R, L346G, T347G 43.07 ↑ 29.66 ↑ -22.38 ↓ -17.43 ↓ -77.61 ↓ R2.D18 K332S, L333G, G334F 38.33 ↑ 24.38 ↑ 14.53 ↑ -28.75 ↓ -69.40 ↓ R2.D19 T347A, K332T 35.70 ↑ 11.28 ↑ 9.45 ↑ -14.32 ↓ -23.52 ↓ R2.D20 V285L, K332T, G336A 33.37 ↑ 35.10 ↑ 23.00 ↑ -10.66 ↓ -69.56 ↓ R2.D21 L346A, M348S, S349G 31.48 ↑ 34.86 ↑ 1.92 ↑ -19.30 ↓ -28.40 ↓ R2.D22 K332S, L333R, G334A 27.28 ↑ 21.12 ↑ 9.56 ↑ -39.18 ↓ -62.41 ↓ R2.D23 V285L, K332T, G334I 25.34 ↑ 21.11 ↑ 14.29 ↑ -11.16 ↓ -72.17 ↓

*CHARMM interaction energy score between Gm4CL2 (WT) and (a)Sinapate = -8.05, (b) Ferulate = -19.31, (c) Caffeate = -20.21, (d) Coumarate = -79.74, and (e) Cinnamate = -109.63 CHARMM energy units. ↑ signifies better-than-wildtype binding affinity, and ↓ signifies lesser-than-wildtype affinity for a certain variant. Lower CHARMM-interaction energy score reflects a stronger enzyme-intermediate binding.

Lindermayr et al.33 experimentally validated that a ΔVal345:Gm4GL2 deletion strain was alone sufficient for introducing sinapate turnover, the major distinction between Gm4CL1 and Gm4CL2.

The IPRO+/- interaction energy scores (Table 2) of Gm4CL2 variants (such as, R2.D1, R2.D5, and so on) containing V345_ show better (-95.38 and -79.09 CHARMM energy units, respectively) sinapyl-AMP stabilization as compared to wild type Gm4CL2 (-8.05). The experimentally measured sinapate affinities of ΔVal345:Gm4GL2 were nearly 250-fold lower (Km

=1208 µM) than wild type Gm4GL1 (Km =4.7 µM) which, however, could not be captured from simulations. Nevertheless, IPRO+/- was able to make reasonable design decisions on two counts:

(a) identify deletion of Val345 and Leu346 as strategies to improve sinapyl-AMP binding, (b)

107

improve sinapyl-AMP binding at the cost of 4-coumaryl-AMP (native substrate) binding (lowered from -79.74 to -57.87 in R2.D7) which is similar to experimental observation (Km =42 µM increased to 49 µM for ΔVal345+K332R:Gm4GL2 ). Structural comparison of R2.D7-variant with wild type revealed that in wild type Gm4GL2-sinapyl AMP complex (see Figure 19), Val345 clashes with one of the methoxy groups of the substrate. Deletion of Val345 opens up the binding pocket by more than 11.2 Å3 on either side of the methoxy group, thus accommodating the sinapyl- moiety simultaneously rendering the pocket too open for efficient binding of smaller 4-coumaryl-

AMP and cinnamyl-AMP (which lack the methoxy groups).

Figure 19. Deletion of V345 in Gm4CL2 opens up the substrate binding groove, thus favorably accommodating larger sinapyl-group, which otherwise clashes with A235 and V345 alike. The binding pocket expands on either side by more than 3Å (represented in dotted double arrows) upon

V345 deletion.

108

Indel-Maker as a tool for constructing enzyme variants with desired indels and substitutions

We have created a Rosetta-based (open-source and freely available from https://maranasgroup.com) Indel-Maker tool to enable users to construct user-defined variant libraries containing combinations of amino acid insertions, deletions and substitutions. This would be instrumental in discerning biophysical cues in experimentally tested variants to explain altered substrate/ cofactor affinities or mutant stabilities. The workflow (see Figure 20) requires users to provide an input file specifying insertions, deletions and modifications to be performed on the input PDB. The required modifications are performed one by one, each followed by the loop closure (in case of insertions and deletions) and structure minimization using Rosetta’s relax protocol (using Rosetta all-atom force field). The resulting PDB and its corresponding Rosetta energy score is output at the end of each modification.

In order to benchmark Indel-Maker, we constructed 35 fluorescent and 35 non-fluorescent mutants of enhanced green fluorescent protein (egfp - from Arpino et al.35). Indel-Maker predicted that overall structural RMSD were less than 1 Å for both sets, but the inactive variants showed more than 2 Å RMSD of the fluorophore region, thus providing structural insights on the inactivity of non-fluorescent variants.

Structural parts of proteins often are more conserved in evolution compared to sequences

(Mapping the Protein Universe – Sander 199636). This has been exploited using protein design approaches to design non-natural sequences that enhance thermal stabilities/ binding affinities of proteins. On the other hand, protein sequences that adopt completely non-natural folds have also been designed (Baker and co-workers9,37). Herein we presented a novel way to increase the scope

109

of targeted protein redesign by introduction of insertions and deletions along with substitutions leading to sampling of a larger sequence space. As demonstrated, integrating indels with protein design can enable design of versatile protein libraries.

A Input File Rosetta Energy Scores Ins. 5_K InDel-Maker of designs Del. 9,12 q Make mutations, Sub. 6_R minimize structure

q Compute all-atom Rosetta energy scores Input Designed PDB PDB

B C D

2.50 2.50 1050

s

e

)

r

)

Å

o

(

Å

c

( l

l

S

r

a

r

o y

e

u g

l

r

v

f

e o

_ 1000

n

_

D

E

D

S

a

S

t

M

t

M

R

e

R

s o 0.00 0.00 R 950 Mutants Mutants Mutants Active egfp mutants showing fluorescence Non-fluorescent egfp mutants

E F

110

Figure 20. Performance of Indel-Maker. a) Indel-Maker workflow. b) Indel-Maker-generated egfp variants reveal more than 2 Å deviation of the fluorophore region. c) The overall structural

RMSD (including side chains) of both the active and inactive variants was less than 1 Å. d) The

Rosetta energy scores of all the variants ranged between 950 and 1025 Rosetta energy units. e)

Deviation of the SYG fluorophore and the neighboring region in inactive L64_ variant by 2.31 Å. f) Overall structural RMSD between WT egfp (yellow) and L64_ variant (cyan) is 0.24 Å.

3.5. References

(1) Whitley, P.; Nilsson, I.; von Heijne, G. De Novo Design of Integral Membrane Proteins.

Nat Struct Biol 1994, 1 (12), 858–862.

(2) Hernández Lozada, N. J.; Lai, R. Y.; Simmons, T. R.; Thomas, K. A.; Chowdhury, R.;

Maranas, C. D.; Pfleger, B. F. Highly Active C8-Acyl-ACP Thioesterase Variant Isolated by a

Synthetic Selection Strategy. ACS Synth. Biol. 2018.

(3) Kumar, S.; Singh, S. K.; Wang, X.; Rup, B.; Gill, D. Coupling of Aggregation and

Immunogenicity in Biotherapeutics: T- and B-Cell Immune Epitopes May Contain Aggregation-

Prone Regions. Pharmaceutical Research. 2011, pp 949–961.

(4) Chowdhury, R.; Ren, T.; Shankla, M.; Decker, K.; Grisewood, M.; Prabhakar, J.; Baker,

C.; Golbeck, J. H.; Aksimentiev, A.; Kumar, M.; et al. PoreDesigner for Tuning Solute Selectivity in a Robust and Highly Permeable Outer Membrane Pore. Nat. Commun. 2018.

(5) Romero, P. A.; Arnold, F. H. Exploring Protein Fitness Landscapes by Directed Evolution.

Nature Reviews Molecular Cell Biology. 2009.

(6) Lippow, S. M.; Tidor, B. Progress in Computational Protein Design. Current Opinion in

Biotechnology. 2007.

111

(7) Khersonsky, O.; Röthlisberger, D.; Wollacott, A. M.; Murphy, P.; Dym, O.; Albeck, S.;

Kiss, G.; Houk, K. N.; Baker, D.; Tawfik, D. S. Optimization of the In-Silico-Designed Kemp

Eliminase KE70 by Computational Design and Directed Evolution. J. Mol. Biol. 2011.

(8) Kaplan, J.; DeGrado, W. F. De Novo Design of Catalytic Proteins. Proc. Natl. Acad. Sci.

2004.

(9) Richter, F.; Leaver-Fay, A.; Khare, S. D.; Bjelic, S.; Baker, D. De Novo Enzyme Design

Using Rosetta3. PLoS One 2011.

(10) Hecht, M. H.; Das, A.; Go, A.; Bradley, L. H.; Wei, Y. De Novo Proteins from Designed

Combinatorial Libraries. Protein Sci. 2004.

(11) Lapidoth, G. D.; Baran, D.; Pszolla, G. M.; Norn, C.; Alon, A.; Tyka, M. D.; Fleishman,

S. J. AbDesign: An Algorithm for Combinatorial Backbone Design Guided by Natural

Conformations and Sequences. Proteins Struct. Funct. Bioinforma. 2015, 83 (8), 1385–1406.

(12) Kuroda, D.; Shirai, H.; Jacobson, M. P.; Nakamura, H. Computer-Aided Antibody Design.

Protein Eng. Des. Sel. 2012, 25 (10), 507–521.

(13) Lippow, S. M.; Wittrup, K. D.; Tidor, B. Computational Design of Antibody-Affinity

Improvement beyond in Vivo Maturation. Nat. Biotechnol. 2007.

(14) Kortemme, T.; Joachimiak, L. A.; Bullock, A. N.; Schuler, A. D.; Stoddard, B. L.; Baker,

D. Computational Redesign of Protein-Protein Interaction Specificity. Nat. Struct. Mol. Biol.

2004.

(15) Rämisch, S.; Weininger, U.; Martinsson, J.; Akke, M.; André, I. Computational Design of a Leucine-Rich Repeat Protein with a Predefined Geometry. Proc. Natl. Acad. Sci. 2014.

112

(16) Leaver-Fay, A.; Tyka, M.; Lewis, S. M.; Lange, O. F.; Thompson, J.; Jacak, R.; Kaufman,

K.; Renfrew, P. D.; Smith, C. A.; Sheffler, W.; et al. Rosetta3: An Object-Oriented Software Suite for the Simulation and Design of Macromolecules. Methods Enzymol. 2011, 487 (C), 545–574.

(17) Pandurangan, A. P.; Ochoa-Montaño, B.; Ascher, D. B.; Blundell, T. L. SDM: A Server for Predicting Effects of Mutations on Protein Stability. Nucleic Acids Res. 2017.

(18) Gainza, P.; Roberts, K. E.; Georgiev, I.; Lilien, R. H.; Keedy, D. A.; Chen, C. Y.; Reza,

F.; Anderson, A. C.; Richardson, D. C.; Richardson, J. S.; et al. Osprey: Protein Design with

Ensembles, Flexibility, and Provable Algorithms. In Methods in Enzymology; 2013.

(19) Rackers, J. A.; Wang, Z.; Lu, C.; Laury, M. L.; Lagardère, L.; Schnieders, M. J.; Piquemal,

J. P.; Ren, P.; Ponder, J. W. Tinker 8: Software Tools for Molecular Design. J. Chem. Theory

Comput. 2018.

(20) Xiong, P.; Wang, M.; Zhou, X.; Zhang, T.; Zhang, J.; Chen, Q.; Liu, H. Protein Design with a Comprehensive Statistical Energy Function and Boosted by Experimental Selection for

Foldability. Nat. Commun. 2014.

(21) Wu, S.; Zhang, Y. LOMETS: A Local Meta-Threading-Server for Protein Structure

Prediction. Nucleic Acids Res. 2007.

(22) Lauck, F.; Smith, C. A.; Friedland, G. F.; Humphris, E. L.; Kortemme, T. RosettaBackrub- a Web Server for Flexible Backbone Protein Structure Modeling and Design. Nucleic Acids Res.

2010.

(23) Jing, F.; Zhao, L.; Yandeau-Nelson, M. D.; Nikolau, B. J. Two Distinct Domains

Contribute to the Substrate Acyl Chain Length Selectivity of Plant Acyl-ACP Thioesterase. Nat.

Commun. 2018.

113

(24) Pantazes, R. J.; Grisewood, M. J.; Li, T.; Gifford, N. P.; Maranas, C. D. The Iterative

Protein Redesign and Optimization (IPRO) Suite of Programs. J. Comput. Chem. 2015, 36 (4),

251–263.

(25) Brooks, B. R.; Brooks, C. L.; Mackerell, A. D.; Nilsson, L.; Petrella, R. J.; Roux, B.; Won,

Y.; Archontis, G.; Bartels, C.; Boresch, S.; et al. CHARMM: The Biomolecular Simulation

Program. J. Comput. Chem. 2009, 30 (10), 1545–1614.

(26) Martín, A.; Barrientos, A.; del Cerro, J. The Natural-CCD Algorithm, a Novel Method to

Solve the Inverse Kinematics of Hyper-Redundant and Soft Robots. Soft Robot. 2018.

(27) Canutescu, A. A.; Dunbrack, R. L. Cyclic Coordinate Descent: A Robotics Algorithm for

Protein Loop Closure. Protein Sci. 2003.

(28) Kenwright, B. Inverse Kinematics – Cyclic Coordinate Descent (CCD). J. Graph. Tools

2013.

(29) Pantazes, R. J.; Grisewood, M. J.; Li, T.; Gifford, N. P.; Maranas, C. D. The Iterative

Protein Redesign and Optimization (IPRO) Suite of Programs. J. Comput. Chem. 2015.

(30) Ness, S.; Martin, R.; Kindler, A. M.; Paetzel, M.; Gold, M.; Jensen, S. E.; Jones, J. B.;

Strynadka, N. C. J. Structure-Based Design Guides the Improved Efficacy of Deacylation

Transition State Analogue Inhibitors of TEM-1 β-Lactamase. Biochemistry 2000.

(31) Li, Z.; Nair, S. K. Structural Basis for Specificity and Flexibility in a Plant 4-

Coumarate:CoA Ligase. Structure 2015.

(32) Gonzalez, C. E.; Roberts, P.; Ostermeier, M. Fitness Effects of Single Amino Acid

Insertions and Deletions in TEM-1 β-Lactamase. J. Mol. Biol. 2019.

114

(33) Lindermayr, C.; Möllers, B.; Fliegmann, J.; Uhlmann, A.; Lottspeich, F.; Meimberg, H.;

Ebel, J. Divergent Members of a Soybean ( Glycine Max L.) 4-Coumarate:Coenzyme A Ligase

Gene Family . Eur. J. Biochem. 2002.

(34) Franks, N. P.; Jenkins, A.; Conti, E.; Lieb, W. R.; Brick, P. Structural Basis for the

Inhibition of Firefly Luciferase by a General Anesthetic. Biophys. J. 1998.

(35) Arpino, J. A. J.; Reddington, S. C.; Halliwell, L. M.; Rizkallah, P. J.; Jones, D. D. Random

Single Amino Acid Deletion Sampling Unveils Structural Tolerance and the Benefits of Helical

Registry Shift on GFP Folding and Structure. Structure 2014.

(36) Holm, L.; Sander, C. Mapping the Protein Universe. Science (80-. ). 1996.

(37) Kuhlman, B.; Dantas, G.; Ireton, G. C.; Varani, G.; Stoddard, B. L.; Baker, D. Design of a

Novel Fold with Atomic-Level Accuracy. Science (80-. ). 2003.

115

Chapter 4

OPTMAVEN-2.0 FOR DE NOVO DESIGN OF VARIABLE ANTIBODY

REGIONS AGAINST TARGETED ANTIGEN EPITOPES

This chapter has been previously published in a modified form in Antibodies (Chowdhury R, Allan

M, Maranas C. OptMAVEn-2.0: de novo design of variable antibody regions against targeted antigen epitopes. Antibodies. 2018 Sep;7(3):23.)

4.1. Significance

Monoclonal antibodies are becoming increasingly important therapeutic agents for the treatment of , infectious diseases, and autoimmune disorders. However, laboratory-based methods of developing therapeutic monoclonal antibodies (e.g. immunized mice, hybridomas, and phage display) are time-consuming and are often unable to target a specific antigen epitope or reach

(sub)nanomolar levels of affinity. To this end, we developed OptMAVEn for de novo design of humanized monoclonal antibody variable regions targeting a specific antigen epitope. In this work, we introduce OptMAVEn-2.0, which improves upon OptMAVEn by (1) reducing computational resource requirements without compromising design quality, (2) clustering the designs to better identify high-affinity antibodies, and (3) eliminating intra-antibody steric clashes using an updated set of clashing parts from the MAPs database. Benchmarking on a set of 10 antigens revealed that

OptMAVEn-2.0 uses an average of 74% less CPU time and 84% less disk storage relative to

OptMAVEn. Testing on 54 additional antigens revealed that computational resource requirements of OptMAVEn-2.0 scale only sub-linearly with respect to antigen size. OptMAVEn-2.0 was used to design and rank variable antibody fragments targeting five epitopes of Zika envelope protein

116

and three of hen egg white lysozyme. Among the top 5 ranked designs for each epitope, recovery of native residue identities is typically 45 – 65%. Molecular dynamics simulations of two designs targeting Zika suggest that at least one would bind with high affinity. OptMAVEn-2.0 can be downloaded from our github repository (https://github.com/maranasgroup) and webpage

(http://www.maranasgroup.com/software.htm).

4.2. Introduction

Antibodies are versatile molecules produced in B-cells and have become the basis of many therapeutics1–3 and diagnostics4–6 for cancers6–8, infectious diseases9, and autoimmune disorders10.

They are affinity proteins which are crucial for humoral immunity and are able to bind to foreign proteins with high specificity11. Administration of serum from survivors to treat patients during infectious disease outbreaks such as the 1918 influenza pandemic12 marks the early years of antibody-mediated therapeutics. The first monoclonal antibodies were developed by immunizing mice with a target antigen6. However, high immunogenicities of murine antibodies limit their efficacies in humans6. Subsequent efforts have resulted in chimeric constructs6 of murine variable domains grafted onto human constant domains. Although chimeras exhibit less immunogenicity relative to fully murine antibodies6, they are not entirely human6 and may still cause adverse reactions. Methods such as phage display13 and yeast display14 have been able to create high- affinity, completely humanized antibodies. However, all experimental methods antibody development are time-consuming15; and none offers a general approach to target a specific antigen epitope, increase affinity without increasing immunogenicity, and categorize designs based on the primary sequence of the variable domain and the binding pose of the antigen16.

Computational methods of antibody design have addressed these limitations. Software exists for designing stable antibody-antigen complexes17–19, predicting the immunogenicities of antibody

117

sequences20,21, and predicting stabilizing mutations to the antibody complimentary determining regions (CDRs)17,22–24. Before our work, we knew of no software that could design antibodies de novo—that is, without an initial structure of an antibody bound to the antigen17–19. To this end, we first developed OptCDR17, which designed de novo CDRs of high affinities but not low immunogenicities. This limitation was addressed in the following effort, OptMAVEn16, which designs full antibody variable domains. Two subsequent efforts at antibody design were

AbDesign18 by Lapidoth et al., and Rosetta Antibody Design (RAbD)19 by Adolf-Bryfogle et al.

However, both these tools build upon existing antibodies and thus require an initial structure of the antigen-antibody complex.

In addition to designing antibodies without an input structure, OptMAVEn-2.0 performs computational affinity maturation while avoiding sequences likely to trigger an immune response.

During affinity maturation, OptMAVEn mimics natural mutation preferences by mutating residues in the CDRs with three times the frequency compared to residues in the framework regions.

OptMAVEn screens a large set of antigen poses, designs antibodies for each pose, and outputs the designs with the most favorable antigen-antibody interaction energies. However, OptMAVEn’s large computational time and storage requirements limit sampling of antigen poses, which reduces the likelihood of finding designs with favorable interaction energies.

Here, we introduce OptMAVEn-2.0, which is capable of sampling a larger set of antigen poses within roughly one day, while OptMAVEn required over one week. We have retained the mixed- integer linear programming (MILP) core module, which identifies six optimal parts from the

Modular Antibody Parts (MAPs) database25 (HV, HCDR3, HJ, L/KV, L/KCDR3, and L/KJ) that constitute the variable domain. While OptMAVEn requires excessive disk storage by storing each

118

antigen pose as a separate PDB file, OptMAVEn-2.0 alleviates this problem by storing only one reference pose and using transformation matrices to generate other poses as needed.

OptMAVEn-2.0 introduces a systematic procedure to classify antibody designs. Each MAPs part is assigned a three-dimensional coordinate that depends on the sequence similarity to other MAPs parts of the same type (HV, HCDR3, and so on). We compute a matrix of pairwise sequence similarity scores for each type of MAPs parts and then convert similarities into metric distances using Stojmirovic’s method26. We use Distance Geometry Optimization Software (DGSOL)27 to embed these distances in 3D-Euclidean space, yielding a 3D-coordinate for each MAPs part28.

After relaxing all designs, OptMAVEn-2.0 creates for each design a 23-dimensional vector consisting of the 3D-coordinates of its six MAPs parts (18 dimensions), the epitope centroid (3 dimensions), and the sine and cosine of the antigen z angle (2 dimensions). A Principal Component

Analysis (PCA) step transforms these 23-dimensional vectors into 3-dimensional vectors, which are then used in k-means clustering of the designs. OptMAVEn-2.0 then ranks the designs from most to least promising by cycling through the clusters and selecting the from each cluster until all designs have been selected.

After ranking these germline designs (so named because they are assembled from MAPs parts that correspond to germline genes), the user has the option of assessing the stability of the germline designs bound to the epitope of interest using short (25 ns) molecular dynamics (MD) trajectories

(using QwikMD29) and/ or subjecting the designs for in silico affinity maturation while ensuring that the immunogenicity scores are reduced. The MD step assesses the stability of the most promising designs over 50ns, ensuring that the best antibody designs bind stably to the antigen.

Affinity maturation is implemented within Iterative Protein Redesign and Optimization (IPRO) software30 and optimizes affinities of germline designs while ensuring that their immunogenicity

119

does not increase. The immunogenicity of each design is assessed using the “human string content”

(HSC)20, which estimates the potential of a sequence to elicit a T cell response when presented on

MHC-II. HSC is used to calculate a “humanization score” (HScore)16: an antibody with a low

HScore is relatively humanized and thus has low potential to trigger an immune response in the human body.

We use OptMAVEn-2.0 to design antibodies targeting five epitopes of Zika envelope (E) protein and three of hen egg white lysozyme. We assess the stability of two designs from one of the Zika cases using short MD simulations. Recovery of epitope-binding residues and sequence similarities have been reported for the top five designs for all the other cases.

4.3. Materials and Methods

4.3.1 Overview

OptMAVEn-2.0 (Figure 21) is de novo antibody design software that extends OptMAVEn16.

OptMAVEn-2.0 is fully automated (unlike OptMAVEn), requires less CPU time and disk storage, and features a novel clustering algorithm to increase the diversity of antibody designs raised against a specific antigen epitope. Both versions assemble antibodies from the MAPs database of antibody parts 25, which contains variable (V), CDR3, and joining (J) regions for the heavy (H), lambda (L), and kappa (K) chains. First, the user specifies the antigen and its epitope.

As in OptMAVEn, the antigen is rotated such that its epitope faces a framework antibody, and then an ensemble of antigen positions is generated by translating and rotating the antigen within a user-defined antigen binding site. Positions in which the antigen clashes with the framework antibody are discarded. At each remaining antigen position, the interaction energy between the antigen and each part in the MAPs database is calculated, and a set of six non-clashing MAPs

120

parts is selected so as to minimize the sum of the interaction energies between the parts and the antigen. These associations of an antigen position with a set of MAPs parts (i.e. designs) are clustered using a k-means approach. OptMAVEn-2.0 sequentially scans through all clusters, generating a PDB and FASTA file of the design with the most negative interaction energy in each cluster, repeating until files have been created for all designs. These designs can then undergo further validation (e.g. QwikMD29) or sequence optimization (e.g. affinity maturation and reduction of HScore16) to yield a set of designs for experimental validation or optimization

(e.g. with phage display13).

(25ns

Figure 21. The workflow of OptMAVEn-2.0. First, the initial epitope positioning step rotates the antigen such that its epitope points downward with epitope centroid at the origin. The grid search step generates an ensemble of antigen positions; followed by the MILP step where the six lowest interaction energy MAPs are chosen to construct the variable antibody fragment. A

Euclidean coordinate for each part in the MAPs database was generated using the embedder

121

module. The k-means protocol uses these and the epitope centroid coordinates and rotation angle to cluster the antibodies. The antibodies with the most negative MILP energy in each cluster are then subjected to structural relaxation and a short MD routine to verify their high affinities. Stable designs emerging from this step could be affinity matured with the dual objective of enhancing their antigen-antibody affinities and lowering their immunogenic potentials.

Design and Implementation

OptMAVEn-2.0 runs continuously from the initial step (starting an experiment) to the output of germline designs. This feature reduces the effort on the part of the user and also makes

OptMAVEn-2.0 easier to use than OptMAVEn, which required manual initiation of each step in the workflow. OptMAVEn-2.0 is currently supported on UNIX platforms with Python 2.731,

NumPy32, SciPy33, and BioPython 1.734. Within its main directory, OptMAVEn-2.0/, are subdirectories src/ (source modules written in Python and Tool Command Language (TCL) scripts), experiments/ (all experiment directories), and data/ (files of antigen structures, topologies, and parameters). If the directory experiments/ does not exist, it is created automatically when the first experiment is started. The data/ directory contains three subdirectories: (1) pdbs/ stores structures of antigens, which may be in either PDB or mmCIF format; (2) input_files/ stores topology and parameter files needed for energy calculations in CHARMM35; and (3) antibodies/ stores framework antibody structures and the MAPs database. Before an experiment can be started, the structure of the antigen and all required topology and parameter files must be located in pdbs/ and input_files/, respectively. OptMAVEn-2.0 is pre-installed with default CHARMM topology

(top_all27_prot_na.rtf) and parameter (par_all27_prot_na.prm) files. The user may add additional files to support a wider range of antigens (or small drug molecules) that characterize these molecules’ types of bonds, angles, dihedrals and improper dihedral angles. An ./OptMAVEn-2.0

122

executable is also present in the OptMAVEn-2.0/ main directory and is used to initiate an experiment.

Starting an Experiment

To start an experiment, the user enters ./OptMAVEn-2.0 into a UNIX terminal from the main directory of OptMAVEn-2.0. First, the user names the experiment; OptMAVEn-2.0 creates a directory named OptMAVEn-2.0/experiments/name to hold all of the experiment’s results and temporary files. The user may customize the configuration of the experiment (e.g. by specifying topology and parameter files) or use the default configuration, defined in OptMAVEn-

2.0/src/standards.py. The user then specifies the file containing the antigen’s structure, the chains that constitute the antigen, heteroatoms to exclude, and the residues of each chain that constitute the epitope region for which the antibody is to be designed. For each antigen chain, at least one epitope residue must be selected.

OptMAVEn-2.0 preprocesses the user-specified antigen structure file by automatically removing heteroatoms and chains that are not part of the antigen but are present in the crystal structure obtained from the Protein Data Bank (PDB). This feature makes initiating an experiment simpler.

Unlike OptMAVEn-2.0, in the older OptMAVEn, the user must remove these chains and heteroatoms manually and create a file listing the epitope residues; OptMAVEn does not check that these residues actually exist, but OptMAVEn-2.0 does. In OptMAVEn-2.0, users select antigen chains, heteroatoms, and epitope residues using a simple, single-line syntax. Ranges are indicated with hyphens, while individual items are delimited with commas: for example, A-C, E specifies chains A, B, C, and E of a certain molecule. Furthermore, OptMAVEn-2.0 makes it simpler for the user by listing the available chains of the antigen molecule to choose from. Overall, unlike in OptMAVEn, the user needs to know only the antigen PDB accession ID and the residues

123

that constitute the epitope of interest. OptMAVEn-2.0 automatically downloads the molecule from the Protein Data Bank using a package in BioPython34 and then performs the remaining steps.

Antigen Positioning

OptMAVEn-2.0 begins by adding missing atoms (e.g., hydrogens) to the antigen as necessary and performing an energy relaxation in CHARMM35. The user may configure this relaxation when starting the experiment by indicating the number of CHARMM relaxation iterations. Following the relaxation, the antigen is rotated to minimize the z-coordinate of the epitope’s centroid (i.e. the mean of the coordinates of the epitope’s Cα atoms, neglecting atomic masses). This step orients the epitope towards the ensemble of MAPs parts that will be assembled into the variable domain, thus ensuring that the antibody will bind to the intended epitope. The implementation of a similar antigen rotation step in OptMAVEn hasd two significant limitations, which are corrected in

OptMAVEn-2.0. First, OptMAVEn uses an exhaustive search of rotations around the x and y axes in discrete increments of 3˚ (i.e. 120 angles per axis yielding 1202 = 14,400 rotations) to minimize the z-coordinate of the epitope’s centroid. This search requires extensive sampling and typically lasts several minutes. Second, the search has a finite resolution (3˚ in each axis): the desired rotation may lie between two search points and thus may not be sampled. To illustrate, let the desired rotation θopt = (θx,opt, θy,opt) consist of a rotation around the x axis by θx,opt followed by a rotation around the y axis by θy,opt. The discrete search will identify a point θopt’ = (θx,opt’, θy,opt’) such that θx,opt’, θy,opt’ ∈ {0˚, 3˚, 6˚, … , 357˚}. The maximum difference between θopt and θopt’ (for

′ 2 2 instance, if θopt = (1.5˚, 1.5˚)) is thus ‖휃opt − 휃opt‖ = √1.5° + 1.5° = 1.5°√2 ≈ 2.1°. Thus, the final rotated antigen conformation in OptMAVEn may be up to 2.1˚ off with respect to the desired rotation.

124

OptMAVEn-2.0 corrects both problems by using a single matrix to perform the rotation. First, the centroids of the antigen (푐퐴) and epitope (푐퐸), and the vector between them 푑 = 푐퐸 − 푐퐴 are

2 2 2 2 computed. Because the rotation does not change interatomic distances, ‖푑‖ = 푑푥 + 푑푦 + 푑푧 remains unchanged during the rotation; likewise, because 푐퐴 is the center of rotation, 푐퐴 must also remain unchanged. Thus, the rotation minimizes the z coordinate of the epitope’s centroid (푐퐸푧)

2 2 2 2 subject to holding ‖푑‖ and 푐퐴 constant. Because 푑푥 + 푑푦 ≥ 0, it must be true that 0 ≤ 푑푧 ≤

2 2 2 ‖푑‖ . Because 푑푧 = (푐퐸푧 − 푐퐴푧) and 푐퐴푧 is a constant, 푐퐸푧 may be decreased until the point at

2 2 2 2 which (푐퐸푧 − 푐퐴푧) = ‖푑‖ , 푑푥 = 푑푦 = 0. Thus, the solution that minimizes 푐퐸푧 is 푐퐸푥 =

푐퐴푥, 푐퐸푦 = 푐퐴푦, 푐퐸푧 = 푐퐴푧 − ‖푑‖. This rotation is implemented using the trans procedure within

Visual Molecular Dynamics (VMD)36 software. If ‖푑‖ = 0 (e.g. if all antigen residues are part of the epitope), then no rotation is performed. This procedure outperforms OptMAVEn in that it requires no exhaustive search and yields an error of less than 0.01˚ in rotating the antigen such that the sum of the z-coordinates of its epitope is minimized.

Following the rotation, OptMAVEn-2.0 generates an ensemble of antigen positions using a grid search (Figure 22). This step has been made significantly more efficient relative to OptMAVEn.

An antigen-binding site is defined as the virtual box (obtained by inspecting 750 antigen-antibody binding regions) in which the x, y, and z coordinates of the epitope’s centroid are within the ranges

[-10 Å, 5 Å], [-5 Å, 10 Å], and [3.75 Å, 16.25 Å], respectively. This box is partitioned into a grid

(default x, y, and z intervals are 2.5, 2.5, and 1.25 Å, respectively). Furthermore, the antigen is rotated around the z axis to increase conformational sampling (the default is 6 rotations in increments of 60˚). Hence, each antigen position can be represented as a so-called position vector consisting of the epitope centroid (x, y, and z coordinates) and the rotation angle around the z axis

125

(휃푧). The default settings lead to 6×7×7×11 = 3,234 positions. OptMAVEn-2.0 introduces a precise definition of 휃푧 for peptide antigens, which was missing in OptMAVEn. Let 푑1 = 푐1 − 푐퐴 be the vector extending from the centroid of the antigen to the coordinate 푐1 of the C-alpha atom of the first residue in the antigen. Then 휃푧 is defined as the angle between the positive x-unit vector

(푖⃗) and the projection of 푑1 onto the x-y plane. Using the relationship between angle and dot product, ‖푖⃗‖‖projx,y(푑1)‖ cos 휃푧 = 푖⃗ ∙ projx,y(푑1), which leads to 휃푧 = sign(proj푥,푦(푑1)푦) ∙

−1 푖⃗∙proj푥,푦(푑1) −1 proj푥,푦(푑1)푥 cos ( ) = sign(proj푥,푦(푑1)푦) ∙ cos ( ). ‖푖⃗‖‖proj푥,푦(푑1)‖ ‖proj푥,푦(푑1)‖

As in OptMAVEn, OptMAVEn-2.0 screens out antigen positions that will inevitably lead to steric clashes with the representative structure of the antibody framework regions. Thus, antigen positions that clash with the framework will clash with any designed antibody and will yield energetically unfavorable designs. Herein, a position is defined as clashing if any atom of the antigen is within 1.25 Å of any atom in the framework. For each antigen position, the number of clashes is counted. While OptMAVEn tolerates up to two clashes, OptMAVEn-2.0 tolerates no clashes, as the former often resulted in interlocked aromatic side chains between residues of the epitope and the designed antibody structure.

126

Figure 22. The grid search procedure. The antigen is first positioned such that (1) the centroid of its epitope is at the origin with the centroid of the antigen directly above it, and (2) the z-rotation angle of the antigen (the angle between P0 and the positive x axis) is zero. An ensemble of positions of the antigen is generated by translating the centroid of the epitope or rotating the antigen around the z axis or both.

OptMAVEn-2.0 significantly reduces disk storage requirements for antigen positioning by saving all non-clashing positions in a single text file (of a few kilobytes) and representing each as its position vector. Meanwhile, OptMAVEn saves each antigen conformation as its own PDB file.

Since PDB files of large antigens can be of the order of several megabytes, alleviating the requirement to save thousands of PDB files could save gigabytes of storage. This choice contributes in large part to reducing the average maximum disk usage by 84%.

127

MAPs Interaction Energy Calculations

At each non-clashing antigen position, the interaction energy between the antigen and each MAPs part is calculated. OptMAVEn uses C++ modules that require a separate PDB file for each antigen position. However, OptMAVEn-2.0 implements the energy calculations by calling the

NAMDEnergy37 module of VMD, which is able to translate and rotate the antigen after loading its initial structure. Thus, we are able to generate all antigen positions using only a reference (starting) structure of the antigen and a second file of position vectors (prepared during the Antigen positioning step), which together typically require only a few hundred kilobytes of disk space.

OptMAVEn-2.0 uses electrostatic and van der Waals energy terms, as does OptMAVEn for choosing the optimal antibody parts during the MILP step. Full antibody variable domain designs emerging from the optimal MAPs parts selection step are all relaxed using an energy function that accounts for solvation effects as well. The binding scores thus calculated are now used to rank all the designs.

Optimal Selection of MAPs Parts

For each antigen position, OptMAVEn-2.0 selects one set of V, D, and J parts from the H locus and one set from either the K or L locus. It thereafter minimizes the sum of the interaction energies of the six parts using a mixed-integer linear program (MILP). In this program we define, set I = {i

| HV, HCDR3, HJ, LV, LCDR3, LJ, KV, KCDR3, and K} that contains the nine categories of

MAPs parts. Each category i has a set of part indexes Pi = {p | 1, 2, …, Ni}, where Ni is the number of parts listed in category i. Each MAPs part is represented as a tuple (i, p) of a category and a serial index of that category. Further, the set IPclash = {((i1, p1), (i2, p2)), … ((im, pm), (in, pn))} is the set of all pairs of parts that sterically clash. The parameter Ei,p is the interaction energy between

128

the antigen and MAPs part (i, p). The parameters Hd and Ld are set to 1 if the heavy and light variable domains, respectively, are being designed, and 0 otherwise. This allows the option of designing both domains (a full antibody) or a single domain (a nanobody). Finally, the binary variable Xi,p is equal to 1 if part (i, p) is chosen by the MILP to be a part of the final antibody design and is 0 otherwise. The optimization protocol uses an objective function subject to a set of five constraints as described below. The formulation is the same as that of OptMAVEn16.

9 푁푖

푀푖푛푖푚푖푧푒 ∑ ∑ 푋푖,푝퐸푖,푝 푖=1 푝=1 subject to

{( ) ( )} 푋푖1,푝1 + 푋푖2,푝2 ≤ 1 ∀ 푖1, 푝1 , 푖2, 푝2 (1)

∈ 퐼푃푐푙푎푠ℎ

푁푖 (2)

∑ 푋푖,푝 = 퐻푑, ∀푖 ∈ {퐻푉, 퐻퐶퐷푅3,퐻퐽} 푝=1

푁퐾푉 푁퐿푉 (3)

∑ 푋퐾푉,푝 + ∑ 푋퐿푉,푝 = 퐿푑 푝=1 푝=1

푁퐾푉 푁퐾퐶퐷푅3 푁퐾퐽 (4)

∑ 푋퐾푉,푝 = ∑ 푋퐾퐶퐷푅3,푝 = ∑ 푋퐾퐽,푝 푝=1 푝=1 푝=1

푁퐿푉 푁퐿퐶퐷푅3 푁퐿퐽 (5)

∑ 푋퐿푉,푝 = ∑ 푋퐿퐶퐷푅3,푝 = ∑ 푋퐿퐽,푝 푝=1 푝=1 푝=1

129

The objective function minimizes the interaction energy between the antigen and the set of MAPs parts that are selected. Constraint 1 prevents sterically clashing MAPs parts being chosen.

Constraint 2 ensures that while a heavy chain is being designed, exactly one HV, HCDR3, and HJ part is selected, and that no heavy chain parts are selected if the heavy chain is not being designed

(Hd = 0). Constraint 3 is analogous to constraint 2 and ensures that if a KV part is selected, no LV parts are selected and vice versa. Constraint 4 ensures that if a KV part is chosen by constraint 3, one each of KCDR3 and KJ parts are also chosen, else no K chain parts should be chosen.

Constraint 5 enforces the same for the L chain MAPs parts during the design. Together, constraints

3, 4, and 5 ensure that if a light chain is being designed, exactly one V, CDR3, and J part is selected for the light chain and prevent choosing a mix of kappa and lambda parts.

OptMAVEn-2.0 improves upon the design step of OptMAVEn in two ways. First, the IPclash set of OptMAVEn (48,800 pairs) was found to be incomplete, sometimes leading to designs with steric clashes between residues within the antibodies. Thus, despite having favorable interaction energies, these antibodies were structurally unstable. The current IPclash set has been updated to contain 66,604 additional pairs of MAPs parts and now identifies all pairs of parts for which any atom in one part is within 1 Å of any atom in the other (excluding pairs that cannot be selected simultaneously, such as HJ-1 and HJ-2 or LV-1 and KJ-1). A second improvement is that

OptMAVEn-2.0 designs only one antibody for each antigen position, while OptMAVEn designed five. As the additional four antibodies designed by OptMAVEn were always sub-optimal to the first design, eliminating them would not eliminate the optimal design for each position. Moreover, the subsequent clustering step would likely cluster together designs at the same position but ultimately choose only or two designs from each cluster design, and so the last three or four designs at each position would very seldom, if at all, appear on the final list of the best designs. Thus,

130

OptMAVEn-2.0 expends roughly one fifth of the effort during the design step without compromising the quality of the designs.

Antibody Assembly

OptMAVEn-2.0 creates a PDB file for each design by assembling the MAPs parts and positioning the antigen. These designs then undergo a structural relaxation (in CHARMM35) that first relieves any potential steric clashes, and then uses van der Waals, electrostatics, and Generalized-Born solvation energy terms to calculate the antigen-antibody interaction energy. These interaction energies are used for the clustering step and subsequent ranking of all the designs.

Clustering the Antibody Designs

Pre-processing step

OptMAVEn-2.0 clusters the antibody designs based on both their antigen positions (which are

Euclidean coordinates) and the sets of MAPs parts they comprise (which are not Euclidean coordinates). To simultaneously cluster by position and MAPs parts, a Euclidean coordinate was generated for each MAPs part. Methods exist to compute distances between two biological sequences (e.g. the amino acid sequences of MAPs parts)25 and to convert pairwise distance matrices into Euclidean coordinates only if (but not necessarily if) these distances satisfy the four criteria of a metric distance d26:

푑(푥, 푦) ≥ 0 ∀ 푥, 푦 ∈ 푀 (1)

푑(푥, 푦) = 0 ⇔ 푥 = 푦 ∀ 푥, 푦 ∈ 푀 (2) 푑(푥, 푦) = 푑(푦, 푥) ∀ 푥, 푦 ∈ 푀 (3) 푑(푥, 푦) + 푑(푦, 푧) ≥ 푑(푥, 푧) ∀ 푥, 푦, 푧 ∈ 푀 (4) where x, y, and z are sequences; M is a category of MAPs parts; and d is the function that computes

131

a distance between two sequences. The first condition requires that all distances be positive, the second that two sequences have distance of zero if and only if they are identical, the third that the distance function is symmetric, and the fourth that the triangle inequality holds.

The method of Stojmirovic26 is particularly well-suited to this task because it yields metric distances from biological sequences in the following manner. Let s(x, y) be a similarity score between sequences x and y, such that s(x, y) is greater if x and y are more similar. The associated quasi-metric distance q of the similarity score s is q(x, y) = s(x, x) – s(x, y). Finally, the distance d(x, y) = max{q(x, y), q(y, x)} is a metric, provided that s satisfies the following conditions26:

푠(푥, 푥) ≥ 푠(푥, 푦) ∀ 푥, 푦 ∈ 푀 (1) 푠(푥, 푥) = 푠(푥, 푦) ∧ 푠(푦, 푥) = 푠(푦, 푦) ⇒ 푥 = 푦 ∀ 푥, 푦 ∈ 푀 (2) 푠(푥, 푦) + 푠(푦, 푧) ≤ 푠(푥, 푧) + 푠(푦, 푦) ∀ 푥, 푦, 푧 ∈ 푀 (3) where x, y, and z are sequences and M is a category of MAPs parts. Most protein alignment scoring systems satisfy these conditions. Because the MAPs parts follow the IMGT numbering system38, amino acids that have aligned with each other have the same residue number. Therefore, the similarity score between two sequences is the sum over all residue numbers of the alignment scores of the pair of aligned amino acids, or of a gap penalty if one sequence lacks a residue number:

푠(푥, 푦) = ∑ 푠′(푥푖, 푦푖) 푖∈퐴∪퐵

where A and B are the sets of residue numbers in sequences x and y, respectively; 푥푖 denotes the amino acid of number i in sequence x (or 푥푖 is a gap if 푖 ∉ 퐴); and 푠′(푥푖, 푦푖) is the similarity score between amino acids 푥푖 and 푦푖 in the BLOSUM62 matrix39 if 푖 ∈ 퐴 ∩ 퐵 or a gap penalty g otherwise. The optimal value of g was not known a priori, and so five levels (4, 6, 8, 10, and 12) were tested. For each level, we computed the similarity scores between all pairs of MAPs parts within every category and verified that they satisfied the conditions for s. Five violations of

132

condition 2 revealed that there were five pairs of identical parts in the MAPs database: (HV-135,

HV-136), (KV-2, KV-3), (KV-25, KV-26), (KV-41, KV-42), and (LV-5, LV-6). After removing the higher-numbered of the two parts from the database, all three conditions were satisfied.

Although the resulting pairwise distance matrix for each MAPs category satisfied the conditions for a metric, all such matrices possessed negative eigenvalues, indicating that they could not be embedded in Euclidean space40. Therefore, we devised a method to approximate a Euclidean embedding of these distances (Figure). Several programs—including MD-jeep 41, Xplor-NIH42,

TINKER43, and DGSOL27—create approximate embeddings in 3D space. Although representing high-dimensional space in three dimensions causes the loss of some information, reducing the dimensionality helps to mitigate the so-called “curse of dimensionality” in the subsequent clustering step44. An attractive feature of DGSOL is that it accepts a lower and upper bound for each pairwise distance, enabling multiple sets of bounds to be tested. DGSOL computes a penalty function that depends on the extent to which the distances between embedded coordinates lie outside of the bounds; distances within the bounds are not penalized. The lower and upper bounds

퐿퐵푖푗 and 푈퐵푖푗, respectively, were computed as 퐿퐵푖푗 = (1 − 푤) × 푑(푥푖, 푥푗) and 푈퐵푖푗 = (1 +

푤) × 푑(푥푖, 푥푗) respectively, where 푥푖 and 푥푗 are two MAPs parts from the same category, d is the distance function, and w is a bound width parameter that was varied from 0.0 to 0.5 in increments of 0.05. For each level of w and of gap penalty g, DGSOL was used to generate an embedded coordinate 푐푖 for each MAPs part i. For each category of MAPs parts, the pairwise distances 푐푖푗 =

‖푐푖 − 푐푗‖ between every pair of parts (푖 ≥ 푗) in the category were compared to the alignment distances from the 푑푖푗 = 푑(푥푖, 푥푗) function. Specifically, the Spearman rank correlation 휌 between

퐶 = {푐푖푗|푖 ≥ 푗} and 퐷 = {푑푖푗|푖 ≥ 푗} was calculated, as was the root mean square error 푅푀푆퐸 =

133

∑ (푐 −푑 )2 √ 푖≥푗 푖푗 푖푗 , where N = card(C) = card(D) is the number of pairs of parts. The optimal w was 푁 chosen such that 휌 was maximized. In the case of a tie, the w that minimized RMSE was chosen from among those w values that maximized 휌.

Figure 23. The steps involved in the embedder module. Actual values for HJ parts are provided as an example. First, the sequences are used to compute pairwise sequence similarity scores Sxy using the BLOSUM62 matrix and a gap penalty g. From Sxy, quasi-metric distances Qxy and their associated metric distances Dxy are computed (e.g. DHJ-1,HJ-2 = 4). Dxy can be visualized as a matrix or as a set of points and pairwise distances that cannot be embedded in Euclidean space. DGSOL generates Euclidean 3D coordinates for the points and computes the distances Dembed between every pair of parts (e.g. Dembed: HJ-1,HJ-2 = 4.06). It minimizes the sum of squared differences between

134

corresponding aligned (Dxy) and embedded (Dembed) distances (e.g. [DHJ-1,HJ-2 – Dembed: HJ-1,HJ-2]2 =

0.0036). The Spearman rank correlation between Dalign and Dembed is used to assess the quality of the embedding.

The gap penalty g is used to compute sequence alignment distances 푑푖푗, which are embedded and used to compute pairwise distances 푐푖푗. Thus, 휌 and RMSE (which depend on 푑푖푗 and 푐푖푗) depend on g. A 휌 close to unity indicates that the relative order of distances was preserved during the embedding, and a RMSE close to zero indicates that the distances themselves were minimally perturbed. The optimal gap penalty (g) would maximize 휌 and minimize RMSE for each MAPs category. To identify this optimal g, we tested five values of g: 4, 6, 8, 10, and 12. Each g was used to generate a similarity matrix S and an alignment matrix Dalign for each category of parts. The distances in Dalign were embedded with DGSOL, and pairwise distances Dembed between the embedded coordinates were computed. Then, 휌 (Figure 24a and Table 5) and RMSE (Figure 24b and Table 6Error! Reference source not found.) were computed using Dalign and Dembed. For each MAPs part category (HV, LV, and so on), we ranked the different g values in terms of the corresponding ρ (highest ρ yields rank 1 for the corresponding g and vice-versa) and of RMSE

(lowest RMSE yields rank 1 for the corresponding g and vice-versa) (Table 7). Therefore, the rank of each g indicates how well the distances in Dalign could be embedded while preserving both relative and absolute distances. HJ, LJ, and KJ were excluded from this analysis because these parts contain no residue number gaps in the IMGT numbering; for these parts, Dalign and Dembed do not depend on g. We found that g = 8 had the best average rank (2.1) (Figure 24c) and thus used g = 8 hereafter. However, the user has the option of selecting a different g from among the levels tested.

135

Table 5. The Spearman rank correlation coefficient (ρ) for each MAPs part category at each gap penalty g. ρ is independent of g for the J parts because the J parts do not have gaps.

g HV HCDR3 HJ LV LCDR3 LJ KV KCDR3 KJ 4 0.932 0.804 0.982 0.987 0.818 1.000 0.934 0.910 0.996 6 0.935 0.831 0.982 0.988 0.838 1.000 0.935 0.922 0.996 8 0.939 0.855 0.982 0.987 0.852 1.000 0.939 0.931 0.996 10 0.935 0.774 0.982 0.986 0.839 1.000 0.921 0.894 0.996 12 0.948 0.891 0.982 0.987 0.875 1.000 0.941 0.946 0.996

Table 6. The root mean squared error (RMSE) for each MAPs part category at each gap penalty g. RMSE is independent of g for the J parts because the J parts do not have gaps.

g HV HCDR3 HJ LV LCDR3 LJ KV KCDR3 KJ 4 26.313 15.441 0.593 13.005 9.833 0.321 21.669 8.484 1.124 6 26.793 15.133 0.593 13.146 9.585 0.321 21.983 8.508 1.124 8 27.179 15.306 0.593 13.244 9.585 0.321 21.906 8.574 1.124 10 27.360 15.902 0.593 13.513 10.096 0.321 22.877 9.551 1.124 12 27.404 15.439 0.593 13.620 9.674 0.321 22.468 8.699 1.124

Table 7. For each category of MAPs parts, the levels of the gap penalty g were ranked from 1 to

5 on the basis of ρ (highest ρ is rank 1) and RMSE (lowest RMSE is rank 1). J parts were excluded because for the J parts, ρ and RMSE are independent of g, as their sequences are devoid of residue gaps.

Gap Penalty (g) Category Criterion Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 HV ρ 12 8 6 10 4 HCDR3 ρ 12 8 6 4 10 KV ρ 12 8 6 4 10 KCDR3 ρ 12 8 10 6 4 LV ρ 12 8 6 4 10 LCDR3 ρ 12 8 6 4 10 HV RMSE 4 6 8 10 12

136

HCDR3 RMSE 6 8 12 4 10 KV RMSE 4 8 6 12 10 KCDR3 RMSE 8 6 12 4 10 LV RMSE 4 8 6 12 10 LCDR3 RMSE 4 6 8 12 10

E

S

M

R

f

o

e

r

o

c

s

-

Z

k

)

n

ρ

a

(

R

n

o

n

i

t

a

a

l

m

r

e

r

a

r

e

o

p

C

S

e

g

k

a

n

r

a

e

r

v A

Figure 24. The optimal gap penalty (g) is 8. For each category of MAPs parts and each gap penalty g (4 to 12), pairwise aligned (Dalign) and embedded (Dembed) distances were generated. The value of ρ and z-score of RMSE between these distances were computed. Progressively increasing g led to a higher (desired) ρ and a higher (undesired) RMSE z-score with the exception of g = 10, which showed lower ρ and higher RMSE z-score than did g = 8. The average rank of each g level for ρ and RMSE reveals g = 8 to be the best with an average rank 2.1.

137

For g = 8, ρ was highest (ρ > 0.982) for the HJ, LJ, KJ and KV categories, showing that Euclidean coordinates very well recapitulated the relative ranks of the distances in Dalign. The CDR3 regions had the lowest values (0.851 < ρ < 0.932), indicating that the optimal Euclidean approximations swapped the ranks of a greater number of distances. Lower ρ values can presumably be attributed to the greater number of structures N in each CDR3 set (39 ≤ N ≤ 428) than in each J set (5 ≤ N ≤

7). In the distance geometry problem, a set of pairwise distances between N points can be embedded into a Euclidean space of at most N – 1 dimension. Thus, the maximum potential dimensions of the spaces in which the CDR3 parts could be embedded are greater those of the spaces in which the J parts could be embedded. Projecting higher-dimensional coordinates onto 3 dimensions crushes more dimensions and thus causes more pairs of points that are far apart in high-dimensional space to become close together in 3-dimensional space. Dimension crushing would create parts with large aligned distances but small embedded distances. Such parts appear most in the sets with the largest number of members (i.e. CDR3), less often in the medium-sized sets (i.e. V), and never in the smallest sets (i.e. J) (Figure 25)

138

! =0.939 ! =0.855 ! =0.982

HV HCDR3 HJ

! =0.987 ! =0.852 ! =1.000

e KV KCDR3 KJ

c

n

a

t

s

i d

! =0.939 ! =0.931 ! =0.996

d

e

d

d e b LV LJ

m LCDR3 E Aligned distance from sequence alignment scores

Figure 25. A 3D coordinate was computed for each MAPs part. For each pair of MAPs parts within each category, the two parts’ embedded distance in Euclidean space was plotted against their sequence alignment distance. k-means Clustering

Each antigen position and its associated optimal set of MAPs parts is converted into a 23- dimensional vector by concatenating the x, y, and z coordinates of the epitope’s centroid; the sine and cosine of 휃푧; and the 3D coordinates representing the six MAPs parts. Clustering algorithms often fail to cluster high-dimensional data well due to the so-called “curse of dimensionality”44.

Thus, the 23-dimensional vectors are normalized such that each dimension has unit variance (if the original variance is not zero), and PCA is performed to reduce the dimensionality of each vector to 3. Because the optimal number of clusters k is unknown prior to clustering, the clustering procedure initializes k to 1 and increments k after each round of clustering. During each round, the

139

k clusters are initialized by randomly selecting (without replacement) one vector as the centroid for each cluster. Each vector is assigned to the cluster with the nearest centroid (measured by

Euclidean distance). If any cluster is empty, a vector selected randomly from another cluster is moved to the empty cluster. Each cluster centroid is then moved to the geometric mean of the vectors in the cluster; the root mean square (RMS) movement is computed. The assignment and movement steps are repeated until the RMS movement falls below a threshold (default 0.01) or an iteration limit is reached (default 1000). For each cluster, the mean squared distance (MSD) between the centroid and the cluster members in computed; the maximum of these MSD values is assigned to the k value. For each k, the ratio of the MSD to the MSD for k = 1 is computed. The k value is incremented until this ratio falls below a threshold (default 0.2).

Ranking the Antibody Designs

OptMAVEn-2.0 ranks the designs using their clusters and their antigen-antibody interaction energies, ensuring that the highest-ranked designs are both structurally diverse and predicted to have high affinities.

Progressing from the cluster with the lowest to the cluster with the highest minimum energy, it collects the design with the minimum solvated interaction energy from each cluster and cycles back until all designs have been chosen. In this way, the most optimal design from every cluster is selected first, followed by the second-, third- and so forth most optimal designs.

The relaxed structure of each design is output as a PDB and a FASTA file in the directory

OptMAVEn-2.0/experiments/name/antigen-antibody-complexes/Result_#.

OptMAVEn-2.0 generates two additional files in the experiment’s directory. Summary.txt gives information about the experiment (e.g. antigen file, epitope). Results.csv lists all designs in

140

descending order by rank and gives, for each, the antigen position, MAPs parts, antibody sequences, cluster number, and MILP, unrelaxed, and relaxed interaction energies.

4.4. Results

We first benchmarked OptMAVEn-2.0 against OptMAVEn with a set of 10 antigens and subsequently used 54 additional antigens to assess the performance of the current algorithm. We then used OptMAVEn-2.0 to design antibody variable fragments against two sets of Zika envelope proteins reported by Wang et al.45 (PDB: 5GZN) and Zhao et al.46 (5KVD, 5KVE, 5KVF, and

5KVG). We ranked our de novo designs along with the native antibody reported for 5GZN; 12 of

77 designs showed enhanced binding relative to the native. MD simulations performed on two out of these 12 designs showed that one design is stably bound to the antigen. Finally, we identified the key stabilizing antigen-antibody interactions in these two designs and the native antibody.

Results from the second set of runs yielded reasonable native sequence recovery with 55% of the top five de novo designed chains showing at least 50% identity and 40% of them showing 75% similarity.

Thereafter, we used OptMAVEn-2.0 to design antibodies against three lysozyme structures

(1BVK47, 4TSB48, and 4PGJ49) for each of which there exists an experimentally reported humanized antibody that binds to it. We analyze the native sequence recovery from the top five best binding designs and also investigate the number of native epitope binding contacts that were also seen in the top five designs.

Computational Benchmarking of OptMAVEn and OptMAVEn-2.0 on ten different antigens

OptMAVEn and OptMAVEn-2.0 were each used to design antibodies for a benchmarking set of ten antigens (PDB codes: 1NSN, 2IGF, 2R0W, 2VXQ, 2ZUQ, 3BKY, 3FFD, 3G5V, 3L5W, and

141

3MLS). These antigens were selected randomly from the 120 antigens used to benchmark

OptMAVEn16.

Benchmarking was performed on a Linux InfiniBand cluster. We measured the amount of time taken for the steps of Antigen Positioning (Tpos), MAPs Interaction Energy Calculations (Tener), and Optimal Selection of MAPs Parts (TMILP); as well as the maximum disk usage of the experiment directory (Dmax) for OptMAVEn (Table 8) and OptMAVEn-2.0 (Table 9). Time taken for the k-means clustering step could not be compared because this step is unique to OptMAVEn-

2.0. Thus, total CPU time (TCPU) for purposes of comparison was defined as TCPU = Tpos + Tener +

TMILP. We also recorded the number of positions that did not clash with the framework antibody

(Npos) and the interaction energy (including Generalized Born solvation) of the most optimal antigen-antibody complex after structural relaxation with CHARMM (Emin).

One potential confounding factor was that we used a different antigen binding site for OptMAVEn and OptMAVEn-2.0 during the antigen positioning step. In previous work16, we used 750 antigen- antibody complexes from the Protein Data Bank to identify an antigen binding site of x: [-10 Å, 5

Å], y: [-5 Å, 10 Å], and z: [3.75 Å, 16.25 Å]. This binding site was used for OptMAVEn. During benchmarking of OptMAVEn-2.0, we interchanged the x and y dimensions of the binding site, that is x: [-5 Å, 10 Å], y: [-10 Å, 5 Å]. This change is not likely to have significantly affected TCPU,

Dmax, or Npos because it did not change the total number of grid points sampled (3,234). However, this change would have affected Emin if the best design from OptMAVEn-2.0 was not within the original binding site of OptMAVEn—that is, if OptMAVEn could not have created the design.

This was the case for only one antigen (2R0W) among the ten tested; thus, we excluded 2R0W from the analysis of Emin. There is no evidence that the difference in antigen binding sites confounded the results of OptMAVEn and OptMAVEn-2.0.

142

OptMAVEn-2.0 reduces time and disk requirements by 74% and 84%, respectively

OptMAVEn-2.0 ran significantly faster than OptMAVEn in terms of TCPU (mean 74% faster, P <

0.001), Tpos (mean 99.8% faster, P < 0.001), Tener (mean 64% faster, P = 0.006), and TMILP (mean

84% faster, P < 0.001). Additionally, average Dmax was 84% lower for OptMAVEn-2.0 than for

OptMAVEn (P < 0.001). These substantial improvements in performance did not compromise design quality: there was no significant difference in Emin between the two programs (P = 0.62)

(Table 10). Because all quantities but Emin were ratios between OptMAVEn and OptMAVEn-2.0, we computed their P values using two-tailed ratio t-tests of log10(Q / O), where Q and O are the values for OptMAVEn-2.0 and OptMAVEn, respectively. The P-value for Emin was computed using a standard paired t-test of Q – O. We verified our assumptions of normality using Shapiro-

Wilk tests: all P-values were > 0.05.

Table 8. The performance of OptMAVEn on ten antigens for benchmarking. Tpos, Tener, TMILP, and

TCPU are in hours; Dmax is in megabytes; Emin is in kcal/mol. *2R0W was excluded from analysis of Emin.

Antigen Tpos Tener TMILP TCPU Dmax Emin Npos 1NSN 32.7 214.2 26.8 273.7 1004 -658.7 2428 2IGF 2.1 20.0 26.4 48.4 820 -76.4 3023 2R0W 2.0 17.8 20.2 40.0 779 -277.0* 2955 2VXQ 26.1 174.4 19.6 220.1 970 -174.5 2711 2ZUQ 41.6 290.9 18.8 351.4 1094 -346.0 2645 3BKY 5.0 54.8 33.7 93.5 824 -216.1 3035 3FFD 5.3 35.0 19.5 59.8 657 +576.6 2347 3G5V 22.0 33.1 20.8 75.9 808 -309.9 2976 3L5W 29.6 173.9 24.4 227.9 1008 -281.4 2798 3MLS 5.8 53.0 21.9 80.7 809 -249.6 2903

143

Table 9. The performance of OptMAVEn-2.0 on ten antigens for benchmarking. Tpos, Tener, TMILP, and TCPU are in hours; Dmax is in megabytes; Emin is in kcal/mol. *2R0W was excluded from analysis of Emin.

Antigen Tpos Tener TMILP TCPU Dmax Emin Npos 1NSN 0.036 22.3 1.8 24.2 142.4 -438.1 442 2IGF 0.009 26.1 5.6 31.7 169.7 -118.5 1374 2R0W 0.010 22.4 4.9 27.4 152.9 -127.9* 1204 2VXQ 0.033 33.7 3.6 37.4 135.4 -235.3 893 2ZUQ 0.046 40.4 3.2 43.6 167.3 -131.3 774 3BKY 0.011 33.9 6.7 40.6 197.4 -208.4 1647 3FFD 0.014 10.9 2.0 13.0 83.8 +92.6 492 3G5V 0.012 21.0 4.2 25.2 137.6 -458.5 1035 3L5W 0.033 36.4 3.8 40.2 144.7 -394.0 910 3MLS 0.009 18.0 3.3 21.3 114.7 -171.2 807

Table 10. Comparison of the performances of OptMAVEn and OptMAVEn-2.0 on ten antigens.

Tpos, Tener, TMILP, TCPU , Dmax, and Npos report the log10 of the ratios of the corresponding

OptMAVEn-2.0 and OptMAVEn values. Emin reports the difference of the corresponding

OptMAVEn-2.0 and OptMAVEn values. The Shapio-Wilk test shows that every set of values is close to normal (P > 0.05). OptMAVEn-2.0 performed significantly better (P-value < 0.05) in Tpos,

Tener, TMILP, TCPU, and Dmax and yielded designs of equivalent Emin (P-value = 0.79). The row mean

(ratio) gives, for the quantities reported as log10 ratios, the value of the mean ratio (i.e. 10mean). The

% reduction is 100% – mean (ratio). *2R0W was excluded from analysis of Emin.

144

Antigen Tpos Tener TMILP TCPU Dmax Emin Npos 1NSN -2.96 -0.982 -1.162 -1.053 -0.848 +220.6 -0.740 2IGF -2.35 +0.116 -0.674 -0.184 -0.684 -42.1 -0.342 2R0W -2.29 +0.102 -0.613 -0.165 -0.707 +149.1* -0.390 2VXQ -2.90 -0.714 -0.732 -0.770 -0.855 -60.8 -0.482 2ZUQ -2.95 -0.857 -0.774 -0.906 -0.815 +214.6 -0.534 3BKY -2.65 -0.208 -0.700 -0.362 -0.620 +7.7 -0.265 3FFD -2.58 -0.505 -0.984 -0.663 -0.895 -484.0 -0.679 3G5V -3.27 -0.198 -0.698 -0.479 -0.769 -148.6 -0.459 3L5W -2.95 -0.680 -0.806 -0.753 -0.843 -112.6 -0.488 3MLS -2.80 -0.469 -0.823 -0.578 -0.848 +78.4 -0.556 Shapiro P 6.0E-01 5.8E- 1.0E-01 8.2E-01 1.8E-01 3.6E-01 9.4E-01 01 mean -2.77 -0.440 -0.797 -0.591 -0.788 -36.3 -0.494 s. d. 0.303 0.383 0.164 0.296 0.090 213.2 0.145 P-value 3.5E-10 5.5E- 9.2E-08 1.4E-04 5.0E-10 6.2E-01 1.9E-06 03 mean 0.002 0.363 0.160 0.256 0.163 n/a 0.321 (ratio) % 99.8 63.7 84.0 74.4 83.7 n/a 67.9 reduction

Test of OptMAVEn-2.0 on 54 additional antigens reveals sub-linear scaling

In order to more fully analyze the relations between the performance metrics, we used

OptMAVEn-2.0 to design antibodies for an additional 54 antigens (Table 11). that we selected randomly from the 120 antigens used to benchmark OptMAVEn. We found that Npos correlated with both TCPU (r = 0.663) and Dmax (r = 0.954) more strongly than any other feature of the antigen correlated with these performance metrics. The number of residues (Nres) or atoms (Natom) correlated only weakly with TCPU (r = 0.083, r = 0.075, respectively). Nres and Natom correlated moderately well with Dmax (r = -0.472, r = -0.482, respectively) but, as larger antigens should require larger files, the negative sign was unexpected. Given the strong negative correlation between Npos and Natom (r = -0.650), it seems that larger antigens (measured by Natom) unsurprisingly tend to clash with the framework antibody in a larger number of positions and thus

145

have lower Npos values. Because Npos is also the number of antibodies designed, decreasing Npos reduces the number of files associated with antibody designs, decreasing Dmax. These results show that the computational resource requirements of OptMAVEn-2.0 scale in a sub-linear manner with the size of the antigen, ceteris paribus. Due to this feature, OptMAVEn-2.0 (unlike OptMAVEn) is capable of designing antibodies for very large antigens, e.g. Zika E protein (Natom = 6801).

Table 11. OptMAVEn-2.0 was tested on 54 antigens in addition to those used for benchmarking against OptMAVEn. TCPU is in hours, Dmax is in megabytes, and Emin is in kcal/mol.

Antigen Nres Natom Npos TCPU Dmax Emin 1ACY 10 156 1558 40.9 188.3 -370.6 1CE1 8 93 1694 44.3 200.9 -513.3 1CFT 5 84 1554 38.9 187.9 -253.5 1DZB 129 1958 749 42.2 136.3 -775.8 1EGJ 101 1643 650 34.1 106.9 -618.6 1F90 9 156 1328 35.0 165.8 -377.5 1FPT 11 162 1478 38.4 180.0 -455.6 1HH6 11 159 718 20.8 104.6 -385.5 1I8I 9 142 1480 38.4 179.7 -350.8 1JHL 129 1962 985 53.7 132.3 -766.6 1JRH 95 1491 397 21.9 99.6 -541.4 1KC5 8 119 1299 36.8 162.1 -376.1 1KIQ 129 1968 730 41.4 119.1 -750.1 1MLC 129 1968 618 35.9 111.2 -752.0 1N64 16 241 990 28.1 132.9 -386.6 1NAK 10 166 1192 41.5 154.1 -393.3 1OBE 13 195 417 13.5 77.9 -397.0 1ORS 132 2146 1001 55.7 162.4 -625.5 1PZ5 8 124 1348 34.1 167.4 -419.5 1QNZ 18 301 575 18.5 91.4 -367.3 1SM3 9 126 1354 34.8 167.9 -454.2 1TQB 102 1659 489 26.8 104.1 -534.6 1V7M 145 2258 588 37.5 115.4 -561.0 1XGY 6 85 1811 45.4 212.8 -293.1

146

1ZA3 91 1346 71 7.5 91.8 -758.7 2A6I 9 136 1093 29.1 141.8 -365.2 2BDN 68 1106 810 35.2 115.1 -740.6 2DQJ 129 1968 590 34.0 111.6 -852.4 2FJH 98 1565 312 18.4 99.7 -528.8 2H1P 11 182 561 17.0 90.4 -355.0 2HH0 9 151 1062 28.6 140.0 -282.7 2HRP 10 177 1013 27.9 135.4 -366.5 2IFF 129 1966 595 33.9 126.7 -594.4 2JEL 85 1293 596 28.1 101.9 -539.5 2OR9 11 181 734 21.1 106.7 -387.8 2QHR 11 185 761 20.3 111.3 -340.2 2R29 97 1553 641 33.2 105.3 -698.4 3AB0 136 1955 380 23.9 107.8 -765.0 3BDY 95 1521 779 36.3 133.9 -439.7 3CVH 8 142 1168 30.7 149.6 -333.7 3D85 133 2074 441 27.9 109.8 -717.0 3E8U 11 136 1481 38.1 180.8 -431.4 3ETB 144 2332 296 21.8 111.3 -898.6 3F58 11 136 1317 34.6 168.5 -322.6 3G6D 106 1667 418 24.2 103.2 -876.8 3GHB 10 146 1341 33.5 166.7 -383.4 3GHE 15 255 773 26.9 112.2 -430.1 3HR5 9 142 1340 38.4 166.5 -478.7 3KS0 92 1443 1148 54.3 148.0 -578.5 3MLX 14 235 621 20.5 94.7 -367.7 3NFP 124 1909 292 19.7 104.5 -771.6 3P30 84 1437 32 4.7 65.2 -714.9 3QG6 6 105 1425 36.1 175.2 -362.4 3RKD 146 2185 776 46.1 124.5 -793.7

147

Test cases on Zika E protein

We used OptMAVEn-2.0 to design antibodies targeting epitopes of Zika E protein that we identified in the PDB entries 5GZN45, 5KVD46, 5KVE46, 5KVF46, and 5KVG46. While the antibodies in 5GZN are from a human, those in 5KVD, 5KVE, 5KVF, and 5KVG were raised in mice. The reported native antibody in each PDB binds Zika E protein with an affinity in the low nanomolar to low micromolar range.

Setup for the test cases on Zika E protein

We defined an epitope residue such that at least one heavy atom of the residue was within 4 Å of at least one heavy atom of the antibody. The epitope residues are given in Table 12. Note that if no structures of Zika in complex with an antibody had been available, we could have predicted these epitopes using existing software such as those described in Soria-Gurerra et al.50. We used the default settings for OptMAVEn-2.0 and defined the antigen binding box with the following bounds x: [-5 Å, 10 Å], y: [-10 Å, 5 Å], and z: [3.75 Å, 16.25 Å].

Table 12. The antigen chains and epitope residues of the designs used in the test cases.

PDB Antibody Chain Epitope residues 46, 47, 52, 136, 138, 140, 156, 158, 159, 166, 168, 276, 277, 278, 5GZN Z3L1 A 279, 280, 281, 283 301, 315, 316, 317, 318, 319, 320, 321, 322, 327, 329, 362, 364, 5KVD ZV-2 E 365, 366, 367, 372, 373, 374, 375, 377 307, 340, 342, 343, 344, 347, 348, 350, 351, 352, 353, 354, 355, 5KVE ZV-48 E 358, 384, 386

148

307, 340, 342, 343, 344, 347, 348, 350, 351, 352, 353, 354, 355, 5KVF ZV-64 E 358, 391 309, 310, 311, 312, 313, 314, 331, 332, 333, 334, 335, 336, 337, 5KVG ZV-67 E 368, 370, 371, 393, 394, 395, 396, 397 18, 19, 22, 23, 24, 27, 102, 103, 116, 117, 118, 119, 120, 121, 1BVK N/A C 124, 125, 129 35, 46, 47, 48, 49, 52, 57, 59, 61, 62, 63, 70, 107, 108, 109, 110, 4PGJ N/A B 112 21, 22, 23, 102, 103, 104, 106, 109, 111, 112, 113, 114, 116, 117, 4TSB N/A A 118

Recovery of native residues in test cases of Zika E protein

For the four murine antibody bound epitope cases, we used EMBOSS Needle51 to report the recovery of antigen binding native interactions along with sequence similarities and identities for the top five designs. Out of the 40 chains (20 heavy and 20 light chains from the top five designs of four cases), 22 (55%) chains were at least 50% identical, and 16 (40%) were at least 75% similar.

Recovery of native L sequences was higher on average than that of H sequences: of the 22 chains that were at least 50% identical, 15 (68%) were L chains; and of the 16 chains that were at least

75% similar, 14 (88%) were L chains. This result likely arises because CDR-H3 is more diverse than CDR-L3.

Humanization scores in test cases for Zika E protein

We assessed the HScores of the top 5 designs and compared them to those of the native structures

(Table 13). The HScores of the de novo designs were consistently lower than those of the native antibody in all but two cases (5GZN light chain, 5KVF light chain, highlighted). This result is unsurprising because all native antibodies but 5GZN are murine. Even relative to a human antibody

149

(5GZN), the heavy chain HScores for the top 5 designs are consistently lower, which compensates for the relatively larger HScores of the light chains. The HScores suggest that OptMAVEn-2.0 can design antibodies with immunogenicities similar to those of human antibodies, although these predictions await experimental confirmation.

Table 13. Comparison of HScores of the top 5 de novo designs with the HScores of the native antibodies for zika envelope protein.

Antibody Native Designed Native light Designed Accession name (from heavy chain heavy chain chain light chain paper) HScore HScores HScore HScores 5GZN Z3L1 52 17 – 36 4 16 – 41 5KVD ZV-2 152 6 – 59 56 0 – 31 5KVE ZV-48 128 21 – 68 59 1 – 27 5KVF ZV-64 107 21 – 44 22 22 – 30 5KVG ZV-67 133 10 – 39 111 10 – 25

Molecular dynamics simulations

We performed fast MD simulations using the QwikMD29 protocol in VMD on three antibody- antigen complexes for 5GZN: the native complex, the top design (5gzn_R27, with the lowest interaction energy), and the design with the lowest MILP energy, which excludes solvation

(5gzn_R0). The QwikMD trajectories were set up for 25ns each of equilibration and production, with a time step of 2fs; trajectory snapshots were kept every 1000 steps (2ps). The simulations were run at 310 K with water as the implicit solvent.

We assessed the long-term stability of each of the three antigen-antibody complexes by calculating, once every 2.5ns, the RMSD of the antigen with respect to the antigen at the beginning of the production run (i.e. time 0ns). In order to analyze the stability of the antigen-antibody complex for the de novo designed antibody, we first identified the binding interface residues and tracked their fluctuations during the course of the 25ns production run. Residues distal to the

150

interface were neglected because unordered loop regions would contribute to larger RMSDs even though the interface might be fairly stable. The antibody residues that are a part of the binding interface were aligned to their starting conformation (at 0ns) at the end of every 2.5ns of the 25ns run. Then the heavy-atom RMSD of the antigen residues within the interface was computed

(Figure 26). RMSDs of the native complex and 5gzn_R0 were similar and remained below 6 Å in every frame examined, indicating that these complexes were stable throughout the entire simulations, according to a previous definition of stable binding by Poosarla et al.52. RMSD of

5gzn_R27 exceeded 6 Å but did not exceed 12 Å, indicating that the antigen remained partially bound52. Figure 27 shows the key electrostatic interactions (polar and salt bridge) seen in the

5gzn_native, 5gzn_R0, and 5gzn_R27 designs.

y

l d

l 5gzn_R27

n

a

i

u

t

)

r

o

a

Å

b

( p

5gzn_R0

D

S

d

n

M

u

R

o b

5gzn_native

y

l

b

a

t s

Progress of MD simulation (ns)

Figure 26. The native and 5gzn_R0 antigens remained stably bound (RMSD < 6 Å), while in

5gzn_R27 antigen remained partially bound (6 Å < RMSD < 12 Å) throughout the MD simulations. Heavy-atom RMSDs of antigen residues within a box at the antigen-antibody

151

interface were computed after aligning the antibody residues within the box. The RMSDs for each complex are relative to the first frame (time 0ns) of the production run.

Met277 Ser173 Arg283 Glu176 Asp278 Lys281 Glu136 2.6 Å Arg138 Ser51* Lys166 2.2 Å Ser104 2.8 Å Pro83 2.0 Å 2.8 Å 3.5 Å 2.0 Å 3.0 Å Lys166 3.0 Å Ala280 3.0 Å Tyr38 3.6 Å Lys108 3.0 Å 2.1 Å Trp58 Glu111 3.0 Å His100 Asp51 Glu109 Native Asn31 5gRzn0__RA0b

Lys54 Asp155 heavy chain Gln147 light chain Phe183 antigen residues at the binding interface 2.3 Å Leu180 2.0 Å Thr179 * antigen residues not a part of original epitope 2.1 Å Glu367 Asn36

Glu370 Glu162 2.6 Å Ser109 2.8 Å Tyr38 Gly35 Asn65 2.8 Å 5gzn_R275gRz3n 1__RA3b1 2.1 Å Tyr107 Lys72 Ser106

Figure 27. The key interactions at the antigen-antibody interface post-MD simulations for native,

5gzn_R0 and 5gzn_R27 have been depicted. The light and heavy chain residues are shown as magenta and cyan sticks respectively, while antigen residues are depicted as green sticks.

Test cases on hen egg white lysozyme

We identified three epitopes of hen egg white lysozyme from the PDB entries 1BVK49, 4PGJ50, and 4TSB51. The native antibodies in all three structures are human or humanized, though 4PGJ

152

contains only the heavy chain in complex with lysozyme.

Setup for the test cases on lysozyme

We used the same definition of epitope residues as was used for Zika (see Table 14) and the default OptMAVEn-2.0 settings.

Recovery of native residues and contacts in test cases of lysozyme

For each test case, the sequences of the top five designs, we assessed the recovery of native residues for these designs. Of the 20 chains designed for 1BVK and 4TSB, 18 (90%) are more than 65% similar and 9 (45%) are more than 75% similar. For 4PGJ, we found lower similarities in the range of 37 – 46%, likely because the native antibody was engineered using phage display with a library of humanized sequences, rather than isolated directly from a human. Excluding

4PGJ, 15 (75%) of the designed chains were at least 50% identical; the lowest identity observed was 40.7%, the highest 85.6%. These results show that OptMAVEn-2.0 can recover high percentages of the residues in native human antibodies. We also report the percentage recovery of native antigen-antibody contacts in these top five designs and their HScores in comparison to those of the native structures (see Table 15).

Table 14. Comparison of HScores of the top 5 de novo designs with the HScores of the native antibodies for lysozyme.

Native heavy Designed heavy Native light Designed light Accession chain HScore chain HScores chain HScore chain HScores 1BVK 85 10-37 57 7-27 4TSB 26 12-32 21 16-38 4PGJ 87 20-49 N/A 5-39

153

4.5. Summary and Discussion

OptMAVEn, an extension of the OptCDR framework, was the first software capable of designing entire variable domains de novo. However, OptMAVEn requires gigabytes of disk storage and weeks of CPU time, making it computationally intensive to target large antigens. We have developed OptMAVEn-2.0, which designs antibodies of equivalent affinities using significantly reduced disk storage (84% less) and CPU time (74% less). These improvements reduce the time needed to design germline antibodies from over a week to roughly one day and enable

OptMAVEn-2.0 to handle large antigens, such as Zika E protein (407 residues)45.

Due to its increased speed, OptMAVEn-2.0 could now be integrated into laboratory-based workflows for designing antibodies. The most common technologies for antibody development in the laboratory are animal immunization and phage display13. Immunization can yield low-affinity antibodies de novo in 1 – 2 weeks53, while phage display can in some cases design high-affinity but non-specific antibodies in under one week and also requires an initial library of antigen binding fragments54. An integrated workflow would take advantage of the high affinities reached by phage display, as well as OptMAVEn-2.0’s speed (typically < 24 hours to design hundreds of variable domains) and abilities to minimize immunogenicity and target a specific epitope. Thus, we believe

OptMAVEn-2.0 could enable the rapid design of candidate antibodies for experimental validation using only the antigen structure, unlike all other computational methods to our knowledge16–19,55.

OptMAVEn-2.0 introduces a new clustering step that retains designs with high (unfavorable) interaction energies if they are the best designs among those with similar antigen poses and antibody sequences. Following the generation of germline designs, the designs can be validated with MD simulations (e.g. in QwikMD29). Designs that are likely to bind with high affinity

154

according to the MD simulations can be further optimized using affinity maturation in IPRO30, which increases affinity while lowering immunogenicity.

Despite these promising results, there are several limitations of OptMAVEn-2.0 on which we are currently working. As in OptMAVEn, the MILP step of OptMAVEn-2.0 still uses a simplified energy function that poorly estimates the chemical potential near the binding site; estimates worsen as the number of charged interactions increases. We have partially addressed this limitation by considering solvation when relaxing, clustering, and ranking the designs from the MILP step.

Future work involves improving estimates of chemical potentials by incorporating solvation and entropy terms. Checa et al.56 and Lazaridis and Karplus57 have found that solvation energy contributions to protein-protein interactions are important. Solvation energy calculations could be further augmented by accounting for intramolecular self-solvation terms, as described by Choi et al.58. Additionally, incorporating the conformational entropy of the antigen would capture effects of unordered loops and binding site rotamers which are not held in place by a stable interaction with another residue, thereby providing meaningful insights about antigen-antibody binding biophysics59.

Another limitation of OptMAVEn is that it does not explicitly consider the stability of the antibody itself. Antibodies are complex molecules and are prone to failure in multiple ways60. Aggregation of antibodies is a particular problem: when antibodies aggregate, they not only lose their ability to bind to the target ligand but also increase the risk of becoming immunogenic22, even for fully human antibodies60. Several methods have been developed to predict (e.g. Spatial Aggregation

Propensity22) or remove (e.g. Rosetta Supercharge23,24) aggregation-prone regions of antibodies.

Potentially, these tools or similar methods could be incorporated into the affinity maturation step of future versions of OptMAVEn. These methods would ensure that the aggregation risk did not

155

increase during affinity maturation, just as the current implementation imposes a similar constraint on the HScore. Antibodies may also degrade chemically, such as through separation of the chains, oxidation, hydrolysis, or deamidation60. Future versions of OptMAVEn could include measures to reduce the risk of such degradation, thereby increasing shelf life or the tolerance of antibodies to a variety of conditions.

Currently, OptMAVEn-2.0 runs on the ICS-ACI cluster at Pennsylvania State University. In order to make it available to everyone without a CHARMM license, we plan to implement a web server on which users may submit jobs to be run. Like the command-line OptMAVEn-2.0 interface, this web server will prompt users for a structure file upload (or a PDB ID), the chain(s) in the antigen, and the epitope residues, as well as provide options to customize the settings of OptMAVEn-2.0.

4.6. References

1. Ecker, D. M., Jones, S. D. & Levine, H. L. The therapeutic monoclonal antibody market. mAbs 7, 9–14 (2015).

2. Mahmuda, A. et al. Monoclonal antibodies: A review of therapeutic applications and future prospects. Pharm Res Trop. J. Pharm. Res. J. Cit. ReportsScience Ed. 16, 713–713 (2017).

3. Shepard, H. M., Phillips, G. L., Thanos, C. D. & Feldmann, M. Developments in therapy with monoclonal antibodies and related proteins. Clin. Med. J. R. Coll. Physicians London 17,

220–232 (2017).

4. Schirrmann, T., Meyer, T., Schütte, M., Frenzel, A. & Hust, M. Phage display for the generation of antibodies for proteome research, diagnostics and therapy. Molecules 16, 412–426

(2011).

156

5. Byrne, H., Conroy, P. J., Whisstock, J. C. & O’Kennedy, R. J. A tale of two specificities:

Bispecific antibodies for therapeutic and diagnostic applications. Trends in Biotechnology 31,

621–632 (2013).

6. Weiner, G. J. Building better monoclonal antibody-based therapeutics. Nature Reviews

Cancer 15, 361–370 (2015).

7. Weiner, L. M., Surana, R. & Wang, S. Monoclonal antibodies: Versatile platforms for immunotherapy. Nature Reviews Immunology 10, 317–327 (2010).

8. Rudnick, S. I. & Adams, G. P. Affinity and Avidity in Antibody-Based Tumor Targeting.

Cancer Biother. Radiopharm. 24, 155–161 (2009).

9. Irani, V. et al. Molecular properties of human IgG subclasses and their implications for designing therapeutic monoclonal antibodies against infectious diseases. Molecular Immunology

67, 171–182 (2015).

10. Simpson, E. L. et al. Two Phase 3 Trials of Dupilumab versus Placebo in Atopic

Dermatitis. N. Engl. J. Med. 375, 2335–2348 (2016).

11. Saper, C. B. A guide to the perplexed on the specificity of antibodies. Journal of

Histochemistry and Cytochemistry 57, 1–5 (2009).

12. Shriver, Z., Trevejo, J. M. & Sasisekharan, R. Antibody-based strategies to prevent and treat influenza. Frontiers in Immunology (2015). doi:10.3389/fimmu.2015.00315

13. Saeed, A. F. U. H., Wang, R., Ling, S. & Wang, S. Antibody engineering for pursuing a healthier future. Frontiers in Microbiology 8, (2017).

14. Boder, E. T., Raeeszadeh-Sarmazdeh, M. & Price, J. V. Engineering antibodies by yeast display. Archives of Biochemistry and Biophysics 526, 99–106 (2012).

157

15. Leenaars, M. & Hendriksen, C. F. M. Critical steps in the production of polyclonal and monoclonal antibodies: Evaluation and recommendations. ILAR J. 46, 269–279 (2005).

16. Li, T., Pantazes, R. J. & Maranas, C. D. OptMAVEn - A new framework for the de novo design of antibody variable region models targeting specific antigen epitopes. PLoS One 9, (2014).

17. Pantazes, R. J. & Maranas, C. D. OptCDR: A general computational method for the design of antibody complementarity determining regions for targeted epitope binding. Protein Eng. Des.

Sel. 23, 849–858 (2010).

18. Lapidoth, G. D. et al. AbDesign: An algorithm for combinatorial backbone design guided by natural conformations and sequences. Proteins Struct. Funct. Bioinforma. 83, 1385–1406

(2015).

19. Adolf-Bryfogle, J. et al. Rosetta Antibody Design (RAbD): A General Framework for

Computational Antibody Design. bioRxiv 183350 (2017). doi:10.1101/183350

20. Lazar, G. A., Desjarlais, J. R., Jacinto, J., Karki, S. & Hammond, P. W. A molecular immunology approach to antibody humanization and functional optimization. Mol. Immunol. 44,

1996–2008 (2007).

21. De Groot, A. S., McMurry, J. & Moise, L. Prediction of immunogenicity: in silico paradigms, ex vivo and in vivo correlates. Current Opinion in Pharmacology 8, 620–626 (2008).

22. Chennamsetty, N., Voynov, V., Kayser, V., Helk, B. & Trout, B. L. Design of therapeutic proteins with enhanced stability. Proc. Natl. Acad. Sci. 106, 11937–11942 (2009).

23. Miklos, A. E. et al. Structure-based design of supercharged, highly thermoresistant antibodies. Chem. Biol. 19, 449–455 (2012).

24. Der, B. S. et al. Alternative Computational Protocols for Supercharging Protein Surfaces for Reversible Unfolding and Retention of Stability. PLoS One 8, (2013).

158

25. Pantazes, R. J. & Maranas, C. D. MAPs: a database of modular antibody parts for predicting tertiary structures and designing affinity matured antibodies. BMC Bioinformatics 14,

168 (2013).

26. Stojmirovic, A. Quasi-metric spaces with measure. Topol. Proc. 28, 655–671 (2004).

27. Moré, J. J. & Wu, Z. Distance Geometry Optimization for Protein Structures. J. Glob.

Optim. 15, 219–234 (1999).

28. Stojmirovic, A. & Yu, Y. Information channels in protein interaction networks. arXiv

Prepr. arXiv0901.0287 (2009).

29. Ribeiro, J. V. et al. QwikMD - Integrative Molecular Dynamics Toolkit for Novices and

Experts. Sci. Rep. 6, (2016).

30. Pantazes, R. J., Grisewood, M. J., Li, T., Gifford, N. P. & Maranas, C. D. The Iterative

Protein Redesign and Optimization (IPRO) suite of programs. J. Comput. Chem. 36, 251–263

(2015).

31. Python Software Foundation. Python Language Reference, version 2.7. Python Softw.

Found. (2013). doi:https://www.python.org/

32. Community, N. NumPy Reference. October (2011). doi:citeulike-article-id:11894772

33. Oliphant, T. E. SciPy: Open source scientific tools for Python. Comput. Sci. Eng. (2007). doi:10.1109/MCSE.2007.58

34. Cock, P. J. A. et al. Biopython: Freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 25, 1422–1423 (2009).

35. Brooks, B. R. et al. CHARMM: The biomolecular simulation program. J. Comput. Chem.

30, 1545–1614 (2009).

159

36. Humphrey, W., Dalke, A. & Schulten, K. VMD: Visual molecular dynamics. J. Mol.

Graph. 14, 33–38 (1996).

37. Phillips, J. C. et al. Scalable molecular dynamics with NAMD. Journal of Computational

Chemistry 26, 1781–1802 (2005).

38. Lefranc, M.-P. IMGT Unique Numbering for the Variable (V), Constant (C), and Groove

(G) Domains of IG, TR, MH, IgSF, and MhSF. Cold Spring Harb. Protoc. 2011, 633–642 (2011).

39. Henikoff, S. & Henikoff, J. G. Amino acid substitution matrices from protein blocks. Proc.

Natl. Acad. Sci. 89, 10915–10919 (1992).

40. Havel, T., Kuntz, I. & Crippen, G. The theory and practice of distance geometry. Bull.

Math. Biol. 45, 665–720 (1983).

41. Mucherino, A., Liberti, L. & Lavor, C. MD-jeep: An implementation of a Branch and Prune algorithm for distance geometry problems. in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 6327

LNCS, 186–197 (2010).

42. Schwieters, C. D., Kuszewski, J. J., Tjandra, N. & Clore, G. M. The Xplor-NIH NMR molecular structure determination package. J. Magn. Reson. 160, 65–73 (2003).

43. Pappu, R. V., Hart, R. K. & Ponder, J. W. Tinker: a package for molecular dynamics simulation. J. Phys. Chem. B 102, 9725–42 (1988).

44. Marimont, R. B. & Shapiro, M. B. Nearest neighbour searches and the curse of dimensionality. IMA J. Appl. Math. (Institute Math. Its Appl. 24, 59–70 (1979).

45. Wang, Q. et al. Molecular determinants of human neutralizing antibodies isolated from a patient infected with Zika virus. Sci. Transl. Med. 8, 369ra179-369ra179 (2016).

160

46. Zhao, H. et al. Structural Basis of Zika Virus-Specific Antibody Protection. Cell (2016). doi:10.1016/j.cell.2016.07.020

47. Holmes, M. A., Buss, T. N. & Foote, J. Conformational correction mechanisms aiding antigen recognition by a humanized antibody. J. Exp. Med. (1998). doi:9463398

48. Wensley, B. Structure of a lysozyme antibody complex. TO BE Publ. doi:10.2210/PDB4TSB/PDB

49. Rouet, R., Dudgeon, K., Christie, M., Langley, D. & Christ, D. Fully human

VH single domains that rival the stability and cleft recognition of camelid antibodies.

J. Biol. Chem. (2015). doi:10.1074/jbc.M114.614842

50. Soria-Guerra, R. E., Nieto-Gomez, R., Govea-Alonso, D. O. & Rosales-Mendoza, S. An overview of bioinformatics tools for epitope prediction: Implications on vaccine development.

Journal of Biomedical Informatics 53, 405–414 (2015).

51. Rice, P., Longden, L. & Bleasby, A. EMBOSS: The European Molecular Biology Open

Software Suite. Trends Genet. (2000). doi:10.1016/S0168-9525(00)02024-2

52. Poosarla, V. G. et al. Computational de novo design of antibodies binding to a peptide with high affinity. Biotechnol. Bioeng. 114, 1331–1342 (2017).

53. Foote, J. & Eisen, H. N. Kinetic and affinity limits on antibodies produced during immune responses. Proc. Natl. Acad. Sci. (1995). doi:10.1073/pnas.92.5.1254

54. Fellouse, F. A. et al. High-throughput Generation of Synthetic Antibodies from Highly

Functional Minimalist Phage-displayed Libraries. J. Mol. Biol. (2007). doi:10.1016/j.jmb.2007.08.005

55. Entzminger, K. C. et al. De novo design of antibody complementarity determining regions binding a FLAG tetra-peptide. Sci. Rep. 7, (2017).

161

56. Checa, A., Ortiz, A. R., De Pascual-Teresa, B. & Gago, F. Assessment of solvation effects on calculated binding affinity differences: Trypsin inhibition by flavonoids as a model system for congeneric series. J. Med. Chem. 40, 4136–4145 (1997).

57. Lazaridis, T. & Karplus, M. Effective energy function for proteins in solution. Proteins 35,

133–52 (1999).

58. Choi, H., Kang, H. & Park, H. New solvation free energy function comprising intermolecular solvation and intramolecular self-solvation terms. J. Cheminform. 5, (2013).

59. Duan, L., Liu, X. & Zhang, J. Z. H. Interaction Entropy: A New Paradigm for Highly

Efficient and Reliable Computation of Protein-Ligand Binding Free Energy. J. Am. Chem. Soc.

138, 5722–5728 (2016).

60. Kumar, S., Singh, S. K., Wang, X., Rup, B. & Gill, D. Coupling of aggregation and immunogenicity in biotherapeutics: T- and B-cell immune epitopes may contain aggregation- prone regions. Pharmaceutical Research 28, 949–961 (2011).

162

Chapter 5

SYNOPSIS

In this thesis we have used in conjunction with molecular mechanics calculations to devise protein design tools that enable redesign of three classes of proteins – channels, enzymes, and antibody variable fragments. These tools provide means to identify combinations of amino acid transitions at specific loci across the polypeptide backbone to alternate amino acids or and even addition or removal of amino acids. These changes are automatically predicted by the algorithms in order to satisfy a manifold of biochemical objectives. For example,

PoreDesigner can be used to precisely tune the pore size and inner pore-wall chemistry to enable targeted solute separations at the angstrom-scale using biomimetic membrane- protein assemblies.

This has applications not only in membrane-separations, but also in DNA-sequencing techniques.

IPRO+/-, on the other hand, is aimed at enzyme redesign for altered substrate specificity with two major applications – (a) engineering thioesterases to catalyze C8-ACP to C8-acid conversion, relevant in biofuel industry, and (b) engineering lignin biosynthesis enzymes for altering S/G ratio in lignin for the ease of digestibility. Finally, OptMAVEn-2.0 is a rapid workflow for generating humanized libraries of highly specific antibodies that bind to a given antigen protein. This has been tested successfully with a-Syn peptide responsible for Parkinson’s disease.

Moving forward, these algorithms can be further improved by incorporating several additional constraints while designing a protein by learning from existing proteins of the similar family that are public available. To this end, convoluted neural networks would be extremely useful for extracting several features from the published crystal structures beyond their sequence and 3D- atomic coordinates (such as – alpha helical content, loop flexibility, disulfide bridges).

163

VITA

EDUCATION Ph.D. in Chemical Engineering Minor in Computational Sciences August 2013 – December 2020 The Pennsylvania State University, University Park, PA

Bachelors in Chemical Engineering (B.E) July 2009 – July 2013 Jadavpur University, Kolkata, India

EXPERIENCE Research Assistant, Chemical Engineering August 2013 – December 2019 The Pennsylvania State University, University Park, PA

Undergraduate Research Fellow, Biophysics June 2012 – August 2012 National University of Singapore, Singapore.

PUBLICATIONS Chowdhury, R., Ren, T., Shankla, M., Decker, K., Grisewood, M., Prabhakar, (SELECTED) J., Baker, C., Golbeck, J.H., Aksimentiev, A., Kumar, M. and Maranas, C.D., 2018. PoreDesigner for tuning solute selectivity in a robust and highly permeable outer membrane pore. Nature communications, 9(1), pp.1-10.

Song, W., Joshi, H., Chowdhury, R., Najem, J.S., Shen, Y.X., Lang, C., Henderson, C.B., Tu, Y.M., Farell, M., Pitz, M.E. and Maranas, C.D., 2020. Artificial water channels enable fast and selective water permeation through water-wire networks. Nature Nanotechnology, 15(1), pp.73-79.

Chowdhury, R., Chowdhury, A. and Maranas, C.D., 2015. Using gene essentiality and synthetic lethality information to correct yeast and CHO cell genome-scale models. Metabolites, 5(4), pp.536-570.

Tiller, K.E., Chowdhury, R., Li, T., Ludwig, S.D., Sen, S., Maranas, C.D. and Tessier, P.M., 2017. Facile affinity maturation of antibody variable domains using natural diversity mutagenesis. Frontiers in immunology, 8, p.986.

Chowdhury, R., Allan, M.F. and Maranas, C.D., 2018. OptMAVEn-2.0: de novo design of variable antibody regions against targeted antigen epitopes. Antibodies, 7(3), p.23.

Hernández Lozada, N.J., Lai, R.Y., Simmons, T.R., Thomas, K.A., Chowdhury, R., Maranas, C.D. and Pfleger, B.F., 2018. Highly active C8-acyl-ACP thioesterase variant isolated by a synthetic selection strategy. ACS synthetic biology, 7(9), pp.2205-2215.

GOOGLE Ratul Chowdhury Publications List: LINK SCHOLAR