Building maps from genetic sequences to biological function

The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters

Citation Riesselman, Adam Joseph. 2019. Building maps from genetic sequences to biological function. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.

Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:41121291

Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA

Building maps from genetic sequences to biological function

A dissertation presented

by

Adam Joseph Riesselman

to

The Division of Medical Sciences

in partial fulfillment of the requirements

for the degree of

Doctor of Philosophy

in the subject of

Biomedical Informatics

Harvard University

Cambridge, Massachusetts

December 2018

©2018 Adam Joseph Riesselman All rights reserved.

Dissertation Advisor: Debora S. Marks Adam Joseph Riesselman

Building maps from genetic sequences to biological function

Abstract

Predicting how changes to the genetic code will alter the characteristics of an organism is a fundamental question in biology and genetics. Typically, measurements of the true functional landscape relating genotype to phenotype are noisy and costly to obtain. Though high-throughput

DNA sequencing and synthesis can shed light on biological constraints in organisms, inferring relationships from these high-dimensional, multi-scale data to make predictions about new biological sequences is a formidable task. Here, I aim to build algorithms that map genetic sequences to biological function. In Chapter 1, I examine how deep latent variable models of evolutionary sequences can predict the effects of mutations in an unsupervised manner. In

Chapter 2, I discuss how deep autoregressive models can be applied to genetic data for variant effect prediction and the synthesis of a diverse synthetic nanobody library. In Chapter 3, I explore how sparse Bayesian logistic regression can efficiently summarize laboratory affinity maturation experiments to improve nanobody binding affinity. In Chapter 4, I show how to integrate genetic, proteomic, and metabolomic data to optimize thiamine biosynthesis in E. coli.

In Chapter 5, I propose future research directions, including extensions to both the analytical methods and biological systems discussed. These results show that probabilistic algorithms of genetic sequence data can both explain phenotypic variation and be used to design proteins and organisms with improved properties.

iii Table of Contents

Acknowledgements ...... vi

INTRODUCTION ...... 1 Finding patterns in genetic data ...... 1 On designing useful statistical models ...... 3 Final remarks ...... 4 References ...... 6

CHAPTER 1 ...... 8

Deep generative models of genetic variation capture the effects of mutations ...... 8 Abstract ...... 9 Introduction ...... 9 Results ...... 11 Discussion ...... 23 Methods...... 25 References ...... 39

CHAPTER 2 ...... 45

Predicting the effects of mutations with alignment-free generative models of protein sequences ...... 45 Abstract ...... 46 Introduction ...... 46 Results ...... 49 Discussion ...... 55 Methods...... 56 References ...... 61

CHAPTER 3 ...... 68

Refining nanobody affinity with algorithmically-guided directed evolution ...... 68 Abstract ...... 69 Introduction ...... 69 Results ...... 72 Discussion ...... 76 Methods...... 77 References ...... 79

iv CHAPTER 4 ...... 82

Accurate prediction of the thiamine biosynthetic landscape using Gaussian process learning ...... 82 Abstract ...... 83 Introduction ...... 83 Results ...... 86 Discussion ...... 96 Methods...... 98 References ...... 112

CHAPTER 5 ...... 116

Conclusions ...... 116 Specific Conclusions ...... 116 General Conclusions ...... 119 References ...... 121

Appendix: Supplementary Figures and Tables ...... 123

v Acknowledgements

I came into graduate school with the grandiose idea of working in the lab and guiding my experiments with new algorithms to engineer organisms—I wanted to make sure the code we wrote made something tangible. After stumbling around for the first year or so, I realized that doing everything on my own was not the way to go. Science is made so much richer with collaborations and the exchange of new ideas and disciplines, and I am so happy and proud to have worked with many talented people over my career.

First and foremost, I would like to thank my advisor Debora Marks for truly caring about me, her students, and the science. She has taught me to “know thy data” and has provided me endless opportunities for my career. Most importantly, Debbie trained me to think like a scientist by making sure to know what we are looking for: “What is the question?”. With her help and guidance, I’ve gotten everything and more I wanted from graduate school.

I would also like to thank my collaborators that have made this work possible. I cannot thank John Ingraham enough for working with me through our papers and always patiently helping me understand the math or answer any questions I had. I would also like to thank Sam Deutsch for being my mentor at the Joint Genome Institute, and Hans Genee, Morten Sommer, and everyone else on the thiamine project for being great collaborators working on exciting, impactful research. I would like to thank Andrew Kruse for being a wonderful, friendly, insightful collaborator with the nanobody projects. I would also like to thank Aashish Manglik, Conor McMahon, Ishan Deshpande, and Jiahao Liang for their work as well.

I am so fortunate to have joined the Marks lab for all of the amazing members I have had over the years. I thank Anna Green for talking about science, dealing with stress, and appreciating my “avant garde” lab humor. A big thanks to all the other lab members: Aaron Kollasch, Agnes Toth-Petroczy, Benni Schubert, Caleb Weinreb, Charlotta Scharfe, Christian Dallago, David Ding, Elana Simon, Eli Weinstein, Hailey Cambra, Jonny Frazer, June Shin, Kelly Brock, Malfalda Dias, Perry Palmedo, Rohan Maddamsetti, Tessa Green, and Thomas Hopf.

I would also like to thank for being a constant in support and help over the years. I’d also like to thank Frank Poelwijk and Nick Gauthier in his lab for their kindness and advice as well.

I would like to thank the and Integrative Genomics PhD program for their support and guidance, including Peter Park, Isaac Kohanne, Marissa DiSarno, Katherine Flannery, and Cathy Haskell.

A big thank you to the systems biology admin as well, including Barb Grant for keeping a roof over our heads, Jennie Epp for dealing with any lab emergency with unimaginable patience and grace, and Kathy Buhl for her perpetual kindness. I would also like to thank the Harvard Research Computing staff for their tireless support of my computational demands.

I would like to thank my PQE, DAC, and defense members for helping me make progress in my graduate school career: Shamil Sunyaev, Martha Bulyk, George Church, Sasha Rush, Stirling

vi Churchman, Tim Mitchison, and Gene-Wei Li. I would also like to thank my previous rotation lab mentors: Jen Sheen, Pam Silver, and Jeff Way.

I am indebted to the Department of Energy Computational Science Graduate Fellowship for funding my graduate school education, providing me opportunities to network and expand my research, and forcing me to take a challenging yet thorough course load. Thank you to the Krell Instititue as well for their help, including Lindsey Eilts, Michelle King, Lisa Frerichs, and Thomas O’Donnell. I would like to thank the Joint Genome Institute for hosting me in the summer of 2016 as well as Max, Joy, Jordan, and Jay, for making my time in Berkeley much more fun.

I would also like to thank all of my educators that have made me into the scientist I am today, including Kathleen Synder, Jim Blankman, Bram Govaerts, Paul Scott, Sona Pandey, Gholam Mirafzal, Jerry Honts, Timothy Urness, Eric Manley, Ajith Anand, Maren Roe, Bill Gordon- Kamm, Keith Lowe, and Todd Jones. I am also indebted to Norman Borlaug and the World Food Prize for being the catalyst to my career. In particular, I cannot thank Kenneth Quinn, Lisa Fleming, and Keegan Kautzky enough for their tireless support and being fantastic role models.

I would like to thank all of the friends I’ve made in graduate school in Boston—you’ve all made this journey much more fun than going it alone: Katie Wu, Rebecca Fine, Brittany Mayweather, Thao Truong, Dan Foster, Marina Watanabe, Sindhu Carmen, Sisi Sarkizova, Rohit Garg, Kayla Davis, Max Schubert, Gleb Kuznetsov, Surge Biswas, Pierce Ogden, Isaac Plant, Marc Pressler, Evi Van Itallie, Hattie Chung, Steph Hayes, Brendan Colon, Ben Vincent, Sam Wolock, Armin Schoch, and Chuck Noble. I’d also like to thank all my friends from home for their constant support, especially, Matt Dalton, Domenic Lamberti, Drew Gibson, and Erin Austin.

I would like to thank my partner Chris Bandoro for his endless love, kindness, and patience with me and for always be willing to go on an adventure. I am extremely lucky to have a loving family that has always supported me. Thank you to all my grandparents (Walt, Lucille, Vic, Marge), aunts (Kristi, Carolyn, Dolores, Connie, Kathy), uncles (Jerome, Steve, Dave, Bob, Marv, Duane), and cousins for believing in me and supporting me through everything. Thank you to Nicole, Jon, Kevin, and Blair, and my nephews Matthew and Vincent, for their love and support. Finally, this was all made possible by my amazing parents, Mark and Kim. They taught me the value of hard work, the importance of education, and the virtues of patience and kindness for everyone. They always fostered my interest in science and have been an incredible support system. I love you all so much.

vii Introduction

“Do what you can, with what you have, where you are.”

--Theodore Roosevelt

DNA is the primary determinant of the phenotype of organisms. Millions of years of random mutations and competition to survive under a variety of conditions is the driving force behind the staggering amount of diversity of life on earth. This process can be mimicked in the laboratory by on a much shorter timeframe through directed evolution1,2. However, the constraints that determine the consequence of mutations in DNA are difficult and expensive to ascertain. Here I will introduce computational methods to characterize biological sequences that have undergone diversification and selection. These mathematical models can be used to understand functional constraints as well as engineer organisms with new and improved properties.

Finding patterns in genetic data

Modern society has become increasingly reliant on our ability to understand genetic variation. Crops and livestock have been selected over a millennia to maintain high yields while being robust to a wide variety of conditions3. Precision medicine has allowed doctors to pinpoint the cause of disease and prescribe treatments accordingly4. Scientists have been able to engineer industrial biomolecules, lower the cost of medications, and tailor therapeutics5,6. Building links between changes in DNA and the outcome in vitro or in vivo is critical to improvement of these endeavors.

1 Forward genetics is the traditional approach to elucidating the effects of mutations. After observing a particularly interesting phenotype, the causative mutation is winnowed down through a process of breeding, screening, and mapping to pinpoint the changes responsible for the phenotype of interest7. Given enough time, patience, and resources, identifying the mutation can be completed with relatively little technology. High-throughput sequencing has accelerated this process by providing fine-grained resolution of reference genomes, but mapping novel variants or sequencing new genomes can be slow and expensive. Moreover, extrapolating observed phenotypes from one organism to another is nontrivial and requires homology-based approaches.

Alternatively, reverse genetics maps genetic variants to function by creating DNA sequences of interest and experimentally determining their phenotype7. Boosted by advances in

DNA synthesis technology, thousands of mutations can be studied in a single experiment8. These approaches are much more generalizable, as diverse genetic components from recalcitrant organisms, including regulatory elements and biosynthetic clusters, can be transferred to more well-behaved systems in the lab. When approached from an engineering perspective, these techniques can be used to create DNA sequences with novel, valuable properties5.

This approach, most recently employed by deep mutational scanning9, has been tremendously successful in characterizing genetic systems. However, both technical and biological issues complicate brute-force reverse genetic approaches.

First, previously characterized genetic parts do not perform predictably with one another.

For example, epistasis between residues in a biological sequence confound assumptions regarding the additive effects of mutations10,11, or nonlinear dynamics may govern the metabolism of a system of interest12, leading to unpredictable phenotypic outcomes.

2 Exhaustively testing all combinations of mutations is unfeasible. Second, if more than one change is made to the system at once, the true cause of variation must be deconvoluted13. Third, biological measurements are noisy, expensive to obtain, and may be confounded with batch effects. These measurements, including DNA modifications and RNA, protein, and metabolite expression levels, are also typically high-dimensional and arise from hierarchical factors of variation. Finally, direct measurements of a biological system may not be available at all.

Namely, large genetic sequence databases contain only the result of long-term evolution, not information on the functional constraints themselves13-16.

On designing useful statistical models

Testing all possible hypothesis to understand the function of a DNA sequence in the lab is combinatorically intractable. Alternatively, measurements from biological systems can be treated as an inverse problem17. Namely, computational biologists can build models to describe the causal factors that created the data, fit it to available examples, and use the parameters of the model to understand the biological underpinnings of the system. In this thesis, I utilize both discriminative and generative models to understand and make predictions for genetic systems.

Discriminative models are utilized when trying to predict a measured value from some known variables. Numerous algorithms can parameterize this form. The simplest is regression, which is a powerhouse of genetics, and broadly, data analysis. In a generic regression model, terms are additive and independent from one another. These robust models are both predictive and are readily interpretable. Numerous extensions of these models exist which are utilized in this thesis, including sparse priors for variable selection, terms to uncover interactions between variables, and tree-based structure to interpret nonlinear interactions in the data. Gaussian processes, which are nonparametric function approximators, can even be framed as Bayesian

3 linear regression in which we integrate out the weights18. Though Gaussian process models are difficult to interpret, they are resistant to overfitting and can be fit to limited data. These regression-based models provide the basis for supervised learning in this thesis.

Generative models are particularly useful in an unsupervised machine learning setting, where a probability distribution is fit to a set of data. In this framework, we learn a parameterization in which we can generate new examples of the data by understanding the constraints that make any given datum probable. This approach has been particularly successful in analyzing biological sequence data from public databases because the sequences are very heterogeneous, have a high information content, and don’t have any matching measurement.

These models can be used to find other related members of a sequence family in an unsupervised manner, understand phylogeny and relatedness of sequences16, find epistatic constraints13,14, and predict the effects of mutations19. In this thesis, I parameterize generative models with deep neural networks to improve mutation effect prediction and extend the scope of sequences these models can handle.

Traditionally, the process of deriving complicated mathematical models of high- dimensional data was a slow and laborious process. However, advances in automatic differentiation20 and variational inference21 have made the process of building complex models running on high-performance computing resources fast and simple. Using these tools and techniques, the causes of variation can be teased out of relatively few data samples, and the resulting model can be used to predict the effects of mutations in new sequences.

Final remarks

I do not expect biology to be found solely in silico: experimentation will always be the tried and true method for determining biological function of a sequence. In this thesis, I aim to

4 formulate biological questions regarding sparsely-characterized biological systems and design robust, interpretable, generalizable probabilistic algorithms to map genotype-to-phenotype.

5 References

1 Packer, M. S. & Liu, D. R. Methods for the directed evolution of proteins. Nature Reviews Genetics 16, 379 (2015).

2 Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nature reviews Molecular cell biology 10, 866 (2009).

3 Allard, R. W. & Allard, R. W. Principles of plant breeding. (John Wiley & Sons, 1999).

4 Gonzaga-Jauregui, C., Lupski, J. R. & Gibbs, R. A. Human genome sequencing in health and disease. Annual review of medicine 63, 35-61 (2012).

5 Purnick, P. E. & Weiss, R. The second wave of synthetic biology: from modules to systems. Nature reviews Molecular cell biology 10, 410 (2009).

6 Benner, S. A. & Sismour, A. M. Synthetic biology. Nature Reviews Genetics 6, 533 (2005).

7 Griffiths, A. J. et al. An introduction to genetic analysis. (Macmillan, 2005).

8 Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nature methods 11, 499 (2014).

9 Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nature methods 11, 801 (2014).

10 Phillips, P. C. Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nature Reviews Genetics 9, 855 (2008).

11 Poelwijk, F. J., Kiviet, D. J., Weinreich, D. M. & Tans, S. J. Empirical fitness landscapes reveal accessible evolutionary paths. Nature 445, 383 (2007).

12 Mendes, P. & Kell, D. Non-linear optimization of biochemical pathways: applications to metabolic engineering and parameter estimation. Bioinformatics (Oxford, England) 14, 869-883 (1998).

13 Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PloS one 6, e28766 (2011).

14 Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences 108, E1293- E1301 (2011).

15 Lapedes, A., Giraud, B. & Jarzynski, C. Using sequence alignments to predict protein structure and stability with high accuracy. arXiv preprint arXiv:1207.2484 (2012).

6 16 Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. (Cambridge university press, 1998).

17 Stein, R. R., Marks, D. S. & Sander, C. Inferring pairwise interactions from biological data using maximum-entropy probability models. PLoS 11, e1004182 (2015).

18 Rasmussen, C. E. in Advanced lectures on machine learning 63-71 (Springer, 2004).

19 Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nature biotechnology 35, 128 (2017).

20 LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. nature 521, 436 (2015).

21 Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. & Saul, L. K. An introduction to variational methods for graphical models. Machine learning 37, 183-233 (1999).

7 Chapter 1

Deep generative models of genetic variation capture the effects of mutations

Adam J. Riesselman§1,2, John B. Ingraham§1,3, Debora S. Marks1

1 Department of Systems Biology, Harvard Medical School 2 Program in Biomedical Informatics, Harvard Medical School 3 Program in Systems Biology, Harvard University § Equal contribution

A.J.R., J.B.I., and D.S.M. designed the study. A.J.R. and J.I. performed the computation. A.J.R.,

J.B.I., and D.S.M. wrote the paper.

8 Abstract

The functions of proteins and RNAs are defined by the collective interactions of many residues, and yet most statistical models of biological sequences consider sites near- independently. Recent approaches have demonstrated benefits of including interactions to capture pairwise covariation, but leave higher-order dependencies out of reach. Here, we show how it is possible to capture higher-order, context-dependent constraints in biological sequences via latent variable models with nonlinear dependencies. We present a new probabilistic model for sequence families, DeepSequence, which can predict the effects of mutations across a variety of deep mutational scanning experiments significantly better than existing methods that are based on the same evolutionary data. The model, learned in an unsupervised manner solely from sequence information, is grounded with biologically motivated priors, reveals latent organization of sequence families, and can be used to extrapolate to new parts of sequence space.

Introduction

A major unmet challenge in biological research, clinical medicine and biotechnology is how we should decipher and exploit the effect of mutations on biomolecules. From interpreting which genetic variants in humans underlie disease, to developing modified proteins that have useful properties, to synthesizing large molecular libraries that are enriched with functional sequences, there is need to be able to rapidly assess whether a given mutation to a protein or

RNA will disrupt its function1, 2. Although high throughput technologies can now simultaneously assess the effects of thousands of mutations in parallel3-25 1 26, 27, sequence space is exponentially large and experiments are resource-intensive. Accurate computational methods are thus an important component for high-throughput sequence annotation and design.

9 Most improvements to computational predictions of mutation effects have been driven by leveraging the signal of evolutionary conservation among homologous sequences28-33.

Historically, these tools analyze the conservation of single site in protein in a background- independent manner. Recent work has demonstrated that incorporating inter-site dependencies with a pairwise interaction model of genetic variation can more accurately predict the effects of mutations high-throughput mutational scans34-36. However, numerous lines of evidence suggest that higher order epistasis pervades the evolution of proteins and RNAs37-40, which pairwise models are unable to capture. Naïve extension of the pairwise models with third or higher terms is statistically unfeasible, as even third-order interaction models for a modest length protein of

100 100 amino acids will have approximately a billion parameters 20 ≈ 1.29 ×10 . Even 3 if such a model could be engineered or coarse-grained41 to be computationally and statistically tractable, it will only marginally improve the fraction of higher-order terms considered, leaving

4th and higher order interactions out of reach.

Directly parameterizing models of sequence variation with all possible interactions of order k leads to an intractable combinatorial explosion in the number of parameters to consider.

An alternative to this fully observed approach for modeling data – where the correlations between positions are explained directly in terms of position-position couplings – is to model variations in terms of hidden variables to which the observed positions are coupled. Two widely used models for the analysis of genetic data, PCA and admixture analysis,42-44 can be cast as latent variable (hidden variable) models in which the visible data (genotypes) depend on hidden variables (factors or populations) in a linear way. In principle, replacing those linear dependencies with flexible nonlinear transformations could facilitate modeling of arbitrary-order correlations within observed genotype, but developing tractable inference algorithms for them is

10 more complex. Recent advances in approximate inference45, 46 have made these kinds of nonlinear latent variable models tractable for modeling complex distributions for many kinds of data, including text, audio, and even chemical structures47, but their application to genetic data remains in its infancy.

Here, we develop nonlinear latent variable models for biological sequence families and leverage approximate inference techniques to infer them from large multiple sequence alignments. We show how a Bayesian deep latent variable model can be used to reveal latent structure in sequence families and predict the effects of mutations with accuracies exceeding those of site-independent or pairwise-interaction models.

Results

A deep generative model captures latent structure in sequence families.

The genes that we observe across species today are the results of long-term evolutionary processes that select for functional molecules. We seek to model the constraints underlying the evolutionary processes of these sequence data and to use those constraints to reason about what other mutations may be plausible. If we approximate the evolutionary process as a “sequence generator” that generates a sequence � with probability � � � and parameters � that are fit to reproduce the statistics of evolutionary data, we can use the probabilities that the model assigns to any given sequence as a proxy for the relative plausibility that a molecule satisfies functional

(�(Mutant)|�) constraints. We will consider the log-ratio, ��� as a heuristic metric for the relative (�(Wild-Type)|�) favorability of a mutated sequence, �(Mutant), versus a wild-type �(Wild-Type). This log-ratio heuristic has been previously shown to accurately predict the effects of mutations across multiple kinds of generative models � � � 34. Our innovation here is to consider a nonlinear latent

11 variable model for � � � that is capable of capturing higher-order constraints (Figure 1.1a and

Methods). This approach is fully unsupervised, as we do not train on observed mutation effect data but rather use the statistical patterns in observed sequences as a signal of selective constraint.

a Sitewise factors Pairwise factors Latent factors

I L A V P I L A V P I L A V P

I Q A A P I Q A A P I Q A A P I N A A P I N A A P I N A A P H Q D M G H Q D M G H Q D M G K R D S G K R D S G K R D S G K A D N A K A D N A K A D N A K A D N A K A D N A K A D N A

b Approximate posterior Generative model 1500 30

1500 100

2000

K S V AEH YNT GR L AK MHAEKLYSTCVR MHGDR IFTNC

Figure 1.1. A nonlinear latent variable model captures higher-order dependencies in proteins and RNAs. a. In contrast to sitewise and pairwise models that factorize dependency in sequence families with low-order terms, a nonlinear latent variable model posits hidden variables z that can jointly influence many positions at the same time. b.

The dependency p(x|z) (blue) of the sequence x on the latent variables z is modeled by a neural network, and inference and learning is made tractable by jointly training with an approximate inference network q(z|x) (yellow). This combination of model and inference is also known as a variational autoencoder. Size of the latent variables z and hidden dimensions of the neural network are shown.

We introduce a nonlinear latent variable model � � � to implicitly capture higher order interactions between positions in a sequence. We imagine that when the data are generated, a hidden variable � is sampled from a prior distribution p(�), in our case a standard multivariate normal, and that a sequence � is in turn generated based on a conditional distribution p � �, �

12 that is parameterized by a neural network. If the system were fully observed, the probability of data would be simple to compute as p � �, � p � , but when � is hidden we must contend with the marginal likelihood

�(�|�) = ∫ �(�|�, �)�(�)��, which considers all possible explanations for the hidden variables � by integrating them out.

Directly computing this probability is intractable in the general case, but we can use variational inference48 to form a lower bound on the (log) probability. This bound, known as the Evidence

Lower Bound (ELBO) ℒ �; � , takes the form

log � � � ≥ ℒ �; � ≜ � log � � �, � − � � � �, � ||� � , where q � �, � is a variational approximation for the posterior distribution p(�|�, �) of hidden variables given the observed variables. We model both the conditional distribution p(�|�, �) of the generative model and the approximate posterior q � �, � with neural networks, which results in a flexible model-inference combination known as a Variational Autoencoder (VAE) 45,

46 ( Figure 1.1b). After fitting the model to a given family by optimizing the variational parameters �, it can be readily applied to predict the effects of arbitrary types and number of

(�(Mutant)|�) mutations. We quantify effects with an approximation to the log ratio, log by (�(Wild-Type)|�) replacing each log probability with the ELBO (Figure 1.2).

13 Mutation Effect prediction

R p(x ) MHAE LYSTCVR log mutant K p(xwildtype)

Secondary Structure D E K R H N Q S T 30 40 50 60 70 80 90 100 110 120 P G A V I L M C F Y W 130 140 150 160 170 180 190 200 210 220

Neutral Wildtype position Deleterious 230 240 250 260 270 280

Figure 1.2. Mutation effects can be quantified by likelihood ratios. After fitting a probabilistic model to a family of homologous sequences, we heuristically quantify the effect of mutation as the log ratio of mutant likelihood to wild type likelihood (as approximated by the ELBO; Methods). Below: mutation effect scores for all possible point mutations to β-lactamase.

We use a particular combination of priors, parameterizations, and learning algorithms in order to make the model more interpretable and more likely to generalize. First, we encourage sparse interactions with a group sparsity prior on the last layer of the neural network for p(�|�, �). This prior encourages small subgroups of hidden units in the network to influence only a few positions at a time. Second, We encourage correlation between amino acid usage, by transforming all local predictions of the amino acids at each position with a shared a linear map

� which we refer to as a dictionary. Finally, and in deviation from standard practice for VAEs, we learn distributions over the weights of the neural network for p � �, � with a variational

14 approximation over both the global model parameters as well as the per-datum hidden variables.

This means that rather than learning a single neural network for p(�|�, �), we learn an infinite ensemble of networks.

The joint variational approximation over global and local parameters is optimized by stochastic gradient ascent on the ELBO to give a fully trained model (Methods). Since this is a non-convex optimization problem with multiple solutions, we fit five replicas of the model from different initial conditions. Throughout the analysis, we will consider both the average performance across these five fits as well as the performance of the ensemble prediction that averages the predictions together.

Model probabilities correlate with experimental mutation effects.

We compared predictions from DeepSequence to a collection of 42 high-throughput mutational scans (712,218 mutations across 108 sets of experiments on 34 proteins and a tRNA,

Methods, Supplementary Figure S1.1). We found that the predictions of the DeepSequence ensemble correlate the same or better with experimental mutation effects across a majority of the datasets when compared to both a pairwise interaction model (EVmutation34, 33/42 datasets,

Median difference in rank correlation ∆� = 0.036]) and a site independent model (34/42 datasets, Median ∆� = 0.063]) trained on the same data (Figure 1.3). The average performance of DeepSequence without ensembling reproduces this overall advantage against EVmutation

(32/42 datasets, Median ∆� = 0.024) and the site-independent model (31/42 datasets, Median

∆� = 0.055). The clear exceptions to the overall improvement of Deep Sequence are the comparisons to the viral protein mutation experiment effects, and especially the two HIV env experiments suggesting that the VAE approach is more dependent on a larger diversity of fit sequences on which to train the model. The DeepSequence predictions were consistently more

15 accurate than other commonly used methods, for instance BLOSUM62 (20/20, Median ∆� =

0.32), SIFT32 (20/20, Median ∆� = 0.24), and Polyphen231 (19/20, Median ∆� = 0.20).

a 0.8 b 0.8 Latent (DeepSequence) N mutations Pairwise (EVmutation) 0.7 Independent 102 103 104 105 106 | ρ 0.6 | Spearman | 0.5 DeepSequence

0.4 0.0 0.0 Independent 0.8

| Spearman ρ 0.3 | Spearman ρ |

0.2 |

ρ Sequence 0.1 family origin

Prokaryotic or 0.0 eukaryotic | Spearman | DeepSequence Viral ) ) ) N mutations PTEN Ubiquitin T. maritima β-lactamase

Calmodlulin-1 0.0 0.8 β-glucosidase GTPase HRas EVmutation S. solfataricus T. thermophilus HIV env (BF520) HIV env (BG505) Hepatitis C NS5A | Spearman ρ | YAP1 (WW domain) Levoglucosan kinase PSD 95 (PDZ domain) Yeast tRNA (CCU, Arg) TIM barrel ( Toxin-antitoxin complex UBE4B (U-box domain) Influenza hemagglutinin BRCA1 (RING Domain) BRCA1 (BRCT Domain) BRCA1 (BRCT HSP90 (ATPase domain) HSP90 (ATPase Aliphatic amide hydrolase TIM barrel ( TIM barrel ( GAL4 (DNA-binding domain) Translation initiation factor IF1 Kanamycin kinase APH(3’)-II Kanamycin kinase PABP singles (RRM domain) PABP Thiamin pyrophosphokinase 1 PABP doubles (RRM domain) PABP DNA methyltransferase HaeIII DNA Thiopurine S-methyltransferase Influenza polymerase PA subunit Influenza polymerase PA Small ubiquitin-related modifier 1 Mitogen-activated protein kinase 1 SUMO-conjugating enzyme UBC9

Imidazoleglycerol-phosphate dehydratase

Figure 1.3. A deep latent variable model predicts the effects of mutations better than site-independent or pairwise models. a. A nonlinear latent variable model

(DeepSequence) captures the effects of mutations across deep mutational scanning experiments as measured by rank correlation. b. The latent variable model tends to be more predictive of mutational effects than pairwise and site-independent models when fit to prokaryotic and eukaryotic sequence families.

The deep mutational scans that we analyze typically involve only one or a few mutational steps from assayed sequences (‘test set’) to the sequences that the model is trained on, raising the question of how well the model can generalize when the number of steps is larger. To test this, we reran the experiments for TEM1 Beta-lactamase with artificially purged training sets in which sequences within 35, 60, 80, 95, or 100 percentage identity to wild-type TEM1 were removed. Surprisingly, we found that DeepSequence continued to exceed the performance of pairwise and site-independent models even when all sequences within 60% sequence identity

16 were purged. The Spearman correlation dropped by only 0.07 despite all mutated test sequences being ~100 mutational steps away from the training set (Methods, Supplementary Figure S1.2) .

We observed a consistent amino acid bias in prediction accuracy of all three evolutionary models (independent, EVmutation, DeepSequence) when comparing the residuals of the rankings of the predictions versus the experiment for each amino acid transition (Supplementary Figure

S1.3), but we were unable to find consistent patterns for this discrepancy. For instance, we could not explain the observed bias by codon usage. DeepSequence can be improved by accounting for this bias by fitting a linear model on top of the predictions, but the improvements are small

(Supplementary Figure S1.4, Methods).

Sequence space in latent space.

Examining the low-dimensional latent spaces learned by a latent variable model can give insight into relationships between data points (sequences). To gain insight into the organization of latent space directly, we fit an otherwise-identical copy of the model with 2-dimensional rather than 30-dimensional � to β-lactamase, (Figure 1.4). This visualization illustrates the comparative shallowness of deep mutational scans; all mutated sequences from the deep mutational scans of b-lactamase tightly concentrate in a small region of latent space. We also observe an uneven distribution in latent space with phylogenetically coherent structure, however we caution over interpreting this distribution because it will depend strongly on the choice of prior and the variational approximations49.

17

Figure 1.4. Latent variables capture organization of sequence space. In a two- dimensional latent space for the β-lactamase family, closeness in latent space reflects phylogenetic groupings. When examining a single deep mutational scanning experiment, all variants (shown in pink) occupy only a very small portion of the sequence space of the entire evolutionary family.

Bayesian learning prevents overfitting and facilitates interpretation.

To test the importance of our specific choices for the architecture and learning algorithm, we performed an ablation study of model design decisions across a subset of proteins. We found that all of the major design decisions contributed to improved performance, including using sparse priors on the last layer, a learned dictionary of amino acid correlations, a global inverse- temperature parameter, and a Bayesian variational approximation for the weights (Table 1.1,

Figure 1.5, and Methods). The largest improvement seems to be from using variational Bayes to learn distributions over the weights of the network rather than point estimates, even when point estimation is combined with group sparsity priors50 or Dropout51 regularization (Table 1.1).

18

Table 1.1. Biologically motivated priors and Bayesian learning improve model performance. Ablation studies of critical components of DeepSequence, showing the average

Spearman ρ of predictions from five randomly-initialized models. We include combinations of components of the structured matrix decomposition and use either Bayesian approximation or

Maximum a posteriori (MAP) estimation of decoder weights. These can be compared to predictions made from EVmutation (Pair) and the site-independent model (site). Inclusion is indicated with (✓), and top performing model configurations for each dataset are bolded.

Bayesian θ Sparsity [S] Convolution [C] Temperature [T] MAP θ L2 Regularization Dropout [S+C+T] Final ReLU Pair Site β-lactamase 0.73 0.73 0.73 0.73 0.73 0.74 0.53 0.61 0.04 0.40 0.56 0.37 0.34 0.42 0.70 0.60 PSD 95 (PDZ domain) 0.58 0.60 0.58 0.57 0.57 0.55 0.55 0.48 0.32 0.47 0.50 0.41 0.37 0.47 0.54 0.47 GAL4 (DNA-binding domain) 0.61 0.46 0.50 0.62 0.60 0.58 0.60 0.53 0.26 0.47 0.52 0.43 0.42 0.47 0.59 0.41 HSP90 (ATPase domain) 0.54 0.54 0.54 0.51 0.52 0.52 0.48 0.45 0.03 0.34 0.44 0.26 0.22 0.33 0.49 0.43 Kanamycin kinase APH(3’)-II 0.62 0.62 0.62 0.60 0.59 0.60 0.53 0.49 0.09 0.38 0.49 0.40 0.39 0.38 0.59 0.33 DNA methyltransferase HaeIII 0.70 0.70 0.69 0.70 0.68 0.68 0.64 0.64 0.12 0.54 0.64 0.50 0.49 0.54 0.69 0.44 PABP singles (RRM domain) 0.67 0.67 0.66 0.65 0.63 0.65 0.64 0.62 0.44 0.59 0.63 0.58 0.58 0.59 0.59 0.42 Ubiquitin 0.50 0.46 0.46 0.44 0.48 0.43 0.37 0.39 0.09 0.38 0.37 0.29 0.31 0.38 0.43 0.46 YAP1 (WW domain) 0.64 0.64 0.64 0.63 0.63 0.64 0.63 0.58 0.28 0.50 0.61 0.49 0.44 0.50 0.57 0.58

We consider two kinds of structured correlations in multiple sequence alignments in the design of the final layer. The first is correlated bias in amino acid usage, where hydrophobic or polar amino acids tend to have correlated liability at a given position. We capture this with shared linear transformation � that is tied across all positions and find that its implicit correlation structure (��) reflects well-known amino acid similarities (Figure 1.6a). The second kind of correlation in multiple sequence alignments is between positions, which often coincides with structural proximity 52-55. Our group sparsity prior captures these by learning 500 soft sub-groups of positions that are each connected to 4 hidden units. We computed the average pairwise

19 distance between positions in each sub-group using a representative PDB structure (Methods) and found that most were less than a null expectation (Methods, Figure 1.6a) with subsets of residues close in 3D.

Figure 1.5. Structured priors over weights capture biological assumptions. Top: In a traditional fully-connected layer that outputs per-position logits over different letters

ABCD at each position, every hidden unit can influence every logit at every position.

Center: Placing a group-sparsity prior over the weights encourages block sparsity such that groups of k hidden units tend to influence all the logits at a small number of positions. Bottom: An additional global transform at every position in the sequence by a shared weight matrix captures correlations between letter usage.

Interpretation of mutation effect predictions.

We then explored which sets of mutations were most differentially predicted by

DeepSequence (on a subset of eight experiments with large overall difference to the independent methods, Supplementary Figure S1.5, Methods). DeepSequence was more accurate for all proteins across all residue classifications that we explored, including both evolutionary

(“Conservation, Frequency”) and structural features (“Interaction”) (Figure 1.6b). The overall

20 increased accuracy of the latent model predictions is particularly strong for mutations that are deleterious in the experiment and often where these deleterious sites are variable or proximal to interacting ligands. For example, in the RRM domain of the polyA binding protein and the PDZ domain in PSD95 and kanamycin kinase, residues close to their ligands or cofactors are the most differentially accurate. These inlcudes a residue position involved in specificity switching G330 in the PDZ domain and RNA interaction sites in the RRM (Figure 1.6b). These results are consistent with the idea that the latent model is making better predictions for sites sensitive to change but still varied across evolution, and hence context dependent.

21 a Interpreting model parameters 2 N scale parameters Subgroup past threshold 101 102 103 104 distances Tryptophan Tyrosine Aromatic 1 Phenylalanine Leucine Methionine Isoleucine Hydrophobic Valine 0 Glycine Alanine Serine Threonine Hydroxylic Cysteine -1 Asparagine Histidine Glutamine Polar Lysine Arginine -2 Aspartic acid Negatively charged Glutamic acid Proline 1.0 Valine Lysine Serine Correlation Proline Glycine Alanine

-3 Leucine Arginine Tyrosine Histidine Cysteine Isoleucine Threonine Glutamine Methionine Tryptophan Asparagine Aspartic acid Glutamic acid Phenylalanine ) ) ) -0.4 PTEN HIV env Ubiquitin T. maritima T. β-lactamase Calmodlulin-1 β-glucosidase GTPase HRas S. solfataricus T. thermophilus T. Hepatitis C NS5A YAP1 (WW domain) YAP1 Levoglucosan kinase PABP (RRM domain) PABP PSD 95 (PDZ domain) BRCA1 (RING domain) TIM barrel ( UBE4B (U-box domain) BRCA1 (BRCT domain) BRCA1 (BRCT Dihydrofolate reductase Influenza hemagglutinin HSP90 (ATPase domain) HSP90 (ATPase Aliphatic amide hydrolase TIM barrel ( TIM barrel ( Kanamycin kinase APH(3’)-II Kanamycin kinase GAL4 (DNA-binding domain) DNA methyltransferase HaeIII DNA Thiamin pyrophosphokinase 1 Translation initiation factor IF1 Translation Thiopurine S-methyltransferase Influenza polymerase PA subunit Influenza polymerase PA Small ubiquitin-related modifier 1 SUMO-conjugating enzyme UBC9 Mitogen-activated protein kinase 1 Imidazoleglycerol-phosphate dehydratase

b Interpreting model predictions DeepSequence vs. Independent Most differentially predicted positions

PSD 95 (PDZ domain) I137 PABP PDZ 4f02 G330 E331 1be9 Kanamycin Kinase APH(3’)-II A376 PABP (RRM domain) G329

DNA methyltransferase HaeIII K131 β-lactamase, Ranganathan 2015 V328 β-glucosidase H172 G126 YAP1 (WW domain) N127 G213 GAL4 (DNA-binding domain) Kanamycin kinase 0.1 C131 1nd4 All R66 Rare L223 Root mean Neutral squared residual VariableFrequent InteractionConserved Deleterious difference vs. experiment -0.1 I228 DeepSequence Frequency Experiment Conservation more accurate

Figure 1.6. Differential improvement is strongest for deleterious effects. a. Left: sparse sub-groups targeted by the latent factors with a group sparsity prior are closer in structure than expected for random sub-groups. Box plots show the median (center line), interquartile range

(hinges), and 1.5 times the interquartile range (whiskers); outliers are plotted as individual points. Right: correlations in the weights of the final width-1 convolution taken across all models reflect known amino acid correlations and are correlated with a well-known substitution matrix BLOSUM62 (Spearman ρ = 0.83, N=210). b. Left: comparison of relative rank error between DeepSequence and a position-independent conservation model across different position and mutation attributes. Right: top five positions with largest reduction in rank error from the site-independent model to DeepSequence.

22

Discussion

We have presented a deep latent variable model that can capture higher-order correlations in biological sequence families and shown how it can be applied to predict the effects of mutations across diverse classes of proteins and RNAs. We find that the predictions of the deep latent variable model are more accurate than a previously published pairwise-interaction approach to model epistasis 34, 56, which in turn was more accurate than commonly used supervised methods 57, 58. In addition, both the latent variables and global variables of the model learn interpretable structure both for macrovariation and phylogeny as well as structural proximity of residues.

Deep latent variable models introduce additional flexibility to model higher-order constraints, but they incur the cost of reduced interpretability and increased potential for overfitting. Indeed, we found that even traditional approaches for regularization, such as

Dropout51 and sparsity priors, were often worse than the already-established pairwise models. As this work was in progress, other nonlinear latent variable models were proposed for sequence families59, 60, evidencing the benefits of more parametrically powerful models for sequence variation. A key aspect of this work distinguishing it from both those and from other models in our ablation study was to use approximate Bayesian inference, where we estimate distributions over model parameters and propagate that uncertainty into model predictions. While we found that mean-field approximate variational inference and group sparsity priors were sufficient to exceed the performance of a wide range of models, future work will likely benefit from other biologically-motivated priors as well as more accurate approximations for variational inference61,

23 62. Additionally, incorporating more rigidly structured probabilistic graphical models to model dependencies between latent variables could improve generality and interpretability63.

Lastly, the efficacy of DeepSequence for predicting experimental mutational scans will be contingent on both the quality and relevance of the evolutionary sequence data in two ways.

First, DeepSequence and other mutation effect prediction methods train on multiple sequences alignments34, 53, 64-66 for which the quality of the available data may vary across families and for which decisions on inclusion thresholds are somewhat ad hoc. Second, the premise that evolutionary data can be applied to predict outcomes of an experiment is highly contingent on the relevance of the experimental assay to long-term selective forces in the family. A mutation may be damaging with regard to some measurable protein feature e.g. enzyme efficiency, but harmless for stability or even organism fitness, as others and we have previously discussed 34, 36,

67. Both of these issues could be partially addressed by incorporating DeepSequence predictions as features in a supervised learning framework.

Despite challenges for deep models of sequence variation and data used to train them, they are likely to be of increasing importance to the high-throughput design and annotation of biological sequences. Evolution has and continues to conduct an unthinkably large number of protein experiments, and deep generative models can begin to identify the statistical patterns of constraint that characterize essential functions of molecules.

24 Methods

Datasets

We constructed a dataset of deep mutational scans that combined those analyzed in

EVmutation 3-6, 8-12, 14-18, 21, 22, 24, 68-70 with additional studies that have been published since25, 71-81.

Alignments

We used the multiple sequence alignments that were published with EVmutation for the

19 families that overlapped34 and repeated the same alignment-generation protocol for the 4 additional proteins that were added in this study. Briefly, for each protein (target sequence), multiple sequence alignments of the corresponding protein family were obtained by five search iterations of the profile HMM homology search tool jackhmmer82 against the UniRef100 database of non-redundant protein sequences83 (release 11/2017). We used a bit score of 0.5 bits/residue as a threshold for inclusion unless the alignment yielded < 80% coverage of the length of the target domain, or if there were not enough sequences (redundancy-reduced number of sequences ≥10L, where L = sequence length). For <10Lsequences, we decreased the required average bit score until satisfied and when the coverage was < 80% we increased the bit score until satisfied. Proteins with < 2L sequences at < 70% coverage were excluded from the analysis.

See previous work for ParE-ParD toxin-antitoxin and tRNA alignment protocols.

Sequence weights

The distributions of protein and RNA sequences in genomic databases are biased by both

(i) human sampling, where some parts of a phylogeny may be more studied and sequenced than others, and (ii) evolutionary sampling, where some groups of species have arisen from large radiations and will have proteins that are overrepresented. We aim to reduce these biases by reweighting the data distribution. We use the previously established procedure84 of computing

25 each sequence weight � as the reciprocal of the number of sequences within a given Hamming

() () distance cutoff. If � � , � is the normalized Hamming distance between the query

() () sequence � and another sequence in the alignment � and � is a pre-defined neighborhood size, the sequence weight is

() () � = � � � , � < � .

The effective sample size of a multiple sequence alignment can then be computed as the sum of these weights as

� = �.

To fit a model to reweighted data, there are two common approaches. First, as was done

84 previously , one can reweight every log-likelihood in the objective by its sequence weight �.

While this works well for batch optimization, we found it to lead to high-variance gradient estimates with mini-batch optimization that make stochastic gradient descent unstable. We instead used the approach of sampling data points with probability � proportional to their

weight in each mini-batch as � = .

34 Following prior work , we set � = 0.2 for all multiple sequence alignments sequences

(80% sequence identity) except those for viral proteins where we set � = 0.01 (99% sequence identity) due to limited sequence diversity and the expectation that small differences in viral sequences have higher probability of containing constraint information than the same diversity might from a sample of mammals, for instance.

26 Background: latent factor models

Probabilistic latent variable models reveal structure in data by positing an unobserved generative process that created the data and then doing inference to learn the parameters of the generative process. We will focus on models with a generative process in which an unobserved set of factors � are drawn from an independent distribution and each data point arises according to a conditional distribution � � �, � that is parameterized by �. This process can be written as

� ~ � 0, � ,

� ~ � �|�, � .

Principal Component Analysis (PCA) has been a foundational model for the analysis of genetic variation since its introduction by Cavalli-Sforza. PCA can be realized in this probabilistic framework as the zero-noise limit of Probabilistic PCA42, 85. With linear conditional dependencies � � �, � , PCA can only model additive interactions between the latent factors �.

This limitation could in principle be remedied by using a conditional model � � �, � with nonlinear dependencies on �.

Here, we will consider a conditional model for sequences � � �, � that differs from PCA in two ways: First, the conditional distribution of the data � � �, � will be categorical rather than Gaussian to model discrete characters. Second, the conditional distribution � � �, � will be parameterized by a neural network rather than a linear map. In this sense, our latent variable model may be thought of as a discrete, nonlinear analog of PCA.

Nonlinear categorical factor model

Our probabilistic model is specified by two components, a prior distribution � � that specifies the marginal distribution of the hidden variables z and a conditional distribution

27 � � �, � that specifies how a sequence x is generated given the hidden variables. In our model, the sequence x is a length L string of letters drawn from an alphabet of size q and the hidden variables z are a length D vector of real numbers.

In this work, we structure our prior and conditional distribution similarly to original versions of the Variational Autoencoder. That is, we model the prior distribution � � as a multivariate Normal of dimension K with mean zero and the identity covariance and we model the conditional distribution � � �, � as a simple fully-connected neural network with two hidden layers. Thus, the generative process for the joint distribution � � � � �, � can be written as

� ~ � 0, � ,

() () () � = � � � + � ,

() () () () � = � � � + � ,

�(,) = � , � + � , for i =1 … L,

(,) � � � = �|� = (,) for i =1 … L, � where the nonlinearities were rectified linear units � = max (0, �) on the first layer and

sigmoidal � = on the second. The sigmoidal nonlinearity was introduced to bound the magnitude of the pre-activations that are multiplied by structured matrix � , (see next section). It is important to note that each letter � is conditionally independent of every other position given the hidden variables � and, as a result, all correlations between letters must be mediated by correlations with the hidden variables.

28 Structured parameterization

All biologically motivated aspects of our model are captured in a structured parameterization of the final weight matrix. Our parameterization is motived by three assumptions: (i) Sparsely interacting subsystems. Hidden factors influence small subsystems of positions at a time rather than jointly affecting the entire sequence. (ii) Correlated amino acid usage. Certain amino acids are more likely functionally substitute other amino acids in the same position based on biochemistry. (iii) Selective uncertainty. Differences in the strength of selection may lead to varying effective “temperatures” of the constraint distribution. To capture these constraints, we parameterize the at each position i as

(,) , � = �� � ���� �� , where �(,) is a [q x H] matrix that linearly combines the H activations in the final hidden �() layer to q multinomial logits for the different characters (e.g. amino acids) at position i. The parameterization consists of four terms: a matrix � that captures amino acid correlations, a matrix � with elements on (0,1) that gates which hidden units can affect which positions, a scalar constant � capturing the overall selective constraint shared across all positions (inverse temperature), and underlying parameters � , that combine with the other terms to capture site- specific constraints. We describe these elements in turn.

To capture correlations in amino acid usage, the weights themselves are split into a combination of local weights at each position � , plus a global “dictionary” matrix �. The inner dimesion of the product � � , is a hyperparameter E such that � is a global [q x E] matrix that left multiplies each of the � , matrices which are [E x H] to transform to the alphabet of the model (e.g. the amino acids).

29 To capture sparsely depending subsystems, we learn an [H x L] matrix � that masks which neurons in the final hidden layer �() can influence which positions in the sequence. Each

() column vector �� captures which hidden units in � affect position � in the sequence. We constrain the values of this matrix to be between 0 and 1, and also to tie some of the rows of � to be equal (Figure 1.5, see parameterization below). In the above expression we write ���� �� to indicate the [H x H] matrix containing the column vector �� along its diagonal, which allows us to frame the the matrix � as a component of the parameterization of �(,). In practice, we

(,) , , actually compute the effect of � with broadcasting as ℎ = �� � (�� ⨀� ) + � .

Lastly, a positive scalar � captures the overall strength of the constraints. Typically, including a scalar prefactor in a weight matrix is fully redundant, but by placing a prior over this parameter and then modeling uncertainty with a variational approximation this can capture global, sequence-wide correlations in the selective strength.

We parameterize the constrained parameters � and � with unconstrained forms � and � that can be optimized by gradient descent. The global “inverse temperature” � is parameterized by the softplus function as � = log 1 + � . The sparsity matrix � is constrained to both (i) have elements on (0,1) and (ii) have H/k blocks of k identical rows. We do this by parameterizing it in terms of an [H/k x L] matrix �, transforming it by a sigmoid, and tiling the rows of � k times

with the transform � = 1 + exp � . When paired with a Gaussian priors over � ��� , and � , the scale parameters � will be a priori logit-normally distributed and the resulting

, product � ���� �� can be seen as a continuous relaxation of a spike and slab prior (where it would be exact if the elements of � were Bernoulli).

30 Priors

We place Gaussian priors over all unconstrained parameters as �~ � 0,1 ,

� ~ � 0,1 , � ~ � �, � , and � ~ � 0,1 . To set the hyperparameters �, � we consider the effective logit-normal prior over s, which can be thought of as a smooth relaxation

of a Bernoulli that can be made sharper by increasing the variance � . We set � = −12.36 and

� = 16.

Background: Variational Inference.

Nonlinear latent factor models are difficult to infer. Since the latent variables � are not observed, computing the marginal likelihood of the data requires integrating them out as

log � � � = log � � �, � � � ��.

We must do this integral because we do not know a priori which � is responsible for each data point �, and so we average over all possible explanations weighted by their relative probability.

When this integral over � cannot be simplified, optimizing the marginal likelihood log � � � to fit a model to data will be intractable.

Variational Inference forms a lower bound on log � � � that is easier to optimize. First, we introduce � � �, � , an approximate distribution for � given � that is flexibly parameterized by variational parameters �. By Jensen’s inequality we can lower bound the intractable marginal likelihood log � � � as

log � � � = log � � �, � � � ��

� � �, � � � = log � � �, � �� � � �, �

� �, � � ≥ log � � �, � ��. � �, �

31 We write this lower bound as

log � � � ≥ ℒ � ≜ � log � � �, � − � � � �, � ||� � , which we refer to as the Evidence Lower BOund (ELBO). Maximizing the ELBO has the side effect of minimizing the KL divergence between the variational approximation � � �, � and the

� �, � � true posterior distribution for z given x, � � �, � = . � �, � � �

Variational approximation for local posteriors � � �, �

We structure the functional form of the variational approximation for � as

() () () � = � � � + � ,

() () () () � = � � � + � ,

() () () � = � � + � ,

() () () � = � � + � ,

� � �, � = � �, ����(�) .

We apply the ‘reparameterization trick’ of Kingma and Welling 45 and Rezende et al.46 to write the latent variables � as deterministic transforms of a noise source � ~ � 0, � as � =

� + � ⊙ � . The symbol ⊙ is an element-wise product.

Variational approximation for global posteriors � � � .

We briefly review45, 46 how to extend variational approximations to include both the latent variables � as well as the global parameters � 45, 46. Because the posterior for the global parameters is conditioned on the entire dataset, we must consider the marginal likelihood of the full dataset � = �(�), … , �(�) which integrates out all the corresponding latent factors � =

�(�), … , �(�) . The likelihood of the entire dataset log � � can be lower bounded by

log � � = log � � �, � � � � � ����

32 � � �, � � � � � = log � �, � �, � ���� � �, � �, �

� � �, � � � � � ≥ log � �, � �, � ����. � �, � �, �

The ELBO can then be written as

log � � � ≥ ℒ � ≜ ��∈ �(�)(�|�) log � � �, � − � � � �, � ||� �

() () − � � � ||� � . �()

We model all variational distributions over the parameters with fully-factorized mean- field Gaussian distributions. We factorize the variational approximation across the global and local parameters as � �, � �, � = � � �, � � � � , across � as � � �, � =

(�) (�) (�) � � � , � , and across the model parameters as � � � = � � . In accordance with our data reweighting scheme, we set � = �, the effective number of sequences that is the sum of the sequence weights.

Model hyperparameters

We used a fixed architecture across all sequence families. The encoder has architecture

1500-1500-30 with fully connected layers and ReLU nonlinearities. The decoder has two hidden layers: the first with size 100 and a ReLU nonlinearity, and the second with size 2000 with a sigmoid nonlinearity. The dictionary � is a 40 by q matrix where the alphabet size q was 20 for proteins and 4 for nucleic acids. A single set of sparsity scale parameters controlled 4 sets of dense weights. Dropout 51 was set to 0.5 when used in ablation studies. Models were optimized with Adam 86 with default parameters using a batch size of 100 until convergence, completing

300000 updates.

33 Each model was fit five times to the same multiple sequence alignment using a different random seed. The mutation effect prediction (∆E) is calculated by taking the difference of the mean of 2000 ELBO samples of the wild-type and a given mutated sequence mutated sequence.

Site-independent and pairwise model

We compare the VAE against two kinds of other probabilistic models, both of which can be characterized as undirected graphical models with probability

1 � � = ��� � � . �

For these distributions, � � is the log-potential that describes the favorability of a sequence and

Z normalizes the distribution over sequence space. A site-independent model has site-additive terms for each amino acid in each position as

� � = ℎ(�), while the pairwise model includes additional parameters for every pairwise combination of amino acids as

� � = ℎ(�) + �(�, �).

34 We estimated these models using the same methods previously described for EVmutation (L2- penalized maximum pseudolikelihood). To compute the effects of mutations, we again use the

(�(Mutant)|�) log ratio ��� which reduces to the difference of the (negative) energy values E(x) (�(Wild-Type)|�) for the mutated and wild type sequence.

34 Non-epistatic mutation effect predictors.

We also compare a subset of protein datasets to three commonly used mutation effect predictors, a BLOSUM62 matrix, SIFT32, and Polyphen231, as previously described for

EVmutation34.

Group sparsity analysis.

We aimed to test if positional activations driven by specific neurons in the last hidden layer corresponded to structural proximity We gathered ground-truth distance data from the

PDB87 records of homologous sequences that were found via Jackhmmer. To aggregate distance information across multiple structures for the same family, we computed the median (across

PDBs) of the minimum atom distances (across all atom pairs per position pair) for both intra and homo-oligomer distances. The final aggregated distance matrix was the element-wise minimum of the intra and oligomeric distance matrices.

To identify the dominant connections between the last hidden layer of neurons and specific positions in the sequence, we consider the (approximate) sparsity structure of the matrix

�. The row vectors �,: represent which positions in the sequence are affected by the hidden unit

() ℎ . We quantify what the ‘typical’ distances in these groupings by a weighted average of the

() distances, where the weighting within each group is computed as � = ��. The `typical’

() () () distance within a group a is then � = () . The null expectation for each � is the expected average distance under permuted groups, which is simply the average pairwise distance across the whole structure. We discarded any `disconnected’ hidden units for which the entire row of � had negligible value (∀�, � < 0.001).

35 Residual analysis

Spearman ρ is calculated by transforming paired data to ranked quantiles and then computing the Pearson correlation between the ranks. To determine where the model over or under predicted the ∆E for each mutation, we transformed the experimental measurements and mutation effect predictions to normalized ranks on the interval [0,1].

To quantify error between predictions, we fit a least-squares linear fit from the normalized ranks of the predictions to the normalized ranks of the data for each method and then measured the residuals of this fit. Thus, we define the residual effects as the residuals of a least- squares linear fit between the normalized ranks. Given a least-squares fit with slope and bias � and �, the residuals are then

�∆ = � − ��∆ + �

Thus positive residuals �∆ > 0 represent underprediction of the rank of the experimental effect

(and thus overprediction of deleteriousness), while negative �∆values represent overprediction of the experimental rank (and thus underprediction of deleteriousness). Deep mutational scans with only single mutations were analyzed, using the most recent experimental data for each protein. Residuals were grouped by the identity of the amino acid either before mutation

(wildtype) or after mutation (mutant).

Bias correction.

To correct for biases between mutation effect predictions and experimental measurements, we created a feature matrix for each mutation that included ∆E, amino acid identity before and after mutation, alignment column statistics (conservation and amino acid frequency), and residue hydrophobicity88. Leave-one-out cross validation (LOOCV) was used to correct the bias for each dataset. Using the most recent DMS experiment as the representative of

36 that protein family (28 DMS datasets), the mutants of 27 datasets were used to fit a regression model to predict the residuals of each known mutation, �∆ , given the feature matrix. After this model was fit, it was used to predict �∆ for the mutants in the test dataset. This predicted residual bias �∆was subtracted off the normalized predicted rank �∆ = �∆ − �∆.These corrected predictions were then reranked and compared to the experimental results to calculate a corrected Spearman ρ. To predict the effects of mutations solely from DMS data, the same

LOOCV procedure was used excluding all evolutionary information in the feature matrix for each mutation. In this case, the feature matrix was used to directly compute a rank �. These values were subsequently reranked and compared to the ranked experimental results to calculate a corrected Spearman ρ.

Generalizability analysis

We focused on the alignment and mutation effects of β-lactamase as a case study to test the generaliziability of evolutionary models on predicting mutation effects (Supplementary

Figure S1.2). We first made four reduced alignments by removing the query sequence

(BLAT_ECOLX) and all sequences with a normalized hamming distance greater than 0.95, 0.8,

0.6, and 0.35 to the query, respectively. Five latent variable models were fit to each alignment, as well as a pairwise and independent model using the same sequence weighting and model fitting techniques.

Acknowledgements

We thank Chris Sander, Frank Poelwijk, David Duvenaud, Sam Sinai, Eric Kelsic and members of the Marks lab for helpful comments and discussions. A.J.R. is supported by DOE CSGF

37 fellowship DE-FG02-97ER25308. D.S.M. and J.B.I. were funded by NIGMS (R01GM106303).

This work is published at https://www.nature.com/articles/s41592-018-0138-4

38 References

1. Fowler, D.M. & Fields, S. Deep mutational scanning: a new style of protein science. Nature methods 11, 801-807 (2014).

2. Kosuri, S. & Church, G.M. Large-scale de novo DNA synthesis: technologies and applications. Nature methods 11, 499-507 (2014).

3. Romero, P.A., Tran, T.M. & Abate, A.R. Dissecting enzyme function with microfluidic- based deep mutational scanning. Proc Natl Acad Sci U S A 112, 7159-7164 (2015).

4. Roscoe, B.P. & Bolon, D.N. Systematic exploration of ubiquitin sequence, E1 activation efficiency, and experimental fitness in yeast. Journal of molecular biology 426, 2854-2870 (2014).

5. Roscoe, B.P., Thayer, K.M., Zeldovich, K.B., Fushman, D. & Bolon, D.N. Analyses of the effects of all ubiquitin point mutants on yeast growth rate. Journal of molecular biology 425, 1363-1377 (2013).

6. Melamed, D., Young, D.L., Gamble, C.E., Miller, C.R. & Fields, S. Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly(A)-binding protein. Rna 19, 1537-1551 (2013).

7. Stiffler, M.A., Hekstra, D.R. & Ranganathan, R. Evolvability as a Function of Purifying Selection in TEM-1 beta-Lactamase. Cell 160, 882-892 (2015).

8. McLaughlin, R.N., Jr., Poelwijk, F.J., Raman, A., Gosal, W.S. & Ranganathan, R. The spatial architecture of protein function and adaptation. Nature 491, 138-142 (2012).

9. Kitzman, J.O., Starita, L.M., Lo, R.S., Fields, S. & Shendure, J. Massively parallel single-amino-acid mutagenesis. Nat Methods 12, 203-206, 204 p following 206 (2015).

10. Melnikov, A., Rogov, P., Wang, L., Gnirke, A. & Mikkelsen, T.S. Comprehensive mutational scanning of a kinase in vivo reveals substrate-dependent fitness landscapes. Nucleic acids research 42, e112 (2014).

11. Araya, C.L. et al. A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proc Natl Acad Sci U S A 109, 16858- 16863 (2012).

12. Firnberg, E., Labonte, J.W., Gray, J.J. & Ostermeier, M. A comprehensive, high- resolution map of a gene's fitness landscape. Mol Biol Evol 31, 1581-1592 (2014).

13. Starita, L.M. et al. Massively Parallel Functional Analysis of BRCA1 RING Domain Variants. Genetics (2015).

39 14. Rockah-Shmuel, L., Toth-Petroczy, A. & Tawfik, D.S. Systematic Mapping of Protein Mutational Space by Prolonged Drift Reveals the Deleterious Effects of Seemingly Neutral Mutations. PLoS Comput Biol 11, e1004421 (2015).

15. Jacquier, H. et al. Capturing the mutational landscape of the beta-lactamase TEM-1. Proc Natl Acad Sci U S A 110, 13067-13072 (2013).

16. Qi, H. et al. A quantitative high-resolution genetic profile rapidly identifies sequence determinants of hepatitis C viral fitness and drug sensitivity. PLoS Pathog 10, e1004064 (2014).

17. Wu, N.C. et al. Functional Constraint Profiling of a Viral Protein Reveals Discordance of Evolutionary Conservation and Functionality. PLoS Genet 11, e1005310 (2015).

18. Mishra, P., Flynn, J.M., Starr, T.N. & Bolon, D.N.A. Systematic Mutant Analyses Elucidate General and Client-Specific Aspects of Hsp90 Function. Cell Rep 15, 588-598 (2016).

19. Doud, M.B. & Bloom, J.D. Accurate measurement of the effects of all amino-acid mutations to influenza hemagglutinin. bioRxiv (2016).

20. Deng, Z. et al. Deep sequencing of systematic combinatorial libraries reveals beta- lactamase sequence constraints at high resolution. Journal of molecular biology 424, 150-167 (2012).

21. Starita, L.M. et al. Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proc Natl Acad Sci U S A 110, E1263-1272 (2013).

22. Aakre, C.D. et al. Evolving new protein-protein interaction specificity through promiscuous intermediates. Cell 163, 594-606 (2015).

23. Julien, P., Minana, B., Baeza-Centurion, P., Valcarcel, J. & Lehner, B. The complete local genotype-phenotype landscape for the alternative splicing of a human exon. Nat Commun 7, 11558 (2016).

24. Li, C., Qian, W., Maclean, C.J. & Zhang, J. The fitness landscape of a tRNA gene. Science (2016).

25. Mavor, D. et al. Determination of ubiquitin fitness landscapes under different chemical stresses in a classroom setting. Elife 5 (2016).

26. Gasperini, M., Starita, L. & Shendure, J. The power of multiplexed functional analysis of genetic variants. Nat Protoc 11, 1782-1787 (2016).

27. Starita, L.M. et al. Variant Interpretation: Functional Assays to the Rescue. Am J Hum Genet 101, 315-325 (2017).

28. Adzhubei, I.A. et al. A method and server for predicting damaging missense mutations. Nature methods 7, 248-249 (2010).

40 29. Hecht, M., Bromberg, Y. & Rost, B. Better prediction of functional effects for sequence variants. BMC genomics 16, S1 (2015).

30. Huang, Y.-F., Gulko, B. & Siepel, A. Fast, scalable prediction of deleterious noncoding variants from functional and population genomic data. Nature genetics 49, 618-624 (2017).

31. Kircher, M. et al. A general framework for estimating the relative pathogenicity of human genetic variants. Nature genetics 46, 310-315 (2014).

32. Ng, P.C. & Henikoff, S. SIFT: Predicting amino acid changes that affect protein function. Nucleic acids research 31, 3812-3814 (2003).

33. Finn, R.D. et al. HMMER web server: 2015 update. Nucleic acids research 43, W30-38 (2015).

34. Hopf, T.A. et al. Mutation effects predicted from sequence co-variation. Nature biotechnology 35, 128-135 (2017).

35. Mann, J.K. et al. The fitness landscape of HIV-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS computational biology 10, e1003776 (2014).

36. Boucher, J.I., Bolon, D.N. & Tawfik, D.S. Quantifying and understanding the fitness effects of protein mutations: Laboratory versus nature. Protein Sci (2016).

37. Weinreich, D.M., Lan, Y., Wylie, C.S. & Heckendorn, R.B. Should evolutionary geneticists worry about higher-order epistasis? Curr Opin Genet Dev 23, 700-707 (2013).

38. Bendixsen, D.P., Ostman, B. & Hayden, E.J. Negative Epistasis in Experimental RNA Fitness Landscapes. J Mol Evol (2017).

39. Rodrigues, J.V. et al. Biophysical principles predict fitness landscapes of drug resistance. Proc Natl Acad Sci U S A 113, E1470-1478 (2016).

40. Echave, J. & Wilke, C.O. Biophysical Models of Protein Evolution: Understanding the Patterns of Evolutionary Sequence Divergence. Annu Rev Biophys 46, 85-103 (2017).

41. Schmidt, M. & Hamacher, K. Three-body interactions improve contact prediction within direct-coupling analysis. Physical Review E 96, 052405 (2017).

42. Roweis, S. & Ghahramani, Z. A unifying review of linear gaussian models. Neural Comput 11, 305-345 (1999).

43. Pritchard, J.K., Stephens, M. & Donnelly, P. Inference of population structure using multilocus genotype data. Genetics 155, 945-959 (2000).

44. Patterson, N., Price, A.L. & Reich, D. Population structure and eigenanalysis. PLoS Genet 2, e190 (2006).

41 45. Kingma, D.P. & Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).

46. Rezende, D.J., Mohamed, S. & Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082 (2014).

47. Gómez-Bombarelli, R. et al. Automatic chemical design using a data-driven continuous representation of molecules. arXiv preprint arXiv:1610.02415 (2016).

48. Wainwright, M.J. & Jordan, M.I. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning 1, 1-305 (2008).

49. Kingma, D.P. et al. in Advances in Neural Information Processing Systems 4743-4751 (2016).

50. Murphy, K.P. Machine learning: a probabilistic perspective. (MIT press, 2012).

51. Srivastava, N., Hinton, G.E., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. Journal of machine learning research 15, 1929-1958 (2014).

52. Hopf, T.A. et al. Three-dimensional structures of membrane proteins from genomic sequencing. Cell 149, 1607-1621 (2012).

53. Marks, D.S. et al. Protein 3D structure computed from evolutionary sequence variation. PLoS One 6, e28766 (2011).

54. Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc Natl Acad Sci U S A 108, E1293-1301 (2011).

55. Jones, D.T., Singh, T., Kosciolek, T. & Tetchner, S. MetaPSICOV: combining coevolution methods for accurate prediction of contacts and long range hydrogen bonding in proteins. Bioinformatics 31, 999-1006 (2015).

56. Figliuzzi, M., Jacquier, H., Schug, A., Tenaillon, O. & Weigt, M. Coevolutionary Landscape Inference and the Context-Dependence of Mutations in Beta-Lactamase TEM-1. Mol Biol Evol 33, 268-280 (2016).

57. Sim, N.L. et al. SIFT web server: predicting effects of amino acid substitutions on proteins. Nucleic acids research 40, W452-457 (2012).

58. Adzhubei, I., Jordan, D.M. & Sunyaev, S.R. Predicting functional effect of human missense mutations using PolyPhen-2. Current protocols in human genetics / editorial board, Jonathan L. Haines ... [et al.] Chapter 7, Unit7 20 (2013).

59. Tubiana, J., Cocco, S. & Monasson, R. Learning protein constitutive motifs from sequence data. arXiv preprint arXiv:1803.08718 (2018).

42 60. Sinai, S., Kelsic, E., Church, G.M. & Nowak, M.A. Variational auto-encoding of protein sequences. arXiv preprint arXiv:1712.03346 (2017).

61. Rezende, D.J. & Mohamed, S. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770 (2015).

62. Burda, Y., Grosse, R. & Salakhutdinov, R. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519 (2015).

63. Johnson, M., Duvenaud, D.K., Wiltschko, A., Adams, R.P. & Datta, S.R. 2946-2954 (

64. Ovchinnikov, S. et al. Large-scale determination of previously unsolved protein structures using evolutionary information. Elife 4, e09248 (2015).

65. Weinreb, C. et al. 3D RNA and Functional Interactions from Evolutionary Couplings. Cell 165, 963-975 (2016).

66. Toth-Petroczy, A. et al. Structured States of Disordered Proteins from Genomic Sequences. Cell 167, 158-170 e112 (2016).

67. Starita, L.M. et al. Massively parallel functional analysis of BRCA1 RING domain variants. Genetics 200, 413-422 (2015).

68. Stiffler, M.A., Hekstra, D.R. & Ranganathan, R. Evolvability as a function of purifying selection in TEM-1 β-lactamase. Cell 160, 882-892 (2015).

69. Deng, Z. et al. Deep sequencing of systematic combinatorial libraries reveals β-lactamase sequence constraints at high resolution. Journal of molecular biology 424, 150-167 (2012).

70. Doud, M.B. & Bloom, J.D. Accurate measurement of the effects of all amino-acid mutations on influenza hemagglutinin. Viruses 8, 155 (2016).

71. Wrenbeck, E.E., Azouz, L.R. & Whitehead, T.A. Single-mutation fitness landscapes for an enzyme on multiple substrates reveal specificity is globally encoded. Nature communications 8, 15695 (2017).

72. Chan, Y.H., Venev, S.V., Zeldovich, K.B. & Matthews, C.R. Correlation of fitness landscapes from three orthologous TIM barrels originates from sequence and structure constraints. Nature communications 8, 14614 (2017).

73. Kelsic, E.D. et al. RNA Structural Determinants of Optimal Codons Revealed by MAGE- Seq. Cell Syst 3, 563-571 e566 (2016).

74. Brenan, L. et al. Phenotypic characterization of a comprehensive set of MAPK1/ERK2 missense mutants. Cell reports 17, 1171-1183 (2016).

75. Bandaru, P. et al. Deconstruction of the Ras switching cycle through saturation mutagenesis. Elife 6 (2017).

43 76. Findlay, G.M. et al. Accurate functional classification of thousands of BRCA1 variants with saturation genome editing. bioRxiv, 294520 (2018).

77. Matreyek, K.A. et al. Multiplex Assessment of Protein Variant Abundance by Massively Parallel Sequencing. bioRxiv, 211011 (2018).

78. Klesmith, J.R., Bacik, J.-P., Michalczyk, R. & Whitehead, T.A. Comprehensive sequence-flux mapping of a levoglucosan utilization pathway in E. coli. ACS synthetic biology 4, 1235-1243 (2015).

79. Haddox, H.K., Dingens, A.S., Hilton, S.K., Overbaugh, J. & Bloom, J.D. Mapping mutational effects along the evolutionary landscape of HIV envelope. Elife 7, e34420 (2018).

80. Pokusaeva, V. et al. Experimental assay of a fitness landscape on a macroevolutionary scale. bioRxiv, 222778 (2018).

81. Weile, J. et al. A framework for exhaustively mapping functional missense variants. Molecular systems biology 13, 957 (2017).

82. Eddy, S.R. Accelerated Profile HMM Searches. PLoS Comput Biol 7, e1002195 (2011).

83. Suzek, B.E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926-932 (2015).

84. Ekeberg, M., Lovkvist, C., Lan, Y., Weigt, M. & Aurell, E. Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Phys Rev E Stat Nonlin Soft Matter Phys 87, 012707 (2013).

85. Tipping, M.E. & Bishop, C.M. Probabilistic principal component analysis. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 61, 611-622 (1999).

86. Kingma, D. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

87. Berman, H.M. et al. The Protein Data Bank. Nucleic acids research 28, 235-242 (2000).

88. Kyte, J. & Doolittle, R.F. A simple method for displaying the hydropathic character of a protein. Journal of molecular biology 157, 105-132 (1982).

44 Chapter 2

Predicting the effects of mutations with alignment-free generative models of protein sequences

Adam J. Riesselman1,2, John B. Ingraham2, Aaron Kollasch2,4, Elana P. Simon3, Jung-Eun Shin2, Conor McMahon5, Aashish Manglik6,7, Andrew C. Kruse5, Debora S. Marks1,2*

1Department of Biomedical Informatics, Harvard Medical School 2Department of Systems Biology, Harvard Medical School 3Harvard College 4Division of Medical Sciences, Harvard Medical School 5Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School 6Department of Pharmaceutical Chemistry, University of California San Francisco 7Department of Anesthesia and Perioperative Care, University of California San Francisco *Corresponding author

Author contributions: A.J.R., J.B.I., A.M., A.C.K., and D.S.M. conceived of the study. A.J.R. developed the model, made the model predictions, and formatted the data. A.K., E.P.S., J.S. and

C.M. performed analysis. A.J.R. and D.S.M. wrote the manuscript.

45 Abstract

Protein sequences are constrained by complex interactions between the residues that compose them, and models that can capture genetic variation are powerful tools in predicting the effects of mutations and generating new sequences with improved function in the laboratory.

However, current state-of-art approaches rely on aligned protein sequences, which greatly limits the types of predictions that can be made. Borrowing from recent advances in natural language processing and speech synthesis, autoregressive models can be fit to a set of sequences to find functional constraints without imposing a strict structure on the data. We develop a deep neural network-powered autoregressive model for protein sequences and validate its performance on recently published high-throughput mutagenesis experiments. We then predict the effects insertions and deletions on protein function. Finally we fit our model to the llama immune repertoire to predict functionality of new nanobody sequences as well as rationally design a library of nanobodies with improved biomolecular properties.

Introduction

Designing proteins with a targeted function is a major goal of biotechnology and biomedicine. Existing techniques to characterize variants is both expensive and time consuming because the space of potentially functional sequences is astronomically large. DNA synthesis techniques have allowed for the phenotyping of thousands of variants in a single experiment1,2, and machine learning techniques applied to these data can refine variants with improved properties3-7. However, the sampled diversity of sequences in these experiments pales in comparison to that found in nature. Vast amounts of genetic variation can be used to guide the design of proteins8-12, but many of the sequence features that determine function are still

46 unknown. An ideal approach would leverage and interpret natural genetic variation and translate it to testable sequences in the lab.

Generative models provide a principled way of deciphering biological sequence variation. Hidden Markov Models (HMMs) are a class of generative models that can both organize a set of related sequences into an alignment and find new members that are likely to be related from large sequence databases13,14. Once aligned, pairwise generative models fit to genetic data can uncover epistatic constraints in biomolecules as well as predict the effects of mutations15-18. Recently, we and others have shown that neural network powered deep generative models can be fit to biological sequence data for variant effect prediction19,20. Though these techniques are extremely useful, they also have shortcomings. Both the pairwise and latent variable model hinge on accurate multiple sequence alignments. Moreover, insertions and deletions in sequences are either largely ignored or treated as another amino acid or nucleotide character, though they have large consequences for understanding human genetic variation, cancer genomics, and protein engineering. HMMs can model sequences of any length by probabilistically aligning residues to a set of states in a profile, but capturing higher-order dependencies between sets of residues or modelling low-complexity regions remains a challenge13.

Complex, sequential data is common in other scientific disciples. Text-to-speech21,22 and language translation23,24 have no simple structured data preprocessing step available, analogous to alignments in genomics, before modelling. Autoregressive models can naturally formulate a distribution over these data by decomposing the likelihood per time step, conditioning on previous values for each, respectively25 (Figure 2.1). In our case, our dataset is simply a collection of genetic sequences, and each timestep is a residue in a sequence. Autoregressive

47 models can be parameterized by deep neural networks and capture arbitrary-order, nonlinear interactions between residues26. Though many models can model an autoregressive likelihood, dilated convolutional neural networks have proven successful in the domains of speech generation27 and natural language processing28.

running We are going to go to the park. banana zoo

D AQKLYLTHIDAEVEGADTLFITEVK G Y

Figure 2.1 – Predicting characters with probabilistic models of sequences. Autoregressive models of sequences find probable characters conditioned on previous values. This relationship is easy to observe with natural language (above), but more cryptic constraints define the functional landscape of proteins (below).

One application ripe for an autoregressive framework is nanobodies, important molecular biology, therapeutic, and diagnostic tools29 which are difficult to model with traditional sequence analysis techniques30-32. These single domain antibodies, found in camelids and certain shark species, undergo recombination and affinity maturation in the host to hone specific binding to an epitope while ensuring the protein is still folded. The complementarity determining regions

(CDRs), three loop regions on the protein, are hypermutated and have the largest impact on binding29. CDR3, the third and most genetically and structurally diverse loop, is a primary determinant of nanobody binding31,33,34. To understand the constraints of nanobody function, more powerful tools to explain patterns in sequence variation are needed.

48 In this work, we describe an alignment-free model of protein families capable of parameterizing residue dependencies. We build autoregressive models of protein families with deep neural networks, validate these models with previously published deep mutational scanning data, and predict the effects of insertions and deletions. We then fit this model to a repertoire of natural nanobody sequences and create a synthetic library of ~200K diverse CDR3 variants using the autoregressive model for a nanobody library for improved biochemical properties.

Results

Autoregressive models of biological sequences

From over the course of billions of years in the most core, essential genes, to a few weeks during affinity maturation of antibodies in the immune system, protein sequences observed in organisms today have undergone mutation and selection to be functional, folded biomolecules.

Generative models can be used to parameterize this process. Namely, the constraints essential to making a functional sequence � are captured by a generative model �(�|�) encoded by parameters �.

Autoregressive models do not require a structured alignment input nor utilize a finite number of states in a sequence profile. The probability distribution �(�|�) is decomposed into the product of conditional probabilities over previous characters along a sequence of length �

(Figure 2.2a).

� � � = � �� � �(��|��, … , ���, �)

This temporal generative structure is similar to HMMs: however, here the dependency of a residue is conditioned on all the previous characters with a set of nonlinear transformations. We parameterize this process with dilated convolutional neural networks (Figure 2.2b, Methods).

49 These are feed-forward deep neural networks that aggregate long-range dependencies in sequences over an exponentially-large receptive field27,28. The causal structure of the model allows for efficient training to a set of sequences, inference of the effect of mutations, and sampling of new sequences.

Figure 2.2 – Autoregressive models of biological sequences. a. Instead of finding correlations between columns in a multiple sequence alignment (left), the autoregressive model predicts a residue given all the preceding values (right). b. Causal dilated convolutions are used to model the autoregressive likelihood. This feed-forward architecture is able to capture long-range interactions with an exponentially large receptive field.

50 Predicting the effects of mutations

Once trained to a set of sequences, generative models � � � can estimate the probability a mutated sequence �(������) satisfies the constrains of the sequence family, relative to a functional sequence �(��������). In particular the log-ratio

� �(������)|� log � �(��������)|� has been shown to accurately predict the effect of a mutation in a purely unsupervised fashion15,20.

To determine the sensitivity of the autoregressive model to mutation effects, we gathered 40 deep mutational scans from the literature across 33 different proteins, totaling to 690,257 individual sequences. We fit three replicates of an autoregressive model to unaligned sequences belonging to a sequence family in both the N-to-C and C-to-N orientation using two different network sizes (Methods). We compare the log-probability ratio for these mutants with other generative models trained on the same aligned set of aligned sequences, including a profile

HMM; a site-independent model; EVmutation, a Markov random field model with pairwise dependencies between positions; and DeepSequence, a latent variable model.

The autoregressive model is consistently able to predict the effects of mutations across a wide array of proteins and experimental assays (Figure 2.3). The autoregressive model with more parameters (channel width 48) was more predictive than the smaller model (Spearmanr ρ better on 24/40 datasets). However, the smaller model was more predictive of the effects of mutations in viral proteins (5/5 datasets), presumably due to overfitting of the larger models to limited sequence diversity (Supplementary Figure 2.1). It has previously been reported that reversing the order of the sequence can improve learning in the context of natural language processing24;

51 models fit to sequences in an N-to-C orientation were slightly more predictive than the reverse direction (Supplementary Figure 2.2).

33 Sequence families

24 Fit p(x)

48 p(x ) log mutant Autoregressive p(xwildtype)

DeepSequence Compare EVmutation

40 Experiments Independent HMM 0.0 0.2 0.4 0.6 0.8 Spearman ρ

Figure 2.3 – Generative models predict the effects of mutations. We fit generative models to 33 different sequence families and compare effect predictions from 40 experiments.

We then compare the effect predictions made from the large autoregressive model to the other generative models fit to the same alignment. The autoregressive model consistently outperforms a profile HMM on the same data (Spearman ρ same or better on 34/40 datasets); is able to consistently match or outperform a model with only site-specific terms (26/40 datasets) and EVmutation (26/40 datasets); and competitively match state-of-the-art results of

DeepSequence (16/40 datasets).

Previous generative models built using alignments are constrained to predicting missense mutations. However, in-frame insertions and deletions can also have large phenotypic consequences on protein function, yet these changes have proved difficult to model. By normalizing the log probability ratio by length, the negative bits per residue calculated by the autoregressive model calculates the likelihood of a sequence harboring any number of missense

52 mutations, insertions, and deletions. We find that the autoregressive model can capture these complex, nonlinear effects on imidazoleglycerol-phsphate dehydratase (AUC=0.92, Spearman

ρ=0.68, N=6102; Figure 2.4b, Supplementary Figure 2.3). The autoregressive model is also predictive of single amino acid deletions in the PTEN phosphatase (AUC=0.83, Spearman

ρ=0.67, N=340; Figure 2.4b). These predictions reveal the tolerance to amino acid deletions of surface-exposed loops and the C2 domain, but not the phosphatase domain, in the PTEN phosphatase, all in an unsupervised manner (Figure 2.4c).

a Imidazoleglycerol-phosphate dehydratase Insertions & deletions C AUC = 0.92 AQKLYLTHDE 1.2 N = 6102

? 0.8 Selection 0.4

0.0 -2.8 -2.4 -2.0 -1.6 -Bits per residue b PTEN phosphatase c Deletions C2 Phosphatase

AUC = 0.83 1 N = 340 0 -1 2 -3 -4 Cumulative score -5 -6 -2.0 -1.9 -1.8 -1.7 Neutral -Bits per residue Deleterious

Figure 2.4 – Predicting the effects of insertions and deletions. The negative bits per residue calculated by the autoregressive model can predict the effects of (a) multiple insertions and deletions in imidazolegloycerol-phosphate dehydratase and (b) single amino acid deletions on the PTEN phosphatase. c. These predictions reveal tolerance to deletions in surface loops and the C2 domain (pdb: 1d5r).

53

Generating a ‘smart’ synthetic nanobody library

Nanobodies are highly-valued molecular tools that can be engineered to specifically target an antigen of interest. However, traditional in vivo generation of nanobodies is a slow, laborious process. Synthetic library approaches aim to shorten this process, typically by coupling library synthesis with codon randomized CDRs with in vitro high-throughput sorting of high- affinity nanobody clones35-38. However, many of these clones have poor biochemical properties including instability, polyspecificity, and aggregation35. In vivo, nanobody sequences undergo affinity maturation where these unfavorable properties are selected against29. We sought to learn the constraints that characterize functional nanobodies by fitting the autoregressive model to a set of ~1.2 million nanobody sequences from a set of the llama immune repertoire39 (Methods).

We found that bits per residue calculations of new llama nanobody sequences were able to predict thermostability of new llama nanobody sequences40 (Spearman ρ=0.53, N=17; Figure

2.5a) and the expression of nanobody sequences35 (Supplementary Table 2.1).

To improve upon the synthetic nanobody library approach, we used our autoregressive model to generate a set of CDR3 sequences with improved properties (Figure 2.5b). Conditioned on the germline nanobody sequence, we generated ~3.7 million nonredundant, nonreactive

(lacking sulfur-containing amino acids, glycosylation motifs, and self-acetylation motifs) putatively functional CDR3 sequences by greedy sampling, all of which our model deemed functional. To design a diverse library to increase the possibility of finding a high-affinity clone during selection, we used BIRCH clustering41 on the L2 normalized kmer vector of the CDR3 sequences. The resulting library of 185,836 variants contains highly diverse, biochemically well-

54 behaved, putatively fit CDR3 sequences that have the same distribution of properties as the naïve llama immune repertoire (Figure 2.5c).

Figure 2.5 – Autoregressive models for nanobody sequences. a. The autoregressive model predicts the melting temperature (Tm) of nanobody sequences. b. Using the same model, we then generate a diverse CDR3 library conditioned on the preceding germline sequence. c.

Though never found in the original dataset, this synthetic library capture the mode of properties (length, isolelectric point, and hydrophobicity) of the natural nanobody repertoire.

Discussion

Here we show how neural network-powered autoregressive models can predict the effects of mutations in an unsupervised manner from evolutionary sequences alone. These techniques do not rely on alignments and can predict the effects of missense mutations, insertions, deletions in a probabilistic manner. We then use the model to generate a diverse synthetic nanobody library with improved biochemical properties.

55 Many different neural network architectures can model an autoregressive likelihood, including attention-based models42 and recurrent neural networks25. However, we moved away from training with an LSTM43 or GRU44 architecture led to exploding gradients45 during the training of long sequences families. Latent variables can also be incorporated into autoregressive models for controlled generation of sequences46. Though others have reported difficulty training these models47, new machine learning techniques may make these models practical in the future48-50. We also anticipate better strategies to explore diverse yet fit regions of sequence space during generation, either by exploiting variance explained by latent variables51 or diverse beam search strategies52.

Due to their flexibility, deep autoregressive models will also open the door to new opportunities in probabilistic biological sequence analysis. Unlike alignment-based techniques, since no homology between sequences is explicitly required, generative models of disordered proteins; multiple protein families; promoters and enhancers; or even entire genomes can be modeled with autoregressive likelihoods. With the increased number of available sequences and growth in both computing power and new machine learning algorithms, autoregressive sequence models will be indispensable for biological sequence analysis into the future.

Methods

Model

Autoregressive models formulate a probability distribution over a set of sequences

� �|� . Instead of transforming the set of sequences into a matrix via a multiple sequence alignment15,20, we adopt an autoregressive likelihood. This formulation decomposes the probability distribution into a product of conditional probabilities over previous residues:

56

� � � = � �� � �(��|��, … , ���, �)

Protein sequences are represented by a 21 letter alphabet, one for each amino acid type, as well as a 'start/stop' character (*). The log-likelihood for a sequence is the sum of the cross-entropy between the true amino acids at each position and the predicted distribution over possible amino acids, conditioned on the previous characters.

We adopt a residual causal dilated convolutional neural network architecture with both weight normalization53 and layer normalization54. To help prevent overfitting, we use L2 regularization on the weights and place Dropout layers (p = 0.5) immediately after each residual block55. We use a batch size of 30, 6 residual blocks with a dilation rate of

[1,2,4,8,16,32,64,128,200] each, and a channel size of either 24 or 48 for all protein families tested. We also make an ensembled model by simultaneously training two models, each conditioning generation of a character on the characters before or after, respectively. In total, twelve models are built for each protein family. Each model is trained for 250,000 updates using

Adam with default parameters56, and the gradient norm is clipped45 to 100.

Data collection

We first collected deep mutational scanning data from 40 experiments and 33 proteins from the literature16,57-91. Using a previously published multiple sequence alignment procedure20, to gather examples of sequences to fit the model, we first generate multiple sequence alignments with Jackhmmer14 searching against the Uniref100 database (release: November 2017)92. A bitscore of 0.5 per residue was used as an inclusion threshold, and tuned to ensure at least 80% column coverage with the query sequence.

57 Sequence weighting

Proteins and RNA from commonly sequenced organisms appear more frequently in sequence databases than rare members of a family. We weight sequences to help reduce this bias93, ensuring that. To calculate the weights for sequences in the validation experiments, we used a weighting scheme based on the multiple sequence alignment. This weighting function is defined over regions in the alignment that are less than 50% gap characters and columns with over 70% coverage. Each sequence is given a weight � as the reciprocal of the number of sequences in a hamming distance cutoff �:

� = � � �, � < �

where � = 0.2 (80% sequence identity) and �(�, �) is the normalized hamming distance between a sequence of interest and another sequence in the alignment. Once this weight is then paired with the gapless sequence region retrieved from the database, which is used during training.

During training, sequences are chosen for a minibatch proportional to their weight with respect to the rest of their family, i.e. diverse sequences are chosen more frequently than more common sequences:

� � = �

Due to the large number of sequences in the llama immune repertoire, calculating the full pairwise distance matrix for sequence weighting was infeasible. We approximate this weighting scheme by assigning each sequence to a cluster using Linclust94 at both 80% and 90% sequence

58 identity thresholds (ID). During training, to select a sequence for the minibatch, we first select an

80% ID clusters, with replacement, with an equal probability of choosing each cluster. Next, within each 80% ID cluster, an 90% ID cluster is chosen. Finally a sequence is randomly chosen from each 90% ID cluster.

Nanobody generation by model sampling

We sampled the N-to-C model with greedy sampling algorithm conditioned on the germline framework-CDR1-CDR2 sequence using an inverse temperature of 1.0. Using this strategy, a multinomial distribution over the amino acid alphabet is generated conditioned on previous outputs and sampled, one amino acid at a time. After an amino acid is chosen according to these probabilities, it is added to the end of the growing sequence. The process is terminated when the model generates a stop character. Here we adapted the germline nanobody sequence to that which was found most commonly in the repertoire:

EVQLVESGGGLVQAGGSLRLSCAASGFTFSSYAMGWYRQAPGKEREFVAAISWSGGSTYYADSVKGRFTISRDNAKNTVYLQMNSLKPEDTAVYYC[generation begins] 37,744,800 total sequences were generated with greedy sampling. CDR3 sequences were only considered if they had either not been generated previously and the generated final beta strand matched the reference nanobody sequence. CDR3 sequences were also removed if they contained glycosylation (NxS and NxT) sites, asparagine deamination (NG) motifs, or sulfur- containing amino acids (cysteine and methionine). This process resulted in 3,690,554 sequences.

Length-normalized log probabilities were then calculated for each sequence using both the N-to-

C and C-to-N models for downstream processing.

Library selection

From this large number of sequences, we then sought to choose 185,836 CDR3 sequences that are both deemed fit by the model and as diverse from one another as possible, to cover the largest amount of sequence space. Since we cannot cluster CDR3 sequences with

59 traditional alignment-based tools, we turned to clustering methods that use kmer counts. First, we featurized these sequences into a fixed length vector by counting all possible kmers of size 1, 2, and 3, resulting in a 6174 dimensional vector of counts for each CDR3 sequence. The resulting vector was then L2 normalized to create a kmer unit vector.

We then used BIRCH clustering41 to find diverse member of the dataset in O(n) time. We tune a threshold parameter until the number of initial clusters is ~2x larger than the number of final clusters we want in the final library. For a final library size of 185,836 sequences, we use a threshold of 0.575, resulting in 382,675 clusters. Each cluster with more than one member was represented by a centroid sequence—this is the member that looks like the “average” member of that cluster. We then picked the top 185,836 most probable sequences from the set of representative sequences for the final library construction.

Hidden Markov Model

We fit the hidden Markov model to each sequence family with the HMMer suite14 using default parameters. The log-probability of each sequence is calculated using the forward algorithm.

60 References

1 Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nature methods 11, 499 (2014).

2 Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nature methods 11, 801 (2014).

3 Biswas, S. et al. Toward machine-guided design of proteins. bioRxiv, 337154 (2018).

4 Liao, J. et al. Engineering proteinase K using machine learning and synthetic genes. BMC biotechnology 7, 16 (2007).

5 Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proceedings of the National Academy of Sciences 110, E193-E201 (2013).

6 Fox, R. J. et al. Improving catalytic function by ProSAR-driven enzyme evolution. Nature biotechnology 25, 338 (2007).

7 Bedbrook, C. N., Yang, K. K., Rice, A. J., Gradinaru, V. & Arnold, F. H. Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization. PLoS computational biology 13, e1005786 (2017).

8 Reynolds, K. A., Russ, W. P., Socolich, M. & Ranganathan, R. in Methods in enzymology Vol. 523 213-235 (Elsevier, 2013).

9 Grisoni, F. et al. Designing anticancer peptides by constructive machine learning. ChemMedChem (2018).

10 Wheeler, L. C., Lim, S. A., Marqusee, S. & Harms, M. J. The thermostability and specificity of ancient proteins. Current opinion in structural biology 38, 37-43 (2016).

11 Currin, A., Swainston, N., Day, P. J. & Kell, D. B. Synthetic biology for the directed evolution of protein biocatalysts: navigating sequence space intelligently. Chemical Society Reviews 44, 1172-1239 (2015).

12 Tian, P., Louis, J. M., Baber, J. L., Aniana, A. & Best, R. B. Co-evolutionary fitness landscapes for sequence design. Angewandte Chemie International Edition 57, 5674-5678 (2018).

13 Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. (Cambridge university press, 1998).

14 Eddy, S. R. Accelerated profile HMM searches. PLoS computational biology 7, e1002195 (2011).

61 15 Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nature biotechnology 35, 128 (2017).

16 Mann, J. K. et al. The fitness landscape of HIV-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS computational biology 10, e1003776 (2014).

17 Figliuzzi, M., Jacquier, H., Schug, A., Tenaillon, O. & Weigt, M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Molecular biology and evolution 33, 268-280 (2015).

18 Lapedes, A., Giraud, B. & Jarzynski, C. Using sequence alignments to predict protein structure and stability with high accuracy. arXiv preprint arXiv:1207.2484 (2012).

19 Sinai, S., Kelsic, E., Church, G. M. & Nowak, M. A. Variational auto-encoding of protein sequences. arXiv preprint arXiv:1712.03346 (2017).

20 Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nature methods 15, 816-822 (2018).

21 Graves, A., Mohamed, A.-r. & Hinton, G. in Acoustics, speech and signal processing (icassp), 2013 ieee international conference on. 6645-6649 (IEEE).

22 Wang, Y. et al. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017).

23 Bahdanau, D., Cho, K. & Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

24 Sutskever, I., Vinyals, O. & Le, Q. V. in Advances in neural information processing systems. 3104-3112.

25 Sutskever, I., Martens, J. & Hinton, G. E. in Proceedings of the 28th International Conference on Machine Learning (ICML-11). 1017-1024.

26 LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. nature 521, 436 (2015).

27 Oord, A. v. d. et al. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).

28 Kalchbrenner, N. et al. Neural machine translation in linear time. arXiv preprint arXiv:1610.10099 (2016).

29 Muyldermans, S. Nanobodies: natural single-domain antibodies. Annual review of biochemistry 82, 775-797 (2013).

30 Lefranc, M. The IMGT unique numbering for immunoglobulins, T-cell receptors, and Ig- like domains. Immunologist 7, 132-136 (1999).

62 31 Elhanati, Y. et al. Inferring processes underlying B-cell repertoire diversity. Phil. Trans. R. Soc. B 370, 20140243 (2015).

32 Souto-Carneiro, M. M., Longo, N. S., Russ, D. E., Sun, H.-w. & Lipsky, P. E. Characterization of the human Ig heavy chain antigen binding complementarity determining region 3 using a newly developed software algorithm, JOINSOLVER. The Journal of Immunology 172, 6790-6802 (2004).

33 Xu, J. L. & Davis, M. M. Diversity in the CDR3 region of VH is sufficient for most antibody specificities. Immunity 13, 37-45 (2000).

34 Rock, E. P., Sibbald, P. R., Davis, M. M. & Chien, Y.-H. CDR3 length in antigen- specific immune receptors. Journal of Experimental Medicine 179, 323-328 (1994).

35 McMahon, C. et al. Yeast surface display platform for rapid discovery of conformationally selective nanobodies. Report No. 1545-9985, (Nature Publishing Group, 2018).

36 Moutel, S. et al. NaLi-H1: A universal synthetic library of humanized nanobodies providing highly functional antibodies and intrabodies. Elife 5, e16228 (2016).

37 Yan, J., Li, G., Hu, Y., Ou, W. & Wan, Y. Construction of a synthetic phage-displayed Nanobody library with CDR3 regions randomized by trinucleotide cassettes for diagnostic applications. Journal of translational medicine 12, 343 (2014).

38 Reiter, Y., Schuck, P., Boyd, L. F. & Plaksin, D. An antibody single-domain phage display library of a native heavy chain variable region: isolation of functional single-domain VH molecules with a unique interface1. Journal of molecular biology 290, 685-698 (1999).

39 McCoy, L. E. et al. Molecular evolution of broadly neutralizing Llama antibodies to the CD4-binding site of HIV-1. PLoS pathogens 10, e1004552 (2014).

40 Kunz, P. et al. Exploiting sequence and stability information for directing nanobody stability engineering. Biochimica et Biophysica Acta (BBA)-General Subjects 1861, 2196-2205 (2017).

41 Zhang, T., Ramakrishnan, R. & Livny, M. in ACM Sigmod Record. 103-114 (ACM).

42 Vaswani, A. et al. in Advances in Neural Information Processing Systems. 5998-6008.

43 Hochreiter, S. & Schmidhuber, J. Long short-term memory. Neural computation 9, 1735- 1780 (1997).

44 Cho, K. et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).

45 Pascanu, R., Mikolov, T. & Bengio, Y. in International Conference on Machine Learning. 1310-1318.

63 46 Bowman, S. R. et al. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349 (2015).

47 Killoran, N., Lee, L. J., Delong, A., Duvenaud, D. & Frey, B. J. Generating and designing DNA with deep generative models. arXiv preprint arXiv:1712.06148 (2017).

48 Kim, Y., Wiseman, S., Miller, A. C., Sontag, D. & Rush, A. M. Semi-Amortized Variational Autoencoders. arXiv preprint arXiv:1802.02550 (2018).

49 Yang, Z., Hu, Z., Salakhutdinov, R. & Berg-Kirkpatrick, T. Improved variational autoencoders for text modeling using dilated convolutions. arXiv preprint arXiv:1702.08139 (2017).

50 van den Oord, A. & Vinyals, O. in Advances in Neural Information Processing Systems. 6306-6315.

51 Greener, J. G., Moffat, L. & Jones, D. T. Design of metalloproteins and novel protein folds using variational autoencoders. Scientific Reports 8, 16189 (2018).

52 Vijayakumar, A. K. et al. Diverse beam search: Decoding diverse solutions from neural sequence models. arXiv preprint arXiv:1610.02424 (2016).

53 Salimans, T. & Kingma, D. P. in Advances in Neural Information Processing Systems. 901-909.

54 Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).

55 Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15, 1929-1958 (2014).

56 Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

57 Weile, J. et al. A framework for exhaustively mapping functional missense variants. Molecular systems biology 13, 957 (2017).

58 Matreyek, K. A. et al. Multiplex assessment of protein variant abundance by massively parallel sequencing. Nature genetics, 1 (2018).

59 Mighell, T. L., Evans-Dutson, S. & O’Roak, B. J. A Saturation Mutagenesis Approach to Understanding PTEN Lipid Phosphatase Activity and Genotype-Phenotype Relationships. The American Journal of Human Genetics 102, 943-955 (2018).

60 Chan, Y. H., Venev, S. V., Zeldovich, K. B. & Matthews, C. R. Correlation of fitness landscapes from three orthologous TIM barrels originates from sequence and structure constraints. Nature communications 8, 14614 (2017).

64 61 Brenan, L. et al. Phenotypic characterization of a comprehensive set of MAPK1/ERK2 missense mutants. Cell reports 17, 1171-1183 (2016).

62 Pokusaeva, V. et al. Experimental assay of a fitness landscape on a macroevolutionary scale. bioRxiv, 222778 (2018).

63 Klesmith, J. R., Bacik, J.-P., Michalczyk, R. & Whitehead, T. A. Comprehensive sequence-flux mapping of a levoglucosan utilization pathway in E. coli. ACS synthetic biology 4, 1235-1243 (2015).

64 Haddox, H. K., Dingens, A. S., Hilton, S. K., Overbaugh, J. & Bloom, J. D. Mapping mutational effects along the evolutionary landscape of HIV envelope. Elife 7, e34420 (2018).

65 Wrenbeck, E. E., Azouz, L. R. & Whitehead, T. A. Single-mutation fitness landscapes for an enzyme on multiple substrates reveal specificity is globally encoded. Nature communications 8, 15695 (2017).

66 Findlay, G. M. et al. Accurate classification of BRCA1 variants with saturation genome editing. Nature 562, 217 (2018).

67 Wu, N. C. et al. Functional constraint profiling of a viral protein reveals discordance of evolutionary conservation and functionality. PLoS genetics 11, e1005310 (2015).

68 Rockah-Shmuel, L., Tóth-Petróczy, Á. & Tawfik, D. S. Systematic mapping of protein mutational space by prolonged drift reveals the deleterious effects of seemingly neutral mutations. PLoS computational biology 11, e1004421 (2015).

69 Qi, H. et al. A quantitative high-resolution genetic profile rapidly identifies sequence determinants of hepatitis C viral fitness and drug sensitivity. PLoS pathogens 10, e1004064 (2014).

70 Jacquier, H. et al. Capturing the mutational landscape of the beta-lactamase TEM-1. Proceedings of the National Academy of Sciences 110, 13067-13072 (2013).

71 Roscoe, B. P., Thayer, K. M., Zeldovich, K. B., Fushman, D. & Bolon, D. N. Analyses of the effects of all ubiquitin point mutants on yeast growth rate. Journal of molecular biology 425, 1363-1377 (2013).

72 Kitzman, J. O., Starita, L. M., Lo, R. S., Fields, S. & Shendure, J. Massively parallel single-amino-acid mutagenesis. Nature methods 12, 203-206 (2015).

73 Bershtein, S., Mu, W. & Shakhnovich, E. I. Soluble oligomerization provides a beneficial fitness effect on destabilizing mutations. Proceedings of the National Academy of Sciences 109, 4857-4862 (2012).

74 Stiffler, M. A., Hekstra, D. R. & Ranganathan, R. Evolvability as a function of purifying selection in TEM-1 β-lactamase. Cell 160, 882-892 (2015).

65 75 McLaughlin Jr, R. N., Poelwijk, F. J., Raman, A., Gosal, W. S. & Ranganathan, R. The spatial architecture of protein function and adaptation. Nature 491, 138-142 (2012).

76 Halabi, N., Rivoire, O., Leibler, S. & Ranganathan, R. Protein sectors: evolutionary units of three-dimensional structure. Cell 138, 774-786 (2009).

77 Melnikov, A., Rogov, P., Wang, L., Gnirke, A. & Mikkelsen, T. S. Comprehensive mutational scanning of a kinase in vivo reveals substrate-dependent fitness landscapes. Nucleic acids research 42, e112-e112 (2014).

78 Firnberg, E., Labonte, J. W., Gray, J. J. & Ostermeier, M. A comprehensive, high- resolution map of a gene’s fitness landscape. Molecular biology and evolution 31, 1581-1592 (2014).

79 Deng, Z. et al. Deep sequencing of systematic combinatorial libraries reveals β-lactamase sequence constraints at high resolution. Journal of molecular biology 424, 150-167 (2012).

80 Starita, L. M. et al. Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis. Proceedings of the National Academy of Sciences 110, E1263- E1272 (2013).

81 Philip, A. F., Kumauchi, M. & Hoff, W. D. Robustness and evolvability in the functional anatomy of a PER-ARNT-SIM (PAS) domain. Proceedings of the National Academy of Sciences 107, 17986-17991 (2010).

82 Aakre, C. D. et al. Evolving new protein-protein interaction specificity through promiscuous intermediates. Cell 163, 594-606 (2015).

83 Starita, L. M. et al. Massively parallel functional analysis of BRCA1 RING domain variants. Genetics 200, 413-422 (2015).

84 Melamed, D., Young, D. L., Gamble, C. E., Miller, C. R. & Fields, S. Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly (A)-binding protein. Rna 19, 1537-1551 (2013).

85 Mavor, D. et al. Determination of ubiquitin fitness landscapes under different chemical stresses in a classroom setting. Elife 5, e15802 (2016).

86 Di Nardo, A. A., Larson, S. M. & Davidson, A. R. The relationship between conservation, thermodynamic stability, and function in the SH3 domain hydrophobic core. Journal of molecular biology 333, 641-655 (2003).

87 Araya, C. L. et al. A fundamental protein property, thermodynamic stability, revealed solely from large-scale measurements of protein function. Proceedings of the National Academy of Sciences 109, 16858-16863 (2012).

66 88 Roscoe, B. P. & Bolon, D. N. Systematic exploration of ubiquitin sequence, E1 activation efficiency, and experimental fitness in yeast. Journal of molecular biology 426, 2854-2870 (2014).

89 Mishra, P., Flynn, J. M., Starr, T. N. & Bolon, D. N. Systematic mutant analyses elucidate general and client-specific aspects of Hsp90 function. Cell reports 15, 588-598 (2016).

90 Doud, M. B. & Bloom, J. D. Accurate measurement of the effects of all amino-acid mutations on influenza hemagglutinin. Viruses 8, 155 (2016).

91 Romero, P. A., Tran, T. M. & Abate, A. R. Dissecting enzyme function with microfluidic-based deep mutational scanning. Proceedings of the National Academy of Sciences 112, 7159-7164 (2015).

92 Suzek, B. E. et al. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics 31, 926-932 (2014).

93 Ekeberg, M., Lövkvist, C., Lan, Y., Weigt, M. & Aurell, E. Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Physical Review E 87, 012707 (2013).

94 Steinegger, M. & Söding, J. Clustering huge protein sequence sets in linear time. Nature communications 9, 2542 (2018).

67 Chapter 3

Refining nanobody affinity with algorithmically-guided directed evolution

Adam J. Riesselman1,2, Ishan Deshpande4,5, Jiahao Liang4,5, Andrew C. Kruse3, Aashish Manglik4,5*, Debora S. Marks1,2*

1Department of Biomedical Informatics, Harvard Medical School 2Department of Systems Biology, Harvard Medical School 3Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School 4Department of Pharmaceutical Chemistry, University of California San Francisco 5Department of Anesthesia and Perioperative Care, University of California San Francisco *Corresponding authors

Author contributions: A.J.R., A.C.K., A.M., D.S.M. conceived of the study. A.J.R. developed the model and performed the computational analysis. I.D. and J.L. performed the experiments A.J.R. wrote the manuscript.

68 Abstract

Nanobodies are single-domain antibodies that are useful molecular tools in crystallography, molecular biology, and biomedicine. High affinity to the desired antigen is crucial to their function, but the molecular underpinnings of affinity are difficult to ascertain.

Though directed evolution can refine the desired characteristics of a protein, algorithmically- aided optimization is necessary to navigate sequence space intelligently. Here we use sparse logistic regression of deep sequencing data of a nanobody that underwent diversification and selection to Smoothened, a protein implicated in cancer. This model scales to hundreds of thousands of sequences and can be used to predict future rounds of selection as well as find a small number of mutations responsible for increased affinity.

Introduction

Antibodies are a crucial component to the immune system responsible for the detection of biomolecules to elicit an immune response. The specificity of an antibody to an antigen is refined during affinity maturation, a process in which the antibody sequence is hypermutated by host enzymes, and B cells producing high-affinity clones expand and proliferate1, resulting in an antibody with extremely high specificity and fidelity. Currently, obtaining antibodies to a specific target requires immunization of an animal, but this process is slow, expensive, and impractical for a large number of molecules2.

The process of finding a high affinity antibody to a target can be replicated in the laboratory to bypass traditional in vivo generation methods. Combinatoric libraries of the complementarity determining regions (CDR) of the antibody mimic the natural variation primarily responsible for antibody affinity3-7, and subsequent selection for high-affinity clones is

69 performed in vitro using high-throughput expression and sorting. However, clones selected from the original library may not have high enough affinity to their target for practical application in the lab or clinic and may need subsequent refinement to improve specificity and other molecular characteristics.

Directed evolution8,9 can be used to refine and improve the properties of biomolecules, including the affinity of antibodies to an antigen10. This process replicates natural affinity maturation: diversity is introduced into sequences typically via error-prone PCR or recombination over multiple rounds of increasing selection pressure until an optimized sequence is found. This technique has led to sequences with refined or unique properties with limited experimental time or cost8,9.

However, directed evolution may not always find a sequence that optimizes a given function. The functional landscape may be difficult to traverse, resulting in sequences that get stuck in local maxima8. Alternatively, sufficient sequence diversity may not exist, or selection pressure may not be strong enough to optimize a given function. Machine learning techniques can help parameterize the functional landscape of sequences to guide protein engineering efforts.

These techniques are crucial in the design of experiments optimization cycle to refining sets of mutations to test to improve the function of a sequence11-16. However, many algorithms do not scale to the large number of datapoints obtained by deep sequencing11,15,17,18. Moreover, though

“black-box” machine learning techniques can improve predictive performance11,14,16, interpretability of mutation effects is key for iterative rounds of library design. Finally, since biological measurements are noisy, uncertainty quantification is necessary to find signals in biological data19-21.

70 Select binders

Amplify

Mutagenize Deep sequence ? ? ? Fit model ? ? ? ? ? ? Test beneficial mutations

Figure 3.1. Experimental pipeline of synthetic nanobody affinity maturation to

Smoothened.

Regression-based models are widely adopted tools in understanding the genotype to phenotype relationship. They are frequently employed in genome wide association studies22 and have been used previously to engineer genetic systems12,23. Regression models, when powered with variational inference techniques, provide principled regularization techniques, are readily interpretable by biologists, are predictive of the outcome of experiments, and can be fit to data in a matter of minutes on a laptop computer.

Here we focus on optimizing the binding affinity of a nanobody, a single-domain antibody found naturally in camelids and some sharks species24. Due to their small size and ease of synthesis, nanobodies have seen increasing adoption in molecular biology, crystallography, and biotherapeutics24. Our nanobody of interest targets Smoothened, an oncogene found to frequently be a driver of cancer25,26. We couple directed evolution techniques with sparse

Bayesian logistic regression to improve the affinity of an initial nanobody candidate to

71 Smoothened. We show that a model fit to the first round of selection can strongly predict later rounds of increased selection pressure, and a small number of mutations can be identified to increase nanobody affinity.

Results

Synthetic affinity maturation data was generated for a nanobody clone sourced from a synthetic library5. After diversification with error-prone PCR, variants were sorted and deep sequenced with increasing levels of selection (Figure 3.1). However, still no clone with high enough affinity to Smoothened was obtained for either crystallography or in-depth molecular experiments. Therefore, we decided to analyze the deep sequencing data for any information about mutations likely to increase affinity.

We formulate the identification of mutations that lead to a sequence being selected after random mutagenesis and sorting by binding activity as a sparse logistic regression problem

(Methods). The probability a sequence is found by sequencing after sorting is determined by summing learned site-independent coefficients for the variant at each position. We enforce

Bayesian sparsity27 on these coefficients to compensate for the noisy enrichment sequence labels with uncertainty in the model parameters, and find as few mutations as possible responsible for increased binding to simplify follow-up experiments. Using stochastic variational inference20, our algorithm scales to inference to hundreds of thousands of sequences.

72 Naive Library

Round 1 Train

Round 2 Test

Round 3 Round 1 vs Naive, AUC = 0.67 Selection Round 2 vs Naive, AUC = 0.92 strength Round 3 vs Naive, AUC = 0.96

Figure 3.2. Early rounds of selection predict enrichment under stronger selection strength. A model trained to enrichment scores from Round 1 (pink) can predict enrichment in subsequently stronger rounds of selection (blue).

We find that the logistic regression model is predictive of enrichment, both within and across different rounds of selection (Table 3.1). A model trained on the first round of enrichment is somewhat accurate (AUC=0.67), but this same model almost perfectly predicts the second

(AUC=0.92) and third (AUC=0.96) rounds of selection (Figure 3.2). This shows the regression model learns generalizable parameters that can predict future rounds of selection and suggests that only a single round of selection is required to identify mutations that increase affinity.

73 Table 3.1. Cross validation of regularization strengths and selection rounds. Models are trained with enrichment from a round selection relative to the starting library using sparse

Bayesian logistic regression. Predictions are scored as the AUC. Five-fold cross validation are shown in grey.

Test Training Scale (�) Round 1 vs Naïve Round 2 vs Naïve Round 3 vs Naïve 1 0.662 0.918 0.962 0.1 0.639 0.885 0.939 Round 1 vs Naïve 0.01 0.579 0.759 0.824 0.001 0.523 0.581 0.605 0.0001 0.501 0.507 0.515 1 0.661 0.935 0.973 0.1 0.653 0.924 0.968 Round 2 vs Naïve 0.01 0.608 0.866 0.933 0.001 0.546 0.724 0.809 0.0001 0.506 0.545 0.566 1 0.654 0.930 0.975 0.1 0.644 0.918 0.968 Round 3 vs Naïve 0.01 0.600 0.860 0.935 0.001 0.548 0.745 0.838 0.0001 0.507 0.572 0.600

In order to better prioritize a small subset of mutations for follow-up, we experimented with increasing the regularization strength to drive many of the parameters to zero. Though increasing the regularization strength of the model fit to the third round of selection from � = 1 to � = 0.01 did slightly reduce the cross-validated accuracy of the model from AUC=0.98 to

0.94 (Table 3.1), respectively, many more parameters are driven to zero (Figure 3.3a). This refines which mutations are likely to cause increased affinity and reduces the number of mutations to test substantially.

74 The parameters of the regression model correspond monotonically to the probability of enrichment of a sequence: large positive values indicate variants that increase affinity, large negative values indicate mutations that reduced affinity, and values near zero have little effect on the outcome of enrichment. The model finds a small number of mutations are enriched that are likely to cause increased affinity to Smoothened (Figure 3.3b). CDRs are typically responsible for nanobody binding to the antigen: accordingly, three positions in CDR1 and one in CDR3, are obvious candidates for improving nanobody affinity. Surprisingly, many mutations that increase affinity to Smoothened are located in the framework, which may either directly contact

Smoothened or stabilize the nanobody structure itself24. These critical variants were not present in the original nanobody library5.

a b 103 102 101 6

4

2 λ = 1 0

-2

-4 -4 0 4 Regression parameter -4 -2 0 2 4 6 expections λ = 0.01

Figure 3.3. A sparse number of variants are linked to increased affinity. a. Increasing the regularization strength from � = 1 to � = 0.01 results in more model parameters driven to zero. b. The predicted increase in affinity each variant confers on nanobody binding is plotted against the nanobody sequence. The CDRs are shown in magenta, yellow, and blue, respectively, contrasting the beta strands of the framework.

75 Discussion

Here we show how sparse logistic regression can be used to summarize synthetic affinity maturation data. This algorithm both finds mutations that are likely to increase affinity of the nanobody to Smoothened and is predictive of the effects of mutations. This analysis also shows that the first round of sequencing and selection contained ample information to predict future rounds under increased selection pressure, hinting that only one round of selection and sequencing may be needed, decreasing cost and experimental turnaround. We anticipate high- throughput sequencing and experimentation coupled with machine learning techniques to become increasingly useful in creating high-value biomolecules.

76 Methods

Data processing

Due to the low read count and lack of technical replicates of the sequencing, the log enrichment values themselves were not a reliable measure of protein fitness during selection.

Thus, we formulate the enrichment analysis as a classification problem. We assign a class to each of these sequences: 1 if the sequence was observed by sequencing after a round of selection, and 0 if it was only seen before enrichment. Sequences are encoded as a one-hot matrix of size

� � �, where each position is a vector of zeros except for the amino acid at that position, which is a 1.

Model description

We would like to learn a set of weights � to predicted the probability a sequence � is enriched after selection �: this is formulated as the conditional probability � � �, � . Typically, these weights are learned by maximum likelihood estimation via gradient descent. However, under a Bayesian inference framework, a prior distribution is placed on the weights �(�). To fit this model, the posterior distribution must be evaluated; however, this requires an intractable

()(|,) integral over the weights: � � �, � = ��. Instead we posit parameters � that (,) form a variational distribution on the weights � � � to approximate the posterior � � �, � as determined by the Kullback-Leibler (KL) divergence. These parameters � are found by maximizing the evidence lower bound (ELBO) on the marginal likelihood of � data:

log � � �, � ≥ �� � � log � � �, � − � � � � ||�(�)

The weights are defined with a scale-mixture prior to induce sparsity27. The preference of an

(�) (�) amino acid � at a given site � is controlled by four sets parameters: dense parameters ��� , ��� ,

77 (�) (�) and sparse parameters ��� , ��� . These parameters are sampled during training by first sampling

(�) (�) from standard normals ��� ~ �(0,1), ��� ~ �(0,1) and utilizing the reparameterization trick19,28:

(�) (�) ��� = ��� + ��� ��

(�) (�) ��� = exp ��� + ��� ��

We then define the probability of enrichment � � �, � as:

1 + exp ��������� + �� where � is the length of the sequence and � is the alphabet size. We place a standard normal prior over the dense weights � ~ �(0,1) and an inverse-gamma prior27 over the sparse weights with the global scale parameter clamped at 10.

Model fitting

Each model is trained until convergence to 2000 iterations using the Adam optimization algorithm29, with a learning rate of 0.01 for the first 1000 iterations and 0.001 for the final 1000.

To more strongly enforce regularization and sparsity, we reduce the effective number of data30 as to make the influence of the prior stronger, �� → �. We scan over a range of � ∈

1.0, 0.1, 0.01, 0.001, 0.0001 and select � by five-fold cross validation as the most sparse, heavily-regularized yet predictive model. Finally, we report the final coefficients of the model as

�(�) their expectation: �(�) exp �(�) + .

78 References

1 Teng, G. & Papavasiliou, F. N. Immunoglobulin somatic hypermutation. Annu. Rev. Genet. 41, 107-120 (2007).

2 Junge, F. et al. Large-scale production of functional membrane proteins. Cellular and Molecular Life Sciences 65, 1729-1755 (2008).

3 McCoy, L. E. et al. Molecular evolution of broadly neutralizing Llama antibodies to the CD4-binding site of HIV-1. PLoS pathogens 10, e1004552 (2014).

4 Yan, J., Li, G., Hu, Y., Ou, W. & Wan, Y. Construction of a synthetic phage-displayed Nanobody library with CDR3 regions randomized by trinucleotide cassettes for diagnostic applications. Journal of translational medicine 12, 343 (2014).

5 McMahon, C. et al. Yeast surface display platform for rapid discovery of conformationally selective nanobodies. Report No. 1545-9985, (Nature Publishing Group, 2018).

6 Reiter, Y., Schuck, P., Boyd, L. F. & Plaksin, D. An antibody single-domain phage display library of a native heavy chain variable region: isolation of functional single- domain VH molecules with a unique interface1. Journal of molecular biology 290, 685- 698 (1999).

7 Hawkins, R. E., Russell, S. J. & Winter, G. Selection of phage antibodies by binding affinity: mimicking affinity maturation. Journal of molecular biology 226, 889-896 (1992).

8 Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nature reviews Molecular cell biology 10, 866 (2009).

9 Packer, M. S. & Liu, D. R. Methods for the directed evolution of proteins. Nature Reviews Genetics 16, 379 (2015).

10 Boder, E. T., Midelfort, K. S. & Wittrup, K. D. Directed evolution of antibody fragments with monovalent femtomolar antigen-binding affinity. Proceedings of the National Academy of Sciences 97, 10701-10705 (2000).

11 Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proceedings of the National Academy of Sciences 110, E193-E201 (2013).

12 Fox, R. J. et al. Improving catalytic function by ProSAR-driven enzyme evolution. Nature biotechnology 25, 338 (2007).

13 Liao, J. et al. Engineering proteinase K using machine learning and synthetic genes. BMC biotechnology 7, 16 (2007).

79 14 Bedbrook, C. N., Yang, K. K., Rice, A. J., Gradinaru, V. & Arnold, F. H. Machine learning to design integral membrane channelrhodopsins for efficient eukaryotic expression and plasma membrane localization. PLoS computational biology 13, e1005786 (2017).

15 Jenson, J. M., Xue, V., Stretz, L., Mandal, T. & Keating, A. E. Peptide design by optimization on a data-parameterized protein interaction landscape. Proceedings of the National Academy of Sciences 115, E10342-E10351 (2018).

16 Biswas, S. et al. Toward machine-guided design of proteins. bioRxiv, 337154 (2018).

17 Reid, S., Tibshirani, R. & Friedman, J. A study of error variance estimation in lasso regression. Statistica Sinica, 35-67 (2016).

18 Snelson, E. & Ghahramani, Z. in Advances in neural information processing systems. 1257-1264.

19 Blundell, C., Cornebise, J., Kavukcuoglu, K. & Wierstra, D. Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424 (2015).

20 Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. & Saul, L. K. An introduction to variational methods for graphical models. Machine learning 37, 183-233 (1999).

21 Riesselman, A. J., Ingraham, J. B. & Marks, D. S. Deep generative models of genetic variation capture the effects of mutations. Nature methods 15, 816-822 (2018).

22 Visscher, P. M. et al. 10 years of GWAS discovery: biology, function, and translation. The American Journal of Human Genetics 101, 5-22 (2017).

23 Kuznetsov, G. et al. Optimizing complex phenotypes through model-guided multiplex genome engineering. Genome biology 18, 100 (2017).

24 Muyldermans, S. Nanobodies: natural single-domain antibodies. Annual review of biochemistry 82, 775-797 (2013).

25 Rimkus, T., Carpenter, R., Qasem, S., Chan, M. & Lo, H.-W. Targeting the sonic hedgehog signaling pathway: review of smoothened and GLI inhibitors. Cancers 8, 22 (2016).

26 Xie, J. et al. Activating Smoothened mutations in sporadic basal-cell carcinoma. Nature 391, 90 (1998).

27 Ingraham, J. & Marks, D. Variational inference for sparse and undirected models. arXiv preprint arXiv:1602.03807 (2016).

28 Kingma, D. P. & Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013).

80 29 Kingma, D. P. & Ba, J. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

30 Lapedes, A., Giraud, B. & Jarzynski, C. Using sequence alignments to predict protein structure and stability with high accuracy. arXiv preprint arXiv:1207.2484 (2012).

81 Chapter 4

Accurate prediction of the thiamine biosynthetic landscape using

Gaussian process learning

Adam J. Riesselman3,4†, Hans J. Genee1,2†, Søren D. Petersen1, Sangeeta Nath5,6, Luisa S. Gronenberg2, Bo Salomonsen2, Anne P. Bali1,2, Kathleen Smart2, Leanne Jade G. Chan6,7, Melissa Nhan6,7, Edward E. K. Baidoo6,7, George Wang6,7, Ernst Oberortner5,6, Nathan J. Hillson6,7, Jay D. Keasling1,6,7,8,9, Debora S. Marks3, Christopher J. Petzold6,7, Samuel Deutsch5,6*, Morten O.A. Sommer1*

1 Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, DK-2800 Kgs. Lyngby, Denmark 2 Biosyntia ApS, Fruebjergvej 3, DK-2100, Copenhagen, Denmark 3Department of Systems Biology, Harvard Medical School, Boston, MA 02115, USA. 4Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115, USA. 5DOE Joint Genome Institute, 2800 Mitchell Dr, Walnut Creek, CA, USA 6Biological Systems and Engineering Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Rd., Berkeley, CA, USA 7Joint BioEnergy Institute, Emeryville, California 94608, United States 8Department of Chemical & Biomolecular Engineering, University of California, Berkeley, California 94720, United States 9Department of Bioengineering, University of California, Berkeley, California 94720, United States † Joint first authors,*Corresponding authors

H.J.G, S.D., and M.O.A.S conceived the study. H.J.G. designed the combinatorial pathway assembly strategy. H.J.G. and S.D.P. constructed the combinatorial library, performed functional selections and characterized growth and productivity of individual clones. L.G. developed method for the detection of thiamine moieties using HPLC. H.J.G., L.S.G. and B.S. developed the fluorescence assay for detection of thiamine moieties. E.E.K.B. and G.W. developed method for detection of thiamine pyrophosphate using LC-MS. C.J.P., L.G.C., and M.N. performed proteomics. A.J.R. and S.D. modeled the data and designed validation strains. N.J.H, J.D.K and

D.S.M provided scientific input and helped to support the study. A.P.B. and K.S. performed bioreactor experiments. H.J.G., A.J.R., D.S.M., S.D., and M.O.A.S wrote the manuscript with contributions from all authors.

82 Abstract

Cellular metabolites are of high interest in biomedicine and biotechnology but can be costly or hard to synthesize de novo. A promising approach is to leverage endogenous bacterial pathways by refactoring their genetic components to maximize the output of the desired metabolite. However, endogenous pathways are the result of evolutionary processes under unidentified selection pressures, and the combinatorics of the search space to optimize metabolite production is too large to allow for exhaustive characterization. Here we present a generalizable approach to understand pathway production landscapes to guide pathway optimization. We demonstrate the method on the industrially relevant and biologically complex thiamine biosynthesis pathway. We built a 16,384 pathway variant library and performed detailed characterization by sequencing, proteomics and metabolomics on 52 strains that cover the phenotypic spectrum. The resulting data was used to build a predictive Gaussian process model. We validated the model across a large range of expected thiamine output levels and accurately predicted thiamine production (r2 = 0.82, p=9.67e-7), including a very narrow optimum driven by the expression levels of a few key enzymes in the pathway. The best performing strain produced thiamine at 10.2 mg/l, the highest reported titer to date. Our results suggest that sparse characterization of refactored strains that sample the phenotypic spectrum, combined with Gaussian process modeling can reduce the need for costlier more exhaustive experimental searches and improve our ability to optimize metabolite pathways.

Introduction

Biological systems are increasingly being engineered for research and industrial purposes, including the production of high-value pharmaceuticals, chemicals, and biofuels1,2.

83 However, reconfiguring the tightly regulated metabolism of a cell through modification of gene expression levels and removal or introduction of enzymatic reactions often results in unpredictable phenotypes with undesired characteristics3,4. The underlying causes of poor predictability of phenotypic outcomes are complex and typically rooted in the introduction of imbalances of the native biochemical network. For example, imbalanced pathway flux may lead to accumulation of inhibitory feedback intermediates5, while too highly expressed enzymes may exhaust the cell’s co-factor reservoirs6. Accurately identifying optimal solutions for a specific phenotypic target necessitates in-depth understanding of the genotype-phenotype relations for such perturbations7.

Bottom-up models that integrate enzyme kinetics with metabolic networks require component properties to be characterized a priori and are generally not available for novel systems8-10. Alternatively, synthetic biology provides a modular molecular toolbox to build complex gene circuits to explore landscapes of small molecule production11-14. Though these techniques have seen success in biosynthetic pathway construction, providing statistical frameworks that help to interpret outputs and guide the next round of experimentation remains a defining challenge.

Machine learning approaches built upon empirically collected, high-quality sequencing and biochemical data offer an opportunity to develop predictive models of biological systems.

These algorithms would require much less exhaustive sampling than traditional bottom-up approaches, reducing cost and experimentation time. Though some computational techniques require complete knowledge of the biosynthetic pathway and all intermediate metabolites15, other algorithms have been successful in optimizing strains using simple statistical models16-19.

However, larger, complex biosynthetic pathways will require more sophisticated algorithms to

84 both model the small molecule production landscape as well as summarize model predictions to assist in the design of experiments for researchers. Moreover, since sampling variants of a new pathway can be both expensive and labor-intensive, techniques that can generalize to unseen regions of the production landscape using as few samples as possible are in critical demand.

a b Purine Pyruvate + metabolism glyceraldehyd 3P Cysteine Tyrosine Dxs ThiF N + TPP ThiS ThiH P O ThiI O N O IscS + SAM NH + NADPH 1 kb + ATP AIR OH PCR HO OH HO ThiS-SH ThiC PO Stronger + SAM DXP NH ThiC ThiE ThiD ThiF ThiS ThiG ThiH DHG RBS 1 N O OH ThiG N OP RBS 2 HMP-P NH2 HO2C RBS 3 ThiD S + ATP N THZ-P OP RBS 4 N Weaker N OPP HMP-PP NH2 ThiE 64 variants 256 variants Total thiamines S Combinatorial assembly TMP N+ PO

NH N 2 N ThiL Spontanoues + ATP ThiK S S + N PPO N+ 16,384 variants

NH NH N 2 N 2 N TPP N Thiamine

Figure 4.1. Construction of combinatorial thiamine biosynthesis pathway library and characterization using biosensor-based selection. a. The thiamine biosynthesis pathway in E. coli. The core genes in blue and red denote different branches of E. coli metabolism. Genes in gray represent additional non-core genes not included in the refactored pathway. b. The top diagram shows the native genomic configuration of the thiamine biosynthesis genes in E. coli.

The middle diagram shows the combinatorial assembly (Supplementary Figure S4.1) of the 7 core thiamine biosynthesis genes using 1 of 4 RBS variants with increasing predicted translation initiation rates (Supplementary Figure S4.2). The bottom diagram shows the

‘architecture’ of the refactored thiamine biosynthesis cluster organized into 2 operons driven

85 Figure 4.1 (Continued). by constitutive promoters. The color code for the 4 RBSs going from predicted strongest (dark red) to predicted weakest (dark blue) is shown on the right.

Machine learning methods applied to biosynthetic pathways are particularly useful when the metabolic pathway system is both industrially relevant and relatively uncharacterized. In this study we chose thiamine, or vitamin B1 - a molecule for which thousands of tons are produced industrially by chemical synthesis. Efficient optimization of microbial thiamine biosynthesis has not been successful but could pave the way for industrial production by fermentation. The development of high yield thiamine producing strains is hampered by an exceptionally complex and poorly understood biosynthetic pathway (Figure 4.1a) that taps into 4 different metabolite pools20.

Results

To explore the landscape of thiamine biosynthesis in E. coli, we refactored the 7-gene core biosynthetic pathway designed with a two operon ‘architecture’ (thiCED and thiFSGH) to reflect the branches of the native thiamine pathway (Figure 4.1a). A 16,384 variant library was constructed by randomly introducing one of four distinct ribosomal binding sites (RBS) in front of each gene21. These RBSs were chosen to span a predicted 2-order-of-magnitude range of protein translational initiation22,23 (Figure 4.1b, Supplementary Figure S4.1 and S4.2).

To elucidate the molecular drivers leading to cellular thiamine production, we investigated a subset of full-length pathways that, when expressed in vivo, cover the phenotypic spectrum of thiamine production. To minimize screening, the constructed pathway library was transformed into an E. coli host containing a riboswitch-based biosensor plasmid that confers antibiotic resistance in a dose-dependent manner proportional to the intracellular thiamine

86 concentration (Methods, Supplementary Figure S4.3). A total of 72 colonies were picked from agar plates containing low to high antibiotic concentrations, cultured, and assayed for thiamines

(thiamine, thiamine monophosphate and TPP) production using a fluorescent assay developed as part of this study (Methods and Supplementary Figure S4.6 and S4.7). Fifty-two individual strains with thiamine production levels ranging from low to high were selected for in depth characterization including (i) sequencing to determine the RBS sequences for each gene in each clone, (ii) targeted proteomics quantification of pathway enzymes and (iii) intra- and extracellular thiamines quantification.

Results from the targeted proteomics showed that the expression levels of the seven enzymes in the refactored pathway varied 10-100 fold between strains depending on the gene, confirming that our library design effectively sampled the protein expression space of the core thiamine biosynthetic pathway. We then compared the observed protein levels derived from the targeted proteomics to the in-silico translation initiation rate predictions. Translation Initiation

Rate (TIR) estimates generated by the RBS Calculator22,23 were reasonably predictive of protein levels for 5 out of 7 genes, but poorly explain the observed inter-strain variation in protein levels for genes with same RBS sequences (Supplementary Figure S4.4).

87

Figure 4.2. RBSs predict protein levels measured through targeted proteomics. a.

Deriving an empirical operon model using linear regression. The first open reading frame only parameterizes adjacent RBS strengths, while subsequent open reading frames account for the strength of both upstream and adjacent RBSs. b. Interaction strengths for upstream and

88 Figure 4.2 (Continued). adjacent RBSs are revealed with the linear operon model. Positive, negligible, and negative interactions are colored as red, white, and blue, respectively. c-i.

Scatterplots showing the association between normalized natural log (LN) measured protein levels on the Y-axis and Predicted protein level based on the operon model on the X-axis. The

Red line indicates the best fit based on linear regression analysis. The text at the top shows the regression based adjusted R-squared and p-values for each gene.

In line with previous studies, we saw a range of protein levels for each RBS, suggesting that additional sources of variation are involved. We surmised that translation levels of genes within these operons are not independent24-26 as we observed significant pairwise correlations between the protein levels of different genes (Supplementary Figure S4.5a). This motivated us to build an operon model that can capture translational coupling to improve protein level predictions from TIR estimates. Using a linear regression model that explicitly parameterizes operon architecture (Figure 4.2a), the regression coefficients reveal that in most cases, increasing the RBS strength directly adjacent to a gene, as expected, will increase protein expression

(Figure 4.2b). However, changes in RBS strength for upstream genes can also influence protein expression downstream: in the case of thiE and thiH, the strength of upstream interactions is comparable to adjacent RBSs. By applying our operon model, we observed improvements in predictive power particularly for thiE and thiH where upstream interaction coefficients are relatively large (Figure 4.2b, Supplementary Figure S4.5b). All together, protein levels could be significantly predicted for 6 out of the 7 genes with r2 values ranging from 0.43 to 0.8 (Figure

4.2c-i).

89

Figure 4.3. Intracellular thiamine levels show a multifactorial association to protein levels of the refactored biosynthesis cluster. a. Sequencing and intracellular thiamine production for 52 characterized strains. Colored squares on the left hand side indicate the

RBSs for each of the seven genes in the refactored cluster according to the color-code legend at the bottom right. Squares with no colour indicate the presence of a mutant RBS sequence. Horizontal bar chart on the right shows the levels of intracellular thiamines detected for each strain. b-h. Scatter plots showing intracellular thiamines level on the Y-axis, and normalized LN protein levels on the X-axis. The red line indicates the best fit based on linear

90 Figure 4.3 (Continued). regression analysis, and the text at the top shows the regression adjusted R-squared and p-values for each gene.

Results from the thiamines quantification assays showed that intracellular thiamines correlated with extracellular thiamines with 5-10% of total thiamines being inside the cells

(Supplementary Figure S4.6 and S4.7). The highest producing strain accumulated 2.4 µM thiamines in the growth medium and ~13.0 µM/cell intracellular thiamines corresponding to a

~45-fold improvement relative to the wild-type (extracellular thiamines = 53 nM, intracellular thiamines= 0.5 µM/cell) (Figure 4.3a).

We then looked at the relationship between thiamine biosynthetic pathway protein expression levels and thiamine production. Protein levels of ThiH and ThiG significantly correlated with intracellular thiamine production, but products of other genes correlated less well or not at all (Figure 4.3b-h). Since no single protein level was highly predictive of metabolite production, we made more comprehensive models of the thiamine production landscape. We first explored the use of a Gaussian processes to model the relationship between protein levels and thiamine production. Gaussian processes are Bayesian nonparametric function approximators that are resistant to overfitting, provide principled estimate uncertainty in predictions, and can capture nonlinear relationships in the data27,28. We compared predictions against a regression- based model, which has been used previously to optimize the expression of genes in biosynthetic pathways17,18. We performed automated feature selection using linear and interaction-based regression and converged on a linear model that includes ThiG, ThiC and ThiE that had the best fit.

91

Figure 4.4. Validating model predictions through de-novo strain construction. a.

Predicted thiamine production landscape plotted in 2 dimensions as a function of ThiC and

ThiG protein levels. b. Decision tree summarization “Rules” for optimal thiamine production.

Dials show range of optimal protein expression for each gene. c. The integrated pathway model capturing data from all steps of the Design, Build, Test and Learn cycle inform the construction of next round of strains d-e. Scatter plots showing the evaluation of

92 Figure 4.4 (Continued). Gaussian process and regression models for predicting intracellular thiamines from protein levels. Y-axis shows the measured intracellular thiamines level and X- axis the predicted intracellular thiamines for each model. f. Comparison of predicted thiamine levels computed from regression and Gaussian process models over increasing levels of ThiG, highlighting the accuracy of the landscape predicted by the Gaussian process.

We used both the regression based model and the Gaussian process model to generate thiamine production landscapes over the range of protein levels for ThiC, ThiE and ThiG.

Notably, since there are no explicit constraints, the linear model predicts a continuous increase in thiamine production with increasing protein expression (Supplementary Figure S4.8), which is unlikely to accurately represent the actual production landscape given physiological limitations29. However, the landscape from the Gaussian process model found a bounded optimal thiamine production level. Plotted in 2-D as a function of ThiC and ThiG (which had the strongest effects on thiamines production), non-linear constraints define the protein levels for peak thiamine production, which had a single maximum at optimum expression levels for each protein (Figure 4.4a). This constrained landscape is unlike predictions made from the regression model, which places peak thiamine production at maximal ThiC and ThiG levels (Supplementary

Figure S4.8).

The ability to capture nonlinear relationships and robustness to overfitting of the thiamine production landscape comes at a price of interpretability. How can we summarize experimentally-actionable rules from the Gaussian process to guide the design of experiments for optimal thiamine production? We coupled the statistical efficiency of the Gaussian process with the interpretability of a decision tree30 by first predicting thiamine levels over ranges of ThiC,

93 ThiG, and ThiE; fitting a decision tree to the interpolated landscape; and summarizing the protein levels required to reach the branch with optimal thiamine production (Figure 4.4b and

Supplementary Figure S4.9). Generating such rules overcomes the common challenge of interpreting machine learning models predicated on complex nonlinear relationships.

These rules suggest that intracellular thiamine levels are maximized when the protein levels of ThiC, ThiG, and ThiE is between 12.1-13.2, 14.1-14.7, and 5.2-15.0, respectively

(Figure 4.4b). Optimal ThiC levels are near maximum of the observed range, whereas ThiG seems to have a narrower optimum at intermediate levels, below or above which thiamine production would decrease. Finally, ThiE has a wider optimum that is sufficient to support high production of thiamine.

To test the accuracy of the predicted production landscape generated through the

Gaussian process, we designed a set of validation strains (n=16) that would evaluate key features of the landscape. In particular, we focused on the complex relationship between the ThiC and

ThiG protein levels as the predicted outcome differed significantly between our multivariate regression model and the Gaussian process model. We built a series of strains (n=9) with a predicted fixed ThiC protein level while varying ThiG protein levels and vice versa. Also, a series of strains (n=7) was built to cover parts of the thiamine production landscape, which was not well covered in the original library. To design the validation strains we combined all of our data into an integrated model as shown in Figure 4.4c.

The validation strains were fully characterized using DNA sequencing, targeted proteomics and intracellular thiamines quantification as before. Our Gaussian process model accurately predicted intracellular thiamine production from the measured protein levels of the validation strains (r2=0.82, P =9.67e-7, n=16) (Figure 4.4d), in contrast to the much poorer

94 performance of multivariate regression model (r2=0.27, P =0.035, n=16) (Figure 4.4f). A model using just the decision tree to predict thiamine production resulted in similarly poor predictive performance (r2=0.30, P =0.029, n=16), showing the effectiveness of integrating the predictive power of the Gaussian process with the interpretability of the decision tree. We also observed the expected operon model to protein level correlation patterns for six of the seven gene products

(ThiE, ThiD, ThiF, ThiS, ThiG, ThiH) (Supplementary Figure S4.10), although as before we observed a range of protein levels for each RBS, which precludes accurate prediction of the thiamine output from the sequence data alone.

An intriguing feature of the Gaussian model was the narrow optimal range of ThiG protein level required for reaching high levels of intracellular thiamines (Figure 4.4b). The

Gaussian process predicted that ThiG protein levels between 14.1-14.7 would result in the highest intracellular thiamines production, with protein levels outside this range resulting in reduced thiamine. This prediction was precisely borne out in the designed validation strains

(Figure 4.4e) highlighting the differences with the regression model that predicted increasing thiamine with increasing ThiG protein levels. Other features predicted to be important for the intracellular thiamine levels were the optimal levels of ThiC and ThiE, which were also confirmed in the validation data (Supplementary Figure S4.11).

Another interesting aspect of the Gaussian model is the fact that it captures biological insight without a priori knowledge about pathway architecture or enzyme kinetics. Previous studies have shown that ThiC is the committing step in the left branch of the biosynthesis pathway (Figure 4.1a) while ThiG plays a key role in the right branch of the pathway by assembling the three precursor components namely ThiS-thiocarboxylate, dehydroglycine

(DHG) and 1-deoxy-d-xylulose 5-phosphate (DXP) into the thiazole moiety (THZ-P) (Figure

95 4.1a). Finally, ThiE combines the two branches and forms thiamine monophosphate by assembly of the pyrimidine- and the thiazole moiety 20 (Figure 4.1a). The requirement for high levels of

ThiC is in good agreement with a reported very low catalytic turnover of this enzyme31. The narrow optimum of ThiG levels is surprising and the mechanism for this is not clear, which suggests the narrow optimum could stem from a complex trade-off between burden and productivity.

Previous efforts on thiamine metabolic engineering have had limited success32, and the best previous reported titer is 1.2 mg/l based on engineered Bacillus subtilis33. When tested in fed-batch bioreactors, our strain produced up to 10.2 mg/l extracellular thiamines (mean = 9 mg/l, s.d. = 2.6 mg/l, n = 3) from glucose and hence represents the highest titer reported to date

(Methods, Supplementary Figure S4.12). Further engineering of this strain, e.g. by functionally expressing a thiamine phosphatase34, could lead to the production of pure unphosphorylated thiamine with future potential for replacing current chemical production of thiamine which exceeds several thousand metric tons.

Discussion

Biosynthetic pathways have evolved over long timescales based on hard to decipher selective pressures resulting in complex interactions that affect biological function. As a result, the production landscape of various cellular metabolites exhibits non-intuitive features, which may be more accurately modeled by integrating new statistical techniques with orthogonally derived experimental data. Remarkably, our data integration into hierarchical models is able to accurately predict complex features of the thiamine production landscape in E.coli, and to inform the next round of strain designs based on clearly defined actionable rules.

96 Building on previous studies that have explored phenotypic landscapes through combinatorial pathway refactoring approaches15,16,35,36 our efforts differ by successfully applying machine learning to build a model that enables the accurate prediction of metabolic outputs from protein levels and is applicable to any organism in which protein expression levels can be tuned.

This technique can be applied to any pathway of interest by generating a large pool of variants; screening production levels of a few hundred individual clones, deeply characterizing a few dozen variants that cover the phenotypic spectrum with sequencing, proteomics, and metabolomics; and building statistical models as outlined in this study to parameterize the explored metabolic landscape. Generalizable application of the framework outlined in this study has the potential to significantly accelerate biological engineering applications.

97 Methods

Strains and media

Escherichia coli DH10B was used for cloning, plasmid propagation, and as a host for functional selection and characterization of combinatorial thiamine pyrophosphate (TPP) over- expression pathways. LB medium (10 g/L tryptone, 5 g/L yeast extract, 10 g/L NaCl; VWR

#90003-350) with appropriate antibiotic supplementation (kanamycin (50 µg/ml) and/or ampicillin (50 µg/ml)) was used for strain maintenance. A modified rich defined MOPS

(mrMOPS) medium as described previously37 supplemented with glucose (10 g/L) and appropriate antibiotics was used for functional selections, growth-, proteomics-, and productivity characterizations.

Construction of vector backbones for combinatorial assembly

The combinatorial assembly vector-set pGEN49, pGEN50, and pGEN51 were designed at the nucleotide level in silico (see Supplementary Figure S4.1 for plasmid maps). Each plasmid contain various strong constitutive promoters and different strong terminators to reduce the risk of homologous recombination when combined. Transcriptional element parts were chosen from the BioFAB registry of parts 38,39. For construction, the three assembly and expression cassettes were synthesized as gBlocks (IDT). Next, the vector backbone of the pZE21 expression vector 40 excluding promoter, multiple cloning site, and terminator was amplified by PCR using primers oGENJB1 and oGENJB2, and the PCR fragment was DpnI treated and gel-purified. The three gBlocks were assembled individually with the pZE21 backbone by Gibson cloning yielding pGEN49, pGEN50, and pGEN51, and the ligation mixes were transformed into E. coli by standard procedures. The resulting plasmids were sequenced by Sanger sequencing (Beckman

Coulter U.K.).

98 Preparation of biobricks for combinatorial assembly

The open reading frames of thiC, thiE, thiD, thiF, thiS, thiG, and thiH were amplified by

PCR as individual fragments from E. coli MG1655 genomic DNA. For each open reading frame, four distinct forward primers were used that in addition to the Gibson linker contains a 16 bp 5’ untranslated region (5’UTR) encoding one of the following four ribosome binding sites (RBSs):

1) AGCTAAGGAGGTAAAT, 2) AGCGAGGTAATACTAG, 3) AGCGTGGTAATACTAG, and 4) AGCGTGCTAATACTAG. These RBS sequences were chosen based on their predicted translation initiation rates ranging from low to high as calculated using the RBS Calculator23,41

(V2.0). The resulting 28 fragments were gel-purified. Similarly, pGEN49 and pGEN50 were

PCR amplified using whole plasmid amplification, treated with DpnI restriction enzyme to remove template DNA and gel-purified.

Construction of the combinatorial TPP pathway library

Construction of the full TPP pathway library was performed following a two-step hierarchical cloning procedure. First the PCR fragments based on pGEN49, thiC, thiE, and thiD

(n=13) were mixed in equimolar concentrations and assembled by Gibson cloning (NEB #

E2611) using the manufacture’s instructions resulting in potentially 64 pooled constructs representing combinatorial expression variants of the thiCED cistron (named pGEN77).

Likewise, the PCR fragments based on pGEN50, thiF, thiS, thiG, and thiH were assembled resulting in potentially 256 constructs representing combinatorial expression variants of the thiFSGH cistron (named pGEN80). The pGEN77-mix was transformed into electrocompetent E. coli DH10B (Invitrogen # 18290-015) following standard procedures, recovered for 1.5 h in 1 mL Super Optimal broth with Catabolite repression (SOC) medium to which 9 mL 2xYT supplemented with kanamycin (50 µg/ml) was added and the culture was incubated at 37°C

99 overnight. 10 µL of the 1.5 h recovery culture was plated on LB agar plates supplemented with kanamycin (50 µg/ml) to evaluate cloning and transformation efficiencies. The following day, the total plasmid pool of 5 mL of the overnight culture was purified and digested for > 8 h using

SwaI (NEB # R0604), and linearized DNA was gel-purified. Using the in vitro assembly mix of the pGEN80-mix as template for PCR, the thiFSGH cistrons were amplified by PCR from the A to the C linker and gel-purified. Finally, the linearized pGEN77-mix was mixed in equimolar amounts with the amplified thiFSGH cistron mix and assembled into one plasmid library

(pGEN83) by Gibson assembly. The full library was transformed into electrocompetent E. coli

DH10B (Invitrogen # 18290-015), recovered for 1.5 h in 1 mL SOC medium to which 9 mL

2xYT supplemented with kanamycin (50 µg/ml) was added and the culture was incubated at

37°C overnight. 10 µL of the 1.5 h recovery culture was plated on LB agar plates supplemented with kanamycin (50 µg/ml) to evaluate cloning and transformation efficiencies. The following day, the total plasmid pool of 5 mL of the overnight culture was purified and the remaining culture (EcGEN204) was stored as glycerol stocks (15% glycerol) at -80°C. A gel- electrophoresis analysis of the purified plasmid mix indicated that a proportion of the plasmids were smaller than the expected 9 kb, and the DNA bands corresponding to 7-11 kb were excised from the gel and purified.

Functional selection for TPP producing clones

Of the size-selected pGEN83 plasmid mix, 400 ng was used to transform electrocompetent E. coli DH10B which were already harboring the TPP riboswitch selection plasmid pGEN37 (TPP selection host strain, EcGEN46). Following electroporation, the cells were recovered for 1.5 h in 1 mL SOC medium and were subsequently washed two times in mrMOPS medium before adding 10 mL mrMOPS supplemented with kanamycin (50 µg/ml) and

100 ampicillin (50 µg/ml). 10 µL of the 1.5 h recovery culture was plated on LB agar plates supplemented with kanamycin (50 µg/ml) and ampicillin (50 µg/ml) to evaluate transformation efficiencies. The 10 mL culture was grown overnight and 5 ml was stored as glycerol stocks

(15% glycerol) at -80°C (EcGEN198). The remaining culture was plated in volumes corresponding to 105, 106, and 107 cell forming units in total, on fresh, pre-dried mrMOPS agar plates containing kanamycin (50 µg/ml), and chloramphenicol and spectinomycin concentrations ranging from (0, 0), (10, 20), (14, 20), (18, 30), (25, 50), (40, 50), (60, 50), (80, 50)

(chloramphenicol µg/ml, spectinomycin µg/ml). As a negative control, an overnight culture of the selection host strain carrying an empty expression vector (pGEN49) (strain EcGEN126) was plated in parallel. The same strain was used as a positive control but plated on media supplemented with 50 µM thiamine. The plates were incubated at 37°C for 48 h. Single colonies were picked from selective plates and streaked on fresh selective agar-plates to ensure monoclonal isolates and confirm antibiotic resistance. Three single colonies of each strain were used to inoculate a 1 mL overnight culture in mrMOPS from which glycerol stocks were prepared (15% glycerol) and stored at -80°C. Similarly, colonies from non-selective conditions were picked, streaked at non-selective conditions from which 3 single colonies of each strain were grown and stored as described above. Colonies picked from non-selective conditions were picked immediately after transformation from the plates used to evaluate transformation efficiencies. The plasmid was extracted from each individual strain and fully sequenced by

Sanger sequencing. For deep sequencing of RBSs of the full library, the remaining colonies from each selection conditions were harvested, and the total plasmids from each condition were extracted and sequenced as described below.

101 HPLC measurement of intra- and extracellular thiamines

10 µL aliquots were withdrawn from each of the glycerol stocks of the strains and used to inoculate 1 mL cultures in mrMOPS media containing kanamycin (50 µg/ml) in a 96 deep-well plate. Additionally 10 µL aliquots from the glycerol stocks of four replicates of the wild-type strain (EcGEN126) was included. The plate was closed with a breathable metal lid and incubated at 37°C with shaking (300 RPM) in an Innova 44 shaker (Eppendorf) for 24 h. To estimate final cell turbidity, 20 µL of well-mixed 24 h cultures was suspended into 80 µL mrMOPS and OD630 measured in a ELx808 Absorbance Microplate Reader (Biotek). From cell cultures prepared as described above, cells were harvested by centrifugation (4000 G at 4°C). The supernatant was withdrawn for extracellular analysis. For intracellular analysis, all of the remaining supernatant was removed and the cell pellet was vortexed. To lyse the cells, 200 µL of ice-cold HPLC-grade methanol was added and the lysate was stored at -20°C for one hour before the cell debris was pelleted by centrifugation (4000 G at 4°C). Individual thiamine compounds in the intra- and extracellular extracts were then measured using a modified thiochrome high-performance liquid chromatography (HPLC) method as described previously42. Briefly, thiamine, TMP and TPP were derivatized to fluorescent thiochromes by the following procedure: 40 µl of the supernatant

(either intra- or extracellular) were added to 80 µl 4 M potassium acetate and mixed by pipetting.

Then, 40 µl freshly prepared 3.8 M potassium ferricyanide in 7 M NaOH was added. The solution was mixed by pipetting and finally quenched by addition of 40 µl fresh 0.06% H2O2 in

42 saturated KH2PO4. The HPLC procedure was performed as previously described , and concentrations were estimated by comparisons to standard curves for thiamine, TMP, and TPP

(Sigma-Aldrich). For calculations of intracellular concentrations of thiamines we assumed an

OD-specific total cell volume of 3.4 µl per ml culture per OD600 (7).

102 Construction of validation pathways

For the construction of 16 validation pathway plasmids, in which the genes were expressed from specific predefined RBS sequences, we followed a similar procedure as described for the full pathway library using only the relevant biobricks in 17 separate assembly reactions as described above. Strains carrying the resulting pathways were sequence validated using the PacBio RSII system. The PacBio raw reads were aligned to the full pathway sequences using the SMRT analysis suite (Pacific Biosciences). Following, we used the Broad Institute’s

Genome Analysis Toolkit (GATK)43 and Integrated Genome Viewer (IGV)44 software to respectively call variants and to visualize the sequence validation results. Verified plasmids were transformed into the TPP selection host strain as described above and three individual colonies of each strain were stored in glycerol stocks, for TPP phenotyping and proteomics analysis.

Measurement of thiamines by thiochrome fluorescence assay

As a faster alternative to quantifying thiamines by HPLC we developed a new assay in which we used a fluorescence plate reader (Synergy H4, Biotek) to directly measure thiochromes derivatized by the thiochrome assay (as described in the HPLC section) in a 96 well plate. An excitation of 365 nm and emission of 444 nm was found to work optimally. The assay successfully quantifies the total amount of thiamine and thiamine phosphates, and the fluorescence correlates linearly with the concentration of thiamines in both mrMOPS, water and methanol. The assay was demonstrated to work for both intracellular and extracellular thiamines although the background level was substantially higher for extracellular thiamines

(Supplementary Figure S4.5 and S4.6). The assay was applied to measure the total levels of both intra- and extracellular thiamines for all strains in biological triplicates. A good correlation was

103 obtained when comparing the levels of TPP measured by HPLC to the levels of total thiamines measured by the fluorescence assay (Supplementary Figure S4.5 and S4.6).

Proteomics

20 µL aliquots were withdrawn from each of the glycerol stocks and used to inoculate 10 mL cultures in mrMOPS media containing kanamycin (50 µg/ml) in 50 mL falcon tubes.

Additionally 20 µL aliquots from the glycerol stocks of four replicates of the wild-type control strain (EcGEN126) was included. The cultures were incubated at 37°C for 24 h with shaking

(250 RPM). The cells were pelleted by centrifugation (4000 G at 4°C) and the supernatant was discarded. The pellet was vortexed and 1 mL of ice-cold HPLC-grade methanol was added. The pellets were stored at -20°C. Proteomics were performed as reported previously45. Briefly, peptides were analyzed using an Agilent 1290 liquid chromatography system coupled to an

Agilent 6460 QQQ mass spectrometer (Agilent Technologies, Santa Clara, CA). The peptides were separated on an Ascentis Express Peptide ES-C18 column (2.7 µm particle size, 160 Å pore size, 10 cm length x 2.1 mm i.d., coupled with a 5 mm x 2.1 mm i.d. guard column; Sigma-

Aldrich, St. Louis, MO) operating at a flow rate of 400 µl/min and heated to 60 ºC. The chromatographic conditions were as follows: initial condition 98% Buffer A (99.9% water, 0.1% formic acid) and 2% Buffer B (99.9% acetonitrile, 0.1% formic acid), held constant for 2 minutes, increased to 10% Buffer B in 0.5 minutes, increased to 40% Buffer B over 3.5 minutes, increased to 90% Buffer B in 0.5 minutes, held constant at 90% B for 2 minutes, then returned to

2% B in 0.5 minutes where it was held for 1 minute prior to re-equilibrate the column for the next sample. Selected-reaction-monitoring (SRM) transitions to detect peptides from thiamine pathway proteins were generated and validated in Skyline. The data were analyzed and refined by using Skyline and peptide quantification was achieved by summing the integrated peak areas

104 of the SRM transitions. (Available at Panoramaweb: https://goo.gl/1vCCWi) Peptide abundances of the same protein were summed to assign abundance to that protein.

Growth rates

Growth rates (r) were determined by growing cultures in 96-well plates and measuring

OD630 every five minutes using an ELx808 Absorbance Microplate Reader from BioTek. Only measurements obtained in mid-exponential phase (OD630 0.1 – 0.3) were used. log(1+r) was determined as the slope in the linear regression: log(OD630) = log(1+r)*t + C, in which C is a constant and t is time. Doubling time, T2 was obtained as T2 = log(2)/log(1+r).

Linear regression of protein expression

Linear regression can predict protein levels in given the Translation Initiation Rate and describe interactions between genes in an operon. A site-independent interaction linear regression model has the least assumptions, as predicted protein level � is linearly related to

Translation Initiation Rate � via the coefficient � and the mean measured protein level at that site �:

� = � � + �

Interactions between genes that alter protein expression levels can expressed with additional

parameters. We first instantiated interaction parameters between the TIR adjacent � � and

upstream ��at a given position. The first gene in an operon then corresponds to a site- independent model, while the rest of the genes are modeled via forward interactions. This model closely mimics the processivity of the ribosome on an RNA transcript.

� = � � + �� + �

A similar model with the same number of parameters and can be used to parameterize reverse interaction, where TIR levels of the gene downstream affect protein levels at gene i:

105 � = � � + � � + �

To parameterize all long-range, upstream interactions between all genes in an operon of length �, all upstream TIRs have a linear, additive effect on predicted protein expression. This corresponds to an upper covariance interaction model:

� = �� + �

Similarly, all long-range downstream interactions can be parameterized via a lower covariance interaction model:

� = �� + �

Finally, a full covariance interaction model parameterizes all interactions upstream and downstream of a gene in an operon:

� = �� + �

To control for the number of predictor variables in the regression analysis and choose the best fitting yet most parsimonious model, the operon model was chosen using the adjusted r-squared

(Supplementary Table S4.1). The operon model was implemented in Python using the scipy using least square regression package (linalg.lstsq).

Models mapping protein levels to thiamine production

Modeling of the thiamine production landscape was done with a Gaussian process using the best unbiased linear framework46. Under this assumption, thiamine production � ∈ ℝ is a mean-zero multivariartiate Gaussian distribution dictated by the covariance function of all protein expression levels � ∈ ℝ in the dataset:

106 � ~ � 0, �(�, �)

We subsequently z-score both thiamine and log protein levels before fitting the Gaussian process model. The covariace matrix �(�, �) ∈ ℝ is created via a kernel function, which describes how similar each of the independent variables are in the dataset. Here we use a squared-exponential kernel, an infinitely-differentiable function that can capture nonlinear relationships between datapoints to encourage a smooth optimization landscape. The distance between two protein expression levels � and �′ is then:

|� − �| � �, � = exp − �

We introduce length-scale parameters � ∈ ℝ that reweigh the importance of each of the protein expression levels in the kernel during covariance function evaluation. These parameters are fit by maximizing the marginal likelihood of the model to the training data.

To predict the thiamine production level of M new datapoints �∗ ∈ ℝ , the kernel is evaluated

against all the training data �∗ = �(�∗, �) ∈ ℝ . Predictions � for these new datapoints are created via:

� = �∗ � �

Once the Gaussian process model is fit to data, we evaluate new constructs in silico by varying the log TIR levels of ThiC, ThiG, and ThiE in the background of the highest producing strain

(TS11_A2) over a range of 0 to 17 at increments of 1, predicting protein levels with the forward operon model, and predicting thiamine production with the Gaussian process.

Decision trees are nonparametric supervised learning technique that partition input data to predict output variables. The data is organized into a tree-structure, where nodes are introduced to decrease the mean squared error between their leaves. Importantly, decision trees only partition input data and do not extrapolate or make reliable predictions outside of data

107 provided. To fully enumerate potential protein levels outside of the range of the input library, the

Gaussian process was used to accurately predict thiamine production given the varied protein levels, which was then summarized by the decision tree. We set the maximum number leaf nodes in the tree to 50 and the minimum samples per leaf to 10 to interpret both log TIR and log protein levels predicted to optimize thiamine production.

Both the Gaussian process and decision tree were implemented with the Python sklearn package

(Version 0.16.1)47.

We first performed linear regression using all protein levels to predict thiamine production. ThiC, ThiG, and ThiE protein levels were significant (p < 0.005). To build a more powerful regressor while reducing the number of terms in the model, we used stepwise regression using the backwards elimination criterion as the baseline to predict thiamine from log protein levels. Briefly, starting with thiC, thiG, and thiE protein levels and all multiplicative pairwise interactions, a regression model was fit after removing each variable from the model.

The model with the removed term with the lowest Akaike information criterion (AIC) was selected for the next round of variable elimination. This process is repeated until the AIC cannot be improved. This model converged on the solution:

� = �� + �� + �� + ��� + �

Stepwise regression was performed using the “step” function in R.

Strain enrichment

To predict the number strains that need to be built to reach a target phenotypic output based on the integrated model, we performed a simulation study. We first looked at the distribution of the error, calculated as the difference between the integrated model prediction and the validation result. This distribution closely followed a normal distribution (shapiro.test p-

108 value = 0.9757) with a mean=1.3438 and sd=3.539603. We simulated 1000 strain outputs using a normal distribution with the above parameters. We then randomly sampled 3, 4 or 5 strains

10,000 times, and scored how many times we reached the target phenotype, which we set at +/-

1.5 uM of intracellular thiamine, which is close to the measurement accuracy of our fluorescent assay. By building 5 strains we would expect to reach the target phenotype 85% of the time.

Bioreactor fermentation

Strain EcGENTS_A11 was chosen for bioreactor fed-batch experiments. 16 biological replicates were screened for production in 400 µl mrMOPS supplemented with 0.4 % Cas Amino

Acids (BD Biosciences, 228830), 50 µg/ml kanamycin and 20 µg/ml chloramphenicol. Cultures were inoculated to an initial OD600 of 0.01 in a deep-well plate and incubated for 24 hours at

37oC with shake at 275 rpm. OD600 were measured before cells were spun down at 5000G for 5 minutes and supernatant evaluated for thiamine content using thiochrome assay as described above.

50 mL mrMOPS in 250 mL baffled shake flasks supplemented with 0.4 % Cas Amino

Acids (BD Biosciences, 228830), 50 µg/ml kanamycin and 20 µg/ml chloramphenicol was inoculated with 2 mL of an o/n culture in the same media of 3 biological replicates validated for thiamine production by thiochrome assay. After 24 hours of incubation at 37oC with 275 rpm shake, the pre-cultures were evaluated for thiamine production and 0.5L fermentation media inoculated using a 5% inoculum density. The fermentation media contained 10 g/L glucose, 10 g/L yeast extract, 1g/L MgSO4●7H2O and 10 g/L Cas Amino Acids supplemented with 10

2 g/L(NH4) SO4, 4g/L K2HPO4, 4 g/L KH2PO4, 3 g/L Na3Citrate●2H2O, 2 g/L Na2SO4 and 1 g/L

NH4Cl. After batch phase (depletion of glucose), glucose feeding was performed at a constant rate of 4 g/h using 51.5% glucose solution supplemented with 1g/L MgSO4●7H2O. 50 mL 10

109 g/L branched-chain amino acids were added after 24 hours and 55 hours of inoculation, and 1.5 mL of trace metals was added 24 hours after inoculation. The trace metal solution was constituted of 0.11 g/l ZnSO4●7H2O, 0.1 g/L CuSO4●5H2O, 0.06 g/l MnSO4●H2O, 8.36 g/l

FeCl3●6H2O, 0.12 g/l CoCl2●6H2O and 0.4 g/l CaCl2●2H2O.1 mL. 1% Antifoam 204 (Sigma-

Aldrich) was added twice during the fermentations to avoid overfoaming. The fermentations were run for 55 hours.

Thiamine bioassay

Due to complexities of measuring thiamines of the high cell density fermentation broth using previous assays we used a thiamine bioassay to determine thimaines titers. A thiamine deficient strain, E. coli BW25113 ΔthiE, was used to established a growth-based bioassay, where the strain responses in a thiamines-dependent way when grown in minimal media. The dynamic range of the strain was determined to be 0.4-5 nM thiamines when grown in mrMOPS. Samples of interest were diluted into this range, in mrMOPS media with the E. coli BW25113 ΔthiE at an

OD600 of 0.001. A volume of 150 µl was transferred into a microtiter plate, and incubated at

37oC for 20 hours with shaking at 275 rpm before OD600 was measured and thiamine content in the samples evaluated.

Acknowledgements

We acknowledge Jonathan Vu for help with proteomics experiments. The work was supported by the Novo Nordisk Foundation, the European Union Seventh Framework Programme (FP7-

KBBE-2013-7-single-stage) under grant agreement no. 613745, Promys. The work conducted by the U.S. Department of Energy Joint Genome Institute and Joint BioEnergy Institute is supported by the Office of Science, Office of Biological and Environmental Research, of the U.S.

110 Department of Energy under Contract No. DE-AC02-05CH11231. The work conducted by

Biosyntia ApS was supported with funding from the European Union’s Horizon 2020 research and innovation program under grant agreement No 686070, DD-Decaf. A.J.R. was additionally supported by the DOE Computer Science Graduate Fellowship (CSGF) program No. DE-FG02-

97ER25308.

Hans J. Genee, Luisa S. Gronenberg, Bo Salomonsen, Anne P. Bali, Kathleen Smart, and Morten

O.A. Sommer have financial interests in Biosyntia ApS.

111 References

1 Nielsen, J. & Keasling, J. D. Engineering Cellular Metabolism. Cell 164, 1185-1197, doi:10.1016/j.cell.2016.02.004 (2016).

2 Awan, A. R., Shaw, W. M. & Ellis, T. Biosynthesis of therapeutic natural products using synthetic biology. Advanced drug delivery reviews 105, 96-106, doi:10.1016/j.addr.2016.04.010 (2016).

3 Lo, T. M. et al. Microbial engineering strategies to improve cell viability for biochemical production. Biotechnology advances 31, 903-914, doi:10.1016/j.biotechadv.2013.02.001 (2013).

4 Solomon, K. V. & Prather, K. L. The zero-sum game of pathway optimization: emerging paradigms for tuning gene expression. Biotechnology journal 6, 1064-1070, doi:10.1002/biot.201100086 (2011).

5 Nielsen, J. Metabolic engineering: techniques for analysis of targets for genetic manipulations. Biotechnology and bioengineering 58, 125-132 (1998).

6 Zhu, L., Zhu, Y., Zhang, Y. & Li, Y. Engineering the robustness of industrial microbes through synthetic biology. Trends in microbiology 20, 94-101, doi:10.1016/j.tim.2011.12.003 (2012).

7 Bonde, M. T. et al. Predictable tuning of protein expression in bacteria. Nature methods 13, 233-236, doi:10.1038/nmeth.3727 (2016).

8 King, Z. A., Lloyd, C. J., Feist, A. M. & Palsson, B. O. Next-generation genome-scale models for metabolic engineering. Current opinion in biotechnology 35, 23-29 (2015).

9 Saa, P. A. & Nielsen, L. K. Construction of feasible and accurate kinetic models of metabolism: A Bayesian approach. Scientific reports 6, 29635 (2016).

10 Xu, P., Ranganathan, S., Fowler, Z. L., Maranas, C. D. & Koffas, M. A. Genome-scale metabolic network modeling results in minimal interventions that cooperatively force carbon flux towards malonyl-CoA. Metabolic engineering 13, 578-587 (2011).

11 Nielsen, A. A. et al. Genetic circuit design automation. Science 352, aac7341 (2016).

12 Roehner, N., Young, E. M., Voigt, C. A., Gordon, D. B. & Densmore, D. Double Dutch: a tool for designing combinatorial libraries of biological systems. ACS synthetic biology 5, 507- 517 (2016).

13 Smanski, M. J. et al. Functional optimization of gene clusters by combinatorial design and assembly. Nature biotechnology 32, 1241 (2014).

112 14 Yaman, F., Bhatia, S., Adler, A., Densmore, D. & Beal, J. Automated selection of synthetic biology parts for genetic regulatory networks. ACS synthetic biology 1, 332-344 (2012).

15 Farasat, I. et al. Efficient search, mapping, and optimization of multi-protein genetic systems in diverse bacteria. Molecular systems biology 10, 731 (2014).

16 Lee, M. E., Aswani, A., Han, A. S., Tomlin, C. J. & Dueber, J. E. Expression-level optimization of a multi-enzyme pathway in the absence of a high-throughput assay. Nucleic acids research 41, 10668-10678 (2013).

17 Xu, P., Rizzoni, E. A., Sul, S.-Y. & Stephanopoulos, G. Improving metabolic pathway efficiency by statistical model-based multivariate regulatory metabolic engineering. ACS synthetic biology 6, 148-158 (2016).

18 Young, E. M. et al. Iterative algorithm-guided design of massive strain libraries, applied to itaconic acid production in yeast. Metabolic engineering 48, 33-43 (2018).

19 Zhou, H., Vonk, B., Roubos, J. A., Bovenberg, R. A. & Voigt, C. A. Algorithmic co- optimization of genetic constructs and growth conditions: application to 6-ACA, a potential nylon-6 precursor. Nucleic acids research 43, 10560-10570 (2015).

20 Jurgenson, C. T., Begley, T. P. & Ealick, S. E. The structural and biochemical foundations of thiamin biosynthesis. Annual review of biochemistry 78, 569-603, doi:10.1146/annurev.biochem.78.072407.102340 (2009).

21 Gibson, D. G. Enzymatic assembly of overlapping DNA fragments. Methods in enzymology 498, 349-361, doi:10.1016/B978-0-12-385120-8.00015-2 (2011).

22 Espah Borujeni, A., Channarasappa, A. S. & Salis, H. M. Translation rate is controlled by coupled trade-offs between site accessibility, selective RNA unfolding and sliding at upstream standby sites. Nucleic acids research 42, 2646-2659, doi:10.1093/nar/gkt1139 (2014).

23 Salis, H. M. The ribosome binding site calculator. Methods in enzymology 498, 19-42, doi:10.1016/B978-0-12-385120-8.00002-4 (2011).

24 Oppenheim, D. S. & Yanofsky, C. Translational coupling during expression of the tryptophan operon of Escherichia coli. Genetics 95, 785-795 (1980).

25 Schumperli, D., McKenney, K., Sobieski, D. A. & Rosenberg, M. Translational coupling at an intercistronic boundary of the Escherichia coli galactose operon. Cell 30, 865-871 (1982).

26 Tian, T. & Salis, H. M. A predictive biophysical model of translational coupling to coordinate and control protein expression in bacterial operons. Nucleic acids research 43, 7137- 7151, doi:10.1093/nar/gkv635 (2015).

27 Rasmussen, C. E. & Williams, C. K. Gaussian processes for machine learning. Vol. 1 (MIT press Cambridge, 2006).

113 28 Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proceedings of the National Academy of Sciences of the United States of America 110, E193-201, doi:10.1073/pnas.1215251110 (2013).

29 O'Brien, E. J., Lerman, J. A., Chang, R. L., Hyduke, D. R. & Palsson, B. O. Genome- scale models of metabolism and gene expression extend and refine growth phenotype prediction. Molecular systems biology 9, 693, doi:10.1038/msb.2013.52 (2013).

30 Frosst, N. & Hinton, G. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784 (2017).

31 Palmer, L. D. & Downs, D. M. The thiamine biosynthetic enzyme ThiC catalyzes multiple turnovers and is inhibited by S-adenosylmethionine (AdoMet) metabolites. The Journal of biological chemistry 288, 30693-30699, doi:10.1074/jbc.M113.500280 (2013).

32 zu Berstenhorst, S. M., Hohmann, H.-P. & Stahmann, K.-P. Vitamins and vitamin-like compounds: microbial production. (2009).

33 Schyns, G. et al. Isolation and characterization of new thiamine-deregulated mutants of Bacillus subtilis. Journal of bacteriology 187, 8127-8136 (2005).

34 Hasnain, G. et al. Bacterial and plant HAD enzymes catalyze a missing phosphatase step in thiamin diphosphate biosynthesis. Biochemical Journal, BJ20150805 (2015).

35 Jeschek, M., Gerngross, D. & Panke, S. Rationally reduced libraries for combinatorial pathway optimization minimizing experimental effort. Nature communications 7, 11163, doi:10.1038/ncomms11163 (2016).

36 Smanski, M. J. et al. Functional optimization of gene clusters by combinatorial design and assembly. Nat Biotechnol 32, 1241-1249, doi:10.1038/nbt.3063 (2014).

37 Genee, H. J. et al. Functional mining of transporters using synthetic selections. Nature chemical biology 12, 1015-1022, doi:10.1038/nchembio.2189 (2016).

38 Cambray, G. et al. Measurement and modeling of intrinsic transcription terminators. Nucleic acids research 41, 5139-5148, doi:10.1093/nar/gkt163 (2013).

39 Mutalik, V. K. et al. Quantitative estimation of activity and quality for collections of functional genetic elements. Nature methods 10, 347-353, doi:10.1038/nmeth.2403 (2013).

40 Lutz, R. & Bujard, H. Independent and tight regulation of transcriptional units in Escherichia coli via the LacR/O, the TetR/O and AraC/I1-I2 regulatory elements. Nucleic acids research 25, 1203-1210 (1997).

41 Espah Borujeni, A. & Salis, H. M. Translation Initiation is Controlled by RNA Folding Kinetics via a Ribosome Drafting Mechanism. Journal of the American Chemical Society 138, 7016-7023, doi:10.1021/jacs.6b01453 (2016).

114 42 Schyns, G. et al. Isolation and characterization of new thiamine-deregulated mutants of Bacillus subtilis. Journal of bacteriology 187, 8127-8136, doi:10.1128/JB.187.23.8127- 8136.2005 (2005).

43 McKenna, A. et al. The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research 20, 1297-1303, doi:10.1101/gr.107524.110 (2010).

44 Robinson, J. T. et al. Integrative genomics viewer. Nat Biotechnol 29, 24-26, doi:10.1038/nbt.1754 (2011).

45 Mendez-Perez, D. et al. Production of jet fuel precursor monoterpenoids from engineered Escherichia coli. Biotechnology and bioengineering 114, 1703-1712, doi:10.1002/bit.26296 (2017).

46 Welch, W. J. et al. Screening, predicting, and computer experiments. Technometrics 34, 15-25 (1992).

47 Pedregosa, F. et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 12, 2825-2830 (2011).

115 Chapter 5

Conclusions

Here I have shown how to use probabilistic models to predict phenotype from genetic sequences. These characterizations range from in vitro studies of protein function, to organismal phenotypes in the laboratory, to mutations that can cause disease in humans. Predictions are made directly from a collection of sequences alone, summarizing constraints from evolutionary sequences, or utilizing sequencing, proteomic, metabolomic measurements from the system being optimized.

Specific Conclusions

In Chapter 1, I discussed DeepSequence, a deep latent variable model fit to evolutionary sequences to predict the effect of mutations in an unsupervised manner. This work opened the door to utilizing other new deep generative models for unsupervised variant effect prediction. It also stressed the importance of using approximate Bayesian inference to not overfit to limited biological data. Since released, new approximate inference techniques can put a tighter bound on the variational autoencoder objective, including inverse autoregressive flows1 and importance weighting2. Latent variable models that utilize normalizing flows3,4 can calculate an exact likelihood instead of a lower bound. Generative Adversarial Networks5 are also generative models, where the discriminator models P(X); this component of the network could in theory be used for variant effect prediction. However, difficulties in training, lack of principled regularization, and mode collapse remain challenging in this framework6. Finally, DeepSequence struggled to model sequence families with low sequence diversity, particularly viral families. To

116 leverage the small amount of sequence information available, nonparametric density estimators like Gaussian process latent variable models7,8 may be less likely to overfit.

In Chapter 2, I discussed how autoregressive generative models can be fit to protein families to both predict the effects of mutations and generate a synthetic nanobody library. Deep autoregressive models for biological sequences are particularly interesting because they do not explicitly require sequence homology in the data that is being modelled. The current algorithm is only fit to a single sequence family, and, paradoxically, those sequences were gathered by an alignment-based search tool9. How then can we fit the autoregressive model to sequences that cannot be modeled with an alignment if we cannot gather examples of those sequences because we require alignments? Kmer-based methods or deep HMMs10 may be able to find unalignable sequences in public databases. This framework also allows fitting multiple sequence families at once—ideally, all sequences could be fit in one big model and be used to infer the effect of a mutation for any sequence quickly. However, I briefly tried fitting all of the PFAM database11 at once with a causal dilated convolutional neural network but ran into two issues. First, it is unclear how to weight sequences when fitting the model because of the a large class imbalance of domain representation in biological sequence databases. Second, more powerful architectures with more parameters are needed to fit this much data. Finally, as mentioned previously, latent variables can be added to explain additional variance in the data to improve predictions or for feature engineering of sequences. It has been reported that the hidden representation of each timestep in autoregressive models are not useful for semisupervised learning4. Integrating latent variables per residue12,13 may provide the key to more interpretable models for controlled generation of sequences.

117 In Chapter 3, I discussed how to use sparse Bayesian logistic regression to identify mutations in a nanobody sequence enriched during affinity maturation due to increased binding to an antigen. We were surprised how well a simple sparse logistic regression model, when trained on the first round of selection, could predict future rounds of selection. These results imply that these subsequent rounds of selection may not need to be performed, reducing the time and money required for finding a nanobody with sufficiently good properties. However, I did try a model with hierarchically sparse pairwise terms14, but I could not find support for epistatic relationships between pairs of positions. Since a site-independent model could sufficiently describe the data, higher-order terms were not required. In the future, utilizing the hierarchical priors of the sparse regression model could be used to integrate external information. Similar to previous efforts15, if the crystal structure is known, priors on the pairwise terms can encourage epistatic terms that are close in 3D structure in a regression model.

In Chapter 4, I discussed how to model and optimize thiamine production from measurements from a combinatoric biosynthetic ribosome binding site library. Since I was working closely with experimentalists, interpretability was paramount. The decision tree itself did a poor job of characterizing the thiamine production landscape—using the Gaussian process model as the main predictor and interpreting this model with the decision tree revealed to me and my collaborators the constraints to optimize. In the future, these models can be merged to make a more powerful, interpretable model16. Since our thiamine system had an antibiotic-resistance reporter, it would be interesting to know if a different or better optimum could be achieved by both improving the combinatoric space of ribosome binding sites17 tested at the beginning of each of the genes as well as increasing the stringency of selection. Finally, I envision modulating both the expression level of proteins as well as introducing mutations into the protein sequences

118 themselves will be an interesting avenue of future research that can integrate all the modelling paradigms described in this thesis.

General Conclusions

One continuing challenge is the conflict between predictive power and interpretability of machine learning algorithms. In site-independent, pairwise (Ising models), or hidden Markov models, the parameter values themselves are interpretable and typically have a monotonic relationship with the outcome of the model. Deep learning has received a reputation for being a

“black-box” predictor, and for good reason. The parameters of the model are convoluted with nonlinear, data-specific hidden variables. In the case of Gaussian processes, there are hardly any

(or potentially no) variables to interpret. When using these more powerful algorithms, predictions must either be valuable enough that they do not warrant explanation by the model, or additional analysis, such as regression, decision trees, or residual analysis, must be used to understand the predictions made by the model. As discussed previously, additional use of latent variables may be key to interpreting deep learning algorithms.

Generative models are particularly useful in summarizing evolutionary information.

Algorithms described in Chapters 1 and 2 add to the toolbox of generative models available to researchers as well as describe general heuristics to fitting deep generative models to biological data. Each generative model specializes in certain tasks (Table 5.1).

119 Table 5.1 – Analyzed classes of generative models for computational biology.

Model Notable usage Hidden Markov model (HMM) Gene finding, homology detection Ising/Potts model (Pairwise) Contact prediction, mutation effect prediction (missense) Latent variable model Mutation effect prediction (missense), dimensionality reduction Autoregressive model Mutation effect prediction (missense, insertion, deletion), protein design

Generally, integrating experimental and evolutionary signals together in model building is still an open question in computational biology. How can we make models that are powerful enough to parameterize an entire sequence family with thousands of members, but also be sensitive enough to individual mutations of a single sequence? In the field of machine learning, is can be cast as either a semi-supervised or transfer learning problem. Parameterizing the objective function to capture additional structural or evolutionary data while being sensitive enough to model laboratory variation in a probabilistic framework is still an open question.

As our ability to synthesize and sequence DNA improves, better algorithms are necessary to leverage this technology. I am confident that the tools described here and generalizable rules therein can improve our understanding of biological systems and expand our reach into more complicated, diverse, valuable systems.

120 References

1 Kingma, D. P. et al. in Advances in Neural Information Processing Systems. 4743-4751.

2 Burda, Y., Grosse, R. & Salakhutdinov, R. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519 (2015).

3 Dinh, L., Sohl-Dickstein, J. & Bengio, S. Density estimation using Real NVP. arXiv preprint arXiv:1605.08803 (2016).

4 Kingma, D. P. & Dhariwal, P. Glow: Generative flow with invertible 1x1 convolutions. arXiv preprint arXiv:1807.03039 (2018).

5 Goodfellow, I. et al. in Advances in neural information processing systems. 2672-2680.

6 Arjovsky, M., Chintala, S. & Bottou, L. Wasserstein gan. arXiv preprint arXiv:1701.07875 (2017).

7 Lawrence, N. D. in Advances in neural information processing systems. 329-336.

8 Gal, Y., Chen, Y. & Ghahramani, Z. Latent Gaussian processes for distribution estimation of multivariate categorical data. (2015).

9 Eddy, S. R. Accelerated profile HMM searches. PLoS computational biology 7, e1002195 (2011).

10 Krishnan, R. G., Shalit, U. & Sontag, D. in AAAI. 2101-2109.

11 Finn, R. D. et al. The Pfam protein families database: towards a more sustainable future. Nucleic acids research 44, D279-D285 (2015).

12 Prenger, R., Valle, R. & Catanzaro, B. WaveGlow: A Flow-based Generative Network for Speech Synthesis. arXiv preprint arXiv:1811.00002 (2018).

13 Chung, J. et al. in Advances in neural information processing systems. 2980-2988.

14 Ingraham, J. B. & Marks, D. Bayesian sparsity for intractable distributions. arXiv preprint arXiv:1602.03807 (2016).

15 Romero, P. A., Krause, A. & Arnold, F. H. Navigating the protein fitness landscape with Gaussian processes. Proceedings of the National Academy of Sciences 110, E193-E201 (2013).

16 Frosst, N. & Hinton, G. Distilling a neural network into a soft decision tree. arXiv preprint arXiv:1711.09784 (2017).

121 17 Jeschek, M., Gerngross, D. & Panke, S. Rationally reduced libraries for combinatorial pathway optimization minimizing experimental effort. Nature communications 7, 11163 (2016).

122

Appendix: Supplementary Figures and Tables

123

Supplementary Figure S1.1. Distribution of experimental mutation effects and predictions made by DeepSequence.

124

Supplementary Figure S1.2. Mutation-effect predictions from generative models can be generalized to unseen sequences. (above) Spearman ρ of mutation effect prediction of β- lactamase7 of each of the three generative models (N=4788). Sequences with a normalized hamming distance greater than 0.53, 0.6, 0.8, and 0.95 with respect to the reference sequence are removed from the alignment before model fitting and inference. The distribution of hamming distances of the alignment and the cutoff of inclusion into each alignment is shown below.

125

Supplementary Figure S1.3. Predictions from all generative models for sequence families exhibited biases when compared to experimental data. By transforming all model predictions and mutations to normalized ranks, we can compare effect predictions to experimental data across all biological datasets and models. The site-independent, pairwise, and latent variable models systematically over and under predict the effects of mutations according to amino acid identity. These biases vary in magnitude and direction depending on the amino acid identity before mutation (wildtype) or the residue identity it is mutated to

(mutant).

126 0.9 DeepSequence N mutations EVmutation 0.8 Independent Supervised 102 103 104 105

0.7

0.6

0.5

| Spearman ρ 0.4

0.3

0.2

0.1

0.0 )

N mutations

PTEN

Ubiquitin

β-lactamase

Calmodlulin-1

β-glucosidase

GTPase HRas GTPase

T. thermophilus

HIV env (BG505) env HIV

Hepatitis C NS5A C Hepatitis

YAP1 (WW domain)

Levoglucosan kinase Levoglucosan

PSD 95 (PDZ domain) (PDZ 95 PSD

UBE4B (U-box domain) (U-box UBE4B

Influenza hemagglutinin Influenza

BRCA1 (RING Domain) (RING BRCA1

BRCA1 (BRCT Domain) (BRCT BRCA1

HSP90 (ATPase domain) (ATPase HSP90

Aliphatic amide hydrolase amide Aliphatic

TIM barrel ( barrel TIM

GAL4 (DNA-binding domain) (DNA-binding GAL4

Translation initiation factor IF1

Kanamycin kinase APH(3’)-II kinase Kanamycin

PABP singles (RRM domain) (RRM singles PABP

Thiamin pyrophosphokinase 1 pyrophosphokinase Thiamin

DNA methyltransferase HaeIII methyltransferase DNA

Thiopurine S-methyltransferase Thiopurine

Influenza polymerase PA subunit PA polymerase Influenza Small ubiquitin-related modifier 1 modifier ubiquitin-related Small SUMO-conjugating enzyme UBC9 enzyme SUMO-conjugating

Supplementary Figure S1.4. Supervised calibration of mutation-effect predictions improves predictive performance. Amino acid bias was corrected with linear regression for all generative models, leaving one protein out for test and training a model on the rest

(Methods). The bottom of the bar is Spearman ρ before correction, while the top is Spearman ρ after correction. Predictions without any evolutionary information (Supervised) performed considerably worse than other predictors.

127

Supplementary Figure S1.5 – Differential improvement was strongest for deleterious effects. Top five positions with largest reduction in rank error from independent model to

DeepSequence for eight proteins are shown on the crystal structure of the protein.

128

Supplementary Figure S2.1 – Spearman ρ of effect predictions for all reported deep mutational scans and generative models.

129

Supplementary Figure S2.2. Difference in Spearman ρ between the average effect predictions of models trained from an N-to-C orientation relative to a C-to-N orientation compared to effects measured in the laboratory.

130

Supplementary Figure S2.3. Distribution of sequence lengths and mutation effects of imidazoleglycerol-phsphate dehydratase compared in Figure 2.4. There is no correlation between protein length and experimentally-determined function.

131

Supplementary Figure S4.1. Construction of combinatorial TPP pathway library. a. The open reading frames of thiC, thiE, thiD, thiF, thiS, thiG, and thiH were amplified by PCR as individual fragments from E. coli MG1655 genomic DNA. The primers for PCR which are listed

For each open reading frame, four different forward primers were used that in addition to the

132 Supplementary Figure S4.1 (Continued). Gibson linker contains a 16 bp 5’ untranslated region

(5’UTR) encoding one of the following four ribosome binding sites (RBSs): 1)

AGCTAAGGAGGTAAAT, 2) AGCGAGGTAATACTAG, 3) AGCGTGGTAATACTAG, and

4) AGCGTGCTAATACTAG. The resulting 28 fragments were gel-purified. Similarly, pGEN49 and pGEN50 were PCR amplified using whole plasmid amplification, treated with DpnI restriction enzyme to remove template DNA and gel-purified. b. Construction of the full TPP pathway library was performed following a two-step hierarchical cloning procedure. First the

PCR fragments based on pGEN49, thiC, thiE, and thiD (13 in total) were mixed in equimolar concentrations and assembled by Gibson cloning (NEB # E2611) using the manufacture’s instructions resulting in potentially 64 pooled constructs representing combinatorial expression variants of the thiCED cistron (named pGEN77). Likewise, the PCR fragments based on pGEN50, thiF, thiS, thiG, and thiH were assembled resulting in potentially 256 constructs representing combinatorial expression variants of the thiFSGH cistron (named pGEN80). c. The pGEN77-mix was transformed into electrocompetent E. coli DH10B, recovered for 1.5 h in 1 mL Super Optimal broth with Catabolite repression (SOC) medium to which 9 mL 2xYT supplemented with kanamycin (50 µg/ml) was added and the culture was incubated at 37°C overnight. The following day, the total plasmid pool of 5 mL of the overnight culture was purified and digested for > 8 h using SwaI (NEB # R0604), and linearized DNA was gel- purified. Using the in vitro assembly mix of the pGEN80-mix as template for PCR, the thiFSGH cistrons were amplified by PCR from the A to the C linker and gel-purified. Finally, the linearized pGEN77-mix was mixed in equimolar amounts with the amplified thiFSGH cistron mix and assembled into one plasmid library (pGEN83) by Gibson assembly.

133

1000000

100000

10000 RBS 1

1000 RBS 2

(TIR) 100 RBS 3

10 RBS 4

1

Predictedtranslation initiationrate ThiC ThiE ThiD ThiF ThiS ThiG ThiH

Supplementary Figure S4.2. Predicted Translation initiation rates. Bar chart showing the predicted translation initiation rates (TIRs) for each RBS in each gene of the refactored thiamine biosynthetic cluster as calculated by the RBS calculator v2.0.

134 Chlorampenicol [µg/mL] a b Spectinomycin [µg/mL] Wild-type TPP levels + TPP 0 10 14 18 25 40 60 80 no growth growth 0 20 20 30 50 50 50 50

Control strain RBS no thiamine Resistance to cat cat 5’ 5’ chloramphenicol Library OFF ON no thiamine

Resistance to aadA aadA 5’ 5’ spectinomcyin Control strain + thiamine

Supplementary Figure S4.3. RNA-encoded biosensor detects intracellular thiamine concentrations. a. The biosensor works through a thiamine pyrophosphate (TPP) riboswitch that activates the expression of antibiotic resistance genes in a dose-dependent manner. b.

Survival of control strains (empty vector), library strains, and control strains with thiamine (50

µM) added to the medium, in an E. coli host harboring the thiamine biosensor. Increasing concentrations of the 2 antibiotics is shown at the top in blue and red.

135 a ThiC b ThiE c ThiD Rsq=0.70, p=3.62e−14 Rsq=0.63, p=2.44e−12 Rsq=,0.46 p=3.70e−7

12 13.0

12 10 11 el 8 v

12.0 e 6 10 4

11.0 9 Normalized LN protein L 2

8 10.0 4 5 6 7 8 9 10 0 1 2 3 4 5 2 3 4 5 6 7 LN predicted TIR LN predicted TIR LN predicted TIR d e f ThiF ThiS ThiG Rsq=0, p=0.964 Rsq=0.63, p=6.18e−12 Rsq=0.8, p=2.20e−16 12 15 13.0 el 11 v e 13 10 12.0

9 Normalized LN protein L

11 8

11.0

4 5 6 7 8 4 6 8 10 6 8 10 12 LN predicted TIR LN predicted TIR LN predicted TIR g ThiH Rsq=0.07, p=0.098 11

el 10 v

e 9

8 Normalized LN protein L

1 2 3 4 5 6 7 LN predicted TIR

Supplementary Figure S4.4. RBS to protein level association in 52 characterized strains.

The correlation between normalized natural log (LN) measured protein levels on the Y-axis and

LN predicted translation initiation rates (TIR) on the X-axis. The colored bars at the bottom of the x-axis indicate the predicted LN_TIR for each of the 4 RBSs. The Red line indicates the best fit based on linear regression analysis. The text at the top shows the regression based R-squared and p-values for each gene.

136

a

ln_Prot_ThiH 1

Pearson Correlation ln_Prot_ThiG 1 0.63* −1.0 −0.5 0.0 0.5 1.0

ln_Prot_ThiS 1 0 −0.12

ln_Prot_ThiF 1 0.7* −0.1 −0.07

ln_Prot_ThiD 1 0.14 0.19 −0.06 0

ln_Prot_ThiE 1 0.04 −0.16 −0.25 0.12 0.21

ln_Prot_ThiC 1 0.17 0.16 0.3* 0.05 −0.23 0.07

ln_Prot_ThiC ln_Prot_ThiE ln_Prot_ThiD ln_Prot_ThiF ln_Prot_ThiS ln_Prot_ThiG ln_Prot_ThiH

b

> 0.6 Adjacent 0.234 1.186 0.318 0.004 0.402 0.540 0.232 Interactions ficents) f

ThiC ThiE ThiD ThiF ThiS ThiG ThiH 0.0 Strength of Interaction (Correlation coe Upstream 0.288 -0.072 0.029 -0.014 0.246 Interactions < -0.6

Supplementary Figure S4.5. Detecting interactions between genes in protein levels and RBS strengths. a. Correlation coefficients between protein levels of different genes. Correlations between ThiC-ThiF (P=0.02799), ThiF-ThiS (P=7.269e-09), and ThiG-ThiH (P=4.726e-07) were statistically significant. b. Regression coefficients of the final linear operon model. Positive upstream interactions are found between ThiC-ThiE, ThiF-ThiS, and ThiG-ThiH.

137

Supplementary Figure S4.6. Validation of thiamine quantification assays. Total intracellular thiamines (thimaine, thiamine monophosphate and thiamine pyrophosphate (TPP)) of 52 samples from 52 different thiamine production strains were quantified using a thiochrome method developed in this study (see online methods). The thiochrome quantifications were compared to our measurements of intracellular TPP of the same samples using high-performance liquid chromatography (HPLC). A linear correlation of the levels of the two methods was observed (R- squared = 0.92), showing that the thiochrome method agrees with HPLC measurements and that total intracellular thiamines correlate with intracellular TPP levels.

138

Supplementary Figure S4.7. Total thiamine of 52 characterized strains. Total extracellular thiamines (thimaine, thiamine monophosphate and thiamine pyrophosphate (TPP)) of 52 samples from 52 different thiamine production strains were quantified using a thiochrome method developed in this study (see online methods). The thiochrome quantifications were compared to our measurements of intracellular TPP of the same samples using high performance liquid chromatography (HPLC). A linear correlation of the levels of the two methods was observed (R- squared = 0.81), showing that total extracellular thiamines correlate with intracellular TPP levels.

Limit of detection for the fluorescence assay was > 10 nM/OD.

139

Supplementary Figure S4.8. Multivariate regression model of Thiamine Production.

Thiamine production landscape over values of ThiC and ThiG log(protein levels) using the multivariate linear regression model (Thiamine ≈ ThiC + ThiE + ThiG + ThiC*ThiG). Thiamine production is predicted to increase linearly with production of both ThiC and ThiG proteins.

140

Supplementary Figure S4.9. Decision tree summarizing the constraints of log TIRs in the thiamine production landscape. This tree was summarized to determine the rules of thiamine production

141

ThiC ThiE ThiD Rsq=0.077, p=0.2986 Rsq=0.737, p=2.06e−5 Rsq=0.573, p=6.81e−4

12 10 13.0 el el el v v v 12.0 8 e e e 6 12.4 11.0 4

Protein L Protein L Protein L 2

10.0 0 11.8 11.0 12.0 13.0 4 6 8 10 12 9.5 10.0 10.5 11.0 11.5

Predicted Protein Level Predicted Protein Level Predicted Protein Level

ThiF ThiS ThiG Rsq=0.209, p=0.075 Rsq=0.194, p=0.087 Rsq=0.893, p=3.46e−8 13.5 12

15 el el el v v v 11 e e e 12.5

10 13 11.5 Protein L Protein L Protein L 9 10.5 11 8

12.25 12.30 12.35 9.4 9.8 10.2 12 13 14 15 16

Predicted Protein Level Predicted Protein Level Predicted Protein Level

ThiH Rsq=0.703, p=4.93e−5

11 el v

e

10

9

Protein L 8

9.0 9.5 10.0 11.0

Predicted Protein Level

Supplementary Figure S4.10. Operon model predicts protein levels in validation strains. Scatter plots show association of natural log (LN) measured protein levels on the Y-axis and LN predicted translation initiation rates (TIR) on the X-axis. The Red line indicates the best fit based on linear regression analysis. The text at the top shows the regression based R-squared and p-values for each gene.

142

ThiC and ThiE model comparison to validation experiment

Regression

15 15

model

10 10 Gaussian process

model 5 5 Validation acellular thiamines uM r experiment Int 11.8 12.0 12.2 12.4 12.6 12.8 13.0 13.2 0 2 4 6 8 10 12 ThiC Protein Level ThiE Protein Level

Supplementary Figure S4.11. Constraints on ThiE and ThiC protein levels to high producing thiamine strains predicted by the Gaussian process.

143

Supplementary Figure S4.12. Bioreactor fed-batch fermentations of three biological replicates of strain EcGENTS_A11. Optical density (OD600) (red) and extracellular thiamines (thiamine +

TMP + TPP) (blue) were measured on samples withdrawn from the fermentations. Error bars indicate the standard deviation of three technical replicates.

144

Supplementary Table S2.1. Effect predictions of nanobody sequences compared to the phenotype observed during size exclusion chromatography reported in McMahon et al., 2018.

Name Phenotype Bits per residue >Nb.BV035 Monodisperse, low expression -1.266 >Nb.BV009 Poorly expressed -1.247 >Nb.BV056 Monodisperse -1.214 >Nb.BV045 Monodisperse -1.211 >Nb.BV008 Monodisperse -1.203 >Nb.BV025 Monodisperse -1.181 >Nb.BV018 Monodisperse -1.160 >Nb.BV052 Monodisperse -1.139 >Nb.BV049 Monodisperse -1.131 >Nb.BV047 Somewhat polydisperse -1.053 >Nb.BV032 Mostly monodisperse -1.043

145

Supplementary Table S4.1 – Model selection of the linear operon model using the adjusted r- squared.

Number of adjusted r-squared Model Parameters r-squared r-squared validation Single 7 0.82 0.789 0.756 Forward 12 0.851 0.802 0.767 Reverse 12 0.821 0.762 0.757 Forward Tri. 28 0.857 0.656 0.765 Reverse Tri. 28 0.83 0.592 0.767 Full 49 0.865 Nan 0.781

146