Building maps from genetic sequences to biological function
The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters
Citation Riesselman, Adam Joseph. 2019. Building maps from genetic sequences to biological function. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences.
Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:41121291
Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA
Building maps from genetic sequences to biological function
A dissertation presented
by
Adam Joseph Riesselman
to
The Division of Medical Sciences
in partial fulfillment of the requirements
for the degree of
Doctor of Philosophy
in the subject of
Biomedical Informatics
Harvard University
Cambridge, Massachusetts
December 2018
©2018 Adam Joseph Riesselman All rights reserved.
Dissertation Advisor: Debora S. Marks Adam Joseph Riesselman
Building maps from genetic sequences to biological function
Abstract
Predicting how changes to the genetic code will alter the characteristics of an organism is a fundamental question in biology and genetics. Typically, measurements of the true functional landscape relating genotype to phenotype are noisy and costly to obtain. Though high-throughput
DNA sequencing and synthesis can shed light on biological constraints in organisms, inferring relationships from these high-dimensional, multi-scale data to make predictions about new biological sequences is a formidable task. Here, I aim to build algorithms that map genetic sequences to biological function. In Chapter 1, I examine how deep latent variable models of evolutionary sequences can predict the effects of mutations in an unsupervised manner. In
Chapter 2, I discuss how deep autoregressive models can be applied to genetic data for variant effect prediction and the synthesis of a diverse synthetic nanobody library. In Chapter 3, I explore how sparse Bayesian logistic regression can efficiently summarize laboratory affinity maturation experiments to improve nanobody binding affinity. In Chapter 4, I show how to integrate genetic, proteomic, and metabolomic data to optimize thiamine biosynthesis in E. coli.
In Chapter 5, I propose future research directions, including extensions to both the analytical methods and biological systems discussed. These results show that probabilistic algorithms of genetic sequence data can both explain phenotypic variation and be used to design proteins and organisms with improved properties.
iii Table of Contents
Acknowledgements ...... vi
INTRODUCTION ...... 1 Finding patterns in genetic data ...... 1 On designing useful statistical models ...... 3 Final remarks ...... 4 References ...... 6
CHAPTER 1 ...... 8
Deep generative models of genetic variation capture the effects of mutations ...... 8 Abstract ...... 9 Introduction ...... 9 Results ...... 11 Discussion ...... 23 Methods...... 25 References ...... 39
CHAPTER 2 ...... 45
Predicting the effects of mutations with alignment-free generative models of protein sequences ...... 45 Abstract ...... 46 Introduction ...... 46 Results ...... 49 Discussion ...... 55 Methods...... 56 References ...... 61
CHAPTER 3 ...... 68
Refining nanobody affinity with algorithmically-guided directed evolution ...... 68 Abstract ...... 69 Introduction ...... 69 Results ...... 72 Discussion ...... 76 Methods...... 77 References ...... 79
iv CHAPTER 4 ...... 82
Accurate prediction of the thiamine biosynthetic landscape using Gaussian process learning ...... 82 Abstract ...... 83 Introduction ...... 83 Results ...... 86 Discussion ...... 96 Methods...... 98 References ...... 112
CHAPTER 5 ...... 116
Conclusions ...... 116 Specific Conclusions ...... 116 General Conclusions ...... 119 References ...... 121
Appendix: Supplementary Figures and Tables ...... 123
v Acknowledgements
I came into graduate school with the grandiose idea of working in the lab and guiding my experiments with new algorithms to engineer organisms—I wanted to make sure the code we wrote made something tangible. After stumbling around for the first year or so, I realized that doing everything on my own was not the way to go. Science is made so much richer with collaborations and the exchange of new ideas and disciplines, and I am so happy and proud to have worked with many talented people over my career.
First and foremost, I would like to thank my advisor Debora Marks for truly caring about me, her students, and the science. She has taught me to “know thy data” and has provided me endless opportunities for my career. Most importantly, Debbie trained me to think like a scientist by making sure to know what we are looking for: “What is the question?”. With her help and guidance, I’ve gotten everything and more I wanted from graduate school.
I would also like to thank my collaborators that have made this work possible. I cannot thank John Ingraham enough for working with me through our papers and always patiently helping me understand the math or answer any questions I had. I would also like to thank Sam Deutsch for being my mentor at the Joint Genome Institute, and Hans Genee, Morten Sommer, and everyone else on the thiamine project for being great collaborators working on exciting, impactful research. I would like to thank Andrew Kruse for being a wonderful, friendly, insightful collaborator with the nanobody projects. I would also like to thank Aashish Manglik, Conor McMahon, Ishan Deshpande, and Jiahao Liang for their work as well.
I am so fortunate to have joined the Marks lab for all of the amazing members I have had over the years. I thank Anna Green for talking about science, dealing with stress, and appreciating my “avant garde” lab humor. A big thanks to all the other lab members: Aaron Kollasch, Agnes Toth-Petroczy, Benni Schubert, Caleb Weinreb, Charlotta Scharfe, Christian Dallago, David Ding, Elana Simon, Eli Weinstein, Hailey Cambra, Jonny Frazer, June Shin, Kelly Brock, Malfalda Dias, Perry Palmedo, Rohan Maddamsetti, Tessa Green, and Thomas Hopf.
I would also like to thank Chris Sander for being a constant in support and help over the years. I’d also like to thank Frank Poelwijk and Nick Gauthier in his lab for their kindness and advice as well.
I would like to thank the Bioinformatics and Integrative Genomics PhD program for their support and guidance, including Peter Park, Isaac Kohanne, Marissa DiSarno, Katherine Flannery, and Cathy Haskell.
A big thank you to the systems biology admin as well, including Barb Grant for keeping a roof over our heads, Jennie Epp for dealing with any lab emergency with unimaginable patience and grace, and Kathy Buhl for her perpetual kindness. I would also like to thank the Harvard Research Computing staff for their tireless support of my computational demands.
I would like to thank my PQE, DAC, and defense members for helping me make progress in my graduate school career: Shamil Sunyaev, Martha Bulyk, George Church, Sasha Rush, Stirling
vi Churchman, Tim Mitchison, and Gene-Wei Li. I would also like to thank my previous rotation lab mentors: Jen Sheen, Pam Silver, and Jeff Way.
I am indebted to the Department of Energy Computational Science Graduate Fellowship for funding my graduate school education, providing me opportunities to network and expand my research, and forcing me to take a challenging yet thorough course load. Thank you to the Krell Instititue as well for their help, including Lindsey Eilts, Michelle King, Lisa Frerichs, and Thomas O’Donnell. I would like to thank the Joint Genome Institute for hosting me in the summer of 2016 as well as Max, Joy, Jordan, and Jay, for making my time in Berkeley much more fun.
I would also like to thank all of my educators that have made me into the scientist I am today, including Kathleen Synder, Jim Blankman, Bram Govaerts, Paul Scott, Sona Pandey, Gholam Mirafzal, Jerry Honts, Timothy Urness, Eric Manley, Ajith Anand, Maren Roe, Bill Gordon- Kamm, Keith Lowe, and Todd Jones. I am also indebted to Norman Borlaug and the World Food Prize for being the catalyst to my career. In particular, I cannot thank Kenneth Quinn, Lisa Fleming, and Keegan Kautzky enough for their tireless support and being fantastic role models.
I would like to thank all of the friends I’ve made in graduate school in Boston—you’ve all made this journey much more fun than going it alone: Katie Wu, Rebecca Fine, Brittany Mayweather, Thao Truong, Dan Foster, Marina Watanabe, Sindhu Carmen, Sisi Sarkizova, Rohit Garg, Kayla Davis, Max Schubert, Gleb Kuznetsov, Surge Biswas, Pierce Ogden, Isaac Plant, Marc Pressler, Evi Van Itallie, Hattie Chung, Steph Hayes, Brendan Colon, Ben Vincent, Sam Wolock, Armin Schoch, and Chuck Noble. I’d also like to thank all my friends from home for their constant support, especially, Matt Dalton, Domenic Lamberti, Drew Gibson, and Erin Austin.
I would like to thank my partner Chris Bandoro for his endless love, kindness, and patience with me and for always be willing to go on an adventure. I am extremely lucky to have a loving family that has always supported me. Thank you to all my grandparents (Walt, Lucille, Vic, Marge), aunts (Kristi, Carolyn, Dolores, Connie, Kathy), uncles (Jerome, Steve, Dave, Bob, Marv, Duane), and cousins for believing in me and supporting me through everything. Thank you to Nicole, Jon, Kevin, and Blair, and my nephews Matthew and Vincent, for their love and support. Finally, this was all made possible by my amazing parents, Mark and Kim. They taught me the value of hard work, the importance of education, and the virtues of patience and kindness for everyone. They always fostered my interest in science and have been an incredible support system. I love you all so much.
vii Introduction
“Do what you can, with what you have, where you are.”
--Theodore Roosevelt
DNA is the primary determinant of the phenotype of organisms. Millions of years of random mutations and competition to survive under a variety of conditions is the driving force behind the staggering amount of diversity of life on earth. This process can be mimicked in the laboratory by on a much shorter timeframe through directed evolution1,2. However, the constraints that determine the consequence of mutations in DNA are difficult and expensive to ascertain. Here I will introduce computational methods to characterize biological sequences that have undergone diversification and selection. These mathematical models can be used to understand functional constraints as well as engineer organisms with new and improved properties.
Finding patterns in genetic data
Modern society has become increasingly reliant on our ability to understand genetic variation. Crops and livestock have been selected over a millennia to maintain high yields while being robust to a wide variety of conditions3. Precision medicine has allowed doctors to pinpoint the cause of disease and prescribe treatments accordingly4. Scientists have been able to engineer industrial biomolecules, lower the cost of medications, and tailor therapeutics5,6. Building links between changes in DNA and the outcome in vitro or in vivo is critical to improvement of these endeavors.
1 Forward genetics is the traditional approach to elucidating the effects of mutations. After observing a particularly interesting phenotype, the causative mutation is winnowed down through a process of breeding, screening, and mapping to pinpoint the changes responsible for the phenotype of interest7. Given enough time, patience, and resources, identifying the mutation can be completed with relatively little technology. High-throughput sequencing has accelerated this process by providing fine-grained resolution of reference genomes, but mapping novel variants or sequencing new genomes can be slow and expensive. Moreover, extrapolating observed phenotypes from one organism to another is nontrivial and requires homology-based approaches.
Alternatively, reverse genetics maps genetic variants to function by creating DNA sequences of interest and experimentally determining their phenotype7. Boosted by advances in
DNA synthesis technology, thousands of mutations can be studied in a single experiment8. These approaches are much more generalizable, as diverse genetic components from recalcitrant organisms, including regulatory elements and biosynthetic clusters, can be transferred to more well-behaved systems in the lab. When approached from an engineering perspective, these techniques can be used to create DNA sequences with novel, valuable properties5.
This approach, most recently employed by deep mutational scanning9, has been tremendously successful in characterizing genetic systems. However, both technical and biological issues complicate brute-force reverse genetic approaches.
First, previously characterized genetic parts do not perform predictably with one another.
For example, epistasis between residues in a biological sequence confound assumptions regarding the additive effects of mutations10,11, or nonlinear dynamics may govern the metabolism of a system of interest12, leading to unpredictable phenotypic outcomes.
2 Exhaustively testing all combinations of mutations is unfeasible. Second, if more than one change is made to the system at once, the true cause of variation must be deconvoluted13. Third, biological measurements are noisy, expensive to obtain, and may be confounded with batch effects. These measurements, including DNA modifications and RNA, protein, and metabolite expression levels, are also typically high-dimensional and arise from hierarchical factors of variation. Finally, direct measurements of a biological system may not be available at all.
Namely, large genetic sequence databases contain only the result of long-term evolution, not information on the functional constraints themselves13-16.
On designing useful statistical models
Testing all possible hypothesis to understand the function of a DNA sequence in the lab is combinatorically intractable. Alternatively, measurements from biological systems can be treated as an inverse problem17. Namely, computational biologists can build models to describe the causal factors that created the data, fit it to available examples, and use the parameters of the model to understand the biological underpinnings of the system. In this thesis, I utilize both discriminative and generative models to understand and make predictions for genetic systems.
Discriminative models are utilized when trying to predict a measured value from some known variables. Numerous algorithms can parameterize this form. The simplest is regression, which is a powerhouse of genetics, and broadly, data analysis. In a generic regression model, terms are additive and independent from one another. These robust models are both predictive and are readily interpretable. Numerous extensions of these models exist which are utilized in this thesis, including sparse priors for variable selection, terms to uncover interactions between variables, and tree-based structure to interpret nonlinear interactions in the data. Gaussian processes, which are nonparametric function approximators, can even be framed as Bayesian
3 linear regression in which we integrate out the weights18. Though Gaussian process models are difficult to interpret, they are resistant to overfitting and can be fit to limited data. These regression-based models provide the basis for supervised learning in this thesis.
Generative models are particularly useful in an unsupervised machine learning setting, where a probability distribution is fit to a set of data. In this framework, we learn a parameterization in which we can generate new examples of the data by understanding the constraints that make any given datum probable. This approach has been particularly successful in analyzing biological sequence data from public databases because the sequences are very heterogeneous, have a high information content, and don’t have any matching measurement.
These models can be used to find other related members of a sequence family in an unsupervised manner, understand phylogeny and relatedness of sequences16, find epistatic constraints13,14, and predict the effects of mutations19. In this thesis, I parameterize generative models with deep neural networks to improve mutation effect prediction and extend the scope of sequences these models can handle.
Traditionally, the process of deriving complicated mathematical models of high- dimensional data was a slow and laborious process. However, advances in automatic differentiation20 and variational inference21 have made the process of building complex models running on high-performance computing resources fast and simple. Using these tools and techniques, the causes of variation can be teased out of relatively few data samples, and the resulting model can be used to predict the effects of mutations in new sequences.
Final remarks
I do not expect biology to be found solely in silico: experimentation will always be the tried and true method for determining biological function of a sequence. In this thesis, I aim to
4 formulate biological questions regarding sparsely-characterized biological systems and design robust, interpretable, generalizable probabilistic algorithms to map genotype-to-phenotype.
5 References
1 Packer, M. S. & Liu, D. R. Methods for the directed evolution of proteins. Nature Reviews Genetics 16, 379 (2015).
2 Romero, P. A. & Arnold, F. H. Exploring protein fitness landscapes by directed evolution. Nature reviews Molecular cell biology 10, 866 (2009).
3 Allard, R. W. & Allard, R. W. Principles of plant breeding. (John Wiley & Sons, 1999).
4 Gonzaga-Jauregui, C., Lupski, J. R. & Gibbs, R. A. Human genome sequencing in health and disease. Annual review of medicine 63, 35-61 (2012).
5 Purnick, P. E. & Weiss, R. The second wave of synthetic biology: from modules to systems. Nature reviews Molecular cell biology 10, 410 (2009).
6 Benner, S. A. & Sismour, A. M. Synthetic biology. Nature Reviews Genetics 6, 533 (2005).
7 Griffiths, A. J. et al. An introduction to genetic analysis. (Macmillan, 2005).
8 Kosuri, S. & Church, G. M. Large-scale de novo DNA synthesis: technologies and applications. Nature methods 11, 499 (2014).
9 Fowler, D. M. & Fields, S. Deep mutational scanning: a new style of protein science. Nature methods 11, 801 (2014).
10 Phillips, P. C. Epistasis—the essential role of gene interactions in the structure and evolution of genetic systems. Nature Reviews Genetics 9, 855 (2008).
11 Poelwijk, F. J., Kiviet, D. J., Weinreich, D. M. & Tans, S. J. Empirical fitness landscapes reveal accessible evolutionary paths. Nature 445, 383 (2007).
12 Mendes, P. & Kell, D. Non-linear optimization of biochemical pathways: applications to metabolic engineering and parameter estimation. Bioinformatics (Oxford, England) 14, 869-883 (1998).
13 Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PloS one 6, e28766 (2011).
14 Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proceedings of the National Academy of Sciences 108, E1293- E1301 (2011).
15 Lapedes, A., Giraud, B. & Jarzynski, C. Using sequence alignments to predict protein structure and stability with high accuracy. arXiv preprint arXiv:1207.2484 (2012).
6 16 Durbin, R., Eddy, S. R., Krogh, A. & Mitchison, G. Biological sequence analysis: probabilistic models of proteins and nucleic acids. (Cambridge university press, 1998).
17 Stein, R. R., Marks, D. S. & Sander, C. Inferring pairwise interactions from biological data using maximum-entropy probability models. PLoS computational biology 11, e1004182 (2015).
18 Rasmussen, C. E. in Advanced lectures on machine learning 63-71 (Springer, 2004).
19 Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nature biotechnology 35, 128 (2017).
20 LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. nature 521, 436 (2015).
21 Jordan, M. I., Ghahramani, Z., Jaakkola, T. S. & Saul, L. K. An introduction to variational methods for graphical models. Machine learning 37, 183-233 (1999).
7 Chapter 1
Deep generative models of genetic variation capture the effects of mutations
Adam J. Riesselman§1,2, John B. Ingraham§1,3, Debora S. Marks1
1 Department of Systems Biology, Harvard Medical School 2 Program in Biomedical Informatics, Harvard Medical School 3 Program in Systems Biology, Harvard University § Equal contribution
A.J.R., J.B.I., and D.S.M. designed the study. A.J.R. and J.I. performed the computation. A.J.R.,
J.B.I., and D.S.M. wrote the paper.
8 Abstract
The functions of proteins and RNAs are defined by the collective interactions of many residues, and yet most statistical models of biological sequences consider sites near- independently. Recent approaches have demonstrated benefits of including interactions to capture pairwise covariation, but leave higher-order dependencies out of reach. Here, we show how it is possible to capture higher-order, context-dependent constraints in biological sequences via latent variable models with nonlinear dependencies. We present a new probabilistic model for sequence families, DeepSequence, which can predict the effects of mutations across a variety of deep mutational scanning experiments significantly better than existing methods that are based on the same evolutionary data. The model, learned in an unsupervised manner solely from sequence information, is grounded with biologically motivated priors, reveals latent organization of sequence families, and can be used to extrapolate to new parts of sequence space.
Introduction
A major unmet challenge in biological research, clinical medicine and biotechnology is how we should decipher and exploit the effect of mutations on biomolecules. From interpreting which genetic variants in humans underlie disease, to developing modified proteins that have useful properties, to synthesizing large molecular libraries that are enriched with functional sequences, there is need to be able to rapidly assess whether a given mutation to a protein or
RNA will disrupt its function1, 2. Although high throughput technologies can now simultaneously assess the effects of thousands of mutations in parallel3-25 1 26, 27, sequence space is exponentially large and experiments are resource-intensive. Accurate computational methods are thus an important component for high-throughput sequence annotation and design.
9 Most improvements to computational predictions of mutation effects have been driven by leveraging the signal of evolutionary conservation among homologous sequences28-33.
Historically, these tools analyze the conservation of single site in protein in a background- independent manner. Recent work has demonstrated that incorporating inter-site dependencies with a pairwise interaction model of genetic variation can more accurately predict the effects of mutations high-throughput mutational scans34-36. However, numerous lines of evidence suggest that higher order epistasis pervades the evolution of proteins and RNAs37-40, which pairwise models are unable to capture. Naïve extension of the pairwise models with third or higher terms is statistically unfeasible, as even third-order interaction models for a modest length protein of
100 100 amino acids will have approximately a billion parameters 20 ≈ 1.29 ×10 . Even 3 if such a model could be engineered or coarse-grained41 to be computationally and statistically tractable, it will only marginally improve the fraction of higher-order terms considered, leaving
4th and higher order interactions out of reach.
Directly parameterizing models of sequence variation with all possible interactions of order k leads to an intractable combinatorial explosion in the number of parameters to consider.
An alternative to this fully observed approach for modeling data – where the correlations between positions are explained directly in terms of position-position couplings – is to model variations in terms of hidden variables to which the observed positions are coupled. Two widely used models for the analysis of genetic data, PCA and admixture analysis,42-44 can be cast as latent variable (hidden variable) models in which the visible data (genotypes) depend on hidden variables (factors or populations) in a linear way. In principle, replacing those linear dependencies with flexible nonlinear transformations could facilitate modeling of arbitrary-order correlations within observed genotype, but developing tractable inference algorithms for them is
10 more complex. Recent advances in approximate inference45, 46 have made these kinds of nonlinear latent variable models tractable for modeling complex distributions for many kinds of data, including text, audio, and even chemical structures47, but their application to genetic data remains in its infancy.
Here, we develop nonlinear latent variable models for biological sequence families and leverage approximate inference techniques to infer them from large multiple sequence alignments. We show how a Bayesian deep latent variable model can be used to reveal latent structure in sequence families and predict the effects of mutations with accuracies exceeding those of site-independent or pairwise-interaction models.
Results
A deep generative model captures latent structure in sequence families.
The genes that we observe across species today are the results of long-term evolutionary processes that select for functional molecules. We seek to model the constraints underlying the evolutionary processes of these sequence data and to use those constraints to reason about what other mutations may be plausible. If we approximate the evolutionary process as a “sequence generator” that generates a sequence � with probability � � � and parameters � that are fit to reproduce the statistics of evolutionary data, we can use the probabilities that the model assigns to any given sequence as a proxy for the relative plausibility that a molecule satisfies functional