Building Maps from Genetic Sequences to Biological Function
Total Page:16
File Type:pdf, Size:1020Kb
Building maps from genetic sequences to biological function The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Riesselman, Adam Joseph. 2019. Building maps from genetic sequences to biological function. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:41121291 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Building maps from genetic sequences to biological function A dissertation presented by Adam Joseph Riesselman to The Division of Medical Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Biomedical Informatics Harvard University Cambridge, Massachusetts December 2018 ©2018 Adam Joseph Riesselman All rights reserved. Dissertation Advisor: Debora S. Marks Adam Joseph Riesselman Building maps from genetic sequences to biological function Abstract Predicting how changes to the genetic code will alter the characteristics of an organism is a fundamental question in biology and genetics. Typically, measurements of the true functional landscape relating genotype to phenotype are noisy and costly to obtain. Though high-throughput DNA sequencing and synthesis can shed light on biological constraints in organisms, inferring relationships from these high-dimensional, multi-scale data to make predictions about new biological sequences is a formidable task. Here, I aim to build algorithms that map genetic sequences to biological function. In Chapter 1, I examine how deep latent variable models of evolutionary sequences can predict the effects of mutations in an unsupervised manner. In Chapter 2, I discuss how deep autoregressive models can be applied to genetic data for variant effect prediction and the synthesis of a diverse synthetic nanobody library. In Chapter 3, I explore how sparse Bayesian logistic regression can efficiently summarize laboratory affinity maturation experiments to improve nanobody binding affinity. In Chapter 4, I show how to integrate genetic, proteomic, and metabolomic data to optimize thiamine biosynthesis in E. coli. In Chapter 5, I propose future research directions, including extensions to both the analytical methods and biological systems discussed. These results show that probabilistic algorithms of genetic sequence data can both explain phenotypic variation and be used to design proteins and organisms with improved properties. iii Table of Contents Acknowledgements ...................................................................................................................... vi INTRODUCTION ........................................................................................................................ 1 Finding patterns in genetic data .................................................................................................. 1 On designing useful statistical models ........................................................................................ 3 Final remarks .............................................................................................................................. 4 References ................................................................................................................................... 6 CHAPTER 1 .................................................................................................................................. 8 Deep generative models of genetic variation capture the effects of mutations ....................... 8 Abstract ....................................................................................................................................... 9 Introduction ................................................................................................................................. 9 Results ....................................................................................................................................... 11 Discussion ................................................................................................................................. 23 Methods..................................................................................................................................... 25 References ................................................................................................................................. 39 CHAPTER 2 ................................................................................................................................ 45 Predicting the effects of mutations with alignment-free generative models of protein sequences ...................................................................................................................................... 45 Abstract ..................................................................................................................................... 46 Introduction ............................................................................................................................... 46 Results ....................................................................................................................................... 49 Discussion ................................................................................................................................. 55 Methods..................................................................................................................................... 56 References ................................................................................................................................. 61 CHAPTER 3 ................................................................................................................................ 68 Refining nanobody affinity with algorithmically-guided directed evolution ........................ 68 Abstract ..................................................................................................................................... 69 Introduction ............................................................................................................................... 69 Results ....................................................................................................................................... 72 Discussion ................................................................................................................................. 76 Methods..................................................................................................................................... 77 References ................................................................................................................................. 79 iv CHAPTER 4 ................................................................................................................................ 82 Accurate prediction of the thiamine biosynthetic landscape using Gaussian process learning ........................................................................................................................................ 82 Abstract ..................................................................................................................................... 83 Introduction ............................................................................................................................... 83 Results ....................................................................................................................................... 86 Discussion ................................................................................................................................. 96 Methods..................................................................................................................................... 98 References ............................................................................................................................... 112 CHAPTER 5 .............................................................................................................................. 116 Conclusions ................................................................................................................................ 116 Specific Conclusions ............................................................................................................... 116 General Conclusions ............................................................................................................... 119 References ............................................................................................................................... 121 Appendix: Supplementary Figures and Tables ...................................................................... 123 v Acknowledgements I came into graduate school with the grandiose idea of working in the lab and guiding my experiments with new algorithms to engineer organisms—I wanted to make sure the code we wrote made something tangible. After stumbling around for the first year or so, I realized that doing everything on my own was not the way to go. Science is made so much richer with collaborations and the exchange of new ideas and disciplines, and I am so happy and proud to have worked