Phenotype Prediction from Evolutionary Sequence Covariation

TECHNISCHE UNIVERSITAT¨ MUNCHEN¨ Fakultat¨ fur¨ Informatik Lehrstuhl fur¨ Bioinformatik Phenotype Prediction From Evolutionary Sequence Covariation Thomas A. Hopf Vollstandiger¨ Abdruck der von der Fakultat¨ fur¨ Informatik der Technischen Univer- sitat¨ Munchen¨ zur Erlangung des akademisches Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzende: Univ.-Prof. Gudrun J. Klinker, Ph.D. Prufer¨ der Dissertation: 1. Univ.-Prof. Dr. Burkhard Rost 2. Assistant-Prof. Dr. Debora Marks (Harvard Medical School, Boston/USA) 3. Prof. Dr. Yitzhak Pilpel (Weizmann Institute of Science, Rehovot/Israel) Die Dissertation wurde am 29.10.2015 bei der Technischen Universitat¨ Munchen¨ ein- gereicht und durch die Fakultat¨ fur¨ Informatik am 28.01.2016 angenommen. Dedicated to my family Abstract Disentangling the relationship between genotype and phenotype is a major open ques- tion in biology. Since natural selection acts on phenotypes, the need to maintain organismal fitness leaves a trace of functional constraints in the underlying genotypes. We analyze abundantly available genotype data of homologous proteins from genomic sequencing for patterns of amino acid conservation and covariation. Based on statistical maximum entropy models of sequence coevolution to identify such evolutionary couplings, we developed computational methods to predict phenotypes that are difficult to obtain by experiment, on three different scales: (i) the three-dimensional structures of membrane proteins through coevolving sites in spatial contact; (ii) the interactions between pairs of proteins in protein complexes through coevolution of contacting sites in different molecules; and (iii) the quantitative, context-dependent impact of genotype changes on molecular and organismal function. For each of these approaches, we validated our phenotype predictions against available experimental data and show that, given sufficient genotype information, (i) accurate residue-residue contacts and three-dimensional structure predictions can be obtained for a wide range of proteins, allowing the de novo prediction of experimentally unsolved structures; (ii) coevolving positions between interacting proteins can be identified with high accuracy, enabling the detection of protein-protein interactions and the reconstruction of complex structures through molecular docking; and that (iii) the computed probabilistic, context- dependent effects of amino acid substitutions quantitatively agree with experimental measurements of biochemical function and organismal growth. In summary, our re- sults show that statistical sequence modeling gives both a quantitative mapping from genotype to high-level phenotypes as well as biologically relevant intermediate molecular phenotypes such as protein structures or interactions. We anticipate that our approaches will allow to extract useful phenotypic information from the exponen- tially increasing amounts of sequence data and contribute to a deeper understanding of protein evolution and design as well as clinical applications. v Zusammenfassung Die Entschlusselung¨ des Zusammenhangs zwischen Genotyp und Phanotyp¨ ist eine ungeloste¨ wichtige Fragestellung in der Biologie. Da die naturliche¨ Selektion auf der Ebene des Phanotyps¨ einwirkt, hinterlasst¨ die Notwendigkeit, die Fitness des Or- ganismus zu erhalten, Spuren funktioneller Beschrankungen¨ auf der zugrunde lie- genden Ebene des Genotyps. Wir analysieren die in großer Fulle¨ verfugbaren¨ Geno- typdaten homologer Proteine aus Genomsequenzierungen hinsichtlich Mustern von Aminosaurekonservierung¨ und -kovariation. Basierend auf statistischen Maximum- Entropie-Modellen der Sequenzkovariation zur Identifizierung dieser evolutionaren¨ Kopplungen entwickelten wir rechnergestutzte¨ Methoden zur Vorhersage von Pha-¨ notypen, die nur aufwandig¨ mittels experimenteller Techniken gewonnen werden konnen,¨ auf drei verschiedenen Ebenen: (i) Die dreidimensionale Struktur von Mem- branproteinen durch Koevolution von Positionen in raumlichem¨ Kontakt; (ii) die In- teraktion zwischen mehreren Proteinen durch die Koevolution von wechselwirkenden Positionen in Proteinkomplexen; sowie (iii) die quantitativen, kontextabhangigen¨ Ef- fekte von Veranderungen¨ des Genotyps auf molekulare und organismische Funkti- on. Fur¨ jede dieser Methoden validierten wir unsere Phanotypvorhersagen¨ gegen die verfugbaren¨ experimentellen Daten und zeigen, dass, ausreichend Genotypinformati- on vorausgesetzt, (i) genaue Kontakte zwischen Aminosaureresten¨ sowie dreidimensionale Strukturvorhersagen berechnet werden konnen,¨ was die de novo-Vorhersage von experimentell ungelosten¨ Proteinen ermoglicht;¨ (ii) koevolvierende Positionen zwischen interagierenden Proteinen mit hoher Genauigkeit identifiziert werden konnen,¨ was die Erkennung von Protein-Protein-Interaktionen sowie die Rekonstruktion von Komplexstrukturen mittels molekularem Docking erlaubt; und dass (iii) die probabi- listische, kontextabhangige¨ Modellierung von Aminosauresubstitutionen¨ experimentellen Messungen von biochemischer Funktion und organismischem Wachstum quan- titativ entspricht. Zusammengefasst zeigen unsere Ergebnisse, dass die statistische Se- quenzmodellierung sowohl eine quantitative Abbildung von Genotyp zu Phanotyp¨ als auch biologisch relevante, dazwischenliegende molekulare Phanotypen¨ wie Protein- strukturen oder -interaktionen liefert. Wir erwarten, dass unsere Methoden die Extra- hierung nutzlicher¨ phanotypischer¨ Informationen aus der exponentiell zunehmenden Zahl an Sequenzen erlauben und zu einem tieferen Verstandnis¨ von Proteinevolution und -design sowie klinischen Anwendungen beitragen werden. vii Publications Manuscripts published during PhD studies: Hopf, T. A. et al. Sequence co-evolution gives 3D contacts and structures of protein complexes. Elife 3 (2014) Hopf, T. A. et al. Amino acid coevolution reveals three-dimensional structure and functional domains of insect odorant receptors. Nat Commun 6, 6077 (2015) Hopf, T. A. et al. Quantification of the effect of mutations using a global probability model of natural sequence variation. arXiv: 1510.04612 (2015) Marks, D. S. et al. Protein structure prediction from sequence variation. Nat. Biotechnol. 30, 1072–1080 (2012) Kajan,´ L. et al. FreeContact: fast and free software for protein contact prediction from residue co-evolution. BMC Bioinformatics 15, 85 (2014) Tang, Y. et al. Protein structure determination by combining sparse NMR data with evolutionary couplings. Nat. Methods 12, 751–754 (2015) Sheridan, R. et al. EVfold.org: Evolutionary Couplings and Protein 3D Structure Prediction. bioRxiv, 021022 (2015) Related manuscripts published prior to PhD studies: Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE 6, e28766 (2011) Hopf, T. A. et al. Three-dimensional structures of membrane proteins from genomic sequencing. Cell 149, 1607–1621 (2012) Other manuscripts (not related to the contents of this thesis): Hopf, T. & Kramer, S. in Discovery Science (eds Pfahringer, B. et al.) Lecture Notes in Computer Science 6332, 311–325 (Springer Berlin Heidelberg, 2010) Radivojac, P. et al. A large-scale evaluation of computational protein function prediction. Nat. Methods 10, 221–227 (2013) Hamp, T. et al. Homology-based inference sets the bar high for protein function prediction. BMC Bioinformatics 14 Suppl 3, S7 (2013) ix Acknowledgements The past three years of research and writing this dissertation would not have been possible without the contributions of many other people. For this reason, I want to express my gratitude to: Burkhard Rost, Chris Sander, and in particular Debora Marks, for making this transat- lantic endeavor possible, exciting projects to work on, their mentorship, hospitality, inspiring discussions, and support in many other ways. Gudrun Klinker, for chairing the examining committee, and Yitzhak Pilpel, for review- ing the dissertation. All members of the Marks, Rost and Sander labs for an enriching scientific and per- sonal environment; in particular Charlotta Scharfe¨ for a great collaboration and proof- reading of this dissertation, Frank Poelwijk and Agnes´ Toth-Petr´ oczy´ for exciting discussions on protein evolution, Tobias Hamp for his input on protein complexes, John Ingraham for his contributions to the mutation effect project, Richard Stein for sharing his knowledge on maximum entropy models, and Li Yang Smith for her work on the transmembrane proteins. My co-authors, for their valuable scientific contributions and insights. The administrative and technical staff at TUM (Marlena Drabik, Inga Weise, Mar- tina Hilla, Manuela Fischer and Tim Karl) and HMS (Maria Ferreira, Mason Miranda, Pamela Needham and Jake DeSilva, Research Computing), for their constant help and keeping things running smoothly. The scientific Python community for providing such a fantastic set of tools, in particular the developers of the IPython/Jupyter notebook. All my friends at home, in Munich and in Boston for a great time and offering distrac- tion at times when things did not work as they should. My family, for their never-ending and unconditional support throughout all the years. Thank you all! xi Contents List of Figures xv List of Tables xvii Abbreviations xix 1. Introduction 1 1.1. The relationship of genotype and phenotype . 1 1.1.1. Biological relevance . 1 1.1.2. Experimental studies of the genotype-phenotype map . 2 1.1.3. Context-dependence of genetic variants . 3 1.2. Computational

Phenotype Prediction from Evolutionary Sequence Covariation

2018 Program Book.Indd

Mutational Scanning Symposium Virtual Event

Mutational Scanning Symposium

2016 ISCB Overton Prize Awarded to Debora Marks[Version 1; Peer

Hits Symp 2017

Proceedings of the Eighteenth International Conference on Machine Learning., 282 – 289

Building Maps from Genetic Sequences to Biological Function

SUNDAY, August 19 MONDAY, August 20

Transformer Protein Language Models Are Unsupervised Structure Learners

Arxiv:2010.08162V2 [Q-Bio.BM] 15 Nov 2020 1.1 Protein Structure Data Availability and Information Leakage

Generative Models for Graph-Based Protein Design

Mutational Scanning Symposium