Phenotype Prediction from Evolutionary Sequence Covariation

Total Page:16

File Type:pdf, Size:1020Kb

Phenotype Prediction from Evolutionary Sequence Covariation TECHNISCHE UNIVERSITAT¨ MUNCHEN¨ Fakultat¨ fur¨ Informatik Lehrstuhl fur¨ Bioinformatik Phenotype Prediction From Evolutionary Sequence Covariation Thomas A. Hopf Vollstandiger¨ Abdruck der von der Fakultat¨ fur¨ Informatik der Technischen Univer- sitat¨ Munchen¨ zur Erlangung des akademisches Grades eines Doktors der Naturwissenschaften (Dr. rer. nat.) genehmigten Dissertation. Vorsitzende: Univ.-Prof. Gudrun J. Klinker, Ph.D. Prufer¨ der Dissertation: 1. Univ.-Prof. Dr. Burkhard Rost 2. Assistant-Prof. Dr. Debora Marks (Harvard Medical School, Boston/USA) 3. Prof. Dr. Yitzhak Pilpel (Weizmann Institute of Science, Rehovot/Israel) Die Dissertation wurde am 29.10.2015 bei der Technischen Universitat¨ Munchen¨ ein- gereicht und durch die Fakultat¨ fur¨ Informatik am 28.01.2016 angenommen. Dedicated to my family Abstract Disentangling the relationship between genotype and phenotype is a major open ques- tion in biology. Since natural selection acts on phenotypes, the need to maintain organ- ismal fitness leaves a trace of functional constraints in the underlying genotypes. We analyze abundantly available genotype data of homologous proteins from genomic se- quencing for patterns of amino acid conservation and covariation. Based on statistical maximum entropy models of sequence coevolution to identify such evolutionary cou- plings, we developed computational methods to predict phenotypes that are difficult to obtain by experiment, on three different scales: (i) the three-dimensional structures of membrane proteins through coevolving sites in spatial contact; (ii) the interactions between pairs of proteins in protein complexes through coevolution of contacting sites in different molecules; and (iii) the quantitative, context-dependent impact of geno- type changes on molecular and organismal function. For each of these approaches, we validated our phenotype predictions against available experimental data and show that, given sufficient genotype information, (i) accurate residue-residue contacts and three-dimensional structure predictions can be obtained for a wide range of proteins, allowing the de novo prediction of experimentally unsolved structures; (ii) coevolving positions between interacting proteins can be identified with high accuracy, enabling the detection of protein-protein interactions and the reconstruction of complex struc- tures through molecular docking; and that (iii) the computed probabilistic, context- dependent effects of amino acid substitutions quantitatively agree with experimental measurements of biochemical function and organismal growth. In summary, our re- sults show that statistical sequence modeling gives both a quantitative mapping from genotype to high-level phenotypes as well as biologically relevant intermediate molec- ular phenotypes such as protein structures or interactions. We anticipate that our approaches will allow to extract useful phenotypic information from the exponen- tially increasing amounts of sequence data and contribute to a deeper understanding of protein evolution and design as well as clinical applications. v Zusammenfassung Die Entschlusselung¨ des Zusammenhangs zwischen Genotyp und Phanotyp¨ ist ei- ne ungeloste¨ wichtige Fragestellung in der Biologie. Da die naturliche¨ Selektion auf der Ebene des Phanotyps¨ einwirkt, hinterlasst¨ die Notwendigkeit, die Fitness des Or- ganismus zu erhalten, Spuren funktioneller Beschrankungen¨ auf der zugrunde lie- genden Ebene des Genotyps. Wir analysieren die in großer Fulle¨ verfugbaren¨ Geno- typdaten homologer Proteine aus Genomsequenzierungen hinsichtlich Mustern von Aminosaurekonservierung¨ und -kovariation. Basierend auf statistischen Maximum- Entropie-Modellen der Sequenzkovariation zur Identifizierung dieser evolutionaren¨ Kopplungen entwickelten wir rechnergestutzte¨ Methoden zur Vorhersage von Pha-¨ notypen, die nur aufwandig¨ mittels experimenteller Techniken gewonnen werden konnen,¨ auf drei verschiedenen Ebenen: (i) Die dreidimensionale Struktur von Mem- branproteinen durch Koevolution von Positionen in raumlichem¨ Kontakt; (ii) die In- teraktion zwischen mehreren Proteinen durch die Koevolution von wechselwirkenden Positionen in Proteinkomplexen; sowie (iii) die quantitativen, kontextabhangigen¨ Ef- fekte von Veranderungen¨ des Genotyps auf molekulare und organismische Funkti- on. Fur¨ jede dieser Methoden validierten wir unsere Phanotypvorhersagen¨ gegen die verfugbaren¨ experimentellen Daten und zeigen, dass, ausreichend Genotypinformati- on vorausgesetzt, (i) genaue Kontakte zwischen Aminosaureresten¨ sowie dreidimen- sionale Strukturvorhersagen berechnet werden konnen,¨ was die de novo-Vorhersage von experimentell ungelosten¨ Proteinen ermoglicht;¨ (ii) koevolvierende Positionen zwischen interagierenden Proteinen mit hoher Genauigkeit identifiziert werden konnen,¨ was die Erkennung von Protein-Protein-Interaktionen sowie die Rekonstruktion von Komplexstrukturen mittels molekularem Docking erlaubt; und dass (iii) die probabi- listische, kontextabhangige¨ Modellierung von Aminosauresubstitutionen¨ experimen- tellen Messungen von biochemischer Funktion und organismischem Wachstum quan- titativ entspricht. Zusammengefasst zeigen unsere Ergebnisse, dass die statistische Se- quenzmodellierung sowohl eine quantitative Abbildung von Genotyp zu Phanotyp¨ als auch biologisch relevante, dazwischenliegende molekulare Phanotypen¨ wie Protein- strukturen oder -interaktionen liefert. Wir erwarten, dass unsere Methoden die Extra- hierung nutzlicher¨ phanotypischer¨ Informationen aus der exponentiell zunehmenden Zahl an Sequenzen erlauben und zu einem tieferen Verstandnis¨ von Proteinevolution und -design sowie klinischen Anwendungen beitragen werden. vii Publications Manuscripts published during PhD studies: Hopf, T. A. et al. Sequence co-evolution gives 3D contacts and structures of protein complexes. Elife 3 (2014) Hopf, T. A. et al. Amino acid coevolution reveals three-dimensional structure and functional domains of insect odorant receptors. Nat Commun 6, 6077 (2015) Hopf, T. A. et al. Quantification of the effect of mutations using a global probability model of natural sequence variation. arXiv: 1510.04612 (2015) Marks, D. S. et al. Protein structure prediction from sequence variation. Nat. Biotechnol. 30, 1072–1080 (2012) Kajan,´ L. et al. FreeContact: fast and free software for protein contact prediction from residue co-evolution. BMC Bioinformatics 15, 85 (2014) Tang, Y. et al. Protein structure determination by combining sparse NMR data with evolutionary couplings. Nat. Methods 12, 751–754 (2015) Sheridan, R. et al. EVfold.org: Evolutionary Couplings and Protein 3D Structure Prediction. bioRxiv, 021022 (2015) Related manuscripts published prior to PhD studies: Marks, D. S. et al. Protein 3D structure computed from evolutionary sequence variation. PLoS ONE 6, e28766 (2011) Hopf, T. A. et al. Three-dimensional structures of membrane proteins from ge- nomic sequencing. Cell 149, 1607–1621 (2012) Other manuscripts (not related to the contents of this thesis): Hopf, T. & Kramer, S. in Discovery Science (eds Pfahringer, B. et al.) Lecture Notes in Computer Science 6332, 311–325 (Springer Berlin Heidelberg, 2010) Radivojac, P. et al. A large-scale evaluation of computational protein function pre- diction. Nat. Methods 10, 221–227 (2013) Hamp, T. et al. Homology-based inference sets the bar high for protein function prediction. BMC Bioinformatics 14 Suppl 3, S7 (2013) ix Acknowledgements The past three years of research and writing this dissertation would not have been possible without the contributions of many other people. For this reason, I want to express my gratitude to: Burkhard Rost, Chris Sander, and in particular Debora Marks, for making this transat- lantic endeavor possible, exciting projects to work on, their mentorship, hospitality, inspiring discussions, and support in many other ways. Gudrun Klinker, for chairing the examining committee, and Yitzhak Pilpel, for review- ing the dissertation. All members of the Marks, Rost and Sander labs for an enriching scientific and per- sonal environment; in particular Charlotta Scharfe¨ for a great collaboration and proof- reading of this dissertation, Frank Poelwijk and Agnes´ Toth-Petr´ oczy´ for exciting dis- cussions on protein evolution, Tobias Hamp for his input on protein complexes, John Ingraham for his contributions to the mutation effect project, Richard Stein for sharing his knowledge on maximum entropy models, and Li Yang Smith for her work on the transmembrane proteins. My co-authors, for their valuable scientific contributions and insights. The administrative and technical staff at TUM (Marlena Drabik, Inga Weise, Mar- tina Hilla, Manuela Fischer and Tim Karl) and HMS (Maria Ferreira, Mason Miranda, Pamela Needham and Jake DeSilva, Research Computing), for their constant help and keeping things running smoothly. The scientific Python community for providing such a fantastic set of tools, in partic- ular the developers of the IPython/Jupyter notebook. All my friends at home, in Munich and in Boston for a great time and offering distrac- tion at times when things did not work as they should. My family, for their never-ending and unconditional support throughout all the years. Thank you all! xi Contents List of Figures xv List of Tables xvii Abbreviations xix 1. Introduction 1 1.1. The relationship of genotype and phenotype . 1 1.1.1. Biological relevance . 1 1.1.2. Experimental studies of the genotype-phenotype map . 2 1.1.3. Context-dependence of genetic variants . 3 1.2. Computational
Recommended publications
  • 2018 Program Book.Indd
    32nd Annual Symposium BOSTON July 9 - 12, 2018 Program www.proteinsociety.org Table of Contents #PS32 2 Welcome Mission 3 Program Planning Committee 4 Committees The Protein Society is a not-for-profi t scholarly society 7 Corporate Support with a mission to advance state-of-the-art science through international forums that promote commu- 8 Registration nication, cooperation, and collaboration among scientists involved in the study of proteins. 9 Hotel Floor Plan For 32 years, The Protein Society has served as the in- 11 General Information tellectual home of investigators across all disciplines - and from around the world - involved in the study 16 2018 Protein Society Award Winners of protein structure, function, and design. The Soci- ety provides forums for scientifi c collaboration and 22 Travel Awards communication and supports professional growth of young investigators through workshops, networking 24 At-A-Glance opportunities, and by encouraging junior research- ers to participate fully in the Annual Symposium. In 26 Program addition to our Symposium, the Society’s prestigious journal, Protein Science, serves as an ideal platform 40 Exhibitor Director to further the science of proteins in the broadest sense possible. 50 Poster Presentation Schedule 64 Abstracts: TPS Award Winners & Invited Speakers 86 Posters #PS32 1986 - 2018 1 Welcome Program Planning Committee #PS32 Welcome to Boston and to the 2018 32nd Boston | July 9 - 12, 2018 Annual Symposium of the Protein Society! We are excited to bring you this year’s Annual Symposium comprising twelve exceptional scien- tifi c sessions that cover a wide range of scientifi c achievement in the fi eld of protein science.
    [Show full text]
  • Mutational Scanning Symposium Virtual Event
    MUTATIONAL SCANNING SYMPOSIUM VIRTUAL EVENT APRIL 5-7, 2021 1 TABLE OF CONTENTS Welcome 3 Agenda Day 1; April 5 4 Agenda Day 2; April 6 5 Agenda Day 3; April 7 6 Keynote Speakers 7 Artist Alex Cagan 8 Speaker Bios 6 Grace R. Anderson Atina Cote Erika DeBenedictis John Doench Maitreya Dunham Doug Fowler Meghan Garrett Matt Hurles Martin Kampmann Rachel Karchin Ben Lehner (Keynote Speaker) Prashant Mali Deborah Marks Kenneth Matreyek Beatriz Adriana Osuna Maria-Jesus Martin Erik Procko Kimberly A. Reynolds (Keynote Speaker) Frederick (Fritz) Roth Alan Rubin Avtar Singh Lea Starita Tyler N. Starr Clare Turnbull Organizing Committee 17 Virtual Meeting Information 18 Sponsors 19 2 Multiplex Assays of Variant Effects Multiplex Assays of Variant Effects (MAVEs) are key to variant interpretation and are transforming our understanding of the human genome. This annual symposium is sponsored by the Center for the Multiplex Assessment of Phenotype , Illumina, Octant, Wellcome Sanger, and the Brotman Baty Institute. Experts in the field of mutational scanning come from around the world meet to present their work and provide insights on the future of this science for this three-day event which will be held virtually, April 5th-7th 2021. Register to attend HERE 3 Mutational Scanning Symposium April 5-7 2021 All times April 5, 2021 Agenda Pacific Standard Time Lea Starita, PhD | UW 8:00 -08:10 am Welcome, Opening Remarks Kim Reynolds, PhD | UTSW, Dallas (KEYNOTE/featured speaker) 8:10 - 8:55 am Mapping sequence constraints in an essential metabolic enzyme Maitreya Dunham, PhD | UW 8:55 - 9:20 am Deep mutational scanning of pharmacogenes using yeast activity assays.
    [Show full text]
  • Mutational Scanning Symposium
    MUTATIONAL SCANNING SYMPOSIUM VIRTUAL EVENT APRIL 5-7, 2021 1 TABLE OF CONTENTS Welcome 3 Agenda Day 1; April 5 4 Agenda Day 2; April 6 5 Agenda Day 3; April 7 6 Keynote Speakers 7 Artist Alex Cagan 8 Speaker Bios 6 Grace R. Anderson Atina Cote Erika DeBenedictis John Doench Maitreya Dunham Doug Fowler Meghan Garrett Matt Hurles Martin Kampmann Rachel Karchin Ben Lehner (Keynote Speaker) Prashant Mali Deborah Marks Kenneth Matreyek Beatriz Adriana Osuna Maria-Jesus Martin Erik Procko Kimberly A. Reynolds (Keynote Speaker) Frederick (Fritz) Roth Alan Rubin Avtar Singh Lea Starita Tyler N. Starr Clare Turnbull Organizing Committee 17 Virtual Meeting Information 18 Sponsors 19 2 Multiplex Assays of Variant Effects Multiplex Assays of Variant Effects (MAVEs) are key to variant interpretation and are transforming our understanding of the human genome. This annual symposium is sponsored by the Center for the Multiplex Assessment of Phenotype , Illumina, Octant, Wellcome Sanger, and the Brotman Baty Institute. Experts in the field of mutational scanning come from around the world meet to present their work and provide insights on the future of this science for this three-day event which will be held virtually, April 5th-7th 2021. Register to attend HERE 3 Mutational Scanning Symposium April 5-7 2021 All times April 5, 2021 Agenda Pacific Standard Time Lea Starita, PhD | UW 8:00 -08:10 am Welcome, Opening Remarks Kim Reynolds, PhD | UTSW, Dallas (KEYNOTE/featured speaker) 8:10 - 8:55 am Mapping sequence constraints in an essential metabolic enzyme Maitreya Dunham, PhD | UW 8:55 - 9:20 am Deep mutational scanning of pharmacogenes using yeast Activity assays.
    [Show full text]
  • 2016 ISCB Overton Prize Awarded to Debora Marks[Version 1; Peer
    F1000Research 2016, 5(ISCB Comm J):1575 Last updated: 09 APR 2020 EDITORIAL 2016 ISCB Overton Prize awarded to Debora Marks [version 1; peer review: not peer reviewed] Christiana N. Fogg1, Diane E. Kovats 2 1Freelance Science Writer, Kensington, MD, USA 2International Society for Computational Biology, Bethesda, MD, USA First published: 05 Jul 2016, 5(ISCB Comm J):1575 ( Not Peer Reviewed v1 https://doi.org/10.12688/f1000research.9158.1) Latest published: 05 Jul 2016, 5(ISCB Comm J):1575 ( This article is an Editorial and has not been subject https://doi.org/10.12688/f1000research.9158.1) to external peer review. Abstract Any comments on the article can be found at the The International Society for Computational Biology (ISCB) recognizes the achievements of an early- to mid-career scientist with the Overton Prize end of the article. each year. The Overton Prize was established to honor the untimely loss of Dr. G. Christian Overton, a respected computational biologist and founding ISCB Board member. Winners of the Overton Prize are independent investigators in the early to middle phases of their careers who are selected because of their significant contributions to computational biology through research, teaching, and service. 2016 will mark the fifteenth bestowment of the ISCB Overton Prize. ISCB is pleased to confer this award the to Debora Marks, Assistant Professor of Systems Biology and director of the Raymond and Beverly Sackler Laboratory for Computational Biology at Harvard Medical School. Keywords ISCB , Overton Prize , Debora Marks , Dr. G. Christian Overton , systems biology This article is included in the International Society for Computational Biology Community Journal gateway.
    [Show full text]
  • Hits Symp 2017
    HiTS Symposium December 8, 2017 Novars Instutes of BioMedical Research 250 Massachuses Avenue Cambridge, MA Contents Schedule .................................................................................................................. 3 Speaker Bios and Talk Abstracts ............................................................................. 5 Attendee List ......................................................................................................... 17 Poster Index .......................................................................................................... 23 Poster Abstracts .................................................................................................... 27 2 HiTS Fall Symposium 2017 Schedule 8:30 – 9:00am Registration and Light Breakfast 9:00 – 9:30am Welcome Remarks: Peter Sorger, Head of Harvard Program in Therapeutic Science, Harvard Medical School 9:30 – 10:10am Keynote Talk: Mark Borowsky, Executive Director of Scientific Data Analysis, Novartis Institutes for BioMedical Research Machine Learning and Drug Discovery 10:10 – 10:20am Coffee Break 10:20 – 12:15pm Session 1: Biology and Pharmacology at the Single Cell Level Chair: Michael Baym Bree Aldridge, Assistant Professor, Tufts University Mycobacterial cell size control and antibiotic susceptibility Ben Izar, Instructor, Harvard Medical School Cancer cell‐autonomous mechanisms of resistance to immune checkpoint inhibitors revealed by single‐cell RNA‐sequencing Adam Palmer, Postdoctoral Fellow, Harvard Medical School Identifying
    [Show full text]
  • Proceedings of the Eighteenth International Conference on Machine Learning., 282 – 289
    Abstracts of papers, posters and talks presented at the 2008 Joint RECOMB Satellite Conference on REGULATORYREGULATORY GENOMICS GENOMICS - SYSTEMS BIOLOGY - DREAM3 Oct 29-Nov 2, 2008 MIT / Broad Institute / CSAIL BMP follicle cells signaling EGFR signaling floor cells roof cells Organized by Manolis Kellis, MIT Andrea Califano, Columbia Gustavo Stolovitzky, IBM Abstracts of papers, posters and talks presented at the 2008 Joint RECOMB Satellite Conference on REGULATORYREGULATORY GENOMICS GENOMICS - SYSTEMS BIOLOGY - DREAM3 Oct 29-Nov 2, 2008 MIT / Broad Institute / CSAIL Organized by Manolis Kellis, MIT Andrea Califano, Columbia Gustavo Stolovitzky, IBM Conference Chairs: Manolis Kellis .................................................................................. Associate Professor, MIT Andrea Califano ..................................................................... Professor, Columbia University Gustavo Stolovitzky....................................................................Systems Biology Group, IBM In partnership with: Genome Research ..............................................................................editor: Hillary Sussman Nature Molecular Systems Biology ............................................... editor: Thomas Lemberger Journal of Computational Biology ...............................................................editor: Sorin Istrail Organizing committee: Eleazar Eskin Trey Ideker Eran Segal Nir Friedman Douglas Lauffenburger Ron Shamir Leroy Hood Satoru Miyano Program Committee: Regulatory Genomics:
    [Show full text]
  • Building Maps from Genetic Sequences to Biological Function
    Building maps from genetic sequences to biological function The Harvard community has made this article openly available. Please share how this access benefits you. Your story matters Citation Riesselman, Adam Joseph. 2019. Building maps from genetic sequences to biological function. Doctoral dissertation, Harvard University, Graduate School of Arts & Sciences. Citable link http://nrs.harvard.edu/urn-3:HUL.InstRepos:41121291 Terms of Use This article was downloaded from Harvard University’s DASH repository, and is made available under the terms and conditions applicable to Other Posted Material, as set forth at http:// nrs.harvard.edu/urn-3:HUL.InstRepos:dash.current.terms-of- use#LAA Building maps from genetic sequences to biological function A dissertation presented by Adam Joseph Riesselman to The Division of Medical Sciences in partial fulfillment of the requirements for the degree of Doctor of Philosophy in the subject of Biomedical Informatics Harvard University Cambridge, Massachusetts December 2018 ©2018 Adam Joseph Riesselman All rights reserved. Dissertation Advisor: Debora S. Marks Adam Joseph Riesselman Building maps from genetic sequences to biological function Abstract Predicting how changes to the genetic code will alter the characteristics of an organism is a fundamental question in biology and genetics. Typically, measurements of the true functional landscape relating genotype to phenotype are noisy and costly to obtain. Though high-throughput DNA sequencing and synthesis can shed light on biological constraints in organisms, inferring relationships from these high-dimensional, multi-scale data to make predictions about new biological sequences is a formidable task. Here, I aim to build algorithms that map genetic sequences to biological function.
    [Show full text]
  • SUNDAY, August 19 MONDAY, August 20
    13th International Conference on Systems Biology August 19 – 23, 2012 Schedule of Events SUNDAY, August 19 12:00 noon - 5:00 pm Registration Concert Hall Foyer 3:00 pm - 3:15 pm Opening Remarks and Welcome Concert Hall Charlie Boone, University of Toronto 3:15 pm - 4:15 pm Plenary Session 1 Concert Hall Chair: Brenda Andrews, University of Toronto Keynote Lecture Exploring the sequence specificty of eukarytic DNA and RNA binding proteins. Timothy Hughes. Keynote Lecture Unraveling principles of gene regulation using thousands of designed promoter sequences. Eran Segal 4:15 pm - 5:00 pm Special Lectures Concert Hall Chair: Charlie Boone, University of Toronto Visualisations of DNA with Hollywood visual effects. Drew Berry. Systems-Biology Applications of Imaging Flow Cytometry. Timothy Galitski. 5:00 pm - 5:30 pm Champagne Reception Concert Hall Foyer 5:30 pm - 6:00 pm Performance byTanya Tagaq Inuit throat singer Concert Hall 6:15 pm - 8:00 pm Opening Reception at the Donnelly Centre Donnelly Centre MONDAY, August 20 9:00 am - 10:00 am Plenary Session II Concert Hall Chair: Fritz Roth, University of Toronto Keynote Lecture Reading and Writing Omes. George Church. Lecture Towards unification of genetic and hierarchy models of tumor heterogeneity. John Dick. 10:00 am - 10:30 am Special Lecture Chair: Corey Nislow, University of Toronto Carl Zimmer, Science Journalist, Birds and Bioterrorists: The (Un)natural History of Influenza 13th International Conference on Systems Biology August 19 – 23, 2012 Schedule of Events 10:30 am - 11:00 am
    [Show full text]
  • Transformer Protein Language Models Are Unsupervised Structure Learners
    bioRxiv preprint doi: https://doi.org/10.1101/2020.12.15.422761; this version posted December 15, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. TRANSFORMER PROTEIN LANGUAGE MODELS ARE UNSUPERVISED STRUCTURE LEARNERS Roshan Rao∗ Joshua Meier Tom Sercu UC Berkeley Facebook AI Research Facebook AI Research [email protected] [email protected] [email protected] Sergey Ovchinnikov Alexander Rives Harvard University Facebook AI Research & New York University [email protected] [email protected] ABSTRACT Unsupervised contact prediction is central to uncovering physical, structural, and functional constraints for protein structure determination and design. For decades, the predominant approach has been to infer evolutionary constraints from a set of related sequences. In the past year, protein language models have emerged as a po- tential alternative, but performance has fallen short of state-of-the-art approaches in bioinformatics. In this paper we demonstrate that Transformer attention maps learn contacts from the unsupervised language modeling objective. We find the highest capacity models that have been trained to date already outperform a state- of-the-art unsupervised contact prediction pipeline, suggesting these pipelines can be replaced with a single forward pass of an end-to-end model.1 1 INTRODUCTION Unsupervised modeling of protein contacts has an important role in computational protein de- sign (Russ et al., 2020; Tian et al., 2018; Blazejewski et al., 2019) and is a central element of all current state-of-the-art structure prediction methods (Wang et al., 2017; Senior et al., 2020; Yang et al., 2019).
    [Show full text]
  • Arxiv:2010.08162V2 [Q-Bio.BM] 15 Nov 2020 1.1 Protein Structure Data Availability and Information Leakage
    SidechainNet: An All-Atom Protein Structure Dataset for Machine Learning Jonathan E. King∗ David Ryan Koes Comp. & Systems Biology Comp. & Systems Biology University of Pittsburgh University of Pittsburgh Pittsburgh, PA 15213 Pittsburgh, PA 15213 [email protected] [email protected] Abstract Despite recent advancements in deep learning methods for protein structure pre- diction and representation, little focus has been directed at the simultaneous inclusion and prediction of protein backbone and sidechain structure informa- tion. We present SidechainNet, a new dataset that directly extends the Protein- Net dataset. SidechainNet includes angle and atomic coordinate information capable of describing all heavy atoms of each protein structure. In this paper, we provide background information on the availability of protein structure data and the significance of ProteinNet. Thereafter, we argue for the potentially ben- eficial inclusion of sidechain information through SidechainNet, describe the process by which we organize SidechainNet, and provide a software package (https://github.com/jonathanking/sidechainnet) for data manipulation and training with machine learning models. 1 Background The deep learning subfield of protein structure prediction and protein science has made considerable progress in the last several years. In addition to the dominance of the AlphaFold deep learning method in the 2018 Critical Assessment of protein Structure Prediction (CASP) competition [1, 2, 3], many novel deep learning-based methods have been developed for protein representation [4, 5], property prediction [6, 7], and structure prediction [8, 9, 10, 11]. Such methods are demonstrably effective and extremely promising for future research. They have the potential to make complex inferences about proteins much faster than competing computational methods and at a cost several orders of magnitude lower than experimental methods.
    [Show full text]
  • Generative Models for Graph-Based Protein Design
    Generative models for graph-based protein design John Ingraham, Vikas K. Garg, Regina Barzilay, Tommi Jaakkola Computer Science and Artificial Intelligence Lab, MIT {ingraham, vgarg, regina, tommi}@csail.mit.edu Abstract Engineered proteins offer the potential to solve many problems in biomedicine, energy, and materials science, but creating designs that succeed is difficult in practice. A significant aspect of this challenge is the complex coupling between protein sequence and 3D structure, with the task of finding a viable design often referred to as the inverse protein folding problem. In this work, we introduce a conditional generative model for protein sequences given 3D structures based on graph representations. Our approach efficiently captures the complex dependencies in proteins by focusing on those that are long-range in sequence but local in 3D space. This graph-based approach improves in both speed and reliability over conventional and other neural network-based approaches, and takes a step toward rapid and targeted biomolecular design with the aid of deep generative models. 1 Introduction A central goal for computational protein design is to automate the invention of protein molecules with defined structural and functional properties. This field has seen tremendous progess in the past two decades [1], including the design of novel 3D folds [2], enzymes [3], and complexes [4]. Despite these successes, current approaches are often unreliable, requiring multiple rounds of trial-and-error in which initial designs often fail [5, 6]. Moreover, diagnosing the origin of this unreliability is difficult, as contemporary bottom-up approaches depend both on the accuracy of complex, composite energy functions for protein physics and also on the efficiency of sampling algorithms for jointly exploring the protein sequence and structure space.
    [Show full text]
  • Mutational Scanning Symposium
    MUTATIONAL SCANNING SYMPOSIUM VIRTUAL EVENT APRIL 5-7, 2021 1 TABLE OF CONTENTS Welcome 3 Agenda Day 1; April 5 4 Agenda Day 2; April 6 5 Agenda Day 3; April 7 6 Keynote Speakers 7 Artist Alex Cagan 8 Speaker Bios 9 Grace R. Anderson Atina Cote Josh Cuperus Erika DeBenedictis John Doench Maitreya Dunham Doug Fowler Meghan Garrett Matt Hurles Martin Kampmann Rachel Karchin Ben Lehner (Keynote Speaker) Prashant Mali Deborah Marks Kenneth Matreyek Beatriz Adriana Osuna Maria-Jesus Martin Erik Procko Kimberly A. Reynolds (Keynote Speaker) Frederick (Fritz) Roth Alan Rubin Avtar Singh Lea Starita Tyler N. Starr Clare Turnbull Organizing Committee/Moderators 17 Virtual Meeting Information 18 Sponsors 19 2 Multiplex Assays of Variant Effects Multiplex Assays of Variant Effects (MAVEs) are key to variant interpretation and are transforming our understanding of the human genome. This annual symposium is sponsored by the Center for the Multiplex Assessment of Phenotype , Illumina, Octant, Wellcome Sanger, and the Brotman Baty Institute. Experts in the field of mutational scanning come from around the world meet to present their work and provide insights on the future of this science for this three-day event which will be held virtually, April 5th-7th 2021. Register to attend HERE 3 Mutational Scanning Symposium April 5-7 2021 All times April 5, 2021 Agenda Pacific Standard Time Lea Starita, PhD | UW 8:00 -08:10 am Welcome, Opening Remarks Kim Reynolds, PhD | UTSW, Dallas (KEYNOTE/featured speaker) 8:10 - 8:55 am Mapping sequence constraints in an essential metabolic enzyme Maitreya Dunham, PhD | UW 8:55 - 9:20 am Deep mutational scanning of pharmacogenes using yeast activity assays.
    [Show full text]