Wellcome Sanger Institute and Wellcome Genome Campus Landscape Review for More Information on This Publication, Visit

Total Page:16

File Type:pdf, Size:1020Kb

Wellcome Sanger Institute and Wellcome Genome Campus Landscape Review for More Information on This Publication, Visit EUROPE SARAH PARKINSON, JOE FRANCOMBE, CAGLA STEVENSON, HAMISH EVANS, ADVAIT DESHPANDE, SUSAN GUTHRIE Wellcome Sanger Institute and Wellcome Genome Campus Landscape Review For more information on this publication, visit www.rand.org/t/RRA215-1 Published by the RAND Corporation, Santa Monica, Calif., and Cambridge, UK © Copyright 2021 RAND Corporation R® is a registered trademark. RAND Europe is a not-for-profit organisation whose mission is to help improve policy and decision making through research and analysis. RAND’s publications do not necessarily reflect the opinions of its research clients and sponsors. Limited Print and Electronic Distribution Rights This document and trademark(s) contained herein are protected by law. This representation of RAND intellectual property is provided for noncommercial use only. Unauthorized posting of this publication online is prohibited. Permission is given to duplicate this document for personal use only, as long as it is unaltered and complete. Permission is required from RAND to reproduce, or reuse in another form, any of its research documents for commercial use. For information on reprint and linking permissions, please visit www.rand.org/pubs/permissions. Support RAND Make a tax-deductible charitable contribution at www.rand.org/giving/contribute www.rand.org www.rand.org/randeurope Table of contents Abbreviations ........................................................................................................................................... 7 Figures ..................................................................................................................................................... 9 Tables .................................................................................................................................................... 11 Boxes ..................................................................................................................................................... 12 Executive summary ................................................................................................................................ 13 Key contributions of the Sanger Institute and comparator organisations............................................ 13 Reflections and future outlook ......................................................................................................... xxi 1. Introduction ....................................................................................................................... 22 1.1. Background and context ....................................................................................................... 22 1.2. Scope of the work ................................................................................................................. 23 1.3. Research questions ................................................................................................................ 23 1.4. Approach .............................................................................................................................. 24 1.5. Limitations of the study ........................................................................................................ 26 1.6. Structure of this report .......................................................................................................... 27 2. Review of Wellcome Sanger Institute and Wellcome Genome Campus ................................. 28 2.1. Operation and key features ................................................................................................... 28 2.2. Outputs and contributions ................................................................................................... 32 2.3. Summary of role in field and characteristics .......................................................................... 41 3. Comparator institutions ...................................................................................................... 46 3.1. Broad Institute ...................................................................................................................... 46 3.2. Wellcome Centre for Human Genetics (WHG) ................................................................... 54 3.3. Janelia Research Campus ...................................................................................................... 60 3.4. The NHGRI ........................................................................................................................ 68 3.5. Reflections ............................................................................................................................ 74 4. Case studies ........................................................................................................................ 85 4.1. Open Targets ........................................................................................................................ 86 4.2. Tree of Life ........................................................................................................................... 91 4.3. DDD .................................................................................................................................... 98 5 4.4. Malaria research and MalariaGEN ...................................................................................... 105 4.5. ICGC ................................................................................................................................. 108 5. Reflections ....................................................................................................................... 114 5.1. Reflections on the role of the Sanger Institute and the Genome Campus in the field and key characteristics ...................................................................................................................... 114 5.2. Looking forward: the future for genetics and genomics research and the role of the Sanger Institute and Wellcome Genome Campus .......................................................................... 116 Bibliography ............................................................................................................................ 118 Annex A. Methodology ......................................................................................................... 124 A.1. Task 1: Desk research ......................................................................................................... 125 A.2. Task 2: Interviews ............................................................................................................... 128 A.3. Task 3: Bibliometric analysis ............................................................................................... 130 A.4. Task 4: Case studies ............................................................................................................ 131 A.5. Task 5: Synthesis and reporting .......................................................................................... 131 A.6. Caveats and limitations of the analysis ................................................................................ 132 Annex B. Interview protocols ................................................................................................ 134 B.1. Protocol for external interviews ........................................................................................... 134 B.2. Protocol for internal interviews ........................................................................................... 138 Annex C. Case study structure ............................................................................................... 143 6 Abbreviations ARGO Accelerate Research in Genomic Oncology BIC BioData Innovation Centre CAP Complementary Analysis Programme BRAIN Brain Research Through Advancing Innovative Neurotechnologies COSMIC Catalogue of Somatic Mutations in Cancer CRISPR Clustered Regularly Interspaced Short Palindromic Repeats CWTS Leiden University’s Centre for Science and Technology Studies DDD Deciphering Developmental Disorders EMBL-EBI European Molecular Biology Laboratory - European Bioinformatics Institute FISH Fluorescence in situ Hybridisation GEL Genomics England Ltd. GRL Genome Research Limited GA4GH Global Alliance for Genomics and Health GATK Genome Analysis Toolkit GSK GlaxoSmithKline GWAS Genome-Wide Association Studies HCA Human Cell Atlas HCP Highly cited publications HHMI Howard Hughes Medical Institute ICGC International Cancer Genome Consortium IP Intellectual Property iPSC induced Pluripotent Stem Cells LMIC Low- to Middle-Income Country MIT Massachusetts Institute of Technology MNCS Mean Normalised Citation Score MNJS Mean Normalised Journal Score 7 MRC Medical Research Council NHGRI National Human Genome Research Institute NHS National Health Service OGT Oxford Gene Technology OICR Ontario Institute for Cancer Research PCAWG Pan-Cancer Analysis of Whole Genomes PDNA Plasmodium Diversity Network Africa PI Principle investigator QA Quality Assurance QTL Quantitative Trait Locus REA Rapid Evidence Assessment SNP Single Nucleotide Polymorphism Stakeholder Perspectives on the Ethical Challenges in the use of Artificial Intelligence for SPACE Cognitive Evaluation SWOT Strength, Weakness, Opportunities, Threats TCGA The Cancer Genome Atlas WHG Wellcome Centre for Human Genetics WoS Web of Science 8 Figures Figure 1: Summary of key achievements of the Sanger Institute .............................................................
Recommended publications
  • Analysis of the Impact of Sequencing Errors on Blast Using Fault Injection
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Illinois Digital Environment for Access to Learning and Scholarship Repository ANALYSIS OF THE IMPACT OF SEQUENCING ERRORS ON BLAST USING FAULT INJECTION BY SO YOUN LEE THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2013 Urbana, Illinois Adviser: Professor Ravishankar K. Iyer ABSTRACT This thesis investigates the impact of sequencing errors in post-sequence computational analyses, including local alignment search and multiple sequence alignment. While the error rates of sequencing technology are commonly reported, the significance of these numbers cannot be fully grasped without putting them in the perspective of their impact on the downstream analyses that are used for biological research, forensics, diagnosis of diseases, etc. I approached the quantification of the impact using fault injection. Faults were injected in the input sequence data, and the analyses were run. Change in the output of the analyses was interpreted as the impact of faults, or errors. Three commonly used algorithms were used: BLAST, SSEARCH, and ProbCons. The main contributions of this work are the application of fault injection to the reliability analysis in bioinformatics and the quantitative demonstration that a small error rate in the sequence data can alter the output of the analysis in a significant way. BLAST and SSEARCH are both local alignment search tools, but BLAST is a heuristic implementation, while SSEARCH is based on the optimal Smith-Waterman algorithm.
    [Show full text]
  • Ensembl Genomes: Extending Ensembl Across the Taxonomic Space P
    Published online 1 November 2009 Nucleic Acids Research, 2010, Vol. 38, Database issue D563–D569 doi:10.1093/nar/gkp871 Ensembl Genomes: Extending Ensembl across the taxonomic space P. J. Kersey*, D. Lawson, E. Birney, P. S. Derwent, M. Haimel, J. Herrero, S. Keenan, A. Kerhornou, G. Koscielny, A. Ka¨ ha¨ ri, R. J. Kinsella, E. Kulesha, U. Maheswari, K. Megy, M. Nuhn, G. Proctor, D. Staines, F. Valentin, A. J. Vilella and A. Yates EMBL-European Bioinformatics Institute, Wellcome Trust Genome Campus, Cambridge CB10 1SD, UK Received August 14, 2009; Revised September 28, 2009; Accepted September 29, 2009 ABSTRACT nucleotide archives; numerous other genomes exist in states of partial assembly and annotation; thousands of Ensembl Genomes (http://www.ensemblgenomes viral genomes sequences have also been generated. .org) is a new portal offering integrated access to Moreover, the increasing use of high-throughput genome-scale data from non-vertebrate species sequencing technologies is rapidly reducing the cost of of scientific interest, developed using the Ensembl genome sequencing, leading to an accelerating rate of genome annotation and visualisation platform. data production. This not only makes it likely that in Ensembl Genomes consists of five sub-portals (for the near future, the genomes of all species of scientific bacteria, protists, fungi, plants and invertebrate interest will be sequenced; but also the genomes of many metazoa) designed to complement the availability individuals, with the possibility of providing accurate and of vertebrate genomes in Ensembl. Many of the sophisticated annotation through the similarly low-cost databases supporting the portal have been built in application of functional assays.
    [Show full text]
  • Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J
    Perspective Cite This: J. Am. Chem. Soc. 2019, 141, 14463−14479 pubs.acs.org/JACS Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J. Gray, Lukasz G. Migas, Perdita E. Barran, Kevin Pagel, Peter H. Seeberger, ∥ ⊥ # ⊗ ∇ Claire E. Eyers, Geert-Jan Boons, Nicola L. B. Pohl, Isabelle Compagnon, , × † Göran Widmalm, and Sabine L. Flitsch*, † School of Chemistry & Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, U.K. ‡ Institute for Chemistry and Biochemistry, Freie Universitaẗ Berlin, Takustraße 3, 14195 Berlin, Germany § Biomolecular Systems Department, Max Planck Institute for Colloids and Interfaces, Am Muehlenberg 1, 14476 Potsdam, Germany ∥ Department of Biochemistry, Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool L69 7ZB, U.K. ⊥ Complex Carbohydrate Research Center, University of Georgia, Athens, Georgia 30602, United States # Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States ⊗ Institut Lumierè Matiere,̀ UMR5306 UniversitéLyon 1-CNRS, Universitéde Lyon, 69622 Villeurbanne Cedex, France ∇ Institut Universitaire de France IUF, 103 Blvd St Michel, 75005 Paris, France × Department of Organic Chemistry, Arrhenius Laboratory, Stockholm University, S-106 91 Stockholm, Sweden *S Supporting Information established connection between their structure and their ABSTRACT: Carbohydrates possess a variety of distinct function, full characterization of unknown carbohydrates features with
    [Show full text]
  • 100000 Genomes and Genomics England
    100,000 Genomes & Genomics England Tim Hubbard Genomics England King’s College London, King’s Health Partners Wellcome Trust Sanger Institute Global Leaders in Genomic Medicine Washington 8-9th January 2014 UK Health System 101 • Four separate health services – NHS England – NHS Wales – NHS Scotland – Health & Social Care in Northern Ireland (HSC) • NHS (England) – ~1.4 million employees – ~£110 billion annual budget • Structure in England changed 1st April 2013 https://www.gov.uk/government/organisations/department-of-health Linking Health data to Research Clinical Data Healthcare Professional World Genotype Electronic Health Record Whole Genome Sequencing Phenotype Electronic Genomic Biology Health Data World Records Reference Genotype and genome sequence Phenotype ~3 gigabytes relationship capture EBI: repositories (petabytes of genome sequence data) Human sequence data Sanger: sequencing repositories (1000 genomes, uk10K) Steps in UK towards E-Health Research, Genomic Medicine • Health data to Research – 2006 Creation of OSCHR • Increase coordination between funders: MRC and NIHR – 2007 OSCHR E-health board • Enable research access to UK EHR data • Build capacity for research on EHR data • Genomics to Health – 2009 House of Lords report on Genomic Medicine – 2010 Creation of Human Genomic Strategy Group (HGSG) 2011: UK Life Sciences Strategy No10: http://www.number10.gov.uk/news/uk-life-sciences-get-government-cash-boost/ BIS/DH: http://www.dh.gov.uk/health/2011/12/nhs-adopting-innovation/ Linking Health data to Research Clinical Data
    [Show full text]
  • Functional Effects Detailed Research Plan
    GeCIP Detailed Research Plan Form Background The Genomics England Clinical Interpretation Partnership (GeCIP) brings together researchers, clinicians and trainees from both academia and the NHS to analyse, refine and make new discoveries from the data from the 100,000 Genomes Project. The aims of the partnerships are: 1. To optimise: • clinical data and sample collection • clinical reporting • data validation and interpretation. 2. To improve understanding of the implications of genomic findings and improve the accuracy and reliability of information fed back to patients. To add to knowledge of the genetic basis of disease. 3. To provide a sustainable thriving training environment. The initial wave of GeCIP domains was announced in June 2015 following a first round of applications in January 2015. On the 18th June 2015 we invited the inaugurated GeCIP domains to develop more detailed research plans working closely with Genomics England. These will be used to ensure that the plans are complimentary and add real value across the GeCIP portfolio and address the aims and objectives of the 100,000 Genomes Project. They will be shared with the MRC, Wellcome Trust, NIHR and Cancer Research UK as existing members of the GeCIP Board to give advance warning and manage funding requests to maximise the funds available to each domain. However, formal applications will then be required to be submitted to individual funders. They will allow Genomics England to plan shared core analyses and the required research and computing infrastructure to support the proposed research. They will also form the basis of assessment by the Project’s Access Review Committee, to permit access to data.
    [Show full text]
  • Rare Variant Contribution to Human Disease in 281,104 UK Biobank Exomes W ­ 1,19 1,19 2,19 2 2 Quanli Wang , Ryan S
    https://doi.org/10.1038/s41586-021-03855-y Accelerated Article Preview Rare variant contribution to human disease W in 281,104 UK Biobank exomes E VI Received: 3 November 2020 Quanli Wang, Ryan S. Dhindsa, Keren Carss, Andrew R. Harper, Abhishek N ag­­, I oa nn a Tachmazidou, Dimitrios Vitsios, Sri V. V. Deevi, Alex Mackay, EDaniel Muthas, Accepted: 28 July 2021 Michael Hühn, Sue Monkley, Henric O ls so n , S eb astian Wasilewski, Katherine R. Smith, Accelerated Article Preview Published Ruth March, Adam Platt, Carolina Haefliger & Slavé PetrovskiR online 10 August 2021 P Cite this article as: Wang, Q. et al. Rare variant This is a PDF fle of a peer-reviewed paper that has been accepted for publication. contribution to human disease in 281,104 UK Biobank exomes. Nature https:// Although unedited, the content has been subjectedE to preliminary formatting. Nature doi.org/10.1038/s41586-021-03855-y (2021). is providing this early version of the typeset paper as a service to our authors and Open access readers. The text and fgures will undergoL copyediting and a proof review before the paper is published in its fnal form. Please note that during the production process errors may be discovered which Ccould afect the content, and all legal disclaimers apply. TI R A D E T A R E L E C C A Nature | www.nature.com Article Rare variant contribution to human disease in 281,104 UK Biobank exomes W 1,19 1,19 2,19 2 2 https://doi.org/10.1038/s41586-021-03855-y Quanli Wang , Ryan S.
    [Show full text]
  • The ELIXIR Core Data Resources: ​Fundamental Infrastructure for The
    Supplementary Data: The ELIXIR Core Data Resources: fundamental infrastructure ​ for the life sciences The “Supporting Material” referred to within this Supplementary Data can be found in the Supporting.Material.CDR.infrastructure file, DOI: 10.5281/zenodo.2625247 (https://zenodo.org/record/2625247). ​ ​ Figure 1. Scale of the Core Data Resources Table S1. Data from which Figure 1 is derived: Year 2013 2014 2015 2016 2017 Data entries 765881651 997794559 1726529931 1853429002 2715599247 Monthly user/IP addresses 1700660 2109586 2413724 2502617 2867265 FTEs 270 292.65 295.65 289.7 311.2 Figure 1 includes data from the following Core Data Resources: ArrayExpress, BRENDA, CATH, ChEBI, ChEMBL, EGA, ENA, Ensembl, Ensembl Genomes, EuropePMC, HPA, IntAct /MINT , InterPro, PDBe, PRIDE, SILVA, STRING, UniProt ● Note that Ensembl’s compute infrastructure physically relocated in 2016, so “Users/IP address” data are not available for that year. In this case, the 2015 numbers were rolled forward to 2016. ● Note that STRING makes only minor releases in 2014 and 2016, in that the interactions are re-computed, but the number of “Data entries” remains unchanged. The major releases that change the number of “Data entries” happened in 2013 and 2015. So, for “Data entries” , the number for 2013 was rolled forward to 2014, and the number for 2015 was rolled forward to 2016. The ELIXIR Core Data Resources: fundamental infrastructure for the life sciences ​ 1 Figure 2: Usage of Core Data Resources in research The following steps were taken: 1. API calls were run on open access full text articles in Europe PMC to identify articles that ​ ​ mention Core Data Resource by name or include specific data record accession numbers.
    [Show full text]
  • Insights Into the Genetic Basis of the Renal Cell Carcinomas from the Cancer Genome Atlas Scott M
    Published OnlineFirst June 21, 2016; DOI: 10.1158/1541-7786.MCR-16-0115 Review Molecular Cancer Research Insights into the Genetic Basis of the Renal Cell Carcinomas from The Cancer Genome Atlas Scott M. Haake, Jamie D. Weyandt, and W. Kimryn Rathmell Abstract The renal cell carcinomas (RCC), clear cell, papillary, and mutations in chromatin modifier genes. Importantly, the pap- chromophobe, have recently undergone an unmatched geno- illary RCC classification encompasses a heterogeneous group mic characterization by The Cancer Genome Atlas. This anal- of diseases, each with highly distinct genetic and molecular ysis has revealed new insights into each of these malignancies features. In conclusion, this review summarizes RCCs that and underscores the unique biology of clear cell, papillary, and represent a diverse set of malignancies, each with novel bio- chromophobe RCC. Themes that have emerged include dis- logic programs that define new paradigms for cancer biology. tinct mechanisms of metabolic dysregulation and common Mol Cancer Res; 14(7); 589–98. Ó2016 AACR. Introduction ullary carcinoma and translocation carcinoma. It is anticipated that a revised World Health Organization (WHO) pathologic Cancers of the kidney have long fascinated physicians with classification recognizing even more rare and largely low-risk their highly divergent phenotypes and patterns of behavior (1). entities, such as clear cell tubulopapillary RCC and multi- Kidney tumors are conventionally grouped based upon their locular cystic RCC, will be published in 2016
    [Show full text]
  • PLK-1 Promotes the Merger of the Parental Genome Into A
    RESEARCH ARTICLE PLK-1 promotes the merger of the parental genome into a single nucleus by triggering lamina disassembly Griselda Velez-Aguilera1, Sylvia Nkombo Nkoula1, Batool Ossareh-Nazari1, Jana Link2, Dimitra Paouneskou2, Lucie Van Hove1, Nicolas Joly1, Nicolas Tavernier1, Jean-Marc Verbavatz3, Verena Jantsch2, Lionel Pintard1* 1Programme Equipe Labe´llise´e Ligue Contre le Cancer - Team Cell Cycle & Development - Universite´ de Paris, CNRS, Institut Jacques Monod, Paris, France; 2Department of Chromosome Biology, Max Perutz Laboratories, University of Vienna, Vienna Biocenter, Vienna, Austria; 3Universite´ de Paris, CNRS, Institut Jacques Monod, Paris, France Abstract Life of sexually reproducing organisms starts with the fusion of the haploid egg and sperm gametes to form the genome of a new diploid organism. Using the newly fertilized Caenorhabditis elegans zygote, we show that the mitotic Polo-like kinase PLK-1 phosphorylates the lamin LMN-1 to promote timely lamina disassembly and subsequent merging of the parental genomes into a single nucleus after mitosis. Expression of non-phosphorylatable versions of LMN- 1, which affect lamina depolymerization during mitosis, is sufficient to prevent the mixing of the parental chromosomes into a single nucleus in daughter cells. Finally, we recapitulate lamina depolymerization by PLK-1 in vitro demonstrating that LMN-1 is a direct PLK-1 target. Our findings indicate that the timely removal of lamin is essential for the merging of parental chromosomes at the beginning of life in C. elegans and possibly also in humans, where a defect in this process might be fatal for embryo development. *For correspondence: [email protected] Introduction Competing interests: The After fertilization, the haploid gametes of the egg and sperm have to come together to form the authors declare that no genome of a new diploid organism.
    [Show full text]
  • C. Elegans Whole Genome Sequencing Reveals Mutational Signatures Related to Carcinogens and DNA Repair Deficiency
    Downloaded from genome.cshlp.org on September 28, 2021 - Published by Cold Spring Harbor Laboratory Press C. elegans whole genome sequencing reveals mutational signatures related to carcinogens and DNA repair deficiency Authors: Bettina Meier * (1); Susanna L Cooke * (2); Joerg Weiss (1); Aymeric P Bailly (1,3); Ludmil B Alexandrov (2); John Marshall (2); Keiran Raine (2); Mark Maddison (2); Elizabeth Anderson (2); Michael R Stratton (2); Anton Gartner * (1); Peter J Campbell * (2,4,5). * These authors contributed equally to this project. Institutions: (1) Centre for Gene Regulation and Expression, University of Dundee, Dundee, UK. (2) Cancer Genome Project, Wellcome Trust Sanger Institute, Hinxton, UK. (3) CRBM/CNRS UMR5237, University of Montpellier, Montpellier, France. (4) Department of Haematology, University of Cambridge, Cambridge, UK. (5) Department of Haematology, Addenbrooke’s Hospital, Cambridge, UK. Address for correspondence: Dr Peter J Campbell, Dr Anton Gartner, Cancer Genome Project, Centre for Gene Regulation and Expression, Wellcome Trust Sanger Institute, The University of Dundee, Hinxton CB10 1SA, Dow Street, Cambridgeshire, Dundee DD1 5EH UK. UK. Tel: +44 (0) 1223 494745 Phone: +44 (0) 1382 385809 Fax: +44 (0) 1223 494809 E-mail: [email protected] E-mail: [email protected] Running title: mutation profiling in C. elegans Keywords: mutation pattern, genetic and environmental factors, C. elegans, cisplatin, aflatoxin B1, whole-genome sequencing. Downloaded from genome.cshlp.org on September 28, 2021 - Published by Cold Spring Harbor Laboratory Press ABSTRACT Mutation is associated with developmental and hereditary disorders, ageing and cancer. While we understand some mutational processes operative in human disease, most remain mysterious.
    [Show full text]
  • Abstracts In
    ECCB 2014 Accepted Posters with Abstracts G: Bioinformatics of health and disease G01: Emile Rugamika Chimusa, Jacquiline Wangui Mugo and Nicola Mulder. Leveraging ancestry along the genome of admixed individuals to resolve missing heritability in disease scoring statistics Abstract: Human genetics has been haunted by the mystery of “missing heritability” of common traits. Although studies have discovered several variants associated with common diseases and traits, these variants typically appear to explain only a minority of the heritability. Resolving missing heritability, the difference between phenotypic variance explained by associated SNPs and estimates of narrow-sense heritability (h2), will inform strategies for disease mapping and prediction of complex traits. Among biased estimates of h2 due to epistatic interactions and rare variants not captured by genotyping arrays have been cited to be the most can be the most explanations for missing heritability. Here, we present an approach for estimating heritability of traits based on sharing local ancestry segments between pairs of unrelated individuals in an admixed population. From simulation data and real data, we demonstrated that our approach outperformed current approaches for estimating heritability of traits and holds values in admixture mapping for deconvoluting genes underlying ethnic differences in complex diseases risk. G02: Sylvain Mareschal, Pierre-Julien Viailly, Philippe Bertrand, Fabienne Desmots-Loyer, Elodie Bohers, Catherine Maingonnat, Karen Leroy, Thierry Fest and Fabrice Jardin. Next- Generation Sequencing applied to tailor targeted therapies in lymphoma: the RELYSE project Abstract: Non-Hodgkin Lymphomas (NHL) are lymphoid cell malignancies accounting for about 4% of all cancers, with an incidence rate of 12 cases per 100,000 and per year in Europe.
    [Show full text]
  • Learning Protein Constitutive Motifs from Sequence Data Je´ Roˆ Me Tubiana, Simona Cocco, Re´ Mi Monasson*
    TOOLS AND RESOURCES Learning protein constitutive motifs from sequence data Je´ roˆ me Tubiana, Simona Cocco, Re´ mi Monasson* Laboratory of Physics of the Ecole Normale Supe´rieure, CNRS UMR 8023 & PSL Research, Paris, France Abstract Statistical analysis of evolutionary-related protein sequences provides information about their structure, function, and history. We show that Restricted Boltzmann Machines (RBM), designed to learn complex high-dimensional data and their statistical features, can efficiently model protein families from sequence information. We here apply RBM to 20 protein families, and present detailed results for two short protein domains (Kunitz and WW), one long chaperone protein (Hsp70), and synthetic lattice proteins for benchmarking. The features inferred by the RBM are biologically interpretable: they are related to structure (residue-residue tertiary contacts, extended secondary motifs (a-helixes and b-sheets) and intrinsically disordered regions), to function (activity and ligand specificity), or to phylogenetic identity. In addition, we use RBM to design new protein sequences with putative properties by composing and ’turning up’ or ’turning down’ the different modes at will. Our work therefore shows that RBM are versatile and practical tools that can be used to unveil and exploit the genotype–phenotype relationship for protein families. DOI: https://doi.org/10.7554/eLife.39397.001 Introduction In recent years, the sequencing of many organisms’ genomes has led to the collection of a huge number of protein sequences, which are catalogued in databases such as UniProt or PFAM Finn et al., 2014). Sequences that share a common ancestral origin, defining a family (Figure 1A), *For correspondence: are likely to code for proteins with similar functions and structures, providing a unique window into [email protected] the relationship between genotype (sequence content) and phenotype (biological features).
    [Show full text]