The Arabidopsis Thaliana Cdna Sequencing Projects 1This Paper

Total Page:16

File Type:pdf, Size:1020Kb

The Arabidopsis Thaliana Cdna Sequencing Projects 1This Paper FEBS 18220 FEBS Letters 403 (1997) 221-224 Minireview The Arabidopsis thaliana cDNA sequencing projects Michel Delseny*, Richard Cooke, Monique Raynal, Francoise Grellet University of Perpignan, Laboratoire de Physiologie et Biologie Moleculaire des Plantes, UMR 5545 CNRS, Avenue de Villeneuve 66860, Perpignan cédex, France Received 14 January 1997 scribed for human genes [3]. The second strategy consisted in Abstract Nearly 30 000 Arabidopsis thaliana EST (Expressed Sequence Tags) have been produced by a French-American sequencing the complete genome. Although the two strategies consortium. Despite redundancy, these sequences tag about half are complementary and part of a world-wide integrated proj- of the expected Arabidopsis genes. Approximately 40% of the ect [2], this minireview will be restricted to the EST projects non-redundant EST can be assigned a putative function by simple and their impact on plant science. homology search. This programme allowed the identification of a large number of genes which would have been very difficult to 2. EST strategy and production isolate by other classical techniques. It considerably stimulated many areas of plant biology by the rapid discovery a large Most Arabidopsis EST sequencing has been carried out by number of genes, by revealing multigene families and by allowing two consortia: one in France, started by the end of 1991 with the analysis of differential expression of the different members. Finally this programme facilitated construction of physical maps national support and continued after 1993 with European of the chromosomes and opened the way for complete sequencing Union funding as part of the ESSA (European Scientists Se- of the Arabidopsis genome and comparative mapping of the quencing Arabidopsis) programme, and one in the US initiated major plant crops. a few months later at Michigan State University. The initial © 1997 Federation of European Biochemical Societies French project was to obtain the largest number of EST, to identify as many of them as possible by homology search, and Key words: Expressed Sequence Tag; cDNA; Gene function; to map them on the genetic and physical maps. In contrast to Genome sequencing; Crucifer their American colleagues, who prepared a single library from a mixture of mRNA from different tissues, the French scien- tists worked with specialized libraries corresponding to specif- 1. Introduction ic tissues, organs or developmental stages. This strategy ini- tially helped to avoid overlaps and redundancy between Arabidopsis thaliana, a small crucifer weed, was chosen by a groups as well as maintaining a strong biological interest for number of plant biologists a few years ago as a model to the participants and it allows one to compare the different analyse plant genome structure and gene expression as well libraries. A computer network was organised so as to auto- as plant development. The major advantages of this plant are matically identify redundancies and submit sequence informa- its small genome, short life cycle, genetics and relative easiness tion to the EMBL database. All sequences were first edited of transformation [1]. With a genome size of 120 Mbp, Ara- and analyzed by homology search in each individual labora- bidopsis compares well with animal models such as Drosophila tory. After validation of the sequence and annotation, non- or Caenorhabditis and is one of the smallest plant genomes. redundant EST were then transferred to the EMBL database More than 1000 mutants have been characterized and half of from where they were automatically integrated into dbEST them have been mapped to one of its five chromosomes [2], [4]. Another difference between the two projects was that making it a suitable model to dissect metabolism, develop- American EST were sequenced only from the 5' end and re- ment and transduction pathways using molecular genetics. dundancy was not eliminated at the submission stage whereas Due to specific difficulties, as well as a much smaller scientific in the French project non-redundant clones were sequenced community, gene isolation and characterization in higher from both ends. plants has been lagging behind animals. Thus, at the end of Altogether, the two programmes have produced close to 1991, no more than 500 different plant protein sequences were 30000 EST, placing Arabidopsis genome in fourth position represented in databases. In order to increase rapidly our after human and mouse EST, immediately behind Caenorhab- knowledge of this genome and, hopefully, that of other crop ditis elegans and ahead of rice [5]. The French groups pro- plant genomes, two major strategies have been developed. The duced approximately 12000 EST, but only 6500, which were first consisted in partially sequencing a large number of not redundant with previously submitted EST, were deposited cDNA in order to make a catalogue of Arabidopsis expressed in dbEST. Most of the corresponding clones are available genes (EST, or Expressed Sequence Tags) as initially de- from the Ohio State University Arabidopsis Biological Re- source Centre (ABRC) in Columbus, Ohio. It is difficult to determine precisely how many different genes these EST represent because EST sequences are not *Corresponding author. Fax: (33) 4-68-66-84-99. 100% reliable because the collection is a mixture of 5' and E-mail: delseny @univ-perp.fr 3' ends and because cDNA are not always full length. An This paper was presented at the 24th FEBS Meeting in Barcelona. additional problem is that many cDNA correspond to mem- 0014-5793/97/S17.00 © 1997 Federation of European Biochemical Societies. All rights reserved. P//S0014-5793(97)00075-6 222 M. Delseny et al.lFEBS Letters 403 (1997) 221-224 bers of multigene families which are not easily distinguishable. the sequenced region might correspond to a variable region in A first estimation consists in assembling overlapping EST into a conserved gene. contigs: the 30000 EST can be organized into about 15 000 A major surprise was to discover that many genes belonged contigs [6] but this is certainly an overestimate of gene number to multigene families: for example, we found that there were and a more realistic speculation is that 10000 distinct genes at least five distinct thioredoxin h genes which are differen- have been tagged by at least one EST. This figure has to be tially expressed whereas there are three genes in yeast and a compared with the gene number which is now estimated to be single one in human [14]. Similarly we reconstructed 106 dif- around 20000 and with the 50 or 60 genes which have been ferent cDNA coding for 50 distinct types of ribosomal pro- isolated using T-DNA or transposon tagging or map-based teins. The majority are coded by two, three and even four cloning [7]. Thus, just by random sequencing of cDNA, genes which differ essentially in their non-coding regions [15]. roughly half of the genes have been tagged; they probably correspond to the most abundantly expressed genes. Sequenc- 4. Impact of the Arabidopsis EST programmes ing work is almost completed in both consortia because re- dundancy is now becoming too high for the programmes to be One of the biggest impacts of the Arabidopsis EST pro- cost effective. Tagging genes with rare transcripts cannot be grammes is on plant biology studies. A crude measure of achieved using the present strategies and will require further this impact is the thousands of clone requests which were developments. received by the ABRC. Availability of the EST sequences EST are usually ca. 300 bp long and represent only part of accelerated the isolation of many new genes for which a par- a gene. However, about 150 full-length cDNA have now been tial protein sequence was available. It also stimulated and completely sequenced, starting from an EST. In addition, if a facilitated the obtention of recombinant proteins and the gene is abundantly expressed, overlapping EST can be as- analysis of differential expression of members of multigene sembled in order to reconstruct a full-length cDNA in silico. families by allowing one to prepare gene-specific probes [14,16-19]. Having available a large number of EST allows 3. What genes have been identified by EST sequencing? analysis of overall gene expression. cDNA clones can be gridded on high-density filters and hybridized with various A preliminary identification of a gene relies on the presence complex probes corresponding to different development of an already identified homologous sequence in databases. stages, different stimuli or different genetic backgrounds [20]. Each sequence was translated in all six possible reading Alternatively oligonucleotides derived from the EST can be phases. Using the BLAST and FASTA algorithms [8,9], spotted on DNA chips [21]. When several cDNA libraries they were aligned with the non-redundant protein sequence have been used, the frequency of occurrence of the various database. In this way, more than 40% of the non-redundant clones can also be analysed and compared. Another type of EST were found to have significant homology, at least at the physiological study consists in attempting to classify cDNA amino acid sequence level, with an already known gene. and genes according to their control by a given regulatory Lists of putative functions for EST can be found in the gene. This strategy was developed for a number of genes ex- papers reporting major progress in the programmes [10-12]. pressed during seed maturation to determine whether their Such an analysis simply gives a hint of what the function expression was dependent upon regulatory gene ABI 3 [22]. might be and, in many cases, extensive biochemical and bio- EST programmes have an important impact on mapping logical work is necessary to unambiguously identify a gene and sequencing the Arabidopsis genome. From the very begin- and its function. This task is clearly impossible for the se- ning, mapping the EST on the chromosomes was a major goal quencing labs, therefore, relevant clones were distributed to of the French consortium because it was anticipated that it colleagues who have an interest in a specific gene.
Recommended publications
  • Analysis of the Impact of Sequencing Errors on Blast Using Fault Injection
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Illinois Digital Environment for Access to Learning and Scholarship Repository ANALYSIS OF THE IMPACT OF SEQUENCING ERRORS ON BLAST USING FAULT INJECTION BY SO YOUN LEE THESIS Submitted in partial fulfillment of the requirements for the degree of Master of Science in Electrical and Computer Engineering in the Graduate College of the University of Illinois at Urbana-Champaign, 2013 Urbana, Illinois Adviser: Professor Ravishankar K. Iyer ABSTRACT This thesis investigates the impact of sequencing errors in post-sequence computational analyses, including local alignment search and multiple sequence alignment. While the error rates of sequencing technology are commonly reported, the significance of these numbers cannot be fully grasped without putting them in the perspective of their impact on the downstream analyses that are used for biological research, forensics, diagnosis of diseases, etc. I approached the quantification of the impact using fault injection. Faults were injected in the input sequence data, and the analyses were run. Change in the output of the analyses was interpreted as the impact of faults, or errors. Three commonly used algorithms were used: BLAST, SSEARCH, and ProbCons. The main contributions of this work are the application of fault injection to the reliability analysis in bioinformatics and the quantitative demonstration that a small error rate in the sequence data can alter the output of the analysis in a significant way. BLAST and SSEARCH are both local alignment search tools, but BLAST is a heuristic implementation, while SSEARCH is based on the optimal Smith-Waterman algorithm.
    [Show full text]
  • Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J
    Perspective Cite This: J. Am. Chem. Soc. 2019, 141, 14463−14479 pubs.acs.org/JACS Advancing Solutions to the Carbohydrate Sequencing Challenge † † † ‡ § Christopher J. Gray, Lukasz G. Migas, Perdita E. Barran, Kevin Pagel, Peter H. Seeberger, ∥ ⊥ # ⊗ ∇ Claire E. Eyers, Geert-Jan Boons, Nicola L. B. Pohl, Isabelle Compagnon, , × † Göran Widmalm, and Sabine L. Flitsch*, † School of Chemistry & Manchester Institute of Biotechnology, The University of Manchester, 131 Princess Street, Manchester M1 7DN, U.K. ‡ Institute for Chemistry and Biochemistry, Freie Universitaẗ Berlin, Takustraße 3, 14195 Berlin, Germany § Biomolecular Systems Department, Max Planck Institute for Colloids and Interfaces, Am Muehlenberg 1, 14476 Potsdam, Germany ∥ Department of Biochemistry, Institute of Integrative Biology, University of Liverpool, Crown Street, Liverpool L69 7ZB, U.K. ⊥ Complex Carbohydrate Research Center, University of Georgia, Athens, Georgia 30602, United States # Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States ⊗ Institut Lumierè Matiere,̀ UMR5306 UniversitéLyon 1-CNRS, Universitéde Lyon, 69622 Villeurbanne Cedex, France ∇ Institut Universitaire de France IUF, 103 Blvd St Michel, 75005 Paris, France × Department of Organic Chemistry, Arrhenius Laboratory, Stockholm University, S-106 91 Stockholm, Sweden *S Supporting Information established connection between their structure and their ABSTRACT: Carbohydrates possess a variety of distinct function, full characterization of unknown carbohydrates features with
    [Show full text]
  • Le Monde Végétal S'ouvre Aux Biotechnologies
    LLLeee mmmooonnndddeee vvvééégggééétttaaalll sss’’’ooouuuvvvrrreee aaauuuxxx bbbiiioottteeeccchhhnnnooolllooogggiiieeesss NNNeeewww tttrrreeennndddsss iiinnn ppplllaaannnttt bbbiiiooolllooogggyyy aaannnddd bbbiiiooottteeeccchhhnnnooolllooogggyyy Colloque international de l’Académie des sciences Institut de France, 23 quai de Conti – 75006 Paris 15-16 septembre 2008 Grande salle des Résumés / Abstracts Comité scientifique Jean-François Bach Michel Caboche Henri Décamps Roland Douce Michel Delseny Christian Dumas Jules Hoffmann Jean-Dominique Lebreton Nicole Le Douarin Georges Pelletier Alain Prochiantz de l’Académie des sciences Marc Van Montagu Université de Gand Organisation Service des Colloques, Académie des sciences, Fabienne Bonfils Site Internet : www.academie-sciences.fr Société française de biologie végétale LLLeee mmmooonnndddeee vvvééégggééétttaaalll sss’’’ooouuuvvvrrreee aaauuuxxx bbbiiiooottteeccchhhnnnooolllooogggiiieeesss NNNeeewww tttrrreeennndddsss iiinnn ppplllaaannnttt bbbiiiooolllooogggyyy aaannnddd bbbiiiooottteeeccchhhnnnooolllooogggyyy PPRROOGGRRAAMMMMEE ___________________________________________________________________ _________________________________________________________________________________________ Colllloque iinternatiionall de ll’’Académiie des sciiences, Pariis 15-16 septembre 2008 Le monde végétal s’ouvre aux biotechnologies / New trends in plant biology and biotechnology Lundi 15 septembre / Monday, September 15th Session 1 Les plantes dévoilent leurs secrets / Plants Release their Secrets 8h45- 9h00
    [Show full text]
  • P.-O. 2 Catalan
    P.-O. VALLESPIR Primaire de la droite et 12 chiens maltraités du centre : comment voter et 10 cadavres découverts PAGE 3 PAGE 2 CATALAN Jeudi 17 novembre 2016 • N°321 • Espagne 1,50€ • France 1,10€ lindependant.fr FIONA CANDIDAT À LA PRÉSIDENTIELLE Le témoignage tourne court Macron dans le grand bain PAGE ACTU ET ÉDITO Train jaune à la casse Le témoin-surprise appelé hier devant les assises du Puy-de- Dôme, où sont jugés la mère de la petite Fiona et son ancien compagnon, s’est révélé être une médium qui a dû être évacuée après avoir fait un malaise. PAGE SOCIÉTÉ EUROPE Trump divise les « 28 » Les dirigeants européens ne sont pas unanimes sur la conduite à tenir face au nouveau président des Etats-Unis dont l’isolationnisme inquiète l’Union. PAGE INTERNATIONAL CATALOGNE Dali avait-il une fille cachée ? Pilar Abel, âgée de 60 ans, affirme être la fille cachée du peintre surréaliste. Elle a saisi la justice espagnole pour obtenir la reconnaissance de sa paternité. PAGE EURORÉGION ◗ Les défenseurs du Canari jaune sont vent debout après avoir découvert que deux anciens wagons étraves, du matériel roulant déclassé, avaient été vendus à un ferrailleur. Ils souhaitent la création d’un musée du Train jaune pour préserver ce patrimoine. PAGE 6 Photo Fred Berlic DANS LES VILLAGES Argelès-sur-Mer Des experts au chevet du lycée Bourquin PAGE 14 ä Prades £]£ä Un médecin urgentiste £££Ç se penche sur le malaise hospitalier äÓ{{ PAGE 19 9 & ,&F 8 372 2, B O U L E V A R D D E S P Y R É N É E S, C S 4 0 0 6 6, 6 6 0 0 7 P E R P I G N A N C E D E X - T É L.
    [Show full text]
  • "Phylogenetic Analysis of Protein Sequence Data Using The
    Phylogenetic Analysis of Protein Sequence UNIT 19.11 Data Using the Randomized Axelerated Maximum Likelihood (RAXML) Program Antonis Rokas1 1Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee ABSTRACT Phylogenetic analysis is the study of evolutionary relationships among molecules, phenotypes, and organisms. In the context of protein sequence data, phylogenetic analysis is one of the cornerstones of comparative sequence analysis and has many applications in the study of protein evolution and function. This unit provides a brief review of the principles of phylogenetic analysis and describes several different standard phylogenetic analyses of protein sequence data using the RAXML (Randomized Axelerated Maximum Likelihood) Program. Curr. Protoc. Mol. Biol. 96:19.11.1-19.11.14. C 2011 by John Wiley & Sons, Inc. Keywords: molecular evolution r bootstrap r multiple sequence alignment r amino acid substitution matrix r evolutionary relationship r systematics INTRODUCTION the baboon-colobus monkey lineage almost Phylogenetic analysis is a standard and es- 25 million years ago, whereas baboons and sential tool in any molecular biologist’s bioin- colobus monkeys diverged less than 15 mil- formatics toolkit that, in the context of pro- lion years ago (Sterner et al., 2006). Clearly, tein sequence analysis, enables us to study degree of sequence similarity does not equate the evolutionary history and change of pro- with degree of evolutionary relationship. teins and their function. Such analysis is es- A typical phylogenetic analysis of protein sential to understanding major evolutionary sequence data involves five distinct steps: (a) questions, such as the origins and history of data collection, (b) inference of homology, (c) macromolecules, developmental mechanisms, sequence alignment, (d) alignment trimming, phenotypes, and life itself.
    [Show full text]
  • Mise En Page 1
    Le public scientifique Trajectoires de la génétique 150 ans après Mendel 11, 12 et 13 septembre 2016 Grande salle des séances de l’Institut de France Les travaux pionniers de Gregor Mendel, publiés il y a 150 ans, ouvraient une voie intellectuelle originale dans l’étude des processus de l’hérédité qui, quelques dizaines d’années plus tard, devait donner naissance à une nouvelle discipline scientifique, la génétique, dont les découvertes fondamentales n’allaient cesser de bou- leverser notre compréhension du monde vivant et de nous-même jusqu’à aujourd’hui. Si l’élucidation de la structure de l’ADN, au milieu du siècle dernier, ouvrait la voie de la connaissance moléculaire des gènes et offrait les outils de leur étude, bien d’autres travaux devaient révéler ensuite la complexité intrinsèque de la nature, de l’origine et du fonctionnement des génomes. Ceux-ci nous apportent un éclairage nouveau sur l’évolution des organismes vivants et leurs fonctionne- ments normaux ou pathologiques. Ce colloque, en choisissant quelques thèmes parmi le foisonnement des recherches actuelles illustrées par leurs acteurs, tentera d’anticiper les « trajectoires de la génétique » dans les années futures, avec les questionnements qu’elles suscitent et les espoirs qu’elles portent. Les organisateurs du colloque Bernard Dujon Membre de l’Académie des sciences, Université Pierre et Marie Curie, Institut Pasteur, Paris Bernard Dujon est membre de l’Académie des sciences, professeur émérite à l’université Pierre et Marie Curie et à l’Institut Pasteur. Ses principaux travaux de recherche portent sur les génomes et les mécanismes moléculaires de leur dyna- mique et de leur évolution.
    [Show full text]
  • EMBL-EBI Powerpoint Presentation
    Processing data from high-throughput sequencing experiments Simon Anders Use-cases for HTS · de-novo sequencing and assembly of small genomes · transcriptome analysis (RNA-Seq, sRNA-Seq, ...) • identifying transcripted regions • expression profiling · Resequencing to find genetic polymorphisms: • SNPs, micro-indels • CNVs · ChIP-Seq, nucleosome positions, etc. · DNA methylation studies (after bisulfite treatment) · environmental sampling (metagenomics) · reading bar codes Use cases for HTS: Bioinformatics challenges Established procedures may not be suitable. New algorithms are required for · assembly · alignment · statistical tests (counting statistics) · visualization · segmentation · ... Where does Bioconductor come in? Several steps: · Processing of the images and determining of the read sequencest • typically done by core facility with software from the manufacturer of the sequencing machine · Aligning the reads to a reference genome (or assembling the reads into a new genome) • Done with community-developed stand-alone tools. · Downstream statistical analyis. • Write your own scripts with the help of Bioconductor infrastructure. Solexa standard workflow SolexaPipeline · "Firecrest": Identifying clusters ⇨ typically 15..20 mio good clusters per lane · "Bustard": Base calling ⇨ sequence for each cluster, with Phred-like scores · "Eland": Aligning to reference Firecrest output Large tab-separated text files with one row per identified cluster, specifying · lane index and tile index · x and y coordinates of cluster on tile · for each
    [Show full text]
  • A SARS-Cov-2 Sequence Submission Tool for the European Nucleotide
    Databases and ontologies Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab421/6294398 by guest on 25 June 2021 A SARS-CoV-2 sequence submission tool for the European Nucleotide Archive Miguel Roncoroni 1,2,∗, Bert Droesbeke 1,2, Ignacio Eguinoa 1,2, Kim De Ruyck 1,2, Flora D’Anna 1,2, Dilmurat Yusuf 3, Björn Grüning 3, Rolf Backofen 3 and Frederik Coppens 1,2 1Department of Plant Biotechnology and Bioinformatics, Ghent University, 9052 Ghent, Belgium, 1VIB Center for Plant Systems Biology, 9052 Ghent, Belgium and 2University of Freiburg, Department of Computer Science, Freiburg im Breisgau, Baden-Württemberg, Germany ∗To whom correspondence should be addressed. Associate Editor: XXXXXXX Received on XXXXX; revised on XXXXX; accepted on XXXXX Abstract Summary: Many aspects of the global response to the COVID-19 pandemic are enabled by the fast and open publication of SARS-CoV-2 genetic sequence data. The European Nucleotide Archive (ENA) is the European recommended open repository for genetic sequences. In this work, we present a tool for submitting raw sequencing reads of SARS-CoV-2 to ENA. The tool features a single-step submission process, a graphical user interface, tabular-formatted metadata and the possibility to remove human reads prior to submission. A Galaxy wrap of the tool allows users with little or no bioinformatic knowledge to do bulk sequencing read submissions. The tool is also packed in a Docker container to ease deployment. Availability: CLI ENA upload tool is available at github.com/usegalaxy- eu/ena-upload-cli (DOI 10.5281/zenodo.4537621); Galaxy ENA upload tool at toolshed.g2.bx.psu.edu/view/iuc/ena_upload/382518f24d6d and https://github.com/galaxyproject/tools- iuc/tree/master/tools/ena_upload (development) and; ENA upload Galaxy container at github.com/ELIXIR- Belgium/ena-upload-container (DOI 10.5281/zenodo.4730785) Contact: [email protected] 1 Introduction Nucleotide Archive (ENA).
    [Show full text]
  • Genomic Sequencing of SARS-Cov-2: a Guide to Implementation for Maximum Impact on Public Health
    Genomic sequencing of SARS-CoV-2 A guide to implementation for maximum impact on public health 8 January 2021 Genomic sequencing of SARS-CoV-2 A guide to implementation for maximum impact on public health 8 January 2021 Genomic sequencing of SARS-CoV-2: a guide to implementation for maximum impact on public health ISBN 978-92-4-001844-0 (electronic version) ISBN 978-92-4-001845-7 (print version) © World Health Organization 2021 Some rights reserved. This work is available under the Creative Commons Attribution-NonCommercial-ShareAlike 3.0 IGO licence (CC BY-NC-SA 3.0 IGO; https://creativecommons.org/licenses/by-nc-sa/3.0/igo). Under the terms of this licence, you may copy, redistribute and adapt the work for non-commercial purposes, provided the work is appropriately cited, as indicated below. In any use of this work, there should be no suggestion that WHO endorses any specific organization, products or services. The use of the WHO logo is not permitted. If you adapt the work, then you must license your work under the same or equivalent Creative Commons licence. If you create a translation of this work, you should add the following disclaimer along with the suggested citation: “This translation was not created by the World Health Organization (WHO). WHO is not responsible for the content or accuracy of this translation. The original English edition shall be the binding and authentic edition”. Any mediation relating to disputes arising under the licence shall be conducted in accordance with the mediation rules of the World Intellectual Property Organization (http://www.wipo.int/amc/en/mediation/rules/).
    [Show full text]
  • The Biogrid Interaction Database
    D470–D478 Nucleic Acids Research, 2015, Vol. 43, Database issue Published online 26 November 2014 doi: 10.1093/nar/gku1204 The BioGRID interaction database: 2015 update Andrew Chatr-aryamontri1, Bobby-Joe Breitkreutz2, Rose Oughtred3, Lorrie Boucher2, Sven Heinicke3, Daici Chen1, Chris Stark2, Ashton Breitkreutz2, Nadine Kolas2, Lara O’Donnell2, Teresa Reguly2, Julie Nixon4, Lindsay Ramage4, Andrew Winter4, Adnane Sellam5, Christie Chang3, Jodi Hirschman3, Chandra Theesfeld3, Jennifer Rust3, Michael S. Livstone3, Kara Dolinski3 and Mike Tyers1,2,4,* 1Institute for Research in Immunology and Cancer, Universite´ de Montreal,´ Montreal,´ Quebec H3C 3J7, Canada, 2The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada, 3Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA, 4School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JR, UK and 5Centre Hospitalier de l’UniversiteLaval´ (CHUL), Quebec,´ Quebec´ G1V 4G2, Canada Received September 26, 2014; Revised November 4, 2014; Accepted November 5, 2014 ABSTRACT semi-automated text-mining approaches, and to en- hance curation quality control. The Biological General Repository for Interaction Datasets (BioGRID: http://thebiogrid.org) is an open access database that houses genetic and protein in- INTRODUCTION teractions curated from the primary biomedical lit- Massive increases in high-throughput DNA sequencing erature for all major model organism species and technologies (1) have enabled an unprecedented level of humans. As of September 2014, the BioGRID con- genome annotation for many hundreds of species (2–6), tains 749 912 interactions as drawn from 43 149 pub- which has led to tremendous progress in the understand- lications that represent 30 model organisms.
    [Show full text]
  • The Interpro Database, an Integrated Documentation Resource for Protein
    The InterPro database, an integrated documentation resource for protein families, domains and functional sites R Apweiler, T K Attwood, A Bairoch, A Bateman, E Birney, M Biswas, P Bucher, L Cerutti, F Corpet, M D Croning, et al. To cite this version: R Apweiler, T K Attwood, A Bairoch, A Bateman, E Birney, et al.. The InterPro database, an integrated documentation resource for protein families, domains and functional sites. Nucleic Acids Research, Oxford University Press, 2001, 29 (1), pp.37-40. 10.1093/nar/29.1.37. hal-01213150 HAL Id: hal-01213150 https://hal.archives-ouvertes.fr/hal-01213150 Submitted on 7 Oct 2015 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. © 2001 Oxford University Press Nucleic Acids Research, 2001, Vol. 29, No. 1 37–40 The InterPro database, an integrated documentation resource for protein families, domains and functional sites R. Apweiler1,*, T. K. Attwood2,A.Bairoch3, A. Bateman4,E.Birney1, M. Biswas1, P. Bucher5, L. Cerutti4,F.Corpet6, M. D. R. Croning1,2, R. Durbin4,L.Falquet5,W.Fleischmann1, J. Gouzy6,H.Hermjakob1,N.Hulo3, I. Jonassen7,D.Kahn6,A.Kanapin1, Y. Karavidopoulou1, R.
    [Show full text]
  • Enabling Interpretation of Protein Variation Effects with Uniprot
    Andrew Nightingale1, Jie luo, Michele Magrane1, Peter McGarvey2, Sandra Orchard1, Maria Martin1, UniProt Consortium1,2,3 1EMBL-European Bioinformatics Institute, Cambridge, UK 2SIB Swiss Institute of Bioinformatics, Geneva, Switzerland 3Protein Information Resource, Georgetown University, Washington DC & University of Delaware, USA Enabling interpretation of protein variation effects with UniProt Introduction Understanding the effect of genetic variants on protein function is crucial to thoroughly understand the role of proteins in disease biology. UniProt aims to support the scientific community, computational biologists and clinical researchers, by providing a comprehensive, high-quality and freely accessible resource of protein sequence and functional information. This includes a comprehensive catalogue of protein altering variation data coupled with information about how these variants affect protein function. UniProt variant data sources Variant data from literature Large-scale variant data Variants are captured from the scientific 1. Imported variant literature and manually reviewed for data is dependent addition to UniProtKB/Swiss-Prot upon exact mapping between the reference proteome Description of disease associated with and genome. genetic variations in a protein TCGA 2. Variant data is imported from a variety of resources Variant data including effects of the variant Database Total Imported Variant Total Unique to complement the on the protein and links to variant 1000Genomes 859,757 81,216 resources ClinVar 183,655 76,218 set of variants COSMIC 184,237 18,863 Category Number ESP 939,238 68,803 captured from the ExAC 4,333,620 2,776,617 Total reviewed variants 79,284 TCGA 1,202,700 920,549 literature Disease-associated variants 30,471 UniProt 80,224 49,971 Total 7,781,431 3,992,437 Number of proteins with variants 12,886 *Represents the number of UniProt variants with a dbSNP identifier Interpretation protein variant effect with UniProt 1.
    [Show full text]