View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by The Jackson Laboratory: The Mouseion at the JAXlibrary The Jackson Laboratory The Mouseion at the JAXlibrary Faculty Research 2018 Faculty Research 1-4-2018 Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein- coding regions supported by expert curation. Shashikant Pujar Nuala A O'Leary Catherine M Farrell Jane E Loveland Jonathan M Mudge See next page for additional authors Follow this and additional works at: https://mouseion.jax.org/stfb2018 Part of the Life Sciences Commons, and the Medicine and Health Sciences Commons Recommended Citation Pujar, Shashikant; O'Leary, Nuala A; Farrell, Catherine M; Loveland, Jane E; Mudge, Jonathan M; Wallin, Craig; Girón, Carlos G; Diekhans, Mark; Barnes, If; Bennett, Ruth; Berry, Andrew E; Cox, Eric; Davidson, Claire; Goldfarb, Tamara; Gonzalez, Jose M; Hunt, Toby; Jackson, John; Joardar, Vinita; Kay, Mike P; Kodali, Vamsi K; Martin, Fergal J; McAndrews, Monica; McGarvey, Kelly M; Murphy, Michael; Rajput, Bhanu; Rangwala, Sanjida H; Riddick, Lillian D; Seal, Ruth L; Suner, Marie-Marthe; Webb, David; Zhu, Sophia; Aken, Bronwen L; Bruford, Elspeth A; Bult, Carol J; Frankish, Adam; Murphy, Terence; and Pruitt, Kim D, "Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation." (2018). Faculty Research 2018. 49. https://mouseion.jax.org/stfb2018/49 This Article is brought to you for free and open access by the Faculty Research at The ousM eion at the JAXlibrary. It has been accepted for inclusion in Faculty Research 2018 by an authorized administrator of The ousM eion at the JAXlibrary. For more information, please contact [email protected]. Authors Shashikant Pujar, Nuala A O'Leary, Catherine M Farrell, Jane E Loveland, Jonathan M Mudge, Craig Wallin, Carlos G Girón, Mark Diekhans, If Barnes, Ruth Bennett, Andrew E Berry, Eric Cox, Claire Davidson, Tamara Goldfarb, Jose M Gonzalez, Toby Hunt, John Jackson, Vinita Joardar, Mike P Kay, Vamsi K Kodali, Fergal J Martin, Monica McAndrews, Kelly M McGarvey, Michael Murphy, Bhanu Rajput, Sanjida H Rangwala, Lillian D Riddick, Ruth L Seal, Marie-Marthe Suner, David Webb, Sophia Zhu, Bronwen L Aken, Elspeth A Bruford, Carol J Bult, Adam Frankish, Terence Murphy, and Kim D Pruitt This article is available at The ousM eion at the JAXlibrary: https://mouseion.jax.org/stfb2018/49 Published online 6 November 2017 Nucleic Acids Research, 2018, Vol. 46, Database issue D221–D228 doi: 10.1093/nar/gkx1031 Consensus coding sequence (CCDS) database: a standardized set of human and mouse protein-coding regions supported by expert curation Shashikant Pujar1, Nuala A. O’Leary1, Catherine M. Farrell1, Jane E. Loveland2, Jonathan M. Mudge2, Craig Wallin1,CarlosG.Giron´ 2, Mark Diekhans3, If Barnes2, Ruth Bennett2, Andrew E. Berry2,EricCox1, Claire Davidson2, Tamara Goldfarb1, Jose M. Gonzalez2, Toby Hunt2, John Jackson1, Vinita Joardar1,MikeP.Kay2, Vamsi K. Kodali1, Fergal J. Martin2, Monica McAndrews4, Kelly M. McGarvey1, Michael Murphy1, Bhanu Rajput1, Sanjida H. Rangwala1, Lillian D. Riddick1, Ruth L. Seal5, Marie-Marthe Suner2, David Webb1, Sophia Zhu4, Bronwen L. Aken2, Elspeth A. Bruford5, Carol J. Bult4, Adam Frankish2, Terence Murphy1,* and Kim D. Pruitt1 1National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA, 2European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK, 3University of California Santa Cruz Genomics Institute, Santa Cruz, CA 95064, USA, 4Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, ME 04609, USA and 5HUGO Gene Nomenclature Committee, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Received September 20, 2017; Revised October 13, 2017; Editorial Decision October 16, 2017; Accepted October 20, 2017 ABSTRACT laboration members and new features added to the CCDS user interface. We also present expert cura- The Consensus Coding Sequence (CCDS) project tion scenarios, with specific examples highlighting provides a dataset of protein-coding regions that the importance of an accurate reference genome as- are identically annotated on the human and mouse sembly and the crucial role played by input from the reference genome assembly in genome annota- research community. tions produced independently by NCBI and the En- sembl group at EMBL-EBI. This dataset is the prod- uct of an international collaboration that includes INTRODUCTION NCBI, Ensembl, HUGO Gene Nomenclature Com- Accurate and comprehensive whole genome annotation of mittee, Mouse Genome Informatics and University the human and mouse reference genomes is essential to sup- of California, Santa Cruz. Identically annotated cod- port many areas of scientific inquiry, including medical re- ing regions, which are generated using an auto- search. RefSeq (1) annotation from NCBI (National Cen- mated pipeline and pass multiple quality assurance ter for Biotechnology Information) and Ensembl (2)an- checks, are assigned a stable and tracked identi- notation from EMBL-EBI (The European Molecular Biol- fier (CCDS ID). Additionally, coordinated manual re- ogy Laboratory’s European Bioinformatics Institute) pro- view by expert curators from the CCDS collabo- vided for these species are the primary reference resources ration helps in maintaining the integrity and high through which biological data are interpreted and reported quality of the dataset. The CCDS data are available by the scientific community. The analytical workflows used by these separate projects are conceptually similar in that through an interactive web page (https://www.ncbi. they both produce gene annotations based on a combi- nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site nation of computational pipelines and manual curation, (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, largely based on the interpretation of transcriptomics and we outline the ongoing work, growth and stability of other experimental data. Ultimately, however, Ensembl and the CCDS dataset and provide updates on new col- NCBI have developed different rules and guidelines for au- *To whom correspondence should be addressed. Tel: +1 301 402 0990; Fax: +1 301 594 5166; Email: [email protected] Published by Oxford University Press on behalf of Nucleic Acids Research 2017. This work is written by (a) US Government employee(s) and is in the public domain in the US. Downloaded from https://academic.oup.com/nar/article-abstract/46/D1/D221/4595661 by Jackson Laboratory user on 21 February 2018 D222 Nucleic Acids Research, 2018, Vol. 46, Database issue tomated and manual annotation or curation, and this has A led to differences between the sets of genes, transcripts and 35000 proteins currently found in these datasets. Such inconsisten- 30000 cies can present a challenge to the scientific community in 25000 their efforts to interpret biological data; for example, when a disease-associated variant is found to occur in a protein- 20000 coding transcript in one dataset, but a non-coding model in 15000 another. Furthermore, the regular emergence of new data Total Number 10000 types and methodologies with which to identify novel tran- 5000 scripts and to gain insights into their functionality mean that these datasets have the potential to include additional 0 2005 2007 2008 2009 2011 2011 2012 2013 2013 2013 2014 2015 2016 divergence with each new release. Human CCDS Release (Year) The Consensus Coding Sequence (CCDS) collaboration was formed in 2005 to address the issue of discrepancies CCDS IDs Gene IDs between Ensembl and NCBI genome annotations by pro- ducing a consensus dataset of protein-coding regions with B identical coding sequence (CDS) coordinates on the human 30000 and mouse reference genomes in both annotations. Consen- 25000 sus protein-coding regions, identified by stable and tracked identifiers (CCDS IDs), and related metadata, are accessi- 20000 ble through a public search page (www.ncbi.nlm.nih.gov/ 15000 CCDS). In addition, data are available for bulk download 10000 from an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). Total number A detailed description of methods to access CCDS data, the 5000 CCDS workflow, curation processes and quality assurance (QA) tests involved in creating the dataset, were published 0 2006 2007 2011 2012 2013 2014 2015 2016 previously (3–5). Mouse CCDS Release (Year) The core of the collaboration relies on consensus build- ing across members, including NCBI who provide the Ref- CCDS IDs Gene IDs Seq annotation, the Ensembl Genebuild and Ensembl HA- VANA (Human and Vertebrate Analysis and Annotation) Figure 1. Number of CCDS IDs and genes represented in the human (A) groups at EMBL-EBI who produce the GENCODE (6) and mouse (B) CCDS releases. The X-axis indicates the year in which a gene set, University of California Santa Cruz (UCSC), and CCDS dataset was made public. Details about CCDS releases are available on the CCDS Releases and Statistics web page (https://www.ncbi.nlm.nih. the two nomenclature authorities, HUGO Gene Nomen- gov/CCDS/CcdsBrowse.cgi?REQUEST=SHOW STATISTICS). clature Committee (HGNC) (7) and Mouse Genome In- formatics (MGI) (8) who provide standardized human and mouse gene symbols and names, respectively. To maintain made to the CCDS resource since our last report, and we high-quality annotation,
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-