Current Status and New Features of the Consensus Coding Sequence Database Catherine M
Total Page:16
File Type:pdf, Size:1020Kb
Published online 11 November 2013 Nucleic Acids Research, 2014, Vol. 42, Database issue D865–D872 doi:10.1093/nar/gkt1059 Current status and new features of the Consensus Coding Sequence database Catherine M. Farrell1, Nuala A. O’Leary1, Rachel A. Harte2, Jane E. Loveland3, Laurens G. Wilming3, Craig Wallin1, Mark Diekhans2, Daniel Barrell3, Stephen M. J. Searle3, Bronwen Aken3, Susan M. Hiatt1, Adam Frankish3, Marie-Marthe Suner3, Bhanu Rajput1, Charles A. Steward3, Garth R. Brown1, Ruth Bennett3, Michael Murphy1, Wendy Wu1, Mike P. Kay3, Jennifer Hart1, Jeena Rajan3, Janet Weber1, Catherine Snow3, Lillian D. Riddick1, Toby Hunt3, David Webb1, Mark Thomas3, Pamela Tamez1, Sanjida H. Rangwala1, Kelly M. McGarvey1, Shashikant Pujar1, Andrei Shkeda1, Jonathan M. Mudge3, Jose M. Gonzalez3, James G. R. Gilbert3, Stephen J. Trevanion3, Robert Baertsch2, Jennifer L. Harrow3, Tim Hubbard3, James M. Ostell1, David Haussler2,4 and Kim D. Pruitt1,* 1National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Building 38A, 8600 Rockville Pike, Bethesda, MD 20894, USA, 2Center for Biomolecular Science and Engineering, University of California Santa Cruz (UCSC), Santa Cruz, CA 95064, USA, 3Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SA, UK and 4Howard Hughes Medical Institute, University of California Santa Cruz, Santa Cruz, CA 95064, USA Received September 12, 2013; Revised September 30, 2013; Accepted October 12, 2013 ABSTRACT and our approach to representing genes for The Consensus Coding Sequence (CCDS) project which support evidence is incomplete. We also (http://www.ncbi.nlm.nih.gov/CCDS/) is a collab- present a summary of recent and future curation orative effort to maintain a dataset of protein- targets. coding regions that are identically annotated on the human and mouse reference genome INTRODUCTION assemblies by the National Center for Biotech- Biological and biomedical research has come to rely on nology Information (NCBI) and Ensembl genome an- accurate and consistent annotation of genes and their notation pipelines. Identical annotations that pass products on genome assemblies. High-quality genome quality assurance tests are tracked with a stable assemblies, such as the human and mouse reference identifier (CCDS ID). Members of the collaboration, genome assemblies that are maintained by the Genome who are from NCBI, the Wellcome Trust Sanger Reference Consortium (GRC) (1), are particularly Institute and the University of California Santa amenable to high-definition gene annotation. Reference an- Cruz, provide coordinated and continuous review notation of these genomes is available from various sources, of the dataset to ensure high-quality CCDS repre- including the National Center for Biotechnology sentations. We describe here the current status Information (NCBI) (2), Ensembl (3), the Vertebrate Genome Annotation database (Vega) (4) and GENCOD and recent growth in the CCDS dataset, as well as Egenes.org. Each annotation group has independent goals recent changes to the CCDS web and FTP sites. and policies, which result in some annotation variation. These changes include more explicit reporting Nevertheless, much of the annotation provided by these in- about the NCBI and Ensembl annotation releases dividual groups is identical, thus providing users with a being compared, new search and display options, higher degree of confidence in the accuracy of these the addition of biologically descriptive information annotations. *To whom correspondence should be addressed. Tel: +1 301 435 5898; Fax: +1 301 480 2918; Email: [email protected] Published by Oxford University Press 2013. This work is written by US Government employees and is in the public domain in the US. D866 Nucleic Acids Research, 2014, Vol. 42, Database issue The Consensus Coding Sequence (CCDS) project (5) has GROWTH OF THE CCDS DATASET been established to identify a gold standard set of protein- The CCDS datasets for human and mouse are periodically coding gene annotations that are identically annotated on reanalyzed and publicly distributed as CCDS releases the human and mouse reference genome assemblies (Table 1). Historically, CCDS releases occurred following by the participating annotation groups. The participating coordinated whole-genome annotation runs by both members are from NCBI, the European Bioinformatics NCBI and Ensembl. This policy has been changed to Institute (EBI), the Wellcome Trust Sanger Institute allow CCDS updates on a more regular basis following (WTSI) and the University of California Santa Cruz annotation updates by either NCBI or Ensembl. Human (UCSC). Manual annotation is provided by the Reference updates are expected to occur roughly every 6 months and Sequence (RefSeq) group at NCBI (2) and by the Human mouse releases will be yearly. Additional CCDS updates and Vertebrate Analysis and Annotation (HAVANA) will occur for both species following assembly updates at group at WTSI (4). A combination of manual and auto- which time we still require both groups to have generated mated genome annotations provided by NCBI and updated annotation on the new assembly. Reducing the Ensembl (which incorporates manual HAVANA annota- time interval between releases allows for the CCDS tions) are compared to identify annotations with matching dataset to represent more updated annotations. genomic coordinates; NCBI and Ensembl-coding region However, because genomic annotation is updated by annotations must be identical at the CDS start and stop NCBI and Ensembl at different times, CCDS content codons and at every splice site to be considered for the may not yet reflect the most recent manual annotation CCDS database. Each matching annotation is assigned a curation efforts. Figure 1A indicates that the number of unique identifier known as a CCDS ID, which is tracked CCDS IDs for both human and mouse continues to and reported in the database. Matching CCDS sequences increase with each new CCDS analysis. Since 2011, the and related metadata are available to users via a public human and mouse CCDS dataset sizes have increased by FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/) or a 1279 and 906 CCDS IDs, respectively. web-based interface (www.ncbi.nlm.nih.gov/CCDS/) that Most of the recent growth in the CCDS dataset comes includes individual report pages for each CCDS ID. from an increase in the number of genes with more than Information on how to access CCDS data was described one splice variant obtaining a CCDS ID. Since 2011, the previously (5), including details on CCDS report page number of human genes in the CCDS database increased features and how to access CCDS data tracks on various by 135, whereas the number of human genes having more genome browsers. In addition, the README file (ftp://ftp. than one splice variant with a CCDS ID increased by 479. ncbi.nlm.nih.gov/pub/CCDS/README) on the FTP site A similar pattern was observed in mouse (Figure 1B), provides full descriptions of the various download files bringing the total number of genes with more than one that are available. CCDS ID to 7752, of which over 100 have more than six CCDS matching annotations indicate concordance CCDS IDs. Therefore, the CCDS dataset is increasingly between different annotation groups with diverse representing more alternative splicing events with each policies, and they thus represent high-quality annotations new release. The CCDS dataset is expected to continue that can be used as standards. The integrity of the CCDS to grow in the next few years as a result of targeted dataset is maintained through stringent quality assurance curation initiatives (see below), which will allow for an (QA) testing and on-going manual curation (6). QA increase in the representation of protein-coding genes testing includes checks for possible conflicts within the and more protein isoforms per gene. Also, as both coding sequence and its annotated structure, protein- curation groups begin to integrate RNAseq data available coding potential, possible problems with the genome from the Illumina Human Body Map 2.0 project (NCBI sequence, and assessing the quality of matched NCBI GEO accession GSE30611), ENCODE (13) and other and Ensembl annotation. Curatorial updates to existing sources such as GTEx data (14) into their pipelines, the CCDS IDs require unanimous agreement by all collabor- number of alternatively spliced transcripts will increase. ators. A process flow has been established to facilitate review of CCDS IDs that have been flagged by any member of the collaboration for update or withdrawal, NEW FEATURES with the voting members for curation updates being RefSeq, HAVANA and UCSC. The CCDS dataset is an The CCDS database has incorporated several new features integral part of the GENCODE gene annotation project to improve user browsing, to add useful biological infor- (7) and it is used as a standard for high-quality coding mation that was either previously internal or not exon definition in various research fields, including categorized, and to increase gene representation in the clinical studies (8,9), large-scale epigenomic studies (10), dataset. exome projects (11) and exon array design (12). Due Updated reporting of CCDS and annotation release to the consensus annotation of CCDS exons by the information independent annotation groups, exome projects in par- ticular have regarded CCDS coding exons as reliable The CCDS web resource and FTP site now specifies which targets for downstream studies