European Nucleotide Archive in 2016

D32–D36 Nucleic Acids Research, 2017, Vol. 45, Database issue Published online 28 November 2016 doi: 10.1093/nar/gkw1106 European Nucleotide Archive in 2016 Ana Luisa Toribio*, Blaise Alako, Clara Amid, Ana Cerdeno-Tarr˜ aga,´ Laura Clarke, Iain Cleland, Susan Fairley, Richard Gibson, Neil Goodgame, Petra ten Hoopen, Suran Jayathilaka, Simon Kay, Rasko Leinonen, Xin Liu, JosueMart´ ´ınez-Villacorta, Nima Pakseresht, Jeena Rajan, Kethi Reddy, Marc Rosello, Nicole Silvester, Dmitriy Smirnov, Daniel Vaughan, Vadim Zalunin and Guy Cochrane European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Received October 22, 2016; Revised October 25, 2016; Editorial Decision October 26, 2016; Accepted October 31, 2016 ABSTRACT (1), ENA partners with the National Institute of Genet- ics’ DNA DataBank of Japan (2) and the National Center The European Nucleotide Archive (ENA; http://www. for Biotechnology’s GenBank and Sequence Read archive ebi.ac.uk/ena) offers a rich platform for data sharing, (NCBI GenBank and SRA) (3) to provide globally compre- publishing and archiving and a globally comprehen- hensive coverage through routine data exchange, to build sive data set for onward use by the scientific com- the scientific standards for data exchange and to promote munity. With a broad scope spanning raw sequenc- the timely sharing of well-structured sequence data. ing reads, genome assemblies and functional anno- In this paper, we outline ENA content, introduce the core tation, the resource provides extensive data submis- services on offer from the resource and provide entry points sion, search and download facilities across web and for users into these services. We then outline a number of programmatic interfaces. Here, we outline ENA con- developments that have been made in the last year. Finally, tent and major access modalities, highlight major we highlight examples of use of ENA data. developments in 2016 and outline a number of examples of data reuse from ENA. CONTENT We continue to see significant growth in ENA content, with INTRODUCTION doubling times currently at 32 months for raw sequence For the last third of a century, the European Nucleotide data. At the time of writing, ENA comprised 3 × 1015 base Archive (ENA; http://www.ebi.ac.uk/ena) has served as a pairs, over 1.5 million taxa, 192 thousand studies, 770 mil- cornerstone of the world’s bioinformatics infrastructure. In lion sequence records and references to almost 200 thou- 2016, the resource continues as a broad and heavily used sand literature publications. platform for the sharing, publication, safeguarding and ENA content is organized into a number of data types reuse of globally comprehensive public nucleotide sequence (see Figure 1). Core data types, such as raw sequence data and associated information. data and derived data, including sequences, assemblies and ENA content spans a spectrum of data types from raw functional annotation, are supplemented with accessory reads to asserted annotation and offers a broad range of data, such as studies and samples, to provide experimen- services to the scientific community. Submissions services tal context. Primary data (such as reads) and several de- cater for our broad range of data providers, from the small- rived data types (sequence, assembly and analysis) are sub- scale submitting research laboratory to the major sequenc- mitted, while remaining derived data types (coding, non- ing centre. Data access services, such as our search and re- coding, marker and environmental) are derived from sub- trieval interfaces support users, from those browsing casu- mitted content as part of processing and indexing within ally to those integrating ENA content through program- ENA. Further information is provided from http://www.ebi. matic access into their own analyses and software applica- ac.uk/ena/submit/data-formats. tions. Our helpdesk provides responsive support to several Data are highly integrated to provide discoverability, thousand active data submitters and many times this num- reusability and cross-data set interoperability. This integra- ber of data consumers. tion is achieved at three levels; between records of the same As a member of the International Nucleotide Sequence class (such as through consistent standards of annotation Database Collaboration (INSDC; http://www.insdc.org/) in Feature Tables associated with sequence records), be- *To whom correspondence should be addressed. Tel: +44 1223 494680; Fax: +44 1223 494468; Email: [email protected] C The Author(s) 2016. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work is properly cited. For commercial re-use, please contact [email protected] Nucleic Acids Research, 2017, Vol. 45, Database issue D33 primary data derived data Coding Read Sequence - protein-coding regions of annotated sequences - ‘second generaon’ - mul-pass, assembled sequence sequencing raw data - can carry funconal annotaon Non-coding Trace Assembly - non-protein-coding regions of annotated - capillary sequencing raw - genome- or transcriptome-level sequences data construct comprising reads and sequences Marker - subset of coding and non-coding providing phylogenec or other markers accessory data Environmental - subset of samples relang to Sample environmental applicaons - informaon relang to sampling, Study including context and processing - informaon relang to unit of science (e.g. publicaon) Analysis - serves to group other classes, - idenficaon data (e.g. OTUs) Taxon such as sequence, read and - reference read alignments - informaon relang to sequenced sample - processed reads (e.g. trimmed amd organism filtered reads) -genomic variaons (VCF) - future data types Figure 1. Organization of principal data types in ENA. tween records of different classes (such as a sequence record SUBMISSION making reference to a sample and a taxon) and between ENA offers a comprehensive range of submission op- ENA records and records in external resources (such as a tions through the Webin system (http://www.ebi.ac.uk/ena/ coding record carrying a cross-reference to a record in the submit). This system provides both an interactive web ap- UniProt KnowledgeBase) (4). Further discoverability and plication that offers, inter alia, spreadsheet upload support reusability are afforded through collaborations with exter- and a powerful RESTful programmatic submission inter- nal expert communities, in which we work on the develop- face. The former is recommended for infrequent submitters ment and implementation in ENA of reporting standards and first-time users (including those setting up program- initiatives such as the Minimal Information about a Se- matic submission systems). The latter is recommended for quencing Experiment (5) and the Global Microbial Iden- those with informatics skills who wish to establish ongo- tifier (http://www.globalmicrobialidentifier.org/). ing regular data flow for a project or submitting centre. Given the aggressive technological advance that we con- (We note that data volume per se is not a useful guide in tinue to observe in sequencing technology, we expect ongo- choosing which tool to use; since both web and program- ing change in the nature of incoming data (such as new read matic interfaces work in conjunction with file upload over formats) and continued broadening of the application base FTP (supported within the web application and through served by sequencing (and hence new derived data struc- external clients), with web and RESTful interfaces pro- tures). As such, we consider our role as a ‘pioneer’ database viding the user with transactional control over the sub- to be important and we strive to be agile and responsive mission.) For depositions of multi-omics data, submitters in responding to new requirements from the scientific com- should first contact our helpdesk with a brief outline of munity. We support raw data from all sequencing platforms the data set in question and we will work with colleagues and have in place a technical system that allows rapid exten- within the European Molecular Biology Laboratory, Eu- sion as new data emerge. For new applications, our ‘analy- ropean Bioinformatics Institute (EMBL-EBI) BioStudies sis’ data type has been put in place to allow us rapidly to (https://www.ebi.ac.uk/biostudies/) and BioSamples (http: respond to new derived data types; indeed, reference read //www.ebi.ac.uk/biosamples)(6) data resources to provide alignments and genomic variation data are examples of ex- instructions on how to proceed. tended data types that have been supported using this ap- proach. D34 Nucleic Acids Research, 2017, Vol. 45, Database issue ACCESS SELECTED DEVELOPMENTS IN 2016 We offer a range of web, programmatic and FTP services Genome assembly data, especially from bacterial isolates, that support user access to ENA data, covering search, continue to grow relative to other data types. In order to browse and download functions. serve these data better, we have supplemented the existing Three search services are supported: A simple text search HTML view of assemblies with an XML view, available box, available in the header of all pages on the ENA website from the browser and through

European Nucleotide Archive in 2016

Gene Discovery and Annotation Using LCM-454 Transcriptome Sequencing Scott J

Whole Genome Sequencing Data of Multiple Individuals of Pakistani

EMBL-EBI Powerpoint Presentation

An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads

Sequence Alignment/Map Format Specification

Genomic Data Standards Resources and Initiatives Cited in the Supplemental Information to the Genomic Data Sharing Policy

Comparative Analysis of Pacbio and Oxford Nanopore Sequencing Technologies for Transcriptomic Landscape Identiﬁcation of Penaeus Monodon

Multi-Scale Analysis and Clustering of Co-Expression Networks

Globalfungi, a Global Database of Fungal Occurrences from High

Closure of the NCBI SRA and Implications for the Long-Term Future of Genomics Data Storage GB Editorial Team*

NGS Raw Data Analysis

L3: Short Read Alignment to a Reference Genome