The European Bioinformatics Institute in 2017: Data Coordination and Integration Charles E
Total Page:16
File Type:pdf, Size:1020Kb
Published online 25 November 2017 Nucleic Acids Research, 2018, Vol. 46, Database issue D21–D29 doi: 10.1093/nar/gkx1154 The European Bioinformatics Institute in 2017: data coordination and integration Charles E. Cook, Mary T. Bergman*, Guy Cochrane, Rolf Apweiler and Ewan Birney European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK Received October 13, 2017; Revised October 30, 2017; Editorial Decision October 31, 2017; Accepted November 20, 2017 ABSTRACT Data standards and ontology development are basic components of our integration activities, extending well be- The European Bioinformatics Institute (EMBL-EBI) yond molecular biology to encompass imaging, biobank- supports life-science research throughout the world ing and plant phenotyping. As data generation intensifies by providing open data, open-source software and across the biomedical and life sciences, analytical pipelines analytical tools, and technical infrastructure (https: designed for lower data volumes are suffering from bot- //www.ebi.ac.uk). We accommodate an increasingly tlenecks. The adoption of data standards can relieve some diverse range of data types and integrate them, of this pressure, allowing information to flow into analysis so that biologists in all disciplines can explore life pipelines across sectors. in ever-increasing detail. We maintain over 40 data The speed of mass data production and deposition de- resources, many of which are run collaboratively mands creative solutions for data storage and computa- with partners in 16 countries (https://www.ebi.ac.uk/ tional infrastructure. The results of large-scale metage- nomics, proteomics and metabolomics experiments are be- services). Submissions continue to increase expo- ginning to enter the public archives at scale, and bioimaging nentially: our data storage has doubled in less than is gaining a firm foothold on the data-growth ladder. Ana- two years to 120 petabytes. Recent advances in cel- lyzing such data collectively puts a strain on local compute lular imaging and single-cell sequencing techniques infrastructure, and is greatly facilitated by cloud-based col- are generating a vast amount of high-dimensional laboration platforms such as those being built by EMBL- data, bringing to light new cell types and new per- EBI. spectives on anatomy. Accordingly, one of our main This update gives an overview of data growth, diversifica- focus areas is integrating high-quality information tion, integration and analysis in the life sciences at a time of from bioimaging, biobanking and other types of great transformation in the use of data in healthcare, agri- molecular data. This is reflected in our deep involve- culture and biotechnology. ment in Open Targets, stewarding of plant phenotyp- ing standards (MIAPPE) and partnership in the Hu- man Cell Atlas data coordination platform, as well EMBL-EBI DATA RESOURCES IN 2017 as the 2017 launch of the Omics Discovery Index. The public data resources at EMBL-EBI have become This update gives a birds-eye view of EMBL-EBI’s ap- an essential component of any computational biology proach to data integration and service development research project (www.beagrie.com/static/resource/EBI- as genomics begins to enter the clinic. impact-report.pdf). Updates on several of these resources are featured in this issue of Nucleic Acids Research,as shown in Figure 1. INTRODUCTION Our scientific data resourceshttps://www.ebi.ac.uk/ ( EMBL’s European Bioinformatics Institute (EMBL-EBI) is services) include archives, value-added knowledgebases and a publicly funded scientific organization that provides open integrating services for molecular data analysis. With few data, open-source software tools and ‘big data’ infrastruc- exceptions, these resources are freely available without re- ture to the global life-science research community, free of striction. Fifteen of our resources are run as collaborations charge. The institute’s approach to managing, developing with partners in 16 countries, in which data or metadata and extending these public data resources is influenced by are shared so that users can search for, and find, resource- rapid technological advances in the life sciences, such as specific data relevant to their research through any ofthe single-cell sequencing and bioimaging. partners in the collaboration (Figure 2). *To whom correspondence should be addressed. Tel: +44 1223 494 665; Fax: +44 1223 494 468; Email: [email protected] C The Author(s) 2017. Published by Oxford University Press on behalf of Nucleic Acids Research. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. D22 Nucleic Acids Research, 2018, Vol. 46, Database issue Figure 1. Selection of core data resources at EMBL-EBI. Updates in this issue: BioModels (15), EBI Metagenomics (39), ENA (6), Ensembl (7), Ensembl Genomes (8), Europe PMC (9), Expression Atlas (21), Mechanism and Catalytic Site Atlas (M-CSA) (45), MEROPS (46), Protein Data Bank in Europe (PDBe) (12), Reactome (47), Rfam (48) and WormBase (49). For a complete lists of EMBL-EBI resources see https://www.ebi.ac.uk/services. EMBL-EBI is an active partner in the ELIXIR infras- searchers. They provide a foundation for the knowledge- tructure (https://www.elixir-europe.org/about-us), which bases such as Ensembl, UniProt and the Expression Atlas aims to coordinate life-science resources throughout Eu- (21), which enrich and update annotation and develop tools rope so that they form a single infrastructure. In July for access and analysis. 2017, ELIXIR announced the selection of the first set of Europe PMC, the data resource for life-science research ELIXIR Core Data Resources (https://www.elixir-europe. articles, has developed the SciLite application for display- org/platforms/data/core-data-resources). These are life sci- ing text-mined annotations from the community on full ence data resources that are of fundamental importance text articles (9). These highlights of key biological entities to the worldwide life-science community and to the long- and relationships both aid skim-reading––a particular re- term preservation of biological data (1). EMBL-EBI par- quirement for database curators––and provide deep anchor ticipated actively in the development of the criteria for se- points within full-text articles for precise linking between lecting ELIXIR Core Resources, and 12 EMBL-EBI re- knowledgebase databases and research papers. One of the sources were included in the initial set of ELIXIR re- text-mined entity types is database accession numbers. Over sources: ArrayExpress (2), ChEBI (3), ChEMBL (4), the 20 accession number types are mined on a routine basis, European Genome-phenome Archive (EGA) (5), the Euro- providing a further layer of integration, in particular for de- pean Nucleotide Archive (ENA) (6), Ensembl (7), Ensembl position databases. To improve access to the ‘data behind a Genomes (8), Europe PMC (9), IntAct (as part of the IMEx paper’, Europe PMC now generates a record in BioStud- Consortium) (10), InterPro (11), the Protein Data Bank ies (18) for all papers that mention or cite data in commu- in Europe (PDBe) (12), PRIDE (part of the ProteomeX- nity archives in the text, or have supplementary data files or change consortium) (13) and UniProt (14). ELIXIR has both. This approach allows for connectivity across all the also compiled a list of archival resources that are recom- data resources that support a study. mended for deposition of experimental data. Archival re- To facilitate the use of publicly available molecular data sources are vital to the scientific community for ensur- in R&D pipelines across sectors, we offer programmatic ac- ing that experimental data are available for re-use by the cess to almost all of our data resources. Since our last update world-wide community. Eight EMBL-EBI resources were (19) we have launched several new tools for access, includ- named as ELIXIR deposition databases: ArrayExpress (2), ing application program interfaces (APIs) for nucleotide BioModels (15), EGA (5), ENA (6), IntAct (10), Metabo- (https://www.ebi.ac.uk/about/news/service-news/new-ena- Lights (16), PDBe (12)andPRIDE(13); with four addi- discovery-api), expression (https://www.ebi.ac.uk/about/ tional databases to be included in the near future: BioSam- news/service-news/restful-rna-seq-analysis-api-v1.2)and ples (17), BioStudies (18), EVA (19) and Electron Mi- protein sequence data resources (22). croscopy Data Bank (EMDB) (20). We also offer programmatic and web access to about Archival resources, for example the ENA, BioSamples, 120 computational tools (https://www.ebi.ac.uk/services/ PRIDE and Array Express, are the first port of call for data all), and help users bring their own data and store their re- sharing, and store raw experimental data submitted by re- sults for future reference. Nucleic Acids Research, 2018, Vol. 46, Database issue D23 Figure 2. Major database collaborations. Many EMBL-EBI data resources are collaborations with other organizations around the world. These collabo- rations allow us to share the effort of collecting and serving data to our users and to provide collectively more and better analytical tools. This map visually demonstrates the shared global network that supports open data for