Published online 25 November 2017 Nucleic Acids Research, 2018, Vol. 46, Database issue D21–D29 doi: 10.1093/nar/gkx1154 The European Bioinformatics Institute in 2017: data coordination and integration Charles E. Cook, Mary T. Bergman*, Guy Cochrane, Rolf Apweiler and Ewan Birney
European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridge CB10 1SD, UK
Received October 13, 2017; Revised October 30, 2017; Editorial Decision October 31, 2017; Accepted November 20, 2017
ABSTRACT Data standards and ontology development are basic components of our integration activities, extending well be- The European Bioinformatics Institute (EMBL-EBI) yond molecular biology to encompass imaging, biobank- supports life-science research throughout the world ing and plant phenotyping. As data generation intensifies by providing open data, open-source software and across the biomedical and life sciences, analytical pipelines analytical tools, and technical infrastructure (https: designed for lower data volumes are suffering from bot- //www.ebi.ac.uk). We accommodate an increasingly tlenecks. The adoption of data standards can relieve some diverse range of data types and integrate them, of this pressure, allowing information to flow into analysis so that biologists in all disciplines can explore life pipelines across sectors. in ever-increasing detail. We maintain over 40 data The speed of mass data production and deposition de- resources, many of which are run collaboratively mands creative solutions for data storage and computa- with partners in 16 countries (https://www.ebi.ac.uk/ tional infrastructure. The results of large-scale metage- nomics, proteomics and metabolomics experiments are be- services). Submissions continue to increase expo- ginning to enter the public archives at scale, and bioimaging nentially: our data storage has doubled in less than is gaining a firm foothold on the data-growth ladder. Ana- two years to 120 petabytes. Recent advances in cel- lyzing such data collectively puts a strain on local compute lular imaging and single-cell sequencing techniques infrastructure, and is greatly facilitated by cloud-based col- are generating a vast amount of high-dimensional laboration platforms such as those being built by EMBL- data, bringing to light new cell types and new per- EBI. spectives on anatomy. Accordingly, one of our main This update gives an overview of data growth, diversifica- focus areas is integrating high-quality information tion, integration and analysis in the life sciences at a time of from bioimaging, biobanking and other types of great transformation in the use of data in healthcare, agri- molecular data. This is reflected in our deep involve- culture and biotechnology. ment in Open Targets, stewarding of plant phenotyp- ing standards (MIAPPE) and partnership in the Hu- man Cell Atlas data coordination platform, as well EMBL-EBI DATA RESOURCES IN 2017 as the 2017 launch of the Omics Discovery Index. The public data resources at EMBL-EBI have become This update gives a birds-eye view of EMBL-EBI’s ap- an essential component of any computational biology proach to data integration and service development research project (www.beagrie.com/static/resource/EBI- as genomics begins to enter the clinic. impact-report.pdf). Updates on several of these resources are featured in this issue of Nucleic Acids Research,as shown in Figure 1. INTRODUCTION Our scientific data resourceshttps://www.ebi.ac.uk/ ( EMBL’s European Bioinformatics Institute (EMBL-EBI) is services) include archives, value-added knowledgebases and a publicly funded scientific organization that provides open integrating services for molecular data analysis. With few data, open-source software tools and ‘big data’ infrastruc- exceptions, these resources are freely available without re- ture to the global life-science research community, free of striction. Fifteen of our resources are run as collaborations charge. The institute’s approach to managing, developing with partners in 16 countries, in which data or metadata and extending these public data resources is influenced by are shared so that users can search for, and find, resource- rapid technological advances in the life sciences, such as specific data relevant to their research through any ofthe single-cell sequencing and bioimaging. partners in the collaboration (Figure 2).
*To whom correspondence should be addressed. Tel: +44 1223 494 665; Fax: +44 1223 494 468; Email: [email protected]