Discovering and Linking Public Omics Data Sets Using the Omics Discovery Index
Total Page:16
File Type:pdf, Size:1020Kb
View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Providence St. Joseph Health Digital Commons Providence St. Joseph Health Providence St. Joseph Health Digital Commons Articles, Abstracts, and Reports 5-9-2017 Discovering and linking public omics data sets using the Omics Discovery Index. Yasset Perez-Riverol Mingze Bai Felipe da Veiga Leprevost Silvano Squizzato Young Mi Park See next page for additional authors Follow this and additional works at: https://digitalcommons.psjhealth.org/publications Part of the Genetics and Genomics Commons Recommended Citation Perez-Riverol, Yasset; Bai, Mingze; da Veiga Leprevost, Felipe; Squizzato, Silvano; Park, Young Mi; Haug, Kenneth; Carroll, Adam J; Spalding, Dylan; Paschall, Justin; Wang, Mingxun; Del-Toro, Noemi; Ternent, Tobias; Zhang, Peng; Buso, Nicola; Bandeira, Nuno; Deutsch, Eric W; Campbell, David S; Beavis, Ronald C; Salek, Reza M; Sarkans, Ugis; Petryszak, Robert; Keays, Maria; Fahy, Eoin; Sud, Manish; Subramaniam, Shankar; Barbera, Ariana; Jiménez, Rafael C; Nesvizhskii, Alexey I; Sansone, Susanna-Assunta; Steinbeck, Christoph; Lopez, Rodrigo; Vizcaíno, Juan A; Ping, Peipei; and Hermjakob, Henning, "Discovering and linking public omics data sets using the Omics Discovery Index." (2017). Articles, Abstracts, and Reports. 1511. https://digitalcommons.psjhealth.org/publications/1511 This Article is brought to you for free and open access by Providence St. Joseph Health Digital Commons. It has been accepted for inclusion in Articles, Abstracts, and Reports by an authorized administrator of Providence St. Joseph Health Digital Commons. For more information, please contact [email protected]. Authors Yasset Perez-Riverol, Mingze Bai, Felipe da Veiga Leprevost, Silvano Squizzato, Young Mi Park, Kenneth Haug, Adam J Carroll, Dylan Spalding, Justin Paschall, Mingxun Wang, Noemi Del-Toro, Tobias Ternent, Peng Zhang, Nicola Buso, Nuno Bandeira, Eric W Deutsch, David S Campbell, Ronald C Beavis, Reza M Salek, Ugis Sarkans, Robert Petryszak, Maria Keays, Eoin Fahy, Manish Sud, Shankar Subramaniam, Ariana Barbera, Rafael C Jiménez, Alexey I Nesvizhskii, Susanna-Assunta Sansone, Christoph Steinbeck, Rodrigo Lopez, Juan A Vizcaíno, Peipei Ping, and Henning Hermjakob This article is available at Providence St. Joseph Health Digital Commons: https://digitalcommons.psjhealth.org/publications/1511 HHS Public Access Author manuscript Author ManuscriptAuthor Manuscript Author Nat Biotechnol Manuscript Author . Author Manuscript Author manuscript; available in PMC 2018 March 01. Published in final edited form as: Nat Biotechnol. 2017 May 09; 35(5): 406–409. doi:10.1038/nbt.3790. Discovering and Linking Public ‘Omics’ Datasets using the Omics Discovery Index Yasset Perez-Riverola,†,*, Mingze Baia,b,c,†, Felipe da Veiga Leprevostd, Silvano Squizzatoa, Young Mi Parka, Kenneth Hauga, Adam J. Carrolle, Dylan Spaldinga, Justin Paschalla, Mingxun Wangf, Noemi del-Toroa, Tobias Ternenta, Peng Zhangd,g, Nicola Busoa, Nuno Bandeiraf, Eric W. Deutschh, David S Campbellh, Ronald C. Beavisi, Reza M. Saleka, Ugis Sarkansa, Robert Petryszaka, Maria Keaysa, Eoin Fahyj, Manish Sudj, Shankar Subramaniamj, Ariana Barberak, Rafael C. Jiménezl, Alexey I. Nesvizhskiid, Susanna- Assunta Sansonem, Christoph Steinbecka, Rodrigo Lopeza, Juan Antonio Vizcaínoa, Peipei Pingn, and Henning Hermjakoba,c,* aEuropean Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK bSchool of Bio-information, Chongqing University of Posts and Telecommunications, 400065 Chongqing, China cBeijing Proteome Research Center, National Center for Protein Sciences Beijing, No. 38, Life Science Park Road, Changping District, 102206 Beijing dDepartment of Pathology, University of Michigan, Ann Arbor, Michigan, 48109, USA eResearch School of Biology, Australian National University, Canberra, 0200, Australia fDepartment of Computer Science and Engineering, University of California, San Diego, 9500, La Jolla, California 92093, USA gCommonwealth Scientific and Industrial Research Organization, Canberra, 0200, Australia hInstitute for Systems Biology, Seattle, Washington, USA iBiochemistry & Medical Genetics, University of Manitoba, Winnipeg, R3T 2N2, Canada jDepartment of Bioengineering, UC San Diego, La Jolla, CA 92093-0412, USA *Corresponding authors: Dr. Yasset Perez-Riverol, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. Phone: + 44 (0) 1223 492513. [email protected]. Henning Hermjakob, European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK. Phone: + 44 (0) 1223 494671. [email protected]. †These authors contributed equally to this work. Author’s contributions HH, YPR and PP developed the OmicsDI concept. YPR and MB designed and developed the web and annotation/enrichment framework. FPV, RCB and YPR developed the GPMDB reader. SS, YMP and NB developed the indexing system based on the EMBL-EBI (European Bioinformatics Institute) Search Server. YPR, EF, MS, ShS, AJC, KH, DS, JP, and MW developed the Metabolomics Workbench, Metabolome Express, MetaboLights, EGA and MassIVE readers and APIs, respectively. PZ helped to make the MetabolomeExpress schema OmicsDI-compatible. US, RP, MK contributed with the integration of ArrayExpress and Expression Atlas. NT, YPR and TT developed the PRIDE reader and contributed to the web development. EWD, DSC, RMS, NB, AIN, CS, RCJ, RL and JAV contributed to the design of the system. YPR and AB developed the ddiR package. YPR and MB, designed and implemented the biological similarity scoring system; and YPR and AB performed the data analysis. YPR, JAV and HH wrote the manuscript, with contributions from all authors. Perez-Riverol et al. Page 2 kDepartment of Medicine, University of Cambridge, Cambridge, UK Author ManuscriptAuthor Manuscript Author Manuscript Author Manuscript Author lELIXIR Hub, Wellcome Genome Campus, Hinxton, Cambridge, UK mOxford e-Research Centre, University of Oxford, 7 Keble Road, OX1 3QG, UK nDepartment of Physiology and Department of Medicine, Division of Cardiology, David Geffen School of Medicine at UCLA, 675 Charles E. Young Drive, MRL Building, Suite 1609, Los Angeles, California 90095, USA To the editor Biomedical data are being produced at an unprecedented rate owing to the falling cost of experiments and wider access to genomics, transcriptomics, proteomics and metabolomics platforms1,2. As a result, public deposition of omics data is on the increase. This presents new challenges, including finding ways to store, organize and access different types of biomedical data present on different platforms. We present the Omics Discovery Index (OmicsDI - http://www.omicsdi.org), an open source platform that enables access, discovery and dissemination of omics datasets. In 2016, a group of researchers, publishers and research funders published the first guidelines to make data Findable, Accessible, Interoperable and Re-usable (FAIR - https:// www.force11.org/group/fairgroup/fairprinciples)3. The FAIR principles put specific emphasis on enhancing the ability of machines to automatically find and use the data, in addition to supporting its reuse. Challenges facing combined analyses of different datasets include achieving a common representation for datasets and their associated metadata, and the lack of protocols and tools that enable data exchange across multiple repositories. With respect to the first principle (‘Findable’), most of the available resources for the scientific community nowadays are either field specific, that is genomics, proteomics or metabolomics experimental datasets; or organism-specific, but including datasets from different omics technologies, e.g. the “Saccharomyces Genome Database” (SGD). Finding a dataset can be frustrated by the need to search individual repositories and read numerous publications. The development of consortia integrating resources e.g. ProteomeXchange and MetabolomeXchange has helped to improve findability. Nature1 and Nature Biotechnology4 have highlighted the need for dataset integration frameworks to increase findability of data. And, in the context of the European ELIXIR (https://www.elixir-europe.org/) and USA Big Data to Knowledge (BD2K)5 trans-NIH initiative, it is clear that a dedicated platform, search engines and services enabling the aggregation of omics datasets, to resources such as PubMed6 or Europe PubMed Central (EuroPMC)7, is required. The Omics Discovery Index (OmicsDI - http://www.omicsdi.org) is an open source platform that can be used to access, discover and disseminate omics datasets. OmicsDI can integrate proteomics, genomics, metabolomics, and transcriptomics datasets (Fig. 1). To date, eleven repositories have agreed on a common metadata structure framework and exchange format, and have contributed to OmicsDI (Supplementary Notes 1–3), including proteomics databases (the PRoteomics IDEntifications (PRIDE) database, PeptideAtlas, the Mass spectrometry Interactive Virtual Environment (MassIVE) and the Global Proteome Machine Nat Biotechnol. Author manuscript; available in PMC 2018 March 01. Perez-Riverol et al. Page 3 Database (GPMDB)); metabolomics databases (MetaboLights, the Global Natural Products Author ManuscriptAuthor Manuscript Author Manuscript Author Manuscript Author Social Molecular Networking