Publishing FAIR Data: an Exemplar Methodology Utilizing PHI-Base
Total Page:16
File Type:pdf, Size:1020Kb
TECHNOLOGY REPORT published: 12 May 2016 doi: 10.3389/fpls.2016.00641 Publishing FAIR Data: An Exemplar Methodology Utilizing PHI-Base Alejandro Rodríguez-Iglesias 1 †, Alejandro Rodríguez-González 2 †, Alistair G. Irvine 3, Ane Sesma 1, Martin Urban 4, Kim E. Hammond-Kosack 4 and Mark D. Wilkinson 1* 1 Center for Plant Biotechnology and Genomics, Universidad Politécnica de Madrid, Madrid, Spain, 2 ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, Madrid, Spain, 3 Department of Computational and Systems Biology, Rothamsted Research, Harpenden, UK, 4 Department of Plant Biology and Crop Science, Rothamsted Research, Harpenden, UK Pathogen-Host interaction data is core to our understanding of disease processes and their molecular/genetic bases. Facile access to such core data is particularly important for the plant sciences, where individual genetic and phenotypic observations have the added complexity of being dispersed over a wide diversity of plant species vs. the relatively fewer host species of interest to biomedical researchers. Recently, an international initiative interested in scholarly data publishing proposed that all scientific data should Edited by: Roger Deal, be “FAIR”—Findable, Accessible, Interoperable, and Reusable. In this work, we describe Emory University, USA the process of migrating a database of notable relevance to the plant sciences—the Reviewed by: Pathogen-Host Interaction Database (PHI-base)—to a form that conforms to each of Toshiaki Katayama, University of Tokyo, Japan the FAIR Principles. We discuss the technical and architectural decisions, and the M. Scott Marshall, migration pathway, including observations of the difficulty and/or fidelity of each step. Netherlands Cancer Institute, We examine how multiple FAIR principles can be addressed simultaneously through Netherlands careful design decisions, including making data FAIR for both humans and machines with *Correspondence: Mark D. Wilkinson minimal duplication of effort. We note how FAIR data publishing involves more than data [email protected] reformatting, requiring features beyond those exhibited by most life science Semantic † These authors have contributed Web or Linked Data resources. We explore the value-added by completing this FAIR equally to this work. data transformation, and then test the result through integrative questions that could Specialty section: not easily be asked over traditional Web-based data resources. Finally, we demonstrate This article was submitted to the utility of providing explicit and reliable access to provenance information, which we Technical Advances in Plant Science, a section of the journal argue enhances citation rates by encouraging and facilitating transparent scholarly reuse Frontiers in Plant Science of these valuable data holdings. Received: 20 January 2016 Keywords: FAIR data, Linked Data, Pathogen-Host Interactions, PHI-base, Semantic Web, Semantic PHI-base, Accepted: 26 April 2016 SPARQL, data integration Published: 12 May 2016 Citation: Rodríguez-Iglesias A, INTRODUCTION Rodríguez-González A, Irvine AG, Sesma A, Urban M, Traditional Scholarly Data Publishing Hammond-Kosack KE and As with all life science domains, research into plant diversity, biochemistry, development, and Wilkinson MD (2016) Publishing FAIR Data: An Exemplar Methodology disease is becoming increasingly data-centric. There are more than 1800 entries in the latest Utilizing PHI-Base. compilation of life science databases undertaken by the Nucleic Acids Research journal (Galperin Front. Plant Sci. 7:641. et al., 2015). Many of these are domain-specific, or special-purpose resources that are usually doi: 10.3389/fpls.2016.00641 curated, and contain specific types of data (e.g., RNACentral for noncoding RNA sequences Frontiers in Plant Science | www.frontiersin.org 1 May 2016 | Volume 7 | Article 641 Rodríguez-Iglesias et al. Semantic PHI-Base: PHI-Base As FAIR Data Bateman et al., 2011) or data relevant to a specific species (e.g., Contemporary Data Publishing Guidelines AraPort’s collection of information about Arabidopsis thaliana The recently-published FAIR Data Principles8 (Wilkinson et al., Krishnakumar et al., 2015). However, in addition to these special- 2016) are an attempt to answer these questions, and provide purpose repositories, there is also a proliferation of general- data producers and repository curators with a set of guideposts purpose repositories such as Dataverse (Crosas, 2011), FigShare1 that lead to ever-increasing degrees of data discoverability and , Dryad2 , Mendeley Data3 , Zenodo4 , DataHub5 , DANS6 , and (re-)usability. FAIR is an acronym of Findable, Accessible, EUDat (Lecarpentier et al., 2013) where researchers deposit a Interoperable, and Reusable. Briefly, the FAIR Principles suggest wide range of often specialist data and datatypes, at scales ranging that every data element should have a globally-unique identifier, from single figures or images, to large high-throughput datasets. and that this identifier should be associated with contextual, For most plant biologists, manually navigating these data searchable metadata (“Findable”); these identifiers should all resources to find, integrate, and analyse the data they require resolve to data or metadata using an open, standard protocol is exceedingly difficult. Not only is the data dispersed among (“Accessible”); the data and metadata should use a formal, many special- and general-purpose repositories, making it broadly applicable representation language, and utilize open and harder to locate, it is also often provided in formats that widely-accepted domain-relevant vocabularies and ontologies are difficult to integrate—both with the researcher’s local (“Interoperable”); and finally, the data should be richly described datasets, and with data from other Web repositories and with an abundance of cross-references, and with a clearly-defined databases. Once located, the data may be difficult to interpret mechanism for accessing provenance and license information and/or be of questionable quality—in fact, a recent study into (“Reusable”). Beyond the core principles, “FAIR-ness” also these problems in the area of ecology and evolution found emphasizes that each of the principles should be equally that 64% of public data archives were “unusable” (Roche applicable to both humans and machines. This is in recognition et al., 2015). All of this results in a great deal of error- of the ever-increasing need for automation within large-scale prone, manual copy-and-paste by biologists, and increasingly data discovery, retrieval, integration, and analysis pipelines. requires the researcher to employ a bioinformatics specialist Within that proposal were details regarding the kinds of qualities to write bespoke data retrieval, cleansing, and integration FAIR data would exhibit, and the integrative behaviors that would software. result. What was not defined by the FAIR Principles, however, While the problem of discoverability and reusability of was how this might be implemented, particularly over important publicly-archived data in general-purpose repositories will existing data repositories. require both technical and social/behavioral changes—which While FAIR-ness speaks to more than just data representation will likely be driven by journals and/or funding agencies7 — and formats, and does not suggest any specific implementation advancing the state of special-purpose repositories is a more or technology, in practice most FAIR transformations to tangible near-term objective due to their more centralized date have utilized the principles of Linked Data (Bizer and focused nature. Moreover, by and large, the same et al., 2009). Linked Data is both a design-philosophy, and decisions must be made for both types of data resources, a structured format. Linked Data was specifically conceived thus the knowledge gleaned from one activity can inform the to work within the Web, and builds on the core technologies other. of the HTTP communications protocol, Uniform Resource Locating and accessing data within special-purpose Identifiers (URIs) for naming entities, concepts and relations, repositories should, in principle, be straightforward. These and Resource Description Framework (RDF) as a generic repositories generally have a curatorial staff concerned about information model. The Linked Data principles require that all quality-control, have organized the data into sensible structures, entities must be identified using URIs, that these URIs must and have usually enriched the data with cross-references resolve to information about that entity, and that information both internally between elements of their own data, and should include links to other entities also named by URIs. externally to remote datasets. Nevertheless, in practice it Thus, the Linked Data principles are more specific than, and remains difficult to query, extract, and integrate the data, span a subset of, the FAIR Data principles, where the latter even from these highly curated repositories, particularly additionally addresses “social” issues such as provenance, at the scale demanded by contemporary high-throughput licensing, commitment-to-availability, and adherence to research. Why is this so? What are the practices of data community-standards. repositories, and the properties of their data holdings, that affect (positively or negatively) the usability of their valuable From Traditional to Contemporary contents? Recently a series of BYOD (“bring your own data”) Workshops (Roos and Lopes, 2014), targeting specific