Playing Well on the Data Fairground: Initiatives and Infrastructure in Research Data Management
Total Page:16
File Type:pdf, Size:1020Kb
RESEARCH PAPER Playing Well on the Data FAIRground: Initiatives and Infrastructure in Research Data Management Danielle Descoteaux1†, Chiara Farinelli2, Marina Soares e Silva2 & Anita de Waard1 1Elsevier, Inc, 50 Hampshire St, Cambridge, MA 02139, USA 2 Elsevier, B.V., Radarweg 29, 2043NX Amsterdam, The Netherlands Downloaded from http://direct.mit.edu/dint/article-pdf/1/4/350/683840/dint_a_00020.pdf by guest on 24 September 2021 Keywords: Open data; Data sharing; Data citation; Open research Citation: D. Descoteaux, C. Farinelli, M.S.e Silva & A. de Waard. Playing well on the data FAIRground: Initiatives and infrastructure in research data management. Data Intelligence 1(2019), 350-367. doi: 10.1162/dint_a_00020 Received: April 23, 2019; Revised: June 10, 2019; Accepted: June 28, 2019 ABSTRACT Over the past five years, Elsevier has focused on implementing FAIR and best practices in data management, from data preservation through reuse. In this paper we describe a series of efforts undertaken in this time to support proper data management practices. In particular, we discuss our journal data policies and their implementation, the current status and future goals for the research data management platform Mendeley Data, and clear and persistent linkages to individual data sets stored on external data repositories from corresponding published papers through partnership with Scholix. Early analysis of our data policies implementation confirms significant disparities at the subject level regarding data sharing practices, with most uptake within disciplines of Physical Sciences. Future directions at Elsevier include implementing better discoverability of linked data within an article and incorporating research data usage metrics. 1. BACKGROUND AND MOTIVATION The FAIR (findable, accessible, interoperable and reusable) Data Principles argue that standardized data management is “the key conduit leading to knowledge discovery and innovation” [1]. The storage, preservation, accessibility and citation of research data is an essential aspect of creating rigorous and reusable scholarly output. This means that a previously ancillary artefact – raw research data – is now becoming a usable and analyzable scholarly output in its own right. Data repositories, publishers, funders, institutions and scholars have been working through many venues to develop standards, goals for † Corresponding author: Danielle Descoteaux (E-mail: [email protected]). © 2019 Chinese Academy of Sciences Published under a Creative Commons Attribution 4.0 International (CC BY 4.0) Playing Well on the Data FAIRground: Initiatives and Infrastructure in Research Data Management interoperability and requirements for metadata and data permanence to allow storage and access to this growing body of publicly available research data, through such organizations as the Research Data Alliance (RDA). Defining, meeting, and raising the standards for open science, including best practices for research data management, is generally a community effort with global stakeholders. At the 2016 G20 Summit in Hangzhou, the G20 leaders declared their support to FAIR data principles being implemented to promote open science and to enable appropriate access to publicly funded research results [2]. Similarly, stakeholder groups such as CODATA and the European Open Science Cloud are actively engaged in enabling FAIR Data Principles throughout the scholarly workflow [3]. In specific domains, there are tailored efforts to focus the research data management (RDM) practices of an entire community around these standards. For Downloaded from http://direct.mit.edu/dint/article-pdf/1/4/350/683840/dint_a_00020.pdf by guest on 24 September 2021 instance, in the Earth and Space Sciences, a coalition of groups representing the international science community was convened by the American Geophysical Union (AGU), to develop standards to connect researchers, publishers and data repositories in these disciplines to enable FAIR data [4]. Despite these ambitious goals, research data management practices are still heterogeneous both geographically and across different areas of research. While most researchers agree that reusing data from others would benefit their research, data sharing is not widespread and researchers report having little experience with data sharing. According to the most recent Open Data Report [5], 73% of academics surveyed said that having access to published research data would benefit their own research, while only 64% are willing to allow others to access their research data. One of the reasons for this disconnect is that despite the growth of information on the importance of data sharing, most scholarly research is still aimed at publishing papers in reputable journals. Sharing and publishing data is not perceived by authors as a priority of their institutions ([5, 6]). It’s for this reason we see a natural opportunity for scholarly publishers to take an active role. Manuscript submission, which prompts authors to provide information about their research, is a natural moment to bring research data together with an article: to require and enable data sharing, allow data annotation and connect RDM tools and standards to the publishing workflow. Creating these pathways to open data enables the raw data and the paper to be linked together, without extraneous and new workflows for researchers. We therefore also actively support and are enabling proper Data Citation Practices, as outlined by the Force11 Data Citation Guidelines [7] and have helped lead a convergence of science publishers on modes and systems of data citation [8]. Proper data citation practices can support citation counts, downloads and views of data sets, which can act as important metrics to establish review and reuse of data and serve to motivate the scholarly community to share and publish their data. NB: For the purposes of this article, data sharing will largely be defined as how data are saved, shared, cited and trusted, with each of these components incorporating several layers. Moreover, we will use “research data” interchangeably to encompass raw data, code, software and other research objects. We recognize that different communities will focus on the sharing and creation of different research objects and it is not our intention to impose a definition of those digital research output objects. There has been widespread agreement on standards that come from such discussions with the Research Data Alliance, Force 11, and FAIRsharing, with nuanced understanding about different kinds of data and the domain-specific repositories that might host them. Data Intelligence 351 Playing Well on the Data FAIRground: Initiatives and Infrastructure in Research Data Management Below, we discuss a series of initiatives taken largely over the last few years to facilitate proper practices for data deposition, curation and discovery. This paper is organized as follows: first, we discuss the overall principles behind our RDM practices and tools (2.1); then, we discuss a series of efforts that we have engaged in, together with the community of stakeholders, over the past five years, and the practical outcomes we have seen from these efforts (2.2 – 2.6). Lastly, we discuss the implications of these efforts, and some thoughts on moving forward with this important challenge. 2. PROMOTING RESEARCH DATA MANAGEMENT AT ELSEVIER Downloaded from http://direct.mit.edu/dint/article-pdf/1/4/350/683840/dint_a_00020.pdf by guest on 24 September 2021 2.1 Overall Vision on Research Data Management Over the past five years, we have developed multiple initiatives aimed at promoting data management and sharing, discussed in the rest of this section. Throughout these efforts, we have been driven by an overarching idea of a “data Maslow hierarchy”, as depicted in Figure 1 below from [9]. The idea behind this figure is that all components of data sharing support the “highest” goal (that of data reuse), but this goal cannot be obtained unless the “lower-level components” are in place, i.e. data must be stored, before it can be accessed; it must be accessible, to be reused. In our educational outreach (see e.g. Researcher Academy [10]), we consistently emphasize that good data management starts in the research planning phase, and an important role is played by a fruitful interaction with data librarians, data stewards and curators and others at the researchers’ home institution or in their specific community of practice. In the remainder of this section, we will discuss a series of efforts which we have undertaken to support this vision: 2.2. Data citation and deposition guidelines for our journals – supporting storage, preservation, access, discovery and citation; 2.3. Data Linking – supporting Data Discovery 2.4. Research Data Management Infrastructure – supporting storage, preservation, access, discovery and citation; 2.5. Data Journals – supporting the evaluation of Data Quality and Data Reproducibility. 352 Data Intelligence Playing Well on the Data FAIRground: Initiatives and Infrastructure in Research Data Management Downloaded from http://direct.mit.edu/dint/article-pdf/1/4/350/683840/dint_a_00020.pdf by guest on 24 September 2021 Figure 1. The “data Maslow hierarchy” visualizing the components of data sharing [11]. 2 .2 Research Data Deposition and Citation: The TOP Guidelines As a first step to address the growing demand for guidance and tools to address calls for transparency, openness and reproducibility of research, we implemented a series