Assessing Metadata and Curation Quality: a Case Study from the Development of a Third-Party

bioRxiv preprint doi: https://doi.org/10.1101/530691; this version posted January 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Assessing metadata and curation quality: a case study from the development of a third-party curation service at Springer Nature Rebecca Grant, Graham Smith and Iain Hrynaszkiewicz Correspondence should be addressed to Rebecca Grant, Springer Nature, The Campus, Trematon Walk, Wharfdale Road, London N1 9FN. Email: [email protected] Abstract Since 2017, the publisher Springer Nature has provided an optional Research Data Support service to help researchers deposit and curate data that support their peer-reviewed publications. This service builds on a Research Data Helpdesk, which since 2016 has provided support to authors and editors who need advice on the options available for sharing their research data. In this paper we describe a short project which aimed to facilitate an objective assessment of metadata quality, undertaken during the development of a third-party curation service for researchers (Research Data Support). We provide details on the single-blind user-testing which was undertaken, and the results gathered during this experiment. We also briefly describe the curation services which have been developed and introduced following an initial period of testing and piloting. This paper will be presented at the International Digital Curation Conference 2019, and has been submitted to the International Journal of Digital curation. Introduction In 2016 the publisher Springer Nature introduced standard research data policies for its journals (Hrynaszkiewicz et al, 2017) with the aim of encouraging each journal to select the policy which is most appropriate for its discipline and its community. To date, more than 1500 Springer Nature journals have selected a standard policy. Four policies are offered, which include consistent features such as data citation, data availability statements, data deposition in repositories and data peer review. The introduction of standard data policies by publishers is in response to growing demand from research funding agencies for data sharing. There is also evidence that the research data policies of journals and publishers, historically, have lacked standards and were in need of harmonisation (Naughton & Kernohan, 2016). Since the introduction of standard data policies by Springer Nature, other academic publishers have begun to introduce similar policies, including Taylor & Francis, Elsevier, Wiley and BMJ. A global bioRxiv preprint doi: https://doi.org/10.1101/530691; this version posted January 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. Research Data Alliance Interest Group1 has also developed a master policy framework with the aim of harmonising data policies across all publishers, and this was published in draft format in 2018. With the on-going introduction of data policies by publishers, as well as funding agencies and institutions, researchers are increasingly compelled to share the data which underpins their published articles. In 2016, as the standard data policies began rolling out across journals, Springer Nature introduced a Research Data Helpdesk to provide support to researchers, authors and academic editors who need to comply with or implement journal data policies. The Helpdesk also provides email-based support to researchers and editors regarding many aspects of data sharing and publication. Although all aspects of research data are covered by the Helpdesk including data preservation, licensing and publishing, an analysis of the Helpdesk queries in 2017 found that the majority related to data policies and policy compliance. Researchers also requested information regarding the deposit of data in repositories, and how data availability statements should be drafted, which also reflect the requirements of their journal’s data policies (Astell, Hrynaszkiewicz, Grant, Smith & Salter, 2017). In 2017 Springer Nature undertook a large-scale survey of researchers that received nearly 8000 responses and aimed to assess the aspects of data sharing which researchers find to be most challenging (Stuart et al., 2017). In this survey, 63% of respondents reported that they shared data supporting their peer reviewed publications. However, researchers stated that their greatest barrier to data sharing is their ability to “organise data in a presentable and useful way”. A lack of time, concerns about copyright and licensing, and a lack of research funding for data sharing were also identified as practical challenges to increased data sharing in the survey. Surveys from the publishers Wiley, Elsevier, and the publishing technology company Digital Science have found similar results regarding the proportion of researchers who share data (around two- thirds), and the ways by which researchers share data. The most common ways of sharing data that were reported tend to be suboptimal. All four surveys found a relatively low rate of repository usage by researchers, at 25% (Market Research, Wiley, 2017), (Berghmans et al, 2017), (Digital Science et al, 2017). Although publisher policies and those of other stakeholders such as funders generally encourage data sharing using repositories, it is apparent that researchers do not feel adequately equipped to organise and describe their research data. One of the conclusions of the Springer Nature survey was that researchers should have faster and easier routes to data deposit, which do not require them to become experts in data curation. It also suggested that close collaboration between stakeholders, including researchers, research infrastructure providers, institutions and publishers will be necessary to affect change, and to develop solutions which simplify workflows for data deposit. Development of a third-party curation service To help address the lack of time, skills or expertise needed to organise and share data reported by researchers in the survey, the Springer Nature data publishing team began the development of a curation service aimed to provide researchers with a means to deposit their data in a suitable repository, without requiring them to learn the skills necessary to create high quality metadata records. Such a service has a number of requirements, including a portal where data can be uploaded, and a platform (a repository) where the data can be assigned a DOI and published. The 1 Data policy standardisation and implementation Interest Group: https://www.rd-alliance.org/groups/data- policy-standardisation-and-implementation bioRxiv preprint doi: https://doi.org/10.1101/530691; this version posted January 27, 2019. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY 4.0 International license. service also needed to include appropriate editorial checks to ensure data types that need to be deposited in discipline-specific repositories, such as DNA/RNA sequence data are directed to community specific repositories rather than a general repository, upon which the Research Data Support service is based. Additionally the service needed to ensure published data are based on scholarly research, and that sensitive data and data derived from human participants are suitably anonymised. While the technical infrastructure and initial screening of submission was an important concern, a key focus of the service development was on drafting complete, accurate and appropriate metadata on behalf of the researcher. The intention was also to align the metadata being created with the FAIR Data Principles (Wilkinson et al, 2016), guaranteeing that any researcher who published data through the service could be assured that their data would be, at a minimum, Findable and Accessible. The metadata records created needed to be of high quality, and to add value to the data in a way that a researcher using the service could not have achieved themselves without specialist data curation skills. Many repositories, and the documentation relating to standard metadata schema, provide guidance on the content and completeness that is expected when publishing data. The FAIR Data Principles also provide some high-level guidance on what can be expected of Findable and Accessible datasets, for example that their metadata includes a persistent identifier and a rich description. More recently the Go Fair website has expanded on what the term rich metadata implies, noting that it should be “generous and extensive, including descriptive information about the context, quality and condition, or characteristics of the data.”2 In spite of existing guidance on how high quality metadata should be created, there is a lack of documentation on how metadata quality can be assessed or benchmarked after the metadata have been created. Furthermore, the focus of the Springer Nature Research Data Support service is on curating datasets that support peer-reviewed publications, taking into account the needs of journal editors, peer reviewers, readers, as well as authors (data creators). As we began to develop standard

Assessing Metadata and Curation Quality: a Case Study from the Development of a Third-Party

Module 8 Wiki Guide

Undergraduate Biocuration: Developing Tomorrow’S Researchers While Mining Today’S Data

A Brief Introduction to the Data Curation Profiles

Applying the ETL Process to Blockchain Data. Prospect and Findings

Biocuration - Mapping Resources and Needs [Version 2; Peer Review: 2 Approved]

Automating Data Curation Processes Reagan W

Transitioning from Data Storage to Data Curation: the Challenges Facing an Archaeological Institution Maria Sagrario R

How to Discover Requirements for Research Data Management Services Angus Whyte (DCC) and Suzie Allard (Dataone)

Data Cleansing & Curation

Data Curation Network

Data Curation in Practice: Extract Tabular Data from PDF Files Using a Data Analytics Tool

Amplifying Data Curation Efforts to Improve the Quality of Life Science Data