Data Repository Selection: Criteria That Matter

Susanna-Assunta Sansone1* (0000-0001-5306-5690), Peter McQuilton1* (0000-0003-2687-1982), Helena Cousijn2* (0000-0001- 6660-6214), Wei Mun Chan3 (0000-0002-9971-813X), Ilaria Carnevale4 (?), Imogen Cranston5 (0000-0002-7134-499X), Scott Edmunds6 (0000-0001-6444-1436), Nicholas Everitt7 (0000-0001-8343-8910), Emma Ganley8,9 (0000-0002-2557-6204), Chris Graf10 (0000-0002-4699-4333), Iain Hrynaszkiewicz8 (0000-0002-9673-5559), Varsha K. Khodiyar11 (0000-0002-2743-6918), Thomas Lemberger12 (0000-0002-2499-4025), Kiera McNeice13 (0000-0003-2839-4067), Philippe Rocca-Serra1 (0000-0001-9853-5668), Kathryn Sharples10 (0000-0003-2809-6828), Marina Soares E Silva4 (0000-0001-9530-627X), Jonathan Threlfall5 (0000-0001-8599- 4320).

1. Oxford e-Research Centre, Department of Engineering Science, , Oxford, OX1 3QG, UK. 2. DataCite, Welfengarten 1b, 30167 Hannover, . 3. eLife Sciences Publications, Ltd, Westbrook Centre, Milton Road, Cambridge, CB4 1YG, UK. 4. Elsevier, Radarweg 29, 1043NX, Amsterdam, The Netherlands. 5. F1000, Middlesex House, 34-42 Cleveland St, Fitzrovia, London W1T 4LB, UK. 6. GigaScience, BGI Hong Kong Tech Ltd., 26F A Kings Wing Plaza, 1 On Kwan St, Shek Mun, N.T., Hong Kong, . 7. Taylor & Francis, Park Square, Milton Park, Abingdon, OX14 4RN, UK. 8. PLOS (Public Library of Science), Carlyle House, Carlyle Road, Cambridge CB4 3DN, UK. 9. (current position) independent expert. 10. Wiley, 9600 Garsington Road, Oxford, OX4 2DQ, UK. 11. Springer Nature, 4 Crinan Street, London, N1 9XW, UK. 12. EMBO Press, Meyerhofstrasse 1, 69117 Heidelberg, Germany. 13. Cambridge University Press, Shaftesbury Rd, Cambridge, CB2 8BS, UK.

*corresponding authors: susanna-assunta.sansone [at] oerc.ox.ac.uk, peter.mcquilton [at] oerc.ox.ac.uk, helena.cousijn [at] datacite.org

1

Summary

This article proposes a set of criteria that journals and publishers believe are important for the identification and selection of data repositories, which can be recommended to researchers when they are preparing to publish the data underlying their findings. This work intends to (i) reduce complexity for researchers when preparing their submissions to journals, (ii) increase efficiency for data repositories that currently have to work with all individual publishers, and (iii) simplify the process of recommending data repositories for publishers. This makes implementation of research data policies more efficient and consistent, which may help to improve approaches to data sharing by promoting the use of reliable data repositories. This initiative stems from a discussion between the

Force11 Data Citation Implementation Pilot (DCIP) group1,2 and the joint Force11 and Research Data Alliance (RDA) FAIRsharing WG3 on the need to develop a shared list of recommended data deposition repositories. These activities have matured as part of a collaboration between FAIRsharing4, DataCite5 and a group of publisher representatives (authors of this work) who are actively implementing data policies and recommending data repositories to researchers.

Objectives

These proposed criteria are intended to: ● guide journals and publishers in providing authors with consistent recommendations and guidance on data deposition, and improve authors’ data sharing practices; ● reduce potential for confusion of researchers and support staff, and reduce duplication of effort by different publishers and data repositories

1 https://www.force11.org/group/dcip 2 http://dx.doi.org/10.1038/s41597-019-0031-8 3 https://rd-alliance.org/group/fairsharing-registry-connecting-data-policies-standards-databases.html 4 https://fairsharing.org/ 5 https://datacite.org/

2

● inform data repository developers and managers of the features believed to be important by journals and publishers; ● apprise certification and other evaluation initiatives, serving as a reference and perspective from journals and publishers; ● drive the curation of the description of the data repository in FAIRsharing, which will enable display, filter and search based on these criteria.

Primary Target Audience

Although we recognize that researchers and other stakeholders play a role in the research data life cycle, in this first instance the target audience for this work are other journals and publishers, repository developers and maintainers, certification and other evaluation initiatives, and other policy makers.

Background

Several thousand data repositories currently exist, some generalist, some specific to a discipline or data type. The majority of these repositories are for datasets, while a number exist for other digital products, such as software, pre-prints and algorithms. These repositories play a key role in ensuring greater transparency and preservation of the information that underpins research findings, as expected by governments and funders. To support the Findability, Accessibility, Interoperability and Reusability of the data, many repositories are beginning to incorporate the FAIR Principles6 in their policies and implement the necessary technical enhancements. Concomitantly, publishers and journals are developing data policies to ensure that datasets, as well as other digital products associated with articles, are deposited and made accessible via appropriate repositories. With thousands of options available, however, the lists of deposition repositories recommended by publishers are often different7,8, and consequently the guidance provided to authors may vary from journal to journal. This is due to a lack of common criteria used to select the data repositories, but also to the fact that there is still no consensus of what constitutes a good data repository. Several organisations have proposed criteria to recommend and

6 https://doi.org/10.1038/sdata.2016.18 7 https://datascience.codata.org/article/10.5334/dsj-2017-042 8 https://fairsharing.org/recommendations/

3

evaluate the quality of data repositories9, examples are the domain-agnostic CoreTrustSeal (CTS)10, the nascent TRUST principles11, the domain-specific ELIXIR Core Data Repositories12, and the funder-specific guidance from the USA National Institutes of Health

(NIH)13. These different efforts attempt to define criteria some of which are similar, despite often being labeled differently. Other criteria proposed have no correspondence between them or are complimentary. This stems from different approaches of these organizations to defining what constitutes quality of a repository. Although these initiatives might be suitable for publisher data repository selection in the future, they have not yet reached a critical mass and therefore there is a need for publishers to agree on criteria which will enable them to provide more consistent guidance to authors.

Approach

Mindful of existing initiatives, the chairs of the Force11 DCIP group and the joint Force11 and RDA FAIRsharing WG initiated a discussion on the need to identify the features for data repository selection that journals and publishers believe are important for the identification and selection of appropriate data repositories. The work kickstarted in January 2018, at a meeting held in Oxford, with representatives of journals and publishers engaged in the DCIP group and a few others with recommended databases and/or are already working as part of the FAIRsharing Community14. The first phase of the work focused on defining our objectives and relations with other efforts, and on teasing out an initial list of criteria from the needs and views of the represented journals and publishers. The criteria were presented at sessions during RDA Plenaries 12th and 13th and shared internally within the participating organizations.

Meanwhile, an article15 by the FAIRsharing team and its Community delivered an analysis of the data policies by major data-focussed

9 https://doi.org/10.2218/ijdc.v9i1.309 10 https://www.coretrustseal.org 11 https://docs.google.com/document/d/1UCsdnz0wk9TeMj1Dqxi8wuZ2Lu_TNVkpJ2TX48yKsec/edit?usp=sharing 12 http://dx.doi.org/10.12688/f1000research.9656.2 13 https://www.nlm.nih.gov/NIHbmic/nih_data_sharing_repositories.html 14 https://fairsharing.org/communities https://doi.org/10.1038/s41587-019-0080-8 15 https://doi.org/10.1038/s41587-019-0080-8

4

journals and publishers, and the repositories and standards (for data citation and reporting) they recommend. The results confirm the discrepancy in resource recommendation across the data policies, the result of a cautious approach also due to the wealth of existing resources. Subsequently, the initial criteria were subsequently reviewed and refined iteratively, also driven by the collaboration16 between FAIRsharing and DataCite. The focus is on those data repositories that accept submissions of research data (here also referred to as deposition repositories), thereby excluding professionally curated knowledge-bases, which, whilst important as data sources, do not allow direct deposition of user-generated data, and are outside the scope of this paper. In addition to these criteria, FAIRsharing will also provide a means by which data repository certification and any community-specific evaluation (where the deposition repositories and knowledge resources are identified by a recognized community in a given domain) can be used as filters for discovery. FAIRsharing also interrelates repositories to standards (for citation, data and metadata reporting) they implement, and the data policies that recommend them, guiding consumers to discover, select and use these resources with confidence, and producers to make their resources more discoverable, more widely adopted and cited.

Proposed Criteria

The table below summarizes the proposed criteria, their definitions, and ideal values for each criterion. Although all of these criteria are important to enable the FAIRness of data, we also know that FAIRness is ‘aspirational’. Also it is not yet practical to filter on some criteria because they are not widely used by data repositories. We have therefore distinguished between two types of criteria:

Essential criteria are, as the name suggests, essential for filtering to arrive at a list of repositories that publishers would want to recommend for data deposition.

Desirable criteria that are also important, and in an ideal future would also be considered minimum best practices. However adoption of these practices is not yet widespread enough for it to be practical to use them for filtering.

16 https://doi.org/10.5438/z32p-wj46

5

CRITERIA DEFINITION IMPORTANCE VALUE(S) FAIRSHARING WILL VALUE(S) TO BE USED CURATE AND DISPLAY FOR FILTERING

Essential These criteria are, as the name suggests, essential for filtering to arrive at a list of repositories that publishers would want to recommend for data deposition.

Repository The life cycle status of the repository: is it Essential -‘Ready’ (production-level). Status ‘ready’: Only Status still being developed or in production and - ‘In development’ (actively being production-level accepting data submissions? developed but still not ready for repositories will be use). selected. - ‘Deprecated’ (no longer maintained or superseded). - ‘Uncertain’ (when the life cycle position is unclear).

Data Access Data access terms at repository level: how Essential - Presence ‘yes’, or absence ‘no’. Presence (yes): Only Condition does one get to the data? For example, if - Type of condition, e.g. ‘open’, repositories that make the data is freely available or subject to a ‘controlled’ access condition clear will request and approval process. - URL to the condition information. be selected.

Data Reuse Licence or terms of use for reuse of Essential - Presence ‘yes’, or absence ‘no’. Presence (yes): Only Condition existing data in the repository: what are the - Type of condition. repositories that make conditions, if any, under which one can - URL to the condition information. data reuse condition clear reuse the data? will be selected.

Data Deposition Deposition of data: are there any Essential - Presence ‘yes’, or absence ‘no’. Presence (yes): Only Condition restrictions (e.g. by location, country, - Type of condition. repositories that make organization, etc.) or can anyone from - URL to the condition information. data deposition condition anywhere deposit data? clear will be selected.

Data Policy that details how the preservation of Essential - Presence ‘yes’, or absence ‘no’. Presence (yes): Only Preservation the data is ensured: does the repository - URL to the policy. repositories with a data Policy have page or document describing these? preservation policy will be

6

selected.

Persistent Globally unique and Persistent IDentifiers Essential - Presence ‘yes’, or absence ‘no’. Presence (yes): Only Identifiers for (PIDs): does the repository assign them to - Type of PID schema. repositories assigning Data the deposited data? Which type of identifier PIDs will be selected. schema is used?

User Support Support to users during or after Essential - Presence ‘yes’, or absence ‘no’. Presence (yes): Only submission: does the repository have a repositories with user contact point, team or helpdesk to assist support will be selected. users?

CRITERIA DEFINITION IMPORTANCE VALUE(S) FAIRSHARING WILL VALUE(S) TO BE USED CURATE AND DISPLAY FOR FILTERING

Desirable These criteria that are also important, and in an ideal future would also be considered minimum best practices. However, adoption of these practices is not yet widespread enough for it to be practical to use them for filtering.

Data and The community-defined standards the Desirable - Presence ‘yes’, or absence ‘no’. Future - presence (yes): Metadata repository implements to: (i) enable data - Type of standards. Only repositories Community citation17, and (ii) the annotation and - Relations between types of implementing citation and Standards representation of data/metadata (e.g. standards, and between reporting standards will be models, formats, schemas, vocabularies, standards and repositories (to selected. ontologies). The standards can be track their use and adoption). implemented via the data submission tool, or by the curation team.

Data Curation Review and annotation of the data Desirable - Presence ‘yes’, or absence ‘no’. Future - presence (yes): performed by the repository (e.g. via the - URL to the curation information. Only repositories curating data submission/curation tool, or by its data will be selected. curation team): are there minimum curation

17 https://www.biorxiv.org/content/early/2016/12/28/097196

7

steps that repository performs on the submitted data? A webpage or document that describes the type of curation done?

Data Versioning Ability and process to make and track edits Desirable - Presence ‘yes’, or absence ‘no’. Future - presence (yes): to a dataset after deposition: does the Only repositories with data repository enable modification to the versioning policy will be submitted dataset (e.g., to make corrected selected. or appended with additional information)? Is there a process to distinguish and link the two versions of the data?

Data Citation A mechanism for linking data to articles: Desirable - Presence ‘yes’, or absence ‘no’. Future - presence (yes): does the repository enable this at - Type of mechanism. Only repositories enabling submission stage or post submission? data-article links will be selected.

Dataset Usage To assess the use of the data in the Desirable - Presence ‘yes’, or absence ‘no’. Future - presence (yes): Information repository and allow researchers insight in - URL to information. Only repositories collecting data reuse: does the repository collects and and sharing this shares this information (e.g. number of information will be views, downloads)? selected.

Data Access for A mechanism and process for sharing Desirable - Presence ‘yes’, or absence ‘no’. Future - presence (yes): Pre-Publication deposited data via a link anonymously (or Only repositories enabling Review otherwise depending on journal policy this access will be regarding open/closed review): does the selected. repository allow confidential access to the data for peer-review? Does the repository also have capability for double blind peer review?

Resource Plan that gives information about Desirable - Presence ‘yes’, or absence ‘no’. Future - presence (yes): Sustainability sustainability plans for the repository: does Only repositories with the repository have a webpage, or sustainability plans will be

8

document that describes these? selected.

Funding The type of funding (e.g. grants, donations) Desirable - Types of funding At this stage, this criterion and the organisation(s) that fund the - Name of the funding is for information only and repository. organization (using FundRef) related to the sustainability criteria.

Data Contact Contact info (for the person or organization; Desirable - Presence ‘yes’, or absence ‘no’. Future - presence (yes): Information depositor, producer or owner) of the data: - Contact details (incl. ORCID). Only repositories with does the repository keeps and shows this contact information will be information? selected.

Certification Certification schemes and/or community Desirable - Presence ‘yes’, or absence ‘no’. At this stage, this criterion and Community badges that assess certain aspects of the - Type of certification or badge. is for information only, Badge repository (e.g., its fitness, trustworthiness, - URL to the certification or badge. because different adoption): does the repository have any? certifications may be more appropriate to different situations.

FAIRsharing Contact (person or organization) for the Desirable - Presence ‘yes’, or absence ‘no’. Future - presence (yes): Record record in FAIRsharing that describes the - Contact details (incl. ORCID). Only repositories with a Maintainer repository: has the owner or maintainer of maintained record will be the repository claimed the record and selected vetted its descriptions? Although records in FAIRsharing are curated by its in-house team, the participation of the owner or maintainer of the repository helps verifying the information and track the evolution of the resource.

9

Driving Changes

Identifying the criteria that journals and publishers believe are important to select and recommend appropriate repositories will not only help us meet the above-mentioned objectives, but will also drive changes by:

● defining a common language across publishers; ● helping publishers to maintain this information in a more automated way; ● making the process for selection of recommended repositories more transparent to all stakeholders.

Furthermore, the criteria here described could form part of the key descriptors needed for the FAIRness assessment of data repositories, and feed into the activities working to identify FAIR maturity indicatorse.g.18, and evaluation toolse.g.19, as well as many

FAIR-enabling projectse.g.20.

Challenges and Future Work

Our discussion has also highlighted a number of challenges that must be addressed, but that are part of future or other activities:

● Licence equivalence: many repositories do not use (CC) licences21 but state they use an ‘equivalent’ one.

However, this equivalence is difficult to measuree.g.22 and is something we need to be able to represent to allow comparison between the different terms and conditions. ● File size and cost associated with deposition: this is key information for journals and publishers, as well as for researchers, but is hard to collect as often this information is not publicly available, or may depend on the organisation interested (e.g., academic versus commercial rates and special deals).

18 https://www.rd-alliance.org/groups/fair-data-maturity-model-wg 19 https://doi.org/10.1038/s41597-019-0184-5 20 https://www.fairsfair.eu 21 https://creativecommons.org 22 https://discuss.okfn.org/c/projects/OpenDefinition

10

● Institutional repositories: at the time of writing FAIRsharing does not systematically include information about institutional repositories, although many of these are also incorporating FAIR Principles into their policies and infrastructure. Publishers and journals could use the proposed list to specify which criteria these repositories should meet.

● Data/dataset citation: whether and how a deposition repository follows the data citation principles23 and has sufficient metadata in each dataset’s landing page are other ideal criteria for selection, but currently this information is not easily collected.

The outcomes of this collaboration will enable publishers as well as individual journals to provide lists or recommended repositories based on clear criteria and in line with the guidance authors encounter elsewhere. This will make it easier for authors to find the right data repository for their data and thereby play a key role in facilitating data sharing in different scientific domains.

23 https://doi.org/10.1038/s41597-019-0031-8

11