FAIR Linked Data
Total Page:16
File Type:pdf, Size:1020Kb
FAIR Linked Data - Towards a Linked Data backbone for users and machines Johannes Frey Sebastian Hellmann [email protected] Knowledge Integration and Linked Data Technologies (KILT/AKSW) & DBpedia Association InfAI, Leipzig University Leipzig, Germany ABSTRACT scientific and open data. While the principles became widely ac- Although many FAIR principles could be fulfilled by 5-star Linked cepted and popular, they lack technical details and clear, measurable Open Data, the successful realization of FAIR poses a multitude criteria. of challenges. FAIR publishing and retrieval of Linked Data is still Three years ago a dedicated European Commission Expert Group rather a FAIRytale than reality, for users and machines. In this released an action plan for "turning FAIR into reality" [1] empha- paper, we give an overview on four major approaches that tackle sizing the need for defining "FAIR for implementation". individual challenges of FAIR data and present our vision of a FAIR In this paper, we argue that Linked Data as a technology in Linked Data backbone. We propose 1) DBpedia Databus - a flex- combination with the Linked Open Data movement and ontologies ible, heavily automatizable dataset management and publishing can be seen as one of the most promising approaches for managing platform based on DataID metadata; that is extended by 2) the scholarly metadata as well as representing and connecting research novel Databus Mods architecture which allows for flexible, uni- results (RDF knowledge graphs) across the web. Although many fied, community-specific metadata extensions and (search) overlay FAIR principles could be fulfilled by 5-star Linked Open Data [5], systems; 3) DBpedia Archivo an archiving solution for unified han- the successful realization of FAIR principles for the Web of Data in dling and improvement of FAIRness for ontologies on publisher and its current state is a FAIRytale. consumer side; as well as 4) the DBpedia Global ID management We give an overview on the implementation of four major ap- and lookup services to cluster and discover equivalent entities and proaches that tackle individual challenges of FAIR data and present properties. our technical vision how they can be extended and interact together to get one step closer towards a backbone for FAIR Linked Data. CCS CONCEPTS 2 VISION • Information systems ! Web searching and information discovery; Service buses; Semantic web description languages; Our vision in a nutshell is direct and non-compromising: we en- • Applied computing ! Annotation. vision an information space of metadata that is openly licensed (CC-0), openly accessible in its full extent, fully automated in terms KEYWORDS of access and licenses, machine-readable, globally interoperable, FAIR, Linked Data, Dataset discovery, Data management, Data but decentral in nature, coping well with changes on the web and assessment, Ontologies last-but-not-least has automatically verifiable integrity checks to guarantee all of the aforementioned criteria. Our vision is crucial, ACM Reference Format: because it aims vagueness and "grey area" and also defends against Johannes Frey and Sebastian Hellmann. 2021. FAIR Linked Data - Towards "white lies" and "label fraud", i.e. systems claiming to be "open" a Linked Data backbone for users and machines. In Sci-K 2021: International Workshop on Scientific Knowledge: Representation, Discovery, and Assess- and "FAIR", but have serious limitations hidden in the fine print ment. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/nnnnnnn. or the technical implementation. We particular see the danger in nnnnnnn these systems as they exploit the good intentions of FAIR without delivering and without any checks or repercussions. 1 INTRODUCTION The FAIR guiding principles [9] were published in 2016 to ideologi- 3 FAIR LINKED DATA CHALLENGES cally drive the adoption of several guiding principles for publishing The FAIR guideline [9] lists 15 principles along the dimensions F (findability), A (accessibility), I (interoperability) and R (re-usability). Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed As a first high-level point of criticism, the guideline leaves many for profit or commercial advantage and that copies bear this notice and the full citation open questions w.r.t. the realization – Challenge C1 and is barely on the first page. Copyrights for components of this work owned by others than ACM actionable for data publishers. Secondly, the principles are not au- must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a tomatically verifiable – Challenge, C2 the overall compliance fee. Request permissions from [email protected]. is not measurable and is therefore hardly awardable and claimable Sci-K 2021, April 19–23, 2018, Ljubljana, Slovenia without the ability to check for proof or fulfillment. During our © 2021 Association for Computing Machinery. ACM ISBN 978-x-xxxx-xxxx-x/YY/MM...$15.00 endeavor to renovate and advance the creation and release of DB- https://doi.org/10.1145/nnnnnnn.nnnnnnn pedia knowledge graphs in the last 3 years [3, 4, 6, 7], we identified Sci-K 2021, April 19–23, 2018, Ljubljana, Slovenia Frey, et al. several challenges of publishing Linked Data in a FAIR way. We I3 M/D include qualified references to other M/D . denote the challenges in relation to the respective FAIR principle, I3C Biunique, persistent entity/concept identifiers are needed followed by the letter 퐶 for quick reference in the remainder of the to connect and relate knowledge in an efficient and reliable way. paper; (meta)data is abbreviated as M/D : IC Vocabulary mappings, entity interlinking, and value normal- F - Findability ization are needed to allow actual (semantic) interoperability. F1 M/D are assigned a globally unique and persistent identifier. R - Reusability F1C Linked Data as a technology uses globally unique, HTTP- R1 M/D are richly described with a plurality of accurate and relevant accessible identifier schemes and is therefore in its core well-suited. attributes. These schemes, however, are independently managed and heteroge- R1C Defining "accurate and relevant" M/D can be highly com- neous and thus needs to cope with identifier deduplication, change munity specific and lead to many heterogeneous specialized FAIR adaption (e.g. link rot) and cost of persistence. repositories that are incompatible to each other. A variety of "rich" F2 Data are described with rich metadata (defined by R1 below). metadata vocabularies to attract all kinds of communities and pre- F2C Deciding on the richness and granularity of metadata in- vent isolated solutions needs to be supported. volves multiple trade-offs between being homogeneous but still R1.1 M/D are released with a clear and accessible data usage license. flexible enough to accommodate a variety of use cases; between R1.2 M/D are associated with detailed provenance. being simple to understand and lightweight (especially in terms of R1.2C For reliable and meaningful provenance the used in- scalability for F4) but being verbose and accurate. Finding a consent put/referenced data also needs to be FAIR and the implementation for a publishing platform that scales and serves data from various needs to be compatible. domains seems tough to achieve in the long tail. Moreover, granu- R1.3 M/D meet domain-relevant community standards. larity can be dependent on the search context / information demand. R1.3C General purpose data can be used interdisciplinary and There is a need to be able to find/query via multiple granularity cross-domain. A Mechanism to associate multiple metadata de- layers of metadata. scriptions (without registering duplicates in the same or across F3 Metadata clearly and explicitly include the identifier of the data repositories) needs to be supported. they describe. F4 M/D are registered or indexed in a searchable resource. 4 APPROACHES F4C We argue that "searchable" is not enough in terms of A,I,R. We present four major approaches that address individual chal- The resource index must be FAIR itself to prevent M/D silos. Fur- lenges of FAIR data publishing, try to foster adherence to FAIR or thermore, indexing M/D relevant for all communities in one place aim at actively improving FAIRness. The chapter is summarized in will be hard or impossible to scale. If M/D is split across multiple Table 1. "FAIR data repositories/indexes" that non-FAIR itself, questions like a) "How to query multiple metadata (catalogs or repositories) in one 4.1 Databus, DataID & Collections attempt/unified way to find relevant information?", and b) "How to The development of the DBpedia Databus1 was driven by the need access, integrate and reuse these?" arise. for a flexible, heavily automatizable dataset management and pub- A - Accessibility lishing platform for a new and more agile DBpedia release cycle A1 M/D are retrievable by their identifier using a standardised com- [7]. It is inspired by the idea of transferring paradigms and tech- munications protocol. niques from software (release) management/deployment to data A1C Linked Data standards need to be enforced with validation. management, and the Maven Central Repository. Even with PURL(s) link rot occurs. The community is dependent The Databus uses the Apache Maven concept hierarchy group, on the registrar to take action (change redirection target). artifact, version and ports them to a Linked Data platform, in or- A1.1 The protocol is open, free, and universally implementable. der to manage data pipelines and enable automated publishing A1.1C Invalid deployments of Linked Data (invalid serialization, and consumption of data. Artifacts form the abstract identity of incorrect content-negotiation) limit or prevent access. a dataset with a stable dataset ID and can be used as entry point A1.2 The protocol allows for an authentication and authorisation to discover all versions.A version usually contains the same set of procedure, where necessary.