A Shape Expression Approach for Assessing the Quality of Linked Open Data in Libraries

Semantic Web 0 (0) 1 1 IOS Press 1 1 2 2 3 3 4 A Shape Expression approach for assessing 4 5 5 6 the quality of Linked Open Data in Libraries 6 7 7 8 Gustavo Candela a,*, Pilar Escobar a, María Dolores Sáez a and Manuel Marco-Such a 8 9 a Department of Software and Computing systems, University of Alicante, Spain 9 10 E-mail: [email protected] 10 11 11 Victor de Boer, Vrije Universiteit Amsterdam, the Netherlands. Enrico Daga, The Open University, 12 12 13 United Kingdom. Marieke van Erp, KNAW Humanities Cluster, the Netherlands . Eero Hyvönen, 13 14 University of Helsinki, Aalto University, Finland. Albert Meroño Peñuela, Vrije Universiteit Amsterdam, 14 15 the Netherlands. Harald Sack, FIZ Karlsruhe - Leibniz Institute for Information Infrastructure, Germany. 15 16 16 17 Editors: Mehwish Alam, FIZ Karlsruhe, Germany; Victor de Boer, Vrije Universiteit Amsterdam, Netherlands; Eero Hyvönen, University of 17 18 Helsinki, Aalto University, Finland; Albert Meroño Peñuela, Vrije Universiteit Amsterdam, Netherlands; Harald Sack, FIZ Karlsruhe, Germany 18 Solicited reviews: Jouni Tuominen, Aalto University, Finland; Katherine Thornton, Yale University, USA; Marilena Daquino, University of 19 19 Bologna, Italy 20 20 21 21 22 22 23 23 24 24 Abstract. Cultural heritage institutions are exploring Semantic Web technologies to publish and enrich their catalogues. Several 25 25 initiatives, such as Labs, are based on the creative and innovative reuse of the materials published by cultural heritage institu- 26 tions. In this way, quality has become a crucial aspect to identify and reuse a dataset for research. In this article, we propose a 26 27 methodology to create Shape Expressions definitions in order to validate LOD datasets published by libraries. The methodology 27 28 was then applied to four use cases based on datasets published by relevant institutions. It intends to encourage institutions to use 28 29 ShEx to validate LOD datasets as well as to promote the reuse of LOD, made openly available by libraries. 29 30 30 31 Keywords: Linked Open Data, Data Quality, Libraries, Cultural Heritage 31 32 32 33 33 34 34 35 1. Introduction creative ways of reusing them [4]. GLAM institutions 35 36 are engaging users and researchers to conduct research 36 37 Galleries, Libraries, Archives and Museums (GLAM on the digital collections. 37 38 institutions) have traditionally provided access to dig- The Semantic Web was presented by Tim Berners- 38 39 ital collections. The wide range of material formats Lee in 2001 as an extension of the current Web in 39 40 include text, image, video, audio or maps. which information is structured in a way that is read- 40 41 As technologies have evolved over the years, GLAM able by computers [5]. The Semantic Web is based on 41 42 organisations have adapted to the new environments in a range of technologies that enable the connection of 42 43 terms of new skills, service design or digital research resources, known as Linked Open Data (LOD). 43 44 [1]. Institutions have started to make their collections Applying the LOD concepts to the digital collec- 44 45 accessible for computational uses such as data science, tions provided by libraries has become highly popu- 45 46 Machine Learning, and Artificial Intelligence [2, 3]. lar in the research community. Many institutions have 46 47 Recently, the Lab concept has emerged as a means adopted Resource Description Framework (RDF) to 47 48 to publish digital collections as datasets amenable to describe their content. In addition, collaborative edit- 48 49 computational use as well as to identify innovative and ing approaches have been proposed using Wikidata 49 50 and Wikibase to highlight research opportunities in 50 51 *Corresponding author. E-mail: [email protected]. community-based collections, as well as community- 51 1570-0844/0-1900/$35.00 © 0 – IOS Press and the authors. All rights reserved 2 G. Candela et al. / A ShEx approach for assessing the quality of LOD in Libraries. 1 owned infrastructure, to facilitate open scholarship tained after the quality assessment; and (c) the ShEx 1 2 practices [6, 7]. The use of LOD enhances the dis- definitions to assess LOD published by libraries. 2 3 coverability and impact of digital collections by trans- The paper is organised as follows: after a brief de- 3 4 forming isolated repositories (data silos) into valuable scription of the state of the art in Section 2, Section 3 4 5 datasets that are connected to external repositories. describes the methodology employed to evaluate LOD 5 6 However, the use of Semantic Web technologies in libraries using ShEx. Section 4 shows the results 6 7 requires complex technical skills and professional of the methodology’s application. The paper concludes 7 8 knowledge in different fields, hindering their adop- with an outline of the adopted methodology, and gen- 8 9 tion. Many aspects must be taken into account such as eral guidelines on how to use the results and future 9 10 the vocabulary to describe the resources, the identifi- work. 10 11 cation of external repositories to create the links and 11 12 the system to store the final dataset. As in the case of 12 13 other types of structured data, LOD suffers from qual- 2. Related work 13 14 ity problems such as inaccuracy, inconsistency, and in- 14 2.1. Background 15 completeness, impeding its full potential of reuse and 15 16 exploitation. 16 The Semantic Web is a web of data that is machine- 17 Bibliographic records published as LOD by libraries 17 readable and includes a collection of technologies to 18 have gained value in contexts outside of the library do- 18 describe and query the data, as well as to define stan- 19 main in order to connect and reuse resources [8, 9]. It 19 dard vocabularies. Linked Data was introduced by Tim 20 is crucial to provide libraries with higher-quality and 20 21 Berners-Lee [17] as a essential component of the Se- 21 richer metadata for reuse in a linked data environment, 22 mantic Web to create relationships between datasets. 22 not only to create contextual information, but also to 23 Thus, the Resource Description Framework (RDF) 23 facilitate the work of the library staff [10]. 24 [18] lies at the heart of the Semantic Web as it provides 24 Recent studies have focused on the next genera- 25 a standard model for data interchange on the Web and 25 tion of metadata in libraries to develop quality assur- 26 extends the Web’s linking structure by means of URIs. 26 ance practices [11]. Some approaches have assessed 27 In addition, SPARQL provides a standardised query 27 the quality of LOD using several methods and tech- 28 language for data represented as RDF in which a query 28 niques [12, 13]. A preliminary query-based approach 29 can include a list of triple patterns, conjunctions, dis- 29 30 assesses the quality of the LOD published by four rel- junctions, and optional patterns [19]. 30 31 evant libraries [14]. Shape Expressions (ShEx) have Libraries have traditionally provided the descriptive 31 32 emerged as a concise, formal, modelling and valida- metadata of bibliographic records using standards such 32 1 33 tion language for RDF structures, addressing the Se- as MARC. While MARC is the most common for- 33 34 mantic Web community’s need to ensure data qual- mats used by libraries to publish bibliographic infor- 34 35 ity for RDF graphs [15, 16]. However, to the best mation, it presents limitations regarding its use as RDF, 35 36 of our knowledge, none of these previous approaches since MARC was not defined for a Web environment 36 37 provide a user-friendly syntax, systematic and repro- [20]. 37 38 ducible method to assess the quality of LOD published In this sense, several initiatives provide a more ex- 38 39 by libraries based on ShEx as a main component. pressive and modern framework for bibliographic in- 39 40 The objective of the present study was to intro- formation based on Semantic Web technologies. Some 40 41 duce a systematic and reproducible approach to anal- examples include: Functional Requirements for Bib- 41 42 yse the data quality of LOD published by libraries. liographic Records (FRBR), the family of concep- 42 43 The methodology was then applied to four LOD repos- tual models [21], and Resource Description and Ac- 43 44 itories issued by relevant institutions. The collection cess (RDA) specification [22], the IFLA Library Ref- 44 45 of ShEx schemas provided as a result of this study is erence Model (LRM) [23], the Bibliographic Ontology 45 46 publicly available and can be used to reproduce the (BIBO) [24], the Bibliographic Framework Initiative 46 47 results and extend the examples provided using new (BIBFRAME) [25] and FRBRoo [26]. However, trans- 47 48 rules based on additional properties and vocabularies. lating the old records into the new format is not an easy 48 49 The main contributions of this paper are as follow: task [27], since libraries usually host large catalogues, 49 50 (a) a methodology to assess the quality of the LOD 50 51 published by libraries using ShEx; (b) the results ob- 1https://www.loc.gov/marc/ 51 G. Candela et al. / A ShEx approach for assessing the quality of LOD in Libraries. 3 1 including many types of resources that often requirea ries of 55 intrinsic, representational, contextual and ac- 1 2 manual revision to transform the data with accuracy. cessibility quality metrics [34]. Stardog Integrity Con- 2 3 Several major libraries (e.g., the OCLC, the British straint Validation (ICV) allows to write constraints 3 4 Library, the National Library of France), publishers, that are translated to SPARQL in order to assess RDF 4 5 and library catalogue vendors have applied LOD to triples in a repository [29].

A Shape Expression Approach for Assessing the Quality of Linked Open Data in Libraries

Validating RDF Data Using Shapes

V a Lida T in G R D F Da

The Opencitations Data Model

Using Shape Expressions (Shex) to Share RDF Data Models and to Guide Curation with Rigorous Validation B Katherine Thornton1( ), Harold Solbrig2, Gregory S

Shape Designer for Shex and SHACL Constraints Iovka Boneva, Jérémie Dusart, Daniel Fernández Alvarez, Jose Emilio Labra Gayo

Validating RDF with Shape Expressions

Validating Shacl Constraints Over a Sparql Endpoint

Semi Automatic Construction of Shex and SHACL Schemas Iovka Boneva, Jérémie Dusart, Daniel Fernández Alvarez, Jose Emilio Labra Gayo

DINGO: an Ontology for Projects and Grants Linked Data

Reading an XML Text Like a Human with Semantic Web Technologies 1

Document (1407

Multi-‐Entity Models of Resource Description in the Semantic