Making a Common Graphical Language for the Validation of Linked Data

Total Page:16

File Type:pdf, Size:1020Kb

Making a Common Graphical Language for the Validation of Linked Data EXAMENSARBETE INOM ARKITEKTUR, AVANCERAD NIVÅ, 30 HP STOCKHOLM, SVERIGE 2017 Making a common graphical language for the validation of linked data. DANIEL ECHEGARAY KTH SKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION Making a common graphical language for the validation of linked data. DANIEL ECHEGARAY Master in Computer Science Date: July 7, 2017 Supervisor: Cyrille Artho Examiner: Tino Weinkauf Swedish title: Skapandet av ett generiskt grafiskt språk för validering av länkad data. School of Computer Science and Communication i Abstract A variety of embedded systems is used within the design and the construction of trucks within Scania. Because of their heterogeneity and complexity, such systems require the use of many software tools to support embedded systems development. These tools need to form a well-integrated and effective development environment, in order to ensure that product data is consistent and correct across the developing organisation. A prototype is under development which adapts a linked data approach for data integration, more specifically this prototype adapt the Open Services for Lifecycle Collaboration(OSLC) specification for data-integration. The prototype allows users, to design OSLC-interfaces between product management tools and OSLC-links between their data. The user is fur- ther allowed to apply constraints on the data conforming to the OSLC validation lan- guage Resource Shapes(ReSh). The problem lies in the prototype conforming only to the language of Resource Shapes whose constraints are often too coarse-grained for Scania’s needs, and that there exists no standardised language for the validation of linked data. Thus, for framing this study two research questions was formulated (1) How can a common graphical language be created for supporting all validation technologies of RDF-data? and (2) How can this graphical language sup- port the automatic generation of RDF-graphs? A case study is conducted where the specific case consists of a software tool named SESAMM-tool at Scania. The case study included a constraint language comparison and a prototype extension. Furthermore, a design science research strategy is followed, where an effective artefact was searched for answering the stated research questions. Design science promotes an iterative process including implementation and evaluation. Data has been empirically collected in an iterative development process and evaluated using the methods of informed argument and controlled experiment, respectively, for the constraint language comparison and the extension of the prototype. Two constraint languages were investigated Shapes Constraint Language (SHACL) and Shapes Expression (ShEx). The result of the constraint language comparison con- cluded SHACL as the constraint language with a larger domain of constraints having finer-grained constraints also with the possibility of defining new constraints. This was based on that SHACL constraints was measured to cover 89.5% of ShEx constraints and 67.8% for the converse. The SHACL and ShEx coverage on ReSh property constraints was measured to 75% and 50%. SHACL was recommended and chosen for extending the prototype. On extending the prototype abstract super classes was introduced into the underlying data model. Constraint language classes was stated as subclasses. SHACL was additionally stated as such a subclass. This design offered an increased code reuse within the prototype but gave rise to issues relating to the plug-in technologies that the prototype is based upon. The current solution still has the issue that properties of one constraint language may be added to classes of another constraint language. ii Sammanfattning En mängd olika inbyggda system används inom design och konstruktion av lastbilar in- om Scania. På grund av deras heterogenitet och komplexitet kräver sådana system an- vändningen av många mjukvaruverktyg för att stödja inbyggd systemutveckling. Dessa verktyg måste bilda en välintegrerad och effektiv utvecklingsmiljö för att säkerställa att produktdata är konsekventa och korrekta över utvecklingsorganisationen. En prototyp håller på att utvecklas som anpassar en länkad datainriktning för dataintegration, mer specifikt anpassar denna prototyp en dataintegration specifikation utvecklad av Open Services for Lifecycle Collaboration(OSLC). Prototypen tillåter användare att utforma OSLC-gränssnitt mellan produkthanteringsverktyg och OSLC-länkar mellan deras da- ta. Användaren får vidare tillämpa begränsningar på de data som överensstämmer med OSLC-valideringsspråket Resource Shapes. Problemet ligger i prototypen som endast överensstämmer med Resource Shapes, vars begränsningar ofta är för grova för Scanias behov och att det inte finns något stan- dardiserat språk för validering av länkad data. Således, för att utforma denna studie for- mulerades två forskningsfrågor textit (1) Hur kan ett gemensamt grafiskt språk skapas för att stödja alla valideringsteknologier av RDF-data? och textit (2) Hur kan detta gra- fiska språk stödja Automatisk generering av RDF-grafer? En fallstudie genomförs där det specifika fallet består av ett mjukvaruverktyg som heter SESAMM-tool hos Scania. Fallstudien innehöll en jämförelse av valideringsspråk och vidareutveckling av prototypen. Vidare följs Design Science som forskningsstrategi där en effektiv artefakt sökts för att svara på de angivna forskningsfrågorna. Design Sci- ence främjar en iterativ process inklusive genomförande och utvärdering. Data har em- piriskt samlats på ett iterativt sätt och utvärderats med hjälp av utvärderingsmetoderna informerat argument och kontrollerat experiment, för valideringsspråkjämförelsen och vidareutvecklingen av prototypen. Två valideringsspråk undersöktes Shapes Constraint Language (SHACL) och Shapes Expression (ShEx). Resultatet av valideringsspråksjämförelsen konkluderade SHACL som valideringsspråket med en större domän av begränsningar, mer finkorniga begränsningar och med möjligheten att definiera nya begränsningar. Detta var baserat på att SHACL- begränsningarna uppmättes täcka 89,5 % av ShEx-begränsningarna och 67,8 % för det omvända. SHACL- och ShEx-täckningen för Resource Shapes-egenskapsbegränsningar mättes till 75 % respektive 50 %. SHACL rekommenderades och valdes för att vidareut- veckla prototypen. Vid vidareutveckling av prototypen infördes abstrakta superklasser i den underliggande datamodellen. Superklasserna tog i huvudsak rollen som tidiga- re klasser för valideringsspråk, som istället utgjordes som underklasser. SHACL anges som en sådan underklass. Denna design erbjöd hög kodåteranvändning inom prototypen men gav också upphov till problem som relaterade till plugin-teknologier som prototy- pen bygger på. Den nuvarande lösningen har fortfarande problemet att egenskaper hos ett valideringsspråk kan läggas till klasser av ett annat valideringsspråk. Contents Contents iii List of Figures vi List of Tables viii 1 Introduction 1 1.1 Problem and Research Question . 2 1.2 Purpose . 2 1.3 Ethics and Sustainability . 3 1.4 Scope . 3 1.5 Limitations . 3 1.6 Disposition . 3 2 Background 4 2.1 Linked data . 4 2.2 Open Services for Lifecycle Collaboration . 4 2.3 Resource Description Framework . 5 2.4 OSLC Tool-chain . 5 2.5 RDF Constraint languages . 6 2.6 Summary . 7 3 Related Work 8 3.1 Shapes Constraint Language . 8 3.2 Shapes Expression . 9 3.3 OSLC Resource Shape . 10 3.4 SPARQL Inferencing Notation . 10 3.5 Web Ontology Language . 11 3.6 Description Set Profiles . 12 3.7 Summary . 13 4 Lyo toolchain modeling and code generation prototype 14 4.1 Functionality . 14 4.2 Extensions . 15 4.3 Technologies . 16 4.3.1 Eclipse Modeling Framework Core . 16 4.3.2 Sirius . 16 4.3.3 Acceleo . 17 iii iv CONTENTS 4.4 Summary . 17 5 Research Method 18 5.1 Research Phases . 18 5.1.1 Case study . 18 5.2 Design Science . 20 5.2.1 Design as an Artifact . 21 5.2.2 Problem Relevance . 21 5.2.3 Design Evaluation . 21 5.2.4 Research Contribution . 22 5.2.5 Research Rigor . 22 5.2.6 Design as a Search Process . 22 5.2.7 Communication of Research . 23 5.3 Research Strategy Motivation . 23 5.4 Summary . 23 6 Constraint Language Comparison 25 6.1 Features . 26 6.2 Constraint coverage . 27 6.3 Summary . 28 7 Implementation 29 7.1 Evaluation . 29 7.1.1 Task . 29 7.1.2 Evaluation Criteria . 31 7.2 Iterative Process . 32 7.2.1 First iteration: Learn by doing . 32 7.2.2 Second iteration: Inheritance for code reuse . 33 7.2.3 Third iteration: Abstract super class for cohesion . 35 7.2.4 Fourth iteration: reference attributes and backwards compatibility . 36 7.2.5 Fifth iteration: Breaking name conventions and code clean up . 37 7.3 Summary . 38 8 Discussion and Conclusion 39 8.1 Comparison between constraint languages . 39 8.2 Implementation . 40 8.3 Research findings . 41 9 Future Work 42 Bibliography 45 A Lyo prototype meta-model 48 B SHACL on ShEx coverage 50 C ShEx on SHACL coverage 52 D SHACL on ReSh coverage 53 CONTENTS v E ShEx on ReSh coverage 55 List of Figures 2.1 An illustration of lifecycle management tools integrated with a linked data ap- proach and forming an OSLC toolchain. 6 4.1 A simple high-level model of how three tools are connected through their data. The letter ’P’ stands for producing data and ’C’ for consuming data. 15 4.2 A simple conceptual model of how the prototype currently work and how it should be extended. 16 5.1 An overview of the research phases. 18 5.2 An overview of how design science research was applied for the implemen- tation in this thesis. 20 6.1 Top left of SHACL and ReSh. Top right ShEx and ReSH. Bottom left SHACL and ShEx. Bottom right SHACL, ShEx and ReSh. 25 6.2 To the left, amount of ShEx constraints covered by SHACL. To The right, amount of SHACL constraints covered by ShEx. 27 6.3 To the left, amount of ReSh constraints covered by SHACL. To the right, amount of ReSh constraints covered by ShEx. 28 7.1 A modelled figure replicating a subset of SESAMM-tool database with classes and properties obfuscated. 30 7.2 The meta-model extension in the first iteration. 32 7.3 A model designed in the first iteration. All elements on the left side, conform to pre-existing ReSh constraints.
Recommended publications
  • Validating RDF Data Using Shapes
    83 Validating RDF data using Shapes a Jose Emilio Labra Gayo ​ ​ a University​ of Oviedo, Spain Abstract RDF forms the keystone of the Semantic Web as it enables a simple and powerful knowledge representation graph based data model that can also facilitate integration between heterogeneous sources of information. RDF based applications are usually accompanied with SPARQL stores which enable to efficiently manage and query RDF data. In spite of its well known benefits, the adoption of RDF in practice by web programmers is still lacking and SPARQL stores are usually deployed without proper documentation and quality assurance. However, the producers of RDF data usually know the implicit schema of the data they are generating, but they don't do it traditionally. In the last years, two technologies have been developed to describe and validate RDF content using the term shape: Shape Expressions (ShEx) and Shapes Constraint Language (SHACL). We will present a motivation for their appearance and compare them, as well as some applications and tools that have been developed. Keywords RDF, ShEx, SHACL, Validating, Data quality, Semantic web 1. Introduction In the tutorial we will present an overview of both RDF is a flexible knowledge and describe some challenges and future work [4] representation language based of graphs which has been successfully adopted in semantic web 2. Acknowledgements applications. In this tutorial we will describe two languages that have recently been proposed for This work has been partially funded by the RDF validation: Shape Expressions (ShEx) and Spanish Ministry of Economy, Industry and Shapes Constraint Language (SHACL).ShEx was Competitiveness, project: TIN2017-88877-R proposed as a concise and intuitive language for describing RDF data in 2014 [1].
    [Show full text]
  • V a Lida T in G R D F Da
    Series ISSN: 2160-4711 LABRA GAYO • ET AL GAYO LABRA Series Editors: Ying Ding, Indiana University Paul Groth, Elsevier Labs Validating RDF Data Jose Emilio Labra Gayo, University of Oviedo Eric Prud’hommeaux, W3C/MIT and Micelio Iovka Boneva, University of Lille Dimitris Kontokostas, University of Leipzig VALIDATING RDF DATA This book describes two technologies for RDF validation: Shape Expressions (ShEx) and Shapes Constraint Language (SHACL), the rationales for their designs, a comparison of the two, and some example applications. RDF and Linked Data have broad applicability across many fields, from aircraft manufacturing to zoology. Requirements for detecting bad data differ across communities, fields, and tasks, but nearly all involve some form of data validation. This book introduces data validation and describes its practical use in day-to-day data exchange. The Semantic Web offers a bold, new take on how to organize, distribute, index, and share data. Using Web addresses (URIs) as identifiers for data elements enables the construction of distributed databases on a global scale. Like the Web, the Semantic Web is heralded as an information revolution, and also like the Web, it is encumbered by data quality issues. The quality of Semantic Web data is compromised by the lack of resources for data curation, for maintenance, and for developing globally applicable data models. At the enterprise scale, these problems have conventional solutions. Master data management provides an enterprise-wide vocabulary, while constraint languages capture and enforce data structures. Filling a need long recognized by Semantic Web users, shapes languages provide models and vocabularies for expressing such structural constraints.
    [Show full text]
  • The Opencitations Data Model
    The OpenCitations Data Model Marilena Daquino1;2[0000−0002−1113−7550], Silvio Peroni1;2[0000−0003−0530−4305], David Shotton2;3[0000−0001−5506−523X], Giovanni Colavizza4[0000−0002−9806−084X], Behnam Ghavimi5[0000−0002−4627−5371], Anne Lauscher6[0000−0001−8590−9827], Philipp Mayr5[0000−0002−6656−1658], Matteo Romanello7[0000−0002−7406−6286], and Philipp Zumstein8[0000−0002−6485−9434]? 1 Digital Humanities Advanced research Centre (/DH.arc), Department of Classical Philology and Italian Studies, University of Bologna fmarilena.daquino2,[email protected] 2 Research Centre for Open Scholarly Metadata, Department of Classical Philology and Italian Studies, University of Bologna 3 Oxford e-Research Centre, University of Oxford [email protected] 4 Institute for Logic, Language and Computation (ILLC), University of Amsterdam [email protected] 5 Department of Knowledge Technologies for the Social Sciences, GESIS - Leibniz-Institute for the Social Sciences [email protected], [email protected] 6 Data and Web Science Group, University of Mannheim [email protected] 7 cole Polytechnique Fdrale de Lausanne [email protected] 8 Mannheim University Library, University of Mannheim [email protected] Abstract. A variety of schemas and ontologies are currently used for the machine-readable description of bibliographic entities and citations. This diversity, and the reuse of the same ontology terms with differ- ent nuances, generates inconsistencies in data. Adoption of a single data model would facilitate data integration tasks regardless of the data sup- plier or context application. In this paper we present the OpenCitations Data Model (OCDM), a generic data model for describing bibliographic entities and citations, developed using Semantic Web technologies.
    [Show full text]
  • Using Shape Expressions (Shex) to Share RDF Data Models and to Guide Curation with Rigorous Validation B Katherine Thornton1( ), Harold Solbrig2, Gregory S
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Repositorio Institucional de la Universidad de Oviedo Using Shape Expressions (ShEx) to Share RDF Data Models and to Guide Curation with Rigorous Validation B Katherine Thornton1( ), Harold Solbrig2, Gregory S. Stupp3, Jose Emilio Labra Gayo4, Daniel Mietchen5, Eric Prud’hommeaux6, and Andra Waagmeester7 1 Yale University, New Haven, CT, USA [email protected] 2 Johns Hopkins University, Baltimore, MD, USA [email protected] 3 The Scripps Research Institute, San Diego, CA, USA [email protected] 4 University of Oviedo, Oviedo, Spain [email protected] 5 Data Science Institute, University of Virginia, Charlottesville, VA, USA [email protected] 6 World Wide Web Consortium (W3C), MIT, Cambridge, MA, USA [email protected] 7 Micelio, Antwerpen, Belgium [email protected] Abstract. We discuss Shape Expressions (ShEx), a concise, formal, modeling and validation language for RDF structures. For instance, a Shape Expression could prescribe that subjects in a given RDF graph that fall into the shape “Paper” are expected to have a section called “Abstract”, and any ShEx implementation can confirm whether that is indeed the case for all such subjects within a given graph or subgraph. There are currently five actively maintained ShEx implementations. We discuss how we use the JavaScript, Scala and Python implementa- tions in RDF data validation workflows in distinct, applied contexts. We present examples of how ShEx can be used to model and validate data from two different sources, the domain-specific Fast Healthcare Interop- erability Resources (FHIR) and the domain-generic Wikidata knowledge base, which is the linked database built and maintained by the Wikimedia Foundation as a sister project to Wikipedia.
    [Show full text]
  • Shape Designer for Shex and SHACL Constraints Iovka Boneva, Jérémie Dusart, Daniel Fernández Alvarez, Jose Emilio Labra Gayo
    Shape Designer for ShEx and SHACL Constraints Iovka Boneva, Jérémie Dusart, Daniel Fernández Alvarez, Jose Emilio Labra Gayo To cite this version: Iovka Boneva, Jérémie Dusart, Daniel Fernández Alvarez, Jose Emilio Labra Gayo. Shape Designer for ShEx and SHACL Constraints. ISWC 2019 - 18th International Semantic Web Conference, Oct 2019, Auckland, New Zealand. 2019. hal-02268667 HAL Id: hal-02268667 https://hal.archives-ouvertes.fr/hal-02268667 Submitted on 30 Sep 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Shape Designer for ShEx and SHACL Constraints∗ Iovka Boneva1, J´er´emieDusart2, Daniel Fern´andez Alvarez´ 3, and Jose Emilio Labra Gayo3 1 Univ. Lille, CNRS, Centrale Lille, Inria, UMR 9189 - CRIStAL - Centre de Recherche en Informatique Signal et Automatique de Lille, F-59000 Lille, France 2 Inria, France 3 University of Oviedo, Spain Abstract. We present Shape Designer, a graphical tool for building SHACL or ShEx constraints for an existing RDF dataset. Shape Designer allows to automatically extract complex constraints that are satisfied by the data. Its integrated shape editor and validator allow expert users to combine and modify these constraints in order to build an arbitrarily complex ShEx or SHACL schema.
    [Show full text]
  • Validating RDF with Shape Expressions
    Validating RDF with Shape Expressions Iovka Boneva Jose Emilio Labra Gayo LINKS, INRIA & CNRS University of Oviedo University of Lille, France Spain Samuel Hym Eric G. Prud'hommeau 2XS W3C University of Lille, France Stata Center, MIT Harold Solbrig S lawek Staworko∗ Mayo Clinic College of Medicine LINKS, INRIA & CNRS Rochester, MN, USA University of Lille, France Abstract We propose shape expression schema (ShEx), a novel schema formalism for describing the topology of an RDF graph that uses regular bag expressions (RBEs) to define con- straints on the admissible neighborhood for the nodes of a given type. We provide two alternative semantics, multi- and single-type, depending on whether or not a node may have more than one type. We study the expressive power of ShEx and study the complexity of the validation problem. We show that the single-type semantics is strictly more ex- pressive than the multi-type semantics, single-type validation is generally intractable and multi-type validation is feasible for a small class of RBEs. To further curb the high com- putational complexity of validation, we propose a natural notion of determinism and show that multi-type validation for the class of deterministic schemas using single-occurrence regular bag expressions (SORBEs) is tractable. Finally, we consider the problem of val- idating only a fragment of a graph with preassigned types for some of its nodes, and argue that for deterministic ShEx using SORBEs, multi-type validation can be performed efficiently and single-type validation can be performed with a single pass over the graph. 1 Introduction Schemas have a number of important functions in databases.
    [Show full text]
  • Validating Shacl Constraints Over a Sparql Endpoint
    Validating shacl constraints over a sparql endpoint Julien Corman1, Fernando Florenzano2, Juan L. Reutter2 , and Ognjen Savković1 1 Free University of Bozen-Bolzano, Bolzano, Italy 2 PUC Chile and IMFD, Chile Abstract. shacl (Shapes Constraint Language) is a specification for describing and validating RDF graphs that has recently become a W3C recommendation. While the language is gaining traction in the industry, algorithms for shacl constraint validation are still at an early stage. A first challenge comes from the fact that RDF graphs are often exposed as sparql endpoints, and therefore only accessible via queries. Another difficulty is the absence of guidelines about the way recursive constraints should be handled. In this paper, we provide algorithms for validating a graph against a shacl schema, which can be executed over a sparql end- point. We first investigate the possibility of validating a graph through a single query for non-recursive constraints. Then for the recursive case, since the problem has been shown to be NP-hard, we propose a strategy that consists in evaluating a small number of sparql queries over the endpoint, and using the answers to build a set of propositional formu- las that are passed to a SAT solver. Finally, we show that the process can be optimized when dealing with recursive but tractable fragments of shacl, without the need for an external solver. We also present a proof-of-concept evaluation of this last approach. 1 Introduction shacl (for SHApes Constraint Language),3 is an expressive constraint lan- guage for RDF graph, which has become a W3C recommendation in 2017.
    [Show full text]
  • Semi Automatic Construction of Shex and SHACL Schemas Iovka Boneva, Jérémie Dusart, Daniel Fernández Alvarez, Jose Emilio Labra Gayo
    Semi Automatic Construction of ShEx and SHACL Schemas Iovka Boneva, Jérémie Dusart, Daniel Fernández Alvarez, Jose Emilio Labra Gayo To cite this version: Iovka Boneva, Jérémie Dusart, Daniel Fernández Alvarez, Jose Emilio Labra Gayo. Semi Automatic Construction of ShEx and SHACL Schemas. 2019. hal-02193275 HAL Id: hal-02193275 https://hal.archives-ouvertes.fr/hal-02193275 Preprint submitted on 24 Jul 2019 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Semi Automatic Construction of ShEx and SHACL Schemas Iovka Boneva1, J´er´emieDusart2, Daniel Fern´andez Alvarez,´ and Jose Emilio Labra Gayo3 1 University of Lille, France 2 Inria, France 3 University of Oviedo, Spain Abstract. We present a method for the construction of SHACL or ShEx constraints for an existing RDF dataset. It has two components that are used conjointly: an algorithm for automatic schema construction, and an interactive workflow for editing the schema. The schema construction algorithm takes as input sets of sample nodes and constructs a shape con- straint for every sample set. It can be parametrized by a schema pattern that defines structural requirements for the schema to be constructed. Schema patterns are also used to feed the algorithm with relevant infor- mation about the dataset coming from a domain expert or from some ontology.
    [Show full text]
  • DINGO: an Ontology for Projects and Grants Linked Data
    DINGO: an ontology for projects and grants linked data Diego Chialva∗1 and Alexis-Michel Mugabushaka1 1ERCEA†, Place Charles Rogier 16, 1210 Brussels, Belgium‡ Abstract We present DINGO (Data INtegration for Grants Ontology), an ontology that provides a machine readable extensible framework to model data for semantically-enabled applications relative to projects, funding, actors, and, notably, funding policies in the research landscape. DINGO is designed to yield high modeling power and elasticity to cope with the huge variety in funding, research and policy practices, which makes it applicable also to other areas besides research where funding is an important aspect. We discuss its main features, the principles followed for its development, its community uptake, its maintenance and evolution. Keywords: ontology linked data, research funding, research projects, research policies 1 Introduction, Motivation, Goals and Idea Services and resources built around Semantic Web, semantically-enabled applications and linked (open) data technologies have been increasingly impacting research and research-related activities in the last years. Development has been intense along several directions, for instance in “semantic publishing” [36], but also in the aspects directed toward the reproducibility and attribution of research and scholarly outputs, leading also to the interest in having Open Science Graphs interconnected at the global level [21]. All this has become more and more essential to research practices, also in light of the so-called reproducibility crisis affecting a number of research fields (see, for instance, the huge list of latest studies at https://reproduciblescience.org/2019 ). In fact, the demand of easily and automatically parsable, interoperable and processable data goes beyond the purely academic sphere.
    [Show full text]
  • Reading an XML Text Like a Human with Semantic Web Technologies 1
    BASEL DIGITAL FRÜHE NEUZEIT WIRTSCHAFT Reading an XML Text Like a Human with Semantic Web Technologies Learning from the Basel City Accounts as Digital Edition Georg Vogeler 26.05.2021 Historical texts carry information. Though this assertion will hardly surprise historians, they less often consider how much this information is influenced by the material, visual, and organisational context of the text. Which archive considered the text worth preserving? How is the text organised on the page? Who wrote the text? What material was used to create the text? Digital editing all- ows us to create representations of texts which take into account all these aspects. 1 The digital scholarly edition of the Basel city accounts 1535–1611, created in collaboration with Susanna Burg- hartz and teams from Basel and Graz, and published in 2015,1 demonstrates both the feasibility and effective- ness of this digital approach to editing historical accoun- ting records. It has become a major reference for digital scholarly editions of historical accounts. Starting from the experience in the work with this his- torical object, this contribution reflects on technical solutions which can help to realise editions similar to the Basel city accounts. The edition uses a combina- tion of two well-established technologies in the Digital Humanities: the Text Encoding Initiative (TEI2) XML markup for transcriptions, and the Resource Description Framework (RDF3), defined by the World Wide Web con- sortium (W3C), to publish the content of the accounts in the «Web of Data» or the «Semantic Web». RDF is based on globally unique identifiers in the form of web addresses (URI4) and a formal structure similar to simple textual statements: «subject, object, predicate».
    [Show full text]
  • Document (1407
    Plenary Debates of the Parliament of Finland as Linked Open Data and in Parla-CLARIN Markup Laura Sinikallio # Senka Drobac # HELDIG Centre for Digital Humanities, Department of Computer Science, SeCo Research Group, SeCo Research Group, University of Helsinki, Finland Aalto University, Finland Minna Tamper # Rafael Leal # Department of Computer Science, HELDIG Centre for Digital Humanities, SeCo Research Group, SeCo Research Group, Aalto University, Finland University of Helsinki, Finland Mikko Koho # Jouni Tuominen # HELDIG Centre for Digital Humanities, Aalto University, Finland SeCo Research Group, HELDIG Centre for Digital Humanities, University of Helsinki, Finland SeCo Research Group, University of Helsinki, Finland Matti La Mela # Eero Hyvönen # HELDIG Centre for Digital Humanities, Aalto University, Finland SeCo Research Group, HELDIG Centre for Digital Humanities, University of Helsinki, Finland SeCo Research Group, University of Helsinki, Finland Abstract This paper presents a knowledge graph created by transforming the plenary debates of the Parliament of Finland (1907–) into Linked Open Data (LOD). The data, totaling over 900 000 speeches, with automatically created semantic annotations and rich ontology-based metadata, are published in a Linked Open Data Service and are used via a SPARQL API and as data dumps. The speech data is part of larger LOD publication FinnParla that also includes prosopographical data about the politicians. The data is being used for studying parliamentary language and culture in Digital Humanities in several universities. To serve a wider variety of users, the entirety of this data was also produced using Parla-CLARIN markup. We present the first publication of all Finnish parliamentary debates as data. Technical novelties in our approach include the use of both Parla-CLARIN and an RDF schema developed for representing the speeches, integration of the data to a new Parliament of Finland Ontology for deeper data analyses, and enriching the data with a variety of external national and international data sources.
    [Show full text]
  • Multi-‐Entity Models of Resource Description in the Semantic
    Multi-Entity Models of Resource Description in the Semantic Web: A comparison of FRBR, RDA, and BIBFRAME Thomas Baker, Department of Library and Information Science, Sungkyunkwan University, Seoul, Korea Karen Coyle, consultant, Berkeley, California, USA Sean Petiya, School of Library and Information Science, Kent State University, Kent, Ohio, USA (Preprint) Abstract Bibliographic description in emerging library standards is embracing a multi-entity model that describes varying levels of abstraction from the conceptual work to the physical item. Three of these multi-entity models have been published as vocabularies using the Semantic Web standard Resource Description Framework (RDF): FRBR, RDA, and BIBFRAME. The authors test RDF data based on the three vocabularies using common Semantic Web-enabled software. The analysis demonstrates that the intended data structure of the models is not supported by the RDF vocabularies. In some cases this results in undesirable incompatibilities between the vocabularies, which will be a hindrance to interoperability in the open data environment of the Web.* Data Files The data files supporting this study are available at: http://lod-lam.slis.kent.edu/wemi-rdf/ Introduction Most bibliographic metadata on the Web, such as data describing a book, article, or image, follows the implicit model of a single entity (a “resource”) with attributes (properties). This model is reflected, for example, in the widely used Dublin-Core-based XML format of the Open Archives Initiative Protocol for Metadata Harvesting (OAI-PMH). Over the past two decades, however, the library world has developed more differentiated models of bibliographic resources. These models do not see a book as just a book, but as a set of entities variously reflecting the meaning, expression, and physicality of a resource.
    [Show full text]