Making a Common Graphical Language for the Validation of Linked Data

EXAMENSARBETE INOM ARKITEKTUR, AVANCERAD NIVÅ, 30 HP STOCKHOLM, SVERIGE 2017 Making a common graphical language for the validation of linked data. DANIEL ECHEGARAY KTH SKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION Making a common graphical language for the validation of linked data. DANIEL ECHEGARAY Master in Computer Science Date: July 7, 2017 Supervisor: Cyrille Artho Examiner: Tino Weinkauf Swedish title: Skapandet av ett generiskt grafiskt språk för validering av länkad data. School of Computer Science and Communication i Abstract A variety of embedded systems is used within the design and the construction of trucks within Scania. Because of their heterogeneity and complexity, such systems require the use of many software tools to support embedded systems development. These tools need to form a well-integrated and effective development environment, in order to ensure that product data is consistent and correct across the developing organisation. A prototype is under development which adapts a linked data approach for data integration, more specifically this prototype adapt the Open Services for Lifecycle Collaboration(OSLC) specification for data-integration. The prototype allows users, to design OSLC-interfaces between product management tools and OSLC-links between their data. The user is fur- ther allowed to apply constraints on the data conforming to the OSLC validation language Resource Shapes(ReSh). The problem lies in the prototype conforming only to the language of Resource Shapes whose constraints are often too coarse-grained for Scania’s needs, and that there exists no standardised language for the validation of linked data. Thus, for framing this study two research questions was formulated (1) How can a common graphical language be created for supporting all validation technologies of RDF-data? and (2) How can this graphical language support the automatic generation of RDF-graphs? A case study is conducted where the specific case consists of a software tool named SESAMM-tool at Scania. The case study included a constraint language comparison and a prototype extension. Furthermore, a design science research strategy is followed, where an effective artefact was searched for answering the stated research questions. Design science promotes an iterative process including implementation and evaluation. Data has been empirically collected in an iterative development process and evaluated using the methods of informed argument and controlled experiment, respectively, for the constraint language comparison and the extension of the prototype. Two constraint languages were investigated Shapes Constraint Language (SHACL) and Shapes Expression (ShEx). The result of the constraint language comparison con- cluded SHACL as the constraint language with a larger domain of constraints having finer-grained constraints also with the possibility of defining new constraints. This was based on that SHACL constraints was measured to cover 89.5% of ShEx constraints and 67.8% for the converse. The SHACL and ShEx coverage on ReSh property constraints was measured to 75% and 50%. SHACL was recommended and chosen for extending the prototype. On extending the prototype abstract super classes was introduced into the underlying data model. Constraint language classes was stated as subclasses. SHACL was additionally stated as such a subclass. This design offered an increased code reuse within the prototype but gave rise to issues relating to the plug-in technologies that the prototype is based upon. The current solution still has the issue that properties of one constraint language may be added to classes of another constraint language. ii Sammanfattning En mängd olika inbyggda system används inom design och konstruktion av lastbilar inom Scania. På grund av deras heterogenitet och komplexitet kräver sådana system an- vändningen av många mjukvaruverktyg för att stödja inbyggd systemutveckling. Dessa verktyg måste bilda en välintegrerad och effektiv utvecklingsmiljö för att säkerställa att produktdata är konsekventa och korrekta över utvecklingsorganisationen. En prototyp håller på att utvecklas som anpassar en länkad datainriktning för dataintegration, mer specifikt anpassar denna prototyp en dataintegration specifikation utvecklad av Open Services for Lifecycle Collaboration(OSLC). Prototypen tillåter användare att utforma OSLC-gränssnitt mellan produkthanteringsverktyg och OSLC-länkar mellan deras data. Användaren får vidare tillämpa begränsningar på de data som överensstämmer med OSLC-valideringsspråket Resource Shapes. Problemet ligger i prototypen som endast överensstämmer med Resource Shapes, vars begränsningar ofta är för grova för Scanias behov och att det inte finns något stan- dardiserat språk för validering av länkad data. Således, för att utforma denna studie for- mulerades två forskningsfrågor textit (1) Hur kan ett gemensamt grafiskt språk skapas för att stödja alla valideringsteknologier av RDF-data? och textit (2) Hur kan detta gra- fiska språk stödja Automatisk generering av RDF-grafer? En fallstudie genomförs där det specifika fallet består av ett mjukvaruverktyg som heter SESAMM-tool hos Scania. Fallstudien innehöll en jämförelse av valideringsspråk och vidareutveckling av prototypen. Vidare följs Design Science som forskningsstrategi där en effektiv artefakt sökts för att svara på de angivna forskningsfrågorna. Design Sci- ence främjar en iterativ process inklusive genomförande och utvärdering. Data har em- piriskt samlats på ett iterativt sätt och utvärderats med hjälp av utvärderingsmetoderna informerat argument och kontrollerat experiment, för valideringsspråkjämförelsen och vidareutvecklingen av prototypen. Två valideringsspråk undersöktes Shapes Constraint Language (SHACL) och Shapes Expression (ShEx). Resultatet av valideringsspråksjämförelsen konkluderade SHACL som valideringsspråket med en större domän av begränsningar, mer finkorniga begränsningar och med möjligheten att definiera nya begränsningar. Detta var baserat på att SHACL- begränsningarna uppmättes täcka 89,5 % av ShEx-begränsningarna och 67,8 % för det omvända. SHACL- och ShEx-täckningen för Resource Shapes-egenskapsbegränsningar mättes till 75 % respektive 50 %. SHACL rekommenderades och valdes för att vidareut- veckla prototypen. Vid vidareutveckling av prototypen infördes abstrakta superklasser i den underliggande datamodellen. Superklasserna tog i huvudsak rollen som tidiga- re klasser för valideringsspråk, som istället utgjordes som underklasser. SHACL anges som en sådan underklass. Denna design erbjöd hög kodåteranvändning inom prototypen men gav också upphov till problem som relaterade till plugin-teknologier som prototypen bygger på. Den nuvarande lösningen har fortfarande problemet att egenskaper hos ett valideringsspråk kan läggas till klasser av ett annat valideringsspråk. Contents Contents iii List of Figures vi List of Tables viii 1 Introduction 1 1.1 Problem and Research Question . 2 1.2 Purpose . 2 1.3 Ethics and Sustainability . 3 1.4 Scope . 3 1.5 Limitations . 3 1.6 Disposition . 3 2 Background 4 2.1 Linked data . 4 2.2 Open Services for Lifecycle Collaboration . 4 2.3 Resource Description Framework . 5 2.4 OSLC Tool-chain . 5 2.5 RDF Constraint languages . 6 2.6 Summary . 7 3 Related Work 8 3.1 Shapes Constraint Language . 8 3.2 Shapes Expression . 9 3.3 OSLC Resource Shape . 10 3.4 SPARQL Inferencing Notation . 10 3.5 Web Ontology Language . 11 3.6 Description Set Profiles . 12 3.7 Summary . 13 4 Lyo toolchain modeling and code generation prototype 14 4.1 Functionality . 14 4.2 Extensions . 15 4.3 Technologies . 16 4.3.1 Eclipse Modeling Framework Core . 16 4.3.2 Sirius . 16 4.3.3 Acceleo . 17 iii iv CONTENTS 4.4 Summary . 17 5 Research Method 18 5.1 Research Phases . 18 5.1.1 Case study . 18 5.2 Design Science . 20 5.2.1 Design as an Artifact . 21 5.2.2 Problem Relevance . 21 5.2.3 Design Evaluation . 21 5.2.4 Research Contribution . 22 5.2.5 Research Rigor . 22 5.2.6 Design as a Search Process . 22 5.2.7 Communication of Research . 23 5.3 Research Strategy Motivation . 23 5.4 Summary . 23 6 Constraint Language Comparison 25 6.1 Features . 26 6.2 Constraint coverage . 27 6.3 Summary . 28 7 Implementation 29 7.1 Evaluation . 29 7.1.1 Task . 29 7.1.2 Evaluation Criteria . 31 7.2 Iterative Process . 32 7.2.1 First iteration: Learn by doing . 32 7.2.2 Second iteration: Inheritance for code reuse . 33 7.2.3 Third iteration: Abstract super class for cohesion . 35 7.2.4 Fourth iteration: reference attributes and backwards compatibility . 36 7.2.5 Fifth iteration: Breaking name conventions and code clean up . 37 7.3 Summary . 38 8 Discussion and Conclusion 39 8.1 Comparison between constraint languages . 39 8.2 Implementation . 40 8.3 Research findings . 41 9 Future Work 42 Bibliography 45 A Lyo prototype meta-model 48 B SHACL on ShEx coverage 50 C ShEx on SHACL coverage 52 D SHACL on ReSh coverage 53 CONTENTS v E ShEx on ReSh coverage 55 List of Figures 2.1 An illustration of lifecycle management tools integrated with a linked data approach and forming an OSLC toolchain. 6 4.1 A simple high-level model of how three tools are connected through their data. The letter ’P’ stands for producing data and ’C’ for consuming data. 15 4.2 A simple conceptual model of how the prototype currently work and how it should be extended. 16 5.1 An overview of the research phases. 18 5.2 An overview of how design science research was applied for the implementation in this thesis. 20 6.1 Top left of SHACL and ReSh. Top right ShEx and ReSH. Bottom left SHACL and ShEx. Bottom right SHACL, ShEx and ReSh. 25 6.2 To the left, amount of ShEx constraints covered by SHACL. To The right, amount of SHACL constraints covered by ShEx. 27 6.3 To the left, amount of ReSh constraints covered by SHACL. To the right, amount of ReSh constraints covered by ShEx. 28 7.1 A modelled figure replicating a subset of SESAMM-tool database with classes and properties obfuscated. 30 7.2 The meta-model extension in the first iteration. 32 7.3 A model designed in the first iteration. All elements on the left side, conform to pre-existing ReSh constraints.

Making a Common Graphical Language for the Validation of Linked Data

Validating RDF Data Using Shapes

V a Lida T in G R D F Da

The Opencitations Data Model

Using Shape Expressions (Shex) to Share RDF Data Models and to Guide Curation with Rigorous Validation B Katherine Thornton1( ), Harold Solbrig2, Gregory S

Shape Designer for Shex and SHACL Constraints Iovka Boneva, Jérémie Dusart, Daniel Fernández Alvarez, Jose Emilio Labra Gayo

Validating RDF with Shape Expressions

Validating Shacl Constraints Over a Sparql Endpoint

Semi Automatic Construction of Shex and SHACL Schemas Iovka Boneva, Jérémie Dusart, Daniel Fernández Alvarez, Jose Emilio Labra Gayo

DINGO: an Ontology for Projects and Grants Linked Data

Reading an XML Text Like a Human with Semantic Web Technologies 1

Document (1407

Multi-‐Entity Models of Resource Description in the Semantic