EXAMENSARBETE INOM ARKITEKTUR, AVANCERAD NIVÅ, 30 HP STOCKHOLM, SVERIGE 2017

Making a common graphical language for the validation of .

DANIEL ECHEGARAY

KTH SKOLAN FÖR DATAVETENSKAP OCH KOMMUNIKATION Making a common graphical language for the validation of linked data.

DANIEL ECHEGARAY

Master in Computer Science Date: July 7, 2017 Supervisor: Cyrille Artho Examiner: Tino Weinkauf Swedish title: Skapandet av ett generiskt grafiskt språk för validering av länkad data. School of Computer Science and Communication i

Abstract

A variety of embedded systems is used within the design and the construction of trucks within Scania. Because of their heterogeneity and complexity, such systems require the use of many software tools to support embedded systems development. These tools need to form a well-integrated and effective development environment, in order to ensure that product data is consistent and correct across the developing organisation. A prototype is under development which adapts a linked data approach for data integration, more specifically this prototype adapt the Open Services for Lifecycle Collaboration(OSLC) specification for data-integration. The prototype allows users, to design OSLC-interfaces between product management tools and OSLC-links between their data. The user is fur- ther allowed to apply constraints on the data conforming to the OSLC validation lan- guage Resource Shapes(ReSh).

The problem lies in the prototype conforming only to the language of Resource Shapes whose constraints are often too coarse-grained for Scania’s needs, and that there exists no standardised language for the validation of linked data. Thus, for framing this study two research questions was formulated (1) How can a common graphical language be created for supporting all validation technologies of RDF-data? and (2) How can this graphical language sup- port the automatic generation of RDF-graphs?

A case study is conducted where the specific case consists of a software tool named SESAMM-tool at Scania. The case study included a constraint language comparison and a prototype extension. Furthermore, a design science research strategy is followed, where an effective artefact was searched for answering the stated research questions. Design science promotes an iterative process including implementation and evaluation. Data has been empirically collected in an iterative development process and evaluated using the methods of informed argument and controlled experiment, respectively, for the constraint language comparison and the extension of the prototype.

Two constraint languages were investigated Shapes Constraint Language (SHACL) and Shapes Expression (ShEx). The result of the constraint language comparison con- cluded SHACL as the constraint language with a larger domain of constraints having finer-grained constraints also with the possibility of defining new constraints. This was based on that SHACL constraints was measured to cover 89.5% of ShEx constraints and 67.8% for the converse. The SHACL and ShEx coverage on ReSh property constraints was measured to 75% and 50%. SHACL was recommended and chosen for extending the prototype. On extending the prototype abstract super classes was introduced into the underlying data model. Constraint language classes was stated as subclasses. SHACL was additionally stated as such a subclass. This design offered an increased code reuse within the prototype but gave rise to issues relating to the plug-in technologies that the prototype is based upon. The current solution still has the issue that properties of one constraint language may be added to classes of another constraint language. ii

Sammanfattning

En mängd olika inbyggda system används inom design och konstruktion av lastbilar in- om Scania. På grund av deras heterogenitet och komplexitet kräver sådana system an- vändningen av många mjukvaruverktyg för att stödja inbyggd systemutveckling. Dessa verktyg måste bilda en välintegrerad och effektiv utvecklingsmiljö för att säkerställa att produktdata är konsekventa och korrekta över utvecklingsorganisationen. En prototyp håller på att utvecklas som anpassar en länkad datainriktning för dataintegration, mer specifikt anpassar denna prototyp en dataintegration specifikation utvecklad av Open Services for Lifecycle Collaboration(OSLC). Prototypen tillåter användare att utforma OSLC-gränssnitt mellan produkthanteringsverktyg och OSLC-länkar mellan deras da- ta. Användaren får vidare tillämpa begränsningar på de data som överensstämmer med OSLC-valideringsspråket Resource Shapes.

Problemet ligger i prototypen som endast överensstämmer med Resource Shapes, vars begränsningar ofta är för grova för Scanias behov och att det inte finns något stan- dardiserat språk för validering av länkad data. Således, för att utforma denna studie for- mulerades två forskningsfrågor textit (1) Hur kan ett gemensamt grafiskt språk skapas för att stödja alla valideringsteknologier av RDF-data? och textit (2) Hur kan detta gra- fiska språk stödja Automatisk generering av RDF-grafer?

En fallstudie genomförs där det specifika fallet består av ett mjukvaruverktyg som heter SESAMM-tool hos Scania. Fallstudien innehöll en jämförelse av valideringsspråk och vidareutveckling av prototypen. Vidare följs Design Science som forskningsstrategi där en effektiv artefakt sökts för att svara på de angivna forskningsfrågorna. Design Sci- ence främjar en iterativ process inklusive genomförande och utvärdering. Data har em- piriskt samlats på ett iterativt sätt och utvärderats med hjälp av utvärderingsmetoderna informerat argument och kontrollerat experiment, för valideringsspråkjämförelsen och vidareutvecklingen av prototypen.

Två valideringsspråk undersöktes Shapes Constraint Language (SHACL) och Shapes Expression (ShEx). Resultatet av valideringsspråksjämförelsen konkluderade SHACL som valideringsspråket med en större domän av begränsningar, mer finkorniga begränsningar och med möjligheten att definiera nya begränsningar. Detta var baserat på att SHACL- begränsningarna uppmättes täcka 89,5 % av ShEx-begränsningarna och 67,8 % för det omvända. SHACL- och ShEx-täckningen för Resource Shapes-egenskapsbegränsningar mättes till 75 % respektive 50 %. SHACL rekommenderades och valdes för att vidareut- veckla prototypen. Vid vidareutveckling av prototypen infördes abstrakta superklasser i den underliggande datamodellen. Superklasserna tog i huvudsak rollen som tidiga- re klasser för valideringsspråk, som istället utgjordes som underklasser. SHACL anges som en sådan underklass. Denna design erbjöd hög kodåteranvändning inom prototypen men gav också upphov till problem som relaterade till plugin-teknologier som prototy- pen bygger på. Den nuvarande lösningen har fortfarande problemet att egenskaper hos ett valideringsspråk kan läggas till klasser av ett annat valideringsspråk. Contents

Contents iii

List of Figures vi

List of Tables viii

1 Introduction 1 1.1 Problem and Research Question ...... 2 1.2 Purpose ...... 2 1.3 Ethics and Sustainability ...... 3 1.4 Scope ...... 3 1.5 Limitations ...... 3 1.6 Disposition ...... 3

2 Background 4 2.1 Linked data ...... 4 2.2 Open Services for Lifecycle Collaboration ...... 4 2.3 Resource Description Framework ...... 5 2.4 OSLC Tool-chain ...... 5 2.5 RDF Constraint languages ...... 6 2.6 Summary ...... 7

3 Related Work 8 3.1 Shapes Constraint Language ...... 8 3.2 Shapes Expression ...... 9 3.3 OSLC Resource Shape ...... 10 3.4 SPARQL Inferencing Notation ...... 10 3.5 ...... 11 3.6 Description Set Profiles ...... 12 3.7 Summary ...... 13

4 Lyo toolchain modeling and code generation prototype 14 4.1 Functionality ...... 14 4.2 Extensions ...... 15 4.3 Technologies ...... 16 4.3.1 Eclipse Modeling Framework Core ...... 16 4.3.2 Sirius ...... 16 4.3.3 Acceleo ...... 17

iii iv CONTENTS

4.4 Summary ...... 17

5 Research Method 18 5.1 Research Phases ...... 18 5.1.1 Case study ...... 18 5.2 Design Science ...... 20 5.2.1 Design as an Artifact ...... 21 5.2.2 Problem Relevance ...... 21 5.2.3 Design Evaluation ...... 21 5.2.4 Research Contribution ...... 22 5.2.5 Research Rigor ...... 22 5.2.6 Design as a Search Process ...... 22 5.2.7 Communication of Research ...... 23 5.3 Research Strategy Motivation ...... 23 5.4 Summary ...... 23

6 Constraint Language Comparison 25 6.1 Features ...... 26 6.2 Constraint coverage ...... 27 6.3 Summary ...... 28

7 Implementation 29 7.1 Evaluation ...... 29 7.1.1 Task ...... 29 7.1.2 Evaluation Criteria ...... 31 7.2 Iterative Process ...... 32 7.2.1 First iteration: Learn by doing ...... 32 7.2.2 Second iteration: Inheritance for code reuse ...... 33 7.2.3 Third iteration: Abstract super class for cohesion ...... 35 7.2.4 Fourth iteration: reference attributes and backwards compatibility . . 36 7.2.5 Fifth iteration: Breaking name conventions and code clean up . . . . 37 7.3 Summary ...... 38

8 Discussion and Conclusion 39 8.1 Comparison between constraint languages ...... 39 8.2 Implementation ...... 40 8.3 Research findings ...... 41

9 Future Work 42

Bibliography 45

A Lyo prototype meta-model 48

B SHACL on ShEx coverage 50

C ShEx on SHACL coverage 52

D SHACL on ReSh coverage 53 CONTENTS v

E ShEx on ReSh coverage 55 List of Figures

2.1 An illustration of lifecycle management tools integrated with a linked data ap- proach and forming an OSLC toolchain...... 6

4.1 A simple high-level model of how three tools are connected through their data. The letter ’P’ stands for producing data and ’C’ for consuming data. . . . . 15 4.2 A simple conceptual model of how the prototype currently work and how it should be extended...... 16

5.1 An overview of the research phases...... 18 5.2 An overview of how design science research was applied for the implemen- tation in this thesis...... 20

6.1 Top left of SHACL and ReSh. Top right ShEx and ReSH. Bottom left SHACL and ShEx. Bottom right SHACL, ShEx and ReSh...... 25 6.2 To the left, amount of ShEx constraints covered by SHACL. To The right, amount of SHACL constraints covered by ShEx...... 27 6.3 To the left, amount of ReSh constraints covered by SHACL. To the right, amount of ReSh constraints covered by ShEx...... 28

7.1 A modelled figure replicating a subset of SESAMM-tool with classes and properties obfuscated...... 30 7.2 The meta-model extension in the first iteration...... 32 7.3 A model designed in the first iteration. All elements on the left side, conform to pre-existing ReSh constraints. The elements on the right conform to the ex- tended constraint language, SHACL. The language elements are spread over two domains with two associated namespaces depicted as ’nsp’, allowing any element to be reached by following their unique URL...... 33 7.4 The meta-model extension in the second iteration. Inheriting from the origi- nal classes ...... 34 7.5 The meta-model extension in the third iteration. An abstract resource and prop- erty...... 35 7.6 The meta-model extension in the fourth iteration. An abstract Shape and Prop- erty are applied as adaptors in an adaptor pattern with ShaclShape and Shacl- Property as adaptee classes, the abstract elements are referenced similar to Re- source and ResourceProperty excluding the keyword ’resource’...... 36 7.7 The meta-model extension in the fifth iteration. An abstract Shape and Prop- erty as adaptors, with the additional adaptee classes Resource and ResourceProp- erty ...... 37

vi LIST OF FIGURES vii

A.1 Meta-model of the prototype ...... 49 List of Tables

6.1 Table comparing features in the languages of SHACL and ShEx ...... 26

B.1 Table of how SHACL Core constraint components are covered by constraints of ShEx language...... 51

C.1 Table of how ShEx Node Constraint Semantics are covered by constraints of SHACL language ...... 52

D.1 Table of how ReSh Property Constraints are covered by constraints of SHACL language ...... 54

E.1 Table of how ReSh Property Constraints are covered by constraints of ShEx lan- guage...... 56

viii Chapter 1

Introduction

“Begin at the beginning,” the King said, gravely, “and go on till you come to an end; then stop”.

Lewis Carroll, Alice in Wonderland

This conducted thesis is a part of the ESPRESSO project. The ESPRESSO project is a col- laboration between Scania and Royal Institute of Technology (KTH). The overall objective of the project is to develop and adapt model-based techniques that improve quality and costs for embedded systems in trucks focusing on safety critical systems[1]. A variety of embedded systems are used within Scania. Because of their heterogene- ity and complexity, such systems require the use of many software tools to support em- bedded systems development. These tools need to form a well-integrated and effective Development Environment (DE), in order to ensure that product data is consistent and correct across the developing organisation. Currently, there is a modelling tool under development which will be referred as the prototype. The prototype follows a linked data approach for data integration. This ar- chitecture is the opposite of having a centralised integration approach where all data is stored in one point. The prototype allows users to create models of how tools share their data by following a set of constraint rules. The tools are integrated in data level forming a tool-chain. While a linked data distributed approach seems promising, it creates a chal- lenge to understand and manage the overall information structure that is now handled across the many tools. In particular, it is necessary to investigate how such a distributed approach to data management can be reconciled with the need to have control over the overall information model in the organisation. Model-based development reduces errors and misunderstanding between different sections within companies [2]. Thus, a tool like this can be valuable. The question lies in how to extend this graphical language to support several constraint languages and how the code that the prototype generates could be supported by external validation software modules. A case study is conducted where the specified case or the point of view is for a soft- ware tool called SESAMM-tool within the organisation Scania. To find what constraint language should be used and how the prototype can be extended to use it. SESAMM-

1 2 CHAPTER 1. INTRODUCTION tool has been developed in Scania in the department of RESA. It is a tool for modelling functionalities within Scania’s vehicle system. In the time of writing this thesis, the prototype supports one constraint language and conform to the constraint language OSLC Resource Shapes. The problem is two-fold, property constraints defined by the OSLC Resource Shapes vocabulary are defined to be broad and general. Being very coarse-grained, they do not allow to construct valida- tion rules of a more specific and complex nature. This makes the tool impractical at a large company such as Scania. The second problem is within the prototype and how it can be made compatible with an external automatic validation module. There are future plans in Scania to append modules to the prototype for performing automatic validation. These, external validation modules make use of the generated code from the prototype. In this thesis, consideration has been put on how the generated code can support such a module for automatic validation. The validation module itself is not in the scope of this thesis.

1.1 Problem and Research Question

The purpose of this thesis is to construct a general graphical language for a tool-chain modelling system that conforms to a given validation language. In the time of writing this thesis, there exists a prototype that conforms to one validation language. This thesis is in the context of that prototype. The question arises how can further validation languages be integrated to the pro- totype. Making the prototype a common graphical language supporting more than one validation language of linked data. Suggestions could be by either a conjunction or a disjunction of the validation languages into the graphical language. Further on, while constructing the graphical language consideration has been taken on how it can support RDF-graph generation. Hence the two following research questions were constructed.

1. How can a common graphical language be created for supporting all validation technologies of RDF-data?

2. How can this graphical language support the automatic generation of RDF-graphs?

1.2 Purpose

In the time of writing this thesis, the modelling prototype conforms to a constraint lan- guage of linked data named OSLC Resource Shapes (ReSh). The prototype automati- cally generates Java classes that represent validating ReSh resources. As mentioned, the problem with having a modelling prototype restricted to ReSh in an industry company such as Scania are that the sets of property constraints defined in ReSh are defined to be coarse-grained. There is a need for more fine-grained constraints. Further reasons for ex- tending the prototype are that there exist software modules for automatic validation that support other large constraint languages such as Shapes Constraint Language (SHACL) and Shapes Expression (ShEx). The extended purpose is to analyse different constraint languages in the perspective of a software tool named SESAMM-tool and investigate how the existing prototype can be extended to suit the need of applying constraints on SESAMM-tool data and making it support automatic validation. CHAPTER 1. INTRODUCTION 3

1.3 Ethics and Sustainability

Ethics has been considered towards the thesis employer, Scania. This has been done by not exposing data that may harm the company and furthermore have the thesis approved for publishing by Scania ambassadors. The prototype that has been extended within this thesis, simplifies the process for achieving tool interoperability. Allowing software tools to exchange information and in this project contributing to the quality and cost for the development of embedded systems, primary in Scania but may be generalised for other cases.

1.4 Scope

The conducted study for this thesis has primarily been done in the location Scania Tekniskt Centrum from January 2017 until June 2017. The main focus has been to study and in- vestigate constraint languages and implementation consisting of a further development for an existing modelling tool prototype of software tool-chains.

1.5 Limitations

The thesis has two obvious limitations. First being that the work in this thesis is a case study research, meaning that the conducted research has been done for the specific case of SESAMM-tool in a truck manufacturing company named Scania. Due to this, research should be conducted for other scenarios to further generalise the results. The second lim- itation has to do with the research question "How can a common graphical language be created for supporting all validation technologies of RDF-data?". Due to the limited period of time, it has not been possible to investigate all linked data validation technologies. Therefore, a subset of constraint languages has been chosen and analysed.

1.6 Disposition

The following chapters in this thesis are structured as follows. Chapter 2, covers the the- oretical background, followed by related work presented in Chapter 3. Chapter 4 de- scribes the prototype, its purpose and the intended extension. In chapter 5, the research method used in this study is described. Chapter 6, presents the results from the made constraint language comparison followed by the prototype implementation results pre- sented in chapter 7. Chapter 8, holds the discussion and conclusions of the results fol- lowed by future work that is presented in chapter 9. Chapter 2

Background

2.1 Linked data

The structure of linked data is three-fold and is composed of three components that are linked together. The linked components are intuitively called a ’triple’ and consists of subject, predicate and object. Each component should be able to be dereferenced to get the value and possible other relations. Linked data is built upon web technologies such as Transfer Protocol(HTTP), Resource Description Framework(RDF) and Uni- versal Resource Identifiers(URI). Linked data is data linked through web technologies. Instead of serving web pages they point to data readable by a computer. Following is an example to give the reader an intuitive feel for linked data: Alice -> From->Sweden. Given this triple of data a user is able to dereference any part of it. Dereferencing the subject ‘Alice’ should give further information about Alice and further show other triples for which Alice is a subject. Dereferencing the predicate ‘From’ could give back the definition of the predicate ’From’. For instance, ’From’ = ’Has as subject a Person. The object should be the country of origin of the subject Person’. Dereferencing the link ‘Sweden’ should return a descriptive text about Sweden and possible other linked triples to Sweden. This procedure continues for every component and can accumulate into large graphs. The semantic web[3] also build on these technologies and is described as a shared ’web of linked data’, where data is shared across application and enterprise boundaries [4]. In the case of Scania, the desire is to use linked data technology within the company boundaries.

2.2 Open Services for Lifecycle Collaboration

Open Services for Lifecycle Collaboration (OSLC) is an open community creating speci- fications for integration of software tools. The specifications are based upon linked data and standards such as Representational State Transfer (REST) and Uniform Re- source Locator(URL). The integration is done by integrating data of tools and workflows in support of end-to-end processes. The community is separated into work groups that work in different ‘domains’. For each domain, there is one topic. To list a few of them: Change Management (CM), Requirement Management (RM), Quality Management (QM) etc. Workgroups investigate several integration scenarios within the domains and specify common vocabularies needed to support the scenarios [5]. OSLC is mainly composed by

4 CHAPTER 2. BACKGROUND 5 the OSLC Core specification and OSLC domain specification. OSLC Core specifies a gen- eral interface between different tools. The idea is to have a minimalistic approach such that the core vocabulary should contain a bare minimum to act as a general specification for integrating tools [6]. The OSLC domain specifications are specifications that specialise in different life cy- cle tools. Meaning that the domain for change management has a vocabulary that sup- ports data integration for change management tools. These domain specifications are to be further specified by users of the OSLC specification to allow rules to be more specific for their own tools. The OSLC integration protocol consists of a conjunction of the OSLC core specification and one or more OSLC domain specifications [7].

2.3 Resource Description Framework

Resource Description Framework (RDF) is a technique for implementing linked data, de- fined by Consortium (W3C). An RDF document consists of three types i.e. resources, properties and statements. A resource is addressed by a URI and may be parts of web pages, entire web pages or real life objects. A property is an attribute de- scribing a resource and defines its permitted values and relations[8]. A resource together with a property and the value of that property forms a statement. Allowed property val- ues may be a literal, resource or other statements [9, 8]. The reader may consider a state- ment as a triple in this context and further linked statements may be referred to as an RDF-graph.

2.4 OSLC Tool-chain

Perhaps the simplest example of what a tool-chain is, is the compiler & linker, libraries and a debugger, where one tool gives input to the consecutive tool. A variety of methods can be used for system integration on constructing a tool-chain. For the project in this study, it was achieved by a linked data integration. Meaning that each tool had its own data and exchange information, by having their data linked mainly by the RDF tech- nology. An OSLC tool-chain follows the linked data approach and publishes its data by adopting an OSLC specification. On implementation, an adapter is created for each tool in the tool-chain that will handle the communication in the tool-chain[10]. In figure 2.1 a simple OSLC tool-chain is illustrated. The dotted lines represent a data level integra- tion by linking data from different tools using RDF technology. An example of this con- stitutes of the predicate called testedBy in the figure. Let the subject data have the name changeRequest residing in the change management tool and the object have the name test- Case, residing in the test management tool. An arbitrary adapter may then request the changeRequest data from the change management tool and simultaneously get the URIs to its predicate and its subject. Subsequently, a request can be done to retrieve the testCase data from the test management tool. Adaptors are the tool-chains mean of communica- tion and essentially acts as a REST interface for each tool and allows the tool-chain to scale with the possibility of adding additional tools by letting them conform to the REST framework. 6 CHAPTER 2. BACKGROUND

Figure 2.1: An illustration of lifecycle management tools integrated with a linked data approach and forming an OSLC toolchain.

2.5 RDF Constraint languages

This section defines what is meant by a constraint language in the context of this thesis and briefly describe the existing varieties in constraint languages of RDF data. The fo- cus of some constraint language technologies are not for validating data but in practice are still used for it. A few varieties of constraint languages are that they differ in expres- siveness, as for example in how constraints may be expressed or whether or not the lan- guage supports inference possibilities. An example of inference in a constraint language can be stated with two given RDF data triples ‘Eva-likes-Hubert’ and ‘Hubert-ofType- Dog’. By inference, the triple ‘Eva-likes-Dog’ may be derived. Depending on the context, inference possibilities can be an undesirable feature, as in the case that Eva likes Hubert but hates other dogs. RDF data does not have any standard constraint language such as XML constraint language is XML-schema and SQL standard constraint language is DDL[11], there only exist proposals. Some of the languages, in the time of writing this thesis, are still worked upon. To list a few RDF constraint languages: ShEx, SHACL and ReSh. These languages and a few additional will be explained in the following chapter. CHAPTER 2. BACKGROUND 7

2.6 Summary

Linked data consists of the linked components subject, predicate and object which ideally should be able to be dereferenced, for getting information and additional links for each component. RDF is a common technique used for implementing linked data and consists of a resource, a property defining constraints and the value of that property. These three components form a statement that may be considered a triple and interlinked triples are called RDF-graphs. In the time of writing this thesis there exist no standard validation language for RDF data, only proposals. OSLC is a community creating specifications for tool interoperability and provides a standard that facilitates tool-chain integration. An OSLC tool-chain is a collection of tools that are integrated adopting OSLC specification standards and consequently integrated by a linked data approach. Chapter 3

Related Work

This chapter describes constraint languages used for validating RDF data. The described languages, are SHACL, ShEx, OSLC Resource Shapes, SPIN, OWL2 and DSP. An exam- ple is provided for each language of a validating schema for an RDF data node.

3.1 Shapes Constraint Language

Shapes Constraint Language (SHACL) is a constraint language used for validating and describing the shape of RDF data[12]. The language is currently under development by the W3C Data Shapes Working Group[5]. SHACL makes use of a schema construct called shapes. In general, a shape can be described as a collection of predicates with associated constraints that is used to describe the shape of RDF data[13, 5]. A SHACL shape can be considered a collection of scopes and constraints for which scopes specify which data nodes should be validated and constraints determine how the node should be validated. SHACL supports validation of graph-based and object oriented data unlike XML-schema that is constrained to tree-structures[14]. SHACL is based on RDF and provides a vocab- ulary for classes, properties and integrity constraints for instances.

:User a sh:Shape ; sh : property [ sh:predicate :name; sh : minCount 1; sh :maxCount 1; sh:datatype xsd:string ; ]; sh : property [ sh : predicate foaf : familyName ; sh : minCount 1; sh :maxCount 1; sh:datatype xsd:string ; ]; sh : property [ sh:predicate foaf:mbox; sh : minCount 1; sh :maxCount 1; sh : nodeKind sh : IRI ; ]. :User sh:scopeNode :Daniel . Code snippet 3.1: Example of a SHACL shape describing RDF-data of a User.

8 CHAPTER 3. RELATED WORK 9

:Daniel foaf:name "Daniel" ; afoaf:User; foaf :familyName "Echegaray" foaf :mbox darr@kth. se . Code snippet 3.2: RDF data, representing an instance of the class user.

In code snippet 3.1 a simple example of a User shape is demonstrated. The User con- sists of three properties i.e. the predicates name, familyName and mbox. The cardinal- ity for each property is exactly one and is defined by the constraints "sh:minCount" and "sh:maxCount". The bottom line declares a scope for the User shape allowing users to declare nodes that the shape should target, an example of a node that is validated cor- rectly is demonstrated in 3.2. A common syntax in RDF data is to use prefix bindings for namespaces. The names- paces define a vocabulary defining properties and classes. A user may then follow the URL and get information of the defined node or property. For instance, the URL http: //xmlns.com/foaf/spec/#term_givenName would be equivalent to stating foaf:givenName, the URL may be followed to get information of the property, similar to using a dictio- nary. Public vocabularies such as "foaf", allow users to control their data in a non-proprietary way, and for example, describes characteristics of people or other specialisations stated in the vocabulary.

3.2 Shapes Expression

Shapes Expression(ShEx), is a language with notions of regular expressions. A ShEx schema is a collection of labelled shapes and node constraints[15, 5]. The reader may think of a shape as a lesser schema used for validating an RDF-triple, and a node constraint as a description of an RDF-node. Meticulously described, a ShEx schema consists of ’shapes expressions’, described as:

"A collection of shapes and of node constraints, possibly combined with AND, OR, and NOT expressions."[16]

A shape expression is composed of four objects. The first object, a node constraint (1), defines allowed values for a set of nodes. The second object, a shape constraint (2), is used for applying constraints for the allowed neighbourhood of a node. A neighbour- hood is defined as the triples that contain a node as a subject or an object. External shape (3), is used as an extension mechanism for ShEx. The fourth object, shape reference (4), is used for identifying other shapes in a schema. The four objects, may be combined with the operators AND, OR and NOT [15].

:User { foaf :name xsd: string {1} , foaf :familyName xsd: string , foaf :mbox shex: IRI } Code snippet 3.3: The User example, expressed as a ShEx shape 10 CHAPTER 3. RELATED WORK

The defined code snippet 3.3 is a ShEx shape corresponding to the SHACL shape, defined in snippet 3.1. The first property in the shape ’name’ has the cardinality of ex- actly one expressed by "{1}", this definition of cardinality is superfluous as the default cardinality in ShEx is 1, but is stated in the example, for illustrative purposes. The lan- guage makes use of the regular expression terms (?,+,*) when expressing cardinalities e.g. (Zero-or-one, One-or-more, Zero-to-many), a precise lower and higher bound is ex- pressed by defining {lowerBound, upperBound}.

3.3 OSLC Resource Shape

OSLC Resource Shape (ReSh) is a high-level RDF vocabulary for validating and describ- ing the shape of RDF-data [17]. A resource shape may list properties, which in turn has to specify their occurrence, property definition and name, and may specify further attributes[18, 19].

a oslc : ResourceShape ; oslc : describes ; oslc : property [ a oslc : Property ; oslc :name "name"; oslc : occurs oslc : Exactly one ; oslc : propertyDefinition foaf :name ; oslc :valueType xsd: String ; ]; oslc : property [ a oslc : Property ; oslc :name "familyName"; oslc : occurs oslc : Exactly one ; oslc : propertyDefinition foaf :familyName ; oslc :valueType xsd: String ; ]; oslc : property [ a oslc : Property ; oslc :name "mbox"; oslc : occurs oslc : Exactly one ; oslc : propertyDefinition foaf :mbox ; oslc :valueType oslc :Resource ; ]; Code snippet 3.4: User example as an OSLC resource shape

An OSLC resource shape, corresponding to the shape in snippet 3.1, is defined in snippet 3.4. The shape in snippet 3.4, validates the data in code snippet 3.2 as correct. Similar to a SHACL shape, an OSLC shape also lists its containing properties. The prop- erty constraints "oslc:name", "oslc:occurs", "oslc:propertyDefinition" are required con- straints and respectively define the name of the property, the cardinality of the property and specify the URI of the property.

3.4 SPARQL Inferencing Notation

SPARQL Inferencing Notation(SPIN)[20]. SPARQL is an RDF Query Language. SPIN is based on SPARQL and essentially provides an abstract vocabulary that represents SPARQL queries in RDF notation[5]. Thus, a user does not have to know SPARQL, although SPARQL extension possibilities exists for constructing own rules. CHAPTER 3. RELATED WORK 11

ss :User spin : constraint [ rdf:type spl:Attribute ; spl :maxCount 1 ; spl :minCount 1 ; spl : predicate foaf :name ; spl :valueType xsd: string ; ]; spin : constraint [ rdf:type spl:Attribute ; spl :maxCount 1 ; spl :minCount 1 ; spl : predicate foaf :givenName ; spl :valueType xsd: string ; ]; spin : constraint [ rdf:type spl:Attribute ; spl :maxCount 1 ; spl :minCount 1 ; spl : predicate foaf :mbox ; spl :valueType spl : optional ; ]. Code snippet 3.5: An SPIN example, providing validation rules for a user.

The SPIN language provides functionality for inference, allowing node values to be cal- culated from other nodes. A use case example of this is the property area, that may get calculated from a width and height. The provided example 3.5 does not contain any spe- cific SPIN-logic, staying consistent with the other examples in this chapter and providing the reader with a simple comparable overview of the described languages.

3.5 Web Ontology Language

By definition, the word ontology is a branch of philosophy in metaphysics and concerns the nature of entities and their relations1. The Web Ontology Language(OWL), more specifically OWL 2 consists of three no- tions: Axioms, Entities and Expressions. An axiom is a statement expressed in an OWL ontology, e.g. "It is raining", "All ravens are black". Entities are essentially atoms, mean- ing any real-world object including relations. Expressions are entities combined with con- structors. For example, the entities "Male" and "User" with a conjunctive constructor is regarded as an expression[21]. The language is declarative, meaning that state of affairs are described in a logical way for which any correct answers (validation) can be expressed by formal semantics (mathematical models of relations of expressions)[22, 21]. OWL uses a reasoner instead of a validator. Inferencing is used when validating on- tologies, and it does not make a ’Unique Name Assumption’ meaning that there are no requirements for resources to have a unique URI. An informal example of this is that two nodes, Daniel and Yash of the same class User will be considered to be the same resource [18].

1https://www.merriam-webster.com/dictionary/ontology 12 CHAPTER 3. RELATED WORK

OWL assumes an ’Open World Assumption’, compared to SQL that assumes a ’Closed World Assumption’. This means that if an SQL query asks ’is Daniel a user?’ and the SQL- database do not include a user named Daniel the reply would be false, however, a rea- soner in OWL 2 would reply ’possibly true’[21].

ex : a owl : Ontology foaf :User a owl: class

foaf :name a owl: ObjectProperty , owl : minCardinality 1; owl : maxCardinality 1; owl : datatype xsd : string . foaf :familyName a owl: ObjectProperty , owl : minCardinality 1; owl : maxCardinality 1; owl : datatype xsd : string . foaf :mbox a owl: ObjectProperty , owl : minCardinality 1; owl : maxCardinality 1; owl : datatype xsd : anyURI . Code snippet 3.6: An OWL ontology of the resource User

The listed ontology 3.6 is in (ttl) format. The ontology describes a user and vali- dates the RDF node 3.2 as correct.

3.6 Description Set Profiles

Description Set Profile(DSP) is as the name implies, a description of constraints for a de- scription set. A description set is a set of one or more descriptions each describing a re- source. A description consists of one or more statements regarding only one resource[23]. The intended usage for a DSP is to evaluate whether a meta-data record conforms to the DSP. DSP uses the notions of templates and constraints. There are two levels of tem- plates: description templates, and statement templates. The first one applies to a single description containing the constraints of the resource. The second applies to one state- ment and contains constraints for properties[24]. The templates may be seen as contain- ers, for constraints and identifiers, applied to resources or properties.

foaf :User foaf :name foaf :givenName CHAPTER 3. RELATED WORK 13

> foaf :mbox Code snippet 3.7: An example of how DSP may be used, to apply constraints on a User

Code snippet 3.7 has an example of how constraints are applied on the user data in 3.2. The DSP has a treelike structure with three ’StatementTemplates’, each containing a property for the user.

3.7 Summary

In general, a shape can be described as a collection of predicates, with associated con- straints describing the shape of an RDF-graph. The shape schema constructs is used in ReSh, SHACL and ShEx. SPIN is based on SPARQL and consist of an abstract vocabu- lary representing SPARQL queries. OWL 2 is a declarative language that uses a reasoner for validating data and applies an ’open world assumption’ on data. DSP is a set of de- scriptions for statements, the language main focus is to examine if a meta-data record conforms to it. Chapter 4

Lyo toolchain modeling and code gen- eration prototype

The purpose of this chapter is to give the reader an overall understanding about the pro- totype on that the implementation work has been conducted. The prototype builds on the Lyo project. The Lyo project is an educational project aiding the Eclipse community by providing a software development kit to help it adapt the OSLC specification and fol- lowing build OSLC compliant tools. In the duration of this thesis, other extensions were also applied to the prototype in the form of an automatic validation module. The other extensions are not in the context of this thesis, but consideration has been taken towards them. The differences will be explained and clarified in this chapter. Section 4.1, Functionality, explains the purpose and functionality of the prototype and its intended function within the industry. Following section 4.2, Extensions, covers imple- mented extensions to the prototype during the time period of this thesis, and also distin- guish extensions covered by others. The last section 4.3, Technologies, lists and explains the technologies that the prototype is based on and further describes how these technolo- gies have been applied within the prototype.

4.1 Functionality

The prototype consists of three views: domain specification, adapter Interface and tool- chain view. The three views allow the modelling user, which will be referred as the ’ar- chitect’, to model an OSLC tool-chain[2, 25]. The purpose of the three views is to reduce complexity for the architect and is achieved by having the views simulate different levels of abstraction of an OSLC tool-chain. By model based development, the prototype support a linked data approach for soft- ware tool interoperability conforming to OSLC standards[2]. The prototype will allow an architect to design high-level models for how different tools are related and for what data they produce and consume, this is illustrated in figure 4.1. The figure shows the relation of how management tools both produce and consume data of one another, de- picted by the letters ’P’ and ’C’. The prototype further allows an architect to design a lower level model for how data is related and how constraints conforming to ReSh are applied to the data. After designing both a high and a low-level model, the prototype has the feature to generate runnable code. The generated code, is code of an OSLC tool-chain, having

14 CHAPTER 4. LYO TOOLCHAIN MODELING AND CODE GENERATION PROTOTYPE 15 adaptors for specified management tools and ReSh resources for validating data. The Adaptors should be integrated with a management tool and its database and further communicates using HTTP protocol and REST framework.

Figure 4.1: A simple high-level model of how three tools are connected through their data. The letter ’P’ stands for producing data and ’C’ for consuming data.

To give the reader an intuitive understanding of how the prototype with the gen- erated code works, a simple example scenario will be described. A scenario with two management tools, one called Bugzilla and another tool Requirement Management ab- breviated as RM. Bugzilla produces data called ChangeRequest. RM produce data called Requirement. A company that use these management tools, wish to link their data by a linked data approach e.g. ’Requirement–>originateFrom->ChangeRequest’. A solution is to use the prototype which allows an architect to model how these management tools should be integrated. The architect may further apply constraints such as ChangeRequest has to have an author and further author is of type Employee etcetera. When the architect is done modelling, the prototype may be used to generate runnable code. The code would allow these two tools to communicate with the help of the OSLC specification protocol and to use the generated ReSh shapes, for validating their data.

4.2 Extensions

The modelling prototype conforms to the OSLC specification and generates Java code representing ReSh resources [26]. An objective of this thesis is how the prototype can be extended to conform to other constraint languages and generate resources in Java code representing the extended language. A graphic example of what extensions are included, and what are not included in the scope of this thesis are illustrated in figure 4.2. The dashed arrows represent how the prototype is to be extended, where it can be 16 CHAPTER 4. LYO TOOLCHAIN MODELING AND CODE GENERATION PROTOTYPE seen that the elements Modelling and Java class representing shape already conforms to the element OSLC and should be extended to the element Constraint language. An element with the name Automatic validation is also presented in the figure, this extension is not in the scope of this thesis but is applied to the prototype in parallel. Upon extending Java class representing shape, consideration has been taken to support the automatic validation and is therefore positioned halfway outside of Prototype. Besides automatic validation, a purpose of the module is to use the generated Java resources and convert them to RDF- graphs that either can be populated with data, or used as a validating schema.

Figure 4.2: A simple conceptual model of how the prototype currently work and how it should be extended.

4.3 Technologies

4.3.1 Eclipse Modeling Framework Core EMF is a set of plug-ins that can be used to model data and generate code. EMF distin- guish between its meta-model and an actual model. The meta-model describes the over- all structure of the model, whereas the model itself is a concrete instance of the meta- model. The user defines a domain model, that can be used to generate Java code. A do- main model consists of the data. The EMF-tools allows the data model to be modelled by UML diagrams[27]. EMF is used for constructing a meta-model in the prototype. Illus- trated in Appendix A.1, is the meta-model and underlying data model that the prototype is based on.

4.3.2 Sirius Sirius is a framework for creating graphical modelling workbenches. Sirius is based on EMF structured data, in other words, Sirius visualises and allows functions to be ex- pressed in an EMF data model. Sirius has a model and editors, the model defines the complete structure of the mod- elling workbench. The editors consist of diagrams, tables and trees. CHAPTER 4. LYO TOOLCHAIN MODELING AND CODE GENERATION PROTOTYPE 17

Sirius is made out of two parts, Specification Environment (SE) and Runtime Environ- ment (RE). SE is for the specifier/developer to create the functionalities for the modelling tool. RE is for the architect, that uses the modelling tool. The specification is executed in the RE, and is viewpoint based. Meaning that different levels of representations are or- ganised in different viewpoints with the purpose of reducing complexity for the intended user. An example of this, which corresponds to the prototype is that there exists three dif- ferent views i.e. domain specification, adapter interface and tool-chain view. These three views exhibit different levels of detail within the tool-chain, also making the views more specialised for its intended user[28]. Besides the purpose of creating a modelling tool, Sirius supports plug-in extensions for code generation, document generation and validation[29].

4.3.3 Acceleo Acceleo[30] is a code generation module that allows its users to generate code and pro- vides tools for doing so. Acceleo implements the language ’Model To Text Transforma- tion’1 (MTL) which in term uses EMF data models. The code generator module in the prototype is based on Acceleo. The work in this module has in large extent, attempted to synchronise with another thesis student working on an automatic validation module. Meaning that code generation, from the code generator module, has been modified to conform with the validation module.

4.4 Summary

The purpose of the prototype is allowing users to create models. The models may be used on a day to day basis, for discussion, and to allow an overview of data and tool relations. The prototype allows a user to model relations between data and tools, and to gen- erate code corresponding to the model. The generated code consists of interfaces in the form of OSLC adaptors, that may be applied to management tools for constructing an OSLC tool-chain. The prototype further allows users to apply ReSh constraints to data and generate corresponding Java resources. An additional extension that is under development is the ability to use the generated Java resources and automatically create RDF-triples, and store them into a triple store. A triple store can be seen as the RDF equivalence of an SQL database. The prototype is based on three different technologies. EMF is the technology used to define the underlying data model of the prototype, Sirius is the technology used for constructing the user interface of the prototype and is based upon the EMF data model. Acceleo is the technology that the code generator module is based on. During code gen- eration, it takes the Sirius user designed model as input.

1http://www.omg.org/spec/MOFM2T/1.0/ Chapter 5

Research Method

This chapter discuss the methods used in the conducted study. Section 5.1, Research Phases, gives an overview of the research phases. The following section 5.2, Design Sci- ence, describes a research approach named Design Science, and how it was applied in this study. The section 5.3, Research Strategy Motivation , argues and motivates the used methods.

5.1 Research Phases

The following section has been added to this thesis to allow the reader an overview of the transpired research phases in the study. Figure 5.1 illustrates this graphically and shows that there have been three phases. The first phase was a prestudy and was further partitioned into two sub-phases where ’Prototype study’ refers to the practical knowl- edge acquired to be able to work with and modify the prototype.

Figure 5.1: An overview of the research phases.

5.1.1 Case study The research has taken the form of a case study. A case study means that conducted re- search is seen through the lens of an issued case. In this study, the case consisted of a software tool called SESAMM-tool, in a truck manufacturing company named Scania. The stated research questions were considered the studied objects. The case study consisted of two phases, first one having a theoretical nature where different validation technologies for RDF-data was analysed in the context of support- ing the case, SESAMM-tool. Further on, a proposition was made to stakeholders, who

18 CHAPTER 5. RESEARCH METHOD 19 decided what technology to advance with. The second phase had a practical nature, ex- tending the existing prototype to support the chosen technology.

Constraint Language Comparison Phase This phase largely consisted of an in-depth study of the two constraint languages SHACL and ShEx. The comparison was initialised in collaboration with another thesis student. The language comparison was structured to be done in a high-level feature comparison followed by a low-level constraint by constraint comparison. The method of document review was used for gathering data about the languages. The languages were further evaluated by the method, informed argument and practically analysed by point-by-point comparison, comparing language features and the languages main constraint components. The expressiveness of the languages has been measured. Measuring, how well the languages could cover each other’s constraints. A constraint of one language was considered covered if a second language could replicate it using one, or more constraints. Found key strengths and weaknesses were then brought forward, and a narrative analysis and discussion were held for the final choice of constraint lan- guage to be used for extending the prototype.

Implementation Phase This phase was the most time-consuming phase and comprised of the extension work of the graphical language. The implementation was done in iterations and included a de- velopment of design suggestion for the EMF meta-model. The suggestion was discussed with stakeholders of the prototype, and thereafter implemented and tested. Stakeholders, in the implementation phase, were from Scania with SESAMM-tool in interest and from the OSLC tool-chain interoperability community. The main modules of the prototype is the EMF meta-model, Sirius graphical interface and the code generator module. Two important concepts within design science, are Relevance and Rigour. For this study, rigour was achieved by having an iterative development process whereas empiri- cal knowledge, gained from precessing iterations were used as a base for further devel- opment into consecutive iterations. Relevance was achieved by proposing design sug- gestions to stakeholders and discussing them, conjoining it with experience gained from previous iterations. Figure 5.2 graphically describes how the two concepts of Design Sci- ence i.e. Relevance and Rigour, was applied into the iterative process. The iterative pro- cess, contained proposed design suggestions and development of the prototype, followed by an evaluation. Controlled Experiment were used as an evaluation method. A controlled experiment is an experiment in a controlled environment whereas a selected variable is modified, the experiment yields result affected by that variable. The evaluation initially transpired in the modelling module of the tool-chain proto- type, where a subset of data from the database of SESAMM-tool was modelled. The code generator module then used the model for generating code. The generated code was sub- sequently assessed and evaluated for conformance with the automatic validation module, a module implemented in parallel by another thesis student. 20 CHAPTER 5. RESEARCH METHOD

Figure 5.2: An overview of how design science research was applied for the implementa- tion in this thesis.

5.2 Design Science

Design Science as a research approach has been used. Design science closely relates to behavioural science, framing business needs with research activities. The philosophy is that theory and practice go hand in hand. Knowledge and understanding of a prob- lem domain and its solution is acquired in the building and application of a designed artefact[31, 32]. Design science is based upon seven guidelines which are meant to aid a better understanding, executing and evaluating the research and results. It is not de- manded that all guidelines are rigorously followed, although it is recommended that all guidelines be addressed in some manner. In Design Science two important concepts are Relevance and Rigour[33]. These are respectively ensured by implementing suitable methodologies, and by feedback of ap- plication, in the appropriate environment[33]. Relevance in information science research should take business needs into account when an artefact is evaluated and built. Rigour in Design Science is achieved by reusing knowledge gained from prior research. This may be achieved by using models and instantiations(prototypes) from previous iterations and by redefining or expand used evaluation methods based upon empirical data[34]. The following sections describe each guideline and how they were followed in this study. CHAPTER 5. RESEARCH METHOD 21

5.2.1 Design as an Artifact "Design-science research must produce a viable artifact in the form of a construct, a model, a method, or an instantiation."[34]. An investigation for a viable Construct, a vocabulary that defines a viable solutions for SESAMM-tool in Scania. This means that different applications of constraint lan- guages will be investigated in the perspective of SESAMM-tool. There will also be an in- vestigation of how the constraint languages should be applied in the design of a graph- ical language resulting in an instantiation, a prototype system. The outcome will be the following artefacts.

1. A graphical language conforming to a constraint language having more fine-grained constraints than ReSh.

2. A method for generating code conforming to an automatic validation module.

3. An instantiation consisting of methods and constructs into a working system.

5.2.2 Problem Relevance "The objective of design-science research is to develop technology-based solutions to important and relevant business problems."[34]. This guideline, for the design-science research conducted in this study, can be seen in different levels of abstraction. Viewing the larger picture the developed artefact may aid in the development process for tool-chain interoperability adapting a model-based tech- nique that contributes to high quality for embedded systems in trucks and safety critical systems. A more explicit aspect of the problem that also is stated in the purpose section is that the constraints of the ReSh vocabulary are defined to be general and are to coarse- grain. In an industry case, there can exist situations where there is a need to construct constraints of a more complex and specific nature. A simple example of this is the car- dinality constraints for ReSh and SHACL. ReSh uses oslc:occurs for defining cardinal- ity, that can take one of the values [zero-or-one, exactly-one, zero-or-many, one-or-many]. SHACL, on the other hand, defines cardinality using the constraints sh:mincount & sh:maxcount, that takes on integers as values, allowing the definition of precise cardinality constraints.

5.2.3 Design Evaluation "The utility, quality, and efficacy of a design artifact must be rigorously demonstrated via well- executed evaluation methods."[34]. Evaluation of designed artefacts normally uses methods in the knowledge base. These methods are listed and described in Hevner et al. [34]. Evaluation methods should be matched appropriately to the designed artefacts. Requirements for the evaluation should be set up in accordance with the business environment. The conducted evaluation method in this study consisted of controlled experiments and informed argument. Controlled experiments have been carried out by following a prototyping evaluation pattern, introduced in Sonnenberg et al. [35]. The pattern con- cludes a set of subsequent steps of evaluation, where (1) a modification of the EMF-meta model was suggested. (2) The design suggestion was implemented, and a test was set 22 CHAPTER 5. RESEARCH METHOD up for modelling a subset of data from the database in SESAMM-tool and subsequently, generate code from that model. (3) Users were selected for conducting the task. (4) The task was assessed if it could be solved, using the prototype. For the constraint language comparison "Informed Argument" as an evaluation method were used, where information from the literature was used to argue the applicability of the constraint language to be used for extending the graphical language. This was struc- tured by conducting a point-by-point comparison for topics in the compared languages.

5.2.4 Research Contribution "Effective design-science research must provide clear and verifiable contributions in the areas of the design artifact, design foundations, and/or design methodologies."[34]. There exists a graphical language that conforms to ReSh constraint language. This has issued a problem as ReSh is a language constructed to be high-level and general. Thus, it is not applicable for practical day to day use in Scania. There is a need to be able to ap- ply more precise constraints on data. Thus the contribution to the work in this study was to extend the mentioned graphical language to conform to a more low-levelled constraint language, making the tool applicable for day to day use in Scania. Furthermore, the con- tribution consists of analysing and comparing the languages to find out what language is better suited for Scania data.

5.2.5 Research Rigor "Design-science research relies upon the application of rigorous methods in both the construction and evaluation of the design artifact."[34]. A case study was conducted where the issued case was a tool used in Scania called SESAMM-tool. The study consisted of a theoretical investigation followed by a practical investigation. The theoretical approach consisted of comparing constraint languages and evaluating them for use in SESAMM-tool. The practical approach had an iterative form. Consisting of an Iterative implementation of an artefact whereas knowledge was gath- ered as empirical data in each iteration and evaluated. The knowledge was used to de- rive design suggestion in consecutive iterations. Each iteration, combined analytical and practical work, proposing design suggestions that were analysed by stakeholders and then practically implemented and evaluated for use, in compliance with SESAMM-tool.

5.2.6 Design as a Search Process "The search for an effective artifact requires utilizing available means to reach desired ends while satisfying laws in the problem environment."[34]. The goal of the conducted study, how to make a common graphical language for val- idation technologies is done in the context of an existing prototype already conforming to OSLC Resource Shapes constraint language. At the starting stage, an extension of the prototypes meta-model has been designed, proposed and discussed with stakeholders for the system. Further on, the example meta-model has been realised in implementation by extending the initial meta-model and extending both the modelling and code gener- ation components of the prototype. Implementation iterations have been conducted in this matter, taking feedback from previous iterations to subsequently create an updated extension example of the meta-model. CHAPTER 5. RESEARCH METHOD 23

5.2.7 Communication of Research "Design-science research must be presented effectively both to technology-oriented as well as management- oriented audiences"[34]. The intended readers for this thesis are computer science students, although, an effort has been made for making the thesis understandable for managerial audiences. This has been done by combining technical information with intuitive figures and explanations. Additionally, on presenting the conducted work, an emphasis has been put on explaining the purpose and usability of the extended graphical language, not emphasising too much on technical information.

5.3 Research Strategy Motivation

This section discusses research methods used in the conducted study and argues them in relation to how the research questions were addressed. Design science was chosen as a research strategy. It is suited for and directly concerns the creation of an artefact. It considers the organisational environment that the artefact should be utilised within. The goal in design science is to create an effective artefact and consequently high utility for that artefact. The apprehended evaluation method in the constraint language comparison was the method, informed argument. It was introduced in Hevner et al. [34], described with the following citation: "Informed Argument: Use information from the knowledge base (e.g., relevant research) to build a convincing argument for the artifacts utility". In the context of the method informed argument, the artefact consists of the results and the conclusion from the con- straint language comparison. The comparison, building convincing argument of the util- ity of the compared languages. A point-by-point method analysis was used, for comparing the two languages. A block comparison method was considered. A block comparison method, first describes points in one subject, and then describes the same points for a second subject, but addi- tionally relates the points to the first subject. This method is better when the comparison is small and easily overviewed, as in an essay. The comparison in this study was con- sidered more comprehensive. Hence, the point-by-point method was considered more suitable for the comparison in this study. For evaluating the implementation phase, the method of controlled experiment was used. The development in each iteration was based on an initial modification of the pro- totype EMF meta-model. The modification followingly affected the development in all modules of the prototype. The modified variables of the controlled experiment in this study were the proposed design suggestions(modifications of the meta-model) in each it- eration. Thus, the method generated results of how each design suggestion affected the overall goal. The goal of developing an artefact answering the research questions.

5.4 Summary

A case study was held for the issued case of the software tool named SESAMM-tool, in Scania. A major reason, for extending the prototype is that it only conforms to ReSh. ReSh is defined to have general and broad constraints and was considered to coarse- grained for an industry company such as Scania. A constraint language comparison was 24 CHAPTER 5. RESEARCH METHOD conducted, in the case of SESAMM-tool. The comparison was intended to aid the choice of constraint language, for extending the prototype. A design science research approach was followed, important concepts in design science are relevance and rigour, that should be enforced by taking into account both business needs and the reuse of knowledge gained from an iterative process. The method of controlled experiment has been used, for evalu- ating the prototype in each iteration. The modified variable in the experiments consisted of a newly suggested meta-model design for each iteration. The evaluation method con- sisted of selecting users to model a subset of data from the database of SESAMM-tool, using the prototype. Chapter 6

Constraint Language Comparison

This chapter presents the result of the constraint language comparison. The languages have been set against each other and were compared at different levels of abstractions. The comparison was initially conducted in collaboration with a second thesis student but was at a further stage carried out independently. To give the reader an idea of how the constraint sets of the three constraint languages ReSh, SHACL and ShEx are in size, relative to each other, and how they conjoin, figure 6.1 has been added. The languages are respectively abbreviated as R, S and Sh. It shows that SHACL > ShEx and that | | | | SHACL ReSh > ShEx ReSh . In section 6.1 Features, features of both languages | | | | | | | | are explainedT independently andT against each other in a compiled table. Following sec- tions contains a comparison of constraints on a more fine-grained level. The constraints are put against each other and are further compared with property constraints of ReSh. A comparative point-by-point analysis method was used for comparing the languages by quantitatively measuring constraint coverage between the languages.

Figure 6.1: Top left of SHACL and ReSh. Top right ShEx and ReSH. Bottom left SHACL and ShEx. Bottom right SHACL, ShEx and ReSh.

25 26 CHAPTER 6. CONSTRAINT LANGUAGE COMPARISON

6.1 Features

A high-level feature comparison was conducted for the two constraint languages, SHACL and ShEx. The feature comparison was focused on special abilities in the languages, how they validate RDF data, and how they may be expressed. On comparing the features, table 6.1 was compiled. The row Recursion in the table, shows that SHACL does not support recursion and that cyclic data models in SHACL are not allowed, although under discussion[36]. The row Nodes selection concisely de- scribes SHACL with a higher variety of selecting nodes. In addition to targeting a spe- cific node, targets may be the classes of a node along with its subclasses. Targets for validation may also be objects or subjects of a defined predicate. In the row Validation focus, SHACL allows for validation reports that include the validation results for suc- cessful constraint checks or accumulated result of non-conformed data[12]. ShEx flags any part of the data that do not conform. For the row Extension mechanism, it can be added that SHACL is based upon SPARQL and offers an extension mechanism based on SPARQL[36]. Inclusion of other shapes, relates to how a shape can make use of other shapes, to not duplicate the shapes, both languages are capable of achieving this. The row Incoming edges refers to how the languages can handle applying constraints to sub- jects of a triple, instead of the object.

Features SHACL ShEx Definition Describes and validates Describes RDF nodes and RDF-graphs graph structures and pro- vides a structural schema for RDF data Recursion No support for recursion Supports recursion and self referencing Nodes selection Node selection achieved by May select particular nodes Scope and constraints to be validated by focus node or neighbourhood Syntax Based on RDF vocabulary Grammar oriented syntax Validation focus Errors in validation. Vali- Focuses on validation results dation reports that support optionals arguments, mak- ing it possible to limit the number of returned results Extension mechanism SPARQL based extension Restricted and predefined mechanism. May be used extension capabilities may be to define constraint compo- used 1 nents Inclusion of other shapes May be achieved by using Shapes may reuse other logical constraint compo- shape by including their nents such as ’sh:and’ declaration. Incoming edges SHACL provides a concept ShEx uses ˆ, as an inverse of inverse property path flag to target a subject in a triple.

Table 6.1: Table comparing features in the languages of SHACL and ShEx CHAPTER 6. CONSTRAINT LANGUAGE COMPARISON 27

6.2 Constraint coverage

This section contains result from measuring constraint coverage of the two languages SHACL and ShEX and further contains a coverage measurement on how SHACL and ShEx constraints covered ReSh constraints. A constraint is considered covered if the con- straint may be replicated by one or more constraints from the covering language. Tables have been compiled of the coverage measurement for the language constraints. The table in Appendix B.1, shows how SHACL Core Constraint Components are cov- ered by ShEx constraints for which 21 constraints was covered of a total of 31 SHACL constraints. The table in Appendix C.1 shows how ShEx Node Constraint Semantics are covered by SHACL constraints for which 17 constraints was covered of a total of 19 ShEx con- straints. The result of the measurements are illustrated in 6.2 which shows a higher cov- erage by SHACL covering 89.5 % of ShEx and ShEx covering 67.8 % of SHACL constraints.

Figure 6.2: To the left, amount of ShEx constraints covered by SHACL. To The right, amount of SHACL constraints covered by ShEx.

Similar to the SHACL-ShEx coverage results, the ReSh coverage measurement was compiled into tables where the table in Appendix D.1 shows how ReSh Property con- straints are covered by SHACL constraints, whereas 12 constraints were covered of a to- tal of 16 ReSh Property constraints. The table in Appendix E.1 shows how ReSh Property constraints are covered by ShEx constraints for which 8 constraints was covered of a total of 16 ReSh Property constraints. The results are illustrated in a diagram, in figure 6.3 that shows SHACL as having the highest coverage amount for ReSh, covering 75 % of ReSh property constraints, and ShEx covering 50 %. 28 CHAPTER 6. CONSTRAINT LANGUAGE COMPARISON

Figure 6.3: To the left, amount of ReSh constraints covered by SHACL. To the right, amount of ReSh constraints covered by ShEx.

6.3 Summary

The constraint language comparison was structured to make a high-level feature com- parison of the languages. Some strong features were identified for SHACL, such as the possibility to tailor a validation report and providing a wider selection of targeting, for nodes in RDF-graphs. A strong case for ShEx was the ability to validate recursive shapes. At the time of writing this thesis, recursive validation is not supported by SHACL. SHACL had a higher constraint coverage of ShEx than ShEx had of SHACL. SHACL also had a higher coverage of ReSh constraints. Chapter 7

Implementation

This chapter describes how the implementation phase was operated following the design science approach as described in chapter 5.2. Section 7.1 Evaluation describes the con- ducted evaluation task. The following sections describe problems and issues that came up during the development of the prototype and the assessment of the conducted evalu- ation tasks. The first research question “1. How can a common graphical language be created for support- ing all validation technologies of RDF-data?“, has framed the focus of the implementation. Whereas, the focus was how to extend the prototype to one other constraint language, at the same time facilitating extension of additional constraint languages. The second research question "How can this graphical language support the automatic generation of RDF- graphs?", has also guided the implementation, where an effort was put into conforming the code generation module making it compatible with an automatic validation module used for generation of RDF-graphs.

7.1 Evaluation

An evaluation task was structured to simulate a real life scenario in the company Sca- nia. Users were chosen and supervised. Upon conducting the task, discussions were held to assess if the task could be considered as effectively accomplished. The concept of rel- evance, as proposed in design science, has been enforced into the prototype by taking organisational business needs into account during the development of the prototype, ap- plied to the study by iteratively accumulating business needs. The business needs, to- gether with the stated research questions framing this study were used as evaluation cri- teria for assessing the evaluation task. Section 7.1.1, explains the evaluation task that was issued for evaluating the proto- type and further explain its purpose and relating it into a greater perspective. Section 7.1.2, explains the accumulated business needs, why they were issued, and how they were used for finding issues throughout the iterations.

7.1.1 Task The task was constructed to be business specific for SESAMM-tool in Scania by extract- ing a schema model from a subset of the data in its the database. The task was to use the schema model and attempt to replicate it, within the prototypes modelling interface, ap-

29 30 CHAPTER 7. IMPLEMENTATION plying SHACL constraints and subsequently generating corresponding Java resources of the model, using the prototype. The SESAMM-tool schema and the replicating SHACL-models contain business sen- sitive information. To allow the reader an illustrative model of how the data schema was replicated, figure 7.1 has been added. Due to the sensitive data in SESAMM-tool, the figure has been obfuscated, i.e. names have been changed, moved, and replaced or re- moved. Users were selected for conducting the task in the extent of time and availability. All selected users have not been available for every evaluation throughout the iterative pro- cess. In total, there were three users one being a Scania employee with the prototype in interest, the second user was the thesis student that worked with the automatic valida- tion module and the third user was another Scania employee.

Figure 7.1: A modelled figure replicating a subset of SESAMM-tool database with classes and properties obfuscated.

The purpose of the evaluation task refers to replicating SQL of manage- ment tools to an equivalent RDF triple-store. In the time of writing this thesis, this task is done manually within Scania. During an interview with a developer performing the manual task, it was described as tedious and partially repetitive work. The final purpose of the prototype is to allow users with a schema, model and repli- cate the schema in the prototypes modelling module using SHACL constraints and sub- sequently generate corresponding Java resources that in turn be used for automatic gen- eration of RDF-graphs. The graphs should then be automatically validated and stored into a triple-store. The scope of the evaluation covered by this thesis is up to the generation of Java re- sources, still taking the automatic graph generation and automatic validation into consid- CHAPTER 7. IMPLEMENTATION 31 eration. As was illustrated in figure 4.2 in chapter 4.2.

7.1.2 Evaluation Criteria The evaluation task in the implementation phase had an iterative nature. Thus, busi- ness needs were accumulated throughout the iterations. The business needs were used as guidelines for the assessment of the conducted evaluations tasks. This section explains the business needs, why they were added and how these were assessed. The accumu- lated business needs in chronological order were (1) SHACL conformance, (2) ReSh & SHACL conformance, (3) Compatibility with Automatic validation module and (4) Backward compatibility.

SHACL conformance The prototype initially complied to ReSh and had been tested to do so. Stated in the problem formulation, ReSh has too coarse-grained constraints, for an industry company such as Scania. Hence, there is a need for an extension to a more expressive constraint language. The purpose of the constructed evaluation task was mainly to test this crite- rion, by noting issues that rose during the task. Assessment of this need was done by supervising and noting issues raised during the conduction of the task. Issues were further discussed and assessed with the selected users.

ReSh & SHACL conformance The contribution in this study was not only intended to be for the industry but also for the OSLC community. Thus, it was important that the prototype not only conforms to SHACL but also to ReSh, simultaneously. This criterion has been assessed by, having the users design smaller ReSh models af- ter creating the main SHACL model under supervision. Raised issues were assessed and discussed.

Automatic validation module A future functionality of the prototype is to handle the whole process of replicating an SQL database storing it into an RDF triple store, as explained in 7.1.1. Therefore it was of importance that the generated Java resources and the automatic validation module were compatible. The criterion has been enforced by selecting the developer of the automatic validation module as a user for the task. The focus was on the generated Java resources, on assess- ing this criterion. The resources were assessed and discussed for compatibility.

Backward compatibility The business need of backwards compatibility came up late into the prototyping itera- tions but still considered as an important need. The prototype had already, to some ex- tent been used within Scania and therefore it was of importance that these previously constructed models were still compatible. This need was assessed by analysing the design suggestions and testing the proto- types with old models for compatibility, observing raised issues. 32 CHAPTER 7. IMPLEMENTATION

7.2 Iterative Process

This section explains raised issues and how these were handled, in each iteration. The issues are explicitly described and related to corresponding design suggestions. The de- sign suggestions are initially based upon the meta-model, illustrated in Appendix A.1. The two elements in the meta-model named Resource and ResourceProperty are the ele- ments initially used for conformance to ReSh. The reader is recommended to have the meta-model at hand in the following subsections. The iterative process correlates to the design science guidelines described in 5.2.5, Re- search Rigor, and 5.2.6, Design as a Search Process. By an iterative process and reusing obtained knowledge and experience, rigour was enforced. An effective artefact was searched by conducting an investigation for an extended meta-model design, that in an effective way may be used for allowing the prototype to support more than one constraint lan- guage.

7.2.1 First iteration: Learn by doing In the first iteration, efforts were on learning and understanding the prototype. The De- sign Science approach promotes that knowledge is gained by combining theoretical and practical implementation[34]. To acquire knowledge about the prototype, the first exten- sion of the meta-model was confined in an identical approach as the initial meta-model. Meaning that the model was extended with elements that mirrored the elements Resource and ResourceProperty. These were called ResourceShacl and ResourcePropertyShacl, and are illustrated in the figure 7.2 representing the initial design suggestion. The new elements had the exact same connections and attributes, only suffixed with the word ’Shacl’.

Figure 7.2: The meta-model extension in the first iteration. CHAPTER 7. IMPLEMENTATION 33

For the extension of the Sirius modelling tool, the domain view was extended to work with both ReSh and SHACL. The shapes of the two different languages could not refer- ence each other. A designed model with the prototype can be seen in figure 7.3 where the elements on the right side are the recent extended classes ResourceShacl and Resour- cePropertyShacl and elements to the left are of the original classes Resource and Resource- Property.

Figure 7.3: A model designed in the first iteration. All elements on the left side, conform to pre-existing ReSh constraints. The elements on the right conform to the extended constraint language, SHACL. The language elements are spread over two domains with two associ- ated namespaces depicted as ’nsp’, allowing any element to be reached by following their unique URL.

Evaluation assessment Short into the work of extending the code generator plug-in, it was apprehended that much of the code in the plug-in had to be duplicated to generate shapes for the newly extended constraint language. This issue was not in line with the first research question of this study and induced an excessive code base for the code generator module.

7.2.2 Second iteration: Inheritance for code reuse Problems risen in the first iteration was that there was a minor reuse of the already tested and functional Acceleo code and that the meta-model did not support the extension of more than one constraint language. Hence a new meta-model was proposed, seen in fig- ure 7.4. This iteration had two new elements named ShaclShape and ShaclProperty respec- tively inheriting from the elements Resource and ResourceProperty. 34 CHAPTER 7. IMPLEMENTATION

Figure 7.4: The meta-model extension in the second iteration. Inheriting from the original classes

Evaluation assessment The modelling tool allowed users to add SHACL specific properties to ReSh resources and ReSh properties to SHACL shapes. This issue originated from the introduced inheri- tance to ShaclShape, inheriting from Resource. This behaviour can be compared to how the languages are encapsulated in figure 7.3 where the language elements live in the same domains but are not allowed to reference each other. The problem explicitly originated in the reference attributes ‘extends’, ’range’ and ’resourceProperties’ residing in the parent class and having the form List. Ef- forts were made in trying to override the attributes in the source code of the subclasses ShaclShape and ShaclProperty however this was not supported by the EMF plug-in. An additional raised issue in the modelling user interface were for the elements Sha- clShape and ShaclProperty. All attributes residing in the superclass were visible in the SHACL subclass, cluttering up the graphical user interface with unnecessary none-SHACL specific constraints. This issue originated in the prototype technology Sirius that based the graphical user interface on the underlying EMF meta-model. The Acceleo code generator module could essentially be re-used in this iteration as inheritance were applied in the meta-model. The extension in this iteration comprised of adding code that checked for instances of the subclass ShaclShape and following gener- ated SHACL specific constraints as annotations to support the automatic validation mod- ule. CHAPTER 7. IMPLEMENTATION 35

7.2.3 Third iteration: Abstract super class for cohesion It is desirable to reuse existing code and to avoid having any language element refer- ence elements that is not from the same language, as that would result in low cohesion. To avoid these issues, a meta-model with an abstract super shape and an abstract super property was proposed. The abstract elements both contained attributes that were nec- essary for the automatic validation module. A stripped illustration of the design sugges- tion can be seen in figure 7.5. The reference attributes resourceProperties, extends, and range from the original classes did not remain in the abstract classes. Corresponding reference attributes were instead included in the language subclasses ReSh and SHACL.

Figure 7.5: The meta-model extension in the third iteration. An abstract resource and prop- erty

Evaluation assessment The strived code re-usage was not enforced by the current design suggestion. A signifi- cant amount of the code for the Acceleo code generation plug-in needed to be modified as it had a dependence on the attributes resourceProperties, range and extends. The equiv- alence of these attributes was instead defined in the subclasses. This did not support an extension of additional constraint languages, as a considerable quantity of code had to be duplicated for every new constraint language added. The question rose if code re-usage could be practised in a more efficient way. 36 CHAPTER 7. IMPLEMENTATION

7.2.4 Fourth iteration: reference attributes and backwards compatibility Inheritance was proposed in previous meta-models for allowing re-use of plug-in code, for facilitating the extension of arbitrary many constraint languages to the prototype. Further, it was discovered that the combination of inheritance and keeping some refer- ence attributes resourceProperties and extends was needed for facilitating extension of con- straint languages in the code generation module. Another demand that had to be consid- ered was to keep the prototype compatible with previously designed architect models. Having this in mind and with gathered experience from previous iterations a new design suggestion was proposed, illustrated in figure 7.6. Introducing two new super classes Shape and Property, that only contained attributes necessary for the automatic code gener- ation, similar to the classes AbstractResource and AbstractProperty from the third iteration,. The reference attributes range and properties were kept in the super classes for facilitating code generation. The extended constraint language classes, ShaclShape and ShaclProperty, only contain language-specific attributes. The original classes Resource and ResourceProp- erty remain untouched due to the backwards compatibility demand.

Figure 7.6: The meta-model extension in the fourth iteration. An abstract Shape and Property are applied as adaptors in an adaptor pattern with ShaclShape and ShaclProperty as adaptee classes, the abstract elements are referenced similar to Resource and ResourceProperty exclud- ing the keyword ’resource’.

Evaluation task An initial effort was made for modifying the code generator module. It was modified to support the super classes Shape and Property. Thus, upon appending additional con- straint languages, as ShaclShape and ShaclProperty was appended in figure 7.6, a reduced amount of Acceleo code needed to be added for subsequent constraint languages. The is- sue remains, that properties can be added to any shape that inherits from the superclass Shape. Hence, on appending additional constraint language would lead to an increase in complexity in the modelling graphical user interface for the architect, when designing CHAPTER 7. IMPLEMENTATION 37 models.

7.2.5 Fifth iteration: Breaking name conventions and code clean up The purpose of the proposed design in the fifth iteration was to promote code reuse and thus reduce the prototypes code base. This was attempted by putting focus on the ele- ments from the original meta-model that were referencing Resource or ResourceProperty, portrayed as the green classes, in figure 7.7. The referencing attributes of these elements were redirected to only reference the super classes Shape and Property. This can be com- pared with the design suggestion proposed in the fourth iteration, figure 7.6 where the green elements reference the ReSh classes Resource and ResourceProperty, and the super classes Shape and Property. The naming convention for the referencing attributes of the green elements was broken as illustrated in figure 7.7. The names of the referencing at- tributes instead of having the keyword ’shape’ contain the keyword ’resource’ that is more related to ReSh. Due to maintaining backwards compatibility, the names had to be kept to allow previous worked ReSh models to remain compatible with the prototype. The references were chosen to target the super shapes Shape and Property in an effort to reduce the code base.

Figure 7.7: The meta-model extension in the fifth iteration. An abstract Shape and Property as adaptors, with the additional adaptee classes Resource and ResourceProperty

Evaluation task Issues from the fourth iteration still remained. The language elements of ReSh and SHACL now both are subclasses of the superclasses. Leading to an increase in complexity for the user-architect, in favour of reducing the code base hence making the prototype more maintainable. 38 CHAPTER 7. IMPLEMENTATION

7.3 Summary

An evaluation task was constructed simulating a real life task in Scania where a schema has been extracted from SESAMM-tool. Users were selected for attempting to replicate the schema into RDF and applying SHACL constraints by using the prototype. A total of five iterations were carried out, where each iteration proposed a new design suggestion for extending the original meta-model. Each design suggestion was based on experience and assessment of precessing iterations. The design suggestion of the fifth iteration intro- duced an adaptor pattern with the elements Shape and Property acting as adaptor super classes. The constraint language elements, ShaclShape & ShaclProperty and Resource & Re- sourceProperty, are adaptee classes in the adapter pattern, meaning that they inherit from the super classes. The issue remains, that properties of one constraint language are al- lowed to be added to classes of another language. This was related to low cohesion and has led to an increase in complexity in the graphical user interface. Chapter 8

Discussion and Conclusion

Technology is nothing. What’s important is that you have a faith in people, that “ they’re basically good and smart, and if you give them tools, they’ll do wonder- ful things with them. ” Steve Jobs,

8.1 Comparison between constraint languages

The constraint language comparison was carried out with the background of Scania’s SESAMM-tool and with the main focus on extending the graphical language. We com- pared two languages; SHACL and ShEx against one another and against ReSh. The fea- tures of SHACL and ShEx has been compared and discussed, followed by measuring the constraint coverage of the languages. Both languages had features not present in the other language. The SHACL feature of validation reports with optional arguments for limiting returned result on validation was considered valuable as it may aid developers searching errors. Thus, yielding a positive economic outcome by reducing time spent in error search. Among the investigated constraint languages, ShEx seems to have a more compact syntax. The ability to get an overview of the shapes and their defined concept can be weighed against the fact that the ShEx syntax includes regular expressions, in- curring higher complexity and making ShEx shapes harder to read. Published documen- tation of ShEx was not as extensive as SHACL’s, and included few intuitive examples. Hence, the startup learning cost could be considered steeper with ShEx. A feature sup- ported in ShEx, but not supported in SHACL is the ability to validate recursive shapes, although there exist possibilities of extending SHACL processor implementation to sup- port recursive validation.

The constraint coverage comparison of the two languages was in favour of SHACL covering 89.5% of ShEx constraints, compared to ShEx covering 67.8% of SHACL con- straints. Coverage of ReSh for the two languages was 75% and 50% for SHACL and ShEx, respectively. Implying a higher expressiveness for SHACL. Some of the ReSh property constraints were stated to be informative and functional so that it in an easier way can be used on the web. An example of this is the ReSh constraint dcterms:description, stated

39 40 CHAPTER 8. DISCUSSION AND CONCLUSION in its definition is that the content should be suitable inside a

element. SHACL seemingly had better coverage of informative constraints than ShEx. Apart from the val- idation of recursive shapes, SHACL outperformed ShEx in every topic throughout the conducted comparison and thus was concluded as the most fitting language and was rec- ommended for extending the prototype.

8.2 Implementation

Further assessment of the constraint languages was done by conducting evaluation tasks with the goal of supporting multiple constraint languages in the prototype. The evalu- ations were carried out in an iterative process. To reach the objective, different design suggestions were proposed, and these gave rise to different issues.

There were five proposed design suggestions, throughout the five iterations. The first design suggestion (1), consisted of a naive mirroring. The suggestion did not support the inclusion of additional constraint languages. Hence, inducing the second design sugges- tion (2), that had inheritance injected. The design suggestion promoted code reuse and supported the inclusion of additional constraint languages. The modifications in the code generator module consisted of adding additional code for generating language-specific constraints as annotations in the generated Java resources. The appended language di- rectly inherited from the original ReSh elements in the meta-model. This raised issues of low cohesion and led to a cluttered view in the graphical user interface. A user setting SHACL property constraints in the graphical user interface could see all constraints, both ReSh and SHACL. An additional issue relating to low cohesion was that the prototype allowed users to add any language element to any other language element. Meaning that ReSh properties were allowed to be added to ShaclShape elements, this was not wanted behaviour. The issues were found to be originating from the low cohesion and the func- tionality of the plug-in technologies that the prototype was based on, that did not sup- port overriding of reference attribute. To address the raised issues of low cohesion an effort was put on increasing cohesion in the third design suggestion (3). Abstract classes were introduced into the design sug- gestion isolating the language subclasses from one another. Thus, achieving increased co- hesion and resulting in an uncluttered view. Inclusion for additional constraint languages was not supported. This originated from the fact that the code generator module had logic dependent on how shapes reached their corresponding properties. Such as, how a ShaclShape object could reach its belonging ShaclProperty objects through the reference attribute properties. Hence, it was concluded that isolating the reference attribute to the language subclasses meant duplicated code for each additional language. The fourth design suggestion (4) had referencing attributes that led to low code reuse in (3), as properties in the super class i.e. properties and extends. Resulting in an increased code reuse and a meta-model seemingly supporting the inclusion for additional con- straint languages. By having these reference attributes in the super classes, the issue of adding properties of one constraint language to elements of another language that was raised in (2), would be reraised upon appending additional languages to the super classes. This was not an active issue. The initial ReSh elements were not subclassing the introduced super classes. Meaning that only using the prototype for ReSh and SHACL with proposed design suggestion (4), should not raise the user issue that elements of one CHAPTER 8. DISCUSSION AND CONCLUSION 41 constraint language can be added to an element of another constraint language, as in (2). The design suggestion in (5), was similar to (4) with the difference that the original ReSh elements inherit from the two super classes Shape and Property. The introduced in- heritance led to an increased code reuse, and thus a reduced code base in the code gen- eration module. The design reactivated the issue, that one language element could be added to another language element as presented in (2). On further investigating the Sir- ius plug-in, capabilities were detected to constrain the view for what type of properties were allowed to be added to the attribute, resourceProperties. It did not completely fix the issue. It mitigated it allowing for a reduced code base and a prototype better-supporting inclusion of additional constraint languages.

The final proposed design suggestion has been inducted as the most appropriate ap- proach for achieving a common graphical language, that support given constraint lan- guage of RDF-data. The prototype conforms to both ReSh and SHACL, languages that make use of the schema construct called shape. Thus, it can be stated with confidence, that the prototype at least supports an extension of languages using similar shape con- structs, i.e. ReSh, SHACL and ShEx. The final prototype, should support automatic generation of RDF graphs. This was reinforced by including the responsive developer of the automatic validation module in the evaluation task of the implementation. The automatic validation module would use the generated code from the prototype and generate RDF-graphs. The code gener- ation module, in the prototype, and the automatic generation module, were required to be compatible. By further reducing the code base in the code generation module, as de- scribed from (4) to (5). The maintainability between the modules should increase.

8.3 Research findings

The study in this thesis was concerned with the creation of a unified platform for mul- tiple validation technologies of RDF data, and code generation for RDF graphs. It has addressed two research questions:

1. How can a common graphical language be created for supporting all validation technologies of RDF-data?

2. How can this graphical language support the automatic generation of RDF-graphs?

The first research question can be seen as partially answered in the scope of the case, by having extended an existing graphical language that allows users to model ReSh con- straints, making it concurrently conform to SHACL. To fully answer the research ques- tion, further extension with additional RDF validation technologies should be applied, and additionally, generalise the research into other cases. For answering the second re- search question, the graphical language has been made compatible with a graph genera- tion module by having it generate Java resources that can be used by that module. Chapter 9

Future Work

The constraint language comparison can further be extended by investigating the need of validating recursive shapes in Scania. Furthermore investigate the cost of implement- ing and maintaining a SHACL processor supporting recursive validation. The aim of the investigation should be for understanding the demand, of recursive validation in Sca- nia data and the economic weight of maintaining a SHACL processor. Given new data from the investigation, a reevaluation should be conducted, searching the most fitting constraint language for the case of SESAMM-tool.

The issue remains, in the implementation, that all properties can be added to all classes inheriting from Shape. Due to the reference attribute resourceProperties maintained in the super class and EMF plug-in technology not supporting overriding of reference attributes. This led to an increase in complexity for the architect in the modelling graphical user in- terface. This issue is open and should be investigated further, a point of improvement can be code expressions in the Sirius plug-in, that allow code logic to run. For example, when a user adds a property to the resourceProperties attribute, logic may be added to se- lect specific subclasses of Shape. Efforts were made in the code expressions, mitigating the problem, but further investigations can be done. Another point of improvement is to implement logic reactively mitigating the issue in the Acceleo plug-in. An example of this is that flags may be put inside the code generation module, flags raising exceptions when a property has been added incorrectly.

Testing and measuring the genericity of the prototype is left to be done. This could be done by extending the prototype to support additional constraint languages. The ShEx constraint language (as SHACL and ReSh) makes use of the schema constructs called shapes. Hence, the prototype should be tested for genericity by including ShEx. Further- more, it would be of interest to examine how well the prototype conforms to constraint languages not using shapes.

To improve efficiency and expressiveness in the prototype a few extra functions should be added. User efficiency can be increased by predefining commonly used public vo- cabularies, such as "dcterms" and "rdfs". This will reduce effort made by users on re- defining these properties for every new project and will reduce the risk of having these properties defined incorrectly. Expressiveness may be increased in the prototype by adding the possibility to state SHACL SPARQL-based constraints. Allowing users with SPARQL

42 CHAPTER 9. FUTURE WORK 43 knowledge to define own property constraints. Additionally extending the prototype with a SPARQL engine for running the SPARQL-based constraints should then be in- cluded.

The constraint language comparison has been limited to a few selected constraint lan- guages. The comparison should be complemented by extending the comparison with ad- ditional validation technologies of linked data. Statement of independent work

This paragraph was added to state a clear distinction of the scope of work in this the- sis from the work of others. The conducted work in the constraint language comparison was partly done in collaboration with another thesis student. Whereas, the feature com- parison was done in collaboration and the constraint coverage measurement was done independently. The implementation work in the prototype was based on prior work of an already existing prototype conforming to ReSh. The implementation work done in this study consisted of making the prototype conform to the introduced generic elements Shape and Property and furthermore making the prototype conform to SHACL.

44 Bibliography

[1] Efficient safety for products (espresso). URL https://www.kth.se/ itm/inst/mmk/forskning/forskningsenheter/mekatronik/ modellbaserad-metodik/efficient-safety-for-products-espresso-1. 437117.

[2] Jad El-khoury, Didem Gürdür, and Mattias Nyberg. A model-driven engineering approach to software tool interoperability based on linked data. International Journal On Advances in Software, 9(3-4):248–259, 2016.

[3] w3c. Semantic web, . URL http://www.w3.org/standards/semanticweb.

[4] w3c. W3c semantic web activity, . URL http://www.w3.org/2001/sw.

[5] Thomas Bosch and Kai Eckert. Guidance, please! towards a framework for RDF- based constraint languages. In Proceedings of the 2015 International Conference on and Applications, DCMI’15, pages 95–111. Dublin Core Metadata Ini- tiative. URL http://dl.acm.org.focus.lib.kth.se/citation.cfm?id= 2907896.2907906.

[6] Open Services for Lifecycle Collaboration. Open services for lifecycle collaboration, . URL http://open-services.net/.

[7] Open Services for Lifecycle Collaboration. What is OSLC?, . URL http:// open-services.net/resources/tutorials/oslc-primer/what-is-oslc/.

[8] Ralph R. Swick Ora Lassila. Resource description framework (rdf) model and syntax specification. URL https://www.w3.org/TR/1999/ REC-rdf-syntax-19990222.

[9] B. Berendt. Web Mining: From Web to Semantic Web: First European Web Mining Forum, EWMF 2003, Cavtat-Dubrovnik, Croatia, September 22, 2003, Revised Selected and Invited Papers. Number v. 1 in Hot topics. Springer. ISBN 978-3-540-23258-2. URL https: //books.google.se/books?id=evd1vgAACAAJ.

[10] Implementing an oslc provider. URL http://open-services. net/resources/tutorials/integrating-products-with-oslc/ implementing-an-oslc-provider/.

[11] Shape expressions primer, . URL http://www.w3.org/2013/ShEx/Primer.

[12] Holger Knublauch and Dimitris Kontokostas. Shapes constraint language (SHACL). URL http://www.w3.org/TR/shacl/.

45 46 BIBLIOGRAPHY

[13] José Emilio Labra Gayo, Eric Prud’hommeaux, Harold R. Solbrig, and Iovka Boneva. Validating and describing linked data portals using shapes. abs/1701.08924. URL http://arxiv.org/abs/1701.08924.

[14] SHACL tutorial: Getting started, . URL http://www.topquadrant.com/ technology/shacl/tutorial/.

[15] Shape Expression Community Group. Shape expression language 2.0. URL https: //shexspec.github.io/spec/.

[16] Eric Prud’hommeaux W3C/MIT Thomas Baker, Dublin Core Metadata Initiative. Shape expressions (shex) primer. URL https://shexspec.github.io/primer/ index.html.

[17] IBM Corporation Arthur Ryman. Resource shape 2.0. URL http://www.w3.org/ Submission/shapes/.

[18] Arthur G. Ryman, Arnaud Le Hors, and Steve Speicher. Oslc resource shape: A lan- guage for defining constraints on linked data. In Christian Bizer, Tom Heath, Tim Berners-Lee, Michael Hausenblas, and Sören Auer, editors, LDOW, volume 996 of CEUR Workshop Proceedings. CEUR-WS.org, 2013. URL http://dblp.uni-trier. de/db/conf/www/ldow2013.html#RymanHS13.

[19] Dave Johnson. Open services for lifecycle collaboration core specification version 2.0 appendix a: Common properties. URL http://open-services.net/bin/view/ Main/OSLCCoreSpecAppendixA.

[20] Spin. URL http://spinrdf.org/.

[21] Bijan Parsia Peter F. Patel-Schneider Sebastian Rudolph Pascal Hitzler, Markus Krötzsch. Owl 2 web ontology language primer (second edition). URL http://www.w3.org/TR/2012/REC-owl2-primer-20121211/.

[22] Janie Rees-Miller Mark Aronoff. An Introduction to Formal semantics, chapter 15. Wiley-Blackwell, 2003. ISBN 9781405102520.

[23] Ambjörn Naeve Pete Johnston Thomas Baker Andy Powell, Mikael Nilsson. Dcmi abstract model. URL http://dublincore.org/documents/abstract-model/.

[24] Mikael Nilsson. Description set profiles: A constraint language for dublin core appli- cation profiles. URL http://dublincore.org//documents/dc-dsp/#sect-7.

[25] Jad El-Khoury, Didem Gürdür, Frédéric Loiret, Martin Törngren, Da Zhang, and Mattias Nyberg. Modelling support for a linked data approach to tool interoperabil- ity. In The Second International Conference on Big Data, Small Data, Linked Data and Open Data, ALLDATA, Lisbon, February 21 - 25, 2016. :, pages 42–47, 2016. QC 20160405.

[26] Jad El-khoury. Lyo code generator: A model-based code generator for the devel- opment of oslc-compliant tool interfaces. SoftwareX, 5:190 – 194, 2016. ISSN 2352- 7110. doi: http://dx.doi.org/10.1016/j.softx.2016.08.004. URL http://www. sciencedirect.com/science/article/pii/S2352711016300267. BIBLIOGRAPHY 47

[27] The Eclipse Foundation. Eclipse modeling framework (emf). URL https:// eclipse.org/modeling/emf/.

[28] Sirius tutorials startertutorial - eclipsepedia. URL https://wiki.eclipse.org/ Sirius/Tutorials/StarterTutorial.

[29] Cédric Brun. Graphical modeling from 2016 to 2017: Better, faster, stronger - CTO @ obeo. URL http://cedric.brun.io/eclipse/ modeling-2016-2017-better-faster-stronger/.

[30] Mariot Chauvin Laurent Goubet Jonathan Musset Aurelien Pupier Obeo, Cedric Brun. Acceleo. URL http://wiki.eclipse.org/Acceleo.

[31] N. Prat, I. Comyn-Wattiau, and J. Akoka. Artifact Evaluation in Information Systems Design Science Research – A Holistic View. In PACIS 2014 Proceedings - Pacific Asia Conference on Information Systems, page Paper 23, June 2014.

[32] Herbert A. Simon. The Sciences of the Artificial (3rd Ed.). MIT Press, Cambridge, MA, USA, 1996. ISBN 0-262-69191-4.

[33] Philipp Offermann, Olga Levina, Marten Schönherr, and Udo Bub. Outline of a de- sign science research process. In Proceedings of the 4th International Conference on Design Science Research in Information Systems and Technology, DESRIST ’09, pages 7:1–7:11, New York, NY, USA, 2009. ACM. ISBN 978-1-60558-408-9. doi: 10.1145/1555619. 1555629. URL http://doi.acm.org/10.1145/1555619.1555629.

[34] Alan R. Hevner, Salvatore T. March, Jinsoo Park, and Sudha Ram. Design science in information systems research. MIS Q., 28(1):75–105, March 2004. ISSN 0276-7783. URL http://dl.acm.org/citation.cfm?id=2017212.2017217.

[35] Christian Sonnenberg and Jan vom Brocke. Evaluation Patterns for Design Science Research Artefacts, pages 71–83. Springer Berlin Heidelberg, Berlin, Heidelberg, 2012. ISBN 978-3-642-33681-2. doi: 10.1007/978-3-642-33681-2_7. URL http: //dx.doi.org/10.1007/978-3-642-33681-2_7.

[36] Jose Emilio Labra Gayo. Shaping data. URL http://shapingdata.blogspot. se/2016/06/rdf-validation-tutorial.html. 48 APPENDIX A. LYO PROTOTYPE META-MODEL 49

Appendix A

Lyo prototype meta-model

Figure A.1: Meta-model of the prototype Appendix B

SHACL on ShEx coverage

50 APPENDIX B. SHACL ON SHEX COVERAGE 51

SHACL core constraints against ShEx semantics SHACL core constraint components SHACL ShEx sh:class - Value Type Constraint sh:dataType ShEx Datatype Con- straint sh:nodeKind - sh:minCount m, Cardinality Constraint { } sh:maxCount n { } sh:minExclusive minExclusive sh:minInclusive minInclusive Value Range Constraint sh:maxExclusive maxExclusive sh:maxInclusive maxInclusive sh:minLength minLength sh:maxLength maxLength String Based Constraint sh:pattern pattern sh:languageIn LanguageStem and LanguageStemrange sh:uniqueLang LanguageStem and LanguageStemrange sh:equals - sh:disjoint - Property Pair Constraint sh:lessThan - sh:lessThanOrEquals - sh:not NOT sh:and AND Logical Constraint sh:or OR sh:xone Can be achieved by combining constraints sh:node - Shape Based Constraint sh:property - sh:qualifiedValueShape, - sh:qualifiedMinCount, sh:qualifiedMaxCount sh:closed closed sh:ignoredProperties -- Other Constraint sh:hasValue objectValue sh:in objectValue

Table B.1: Table of how SHACL Core constraint components are covered by constraints of ShEx language. Appendix C

ShEx on SHACL coverage

ShEx semantics against SHACL core constraints Shex Semantic name ShEx SHACL bnode sh:nodeKind sh:blankNode iri sh:nodeKind sh:IRI Node Kind Constraints literal sh:nodeKind sh:Literal nonLiteral sh:nodeKind sh:BlankNodeOrIRI Data Type Constraints SPARQL data type operands, SPARQL data type may be extended operands, allows IRI:s and rdf:langString as values length sh:minLength + sh:maxLength String Facet Constraints minLength sh:minLength maxLength sh:maxLength pattern sh:pattern minInclusive sh:minInclusive maxInclusive sh:maxInclusive minExclusive sh:minExclusive Numeric Facet Cosntraints maxExclusive sh:maxExclusive totalDigits - fractionDigits - objectValue sh:in Stem sh:in + sh:pattern Values Constraint StemRange, allows exclusion sh:in + sh:pattern set Wildcard, allows exclusion leave out sh:dataType set

Table C.1: Table of how ShEx Node Constraint Semantics are covered by constraints of SHACL language

52 Appendix D

SHACL on ReSh coverage

53 54 APPENDIX D. SHACL ON RESH COVERAGE

ReSh SHACL Summary dcterms:description sh:description Non validating constraint. Pro- vides a description for property in both language dcterms:title sh:name dcterms:title is not required. At least one sh:name constraint need to be defined oslc:allowedValues sh:in oslc:allowedValues may have one resource. sh:in may have many resources and IRI:s oslc:allowedValue sh:in oslc:allowedValue may have many IRI:s and resources equivalent to sh:in oslc:defaultValue sh:defaultValue defaultValue has equivalent func- tionality in both languages oslc:hidden - - oslc:isMemberProperty - - oslc:name sh:name oslc:name has cardinality of 1. sh:name has cardinality 1..* oslc:maxSize sh:maxLength To be used for value nodes of type xsd:String in both languages. Constrains the string length. oslc:occurs sh:minCount & sh:maxCount oslc:occurs may take values in the range [0 1; 1 many]. SHACL | | allows for more precise cardinality definition and many is defined by excluding sh:maxCount oslc:propertyDefinition sh:path Respectively defines an URI and an IRI for a property. With respec- tive cardinalities of [1] and [0; 1]. oslc:range sh:class Specifies the resource types al- lowed in the value node. Respec- tive cardinalities are [0; many] and [0; 1] oslc:readOnly - - oslc:representation - - oslc:valueType sh:dataType sh:valueType allows resource, sh:dataType uses IRI:s for pointing to resources. Respective cardinali- ties are [0; many] and [0; 1] oslc:valueShape sh:node oslc:valueShape is similar to oslc:range but specifies shapes. sh:node specifies nodeShapes

Table D.1: Table of how ReSh Property Constraints are covered by constraints of SHACL language Appendix E

ShEx on ReSh coverage

55 56 APPENDIX E. SHEX ON RESH COVERAGE

ReSh Shex Summary dcterms:description - Non validating constraint. Pro- vides a description for property dcterms:title - dcterms:title is not required. oslc:allowedValues objectValue oslc:allowedValues may have one resource. oslc:allowedValue objectValue oslc:allowedValue may have many IRI:s and resources oslc:defaultValue - defaultValue of a property oslc:hidden - - oslc:isMemberProperty - - oslc:name id oslc:name has cardinality of 1. oslc:maxSize maxLength To be used for value nodes of type xsd:String in both languages. Constrains the string length. oslc:occurs ?, +, *, m, n oslc:occurs may take values in { } the range [0 1; 1 many]. ShEx al- | | lows for more precise cardinality definition and many is defined by * oslc:propertyDefinition TripleConstraint predicate:IRI Respectively defines an URI and an IRI for a property. With respec- tive cardinalities of [1] and [1]. oslc:range - Specifies the resource types al- lowed in the value node. Cardinal- ity is [0; many] oslc:readOnly - Specifies whether a property should be read only and not write- able oslc:representation - Specifies how a resource should be represented oslc:valueType Datatype constraints sh:valueType allows resource, ShEx datatype uses IRI:s for point- ing to resources. Respective cardi- nalities are [0; many] and [0; 1] oslc:valueShape value shape @ oslc:valueShape is similar to oslc:range but specifies shapes. @ is used to specify object assert to given shape

Table E.1: Table of how ReSh Property Constraints are covered by constraints of ShEx lan- guage. www.kth.se