Ontology Driven RDF Data Creation for the Universal Information Adapter

Master’s Thesis

Bastiaan Bijl

Ontology Driven RDF Data Creation for the Universal Information Adapter

THESIS

submitted in partial fulfillment of the requirements for the degree of

MASTER OF SCIENCE

in

COMPUTER SCIENCE

by

Bastiaan Bijl born in Rotterdam

Web Information Systems Department of Software Technology Faculty EEMCS, Delft University of Croon Elektrotechniek B.V. Technology Schiemond 20-22, 3024 EE Delft, the Netherlands Rotterdam, the Netherlands http://wis.ewi.tudelft.nl http://www.croon.nl c 2012 Bastiaan Bijl. Ontology Driven RDF Data Creation for the Universal Information Adapter

Author: Bastiaan Bijl Student id: 1312405 Email: [email protected]

Abstract

In this thesis a closed RDF data production platform is presented that is highly configurable. The use of Named Graphs supports a modularized ontology and a detailed provenance of the multi-user knowledge data. It is designed and configured for the Systems Engineering context of ship building industry part- nership Integraal Samenwerken. The information model originates from the ISO 15926 standard and the design of the platform functions as a prototype for the formulation of ISO 15926 Part 11. The platform consists of a running data-store and workspace user interface, both available over the Web.

Thesis Committee:

Chair: Prof. dr. ir. G.J. Houben, Faculty EEMCS, TUDelft University supervisor: Dr. ir. A.J.H. Hidders, Faculty EEMCS, TUDelft Company supervisor: Ing. L.C. van Ruijven MSc, Croon Elektrotechniek B.V. Committee Member: Dr. E. Visser, Faculty EEMCS, TUDelft

Preface

Can you believe this world’s just exactly as we built it, running out of control? The Nerve // MUTEMATH

The challenges and appeals conveyed in my work at Croon are not easily summarized. I had to face both a rather obscure starting point and a complicated stakeholder or- ganization, but the urgency and ambitious intent of the assignment compensated this by firing my enthusiasm. The opportunity to contribute to the formulation of an ISO norm, the air of applying novel techniques to a problematic industrial practice and the possibility of thus setting an example that might be noticed stimulated me. From the moment my assignment took its definite form, it was my wish to present in this thesis both a platform with sound design which would be smoothly running at the time of its presentation and a description placing the platform in the center of the current academic discussion. A judgment of how closely I reached that goal is left to the reader, but the biggest effort was spent on implementing, testing and redesigning the platform. The first five months I worked alone, but from then I had the privilege to experience a different working style for an extensive period when I got the full-time help of a Java programmer. I wrote this report in parallel, but when it was time to finish the work that could be done within the scope of a thesis project, I transferred all further development to him and worked to give this report its final body. A new level of complexity appeared when the work was prepared to be presented to the ISO workgroup. The triple split of bringing together the academic technical view, the practical client view and the political standardization view in this report resulted in an extra appendix for ISO readers. I want to thank the people from our department that formed such a friendly basis to work from. I highly appreciate the freedom Leo van Ruijven and Jan Hidders gave me and the confidence they had in me, although at some points I felt it was too much. The biggest help was Mohamad Alamili to whom I owe both an endless source of reflection and a great bunch of fruit.

Bastiaan Bijl Delft, the Netherlands January 15, 2014

iii

Contents

Preface iii

Contents v

1 Introduction 1 1.1 Background ...... 1 1.2 Subject matter ...... 2 1.3 Research objectives ...... 2 1.4 Approach ...... 5 1.5 Outline ...... 6

2 Linked Data and Linked Data Applications 7 2.1 What is Linked Data? ...... 7 2.2 Linked Data Applications ...... 14 2.3 Development libraries ...... 21

3 Semantic Information Integration Platforms 23 3.1 Information Systems for CAE ...... 23 3.2 ISO 15926 ...... 27

4 Context of Integraal Samenwerken 31 4.1 Integraal Samenwerken ...... 31 4.2 Project 8 ...... 31 4.3 Requirements ...... 32 4.4 User analysis ...... 35 4.5 Pre-ODRAC implementation ...... 36 4.6 Distributed context ...... 37 4.7 Example model ...... 38

5 41 5.1 Individuals ...... 41 5.2 Relationships ...... 45 5.3 Individual templates ...... 47 5.4 Meta-data ...... 49

v CONTENTS CONTENTS

5.5 Graph replace-chain ...... 50 5.6 Transaction model ...... 51 5.7 Translation labels ...... 51 5.8 Libraries ...... 52 5.9 Relation to OWL ...... 57 5.10 Conclusion ...... 58

6 System Architecture 59 6.1 Data-store service ...... 59 6.2 Workspace service ...... 62 6.3 Workflow ...... 66 6.4 Project configuration ...... 69 6.5 Mappers ...... 69 6.6 Conclusion ...... 71

7 User Interface 73 7.1 The WUI within the platform ...... 73 7.2 Workspace ...... 74 7.3 Browser ...... 75 7.4 Valuebox ...... 77 7.5 Feedback on submitted values ...... 78 7.6 Navigator ...... 79 7.7 Importing and exporting TriX-files ...... 80 7.8 Manual querying ...... 80 7.9 Conclusion ...... 80

8 Implementation 81 8.1 Java package structure ...... 81 8.2 Java design patterns ...... 83 8.3 The use of TDB ...... 87 8.4 JavaScript application ...... 88 8.5 Conclusion ...... 91

9 Evaluation 93 9.1 Comparison to 1.x version ...... 93 9.2 User satisfaction ...... 94 9.3 Reception by Integraal Samenwerken ...... 95 9.4 Performance ...... 97 9.5 Reliability ...... 99 9.6 Configurability ...... 100 9.7 Extensibility ...... 101

10 Conclusions and Future Work 103 10.1 Conclusion ...... 103 10.2 Contributions ...... 104 10.3 Discussion ...... 104 10.4 Future work ...... 105 vi CONTENTS CONTENTS

11 Reflection 107 11.1 Phases ...... 107 11.2 Methodology ...... 109 11.3 Stakeholder communication ...... 109 11.4 Literature referencing ...... 110

Bibliography 113

A Ontology review 119 A.1 Management ...... 119 A.2 Structure ...... 121 A.3 Conclusion ...... 124

B Quality aspects 125 B.1 ISO 25010 quality in use ...... 125 B.2 ISO 25010 product quality ...... 126 B.3 Semantic Content Authoring ...... 128

C Graphs 131 C.1 Granularity ...... 131 C.2 Stable states ...... 132 C.3 Construction states ...... 138

D Primer vocabulary 139

E Perspective of ISO 15926 145 E.1 ODRAC within the ISO’s narrative ...... 145 E.2 Mapping narrative to actual design ...... 147

vii

Chapter 1

Introduction

In this work a Linked Data production platform is presented. It is built as a proto- type application following the data modelling approach from the ISO 15926 Part 11 methodology. This standard from the Systems Engineering community prescribes an RDF structure and contains some architectural directions for knowledge base plat- forms that can fully describe industrial systems. The methodology is described in Section 3.2. The guided RDF data editor presented here could be applied to other knowledge domains as well. The use of a restrictive vocabulary layer on top of RDF and RDF Schema, allowing a highly configurable multi-level ontology to structure the data production process, is the core principle, so throughout this paper we refer to it as the ODRAC platform: Ontology Driven RDF Data Creation. The ODRAC system can be called both a production platform and an application, although the first is more accurate because the proposed design consists of multiple software nodes and includes a data model.

1.1 Background

The work is done at Croon Elektrotechniek BV on the department of Technology De- velopment located at Delft. Croon is one of the member companies of the Integraal Samenwerken partnership, and in this context it is responsible for the development of the Universal Information Adapter (UIA). The Adapter is a functional solution for in- formation integration within the Dutch ship building industry and it plays a key role in a number of projects of the partnership. The main purpose of the Adapter is to integrate different information sources and to facilitate information negotiation between compa- nies contracted to design and build a ship. A general information integration platform could make the engineering work of a ship more efficient and reduce miscommunica- tion and related failure costs. The partnership is the main stakeholder in the design of the ODRAC platform and the requirements will mainly be gathered from this context. The reception by Integraal Samenwerken of the design and implementation plays an important role in the evaluation of the platform. The close relation Croon has with the ISO 15926 community influenced the choice to apply the Systems Engineering legacy of this standard to the UIA case. Croon also has a hand in the formulation of Part 11, which is intended as a minimal methodolog- ical basis how to use the ISO 15926 theory in a practical way. The work of the thesis

1 1.2 Subject matter Introduction

functions as a reconnaissance of an efficient interpretation of this Part 11 approach. Readers from the ISO perspective are referred to Appendix E for an explanatory bridge between the ISO narrative and the approach of this thesis.

1.2 Subject matter

The choice for the ISO 15926 Part 11 approach brings with it the use of RDF and the need to formalize the syntax of communication data, for example using the TriX XML format. Linked Data is a very open technology used in varying contexts resulting in different types of Linked Data editors. To overcome information integration problems Linked Data has promising capabilities, but the application of these capabilities is not trivial. The expressive freedom of RDF can result in excessive data production in amount and complexity and the research on the application of Linked Data is far from settled. At the same time, the UIA has complex expectations like full traceability and high evolvability. On the other hand the ISO data model (Part 2) itself contains a modeling power in need of simplification [74, 69] and a structure in need of revision if it would be expressed in RDF. The work presented in this thesis proposes such a revision by pre- senting a technical data model that can be configured to support the ISO Systems Engineering . Before the thesis work started an earlier version of the UIA platform existed, to- gether with an RDF data model. This platform will be referred to as version 1.x. Because of the low level of maintainability (modularity, reusability and modifiabil- ity) and fundamental problems with the operability and portability (due to the off-line installation of the User Interface application) a fundamental new design was needed. One of the first steps was to evaluate the data model of the 1.x version. A report of this evaluation is included as Appendix A as a clarification of the starting point. The new platform had to overcome the obscurity of the code base and the data model, it needed to be on-line and it needed to be able to support more than one data-store within one project. Yet another important characteristic of RDF data-stores is their lack of scalability. A final aspect of attention for the new platform was brought in by the political po- sition of the prototype function of the application within Integraal Samenwerken. The 1.x version had the function of trying out and demonstrating a new way of working. The new version of the platform had to increase its function as flagship of the legacy of Integraal Samenwerken as best as possible. This practically resulted in a focus on user interface aesthetics, performance and a clear conceptual resemblance of elements from the data model in the user interface.

1.3 Research objectives

The main question of this engineering thesis brings together the ISO standard, RDF and the Integraal Samenwerken context in a version 2.x platform implementation, combining them with the described focus points. What is a proper way to build a highly configurable, extendable, user friendly, well performing and reliable RDF data production platform suited

2 Introduction 1.3 Research objectives

for Integraal Samenwerken Systems Engineering projects, that follows the approach of the ISO 15926 Part 11 reference data methodology?

As described above the ISO 15926 Part 11 approach forms the starting point of this thesis. The methodology aims for a continuation of the modeling power of ISO 15926 within the technical framework of RDF. A subquestion this thesis needs to answer is how this can be done.

(a) How can Systems Engineering knowledge data based on the ISO 15926 Part 2 Data model and containing the Part 4 Initial reference data be ex- pressed in RDF?

The answer to this subquestion is already part of the platform design. A set of re- quirements and design decisions is influenced by the ISO’s way of thinking, but the main source of the requirements forms the first stakeholder, as addressed in the next subquestion.

(b) Which requirements spring from the Integraal Samenwerken context?

The version 2.x platform that will be presented as an answer to the main question should meet the requirements, but some special attention might be paid to the way of evaluation and argumentation behind it. The following subquestion does that.

(c) What do we consider a proper platform, given the focus points?

In order to evaluate the quality of a piece of software a qualification measure has to be established. From [40] and [45] important criteria for the ODRAC design are selected. In Appendix B a full response is given. The selection of focus points was largely based on the state of the Integraal Samenwerken project, and is reflected back into the main question. What follows is a short description of the points and their relation to the selected quality aspects. Some characteristics where important, but did not fit in the scope of the thesis project. These are also described below. Thus, the remainder of this section is meant as a first answer to the subquestion (c). In Evaluation Chapter 9 this answer is completed.

1.3.1 High configurability The platform can be configured to support any information model constructed as a directed graph with labeled nodes and edges. It should be able to deal with changes at all levels of the information model. The multi-layer ontology should allow ODRAC projects to be configured fully which Classes and Properties are used. This requires a high level of flexibility. But the platform is also expected to be developed further beyond the scope of the thesis, so the evolvability of all parts of its design is important. In [45] the ability to adapt to different situations or use cases is called generalizability.

1.3.2 Extendability As discussed the maintainability should be improved considerably compared to the 1.x version. Yet two subaspects that needs extra attention are reusability and modifiability of the code. Because of the prototype stage the Workspace User Interface is expected to

3 1.3 Research objectives Introduction

be enriched in the future with new graphical presentations of entities and patterns. This requires a clear application design and implementation, also at points like graphical user interaction where the abstraction and understandability of the information model is not the major structuring concept. From the Model Driven Architecture community we borrow the insight that the use of a model in the software production helps to increase abstraction, understandability, accuracy, predictiveness and inexpensiveness [26]. Strictly speaking the ODRAC de- sign is not build following MDA because there is no automated connection between a model and generated code, but because the domain specific configuration of the appli- cation is outsourced to a multi-level ontology, part of the application can be redesigned without the need for a recompilation of the code. This does contribute to the simplicity of the code. An extra reason for a clear code design is that it improves the testability of the code. The configurability of the system also makes it more flexible to test. In Section 10.3 the extend to which the code was really tested is discussed.

1.3.3 User friendliness Because of the political function of the platform to introduce test users and decision makers to a data integration process guided by a software platform like ODRAC, the learnability and user interfaces aesthetics are important above average. In [38] the human interaction difficulties of the visualization of ontolo- gies is recognized. Although the Workspace User Interface is not primarily intended for browsing through the ontology layer of a data-set, and will not be operated by peo- ple with elaborate knowledge of ontology design or knowledge base structures, in the WUI design some attention is paid to represent core entities in an intuitive fashion. In [42] the choice to reflect the data structure in the user interface is criticized, but in ODRAC this path is taken deliberately.

1.3.4 Performance

The use of ODRAC is not time critical, but because of the possible size of project data, it is not trivial to prevent scaling problems. Both the capacity of the whole knowledge base and the resource utilization of the data storage and the Web browser- server communication can pose problems for big data. The system should be usable without unacceptable delays.

1.3.5 Reliability The reliability aspect that is of high importance is the user error protection, preventing illegitimate user actions to guard the quality of the data. This is related to the required high level of trust the system is supposed to inspire. Important security aspects are non-repudiation, accountability and authenticity, which together should be capable of proving that the claims that end up in the knowl- edge base are produced undeniably by the recorded user.

4 Introduction 1.4 Approach

1.3.6 Systems Engineering projects In Systems Engineering projects like those in the Integraal Samenwerken context of shipbuilding, a number of special requirements arise. Three kinds of freedom from risk (economic, health/safety and environmental) come about because the decisions made during the production of SE data can lead to risk consequences when the engineered product is build. The platform should therefor support the creation and negotiation of those data entries as best as possible. This is closely related to reliability requirements stated above. Another important quality of the system is that it should be able to contain any SE fact that falls into the knowledge scope of a project. Thus, the context completeness is important, and this is related to the configurability. A third set of requirements has to do with the open nature a data communication platform is required to have in an SE project. The Workspace User Interface should be replaceable by any other application that complies with the communication protocol, and should be able to co-exist with such a data viewing and production application. Thus the platform is required to support interoperability. Also inherit to Systems Engineering is the requirement that the platform supports collaboration where users interact to negotiate data values.

1.3.7 Quality characteristics outside scope Other important security aspects are the confidentiality of the data of a project and preserving integrity by preventing access to unauthorized users, but the selection of which data should be visible to whom is largely left out of the design. The same is true for recoverability of the data after a server crash or an error. Some thought was spent on both issues, but the scope of the thesis project was too narrow to design and test up to these requirements.

1.4 Approach

An engineering thesis generally consists of two products. The RDF data production platform is the central deliverable of this thesis. A live demonstration version can be found at http://www.uia15926-11.com (login demo/demo). This report is the second product, and it contains a presentation and defense of the design. The thesis work was done in a number of phases, discussed in more detail in Sec- tion 11.1. In the first phase the thesis goal was fine-tuned on the basis of research on the 1.x version. In the second phase the requirements, design and implementation were alternately worked out for a new 2.0 version. Later on a freelance Java programmer joined, starting the third phase, and did part of the implementation and doing so stim- ulated the work on the design. The fourth phase was started when a big design change was implemented resulting in version 2.1. A number of different tests were done, and some parts of the design and implementation were altered or fine-tuned. At a cer- tain point the change iterations were stopped and the scope of the thesis project was closed. In the fifth phase the requirements and design were collected and described in the present document. Nevertheless the work on the platform continues, and this

5 1.5 Outline Introduction

means the running demo implementation might deviate from the design presented in the thesis. Future work mentioned in this report could already be implemented there.

1.5 Outline

The report is structured as follows. The upcoming three chapters sketch the context and background of the platform, and from these descriptions the requirements are derived. In Chapter 2 the Linked Data technique is introduced followed by a brief overview of different kinds of Linked Data editors. Related work in other Semantic Informa- tion Integration Platforms is described in Chapter 3. In Chapter 4 subquestion (b) is treated by describing the context and content of Project 8 of Integraal Samenwerken together with the approach of the ISO 16926 Part 11 methodology. Requirements for the ODRAC platform are estabished and an example is introduced of a Systems Engi- neering model that should be expressible in ODRAC. The sequential four chapters present the design. First the use of RDF to encode the knowledge data is explained in Chapter 5, thus answering subquestion (a). The service architecture, the description of the software nodes, is given in Chapter 6. The Workspace User Interface is presented in Chapter 7 and some relevant implementation details are shared in Chapter 8. A conclusion section at the end of each of these four chapters relates the design to the requirements from Chapter 4. The design is evaluated in Chapter 9 and conclusions and future work are given in the Chapter 10. The finial Chapter 11 contains a reflection on the thesis process.

6 Chapter 2

Linked Data and Linked Data Applications

This thesis adds a new platform to the line of Linked Data applications. In the present chapter we investigate the typical characteristics of a Linked Data application by sketch- ing a technological perspective on the concept. First the relevant technical building blocks are summarized. The second section gives an overview of different types of Linked Data applications. Finally some words are spent on popular development li- braries.

2.1 What is Linked Data?

Linked Data can be described as a set of best practices for publishing and connecting structured data on the Web [5]. It is the driving mechanism behind the transition from the human readable World Wide Web to a machine interpretable Web of Data, and it is a way of powering the Semantic Web idea that has been around since the introduction of the Web [16]. Because the ambitious goal is to connect all knowledge and map it to one global data space, the main ideas are formulated very basic. The four Linked Data principles for publishing to the Web of Data are: 1) use URIs as names for things; 2) use HTTP URIs so people can look up those names; 3) provide useful information using standards (RDF, SPARQL) when someone looks up a URI; 4) include other URIs in the information so users can discover more things. As can be seen in these principles, Linked Data relies on a set of technology stan- dards (URI, HTTP, RDF, SPARQL), each of which is standardized by the World Wide Web Consortium (W3C). In the following subsections the standards are explored as far as relevant for the topic of this thesis. The HyperText Transfer Protocol (HTTP) will be left for what it is. The main sources of reference are the six document RDF specification from February 2004 and the eleven document SPARQL 1.1 proposed recommendation from November 2012.

2.1.1 Resource The Resource Description Framework (RDF) provides a way for anyone to make state- ments about any resource [47]. This resource has to be formulated as a Universal Re-

7 2.1 What is Linked Data? Linked Data and Linked Data Applications

source Identifier (URI), which means it has to follow the RFC 2396 Generic Syntax that prescribes the basic structure and the characters to use:

://@:?#

A resource is a globally unique name for a thing. It can represent any concept from human knowledge; anything that has identity. Although the Linked Data prin- ciples prescribe an HTTP URI, the framework does not assume a relation between the resource and a document retrievable using the HTTP or any other protocol. Re- sources are treated as logical constants and possible meaning carried in the specific URI are ignored [37]. Although a URI is globally unique and represents one thing only, the one-to-one uniqueness of a name is not assumed. It could be so that different resources turn out to describe the same concept. Although no relation is assumed between the URI and a retrievable document, in [60] the W3C stresses in a practical guide that all URIs used in a data-set should be on the Web. It prescribes that the URI should either return human readable HTML or RDF data, determined by HTTP content negotiation. At the same time it states that URIs should not be ambiguous in whether they represent an entity or a docu- ment describing that entity. As a consequence they advise two rather complicated ways to build unambiguous URIs. In this thesis there is no pressing reason to fol- low one of the two. Even the more simple example of the Semantic MediaWiki to let http://example.org/wiki/Delft refer to an HTML document and http://example .org/wiki/_Delft to the city is out of scope. In the ODRAC platform there is no ap- parent need for HTML representation of data by the Data-store service, because the offered Workspace User Interface (Chapter 7) is the promoted point of access for hu- man users. The guide proceeds by giving three qualities of good URIs. Simplicity makes them easy to remember; stability makes them available for twenty years and hides imple- mentation specific information; manageability makes it easy to publish new versions without violating the first two aspects.

2.1.2 Triples and Quads In the RDF framework data is modeled as a directed graph. Nodes can be either a Resource (URI) or a Literal (String). Edges between those nodes are defined in state- ments. A statement always consists of three parts: subject, predicate and object, hence called a triple. A predicate is also called property and represents an edge from the subject node to the object node. A predicate is always a Resource, and is generally defined in a vocabulary (see the following subsection). Subject and object generally are a Resource too, but RDF allows the use of a blank node. This is a node without a URI or Literal value and functions as a link between triples. Finally, an object can also be a Literal which carries a text value (a string, number, date, etc.). Those could be identified using URIs too, for example in number-lists, but for the sake of simplicity it is possible to specify them ad-hoc, without reference to a list [47]. Triples can be seen as statements describing Resources, and so they are the building blocks of the Resource Description Framework. There is no other way to define nodes than to mention them in a statement. A set of statements together forms a directed

8 Linked Data and Linked Data Applications 2.1 What is Linked Data? graph. Such a graph can be stored in a file, but as prescribed in principle 3) they should be traversable over the Web too. A Web service offering such access is called an RDF data-store. One data-store can contain more than one directed graph. It is up to the implementation of the Web service to manage the graphs it contains, and this is specially relevant when it comes to SPARQL query evaluation where patterns can be matched to different graphs [33]. When the term model is used to refer to a concrete selection of statements instead of the abstract framework, it generally refers to a specific RDF graph inside some RDF data-store. We will also use the term data-set for a meaningful collection of statements served by a data-store. A problematic feature of the triple structure is the difficulty of referring to a triple using a resource in order to make statements about statements. Meta-statements like this are mainly needed for provenance and trust information. RDF contains a way of describing triples using the type rdf:Statement and the properties rdf:subject, rdf:predicate and rdf:object. This is called reification and it is a way to model a triple using triples, thus describing a copy of a statement in order to refer to it. It introduces a big increase in the number of needed triples and it doesn’t work well in practice [14]. A triple can be applied with a fourth URI element, changing a statement in a quad. This fourth URI can be used as a name for the triple, or a set of triples if more than one get the same name. The resulting single statement or set of statements is called a Named Graph. The graph name URI can thus be used as a representant of the set of statements. It is possible to refer to the set using its name directly, superseding the need to model the statements with reification. In this way statements about the named triple can be done simply by using the name URI as the subject or object in another triple. This technique was formally introduced in [11, 12]. A Named Graph can be used in two ways that slightly differ. Originally a graph was intended to represent real life collections of things like books or people, and query- ing multiple graphs meant combining different repositories, say collections of different city libraries or personal contacts lists. In that sense, graphs where large scale collec- tions. For example all the books of a certain collection at a library would all get the same graph name URI in the quads, connecting the triple part of the statements to that collection. The different use of Named Graphs to define subsets of statements within a full repository graph is relatively new. In [72] Named Graphs are used in practice to record version control in the WikiWikiWeb. And in [73] Named Graphs are collections of data elements that belong to a certain context. In [62] Named Graphs are employed as structural entities with a certain role, like either containing Schema or Instance-base data. In their work a more extensive set of predicates is defined to describe the relation between Named Graphs. For example the nrl:imports allows to import the content from one graph into another. Although little reports on the specific use of Named Graphs are published, all but one of the seventeen RDF Stores listed in [35] support Named Graphs fully, two of which do not use the quad format to save the graph names. In the ODRAC plat- form Named Graphs are used both as structural entities and as collections labeled with provenance data. An important part of the vocabulary in Appendix D deals with the graph structure and can be compared with the nrl vocabulary of [62].

9 2.1 What is Linked Data? Linked Data and Linked Data Applications

2.1.3 Vocabularies and Ontologies Continuing the description of Delft, the RDF description of the city retrieved by re- questing http://example.org/wiki/_Delft certainly contains a statement like this:

wiki:_Delft rdf:type voc:City

In natural language this statement would read as “Delft is a city”. It is a statement describing an instance of a certain concept. It is a triple consisting of three resources, each of which starts with a different namespace prefix. Both in SPARQL queries (2.1.4) and in XML-representations of RDF (2.1.5) these prefixes can be used to ab- breviate the first part of lengthy URIs called the namespace. The first resource clearly abbreviates the domain name earlier referred to as http://example.org/wiki/. The second one, rdf, refers to the RDF vocabulary. A vocabulary is a set of URIs sharing one namespace defined for some specific purpose. The purpose of this specific RDF set is defined as RDF for its own use [48]. The third prefix is fictional and abbre- viates some vocabulary, which, assumed from the context, contains some resources representing geographical concepts. The example is meant to illustrate the principal difference between the given state- ment that Delft is a City and the statement inside the fictional voc: namespace that defines the City concept itself. This geometrical vocabulary contains a definition of this concept:

voc:City rdf:type rdfs:Class

The mathematical theory of Description Logic makes a split between assertions on individuals (ABox) and as assertions on concepts (TBox). The first category con- tains instance assertions like our first example, the second contains the definitions of general categories and properties like the second example [18]. RDF does not provide means to build a typing system, but with the use of the RDF Vocabulary Description Language RDF Schema (conventionally associated with prefix rdfs) a set of classes and properties can be described forming a TBox vocabulary [48]. The definition of the concept City in a vocabulary makes it possible to make instances of the City-type using an ABox assertion as the one in the first example. ABox assertions (Statements) are not publishable inside a vocabulary, but constitute a data-set or knowledge base [38]. The development of the TBox possibilities over Linked Data is ongoing. The W3C has published a vocabulary layer on top of RDF Schema called Ontology Web Lan- guage (OWL). The first version, published in 2004, was accompanied with the descrip- tion that OWL adds more vocabulary for describing properties and classes. Things like relations between classes (e.g. disjointness), cardinality, equality, richer typing of properties, characteristics of properties (e.g. symmetry), and enumerated classes. The OWL statements about classes result in Class Expressions which can contain all kinds of details. Matching these patterns against real instances opens a rich possibility of inferring complex class membership. When the expressiveness of OWL is used to make a TBox description of concepts in some context the resulting set is no longer a vocabulary, but called an ontology. In later versions of OWL an ontology was simply described as a formalized vocabulary of terms [30].

10 Linked Data and Linked Data Applications 2.1 What is Linked Data?

In [38] as much as seven different meanings of the word ontology are given, orig- inating from the field of philosophy called the theory of being which studies in which categories reality presents itself. When we take the perspective to ontology representa- tion languages for computer systems it becomes clear the W3C is not the only player. In [26] a list is given starting with the early languages in beginning of the nineties followed by the Web based languages from the last ten years. But Semantic Web tech- nology is more and more the default to turn to when building automated knowledge representations, and the stack of W3C standards has become the default ground to build on. In the ODRAC platform the W3C is followed as the authority, and the RDF and RDFS vocabulary are the foundation for a native ODRAC vocabulary layer. OWL is not used, mostly because it is not in line with our use of Named Graphs and the ex- pressiveness it offers at the cost of complex inferencing is not needed. In Section 5.9 the difference between the ODRAC data model and OWL is explained in more detail. Roughly two approaches can be taken to develop an ontology [2]. When start- ing bottom-up the most specific concepts in the knowledge domain are described by more abstract categories. These are needed to explicate the differences and similari- ties between the bottom concepts. Typically this results in ontologies that are difficult to modify and integrate with other ontologies. The other way to go is to start with a widely accepted set of high-level concepts and connect them to the bottom line in- stances. This is the top-down approach. A danger in this approach is an overabundance of categories that do not have any application. A way to combine both approaches is to formulate an as a top-down design with clear borders, and use a bottom-up process to relate the domain specific entities to this upper ontology [2]. In the ODRAC system such an approach is supported and encouraged, and forms a main principle in how it is applied to the ISO Systems Engineering domain. In that context we will refer to the upper ontology simply as ontology, but in other cases the term ontology refers to any TBox RDF set. Which of the two meanings is meant should be clear from the context. In the Linked Data practice both RDF and OWL data-sets are not considered com- plete. This open world assumption means no conclusions may be drawn from the absence of certain data. Some explanations include even the possibility of conflicting statements [11]. With the use of Named Graphs it is easier to evaluate and harmonize statements, because the name of the graph is required to be a unique identifier of the contained statements that are expected to be internally consistent [11]. It is up to spe- cific applications to deal with open world implications. In [62] for example, part of system scales up to a closed world environment. In the ODRAC system an even more rigid control structure is used. See Section 2.2.7 for a discussion of different types of closed platforms.

2.1.4 SPARQL One of the great benefits of Linked Data representation of knowledge is the possibility to use the SPARQL query language to interact with the data. The name is a recursive acronym meaning SPARQL Protocol and RDF Query Language, yet in time the term was extended to refer to a set of specifications that provide languages and protocols to query and manipulate RDF graph content on the Web or in an RDF store [71]. Three recommendation documents work out the query [33] and update language [27]. In

11 2.1 What is Linked Data? Linked Data and Linked Data Applications

[55] special attention is given to the SERVICE keyword, which makes it possible to call different SPARQL endpoints from within the query. Three other recommendations describe ways to encode query results. The remaining SPARQL specifications contains a protocol how to send and receive query and update statements over HTTP [24]; a protocol for managing a Graph Store over HTTP [53]; an RDF Vocabulary to describe a SPARQL service [75] and a description of graph patterns to use for entailment [28]. The query and update language borrow some elements from traditional SQL lan- guages, like the use of keywords SELECT FROM and the structure. A typical SPARQL query is given below: PREFIX rdf: PREFIX voc: SELECT ?city FROM WHERE { ?city rdf:type voc:City . ?city voc:has_road_to . FILTER (?city != ) }

The query starts with a number of prefixes that are used in the remainder of the query to increase readability. The SELECT clause enumerates the variables that are expected to appear in the result. The FROM clause is optional and refers to the Web resource (URI) of the data-store. The FROM NAMED clause can be used instead to pinpoint one specific graph inside the graph set. Finally the WHERE clause defines a patterns that will be matched to the RDF Graph. This pattern can consist of any number of triples separated by dots. Also subpatterns can be formulated using curly brackets and these can be combined with UNION or subtracted with MINUS. In the example above two triples are given. The subject of both triples is the same binding variable ?city. This means that for all bindings to this variable both triples should exist. The predicate and object in both triples are real resources that will be looked for in the data-set. At the final part of the pattern a FILTER rule is specified. It states a binding to the Delft identifier should not be added to the result set. Thus the query result is a one column table containing all cities that have a road directly to Delft, and that are not Delft itself. An alternative to the FROM NAMED specification is the GRAPH keyword that can be used inside the WHERE pattern to define a pattern over multiple graphs. An example of such a SPARQL pattern is given in Section 8.2.3. A powerful property in SPARQL 1.1 is the recursive use of predicates. The query above returned a list of Cities that could be reached from Delft by using one road. The following small modification, a + sign appended to the predicate, would return the list of cities that could be reached by using any number of roads from Delft.

?city voc:has_road_to+

A current shortcoming of SPARQL is the impossibility to define a recursive Graph pattern. In Section 5.8.2 we will come back to this topic. Apart from the extensive use of SPARQL queries and some update statements two other SPARQL specifications are used in the ODRAC platform. The ODRAC Data-store service adheres to the query and

12 Linked Data and Linked Data Applications 2.1 What is Linked Data? update protocol [24] (see Section 6.1.1), and query results from the Data-store server are encoded with the XML format to send them to the WUI client [36]. As mentioned above SPARQL can be used for more than querying. It can also be used to insert data into a data-set or to construct it based on a select query. In this way new data can be derived from already existing data. Also the available RDF soft- ware libraries support this deductive reasoning. When for example a certain class has a subclass that has an instance, which can be described using two triples, a third triple can be derived that this instance of the subclass also is an instance of the class. These deductions are called inferences and they follow the logic of inference rules. Reason- ing with inference rules can be used to find entailment of new facts. Entailments are formal conclusions that can be drawn from data using inference rules.

2.1.5 RDL/XML and TriX Query result tables can be encoded in different formats (e.g. XML). To encode RDF data itself also a file format should be selected. As described, the request of a certain resource should result in a response describing the resource with statements and thus linking the resource to other resources. This RDF response can be encoded using the RDL/XML standard described in one of the six RDF W3C recommendations [3]. An example is given here: As long as the content in the response resides in one RDF graph this format suf- fices. But if the content of multiple Named Graphs should be serialized in one file a problem occurs because the format only allows the mention of one graph name [14]. An alternative to the RDF/XML notation is the TriX format that combines high graph expressiveness and simplicity with full compliance with XML. A TriX version of the Delft is a City statement could look as follows:

http://example.org/wiki http://example.org/wiki/_Delft http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://repo.com/vocabulary/City It is clear form this example that a little more text is needed in the TriX format, but it is easier to read for people that expect to see the underlying triple structure. In contrast to RDF/XML no namespace prefixes can be used from within the TriX format. An XSLT extension is supported to allow this, but this introduces an extra parsing step while reading it. This extension will not be used by the ODRAC platform.

13 2.2 Linked Data Applications Linked Data and Linked Data Applications

2.2 Linked Data Applications

In this second section of the Linked Data chapter we try to give an impression of the wide variety of software applications that deal with Linked Data. The purpose in this overview is to briefly cover all categories that are distinguished in the literature in order to position the ODRAC platform among them. When the discussion hits a topic that is relevant for the design of ODRAC we will zoom in to open the academic perspective. Still this is done very briefly. Many names are given to help the reader interested in more information to find his way quickly online. In [5] an overview is given of RDF publishing tools and linked data applications, the former describing eight endpoints, the latter split in browsers, search engines and domain specific applications. But even considering four years of extra development, the picture was not quite complete. The use of Linked Data techniques results in a Web of Data, and it is clear that in this new Web endpoints (2.2.1) serve data on which other applications operate, but the perspective of an Internet-like whole of data does no justice to the full range of Linked Data usages. Initiatives like Haystack [39] and NEPOMUK [62] aim to semantically enrich the personal data in a desktop environ- ment, resulting in an integrated personal knowledge base on which agents can operate, but that is not of big interest to the global Web. Similar domain specific small-scale Webs can be build for any knowledge base system [26]. In this section these applica- tions will be considered closed platforms (2.2.7). Both open and closed platforms have their endpoints serving data, but before there is any data, the information model formulation and the production of the data within the platform can use some application support. The ontology development environ- ments described by [26] and the more visualization oriented tools like Conzilla of [44] are discussed in 2.2.2. When an ontology is more or less fixed, a next step in produc- tion of Linked Data can be aided by the Semantic Content Authoring tools (2.2.3) that help users as much as possible to formulate data from a certain non-RDF context. Some Linked Data repositories describe live data and are maintained by the user. Others can be considered static (like GeoNames), or managed by an authority (like BBC’s broadcast data). In [5] the Linking Open Data project is described in which more than hundred of those public repositories are interlinked. A big connector in this web is the DBpedia repository, a Semantically enriched Wikipedia. Browsing (2.2.4) and searching (2.2.5) through this data is complicated, and applications are being developed to deal with this aspect of Linked Data. For more specialized data needs aggregation platforms (2.2.6) like DERI Pipes or RDF Gears can be used to define detailed data integration procedures. In the following subsections examples are given of applications in the specific category in order to sketch the state of the art of related Linked Data applications. More complete categorizations of Semantic Web tools can be found online1.

2.2.1 Endpoints In [5] eight publishing tools are listed that serve RDF descriptions over the Internet on HTTP requests. Some of them support SPARQL queries, some partly and others not at all. The SPARQL protocol states that SPARQL query endpoints may use content

1http://www.w3.org/2001/sw/wiki/index.php?title=Property:Tool_category

14 Linked Data and Linked Data Applications 2.2 Linked Data Applications negotiation to deliver human-readable or machine-processable information and may choose not to implement SPARQL update or require HTTP authentication [24]. Another division often made in RDF publishing tools is between native and non- native RDF data storage. In the Berlin SPARQL Benchmark [7] the performance of four native RDF stores (Sesame, Virtuoso, Jena TDB, and Jena SDB) is compared to two non-native SPARQL-to-SQL re-writers and to two relational database man- agement systems. The list of existing RDF storage approaches is further extended in [17] by describing 13 multi-indexing frameworks and 11 systems with a simpler storage scheme. This overview of approaches lists almost any possible native storage approach. Non-native RDF endpoints can map virtually any data source to an RDF representation when requested. A different kind of endpoint is the alignment repository [64] that can be used to store and access alignments between different data sources. Anything with an API could be considered an endpoint, but when it comes to repository discovery there is an overlap with the category of search engines.

2.2.2 Ontology engineering editors In [64] the field of Ontology Engineering is described as an engineering discipline that has reached a certain level of maturity, compared to its beginning in the early 90’s. In their work the term ontology refers to machine readable formal representations in general and not the specific type of vocabulary within RDF. In this view all Linked Data applications are ontology tools. Five categories of those are given, together with an enormous amount of example applications. The actual category of ontology engi- neering is split up in editors, browsers (see later subsection), learners and versioner. Three editing platforms among the 13 that are discussed get special attention. The Neon toolkit, Protégé and TopBraid Composer. According to [26] Protégé has been the leading ontology development editor. It is a stand-alone Java application that allows the definition of classes, properties, re- strictions and instances. It is mainly a tool to open, visualize, edit and save RDF and OWL files. Reasoners can also be executed, it can be configured to create a knowledge acquisition tool and it can connect to other knowledge bases [64]. The possibility to build plug-ins for other file formats and visual representation makes it a multi-tool en- vironment. Despite the popularity a number of shortcomings of the environment have been expressed. First of all a higher level of abstraction of ontology language con- structs would be desirable to allow more intuitive and powerful knowledge modeling. Also a more friendly visual or spatial navigation among concept trees and graphs and linking relations. Next to this more support for reasoning facilities and for aligning ontologies and for integrating them with other data resources would mean an enhance- ment. Finally support for processing natural language and support for collaborative data production are suggested [26]. Later in its development process Protégé [52, 66] published a release candidate of WebProtégé [67], a Web-based platform that is lightweight and supports collaboration. Thus at least satisfying one of the request from [26]. In [41] WebProtégé is used in a mash-up framework for distributed authoring of large-scale biomedical terminologies. The NeOn toolkit is a the result of a European Commission funded project to support operations in networks of ontologies, some of them constantly evolving [34].

15 2.2 Linked Data Applications Linked Data and Linked Data Applications

This dynamic interpretation of the playground for next generation applications under- pins their approach which opposes the current approach where the expectation is to produce a single, globally consistent semantic model that serves the needs of applica- tion developers and fully integrates a number of pre-existing ontologies. The toolkit is based on the Eclipse platform (popular for Java development). It supports the com- plete life cycle of large-scale ontology networks. Next to elaborate editing capabilities the platform consists of project and version management with collaboration support, visualization, ontology evaluation, ontology matching, reasoning and inference and knowledge acquisition. TopBraid Composer is implemented as an Eclipse plug-in, very similar to the NeOn toolkit. It has a publishable API in order to build semantic client or server solutions to integrate other applications and data sources [64]. Next to full ontology editing platforms, dedicated tools have been produced to visualize them. Conzilla is an example of such a visualization centered application, which offers a versatile interface for editing and styling RDF2. According to [44] all visual languages for RDF representation are inherently domain specific, and they start with a fresh language called DLG2 (directed, labeled graph). In the basics it is similar to the way the RDF graph is visualized in W3C documents [48]. IsaViz is another graphical RDF authoring tool centered around graph navigation. In [43] this approach of data model representations is contrasted with frame-based navigation used in Protégé. Learners are applications like OntoGen and OntoLearn that can generate an ontol- ogy through a semi-automated process. The reason to mention browsers and versioners in this category is they support the development and management of ontologies. The different category of ontology processing [64] contains three types of applica- tions that fits in this subsection: ontology matchers that detect and output alignments, ontology localizers and profilers that can transform existing ontologies and ontology evaluaters that can check the formal model or the instances. In the next subsection the semi-automated generation of instances around digital documents is discussed, but the manual creation and modification of standalone instances usually can be done with the same editors that are used to edit ontologies.

2.2.3 Semantic Content Authoring applications According to [46] the currently least developed aspect of the semantic content life- cycle is the user-friendly manual and semi-automatic creation of rich semantic con- tent. The manual production of Linked Data instances, in their work called Semantic Content Authoring (SCA), is broadened to the production of RDF data on the one hand and non-semantic sources enriched with semantic representations on the other. This second form is mostly applied to HTML, and in this the W3C standard of RDFa is fol- lowed where RDF statements can be included in HTML-like files as attributes. Their SCA user interface is a smart Web browser and HTML editor that uses semantic data to find documents more precise, presents it more flexible, integrates and personalizes it to whatever the user would like it to, and helps to relate it to existing ontologies. It delivers a what-you-see-is-what-you-get view of produced HTML with a what-you-

2http://www.conzilla.org/wiki/Overview/Main

16 Linked Data and Linked Data Applications 2.2 Linked Data Applications see-is-what-you-mean layer on top of this to visualize which items from the Web site are semantically enriched, a triple view showing the produced RDF triples and a source code view to see the RDFa result: HTML with RDF as attributes. In the WYSISYM SCA Content Editor the starting point is (HTML) data without any link to ontology elements, and after the semantic enrichment process still many elements are without semantic meaning. A similar attempt to semantically enrich Office documents and Wiki pages called Gafee was described as a meta-data editor [8]. The approach was to automatically generate forms based on a domain ontology and a GUI ontology. The form is presented as a plug-in inside the text processing application or as an extension called OWiki inside a MediaWiki system [20]. The user finally supplies the meta-data by filling in the form, and the result is sent to an OWL database. In the case of the OWiki platform, this database runs as an external Java web service. The special quality of the system is the automated influence of the target ontology on the user interface: Templates are automatically generated from ontological data (definition of classes and properties), forms are automatically generated from templates, and eventually ontological data (individuals and property values) are populated through these forms [20]. In [64] other examples are given of the manual or semi-automatic annotation of documents, automatic annotation components, ontology populators that automatically generate instances based on a data source. TopBraid, among other ontology engi- neering editors, can be used to do ontology population. This approach is also called top-down semantic authoring. The use of RDFa or any other way to add meta-data to existing documents are a bottom-up approach [45].

2.2.4 Browsers Just like the World Wide Web, the Semantic Web consists of links that can be used to browse from one data source to the other. Yet the advantage of Linked Data as machine interpretable data means at the same time that the plain data is more difficult to process for humans. Disco is a browser, accessible as a Web page, that follows links just like hypertext navigation. Triples that have the URI as subject are collected from different repositories and listed in a table. Also the source RDF graph is specified. A next level of browsing is offered by Tabulator which enables the user to look for certain patterns [5]. It is also capable of presenting results over a time line or on a map, in order to make the bulk data more insightful. The browser, again accessible as a Web page, merges data from different sources into one result and tracks the provenance. Marbles is a similar browser that indicates the source repository of a certain statement with colored bullets. A last type of browsers presents the Semantic Web as a graph of nodes and edges. Frodo, Rdf-gravity, FOAFNaut and Fenfire are some of them. The graph visualization, even with a semi three dimensional view in Fenfire, is hard to comprehend. In [42] the decision to allow the computer’s internal representation to influence the presentation instead of user’s needs is criticized as a pathetic fallacy, but it is certainly a first step and browsing through the big fat graph might inspire users what they need next. In the ODRAC platform a User Interface is offered that reflects the underlaying RDF graph structure to help the user understand the data. A different approach to browser development is not to work from the technology possibilities of RDF at all, but to design a graphical environment that helps users to

17 2.2 Linked Data Applications Linked Data and Linked Data Applications

browse through big amounts of data on the web, either HTML, graphical, RDF or any other format. The Pivot application of Microsoft Live Labs is a promising devel- opment. An exploration of its possibilities like [9] reminds of the attempts of some Linked Data browsers. Virtuoso reported to be working on binding Pivot to their RDF Quad Store3. Some browsers from the RDF community like IsaViz and RDFAuthor combine browsing the Semantic Web as a graph with editing capabilities. Both can export the data that the user created based on his exploration of items from the Semantic Web in the RDF/XML format. These applications form an overlap with the category of ontology engineering tools described above.

2.2.5 Search engines

The category of Linked Data applications least related to the work in ODRAC is that of search engines. In [5] a difference is made between human-oriented search en- gines and application-oriented indexes. Examples of the first category are Falcons and SWSE, and both are accessible on-line like the popular search engines on the World Wide Web. Falcons provides to search for objects (instances like people or locations), concepts (classes, properties), ontologies (displayed as small graphical graph snippets) or documents (RDF/XML files). Both browsers present the results with more detailed information than only links, but the search request is entered as search terms without further querying logic. Human-oriented search engines provide a starting point for users to browse the Web of data. Swoogle, Sindice and Watson are examples of application-oriented in- dexes that can be used by other Linked Data applications to discover Linked Data over an API [5]. In [16] Watson is introduced as a Semantic Web gateway to expedite next generation Web applications. Different services are available through an API to find Semantic Web documents, retrieve their content and meta-data, inspect their ontologi- cal descriptions and query them using SPARQL. In [64] Swoogle, Sindice and Watson are listed together as ontology discovery tools. A special category of Linked Data application, here listed as search engine, is the ontology reasoner [64]. Instead of searching though the whole Semantic Web these tools are designed to search through one or a number of ontologies and ontology instances to derive conclusions on inferred data or trace inconsistencies. Many of them, like CEL, HermiT and TrOWL were written to operate on OWL ontologies.

2.2.6 Aggregation platforms A step more complex than the application-oriented search indexes are aggregation plat- forms like DERI Pipes [5] or RDF Gears [25]. Both platforms present a web interface to define complex workflows from data inputs, via diverse operations to data outputs. A possible input element is a SPARQL endpoint in combination with a certain query. A simple operation might be the union of those results with the results from another end- point. In order to define its operations, RDF Gears uses a formal language that mixes Semantic Web technologies with Nested Relational Algebra. The output of a workflow is always in the form of an RDF/XML file. DERI Pipes works in the same fashion and

3http://boards.openlinksw.com

18 Linked Data and Linked Data Applications 2.2 Linked Data Applications supports sophisticated operations like identifier consolidation, schema mapping and RDFS or OWL reasoning. Instead of using a special language its operations are de- fined using SPARQL CONSTRUCT operations and XSLT templates.

2.2.7 Closed platforms When describing the state of the art of the Semantic Web in 2008 [16] write that most available applications tend to produce and consume their own data instead of approaching the Semantic Web as one large repository. These first generation Webs were build in companies like Renault, Boeing and British Telecom to annotate com- pany data, but while the systems did what the companies needed the promising po- tential of one connected Semantic Web was not pursued. In process industry often short-term virtual companies are made out of a number of complementing companies working together temporarily and sharing a project knowledge base. Although these virtual companies share some knowledge to the outer world, much of it needs to be protected by a closed platform with high requirements for trusted and secure informa- tion exchange [29]. Also in an application like Haystack [39] a SCA application is combined in a local environment to integrate personal data sources into a personalized information repository. In [42] Haystack is presented as an example of an application with a traditional user interface with no need for graph-like presentation, although the underlying tech- nology is RDF. The annotation of personal heterogeneous data with meta-data, based on a personalized ontology, results in a component framework on which a truly uni- form user interface can be build. Also a special meta-data manipulation language is presented to automate operations on the Linked Data. These agents could be compared to workflow configurations in RDF Gears. The work of the NEPOMUK project [62] aims to integrate personal data in two stages. First it wants to integrate the isolated islands of data of traditional desktop applications into a Personal Semantic Desktop. The second phase is to transform it into a Social Semantic Desktop by presenting the desktop as an end-point of the Semantic Web. Especially the usage of Named Graphs as manageable sets of data makes their approach interestingly related to our work in ODRAC. Also in the field of Software Engineering the modelling capabilities of Linked Data can be applied to automate or model parts of the production process [32]. An ontology could be used to formally describe software requirements, and possibly open model-driven approaches in design and implementation. The usual software models, like UML, could be integrated using Linked Data too. Other ideas include semanti- cally enriching APIs to give automated coding support and automated documentation. Also existing software components could be described and retrieved using Semantic annotation and querying. In the KOntoR [31] platform this is done, with the used ontology providing information integration and background knowledge on software components. The interface is available through an open Web-client, so it is not a closed platform. Yet semantic representations of requirements, models and APIs will unlikely be published on the Semantic. In [16] a boost is given in the production of next-generation semantic web appli- cations by introducing the Watson gateway (see above), but the question is what is wrong with closed platforms. In some cases it is not in the interest of the client to

19 2.2 Linked Data Applications Linked Data and Linked Data Applications

connect their data to the Semantic Web, while knowledge representation and reason- ing capabilities delivered by a Semantic Web are. It could even be that data should be shared with some parties, but protected from others. The ODRAC platform, among other Semantic Systems Engineering tools discussed in the next chapter, supports such a configuration. One Systems Engineering application used in a number of construction companies in the Netherlands is Relatics4, but no academic work discusses it yet. It is a Web- based platform to build a knowledge base for engineering projects. The company refers to the product as Semantic Sheet Software, in order to promote the transition of business-critical data from Excel-sheets to a semantic platform. It models knowledge data in a semantic structure, but the platform was designed before RDF was published or software libraries were at hand facilitating Linked Data. Three tables are used in a SQL database representing the Element, Relation and Property concepts from their own data model and navigating through the user interface and exploring data involves many SQL queries over these tables [57]. The approach of the data model aims to open an expressiveness and reasoning capability similar to the RDF model, but in practice the data is hard to aggregate and navigate because of the fine grained structure of the data model. The main elements in the graphical user interface are HTML-tables and JavaScript trees. All user tasks can be related to the user right management, so it supports usage in collaborative environments. To shield off the complexity for end users it is possible to define custom forms to input data. When ODRAC is compared to Relatics its purpose, interface and application do- main overlap at many points. The use of RDF and its open data model are the main advantages of the current design. A weakness of the user interface of Relatics, as it is graphically configured at Croon currently, is the steep learning curve and little sense of orientation. The user is very quickly confronted with a lot of tables, and important buttons are hidden or accessible via right-click menus that are not easily found. The main window consists of a tree on the left hand side and when a node in the tree is clicked it is opened in the right hand panel of the window. Because only one item can be opened and is docked into the whole right hand panel the user can get the impres- sion that the whole Web browser is redirected to that item. The tree in the left remains visible, but when an item in the right hand panel is clicked the panel navigates to this item without locating this in the displayed tree, adding to the sense of disorientation.

2.2.8 Conclusion

Because of the Systems Engineering context of Relatics and the semantic approach of the data model we end this section with this application as most relevant to study for designing ODRAC purposes. On a technological level the NEPOMUK approach to make RDF data packets as modules is of interest and considering the user interface design the discussion of browsers adds to our considerations. What is left in this chapter is to look for software libraries that can help with the development of Linked Data applications.

4https://www.relatics.com

20 Linked Data and Linked Data Applications 2.3 Development libraries

2.3 Development libraries

Many existing applications and research mock-ups make use of software libraries sup- porting Linked Data models. On community sites5 popular libraries are listed for Java, PHP, C/C++, Python and .Net. The Apache Jena library is the fist listed, and because of the availability of many extensions, especially for SPARQL querying and Named Graphs, ODRAC builds on this library. Jena is published by Hewlett-Packard labora- tories, and a number of publications are spent on Jena [49, 50] and Jena 2 [13]. The introduction of the Named Graphs was influenced by the same HP Labs [11], and soon a plug-in for Jena called NG4J was published [4]. In combination with the Tomcat Java servlet framework Jena and NG4J offered a unique support of the Linked Data operations that ODRAC would be performing. The documentation on Jena and NG4J is not elaborate, but the support on community pages fills the gap. As for the support of the TriX format, NG4J is the only available option in Java yet.

5http://answers.semanticweb.com/questions/75

21

Chapter 3

Semantic Information Integration Platforms

In this chapter we sketch a new introduction perspective. In the previous chapter the focus was on the used technology of Linked Data. Here we zoom in to an industrial tradition and its academical reflection of platforms that are used for Information In- tegration in (Systems) Engineering contexts. First a general overview is given of the use of computer aided engineering tools in process industry (3.1). In Subsection 3.1.1 we introduce the work of [74] on the Comprehensive Information Base. This deals with the design of an ontology-based approach for information integration in chemical process engineering, and contains a prototypical implementation and a series of obser- vations our work can be related to. The work of [29] follows the same main pattern, but calls CIB still domain specific. Their approach, yet still very abstract, is very similar to our work. We discuss it in Subsection 3.1.2. In Section 3.2 the ISO standard 15926 is discussed, the use of which is an important parameter in this research. This is followed by a short description of two applications with high equivalence to the present design called iRING and dot15926. These closely follow the ISO standard, but unfortunately no publications discuss them yet, apart from a single remark in [63]. These data production application are direct counterparts of the present application design. In the final subsection the approach of Part 11 is put forward as the start of a third effort to implement the ISO norm.

3.1 Information Systems for CAE

In a study of the design of industrial pulp and paper production processes [63] the het- erogeneous use of software support and the related fragmented data is illustrated by an engineering enterprise that uses over 50 different engineering tools. These Computer Aided Engineering (CAE) tools are diverse, but have become more data-centric after a period of document-centric design. Also during our work on the design of ODRAC in the ship building industry the presence of many different tools was not only as- sessed but problematized. The heterogeneity of data is a main cause for inefficiency of the engineering processes, and it is the central problem of the field of Information Integration. As an answer to the integration problems, large software suppliers are acquiring

23 3.1 Information Systems for CAE Semantic Information Integration Platforms

and integrating tools in product families. COMOS, for example, is an information platform that supports the design, building and maintenance phase of complex process installations. It has been around for 20 years and was bought by Siemens eight years ago [58]. Four characteristics are (1) it is object oriented (not document-centered), (2) open to other systems, (3) it can communicate worldwide over a closed intranet and (4) it is paperless. It also complies with the ISO 15926 data integration norm for data communication. These characteristics make it similar to the ODRAC platform, yet its commercial presentation, hidden mechanics and tight integration with Siemens software and hardware are different qualities. When compared to Relatics (discussed in Section 2.2.7) it also overlaps with the four characteristics, but COMOS is limited to process industry and to a selection of life cycle stages where Relatics can be applied in a broader domain [58]. In [29] (see 3.1.2) CAE systems like COMOS are seen as still too rigid and potentially cumbersome for virtual companies. Both the prototype system of [29] and [74] approach COMOS as a engineering database over its open XML-format API. It is a typical characteristic of modern process engineering support systems that they comprise an integrated plant information model to cover the many aspects of plant engineering, enabling advanced change management and advanced personalization for different users. The use of a standardized data model is an important prerequisite for efficient integration and the compliance with ISO 15926 is common in many tools in process industry. It is used both as an exchange format in iRING and XMpLant, and as a native data model in Bentley’s OpenPlant [63]. A lot of research and development effort has been invested in the use of a general data model. Many CAE frameworks promote their compliance to the ISO 15926, but this norm is a constant object of further development. The POSC Caesar association1 is responsible for the organization of these developments. In [2] a first bridge was built between the data model and the technology of Linked Data. The ISO 15926 model was transformed into an OWL upper ontology. However, the complexity of the data model has been recognized [69] and criticized [74]. In [74] the whole approach of a single global plant information standard is seen as unrealistic for the near future. Instead a design is given of a semantic integration framework called CIB. Before we describe their solution an analysis of the approaches to deal with heterogeneity is of value here. The use of a populated architecture of tools in Systems Engineering is not as trou- blesome as the heterogeneity of the data contained in the applications. This can be split in syntactic, structural and semantic heterogeneity. Solutions like the use of XML to overcome the first two are widely accepted, but the semantic aspect of information integration, where the precise meaning of the data is evaluated, is still a problem [74]. Traditional approaches to achieve semantic interoperability can be split into brute- force, use of a global data standard and use of an interchange standardization. Most used is a combination of brute-force and an interchange model, together with the pres- ence of a central data warehouse. In this scheme the ISO 15926 is an information exchange norm, but due to its extreme complexity and narrow scope it has not found a broad acceptance outside the process industry. A different exchange standard was STEP which aimed at different industries like automotive, aerospace and chemical plants, but the ISO 15926 started as a spin-off from STEP (see Section 3.2.1). Despite

1https://www.posccaesar.org/

24 Semantic Information Integration Platforms 3.1 Information Systems for CAE the tremendous efforts spent on these initiatives no universal data model has gained wide acceptance. Reasons include incomplete coverage, lack of consistency and diffi- culty to agree on a life-cycle data model. The only available option for the near future is seen by [74] as a mix of proprietary standards, partly accepted interchange standards and in-house standards.

3.1.1 Comprehensive Information Base Instead of choosing from the traditional approaches, [74] introduces an ontology driven integration framework. The so-called hybrid ontology approach builds of one global ontology and several source ontologies. As the heterogeneous mix of data source se- mantics is accepted, each source ontology contains a mapping from one data model to the global ontology. See Figure 3.1. A bidirectional converter functions as a translator that is able to map the XML-output of a data source to the global ontology. The same converter can map content based on this global ontology back to the needed XML- format. For both directions it builds on the source ontology of the specific data source. Thus the CIB forms a mediation layer over all different tools and their data formats. Two important entities in the layer are a knowledge base and an inference engine. The inference engine is capable of presenting one data-set based on the global ontology, where in fact the data is aggregated from the different data sources. The knowledge base consists of all used ontologies and the instances that are projected on it from the separate data sources. A third element is a graphical user interface helping users to formulate queries and visualizing the results. Advantages of this architecture are the need for only one converter per data source (not one for every combination), a low need of global agreement of used data models, allowed diversity of data source formats, a modular structure of the global ontology and a comprehensive view on scattered data forming a neutral point of access. One of the reported disadvantages is the high computational load during reasoning. This reasoning is needed for every data transformation step.

3.1.2 Universal endpoints for generic data views A comparable analysis of traditional solutions and the approach to use Semantic Web technologies to build a highly integrated framework to access industrial engineering data through an interlinked semantic network is presented in [29]. Their solution is introduced in the context of virtual companies, consisting of different businesses of different trade that work together on a certain design. This worsens the heterogene- ity of the data sources and the difficulties to reach agreement on tool use and data formats. After dismissing both the brute-force mapper based approaches and unified world models (STEP, ISO 15926 and MFO) also none of current Linked Data applica- tions (though not mentioning CIB) seems tailor-made for virtual companies. The primary goal is a single access point to industrial data. This data is contained in one Linked Data cloud to which all data sources can be mapped with acceptable effort. Different interactions are supported by generic access concepts that interact with a graph manipulator on the single data cloud. These data entrances are called universal endpoints, and the first suggested are a SPARQL and a endpoint.

25 3.1 Information Systems for CAE Semantic Information Integration Platforms

Figure 3.1: The main architecture used in [74].

Figure 3.2: The main architecture used in [29].

The existing data sources can be either migrated to the data cloud or connected with an on-the-fly transformation. See Figure 3.2. The use of one ontology to struc- ture the Linked Data Cloud is more centralized than in [74]. But in [29] the used ontology and adhering linked data instances get a modular structure and the use of Named Graphs is hinted at. This structure leads to a pre-processed network where no costly computations are needed for interactions with the cloud. This is a big advantage over the approach of CIB. Yet a missing element in the platform is a general ontology for industrial usage. Our ODRAC platform is very similar to the whole design and does offer a dynamic ontology structure.

26 Semantic Information Integration Platforms 3.2 ISO 15926

3.2 ISO 15926

The initiator and current developer of the ISO 15926 is the POSC Caesar Association (PCA). It was founded to support the development of open standards for data integra- tion and has members in the US, Europa and Japan. In an introduction guide to the ISO 15926 from 2011 [65] the standard is explained as a middle layer or interface. It function is explained as a Babel fish, able to translate source data back and forth to an intermediate standard description. Its practical applicability in information exchange is attributed to the convergence of four academic achievements. The construction of ontology, the Semantic Web technology, ways of data encoding and the evolution of product information. The history of its development is summarized as an application of the STEP standard to long-life process plant descriptions, while STEP emerged from the need to supply product information related to CAD drawings.

3.2.1 STEP STEP stands for STandard for Exchange of Product model data and is also known as ISO 10303 [51]. It is one of the biggest ISO standards, and a set of applications called STEP Tools is available. Part of the norm was the formulation of the EXPRESS data modeling language, but later this was translated into XML [74]. In [2, 65] a number of problems is described that emerged from the use of STEP. First of all, information exchange involves the use of up to hundreds of different Application Protocols, each consisting a definition object classes, their taxonomy and relations for a specific ap- plication domain. Also the maintenance of data over time, due to the need to update the data model, is cumbersome. The models capture object information as a snapshot in time. Furthermore the configuration of an Application Protocol was impractical be- cause of the high complexity of the procedure. These deficiencies where an important motivation for the initiative of ISO 15926.

3.2.2 One Application Protocol led to the creation of a structured subset of natural language, meant to express knowledge both human and machine readable. The ontological lan- guage was called Gellish [68]. It was also influenced by the data model of ISO 15926 and for some time it was expected that the new Part 11 would contain a Gellish im- plementation example. The language is also known as STEPlib, confirming the strong connection to STEP. The high expressiveness, the uniformity and the possibility to define queries within the same language are reported strong points [51].

3.2.3 Existing ISO parts In [19] the practical applicability of the ISO’s current state is evaluated. A list is given of the current twelve parts.

1. Overview and fundamental principles

2. Data model

3. Ontology for geometry and topology

27 3.2 ISO 15926 Semantic Information Integration Platforms

4. Initial reference data

5. (withdrawn)

6. Scope and methodology for reference data

7. Template methodology

8. OWL Representation

9. Implementation methods for the integration of distributed systems – Facade im- plementation

10. Conformance testing

11. Methodology for simplified industrial usage of reference data

12. ISO 15926 as OWL 2 – Implementation with named graphs

The evaluation focuses on the use of Part 2, 4, 7, 8 and 9. Together they specify two practical applications. Information can be encoded in lifted form, which means the full philosophical precision of the Part 2 data model has to be met by the information models. The lowered form approach is to base information on templates. The lifted form has not been successfully implemented because it involves tem- plate confusion resulting from the absence of explicit template references. Part 8 [2] mentions the use of meta-data to resolve this, but no structure has been found to relate meta-data to Part 2 building blocks. The work on Part 11 does a first step to resolve this using RDF Named Graphs. Apparently a next step is already planned to combine the use of Named Graphs from Part 11 with the OWL representation from Part 8 in a new Part 12. The main critique from [19] on the current stage of ISO 15926 pilot implemen- tations is the lack of conformance testing. The application of ISO 15926 to deliver business interoperability to stakeholders should focus on pragmatic and bottom-line issues. A conformance testing methodology of an ISO 15026 application to a cer- tain context should improve this focus. Secondly a number of information modeling improvements are desirable. There is an conceptual overlap between the Part 2 data model and the use of RDF. Simply said both contain a slightly different notion of a class. In our work this overlap is drastically resolved by applying RDF only loosely based on the data model. Finally a road map is said to be needed to coordinate further developments.

3.2.4 iRING A number of platforms has been developed in direct contact with the ISO workgroups. The most elaborate is the iRING Tools platform2 developed by Fiatech. Its full name is ISO 15926 Realtime Interoperability Network Grid, and as depicted in Figure 3.3 the grid build on the template methodology (Part 7), the use of OWL (Part 8) and Facades (Part 9) . It has been applied in a collaboration project of the IOHN between

2http://www.iringug.org/

28 Semantic Information Integration Platforms 3.2 ISO 15926

Figure 3.3: The iRING information flow from [65]. the ICT, defense and oil and gas industries [70]. It started in 2008, but no results on the usability are published yet.

Four purposes were [65] to prove that information exchange using the full spec- ification of ISO 15926 is possible, to develop software interface tools using the full specification of ISO 15926 and make the toolkits available to anyone under an open- source license, to develop best practices and make them available to those who use the tools and to encourage software vendors to collaborate and support iRING interfaces within their product offerings.

3.2.5 dot15926 platform Many companies and institutes involved in the ISO 15926 development build their own tools to do research. As an example we mention here the dot15926 Platform3 from the Russian TechInvestLab. It is an architecture and a set of specific interfaces and libraries developed to facilitate creation of semantic applications to work with ISO 15926 data. To demonstrate the capabilities of the platform a dot15926 Editor is build for three specific purposes: explore existing sources of reference data in as many formats as possible, verify reference data and engineer new reference data, including automated reference data creation through mapping from external sources. The Editor is intended to become for ISO 15926 data what Protégé became for OWL, but based on the current amount of external references this might be a little ambitious. A free version is available, but for this and other similar applications the use is difficult to grasp, even with some background knowledge. Apparently no Named Graph structure was used and no transaction mechanism was proposed to do data negotiation.

3.2.6 The Part 11 approach Wiesner et al. discuss the ISO 15926 interchange standard and blame the extreme com- plexity and the narrow scope for the lack of broad acceptance [74]. These problems were recognized by the ISO community as well, and to make the standard applicable to the Project 8 an ISO 15926 Part 11 is being prepared with title Methodology for

3http://techinvestlab.ru/dot15926Editor

29 3.2 ISO 15926 Semantic Information Integration Platforms

simplified industrial usage of reference data. In the target audience description it gives a clear description of the purpose of the methodology [23].

This part of ISO 15926 focuses on a simplified implementation of the aforementioned data model [i.e. ISO 15926] in the area of Systems Engi- neering and is intended for developers of configuration and or informa- tion management processes and systems in general. In particular this part can be utilized to define explicit information issues in the area of systems engineering and product knowledge management in the area of (process) industry, buildings, infra-structure and shipbuilding.

There was a close interaction between the initial goal setting of Project 8, our work and the content of the Part 11, so the present implementation should clearly follow that standard.

30 Chapter 4

Context of Integraal Samenwerken

In the current chapter we gather all the requirements for ODRAC. We begin by giving background information on the partnership (4.1) and the project responsible for the Universal Information Adapter (4.2). Then the answer is formulated to subquestion (b) Which requirements spring from the Integraal Samenwerken context? Next we give a quick analysis of the users of the UIA (4.4). The following two sections discuss the version 1.x implementation of the UIA (4.5) and the distributed context of data- stores in the platform (4.6). We finish this chapter with an example information model which implementation will be discussed later in this thesis.

4.1 Integraal Samenwerken

The primary stakeholder of the data production platform presented in this report is Integraal Samenwerken1(lit. Integrated Collaboration). This collaboration partnership aims to improve the competitiveness of the Dutch shipbuilding sector by developing improved collaboration models and instruments. Targets include (1) reducing failing costs to fifty percent, (2) measurable increase of employee fervor in the sector by twenty percent, (3) increase of maintained knowledge from resigning employees by hundred percent, (4) increase production volume per employee by twenty percent and (5) reduce production times of ships by ten percent. The partnership is partly funded by the Dutch Ministry of Economic Affairs in the Maritime Innovation Program. In the final year, before its finish in September 2013, it has seventeen member companies throughout the shipbuilding sector. Two shipyards are included, a number of electrical engineering companies, builders of piping systems, diesel engines, software, etc.

4.2 Project 8

The activities of the partnership are structured in projects. Project 8 is the produc- tion of the Universal Information Adapter. This is a system to make possible between collaborating companies using some form of information integra- tion. The project started in December 2008 and is one of eleven, yet because many other projects assume its presence it is considered to be a basis for most of the Inte-

1http://www.integraalsamenwerken.nl

31 4.3 Requirements Context of Integraal Samenwerken

graal Samenwerken program [1]. The Universal Information Adapter is a functional entity, and prior to our work a first implementation was build (Section 4.5). This thesis delivers a second implementation using the ODRAC platform. The Universal Information Adapter is primarily needed to eliminate misconcep- tions in the communication during the design processes in a project. But it is expected to store data on all the life-cycle stages of an engineering object. The necessity of the adapter is usually promoted in terms of possible failing costs reduction. The source of current misconceptions is located at a number of levels. Resources are named differ- ently by companies and even within Enterprise Resource Planning (ERP) systems of the same company. Also aggregation levels of components differ, and the way data is encoded and stored. The naming, aggregation and encoding are aspects of the semantic heterogeneity the UIA is required to solve. In an early stage of the project a number of decisions where made in what direction the UIA should tackle the Information Integration problem. Although in the work of CIB (Section 3.1.1) the use of a data exchange format is considered unrealistic and instead a semantically profound but computational heavy translation mechanism is proposed, in UIA the approach is to formulate a data exchange standard. This brings with it the benefits identified in [74] of the leaving-as-it-is of the existing data sources and the need of introducing only an agreeable amount of mappers, but it also means the identified problems need to be dealt with. The most significant is the problem to come to an agreed upon exchange standard. Instead of giving up on the approach of ISO 15926 we try to find a transformed application of its standardization power. How this would look was anticipated by the Project 8 workgroup and formulated in a set of requirements. From the perspective of Integraal Samenwerken the work of this thesis is meant as an evaluation and elaboration of this outset.

4.3 Requirements

The following criteria form the underpinning initial design decisions of the UIA. They have been established before the current thesis project started, so they form the In- formation Adapter requirements. The ODRAC platform should meet these in a sound implementation. The more detailed design decisions in the architecture, data model and implementation form the contribution of this work. In the following four chap- ters this design will be presented, and in the conclusion sections of those chapters a reflection will be given which Criteria are met by the design aspect described.

Criterion 1. Heterogeneous external data sources should be mapped to a homoge- nized data-set using a neutral language and thus communicated.

The formulation of the neutral language is not part of our work or inherent to ODRAC. It is the power behind the semantic integration, solving the naming problem and giving normative aggregation level blueprints of ontological components. In order to suffice, the language should have full domain coverage (see also Criterion 4). But the platform should explicitly be domain independent. A related goal is to find a representation of data that can be extended on the fly and which configuration only involves information models and no extra technical difficulties.

32 Context of Integraal Samenwerken 4.3 Requirements

The encoding of communicated data is also part of the communication standard, solving structural heterogeneity. To achieve this the following file format is prescribed.

Criterion 2. The TriX format should be supported to receive and default for sending communication data.

The use of the XML TriX format is motivated by its human readability and XML conformance. It is one of the options to communicate RDF data with a Named Graphs structure.

Criterion 3. Company specific translations of neutral names should be expressible in the communicated messages.

To help different users and applications make sense of standardized messages, ex- tra labels should be applicable inside the data with company or application specific names.

Criterion 4. The triple structure from RDF should be used to model data.

In order to achieve semantic integration RDF is used to model data. Unique URI names for data entities are useful for addressing items from different sources in a global space. And the powerful querying and reasoning capabilities [74] offer rich interaction possibilities with the data. The ODRAC data model connects the simple RDF classification system to a set of classes en properties from the ISO Part 4 initial reference data. These are the primitive building blocks of the neutral language, the upper ontology. Revisions of Part 4 have been proposed, and in our work we assume it to contain the models published in [69]. An example set of these initial reference data items is given in Section 4.7.

Criterion 5. All triples should be encapsulated in Named Graphs in order to provide them with meta-data.

This is a fundamental decision that makes the approach different from platforms like iRING. As discussed in the previous two chapters the use of Named Graphs is quite new, but appears to be the best approach to allow the recording of data about data.

Criterion 6. For each external data source that is to be connected to a homogenized data-set a mapper is needed converting data to and from the neutral language.

Each external data application (like an ERP system) that is to be connected to the UIA should be extended with a converter module, or a separate mapper application should be made to convert application data to neutral data.

Criterion 7. All homogenized neutral data from a distributed data-set should be ac- cessible through a single channel.

In [74] the accessibility of a single point of access is presented as a major require- ment of a computer support system for industrial design processes. The ISO 15926 sees itself as a uniformization layer over heterogeneous data too. In the UIA this layer

33 4.3 Requirements Context of Integraal Samenwerken

is presented as a channel instead of a single point. This means that no single ser- vice necessarily contains all data, but a managed set of services operates on a channel that collectively serves as a single point of access. The data from a data-set can be distributed over multiple stores.

Criterion 8. Inconsistency reconcilement is done before data is accepted and pub- lished on the channel.

A second major requirement identified in [74] is inconsistency resolution. Before data is merged with a data-store, it is verified for its structural form, ontology con- formance and timely order. Only if new data is consistent with the existing data it is merged into the data-set. The managed set of data-stores within a project channel functions under the closed world assumption. The consistency of the whole data-set is provided and all resources can be expected to be available. From the absence of resources conclusions can be drawn, for example that some Named Graph, when it is not replaced, contains the most recent data.

Criterion 9. Data about an object is owned by the initiator and located at a company dedicated data-store.

To encourage companies to share data to a project data-set, their data should be published to a data service that is technically manged by the company itself. This service will serve its part of the data to the collective channel. Full control means that no copies of the data are stored on a different place than on the companies own server. No alterations may be made to the data-store mechanism itself.

Criterion 10. No instance data should ever be deleted, new instance data can only be marked to replace old data.

Because of the developing nature of the data stored in a CAE tool there has to be structural change support. In STEP this change management was a big problem. Alterations in data might have juridical implications, so the consolidation should be secure and full history track needs to be kept. The radical approach to do this is never to change or delete instance data, only to add replacing entities. The replace chain that results is saved on the data-store that started the chain (Criterion 9). There is an important difference between instance data that can be created by end- users and classification data that is strictly managed, but can be changed by authorized users without leaving a record.

Criterion 11. The governance of data should be securely saved with provenance data.

From all instance data it should be clear which user made which claim at which moment. The meta-data from Criterion 5 should be used for this.

Criterion 12. Data should be negotiable between users based on a transaction model.

34 Context of Integraal Samenwerken 4.4 User analysis

Figure 4.1: The transaction states in Integraal Samenwerken.

Data that is not yet complete or needs to be approved by some user will be marked with meta-data indicating its transaction phase. The UIA should support such a nego- tiation mechanism. A model used for Integraal Samenwerken and partly implemented is the model from Figure 4.1, adopted from the VISI norm2.

Criterion 13. There should be a User Interface where all users can perform their tasks.

For human users a single point of access is offered by a User Interface that runs inside a Web browser. A user should be able to select if neutral names should be displayed or some translation list should be used to localize the ontology.

Criterion 14. Third parties should be able to operate on the channel without using the UIA User Interface.

The offered User Interface is optional. The data-stores that operate on a channel can be connected by any mapper or application. Finally a rule of thumb is formulated to help take further design decisions.

Criterion 15. If a design decision is not evident and might result in restrictions, the less restrictive decision should be picked.

In order to form a sustainable and widely applicable platform it has to be made as open as possible. Some decisions, like the strict verification from Criterion 8, give a clear direction, but the design should be open as possible for future developments. The main interest of the platform is a sustainable collection of data.

4.4 User analysis

The construction of a ship, a project, typically invokes the following stakeholders. The list is simplified to cover only the involved parties relevant to the ODRAC ap- plication design. In a typical project, engineers working at the contractor company, subcontractors and possibly suppliers will be allowed access to the knowledge base. The Universal Information Adapter is build to support the work of those end-users.

2www.crow.nl/nl/VISI

35 4.5 Pre-ODRAC implementation Context of Integraal Samenwerken

The project manager and library mangers have a role in operating the system. In the literature review of SCA systems [45] three types of users are discerned: end-users, domain-experts and developers. The library manager is the domain-expert and the project manager has a related role in operating the system. The developer is added that could be employed at any company to build extensions of the Workspace User Interface or Data-store.

client the legal person that orders a ship;

contractor the shipyard responsible for the design and production of the ship, bound to the client by a contract;

engineer end-user of the WUI or a self-made User Interface; higher management responsible to pay for the used information infrastructure; IT-management responsible for operating the used information infrastructure;

subcontractor company that is contracted by the contractor to do part of the design and production;

engineer idem; higher management idem; IT-management idem;

supplier company that supplies material to the contractor and subcontractors, but is not bound by project scope contracts;

sales employee end-user of the WUI or a self-made User Interface; higher management idem; IT-management idem;

project manager person not responsible for the result but for the organization of the engineering processes;

library manager domain-expert responsible for managing the domain specific (not project specific) content of the Information Adapter;

developer the person responsible for maintenance and further development of the ODRAC platform.

4.5 Pre-ODRAC implementation

The first implementation of the Universal Information Adapter was a stand-alone Java application in combination with a data-store called Converter. A screenshot of the former is given in Figure 4.2. This version 1.x implementation of a UIA platform lived up to most of the requirements, although Criteria 6, 9 and 12 had not got much attention yet. The API of the Converter was relatively similar to the API of data-stores in ODRAC, yet the data model supported by the 2.x version is different.

36 Context of Integraal Samenwerken 4.6 Distributed context

Figure 4.2: Stand-alone Java Client

The application from Figure 4.2 is run locally from a jar-file, which means it has to load an in-memory RDF store with the content of a 10 Mb file each time at start- up. Individuals can be downloaded from a URI that points to a running Converter, but new Individuals can also be composed and sent to such a Converter. Individuals are displayed on a tab inside the main window. The content of the Individual is displayed inside the main window, and all data and meta-data is visible in one panel. Data can be added, changed or removed by clicking the [+], [E] and [-] buttons. The data model of the 1.x version is discussed in Appendix A. The source code based on this data model contained ad hoc assumptions and some detailed exceptions were also programmed in. In Section 9.1 the new UIA design is compared to this pre-ODRAC version.

4.6 Distributed context

According to Criterion 7 a UIA project usually consist of multiple data-stores that share their information over a channel. Criterion 9 hinted that each company within a project would want a dedicated data-store. This introduces some complexity into the design of ODRAC which does not exist in the platforms discussed in Section 3.1.1 and 3.1.2. In iRING (3.2.4) it does. There a pull based data exchange mechanism is used [15] by the endpoints (Facades in Figure 3.3). This means an endpoint can pull data from other endpoints when it receives a SPARQL query (or any other request) for which it needs to process resources that are not present at that endpoint.

37 4.7 Example model Context of Integraal Samenwerken

In ODRAC a very similar situation exists. The endpoints in a channel can be asked SPARQL queries that should be matched to the total of data and not to the part resid- ing at that endpoint. Either a pull mechanism could be used to temporarily collect the needed data to query, or the query itself could be split in parts, send to the appropriate stores and the results recombined. Criterion 9 is a principle that anticipates a mech- anism to restrict pulling complexity, and the design of the ODRAC data model offers more structuring capabilities, so a first step is set in that direction. Yet the design and evaluation during the thesis work was performed using one data-store, thus evading complexity on this front. This means that Criterion 9 is currently the least supported requirement. We will come back to this in the Conclusion chapter in Section 10.3. A second need for support of the distributed context of data-stores is the global view the user interface is required to give on a total project. It was built to deal with different data-stores, but this was not tested. For these reasons the distributed setting of the data-stores is left out of the scope of this thesis. Still external heterogeneous data sources can be connected to the current ODRAC design, but the result of those mappers will for the time being be saved in one ODRAC data-store.

4.7 Example model

One way to see the ODRAC platform is as a knowledge base that can be configured to contain instances of a diagram like Figure 4.3. Any other diagram, consisting of named blocks and arrows would do. The procedure of turning an information model like this to an ODRAC configuration is called priming. In Section 5.8.1 this procedure is discussed in detail. This diagram is a combination of three Systems Engineering models from [69] and it will function as a running example in our work. It is a set of types and relationships that will be proposed as new set of initial reference data of the ISO 15926 Part 4. A system life-cycle could be the design phase and the project a certain ship. During a life- cycle of a certain project work packages will have to be performed. These activities require documents (deliverable items) when they mean an interaction with a system or physical object. Examples of those might be the heating system of a ship and a certain water pomp needed in this system. Such a physical object can have properties, either literal like a model number, qualifiable like a color or quantifiable like a size. In the next chapter the data model is explained how ODRAC structures the typing and instantiating of this model. In Chapter 7 some figures are given how instances of this model look in the User Interface.

38 Context of Integraal Samenwerken 4.7 Example model

Figure 4.3: Systems engineering information model.

39

Chapter 5

Data model

In the present chapter, as first of four design chapters, the data structure of ODRAC is laid out. The information viewpoint is used [59]. Following this description of the core knowledge entities, in Chapter 6 the locations of these entities will be clarified by explaining which services (functional viewpoint) will run where (deployment view- point). Chapter 7 will build on the abstract workflow description in that chapter to introduce the graphical Workspace User Interface (WUI). The fourth and final design chapter (8) explains the most important implementation details. The content of this chapter is our answer to the first subquestion raised in the Chapter 1: How can Systems Engineering knowledge data based on the ISO 15926 Part 2 Data model and containing the Part 4 Initial reference data be expressed in RDF? As described in Chapter 3 the ISO Part 8 already poses an answer to this question using OWL. The approach of Part 11 as worked out in Chapter 4 takes a different path using Named Graphs as knowledge entities. At the end of this chapter (5.9) the relation of OWL to our approach is clarified. We begin the description with Figure 5.1. First the notion of an Individual is ex- plained using the diagram. In Section 5.2 the description of Relationships follows. In Section 5.3 the missing part of the figure is covered, the middle layer containing Tem- plates. This covers the most fundamental knowledge entities, and in Section 5.1.2 a technical overlay of graphs (depicted in blue in Figure 5.1) is presented. The following sections discuss a detailed view on meta-data (5.4), replace chains (5.5) and the trans- action model required by Criterion 12 (5.6). Section 5.8 describes how the elements in the type layer (see Figure 5.1) are structured in a Library and how an information model can be primed to form the upper ontology of a library. The data model is presented without detailed references to the ISO. The modeling power is evaluated in 9.6 and differences between ODRAC and the ISO are further discussed in Appendix E.

5.1 Individuals

The notion of an Individual in the ISO 15926 is defined as a thing that exists (or could possibly exist) in space and time [22]. In our work we take a more technical approach and define an Individual as an instance graph construction. In the UIA the ODRAC

41 5.1 Individuals Data model

Figure 5.1: Basic data structure in type layer (1-6) and individual layer (20-31). The position of graph 10 and 11 is explained later.

platform is configured to model elements like work package and work package activity from Figure 4.3. It is up to the person implementing those information models what the meaning of those concepts is. ODRAC only facilitates instantiating them. When two separate instances are made representing the exact same thing, from the perspective of ODRAC this results in two Individuals. The unique name assumption reasons that two objects with a different name must describe a different Individual. To solve this a merge procedure might be needed resulting in the termination of all but one unique names. Currently this process is not automated. In Figure 5.1 the work package and work package activity types could be repre- sented by the nodes in graph 2 and 3. Graphs 21 and 24 contain instances of those types, each representing an Individual. Graph 20 contains the claim that Individual 21 is of the type contained in graph 2. An unbounded subclassification hierarchy is sup- ported between graphs 1 and 2, but there is only one instance level. Each Individual ultimately is of one type. As for Relationships, instantiated in graph 30 and 31, a type definition must be given too. Graph 31 could depict an instantiation of the consists of relationship from Individual 21 to Individual 22. For Relationships like 31, the type relation always points to a relationship template graph (10, 11), which prescribes the usage of a rela- tionship from the Type layer (4-6). In this case, graph 6 contains the definition of the consists of relationship and graph 11 describes this relationship may exist between a work package and a work package activity. More details are given in Section 5.2. Graph 30 contains a Relationship instance and a node. In Section 5.2.3 this con- struction is described in detail, but it can be understood as an Individuals property, say its name. We can now define an Individual as an instance construction consisting of a identity graph (24), its type definition graph (22), all its property graphs (30) and all outgoing

42 Data model 5.1 Individuals relationships (like 31 of Individual 21).

5.1.1 Identifiers The global identifier of an Individual is a URI. It is the name of the graph containing the Individual. The first part of the URI is the name of the data-store the Individual resides at and the project data-set selector, the second part, separated by a hash (#), is a UUID. The ODRAC user interface client (WUI) generates these URIs using UUID version 4 creating them fully random. This can be seen from the number 4 that is the starting character of the third block:

http://datastore.uia15926-11.com/project/15/ dataset#af4ab6c1-82ca-4e5d-b3f0-2609dd407c93

This approach to produce UUIDs is efficient in time, space and communication and cannot be traced back to who produced it. Yet the most important requirement for ODRAC of uniqueness is not guaranteed [61]. Although the probability of a collision in a set of 236 UUIDs is 4 10 16, the data-store checks all incoming new UUIDs to secure · uniqueness (see Section 6.3.3). Because the first part of the Individuals URI describes the Data-store uniquely and the UUID is guaranteed unique within this Data-store (due to the check), the full URI is unique too. Next to its uniqueness an identifier of an Individual cannot be deleted and lives forever within the lifespan of the project. This is true for all the information stored in a project, but not all knowledge nodes require to be uniquely identifiable. As long as one element is contained in a Named Graph (21, 24, 31) that element is uniquely identifiable with the graphs URI. Some instances do not need a unique identifier. For example a literal property like the name of a Physical object, is not something other users will do claims about. For that reason it might be saved in a graph together with the relationship from the Physical object to the name property (like graph 30). Such an instance is called a blank Individual.

5.1.2 Claims

In ODRAC knowledge is represented with RDF statements. Those triples are the most atomic notion of knowledge. Groups of triples can be contained in a Named Graph, and the granularity of how to divide the triples over graphs (the blue blocks over the directed labeled graph in Figure 5.1) is based on two considerations (see Appendix C.1 for more details). First of all the Named Graph entity represents a claim. Users of the ODRAC platform can inject packages consisting claims, and each claim is required to contain meta-data as by Criterion 5. ODRAC can be configured to supply any type of meta-data, but by default the creator and the creation date of a claim are logged. The more precise the formulation of data needs to be labeled with meta-data, the more detailed the granularity should be. Based on the figure the granularity is almost one- to-one, almost each node and edge gets its own graph. A second consideration is which set of statements should be replaceable individ- ually. Because of Criterion 10 no claim is ever deleted, but some claims can be said to replace others. The content of a graph can be said to be replaced by the content of

43 5.1 Individuals Data model

Figure 5.2: All Graph types used by the ODRAC platform.

newer graph. For this reason the type definition of an Individual is in a separate graph. In order to change the type of an Individual the type definition needs to be replaced, but the Individual itself is not allowed to be replaced. Its URI should live forever. The more data is stored in one graph, the less precise it can be replaced, because a new large set should be made to replace the old graph as a whole. In Figure 5.2 an overview is given of all different types of graphs that exist in the ODRAC platform. All graphs that have subtypes are introduced as generalizations that have no real instances.

5.1.3 Typing

All individuals, either blank or not, should point to exactly one Individual type using the rdf:type relation. For blank individuals this type definition is implicit (see Section 5.2.3) but for Individuals the type definition is carried in graphs like 20 or 22. Such a type definition can be replaced by a definition pointing to a subtype of the type it previously pointed to. All Individual types inherit from one of two predefined root types. For Individual types specified in the domain specific ontology this is the uia:OntologyElement. This specification is given by the domain specialist who configures ODRAC for a specific use. A set of predefined types is grouped in a tree with as root the uia:PrimerElement. This set can be extended by the development team of ODRAC. In both trees, any number of subtypes can be introduced, and they relate to each other as graph 1 and 2 from Figure 5.1. A type definition can be used to contain any type information, like the set from [23] of URI (equal to graph name), unique number, unique name, definition description, notes, superclass and related 15926-1 and -2 entities (see the first figure in Appendix E). ODRAC builds on the URI, name and classification only. See Appendix C for a full overview of the content of all types of graphs.

44 Data model 5.2 Relationships

5.2 Relationships

A predicate in RDF can either point to a Resource or a Literal. In ODRAC a Relation- ship can likewise point to an Individual or a typed Literal. We already discussed the consists of relationship between Individual 21 and 24. Also the difference between an Individual and a blank Individual and the consequence for the use of a Relationship (like in graph 30) have been discussed. We will zoom in further to such a complex graph after we discuss the typing structure of Relationships and the central role of the relationship template.

5.2.1 Typing Just like a small extendable set of primer classes is given next to the ontological classes, also a double hierarchy is used for relationships. They ultimately inherit from uia:OntologyRelationship or uia:PrimerRelationship and can have any level of subtypes. Graph 5 and 6 contain two Relationship types. As in the example, graph 5 contains the p4:consists_of relationship (no spaces are allowed in the URI). In fact, it is the name of the graph. Inside the graph a preferred label and a reversed label are saved, which are used in the WUI to display human readable representations of the relationship. Just like the Individual type graphs any other library information can be included like the set from [23] of relationship name, reverse name, related 15926- 2 type, related 15926-4 class and definition description, and in future an abbreviated noun-description might be added, but ODRAC builds on the name, reverse name and subPropertyOf classification only.

5.2.2 Relationship templates Just like an Individual has a type definition, a Relationship should always refer to exactly one graph that specifies the use of that relationship. For example a graph con- taining a p4:consists_of relationship would point to a graph defining the use of this relationship between a certain domain and range (see directly below). Relationships are contained in graphs (30, 31) that point to a RelationshipTemplateGraph (RTG) us- ing the uia:derivedFrom relation. In Figure 5.3 the content of such a graph (10, 11) is explained. The abbreviated notation at the left hand side is in fact structured as depicted in the right hand graph. Graph 10 might read this content in triple format: dataset:graph_10 uia:domain p4:WorkPackage dataset:graph_10 uia:predicate p4:consists_of dataset:graph_10 uia:range p4:WorkPackageActivity The domain and range point to Individual types that are allowed as starting point and end point of the predicate respectively. As given in Figure 5.3 more information can be specified about the relationship. A fourth relation is the uia:rangeDatatype. Table 5.1 lists all the supported data types. If this specification is omitted the value xsd:anyURI is assumed, which will be the URI of any Individuals of the range type. If any other value is specified the uia:range does not have to be supplied because ODRAC will construct a Literal value. Either for Literal values or for Individuals of the uia:range type, a default value can be supplied with the uia:rangeDefaultValue. In Appendix C a full overview is given of the content of the RelationshipTemplateGraph.

45 5.2 Relationships Data model

Figure 5.3: The abbreviated RTG (left) is in reality structured like the graph right.

URI example format xsd:anyURI http://example.com/path#fragment xsd:string any string of XML Unicode characters xsd:double 19.3 xsd:dateTime 2013-01-11T12:01:42 xsd:integer -23 Table 5.1: Supported data types for relationship template graphs.

Two important relations are the uia:domainContainer and uia:rangeContainer. These are used to specify in what type of graph the domain and range Individual should re- side. If the range is a Literal, the rangeContainer should always be a ComplexData- Graph (see next subsection). The relationship template is used in two ways. From the perspective of an Individ- ual, a selection can be made of all the relationship templates that mention the type of this Individual as domain. This selection describes the Relationships that can be made starting from this Individual. Other information contained in the relationship tem- plate refines this blueprint. The minimal and maximal cardinality is specified which prescribes how many of this Relationship the Individual should have, it is specified whether the Relationship bridges the line from data to meta-data and if it should be contained in a normal graph (thus pointing to a normal Individual) or in a complex graph (thus pointing to a blank Individual). From the perspective of an existing Relationship, following the uia:derivedFrom the relationship type definition can be found, which contains the label and reversed label of the Relationship.

5.2.3 Complex data We will illustrate the function of RTGs in the construction of a blank Individual. As discussed above, a blank Individual is an instance that does not get a dedicated graph. Instead, it shares a graph with the incoming Relationships. In Figure 5.4 the structure is illustrated of a Physical object 41 for example of type Electrical Actuator (graph

46 Data model 5.3 Individual templates

Figure 5.4: Detailed complex data structure.

7). Graph 42 is a ComplexDataGraph (CDG) containing a quantifying property (the orange node), for example the Control voltage. This type is defined in graph 8. Graph 7 and 8 both extend the uia:OntologyElement, although this is not visualized in the diagram. There is a uia:derivedFrom relation pointing from from graph 42 to the relation- ship template graph 10. The only way to find the type of blank Individual inside graph 42 is to follow this relation and select the uia:range of this template. When the active data-set is queried for RelationshipTemplateGraphs with domain type Control voltage and domainContainer a CDG (as we are currently in) two new RTGs are found. Graph 15 describes a has_value predicate pointing a Literal (green) with the xsd:double data type. Graph 16 describes an is_quantified_in relationship to an instance (purple) of a volt collection type (45). For ComplexDataGraphs all RTGs are obligatory. This means CDG 42 is only valid if the relations prescribed by graph 15 and 16 have been contained. There is no explicit reference to these graphs, but they can be found at any time doing the described query. Graph 43 is the only Individual of the volt collection type and has the name volt. Graph 42 contains an is_quantified_in relation to this Individual and a has_value relation to a literal like "230.0^^xsd:double".

5.3 Individual templates

An Individual can be transformed into an individual template by pointing to it from other Individuals with the uia:derivedFrom relation. This second Individual then in- herits all the Relationships of the first, both individual and complex, without any pos- sibility to change those. This relation between two Individuals can only be established during the creation of the second Individual, and it cannot be changed for the full lifetime, because Individual graphs are not replaceable. Allowing living Individuals to change their relation to a template is too complex, because easily conflicts emerge between already existing Relationships of the Individual and that of its (intended) tem- plate.

47 5.3 Individual templates Data model

Figure 5.5: Basic data structure in type layer (1-6), template layer (10-12, 22, 23, 33, etc.) and individual layer (20, 21, etc.).

In Figure 5.5 a number of possibilities for individual templates is given. The Tem- plate layer consisting of individual templates and relationship templates functions as a middle layer between the type layer and the Individuals, but individual templates are just as well instances as normal Individuals. They can be used to define the common characteristics of a set of Individuals. For example a cooling system could consists of pumps that all share the same voltage. All those pump Individuals could be related to one individual template of type pump with a voltage property. For those Individuals no type definition graph (like 20) is needed, but instead a uia:derivedFrom relation is used to point to the individual template that has a type definition. A second use of individual templates is as a specialized hook for RTGs. Normally an RTG is part of the uia:DefaultTemplate that makes it applicable to any Individual. If the uia:template relation inside an RTG points to some individual template, then the relationship template only is active for Individual instances based on this individual template. Graph 12 is an example of such an RTG, fulfilled by graph 33. It should be stated that the use of individual templates is not implemented in the UIA available at the time of this thesis.

In the data model design it is possible to make subclassifications of individual tem- plates (like graph 28) that inherit all the active Relationships and relationship templates from its parents. This can lead to the strange situation that an Individual is based on an individual template that has a Relationship to a certain other Individual (which it inherits). This other Individual might have a similar relationship to the first Individual.

48 Data model 5.4 Meta-data

Figure 5.6: Detailed meta-data structure.

This situation sounds like a loop, but even without an inheritance relation between In- dividuals these loops are possible, provided a Relationship can be constructed between a domain and range of the same type. These loops are allowed and the WUI is build to tolerate these.

5.4 Meta-data

In Figure 5.1 and 5.5 no special structure was presented for meta-data, and the forms of meta-data introduced thus far only include the creator and created information con- tained in a PublishableGraph. The whole knowledge network consists of data about data, but when a relation reaches from within the knowledge domain (e.g. a Work package) to outside the domain (e.g. an Attachment) the object of the relation is called meta-data. An RTG describing a relationship to a meta-data entity contains a uia:level relation pointing to uia:MetaLevel. Any other RTG is of uia:DataLevel. In Figure 5.6 the Relationships a, d, e and f are of MetaLevel, only b and c are of DataLevel. The creator and created relations are this fundamental that the production of them is hard-coded in ODRAC, so no RelationshipTemplates exist prescribing their exis- tence, but any other form of meta-data can be build in. Defining new types of meta- data is no work for the domain expert, but for ODRAC developers or platform manager to decide. Currently an Individual can be supplied with Remark and Attachment meta- data, and about a Relationship also Intention and Certainty claims can be recorded. Figure 5.6 shows the structure of both an Individual or Relationship being subject of a meta-data claim. The same structure is used from saving data claims. A meta-data relation is always contained in a ComplexDataGraph (like graph 30 and 31), and might point to another Individual (like 32). Such an Individual must be an instance of a uia:PrimerElement, and it can not be the subject of other meta-data. Meta-data can only be applied on data, so once the level from data to meta-data (dotted line in Figure 5.6) is crossed, a network chain cannot be longer that the example of 19 - 20 - 31 - 32. Horizontally the chain cannot grow wider, but because meta-data relations prescribed by RelationshipTemplates with level MetaLevel can be replaced, vertically a chain could grow without bound (see next section).

49 5.5 Graph replace-chain Data model

Figure 5.7: Detailed complex data structure.

If the definition of meta-data would be extended from non-domain related claims about domain-related data to all RDF relations outside the ontology namespace, it would mean that structural relations like uia:replaces, uia:derivedFrom but also rdf:type are too part of meta-data. In this report this second definition is not used, but instead the total set of occurring relations in a data-set consists of predefined ODRAC relations (from the uia namespace) from which some function as meta-data (see Sec- tion 5.5).

5.5 Graph replace-chain

If a ReplaceableGraph (like graph 35 from Figure 5.7) replaces another graph (30), it contains a uia:replaces relation pointing to the replaced graph. Starting from the most recent (active) graph a chain can be followed backward in time. To simplify the retrieval of all graphs from one chain, each new graph also contains a uia:origin relation pointing to the first graph in the chain. If a query for all graphs that contain a uia:origin relation to graph 30 only results in graph 30, it has never been replaced and it is still active. If more than one graph is returned, the only graph that is not replaced by another from the result set is the most recent and active graph. This would be graph 37 in the example figure. A Relationship, independent of how many times it was replaced, can be removed from the active data-set by terminating the whole chain. Graph 38 in Figure 5.7 is an example of such a TerminationGraph. Although an Individual can not be replaced, it can be terminated. Graph 39 gives an example of this. Terminating an Individual should always be combined with termi- nating all its Relationships, complex (as in the figure) and non-complex.

50 Data model 5.6 Transaction model

5.6 Transaction model

In this section an extra control layer is introduced. It reuses the replace chain structure and because the Intention meta-data element is used to represent the transaction states it also clarifies the use of meta-data. The action required mechanism itself (graphs 51, 52 and 53) is also applied as meta-data. The combination of the mechanism with the Intention meta-data element results in an implementation of the workflow schema described in Figure 4.1. The procedure is illustrated in 5.8. Graphs 24 and 30 again represent an Individual and a ComplexDataGraph. Instead of fulfilling all the Relationships this Complex- DataGraph should contain according to its RelationshipTemplateGraphs (omitted in the Figure), the graph still contains a Placeholder. When Graphs 51, 52 and 53 are added the CDG is allowed to be synchronized to the Data-store. Graph 51 points with the uia:actionRequired to an RTG. The Relationship de- scribed in this template is required to be instantiated on the specific Individual in graph 30. The action should be requested from a person by the uia:actionRequiredFrom re- lationship in graph 53. In the figure the action is required from person A. It is clear that person B is doing the request, because the three action graphs point to him as uia:creator. Graph 52 is optional and specifies an end-date. Graph 24, 13 and 57-62 are assumed to exist in the Data-store. When graph 30 is synchronized together with graph 51-53 an action is pending for user A. As soon as he logs in to the Workspace service he will be notified. The RTG 13 has a uia:rangeDefaultValue pointing to Intention instance 57 which means a proposal. In graph 54 this relation is made by person A, representing the claim that he did a pro- posal. This proposal is contained in graph 35 that is notated to replace graph 30. The publication of graph 35 and 54 by person B fulfill the action he was required to do. In the figure the procedure continues by the application of graph 55 by person B who did the request and a accept claim by person C. There is no control mechanism yet that manages which person can make a Relationship to which Intention instance. Now implicitly the use of the uia:actionRequired mechanism means the persuaded person has to make a new subject graph (like graph 35 replacing graph 30). In future work this might be made explicit.

5.7 Translation labels

In order to meet Criterion 3 an ODRAC data-set should be able to contain company specific labels to all classes from the neutral information model. A special graph called TranslationGraph is used to contain these company specific names. In Fig- ure 5.9 a Company individual (graph 69) is displayed with two TranslationGraphs. The uia:forCompany label is used to relate the translation label to a companies view. The uia:companyName contains the actual step from a Individual type URI (subject) to the label. The graphs currently only support localized names for Individual types, not for Relationship types, but from the Integraal Samenwerken case there is no need to support that too.

51 5.8 Libraries Data model

Figure 5.8: Possible workflow steps using meta-data.

69 uia:forCompany

uia:companyName “...”

uia:forCompany

uia:companyName “...”

Figure 5.9: Translation graph structure.

The company specific names can be acquired by containing them in a SPARQL query pattern while interacting with a data-store or by selecting a company specific view while logging in to the Workspace User Interface (Section 7.1).

5.8 Libraries

The main principle in the ODRAC platform to prevent data inconsistency (Criterion 8) is to validate new instance data against type and template information before accept- ing it to a data-store. This information is contained in a data-set layer called library.

52 Data model 5.8 Libraries

The relation of this layer to the instance base layer and the used RDF vocabularies is illustrated in Figure 5.10. The use of the RDF, RDFS and RDFG layers has been dis- cussed sufficiently. The primer vocabulary (UIA) contains the specific types of Named Graphs used by the ODRAC platform together with the structural relations and the two elementary classes and two elementary relations. It is called primer because it carries all the resources that are needed to prime an information model (Section 5.8.1). The content of this vocabulary is included in Appendix D. The namespace is chosen to be

http://www.uia15926-11.com/2012/03/uia# for compliance reasons with the Integraal Samenwerken context and the use of the ISO Part 11. The default prefix for resources from this vocabulary is uia: as already used in this chapter. The library is the collection of all IndividualTypeGraphs and RelationshipType- Graphs. From a technological perspective it does not matter how many subsets of library graphs are given. Graphs will be loaded into the project data-set and only the URI of the graph will indicate to what subset it belongs. The DictionaryHeaderGraph can be used to represent the presence of the content of a subset in the data-set. Such a graph has the name of the URI up till the hash, for example

http://www.uia15926-11.com/rdl/part11/rel/0.1# and it contains a graph type definition. An important requirement is that subsets can only be added if the sets they contain subclassifications of are added too. From a modeling perspective a difference is made between an (upper) ontology containing the representation of the elements of the general information model (Figure 4.3) and the modular specialization of those ontological concepts in (possibly project specific) subsets. These can also be called taxonomies or dictionaries. In Figure 5.10 a special dictionary is used to contain the Relationship types. The ontology and rela- tion subsets contain elements from the upper information model appended with meta- data concepts. For example the Certainty class is a meta-data type that extends the PrimerElement. This class is in less need of subclassification than for example the Physical object class. In Figure 4.3 two subsets are depicted extending OntologyEle- ment classes from the ontology set. In the running Information Adapter implemen- tation only one subset was used containing a full taxonomy of concepts involved in shipbuilding. This data was imported from a different information modeling tool and transformed into ODRAC graphs using a special mapper. Next to the ontology, relation and other subsets, which contain IndividualType- Graphs and RelationshipTypeGraphs, a library can contain any number of translation sets. These sets contain any number of TranslationGraphs (previous subsection) re- lated to a certain company specific view. As soon as a data-set contains such a trans- lation set, the project data can be applied with translation labels from these graphs at will. The instance base of a project data-set is depicted with a thick solid box. The content of one library layer can be use for more than one project instance base, just like more than one library can be built using the primer vocabulary. The instance base is split in two sections. The project configuration contains instances of PrimerElements

53 5.8 Libraries Data model

Figure 5.10: The RDF layer structure of the ODRAC platform.

like the Certainty class or the different Intention options 57, 58 and 59 in Figure 5.8. These instances influence the configuration of the ODRAC platform, and they are not allowed to be instantiated by end-users. The project manager controls these. Project Individuals and individual templates can be produced by end-users. If the data-set of a certain project is new and no end-user did claims about the existence of Individuals or any Relationship between them yet, it should contain one Project Individual and any number of Person Individuals. A user can only start to produce data if he or she can be linked to a Person Individual. When the project is configured, the active type collection together with the set of templates allows the user to start defining new Individuals and Relationships between them. There is no configurable control mechanism which user could make instances of what type yet. The individual templates will be part of the instance base. The default set of re- lationship templates are part of the upper ontology, but for uses of an RTG like graph 13 in Figure 5.8 or like graph 12 in Figure 5.5 they might at some point be added by an end-user. It should be emphasized at this point that for a project all library and instance base graphs reside in one RDF data-set. Only the graph name URI indicates to what subset the graph belongs. The layers in Figure 5.10 are meant as a conceptual explanation of what can be found in one project data-set.

5.8.1 Priming procedure Now the structure of the library is discussed and the function of all the graphs from Figure 5.2 have been explained, a description can be given how to transform an in- formation model like Figure 4.3 into an ODRAC ontology. This procedure is called

54 Data model 5.8 Libraries priming, because some extra attributes need to be attached to the model in order for ODRAC to use it as an instance blueprint. The fist step is to make IndividualTypeGraphs for all named nodes and Relation- shipTypeGraphs for all named edges. The consists of relationship, used in five dif- ferent situations in the diagram, only needs to be created once, and because the graph it contains should have a unique name the naming paradigm of subset:consists_of only allows one definition. Inside the RelationshipTypeGraph a label and reversed la- bel should be specified. The nodes directly extend the uia:OntologyElement and the links directly extend the uia:OntologyRelationship. As described in 5.1.3 and 5.2.1 the type graphs can be applied with any version information or textual definition (see Appendix E for an example of this). The next step is describing how the defined parts fit together. The Relation- shipTemplateGraphs contain much of the priming information. First of all they de- scribe which relations may be instantiated between instances of which domain and which range. Because an instance of a class can either live in an IndividualGraph or an ComplexDataGraph the RTG also describes to which kind of domainContainer it ap- plies and in what kind of container the domain Individual should be put. The decision which node from the information model to instantiate as an Individual and which as a blank individual is saved in RTGs. It is even possible to do both, for example a System life-cycle could be instantiated as an Individual for having relations with a System life- cycle stage, but as blank Individuals for some Physical object subtype. In that case it could exist with a String value attached as a kind of Literal property. This would create two very different interpretations of a System life-cycle, but these can coexist without any interference. Also the minimal and maximal cardinality of the Relationship can be specified inside an RTG. When the minimal cardinality is higher than 0, this means an Individual can not be synchronized to the data-set before enough relations of this type are defined. The instantiation process can be guided even further by specifying a default value or an acceptance range for numerical values. The application of meta-data to the future instance base is guided by RTGs too. When for example an uia:attachment relationship is defined for domain uia:Ontology Element and range uia:Attachment it facilitates the definition of an attachment to any element from the information model. The uia:certainty meta-data relation is only applicable to the Quantifiable property so the RTG describing that relation has a little less general domain. At this stage the library is ready for ODRAC to accept instances. Yet the most likely next step is to formulate subtype hierarchies of the upper ontology concepts just defined. The screenshot in Figure 7.1 gives an impression of this, although the model slightly differs from the information model of Figure 4.3. A final optional step is to formulate translation sets. For the Information Adapter for each collaborating company such a translation list was made.

5.8.2 Inferencing One structural detail of the subclassification of Individual and Relationship types needs some extra attention. In a number of situations, for example when finding all the own and inherited RTGs that apply for a specific Individual, the inheritance hierarchy of

55 5.8 Libraries Data model

a type should be easily traversed. In Section 2.1.4 we have seen that SPARQL can define recursive patterns that could be used to calculate the transitive closure of e.g. the subClassOf-relations. Yet, as we briefly noted there, SPARQL does not support recursive patterns over graphs. For example the following query does not return all super types of the class ?itg of Individual ?i. GRAPH ?tdg { ?i rdf:type ?itg } GRAPH ?itg { ?itg rdfs:subClassOf+ ?super_itg }

At some point during the development the a query pattern was considered with the following nested structure. {GRAPH?s_8{?s_8rdfs:subClassOf?s_7} GRAPH ?tg { ?g rdf:type ?s_8 } } UNION { GRAPH ?s_8 { ?s_8 rdfs:subClassOf ?s_7 } OPTIONAL { {GRAPH?s_9{?s_9rdfs:subClassOf?s_8} GRAPH ?tg { ?g rdf:type ?s_9 } } UNION { GRAPH ?s_9 { ?s_9 rdfs:subClassOf ?s_8 } OPTIONAL { GRAPH ?s_10 { ?s_10 rdfs:subClassOf ?s_9 } GRAPH ?tg { ?g rdf:type ?s_10 } } } } }

But this is complex and only reaches a predefined number of levels deep. A simpler solution to overcome the recursive graph pattern limitation is to supply an inference relation from each library subclass to all its predecessors. As discussed in 2.1.2 infer- ences can be automatically added to RDF data-sets on the fly. This process of infer- encing can be used to query a data-set for entailed facts. Yet because of consistency reasons (Criterion 8) and for the sake of simplicity in ODRAC no temporal inferences are calculated. Instead the needed inheritance shortcuts are added to the data model. In Section 5.5 we already introduced the uia:origin relation that points from each graph to the end of a replace chain. This relation can be seen as an inferred link based on the full graph chain. To make all super types for Individuals and Relationships directly available the uia:inheritsFrom relation is used. As a result, each Individu- alTypeGraph points to the uia:OntologyElement or the uia:PrimerElement with the uia:inheritsFrom relation. This is analog to the use of the uia:origin relation in the instance base, but here the inheritsFrom relation also points to all the other super classes. Again this inference relation is added as part of the initial graph.

56 Data model 5.9 Relation to OWL

5.9 Relation to OWL

We finish this chapter with a short explanation why OWL is not used within the ODRAC data model (it is missing in Figure 5.10). In ISO Part 8 the ISO data model was transformed in OWL [56, 2]. Examples of classifications were given in Manchester Syntax like the following Class Expression.

Class: rdl:CentrifugalPump SubClassOf: rdl:DynamicPump oim:Assembly some rdl:PumpImpeller some rdl:PumpSuction some rdl:PumpDischarge oim:DirectConnection some rdl:PumpDriver oim:MaximumDesignPressure xsd:float

Such a class description is more expressive than the class definition used in ODRAC. Usually in OWL the membership of a class can be inferred by matching the Class Ex- pression against the instance base. Also multiple membership is possible. Yet the expressiveness OWL offers at the cost of inference calculations was not needed for the ODRAC data model. We do not use any evaluation mechanism, all Individuals are explicitly related to exactly one class. In ODRAC the closed world assumption and the unique name assumption make it possible to omit inferencing. A very typical deficit of the Part 8 was the temporary removal of the Meta-data Annex. It turns out the formulation of meta claims without the use of Named Graphs is very difficult to do. In ODRAC the use of Named Graphs is the main approach. OWL does not inherently conflict with Named Graphs, but the combination of them is not trivial. Also in the work on NEPOMUK the use of Named Graphs in the graph meta-data vocabulary is said to be inspired by OWL (thus forming an alternative) and does not reuse much OWL [62, 10]. The class expression given above has some similarities with the IndividualType- Graph. Both are a pattern, a construct of triples, together forming the definition of a classification entity. OWL is very open and expressive, the IndividualTypeGraph is very rigid. The design choice in ODRAC was to keep this separation in approaches clear by using very little from the OWL vocabulary. The use of cardinality descrip- tions is a good example. In OWL it is possible to add cardinality selectors to class expressions, restricting the conclusion that a certain instance is of this type to the num- ber of relations it has. The same owl:cardinality relation could have been used in ODRAC, but because of the ontology driven paradigm we want the use of cardinality restraints in RelationshipTemplateGraphs to be prescriptive. For this reason, and be- cause the owl:cardinality relation has owl:Restriction as domain, in ODRAC a new cardinality relation is introduced. Yet this does not mean that no combination of of OWL and Named Graphs could be given. In our work it was not feasible, but in ISO Part 12 the combination will be investigated.

57 5.10 Conclusion Data model

5.10 Conclusion

Criterion 4 that prescribed the use of RDF triples and Criterion 5 demanding Named Graphs are the core requirements fulfilled in the data model presented here. The way meta-data is recorded (Section 5.4), as demanded by the same Criterion 5, offers an easy way for a graph cut between data and meta-data. Criterion 3 and 12 that re- quire the integration of company specific labels and a transaction procedure have been fulfilled by the design offered in Section 5.7 and 5.6. Inherent to the data model is that deletion of data is not possible. This was required by Criterion 10. Also the minimal set of governance data (Criterion 11) of which user did which claim at what moment is incorporated in the model.

58 Chapter 6

System Architecture

The data stored in the ODRAC platform should a accessible via a single channel (Crite- rion 7). It should not be a surprise that this channel is the Web, or some protected Web subnetwork. This means that all ODRACs operations are available as Web services, accessible over the HTTP protocol. In Section 6.1 and 6.2 the two types of Web ser- vices are introduced by describing their purpose (functional viewpoint) and location (deployment viewpoint). In Section 6.4 a description is given how ODRAC is config- ured to contain a project. Once a project is set, it will contain a data-set that remains structured as described in the previous chapter. The workflow of producing data and the propagation of data through the platform is explained (operational viewpoint) in Section 6.3. Also how parts of this workflow are designed to work at the same time (concurrency viewpoint) is clarified. The next chapter will build on this abstract work- flow description to introduce the Workspace User Interface (WUI). In Section 6.4 the configuration of the Workspace and Data-store services is discussed. The next sec- tion contains a description of the position of the Mapper in the ODRAC platform. The chapter finishes like the other design chapters with a conclusion section.

6.1 Data-store service

In Figure 6.1 three data-stores are displayed involved in two projects. For example the p14 arrows can be understood as communication lines for project 14, together forming one distributed data-set channel. In order to give company A, B and C full control over their own data (as required by Criterion 9) their data is saved to a data-store the company owns. Company B uses a third party application to operate on the channel, but the WUI can also be used to interact with the full data-set. The number of data-stores in one channel is not bounded, and a new one can be added at any moment. All editors should be configured to use the new data-store and all data-stores should be available before the channel is in an operative state. As described earlier the projects running during the time of the thesis report only consist of one data-store to postpone the development of a federation mechanism (4.6). A different matter is how many projects can be stored on one data-store. No the- oretical limit is set on this either, but because each data-store loads all projects it is invoked in in a separate part of its memory this is bounded in practice. To give an indication in Table 9.1 the memory usage of a data-set is measured. Although mul-

59 6.1 Data-store service System Architecture

Figure 6.1: Architecture overview with two projects (14 and 15) in three companies.

tiple projects can be stored on one data-store and the request URI for a data graph from a project may contain an indication which project is queried (see URI examples below), it is not obligatory to supply a project identifier. When a data-store listens on the domain http://datastore.uia15926-11.com all requests that reach the service are matched to all projects. Because the UUID in individual graphs is unique for the whole data-store, this can not lead to exceptions. See Section 5.1.1 on the generation of UUIDs. For library graphs their name is required to be unique if their content dif- fers. For that reason it is a good practice to include a version number of a library is the URI. Criterion 4 states that all data in the data-set should be modeled using RDF and Criterion 2 demands the communication of data in the TriX-format is supported. When applied to the data storing Web service, it becomes clear that the function of the service is that of a data-store or quad store. Its three functions are retrieving knowledge data, storing it and executing SPARQL queries. The data-store is also a SPARQL endpoint.

6.1.1 API The data-store services follows the third Linked Data principle of applying RDF data with a resource is looked up. Instead of using the RDL/XML file format TriX files are produced. Also the SPARQL protocol (Section 2.1.4) for querying Semantic Web end- points is followed. The SPARQL update language is not supported, because putting data to the data-store could conflict with the strict validation (6.3.3). As discussed in [54] Linked Data endpoints conflict in some subtle ways with the RESTful design of Web services. Operations from a data-store for example, relate to the data model in- stead of implying application state changes. Yet all interactions with the data-store are stateless and the GET and POST are used for retrieving and submitting data respec- tively. A number of operations are defined on top of the TriX retrieval and SPARQL query interface, which makes it a real API. Below the three modes of interaction are described.

60 System Architecture 6.1 Data-store service

Retrieving data If we assume that of http://datastore.uia15926-11.com a data-store is running with an Individual in store representing the person John, identified by the following URI.

http://datastore.uia15926-11.com/project/15/ dataset#ccc8d823-cd4b-4e7b-9807-0adb4688d2f9

Simply requesting this URI (an HTTP GET request) returns a TriX-file consisting of data describing this resource. One fields can be provided: action which can either be relatedGraphs (assumed) or history.

A TriX-file contains one or more named graphs. The result of an Individual look-up can be twofold. If the action is omitted or set to relatedGraphs the response contains the following selection. First the two graphs involved in Individuals (the Individual- Graph and TypeDefinitionGraph) are added. Also all RelationshipGraphs (complex or not, incoming or outgoing) are added and any meta-data RelationshipGraph consisting a claim about those graphs. If the action field is set to history a different selection of graphs is made. The supplied URI might be part of a replace chain. If this is the case all graphs of the replace chain are added to the TriX-file. This history-request is clearly not for Individuals, because an IndividualGraph cannot be replaced. Requesting its history would only result in a TriX-file with one graph. For data retrieval request no content negotiation is used. The requester should expect the application/trix format. If no graph is found with the specified URI as name a human readable error message is given in the text/ format.

Submitting data When a user, either by using the WUI or any other application producing ISO15926-11 adhering TriX-files, wants to submit new knowledge data to a project, a HTTP POST request is used with multipart/form-data. The following three attributes need to be provided. project_id the integer identifier of the project running on the data-store the submit should be merged with; user_id for means of primitive authentication; upload the full TriX-file containing all knowledge data part of this submit.

As set out in the Introduction chapter (see Section 1.3.7) little attention is paid to preventing data leaks (confidentiality) and unauthorized access (integrity). Currently no validation is needed when retrieving data. In submitting data only the id should be submitted from an existing Person. Person Individuals from a data-set are also registered in a database register (6.4) for user management. If the submitted id points to an active User from the database the TriX-file is processed. Possible future security enhancements include real authentication and HTTPS encryption (see Section 10.4). The submitted TriX-file contains instance base graphs. Library graphs are not al- lowed to be submitted in this fashion. In Subsection 6.3.3 the procedure is described

61 6.2 Workspace service System Architecture

that is used to validate the content of the TriX-file. When the content is completely valid the graphs are added to the project data-set. If the project or user was not speci- fied correctly a text/html response is given. If something went wrong with the TriX content validation a JSON collection is returned of detailed error messages. The WUI displays this report as a second phase of feedback (see Section 7.5). On a successful submit an empty JSON message is returned.

Executing queries As described the SPARQL 1.1 protocol is followed for communicating queries over HTTP [24]. A HTTP GET request is used in the following format, assuming the same data-store name as above:

http://datastore.uia15926-11.com/project/15/ ?query=&project_id=15

Two attributes are provided. As can be seen the project_id is superfluous because the domain path also contains a pointer to the project, but this might not be so.

query the URL encoded SPARQL query;

project_id the integer identifier of the project running on the data-store the query should be matched to.

The first attribute is the percent encoded SPARQL query, the second attribute is not part of the SPARQL 1.1 protocol, but is needed to specify in which project space the query should be executed. After executing the query the result is sent back to the requester. No content negotiation is used, but the result is always in application/-results +xml format [36].

6.2 Workspace service

According to Criterion 13 there should be a User Interface where all users can per- form their tasks. In Section 4.4 a split is made between end-users that need to use ODRAC to produce knowledge data and other users that play certain roles in oper- ating the ODRAC environment. For the first group of users an Workspace User In- terface (WUI) is presented. It runs as a Web service, assumed to be available on http://workspace.uia15926-11.com and it can be accessed with a Web Browser like Chrome or Firefox. In Figure 6.1 the service is depicted outside the domain of the three companies, but there is no reason why the service should not be run on the same machine as one of the data-stores. In the next chapter the graphical design is presented and Chapter 8 explains the implementation details. From a functional viewpoint the following can be said about the WUI. The WUI gives the end-user a workspace which runs in sessions. A session starts by logging in to a certain project, and ends either by a logging out or by a time- out. Within a session a user can navigate through all knowledge data available in a project, and can produce data based on the library. Both for viewing and composing knowledge data different tools are offered inside the workspace.

62 System Architecture 6.2 Workspace service

Most of the Workspace operations are executed inside the JavaScript environ- ment of the Web browser. The Workspace Web service serves an HTML page with JavaScript and CSS resources after a successful login. But for some operations the JavaScript environment communicates with the Java server model.

6.2.1 API Just like the data-store service an open API is used for these communications, but in this case the API is not expected to be used by third parties. In the next subsection the design idea behind this API is explained. The interface is structured with the RESTful JSON approach of backbone.js1. The following interactions are supported.

Operations on instances

GET http://workspace.uia15926-11.com/api/individuals

This returns a JSON formatted list of all Individuals. In Chapter 9.4 this full list approach is evaluated to be unpractical for large data-store collections. Future work will be to make downloading the JSON content of Individuals modular.

GET,POST,DELETE http://workspace.uia15926-11.com/api/individual

This can be used to POST a JSON formatted Individual, or GET or DELETE the Individual identified with the specified URI. uri only required for GET and DELETE.

GET,POST,PUT,DELETE http://workspace.uia15926-11.com/api/relationship

Sending a JSON message with a POST or PUT flag would create or update a Relationship model respectively with new triples. The GET returns a JSON object with the full content of a RelationshipGraph and the DELETE destroys such a graph. This is possible because the graphs in the workspace memory are not yet synchronized with a data-store and are thus not published yet. uri required for all operations but the POST for creating a new Relationship, in all other cases the Relationship already exists and its URI should be specified; subject (deprecated), this is the URI the Relationship says something about.

1http://backbonejs.org/

63 6.2 Workspace service System Architecture

Operations on bookmarks GET http://workspace.uia15926-11.com/api/bookmarks

Returns a JSON formatted list of all Individuals currently bookmarked by the ac- tive user.

POST,DELETE http://workspace.uia15926-11.com/api/bookmark

Can be used to POST a JSON formatted list of Individual URIs that need to be bookmarked, or to remove single Individuals from the bookmarks list.

id only required when doing a DELETE, the parameter should point to a bookmark-id from the database (see Section 6.4).

Operations on attachments

POST http://workspace.uia15926-11.com/api/upload

When a multipart/form-data form is posted to this URI consisting of a file name and it’s content the file is saved on the Workspace server and registered in the project database using a generated UUID. This UUID is returned as plain text as a result of the process.

GET http://workspace.uia15926-11.com/api/download

In order to download a attachment that was stored previously this link can be used to open it as an application/octet-stream.

uuid the identifier of the file, not to be confused with the UUIDs contained in the URIs of Individual names.

Operations client/server communication

GET,POST http://workspace.uia15926-11.com/api/sync

A GET request on this address makes the Workspace synchronize all Individuals in the waiting list (see Section 6.2.2). When a POST request is done with a submitted TriX-file only the content of this file is sent to be synchronized.

64 System Architecture 6.2 Workspace service trix only needed for a POST, it contains a URI formatted TriX-file that will be synchronized to the appropriate data-store.

GET http://workspace.uia15926-11.com/api/trix

Downloading an Individual as TriX results in an application/octet-stream con- tent type stream retrieved from the appropriate data-store. uri the URI identifying the Individual.

POST http://workspace.uia15926-11.com/api/sparql The query submitted in the URI is redirected to the data-store. The result it gets from the data-store is formatted into an HTML table. query a URI formatted SPARQL query, see also Subsection 6.1.1.

For the users operating ODRAC no clean interface is offered yet. For example the process described in Section 5.8.1 involves manually writing XML-files.

6.2.2 Client/server communication The API between the JavaScript environment and the Workspace service suggest a close interaction between the two. Not only operations involved with attachments and data-store communication, that depend on remote resources, are communicated over the API. Also the instantiation of Individuals and Relationships and the request to synchronize data to the data-store involve API calls. This would not necessarily have to be the case. If the Workspace application would solely run as a JavaScript application it could produces instances in the Web browser’s memory only and transform them into TriX before sending them directly to a data-store. In the design of ODRAC this step of building JavaScript TriX-support and by- passing the use of Jena and NG4J is not taken. Instead the JavaScript environment is equipped with an synchronization mechanism (not to be confused with the syn- chronization of TriX-files to a data-store) between JavaScript and the Java server en- vironment. This means in practice that all RDF data produced in the Workspace is represented both in the Web browser’s memory and in the Java server session. When the Web browser is refreshed while keeping the HTTP session active, this means that no data is lost. It also means that the Jena and NG4J extensions of the Graphset and GraphsetMap can be used with the full power as original containers of the RDF data constructed using the HTML interface. Making native JavaScript support would dis- card all the possibilities offered by these Java libraries. The presence of Jena related Graphsets makes it easy to execute SPARQL queries, so for operations like retrieving meta-data from an Individual, a choice can be made to process the RDF graphs involved in an Individual inside the JavaScript memory (operating on the JSON representation of an Individual) or in the Jena environment

65 6.3 Workflow System Architecture

using SPARQL queries. The choice made here is to incorporate all detailed operations on the general models inside JavaScript backbone views. The three supported back- bone models are relationship, individual and bookmark. Their content is synchronized between client and server, and both environments organize their own operations. Dis- playing meta-data is the job of a dedicated backbone view on the client side. Producing a TriX-file from a set of Graphsets and sending it to a data-store is a server job.

6.2.3 Workspace library usage The position of collecting all instance RDF data during a Workspace session is re- served for the server memory, but the server is stripped from other burdens as much as possible. The most noteworthy is the absence of library data in the Workspace server memory. The use of doing SPARQL queries on the Graphsets is therefor limited. RelationshipTemplateGraphs needed to understand type definitions inside a Complex- DataGraph or RelationshipTypeGraphs containing label names are not present. As a result the Workspace server is not capable of validating a TriX-file of an in-memory Graphset. But still the Workspace leans heavily on the steering power of the library. This is accomplished by a JavaScript version of the library that is loaded as compressed JSON file. With the size of less than 400 kb in the Integraal Samenwerken application, there is no need for zip-like compression, but size was minimized by omitting often used values. In Section 8.4.2 the structure of the cached version of the library explained. To query the content no SPARQL can be used, but instead a number of JavaScript functions is used to search through the cached version.

6.3 Workflow

Now the data format and the services involved are described, the process of data cre- ation can be explained in time. In this section the data flow is followed from creation in the Workspace service, to saving it in a Data-store and requesting it back again to the Workspace service.

6.3.1 The ontology drive

Data generation in the ODRAC platform functions as a heavy form of top-down on- tology population (see Section 2.2.3). Individuals and Relationships can only exist if there is a type definition and a relationship template. So typically the library has to be complete before data instantiation can begin. In later phases a library item might be updated or added to. Updating fields inside Individual or Relationship type definitions only has limited consequence because the graph name has to remain identical and the inheritance is unchangeable if any instance of the type or of any of its children exists. The only general acceptable change visible inside the WUI would be changing the Relationship’s preferred and reversed labels. Updating fields inside an RTG is also restricted if Relationships are derived from this template. The domain and range may be changed to one of its predecessors in the hierarchy. Furthermore the cardinality may be widened and default values may be

66 System Architecture 6.3 Workflow changed. If no Relationship instances are based on an RTG it can be changed or removed completely. Changing the library is not a process that can be done as dynamically as adding instances. All editors need to be closed down with no running sessions and after a li- brary update the editors need to get the new version of the library before new operation sessions may be started again.

6.3.2 Workspace During a workspace session, or a session in any other editor application, instance data is constructed in steps. An Individual can be instantiated that has a number of required properties. Once the WUI is used to do so the required properties are automatically generated with Placeholder elements. For example a Name property instance inside a ComplexDataGraph is automatically generated together with any Individual instance. This blank Name instance has a rel:has_value relationship to uia:Placeholder. This RDF element functions as an indicator that this value has to be supplied before the graph containing it is ready to be sent to a Data-store. Obligated Relationships are automatically generated pointing to a Placeholder. Generally speaking the Workspace is a workbench where incomplete data can exist before publishing it. In some situations there might exist a mutual dependency of two types of Individuals that require a Relationship to each other. Instances of both can be made in any order, but they can only be published to a Data-store together. Next to making sets of new Individuals that share new relations, sets can be con- structed of new relations between existing Individuals or containing an updated ver- sion of an existing ComplexDataGraph. All active content of data-stores involved in a project is available inside the workspace, and new data depending on this old data may be published as well. When the workspace is asked to publish its newly created data it checks if all Placeholders have been replaced by real data. Also meta-data is added describing the creator and creation date of the set of graphs. After this stage the Graphsets can be encoded in a TriX-file that is correct as far as the Workspace is concerned.

6.3.3 Synchronization On arrival at a Data-store the content of a TriX-file is first verified and if no problems are found the content is added to the data-set. Unsupported triples in graphs are al- lowed, as long as they do not cross the ODRAC mechanisms. One restriction to the content of TriX-files are graphs that replace graphs inside the same TriX-file. There is no reason to do that, because only the most recent graph would suffice. The names of the graphs, containing UUIDs, where randomly generated by the ed- itor and are most likely unique. Yet as part of the verification process the uniqueness of the graph names is verified within all data-sets on the specific data-store. Next the following steps are set to verify the TriX content. Each graph should have creator and created meta-data, and the creator should point to an existing Person. Other require- ments depend on the type of graph. If a graph is added that is none of the supported types, but does adhere to the previously stated requirements, it is added. Third party

67 6.3 Workflow System Architecture

extensions might require this, but it is important that these graphs do not interfere with ODRAC principles in any way. If a graph is an IndividualGraph its content is verified. In this case no content apart from the obligated meta-data is allowed. Secondly the type definition is tested. Thirdly all obligated relationships are checked to be present. If a graph is a RelationshipGraph (either Complex or not) a more elaborate check is needed. Apart from the presence of all Individuals referred to from the RG the relationship template it is based on should be compared to the relation. For ComplexDataGraphs all relationships inside the graph need to be validated separately. A special case is when an action is required from another user (Section 5.6) because in that case some unset values are allowed. Adding claims poses no consistency problems as long as the library remains con- sistent. Replacing claims on the other hand is more complicated. A submitted graph that is labeled to replace an existing graph should not be replaced by another user in the mean time. This would mean that the first user is proposing a new value for a value he has not seen yet. His request should be rejected and he should first see if his update is still needed based on the value the other user proposed before him. This approach is a key design choice in dealing with concurrency.

6.3.4 Concurrency A data-store has a TDB data-set in memory for each project. For incoming requests the Tomcat framework initializes multiple threads, but in order to safeguard the con- sistency of the data-sets only one request can be handled at the time. If for example two incoming TriX-files contain new values for identical graphs the process of adding these should not run in parallel. This would result in random acceptance of some of the new values and a declination of others. The content of one TriX-file most likely contains internal dependencies also, which makes a partly acceptance of its content unacceptable. Also for implementation reasons a data-set is not allowed to change if some other process is iterating over it. For this reason for each of the three operations available over the data-store API a lock is granted on the appropriate data-set for the duration of the full operation. If the request is the synchronization of a TriX-file the content is first fully evaluated and only if each replacement is based on still active values it is merged to the data-set. If at least one of the assumed latest versions is outdated the whole TriX-file is denied with a detailed error message.

6.3.5 TriX-file population If no concurrency problem happened and no library inconsistency between an editor and the data-store revealed itself a successful synchronization results in an updated data-set as soon as the lock of the submission operation is released. From this moment the data-store can be asked to reveal new information about updated Individuals. As described in the data-store API section (6.1.1) the two mechanisms to get information out of the data-store are using SPARQL queries or requesting TriX-files describing Individuals or Relationship history. Because requesting a TriX-file involves giving a graph name, the only way the the Workspace can know of the existence of certain

68 System Architecture 6.4 Project configuration

Individuals is by building a list of them all. This is done in the category tree structure also displayed in the objects navigator (see Section 7.6). If at some time in the Workspace more information is needed about an Individual a TriX-file is requested. Such a description of an Individual contains the Individual- Graph itself, all the RelationshipGraphs that talk about it, and all the graphs containing meta-data on one of the contained graphs. If relationship graphs or meta-data graphs refer to other Individuals these are not included in the TriX-file. When needed they can be requested on their own right. This description closes the circle of instance data creation and retrieval. As de- scribed earlier it is not possible to communicate library graphs in this mechanism. Because of the special character of how the Workspace caches the library there has been no need to communicate library graphs in TriX-files, but for future development this could be supported.

6.4 Project configuration

Both the Data-store service and the Workspace service require a project configuration. A MySQL database is used to store a minimal set of data needed to operate a project. A database diagram is given in Figure 6.2. A service can host any number of projects, and all those projects need to be listed in the projects table. Each project builds on a set of dictionaries, library sub-sets, that together form the project specific library. In the current implementation these dictionaries are contained in TriX source files. Removing a certain file in the source file folder triggers a rebuild of the project data-set when a data-store is rebooted. Such a rebuild is needed at the start of a project and consists of loading all the data-set source files that are coupled in the database with a project into a TDB store. Apart from library management the database contains users and their authorization data. The database also keeps track of which user is logged in and which has admin- istrator rights. This special category of users is in principle allowed to do project configuration and change domain expert data in the library. Which data-store to use for which user on which project is stored in a junction table. Attachments and bookmarks are also saved in the database. Attachments are rep- resented in the RDF data-set too, but to be able to do more low level file management some information is kept in the database. Bookmarks are considered as an application specific overlay that have no existence inside the knowledge base. More application specific properties like the path to save TDB stores are saved in a key value table.

6.5 Mappers

A Mapper can be any application that transforms data to the right format in order to send it to a Data-store. In Figure 6.1 Company C uses an ERP (Enterprise resource planning) system that contains data that should be injected to the UIA project 15. This flow from a private system to the shared Data-store and back again represents the flow envisioned by the ISO (Section 3.2) and described in the work of [74] (Section 3.1.1).

69 6.5 Mappers System Architecture

Figure 6.2: Database diagram containing project configuration information.

Importing data from a Data-store back into a private system would require a mapper just as much, but the need for this direction is not expected in the near future. During the development of the ODRAC platform work was done on two mappers. One was meant as a general configurable mapper that could be applied to any SQL database or file format representing tables (i.e. some uses of XML). It is done as work on Project 6 of Integraal Samenwerken. This mapper produces Individuals per row, based on table column names. A set of transform operations are specified, for example to merge the value of some columns before producing an Individual. The creation of double Individuals is prevented, and the result can either be a big TriX-file or an automated synchronization with a Data-store. A second mapper was needed from the start of the UIA application. It was a special kind of mapper, because it would not create Individuals, but was needed to build the library. In earlier phases of the Project 8 a full library with neutral and company specific concepts was build in Relatics (see for a short description of this application Section 2.2.7). The content of this repository could be exported to 41 XML-reports, most of which contained valuable information to build the UIA library. A lot of work went in building a DOM parser that interpreted these reports in the right way. Valuable lessons learned from this process involve a road map on importers in general and some experience in choosing URIs for library graphs. Graphs involved in Individuals get randomly generated UUIDs as part of their URI, but for some library items, like RelationshipTemplateGraphs or IndividualTypeGraphs, it is important that they retain the same name each new import. Another issue is how to deal with changes of the source library. The import is split in a number of separate collections, but if something changes in one set the whole import is repeated. For some graphs a static

70 System Architecture 6.6 Conclusion identifier from the source repository is used, but for RTGs a hash number is build from some fields inside the graph. If these change, the URI of the graph changes, but the other way round a graph with identical fields gets the same name. This is used deliberately to prevent double versions of graphs considered identical. The steps involved in making a mapper are the following. First it should be clear which information should be produced by the mapper, and methods to create the re- quired graph types should be in place. Secondly the structure of the source repository should be investigated. Problems that might occur are that the order in which items are listed are counter productive in dependencies, complicated aggregation of infor- mation fragments is needed or to much memory or processor time is required. A problematic phase is the creation of inheritance relations to all the super types of the IndividualTypeGraphs (see Section 5.8.2), because the whole subClassOf chain should be complete before this procedure is successful. In practice this is not guaranteed, so the phase of creating all ITGs should be split from the phase of creating inheritance relations. The third step is choosing a right way to build the URI of each graph and the final step is running the mapper and checking the result. It is a good practice to print the results of assumption checks to a run log. An example is the creation of temporary coupling identifiers that are removed when they are picked up by a dependent import. Some remained after an import, and manual verification proved those loose ends were indeed loose ends in the source repository.

6.6 Conclusion

The architecture presented in this chapter is capable of mapping data from any source application to the neutral language describe in the previous chapter, provided that in- stances have single class membership. The design thus fulfills design Criterion 1. The role of a Mapper, envisioned by Criterion 6, was further described in Section 6.5. The support of TriX-files over the API satisfies Criterion 2. The composition of services that operate the project channel described in Criterion 7 is described in this chapter, as well as the way the required inconsistency reconcile- ment (Criterion 8) is performed. The requirement to allow multiple data-stores on one channel (Criterion 9) is described is part of the design, though this is not part of the implementation. In the following chapter the User Interface required by Criterion 13 will be explained further, but the possibility to interact with the project channel with a company specific application (required by Criterion 14) was described extensively.

71

Chapter 7

User Interface

In this chapter the User Interface of the Workspace service (WUI) is presented. First the function of the User Interface is described. In Section 7.2 to 7.6 the main graph- ical elements are explained, and the two Sections 7.7 and 7.8 describe some special functionalities. The chapter finishes with a conclusions section.

7.1 The WUI within the platform

The Workspace service is explicitly designed as an optional editor (Criterion 14) that can be used to interact with an ODRAC project channel. Users of the platform are free to design any other application to view, create or communicate data for an ODRAC project. See Figure 6.1. The WUI gives a graphical view on the data-set. The view is offered within a session between a log-in and a log-out. On the log-in page a project should be selected, and next to the user credentials a company specific dictionary can be selected. By default the dictionary is set to neutral, but if another option is picked the company specific class names from Criterion 3 are displayed and the neutral terms are printed between []-brackets. The environment that is opened is called the Workspace (7.2). Within one session data can be viewed, edited and newly created and at any time the proceedings can be published to the data-set. This is called synchronization. If a session is ended without synchronizing the data created after the most recent synchronization is lost. The WUI only supports the creation of consistent data, so it cannot be used to create invalid TriX-files. An important requirement for the WUI is easy extendability. The visualization possibilities within a Web browser are comprehensive and the presented version of the WUI contains only basic graphical representations, which could easily be extended. Future ideas include drag-and-drop declaration of Relationships between Individuals, visualization of sets of Individuals on a time scale or more sophisticated zooming functionality.

73 7.2 Workspace User Interface

Figure 7.1: The Workspace window.

7.2 Workspace

The Workspace represents a session in which a user can view, edit and create data elements. To get access to the Workspace the user needs a JavaScript enabled Web browser. For the figures and benchmark presented in this thesis Google’s Chrome is used. When the URL that points to the Workspace service is entered, the user is asked to supply credentials and to choose a project and a company dictionary. When according to the project configuration (Section 6.4) the user has access rights to the project, a HTTP session is started and the Workspace screen is displayed.

In Figure 7.1 the main screen layout is shown. In the header the name of the current user is printed, together with the selected project. The workspace menu is opened in the top right corner containing workspace actions like log-out. The whole left window consists of two docking areas for navigators. The width of the docking area can be adjusted, as well as the vertical position of the divider. On the bottom of the Workspace a footer bar is placed containing the synchro- nization button. Clicking this initiates a synchronization of all session work up to that moment. In Figure 7.1 an informative pop-up is displayed. All graphical entities apart from the header, footer and navigators are contained in a pop-up and can be dragged around or closed. Dragging a pop-up outside the window to the left or downward will expand the workspace. The zooming functionality of the Web browser (usually Ctrl + scroll wheel) can be used to get an overview if many pop-ups are opened.

74 User Interface 7.3 Browser

7.2.1 Flow The main workflow of (1) logging into a session, (2) interacting with the data-set and finally (3) synchronizing the work is represented in a graphical flow from top (where the sessions is represented by the header) to bottom (where the sync button is positioned). Steps 2 and 3 might be repeated, each time switching attention from the center of the screen to the bottom. During step 2 the flow of navigating through the data-set, interacting with Individuals and seeing the result reflected back in the navigators works from left to right and back again. The design is based on a number of other applications that might be familiar to the users. The header with the workspace menu are more ore less similar to the one in Gmail, Facebook and Relatics, and the same is true for the navigational column in the left side of the screen. Compared to Relatics the ODRAC design aims to overcome the steep learning curve and weak sense of orientation discussed in Section 2.2.7. The Workspace is never directed to a certain location and always remains a static frame. Items are always opened in a dedicated container and activating such a container can be linked to an item in a navigator. Two navigators can be opened at the same time.

7.3 Browser

The most comprehensive pop-up is the Browser (not to be confused with the Web browser hosting the Workspace). This is a graphical unity representing an Individual. Browsing through the knowledge base network is visualized with this element, which inspired its name. Figure 7.2 demonstrates the content of a Browser. The title bar contains the name of the Individual and can be used to drag the Browser around. The body of the pop-up consists of three panels. The m-button and the h-button in the title bar can be used to switch the meta-data panel (middle) and the history panel (right) on and off. The minus-button will close the Browser altogether. It can be opened again without loss of data.

Main panel The left panel consists of rows that can be activated by clicking them. The rows are grouped in three sets, starting with the type definition row. The second is a set of Properties, the set of complex data claims (in CDGs from Section 5.2.3), at least consisting the name definition. Third comes the set of Relations the Individual has to other Individuals. Based on the type of the Individual, new properties and relationships can be defined using the two add buttons on the bottom of each section. Both actions lead to new edges and nodes in the knowledge base graph. For rela- tionships from this Individual to others the outgoing relation icon is used ( ) and in the same list incoming relationships are displayed by a similar icon ( ). Following this icon the name of the relation (predicate) is given and the type of the Individual it points to (object). Properties are always an endpoint in the total knowledge graph, so they get a icon representing a contained relationship ( ).

75 7.3 Browser User Interface

Figure 7.2: The Browser pop-up with closed meta-data panel and history panel.

Figure 7.3: Opened meta-data panel and history panel for name property.

76 User Interface 7.4 Valuebox

Meta-data panel Meta-data claims can be done about the Individual directly or about its rows separately. If a row in the main panel is selected the meta-data panel (if opened) will display rows of meta-data claims about the selected row in the main panel. By default the graph id, creator and creation data are displayed, because they are automatically generated for all (publishable) graphs. If the top row, consisting the Individual’s type definition, is clicked the meta-data panel will reveal meta-data of the Individual itself. There is no structural difference between data and meta-data, both are composed of the same graph types. Rows in the meta-data panel are similar to the rows in the main panel. Because the first three meta-data rows are always set their label is hard coded. Other types of meta-data can be added by clicking the add-button and are displayed with a relationship name and a box containing the object. Where properties were separated from relationships in the main panel, in the meta-data panel no separation is made between the two structural forms.

History panel Inside the meta-data panel a new panel can be opened containing the change history of the selected row. Also changes to the meta-data of this data element are summarized in this log.

7.3.1 Flow A Browser is opened by clicking an Individual in a Navigator. When an Individual is right-clicked an context menu opens (see Figure 7.4.d for the similar pop-up). Exist- ing Individuals can be cloned by clicking Duplicate. As a result the new duplicated Individual is opened in a Browser pop-up. The approach to create empty Individuals is to right-click a Individual type and choose New individual. Properties and Relations are added by clicking the add property or add relation row. If any relationship template from the library describes the possibility of a rela- tionship a selection box is displayed. For properties the user has to select the type of property he would like to instantiate. For Relations to other Individuals the selection box is filled with the combination of predicates and the type of the Individual. For example creates Document Types in Figure 7.2. In both cases the users should con- firm the selection, and as a result a row is added. For Properties the predicate name is hidden and the type of the Property is given (i.e. Name) followed by a valuebox. For Relations again the predicate and type combination is given, also followed by a Valuebox.

7.4 Valuebox

A Valuebox is a small representation of an Individual, just like a Browser is a big representation. Both are shaped as a rectangular with rounded borders with an icon and a name. In Figure 7.4 four valueboxes are displayed. The first two (a and b) represent a Literal value inside a property. As described in 5.2.2 a Literal value is

77 7.5 Feedback on submitted values User Interface

Figure 7.4: The valuebox for a Literal value (a) and (b) or an Individual (c) and (d).

always instantiated in a ComplexDataGraph (a Property). The second two valueboxes (c and d) represent an Individual. When making a new Property or Relation one edit icon is displayed in the right hand side of the valuebox. Clicking it activates the edit mode, displayed in (a) and (c). For Literal values (a) a text or date can be added. For Individuals a selection box is given (c) listing all the active instances of the type in case. Once a value is confirmed a second icon is displayed inside the valuebox (b). Pressing it reverts the new value and shows the value as it was before the Workspace session started. As soon the new value is synchronized to the data store the revert button is removed. This is because synchronized data can never be deleted. The edit button remains active. The valuebox representing an Individual can be right-clicked. The resulting con- text menu is the same as is displayed when an Individual is clicked inside a Navigator. The representation of Individuals in a Navigator and inside a Browser are very similar, but inside a Browser editing options are offered. When the View option is picked from the context menu, the Individual is opened inside a new Browser. The function of the other options is explained in the Section 7.6.

7.5 Feedback on submitted values

There are two moments of feedback on the submission of data during the synchro- nization. The first feedback comes from the WUI, which evaluates all data types and require fields. When something is wrong or missing a black label is displayed as in Figure 7.5. When the data is correct as far as the WUI can check the TriX-file is sent to the data-store. There it is validated again, and if something is wrong in this second phase a detailed error message is printed in a JavaScript pop-up. This message comes directly from the data-store in a JSON format and is only converted to a readable log.

78 User Interface 7.6 Navigator

Figure 7.5: Feedback on a wrong value for a complex property.

7.6 Navigator

In Figure 7.2 two Navigators are activated, objects and object relations. The header of two others are visible. A Navigator can be activated by clicking the header. The header can also be dragged from one of the two containers to the other in order to organize the workspace as best for the user actions. This mechanism, together with the pop-up presentation of browsers, is presented as a better navigable interface than the Relatics interface discussed in 2.2.7. The set of Navigators can be extended rather easy. In Section 8.4.5 this develop- ment procedure is explained. Adding Navigators is part of the domain configuration of the ODRAC platform. For example a location Navigator can be added following all located_at relations starting from some predefined highest level location. One of the choices to make in forming such a Navigator is which icon to pick. The four default navigators have abstract icons, representing an object as a Lego block. The bookmark icon is very common and the because the request Navigator contains Individuals that require an action of the active user, the chosen icon is common for actions that need reminding. Each Navigator consists of one root element, represented by the icon in the Nav- igator header. From this a tree structure springs consisting of any type of folders and items. The object Navigator consists of a hierarchy of all Individual types that may be instantiated by the active user. If the types contain subtypes or have Individual in- stances in the project they are shown as folders containing them. Right-clicking a type gives the option to make a new Instance. Right-clicking an Individual opens the same context menu as depicted in Figure 7.4.d. Three of those options involve navigator actions. The bookmarks Navigator con- tains a flat list of Individuals marked with a bookmark. In order to put an Individual in or remove it from this list simply the Bookmark and Remove bookmark(s) can be used. In Section 6.4 is discussed how this list is maintained. The content is remained after a log-out. When the View in object relations option is picked the Individual is represented as first item in the object relations Navigator. As child items all the Individuals are shown to which it has Relationships. Note that the is part of relation in the screen shot is an incoming relation, so that one is not included in the Navigator. Because the type of the Individuals in the object relations Navigator is not clear from the context (as is in the objects Navigator) the type is given between brackets in the Individual label. All relations each second level item has are displayed again as children from those Individuals. Thus the full hierarchy is displayed. When a cycle is detected, i.e. when an Individual is about to be added when it is already contained in the tree somewhere, it is discarded.

79 7.7 Importing and exporting TriX-files User Interface

7.7 Importing and exporting TriX-files

As depicted in Figure 7.4 the right-click menu that is available from clicking a val- uebox in a Navigator or Browser gives the option of downloading a TriX-file. As explained in Section 6.1.1 the Data-store service produces TriX-files describing Indi- viduals. The WUI can be used as a navigator through a project to select an Individual and then downloading its TriX description. Clicking the Import TRIX as text button from the workspace menu (Figure 7.1) gives the reverse option. A pop-up is given within a text box in which a TriX-file can be uploaded. Such a TriX-file could be the result of a mapper, but more likely route is to sent it directly to the Data-store. The Workspace directly sends the file to the Data-store using the same mechanism it sends a TriX file when it performs a synchronization. If the upload was successful the Workspace is updated and displays the new Individuals or Relationships that were encoded in the TriX-file.

7.8 Manual querying

In the workspace menu two options are given to enter a SPARQL query. The query and construct can be used to directly interact with the data-set. Clicking one of the two options opens a pop-up where plain text can be entered and sent. The result of a query is presented in another pop-up containing a table. The function is executed on the one RDF Data-store related to the project. An extra option offered in the live version of the platform is executing some predefined queries and downloading the result as an Excel-file.

7.9 Conclusion

The requirement worked out in this chapter is Criterion 13 that demanded a User In- terface for end-users to perform all their operations. As we have tried to demonstrate all steps described in the workflow of the previous chapter (6.3) are supported in the described interface. In Section 1.3.3 the approach was announced to build a User Interface that would visually represent the RDF data structure in a way that revealed its structure. In Section 2.2.4 the only argument to do so found in literature was that these structure represent- ing User Interfaces fill an important niche. In the stage the integration solution is in it is wise to provide the interface for technically well educated pioneering users. For that reason the Browser inside the user interface is required to represent the triple structure to such an extent that a user that looks at a produced TriX-file should see the analogy with the Browser representation. For example the rows in the Browser correspond to RelationshipGraphs. Figures E.1 and E.2 do not represent an Individual, but their visu- alization approach of a Named Graph with its contained triples gives some impression on how Browsers could have been designed. Because Individuals involve more than one graph they are one step more abstract, but still there is some analogy between Fig- ure 5.4 and 7.5 and between Figure 5.6 and the composition of the Browser window (Figure 7.2 and 7.3).

80 Chapter 8

Implementation

In this chapter some implementation details are given. The main argument that is built here is that it is easy to understand the code base with the background knowledge from this report. The clear class structure improves the maintainability and extendability of the platform. First the package structure is explained and in the subsequent section the most important design patterns are described. In Section 8.3 the choice for the storage back-end is explained. The next section describes the use of JavaScript in the WUI. Some more information is included about the interaction between the client and server of the Workspace. In the final section the main conclusions of this chapter are drawn.

8.1 Java package structure

Both the Data-store and the Workspace run as a Tomcat Web service. Tomcat deals with HTTP sessions and accepting Web requests. A Java class structure is used for executing these operations. For the Data-store and the Workspace a dedicated Java package is made, and a third Util package contains the code that is used by both. This package is compiled to a jar-file that is present in the Data-store and Workspace Tomcat lib-folder. All three packages contain a Settings class with configuration data to connect to the database from Section 6.4. See Figure 8.1, in the following subsections the content of these three packages is described.

8.1.1 Data-store

The Data-store runs a thread containing one ServerThread instance, which loads a Pro- jectService instance for each project hosted on the server (see Section 8.2.1). JSP-files direct each HTTP request to the correct ProjectService. Because the Data-store is the only place where the active library and data-set are fully available as RDF data, the jsexporter that generates a JavaScript cache version of the library runs here. Also the xmlimporter that was discussed in Section 6.5 runs on the Data-store, because it gener- ates library TriX-files that are needed to build or rebuild project data-sets. The Dataset- Factory is used to initialize new TDB data-sets or reopen existing project stores. The TrixVerifier is used to validate the content of TriX-file in the context of a data-set con- taining a set of library graphs. The randomfill package contains a class to create stub Individual instances in order to perform load tests (see Section 9.4).

81 8.1 Java package structure Implementation

Figure 8.1: Package structure of data-store, workspace and util and their classes.

An important design decision was why the Data-store is run as a Tomcat service and not as a stand-alone Java application accepting incoming Web requests. The open- source Fuseki RDF data server could have been used as an example of such a design. The performance difference has been estimated as negligible, seen the almost triv- ial difference between accepting HTTP request through Tomcat or through a copied Fuseki HTTP interface. Both approaches use the same TDB storing mechanism and both would execute the same processing operations. Because of the possibility to eas- ily serve HTML pages for debugging operations the choice was made to run the Data- store as an identical mechanism to the Workspace. Both Tomcat services are run as Windows services on a test server and can thus be managed using the same interface.

8.1.2 Workspace

The Workspace service uses the Tomcat session registry to store a Workspace instance dedicated to a user’s view on a project. See Section 8.2.1 for more details. This Workspace instance ultimately communicates with one or more Data-stores, and for each of them a DatastoreWrapper instance is used to take care of this communication. The IndividualNode and IndividualNodeFactory belong to the same group of objects or- ganized in the util.model. package (next subsection), but their behavior only has meaning in the Workspace context. The class with its factory represents the JSON object sent in a list to the JavaScript cache to indicate that an Individual with this id, name and parent exists. It is a stub object for the util.model.json.Individual. The AuthorizationManager is a class that interacts with the project configuration database to verify login credentials and with the transaction model (Section 5.6) con- figuration mechanism to verify which actions are allowed to be performed by a user.

82 Implementation 8.2 Java design patterns

The InformIndividualNodeFactory and RequestedFromIndividualNodeFactory and used to re- trieve a special selection of IndividualNode (stub) instances. These are used for display in Navigators, but also for the generation of Excel-reports. The classes in the requests package are involved in this extra feature.

8.1.3 Util

The util package contains a number of subpackages used throughout the ODRAC plat- form. For example the ng4j package contains an extension of the NG4J library (which is a Jena extension) with the Graphset element and an extended version of the Named- GraphImpl and NamedGraphSetImpl (see Section 8.2.4). The vocabulary package consists of class-versions of used RDF vocabularies that were not already defined in Jena. Also a number of helping classes is contained in util that help often used operations: util.jsp, util.mysql and util.xml. The package util.sparql is more than a set of helping classes. It is a layer on top of Jena to ease the construction of SPARQL operations in a number of ways (see Section 8.2.2), and the interpretation of the results. Query results are transformed to a Table instance which supports a number of extra operations. The package also contains a set of predefined SPARQL queries in mustache template format (see Section 8.2.3). Finally the util.model package consists of objects that are stored in the MySQL database (Attachment, Project and User) and a set of objects used in the communica- tion between the Tomcat session Workspace instance and the JavaScript environment. These are contained in the util.model.json package. The reason those are not put in the workspace package (like IndividualNode and IndividualNodeFactory) is the NG4J exten- sion depends on them (see Section 8.2.4).

8.2 Java design patterns

8.2.1 Tomcat thread use In Section 6.3.4 some details were given about the lock that was applied on a TDB store during a Data-store operation. In Figure 8.2 the full mechanism is visualized. Each Data-store runs as a ServerThread that is configured using a ServerContextListener to live as long as the Tomcat service itself. Upon start-up the ServerThread instantiates a ProjectServer for each project that is registered to run on that Data-store. Each Pro- jectServer has two TDB stores. Both contain the full library, but only the TDB store contains all Invididuals. The TDB sandbox is used to unpack TriX-files, validate them and save them in the TDB store. The ServerThread is instantiated as a singleton, so within one Java Virtual Machine only one instance is allowed of this class. When a data retrieval, submit or query re- quest is received (see Section 6.1.1) an HTTP request thread is activated by Tomcat. This thread will access the singleton and ask for a connection to one of its data-stores. If a request is received soon after a service start-up the TDB store might not be initial- ized yet by the DatasetFactory. During the data-store initialization the HTTP request thread is paused using the wait() function. When the initialization is done the waiting threads are send a message by the notify() call.

83 8.2 Java design patterns Implementation

Figure 8.2: The use of threads within the Data-store service. The left route represents data retrieval, the right route is how the content of a submitted TriX-file travels first.

Figure 8.3: The use of threads within the Workspace service.

The operations on a ProjectServer that require a lock are synchronized methods. This makes the JVM automatically manage the threads requesting a lock.

A similar description can be given from the Tomcat Workspace service. As first interaction with the server after login the user downloads an HTML and JavaScript environment to his Web browser. Tomcat started a HTTP session for this transaction and after login a Workspace instance is registered inside the session registry. After initialization of the JavaScript environment the local Web browser starts to interact with this Workspace instance using AJAX requests. Similar to normal Web requests a HTTP request thread is dedicated to such an op- eration. Using the session variable available inside JSP-files the appropriate Workspace instance is retrieved from the session registry. The operation has some similarities with the singleton pattern.

Workspace workspace = Workspace.getInstance(session);

84 Implementation 8.2 Java design patterns

8.2.2 The Viewer and Modifier paradigm

Within the ODRAC code base there are roughly two approaches to interact with the RDF data. The first one developed can be called the Viewer and Modifier paradigm. It employs the SPARQL query language as a bridge between RDF models and the pro- gramming environment. The Viewer class can be instantiated with a SPARQL query and a data-set, either a TDB store or an in memory store, and as a result a Table in- stance is returned. During the pioneering phase this approach was useful because the assumptions on the RDF patterns where separated from the code base inside SPARQL queries (see also next subsection). A Viewer can be instantiated at any location in the code where input from a RDF data-set is needed, as long as a pointer to such a set is available. As an alternative to specifying a SPARQL query, a single quad pattern can be defined in a special con- structor using Jena RDFNodes. If for such an ad-hoc pattern a resource from the primer ontology is needed the static *.vocabulary.UIA class can be used, which contains all the primer RDF resources. When data in an RDF data-set has to be added or modified the SPARQL update language can be used. To represent such an operation the Modifier class is used. Four different types are available (ClearModifier, DeleteModifier, InsertModifier and UpdateMod- ifier) and for performance reasons a BatchInsertModifier can be used for large operations in Library construction procedures or in Mappers. The Viewers and Modifiers are very expressive and can be used to bridge the gap between the Java class structure and the RDF data-set. There is for example nowhere a PhysicalObject representation in the code. As for the use of Viewers and Modifiers there is not even a class representing an Individual. The second approach to interact with RDF data does use a class representation of abstract data entities, but still on a very low level. It is the Graphset extension (see Subsection 8.2.4). When the design of the graphs had crystallized out this approach delivered a more integrated Java control over RDF data, but for more exclusive data interactions the Viewer and Modifier are still best suited. For example the InformIndividualNodeFactory and RequestedFromIndividualN- odeFactory from the workspace package use Viewers.

8.2.3 Mustache templates for SPARQL queries In order to separate SPARQL queries from the source code they are saved in *.spq files inside the util.sparql.queries package. To set values inside such a predefined query the template engine Mustache is used to process the template files. Below a fragment is given from the related graphs query. At two places the URI is inserted using the three double curly brackets. The Mustache engine was used for its simplicity and broad availability in different programming languages. A JavaScript version of the same engine is used for HTML fragments in the User Interface (see also Subsection 8.4.4).

PREFIX rdf: PREFIX rdfs: PREFIX uia:

SELECT ?rel_graph ?rel_meta_rel_graph WHERE {

85 8.2 Java design patterns Implementation

{ GRAPH ?rel_graph { ?subj ?rel {{{uri}}} . FILTER(?rel != uia:creator) . FILTER(?rel != uia:forCompany) . FILTER(?rel != uia:actionRequiredFrom) . } } UNION { GRAPH ?rel_graph { {{{uri}}} ?rel ?obj . } }

Figure 8.4: The bottom four classes from the util package extend the top four classes from the NG4J library in the depicted manner.

8.2.4 NG4J Graphset extension

As discussed the NG4J Graphset extension is a layer on top of Jena that enables the ODRAC platform to interact with RDF data. It is a complementing approach to the use of Viewers and Modifiers that use SPARQL to interact with data-sets. The two core classes are the Graph and Graphset (see Figure 8.4). The Graph class can be understood as a Java representation of the PublishableGraph from Figure 5.2. The Graphset is used as a container of all graphs involved in describing the state of one Individual in one moment in time. This is identical to the content of a TriX-file representation of an Individual from the Data-store. The Graph and Graphset are a Java extension of a NamedGraphImpl and NamedGraph- SetImpl that in their turn extend equally named classes from the NG4J library. This ODRAC extension is done two steps. The first order extension adds only functionality in line with the original NG4J classes like findObject(Node, Node) : Node which is in line with the find(Node, Node, Node) : ExtendedIterator. The second order extension of Graph and Graphset contain real ODRAC functionality. In order to give an impression of the functionality in Figure 8.5 an overview is given of the methods of both classes.

86 Implementation 8.3 The use of TDB

Figure 8.5: The methods of the Graph and Graphset Java class.

8.2.5 The Factory pattern For some Java classes the instantiation is so complex that one constructor method is not sufficient. A separate factory class can be defined to provide the instantiation. For six classes this Java pattern is used. The three factories from the workspace package were discussed in Section 8.2.2. The IndividualFactory from util.model.json is comparable, although it does not use Viewers but depends on the Graphset class. It can be used for creating both Individual and Relationship instances. The GraphsetFactory can be used to create a Graphset from an Individual instance (which is a method-less model received from a JSON object) or an Individual or Re- lationship Graph instance. These final two options put actual triples in the Graphs after instantiating them. This RDF data is based on the method-less objects provided when the factory is called. The DatasetFactory is used for initializing a TDB data-set. While the Workspace service uses small in-memory Graphsets, the Data-store uses a large TDB store. As described in Section 8.2.1 this instantiation requires more than setting some fields in the newly constructed instance, so a factory class is used to instantiate it.

8.3 The use of TDB

The choice to use TDB as the storage provider for the Data-store service was moti- vated by a number of reasons. Its presence in performance tests discussed in Section 2.2.1 indicated its popularity, and the test results did not indicate a suboptimal perfor- mance. Just like the the SDB storage system TDB is directly compatible with Jena, which means it implements the com.hp.hpl.jena.query.Dataset interface. An important

87 8.4 JavaScript application Implementation

difference with the SDB system is that TDB is still being maintained. Because of the NG4J support of Named Graphs the use of Jena is highly pre- ferred. In the Jena framework two storage mechanisms are offered by default. An in-memory one and a MySQL backed store. Because a persistent store is needed the in-memory store is no option. The specially optimized TDB approach and the non- transactional caching mechanism make the TDB store a store of higher order compared to the MySQL backed option. The persistent nature of the TDB storage provider means a folder on the hard disk is used to save a number of files to store the full RDF data-set. TDB is very quick to initialize an existing store, but it is important to flush and close the store before shutting down the service. Not doing this before a shutdown would result in a corrupt data-set. Also iterating over graphs inside a data-store is not allowed to be performed in parallel with a read or write operation. This is one of the reasons for the locking mechanism described in Section 6.3.4.

8.4 JavaScript application

A fundamental design difference between version 2.0 and 2.1 of the Information Adapter was the workload distribution over the Web client en server. In version 2.0 each workspace operation was communicated with the Tomcat server using AJAX requests, but this approach appeared to be to slow to be feasible. In version 2.1 the use of a powerful JavaScript data-set representation was introduced. Before this design change JavaScript was merely used for graphical effects and the mainstay of data creation was a session TDB store in the Workspace instance. In version 2.1 this mainstay was moved to the JavaScript environment under the insurance of a synchronization mech- anism that keeps a Graphset with equal purposes of the session TDB store up to date. In this way little steps of data production are dependent on server calls, but still the complete state of a session is stored both in the Web-browser as in the Tomcat service. As a consequence, the JavaScript has become more complicated in version 2.1. In the following subsections important parts of this JavaScript side of the Workspace service.

8.4.1 Used libraries For general purposes the jQuery extension is used. It enhances the interaction with DOM elements, the process of AJAX calls, visual effects like dragging and dropping and basic array operations. For the composition of more complex CSS style sheets the LESS dynamic style-sheet language is used. For icons a sprite image is used consisting of all the icons at once, but the creation of an appropriate CSS style sheet consisting of all the coordinates can be helped in a great manner by using numerical variables in LESS. Because of the parsing time lag the parsing of LESS style-sheets is done once and then saved as normal CSS. Yet another used library is the jqTree. This builds upon jQuery and delivers the basic operations needed for the tree inside the Navigators. The source code had to be adopted to support icons and the context menu. For the context menu another plug- in for jQuery is used. As mentioned above a JavaScript version of Mustache is used to handle HTML templates. A Browser object is completely built out of Mustache templates.

88 Implementation 8.4 JavaScript application

For the interaction mechanism between client and server the Backbone library is used. This depends on the Underscore library consisting of more elaborate operators like forEach and filter. Backbone supports a model-view structure that can send updates it both directions. Furthermore the model can be synchronized over an API using JSON objects. The objects in the util.json package contains Java models that are made available at the server side, representing the Backbone models described in Section 8.4.3.

8.4.2 JavaScript DOM To give an impression of the JavaScript environment Table 8.1 lists the DOM ele- ments that contain all the functions and data objects needed for the interface to func- tion. Table 8.2 gives a detailed view on the $.lib object containing the library cache. Each function initiated by a user action can iterate over the library cache. The set of JavaScript files is divided over cache, control, external, model, util and view folder, but the references to the $.lib object stem from the model, view and control folder. The URIs used in all the $.lib lists are abbreviated by subtracting the reused first fragment of the URIs for a number. A function couple available to switch back and forth from the abbreviated URIs used in the library cash are short() and full(). To do the transformation the $.lib.prefix array is used. The $.lib.comp_labels is loaded if the user selected a non-neutral language during login. It is filled with the content of the predefined JS-file containing a company specific word list.

variable description $.collection list of all loaded Individuals $.control general workspace operations $.factory set of methods to create models (like Individual) $.lib contains library in compressed form (see Table 8.2) $.model definition of backbone models $.tree list of all active navigator trees and their methods $.txt register of user interface messages $.user user that is logged in $.view definition of backbone views $.window methods supporting graphical user interface Table 8.1: DOM elements.

8.4.3 Model synchronization The backbone mechanism is used on three models: bookmark, individual and relation- ship. They extend the Backbone.RelationalModel. For the bookmark and individual also a Backbone.Collection is made. Both model and collection objects are supplied with a URI that the backbone mechanism will use to synchronize with server objects. In Section 6.2.1 this API was described. Figure 8.1 contains the Java class versions of the models. For example the Bookmark class has the fields id and uri, just like the JavaScript backbone model.

89 8.4 JavaScript application Implementation

variable description $.lib.comp_labels (optional) maps graph URI to company specific $.lib.labels maps graph uri to label $.lib.prefix maps prefix id to uri fragment $.lib.rel maps id to {graph uri, label, reversed label} $.lib.tree recursive set of {graph uri, label, [children]} of OntologyElement $.lib.primer_tree recursive set of {graph uri, label, [children]} of PrimerElement $.lib.tmpl maps template uri to set describing template content $.lib.tmpls maps domain uri to all templates it is part of Table 8.2: Compressed library indices.

The model synchronization initiative comes from the JavaScript environment. When a instance is created from a Bookmark model a JSON object is sent to the Java server with a URI field, and the server returns an object with an extra ID field. Also for other models no pushing mechanism is used from the server to the client. The backbone views interact with the models. Each view has a render function that is automatically called if an object changes. If for example the name of an Individual is changed automatically all Browsers that have a relationship with this Individual and all Navigators that have the Individual in its tree change along. In this way Backbone powers the client-server communication and the model-view synchronization inside the client. As a result the maintenance of both streams of information is easier.

8.4.4 Browsers A Browser pop-up as depicted in Figure 7.2 and 7.3 consists of a backbone browser_view, metadata_view and history_view, that in their turn use the complex_relation_view and invidual_relation_view. For rendering each view a mustache HTML template is used. This means that the exterior of a Browser can be easily changed by transforming these templates. New functionality can be added to the views. This Backbone object struc- ture is not inherent to JavaScript, but it is used to make the JavaScript source code modular and easy to maintain.

8.4.5 Navigators Compared to the Browser element the Navigator is required to be more configurable because many different tree structures can be thought fit to navigate over a knowledge base graph. For example a Navigator might be added to in the ship building context to traverse p4:located_at relationships to give a geographical composition structure of a ship. When a project is loaded inside the Web browser the whole library is available in the cached format described in Table 8.2. A navigator can be configured to follow any Relationships from the $.lib.rel elements. Adding a new Navigator is done by making a new *_tree.js file in the /js/model/tree/ folder and including it in the script files to load from index.jsp.A

element is

90 Implementation 8.5 Conclusion added to the DOM and to the $.tree list and registered as Navigator container using the following command.

$.window.registerNavigator(name, direction, icon, title);

As name the string identifier of the

element is used, the direction is top or bottom to indicate the location of the Navigator. The icon and title are used for the draggable Navigator header. The next step is to call the .tree() function on the div-object, available since the inclusion of the tree.jqtree.js library. Adding and removing nodes can be done using the jqTree methods and listeners can be attached to predefined events. The calls of those operations can be integrated in the whole environment.

8.5 Conclusion

No criteria from the requirement chapter contained implementation level decisions, so the view on ODRAC presented in this chapter merely supports the design presented in the former three chapters. Criterion 11 about the secured storage of governance data hints most strong towards the required reliability of data contained in the UIA. In this chapter the use of a TDB store was motivated, and this choice has an important effect on the persistence quality. We will come back to this in Section 9.5. In this chapter the interaction between the Data-store and Workspace was worked out in more detail. Also the structure of the code base on both services was explained. As stated in the introduction of this chapter the explanation of the code base is meant to prove its clarity and modularity. The package structure, use of threads and interaction mechanisms between the Java and RDF environment illustrate the code division. Also a brief description was given how to extend the WUI with a new Navigator and how to change the graphical representation of a Browser.

91

Chapter 9

Evaluation

In this chapter we return to answering subquestion (c) What do we consider a proper platform, given the focus points? The quality measure framework from Appendix B was summarized in the Introduction chapter in focus points. After the ODRAC plat- form is compared to the pre-ODRAC implementation in Section 9.1 the platform is evaluated in the light of those focus points in the subsequent sections. In Section 9.2 the user friendliness (1.3.3) is discussed, followed by a discussion of the reception by Integraal Samenwerken that influenced the evaluation of the Systems Engineering requirements (1.3.6). In order to evaluate the performance (1.3.4) a small empirical test is executed in Section 9.4. In the next section the reliability (1.3.5) is evaluated. The configurability of the platform (1.3.1) is discussed in a separate section too. For developers that wish to contribute to the platform the final section describing the ex- tendability (1.3.2) is important. Only in the next chapter we draw conclusions from the observations in this chapter.

9.1 Comparison to 1.x version

In Section 4.5 the pre-ODRAC implementation was described as the 1.x version of the UIA. The first major improvements in the 2.x ODRAC approach is the user interface. It runs in a Web browser and is available online without installation or user configuration. Also the rich visual and user interaction capabilities of the HTML 5 generation Web browsers make it an attractive platform. The screen layout from the old stand-alone client (Figure 4.2) was taken as a starting point, and has been transformed into a box container called Browser as presented in Chapter 7. Secondly the minimization of data model assumptions is a big step forward. Com- pared to the data model discussed in Appendix A the ODRAC platform offers a struc- tural improvement. Now a solid definition of different types of graphs is given de- scribing their purpose and structure (published in the vocabulary in Appendix D). Fur- thermore the modular structure of the data model means a very basic data structure is assumed by the software implementation. Based on a explicit configuration (using an ontology with modular extensions) the platform supports the ontology driven creation of data. The 1.x version also used an RDL and an Meta-RDL file as a configuration, but these were not structured well and the code base contained ad-hoc assumptions on both. The version 2.x platform can be applied to other knowledge domains without the

93 9.2 User satisfaction Evaluation

need of reviewing the code. A third major improvement is that no software needs to be loaded locally when an end-user operates the Workspace UI. The 1.x version contained a locally loaded complete library, which meant a loading lag of a few seconds but also distribution problems if a library would need to be updated. Less fundamental differences involve an easier to navigate User Interface and a more modular code base. Furthermore the platform needed revision to support multi- ple data stores. A small step is set in this direction (see Section 4.6). Other require- ments, like the transaction interface of Criterion 12, have found an implementation. New are also the visual feedback on wrong or missing values, uploading attachments, user rights management and the possibility to duplicate items. Missing is the template methodology that was supported in the 1.x version and a way to configure which type of properties and relationships to show and hide. This first shortcoming is slightly compensated by the option to duplicate Individuals that can be constructed as an ex- ample. Instead of hiding possibilities the UI is build to be able to contain all properties and relationships in a Browser window in a structured way. The conclusion of this comparison is that the new version presented in this thesis is an improvement in a number of respects with a more or less equal set of features.

9.2 User satisfaction

In this section the user response is evaluated of the end-users described in Section 4.4. Those are the engineers and sales employees the Workspace interface is build for. It is difficult to find people with the right background to test the interface, because the workflow within engineering projects does not include the use of a similar interface yet. Engineers and sales employees have their own systems to interact with and they communicate by interchanging computer files and by using e-mail and telephone. For the evaluation one engineer was interviewed that was involved in the project and had experience with managing engineering data using Relatics. More incidental feedback from comparable users was also included. The response of end-user representatives of the companies invited to the demonstration sessions held in the first two weeks of July 2013 forms another source of feedback. As far as the learnability is concerned the main interface turns out to be intuitive and easy to operate. Also the draggable Navigators and Browsers are an added value, for example compared to more static configuration of the Relatics interface. The nega- tive side that some operations, like sending a request, involve a lot of clicks is accepted as an inevitable cost of creating order and not displaying to much information at one time. No aspects of the previous 1.x implementation were appreciated better than the new interface, apart from the some operations like constructing templates and using colored search filters that were not yet released in the new interface. The interface is seen as an improvement what UI aesthetics is concerned, although the choice of colors and the operation of the valuebox where criticized. At some steps the user has to wait for the system to respond. The introduction of the JavaScript synchronization mechanism between client and server resulted in a workflow without troubles. Opening existing Individuals did not cause any delay too. The two steps in the whole workflow that require some patience is initializing the

94 Evaluation 9.3 Reception by Integraal Samenwerken

Workspace and the synchronization step to send data. This final step took a number of seconds at the worst and this was completely within the acceptance range of the users. Slightly problematic is the load time when the instance base starts to grow. We will come back to this in Section 9.4. When the User Interface was presented in the demonstration sessions to engineers not familiar with the UIA project their reaction was oblivion. In the first demonstra- tion the interface was used on a scenario of two companies interacting with each other. After this practical use case the architecture and argumentation of the UIA were de- scribed. Although the user interface helped to illustrate the practical application of workflow the engineers did not get an ambition to use it. The engineers shared the described reaction on aesthetics and performance, but they could not imagine the gain it would bring them in their daily practice. In a second demonstration of the UIA the presentation order was reversed by first explaining the general purpose of the whole platform and the function of the neutral language before the User Interface was discussed. The response of the engineers was comparable, but in this presentation the interface was presented as one of the possible application of the UIA platform.

9.3 Reception by Integraal Samenwerken

As the initiator of the functional concept of the Universal Information Adapter and main stakeholder in the development of the ODRAC platform the reception by Inte- graal Samenwerken is important. As one of the concluding activities the project group managing Project 8 formulated an elevator pitch describing the main achievements of the work in April 2013. The pitch can be translated as follows.

It is common knowledge that miscommunication leads to rework and ex- tra costs. Especially if those mistakes happen in an early phase of the process. The Information Adapter can reduce the number of those seman- tic communication faults to zero and increase the pace of the process by clear communication. The solution is uniform so it does not need to be reinvented each time. It can be used for digital communication between companies, departments and applications. The communicated informa- tion is traceable and communicated information can be reused. The Infor- mation Adapter supports an interactive way to work. The solution applies to international standards (ISO).

To explain the functionality further they formulated the following ten properties of the UIA.

(1) Each connected information source (company, department, applica- tion) communicates in its own language. The UIA offers an interface be- tween them using a neutral language. (2) Only one interface needs to be maintained per source, the mapper to the neutral language. (3) Data from information sources can be exported to the UIA and imported from it. The user decides what to import and export. (4) It is possible to provide the data with an intention (request, propose, no objection, etc.). (5) Also the

95 9.3 Reception by Integraal Samenwerken Evaluation

certainty of the data can be provided (estimate, reliable, etc.). (6) Attach- ments can be supplied with the data. (7) History is automatically recorded and fully traceable. (8) The date when requested data should be received can be supplied. The sender and receiver can get a quick overview of these dates. (9) The library is flexible, it can be quickly extended with missing terms. (10) Quick, efficient and consequent specification of re- quired data with the use of templates. Company or project specific default object compositions can be supplied with values.

From these descriptions the client’s view on the result can be derived. Although the de- scription is biased because it is formulated for promotional purposes the client believes that the statements are true. In the demonstration sessions mentioned in the previous section a reaction was also given by other user categories. The higher management and IT-management of companies involved in Integraal Samenwerken were more enthusiastic than the end- users. From their reaction a recognition of the problem and affirmation of direction in which the UIA tackles it can be concluded. The higher management appreciated the promise of failure costs reduction and the use of a high-tech solution. During the demonstration no deep level of detail was reached and some IT-managers shared their occupation with implementing a similar system from a large software ven- dor to homogenize the internal communication within a shipyard company. This can be understood as a skeptical position in the political feasibility in introducing a collab- orating standard like the UIA. They did not see a clear difference with any other sort of middle-ware. The project group doing the presentation responded that middle-ware does not retain the data where the UIA platform does contain data-stores. Also the current development team consisting of only two people did not demon- strate a high development capacity. Altogether the UIA project, or any of the other Integraal Samenwerken initiatives will not likely find a successor project. This is con- trary to the enthusiasm of the manager at Croon supervising the design and the chair- man of the execution board of Integraal Samenwerken. In Croon a different future of the platform is envisioned, where the data model and the Workspace interface will be reused and extended (see Section 10.1). When a more detailed analysis is done of quality aspects that relate to the Systems Engineering practice (listed in Section 1.3.6) the design comes out acceptably well. The freedom from risk and reliability of the data contained in a information integration platform are important aspects, and they are covered by the the history logging (7) and meta-data possibilities like (5). The context completeness is supported by the neutral language (1) and the flexibility of the library (9). The collaborative nature of Engineering projects is supported by the freedom of companies to decide which information the share and import (3) and the transaction procedure (4). The required replaceability of the Workspace interface is inherent to the archi- tecture and the related co-existence and interoperability are further affirmed by the decision to use the ODRAC platform approach on the described real data integration project where even the data-store will be replaced by commercial stores.

96 Evaluation 9.4 Performance

9.4 Performance

In order to evaluate the scalability and capacity of the platform a small empirical test was executed. Both the Data-store and Workspace service where included in the test. To evaluate the test results an appeal is done to informed common sense of an average Web user that understands the UIA context.

Data-store content memory usage (Mb) Library (18k graphs) rebuild from files 310 Library (18k graphs) after 1 upload and 1 download 82 Library (18k graphs) + 10k Individuals after 1 download 65 Library (18k graphs) + 100k Individuals after 1 download 65 Table 9.1: Data-store memory use per project.

We start by giving an indication of the resource utilization of the Data-store. In Table 9.1 the memory usage of a Data-store needed to support one project data-set is given. Where the number of instances increases the memory usage decreases. This is counter intuitive, but it can be explained with two important observations. First of all the number of instances contained in the data-set does not influence the size of memory needed to operate the Data-store to retrieve Individuals from it. We assume that the increase from 65 to 82 Mb when uploaded Individuals are processed, is independent from the data-set size too. For practical reasons this was not trivial to test. When a project data-set is started newly a procedure is executed filling the data-set with library graphs. This requires more memory than running an existing data-set.

Stub instances data-set size (Mb) 0 51 100 51 500 53 1,000 55 5,000 73 10,000 95 50,000 271 100,000 492 Table 9.2: Data-store hard disk use per project.

A more detailed measurement on how much disk space is used by TDB is given in 9.2. Eight instance base sizes are listed. This represents eight tests executed with a varying number of randomly created stub instances. When zero stub instances where added the exact number of Individuals in the data-set was a around 20, consisting of

97 9.4 Performance Evaluation

meta-data Individuals and a predefined set of Persons in a running UIA project config- uration. The disk space needed for a growing instance base is listed in the table. This disk space usage is the sum of the space required for the data-set store and sandbox store involved in one project, but because the sandbox only contains the library and no instances the biggest influence is caused by the data-set store. As an indication of how many Individuals might exist in a project a comparison can be made to a recent engineering project of a 1 km traffic tunnel project. This project is currently the biggest project Croon is involved in, and it consists of about 7000 Individuals. A large number of properties and interconnections could be involved which would mean an multitude of graphs is involved in such a data-set. The test only consists of three graphs per Individual. Moreover, the change history of all the graphs also results in a multitude. If for example 60 graphs are involved in one Individual, compared to the 3 from the test, this would mean a 20 fold of the needed data, which is already more that the biggest test of 100,000 Individuals. Yet if the result from the table is interpolated, finding 4.4 Mb is needed per 1,000 stub Individuals (3 graphs each), the needed disk space can be calculated. A ten or hundredfold would still be acceptable with the current hard disk specifications. The growing instance base has a different effect on the response times of the Data- store and Workspace. For each stub instances size from Table 9.2 a test measuring three different response times is repeated five times. In Figure 9.1, 9.2 and 9.3 the results are presented with the instance base size on the x-axis and the response time in seconds on the y-axis. Both axis have a logarithmic scale, preserving linear effects but wrapping the results in a more compact display. In order to record the response times in the Web browser environment the Developer tool (Timeline on the Network resource tab) in Chrome is used. It should be noted that this tool, when activated, slows down the overall performance. We assume this performance drop to be linear to the number of instances, so the measured effect of different numbers of instances will still display the same trend as without the measuring tool. It only affects the measurement of the DOM load time. First of all the response time from the Data-store is measured if one Individual is requested. From Figure 9.1 we see that there is a big spread in results. An important note is that the first Individual that is requested from a freshly booted Data-store takes considerably more time. The TDB store performs some form of initialization. When these measurements are ignored the effect from doubling the instance base from 50,000 to 100,000 individuals does not seem to have any effect on the response times. This is reassuring, especially because those load times of less than 0.5 seconds are acceptable. If the whole trend is observed there seems to be a jump between the range of 100 to 10,000 with an average around the 0.05 seconds and the range of 50,000 and 100,000 with an average of 0.2 seconds. The quality of the test is not sufficient to predict the trend, but some confidence can be derived about the scalability. The second measured response time in Figure 9.2 is the time it takes for the Workspace to return a JSON list of all Individuals to the Web client. It talks to the Data-store using a SPARQL query, but the result is very large consisting of all Indi- viduals. As described in Section 9.2 it becomes unacceptable if the user should wait to long for the initialization step. The Workspace locks until it retrieved and processed all the Individuals. It takes approximately 1 second to load 10,000 Individuals, but above that the response time becomes unacceptable. Especially the DOM processing times

98 Evaluation 9.5 Reliability

Figure 9.1: Data-store response time (s) when one Individual is requested, per instance base size.

Figure 9.2: Response time (s) of full Figure 9.3: Load time (s) of the DOM list of Individuals from Data-store, per per instance base size. instance base size. needed for the received Individuals, as presented in Figure 9.3, becomes problematic. This final measurement includes the time between the user logs in and the interface, with all the Individuals loaded, is ready for the user to interact with it. Already for 500 Individuals this takes nearly 10 seconds. In the current design of the Workspace this is the main focus for future optimization. In the Future work section (10.4) a new loading mechanism is proposed.

9.5 Reliability

The reliability aspect in the Introduction chapter (1.3.5) is completely spent on struc- turing user interaction with the system. A more technical view discussing robustness

99 9.6 Configurability Evaluation

and recoverability is left out of scope (1.3.7). Some experience was gained how to deal with TDB stores, and the main focus during development was to prevent the store to become corrupt. More work has to be done to investigate the recoverability of a TDB store during a server crash or as an effect of a software error. What the availabil- ity of the services on a project channel and the available computational resources is concerned the responsibility for this is left to IT-management. The freedom a user has to cause errors is very limited. If a user uses the Workspace service he has to be a registered person to get access. After logging in he is only given allowed possibilities and he is given detailed feedback on values he enters. But even if a user sends a TriX-file to a Data-store without using the Workspace service the content of this TriX-file is fully checked for syntax and conformance with the library. The only weak spot is the lack of authentication on submitting knowledge data. Inside such a TriX-file valid users have to be related to all statements, so if the content of a TriX-file is assumed to come from the authentic person, the claim is inextricably bound to the identified person. Non-repudiation is maximal and users interacting with the system are held accountable at precisely intended level. Their claims can have juridical implications. An envisioned way to deal with the security problem is to configure the Data- store to only accept TriX-files from the Workspace, but this conflicts with on of the requirements. As described in the section of important aspects that are left out of the scope (1.3.7) this is an important topic for future work.

9.6 Configurability

As demonstrated in the data model (Chapter 5) the domain information model and the meta-data functionality can be fully configured. This is done by defining subclasses of the OntologyElement and PrimerElement and relationships equivalently. It is possible to introduce any number of layers in the library structure, but the most logical is the split between an upper ontology representing completely different objects and more detailed taxonomies defining subclasses of those objects. If a data-set is locked any library graphs can be changed for new ones if they are not in use. Adding graphs is always possible. This is the maximal flexibility that could be achieved. No interface is offered yet to do operations on the library configuration. This means the work of a library manager involves handwork of a technical degree which in practice is unacceptable. The same is true for the configuration of which user is allowed to which project, which is the job of the project manager. The configuration of how the transaction model is applied in a project involves the management of meta- data (e.g. Intention instances), which requires the exact same operations needed from the library manager. For new projects the basic data model can be configured to the most elementary part, so the generalizability of the information model is maximal. What the operations that can be done on the RDF data is concerned the platform is very rigid on low level. Individuals and Relationships can be added, retrieved and replaces, but according to strict rules. No evolution is anticipated on this level. The evolvability of the higher level operations inside the Workspace is discussed in the following section, because it involves extending its code base. The workspace can be extended with new Navigators,

100 Evaluation 9.7 Extensibility but it is also possible to introduce a completely new operator object in the line of the Navigator and Browser. For example a Time-line object that is able to plot Individuals with a certain time Property on a time line, could be added reusing many operations from the JavaScript environment. This would no longer be configuration, but it could be seen as making use of the generalization of high level operations with the use of low level operations.

9.7 Extensibility

When the developer user from Section 4.4 is taken into account, the focus point of extendability can be understood as measuring the developer satisfaction. During the thesis work there was interaction with only one developer that would have been able to respond to the code base without being preoccupied with its design. Yet he only inter- acted with the WUI to upload TriX-files that he generated. The detailed visualization and feedback mechanism of the Workspace, and the extra validation information from the Data-store was build for him and helped at getting to understand the structure of a TriX-file. Unfortunately for evaluation he did not use parts of the code base. As discussed in the Introduction chapter the code quality we are interested in, re- lated to the extendability focus point, are maintainability, reusability and modifiability. We will evaluate them here based on internal characteristics identified in Chapter 8. In the Introduction chapter the comparison was made of the ODRAC platform with a Model Drive Architecture frame. Code generation on a small scale is performed by the use of a template engine to parse SPARQL queries and build HTML. Although there is no real code generation as envisioned in MDA there is a number of (formal and mental) models that add to the understandability of the code. The data model from Chapter 5 is one of the most structuring elements from the design. The graph structure is backed up by the Graphset NG4J extension and can be used throughout the code. The Tomcat thread use represents the core application model and package structure that organizes related classes and separates their functionality over three packages brings order too. Generally speaking the modifiability and reusability of the code are based on its modular structure and the ideas behind that. These were explained in Chapter 8, and we conclude they reached an acceptable level. Further details were given about the modifiability of the client environment of the Workspace. The mental model of a Navigator and Browser that house operations is incorporated in a model-view structure that can be extended easily. Such an extension can be connected to an extension on the server side easily by using the JSON synchronization objects.

101

Chapter 10

Conclusions and Future Work

This chapter starts by formulating the answer to the main research question and the related subquestions raised in this thesis. In Section 10.2 the significant contributions of the work are discussed, followed by an evaluation of weak sides in Section 10.3. The final section discusses the future work.

10.1 Conclusion

The structure of this report has been as follows. First an evaluation frame was set up in Chapter 1 and 9 (and Appendix B) that was informed by the design assignment. Within this frame the assignment was introduced by describing needed techniques (Chapter 2), related approaches (Chapter 3) and the immediate problem domain (Chapter 4). In that chapter design requirements were formulated. The four subsequent chapters contained the design, each concluding with a description which requirements it fulfilled. The main question for a proper solution was supported by three subquestions. The first one asked for a method to express ISO 15926 reference data in RDF. This method was given in Chapter 5 and supported by Appendix C and D. A reflection on it from the ISO’s perspective was given in Appendix E. The second subquestion asked for the requirement analysis offered in Chapter 4 and the third subquestion was partly answered by introducing the evaluation frame. In this chapter the answer to what we consider a proper platform is finished. The ODRAC platform is a proper solution to the general data integration problem discussed in 3 and the specific UIA case from Chapter 4, because it manages to re- duce translation steps to an acceptable degree as soon as a domain covering neutral information model can be specified. This can be done using the modular data model using Named Graphs of a specific type hierarchy. An important prerequisite for the data that should be mapped to the neutral language is that instances have single class membership. The platform uses a very basic and expressive use of Named Graphs, and it opens the promising SPARQL reasoning and profound logging capabilities within a apparently good and scalable performance. The platform is composed of two extensively described services and a briefly de- scribed mapper. Their internal working is made for basic operations and anticipates extension and reuse as far as the timescale allowed the generalization of Java classes and the JavaScript environment. The large list of future work testifies to this ambition.

103 10.2 Contributions Conclusions and Future Work

Some of it covers important aspects that had to be left out of the thesis scope because of the time limit. Others include more optional extensions. The acceptance by the involved stakeholders is very diverse. The main supervising manager put the architecture and the data model to use in an important information in- tegration project for a large infrastructure project, which means a large acceptance the design. The Integraal Samenwerken board was positive but not completely convinced. The work was also presented to the ISO community but their feedback could not be included in this report. A lot of work went into building the Workspace User Interface, which forms a logical follow up to the pre-existing Java Client discussed in Section 4.5. And as presented in Chapter 7 and 8 and evaluated in Chapter 9 it consists of a clear and thought through interface and a well structured implementation.

10.2 Contributions

During the thesis work the hope was felt to contribute as least as much to the Integraal Samenwerken aim to improve the collaboration in the Dutch ship building as to the ISO proceedings. In spite of the involvement of the IS Project 8 board the platform implementation did not gain enough momentum to find an immediate future. During a concluding conference of the IS initiative the main reason for this hesitation ap- peared to be the immense investment needed to implement a platform on a scale that would activate the uniformization potential. When asked a number of large partici- pating software companies said to be prepared to take up the further development and implementation of the platform. As stated above the contribution to the ISO is harder to measure, but the intention to produce a Part 12 that involves the use of Named Graphs sounds promising. In both contexts the work can at least function as a pioneering exercise on how an implemen- tation might look. Moreover the actual content of the Part 11 has been considerably grown in maturity during the development process. The ODRAC platform will have the most concrete effects to the application of the ISO in large infrastructure projects. As said the platform is put to use in a follow-up project hosted by Croon Elektrotechniek and performed by Sysunite B.V.. The ODRAC architecture, the Workspace UI and the Data model are being reused and developed further and many of the design decisions made during this thesis still stand in the new project. An interesting field for recognition would be the academic work in the overlap of computer science and systems engineering. The close similarity between ODRAC and the work of [74] and [29], and the deeper technological detail of this thesis might have some inspirational effect. Especially the apparently unique usage of Named Graphs is hoped to have some academic relevance.

10.3 Discussion

A number of shortcomings of the ODRAC system directly stem from the important aspects like security that where partly left out of the scope (Section 1.3.7). Moreover in the requirements chapter an explanatory section (4.6) was dedicated to the limitation

104 Conclusions and Future Work 10.4 Future work of the implementation in respect to the required federated configuration of data-stores. Together with those clear weak points a number of possibilities offered in the data model (like terminating the existence of Relationships) is not implemented yet. These topics are taken up in the list of future work. Next to an evaluation of the number of features implemented the approach taken during development is important. The use of well supported libraries and a famous IDE environment are a good common practice, as well as the use of a modular class structure and known design patterns. Yet the use of automated tests has been under- valued in this report. As mentioned in the introduction chapter an understandable code base improves the testability. This means that errors fired from lines of code that perform small and clear steps are easily repaired, but it also helps to build automated test that evaluate each step separately. During the design hardly any of those automated tests were used. This can be used as an argument against the quality of the system. It is true that the better a system is tested the higher the quality, but the generality of the solution makes that the clarity of the main design directly works through to the clarity and thus relatively error proof nature of the code. Because all operations performed inside the Workspace and asked from the Data-store boil down to the execution of the same procedures, the manual testing of procedures already means a high code coverage and a minimization of possible execution paths. On top of that some rigorous design choices limit complexity, for example the locking of the access procedure to a Data-store or the rules that apply to a synchronization procedure of TriX-files with the Data-store. Furthermore a number of data validity checks are build inside the application. The TriX-verifier is the most extensive example. During design it was tried to make the system stabilize on exceptions. A second major contribution to the extent to which the system was tested was the manual evaluation of the interface by the IS workgroup management and the exten- sive use of the TriX import function by the developer on a parallel IS project. Many possible procedures where covered by those user tests. Still the system has been under constant construction with many releases building in new functions, sometimes under a high time pressure because of upcoming demon- strations. This is the reason that at some places shortcuts are programmed in or some code results from former approaches. In order to keep the code clean a number of large cleaning operations where done radically removing unused code. Manual testing by the developers and the other parties helped retaining the general functionality.

10.4 Future work

The list of future work hinted to in earlier chapters can roughly be divided into two groups. First the extensions that should definitely be added in order to fully meet the requirements and produce a balanced application. First of all the security aspect like the authentication needed before access to the data-store API is granted. Also the federation mechanism could finally be worked out in the new architecture. A third improvement already mentioned in the report is the need for a cached retrieval of library types by the JavaScript client. A large time performance increase is expected

105 10.4 Future work Conclusions and Future Work

to be possible by reviewing the procedure of loading all active library types which causes the large performance delay measured in Section 9.4. Research should be done for the recoverability of a TDB store if an error occurs or the system freezes. A better understanding of the effect of operations on a TDB and its state would help to make the interaction of the data-store service with the TDB store more robust. Support for a number of elements form the data model is still completely missing from the Workspace, like the production of Individual templates or terminating relationships or Individuals. These are not essential operations, but they will be needed to maintain a project. A related function would be an automated merge of two Individuals that actually represent one thing but was accidentally created twice. The possibility of this was discussed in Section 5.1. The second group of future work consists of interesting possibilities that might mean an extension or better evaluation of the presented work. For example the rdfg: subGraphOf relation from the Named Graphs vocabulary gives the possibility to define an inheritance structure within Named Graphs. Currently ODRAC only uses one level of graphs, which means each triple of a data-set is only contained in one graph. A more composite configuration, for example containing all triples involved in a Com- plexDataGraph in separate subgraphs too, might open new uniformization possibil- ities. The interaction of subgraphs with SPARQL queries should be investigated as well as the support by Jena or NG4J. Related to this is the interesting influence the use of graphs, and possibly sub- graphs, has on the performance of a data-store. In this work the educated assumption was made that the performance does not suffer from the graph structure, but an em- pirical study of the real effects, possibly even a performance gain, would help support and maybe review design decisions. A different track of research might be an evaluation whether the RDFG vocabu- lary could be extended by domain and range like relations describing which triples are allowed inside a graph. Thus supplying means to describe graph characteristics and facilitate the automated evaluation of graph type membership like OWL makes it possible to calculate complex forms of class membership based on ClassExpressions. Further data model improvements that could be thought of to extend the expres- siveness of the data model include the introduction of cardinality restraints on incom- ing relations, a more generalized required actions mechanism and a role based autho- rization mechanism. More research could also be done on the differences between the ODRAC approach and OWL. Important topics include the possibility to introduce multiple class membership in ODRAC, the use of OWL relations in the ODRAC data model and the extend to which the open world assumption of OWL conflicts with the closed world assumption in our work. More practical extensions are the construction of an interface to created and edit li- brary graphs, API support to interchange library graphs and an updated merge function that can handle all graph types.

106 Chapter 11

Reflection

In this final chapter the process of the thesis project is evaluated. Nothing in this chapter is directly related to the ODRAC platform. First the phases of the project are summarized. The next section describes the methodology followed throughout the phases. The communication with the different stakeholders is evaluated in the subse- quent section. The final section discusses the approach taken towards the literature.

11.1 Phases

The first contact made with the work Croon did on the UIA was in the context of a six student project in a master’s course. One third of the assignment was to build a new Web interface, for which the design was on the authors account. The mock-up application delivered was not connected to the real RDF management layer. At the ending phase the company offered a position for a master’s thesis on a more elaborate aspect of the UIA. This thesis work started (March 2012) as an answer to that offer. The first phase involved picking a subject and formulating a research question. The least researched aspect of the UIA at the time was the distributed configuration of data-stores. The interaction between the data-store and interface consisted of SPARQL queries, which could take up to three hours to process, and the federation process of breaking these queries into separate parts for different data-stores and recombining their results seemed a most interesting topic. A wiki-system was set up as a platform to log and contain all used literature, log all activities and form a structured repository of all documentation and thesis fragments. At this phase an investigation was needed of the architecture, which was fairly simple, and of the data model to see on what data the queries had to be executed. This was a difficult task, partly because of the loose and ad-hoc structure of the data model. The work needed to get the data model straight and upgrading the implementation to a level that a federation mechanism could be injected were the clear next steps to take. The second phase thus consisted of working out the data model and reimplement- ing the platform. The new implementation effort was used to switch to a Web based client and the Web interface from the proceeding project was taken as a starting point. The evaluation of the data model followed the redesign and it only gradually became clear that beneath its lack of documentation there was a need of refinement. This data model evaluation had been one of the proposed graduation topics and at this stage it

107 11.1 Phases Reflection

was included in the final research goal. When the full scale of the needed implementa- tion work became clear the ambition to spend time on the federation mechanism was dropped. Also some security aspects where pushed out of the scope to bring a pro- totype implementation within reach. A new phase started when a budget was made available for an assistant developer (September 2012). During the third phase the production, documentation and planning environments had to be opened for collaboration. Also the growing interference of the Integraal Samenwerken workgroup required a ticketing system to organize feature requests. The wiki-system was maintained for the collection of the growing body of literature, but for feature management a Trac1 ticketing system was installed. Later this was replaced by a Relatics site introduced by the IS workgroup. The code was shared using an online Subversion (SVN) repository. When the design of the UIA 2.0 version got mature enough to test its performance (January 2013), the conclusion was drawn that a comprehensive design change was needed that involved a JavaScript caching mechanism. Up to that point all operations in the WUI where directly communicated to Java operations on the server. Although conceptually clean it meant an unacceptable performance lag. The new design in- volved the backbone mechanism, which was the start of phase four and UIA version 2.1 (February 2013). At that time another project within Integraal Samenwerken got active that needed an import export mechanism inside the WUI in combination with detailed feedback on all possible error in a TriX-file. The maintenance of the required features, even after the implementation was ready (April 2013), and the need to com- municate library changes remained a slight cause for delay. Another considerable delay was caused by the need to build a Library importer mapper and another Map- per project the assistant programmer was assigned to. A feature request temporarily brought up in the priority was the possibility to use and create Individual templates. Part of this design was taken up in the data model design in this thesis, but support in the User Interface was finished graphically but postponed before it was fully imple- mented. This work also meant a delay with no apparent result in this thesis. Parallel to the third and fourth phase the writing of the thesis report was started. Before that, a number of summaries, reports and explanation diagrams were collected on the wiki-site but only after the LATEX setup was ready (December 2012) an outline had started to be filled in. Still most time went into implementing and for this reason and a tendency to underestimate the time needed to write, a number of measures were needed to speed the fifth writing phase. At the time of a first beta publication of the WUI and Data-store the further development was transferred to the development assistant (March 2013). At some points my input was needed, and because of the finishing phase IS was in with its projects I had at some points to help to get deadlines too. A number of demonstrations and evaluation meetings were held by Integraal Samenwerken and the future of the project was also discussed within Croon. To help me concentrate on the writing I left the office for four days a week (starting the last week of May 2013). The final literature study and design evaluation where performed during this phase.

1http://trac.edgewall.org/

108 Reflection 11.2 Methodology

11.2 Methodology

Because the unavailability of people familiar with the data model and implementation at the starting point, or documentation with full coverage, the project started as a solo enterprise reinventing the wheel. As hinted in the description of phases in this process much use was made of logging decisions and drawing diagrams. Especially to work through RDF structures of the data model diagrams were essential. Some time was used to reflect on a formal methodology to draw RDF diagrams, but eventually an intuitive method was used influenced by diagrams from the initial Part 11 description and some W3C documents. From the first implementation work the IDE Eclipse was used for the code design and management of the test Tomcat services. To switch between data model design, code design and stakeholder communication the most logical causal order was fol- lowed. At the beginning of the project the system compartments and their composition were worked out intuitively, which at that project size and code base size worked ef- ficiently. As soon as the development work was shared between two people the work division and design decisions had to be discussed. On this scale no formal methodol- ogy was needed, but in an early stage the progress was made visible on a public server that was updated frequently. This server was configured to automatically employ .war- files that were uploaded by the developers. This mechanism to communicate progress to stakeholders can be compared with the frequent delivery principle from the Agile methodology or the sprints from Scrum, but these methods were not followed in their full scale. Stand-up meetings for two people obviously are not meaningful when they are in constant contact already. In the next section the communication with stakeholders is discussed in more de- tail, but a famous problem that revealed itself in two forms was the difficulty to control how much time parts of the design take. This resulted in a unspoken frustration at the side of the stakeholders and an inability to clearly communicate about needed time at the side of the development. The second form mainly emerged during the process of writing the report. Croon wanted the author back to do commercial exploitable work and the university would be critical on the time needed to deliver the result. The described measures of setting scope boundaries, transferring some development tasks and leaving the office were good, but the level of control gained over the processes felt suboptimal. In a next project the eagerness to implement some of the envisioned elements should be more confined and the work could be divided over smaller sprints. Especially in the writing process I could have used a more honest planning, not work- ing with unreasonable deadlines.

11.3 Stakeholder communication

The communication with the stakeholders appeared to be organized via the daily Croon manager, and only gradually the organization of the IS Project 8 and their participation in the project became clear. In the beginning the two information sources where the manager and the former programmer who made the 1.x implementation of the UIA, but it was hard to communicate. In a later phase a colleague pointed me to the Integraal Samenwerken documentation repository which shed some light on the history and goal

109 11.4 Literature referencing Reflection

of IS. It took even more time to understand something about the ISO organization. The preceding developer had documented parts of the design that were of minor interest and it was not able to clarify the main design, requirement analysis or defend the maturity of the application of RDF. The manager apparently had an overview of the design process and appeared to be ahead a number of times, but at the same time he kept me in the dark about much background information and let me reinvent the sys- tem. The requirements were hard to get clear, up to a point that I started to implement first and than ask if that was what was needed. A form of critique I got more than once was that I was not able to explain my design proposal or to introduce the design questions that I had, but the available time in those situations was a number of minutes where much more time was needed to discuss the precise use of RDF. Also the process to get the requirements formulated in 4.3 was difficult to control. The position of the development assistant changed the communication channels because for him the manager was not used as an intermediary. When the implemen- tation was made available on a public server some meetings were held with the devel- opers and the Project 8 managers. Word documents from their side with bug reports and indented responses were used, until Relatics was introduced as ticketing system. The communication resolved around implementation sprints and the subsequent re- quests to the Project 8 people to test the implemented features. The first requirements lists used for feedback contained a response to the system as the preceding developer had left the 1.x version. Although the platform now was completely different it had to continue implementing the same features. For example uploading attachments had become much simpler now there was a client-server Web connection during the run of the Workspace. The most difficult and puzzling aspect of my thesis work was which work was needed from my part to establish the professional relationships needed for my project. At the one hand the work seemed important to the stakeholders, and my interest of making an defensible thesis out of it was recognized, but this mainly resulted in leaving me alone. An overall analysis of my attitude is that I might have been too abiding, but my lack of insight and experience in organizational tact, the difficulty of getting clear organizational information at the start-up and responses tempering my attempts did not stimulate me in this skill.

11.4 Literature referencing

All literature has been found and cited with the help of Google Scholar, with the excep- tion of the W3C documents. The main approach has been to look for similar projects [74, 29] or literature reviews for example on User Interfaces of SCA tools [45] or on- tologies [38]. From these papers the citations were followed. For example the related work sections of all relevant papers theoretically span the whole academical discus- sion. Also papers about the used Named Graphs technique [11] were used in Google Scholar to find recent applications of the technique by searching papers that cited them. Following influential writers like Chris Bizer and Jeremy Carroll also helped getting an oversight of the existing initiatives.

110 Reflection 11.4 Literature referencing

A first approach to look for papers that would discuss in general terms which Soft- ware Engineering principles are best applied in building Semantic Web applications re- sulted in a confusing response discussing how Semantic Web techniques could be used to support Software Engineering. Especially a book published by Springer-Verlag with the title Model Driven Architecture and Ontology Development [26] made me enthusi- astic at first, but gradually turned out to discuss mainly irrelevant topics and appeared not to be well informed about some definitions in the RDF world of ontology usage. A small indication on the quality of a paper can be derived from the names and university of the authors, e.g. German universities usually score good on the subject, but while reading the papers the most promising ones turn out to be those that hit important trends by citing many good papers; those that are able to supply well dosed details (as [29] fails to do raising suspicion on their actual achievements) and those that appear to be aware of the generally accepted definition of terms. This final criterion was most difficult to judge for a reader that is not already familiar with the domain. More specific topics like the probability background of UUIDs were found from more targeted searches. These technical papers and the documents from W3C played an important role in design decisions. The papers discussing related work were for the biggest part processed during the fifth phase. This resulted in a somewhat reversed approach by writing the literature study afterward. This worked well because the dis- cussion could be focused on the design described in the design chapters. For example the discussion on RDF User Interfaces in Chapter 2.2.4 is meant as a prelude to the presentation of the User Interface in Chapter 7. A large part of the relevant literature had already been selected and organized in the described wiki system. Their content had already globally influenced the design, but the detailed elaboration was done in a later phase of the writing process.

111

Bibliography

[1] Integraal samenwerken, 2009. http://www.integraalsamenwerken.nl/.

[2] R. Batres, M. West, D. Leal, D. Price, K. Masaki, Y. Shimada, T. Fuchino, and Y. Naka. An upper ontology based on . & Chemical Engi- neering, 31(5):519–534, 2007.

[3] Dave Beckett. RDF/XML syntax specification (revised). W3C recommendation, W3C, February 2004. http://www.w3.org/TR/2004/REC-rdf-syntax-grammar- 20040210/.

[4] C. Bizer, R. Cyganiak, and E.R. Watkins. Ng4j-named graphs api for jena. In Proceeding of the Second European Semantic Web Conference (Poster). Citeseer, 2005.

[5] C. Bizer, T. Heath, and T. Berners-Lee. Linked data-the story so far. International Journal on Semantic Web and Information Systems (IJSWIS), 5(3):1–22, 2009.

[6] C. Bizer and A. Schultz. Benchmarking the performance of storage systems that expose sparql endpoints. In Proc. 4 th International Workshop on Scalable Semantic Web Knowledge Base Systems (SSWS), 2008.

[7] C. Bizer and A. Schultz. The berlin sparql benchmark. International Journal on Semantic Web and Information Systems (IJSWIS), 5(2):1–24, 2009.

[8] V. Bolognini, A. Di Iorio, S. Duca, A. Musetti, S. Peroni, F. Vitali, and B. White. Exploiting ontologies to deploy user-friendly and customized metadata editors. In Proceedings of the IADIS Internet/WWW 2009 conference. Rome, Italy, 2009.

[9] MN Kamel Boulos, Teeradache Viangteeravat, Matthew N Anyanwu, Venkateswara Ra Nagisetty, Emin Kuscu, et al. Web gis in practice ix: a demon- stration of geospatial visual analytics using microsoft live labs pivot technology and who mortality data. International journal of health geographics, 10(1):19, 2011.

[10] Milena C Cairesı, Simon Scerriı, Siegfried Handschuhı, Michael Sintek, and Ludger van Elst. A protégé plug-in development to support the nepomuk rep- resentational language. 2007.

113 BIBLIOGRAPHY BIBLIOGRAPHY

[11] J.J. Carroll, C. Bizer, P. Hayes, and P. Stickler. Named graphs. Web Semantics: Science, Services and Agents on the World Wide Web, 3(4):247–267, 2005.

[12] J.J. Carroll, C. Bizer, P. Hayes, and P. Stickler. Named graphs, provenance and trust. In Proceedings of the 14th international conference on World Wide Web, pages 613–622. ACM, 2005.

[13] J.J. Carroll, I. Dickinson, C. Dollin, D. Reynolds, A. Seaborne, and K. Wilkin- son. Jena: implementing the semantic web recommendations. In Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters, pages 74–83. ACM, 2004.

[14] J.J. Carroll and P. Stickler. Rdf triples in xml. In Proceedings of the 13th inter- national World Wide Web conference on Alternate track papers & posters, pages 412–413. ACM, 2004.

[15] L. Colson. iring tools sdk guide, 2011.

[16] Mathieu d’Aquin, Enrico Motta, Marta Sabou, Sofia Angeletou, Laurian Gridinoc, Vanessa Lopez, and Davide Guidi. Toward a new generation of se- mantic web applications. Intelligent Systems, IEEE, 23(3):20–28, 2008.

[17] C. David, C. Olivier, and B. Guillaume. A survey of rdf storage approaches.

[18] G. De Giacomo and M. Lenzerini. Tbox and abox reasoning in expressive de- scription logics. In PRINCIPLES OF KNOWLEDGE REPRESENTATION AND REASONING-INTERNATIONAL CONFERENCE-, pages 316–327. MORGAN KAUFMANN PUBLISHERS, 1996.

[19] Peter Denno and Mark Palmer. Modeling and conformance testing for the engi- neering information integration standard iso 15926. 2013.

[20] A. Di Iorio, A. Musetti, S. Peroni, and F. Vitali. Owiki: enabling an ontology-led creation of semantic data. Human–Computer Systems Interaction: Backgrounds and Applications 2, pages 359–374, 2012.

[21] L. Etcheverry and A.A. Vaisman. Views over rdf datasets: A state-of-the-art and open challenges. arXiv preprint arXiv:1211.0224, 2012.

[22] P. van Exel and L. van Ruijven. Industrial automation systems and integration— integration of life-cycle data for process plants including oil and gas production facilities—part 2: Data model. ISO ISO 15926-2:2003(E), ISO, Geneva, Switzer- land, 2003.

[23] P. van Exel and L. van Ruijven. Industrial automation systems and integration— integration of life-cycle data for process plants including oil and gas production facilities—part 11: Methodology for simplified industrial usage of reference data. ISO CD-TS 15926-11, ISO, Geneva, Switzerland, 2012.

[24] Lee Feigenbaum, Gregory Todd Williams, Kendall Grant Clark, and Elias Torres. SPARQL 1.1 protocol. World Wide Web Consortium, Candidate Recommenda- tion CR-sparql11-protocol-20121108, November 2012.

114 BIBLIOGRAPHY BIBLIOGRAPHY

[25] E. Feliksik. Rdf gears, a data integration framework for the semantic web. Mas- ter’s thesis, Delft University of Technology, 2011.

[26] D. Gasevic, D. Djuric, and V. Devedzic. Model driven architecture and ontology development. Springer-Verlag, 2006.

[27] Paul Gearon, Alexandre Passant, and Axel Polleres. SPARQL 1.1 update. World Wide Web Consortium, Proposed Recommendation PR-sparql11-update- 20121108, November 2012.

[28] Birte Glimm and Chimezie Ogbuji. SPARQL 1.1 entailment regimes. World Wide Web Consortium, Candidate Recommendation CR-sparql11-entailment- 20121108, November 2012.

[29] Markus Graube, Johannes Pfeffer, Jens Ziegler, and Leon Urbas. Linked data as integrating technology for industrial data. In Network-Based Information Systems (NBiS), 2011 14th International Conference on, pages 162–167. IEEE, 2011.

[30] W3C OWL Working Group. OWL 2 docu- ment overview (second edition). Technical report, W3C, December 2012. http://www.w3.org/TR/owl2-overview/.

[31] Hans-Jörg Happel, Axel Korthaus, Stefan Seedorf, and Peter Tomczyk. Kontor: An ontology-enabled approach to software reuse. In IN: PROC. OF THE 18TH INT. CONF. ON SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEER- ING, 2006.

[32] H.J. Happel and S. Seedorf. Applications of ontologies in software engineering. In Proc. of Workshop on Sematic Web Enabled Software Engineering"(SWESE) on the ISWC, pages 5–9. Citeseer, 2006.

[33] Steve Harris and Andy Seaborne. SPARQL 1.1 query language. World Wide Web Consortium, Proposed Recommendation PR-sparql11-query-20121108, Novem- ber 2012.

[34] A. Harth. Query answering with distributed lightweight ontologies, 2010.

[35] B. Haslhofer, E. Momeni Roochi, B. Schandl, and S. Zander. Europeana rdf store report. 2011.

[36] Sandro Hawke. SPARQL query results XML format. World Wide Web Con- sortium, Proposed Edited Recommendation PER-rdf-sparql-XMLres-20121108, November 2012.

[37] Patrick Hayes. RDF semantics. W3C recommendation, W3C, February 2004. http://www.w3.org/TR/2004/REC-rdf-mt-20040210/.

[38] Martin Hepp. Ontologies: State of the art, business potential, and grand chal- lenges. In Ontology Management, pages 3–22. Springer, 2008.

[39] D. Huynh, D. Karger, D. Quan, et al. Haystack: A platform for creating, orga- nizing and visualizing information using rdf. In Semantic Web Workshop, 2002.

115 BIBLIOGRAPHY BIBLIOGRAPHY

[40] ISO. Systems and software engineering—systems and software quality re- quirements and evaluation (square)—system and software quality models. ISO ISO/IEC FDIS 25010, ISO, Geneva, Switzerland, 2010.

[41] G. Jiang, H. Solbrig, and C.G. Chute. Mash-up of lexwiki and web-protégé for distributed authoring of large-scale biomedical terminologies. In Proceedings of 18th Annula International Conference on Intellegent Systems for Molecular Biol- ogy (ISMB 2010)-Bio-Ontologies 2010: Semantic Applications in Life Sciences, pages 132–135, 2010.

[42] David Karger et al. The pathetic fallacy of rdf. 2006.

[43] D.R. Karger and D. Quan. Haystack: a user interface for creating, browsing, and organizing arbitrary semistructured information. In CHI’04 extended abstracts on Human factors in computing systems, pages 777–778. ACM, 2004.

[44] A. Katifori, C. Halatsis, G. Lepouras, C. Vassilakis, and E. Giannopoulou. On- tology visualization methods - a survey. ACM Computing Surveys (CSUR), 39(4):10, 2007.

[45] A. Khalili and S. Auer. User interfaces for semantic content authoring: A sys- tematic literature review. 2012.

[46] A. Khalili, S. Auer, and D. Hladky. The rdfa content editor-from wysiwyg to wysiwym. In IEEE Signature Conference on Computers, Software, and Applica- tions, COMPSAC, volume 2012, 2012.

[47] Graham Klyne and Jeremy J. Carroll. Resource description framework (RDF): Concepts and abstract syntax. W3C recommendation, W3C, February 2004. http://www.w3.org/TR/2004/REC-rdf-concepts-20040210/.

[48] Frank Manola and Eric Miller. RDF primer. World Wide Web Consortium, Recommendation REC-rdf-primer-20040210, February 2004.

[49] B. McBride. Jena: Implementing the rdf model and syntax specification. 2001.

[50] B. McBride. Jena: A semantic web toolkit. Internet Computing, IEEE, 6(6):55– 59, 2002.

[51] R.H. Michael. A conceptual framework for constructing distributed object li- braries using gellish. Master’s thesis, Delft University of Technology, June 2009.

[52] N.F. Noy, M. Sintek, S. Decker, M. Crubézy, R.W. Fergerson, and M.A. Musen. Creating semantic web contents with protege-2000. Intelligent Systems, IEEE, 16(2):60–71, 2001.

[53] Chimezie Ogbuji. SPARQL 1.1 graph store HTTP protocol. World Wide Web Consortium, Candidate Recommendation CR-sparql11-http-rdf-update- 20121108, November 2012.

116 BIBLIOGRAPHY BIBLIOGRAPHY

[54] Kevin R Page, David C De Roure, and Kirk Martinez. Rest and linked data: a match made for domain driven development? In Proceedings of the Second International Workshop on RESTful Design, pages 22–25. ACM, 2011.

[55] Eric Prud’hommeaux and Carlos Buil-Aranda. SPARQL 1.1 federated query. World Wide Web Consortium, Proposed Recommendation PR-sparql11- federated-query-20121108, November 2012.

[56] P.P. Pruijn. Industrial automation systems and integration—integration of life- cycle data for process plants including oil and gas production facilities—part 8: Implementation methods for the integration of distributed systems—owl imple- mentation. ISO ISO 15926-8:2009(E), ISO, Geneva, Switzerland, 2009.

[57] C. Roerig. Een nieuw jasje - een nieuwe userinterface voor het bestaande kwaliteitsmanagementsysteem, 2012.

[58] C. Roerig. Comos van siemens: Een kijkje in een andere keuken, 2013.

[59] N. Rozanski and E. Woods. Software Systems Architecture: Working With Stake- holders Using Viewpoints and Perspectives. Addison-Wesley Professional, 2005.

[60] L. Sauermann, R. Cyganiak, and M. Völkel. Cool uris for the semantic web. 2006.

[61] M. Schaffer, P. Schartner, and S. Rass. Universally unique identifiers: How to ensure uniqueness while protecting the issuers´ privacy. Proceedings of Security and Management, pages 198–204, 2007.

[62] Michael Sintek, Ludger Van Elst, Gunnar Grimnes, Simon Scerri, et al. Knowl- edge representation for the distributed, social semantic web named graphs, graph roles and views in nrl. 2007.

[63] M. Strömman, I. Seilonen, J. Peltola, and K. Koskinen. Integration of optimiza- tion to the design of pulp and paper production processes. Simulation and Mod- eling Methodologies, Technologies and Applications, pages 239–254, 2012.

[64] M.C. Suárez-Figueroa, R. García-Castro, B. Villazón-Terrazas, and A. Gómez- Pérez. Essentials in ontology engineering: Methodologies, languages, and tools. 2011.

[65] E. Topping. An Introduction to ISO 15926.

[66] T. Tudorache, N. Noy, S. Tu, and M. Musen. Supporting collaborative ontology development in protégé. The Semantic Web-ISWC 2008, pages 17–32, 2008.

[67] T. Tudorache, J. Vendetti, and N.F. Noy. Web-protege: A lightweight owl ontol- ogy editor for the web. SDR].(Cit. on p.), 2008.

[68] Andries Simon Hendrik Paul Van Renssen. Gellish: a generic extensible onto- logical language-design and application of a universal data structure. 2005.

117 BIBLIOGRAPHY BIBLIOGRAPHY

[69] Leo van Ruijven. Ontology for systems engineering: Model-based systems engi- neering. In Computer Modeling and Simulation (EMS), 2012 Sixth UKSim/AMSS European Symposium on, pages 371–376. IEEE, 2012.

[70] F. Verhelst, F. Myren, P. Rylandsholm, I. Svensson, A. Waaler, T. Skramstad, J.I. Ornas, B. Tvedt, and J. H? ydal. Digital platform for the next generation io: A prerequisite for the high north. In SPE Intelligent Energy Conference and Exhibition, 2010.

[71] W3C SPARQL Working Group. SPARQL 1.1 overview. World Wide Web Con- sortium, Proposed Recommendation PR-sparql11-overview-20121108, Novem- ber 2012.

[72] E. Watkins and D. Nicole. Named graphs as a mechanism for reasoning about provenance. Frontiers of WWW Research and Development-APWeb 2006, pages 943–948, 2006.

[73] K. Wenzel. Ontology-driven application architectures with komma. In Proceed- ings. of 7th International Workshop on Semantic Web Enabled Software Engi- neering (SWESE), 2011.

[74] A. Wiesner, J. Morbach, and W. Marquardt. Information integration in chemical process engineering based on semantic technologies. Computers & Chemical Engineering, 35(4):692–708, 2011.

[75] Gregory Todd Williams. SPARQL 1.1 service description. World Wide Web Consortium, Proposed Recommendation PR-sparql11-service-description- 20121108, November 2012.

118 Appendix A

Ontology review

Ontology is the study of what exists, and in the context of Semantic Web technology it is a formal definition of classes and relations (properties) grouped in a namespace. The ontology used in the Universal Information Adapter (UIA) in Project 8 of Integraal Samenwerken is referenced by the name Meta Resource Description Library (MRDL) and is identified by the namespace http://is.croonprojects.com/mrdl. This document is an evaluation of the management, structure and the content of the ontology as it exists prior to development of the Web client (June 2012). First of all the function played by the ontology in the context of the UIA and it’s change procedures (now and in the future) are evaluated. In the second section the structure of the ontology, e.g. which concepts it aims to map, how it relates to other ontologies, etc., is discussed.

A.1 Management

A.1.1 Current change management Currently the ontology is work in progress. The two documents that specify the full MRDL are:

MRDL4.rdf - a XML-formatted list of all classes and properties, with label and • description.

MRDL one page. - a diagram with all entities and most subclass, domain and • range relations.

The model is managed by one Java programmer who keeps the to files up to date and introduces concepts with the working of the functionality of the data store server (called Converter) in mind. New concepts are introduced and others evaluated in commission with the project manager who guards the ISO compliance and makes the design decisions. For example a change was issued in May 2012 to add the possibility to model Locations and three dimensional Shapes. In accordance to this demand, the programmer changed the two reference files listed above and changed the converter and client implementation.

119 A.1 Management Ontology review rdf:type mrdl:DefiningRelation rdf:type rdf:type rdf:type rdfs:Class rdfs:domain rdfs:domain "Status" mrdl:classStatus rdfs:range rdfs:range mrdl:isAModelOf rdfs:domain mrdl:isApplicableInTheContextOf mrdl:isAQualifyingAspectOf rdfs:subPropertyOf rdfs:subPropertyOf rdfs:range Comment rdf:type System Architect System rdfs:domain MRDL (Decision Chart) "Is "Is classified as a" rdfs:subClassOf Wed Nov 30, 2011 10:39 2011 30, Nov Wed mrdl:ClassStatus/Status mrdl:uploaded == dcterms:created mrdl:dataSource == dcterms:creator mrdl:Context mrdl:dataOwner == dcterms:publisher mrdl:predecessor == dcterms:replaces mrdl:Template rdfs:range rdfs:subClassOf rdfs:subClassOf rdfs:subClassOf Comment dcterms = dct = dcterms rdf:type System Architect System rdfs:Class mrdl:Status mrdl:Stream mrdl:Feature mrdl:Collection MRDL (Decision Chart) Wed Nov 30, 2011 10:39 2011 30, Nov Wed mrdl:GellishObject mrdl:SpatialLocation dct = http://purl.org/dc/terms/ = dct mrdl:InformationObject foaf = http://xmlns.com/foaf/0.1/ = foaf rdl = http://is.croonprojects.com/rdl/ mrdl = http://is.croonprojects.com/mrdl/ rdfs:subClassOf rdfg = http://www.w3.org/2004/03/trix/rdfg-1/ swp = http://www.w3.org/2004/03/trix/swp-2/ xsd = http://www.w3.org/2001/XMLSchema# rdfs = http://www.w3.org/2000/01/rdf-schema# rdf = http://www.w3.org/1999/02/22-rdf-syntax-ns# rdfs:subClassOf rdfs:subClassOf rdfs:subClassOf mrdl:Aspect rdfs:subClassOf mrdl:Qualification/GraphStatus/NoObjection rdfs:subClassOf mrdl:Role mrdl:Agent mrdl:Event rdf:type mrdl:Compound mrdl:PointInTime mrdl:PeriodInTime rdfs:Literal rdfs:subClassOf mrdl:QualifyingAspect rdfs:subClassOf rdfs:subClassOf mrdl:shallBeQualifiedAs mrdl:Scale mrdl:Activity mrdl:Qualification rdfs:subClassOf mrdl:LiteralProperty mrdl:PhysicalObject rdfs:subClassOf mrdl:QualifiableProperty rdfs:subClassOf mrdl:QuantifiableProperty mrdl:Property rdfs:subClassOf rdfs:subClassOf mrdl:Person "Aborted" mrdl:Organization rdfs:range rdfs:domain xsd:duration xsd:dateTime rdfs:domain "Graph intention class" mrdl:Qualification/GraphStatus/Postponed mrdl:Qualification/GraphStatus mrdl:Qualification/GraphStatus/Deleted mrdl:Qualification/GraphStatus/Rejected rdfs:range rdfs:range mrdl:Qualification/GraphStatus/Approved mrdl:Qualification/GraphStatus/Proposed mrdl:Qualification/GraphStatus/Requested rdfs:subClassOf rdfs:domain rdfs:range mrdl:Reliable mrdl:Achieved mrdl:Estimated xsd:double rdf:type rdfs:range rdfs:subClassOf mrdl:Qualification rdf:type mrdl:hasPart "Graph certainty class" rdf:type mrdl:isMappedOnScale mrdl:hasProperty mrdl:isQualifiedAs mrdl:Qualification/Certainty rdf:type mrdl:isQuantifiedAs mrdl:isRealizedBy mrdl:isRealizedBy rdfs:domain mrdl:isRealizedBy mrdl:shallBeQualifiedAs mrdl:shallBeQualifiedAs mrdl:isRealizedBy rdf:type mrdl:isMaterializedBy rdfs:subClassOf rdfs:subClassOf mrdl:QualifiableProperty "Certainty property" "Intention property" mrdl:QlProperty/Certainty mrdl:QlProperty/GraphStatus mrdl:isRealizedBy rdfs:subPropertyOf mrdl:canHavePart mrdl:Agent mrdl:canHaveProperty mrdl:shallHaveProperty mrdl:shallBeQualifiedAs rdfs:range rdfs:Literal mrdl:shallBeMappedOnScale xsd:dateTime mrdl:Organization rdfs:subPropertyOf mrdl:Person rdf:type rdf:type mrdl:canBeRoleFor rdfs:range rdf:type rdfs:range rdfs:range rdfs:range rdfs:range rdfs:range rdfs:range rdfs:range mrdl:shallHavePart rdfs:range "Intention" mrdl:notify mrdl:status mrdl:Remark mrdl:certainty mrdl:attachment rdf:type mrdl:isRealizedBy mrdl:GellishActualRelation mrdl:ShallRelation dct:source mrdl:DefiningRelation dcterms:creator dcterms:created dcterms:replaces rdfs:subPropertyOf dcterms:modified dcterms:publisher dcterms:contributor rdfs:domain mrdl:CanRelation rdfs:range rdf:type rdfs:subClassOf rdfs:domain rdf:type rdfs:domain rdfs:domain rdfs:domain rdfs:domain rdfs:domain rdfs:domain rdfs:domain rdfs:domain rdfs:domain rdfs:domain rdfs:subClassOf rdfs:domain rdfs:subClassOf rdfg:Graph rdfs:subClassOf rdfs:subClassOf mrdl:GellishRelation rdf:type mrdl:GellishConceptualRelation rdf:type rdf:type rdf:type rdf:type rdfs:subClassOf rdf:type rdf:type rdf:type rdf:type mrdl:Metadata rdf:type rdfs:subClassOf "Is "Is classified as a" mrdl:GeneratedMetadata rdfs:Literal rdfs:range rdfs:subPropertyOf rdf:type mrdl:roleType mrdl:reverseLabel rdfs:subClassOf "Is "Is by role classified as" rdfs:domain rdf:Property

Figure A.1: MRDL one page.pdf

120 Ontology review A.2 Structure

A.1.2 Problems with the current change management A disadvantage of the ad-hoc way the ontology is build is its lack of thought through construction and some conceptual confusion (see Section A.2). Also, choices to in- clude certain concepts (like mrdl:Stream) are not documented, and from the minimal- istic description of these concepts it can be concluded that both a definition of the meaning and the confidence of its appropriateness are absent.

A.1.3 The standardization function The ontology is in its proof-of-concept phase and it functions as a subset of the enor- mous ontology defined in the ISO 15926 Part 4 Reference Data definition. This in its turn is a subset of the Gellish taxonomy. The ontology will be published in the ISO 15926 Part 11, together with a API description of how the UIA communicates to triple stores. To reach the level of a publishable ontology the current state is tested in a sequence of test setups.

A.1.4 Possible future change management A number of things could be improved in order to uplift the quality of the ontology to work towards its intended mature state. The W3C has working groups and Principles of Good Practice on publishing and maintaining ontologies to guide things like choosing good names and versioning and some of these recommendations could be of use in the present ontology. Furthermore, its construction should be defended in some form of documentation and the issues that are identified in the rest of this document (picked at first sight and no pretending full coverage) might be considered.

A.2 Structure

The ontology aims to map a certain view on reality. The concepts it is interested in stem from the engineering world. The Gellish language is a Generic Engineering Language and a subset from this language is enough to apply it to the ship building industry. The present ontology is expected to describe this subset, but it contains more than a clean set of concepts mapping the ship building world. It mixes reality concepts with implementation concepts (A.2.1), sometimes there is a difference between the ontology and the implementation (A.2.2), it contains an overstated set of labels (A.2.3), it has a too complex reference to the Gellish language (A.2.4) which obscures the overview and at some places other namespaces are used in an incorrect way (A.2.5).

A.2.1 Graph The facts that are modeled with the ontology need to be addressable. In that way, meta statements can be made about each stated fact. This introduces a difficulty into the ontology, because it somehow needs to work on two levels (the network of facts versus the statements about each fact individually). For this reason, a concept like stated fact needs to be introduced. In the current ontology the concept of a swp:Graph is borrowed from the TriX namespace to do the

121 A.2 Structure Ontology review

trick, because it is able to group statements in a graph that can be named. From the developers viewpoint this is understandable because each stated fact has to be put in a Named Graph to be able to do statements about it, but from a conceptual viewpoint this is confusing. A stated fact is not the same thing as a Graph (each stated fact is stored in a Graph, but not each Graph instance carries a stated fact). Because of its name (and origin in the TriX namespace) the swp:Graph doesn’t seem to represent a concept of the Gellish-like modeled construction world, but an en- tity that is needed for the technical implementation. This means we have two different ontology worlds (different to what they model) that are mixed in one (thus confusing) model. If we look to the way data is structured in the triple store, we see that not only stated facts are modeled as a swp:Graph, but a mrdl:PhysicalObject as a collection of properties is also grouped by a swp:Graph (which in that case means something a little different from stated fact). A practical implication is that the identifier of a mrdl:PhysicalObject is the name of the swp:Graph that specifies it, so the URI points at the same time to the swp:Graph instance and the mrdl:PhysicalObject instance. Connected to this swp:Graph is the swp:WarrantGraph, containing some possi- ble useful information, which in it’s turn is not included in the ontology. Either the swp:WarrantGraph should be included in the ontology (which confuses to worlds) or both swp:Graph and swp:WarrantGraph should be left out, but from a conceptual view- point both should not be there. A better way to build the ontology is to separate the meaningful concepts from the technical implementation. But on the other hand there should be no difference between those two, because the whole aim of Semantic Web technology is to merge the concepts and their technical implementation. One way to solve the confusion is to introduce a mrdl:StatedFact that is the do- main of mrdl:Metadata, and let it have swp:Graph as its superclass. If we also make mrdl:GellishObject a subclass of swp:Graph we can keep the way the data is stored in the triple store, but we define it more meaningful in the ontology. In this solu- tion we still have the mrdl:GeneratedMetadata that should still have swp:Graph (or swp:WarrantGraph) as its domain. But it would still be clearer.

A.2.2 Implementation differences The swp:WarrantGraph contains the relation swp:authority that points to the owner of the fact which according to the ontology should be stored by letting the dct:publisher point from a swp:Graph to mrdl:Organization. This is probably formally correct (be- cause swp:WarrantGraph will have swp:Graph as superclass), but the ontology does not relate directly to the technical implementation, and the more it would do, the better it is.

A.2.3 Use of labels Throughout the list of classes and relations (properties) many concepts get a label. When an instance of a class is made (e.g. a mrdl:PhysicalObject) the rdfs:label gives the name of the instance, but in the ontology labels are also added to definitions (as opposed to instances) which are used by the Java client (instead of the right hand

122 Ontology review A.2 Structure part of the URI or the rdfs:comment) as human readable and in nicely formatted form. It could be questioned if the ontology is a good place to store this information, but it is valid and informative. Except for some confusing URI-label combinations:

dct:publisher is called owner, which is not automatically the same thing (this • is also wrong for a different reason, see the last paragraph of this section on namespaces);

mrdl:classStatus gets label Status, when there is also a mrdl:ClassStatus/ • Status with label Class status class, which in its turn is a subtype of mrdl:Status (with label Status) and which should not be confused with mrdl:status (without capital) with label Intention;

mrdl:Qualification/GraphStatus/Deleted gets label Aborted, which is not pre- • cisely the same thing;

mrdl:Qualification/Certainty gets as label Graph certainty class whereas • mrdl:QlProperty/Certainty gets label Certainty property (this shows that Graph and Property are confused in the label names, see the previous paragraph, but it also shows the abbreviation of QlProperty, which is untidy if not dangerous).

Furthermore the introduction of mrdl:reverseLabel only adds overhead. Any human reader who wants to read the data at RDF-level is able to reverse labels himself. For the automated reasoning it has no function (if it is true that the Java Client does not use it somehow).

A.2.4 Verbose relating to complex Gellish categories In the ontology most relations are brought into a complex class-subclass relation with GellishRelation that is extended by DefiningRelation, GellishConceptualRelation, GellishActualRelation, isRealizedBy, Metadata and reverseLabel that are again further subclassed. This does not seem to serve a constructive purpose. The choice where to use a subclass relation and where to instantiate is not very clear. For ex- ample, mrdl:isRealizedBy has as rdf:type the mrdl:GellishRelation, where mrdl: GellishRelation has mrdl:DefiningRelation as rdfs:subClassOf. The needless complexity and unclear use of the class/instance distinction is fur- thermore illustrated by the fact that for the mrdl:isMaterializedBy both as range and domain takes a mrdl:PhysicalObject, which is conceptually difficult to under- stand (how can a PhysicalObject be in the need of materialization?). As another il- lustration of needless complexity we find that the mrdl:isMaterializedBy relation is a mrdl:isRealizedBy of the mrdl:canBeRoleFor which is of type mrdl:CanRelation which is a rdfs:subClassOf mrdl:GellishConceptualRelation. There is no apparent reason to take this structure up into the ontology.

A.2.5 Namespaces The ontology borrows from the following namespaces:

123 A.3 Conclusion Ontology review

Namespace Description dct / dcterms The dublin core terms namespace. This is used for the meta-data. rdf The standard namespace for RDF relations. rdfs The RDF schema namespace. rdfg The namespace for named graphs. swp The SWP namespace, used for authentication of named graphs. xsd Used for data types. Table A.1: Used namespaces.

It’s a good practice to relate concept from a new ontology to other ontologies, but it is wrong to override the domain and range definitions of those concepts. A number of dcterms relations are said to be instances of the mrdl:GeneratedMetadata and have the rdfg:Graph as range. The rdfs:isDefinedBy is used in the right way. But for the wrongly and rightly used concepts it is true that there is no need do take them up in the new ontology because they can be used anyway. If the concepts need to be further specified new resources should be defined in a self made namespace.

A.3 Conclusion

Now the ontology is a mix of a clean set of reality-describing concepts, some refer- ences that are needed for the implementation level concepts (the Graph and Labels) and a complex reference to the Gellish framework. The former two might not be needed, and removing them will clear up the ontology considerably. The use of concepts from other namespaces is wrong at some points, and the nam- ing in general might be reviewed so it becomes concise (capitalizing, abbreviation) and the meanings adhere to one context only.

124 Appendix B

Quality aspects

In order to structure the formulation of quality aspects that are important in the con- text of the Universal Information Adapter two listings are used. First the ISO 25010 System and software quality models [40] as a generally applicable quality model, and secondly a systematic literature review on quality aspects of Semantic Content Au- thoring systems and their User Interfaces [45]. Although the field of SCA focuses on semantically enriching documents and differs in that respect from the knowledge base centered UIA, the quality aspects are worth evaluating because of the overlap of semantic data creation using a user interface. All terms of both listings are given below with an indication of how important they are for the UIA with a short defense. The most important aspects are also described in Chapter 1.3.

B.1 ISO 25010 quality in use effectiveness (average) Users generally have the time and tasks are not pre-programmable, but the result should be always exactly the same. efficiency (average) Tasks should not require excessive resources. satisfaction

usefulness (average) The user experience cannot be predicted fully because of the new workflow the platform will cause. trust (high) Because of the political function of the UIA prototypes within In- tegraal Samenwerken. pleasure (low) Users will have professional motives to use it, not personal. comfort (average) The platform is not in the first place meant to ease work, but to structure it, yet comfort is important. freedom from risk

economic risk mitigation (high) As an engineering knowledge base platform, an application on a financially challenging project influences the economic risks.

125 B.2 ISO 25010 product quality Quality aspects

health and safety risk mitigation (high) As an engineering knowledge base platform, an application on a safety challenging project influences the safety risks. environmental risk mitigation (high) As an engineering knowledge base plat- form, an application on a environmentally challenging project influences the environmental risks.

context coverage

context completeness (high) The platform must give the user at all times the control over what consequences the data he enters has, and secure the val- ues with control mechanisms. flexibility (high) The platform should be easily configurable to support data that was not supported before.

B.2 ISO 25010 product quality

functional suitability

functional completeness (low) Because of the prototype state there is no good way to test if all needed functionalities are there. functional correctness (average) Important, but the supported functionality is not complex. functional appropriateness (average) Important, but the UI follows the data structure very closely so the mapping from user actions to results is not complex.

performance efficiency

time behavior (average) There is not a special time dependence in any part of the system, apart from user convenience. resource utilization (high) The system should be able to run multiple projects and support many users collaborating, without significant delays. capacity (high) The system should be able to contain enormous amounts of data. Especially in the field of Semantic stores this is an point of active research.

compatibility

co-existence (high) The use of the Workspace User Interface is only one of the possible access points to the ODRAC platform. The data-stores that are part of an ODRAC project should be accessible by other applications as well. interoperability (high) The main purpose of the UIA is open data exchange, so the interoperability of data is of high importance.

126 Quality aspects B.2 ISO 25010 product quality usability

appropriateness recognisability (low) End users will not decide to take UIA into use, but will encounter it professionally. learnability (average) The UIA should help the user as much as possible, but if some instruction is required this is no problem. operability (average) Because of the prototype status most attention is paid to support normal user actions; control of the system may require manual operations of a developer. user error protection (high) The main purpose of the WUI is to guide users to produce data that is always formally correct and carries precisely the intended message. user interfaces aesthetics (average) An aesthetic interface will help users to find their way more quickly and with less effort, but their choice to use it does not depend on it. accessibility (low) Only professionally users with specialized engineering skills will be asked to use the UIA. reliability

maturity (average) The stable performance is normally important. Failures are acceptable as long as the system performs acceptable overall. availability (average) No critical processes depend on the constant availability, but an interrupt does mean the stagnation of the engineering process within a project. fault tolerance (average) The data produced should never contain inconsistent form, but possible down time due to hardware or software malfunctioning is not of high concern. recoverability (high) More important than a constant availability is the robust- ness of the data contained in the UIA. If the system would crash, it is important that within a reasonable time frame the system becomes active again without loss of data, or at least as little as possible. security

confidentiality (high) For some projects the information stored in the UIA is sensitive, for example for constructing data of navy ships. integrity (high) idem non-repudiation (high) Users will be legally bound to the data they enter in the system, so the system needs to be faultless on tracking provenance data. accountability (high) idem authenticity (high) Identically, the system needs to be faultless in identifying its users.

127 B.3 Semantic Content Authoring Quality aspects

maintainability

modularity (average) The code should be readable and extendable, but parts are not expected to be generally applicable enough to publish as a library. Only ODRAC services should be able to run the code. reusability (average) idem analysability (low) Because the code base is small and the project is in a proto- type phase the analysis during development and in run-time is done ad-hoc. modifiability (high) The UIA is expected to undergo many changes, and also ODRAC is presented as an extendable platform, so it should be easily ex- tended. testability (average) In order to uphold data quality the result of a user action should always be a correct data-set. This should be testable, but the system does not have high complexity on this point.

portability

adaptability (average) Because of the different companies it is expected to work, a weak dependency on underlying hard and software is desirable. installability (low) It is acceptable is installation takes some time, because the services are accessible over a network and only need to be installed once. replaceability (high) The WUI should be interchangeable with another appli- cation as long as it adheres to the ODRAC data format and Data-store API.

B.3 Semantic Content Authoring

usability (sufficiently covered by ISO)

customizability (average) Different users should be allowed to align the tools in the UI in a way that makes their work easy.

generalizability (high) Yet the biggest customization power should not be at the in- terface side, but in the project specific configuration and ontology.

collaboration (high) The platform should support efficient communication and col- laboration in the workflow process.

portability (average) The User Interface of the platform should be usable without complex installation.

accessibility (sufficiently covered by ISO)

proactivity (average) The platform should help the user as much as possible with predefined values and limited options based on the context, yet it will not go as far to guess.

automation (low) This prototype version will not do interpretation or automatic data translation. Mappers might incorporate this functionality.

128 Quality aspects B.3 Semantic Content Authoring evolvability (high) The data-set evolves constantly, and the platform should support this evolution fully. interoperability (sufficiently covered by ISO) scalability (sufficiently covered by ISO)

129

Appendix C

Graphs

In the ODRAC graphs design the notion that a Named Graph represents a publication unit is closely followed. Every fact from the knowledge base that should be changeable independently is put in a special graph so it can be replaced. Each first or replacing graph can be traced back to the creator, which makes a graph the representation of a knowledge statement. In the UIA context such a claim might even have juridical implications. In Section C.1 the considerations are shared what part of the knowledge network to put in separate graphs. In Figure C.1 the overview of types of graphs is copied from Section 5.1.2. In Section C.2 the content of all these graphs types is described as an explanation of the primer vocabulary listed in Appendix D. The RDF properties described in this section are used inside the Data-store. In Section C.3 the use of some extra RDF data is listed that is not for storage in a data-store, but is used to temporarily encode construction states.

C.1 Granularity

The way graphs are used may need some explanation. It is not the question why graphs are there, but how the level of granularity was chosen. Both in the RDF semantics and in quad store frameworks all triples are in one default graph if no further specification is given in what graph each triple lives. At the other end of the spectrum a differ- ent graph could be made for each triple. A number of considerations play a role in choosing the appropriate granularity: performance, adherence to design principles, the expressiveness of SPARQL queries over the structure and the role of graphs in the information model. The question of performance needs to be directed to the low level implementation of the used quad store which is TDB. The smaller the granularity of the number of triples carried in a graph, the more graphs, and this might have effects on how quick queries can be answered. Benchmarking different RDF stores on the use of Named Graphs and the scalability effects is not yet done extensively [7] and remains as future work [6], but expert comments suggest that the number of Named Graphs in a TDB does not influence the performance because all triples are stored in quads anyway1. It is also pointed out that the two alternative options of using reification (see Section

1http://answers.semanticweb.com/questions/3961/jena-tdb-and-quads

131 C.2 Stable states Graphs

2.1.2) or incorporating the collection element in the information model both decrease the performance. In the introduction paper of the concept of Named Graphs a difference is made between Named Graphs and the RDF graph that the Named Graphs represent [11]. The original RDF graph relates to a document (e.g. a vocabulary) consisting of statements, which is also referenced by a URI, but the new concept makes it possible within a document to specify for each triple to what virtual Named Graph it belongs. Although in the Jena framework the two concepts overlap because each RDF graph is a Model with a name, it separates the meaning of a Named Graph as a container from the part of an RDF graph it contains. There is considerable practice using them as containers to facilitate meta-data notation like provenance or access control, giving credit to this new meaning [12, 35, 21]. In [11] the GRAPH keyword in SPARQL and the definition of a RDF data-set as a set of a default graph and zero or more Named Graphs the W3C adoption of the Named Graph concept. One SPARQL query can define a pattern over different graphs. What the graphs represent is open to the information model design, but the use of graphs as containers makes it easier to query concepts that can contain data. The data versus meta-data split can be connected to this, resulting in better readable queries. If no performance or design principle dictates how to use the granularity, it leaves it up to the design of the data model and the way SPARQL queries can be built around it. In [11] the concept of the Named Graph is compared to two algorithmic ways to break a full RDF graph over subgraphs. Apart from these algorithm approaches very little guidance is given how to decide on the granularity, apart from explaining the purpose that Named graphs allow collections of triples to be published as independent units, and to retain the integrity of the publication unit. This permits metadata to be added about the publication unit, such as metadata about the publication process etc. The first algorithmic approach is the Minimum Self-contained Graph (MSG). It gives each triple its own graph, except if it has a blank node as subject or object. In such a case all other references of that blank node are also contained in that graph. The other approach is the Concise Bounded Description (CBD) which is filled around a URI. All triples containing the URI as subject or object are collected into the graph, and all triples referencing blank nodes that are contained in this initial set of triples are also part of it. In ODRAC an approach is used that is very similar to the MSG, especially for the ComplexDataGraph.

C.2 Stable states

Figure C.1 gives an exhaustive overview of all the graph types that can be found in ODRAC. Each graph type like DictionaryHeaderGraph is formally described in the primer vocabulary as a Resource of type rdfg:Graph, but few graph instances are given an explicit type definition. Graphs are defined implicitly by SPARQL patterns. If a graph matches such a query, it is assumed to be of the related type. This is a weak definition, and theoretically it is possible that one graph matches more than one pattern. All graphs should have a unique name. PublishableGraphs form the instance base and most of them can be replaced, which requires a unique name. The Publishable- Graphs that are not replaceable represent the identity of an Individual, so they should

132 Graphs C.2 Stable states

PublishableGraph DictionaryHeaderGraph

IndividualTypeGraph ReplaceableGraph RelationshipTypeGraph

IndividualGraph RelationshipGraph RelationshipTemplateGraph

IndividualTemplateGraph TypeDefinitionGraph ComplexDataGraph

TranslationGraph IndividualRelationshipGraph TerminationGraph

Figure C.1: All Graph types used by the ODRAC platform. also be unique. Library graphs are allowed to live in different data-stores at the same time, but a same name should mean a complete similarity, also in the fields that are not used by ODRAC. Below the purpose and content of each graph type is described. The first graph is displayed in TriX format as an example of a serialized version. The rest are described using a list of contained predicates. When a certain relationship has a (default) value, it means that omitting this relation altogether means that this default value is assumed.

C.2.1 DictionaryHeaderGraph One of the locations a graph can reside is the library, which can be filled with the content of selections of dictionaries. The ideas behind this are explained in (Section 5.8), but here it is important to note that it is difficult to reconstruct afterward which dictionaries where loaded in a library. In order to solve this each dictionary collection contains exactly one graph representing the dictionary, containing all relevant infor- mation of the dictionary. Technically it is not the first element of the dictionary, but because of its identification function it is called a DictionaryHeader. In the current stage of the design not much information is needed about a dictio- nary. Therefor a typical DictionaryHeader does not contain more than a name and a type definition. Here (and in the preceding paragraphs) a TriX fragment is shown with abbreviated URIs. If for example all library data of a project is contained in one dictionary, the URI could be http://www.uia15926-11.com/rdl/part11/0.1#.

http://www.uia15926 -11.com/rdl/part11/0.1# http://www.uia15926 -11.com/rdl/part11/0.1# http://www.w3.org/1999/02/22-rdf-syntax-ns#type http://www.uia15926 -11.com/2012/03/uia#DictionaryHeaderGraph

133 C.2 Stable states Graphs

C.2.2 PublishableGraph All graphs that are publishable, meaning that an end-user can create them, should contain two triples. Reversely, any graph that contains those two relations is a Publish- ableGraph. The PublishableGraph is depicted in Figure C.1 to show that this graph type is a generalization over some graph types and no graphs are supposed to apply only to this type definition.

uia:creator points to an Individual representing a user of the system.

uia:created a time stamp of the moment the graph was created.

C.2.3 IndividualTypeGraph An IndividualTypeGraph carries an ontology class or subclass definition. In the UIA application of ODRAC, these graphs represent ISO 15926-4 classes, and to be com- plete, apart from the uia:uniqueName (equivalent to p11:has_unique_name) and rdfs: subClassOf which are obligatory, the relations p11:has_unique_number, p11:is_defined_as and rdfs:comment might be supplied in the graph to follow the ISO 15926-11 format. All IndividualTypes eventually inherit from uia:OntologyElement. In order to iterate easily over all super-classes of a IndividualType a special link is included to directly point to all super-classes a type inherits from. Adding this uia:inheritsFrom relation is a form of inferencing done at ontology creation in order to overcome the graph recursion deficiency of SPARQL (see Section 2.1.4).

uia:uniqueName a string meant to uniquely identify the class for human readers.

uia:inheritsFrom a relation of this type is made to its predecessors also to the highest parent uia:OntologyElement. See Section 5.8.2.

rdfs:subClassOf either another IndividualTypeGraph or uia:OntologyElement.

C.2.4 IndividualGraph The creation of an Individual cannot be undone, but all the relationships it has, for example to a graph containing its name, can be replaced. Even the type definition of an Individual can be changed to some extent (see C.2.8). An IndividualGraph cannot be replaced, because its graph name (URI) has to remain active during the full life-time of the project. This is the reason the graph is almost empty. The only thing contained in it is what every user creatable graph should contain, the creator and created meta-data fields.

C.2.5 IndividualTemplateGraph A IndividualGraph can become an IndividualTemplateGraph by two causes or a com- bination of them. As soon as any RelationshipTemplateGraph points to the Individ- ualGraph using the uia:template relation or if any other IndividualGraph points to the graph using the uia:derivedFrom. The content of the graph does not change. For example the type definition can still be changed to a lower type. The graph is now a

134 Graphs C.2 Stable states special Individual depicted as a subclass in Figure C.1. See Section 5.3 on the inheri- tance mechanism the IndividualTemplateGraph is part of.

C.2.6 ReplaceableGraph Graph A can replace graph B if both are of the same graph type and (for Relationship- Graphs) satisfy the same RelationshipTemplateGraph or (for RelationshipTemplate- Graphs) have the same uia:template, uia:fulfilledBy, uia:relationshipLevel, uia: domain and uia:domainContainer. The graphs that were uia:derivedFrom the re- placed RelationshipTemplateGraph still depend on the graph, so it remains active, but it cannot be used for new items. Replaced RelationshipGraphs are no longer active. See Section 5.5 for more details on the use of the uia:replaces relation. uia:replaces (optional) a graph of similar type and comparable content which is now older and should not be used for new Individuals or Relationships. uia:origin points to the graph itself if it does not replace another graph, or if it does, it point to the beginning of the graph chain.

C.2.7 RelationshipGraph This graph type describes what the TypeDefinitionGraph, ComplexDataGraph and In- dividualRelationshipGraph have in common. All three carry exactly one triple that has an Individual URI as subject. This is the Individual that is the starting point of the relationship. The object of this triple can either be an IndividualTypeGraph, an Individ- ualGraph (which makes it an IndividualRelationshipGraph) or a blank node (a Com- plexDataGraph). The second thing they have in common is the triple which relates the graph itself (subject) to a RelationshipTypeGraph (object) using the uia:derivedFrom (predicate). uia:derivedFrom points to a RelationshipTemplateGraphs it is based on.

?relation the relation prescribed in the RelationshipTemplateGraph. The object depends on the type of the RelationshipGraph (see directly below).

C.2.8 TypeDefinitionGraph The most often used Relationship to relate Individuals to Individual types (class) is the rdf:type definition. For each Individual a graph containing this definition should exist. rdf:type points to the IndividualTypeGraph the subject is an instance of.

C.2.9 IndividualRelationshipGraph The IndividualRelationshipGraph is the simplest form of the RelationshipGraph, be- cause it only contains of the two triples that all RelationshipGraphs share. The subject and the object of the ?relation are an Individual.

135 C.2 Stable states Graphs

C.2.10 ComplexDataGraph This is the most extensive graph, because it contains one Relationship based on a Rela- tionshipTemplateGraph (mentioned in the uia:derivedFrom triple), a blank individuals and any number of relationships from this individual to external Individuals or included typed Literals. Just like all RelationshipGraphs the graph contains a ?relation with an IndividualGraph as subject. This time the object is a blank node. A type definition of this blank node instance is omitted, because it can be directly derived from the Re- lationshipTemplateGraph this CDG is an implementation of. The CDG also contains any number of second order relations pointing from this blank node to typed Literals or other Individuals.

?relation_second_order any number of relations that the blank node subject should have.

C.2.11 RelationshipTypeGraph The RelationshipTypeGraph can be compared to the IndividualTypeGraph. It de- fines the use of Relationships by relating them to uia:OntologyRelationship using rdfs:subPropertyOf paths. In the current design of ODRAC there is no reason to specify groups of Relationships, so typically all relations are a rdfs:subPropertyOf uia:OntologyRelationship. The two other important triples in a RelationshipType- Graph are the uia:preferredLabel and the uia:reverseLabel. Both are untyped Lit- erals and contain a label that should be used to describe the meaning of the relationship and the reversed meaning respectively. The uia:nounLabel is allowed (as any other non-interfering addition), but is not used by the WUI.

rdf:type (redundant) rdf:Property.

rdfs:subPropertyOf some other RelationshipTypeGraph, uia:PrimerRelationship or uia:OntologyRelationship (default).

uia:preferredLabel a plain Literal describing the meaning of the relation.

uia:reverseLabel a plain Literal describing the meaning of the relation in oppo- site direction.

uia:nounLabel (optional) a plain Literal with a one word description of the rela- tionship.

C.2.12 RelationshipTemplateGraph The allowed use of Relationships defined in RelationshipTypeGraphs is put in Rela- tionshipTemplateGraphs, thus acting as a template for relations between Individuals. The following properties need to be specified.

uia:template either uia:DefaultTemplate (default) or an IndividualTemplateGraph.

uia:predicate the URI of the RelationshipTypeGraph.

uia:relationshipLevel either uia:MetaLevel or uia:DataLevel.

136 Graphs C.2 Stable states uia:modality (redundant) if the presence of exactly one of this Relationship type is obligatory for the Individual in uia:domain, the uia:modality of this relation- ship template is uia:ShallBe, if zero or more relations of this type are allowed this property points to uia:CanBe. This can also be set by using the cardinality parameters. uia:minCardinality any unsigned integer prescribing the minimal number of this relation the range element should have with a domain element. uia:maxCardinality any unsigned integer prescribing the maximum number of this relation the range element is allowed to have with a domain element. uia:domain the URI of the IndividualTypeGraph which (or any of its children) is the starting point (subject) of this template. uia:domainContainer the type of Individual this RTG applies to, either an Indi- vidualGraph or ComplexDataGraph (see uia:rangeContainer). uia:range the URI of the IndividualTypeGraph which (or any of its children) is the end point (object) of this template, or a data-type from a predefined list (see below). uia:rangeContainer describes in which container the object of the relationship is stored. There are two options, IndividualGraph or ComplexDataGraph. The type of RelationshipGraph needed for the relationship follows immediately from this choice. If the object Individual needs to be globally identifiable it should be contained in an IndividualGraph, if the object only has meaning in relation to the subject the object can be stored in a ComplexDataGraph as a blank individual. If the matching (blank) Individual from the domain is already in a Complex- DataGraph one extra level or relationships is allowed, but with two restrictions. First the range should be a data-type from Table 5.1 or a class of (library) In- dividuals and the modality should be ShallBe. There is no way to refer to a Relationship instance of this type, because the relationship will be contained in a ComplexDataGraph that also contains one Relationship starting from a named Individual, so there will be no uia:derivedFrom reference for this Relationship. uia:rangeDatatype a value from Table 5.1, if the relation is omitted the assumed value is xsd:anyURI (default). uia:rangeDefaultValue (optional) depending on the data type a default value can be supplied. uia:rangeMinValue (optional) for date and time data types a lower and upper bound can be specified for the value. uia:rangeMaxValue (optional) idem.

The WUI is used to construct Individuals and to link them to each other, but at some point Literal values have to be supplied. The RelationshipTemplates are also used to define those points. The WUI can be programmed to provide input elements and sanity checks per data type. Currently the set from Table 5.1 is supported, but this

137 C.3 Construction states Graphs

can be extended. When the WUI finds one of those value in the range it will make a rdfs:Literal.

C.2.13 TranslationGraph The TranslationGraph contains the following to triples.

uia:companyName has a IndividualTypeGraph as subject and a Literal as object con- taining the company specific name of the Individual type.

uia:forCompany points from the TranslationGraph to a Company instance, indicat- ing which company it is a term for.

C.2.14 TerminationGraph Instead of removing graphs from the data-set if the data they contain is no longer part of the repository, PublishableGraph can be terminated by adding a TerminationGraph to the data-set pointing to the node that should be terminated. When element that will be terminated is not an Individual, it is contained in a ReplaceableGraph, and the origin should also be supplied in the termination graph.

uia:terminates points to the Individual or ReplaceableGraph that is terminated.

uia:origin (optional) points to the origin of the replace chain in the case of a Re- placeableGraph.

C.3 Construction states

When an Individual or Relationship is being constructed in the Workspace, some flags are temporarily added before the data is send to the Data-store in its stable form. Any PublishableGraph can have a uia:state relation. Without states, or with StateNew and StateCompulsary the creation of a RelationshipGraph cannot be reverted in the WUI. As a preparation for the sync the states are removed. Each IndividualRelationshipGraph or ComplexDataGraph gets a uia:Placeholder as object on default. For ComplexDataGraphs this happens for the objects of the rela- tions of the blank individual.

uia:state points to uia:StateNew. If the graph is a ShallBe it also gets the state uia:StateCompulsary, so it can not be reverted in the WUI.

?relation pointing to uia:Placeholder.

138 Appendix D

Primer vocabulary

In this appendix the content of the primer vocabulary is printed fully.

139 Primer vocabulary

140 Primer vocabulary

141 Primer vocabulary

142 Primer vocabulary

143 Primer vocabulary

144 Appendix E

Perspective of ISO 15926

In order to do justice to the ISO 15926 community some effort should be spent on relating the work in this thesis to the work of the ISO 15926 workgroup. The data model explained in this report builds a foundation how different parts of the knowl- edge base relate to each other and how they can be recorded and communicated. But in relation to both the ISO 15926 norm and Gellish (see Section 3.2), the concepts used in the ODRAC data model represent a minimal set of ideas from those elaborate theoretical systems. This selection is thought to be enough to encode knowledge in a corresponding fashion. For example a relationship template is very similar to the class of relationship from ISO 15926-2, but the concept of an Individual in ODRAC is sim- pler and at some detailed points in conflict with the complex notion of individuality from the ISO. This thesis influenced the content of Part 11, and the ODRAC platform functions as a possible implementation of the Part 11 methodology. The use of RDF with Named Graphs to encode reference data is our base principle. There is an overlap between the concepts developed within the ISO’s own data model that radiates down to the technical structure and the basic modeling assumptions inside the RDF technology that radiate their way up. ODRAC clearly builds from a technical basis and aims to reach an ISO equivalent expressiveness through a route in line with RDF. In this appendix we build an explanation of our work within the narrative of the ISO workgroup. Following the description for ISO readers a mapping is given between this and the rest of the thesis. The second section also discusses the differences between the ISO parts and the ODRAC approach.

E.1 ODRAC within the ISO’s narrative

In the 2013 Paris meeting of the ISO workgroup TC184/SC4/WG3 a presentation was given by Leo van Ruijven on the progress of the 15926-11 methodology. This Part 11 consists of three contributions. First it proposes an application of the template method- ology from Part 7 and the OWL implementation from Part 8 into an RDF Named Graphs structure. Secondly it introduces a set of initial relations in line with the Part 2 data model and Part 3 and 4 set of reference data. Thirdly an informative business

145 E.1 ODRAC within the ISO’s narrative Perspective of ISO 15926

Figure E.1: An IndividualTypeGraph as presented to the ISO workgroup.

Figure E.2: A RelationshipTemplateGraph as presented to the ISO workgroup.

domain usage guideline is given how the infrastructure around Named Graphs can be organized. The new reference data relations (the second contribution of the proposed Part 11) are to be included in a new release of Part 4. A number of them represent Systems Engineering concepts like has_property, but some are needed to structure the use of Named Graphs. In Figure E.1 and E.2 two Named Graphs are visualized. In the first figure an ISO class definition of a Centrifugal is given as a Named Graph. It is based on a formal description in a table row also displayed. The Named Graph is an entity with a unique URI as name. The chosen URI consists of an HTTP-domain appended with 130.000.001, the number identifier of the Centrifugal class. Inside the Named Graph RDF relations reside, depicted as red arrows. Three relations are used to supply real content of the class definition graph: its parent class, its unique name and a string definition of the class. Four other relations represent meta-data statements about the Named Graph. They describe the Named Graph as a record that was created and possibly modified. Each relation arrow reflects an RDF triple: the source node of the arrow is the subject, the label of the arrow is the predicate and the target node is the object. A naming principle is that triples should read as a sentence. When the domain-part of a

146 Perspective of ISO 15926 E.2 Mapping narrative to actual design

URI is abbreviated, one could read for example the unique name definition as follows.

(ica:)130.000.001 (part4:)has_unique_name “Centrifugal"

The first word in this sentence, (ica:)130.000.001, has a double meaning. It identifies the class itself, so anywhere within a data-store statements can be done about this class by using this identifier. The second meaning is that it is also the name of the unique technical container that carries the statements about the class (data) and about the cre- ation and modification of the class (meta-data). The second part of the sentence reads as three words: the class referenced by the first word should be uniquely named by a word contained in the sentence as the third element. Because the graph of this example functions as a type definition for an Individual, it is called an IndividualTypeGraph. Figure E.2 contains an example of another Named Graph. The structure is iden- tical to any Named Graph, it has a unique URI as name and it contains data and meta-data. This time it prescribes how a certain type of relationship may be used. It functions as an elementary template, and is called a RelationshipTemplateGraph. In RDF, relationships are called predicates, and just like classes can be defined in a Named Graph (example of Figure E.1) those relationships can also be described in a Named Graph. In this example the predicate has_property is defined in a graph with the URI (part4:)200.000.051 (not depicted). The Named Graph of the exam- ple describes that the has_property predicate can exist between an Individual of class (ica:)130.000.001 and an Individual of class (ica:)140.000.032. Apart from the three triples involved in specifying this data, the same type of meta-data statements are included as in Figure E.1. Now, one extra meta-data triple is added specifying how stringent the template is. The Named Graph consists of this sentence:

(ica:)000.000.001 (part4:)has_modality (part4:)530.001 which turns out to mean that the relationship contained in the RelationshipTemplate- Graph is compulsory for all Centrifugal Individuals. The part4:530.001 is a prede- fined value representing the Shall-option for modality. This IndividualTypeGraph and RelationshipTemplateGraph are just two of the seven different types. For classification reference data three already described graphs are needed: a type definition for classes, a type definition for relationships and a us- age template for relationships. Instantiation of data is done in Named Graphs with an equivalent structure. Four types of those graphs can be discerned: graphs representing an instance of a class, graphs representing relationships between two instances, graphs containing a relationship between an instance and a class and graphs containing a re- lationship between an instance and a textual value.

E.2 Mapping narrative to actual design

The ODRAC platform was build as a technical framework that can also be configured for other knowledge domains. For this reason the RDF data needed for the Named Graph structure was published in an independent RDF vocabulary. The namespace indicator uia: was chosen, instead of the part4: that the presentation in Paris sug- gested. Also the naming principle in labels like part4:is_created_by was not fol- lowed in the ODRAC vocabulary. There the relation is called uia:creator.

147 E.2 Mapping narrative to actual design Perspective of ISO 15926

Figure E.3: Figure 47 from [22].

As a result of the separation between technical structure and domain specific con- figuration, all the relationships needed for the Part 11 methodology can be defined inside the ODRAC framework. In this process, the relationships have to be related to two core entities. Relationships that are part of the knowledge domain extend a uia:OntologyRelationship and relationships that extend the native meta-data possi- bilities extend the uia:PrimerRelationship. The name Ontology can be understood to refer to a domain specific upper ontology. The Primer is an entity containing all RDF concepts needed to prime the domain information model that is stored in the library. The same principle applies to classes. All ontology types need to be related to either the uia:OntologyElement or the uia:PrimerElement. Doing so, the ODRAC platform has started to conflict with the ISO data model. A specification like E.3 can still effortlessly be modeled inside the platform, but models like E.4 and E.5 inevitably interfere with the RDF’s notion of a class. In our design the thoughts behind the original ISO data model were followed as much as meaningful, but the classification principles of RDF were given priority. The alternative to strictly modulate the original data model did not seem feasible.

Little of the OWL approach of Part 8 was reused. The structural basis of the Named Graph offered such possibilities that little OWL was needed. In Section 5.9 we describe the incompatibility between Named Graphs and OWL in more detail. The same is more or less true for the template methodology of Part 7. In ODRAC we incor- porated the template methodology completely in the graph structure. Unfortunately no further comparison between Part 7 and our approach is available. Although the equivalence of a fully configured ODRAC project was not yet firmly compared to the theoretical possibilities of Part 2, 3, 4 and 7 or to the results of the OWL representation in Part 8, we hope that the approach of Part 11 will turn out to be a successful venture. Another important difference between the data model in this thesis and the one presented in Paris is the ComplexDataGraph. This is a construction that allows a set of relations inside one graph. It already existed before the ODRAC work started, but in the

148 Perspective of ISO 15926 E.2 Mapping narrative to actual design

Figure E.4: Figure 69 from [22].

Figure E.5: Figure 70 from [22].

Paris presentation this concept was omitted and replaced by a more verbose structure in which each relation gets its own graph. At the presentation time of this thesis it is believed that this is the future way to apply Named Graphs to the knowledge data. As a result, Literal values might in future be the object of a triple originating from an Individual (like in Figure E.6). This concept was presented in the Paris meeting, but it does not exist in this thesis work. Instead, here such a Literal value is always contained in a ComplexDataGraph. Although there is a future for the ValueGraph, it will not exist in form presented in the figure. It will probably be split in three graphs, one presenting an Name instance, one carrying the type definition of that instance and one will contain the value of the name using the has_value relationship. In this way the type of the Literal value does not depend on the used relationship (has_name in this example).

These are the major differences between the data model implementation of the UIA and the model as it was presented in Paris. In the presentation the data model (as anticipated in the Part 11 work) was contrasted to the Part 2 data model, just as this thesis emphasized the differences. The future developments in the data model, especially when work will be done in a new Part 12 project, are followed with curiosity.

149 E.2 Mapping narrative to actual design Perspective of ISO 15926

Figure E.6: A RelationshipTemplateGraph as presented to the ISO workgroup.

150