Die approbierte Originalversion dieser Diplom-/ Masterarbeit ist in der Hauptbibliothek der Tech- nischen Universität Wien aufgestellt und zugänglich. http://www.ub.tuwien.ac.at

The approved original version of this diploma or master thesis is available at the main library of the Vienna University of Technology. http://www.ub.tuwien.ac.at/eng

Design and Development of a Service for Software Interrelationships

Diplomarbeit

zur Erlangung des akademischen Grades

Diplom-Ingenieur

im Rahmen des Studiums

Software Engineering & Internet Computing

eingereicht von

Nikola Ilo Matrikelnummer 0925955

an der Fakultät für Informatik der Technischen Universität Wien

Betreuung: Thomas Grechenig Mitwirkung: Mario Bernhart

Wien, 6. Oktober 2014 (Unterschrift Verfasser/In) (Unterschrift Betreuung)

Technische Universität Wien A-1040 Wien  Karlsplatz 13  Tel. +43-1-58801-0  www.tuwien.ac.at

Design and Development of a Service for Software Interrelationships

Master’s Thesis

submitted in partial fulfillment of the requirements for the degree of

Diplom-Ingenieur

in

Software Engineering & Internet Computing

by

Nikola Ilo Registration Number 0925955

to the Faculty of Informatics at the Vienna University of Technology

Advisor: Thomas Grechenig Assistance: Mario Bernhart

Vienna, October 6, 2014 (Signature of Author) (Signature of Advisor)

Technische Universität Wien A-1040 Wien  Karlsplatz 13  Tel. +43-1-58801-0  www.tuwien.ac.at

Statement by Author

Nikola Ilo Pfalzauerstraße 60, 3021 Pressbaum

Hiermit erkläre ich, dass ich diese Arbeit selbständig verfasst habe, dass ich die verwendeten Quellen und Hilfsmittel vollständig angegeben habe und dass ich die Stellen der Arbeit – einschließlich Tabellen, Karten und Abbildungen –, die anderen Werken oder dem Internet im Wortlaut oder dem Sinn nach entnommen sind, auf jeden Fall unter Angabe der Quelle als Entlehnung kenntlich gemacht habe.

I hereby declare that I am the sole author of this thesis, that I have completely indicated all sources and help used, and that all parts of this work – including tables, maps and figures – if taken from other works or from the internet, whether copied literally or by sense, have been labelled including a citation of the source.

(Place, Date) (Signature of Author)

i

This thesis is dedicated to the memory of my beloved father. Dipl.-Ing. Dr. Sotiraq Ilo 1958–2014

iii

Acknowledgements

First of all I want to thank my advisers Prof. Dr. Thomas Grechenig and Mario Bernhart for their support. They allowed me to explore my topic freely and therefore walk the path to which the research led me, from Software Engineering into the interesting field of the Semantic Web. Such open mindedness is not to be taken for granted. I also like to thank Brigitte Brem, who helped with the formalities of handing in the thesis. I want to thank my two employers I had during my master’s studies, Christoph Leithner and Manfred Gronalt. They made it possible to finish my degree and work at the same time. Special thanks is due to my friends and supporters Michael Geyer, who helped to convert my hand drawn sketches into beautiful graphics, Wilfried Mayer, who always was there to speak to when I encountered a problem, and Florian Hassanen, who significantly helped me to refine the nomenclature of the concepts which were created as part of this thesis. Further I want to thank my mother, Albana Ilo. Besides caring like all mothers do, she proofread my thesis and provided most valuable feedback. For this she made a great effort to read into a new subject. Most I want to thank my significant other, Anna Mach, for the steady support and for enduring the countless days where the keyboard noise did not stop until late at night. She cheered me up whenever I hit a wall and managed everything I could not while I was writing. Last but not least I want to thank all my family, friends, and colleagues, who supported me throughout the difficult last two years.

v

Kurzfassung

Inter-Software-Beziehungen, wie z.B. Software-Abhängigkeiten, haben Auswirkung auf die Qualität und Entwicklung von Software und Software-Projekten und sind daher von essentieller Bedeutung für die Software-Entwicklung und Wartung. Aus diesem Grund gibt es bereits ausgeklügelte Systeme, um Software-Beziehungen zu deklarieren, zu verwalten und nutzbringend für Softwarebetriebs- und Entwicklungsprozesse einzusetzen. Nennenswerte Beispiele hierfür sind Packet-Management-Systeme von Linux-Distributionen und Build-Management-Systeme wie . Die Software-Netzwerke, auf denen diese Systeme agieren, bilden in sich interoperable, aber jeweils abgeschlossene Software- Ökosysteme, die sich in Syntax und Semantik voneinander unterscheiden, obwohl es Überlappungen in der Menge der enthaltenen Software gibt. Derzeit gibt es kein anwendbares System, welches Software- Ökosystem übergreifende Abfragen und Auswertungen zulässt. Diese Arbeit greift die Problemstellung auf, die semantischen und syntaktischen Grenzen von Software-Ökosystemen zu überwinden und dadurch die praktische Nutzung von Informationen über Inter-Software-Beziehungen für die Software Entwicklung und Wartung zu ermöglichen. Im Rahmen dieser Arbeit wurde ein Software-Prototyp entwickelt, der es ermöglicht, verschiedene Software-Ökosysteme zu integrieren und dadurch systemübergreifende Abfragen durchzuführen. Ein besonderes Augenmerk wurde auf Erweiterbarkeit und Skalierbarkeit gelegt, damit möglichst einfach neue, aber auch zahlreiche Software-Ökosysteme integriert werden können. Während der Entwicklung zeigte sich, dass Semantic Web-Technologien einen guten Rahmen für die Bearbeitung der Problem- stellung bieten. Mehrere Software-Ökosysteme wurden, z.B. aus den /Ubuntu-Quellen oder den Common Vulnerability Enumeration (CVE)- und Common Platform Enumeration (CPE)-Verzeichnissen des National Institute of Standards and Technology (NIST), für die Evaluierung der Datenintegrati- on eingebunden. Weiters wurden Applikationen, wie ein Sicherheitslücken-Benachrichtigungssystem oder ein Lizenz-Einhaltungs-Überprüfungsprogramm, beispielhaft implementiert, um das Potential von Software-Ökosystem übergreifenden Abfragen aufzuzeigen und das Ergebnis zu evaluieren. Die wissenschaftlichen Beiträge dieser Arbeit gliedern sich wie folgt: eine verteilte Architektur für das Abgreifen, Parsen, Umlegen, Nachbearbeiten und Abrufen von generischen Datenquellen in ein semantisches RDF Datenmodel; eine abstrakte OWL-Ontologie für die semantische Modellierung von Inter-Software-Beziehungen; sowie ein System für die Verarbeitung von temporalen Resource Description Framework (RDF)-Aussagen mit SPARQL Protocol and RDF Query Language (SPARQL). Hierbei werden die Anfragen unter Beachtung der zeitlichen Gültigkeit, jedoch ohne vorheriger zeitlichen Normalisierung von Beobachtungszeitpunkten in Gültigkeitszeiträume, evaluiert.

Schlüsselwörter

Software Beziehungen, Software Abhängigkeiten, Semantic Web, Ontologie, Temporales SPARQL, metaservice, Mining Software Repositories

vii

Abstract

Software interrelationships, like software dependencies, have impact on the quality and evolution of software projects and are therefore important to software development and maintenance. Sophisticated systems have been created in the past to define, manage, and utilize relationships in software processes. Mentionable examples for this are package management systems of Linux distributions and build systems like Apache Maven. These systems are clustered in software-ecosystems, which most of the times are syntactically and semantically incompatible to each other, although the described software can overlap. Currently there are no viable systems for querying information across different ecosystems. This thesis is about how to overcome semantic and syntactic borders of software ecosystems and thereby enable practical usage of information about software interrelationships in software development and maintenance. An iterative approach was used to develop a prototype, which enables integration of - and therefore queries across - different software ecosystems. Particular emphasis was placed on the extendibility and the scalability, i.e., to be able to easily integrate new and many ecosystems. During development, Semantic Web technologies showed to provide a suitable framework to approach this task. Several ecosystems, like Debian/Ubuntu repositories, and CVEs and CPEs defined by the NIST, were used to evaluate data integration. Additionally small applications, like a vulnerability notification system and license violation detector were used to show the usefulness of aggregated cross-ecosystem-interrelationships. Contributions of this thesis consist of: a distributed architecture for data retrieval, parsing, mapping, post-processing and querying of generic data into semantic RDF data model; an abstract owl-ontology for semantic modeling of inter-software relationships; and a model for processing temporally scoped RDF statements using SPARQL without previous normalization of observation times to time periods.

Keywords

Software Relationships, Software Dependencies, Semantic Web, Ontology, Temporal SPARQL, metaser- vice, Mining Software Repositories

ix

Contents

1 Introduction 1 1.1 Problem Statement ...... 1 1.2 Motivation ...... 2 1.3 Methodology ...... 3 1.4 Contributions ...... 3 1.5 Related Work ...... 3 1.5.1 Software Interrelationships ...... 3 1.5.2 Mining Software Repositories ...... 4 1.5.3 Semantic Web ...... 4 1.6 Thesis Structure ...... 5

2 Fundamentals 7 2.1 Software Repositories and Ecosystems ...... 7 2.1.1 Debian Ecosystem ...... 8 2.1.2 Apache Maven Repositories ...... 8 2.1.3 Joinup ...... 9 2.1.4 National Vulnerability Database ...... 10 2.1.5 Universal Description, Discovery and Integration ...... 10 2.2 Semantic Web Technologies ...... 10 2.2.1 Resource Description Framework ...... 10 2.2.2 Vocabulary, Ontologies and Reasoning ...... 12 2.2.3 Linked Data ...... 15 2.3 Software Ontologies ...... 16 2.3.1 Description of a Project ...... 17 2.3.2 Software Package Data Exchange ...... 18 2.3.3 Asset Description Metadata Schema for Software ...... 19 2.3.4 Software Ontology ...... 21 2.3.5 Other Software Ontologies ...... 23 2.4 Graph Databases ...... 24 2.4.1 Property Graph Databases ...... 24 2.4.2 RDF Databases ...... 25 2.5 Temporal Databases ...... 25 2.5.1 Temporal RDF ...... 26 2.6 Provenance in RDF ...... 27 2.6.1 Reification ...... 28 2.6.2 Named Graphs ...... 28 2.6.3 Statement Identifier ...... 29 2.7 Web Data Extraction ...... 30

xi 3 Requirements 31 3.1 Stakeholders ...... 31 3.1.1 Web Users ...... 32 3.1.2 API Users ...... 32 3.1.3 Module Developers ...... 33 3.1.4 Software Repository Operators ...... 33 3.1.5 System Administrators ...... 34 3.2 Use cases ...... 34 3.2.1 Integration of Software Repositories ...... 34 3.2.2 Manual Browsing through Software Metadata and Software Interrelationships . 34 3.2.3 Security Alerts based on Security Report Propagation ...... 35 3.2.4 Discovery of potential Licensing Conflicts ...... 35 3.3 Functional Requirements ...... 36 3.4 Non-Functional Requirements ...... 37 3.4.1 Scalability ...... 37 3.4.2 Extendibility ...... 37 3.4.3 Security ...... 38 3.4.4 Usability and Documentation ...... 38 3.4.5 Performance ...... 39 3.4.6 Maintainability ...... 39

4 Design 41 4.1 Initial Design Considerations ...... 41 4.2 System Architecture ...... 42 4.2.1 Components ...... 42 4.2.2 Modules ...... 43 4.2.3 Distributed Execution ...... 44 4.3 Processing Model ...... 44 4.3.1 Observations ...... 45 4.3.2 Processing Pipelines ...... 45 4.3.3 Data Retrieval and First Processing ...... 45 4.3.4 Postprocessing ...... 46 4.4 Data Model ...... 47 4.4.1 Definitions ...... 47 4.4.2 Observation Semantics and Provenance Information ...... 48 4.4.3 Provider Data Model ...... 49 4.4.4 Postprocessor Data Model ...... 49 4.5 Temporal Queries ...... 50 4.5.1 Introduction of SPARQL@T ...... 51 4.5.2 Writing Temporal SPARQL Queries ...... 51 4.5.3 Automatic Translation of a subset of SPARQL ...... 55 4.6 Interface Design ...... 55 4.7 Ontologies ...... 55 4.7.1 Metaservice Observation Ontology ...... 56 4.7.2 Software Relationship Ontology ...... 56

xii 5 Implementation 63 5.1 Platform ...... 63 5.2 Component and Module System ...... 63 5.3 Manager ...... 64 5.4 Data Retrieval and Archival ...... 64 5.5 RDF Database ...... 65 5.6 Messaging ...... 65 5.6.1 Message Service and ActiveMQ Messaging ...... 66 5.6.2 Custom Messaging ...... 67 5.7 Frontend ...... 68 5.7.1 Semantic Web Frontend ...... 68 5.7.2 User Interface ...... 68 5.8 Temporal SPARQL Queries ...... 68 5.8.1 Query Building and Translation ...... 69 5.8.2 Quad Support for SPARQL CONSTRUCT Queries ...... 70 5.8.3 Query Optimization Techniques ...... 71

6 Evaluation 75 6.1 Integration of the Debian Ecosystem ...... 75 6.2 Integration of the Maven Ecosystem ...... 77 6.3 Implementation of License Conflict Discovery ...... 79 6.3.1 Detection of Software Licenses ...... 79 6.3.2 Integration of Linked Data Software Descriptions ...... 79 6.3.3 Copyleft Conflict Detection Query ...... 80 6.4 Implementation of Security Report Alerts ...... 81 6.4.1 Integration of CPE and CVE ...... 81 6.4.2 Integration of WordPress ...... 81 6.4.3 Query Execution and Alert ...... 82 6.5 Runtime Environment ...... 83 6.6 Discussion ...... 83

7 Conclusion 85 7.1 Summary ...... 85 7.2 Limitations ...... 86 7.3 Implications ...... 87 7.4 Outlook ...... 87

Bibliography 89 References ...... 89 Online References ...... 97

Appendix A Linking Open Data Cloud 101

Appendix B Selected Complete Listings 105 B.1 RDF Serializations ...... 105 B.2 SPARQL Queries ...... 107

xiii B.3 Metaservice Module Descriptor ...... 121

Appendix C Screenshots 123

xiv List of Figures

2.1 Example RDF Graph ...... 12 2.2 Statement without Provenance Information ...... 27 2.3 Provenance Information using Reification ...... 28 2.4 Provenance Information using Named Graphs ...... 29 2.5 Provenance Information using Statement Identifiers ...... 29

3.1 Stakeholder Diagram ...... 31

4.1 Component Diagram ...... 42 4.2 Provider Pipeline ...... 46 4.3 Postprocessor Pipeline ...... 46 4.4 SWREL Relationship Property Hierarchy ...... 59 4.5 SWREL Dependency Property Hierarchy ...... 62

5.1 Messaging System based on ActiveMQ Virtual Topic ...... 66 5.2 Custom Messaging System ...... 67 5.3 Dense RDF Graph ...... 72

A.1 Linking Open Data Cloud May 2007 by Cyganiak and Jentzsch [17, 91] ...... 101 A.2 Linking Open Data Cloud March 2009 by Cyganiak and Jentzsch [17, 91] ...... 101 A.3 Linking Open Data Cloud September 2011 by Cyganiak and Jentzsch [17, 91] ...... 102 A.4 Linking Open Data Cloud April 2014 by Schmachtenberg, Paulheim, and Bizer [63] . . . 103

C.1 Debian Package Template - Web-Frontend ...... 123 C.2 Generic Package Template - Web-Frontend ...... 124 C.3 Managment Shell ...... 124 C.4 WordPress Security Alert on Android ...... 125

xv

List of Tables

2.1 Ubuntu Repository Scheme ...... 9 2.2 Selection of Maven Stages in Execution Order ...... 9 2.3 RDF Namespaces and Prefixes ...... 11 2.4 OWL Profiles ...... 14 2.5 Class descriptions from DOAP [95] ...... 18 2.6 Class Descriptions from SPDX [140] ...... 20 2.7 Class Descriptions from ADMS.SW [30] ...... 21 2.8 Selection of Software Ontology (SWO) [139] Classes ...... 22

4.1 Metaservice Observation Ontology Classes ...... 57 4.2 Metaservice Observation Ontology Properties ...... 58 4.3 SWREL Classes ...... 58 4.4 SWREL Relationship Properties ...... 60 4.5 SWREL Dependency Properties ...... 61 4.6 SWREL Range Restriction Properties ...... 62

5.1 Metaservice Maven Artifacts ...... 64

6.1 Mapping of Debian Control Fields to Metaservice Debian Ontology and to Metaservice Ontologies ...... 76 6.2 Mapping of Maven Concepts to Metaservice Concepts ...... 78

xvii

List of Listings

2.1 RDFS Description of rdfs:subClassOf using Turtle ...... 14 2.2 Time Point Query τ-SPARQL ...... 26 2.3 Validity Time Selection Query τ-SPARQL ...... 27 2.4 Validity Time Intersection SPARQL-ST ...... 27

4.1 Temporal Constraint on a Statement Pattern SPARQL@T Query ...... 51 4.2 maxProviderTime Subquery - Translation of Listing 4.1 to SPARQL on the Provider Datamodel ...... 52 4.3 Main Query - Translation of Listing 4.1 to SPARQL on the Provider Data Model . . . 52 4.4 maxPostprocessorTime Subquery - Translation of Listing 4.1 to SPARQL on the Postprocessor Data Model ...... 53 4.5 Main Query - Translation of Listing 4.1 to SPARQL on the Postprocessor Data Model . 54 4.6 Main Query - Translation of Listing 4.1 to SPARQL on the Metaservice Data Model . 54

5.1 Query Building in Java ...... 69 5.2 Output SPARQL Query of Listing 5.1 ...... 70 5.3 SPARQL Quad CONSTRUCT Query ...... 71 5.4 SPARQL Query on Dense RDF Graph ...... 72 5.5 Optimized SPARQL Query on Dense RDF Graph ...... 73

6.1 Copyleft License Conflict Discovery SPARQL Query ...... 80 6.2 Security Alert SPARQL Query ...... 82

B.1 RDF/XML Serialization of RDF Graph in Figure 2.1 ...... 105 B.2 Turtle Serialization of RDF Graph in Figure 2.1 ...... 105 B.3 JSON-LD Serialization of RDF Graph in Figure 2.1 ...... 106 B.4 RDFA Serialization of RDF Graph in Figure 2.1 ...... 106 B.5 Temporal Constraint on a Statement Group SPARQL@T Query ...... 107 B.6 Translation of Listing 4.1 to SPARQL on a valid-from and -to datamodel ...... 107 B.7 Translation of Listing B.5 to SPARQL on a valid-from and -to data model ...... 107 B.8 Optimized SPARQL Query for Resource Lookup ...... 108 B.9 Full Translation of Temporal SPARQL Query on the Metaservice Datamodel . . . . . 111 B.10 Metaservice Module Descriptor ...... 121

xix

List of Abbreviations

ACID Atomicity, Consistency, Isolation, Durability.

ADMS Asset Description Metadata Schema.

ADMS.SW Asset Description Metadata Schema for Software.

API Application Programming Interface.

APT Advanced Packaging Tool.

BOM Bug Ontology Model.

CPAN Comprehensive Perl Archive Network.

CPE Common Platform Enumeration.

CPU Central Processing Unit.

CSS Cascading Style Sheets.

CVE Common Vulnerability Enumeration.

DC Dublin Core.

DOAP Description of a Project.

DOM Document Object Model.

DPKG Debian Packaging Tool.

FLOSS Free/Libre/Open Source Software.

FOAF Friend of a Friend.

FOSS Free/Open Source Software.

GPL General Public License.

GUI Graphical User Interface.

HDD Hard Disk Drive.

HT Hyper Threading.

HTML Hypertext Markup Language.

xxi HTTP Hypertext Transfer Protocol. httpd HTTP daemon.

IDE Integrated Development Environment.

IO Input/Output.

IRC Internet Relay Chat.

IRI Internationalized Resource Identifier.

Java EE Java Platform, Enterprise Edition.

JMS Java Message Service.

JRE Java Runtime Environment.

JSON JavaScript Object Notation.

JVM Java Virtual Machine.

LOD Linking Open Data.

MSR Mining Software Repositories.

NEPOMUK Networked Environment for Personalized, Ontology-based Management of Unified Knowl- edge.

NFO NEPOMUK File Ontology.

NIST National Institute of Standards and Technology.

NMA Notify My Android.

NPM Node Packaged Modules.

NVD National Vulnerability Database.

OWL Web Ontology Language.

POM project object model.

PPA Personal Package Archive.

PyPI Python Package Index.

RADion Repository Asset Distribution.

RAM Random Access Memory.

xxii RDF Resource Description Framework.

RDFa RDF in attributes.

RDFS RDF Schema.

REST Representational State Transfer.

SEON Software Evolution Ontologies.

SID Statement Identifier.

SOA Service-oriented Architecture.

SOM Software Ontology Model.

SPARQL SPARQL Protocol and RDF Query Language.

SPDX Software Package Data Exchange.

SQL Structured Query Language.

SSD Solid State Disk.

SWO Software Ontology.

SWREL Software Relationship Ontology. tmpfs Temporary File System.

UDDI Universal Description, Discovery and Integration.

URI Uniform Resource Identifier.

VCS Version Control System.

VOM Version Ontology Model.

W3C World Wide Web Consortium.

WWW World Wide Web.

XML Extensible Markup Language.

xxiii

Chapter 1. Introduction

1 Introduction

1.1 Problem Statement

Software, software artifacts and software projects are in relationships with each other. The reasons for the need of software-relationships are numerous and include code reuse, abstraction of complexity, or need for integration. Because of the impact of inter-software-project relationships on reliability [60], security [45], compatibility [77] and development of software projects [15], they are important to software project management [66] and software maintenance [1]. The probably most prominent example for software-interrelationships is software-dependencies. Managing software relationships has been such a big concern, that sophisticated management systems and repositories were created and are in wide use [53]. One example is the build management tool Apache Maven [85], which enables Java developers to define dependencies on existing software modules which are automatically fetched from a central repository. Another example are Linux distribution package management systems, which handle connections to upstream software and transform it into an easily deployable package, which resolves all it dependencies on installation. All these systems have in common, that they concentrate on a very specific usage area or topic, share the same semantics and the same toolset for interaction. In this thesis we will call the set of software and its interrelationships in an interoperable system a software ecosystem. A better definition as well as a comparison to the traditional term of a "software ecosystem" is given in Section 2.1. Data in these ecosystems is usually of very high quality, as the semantics and structure are often explicitly defined in a formalized way to enable automated processing by software tools. Due to the high quality and easy usability of the data through ready to use tools, research on relationships inside single ecosystems has been an attractive topic [59, 23]. This is not the case for studies across different ecosystems [76], because there are usually no interfaces to query between different software ecosystems. Therefore querying over more of them becomes a tedious task and crossing ecosystem borders has been avoided. Doing it involves solving the following problems:

• Software ecosystems provide different Application Programming Interfaces(APIs) and data formats to access and present information. For each system crawlers and parsers need to be written to aggregate relationship-information.

• The semantics of the relationship-types often differs between individual software ecosystems. A semantic mapping between these types needs to be established.

So far there is no viable solution, which enables software developers, researchers, or other software stakeholders to query information across software ecosystem borders.

metaservice: a Semantic Web based Approach 1 1.2. Motivation

1.2 Motivation

The reason why cross software repository querying is desirable is that a specific software may be an element of different software ecosystems. Usually neither ecosystem contains all information, but specific metadata is distributed between those. This leads to cumbersome utilization of information, like in software evolution, where studies are therefore often based on limited datasets, and consequently often contradicting. According to Herraiz, Robles, and Gonzalez-Barahona [38] software evolution would greatly profit from structured access to software projects, such that empirically more significant studies can be overtaken more easily. The following practical examples illustrate the potential of easy queries across software ecosystem borders:

Security Report Propagation OpenSSL is a security critical software for encryption, which is widely used in Free/Open Source Software (FOSS) but also commercial software products. In 2014 a critical bug in the keep-alive-subsystem of OpenSSL was found. Without going into further detail, the consequences of the bug were that internet enabled systems worldwide needed to be checked and updated in case they used a vulnerable release of OpenSSL. From an ecosystem viewpoint for this case we need to consider the National Institute of Standards and Technology (NIST) ecosystem, which provides security reports and links to software products in its Common Vulnerability Enumeration (CVE) and Common Platform Enumeration (CPE) databases. For the OpenSSL bug there was an entry using the id CVE-2014-0160. For users of the Debian ecosystem a dedicated security team crawls the CVE database regularly and checks the reports applicability to the Debian distribution manually 1. The ability to follow the existing software-relationships, from the CVE CVE-2014-0160 to the CPE cpe:/a:openssl:openssl:1.0.1e to the upstream source code openssl-1.0.1e.tar.gz to the Debian Package 1.0.1e-2+deb7u4, could support the security team in finding the affected packages. The Debian OpenSSL package could be required by another package, which would extend this chain by one element. This could be done repeatedly and also include software not supported by Debian and outside of the Debian repositories. Questions like "Are our software services vulnerable?" could be answered easily if the ecosystems were interoperable. Automated traversal of relationships across ecosystems makes it possible to write generic security report alert services for any software in an integrated ecosystem.

Licensing Conflict Discovery To be allowed to use a third party library in one’s own software, one has to agree to its licensing terms. Violating the licensing terms can lead to legal issues. As an example General Public License (GPL) has a copy left clause, meaning that software, which is a derivation of, or linking to a GPL-licensed product, must also be distributed under the GPL. This is a serious topic, in the past several court cases about infringement of the GPL had legal consequences. The problem is known as the license mismatch problem [26]. To solve the problem software repositories need to be reviewed. Software repositories often already contain explicit licensing information. With the ability

1 Because the severity of Heartbleed, popular Linux distributions like Debian were actually informed before the public release of the CVE. This is an exemption from the described standard process.

2 Design and Development of a Service for Software Interrelationships Chapter 1. Introduction to cross software repository and ecosystem boundaries, tools could automatically discover probable licensing violations in the relationship graph.

1.3 Methodology

In this thesis a prototype is developed to explore the road to a cross software ecosystem software interrelationship service. An iterative approach is used to develop metaservice. Part of this thesis are the requirements engineering, software design, implementation and an evaluation based on previously chosen use cases. The semantic web and related technologies enable on distributed and semantically different data. They provide a potential platform to implement a universal software interrelationship service. Therefore metaservice is built upon a system architecture, which is based upon semantic web technologies and enables the integration of several software ecosystems through the creation of an abstract common layer.

1.4 Contributions

Contributions are a distributed architecture for data retrieval, parsing, post-processing and querying; an abstract owl-ontology for semantic modeling of inter-software-project relationships; and a model for processing temporally scoped Resource Description Framework (RDF) statements without previous normalization of observation times to time periods.

1.5 Related Work

1.5.1 Software Interrelationships

Software dependencies are the most frequently analyzed relationships between software projects in the research community. The importance of dependencies for software engineering decisions [62, 1] and the impact on software quality [15] have been discussed. There is several research that analyzes code dependencies in a single project scope [15]. Lungu, Robbes, and Lanza [51] extend the analysis from single project dependencies to inter-project dependencies and show up the relevance of inter-project dependencies. They also give reasons why existing research on single project dependency analysis may not be directly applied to a multi-project scope. Bauer and Heinemann [3] show up the need for information on inter-project-dependencies for effective software maintenance. They also provide a solution for dependency detection in Java ecosystems. In [19] inter-project relationships are used to test the impact of software-changes on related software projects. When running automated tests also current versions of dependent projects are run with the code changes to detect breaking changes early. This is only one way on how to use project relationship data. German, Gonzalez-Barahona, and Robles [25] propose a model for software project inter-dependencies. In particular a classification scheme for these dependencies is presented.

metaservice: a Semantic Web based Approach 3 1.5. Related Work

1.5.2 Mining Software Repositories

Flossmole [97] is a collaborative repository for common data in research on Free/Libre/Open Source Software (FLOSS) software repositories. In [40] and [16] it is reported that researchers in the Mining Software Repositories (MSR) field, struggle with retrieving accurate information. Flossmole structures data from different datasources into collections and provides data in a relational format and provides a single location for information retrieval. It is not limited to software metadata but also collects different software project related metadata, e.g., chat logs, mailing archives and issue trackers. BOA is a programming language and infrastructure for executing queries on Ultra-Large-Scale Software Repositories [21, 88] . BOA uses an abstraction layer to software projects across different software repositories and uses a Hadoop [82] cluster to execute these queries in parallel. It shows that it is utterly important to keep scaling in mind when dealing with large repositories. The drawbacks of BOA are, that it is a proprietary system and infrastructure. It is not possible to build external applications, or extend BOA.

1.5.3 Semantic Web

Kiefer, Bernstein, and Tappolet have shown that semantic web technologies are suitable to mine software repositories [46]. EvoOnt [96], a collection of ontologies, and iSPARQL, an extension of SPARQL which supports similarity based matching, are used to calculate different metrics for software. The work did not extend on software interrelationships, but goes into detail with interconnections of different datasources for the same software. Howison acknowledges the missing semantic mapping between different repositories in flossmole and shows an approach to link data across data using RDF and Web Ontology Language (OWL) [39]. However he was faced with performance issues of semantic web technologies and therefore could not provide a practical solution for the whole dataset. The ontology used has been recently superseded by others and did not contain any particularities on software interrelationships. Damljanovic and Bontcheva present a prototype, which uses semantic web technologies to improve accessibility of information about software artifacts and therefore mitigate the significant learning curve for software engineers on complex software systems [18]. Berger et al. show existing approaches and tools, which facilitate software project interrelationships for an improved development workflow [6]. One example is Mylyn which provides an integrated abstract view on tasks throughout different software issue and bug trackers, integrated into an integrated development environment. After this they show a roadmap on how semantic web technologies and a semantic web across software repositories can be used to improve processes in FOSS development and maintenance. Berger and Bac implemented public RDF descriptions of the Debian [92] source packages using Asset Description Metadata Schema for Software (ADMS.SW) [5]. They also describe how global inter- connected net of linked data about software projects can be used to improve software processes. The published data is only about the Debian repositories and inter package interrelationships are not modeled.

4 Design and Development of a Service for Software Interrelationships Chapter 1. Introduction

1.6 Thesis Structure

In Chapter 2 different fundamental concepts and technologies used throughout the thesis are introduced and discussed. Chapter 3 gives an overview over the different requirements on a possible solution. Based on these requirements a solution design for the metaservice prototype is presented in Chapter 4. Chapter 5 discusses the implementation specific challenges and their solutions that arose during development of the prototype. In Chapter 6 the prototype is evaluated by implementation of use cases. A summary of the thesis, its contributions and limitations, and an outlook is given in Chapter 7. The appendix provides additional information, which was too long or not relevant enough to be included in the main part of the thesis. It includes figures, source code listings, and screenshots.

metaservice: a Semantic Web based Approach 5

Chapter 2. Fundamentals

2 Fundamentals

The Fundamentals chapter introduces the foundations onto which the following chapters are built on. It gives definitions and examples for concepts and describes existing technologies, which are used throughout the thesis. In Section 2.1 the terms software repository and software ecosystems are discussed, and examples thereof are presented. Section 2.2 introduces concepts and standards of semantic web technologies. Then Section 2.3 discusses existing semantic web ontologies for software and software relationships. Sections 2.4 and 2.5 give insight into graph databases and temporal databases, respectively. Section 2.6 describes different approaches for representation of provenance information of semantic web data. Section 2.7 introduces what web data extraction is and how it works.

2.1 Software Repositories and Ecosystems

Software repositories can be defined as structured collections of software metadata, which optionally stores the software itself. Its purpose is to help finding, filtering and retrieving software, whereas software can be - optionally packaged - source code or compiled binaries. Repositories, which do not provide the actual software, may also be registries. Version Control System (VCS) repositories can be considered software repositories too, although they are not in the focus of this thesis. A broader definition of software repositories also includes unstructured metadata [44]. The MSR scientific field analyzes and cross-links also Internet Relay Chat (IRC) logs and mailing lists about software. Essentially any data collected during the software development process is subject of MSR. In the scope of this thesis only structured repositories are considered. Different repositories specialize on different software. Popular software repositories exist for Linux distributions, e.g., Debian [92] repositories, Red Hat [135] repositories, and programming languages, e.g., Comprehensive Perl Archive Network (CPAN) [90] for Perl, Python Package Index (PyPI) [133] for Python, RubyGems [136] for Ruby, and Maven [85] for Java. One piece of software does not necessarily exist only in one repository, but can part of many different repositories. Traditionally software ecosystems are defined by Jansen, Finkelstein, and Brinkkemper [43] to be “a set of businesses functioning as a unit and interacting with a shared market for software and services, together with the relationships among them”. In the scope of this thesis software ecosystems are defined as the set of software, connected through and including its semantically defined and structured, technical and socio-technical interrelationships in an interoperable context. Therefore a software ecosystem consists both of the software itself and its relationships. These relationships need to be defined in a way, that interoperability and hence the processing is possible, i.e., two software products which are related need not to be in a common ecosystem, unless these relations are materialized. The size of an ecosystem is directly connected to the requirement of interoperability, i.e., all Debian-based Linux distributions can be seen as one ecosystem, when the scope of interoperability is the handling of software packages. In the core both definitions can lead to similar sets of software and interconnections. The main reason for a new

metaservice: a Semantic Web based Approach 7 2.1. Software Repositories and Ecosystems definition is that it leaves out the economical aspect. It is therefore more useful in the scope of this thesis, where the focus is on software and not businesses. Software ecosystems and software repositories are often connected. Since repositories provide collections of software of a common kind, they often also build software ecosystems. Multiple repositories, which are compatible to each other, typically share a common software ecosystem. Following we will take a closer look at different repositories and ecosystems.

2.1.1 Debian Ecosystem

GNU/Linux software distributions typically have well integrated and automated software repositories. They provide a collection of compatible software, which is usually fetched from repositories as packages in a standardized format like .deb or .rpm. Those packages are enriched with additional metadata and sorted into categories. Linux distribution users have a package management application on their system. This system accesses the repository metadata and lets user browse and install software packages on their local system. To do this the package manager resolves and fetches all dependencies of the software package and installs them in order. Quality assurance by the distribution’s maintainers usually ensures that packages within a repository are able to interoperate with each other. One remarkable GNU/Linux distribution is Debian [92]. Debian uses .deb-Packages and the Debian Package Manager dpkg to setup and manage software. This system works so well that many other distributions built upon the Debian package management. Nonetheless Debian and Debian-based distributions do not necessarily share how they are organizationally created, or how the packages are selected. For example Debian is a community project, but Ubuntu [143], the probably most popular Debian-based distribution today, is backed by a corporation. GNU/Linux distributions are either developed as rolling releases or dedicated distribution releases. In rolling releases packages are continuously updated into the system, while dedicated distribution releases give a stable collection of software, where only bug and security fixes are supplied. Debian uses the distribution releases model. A distribution release in Ubuntu (e.g. Precise Pangolin or Raring Ringtail) is spread into multiple software repositories, depending on support status and license of the software. Additionally special security update repositories are provided. Ubuntu also introduced the Personal Package Archive (PPA) concept, repositories which are provided by third parties, which is not part of the original Debian system. Table 2.1 gives an overview over the different repository names and meanings in an Ubuntu release.

2.1.2 Apache Maven Repositories

Apache Maven [85] is an extendible build-management and self-proclaimed software project management tool for Java. Its core concepts are the project object model (POM) and the Maven Application Lifecycle. The POM is the central point of configuration for a Maven project, where project metadata and deviations from the standard lifecycle are declared. The standard lifecycle consists of several ordered stages. Table 2.2 shows a selection of the stages. Maven artifacts are distributed through Maven Repositories. These repositories contain the artifact itself and the according POM metadata. Specific artifacts are identified by the combination of a group id, an

8 Design and Development of a Service for Software Interrelationships Chapter 2. Fundamentals

Name Supported Free License Comment main Yes Yes restricted Yes No universe No Yes multiverse No No .*-security Yes Depends Security Updates ppa:.*/.* No Depends Personal Repositories partner No No Third-Party Software

Table 2.1: Ubuntu Repository Scheme

Stage Description compile Compile the source code test Run unit tests on the compiled code package Combine the compiled code and the resources to an artifact integration-test Run integration tests on the artifact deploy Copy the artifact and metadata to a Maven Repository

Table 2.2: Selection of Maven Stages in Execution Order artifact id, the version and an optional identifier. Software projects may declare dependencies on existing artifacts. The declaration may be optional or limited to specific times, like compile-time, test-time, or runtime. Maven automatically fetches artifacts from repositories when needed. The Maven Central Repository [118] is the standard repository, from which Maven loads its dependencies. Although most important packages can be found in the Maven Central Repository, several other open repositories exist. Many businesses often operate private repositories. Several other Java Virtual Machine (JVM)-based build-management and dependency-resolution tools like Gradle [105], [83], or Apache Buildr [80] have adopted or are compatible to Mavens repository system and use the Maven Central Repository as a default dependency resolution source.

2.1.3 Joinup

Joinup [112] is an online collaboration platform for e-government professionals, created by the European Commission as part of the Interoperability Solutions for Public Administrations program. Besides offering online community services, it acts as a federated repository for software and other assets related to interoperability of e-government solutions. E-government solutions are usually specific to each country. As part of the harmonization of the solutions of the different members of the European Union, reuse and interoperability of these systems across notional boarders is striven for. According to Loutas et al. [50] problems to achieve this are legal, organizational, semantic, and technical. While legal and organizational problems cannot be solved through technology, both semantic and technical problems can. Joinup uses semantic web technologies to overcome semantic and technical bridges between the different countries.

metaservice: a Semantic Web based Approach 9 2.2. Semantic Web Technologies

Through semantic web technologies used in Joinup the members of the European Union can keep their existing systems and interoperate by implementing these semantic mappings. Details on semantic web technologies and the ontologies used on Joinup are given in Section 2.2 and Section 2.3.3.

2.1.4 National Vulnerability Database

The National Vulnerability Database (NVD) [122] is a software security repository by the U.S. gov- ernment and is managed by the NIST. The content of the NVD is not limited to U.S. software or U.S. corporations. Two major datasets provided are the CVE and the CPE. CVE ids serve as common identifier to reference software vulnerabilities. They are widely used in the software security industry. CVE Reports usually contain a textual description of the vulnerability, the affected products, an impact analysis and dates like creation date and fix date. CPE entries are used in CVE to reference specific software. They are special identifiers that are names for software, but do not link to the actual software outside of the NVD. Besides CVE and CPE the NVD contains several other security related enumerations.

2.1.5 Universal Description, Discovery and Integration

An interesting aspect of software repositories is Universal Description, Discovery and Integration (UDDI). UDDI should be a registry for web services and was a key concept of Service-oriented Architecture (SOA). It should provide automated discovery and description of public web-services. Public UDDI did never gain any relevance, but some companies still use it in-house. Typical automated tasks facilitating are, but not limited to, building, setup, and orchestration of software based on loosely coupled services. Although UDDI was not a big success it demonstrates that software services can be important dependen- cies of software products.

2.2 Semantic Web Technologies

The semantic web was originally envisioned by Tim Berners-Lee, as the future of the World Wide Web (WWW) [8]. Unlike the traditional web, which is designed for humans, the semantic web is also designed for computers. Computers should be able to easily process, reason and help people, by using the comprehensive knowledge in the web. The semantic web is about making data semantically well-defined and therefore easily processable. Following are several technologies and concepts that were created to fulfill this vision.

2.2.1 Resource Description Framework

RDF is an abstract data model and a web standard by World Wide Web Consortium (W3C) [48]. It was originally designed as a metadata data model, but since then has evolved to a more general knowledge representation model. RDF Triples, which are also called statements, are the working principle of RDF. Arbitrary information can be expressed using triples. They are structured like simple sentences, i.e., they consist of a subject,

10 Design and Development of a Service for Software Interrelationships Chapter 2. Fundamentals

Prefix Namespace Description admssw: http://purl.org/adms/sw/ ADMS.SW Ontology bds: http://www.bigdata.com/rdf/search# Bigdata Fulltext-Search bom: http://www.ifi.uzh.ch/ddis/evoont/2008/11/bom# EvoOnt Bug Ontology cc: http://creativecommons.org/ns# Creative Commons Licensing dc: http://purl.org/dc/elements/1.1/ Dublin Core Elements Vocabulary dcterms: http://purl.org/dc/terms/ Dublin Core Terms Vocabulary deb: http://metaservice.org/ns/deb# Metaservice Debian Ontology doap: http://usefulinc.com/ns/doap# DOAP Ontology ex: http://example.org/ Namespace used for Examples foaf: http://xmlns.com/foaf/0.1/ FOAF Ontology ms: http://metaservice.org/ns/metaservice# Metaservice Observation Ontology owl: http://www.w3.org/2002/07/owl# OWL rad: http://www.w3.org/ns/radion# Repository Asset Distribution Ontology rdf: http://www.w3.org/1999/02/22-rdf-syntax-ns# RDF rdfs: http://www.w3.org/2000/01/rdf-schema# RDF Schema sf: http://sourceforge.net/api/sfelements.rdf# Sourceforge Ontology skos: http://www.w3.org/2004/02/skos/core# SKOS Ontology som: http://www.ifi.uzh.ch/ddis/evoont/2008/11/som# EvoOnt Software Ontology spdx: http://spdx.org/rdf/terms# SPDX Ontology swo: http://www.ebi.ac.uk/swo/ SWO Ontology swrel: http://metaservice.org/ns/swrel# Metaservice SWREL Ontology vom: http://www.ifi.uzh.ch/ddis/evoont/2008/11/vom# EvoOnt Version Ontology xhv: http://www.w3.org/1999/xhtml/vocab# XHTML Vocabulary xsd: http://www.w3.org/2001/XMLSchema# XML Schema

Table 2.3: RDF Namespaces and Prefixes

a predicate, and an object. The semantics of statements resemble natural language: The subject is the described entity, the object is the related entity, and the predicate defines the kind of relationship between the subject and the object. A set of triples is a graph, i.e., subjects and objects are the graphs nodes and the predicate is the type of the directed edge of the graph. Resources, Properties and Literals are the basic types of RDF. Resources are the nodes in an RDF graph and can be represented as Internationalized Resource Identifiers(IRIs) with a capitalized local name. There are also blank nodes, where the identity of the node is not known, but the relationships are. Properties are the edge types and are also represented by IRIs with a lower case local name. Unlike resources, properties may not be blank nodes. Literals are either plain text or Extensible Markup Language (XML) Schema typed data, they are also considered Resources but can only be used as objects. IRIs [20] are the internationalized version of Uniform Resource Identifiers(URIs) [9] and allow usage of almost all Unicode characters. They are the probably most occurring element, and therefore take up significant space in RDF documents. Namespaces are used to shorten IRIs, by defining short sequences for common prefixes. These short prefixes are defined once per document and subsequently used to replace the prefix in the document. The usage of namespaces in RDF resembles namespaces in XML. In RDF/XML XML namespaces are used to implement RDF namespaces. Table 2.3 shows a list of namespaces and their abbreviations used throughout this thesis.

metaservice: a Semantic Web based Approach 11 2.2. Semantic Web Technologies

e

ex:Anna x:likes

s

e

k

i

l

:

x e

ex:doesntLike ex:Human ex:Peter ex:Flowers

ex:likes ex:Hans

Figure 2.1: Example RDF Graph

Figure 2.1 shows an example RDF graph. In this graph Anna, Hans, and Peter are humans. Both Anna and Hans do like flowers and Peter does like Anna. Additionally Peter explicitly does not like flowers. There exist several concrete, standardized and widely used serialization formats to write RDF, e.g., RDF/XML, a serialization using XML [24]; Turtle a compact format, which is focused on human readability [13]; RDF in attributes (RDFa), which allows embedding of RDF in Hypertext Markup Language (HTML) documents [10]; and more recently JSON-LD, a format in JavaScript Object Notation (JSON) [49]. The Listings B.1 to B.4 show different serializations of the RDF-graph which is displayed in Figure 2.1.

2.2.2 Vocabulary, Ontologies and Reasoning

The usage of ontologies, taxonomies and schemes provide vocabulary for usage in RDF and lead to interoperability in the semantic web. These guidelines create the semantic connection between the content and the real world, through description of the different terms and concepts, and mapping between each other. There is no central authority, which defines vocabularies, but everybody may define and publish them. Hence multiple concurrent vocabularies may exist about the same topic. On the one hand this is desired, to allow different abstraction layers, but on the other hand it hardens interoperability, because tools need to handle each of them. Ontologies allow the creation of semantic interconnections between vocabularies, such that concrete ones may be automatically translated to more abstract terms. Tools then can only handle the abstract vocabularies and let automated reasoning handle the translation. Additionally to the abstraction relationship between vocabularies, ontologies allow to define constraints or implicit knowledge on resources or statements based on the resources’ or properties’ type. In case of resources this is done by description of classes, which are linked to resources by the rdf:type property. Properties are directly annotated, because they are not instantiated. Constraints and implicit knowledge, e.g., sub-class and sub-property relationships, transitive properties, functional properties, and subject or object restrictions on properties,

12 Design and Development of a Service for Software Interrelationships Chapter 2. Fundamentals

An ontology with many constraints is called a heavyweight ontology and an ontology with few constraints a lightweight ontology [29]. Heavyweight ontologies are usually crafted for a very specific use case and specify strict semantic descriptions. Lightweight ontologies are used for abstract ontologies, because constraints harden the integration of ontologies. Reasoners like HermiT [108] and Pellet [131] provide inference of implicit statements in knowledgebases. They usually implement more features as built-in reasoners of RDF databases. Tools like protégé [132] provide a Graphical User Interface (GUI) and integrated reasoners for ontology creation.

Pure RDF Semantics

The most basic inference rules and vocabulary are provided by RDF itself [37]. This includes for example inference of blank node statements from concrete statements, implicit type inference of predicates and statement reification. Vocabulary is provided for basic types (rdf:Property), type definitions (rdf:type), for collections and lists (rdf:Bag, rdf:Seq, rdf:List, rdf:Alt, rdf:first, rdf:rest, rdf:nil) and for reification of RDF statements (rdf:Statement, rdf:subject, rdf:predicate, rdf:object). RDF semantics cannot be used for ontology creation, because it does not define any kind of inheritance mechanism.

RDF Schema

RDF Schema (RDFS) was developed as a data modeling extension to RDF, which allows the definition of custom classes and property types [32]. Its expressibility allows creation of hierarchical ontologies and is even used for the descriptions of the RDF namespace. In RDFS classes (rdfs:Class) can be used as types for resources. A resource with a type of a class is considered a member of the class. The subclass relationship (rdfs:subClassOf) can be used to make all members of the subclass automatically members of the superclass, i.e., when class A is a subclass of B and the type of the resource X is A, A is automatically also member of class B. Similar hierarchies for properties can be built using the subproperty relationship (rdfs:subPropertyOf). If there are properties C and D, with C being a subproperty D, and there is a statement containing C as a property one can infer a statement which substitutes D for C. To be able to link classes with the properties intended to use in combination with its members, RDFS provides the domain (rdfs:domain) and range (rdfs:range) properties for property description. The domain of a property is the class of resources, which must be used in the subject position of a statement containing the property as a predicate. Usage of a property with a defined domain therefore allows reasoning on the type of the subject. The range of a property is defined analogically on the object of a statement. Additional properties for semantic annotation of classes and property types are rdfs:comment for textual description, rdfs:seeAlso for semantic reference and rdfs:label for a human readable representation. Listing 2.1 shows the usage of RDFS vocabulary in the definition of rdfs:subClassOf.

metaservice: a Semantic Web based Approach 13 2.2. Semantic Web Technologies

1 rdfs :subClassOf 2 a rdf:Property ; 3 rdfs :isDefinedBy ; 4 rdfs:label "subClassOf" ; 5 rdfs:comment "The subject is a subclass of a class." ; 6 rdfs:range rdfs:Class ; 7 rdfs:domain rdfs:Class . Listing 2.1: RDFS Description of rdfs:subClassOf using Turtle

OWL Profile Description OWL 1 DL Profile based on the original OWL standard. Largest subset known to be implementable by using Description Logic. OWL 1 Lite Subset of OWL 1 DL, which restricts the language to an easier processable part. It is therefore easier to implement by tool developers. OWL 1 Full Not Really a profile, but the whole OWL 1 language and semantics. Reasoning is not guaranteed to be decidable. OWL 2 DL Profile based on the OWL 2 standard. Largest subset known to be implementable by using Description Logic. Compared to OWL 1 DL many new features, like punning. OWL 2 EL OWL 2 EL is a subset of OWL 2, which allows extensive and expressive class hierarchies and guarantees efficient reasoning. OWL 2 QL Profile which is based on OWL 2 and is geared for an efficient reasoning implementation on relational databases using ontology translation to SQL statements. OWL 2 RL Subset of OWL 2 DL, which allows reasoning to be implemented by a rule based system. OWL 2 Full Not Really a profile, but the whole OWL 2 language and semantics. Reasoning is not guaranteed to be decidable.

Table 2.4: OWL Profiles

Web Ontology Language

Web Ontology Language (OWL) 1 and 2 are W3C recommendations which allow extensive reasoning and are built upon and extend RDFS [35, 73]. They are so expressive, such that reasoning using each complete language is undecidable. Therefore different profiles were created, which are limited on different decidable parts. A description of these subsets can be seen in Table 2.4. Usage of OWL usually leads to heavyweight ontologies, hence abstract ontologies use only small parts of OWL. Following is an excerpt of important OWL vocabulary. owl:Ontology is the class of ontologies. It is usually used to describe the ontology itself. owl:imports allows for stating that another ontology should be loaded as a part of an ontology. This should not be confused with simple reference of elements from other ontologies. owl:sameAs expresses equality between two resources, meaning that they refer to the same thing. All statements containing one of the resources are therefore also valid for the other resource.

14 Design and Development of a Service for Software Interrelationships Chapter 2. Fundamentals owl:differentFrom links to unequal resources, meaning that they may not refer to the same thing. If owl:sameAs is stated for resources that are different from each other, the knowledgebase is considered inconsistent. owl:equivalentClass expresses equality between two classes, meaning that they contain the same members. All resources, which are a member of one class are therefore also members of the other class. owl:disjointWith expresses disjointness of two classes, meaning that neither may contain mem- bers of the others. Any resources being member of one class can therefore not be member of the other class. If there does exists a member of both classes, the knowledgebase is considered inconsistent. owl:equivalentProperty links two equivalent properties, meaning that they can be used inter- changeably. If there is a statement containing one property as a predicate, then one can infer the same statement with the other property. owl:inverseOf can be used to express that one property is the inverse of the other. A statement consisting of the subject A, the object B and the property X can be used to infer a statement with the subject B, the object A and predicate Y, which is the inverse property of X. owl:TransitiveProperty is a class of transitive properties. Considering a transitive property T, the statement ATC follows from the two statements ATB and BTC. owl:SymmetricProperty is a class of symmetric properties. Considering the symmetric property S, the statement BSA follows from the statement ASB. owl:FunctionalProperty is a class of functional properties. Each resource may only have one object linked by a functional property. If there are two different objects linked by the same resource using the same functional property, those objects are equal in the sense of owl:sameAs. owl:InverseFunctionalProperty is a class of inverse functional properties. Each resource may only be the subject of one statement, which has an inverse functional property as a predicate. If there are two resources with the same value of a functional property, those resources are equal in the sense of owl:sameAs.

One of the main benefits of OWL is, that it allows to state equality between different things. When descriptions are provided by different parties using different ontologies, processing all of them requires not only mapping of classes and properties, but also of resources, which can be done using owl:sameAs. The combination of blank nodes and inverse functional properties allows indirect reference of objects by property values. The explicit modeling of the ontology itself using owl:Ontology allows to provide information to the ontology, such as release date, author, and related ontologies.

2.2.3 Linked Data

Linked Data is a set of best practices, which ensure interoperability of semantic web datasets.

metaservice: a Semantic Web based Approach 15 2.3. Software Ontologies

In the original semantic web vision, semantic web data, was interlinked to each other to form a global network, just like the WWW is interlinked with hyperlinks. However, the original specifications were not explicit on how to realize this. Hence different mechanisms were created to connect the separated datasets and therefore supporting interoperability got harder [7]. Berners-Lee provided four simple rules, also known as Linked Data principles, to publish Linked Data [7]:

1. Use URIs as names for things 2. Use HTTP URIs so that people can look up those names 3. When someone looks up a URI, provide useful information, using the standards (RDF, SPARQL Protocol and RDF Query Language (SPARQL)) 4. Include links to other URIs, so that they can discover more things

The first principle states, that when using URIs in statements, one does not refer to the document which may be dereferencable by the URI, but we use it as a name to reference anything, e.g., people, cars, documents, concepts. This is the essential concept that upgrades RDF from a document metadata format to a generic metadata format. Based on this concept Bizer, Heath, and Berners-Lee describe the linked data semantic web, or web of data, as "web of things in the world, described by data on the Web" [11]. The second principle ensures that a common standard is used to dereference URIs. One major reason for selecting Hypertext Transfer Protocol (HTTP) is that it is an already established and widely adopted standard. Particularly useful features of HTTP for Linked Data are content negotiation mechanisms to provide different representations of data and forwarding mechanisms to delegate authority of an URI. The third principle is that URIs used as names should actually be dereferencable and therefore provide traversable links between different datasets. It also states that looking up an URI should provide useful information, which is information about the thing named by the URI. The last principle implies that in order to create interconnections, datasets should link to resources of other datasets, instead of only to their own resources. These rules have been well accepted and adopted by the community. An example is the Linking Open Data (LOD) community, whose goal is publishing and interconnecting existing open data repositories using semantic web technologies and Linked Data principles. In the last years the web of data created by this community has grown very fast and contains information from a broad range of areas. The fast growth of the LOD web can be seen in Appendix A. One could argue that this fast growth resembles the early days of the traditional WWW.

2.3 Software Ontologies

There are many different software ontologies, which can be distinguished by abstraction level and different views on software and its processes. In the following only important existing software ontologies will be analyzed in detail. The descriptions of the classes are taken from the ontology files and only adjusted slightly.

16 Design and Development of a Service for Software Interrelationships Chapter 2. Fundamentals

2.3.1 Description of a Project

DOAP [95] is the currently most established ontology for software projects. Like many other successful and widely adopted ontologies it is kept very simple, relatively lightweight and provides an abstract view on software projects. DOAP builds upon the FOAF [100] ontology, which is an ontology that describes people, their relationships among themselves, and things they do. This relationship to FOAF is pronounced through properties, which allow linkage of people to roles in a software project. The roles covered are developers, translators, testers, maintainers, documenters, and helpers. Each of them expects the usage of foaf:Person as the property range. Additionally organizations can be linked as vendors, and the general type of audience can be described textually. Another place where DOAP gets very specific is on VCS Repositories. As one can see in Table 2.5, eight of twelve classes are used to differentiate between the different repository types. This focus is only materialized in the class definition, appropriate properties like branches, revisions or similar are missing, with the exception of modules for addressing in some VCS repositories. It is also worth noting that all referenced VCSs are open source, with one exception which is BitKeeper. BitKeeper was probably only added because the Linux kernel, one of the biggest open source software projects, which used it in the past. Other popular commercial systems are missing. DOAP allows description of the source repositories location, web browser access path and specialties like module names and anonymous access points for some repository types. Besides repository links, DOAP does also provide links to other project and development resources. Both actual and historical homepages can be linked, but historical homepages are the only place, where past information can be represented. Additionally specific links for blogs, wikis, and direct links to screenshots and downloads as well as download mirrors can be provided. Mailing lists and bug databases are also properties of software projects, which are described by DOAP. A notable link type is from a software project to an endpoint which provides its software service. Special vocabulary is included to describe technical aspects of software projects. Information about used programming languages, project categories, used software licenses, implementation platforms and operating systems can be attached to them. For some of these properties DOAP requires the usage of literals for description. Besides the specific vocabulary, basic descriptive vocabulary like name, creation date, description, and short description is provided for general annotation. Of these only doap:name is a subproperty of rdfs:label. All others are not mapped by the ontology. DOAP does not only specify software projects and repositories, but also software versions and specifica- tions. Versions are linked from software projects by the doap:release property and are identified by a revision number or code. A direct file download link can be provided for each version of a project. Projects can also be described to implement a specification, but DOAP does not provide any vocabulary for specification description. Development of DOAP has stagnated over the last years, although it is publicly developed on GitHub. It can be considered the standard abstract software ontology, since many later created software ontologies provide mappings to, or even include it. Deficiencies may be seen in the overly detailed modeling of VCS and the missing details on other areas like specifications and software artifacts. DOAP is also almost completely missing software relationships from its vocabulary.

metaservice: a Semantic Web based Approach 17 2.3. Software Ontologies

URI Subclass of Description doap:Project foaf:Project A Software Project doap:Version Version Information of a project release doap:Specification A specification of a system’s aspects, technical or oth- erwise. doap:Repository Source code repository doap:ArchRepository doap:Repository GNU Arch Repository doap:BazaarBranch doap:Repository Bazaar Branch doap:BKRepository doap:Repository BitKeeper Repository doap:CVSRepository doap:Repository CVS Repository doap:DarcsRepository doap:Repository darcs Repository doap:GitRepository doap:Repository Git Repository doap:HgRepository doap:Repository Mercurial Repository doap:SVNRepository doap:Repository Subversion Repository

Table 2.5: Class descriptions from DOAP [95]

To summarize DOAP is a stable and widely adopted, and therefore important ontology with limited expressiveness.

2.3.2 Software Package Data Exchange

SPDX [67, 140] is a software ontology, which focuses on a standardized description of software artifacts and their licensing terms. It was developed to support people across the software supply chain to communicate over licensing information, in a consistent and understandable way, such that software components can be easily used in a licensing and policy compliant way. One can easily observe in Table 2.6 that the ontology contains very specific classes to support concrete data representation and processing. Besides the semantic it does also provide precise modeling of usage constraints in OWL. Of all software ontologies SPDX is currently the one which is most backed by the industry. The ontology builds upon and extends DOAP. An elementary extension is the extension of the software model by packages. Software packages are functional components which are an actual representation of software, e.g., software installers, distribution packages. For every software release they may be one or more distribution packages. An example is the Debian project, where for each release packages are built for all supported architectures. Packages need not to be compiled, but can also contain source files of a software release. SPDX provides vocabulary for linking of releases to packages and packages to its contained files. Both packages and files can be annotated with checksums, licensing and copyright information. The algorithms specified by SPDX, are not generic, but only checksum algorithms. Both the checksum and the exact algorithm used for its calculation are provided, such that automated integrity checking is possible. SPDX allows representation of licensing information in form of extracted licensing text and as a license itself. To cover cases were software is distributed using different or combinations of licenses, the ontology

18 Design and Development of a Service for Software Interrelationships Chapter 2. Fundamentals contains the possibility to express conjunctive and disjunctive license sets. Besides the vocabulary for licensing information, the SPDX project also provides a source for unique names and licensing texts of popular open source software licenses [141]. The ontology also allows to record manual licensing reviews as well as recording of package creators and suppliers, such that history of the package becomes traceable. Although the ontology was used to improve the handling of connected software, software interrelation- ships itself are not modeled in it. SPDX 1.2 is the current version and significant changes concerning the scope and modularity are expected in SPDX 2.0, which currently is in draft status. The current roadmap for the new release states better support for software besides packages and support for software interrelationship modeling. SPDX provides useful and high quality extensions to DOAP. The preview on forthcoming releases promises extensions of the currently limited scope.

2.3.3 Asset Description Metadata Schema for Software

ADMS.SW [30] is an ontology for software and software repository description, which was created during a European research project on interoperability of software for public administrations [110]. Unlike other ontologies, ADMS.SW does not provide own vocabulary for every defined term, but reuses the vocabulary of existing and well established ontologies. This can be observed in Table 2.7, where only very few classes are defined. Among the different reused vocabularies are DOAP, SPDX, Repository Asset Distribution (RADion), Dublin Core (DC), and Asset Description Metadata Schema (ADMS). RADion is an ontology for description of abstract asset repositories. It was designed for cross repository information exchange independent of the actual asset type provided in these repositories. ADMS extends RADion for usage on repositories of metadata. The specification of the ontology encourages the reuse of specific collections of resources, so called controlled vocabularies. These controlled vocabularies are especially useful for interoperability of datasets in the sense of linked open data. Controlled Vocabularies are proposed for programming languages, operating systems, topics, software licenses, file formats and many other concepts. ADMS.SW supports some software interrelationships in addition to its originating ontologies, like stating fork of a project and that one software is included in another. Many other additional properties are specific to software repository metadata, e.g., download counts and categorizations like programming languages and operating systems. Other properties are provided to complement the vocabulary with missing properties. Berger and Bac [5] implemented ADMS.SW metadata description for the Debian software repository. This was probably the first large scale usage of the ontology. Another implementation of ADMS.SW is the Joinup [112] repository. One of the key benefits of ADMS.SW is, that it provides a collection of existing interoperable standards, instead of reinventing the wheel.

metaservice: a Semantic Web based Approach 19 2.3. Software Ontologies

URI spdx:AnyLicenseInfo Description The AnyLicenseInfo class includes all resources that represent licensing information. URI spdx:PackageVerificationCode Description A manifest based verification code (the algorithm is defined in section 4.7 of the full specification) of the package. This allows consumers of this data and/or database to determine if a package they have in hand is identical to the package from which the data was produced. This algorithm works even if the SPDX document is included in the package. URI spdx:ConjunctiveLicenseSet Subclass of rdfs:Container Description A ConjunctiveLicenseSet represents a set of licensing information all of which apply. URI spdx:Package Description A Package represents a collection of software files that are delivered as a single functional compo- nent. URI spdx:Review Description A Review represents an audit and signoff by an individual, organization or tool on the information in a SpdxDocument. URI spdx:DisjunctiveLicenseSet Subclass of rdfs:Container Description A DisjunctiveLicenseSet represents a set of licensing information where only one license applies at a time. This class implies that the recipient gets to choose one of these licenses they would prefer to use. URI spdx:ExtractedLicensingInfo Description An ExtractedLicensingInfo represents a license or licensing notice that was found in the package. Any license text that is recognized as a license may be represented as a License rather than an ExtractedLicensingInfo. URI spdx:Checksum Description A Checksum is value that allows the contents of a file to be authenticated. Even small changes to the content of the file will change its checksum. This class allows the results of a variety of checksum and cryptographic message digest algorithms to be represented. URI spdx:CreationInfo Description A CreationInfo provides information about the individuals, organizations and tools involved in the creation of a SpdxDocument. URI spdx:SpdxDocument Description A SpdxDocument is a summary of the contents, provenance, ownership and licensing analysis of a specific software package. This is, effectively, the top level of SPDX information. URI spdx:File Description A File represents a named sequence of information that is contained in a software package. URI spdx:License Subclass of http://opensource.org/ Description A License represents a copyright license. The SPDX license list website is annotated with these properties (using RDFa) to allow license data published there to be easily processed.

Table 2.6: Class Descriptions from SPDX [140]

20 Design and Development of a Service for Software Interrelationships Chapter 2. Fundamentals

URI admssw:SoftwareRepository Subclass of radion:Repository Description A Software Repository is a system or service that provides facilities for storage and maintenance of descriptions of Software Projects, Software Releases and Software Distributions, and func- tionality that allows users to search and access these descriptions. A Software Repository will typically contain descriptions of several Software Projects, Software Releases and related Software Distributions. URI admssw:SoftwareProject Subclass of doap:Project Description A Software Project is a time-delimited undertaking with the objective to produce one or more software releases, materialized as software packages. Some projects are long-running undertakings, and do not have a clear time-delimited nature or project organization. In this case, the term ’software project’ can be interpreted as the result of the work: a collection of related software releases that serve a common purpose. URI admssw:SoftwareRelease Subclass of doap:Version, rad:Asset Description A Software Release is an abstract entity that reflects the intellectual content of the software and represents those characteristics of the software that are independent of its physical embodiment. This abstract entity corresponds to the FRBR entity expression (the intellectual or artistic realization of a work). An example of a Software Release is the Apache HTTP Server 2.22.22 (httpd) release. URI admssw:SoftwarePackage Subclass of spdx:Package, rad:Distribution, schema:SoftwareApplication Description A Software Package represents a particular physical embodiment of a Software Release, which is an example of the FRBR entity manifestation (the physical embodiment of an expression of a work). A Software Package is typically a downloadable computer file (but in principle it could also be a paper document) that implements the intellectual content of a Software Release. A particular Software Package is associated with one and only one Software Release, while all Packages of a Release share the same intellectual content in different physical formats. An example of a Software Package is httpd-2.2.22.tar.gz , which represents the Unix Source of the Apache HTTP Server 2.22.22 (httpd) software release.

Table 2.7: Class Descriptions from ADMS.SW [30]

2.3.4 Software Ontology

SWO [52, 139] is an ontology which originates from the bioinformatics research area. Its development was motivated by the increasing need in reproducibility of computational investigations. To achieve this goal, SWO consists of an extensive model containing over 3, 700 classes and over 50 property types. It covers both a broad range of abstract software concepts and also very specific vocabulary. In contrast to common ontologies, SWO does not use human readable URI names, but consists of a numeric identifier. The reason for this is, that SWO was never designed to be writable by humans, but to be automatically recorded by tools. This kind of identifiers do provide some benefit compared to conventional ones, which is no preference for a base language and prevention of misapprehension by requiring reading of the definition and description. Another peculiarity of SWO is the usage of classes for more or less concrete things, i.e., Adobe Illustrator 10 is not modeled as an individual, but as a class. Thereby the authors

metaservice: a Semantic Web based Approach 21 2.3. Software Ontologies

Name/URI Software/swo:SWO_0000001 Subclass of Information Content Entity/swo:IAO_0000030 Description Computer software, or generally just software, is any set of machine-readable instructions (most often in the form of a computer program) that conform to a given syntax (sometimes referred to as a language) that is interpretable by a given processor and that directs a computer’s processor to perform specific operations. Name/URI Software Interface/swo:SWO_9000050 Subclass of Information Content Entity/swo:IAO_0000030 Description The mode of interaction with a piece of software. Name/URI Algorithm/swo:SWO_0000064 Subclass of Information Content Entity/swo:IAO_0000030 Description An algorithm is a set of instructions for performing a particular calculation. Name/URI Software License/swo:SWO_0000002 Subclass of Information Content Entity/swo:IAO_0000030 Description A software license is a legal instrument (usually by way of contract law, with or without printed material) governing the use or redistribution of software. Name/URI License Clause/swo:SWO_9000005 Subclass of Information Content Entity/swo:IAO_0000030 Description A license clause is a component of a license which defines some aspect of a restriction or conversely permission in how something corresponding to a license may be legally redistributed, partially redistributed, extended, modified or otherwise used in some way. Name/URI Data/swo:data_0006 Subclass of Information Content Entity/swo:IAO_0000030 Description Information, represented in an information artifact (data record) that is ’understandable’ by dedi- cated computational tools that can use the data as input or produce it as output. Name/URI Topic/swo:topic_0003 Subclass of Information Content Entity/swo:IAO_0000030 Description A category denoting a rather broad domain or field of interest, of study, application, work, data, or technology. Topics have no clearly defined borders between each other. Name/URI Data Format/swo:format_1915 Subclass of Information Content Entity/swo:IAO_0000030 Description A defined way or layout of representing and structuring data in a computer file, blob, string, message, or elsewhere. Name/URI Organization/swo:format_1915 Subclass of Material Entity/swo:OBI_0000245 Description An organization is a continuant entity which can play roles, has members, and has a set of organi- zation rules. Members of organizations are either organizations themselves or individual people. Members can play specific organization member roles that are determined in the organization rules. The organization rules also determine how decisions are made on behalf of the organization by the organization members.

Table 2.8: Selection of SWO [139] Classes

22 Design and Development of a Service for Software Interrelationships Chapter 2. Fundamentals encourage the usage of classes as individuals, which is known as punning. The ontology does not provide mappings to other software ontologies. The top level structure of the classes is based on information content entities, material entities, processes and roles. Most classes are defined as information content entities, which are informational concepts. Material concepts exist in the real world, processes are about timely series of actions, and roles are defined as abstract placeholders. Table 2.8 shows selected abstract classes of SWO. The software class is very abstract and the ontology does not provide classes for a break down into releases or distribution packages, although there are properties which may be used to link a version. Through usage of the explicit algorithm class it is possible to describe the functionality of software in more detail. For both an extensive, but naturally not complete, list of concrete subclasses is provided. Software has properties to link to its supported input and output formats, software licenses, implemented algorithms, provided interfaces and its maturity status. Additionally publishers and developers can be linked to it. Relationships between software can be modeled by a part-of relationship and a uses-software relationship. Unlike any other analyzed software ontology, SWO also contains vocabulary for description of software license clauses. It is possible to state specific restrictions like derivative, distribution, or installation count clauses for individual software licenses. The current version of SWO is of varying quality and the huge size surely makes it difficult to maintain. Detailed modeling of software and its interrelationships is currently not possible. Although the concepts were modeled explicitly for bioinformatics research, some of them are applicable for generic software modeling. A split of the different topic areas and extraction of very concrete classes would probably allow a broader usage of the ontology.

2.3.5 Other Software Ontologies

The NEPOMUK File Ontology (NFO) [54] was created as part of the Networked Environment for Personalized, Ontology-based Management of Unified Knowledge (NEPOMUK) semantic desktop project [31]. NEPOMUK is a project to extend the computer desktop into a collaborative integrated working environment. NFO is a core ontology of NEPOMUK, which describes different types of files, file systems and services. Software is incorporated in form of source code, archives, applications, operating systems and software services. Relationships of software are limited to conflicts and inclusion, because they are not the focus of the ontology. In his thesis [68] Tappolet introduces the software evolution ontology EvoOnt[96]. EvoOnt consists of the Bug Ontology Model (BOM), an ontology to describe software bug reports and trackers, the Software Ontology Model (SOM), an ontology to describe object oriented code and the Version Ontology Model (VOM), an ontology to describe software versions and releases in source code repositories. Software relationships are only described in the VOM. Most of the classes and properties in the VOM are also part of ADMS.SW. Software Evolution Ontologies (SEON) [137] is a pyramid of ontologies for software evolution and its applications [74]. In this pyramid concepts are described at different abstraction layers, which are general concepts, domain spanning concepts, domain specific concepts and system specific contents. The domains currently handled by the ontology group are source code, bug control and history. Therefore SEON and EvoOnt currently share very similar concepts. Currently there is only one software relationship

metaservice: a Semantic Web based Approach 23 2.4. Graph Databases specific property in SEON, which is the general concept dependsOn. Although there are sub-properties of dependsOn, they are only used in the source code domain specific concept and do not describe relationships of, but relationships in software source code. The modular design of SEON allows easy integration of software interrelationship ontologies as a domain specific concept. However this has not been done yet.

2.4 Graph Databases

Graph databases honor relationships between things as a first class concept. Unlike traditional relational databases, where tuples are the main concepts and relationships are only used to join tuples, in graph databases the relationship comes first. Very different graph database models have been developed, each using different graph concepts to model and manipulate data. Graphs may be directed or undirected, data may be saved in the vertices, in the edges or both, one or more edges may be allowed between two vertices, and edges may connect only two or more vertices, which is the case for hypergraphs. Rodriguez and Neubauer give an overview of different graph models [61]. Angles and Gutierrez provide an overview over the variety of different models[2]. Because of the very different models there are no definite query interfaces like SQL is for relational database systems. Like any database management systems, graph databases can be categorized by several characteristics, i.e., performance, reliability, transaction management, scalability, clustering, high availability, and software license. Commercial and open source graph databases have matured in the last years and are widely used in productive systems.

2.4.1 Property Graph Databases

Directed, labeled, and attributed multi-graphs are called property graphs and are the most common graph model used in graph databases [61]. Neo4j [123] is a popular graph database based on the property graph model. It supports Atomicity, Consistency, Isolation, Durability (ACID) transactions and clustering for high availability. Neo4j is partially open source, with extended features for commercial users. OrientDB[129] is a hybrid of a document store and a graph database. Unlike a normal property database, it allows documents and not only simple attributes for vertices. The documents are fully queryable and indexable. OrientDB is distributed under an Apache2 license, while some external features like a query profiler are sold commercially. Both Neo4j and OrientDB provide proprietary query languages and APIs for data access. [106] is a graph traversal query language for property graph, with support for many different graph databases. With it one can select nodes and edges by its properties and traverse the graph using different algorithms and patterns. Any graph database, which implements the Blueprints API [87], can be queried by Gremlin. Both Neo4j and OrientDB implement the Blueprints API and are therefore queryable with Gremlin. Both Gremlin and Blueprints are part of the Tinkerpop framework [142], which contains several tools for graph databases.

24 Design and Development of a Service for Software Interrelationships Chapter 2. Fundamentals

Software relationship graphs can be modeled using property graphs as follows: Software is modeled as nodes and relationships are directed edges from one node to another. Each edge can be attributed with the type of relationship and there may be multiple different relationships between the same software.

2.4.2 RDF Databases

RDF Databases, special graph databases, that build upon the RDF graph model. Therefore they are also called triple stores, because RDF statements are triples. Besides the actual storage of the statements, triple stores do almost always implement additional features like reasoning, or federated querying. There are different approaches for reasoning, materialization, and dynamic query execution, which results in different performance characteristics for the different implementations. Yamaguchi et al. [75] and Serral et al. [65] provide recent benchmarks on triple stores. Results are mixed, but show that usage is feasible with billions of statements.

SPARQL

SPARQL [64] is a standardized querying language for RDF databases. There are several querying modes. SELECT is able to answer in relations. This is especially useful if you need to calculate data or retrieve information for a table like representation. CONSTRUCT returns an RDF document, which contains triplets described by pattern matching. ASK is used to ask basic yes/no questions, which originates in the idea of knowledge based systems. In practice it is not as often used as the two others. All three have in common, that the WHERE clause describes the requested statements via pattern matching. It is also possible in SPARQL to filter by expressions, by using the FILTER clause. Since version SPARQL 1.1 [57] it does not only include queries to retrieve information from a store, but also to insert new, delete, or update existing statements.

2.5 Temporal Databases

A temporal database is a database which can handle different scopes of time, not only as a data type of entries, but also in query semantics. The term temporal database does not imply the type of database system, like relational or graph databases. Two often used temporal aspects in a database are transaction time and validity time. More complicated temporal databases allow custom temporal aspects and even combinations of more than one aspect. A temporal database which considers transaction time allows access to the time at which information was stored in the database. Consider the following example of an attendance systems, which keeps track of start and stop time of a worker. An implementation of a non-temporal database would require storing start of work and end of work for each worker and shift explicitly. A transaction time temporal database, only needs to update the attendance status of the worker, because the database keeps track of the times a worker was present or absent. Another way to view a temporal transaction time database is, that it retains all historic states of the database and makes them queryable. Valid time database allows storage of information and times it was valid in the real world. Any query on a valid time database is accompanied with an observation time, on which the validity of the information

metaservice: a Semantic Web based Approach 25 2.5. Temporal Databases

1 SELECT ? p r e s i d e n t FROM SNAPSHOT 2001 WHERE { 2 ?president a . 3 } Listing 2.2: Time Point Query τ-SPARQL

is calculated. Considering the attendance system example, in a valid time database, just like in a non- temporal database it would be required to store start and end of work for each worker and shift. The difference is, that in a non-temporal database with temporal data types the query for validity needs to be created through comparison operators, while a query in a valid time temporal database directly includes the time which is queried. Time periods usually are allowed to be bounded or unbounded. Unbounded time periods either lack a start or an ending. Additionally to the type of time handled, one can distinguish between point and interval based temporal data models. The difference mainly is between handling of overlapping time slices. While in point based temporal semantics, there is no difference between entries as long as the coalesced view is identical, interval based temporal semantics makes a difference on how intervals were added. Bohlen, Busatto, and Jensen [12] give a precise definition of semantics and differences between these two types of models. There are several forms of queries on temporal databases. The simplest form of temporal queries is time point queries. Every part of queried information is evaluated in accordance to a single point of time. Some temporal database systems allow queries that calculate the time a certain pattern of information was valid. Queries based on time intervals may be evaluated using different semantics, like there must be at least one snapshot in time that is valid or it must be valid throughout the whole time interval. Multi-temporal queries allow usage of different temporal clauses on different parts of information in the database. All these variations of temporal databases and semantics require specific types of indices and query evaluation systems, which are more complicated than the variants of non-temporal systems.

2.5.1 Temporal RDF

Several approaches have been done to extend RDF from a current model to a temporal data model. OWL-Time [55] is a time ontology for OWL. It contains vocabulary for advanced temporal concepts like intervals or time zones, but does not make RDF a temporal data model. Gutierrez, Hurtado, and Vaisman [33] introduced temporal semantics into RDF and provided a formal foundation for it, but did not implement actual extensions of SPARQL for querying. This was done by Tappolet and Bernstein [69], who presented an actual implementation of these concepts. They introduced τ-SPARQL, an extension to SPARQL, which allows temporal queries and can be mapped to conventional queries. Examples of queries written in τ-SPARQL are shown in Listings 2.2 and 2.3.

26 Design and Development of a Service for Software Interrelationships Chapter 2. Fundamentals

1 SELECT ?president ?fromTime ?toTime WHERE { 2 [?fromTime,?toTime] ?president a . 3 } Listing 2.3: Validity Time Selection Query τ-SPARQL

1 SELECT ?president1 , intersect( #interval1 , #interval2 ) WHERE { 2 ?president1 a #interval1. 3 ?president2 a #interval2. 4 } Listing 2.4: Validity Time Intersection SPARQL-ST

ex:Anna foaf:mbox [email protected]

Figure 2.2: Statement without Provenance Information

Further extensions of temporal concepts in RDF were introduced by Perry, Jain, and Sheth [56]. They did not only provide additional cutting point semantics for temporal data, but also extended the model to support spatial information in their model and corresponding query language SPARQL-ST. Compared to τ-SPARQL, SPARQL-ST supports multitemporal queries and intersection of time intervals. An example of this can be seen in Listing 2.4. A peculiarity is that SPARQL-ST uses the number sign not only for comments, like in regular SPARQL, but also to denote temporal variables. Querying on temporal RDF requires different access patterns and therefore other indices as regular RDF. Pugliese, Udrea, and Subrahmanian [58] provide an indexing scheme for temporal RDF, to improve query performance.

2.6 Provenance in RDF

When working with RDF it is often necessary to store information about the origin, time frame of validity or other provenance information of the data. RDF does not provide first class features to store information about statements. Several approaches have been developed to allow storage of provenance information. In the following sections different approaches for provenance information storage are presented. Exam- ples are based on the statement in Figure 2.2, which should be stated to originate from and to be valid between 11/02/2011 and 12/03/2012.

metaservice: a Semantic Web based Approach 27 2.6. Provenance in RDF

dTo [email protected] vali 12/03/2012 ex:

ex:validFrom foaf:mbox ex:Statement1 12/03/2012

ex :origin ex:Anna ex:Internet

Figure 2.3: Provenance Information using Reification

2.6.1 Reification

Reification is a built-in mechanism of RDF, which allows abstraction of statements. In a reified statement, the information is not encoded directly into a single triple, but into multiple statements that describe the original statement. Figure 2.3 shows the RDF graph using reification. It is important that a reification of a triple does not entail the triple itself, which is also true for the other way around. This is important, because otherwise two reified statements, which contradict each other would not be possible to exist. One problem of reification is, that the information is not directly encoded anymore, and it produces at least three additional statements per reified statement. Querying becomes very unnatural, because pattern matching needs to be done over many triples. This also impacts query performance, unless there is a specialized index for reified statements, which is usually not the case. Another problem using reification is that reasoning semantics is not defined. Therefore using reification for provenance data information leads to a major loss of RDF functionality.

2.6.2 Named Graphs

The insufficiency of reification for provenance information storage led to the development of named graphs [14]. Statements, usually represented by triples, are extended to quadruples by addition of a supplementary resource. The additional resource is the name of the graph, and therefore can be used to identify the graph of statements. Statements without an explicit named graph are implicitly put into the default graph. RDF is therefore extended in a way which allows representing not only one, but multiple graphs. The named graph can be used to attach provenance information to one or more statements. Figure 2.4 shows the example statement which is enclosed in a named graph, which in turn is linked to the provenance information. In comparison to the reified solution a single conventional matching pattern can be used to directly match the statement. Indices for efficient pattern matching in quad stores require significantly more storage than in triple stores. While triple stores require only three indices to provide

28 Design and Development of a Service for Software Interrelationships Chapter 2. Fundamentals

ex:Internet 12/03/2012 12/03/2012

o

T

d i l a :v x e ex:graph1

foaf:mbox ex:Anna [email protected]

Figure 2.4: Provenance Information using Named Graphs

ex:Internet 12/03/2012 12/03/2012

o idT val ex:

ex:StatementId 1

foaf:mbox ex:Anna [email protected]

Figure 2.5: Provenance Information using Statement Identifiers full coverage, in quad stores there are needed six. Named graphs therefore provide better querying performance for provenance data handling, but require more index changes on update operations. Named graphs have been standardized, but are not included in all semantic web standards. Reasoning inside of named graphs can be done with existing semantics, but reasoning across multiple named graphs does not have defined semantics. For example it is not clear in which named graph an inferred statement would be placed.

2.6.3 Statement Identifier

Like named graphs, statement identifiers are effectively quads, with the fourth statement element being the unique identifier. Hartig and Thompson present an alternative approach, also known as Reification Done Right, for reification based on Statement Identifiers(SIDs) [36, 71]. They propose an extension of RDF allow statement attributes. Statements are used as first class RDF citizens in the role of subjects and objects. Existing indexing strategies for property graphs can be used for indexing and allow less

metaservice: a Semantic Web based Approach 29 2.7. Web Data Extraction overhead than full quad indexing, which is needed for named graph support. Hartig and Thompson do also provide proposals for extensions of SPARQL and RDF for syntactic support of SIDs. Still it is not standardized and current implementations are therefore very limited. In Figure 2.5 one can see that the graph representation looks similar to the named graph solution. The major difference is that named graphs may contain multiple statements for which the provenance information is valid, while provenance information is only valid for a single statement when using Statement Identifier. This difference in semantics leads to better performance of named graphs or SIDs depending on how many statements share the same provenance information.

2.7 Web Data Extraction

Web data extraction is about discovering, retrieving, extracting and storing specific information from pages on the WWW. Information on the WWW is distributed on web pages, which store content, presen- tation and hyperlinks to other web pages by usage of HTML. This information is usually provided without semantic markup, like by semantic web technologies, and is designed only for human interpretation. Web browsers translate the markup language into a visual representation, which is typically interacted with by mouse. Although automated processing is therefore very difficult, extraction of information is regularly done by web crawlers. Many important internet services, like search engines, rely on automated processing of web data. Ferrara et al. describe the following phases, which are needed for web data extraction systems [22]:

• Interaction with Web Pages

• Generation of a Wrapper

– Automation and Extraction – Data Transformation – Use of extracted Data

The first step for web data extraction is navigating to the data source, and therefore interacting with the web pages. Modern web pages require the usage of many technologies to navigate to the data, e.g., like JavaScript, forms, cookies. Navigation is also necessary for discovery of data sources, when they are not published at a known location but only interlinked from a directory page. The next step is to generate a wrapper, which is computer program that automates the interaction and data extraction with and from web pages. The previously examined navigation process needs to be replicated by the program. Data extraction often requires processing the web page like a web browser, including the handling of malformed HTML markup. Some web pages require building of the Document Object Model (DOM) and execution of JavaScript to access the content. Then different techniques, like text processing, tree traversal or optical recognition may be used to extract the actual information from the unstructured data. This information is usually not in the required format and therefore needs transformation and postprocessing. When aggregating information from different sources, duplicates, spelling mistakes and incorrect data need to be considered to retrieve useful data. Last but not least the retrieved data can finally be stored or further processed.

30 Design and Development of a Service for Software Interrelationships Chapter 3. Requirements

3 Requirements

In this chapter the requirements for a software interrelationship service are explored. These requirements are required to design and evaluate such a service. First the stakeholders of the project are presented and analyzed in Section 3.1. Following use cases are presented in Section 3.2. These use cases will be used to evaluate the results of this thesis. Last functional and non-functional requirements are stated in Sections 3.3 and 3.4. The list of requirements is designed for an actual working service. The designed and presented prototype only acts as a proof of concept and therefore does not implement all requirements, which are listed here.

3.1 Stakeholders

According to ISO/IEC 15288 stakeholders are an “individual or organization having a right, share, claim, or interest in a system or in its possession of characteristics that meet their needs and expectation” [41]. Stakeholders’ needs and expectations are essential for software requirement engineering. Figure 3.1 shows the different stakeholders of an extensible software interrelationship service. The identified stakeholders are web users, API users, module developers, software repository operators, and system administrators. They and their expectations are described in the following sections.

Service uses uses Client implements Extension Module Application

Web User API User Extension Module

develops Extension Module integrates Software operates Repository

Module Developer Software Repository manages Operator

System Administrator

Figure 3.1: Stakeholder Diagram

metaservice: a Semantic Web based Approach 31 3.1. Stakeholders

3.1.1 Web Users

Web users of the service are typically interested in software and its interrelationships. The reason for users to use the service is to gain information about software for better comprehension of software systems without the need to analyze source code or unstructured documentation. They expect information to be available through a web browser and service response times which allow interactive usage. Users want informative and appropriate views for each accessed information type for easy comprehension of the content. Additionally they want to interactively find and explore information, by search and navigation across software interrelationships. Typical web users are software developers and system architects.

Expectations of Web Users

• Easy use of the service

• Low response times

• High quality information

• Compatible with multiple browsers

• Ability to search and browse

3.1.2 API Users

API users share the same interests as web users. The difference between them is, that API users do not use the web interface of the service manually. Instead they implement client programs, which then communicate with the API of the service, to retrieve information. Client programs can integrate the information provided by the service in software development tools, offer a different visualization, or aggregate data for scientific analysis. These programs rely on a stable and profit from a lean unified API. For the usage of the service to be preferred to direct repository access, the effort needed to build and maintaining the programs needs to be lower than conventional implementations for each repository. Typical API users are software tool developers and researchers.

Expectations of API Users

• Consistent semantics

• Good documentation

• Ready to use components for service access

• Quality of service

32 Design and Development of a Service for Software Interrelationships Chapter 3. Requirements

3.1.3 Module Developers

Module developers write modules for integration of different repositories into the service. The module developers are therefore very important for the quality and range of information provided by the service. For the implementation of modules module developers rely on the API and its documentation provided by the service. Additionally supporting tools for implementation and testing are very useful for development, as are ready to use code examples. Developers have preferred programming languages, which they also want to use for the module implementation. To encourage contribution by module developers, the complexity and bureaucracy of the contribution process is best kept low. Module developers may also be software repository providers, which want to integrate their repository.

Expectations of Module Developers

• Stable and easy to use and understand module API

• Good documentation and examples of the service and API

• Access to service source code

• Development and test utilities

• Easy contribution process

3.1.4 Software Repository Operators

Software repositories are the primary source of information for the service. Their operators can lockout the service, if they do not agree with the usage. Therefore their consent is very important. Operators need to take care of their repository service performance. Aggressive retrieval of information from repositories can lead to degradation of the service performance. Therefore repository providers need to be able to provide constraints on crawling speed and frequency. Information in repositories is usually provided using a specific license. The terms of these licenses need to be taken care of, such that no rights of software repository operators are infringed. Commercial and noncommercial operators can be distinguished. Because of no financial interest in the repository data, noncommercial operators are more likely to cooperate.

Expectations of Software Repository Operators

• Consideration of licensing constraints on the repository data

• Consideration of crawling constraints provided by the repositories

• Proper attribution of information sources

metaservice: a Semantic Web based Approach 33 3.2. Use cases

3.1.5 System Administrators

System administrators are responsible for the operation of the software interrelationship service. They need to take care of its security, backup, reliability, and performance and therefore rely on statistics and reports about the service. Based on these they need to be able to control and make changes to different aspects of the service. Additionally, they also decide which modules should be included and afterwards take care of module lifecycles, i.e., installation, monitoring, upgrade and removal.

Expectations of System Administrators

• Powerful and easy to use administration interface

• Access to detailed system statistics

• Support for automated and automation of maintenance tasks

3.2 Use cases

Use cases describe the intended use of a software system. They can be used for documentation of requirements and can serve as a basis for the evaluation of implemented solutions. The following list of use cases is not exhaustive, but reduced to the most significant ways of interaction with the system. These include the integration of data sources and utilization of the service in different aspects.

3.2.1 Integration of Software Repositories

Integration of software repositories into the service is the major use case for module developers. Software repositories are the source of information provided by the service. Different software reposito- ries provide data in different formats and with different meaning. The service on the other hand needs to provide a unified way to access software and software interrelationship information. Therefor module developers need to provide a syntactic and semantic mapping between the repository and the service. Content of software repositories can change throughout time. Consequently the service should be able to handle these changes. Additionally module developers need to be able to define how and where repository data can be retrieved. It is important that the integration of the different repositories is facilitated to encourage implementation of integrations.

3.2.2 Manual Browsing through Software Metadata and Software Interrelationships

Manual browsing is the main use case for web users. Browsing can be split into information discovery through navigation or search. Navigation is the traversal between related information, meaning that cross links can be followed. Search lets users find information

34 Design and Development of a Service for Software Interrelationships Chapter 3. Requirements based on specific criteria. One of the most common ways of search is full text search, where a text pattern is looked for on the whole dataset. Another form of search is structured search, where matching of results is based on the structure, e.g., information newer than a specific date or from a specific source. When users have found the desired information they need to interpret it and then evaluate its usefulness. The same information can be presented by different views. One view may provide better comprehension of the information through selection and arrangement of information. Different types of information require different types of views. Hence it is important to provide the source of this information. Web users are nowadays used to responsive web applications. If a result is not provided within a short period, they leave the page and look else were. It is therefore important to meet response time requirements typical for web applications. A key reason for users to use the service, is that it aggregates information from different sources, so as the user does not have to fetch each of them separately. Therefore the usability of the service increases with the amount of integrated information sources.

3.2.3 Security Alerts based on Security Report Propagation

Security alerts based on security report propagation is a use case for automated usage of software interrelationships provided by the service API. Component based software development is state of the art. Software development from scratch, without reuse of libraries is very rare. Through the transitivity of software dependencies, even trivial applications may require an incomprehensibly large network of software components. Security relevant bugs of software are usually reported in public databases like CVE. Vulnerabilities of one application may propagate to another, when software components are reused. Following the path of relationships between a report and all effected software systems is not a trivial task. Also the amount of CVEs is so large, that manual review would not be worthwhile even when path traversal would be easy. Therefore alerts are necessary to save resources. Open source project maintainers, developers, or system administrators are interested in vulnerability information of their projects or systems. They should be able to register for alerts for their directly used components, and receive notifications if any could be affected directly or through interrelationships. The type of alert is not specified, it could be an email, mobile text message, bug report or similar. The better the coverage of software components in the dependency model is, the better the applications performance.

3.2.4 Discovery of potential Licensing Conflicts

Discovery of potential licensing conflicts is a use case for automated usage of the service API. The right to use and integrate software libraries is not automatically granted, but controlled by copyright holders which grant licenses for usage. Infringement of intellectual property rights by unauthorized usage of software components is no trivial offense and may lead to prosecution. The complexity of software dependency networks and the incomprehensibility of licensing terms for software engineers often lead to unintended violation of licenses.

metaservice: a Semantic Web based Approach 35 3.3. Functional Requirements

An example for a software license is the GPL. Besides other clauses it contains a so called copy left clause, which makes it often incompatible with other licenses. Copy left clauses require that derivative works need to be distributed by the same licensing terms as the originating software. In case of the GPL not only classical derivation but also dynamic or static linking to the software create a derivative work. Manual review of compatibility of software licenses is a time and therefore cost intensive task. Therefore FOSS software projects have consolidated on few different licenses. The compatibility of these licenses can be evaluated statically and then used to automatically find violations through a computer program and the dependency graph. The quality of the dependency graph model, is better the more actually used software components are represented. Crossing ecosystem borders improves the completeness of the model. For automatic evaluation of licensing violations it is required that software dependencies in the depen- dency graph are subdivided into the type of dependency, such that specific licensing constraints like linking can be handled.

3.3 Functional Requirements

The following list summarizes the functional requirements, which are based on the problem statement, the stakeholder analysis, and the use cases.

1. Web Interface

a) Full Text Search b) Structured Search c) Navigation between related information d) Different views depending on information e) Show sources of displayed information

2. Client API

a) Unified data format b) Defined semantics

3. Module API

a) Information retrieval b) Syntactic and semantic mappings c) Handle changes of sources d) Interconnect information from different sources

4. Management

a) Management of stored Information b) Install, Upgrade, and Remove Modules

36 Design and Development of a Service for Software Interrelationships Chapter 3. Requirements

c) Display service status and statistics d) Scheduling of tasks

3.4 Non-Functional Requirements

Non-functional requirements describe the required qualities of the implementation of functions, i.e., how a specific feature is implemented. Unlike functional requirements simple enumeration is not enough for engineering of non-functional requirements. Therefore they are listed and discussed in the following.

3.4.1 Scalability

Ubuntu Raring Ringtail alone contains over 30,000 packages for a single distribution release. Each release has own sets of repositories for each architecture. There have been 20 Ubuntu releases yet. DistroWatch [94], an online directory for Linux distributions, lists over 100 Debian derivatives which can be considered part of the Debian ecosystem. Although with Raring Ringtail one of the bigger releases was picked, these numbers can be used to estimate that the amount of existing packages and therefore relationships in the Debian ecosystem is huge. Module Counts [119] is a tracker for software module counts. It currently contains statistics of over 15 ecosystem specific repositories. Maven central and Node Packaged Modules (NPM), the package manager of Node.js contain each over 80,000 unique modules, for which there exist several releases and packages. GitHub [101] claims on the press information page to contain over 15.3 million software source reposi- tories and every day new software is released. Bavota et al. [4] have concluded that the Apache Java ecosystems grows exponentially. Therefore a system which does want to cover a large part of existing software needs to be able to scale to large amounts of data. Depending on the actual usage also the client count needs to scale appropriately. Scaling vertically is not an option, because single machine computing resources are limited, hence only horizontal scaling is a viable alternative.

3.4.2 Extendibility

The utility of the service rises, the more ecosystems are integrated. The module system makes the service extendable, but for the extendibility it is also important that the service can be extended easily. Since modules solve the same kind of tasks, utilities can be written to remove redundant ones. These utilities should be as abstract as possible, such that module developers need not to bother with implemen- tation details. A module framework should take the heavy lifting off the individual extensions, such that the overall effort is minimal. Another aspect is that modules should be able to work with less abstraction if necessary. For example the majority of modules probably will not want to deal with temporal complexity of the data, but for some special modules the ability to control temporal information may be necessary. The API should therefore also allow lower access, when needed.

metaservice: a Semantic Web based Approach 37 3.4. Non-Functional Requirements

The service should encourage many-to-one semantic and syntactic integration in favor of many-to-many integration, by providing a common standard. The effort for many different modules is minimized by removing the need of cross integration.

3.4.3 Security

Security is needed to be considered for every online service to prevent manipulation and misuse. Since the service is extendible and module contributions may come from untrusted sources, their execution needs to be controlled. This is true for the generators of the content as well as for the views which are used on a web interface. Possible attack vectors for the service are as follows:

Tampered Data An attacker could inject wrong or biased data into the system and therefore compromise the data integrity. As the service is designed to be a data source for external applications, those can be indirectly attacked by targeting the service or its sources. The credibility of the used sources needs to be verified and service users need to be informed of the credibility status of the different sources. When tampered data is detected, it should be possible to isolate the identified statements from the rest of the data set.

Malicious Code Execution Modules could be programmed to execute malicious tasks in the name of the service operator. Through the separation of module developers and operators, operators are required to trust in the executable from the developers. A process is needed to create and maintain the trust relationship between both. Trust can be established through a combination of code reviews, signed executables, transfer of liability, and continuous cooperation. It is necessary to be able to detect malicious behavior of the own infrastructure, like through fingerprinting of legitimate network traffic patterns. A special case of malicious code execution is through injection of JavaScript into the HTML frontend. Injection not only originate from the modules, but also from repository sources or even unfiltered URI parameters.

Denial of Service The service is overloaded with requests, such that the system is not responsive anymore. Since the service provides a programmatic API, it needs to be taken care of that requests cannot lock up significant resources of the whole service. Special care needs to be taken if the service executes complicated user defined queries, as those may be specially crafted to take up resources.

3.4.4 Usability and Documentation

The service needs to provide an easily usable interface to its users. This includes both web interface and API users. Human users are used to specific interface patterns for web sites. These usability patterns should be applied, such that users are able to intuitively use the service. Different classes of information may have different appropriate representations types. Another aspect of human user usability is accessibility and therefore compatibility with different types of browsers.

38 Design and Development of a Service for Software Interrelationships Chapter 3. Requirements

For API users both API and concept documentation are essential. Programming examples are useful for learning purposes and as a quick start for development. Developers need to be able to setup their development environment and test systems in a way compatible to the service. This includes the ability to easily access, replicate, or simulate the infrastructure.

3.4.5 Performance

Overall system performance is required to fulfill many requirements. First of all the calculations need to be executed in a reasonable time. This is important for cost reasons to host the service and for development reasons to enable short feedback cycles. Bad performance also makes denial of service attacks easier, since fewer requests can utilize all of the resources. Performance is also important for web users, which expect fast response time for interactive browsing. In combination with the extendibility of the service, performance needs to be guaranteed also for large data sets. Performance of background tasks is not as important as on the fronted.

3.4.6 Maintainability

The maintainability of a interrelationship service is very important, even more when it is hosted as a single instance. Downtime should be reduced, by making maintenance tasks as easy and automated as possible. Software upgrades should be able to be done seamlessly, without service interruption. In combination with the extendibility requirement it is essential to easily detect malfunction in single extensions without interrupting the whole service. Maintainability of the source code itself is needed for further development. This includes splitting the system in lightly coupled components, which can be modified without impact on the whole system. Existing coding standards should be implemented, depending on the chosen implementation platform.

metaservice: a Semantic Web based Approach 39

Chapter 4. Design

4 Design

In this chapter the requirements and the problem description are transferred into a design concept. Until this point, the thesis has been very generic. Everything stated was applicable to any interrelationship service. From this point onward, the focus of this thesis is metaservice, which is the proposed im- plementation for an interrelationship service. Section 4.1 narratively describes the path which led to metaservice’s design. Section 4.2 gives an overview over its different components and their interactions. Section 4.3 shows the path, which data takes through the system and how the different components interact to process the knowledge. Section 4.4 describes how data is stored and which semantics are applied. Section 4.5 shows how temporal queries based on the data model work. Section 4.7 describes the metaservice semantic web ontologies.

4.1 Initial Design Considerations

When developing a solution from scratch, one can choose from different design options. Metaservice was not designed as a semantic web application from the beginning and it did not handle temporal data until late. Actually semantic web technologies were not considered at all, because they seemed to be an esoteric approach and overly complicated. This section gives brief insight into design concepts which were considered but rejected or further refined throughout the iterative design cycles. At first sight, programming a software interrelationship service seemed just like any another web application. Therefore relationships and metadata were taught of as tuples, which are stored in a common relational database. Soon it became apparent, that there are many different repositories to integrate and they did not share the same data schema. Modeling the database schema was not possible unless one would restrict oneself on certain subset of repository types. Schema-less databases seemed to offer a solution to exactly this problem. In the last years document databases like MongoDB [120] had grown in popularity and matured to production usage. They provided good performance and flexibility, which was needed for an interrelationship database. Compromises which were to be made for document stores were lack of SQL. On further investigation the data’s graph structure became more and more visible. The lack of proper join operations, to follow relationships, seemed to make document stores unusable for this task. Therefore graph databases were investigated, to evaluate if they provide a better suitable model. During this investigation OrientDB [129] came into scope and seemed just the right combination of both worlds. The combination of document store and graph database made it the favorite choice for an implementation for a long time. Only when the mapping between the different data sources were considered, it became apparent that semantic technology would make the unification of them easier. Ontologies seemed the best tool to model the hierarchical mapping and RDF stores provided graph structure and schema freedom, which had already shown to be important storage concepts. Additionally in contrast to document stores and

metaservice: a Semantic Web based Approach 41 4.2. System Architecture graph databases there was SPARQL, a fully-fledged and standardized query language, with similar expressiveness as SQL was for relational databases. Performance and scalability of RDF databases seemed also to have matured over the last years. Therefore semantic web technologies were then used to create a first proof of concept. This proof of concept first worked very fine, until data in the used repositories began to change. Through deletion and modification in the data sources inconsistencies in the data appeared. Because of this, it also became clearer that historic information in software repositories is lost when it changes, although that does not mean that the deleted information is not important anymore to people working with older software. Hence repositories needed to be archived, such that historical information could be preserved. With the decision to store temporal data consequently also the RDF as a current data model, showed its problems with temporal validity. Many approaches were taken to flatten the data to fit into regular RDF, but none was successful. Therefore a full temporal model for RDF is required. This model is one of the contributions of this thesis and is described in the following sections.

4.2 System Architecture

The architecture of metaservice is designed under the consideration of the scalability and extendibility requirements. Therefore a distributed and component based architecture is chosen, which is extendible through modules. Figure 4.1 shows the different components and their interactions. Components which are depicted twice may exist multiple times as different implementations. The oval around the parser and provider means that they are executed as a single component. Following are the description of the different components, the structure of modules, and a description of the distributed deployment and execution.

Frontend Cache

RDF-Database Messaging Service

Provider Parser Archive Crawler Postprocessor

Manager

Figure 4.1: Component Diagram

4.2.1 Components

The core components of metaservice are:

42 Design and Development of a Service for Software Interrelationships Chapter 4. Design

Manager The manager is the foundation of the service. It keeps track of system status and allows the installation, upgrade, and removal of modules. Further it controls the execution of the different components and allows insights into service status and statistics.

RDF Database The database stores all information about software and its interrelationships. Since only a single database is needed for the service, the chosen implementation needs to be able to scale to large amounts of data. It provides a SPARQL endpoint, which is used to communicate with the database. The RDF database does not provide any reasoning, but is only used for storage and query execution.

Messaging Service The messaging service makes asynchronous communication across the different service components available. It is responsible for queuing and distribution of processing tasks to the different worker instances. Like the RDF Database also here only a single instance of the service is executed.

Frontend The frontend is responsible for both the human and computer interface for users and software clients. It transforms incoming requests and either fetches results from cache or delegates them to the RDF store. It is stateless and therefore may be distributed for load balancing.

Cache A cache can be optionally used to improve the frontend performance. Any fast key value store is sufficient.

Archives Archives handle the historical storage and access to the raw data of source repositories. For each software repository there is one archive.

4.2.2 Modules

Modules are used to extend the service. They can be installed, removed and upgraded at runtime. Each module may provide different types of components. Each component type is not limited, to one per module, but may be included several times.

Module Components

The components which can be declared as part of modules are as follows.

Crawlers Crawlers are used to retrieve data from the source repositories. The raw data is only fetched and stored in an archive. The messaging service is then used to asynchronously launch further processing.

Parsers Parsers are used to convert raw data to processable objects and can be reused for different repositories. They are executed in the same process as providers, to avoid serialization of the objects.

metaservice: a Semantic Web based Approach 43 4.3. Processing Model

Providers Providers take the parsed repository data from archives and convert it into RDF Statements, according to specific ontologies. The resulting statements are automatically enriched with provenance information and then stored in the RDF database. Changes of the statements in the RDF database are then propagated to postprocessors by means of the messaging service.

Postprocessors Postprocessors are notified as soon as there is a change in the database. They have access to the database and therefore can infer additional knowledge. Typical jobs a postprocessor may do are interlinking of existing resources from different repositories, further analysis of the underlying artifacts or notification of outside services. Data inferred by postprocessors is stored as RDF Statements, enriched with provenance information and stored in the RDF database. Any change created by postprocessors is fed back to other postprocessors through messages to the messaging service.

Static Data

Additionally to the executable components of modules, they can also provide one or more of the following.

Ontologies Ontologies are used to semantically define the schema of data. These ontologies typically provide a very specific model, which is mapped to more abstract variants provided by metaservice.

Views Views can be defined for specific RDF classes. Users can choose between different available views when viewing resources through the web interface.

Software Repositories Locations of software repositories and parameterization for processing can be provided. Additionally the type of crawler and parser which needs to be used for it is linked.

4.2.3 Distributed Execution

The manager is responsible for execution of the different components. The manger itself, the RDF database and the messaging service can only be executed once, although clustering is possible. All other components can be executed on different nodes and therefore the service can scale horizontally. The only shared resources are the database, the messaging service and the archives. Therefore metaservice components can be considered to be loosely coupled. Since the messaging service provides load balancing, multiple workers of the same component type may be deployed on different nodes. Of the components described only parsers and providers are executed in the same process and are therefore tightly coupled to avoid serialization of large amounts of parsed data.

4.3 Processing Model

This section describes the dynamic structure of metaservice and its calculation model.

44 Design and Development of a Service for Software Interrelationships Chapter 4. Design

4.3.1 Observations

Since information from many datasources may be contradictory to each other, reasoning across the whole knowledgebase does not work. Instead metaservice applies the concept of observations, which capsule a set of knowledge from a specific source and from a specific perspective. Different generators, meaning providers and postprocessors, are therefore allowed to generate contradic- tory observations that do not interfere with each other. Selection of trusted data is therefore deferred to the query time. A single run from a generator produces exactly one observation. Running the generator again with the same parameters, may give a changed observation which overwrites the previous one. Generators are expected to be deterministic, meaning that they should always calculate the same result. However changes can still happen when the outside context or the generator code changes.

4.3.2 Processing Pipelines

Metaservice uses processing pipelines to modularize the execution and thereby improve the resource utilization. Workers are structured as pipelines, where each pipeline segment executes a specific task. When messages are put into the pipeline they are pushed from segment to segment, always taking preliminary results with them. Thereby the context of calculation is split from the pipeline segments and bound to the processed message. To increase the throughput of the pipeline, segments which require longer calculation time or do long running or blocking Input/Output (IO), can be instantiated multiple times. Through concurrent execution idle times can be reduced and multiple cores of modern Central Processing Unit (CPU) architectures can be utilized. Unlike simple concurrent execution of the whole pipeline, segment based concurrency allows more fine gained control of the resource utilization. To many parallel executions of the same segment, which requires much memory may lead to out of memory exceptions. This can be prevented by segment based concurrency control.

4.3.3 Data Retrieval and First Processing

The metaservice process starts with a crawler retrieving new information from a software repository. The retrieved data is then saved in the corresponding archive. As soon as the crawling and storage is completed, a message is sent to the create messaging topic. As soon as computing resources are available the create message is then consumed by the provider pipeline. Figure 4.2 shows a schematic diagram of the pipeline. The dotted duplicates stand for theoretical concurrency of pipeline segments. First the message is analyzed and the correct parser component is chosen for the repository data. Then the parser reads and parses the data from the archive and forwards it to the provider in an object format. The provider then creates statements that describe the contents of the parsed data. These statements are stored in an in-memory RDF store, where implicit information is materialized through reasoning, based

metaservice: a Semantic Web based Approach 45 4.3. Processing Model

Parse Map Statements Calculate Metadata Check changes Store statements Send events

Figure 4.2: Provider Pipeline on specified ontologies. The reason for statement inference outside of the database is that it can happen decentralized and therefore can scale better. This way also only a limited dataset, namely the observation of the provider, is considered. Next all statements are extended to quads, such that they contain a provenance identifier. This identifier is then used to describe for example how, when, and by which generator the statements were generated. The annotated statements are then transferred into the RDF database as a single observation, from where they are then available for postprocessors and the frontend. After successful storage the resources that were created by the provider, need to be considered as changed and therefore further processed by postprocessors. For every of these resources a message is sent to the postprocess topic.

4.3.4 Postprocessing

The postprocessing pipeline, which can be seen in Figure 4.3, takes care of further data processing that goes beyond simple mapping. It has access to the database and is executed whenever relevant statements in the database change.

Prefilter Postprocess Calculate Metadata Check changes Store statements Send events

Figure 4.3: Postprocessor Pipeline

As soon as there is a message on the postprocess topic, and there are processing resources available, it is passed to the postprocessing pipeline workers. Before doing actual resource intensive work, postprocessors detect if the given resource is suitable for processing. Additional filtering is done to check if the request is too old or if the message originates directly form a change by the same postprocessor. Only if all checks pass, the actual postprocessing takes place. For the actual postprocessing, postprocessors have access to the RDF database and can also access external resources. It is important that postprocessing results are consistent, therefore only reliable external resources may be considered. Additionally postprocessors are required to be deterministic and functional, i.e., if no external factors change, a calculation for a resource always returns the same result set. Although postprocessors are always triggered by single resources, they are not limited to produce results exclusively for those resources. Instead the result can contain a whole set of resources. If any of the resources in the result set is suitable for processing by the same postprocessor, it is required to trigger the calculation of the same result set. These resources are called authoritative subjects of the result set, and are described in more detail in Section 4.4.4. Authoritative subjects are essential to identify postprocessor observations. In the end postprocessors encode the inferred knowledge into triples, which is subsequently stored in an in-memory triple store. Just like in the provider pipeline, implicit statements are materialized by reasoning in regard to specified ontologies.

46 Design and Development of a Service for Software Interrelationships Chapter 4. Design

This knowledge is enriched with provenance metadata, like date and time of postprocessing, postprocessor identifier and version and then committed to the RDF database. If there previously was any observation by the same postprocessor about any of the authoritative subjects in the database, these are replaced. After the statements are committed to the database, the postprocess topic is triggered for all resources in the resource set. The reason for this is, that other postprocessors are able to build upon the information inferred during the execution. Since sending messages on change may lead to endless loops, the reason and origin of the message is always attached. This enables to check if a postprocessor has already processed one of the authoritative subjects of the message previously. If it indeed has, further recursive postprocessing is prohibited consequently. Theoretically the recursion checking rule can be changed and alternatives to prevent infinite loops may be deployed. These alternatives could be limiting the maximum recursion level or the repetitive execution count.

4.4 Data Model

The pure RDF data model is not suitable as a real temporal data model and does not comprise the previously introduced observation concept. This section defines an enhanced data model, which is capable of both. It is actually compromised of two different data models because of differences in temporal scope and semantics between observations of providers and postprocessors.

4.4.1 Definitions

To specify a data model and explain its semantics it is necessary to distinguish between different concepts of time and validity.

Temporal Concepts

An observation contains information, which originates from a retrieval of data from a source repository. In this thesis the point in time, at which the information was retrieved is called the data time. As repositories usually describe currently valid time, the data time is often also the valid time of an observation. The data time and valid time may also be different if the source repository provides an explicit valid time for the stored information. It is also possible to interpret the valid time as the interval at which an observation is valid. Another temporal aspect of an observation is when it was calculated and stored into the database, which is its creation time. Observations can be verified to still contain the same statements at a later point in time. The last point in time, when the correctness of the content was checked is the last checked time. While the creation time of an observation never changes, the last checked time may change repeatedly. The query time is the time for which a query on the database expects valid results, i.e., it is not the time when a query is executed but the time the data in the database needs to be valid such that it is considered by the query.

metaservice: a Semantic Web based Approach 47 4.4. Data Model

Validity Concepts

The validity of a statement determines if it needs to be considered or not. Validity always depends on the context of a statement. In the following the validity in different contexts is defined for metaservice. A statement is valid in a RDF database, if and only if it is stored as a triple in it. Further a statement is only valid in an observation, if and only if an observation is stored in the database, which contains the triple. Additionally there is also validity at a point in time, which is the case when an observations valid time contains the point in time in question.

4.4.2 Observation Semantics and Provenance Information

Observations are used to allow different opinions on reality. Each observation is unique in who generated it, what the described thing is, and for which data time it was created and can be identified thereby. Unless there is an encoding error there should not be a possibility for contradictory information in an observation. There may be conflicts between statements of two observations but not between statements of the same observation. Observations are constructed to defer the aggregation time of information from different sources to the latest time possible, which is query time. When writing queries on a set of observations, a selection of believed observations must be done. This allows not only to have contradictory sources, but also to have experimental data in the database, which is ignored by default. It is also possible to believe all observations. Observations are collections of statements and can be interpreted as provenance information of those. As already stated in Section 2.6, primitive RDF reification would need significantly more statements to express and additionally break conventional queries. This leaves SIDs and named graphs. SIDs provide statement level provenance information, which is not required in this case. Another argument against SIDs is that they are not standardized. Therefore provenance information is stored using named graphs, with the observation being identified by the graph name. Information, which are stored for each observation are as follows.

Creation Date creation time of the observation

Last Checked Date last checked time of the observation

Generator generator which produced the observation

Action type of observation

Source repository and path of the statements origin, only used by providers

Subject authoritative subjects, only used by postprocessors

These items are also resembled in the metaservice observation ontology, which is described in Sec- tion 4.7.1.

48 Design and Development of a Service for Software Interrelationships Chapter 4. Design

4.4.3 Provider Data Model

Software repositories may get updated often and can grow very large. For example the Debian Wheezy main repository contains over 30,000 packages and is updated regularly. Usually only a small part of the repository contents changes after an update. Storage of differential observations therefore greatly reduces the amount of statements required compared to full observations after every update. Hence the provider data model builds upon differential observation storage. Two different types of observations are required to store differential changes. These observation actions are add and remove. To create differential observations, providers need to consider two versions of source data. These are the version prior of the analyzed version and the analyzed version itself. Every execution of a provider may create two graphs, one which is assigned the action add, and one which is assigned the action remove. When a statement can be calculated by the previous, but not by the current, source data it is added to the remove observation. If a statement can be provided by the current, but not by the previous, source data it is added to the add graph. If the statement does exists in both, or in neither, versions it is not stored. Both observations inherit their data time from the source data and are identified by the source repository and the generator. An important property of the statements is that they need not to be removed as parts of the same set of statements as they were added, i.e., statement A and B are added as part of the same observation, but they may be removed as part of different observations. This is trivially also the case the other way around. Hence the temporal validity of statements using the provider data model, with respect to a given generator and query time, is defined as follows. A statement is valid at a query time in respect to a generator, if the newest observation in the database, which was produced by the given generator and has a data time prior to the query time, contains the statement and is an add observation. Informally this means that a statement is valid between the data times of the add observation and if it exists the remove observation. The restriction on the generator is necessary, because each generator has its own timeline of observations and statements may be in timelines by different observations. If one did not regard the different generators, a statement could be added in the timeline of one and removed in the timeline of another, which is not desired.

4.4.4 Postprocessor Data Model

Postprocessors calculate statements given a specific time. This time is used to execute queries on the RDF database and consequently is also the data time of the observation. There are cases, where more than one subject leads to the same result. One of these cases is to calculate the previous and next relationships of a set of resources based on implicitly encoded attributes of the resources like a revision identifier. In this case redundancies in calculation and/or storage would be created, as for each element of the set, all elements of the set need to be considered to determine the direct successor or predecessor element. To remove those redundancies, the postprocessor data model defines observations to contain multiple authoritative subjects. The following rule is imposed on postprocessors to determine authoritative subjects: Every resource of a statement in an observation, which is potentially processable by the postprocessor, needs to generate the same observation when actually processed by the

metaservice: a Semantic Web based Approach 49 4.5. Temporal Queries postprocessor. Determining authoritative subjects can then be done by checking every resource, which is part of the observation, if it is processable by the postprocessor. Authoritative subjects and the generator are used to identify the observation. Nonetheless the identity of an observation cannot be trivially determined like in the provider data model, because authoritative subjects are sets and not single resources. To define temporal validity of statements in postprocessor data models it is not possible to execute the postprocessor for every possible query time. Therefore the temporal validity needs to be defined by other semantics. Like in the provider data model, a timeline based approach is used to define validity. Compared to the provider data model postprocessor observations do not have an add or remove operation. Therefore only validity based on the principle that, statements are valid until they are observed to not be valid, is possible. Hence when a postprocessor does not calculate any statements, because of the validity semantics, it still needs to be stored as an empty observation, such that previous observations can also be invalidated completely. Another problem is that timelines for postprocessor observations are not linear, because every authoritative subject has its own timeline. For a statement to be valid at a point in time, there may not be newer observations in any of the authoritative subjects’ timelines which do not contain the statement. This can also be restate as: a statement is valid if the newest observation which contains the statement, which is still prior to the query time, does not have a successor observation on any authoritative subject timeline from the same generator, which does not contain the statement.

4.5 Temporal Queries

Although there has been work on temporal queries on RDF databases, like the ones discussed in Section 2.5.1, to our knowledge neither of them is standardized nor implemented in a usable product. Therefore metaservice may only built upon standard query-mechanisms for implementation of temporal queries. The temporal data model, which is described in 4.4, suits the way the data is generated. Temporal querying is not trivial, because the model does not match usual valid-from and -until semantics for temporal storage. Actually there is not only one, but two different time-validity semantics for each of the sub models. It is not feasible to expect each extending developer to fully understand the data model semantics, which is necessary to write queries by hand. Therefore it is necessary to provide an abstraction layer, which allows easy query writing. Most queries needed for data examination and processing typically consider single points in time and are called time point queries. Therefore it is a priority to provide a mechanism, which allows to write standard time agnostic queries and execute them for a given query time. First SPARQL@T, an extension of SPARQL which allows temporal constraints on each statement pattern, is introduced to formalize simple temporal queries. Then SPARQL@T is used to demonstrate how queries can be translated to regular SPARQL which operates on the metaservice data model. Last an automatic translation of a subset of SPARQL@T to regular SPARQL on the metaservice data model is described.

50 Design and Development of a Service for Software Interrelationships Chapter 4. Design

4.5.1 Introduction of SPARQL@T

SPARQL@T is an extension of regular SPARQL, which allows optional constraints on the temporal validity of statement patterns and pattern groups. It is an easy to read and write language for temporal querying of RDF-stores. Both statement patterns and pattern groups can be followed by an @ and a point in time, the query time, to denote that they must be valid on the query time. In Listing 4.1 the statement pattern only matches if a statement exists that is valid on 01/01/2014 00:00:00. Whereas in Listing B.5 the temporal constraint is attached to the group pattern and therefore matches if all statements in the group pattern are valid on the given point in time. As the constraint is given exclusively on the outermost group pattern, the query is a typical time point query.

1 SELECT 2 * 3 WHERE { 4 ?o ,→ @"2014−01−01T00:00:00Z" ^^xsd:dateTime . 5 } Listing 4.1: Temporal Constraint on a Statement Pattern SPARQL@T Query

In comparison to other temporal SPARQL variants, like τ-SPARQL [69] or SPARQL-ST [56], SPARQL@T is quite limited. It does not allow to use variables in place of the query time, hence it is not possible to query the temporal validity of a statement. The simplicity of SPARQL@T is the reason why it was chosen for explanation of temporal mapping. The other existing languages are more powerful, but would also complicate the explanation. Given a concrete data model and temporal semantics it is sometimes possible to translate a SPARQL@T query to a regular SPARQL query. Listings B.6 and B.7 show translations of the SPARQL@T queries in Listings 4.1 and B.5 to common SPARQL.

4.5.2 Writing Temporal SPARQL Queries

Since the two different data models have different semantics to determine temporal validity statements, both need to be considered to calculate overall temporal validity of a statement. Following are translations of the SPARQL@T query in Listing 4.1. The translation uses several subqueries. To improve readability and to save space these subqueries are written as named subqueries [70], such that the query can be split to multiple parts.

Temporal Queries for the Provider Data Model

The provider data model records the first observation containing a statement and first observations missing a statement. One may assume, and we define, that the time between consecutive add and remove observations of a statement is its period of validity. In absence of a remove observation, the duration has no upper bound. Contrary to the classic valid-from and -to semantics, which stores the period of validity at one place, the provider data model stores this information in different observations. Therefore the classic solution is not applicable, although the formats are semantically similar.

metaservice: a Semantic Web based Approach 51 4.5. Temporal Queries

First of all a statement is invalid if it is not an element of any observation. If it is element of one or more observations, the observation with the maximal data time, which also is before the query time, determines its validity. If the observation is an add observation, the statement is valid, otherwise it is invalid. The named query %maxProviderTime in Listing 4.2 selects the variables of the statement, which in this case are only ?o, and the maximum data time ?time, which is lower than the query time. The results are grouped by the generator, the source and statement variables, such that the maximal time is calculated for each observation timeline and different variable binding. In Line 15 the observations are limited to those with a data time prior or equal to the query time and in Lines 10 to 14 the action or type of observation is limited to add or remove.

1 WITH { 2 SELECT DISTINCT 3 ?o 4 (MAX( ? time ) as ? time ) 5 WHERE { 6 GRAPH ? c { ?o } . 7 ? c ? time ; 8 ?sourcePath; 9 ?generator. 10 { 11 ? c . 12 } UNION { 13 ? c . 14 } 15 FILTER ( ? time <= "2014−01−01T00:00:00Z" ^^xsd:dateTime) 16 } 17 GROUP BY 18 ?o 19 ? g e n e r a t o r 20 ?sourcePath 21 } AS %maxProviderTime Listing 4.2: maxProviderTime Subquery - Translation of Listing 4.1 to SPARQL on the Provider Datamodel

The query in Listing 4.3 includes the %maxTime query from Line 15. It then selects all observations that contain the statement using the precalculated variables and maximal data time. The last pattern, matches only add observations.

1 SELECT 2 * 3 WITH { 4 ... 5 } AS %maxProviderTime 6 WHERE { 7 INCLUDE %maxProviderTime 8 GRAPH ? c { ?o } . 9 ? c ? time . 10 ? c .

52 Design and Development of a Service for Software Interrelationships Chapter 4. Design

11 } Listing 4.3: Main Query - Translation of Listing 4.1 to SPARQL on the Provider Data Model

The translated query does answer the original query correctly, although there are some differences in the result set. The star selector, does fetch additional variables, such that the returned tuples have different arrity. Additionally there may be more returned tuples, because the statement could be true in different observations.

Temporal Queries for the Postprocessor Data Model

The postprocessor model does only have valid from semantics. There is no time given which states until it is valid. All statements of an observation are valid until there is another observation, of the same generator and regarding the same authoritative subjects. Any statement is invalid, if it is not element of any observation. If it is element of one or more observations, the generator and authoritative subjects of these observations need to be determined. Then the observations, which have a data time prior to the query time, are looked up, grouped by their generator and authoritative subjects, and the latest observation in each group is selected. If any of the selected observations contains the statement, the statement is valid. The named query %maxPostprocessorTime in Listing 4.4 selects the maximal ?time for each timeline, which is prior or equal to the query time. To do this, first observations containing the statements are matched in Line 8. Then in Lines 8 to 10 those observations are required to be of the postprocessor type and the generator and the authoritative subjects are matched. Both the generator and the authoritative subjects are used in Lines 11 to 14 to determine corresponding observations, prior to the query time. Last the observations are grouped by their timeline and the maximal time is selected.

1 WITH { 2 SELECT DISTINCT 3 ? s u b j e c t 4 ? g e n e r a t o r 5 (MAX( ? time ) as ? time ) 6 WHERE { 7 GRAPH ? c1 { ?o } . 8 ? c1 ; 9 ?generator; 10 ? s u b j e c t . 11 ? c2 ? s u b j e c t ; 12 ?generator; 13 ? time . 14 FILTER( ? time <= "2014−01−01T00:00:00Z" ^^xsd:dateTime ) 15 } 16 GROUP BY 17 ? s u b j e c t 18 ? g e n e r a t o r 19 } as %maxPostprocessorTime Listing 4.4: maxPostprocessorTime Subquery - Translation of Listing 4.1 to SPARQL on the Postprocessor Data Model

metaservice: a Semantic Web based Approach 53 4.5. Temporal Queries

Listing 4.5 shows the query which is the translation of the query in Listing 4.1. First the result of %maxPostprocessorTime and therefore ?time, ?generator, and ?subject are included. Then all observations that match the statement pattern are looked up and subsequently joined to the known variables. If the join succeeds, there exists an observation, which is latest in the timeline prior to the query time and contains a statement that matches the statement pattern, which hence is valid at the query time. The same restrictions as for the provider data model apply. The arrity of the result tuples and the tuple result count may be different from the original query.

1 SELECT 2 * 3 WITH { 4 ... 5 } AS %maxPostprocessorTime 6 WHERE { 7 INCLUDE %maxPostprocessorTime 8 GRAPH ? c { ?o } . 9 ? c ? s u b j e c t ; 10 ?generator; 11 ? time . 12 } Listing 4.5: Main Query - Translation of Listing 4.1 to SPARQL on the Postprocessor Data Model

Temporal Queries for the Metaservice Data model

A statement is valid, if it is valid in any of the data models. The query for the metaservice data model is therefore the union of the provider data model query and the postprocessor data model query. The combined query can be seen in Listing 4.6 and is the translation of the original SPARQL@T query of Listing 4.1. As for the other data models the restriction apply, that there may be duplicate tuples in the result, and more variables in each tuple. The same rules of translation can analogously be applied for CONSTRUCT and ASK queries.

1 SELECT 2 * 3 WITH { 4 ... 5 } AS %maxProviderTime 6 WITH { 7 ... 8 } AS %maxPostprocessorTime 9 WHERE { 10 GRAPH ? c { ?o } . 11 ? c ? time . 12 { 13 INCLUDE %maxProviderTime 14 ? c . 15 } UNION { 16 INCLUDE %maxPostprocessorTime

54 Design and Development of a Service for Software Interrelationships Chapter 4. Design

17 ? c ? s u b j e c t ; 18 ?generator. 19 } 20 } Listing 4.6: Main Query - Translation of Listing 4.1 to SPARQL on the Metaservice Data Model

4.5.3 Automatic Translation of a subset of SPARQL

Based on the translation concept shown in Section 4.5.2, it is possible to automatically translate a major subset of SPARQL. Each triple statement pattern needs to be calculated in its own subquery. Quad statement patterns are processed together, grouped by their context or context variable. Filters are best applied directly in the subquery where the variables used in the filter are available. Group by, and aggregates can be translated without modification. Distinct operators require previous hiding of the variables, which are generated by the translation. Unions, minus and subqueries can be translated as if they were actual queries. Depending on the structure of the query, duplicate subqueries with the same calculation may be generated through this algorithm. When using named subqueries, these redundant queries can be eliminated. Naming of the newly generated variables and named queries, needs to be done carefully, such that no naming conflicts arise. These are particularly dangerous, since they do not cause an exception, but silently calculate wrong results. Property paths cannot be translated, because there is no way to simulate the + or * operator with regular SPARQL. Decomposing these operators would be needed to be able to check the temporal validity of each intermediate query. SPARQL service extensions cannot be translated automatically, because semantics is different for the different implementations. Although the automated translation has its limitations it is still powerful and therefore useful for querying of temporal data.

4.6 Interface Design

The design of the programmable interface is based on semantic web technologies and linked data principles. Therefore every software has its own URI, under which information about it can be retrieved. This information is provided by common web technologies. Documentation of the API is delegated to the used ontologies and web standards. The browsable web interface is then built on top of the semantic web fronted. Information on how a specific class should be displayed can be embedded in the data and subsequently a template can be fetched to render the structured information of the graph.

4.7 Ontologies

Two ontologies are required for metaservice to work as a software interrelationship service. The metaservice observation ontology vocabulary is needed for the realization of the metaservice data model.

metaservice: a Semantic Web based Approach 55 4.7. Ontologies

SWREL is required as a common ground abstraction layer for all integrated software ecosystems. Both ontologies are implemented using OWL and are hosted on the metaservice namespace. They include or link vocabulary of existing ontologies where applicable, to further refine their semantic specification. Additionally ADMS.SW and its related vocabularies are used for software metadata description, besides software interrelationships.

4.7.1 Metaservice Observation Ontology

The metaservice observation ontology is used to describe the metaservice data model. It handles the description of observations and their provenance data. It is not designed to be reusable for other services than metaservice.

Classes

Classes are provided for observations, modules and their elements. Table 4.1 gives the descriptions on the classes. The semantics of the classes is in accordance to Section 4.4. Since the ontology is very specific, additional constraints are encoded through OWL. Examples for this are that observations need to reference exactly one generator and that continuous observations are required to reference a postprocessor as their generator.

Properties

Properties of the metaservice observation ontology can be seen in Table 4.2. Most properties are used to link the resources of the different classes. ms:latest can be used to buffer the calculation of validity for a continuous observation and a point in time. ms:dummy has the special role to denote empty ontologies.

4.7.2 Software Relationship Ontology

SWREL has its focus on the connections of software. A new ontology is necessary, because the existing ones do not cover software relationships systematically. The ontology is used in metaservice to provide a common abstraction layer for software relationships. The most common software relationships can be categorized into relationships to software, people and organizations, and hardware. Software and generic relationships are the focus of this thesis and ontology. Relationships to people and organizations are already covered well by DOAP, which is the main reason for its import into the ontology. Hardware relationships were out of the scope of this work, but could be added in a later revision of the ontology. Where possible, properties were not modeled exclusively for software, but as generic as possible. Hence there are often no constraints on the property’s domain or range.

56 Design and Development of a Service for Software Interrelationships Chapter 4. Design

URI ms:Observation Description Observations are used to state collections of statements, with the same origin, generator and therefore observation provenance information. Named graphs are used to link observations to its contained statements. The description of the observation is also contained in itself. URI ms:AddObservation Subclass of ms:Observation Description AddObservations are used to describe the start of validity, i.e., their appearance, of statements in a knowledgebase. URI ms:RemoveObservation Subclass of ms:Observation Description AddObservations are used to describe the end of validity, i.e., their disappearance, of statements in a knowledgebase. URI ms:ContinuousObservation Subclass of ms:Observation Description ContinuousObservations are used to describe statements for a given point in time. Statements stay valid until they are not part of a later ContinuousObservation in the same timeline. URI ms:Module Description Modules are metaservice plugins. They provide different components, like Postprocessors or Providers, and static data, like Views and SourceRepositories. URI ms:Generator Description Everything that produces Observations is a Generator. URI ms:Postprocessor Subclass of ms:Generator Description Postprocessors are used for connecting and processing data in observations. They produce Continu- ousObservations. URI ms:Provider Subclass of ms:Generator Description Providers are used to convert data from SourceRepositories into statements in AddObservations and RemoveObservations. URI ms:Source Description A source is the place of origin of information. URI ms:RepositoryPath Subclass of ms:Observation Description A repositorypath is a concrete element from a SourceRepository. URI ms:SourceRepository Subclass of admssw:SoftwareRepository Description A SourceRepository is a SoftwareRepository, which is used as a source in metaservice. URI ms:View Description A view provides visualization for a class of resources in metaservice.

Table 4.1: Metaservice Observation Ontology Classes metaservice: a Semantic Web based Approach 57 4.7. Ontologies

URI Description ms:generator Links an Observation to its Generator. ms:module Links a Module element to its Module. ms:authoritativeSubject Resources which are processed together and are important for the time validity scope of the observation. ms:source Links an Observation to the source, where its statements were retrieved from. ms:creationTime The time an Observation was created. ms:dataTime The validity time of an Observation. ms:lastCheckedTime The last time an Observation was checked to be true. ms:latest Can be used to link an ContinuousObservation to a time it is valid implicitly. ms:repository Links a RepositoryPath to a SoftwareRepository. ms:id Dataproperty for the identifier of Components. ms:path Stores the local path in a Repository of a RepositoryPath. ms:view Used to annotate classes with matching Views for presentation. ms:dummy Used to mark ContinuousObservations, which do not contain any statements.

Table 4.2: Metaservice Observation Ontology Properties

URI swrel:Software Description A very abstract software concept. Instances of this class may refer to one or more software resources. These software resources may be addressed explicitly or implicitly. URI swrel:AnyOneOfSoftware Subclass of swrel:Software, rdf:Alt Description An enumeration of alternative software resources. Selection of the resources is disjunctive. URI swrel:AllOfSoftware Subclass of swrel:Software, rdf:Bag Description An enumeration of software, which are all part of this resource. The resources need to be used conjunctive. URI swrel:SoftwareRange Subclass of swrel:Software Description This is used to describe a range of software, with specific properties. Software is only referenced implicitly not explicitly

Table 4.3: SWREL Classes

Classes

Since SWREL is focused on relationships and not classes of software, there are quite few classes defined in it. The classes which are defined are very abstract concepts of software and software collections. Descriptions of the classes can be seen in Table 4.3. The problem with relationships of software to other software is, that it is often not specified directly on specific artifacts, releases, or projects, but often only on specific criteria. Examples for this are

58 Design and Development of a Service for Software Interrelationships Chapter 4. Design

related

relatedSoftware depends antiDepends implements

downstream upstream source binary hasPort portOf forkOf implementsAPI implementsAlgorithm implementsSpecification

Figure 4.4: SWREL Relationship Property Hierarchy dependencies on Debian packages, which are often only specified on the package name. The dependency is stated on package level and may be fulfilled by any package of that name. Further complications are introduced by dependencies which are specified as combinations from conjunctions and disjunctions. It comes clear that relationships therefore may not be described directly, but only on classes of things. swrel:AllOfSoftware and swrel:AnyOneOfSoftware can be used to model conjunctions and disjunctions. More specific restrictions can be modeled through swrel:SoftwareRange, which can be restricted by constraints. Because of the different nature of the swrel:Software class, it is neither reused from an existing ontology nor mapped to an existing concept. Reification of dependencies and therefore modeling of relationships as classes was considered, but found to overly complicate the model and therefore its usage. Providing both classes and properties for the same concepts would double the size of the ontology and would require specific rules for conversion. Conversion would only be possible from the reified statement to the non-reified, because identity information of relationships is not contained in properties. The most significant limitation of not modeling relationships as classes is that it is not possible to distinguish two different relationships of a software and an object. A work around for this limitation is to use the swrel:AllOfSoftware or swrel:AnyOneOfSoftware classes to create an intermediate resource, which only links to the object. This workaround allows a much easier modeling without losing expressibility of the vocabulary.

Properties

Properties of the ontology are structured in three groups, which are relationships, dependencies, and range restrictions. Inverse properties are only provided were considered useful, because some relationships have strongly directed semantics.

Relationship Properties

Table 4.4 describes the relationship properties of the ontology, but does not include software dependency properties. The hierarchy of the properties can be seen in Figure 4.4. swrel:related is the top level property for relationship properties and therefore also depen- dency properties. It is used for generic relations. A specialized version which links to software is swrel:relatedSoftware. Software relationships are supported to link forked, ported and com- piled software. Additionally swrel:implements is provided to link to implemented things, like APIs, algorithms or specifications. Like the relation property the implementation property does not imply usage on software.

metaservice: a Semantic Web based Approach 59 4.7. Ontologies

URI Description swrel:related Abstract concept of relation. swrel:relatedSoftware Abstract concept of relation to software, swrel:binary Software usually exists either in source or binary format. In case there exists a binary for a source, it may be linked through this porperty. swrel:source Software usually exists either in source or binary format. In case there exists a source for a binary, it may be linked through this property. swrel:hasPort Links to a portation of this software. A portation or port is an adaption of a software onto a different platform. swrel:portOf Links a portation to its original implementation. A portation or port is an adaption of a software onto a different platform. swrel:forkOf Connects assets with a common development ancestor. swrel:implements Abstract connection between the subject, which is the implementa- tion of, and something which is implementable. swrel:implementsAPI The subject implements an API. An API is a formalized contract of a programming interface. swrel:implementsAlgorithm The subject is or contains an implementation of an algorithm. swrel:implementsSpecification The subject implements a specification. Specifications can be stan- dards, norms, or a requirements specification.

Table 4.4: SWREL Relationship Properties

Dependency Properties

Since the hierarchy of relationship and dependency properties is quite complex, it is displayed as a graph in Figure 4.5. Table 4.5 contains the descriptions of the properties. The ontology is structured through abstract dependency properties. OWL does not define abstract properties, therefore they are implemented as regular properties and only described as such. The difference between an abstract property and a normal property is, that abstract properties are only used to structure the ontology. They do not contain any information, but that a specific type of information is given in a subproperty. The design of software dependencies in the ontology was greatly inspired by the work of German, Gonzalez-Barahona, and Robles [25]. The proposed four classifications of software dependencies are replicated in the ontology as abstract properties. The expected way to use the SWREL dependency vocabulary is by mapping properties to the provided vocabulary through usage of the subproperty relationship. Therefore the vocabulary serves as a modular platform, which can be used for expression of semantic meaning of dependency properties through a common vocabulary. An example for this system is swrel:linksDynamically, which is a subproperty of swrel:links and swrel:dependsCompile. One of the most special properties is swrel:antiDepends. All dependency types, with the exception of swrel:antiDepends, are subproperties of swrel:depends. The special definition of it is because it does the opposite of a normal dependency, it expects the described object to be absent. To be

60 Design and Development of a Service for Software Interrelationships Chapter 4. Design

URI Description swrel:depends An abstract dependency on something. This does not specify the dependencies optionality, time-scope, or any other type. swrel:antiDepends This is the invertion of a dependency and states incom- patibility. It is an add on to conventional dependencies to state the negation. swrel:dependsSoftware An abstract dependency on software. swrel:abstractTypeDependencyProperty Subproperties state how a dependency is defined. swrel:abstractStageDependencyProperty Subproperties state in which stage of a software life- cycle the dependency is valid. swrel:abstractUsageDependencyProperty Subproperties state how a dependency is used. swrel:abstractImportanceDependency Subproperties state the importance of a dependency. Property swrel:optional Used to denote optional dependencies. swrel:requires Used to denote must have dependencies. swrel:links A software links another software. Linking is the process of combination of two pieces of software to a single executable. swrel:linksStatically A software links another software. Static linking is the combination of two pieces of software to a single executable at build time. swrel:linksDynamically A software links another software. Dynamic linking is the combination of two pieces of software to a single executable at runtime. swrel:dependsStandalone Software runs on its own. If communication hap- pens between the two applications, than only through weakly typed communication channels, like sockets or pipes. swrel:includesDependency Used to denote the inclusion of a dependency as a part of the subject. swrel:dependsExternal Used to denote that the dependency is not part of the subject. swrel:dependsBuild Dependency is required to by satisfied during build- time of the software. swrel:dependsTest Dependency is required to by satisfied during the exe- cution of tests. swrel:dependsInstallation Dependency is required to by satisfied during the in- stallation of the software. swrel:dependsRuntime Dependency is required to by satisfied during runtime of the software. swrel:dependsService Software connects and uses a service. swrel:dependsMiddleware Subject uses a middleware. swrel:pluginOf Subject is used as a tightly coupled plugin of another software. swrel:dependsCompiler Subject depends on the object as a compiler to build. swrel:dependsInterpreter Subject depends on the object as an interpreter to run.

Table 4.5: SWREL Dependency Properties metaservice: a Semantic Web based Approach 61 4.7. Ontologies

related

antiDepends depends

dependsSoftware

absTypeDep absUsageDep absImportanceDep absStageDep includes dependsExternal links requires optional dependsBuild dependsTest dependsService dependsMiddleware linksDynamically linksStatically dependsCompiler dependsRuntime dependsInstallation

pluginOf dependsStandalone dependsInterpreter

Figure 4.5: SWREL Dependency Property Hierarchy

URI Description swrel:distributionConstraint Used to limit the range to a specific software distribution. swrel:projectConstraint Used to limit the range to a specific software project. swrel:releaseConstraint Used to limit the range to a specific software release.

Table 4.6: SWREL Range Restriction Properties able to reuse all other dependency properties for the description, the negation can be added as a decorator to every dependency. Another important property of swrel:depends is, that it does not state if a dependency is re- ally required. To state the importance of a dependency the properties swrel:optional and swrel:requires are provided.

Range Restriction Properties

Range restriction properties are used to specify software ranges. Constraints of different types are interpreted conjunctive and constraints of the same type disjunctive. The constraints can be linked directly to resources, or contain instructions on which they are valid. These examples are not processable automatically unless semantics is applied to them. This should be done in subproperties of these constraints.

62 Design and Development of a Service for Software Interrelationships Chapter 5. Implementation

5 Implementation

This chapter is about details of the metaservice prototype implementation. Section 5.1 introduces the chosen software platform and the corresponding libraries. The module system and the manager of metaservice are presented in Section 5.3 and Section 5.2, respectively. Retrieval and storage of data from source repositories is examined in Section 5.4. The implementation of the messaging service is discussed in Section 5.6. Section 5.5 gives insight on the chosen RDF database and its configuration. Section 5.7 is about the frontend, including the user interface and the linked data API. Optimizations needed for execution of the automatically translated queries are presented in Section 5.8.

5.1 Platform

Metaservice uses the JVM platform and Java as the main programming language. The JVM provides support for multiple programming languages and is portable. The Java ecosystem contains a variety of libraries, which cover almost all requirements. Metaservice uses both [81] and Google Guava [103] for utility functions and Google Guice [104] for dependency injection. Two major libraries for RDF handling exist, [84] and OpenRDF Sesame [128]. Sesame is used in the prototype. The same logging mechanism is implemented across all components and introduced in the corresponding sections. Since different libraries use different logging mechanisms, SL4J [138] is used to consolidate the different systems. Logback [117] is chosen as the logging backend, because of its advanced mechanisms like log rolling, log compression, and configuration reloading during runtime. Further libraries are used in the different components. Although the JVM support multiple operating systems, metaservice was solely developed for and tested on Debian GNU/Linux. Debian provides stability and a large repository of easy to install software utilities. Among others one of the packages used from Debian is Apache HTTP daemon (httpd). Porting metaservice to another platform should be possible with few changes.

5.2 Component and Module System

Apache Maven is used as a build system and therefore also provides the artifact naming conven- tion for metaservice. Table 5.1 shows the core libraries of metaservice, which are all part of the org.metaservice group. Metaservice modules are also required to be built, or at least deployed, by Maven. All interfaces and classes, which are required for development of a module are part of metaservice-api. Besides the Maven POM, external modules also require a descriptor, where the provided configu- ration and components are declared. The descriptor format is based on XML and must be called metaservice.xml and placed in the root directory of the archive. An example of a metaservice

metaservice: a Semantic Web based Approach 63 5.3. Manager

Maven Artifact Id Description metaservice-manager Console application that runs a metaservice instance. metaservice-api Public API for writing metaservice modules. Modules should use this as a dependency. metaservice-core Private implementation classes. Modules should use this as a run time dependency. metaservice-frontend Static content of the web frontend. metaservice-frontend-rest REST that provides dynamic parts of the web frontend. metaservice-api-messaging API to implement a messaging service. The messaging API is not needed to write a metaservice module. metaservice-messaging-activemq Messaging Service implemented with ActiveMQ. metaservice-messaging-mongokryo Custom Messaging Service implemented with Mon- goDB and Kryo. This is the default messaging service used for metaservice.

Table 5.1: Metaservice Maven Artifacts descriptor is shown in Listing B.10. The module descriptors provide metadata of the modules to the metaservice manager.

5.3 Manager

The manager is able to deploy, start, and stop the frontend, crawlers, providers, postprocessors, and the messaging service. It takes care of module retrieval, installation, removal and upgrade. It provides centralized access to important system metrics like running processes, generated triple counts, and messaging queue lengths. Statistics are also drilled down per semantic data type or by module count. It also provides means to manipulate data and send messages for testing purposes. The user interface is shell based and is implemented using Æsh [78]. Æsh allows autocompletion of custom queries and execution of shell-scripts, such that working with the manager is as convenient as with any other Unix shell. Job scheduling uses the Quartz Scheduler [134]. A screenshot of the manager can be seen in Appendix C, Figure C.3.

5.4 Data Retrieval and Archival

Data retrieval is done by crawlers, which follow declarative rules in the module descriptor. These rules can be reused by different repositories, each repository may define a different starting URI for the crawler. Rules allow to state Cascading Style Sheets (CSS)-like selectors on web pages, to follow links and fetch resources. The implementation uses jsoup [114], which is a Java based HTML parsing and extraction engine.

64 Design and Development of a Service for Software Interrelationships Chapter 5. Implementation

As soon as data is retrieved it is stored in an archive. Because of the often minor changes between snapshots of a repository, differential storage and compression for the archive is implemented with git. Git provides great compression ratio on incrementally changing text based files, which is most often the case for metadata. Special implementations of archives can improve parsing performance, by reducing the amount of data that needs to be parsed. For some source data formats it is possible to calculate differences between two versions easily without complete parsing. This is the case for the Debian package index format, which allows the calculation of changed packages without parsing the whole file in detail.

5.5 RDF Database

The selected RDF database for metaservice is bigdata [71, 86]. Other stores like OWLIM [47, 130] or Virtuoso [127] were also considered. The choice on bigdata fell because of the implementation in java and the open license, which allows source code access and modification. It is actively maintained and its development is steadily ongoing, with four releases happening during the development of metaservice. Bigdata is started as a dedicated application using the NanoSparqlServer interface. The NanoSparqlServer provides an SPARQL endpoint, which allows concurrent reads and writes. The different components access bigdata through the endpoint using the OpenRDF SPARQLRepository class. Quad mode is enabled, and truth-maintenance and reasoning are disabled, because materialized triples are stored when needed. Branching factors for the indices and the size of the write-retention-queue were set to the vendors recommendations, but not verified to provide optimal performance on the dataset. Bigdata provides scale-up and scale-out architectures for clustering. The scale-up architecture allows several connected instances of bigdata, where each holds an entire copy of the data set. It is therefore also used for high-availability scenarios, because it tolerates node outages through a quorum based system. Communication between the instances is needed for query evaluation, hence query throughput scales linearly, while update operations have an overhead of about 50% for short transactions. In the scale-out architecture data and indices are dynamically distributed between the cluster nodes. Queries are also evaluated distributed and therefore near the data. This leads to improved performance, but also to higher latency, for queries on big data sets. Clustering was not yet needed with the size of the demonstration data sets, but would be necessary for productive usage.

5.6 Messaging

The messaging service handles all the different events of all workers. If messages are faster processed then delivered to the workers, computing resources are not utilized efficiently. Therefore the performance of the messaging service has great impact on the overall service performance. Because of the clearly defined interfaces of the messaging service it is possible to swap the component easily. The metaservice-messaging-api component contains all the interfaces which are needed to implement a messaging service. Two different messaging service implementations are discussed in the following sections.

metaservice: a Semantic Web based Approach 65 5.6. Messaging

P1 worker1 10 ... 2 Queue P1 worker2

P1 worker3

10 ... 6 P2 worker1 Topic Queue P2 worker2

10 ... 8 P3 worker1 Queue P3 worker2

Figure 5.1: Messaging System based on ActiveMQ Virtual Topic

5.6.1 Java Message Service and ActiveMQ Messaging

Java Message Service (JMS) [34] is an established standardized messaging API for Java, which is also part of Java Platform, Enterprise Edition (Java EE). Apache ActiveMQ [79] is a clusterable, enterprise- grade, open source JMS implementation. As such, on first sight, ActiveMQ is an excellent candidate to implement the messaging of metaservice. JMS provides two elementary messaging concepts, topics and queues. Topics provide a message distribution mechanism based on the publish/subscribe model. Message producers can publish messages on a topic, and any message consumers subscribed to the topic will receive the message. Topics are not durable, by default, meaning that messages are not kept for later consumption, i.e., only currently subscribed consumers will retrieve the message. Queues provide message queue semantics, meaning that any message sent to the queue by message producers is consumed by one message consumer. Multiple consumers can register to a queue, in which case the queue acts like a load balancer. Messages are retained in queues until it is processed or it expires. In default configuration queues process messages using the first-in first-out principle. The metaservice messaging system can be modeled as follows. There is a queue for each generator type, where the corresponding workers are registered to fetch their messages/tasks. A topic receives all messages per message type. For each generator type there is a listener, which receives the messages from the queue and puts them on its topic. Virtual topics are an ActiveMQ specific feature that dynamically routes messages from a topic to queues, based on name pattern matching. They mimic the same pattern that is needed for metaservice messaging, without the need of an external listener that routes the messages. Figure 5.1 shows the message flow using the ActiveMQ implementation with virtual topics. This setup worked, but soon showed to be a bottleneck of the system. ActiveMQ does not handle very large queues in combination with fast producers and slow consumers well. Both the KahaDB and the better performing LevelDB backends created extensive IO load on multiple gigabyte big journals. The extensive IO load was partially caused by the redundant storage of each message on individual queues. Independent of the persistence backend, some queues stopped working under high load, which is a typical

66 Design and Development of a Service for Software Interrelationships Chapter 5. Implementation

P1 worker1

P1 load P1 worker2 balancer

P1 worker3

10 9 8 7 6 5 4 3 2 1 0 P2 worker1 Queue P2 load balancer P2 worker2

P3 worker1 P3 load balancer P3 worker2

Figure 5.2: Custom Messaging System symptom of ActiveMQ flow control. Disabling flow control did not stop the symptom from appearing, neither did configuration changes or changing releases. After performance problems with ActiveMQ, other JMS messaging servers were considered. None of them showed to provide promising improvements compared to ActiveMQ, as all implementations used similar approaches.

5.6.2 Custom Messaging

Although clearly not the first choice, a custom messaging service was specifically crafted for the metaservice messaging requirements. Figure 5.2 shows the message flow in the custom service. The main improvement is, that messages are not duplicated anymore, because all generator types share only one queue, where the current position is maintained by a cursor. A loadbalancer then distributes the load for each cursor on the individual workers. The different generator types can therefore still consume messages in varying speed. As soon as the last cursor has processed a message it may be removed from the queue. This is currently only done on system stop, such that newly added modules can process existing messages, without the need to generate them again. Because of the known performance problems, the actual implementation was also designed to be as lightweight as possible. The queue is implemented as a MongoDB [120] collection, which stores messages in JSON format. MongoJack [121] and thereby Jackson [111] are used to transform the messages from Java objects into MongoDB. Between the messaging service and the workers, the messages are serialized using Kryo [115], a fast serialization library for Java objects, and sent through KryoNet [116], a networking library built upon Kryo. According to their project descriptions, both are designed for speed and efficiency. After some initial problems related to buffer sizes, the custom messaging service delivers higher through- put using less resources, by sacrificing some of the reliability features of the ActiveMQ solution.

metaservice: a Semantic Web based Approach 67 5.7. Frontend

5.7 Frontend

The frontend of metaservice is responsible for the semantic web frontend and the web interface. It is the only component that is actively accessible from outside of the service.

5.7.1 Semantic Web Frontend

Every URI hosted on metaservice is dereferencable. Apache httpd redirects are used to rewrite and proxy requests to a Java Representational State Transfer (REST) application. The REST-frontend itself does only transform the requests into SPARQL. The data is provided flattened at the latest possible validity time and statements are output up to a distance of three resources beginning from the requested resource. The corresponding query is shown in Appendix B, Listing B.8. This version uses the latest caching optimization, to speed up requests. According to the mimetype in the HTTP accept header RDF/XML, Turtle or JSON/LD is output by the REST-frontend. For advanced clients also unflattened data, containing observations and provenance information, is available on an alternative URI. Currently there is no result caching implemented, which should definitely be done to lower database load. The cache could be easily populated and invalidated whenever observations used in the cached content change.

5.7.2 User Interface

The user interface builds upon the unflattened semantic web frontend data. Handlebars [107] template views, which are provided by the modules, are dynamically loaded according to the type of the resource viewed. The frontend is a single HTML page application, which is returned for every available resource. It then starts a request to get the full request. The JSON-LD.js library [113] and custom code is used to perform flattening of the multiple JavaScript library is used to transform JSON-LD to a simple JavaScript graph. The flattening process does not allow blank nodes, because blank node matching is hard [72]. Special care has to be taken for loops in the graph. Then the tree/graph is given to the handlebars processor which therefore can access all properties. Helper functions are created to handle RDF specific data storage, like language, primitive text and xsd:string text handling. As Handlebars templates can be rendered also on the server side, an optional HTTP cache with ready to display pages may speed up the frontend. Bootstrap [89] is used as a styling and layout library, to achieve a consistent look and feel. It provides ready to use and modern HTML classes, and a responsive layout. Bootstrap supports all major mobile and desktop browsers.

5.8 Temporal SPARQL Queries

Several steps are required to implement the automatic translation of the temporal SPARQL query concept from Section 4.5. First the modules require an API to construct and translate the queries. The SPARQL standard does not support CONSTRUCT queries to create quad results, which is required for

68 Design and Development of a Service for Software Interrelationships Chapter 5. Implementation

1 f i n a l V a r i a b l e a = new Variable("a") , 2 t i t l e = new Variable("title"), 3 a l t = new Variable("alt"); 4 new DefaultSparqlQuery (){ p u b l i c String build() { 5 return 6 7 select(DISTINCT,var(a) ,var( title )) 8 . where ( 9 t r i p l e P a t t e r n ( a , RDF . TYPE , ADMSSW. SOFTWARE_PROJECT) , 10 triplePattern (a, SKOS.ALT_LABEL, alt ) , 11 union ( 12 graphPattern( 13 triplePattern(a, DC.TITLE, title) 14 ), 15 graphPattern( 16 triplePattern(a, RDFS.LABEL, title ) 17 ), 18 graphPattern( 19 triplePattern (a, SKOS.PREF_LABEL, title ) 20 ) 21 ), 22 filter(unequal(val(alt),val(title))) 23 ) 24 25 . b u i l d ( ) ; 26 }}; Listing 5.1: Query Building in Java

proper support of the metaservice data model. Therefore bigdata needs to be patched to allow quad CONSTRUCT queries. Last, the resulting translated queries are so complex that they need to be optimized to meet the performance requirements of the service.

5.8.1 Query Building and Translation

The API contains a builder class to create SPARQL queries in Java. It is statically typed and therefore prevents typo based failures effectively. It is not a traditional builder, because those have limitations with complex structures like trees. The same thing could have been done by static imports, with the implication that methods access would be available in the whole class. Vocabularies and ontologies are automatically translated into Java classes, which provide static access to the corresponding URIs. Thereby also typing errors on common vocabulary are mitigated. The builder stores queries directly as an internal representation which in turn can be directly used for translation. The code in Listing 5.1 generates the SPARQL query in Listing 5.2. A complete translation of the query to the metaservice datamodel, including the optimizations described in Section 5.8, is listed in Appendix B, Listing B.9.

metaservice: a Semantic Web based Approach 69 5.8. Temporal SPARQL Queries

1 SELECT DISTINCT 2 ? a 3 ? t i t l e 4 WHERE { 5 ? a a ? t i t l e . 6 ? a ? a l t . 7 { 8 ? a ? t i t l e . 9 } UNION { 10 ? a ? t i t l e . 11 } UNION { 12 ? a ? t i t l e . 13 } 14 filter(?alt != ?title). 15 } Listing 5.2: Output SPARQL Query of Listing 5.1

5.8.2 Quad Support for SPARQL CONSTRUCT Queries

SPARQL 1.1 has introduced quad support, which is also known as named graph support. Through it one can easily query, update and insert graphs into knowledgebases [64, 57]. However it is not possible to use CONSTRUCT to export quads or multiple graphs. The only way to achieve this using standard SPARQL, is by using a select query and reassemble the graphs afterwards outside of the query. Bigdata does not provide an alternative way to output multiple RDF graphs neither. Since bigdata is open source, quad support for construct queries can be patched in. The changes, which are required to implement this feature in bigdata, are as follows:

• Modify the SPARQL language grammar, to allow a quad construct mode.

• Modify the query and optimization engine to correctly handle the new grammar.

• Modify the "distinct"-semantics, which is used prior to the assembling of the result, such that statements from different named graph are not discarded because of equality of the subject, object, and predicate.

• Add a serializer for a RDF format that supports quads, e.g., JSON-LD.

To minimize interference with existing queries the newly created quad-construct-mode needs to be explicitly enabled through the _QUADMODE_ keyword. This can be seen in Listing 5.3. In this query, the statements ?s ?p ?o are output in the corresponding named graph ?c and the statements ?s1 ?p1 ?o1 are output into the default graph.

70 Design and Development of a Service for Software Interrelationships Chapter 5. Implementation

1 CONSTRUCT { 2 _QUADMODE_ 3 GRAPH ?c { ?s ?p ?o }. 4 ? s1 ? p1 ? o1 . 5 } 6 WHERE { 7 ... 8 } Listing 5.3: SPARQL Quad CONSTRUCT Query

5.8.3 Query Optimization Techniques

Automatically translated temporal queries are substantially longer and more complex than the original ones. Additionally the amount of data to be processed is drastically higher, because not only one but many temporal states are stored and need to be considered. Therefore the needed processing power and hence the computation time increases. In first experiments queries that in their untranslated form could be executed in less than 10 ms, did not terminate at all, because of connection timeouts after hours. The optimizer could not create a performant execution plan based on the complex query. Therefore several actions were taken to improve the performance.

A Heuristic for Temporal Queries

One major reason for the bad performance of the translations is that each statement pattern is evalu- ated on its own, without taking care of the context. Consider a query with the patterns ?s. ?s ?o. and that these patterns match only one result. Usu- ally the optimizer is able to resolve the lookup order, such that ?s matches the first pattern and only afterwards lookup the title when there is only one remaining match. With a properly indexed database this needs only two index lookups and therefore is very fast. However when the patterns are split up, like they are in the temporal translation to determine the validity, this context information is not available any more. The second pattern, without its context, does match every statement in the complete database. The temporal validity is then calculated for every match, which needs considerable resources. Only when the whole calculation is done, all valid matching statements are joined to each other, such that only one result is left. One can easily see that although the translated query is correct, naive evaluation is not practicable. What is essentially needed to reduce the amount of possible bindings for the variables in the patterns, is to evaluate context information prior to the calculation of the temporal validity. It is possible to first execute the query without temporal evaluation, to retrieve possible matching bindings regardless of the time. This is the untranslated version of the query, which can be easily optimized and calculated very fast. The result of the time-agnostic query is then used to provide limited bindings to the patterns, when the temporal validity is tested. This greatly reduces the amount of statements, which have to be looked up.

metaservice: a Semantic Web based Approach 71 5.8. Temporal SPARQL Queries

ex:next

ex:a1 ex:b1 ex:c1

ex:start ex:a2 ex:b2 ex:c2 ex:end

ex:a3 ex:b3 ex:c3

Figure 5.3: Dense RDF Graph

1 SELECT 2 ? r e s u l t 3 WHERE { 4 ? a . 5 ? a ?b . 6 ?b ? c . 7 ? c ? r e s u l t . 8 } Listing 5.4: SPARQL Query on Dense RDF Graph

Special care has to be taken for optional variables in the heuristics, as they may be unbound. Patterns using these unbound variables should not be considered at all, because they were not considered in the heuristic either. However if unbound variables are used to calculate validity, all possible matches in the database are unnecessarily considered again, which is a waste of time. Therefore unbound variables need to be filtered, using the bound() function, prior to the calculation of temporal validity. The heuristics optimization is usable for both the provider and postprocessor data model. Queries where there is no solution, terminate as fast as the nontemporal version. Selection of variable bindings is done after checking that there exists a solution for this binding.

Pruning of Duplicates Before Joins

Consider the query in Listing 5.4 and the RDF graph in Figure 5.3. The result set of the query on the graph is clear, ?result is bound to , but what is interesting is the amount of tuples returned, which is 27. The reason for this result is the evaluation semantics of SPARQL. Although only the result is selected, all possible paths from to are expanded. The count of tuples does not have any significant size in this case, but does have in translated temporal metaservice queries. Especially in the postprocessor data model, observations with many authoritative subjects may easily lead to billions of distinct paths. A good database may optimize the index lookups, but still has to store all possible paths, which consumes memory and processing resources. If only the last node of different paths, but not the way how to get there, is important, the tuple count can be reduced drastically through intermediate DISTINCT operations. In the context of the example this

72 Design and Development of a Service for Software Interrelationships Chapter 5. Implementation

1 SELECT DISTINCT 2 ? r e s u l t 3 WHERE { 4 { SELECT DISTINCT ? c WHERE { 5 { SELECT DISTINCT ?b WHERE { 6 ? a . 7 ? a ?b . 8 }} 9 ?b ? c . 10 }} 11 ? c ? r e s u l t . 12 } Listing 5.5: Optimized SPARQL Query on Dense RDF Graph

means that once paths from the a-nodes are expanded to the b-nodes, all distinct b-nodes are selected, and the a-nodes are discarded. If this principle is expanded on the whole query, exactly one result is given and intermediate results may at most contain three nodes. The optimized query from Listing 5.4 can be seen in Listing 5.5.

Join Order Optimization by Usage of Known Cardinalities

Join order has great impact on query performance. Optimizers are used to determine the best join order, such that the amount of tuples needed to be calculated is minimized. This does usually work very well. The statistics, which are needed for the determination of the join order, can be deduced from the indices. For a simple example consider the two patterns ?a ?b. and ?b ?c. . There are 10 k statements with , and 5 k with as a predicate. Additionally we know that there are thirty possible different bindings for ?a. An optimizer chooses to evaluate the pattern first to reduce overall joins, because it has less possible matches. What the optimizer does not know is that there may be at most one element next to another. This information can be part of an OWL ontology through cardinality constraints, but at least bigdata does not process it. With this information the optimizer could evaluate first, and get to a result significantly faster. Unknown predicate specific cardinalities are not the only blind spot of the bigdata optimizer. Subqueries and unions cannot be estimated correctly enough, as the upper bound is the cross join of the contained patterns, which is ridiculously high, and therefore they are usually put last in queries. Bigdata has runtime query optimization, where it starts the query evaluation and restarts it repeatedly optimized according to the additionally gathered join counts. The query optimizer works quite well, but has a hard time when there are too many possible permutations of the join order, like in translated temporal queries. Bigdata also allows to turn off the optimizer and then uses the order in which the query is written as the join order. This is a viable option for the automatically translated temporal queries, because the cardinalities are known very well. All different optimization techniques proposed, are quite useless when the join order is incorrect.

metaservice: a Semantic Web based Approach 73

Chapter 6. Evaluation

6 Evaluation

The three contributions of the thesis are evaluated by demonstration of the functionality of the metaservice prototype. Different use cases, which require a working system to execute correctly, are implemented. First the applicability of semantic web technologies and the SWREL ontology is shown through inte- gration of two ecosystems in Sections 6.1 and 6.2. Then the implementation of two use cases, which were defined in in Section 3.2, is described in Sections 6.3 and 6.4 to demonstrate the functionality of the architecture and the temporal query models. Section 6.5 documents the runtime environment, which was used to perform the evaluation. The results of the evaluation are discussed in Section 6.6.

6.1 Integration of the Debian Ecosystem

The structure and semantics of Debian packages is described in the Debian Policy Manual [42]. Relevant information to understand Debian package metadata can be found in Chapter 5 - Control files and their fields and Chapter 7 - Declaring relationships between packages. These definitions are valid for all Debian based distributions, including Ubuntu. Metadata in Debian packages is stored in so called control files. Besides control files, there are also other metadata files like changelogs or licensing information. Those are not analyzed in this thesis. There is a distinction between source packages and binary packages. Source packages are effectively upstream source codes with packaging information like control files and distribution specific patches. Binary packages are automatically built from source packages. Debian repositories provide index files, which contain the concatenated metadata of the packages. These index files are downloaded by Debian Packaging Tool (DPKG)-frontends, like Advanced Packaging Tool (APT), which support remote package repositories. Package managers can resolve the dependencies of the packages based on different criteria using the metadata of the index and then install packages on demand. The integration of Debian repositories was implemented in the metaservice-core-deb module. A crawler is provided which can automatically detect the repositories of the different distribution releases and architectures of Debian repositories. It was tested both on Debian and Ubuntu repositories. A very useful source of data is the Debian Snapshot Archive [93]. It contains daily snapshots from the Debian distribution repositories since March 2005. These snapshots have been imported to a metaservice archive and are the biggest dataset used yet. The over 8,000 compressed and deduplicated snapshots of the repository indices are over 14 GB large. After the repository indices are parsed by the metaservice module, they are translated as literally as possible to the Metaservice Debian Ontology, which contains the same semantics as the source. The ontology contains subproperty relations to ADMS.SW and SWREL. A part of the mapping can be seen in Table 6.1. The subproperty relationship swrel:dependsSoftware was omitted in the listed

metaservice: a Semantic Web based Approach 75 6.1. Integration of the Debian Ecosystem einContr Debian SHA256 SHA1 MD5sum Homepage Version Architecture Size Description Package Source Provides Replaces swrel:antiDepends , Built-Using Uploaders Maintainer deb:buildConflicts swrel:antiDepends , swrel:dependsBuild , deb:buildConflictsIndep Build-Conflicts-Indep deb:buildDependsIndep Build-Conflicts Build-Depends-Indep Build-Depends Conflicts Breaks Pre-Depends Suggests Recommends Depends lFedDba nooyURI Ontology Debian Field ol Table 6.1: apn fDba oto ilst eaevc einOtlg n oMtsrieOntologies Metaservice to and Ontology Debian Metaservice to Fields Control Debian of Mapping deb:sha256sum deb:sha1sum deb:md5sum deb:homepage deb:version deb:architecture deb:size deb:description deb:packageName deb:source deb:packageName deb:provides deb:replaces deb:builtUsing deb:uploader deb:maintainer deb:buildDepends deb:conflicts deb:breaks deb:preDepends deb:suggests deb:recommends deb:depends or eea Metadata General Dependencies Relationships apn nMetaservice in Mapping spdx:checksum spdx:checksum spdx:checksum doap:homepage doap:revision schema:fileSize doap:description , doap:name swrel:implements swrel:relatedSoftware swrel:relatedSoftware doap:helper doap:maintainer swrel:dependsBuild , swrel:antiDepends , swrel:antiDepends , swrel:dependsRuntime , swrel:dependsRuntime , swrel:dependsRuntime , swrel:dependsRuntime - doap:name or swrel:source dc:description swrel:dependsBuild , swrel:dependsBuild , swrel:dependsRuntime , swrel:dependsRuntime , swrel:requires swrel:requires , swrel:requires swrel:requires swrel:optional swrel:optional swrel:requires swrel:requires swrel:requires swrel:requires

76 Design and Development of a Service for Software Interrelationships Chapter 6. Evaluation mappings, to save space in the table. Not all attributes from the control files could be mapped to the abstract ontologies, e.g. Architecture has no equivalent in ADMS.SW or SWREL. An interesting aspect of Debian packages is the quite complicated ordering scheme of revision names. The revision name may contain three parts the epoch, upstream version and debian revision. The different elements are compared one after another. Digits are interpreted and sorted as numbers, letters alphabetically and symbols by special conventions. A postprocessor for metaservice was implemented, which aggregates the releases and packages of a project and applies xhv:next and xhv:prev ordering on them. This cannot be done in the provider, because providers always only processes one package at a time. The temporal semantics of metaservice lead to calculations of different total orderings of packages for different points in time. Special view templates are provided for web interface rendering of the Debian classes. They use familiar Debian package terminology to make the metadata easier comprehensible for people.

6.2 Integration of the Maven Ecosystem

Maven is integrated through the metaservice-core-maven module. The first step of the integration is the retrieval of the raw data from source repositories. Although several different Maven repositories exist, Maven Central [118] is the single most important one. However Maven Central does not allow mirroring or crawling, which is a problem for data retrieval. Indices, which contain all packages and could therefore be used for detection of repository changes, are not updated regularly. Clients accessing the repository do so by the artifacts name. There is no need in an up to date index, except for the autocompletion of dependencies in Integrated Development Environments(IDEs). Luckily ibiblio [109], a public digital library and archive, is one of the few authorized mirrors and can be used for crawling. The parsing of the POM should be straightforward because of its XML format, but actually it is not. Maven calculates an effecitve POM and does not use the application POM directly. Besides the POM of the artifact itself, also the whole hierarchy of parent POMs needs to be loaded to calculate the effective POMs. Since fetching and iterative merging is not a trivial task, the parser of the Maven project itself is reused. Although the parser itself is rather undocumented and tightly coupled with the rest of Maven, it is still the safest way to interpret POM-files like Maven would. Once the artifact metadata is parsed the content needs to be mapped to RDF statements. As expected, an ontology needs to be created for Maven. The general attributes like project name, description, and license, are not problematic, but dependencies are encoded complicatedly. Unlike Debian, where only specific types of dependencies are possible, Maven uses configurable dependencies. Hence direct translation of properties is not possible. Therefore the module provides only few often used combinations of dependency options as concrete properties in the ontology. All other options are handled by the provider and are directly expressed through usage of the SWREL vocabulary. Because of the case dependent mapping in the module, Table 6.2 cannot give explicit mappings, but instead only shows related concepts of Maven and metaservice. The only thing that could not be mapped are the transitivity features of the different types of dependencies. Transitivity of dependencies is that on resolving of the dependencies also the dependencies of the dependencies need to be resolved recursively. Maven allows different ways to handle the transitivity of

metaservice: a Semantic Web based Approach 77 6.2. Integration of the Maven Ecosystem

Maven Concept Metaservice Concept name doap:name description doap:description, dc:description parent swrel:relatedSoftware modules swrel:relatedSoftware url doap:homepage license doap:license organization doap:vendor, foaf:Organization developer doap:developer, foaf:Person contributor doap:helper, foaf:Person issueManagement doap:bug-database ciManagement - mailingLists doap:mailing-list scm doap:Repository dependency swrel:dependSoftware dependency/scope[compile] swrel:dependBuild,swrel:dependRuntime dependency/scope[provided,system] swrel:dependBuild dependency/scope[runtime] swrel:dependRuntime dependency/scope[test] swrel:dependTest dependency/optional swrel:optional, swrel:requires repository admssw:SoftwareRepository distributionManagement admssw:SoftwareRepository plugin swrel:depend

Table 6.2: Mapping of Maven Concepts to Metaservice Concepts

dependencies, like in the provided and system scope of dependencies. SWREL does not contain vocabulary to express this. Transitivity exclusions on dependencies, which allow to selectively choose which transitive dependencies are resolved, can consequently also not be handled correctly in the metaservice module. Maven in general has several properties, which make static and reproducible extraction of dependency information impossible. Things like system variables can influence the interpretation of the metadata drastically. For example they can be used to enable or disable profiles, which then may lead to a completely different effective POM. Therefore metaservice cannot interpret the impact of profiles on the project structure. Good attributes of Maven are its imposed rules for namespaces, required full qualification of dependencies and immutability of packages after deployment. All this together leads to the ability to identify packages by name very reliably, which is used in metaservice to implement direct, instead of indirect, referencing of dependencies.

78 Design and Development of a Service for Software Interrelationships Chapter 6. Evaluation

6.3 Implementation of License Conflict Discovery

The use case of Section 3.2.4 is about the detection of conflicts between software licenses. For the evaluation, the detection of copy left violations is implemented and presented using the example of the GPL [102].

6.3.1 Detection of Software Licenses

Detection of Software Licenses in source code has already been discussed several times. One of the software solutions that supports license detection is FOSSology [98] and its license detection tool Nomos. Nomos uses license signatures to match licenses to source files. FOSSology is used to initiate the scan and subsequently stores the result of Nomos. Gobeille [28] gives a more detailed introduction on how FOSSology is structured and how it works. Prior to Nomos licensing detection was mainly based on regular expressions, like implemented in Black Duck Software’s ohcount [126]. Ninka [124] was introduced by German, Manabe, and Inoue [27] as an alternative to Nomos. It uses more advanced parsing, filtering, and matching techniques to improve the detection rate compared to other alternatives. Fossology-Ninka [99] is a tool which wraps both Nomos and Ninka to generate RDF descriptions of software packages, which are based on the SPDX ontology. Tests in the scope of this thesis did indeed provide the expected results. Software licenses were successfully detected, but not reliably for all test cases. Further advances in source code mining are needed to improve detection rates of the tools, but this is outside of the scope of this thesis. To evaluate metaservice manually chosen licenses are therefore picked over automated detection.

6.3.2 Integration of Linked Data Software Descriptions

The examples used for the demonstration were written directly in RDF using ADMS.SW. Although metaservice itself uses RDF for representation of its data, it does not integrate external RDF resources out of the box. Therefore the metaservice-core-ld module was implemented to integrate ex- ternal linked data repositories. It contains parsers for multiple RDF formats and stores information in observations, just like every other module. One aspect of importing already existing linked data is, that resources are provided with existing URIs, which are not in the metaservice namespace. Metaservice needs locally dereferencable URIs, such that clients can access the resources. Therefore a new URI in the metaservice namespace needs to be generated the same way as for all other integrations. The original URI is then linked through owl:sameAs. Unlike traditional semantic web data models, the metaservice data model does not infer implications of owl:sameAs automatically for all statements, but only for the observations where the statements are included in. Hence SPARQL queries need to handle equality of URIs explicitly.

metaservice: a Semantic Web based Approach 79 6.3. Implementation of License Conflict Discovery

6.3.3 Copyleft Conflict Detection Query

The metaservice-demo-licensecheck contains a postprocessor, which can detect license vio- lations of copyleft licenses. Different licenses and copy left propagation dependencies can be configured. The query in Listing 6.1 is already configured to match the GPLv2 and detects violations based on the swrel:links dependency. First, dependencies of the resource in question are selected. Direct dependencies as well as conjunctive and disjunctive dependency statements are supported. To simplify the example, nesting of the terms is not supported by the query. Software ranges are not supported for the same reason, because they also would expand the code significantly. As soon as the dependencies are known, they are matched to the configured license. The second union is used to support owl:sameAs for dependencies. Then the result set contains all dependencies that are distributed under the GPLv2. The last statement filters the cases were the license of the resource in question and the license of a dependency are incompatible.

1 SELECT DISTINCT 2 ?resourceLicense 3 ?dependency 4 ?dependencyLicense 5 WHERE { 6 BIND( ?dependencyLicense AS ) 7 BIND( ?dependencyProperty AS ) 8 9 ? r e s o u r c e ?resourceLicense . 10 { 11 ?resource ?dependencyProperty ?dependency. 12 } UNION { 13 ?resource ?dependencyProperty ?collection . 14 ?collection a . 15 ?collection < r d f : l i > ?dependency . 16 } 17 { 18 ?dependency ?dependencyLicense . 19 } UNION { 20 ?dependency ?dependencySame . 21 ?dependencySame ?dependencyLicense . 22 } 23 FILTER( ?resourceLicense != ?dependencyLicense ) 24 } Listing 6.1: Copyleft License Conflict Discovery SPARQL Query

The query currently analyzes DOAP licenses, but can easily be extended to support SPDX’s declared and detected licenses. The filtering is currently implemented by equality of the resources, but could be implemented by using explicit relationships between the licenses. Legal expertise is required to define such relations. When incompatibilities are found by the postprocessor, the resources are annotated with a warning. This warning are stated as RDF and therefore can be used by the API and the web interface.

80 Design and Development of a Service for Software Interrelationships Chapter 6. Evaluation

6.4 Implementation of Security Report Alerts

The prototype of the security report alert use case, which was introduced in Section 3.2.3, is implemented as a postprocessor. It uses the CPE and CVE repositories of the NIST to report security alerts of the WordPress project.

6.4.1 Integration of CPE and CVE

The NVD repository provides structured information in XML. A collection of XML schemata is provided for the description of the different enumerations of the repository. The integration of the CPEs and CVEs is implemented in the metaservice-core-nvd module. The module contains ontologies for CVEs and CPEs, which are based on the NVD XML schemata. The enumerations are regularly exported as XML files by the NVD. These XML files are fetched by the module’s crawler, parsed according to the schemata and then mapped to RDF statements in the provider. Since security reports are not the main topic of metaservice they were not mapped to an abstract report ontology, but just converted. Special views are provided for the visualization of the resources. The NVD module also contains postprocessors, which are used to link the CPE resources to their software counterparts. In this case related CPEs and software releases were only joined based on package names, but other joining methods are also possible. One of the limitations of postprocessors became apparent during the implementation of the linking. It required the application of a workaround. A postprocessor which should link CPEs and releases needs to be executed when either of them is created, removed, or changed to be able to update the connection. The created observation of the postprocessor naturally contains the link between both matching resources. Hence both the CPE and the release are authoritative subjects of the observation. Now a postprocessor must calculate at most one observation for a given resource and must calculate the same result for all authoritative subjects. CPEs and releases need not to be in an one to one, but may also be in a many to many relationship. Therefore observations may become extremely large and therefore inefficient, because all interconnected relationships end up in the same observation. A workaround for this is to create two postprocessors, one for each side that may change. This reduces the amount of authoritative subjects to one for each observation and significantly lowers the impact of changes. The extremely high density of relationships between CVEs and CPEs acts as a benchmark for the service’s query performance. Security bugs are often found in multiple software releases and every release may contain many security bugs. Consequently the only three layers deep traversal of the RDF graph from a single security report contains often all available security reports of the affected software project. Although this creates significant strain on the service, metaservice can even handle traversals of projects like the Microsoft Internet Explorer, for which many security reports exist.

6.4.2 Integration of WordPress

WordPress [144] is a popular blogging software, written in PHP. Security advisories have been frequently reported for WordPress in the past. The CVE repository currently contains 145 reports, which are linked to a:wordpress:wordpress:-:*:*:*:*:*:*:*, which is the WordPress CPE entry matching all releases. Some of the reports are not exclusively about WordPress, but about the combinations

metaservice: a Semantic Web based Approach 81 6.4. Implementation of Security Report Alerts of WordPress and its plugins. WordPress has been a popular target, because of its widespread usage, extendibility through plugins and themes, and often not properly maintained deployments. Integration of WordPress into metaservice is implemented in the metaservice-demo-wordpress module. The WordPress homepage does not provide any structured, let alone semantic, data on releases. Therefore the module implements a crawler and a parser, which scrape and parse the pages of archived releases on the project’s homepage. The data is then translated to RDF statements, which are directly stated using ADMS.SW vocabulary, without the intermediate step of a dedicated WordPress ontology.

6.4.3 Query Execution and Alert

The actual detection of new security reports and the consequent notification of the user is implemented in the metaservice-demo-securityalert module. The module contains two classes. First a postprocessor that checks if the received change message is created by a new security vulnerability. Second a backend class, which stores the user subscription database and is responsible for sending and remembering already sent alert notifications. The postprocessor executes the query, which can be seen in Listing 6.2, and notifies the backend if it returned a result. To find connections between CVEs and software projects, the query matches first related CPEs and then software releases. Only the project and the project name are selected to send a notification, since the CVE is bound prior to query execution.

1 SELECT DISTINCT 2 ? p r o j e c t 3 ?projectName 4 WHERE { 5 ? cve ? cpe . 6 ? cpe ? r e l e a s e . 7 ? r e l e a s e ? p r o j e c t . 8 ? p r o j e c t ?projectName . 9 } Listing 6.2: Security Alert SPARQL Query

In the current implementation the module backend sends alerts whenever the query returns a new result, which has not been reported before. It does not check if the report is actually new or old. This means that there was no need to wait for an actual new CVE for testing purposes. The users are then alerted by a notification of Notify My Android (NMA) [125]. NMA allows to send notifications to Android smartphones through a REST-API. It was chosen because of the easy implementation, but any other notification technology like e-mail could be used instead. In Appendix C, Figure C.4 shows a screenshot of a successful alert notification. Writing a query that is able to recursively follow dependencies would require usage of SPARQL property paths, which are not available in the subset of SPARQL supported by metaservice. Recursive propagation of security reports can instead be implemented through intermediate marking of affected software as vulnerable. Every marking triggers another postprocessing run, such that longer chains can be processed.

82 Design and Development of a Service for Software Interrelationships Chapter 6. Evaluation

6.5 Runtime Environment

All tests and part of the development were done on a headless server with an 3.4 GHz Intel® Core™ i7-4770 Quad-Core CPU with Hyper Threading (HT), 32 GB Random Access Memory (RAM) and a 256 GB Solid State Disk (SSD) running Debian Wheezy. Swap space was available but the Linux kernels swappiness was set to 0, such that swapping only occurred to save the system from killing processes. Although the components can be distributed on different server, all processes were started on a single server. Initially a Hard Disk Drive (HDD) was used instead of a SSD, but bigdata was significantly bound by the slow IO. Therefore in-memory Temporary File System (tmpfs) on a RAM disk and SSDs were considered as alternatives. The performance difference between the execution of bigdata on a tmpfs and the SSD was marginal, but both significant improvements over the HDD. Therefore the SSD was chosen. With the SSD the whole system changed from IO bound to CPU and network latency bound. CPU limits are reached during SPARQL queries and during parsing of big files. Network latency is a bottleneck for the messaging system and can be mitigated by sending and processing more messages concurrently. OpenJDK Java Runtime Environment (JRE) was used. Experiments showed good performance of the G1 garbage collector for bigdata and the parallel garbage collector for providers and postprocessors. While bigdata delivers good query performance with 4 GB heap space, some data management queries require at least 6 GB heapspace. Heap sizes for the different other processes were harder to determine. Postprocessors require less heap space, even when parallelized. Providers did only work sequentially because they used up to 6 GB heapspace.

6.6 Discussion

Both the Debian package management system and Maven are widespread sophisticated systems that are deployed to solve real world problems. Although both need very different approaches to be implemented, in the end both could be integrated successfully. Most other software ecosystems apply similar concepts to the two evaluated ones and most systems have a lower complexity. The SWREL ontology therefore, with the exception of the definition of transitive properties of relationships, can be seen as very suitable for abstraction of software relationships. The implementation of these two integration modules showed that besides syntactic and semantic difficulties other factors indeed play a big role in integration of software ecosystems. In the case of the Debian ecosystem, the large size of the snapshot repository confirmed that scalability is really necessary for implementations, because there is a lot data to process and this data gets more and more every day. Maven initially showed organizational problems, through the prohibited crawling of the main repository. Although the metadata is provided freely, the access was not as easy as it could. Another limitation for metaservice and similar services is that, although Maven POM is defined very well it needs dynamic interpretation. This improves the flexibility of Maven but at the same time limits usefulness of information from static services. This last problem could probably be solved only through static description of the dynamic behavior directly by the developers. Automatic static analysis is a great tool to make use of a lot of data, but the quality of manual description cannot be reached. The implementation of the licensing conflict detector and the security report alerter showed two things. First, the metaservice design including the architecture, the temporal data model, and automatic query translation works and therefore meets the functionality requirements. Second, the assumption that abstract

metaservice: a Semantic Web based Approach 83 6.6. Discussion information of software interrelationships originating from different software ecosystems enables new ways of processing software relationships is true. The implementations were demonstrations, which served well as a proof of concept for the service.

84 Design and Development of a Service for Software Interrelationships Chapter 7. Conclusion

7 Conclusion

The conclusion starts with a condensed summary of the thesis in Section 7.1. Then limitations and impli- cations of the work are provided in Sections 7.2 and 7.3, respectively. Last an outlook for interrelationship services is given in Section 7.4.

7.1 Summary

The goal of this thesis was to design and implement metaservice, a prototype of a software service, which provides an unified access to generic software interrelationship information. The service should enable usage of structured data about software ecosystems, which is currently only available in distributed and partially incompatible software repositories. An iterative prototype development process was chosen to allow explorative research on possible solution designs. The semantic differences between relationships in different software technologies and metadata formats showed that simple conversion of the data is not sufficient. Additional analysis of the requirements showed, that a scalable and extendible system is needed to cope with the large amount and different types of available software metadata. The need for semantic mapping of the different metadata formats and the natural graph structure of relationship networks led to the consideration of semantic web technologies. Existing ontologies for software and software interrelationship description were evaluated and found to be unsuitable to express the different kinds of relationships. Therefore SWREL, a dedicated software relationship ontology, was developed. It serves as an abstract common layer to which existing software relationship concepts can be mapped. Through the semantic mapping of the concepts from the different software ecosystems, writing queries across software ecosystem borders is made possible. SWREL was successfully evaluated through mapping of the Debian and Maven ecosystems. For metaservice to be able to answer queries across ecosystem borders, it needs to have actual access to the data. A distributed and extendable architecture was proposed and implemented, that is able to gather and transform data from software repositories. It consists of a knowledgebase for statement storage, a frontend for information access ,and a sophisticated processing pipeline for data retrieval, mapping, and postprocessing. The knowledgebase is provided by bigdata, an open source RDF database, which was extended for better support of provenance data. In the processing pipeline work distribution is provided by an event based messaging system, which notifies the potentially distributed workers to create or update statements in the knowledge base. A management tool was provided to manage installed modules, acquired knowledge, and the execution of the independent workers. Early evaluations of the service showed that time and temporal change of software repositories are critical factors which were overlooked in initial designs. If information about software in software repositories is removed, added, or changed, references to these may lead to dead ends. Simple flattening of all data, while leaving out temporal information leads to inconsistencies. An extension of the service, to respect

metaservice: a Semantic Web based Approach 85 7.2. Limitations temporal concepts was therefore required. Currently semantic web technologies have difficulties to handle multiple versions of temporal data, as we have shown in Section 2.5.1. The same is true for reasoning on knowledgebases with inconsistencies, which is needed when using data from independent datasources. To solve these problems the metaservice data model was based on an observation concept, which allows inconsistencies and temporal validity. Automated query translation, which takes care of different forms of temporal information, was designed and implemented to hide the complexity of the new model. They allow transparent usage of RDF and SPARQL without exposing temporal complexities to developers and users. The temporally enriched data model leads to long and complicated queries, for which current SPARQL query optimization whether in bigdata, nor to the best of our knowledge other databases, are suited. The most simple queries did not terminate within hours with default query optimization. Different optimization approaches were explored and implemented to encode implicit knowledge about the data structure into transformed SPARQL queries, such that response times were reverted to milliseconds or up to at most seconds. This performance improvements led to a practically executable system. Based on the service, two simple applications were implemented to evaluate its practical usability. The first application was a detector for potential licensing problems through combination of incompatible licenses. The second application was an alerter, which sent notifications when security reports were published for software. Both applications could be run as expected and therefore provide evidence, that the service enables queries on software, regardless of the source format or origin.

7.2 Limitations

The metaservice prototype and also all three contributions of this thesis - the ontology, the architecture, and the translation of temporal queries - are prototypes. What they are not is ready to use products, which can be deployed in productive systems. This is because several non functional requirements were not implemented properly. For metaservice to fulfill the vision to create a platform for software and software-interrelationships, modules must be written for many software ecosystems. Developers need further tools, like support for automated tests and dedicated testing environments, in addition to proper development documentation. Processes and standards for the acceptance and quality ensurance of external modules are not defined yet. Neither are security considerations for the service. Although the architecture was designed for, and the components were chosen for distributed execution, actual distribution on more than two computers was neither required by the workload, nor possible due to time and resource constraints. This needs to be verified separately. The temporal data model and automatic translations have been shown to work, but haven not been based on a proper formal model, which would be necessary to prove their correctness. The software interrelationship ontology was shown to be able to correctly map two rather complex rela- tionship ontologies. One known limitation is the lack of functionality to describe transitive relationship types. Further other missing features may arise only when additional ecosystems are integrated.

86 Design and Development of a Service for Software Interrelationships Chapter 7. Conclusion

7.3 Implications

Metaservice solves the two problems of a software interrelationship service, the semantic and the syntactic difference of information in software ecosystems. It provides a common interface to process software interrelationships across ecosystem borders. This interface can be used to help people comprehend software faster and to create tools that support software engineering and maintenance tasks. Additionally this thesis showed that software interrelationship services need to be able to handle inconsistencies between, and changes of software repositories. The architecture of metaservice does not only solve these problems theoretically, but the proof of concept implementation showed that it works in practice. The software relationship ontology is able to correctly represent relationships and dependencies of software. Existing public software ontologies cannot be used to express the multitude of existing relationships. Its usage is not limited to metaservice, but can be used also outside of it. The metaservice observation data model provides a novel approach for storage and querying of temporally scoped RDF statements. In contrast to existing temporal data models, it does not require normalization of observation times to time periods and therefore allows storage of information in the way it is perceived. Automatic query translation makes it possible to use this data model on existing RDF databases.

7.4 Outlook

The semantic web is on the rise. Recent applications like the Google Knowledge Graph or projects like LOD have shown this. It is essential for the realization of the vision of globally interconnected software engineering tools, that software and its interconnections can be processed regardless of their source. Platforms like metaservice, which can be part of the LOD cloud, will probably be necessary to kickstart semantic data adoption in the software engineering industry. They can provide a critical mass on information, that can be used to demonstrate the value of semantic metadata on software engineering and maintenance processes. Once semantic web in the software field, and ontology maturity is reached, the next steps would be to move away from the data mapping to first class semantic data and therefore a decentralized approach. This thesis showed up many - and some possible solutions to - existing problems on the road to a borderless global software ecosystem. Hopefully metaservice will reach maturity for public usage, such that further experiences on this topic can be gathered. Next steps on this path are the setup of a secure and distributed instance of metaservice, refinement of the module development tools, initiation of a process for public contribution, and development of further ecosystem integrations.

metaservice: a Semantic Web based Approach 87

Bibliography

Bibliography

References

[1] Joey van Angeren, Vincent Blijleven, and Slinger Jansen. “Relationship intimacy in software ecosystems: a survey of the Dutch software industry”. In: Proceedings of the International Conference on Management of Emergent Digital EcoSystems. Ed. by Massimiliano Di Penta and Jonathan I. Maletic. MEDES ’11. New York, NY, USA: ACM, 2011, pp. 68–75. ISBN: 978-1-4503-1047-5. DOI: 10.1145/2077489.2077502. [2] Renzo Angles and Claudio Gutierrez. “Survey of Graph Database Models”. In: Comupting Surveys 40.1 (Feb. 2008), 1:1–1:39. ISSN: 0360-0300. DOI: 10.1145/1322432.1322433. [3] Veronika Bauer and Lars Heinemann. “Understanding API Usage to Support Informed Decision Making in Software Maintenance”. In: Software Maintenance and Reengineering (CSMR), 2012 16th European Conference on. Ed. by Tom Mens, Anthony Cleve, and Rudolf Ferenc. Mar. 2012, pp. 435–440. DOI: 10.1109/CSMR.2012.55. [4] Gabriele Bavota et al. “The Evolution of Project Inter-dependencies in a Software Ecosystem: The Case of Apache”. In: 2013 IEEE International Conference on Software Maintenance, Eindhoven, The Netherlands. IEEE, Sept. 2013. ISBN: 978-0-7695-4981-1. DOI: 10.1109/icsm.2013.39. [5] Olivier Berger and Christian Bac. “Authoritative Linked Data Descriptions of Debian Source Packages Using ADMS. SW”. In: Open Source Software: Quality Verification. Ed. by Etiel Petrinja et al. Springer, 2013, pp. 168–181. DOI: 10.1007/978-3-642-38928-3_12. [6] Olivier Berger et al. “Weaving a Semantic Web Across OSS Repositories: Unleashing a New Potential for Academia and Practice”. In: International Journal of Open Source Software and Processes 2.2 (2010), pp. 29–40. DOI: 10.4018/jossp.2010040103. [7] Tim Berners-Lee. Linked Data - Design Issues. 2006. URL: http://www.w3.org/DesignIssues/ LinkedData.html (visited on 10/06/2014). [8] Tim Berners-Lee. What the Semantic Web can represent. 1998. URL: http://www.w3.org/ DesignIssues/RDFnot.html (visited on 10/06/2014). [9] Tim Berners-Lee, Roy Fielding, and Larry Masinter. Uniform Resource Identifier (URI): Generic Syntax. RFC 3986. Internet Engineering Task Force, Jan. 2005. URL: http://www.ietf.org/rfc/ rfc3986.txt (visited on 10/06/2014). [10] Mark Birbeck et al. RDFa Core 1.1 - Second Edition. W3C Recommendation. W3C, Aug. 2013. URL: http://www.w3.org/TR/2013/REC-rdfa-core-20130822/ (visited on 10/06/2014). [11] Christian Bizer, Tom Heath, and Tim Berners-Lee. “Linked data - the story so far”. In: In- ternational journal on semantic web and information systems 5 (3 2009), pp. 1–22. DOI: 10.4018/jswis.2009081901.

metaservice: a Semantic Web based Approach 89 [12] Michael H. Bohlen, Renato Busatto, and Christian S. Jensen. “Point-versus interval-based temporal data models”. In: Data Engineering, 1998. Proceedings., 14th International Conference on. Ed. by Susan Darling Urban and Elisa Bertino. Feb. 1998, pp. 192–200. DOI: 10.1109/ICDE. 1998.655777. [13] Gavin Carothers and Eric Prud’hommeaux. RDF 1.1 Turtle. W3C Recommendation. W3C, Feb. 2014. URL: http://www.w3.org/TR/2014/REC-turtle-20140225/ (visited on 10/06/2014). [14] Jeremy J. Carroll et al. “Named graphs”. In: Web Semantics. Science, Services and Agents on the World Wide Web 3.4 (Dec. 2005), pp. 247–267. DOI: 10.1016/j.websem.2005.09.001. [15] Marcelo Cataldo et al. “Software Dependencies, Work Dependencies, and Their Impact on Failures”. In: IEEE Trans. Software Eng. 35.6 (Nov.–Dec. 2009), pp. 864–878. DOI: 10.1109/ TSE.2009.42. [16] Megan Conklin, James Howison, and Kevin Crowston. “Collaboration Using OSSmole: A repository of FLOSS data and analyses”. In: ACM SIGSOFT Software Engineering Notes 30.4 (2005), pp. 1–5. DOI: 10.1145/1082983.1083164. [17] Richard Cyganiak and Anja Jentzsch. Linking Open Data cloud diagram. 2007–2011. URL: http://lod-cloud.net/ (visited on 10/06/2014). [18] Danica Damljanovic and Kalina Bontcheva. “Enhanced semantic access to software artefacts”. In: Workshop on Semantic Web Enabled Software Engineering (SWESE), Karlsruhe, Germany. 2008. [19] Stefan Dösinger, Richard Mordinyi, and Stefan Biffl. “Communicating continuous integration servers for increasing effectiveness of automated testing”. In: Proceedings of the 27th IEEE/ACM International Conference on Automated Software Engineering. Ed. by Michael Goedicke, Tim Menzies, and Motoshi Saeki. ASE 2012. New York, NY, USA: ACM, 2012, pp. 374–377. ISBN: 978-1-4503-1204-2. DOI: 10.1145/2351676.2351751. [20] Martin Duerst and Michel Suignard. Internationalized Resource Identifiers (IRIs). RFC 3987. Internet Engineering Task Force, Jan. 2005. URL: http://www.ietf.org/rfc/rfc3987.txt (visited on 10/06/2014). [21] Robert Dyer et al. “Boa: a language and infrastructure for analyzing ultra-large-scale software repositories”. In: Proceedings of the 2013 International Conference on Software Engineering. Ed. by David Notkin, Betty H.C. Cheng, and Klaus Pohl. ICSE ’13. IEEE Press, 2013, pp. 422– 431. DOI: 10.1109/ICSE.2013.6606588. [22] Emilio Ferrara et al. “Web Data Extraction, Applications and Techniques: A Survey”. In: Com- puting Research Repository (2012). arXiv: 1207.0246v4 [cs.IR]. [23] José A. Galindo, David Benavides, and Sergio Segura. “Debian Packages Repositories as Soft- ware Product Line Models. Towards Automated Analysis”. In: Proceedings of the 1st Interna- tional Workshop on Automated Configuration and Tailoring of Applications (ACoTA). Ed. by Deepak Dhungana et al. CEUR Workshop Proceedings. CEUR-WS.org, Sept. 2010, pp. 29–34. [24] Fabien Gandon and Guus Schreiber. RDF 1.1 XML Syntax. W3C Recommendation. W3C, Feb. 2014. URL: http://www.w3.org/TR/2014/REC-rdf-syntax-grammar-20140225/ (visited on 10/06/2014).

90 Design and Development of a Service for Software Interrelationships Bibliography

[25] Daniel M. German, Jesus M. Gonzalez-Barahona, and Gregorio Robles. “A model to understand the building and running inter-dependencies of software”. In: Reverse Engineering, 2007. WCRE 2007. 14th Working Conference on. Ed. by Massimiliano Di Penta and Jonathan I. Maletic. IEEE. 2007, pp. 140–149. DOI: 10.1109/WCRE.2007.5. [26] Daniel M. German and Ahmed E. Hassan. “License integration patterns: Addressing license mismatches in component-based development”. In: Proceedings of the 31st International Confer- ence on Software Engineering. IEEE, 2009. ISBN: 978-1-4244-3453-4. DOI: 10.1109/icse.2009. 5070520. [27] Daniel M. German, Yuki Manabe, and Katsuro Inoue. “A Sentence-matching Method for Au- tomatic License Identification of Source Code Files”. In: Proceedings of the 25th IEEE/ACM International Conference on Automated Software Engineering. Ed. by Charles Pecheur, Jamie Andrews, and Elisabetta Di Nitto. ASE ’10. Antwerp, Belgium: ACM, 2010, pp. 437–446. ISBN: 978-1-4503-0116-9. DOI: 10.1145/1858996.1859088. [28] Robert Gobeille. “The FOSSology project”. In: Proceedings of the 2008 international working conference on Mining software repositories. ACM. 2008, pp. 47–50. DOI: 10.1145/1370750. 1370763. [29] Asunción Gómez-Pérez, Mariano Fernandez-Lopez, and Oscar Corcho. Ontological engineering with examples from the areas of knowledge management, e-commerce and the Semantic Web. London New York: Springer, 2004. ISBN: 978-1-85233-551-9. [30] ADMS.SW working group. Asset Description Metadata schema for Software. ISA specification. Version 1.00. ISA programme of the European Commision, June 2012. [31] Tudor Groza et al. “The NEPOMUK Project - On the way to the Social Semantic Desktop”. In: Proceedings of International Conferences on new Media technology (I-MEDIA-2007) and Semntic Systems (I-SEMANTICS-07), Graz, Austria, September 5-7. Ed. by Klaus Tochtermann. 2007, pp. 201–210. URL: http://hdl.handle.net/10419/44447. [32] Ramanathan Guha and Dan Brickley. RDF Schema 1.1. W3C Recommendation. W3C, Feb. 2014. URL: http://www.w3.org/TR/2014/REC-rdf-schema-20140225/ (visited on 10/06/2014). [33] Claudio Gutierrez, Carlos A. Hurtado, and Alejandro Vaisman. “Introducing Time into RDF”. In: Knowledge and Data Engineering, IEEE Transactions on 19.2 (Feb. 2007), pp. 207–218. DOI: 10.1109/TKDE.2007.34. [34] Mark Hapner et al. Java Message Service. Java Specification Request. Version 1.1. Sun microsys- tems, Apr. 12, 2002. URL: https://jcp.org/aboutJava/communityprocess/final/jsr914/index.html (visited on 10/06/2014). [35] Frank van Harmelen and Deborah McGuinness. OWL Web Ontology Language Overview. W3C Recommendation. W3C, Feb. 2004. URL: http://www.w3.org/TR/2004/REC-owl-features- 20040210/ (visited on 10/06/2014). [36] Olaf Hartig and Bryan Thompson. “Foundations of an Alternative Approach to Reification in RDF”. In: Computing Research Repository (2014). arXiv: 1406.3399v1 [cs.DB]. [37] Patrick Hayes and Peter Patel-Schneider. RDF 1.1 Semantics. W3C Recommendation. W3C, Feb. 2014. URL: http://www.w3.org/TR/2014/REC-rdf11-mt-20140225/ (visited on 10/06/2014).

metaservice: a Semantic Web based Approach 91 [38] Israel Herraiz, Gregorio Robles, and Jesus M Gonzalez-Barahona. “Research friendly software repositories”. In: Proceedings of the joint international and annual ERCIM workshops on Principles of software evolution (IWPSE) and software evolution (Evol) workshops. Ed. by Hans-Gerhard Gross, Marco Lormans, and Jan Tretmans. ACM. 2009, pp. 19–24. DOI: 10.1145/ 1595808.1595814. [39] James Howison. “Cross-repository data linking with RDF and OWL-Towards common ontologies for representing FLOSS data”. In: Proceedings of the Workshop on Public Data about Software Development. OSS 2008 Workshops Proceedings. Sept. 2008. ISBN: 978-88-903120-1-4. [40] James Howison, Megan Conklin, and Kevin Crowston. “FLOSSmole: A collaborative repository for FLOSS research data and analyses”. In: International Journal of Information Technology and Web Engineering 1.3 (2006), pp. 17–26. DOI: 10.4018/jitwe.2006070102. [41] ISO/IEC. “Systems and software engineering System life cycle processes”. In: ISO/IEC 15288:2008(E) IEEE Std 15288-2008 (Revision of IEEE Std 15288-2004) (Jan. 2008). [42] Ian Jackson, Christian Schwarz, et al. Debian Policy Manual. Version 3.9.5.0. Oct. 28, 2013. URL: https://www.debian.org/doc/debian-policy (visited on 10/06/2014). [43] Slinger Jansen, Anthony Finkelstein, and Sjaak Brinkkemper. “A sense of community: A research agenda for software ecosystems”. In: ICSE Companion. May 2009, pp. 187–190. DOI: 10.1109/ ICSE-COMPANION.2009.5070978. [44] Huzefa Kagdi, Michael L. Collard, and Jonathan I. Maletic. “A survey and taxonomy of ap- proaches for mining software repositories in the context of software evolution”. In: Journal of Software Maintenance and Evolution: Research and Practice 19.2 (2007), pp. 77–131. ISSN: 1532-0618. DOI: 10.1002/smr.344. [45] Khaled M. Khan and Jun Han. “Composing security-aware software”. In: Software, IEEE 19.1 (Jan. 2002), pp. 34–41. ISSN: 0740-7459. DOI: 10.1109/52.976939. [46] Christoph Kiefer, Abraham Bernstein, and Jonas Tappolet. “Mining software repositories with isparql and a software evolution ontology”. In: Proceedings of the 4th International Workshop on Mining Software Repositories. IEEE Computer Society. 2007, p. 10. DOI: 10.1109/MSR.2007.21. [47] Atanas Kiryakov, Damyan Ognyanov, and Dimitar Manov. “OWLIM – A Pragmatic Semantic Repository for OWL”. In: WISE 2005 Workshops. Ed. by Mike Dean et al. Vol. 3807. LNCS. Springer, 2005, pp. 182–192. DOI: 10.1007/11581116_19. [48] Markus Lanthaler, Richard Cyganiak, and David Wood. RDF 1.1 Concepts and Abstract Syntax. W3C Recommendation. W3C, Feb. 2014. URL: http://www.w3.org/TR/2014/REC-rdf11- concepts-20140225/ (visited on 10/06/2014). [49] Markus Lanthaler, Manu Sporny, and Gregg Kellogg. JSON-LD 1.0. W3C Recommendation. W3C, Jan. 2014. URL: http://www.w3.org/TR/2014/REC- json-ld- 20140116/ (visited on 10/06/2014). [50] Nikolaos Loutas et al. “Building Cross-Border Public Services in Europe Through Sharing and Reuse of Interoperability Solutions”. In: The 14th European Conference on e-Government: ECEG2014. Ed. by Alexandru Ionas. 2014, pp. 170–179.

92 Design and Development of a Service for Software Interrelationships Bibliography

[51] Mircea Lungu, Romain Robbes, and Michele Lanza. “Recovering inter-project dependencies in software ecosystems”. In: Proceedings of the 25th IEEE/ACM international conference on Automated software engineering. Ed. by Charles Pecheur, Jamie Andrews, and Elisabetta Di Nitto. ASE ’10. New York, NY, USA: ACM, 2010, pp. 309–312. ISBN: 978-1-4503-0116-9. DOI: 10.1145/1858996.1859058. [52] James Malone et al. “The Software Ontology (SWO): a resource for reproducibility in biomedical data analysis, curation and digital preservation”. In: Journal of Biomedical Semantics 5 (1 June 2, 2014), p. 25. ISSN: 2041-1480. DOI: 10.1186/2041-1480-5-25. [53] Fabio Mancinelli et al. “Managing the Complexity of Large Free and Open Source Package-Based Software Distributions”. In: Automated Software Engineering, 2006. ASE ’06. 21st IEEE/ACM International Conference on. Ed. by Sebastian Uchitel and Steve Easterbrook. Sept. 2006, pp. 199–208. DOI: 10.1109/ASE.2006.49. [54] Antoni Mylka et al. NEPOMUK File Ontology. OSCAF Recommendation. Open Semantic Collaboration Architecture Foundation, Aug. 28, 2013. URL: http://www.semanticdesktop.org/ ontologies/2007/03/22/nfo/ (visited on 10/06/2014). [55] Feng Pan and Jerry Hobbs. Time Ontology in OWL. W3C Working Draft. W3C, Sept. 2006. URL: http://www.w3.org/TR/2006/WD-owl-time-20060927/ (visited on 10/06/2014). [56] Matthew Perry, Prateek Jain, and Amit P. Sheth. “SPARQL-ST: Extending sparql to support spatiotemporal queries”. In: Geospatial semantics and the semantic web. Foundations, Algorithms, and Applications. Ed. by Naveen Ashish and Amit P. Sheth. Springer, 2011, pp. 61–86. ISBN: 978-1-4419-9446-2. DOI: 10.1007/978-1-4419-9446-2_3. [57] Axel Polleres, Paul Gearon, and Alexandre Passant. SPARQL 1.1 Update. W3C Recommendation. W3C, Mar. 2013. URL: http://www.w3.org/TR/2013/REC-sparql11-update-20130321/ (visited on 10/06/2014). [58] Andrea Pugliese, Octavian Udrea, and V. S. Subrahmanian. “Scaling RDF with time”. In: WWW. Ed. by Jinpeng Huai et al. ACM, Apr. 21–25, 2008, pp. 605–614. ISBN: 978-1-60558-085-2. DOI: 10.1145/1367497.1367579. [59] Steven Raemaekers, Arie van Deursen, and Joost Visser. “The Maven Repository Dataset of Metrics, Changes, and Dependencies”. In: Proceedings of the 10th Working Conference on Mining Software Repositories. Ed. by Thomas Zimmermann, Massimiliano Di Penta, and Sunghun Kim. MSR ’13. Piscataway, NJ, USA: IEEE Press, 2013, pp. 221–224. ISBN: 978-1-4673-2936-1. DOI: 10.1109/MSR.2013.6624031. [60] Ralf H. Reussner, Heinz W. Schmidt, and Iman H. Poernomo. “Reliability prediction for component-based software architectures”. In: Journal of Systems and Software 66.3 (2003). Software architecture – Engineering quality attributes, pp. 241–252. ISSN: 0164-1212. DOI: 10.1016/S0164-1212(02)00080-8. [61] Marko A. Rodriguez and Peter Neubauer. “Constructions from dots and lines”. In: Bulletin of the American Society for Information Science and Technology 36.6 (Aug.–Sept. 2010), pp. 35–41. ISSN: 1550-8366. DOI: 10.1002/bult.2010.1720360610. [62] Neeraj Sangal et al. “Using dependency models to manage complex software architecture”. In: OOPSLA. Ed. by Ralph E. Johnson and Richard P. Gabriel. 2005, pp. 167–176. DOI: 10.1145/ 1103845.1094824.

metaservice: a Semantic Web based Approach 93 [63] Max Schmachtenberg, Heiko Paulheim, and Christian Bizer. “Adoption of Linked Data Best Practices in Different Topical Domains”. In: The Semantic Web – ISWC 2014 - 13th International Semantic Web Conference, Trentino, Italy, October 19-23, 2014, Proceedings. Oct. 2014. [64] Andy Seaborne and Steven Harris. SPARQL 1.1 Query Language. W3C Recommendation. W3C, Mar. 2013. URL: http://www.w3.org/TR/2013/REC-sparql11-query-20130321/ (visited on 10/06/2014). [65] Estefanía Serral et al. “Evaluation of semantic data storages for integrating heterogenous disci- plines in automation systems engineering”. In: Industrial Electronics Society, IECON 2013 - 39th Annual Conference of the IEEE. 2013, pp. 6858–6865. DOI: 10.1109/IECON.2013.6700268. [66] Cleidson R.B. de Souza and David F. Redmiles. “An empirical study of software developers’ man- agement of dependencies and changes”. In: Software Engineering, 2008. ICSE ’08. ACM/IEEE 30th International Conference on. May 2008, pp. 241–250. DOI: 10.1145/1368088.1368122. [67] Kate Stewart, Phil Odence, and Esteban Rockett. “Software Package Data Exchange (SPDX™) Specification”. In: International Free and Open Source Software Law Review 2.2 (2011). ISSN: 1877-6922. DOI: 10.5033/ifosslr.v2i2.45. [68] Jonas Tappolet. “Mining Software Repositories - A Semantic Web Approach”. Diplomathesis. University of Zurich, Mar. 13, 2007. [69] Jonas Tappolet and Abraham Bernstein. “Applied Temporal RDF: Efficient Temporal Querying of RDF Data with SPARQL”. In: The Semantic Web: Research and Applications. Ed. by Lora Aroyo et al. Vol. 5554. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2009, pp. 308–322. ISBN: 978-3-642-02120-6. DOI: 10.1007/978-3-642-02121-3_25. [70] Bryan Thompson. SPARQL Named Subquery Extension. Mar. 2012. URL: http://wiki.bigdata. com/wiki/index.php/NamedSubquery (visited on 10/06/2014). [71] Bryan Thompson, Mike Personick, and Martyn Cutcher. “The Bigdata® RDF Graph Database”. In: Linked Data Management, Emerging Directions in Database Systems and Applications. Ed. by Andreas Harth, Katja Hose, and Ralf Schenkel. Chapman and Hall/CRC, 2014, pp. 193–237. ISBN: 978-1466582408. DOI: 10.1201/b16859-12. [72] Yannis Tzitzikas, Christina Lantzaki, and Dimitris Zeginis. “Blank Node Matching and RDF/S Comparison Functions”. In: The Semantic Web – ISWC 2012. Ed. by Philippe Cudré-Mauroux et al. Vol. 7649. Lecture Notes in Computer Science. Springer Berlin Heidelberg, 2012, pp. 591– 607. ISBN: 978-3-642-35175-4. DOI: 10.1007/978-3-642-35176-1_37. [73] W3C OWL Working Group. OWL 2 Web Ontology Language Document Overview. Tech. rep. W3C, Oct. 2009. URL: http://www.w3.org/TR/owl2-overview/ (visited on 10/06/2014). [74] Michael Würsch et al. “SEON: a pyramid of ontologies for software evolution and its applica- tions”. In: Computing 94 (11 2012), pp. 857–885. ISSN: 0010-485X. DOI: 10.1007/s00607-012- 0204-1. [75] Atsuko Yamaguchi et al. “BioBenchmark Toyama 2012: an evaluation of the performance of triple stores on biological data”. In: Journal of Biomedical Semantics 5.32 (July 10, 2014). ISSN: 2041-1480. DOI: 10.1186/2041-1480-5-32. [76] Amir Reza Yazdanshenas and Leon Moonen. “Crossing the boundaries while analyzing heteroge- neous component-based software systems”. In: 23rd IEEE International Conference on Software Maintenance (ICSM 2007). IEEE, Sept. 2011. DOI: 10.1109/icsm.2011.6080786.

94 Design and Development of a Service for Software Interrelationships Bibliography

[77] Il-Chul Yoon et al. “Direct-dependency-based Software Compatibility Testing”. In: Proceedings of the 22nd IEEE/ACM International Conference on Automated Software Engineering. Ed. by R. E. Kurt Stirewalt, Alexander Egyed, and Bernd Fischer. ASE ’07. Atlanta, Georgia, USA: ACM, 2007, pp. 409–412. ISBN: 978-1-59593-882-4. DOI: 10.1145/1321631.1321696.

metaservice: a Semantic Web based Approach 95

Bibliography

Online References

[78] Æsh. URL: http://aeshell.github.io (visited on 10/06/2014). [79] Apache ActiveMQ. URL: http://activemq.apache.org (visited on 10/06/2014). [80] Apache Buildr. URL: http://buildr.apache.org (visited on 10/06/2014). [81] Apache Comons. URL: http://commons.apache.org (visited on 10/06/2014). [82] . URL: http://hadoop.apache.org (visited on 10/06/2014). [83] Apache Ivy. URL: http://ant.apache.org/ivy (visited on 10/06/2014). [84] Apache Jena. URL: https://jena.apache.org (visited on 10/06/2014). [85] Apache Maven. URL: http://maven.apache.org (visited on 10/06/2014). [86] bigdata. URL: http://bigdata.com (visited on 10/06/2014). [87] Blueprints. URL: http://blueprints.tinkerpop.com (visited on 10/06/2014). [88] Boa. URL: http://boa.cs.iastate.edu (visited on 10/06/2014). [89] Bootstrap. URL: http://getbootstrap.com (visited on 10/06/2014). [90] Comprehensive Perl Archive Network. URL: http://www.cpan.org (visited on 10/06/2014). [91] Creative Commons – Attribution-ShareAlike 3.0 Unported (CC BY-SA 3.0). URL: http : / / creativecommons.org/licenses/by-sa/3.0 (visited on 10/06/2014). [92] Debian. URL: https://www.debian.org (visited on 10/06/2014). [93] Debian Snapshot Archive. URL: http://snapshot.debian.org/ (visited on 10/06/2014). [94] DistroWatch. URL: http://distrowatch.com (visited on 10/06/2014). [95] DOAP Project. URL: https://github.com/edumbill/doap/wiki (visited on 10/06/2014). [96] EvoOnt - A Software Evolution Ontology. URL: www.ifi.uzh.ch/ddis/research/evoont (visited on 10/06/2014). [97] Flossmole - Collaborative collection and analysis of free/libre/open source project data. URL: http://flossmole.org (visited on 10/06/2014). [98] FOSSology. URL: http://www.fossology.org (visited on 10/06/2014). [99] Fossology-Ninka. URL: https://github.com/TheFinks/Fossology-Ninka (visited on 10/06/2014). [100] Friend of a Friend Ontology. URL: http://www.foaf-project.org (visited on 10/06/2014). [101] GitHub. URL: http://github.com (visited on 10/06/2014). [102] GNU General Public License. URL: http : / / www. gnu . org / copyleft / gpl . html (visited on 10/06/2014). [103] Google Guava. URL: https://code.google.com/p/guava-libraries (visited on 10/06/2014). [104] Google Guice. URL: https://code.google.com/p/google-guice (visited on 10/06/2014). [105] Gradle. URL: http://www.gradle.org (visited on 10/06/2014). [106] Gremlin. URL: http://gremlin.tinkerpop.com (visited on 10/06/2014). [107] Handlebars.js. URL: http://handlebarsjs.com (visited on 10/06/2014).

metaservice: a Semantic Web based Approach 97 [108] HermiT. URL: http://www.hermit-reasoner.com (visited on 10/06/2014). [109] ibiblio. URL: http://www.ibiblio.org (visited on 10/06/2014). [110] Interoperability Solutions for European Public Administrations. URL: http://ec.europa.eu/isa (visited on 10/06/2014). [111] Jackson JSON Processor. URL: http://jackson.codehaus.org (visited on 10/06/2014). [112] Joinup. URL: https://joinup.ec.europa.eu (visited on 10/06/2014). [113] jsonld.js. URL: https://github.com/digitalbazaar/jsonld.js (visited on 10/06/2014). [114] jsoup Java HTML Parser. URL: http://jsoup.org (visited on 10/06/2014). [115] Kryo. URL: https://github.com/EsotericSoftware/kryo (visited on 10/06/2014). [116] KryoNet. URL: https://github.com/EsotericSoftware/kryonet (visited on 10/06/2014). [117] Logback. URL: http://logback.qos.ch (visited on 10/06/2014). [118] Maven Central Repository. URL: http://repo.maven.apache.org/maven2 (visited on 10/06/2014). [119] Modulecounts. URL: http://www.modulecounts.com (visited on 10/06/2014). [120] MongoDB. URL: http://www.mongodb.org (visited on 10/06/2014). [121] mongoJack — MongoDB Jackson Mapper. URL: http://mongojack.org (visited on 10/06/2014). [122] National Vulnerability Database. URL: http://nvd.nist.gov (visited on 10/06/2014). [123] neo4j. URL: http://www.neo4j.org (visited on 10/06/2014). [124] Ninka. URL: http://ninka.turingmachine.org (visited on 10/06/2014). [125] Notify My Android. URL: http://www.notifymyandroid.com (visited on 10/06/2014). [126] ohcount. URL: https://github.com/blackducksw/ohcount (visited on 10/06/2014). [127] OpenLink Virtuoso. URL: http://virtuoso.openlinksw.com (visited on 10/06/2014). [128] OpenRDF. URL: http://www.openrdf.org (visited on 10/06/2014). [129] OrientDB. URL: http://www.orientechnologies.com/orientdb (visited on 10/06/2014). [130] OWLIM. URL: http://www.ontotext.com/owlim (visited on 10/06/2014). [131] Pellet. URL: http://clarkparsia.com/pellet (visited on 10/06/2014). [132] protégé. URL: http://protege.stanford.edu (visited on 10/06/2014). [133] PyPI - Python Package Index. URL: https://pypi.python.org/pypi (visited on 10/06/2014). [134] Quartz Scheduler. URL: http://quartz-scheduler.org (visited on 10/06/2014). [135] RedHat. URL: www.redhat.com (visited on 10/06/2014). [136] RubyGems. URL: https://rubygems.org (visited on 10/06/2014). [137] SEON - Software Evolution Ontologies. URL: http://www.se-on.org (visited on 10/06/2014). [138] Simple Logging Facade for Java (SLF4J). URL: http://www.slf4j.org (visited on 10/06/2014). [139] Software Ontology. URL: http://purl.bioontology.org/ontology/SWO (visited on 10/06/2014). [140] Software Package Data Exchange®. URL: http://spdx.org (visited on 10/06/2014). [141] SPDX License List. URL: http://spdx.org/licenses (visited on 10/06/2014).

98 Design and Development of a Service for Software Interrelationships Bibliography

[142] TinkerPop. URL: http://www.tinkerpop.com (visited on 10/06/2014). [143] Ubuntu. URL: http://www.ubuntu.com (visited on 10/06/2014). [144] WordPress. URL: http://wordpress.org (visited on 10/06/2014).

metaservice: a Semantic Web based Approach 99

Appendix A. Linking Open Data Cloud

Appendix A Linking Open Data Cloud

The following Figures A.1 to A.4 illustrate the growth of linked open data between May 2007 and April 2014.

Figure A.1: Linking Open Data Cloud May 2007 by Cyganiak and Jentzsch [17, 91]

Figure A.2: Linking Open Data Cloud March 2009 by Cyganiak and Jentzsch [17, 91]

metaservice: a Semantic Web based Approach 101 Linked LOV 4 User Slidey w Feedback wRDF delicious Moseley Scrobbler &ink Sussex R&g 44 Folk Reading s Magnay Lists 4ews tune Klappy y Resource NTU & club Lists Resource Tropes Lotico Semantic yovisto John Music y Lists Music Tweet chester Hellenic Peel &nz NDL R&g &nz Reading & R subjects EUTC g Rg Linked Lists Open Hellenic Prody Library Open H PD Surge Crunchy info tions RDF base Library Radio Discogs ohloh Ontos SourL Crime R Plymouth Rg News Ecosystem Reading 44 LEM Reports s g Portal Lta Lists SH UK Crime ss Music Jamendo Ry uk R&g Linked 4g &nz ntnusc Ox FanHubz R&g gnoss LCCN Points SSW s y y y Populy artists pédia y rus L & LRy R&g s talia s LCSH Rådata reegle 4g resears s 4 ss ss Rg my fr Codes i NHS uk uk y Expery s Classical List Ry win flickr ment Energy R& Pokedex Norwey 4g Mortality Family wrappr Sudoc PSH y && g gian tors Ry Progy 4g MeSH mes semantic && IdRef GND w educations OpenEI s SW Sudoc ndlna Energy ss Music Ry Emission Dog 4 4 Ry uk Chrony Linked 4g ling Event & y Food &Ly 4g Europeana 4ica Media guese heim && &ia Calames Wildlife y y DDC Deutsche Open Finder Revyu Openly s &y lobid Election nance graphie Local Resources NSZL Data legislation Survey nl y L&k data Ulm Swedish Ly Project ss New Mashup Catalog Open graphis York s tutions URI Open w2 Cultural Ly Times Greek Heritage &er &ia Calais codes s ECS Wiki lobid ss Taxon y GovWILD LOIUS iServe && y uk Concept ampton sations &lian Geo &&ase ECS STW GESIS World OS ECS y ESD Names y y ticians y ampton R& referes ss book Explorg &est dards Freebase EPrints ss intervals uk 44 Project 4 Lichfield transpors &ia data RLy y Pisa y ss g Fishes dcs RESEX ding berg Scholay 4 uk of & & meter Scotland Geo R y Texas Uberblic R.g PupilLM Species &g & IRIT gration Exams y Euroy dbpedia y R& London stat lite TCM 4 Gene y Explorg & NVD Traffic Gazette R&g Geo L Eurostat DIT uk Scotland Linked Daily v &L Med 4 Data Data LOCODE DEPLOY s 4 CORDIS lingvoj y y R& some SIDER 4w22z castle 4 Explorg Linked Eurécom CORDIS Roma Eurostat SensorL Drug CiteSeer R&g & GovTrack R Rssg Open Pfam Coursey Centralg riese Enipedia Cyc Lexvo LinkedCT ware Linked UniProt & VIVO EURES 4 Indiana ePrints 4 R LC IEEE Rg s Centralg WordNet RISKS R4g y UniProt Ls EUNIS Twarql R&wg HGNC Semantic Rg Cornetto nomy VIVO FTS & y ProDom STITCH Cornell 44 SITE KISTI NSF Scotland LODE y y graphy WordNet WordNet WordNet JISC R.g R& Climbing Linked 4y KEGG SMC Explorg Pub L GeoData SISVU metrix Drug Piedmont Journals PubMed Gene SGD Chem y Media Finnish 4y TCP Municiy dations Ljero Ontology 4e 4 bible palities Tourism Ski KEGG ontology Geographic 4 &4C Ocean GEMET Enzyme Metoffice ChEM& Italian Drilling OMIM KEGG Publications public Codices 4ET Weather Open Linked MGI Data Pathway schools Forecasts Open InterPro GeneID 4h y KEGG Turismo rus Colors Reaction de ygenerL Zaragoza Product Smart KEGG Weather & Link Medi Glycan Janus Stations Care Government Product UniParc UniRef KEGG 4 Italian UniSTS y Types Homoly Yahooi 4orts Ontology Museums pound Geo Google Gene ydomain 4 w Planet National wrapper y &wDF activity Uni L JP Pathway Sears Open Linked OGOLOD Corpoy 4y Reactome dam y Open rates Numbers Museum cator 4LLLw2zz

Figure A.3: Linking Open Data Cloud September 2011 by Cyganiak and Jentzsch [17, 91] 102 Design and Development of a Service for Software Interrelationships Appendix A. Linking Open Data Cloud g ces Media e Domain LOD MMU Sztaki J DOI N4 Lists Trent e Geographic Publications Resource uclan Government N Nottingham aspire N 4) Keele J 4 s e Social B J ucl erlin Generated aspire J Project Plymouth )Q e GutenbergN B J J e B aphie NTU ) J e el Deutsche Q J Sussex y DF J Q B e SGD Ring ' 4 Q QP J a VIVO Roehampton Indiana e DF R University Open DF OMIM Library J prints QP Resources 4 Portsmouth n QP 4N Taxonomy datos Southampton B libris DF Dewey Q Decimal Library DF Q Princeton Ontology 4 Gutenberg Findingaids 4NNNNNJNPzA6 SGD OMIM Uniprot B QP QP data Web Resources B e n s e Semantic Grundlagen J DF Q Manchester J J Harper Uniprot Museum eco 4 Life Lista Taxonomy Data Materia Linked Mientos QP Encabeza worldcat Q mis DF se Identifiers GNOSS museos Resolving B DF DF DF DWS feld e Group QP Pharmgkb Pub Q Q Taxon Q GeneID QP QP QP Qmul J 4 IEEEvis skos J GNI agrovoc linkedct dam Oasis Museum veshub Q I 4 drug NMIF base Data J Q Linked JNN J ) concept Product Sheffield B interaction knowledge J omy testee J Southampton Nobel Prizes B B QQ Q Q erlin J lite 4 a Lexinfo )Q dailymed dbpedia ) concept LS occurences pedia VIVO Vulnera N LOD University Enipedia Mail glottolog Mark erlin Morelab Linked Debian System J4Q ia Package GNU Tracking JI Wiki Sider P Licenses Project Food ja Facts Open For )Q STW WOLD Thesaurus Economics ia Q DF ia J Product Maps Ontology GNOSS Lotico eu Ontology Interactive ko Geopolitical Delicious Dataset P e QP Q SISVU Q erlin Food Uniprot Linked N Metadata ia GNOSS J DogFood My Vosmedios )Q Drugbank it Sema ia RISP Data Open StatusNet Experiment haus code chickenkiller Greek de Q Tekord Wordnet Dataspaces Geospecies MyOpenlink L ia Q Jugem Wiki fr uni Project ios siegen PlanetData B Taxon Q semantic concept StatusNet Q 7I4W Q ina ia Wordnet belle Typepad sweetie mkuttner StatusNet of StatusNet Flickr nl Linked Wrappr StatusNet Ecology mulestable Proyecto J N Q Lexvo bka J StatusNet mamalibre AM Eviajes GNOSS Semanlink Miguiad StatusNet ness Garnica StatusNet Plywood Green EUNIS GNOSS status fcac LOV 4) StatusNet StatusNet Lingvoj david ludost Dataspaces OpenlinkSW RDF StatusNet StatusNet Ohloh haberthuer kaimi ia StatusNet didactalia base deuxpi I4 coreyavis StatusNet el StatusNet somsants StatusNet Linked Elviajero Wordpress 4 StatusNet Q GNOSS opensimchat Nextweb atari mrblog frosch StatusNet qdnx StatusNet GNOSS t RFID StatusNet J Revyu 7W timttmy l StatusNet Wordnet schiessle ia StatusNet PzAzz QI StatusNet QQ4 Finder Tech Wildlife StatusNet StatusNet and Deusto StatusNet Q orangeseeds ary thelovebug StatusNet hackerposse Trends GNOSS legadolibre Prospects lebsanft StatusNet StatusNet Data gomertronic Subject National StatusNet NYTimes Headings ilikefreedom rigade N J ) Q N QQ4 mes Hellenic piana NQ Prog StatusNet Open 4 recit soucy quitter J) data Q Profiles imirhil StatusNet linked StatusNet StatusNet Linked QQ4 camera deputati e Music StatusNet StatusNet morphtown belfalas edgar linked StatusNet brainz Pokepedia ) Q Police opes chromic Hellenic Geo StatusNet ica icling Names Q NHS gegeweb StatusNet Icane russwurm StatusNet StatusNet Jargon J skilledtests Linked postblue e 4 samnoble StatusNet Eurostat StatusNet shnoulle StatusNet keuser e Data 4 StatusNet Reegle Energy Q ing ssweeny Q Jamendo StatusNet WWW Magnatune StatusNet datenfahrt indymedia Foundation Q Stations macno 4 Semantic Geo ta fragdev StatusNet Data rdfize Linked StatusNet lastfm scoffoni lian StatusNet StatusNet Open B Energy N AA- N Q transparency StatusNet Politicians StatusNet 4 e ESD Web linuxwrangling StatusNet Toolkit Lichfield planetlibre UK Traveler Nmasuno GEMET glou Q Sense artists jonkman Postcodes B StatusNet StatusNet Q rainbowdash Data StatusNet 4 dtdns Linked Data franke Linked Survey StatusNet Ordnance StatusNet alexandre erlin Data Data Linked l Open ESD Eurostat Project Election e tekk erlin Eurostat )Q Standards spip kenzoid Open StatusNet World StatusNet Piracy Linked Factbook )Q Nuts ) J StatusNet StatusNet Data y vocab Q Sessions lydiastench Linked SORS N exdc StatusNet cooleysekula StatusNet 4Q Q Data GovUK data BNB Linked linked yesterday NBN maymay Data StatusNet Stock Index GovUK ) Linked Linked NB dwellings ourcoffs ness Azzz Worldbank StatusNet GovUK homel otbm Data BN spraci Linked Project bonifaz Railway StatusNet StatusNet StatusNet ency 4J Rod freelish qth rities StatusNet cators 4 NBNB Data Open Designators StatusNet GovUK Reference families euskadi GovUK BB J transparency workin N population N projections e households progval Linking Open Data Cloud April 2014 by Schmachtenberg, Paulheim, and Bizer [63] StatusNet pandaid PzAz StatusNet GovUK d N eea nk water Loius J com kathryl johndrink StatusNet StatusNet Geovocab GovUK GovUK NPzAz services societal d StatusNet N P authority wellbeing BN BN UIS Data Linked RDF StatusNet Q GovUK zA StatusNet tschlotfeldt societal Data N'Az wellbeing Linked BN Eurostat z A.4: IMF d z Data imd Linked s nistrative Greek GovUK fcestrada StatusNet tos NPzA Geography ldnfai J GovUK N total societal GovUK GovUK wellbeing NN'A BN household Datos households N houseolds B composition StatusNet N projections households N gener equestriarp J Zaragoza Sat mystatus J StatusNet GovUK B NB N households z transparency d UK J Local StatusNet Guide Openly London integralblue GovUK NAzz GovUK GovUK impact Legislation societal N'z- households wellbeing Randomness indicators BN BN accommodated transparency livejournal GovUK Market GovUK service Housing hiico s expenditure PzAz Figure GovUK N societal wellbeing N N NN t depri GovUK number bedrooms OSM households N N gener mean GovUK m rity GovUK wellbeing z worthwhile NB NB NB BN transparency me Statistics doomicile StatusNet grant BB icators GovUK NB GovUK NN N plans GovUK government N A N NPzA N PzAz GovUK N GovUK societal impact transparency wellbeing neighbourhood indicators affordable N StatusNet depri housing education NN Enel PzAz N Shops GovUK N societal wellbeing GovUK income temporary depri households housing homelessness accommodated GovUK N societal wellbeing icators N deprivation GovUK impact hole NNPzAz N indicators GovUK mean happy GovUK NNPzAz N z affordable transparency N housing yesterday wellbeing transparency housing simd Q wellbe N N depri crime GovUK impact granted scotland planning opendata Oceandrilling indicators GovUK N me societal crime applications wellbeing NPzA employment depri PzAz GovUK N turruta z zaragoza N N Az z d GovUK impact GovUK N societal wellbeing NPzA indicators environment GovUK N societal housing depri wellbeing e employment ))Pz y d GovUK depri rank NB NPzA PzAz opendata GovUK NN N societal wellbeing PzAz employment N GovUK N N N d societal wellbeing depri builds GovUK impact housing depri opendata geographic N N N indicator rank J sing 4 opendata education N rank icators scotland opendata GovUK ds N granted planning icators applications transparency aph N GovUK N N lth transparency d energ N nk PzzR N datazone opendata education GovUK N rank N N scotland households opendata graph N scotland opendata N opendata income N

metaservice: a Semantic Web based Approach 103

Appendix B. Selected Complete Listings

Appendix B Selected Complete Listings

The following listings are provided unshortened, but partially modified for better readability.

B.1 RDF Serializations

1 2 3 < !DOCTYPE rdf:RDF [ 4 5 < !ENTITY ex "http: //example.org/" > 6 ] > 7 8 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 < / rdf:RDF> Listing B.1: RDF/XML Serialization of RDF Graph in Figure 2.1

1 @prefix ex : . 2 3 ex:Peter a ex:Human; 4 ex:likes ex:Flowers . 5 ex:Anna a ex:Human; 6 ex:likes ex:Flowers . 7 ex:Hans a ex:Human;

metaservice: a Semantic Web based Approach 105 B.1. RDF Serializations

8 ex:likes ex:Anna ; 9 ex:doesntLike ex:Flowers . Listing B.2: Turtle Serialization of RDF Graph in Figure 2.1

1 { 2 " @context ":{ 3 " ex " : " h t t p : //example.org/" 4 } , 5 "@graph" : [ 6 { 7 "@id" : " ex : Anna " , 8 " @type " : " ex : Human" , 9 " ex : l i k e s " : { 10 "@id" : " ex : Flowers " 11 } 12 } , 13 { 14 "@id" : " ex : Hans " , 15 " @type " : " ex : Human" , 16 " ex : doesntLike" : { 17 "@id" : " ex : Flowers " 18 } , 19 " ex : l i k e s " : { 20 "@id" : " ex : Anna " 21 } 22 } , 23 { 24 "@id" : " ex : P e t e r " , 25 " @type " : " ex : Human" , 26 " ex : l i k e s " : { 27 "@id" : " ex : Flowers " 28 } 29 } 30 ] 31 } Listing B.3: JSON-LD Serialization of RDF Graph in Figure 2.1

1

4 5
6
7 < / d i v > 8 9
10
11

106 Design and Development of a Service for Software Interrelationships Appendix B. Selected Complete Listings

12

13 < / d i v > 14 < / d i v > 15
16 < / d i v > 17 18 < / d i v > Listing B.4: RDFA Serialization of RDF Graph in Figure 2.1

B.2 SPARQL Queries

1 SELECT 2 * 3 WHERE { 4 ?o . 5 ?o ?p ? o2 . 6 }@"2014−01−01T00:00:00Z" ^^xsd:dateTime Listing B.5: Temporal Constraint on a Statement Group SPARQL@T Query

1 SELECT 2 * 3 WHERE { 4 GRAPH ? c { ?o } . 5 ? c ?validFrom; 6 ? v a l i d T o . 7 FILTER ( 8 ?validFrom < "2014−01−01T00:00:00Z" ^^xsd:dateTime && 9 ? v a l i d T o > "2014−01−01T00:00:00Z" ^^xsd:dateTime 10 ) 11 } Listing B.6: Translation of Listing 4.1 to SPARQL on a valid-from and -to datamodel

1 SELECT 2 * 3 WHERE { 4 GRAPH ? c { 5 ?o . 6 ?o ?p ? o2 . 7 } 8 ? c ?validFrom; 9 ? v a l i d T o . 10 FILTER ( 11 ?validFrom < "2014−01−01T00:00:00Z" ^^xsd:dateTime && 12 ? v a l i d T o > "2014−01−01T00:00:00Z" ^^xsd:dateTime 13 )

metaservice: a Semantic Web based Approach 107 B.2. SPARQL Queries

14 } Listing B.7: Translation of Listing B.5 to SPARQL on a valid-from and -to data model

1 @PREFIX ms : . 2 3 CONSTRUCT {_QUADMODE_ 4 GRAPH ?c { ?s ?p ?o }. 5 } 6 7 WITH { 8 SELECT DISTINCT ? c ? s ?p ?o { 9 BIND( $ pa t h AS ? s ) . 10 GRAPH ?c { ?s ?p ?o }. 11 FILTER( ?p != ). 12 } 13 } AS %l e v e l 1 14 15 WITH { 16 SELECT (?c2 as ?c) (?x as ?s) (?p2 as ?p) (?o2 as ?o) { 17 SELECT DISTINCT ?c2 ?p2 (?o as ?x) ?o2 { 18 INCLUDE %level1filtered . 19 GRAPH ?c2 { ?o ?p2 ?o2. }. 20 FILTER( ? p2 != ). 21 } 22 } 23 } AS %l e v e l 2 24 25 WITH { 26 SELECT (?c2 as ?c) (?x as ?s) (?p2 as ?p) (?o2 as ?o) { 27 SELECT DISTINCT ?c2 ?p2 (?o as ?x) ?o2 { 28 INCLUDE %level2filtered . 29 GRAPH ?c2 { ?o ?p2 ?o2. }. 30 FILTER( ? p2 != ). 31 } 32 } 33 } AS %l e v e l 3 34 35 WITH { 36 SELECT (max(?time) as ?maxTime) ?s ?p ?o { 37 hint:Query hint:optimizer " None " . 38 INCLUDE %l e v e l 1 . 39 ? c a ? t y p e . 40 ? c ? time . 41 ? c ? s o u r c e . 42 FILTER( ?time < $selectedTime ) 43 FILTER( ? t y p e != ) 44 } GROUP BY ?s ?p ?o ?source 45 } AS %level1max 46 47 WITH {

108 Design and Development of a Service for Software Interrelationships Appendix B. Selected Complete Listings

48 SELECT (max(?time) as ?maxTime) ?s ?p ?o { 49 hint:Query hint:optimizer " None " . 50 INCLUDE %l e v e l 2 . 51 ? c a ? t y p e . 52 ? c ? time . 53 ? c ? s o u r c e . 54 FILTER( ?time < $selectedTime ) 55 FILTER( ? t y p e != ) 56 } GROUP BY ?s ?p ?o ?source 57 } AS %level2max 58 59 WITH { 60 SELECT (max(?time) as ?maxTime) ?s ?p ?o { 61 hint:Query hint:optimizer " None " . 62 INCLUDE %l e v e l 3 . 63 ? c a ? t y p e . 64 ? c ? time . 65 ? c ? s o u r c e . 66 FILTER( ?time < $selectedTime ) 67 FILTER( ? t y p e != ) 68 } GROUP BY ?s ?p ?o ?source 69 } AS %level3max 70 71 WITH { 72 SELECT ? c ? s ?p ?o { 73 hint:Query hint:optimizer " None " . 74 INCLUDE %l e v e l 1 . 75 ? c a . 76 ? c $selectedTime . 77 } 78 } AS %level1maxContinuous 79 80 WITH { 81 SELECT ? c ? s ?p ?o { 82 hint:Query hint:optimizer " None " . 83 INCLUDE %l e v e l 2 . 84 ? c a . 85 ? c $selectedTime . 86 } 87 } AS %level2maxContinuous 88 89 WITH { 90 SELECT ? c ? s ?p ?o { 91 hint:Query hint:optimizer " None " . 92 INCLUDE %l e v e l 3 . 93 ? c a . 94 ? c $selectedTime . 95 } 96 } AS %level3maxContinuous 97 98 WITH { 99 SELECT DISTINCT ? c ? s ?p ?o {{

metaservice: a Semantic Web based Approach 109 B.2. SPARQL Queries

100 hint:Query hint:optimizer " None " . 101 INCLUDE %l e v e l 1 . 102 ? c a . 103 ? c ?maxTime . 104 INCLUDE %level1max . 105 } UNION { 106 INCLUDE %level1maxContinuous . 107 } 108 } 109 } AS %level1filtered 110 111 WITH { 112 SELECT DISTINCT ? c ? s ?p ?o { 113 { 114 hint:Query hint:optimizer " None " . 115 INCLUDE %l e v e l 2 . 116 ? c a . 117 ? c ?maxTime . 118 INCLUDE %level2max . 119 } UNION { 120 INCLUDE %level2maxContinuous . 121 } 122 } 123 } AS %level2filtered 124 125 WITH { 126 SELECT DISTINCT ? c ? s ?p ?o { 127 { 128 hint:Query hint:optimizer " None " . 129 INCLUDE %l e v e l 3 . 130 ? c a . 131 ? c ?maxTime . 132 INCLUDE %level3max . 133 } UNION { 134 INCLUDE %level3maxContinuous . 135 } 136 } 137 } AS %level3filtered 138 139 WITH { 140 SELECT DISTINCT ? c ? s ?p ?o { 141 { 142 INCLUDE %level1filtered 143 } UNION { 144 INCLUDE %level2filtered 145 } UNION { 146 INCLUDE %level3filtered 147 } 148 } 149 } AS %statements 150 151

110 Design and Development of a Service for Software Interrelationships Appendix B. Selected Complete Listings

152 WITH{ 153 SELECT DISTINCT ? c { 154 INCLUDE %statements 155 } 156 } AS %c o n t e x t s 157 158 WITH { 159 SELECT ?c (?c as ?s) ?p ?o { 160 hint:Query hint:optimizer " None " . 161 INCLUDE %c o n t e x t s 162 GRAPH ?c { ?c ?p ?o }. 163 } 164 } AS %contextstatements 165 166 167 WHERE { 168 { 169 INCLUDE %statements 170 } UNION { 171 INCLUDE %contextstatements 172 } 173 } Listing B.8: Optimized SPARQL Query for Resource Lookup

1 @PREFIX admssw : . 2 @PREFIX r d f : . 3 @PREFIX r d f s : . 4 @PREFIX skos : . 5 @PREFIX ms : . 6 7 SELECT DISTINCT 8 ? a 9 ? t i t l e 10 WITH { 11 SELECT 12 * 13 WHERE { 14 ? a . 15 ? a ? a l t . 16 { 17 ? a ? t i t l e . 18 } UNION { 19 ? a ? t i t l e . 20 } UNION { 21 ? a ? t i t l e . 22 } 23 FILTER ((?alt != ?title)) 24 } 25 } as %heuristic 26

metaservice: a Semantic Web based Approach 111 B.2. SPARQL Queries

27 WITH { 28 SELECT 29 * 30 WHERE { 31 hint:SubQuery hint:optimizer " None " . 32 INCLUDE %h e u r i s t i c 33 FILTER (bound(? title )) 34 FILTER (bound(?a)) 35 } 36 } as %mmm2Common 37 38 WITH { 39 SELECT DISTINCT 40 ? a 41 (MAX(?timemmm2) as ?timemmm2) 42 ? t i t l e 43 WHERE { 44 INCLUDE %mmm2Common 45 GRAPH ?mmm2{? a ? t i t l e } . 46 hint:SubQuery hint:optimizer " None " . 47 FILTER ((?mmm2type != )) 48 FILTER ((?timemmm2 <= ?selectedDate)) 49 ?mmm2 ?timemmm2 . 50 ?mmm2 a ?mmm2type. 51 ?mmm2 ?mmm2source . 52 } 53 GROUP BY 54 ? t i t l e 55 ? a 56 ?mmm2source 57 } as %mmm2 58 59 WITH { 60 SELECT DISTINCT 61 ?subjectmmm2 62 ?mmm2generator 63 WHERE { 64 INCLUDE %mmm2Common 65 GRAPH ?mmm2{? a ? t i t l e } . 66 hint:SubQuery hint:optimizer " None " . 67 ?mmm2 a . 68 ?mmm2 ?mmm2generator . 69 ?mmm2 ?subjectmmm2 . 70 } 71 } as %mmm2Continuous0 72 73 WITH { 74 SELECT DISTINCT 75 ?subjectmmm2 76 ?mmm2generator 77 (MAX(?timemmm2) as ?timemmm2) 78 WHERE {

112 Design and Development of a Service for Software Interrelationships Appendix B. Selected Complete Listings

79 INCLUDE %mmm2Continuous0 80 hint:SubQuery hint:optimizer " None " . 81 FILTER ((?timemmm2 <= ?selectedDate)) 82 ?mmmmmcontext2 ?subjectmmm2 . 83 ?mmmmmcontext2 ?mmm2generator . 84 ?mmmmmcontext2 ?timemmm2 . 85 } 86 GROUP BY 87 ?subjectmmm2 88 ?mmm2generator 89 } as %mmm2Continuous1 90 91 WITH { 92 SELECT DISTINCT 93 ?timemmm2 94 ? a 95 ?mmm2generator 96 ?mmm2 97 ? t i t l e 98 WHERE { 99 hint:SubQuery hint:optimizer " None " . 100 INCLUDE %mmm2Continuous1 101 ?mmm2 ?subjectmmm2 . 102 ?mmm2 ?mmm2generator . 103 ?mmm2 ?timemmm2 . 104 GRAPH ?mmm2{? a ? t i t l e } . 105 } 106 } as %mmm2Continuous 107 108 WITH { 109 SELECT 110 * 111 WHERE { 112 hint:SubQuery hint:optimizer " None " . 113 INCLUDE %h e u r i s t i c 114 FILTER (bound(? title )) 115 FILTER (bound(?a)) 116 } 117 } as %mmm3Common 118 119 WITH { 120 SELECT DISTINCT 121 ? a 122 ? t i t l e 123 (MAX(?timemmm3) as ?timemmm3) 124 WHERE { 125 INCLUDE %mmm3Common 126 GRAPH ?mmm3{? a ? t i t l e } . 127 hint:SubQuery hint:optimizer " None " . 128 FILTER ((?mmm3type != )) 129 FILTER ((?timemmm3 <= ?selectedDate)) 130 ?mmm3 ?timemmm3 .

metaservice: a Semantic Web based Approach 113 B.2. SPARQL Queries

131 ?mmm3 a ?mmm3type. 132 ?mmm3 ?mmm3source. 133 } 134 GROUP BY 135 ? t i t l e 136 ? a 137 ?mmm3source 138 } as %mmm3 139 140 WITH { 141 SELECT DISTINCT 142 ?subjectmmm3 143 ?mmm3generator 144 WHERE { 145 INCLUDE %mmm3Common 146 GRAPH ?mmm3{? a ? t i t l e } . 147 hint:SubQuery hint:optimizer " None " . 148 ?mmm3 a . 149 ?mmm3 ?mmm3generator . 150 ?mmm3 ?subjectmmm3 . 151 } 152 } as %mmm3Continuous0 153 154 WITH { 155 SELECT DISTINCT 156 ?subjectmmm3 157 ?mmm3generator 158 (MAX(?timemmm3) as ?timemmm3) 159 WHERE { 160 INCLUDE %mmm3Continuous0 161 hint:SubQuery hint:optimizer " None " . 162 FILTER ((?timemmm3 <= ?selectedDate)) 163 ?mmmmmcontext2 ?subjectmmm3 . 164 ?mmmmmcontext2 ?mmm3generator . 165 ?mmmmmcontext2 ?timemmm3 . 166 } 167 GROUP BY 168 ?subjectmmm3 169 ?mmm3generator 170 } as %mmm3Continuous1 171 172 WITH { 173 SELECT DISTINCT 174 ?timemmm3 175 ?mmm3 176 ?mmm3generator 177 ? a 178 ? t i t l e 179 WHERE { 180 hint:SubQuery hint:optimizer " None " . 181 INCLUDE %mmm3Continuous1 182 ?mmm3 ?subjectmmm3 .

114 Design and Development of a Service for Software Interrelationships Appendix B. Selected Complete Listings

183 ?mmm3 ?mmm3generator . 184 ?mmm3 ?timemmm3 . 185 GRAPH ?mmm3{? a ? t i t l e } . 186 } 187 } as %mmm3Continuous 188 189 WITH { 190 SELECT 191 * 192 WHERE { 193 hint:SubQuery hint:optimizer " None " . 194 INCLUDE %h e u r i s t i c 195 FILTER (bound(? title )) 196 FILTER (bound(?a)) 197 } 198 } as %mmm4Common 199 200 WITH { 201 SELECT DISTINCT 202 ? a 203 ? t i t l e 204 (MAX(?timemmm4) as ?timemmm4) 205 WHERE { 206 INCLUDE %mmm4Common 207 GRAPH ?mmm4{? a ? t i t l e } . 208 hint:SubQuery hint:optimizer " None " . 209 FILTER ((?mmm4type != )) 210 FILTER ((?timemmm4 <= ?selectedDate)) 211 ?mmm4 ?timemmm4 . 212 ?mmm4 a ?mmm4type. 213 ?mmm4 ?mmm4source . 214 } 215 GROUP BY 216 ? t i t l e 217 ? a 218 ?mmm4source 219 } as %mmm4 220 221 WITH { 222 SELECT DISTINCT 223 ?subjectmmm4 224 ?mmm4generator 225 WHERE { 226 INCLUDE %mmm4Common 227 GRAPH ?mmm4{? a ? t i t l e } . 228 hint:SubQuery hint:optimizer " None " . 229 ?mmm4 a . 230 ?mmm4 ?mmm4generator . 231 ?mmm4 ?subjectmmm4 . 232 } 233 } as %mmm4Continuous0 234

metaservice: a Semantic Web based Approach 115 B.2. SPARQL Queries

235 WITH { 236 SELECT DISTINCT 237 ?subjectmmm4 238 ?mmm4generator 239 (MAX(?timemmm4) as ?timemmm4) 240 WHERE { 241 INCLUDE %mmm4Continuous0 242 hint:SubQuery hint:optimizer " None " . 243 FILTER ((?timemmm4 <= ?selectedDate)) 244 ?mmmmmcontext2 ?subjectmmm4 . 245 ?mmmmmcontext2 ?mmm4generator . 246 ?mmmmmcontext2 ?timemmm4 . 247 } 248 GROUP BY 249 ?subjectmmm4 250 ?mmm4generator 251 } as %mmm4Continuous1 252 253 WITH { 254 SELECT DISTINCT 255 ?mmm4generator 256 ? a 257 ?timemmm4 258 ? t i t l e 259 ?mmm4 260 WHERE { 261 hint:SubQuery hint:optimizer " None " . 262 INCLUDE %mmm4Continuous1 263 ?mmm4 ?subjectmmm4 . 264 ?mmm4 ?mmm4generator . 265 ?mmm4 ?timemmm4 . 266 GRAPH ?mmm4{? a ? t i t l e } . 267 } 268 } as %mmm4Continuous 269 270 WITH { 271 SELECT 272 * 273 WHERE { 274 hint:SubQuery hint:optimizer " None " . 275 INCLUDE %h e u r i s t i c 276 FILTER (bound(?a)) 277 FILTER (bound(?alt)) 278 } 279 } as %mmm1Common 280 281 WITH { 282 SELECT DISTINCT 283 ? a l t 284 (MAX(?timemmm1) as ?timemmm1) 285 ? a 286 WHERE {

116 Design and Development of a Service for Software Interrelationships Appendix B. Selected Complete Listings

287 INCLUDE %mmm1Common 288 GRAPH ?mmm1{? a ? a l t } . 289 hint:SubQuery hint:optimizer " None " . 290 FILTER ((?mmm1type != )) 291 FILTER ((?timemmm1 <= ?selectedDate)) 292 ?mmm1 ?timemmm1 . 293 ?mmm1 a ?mmm1type. 294 ?mmm1 ?mmm1source . 295 } 296 GROUP BY 297 ? a 298 ? a l t 299 ?mmm1source 300 } as %mmm1 301 302 WITH { 303 SELECT DISTINCT 304 ?subjectmmm1 305 ?mmm1generator 306 WHERE { 307 INCLUDE %mmm1Common 308 GRAPH ?mmm1{? a ? a l t } . 309 hint:SubQuery hint:optimizer " None " . 310 ?mmm1 a . 311 ?mmm1 ?mmm1generator . 312 ?mmm1 ?subjectmmm1 . 313 } 314 } as %mmm1Continuous0 315 316 WITH { 317 SELECT DISTINCT 318 ?subjectmmm1 319 ?mmm1generator 320 (MAX(?timemmm1) as ?timemmm1) 321 WHERE { 322 INCLUDE %mmm1Continuous0 323 hint:SubQuery hint:optimizer " None " . 324 FILTER ((?timemmm1 <= ?selectedDate)) 325 ?mmmmmcontext2 ?subjectmmm1 . 326 ?mmmmmcontext2 ?mmm1generator . 327 ?mmmmmcontext2 ?timemmm1 . 328 } 329 GROUP BY 330 ?subjectmmm1 331 ?mmm1generator 332 } as %mmm1Continuous1 333 334 WITH { 335 SELECT DISTINCT 336 ? a l t 337 ?mmm1 338 ? a

metaservice: a Semantic Web based Approach 117 B.2. SPARQL Queries

339 ?mmm1generator 340 ?timemmm1 341 WHERE { 342 hint:SubQuery hint:optimizer " None " . 343 INCLUDE %mmm1Continuous1 344 ?mmm1 ?subjectmmm1 . 345 ?mmm1 ?mmm1generator . 346 ?mmm1 ?timemmm1 . 347 GRAPH ?mmm1{? a ? a l t } . 348 } 349 } as %mmm1Continuous 350 351 WITH { 352 SELECT 353 * 354 WHERE { 355 hint:SubQuery hint:optimizer " None " . 356 INCLUDE %h e u r i s t i c 357 FILTER (bound(?a)) 358 } 359 } as %mmm0Common 360 361 WITH { 362 SELECT DISTINCT 363 ? a 364 (MAX(?timemmm0) as ?timemmm0) 365 WHERE { 366 INCLUDE %mmm0Common 367 GRAPH ?mmm0{? a }. 368 hint:SubQuery hint:optimizer " None " . 369 FILTER ((?mmm0type != )) 370 FILTER ((?timemmm0 <= ?selectedDate)) 371 ?mmm0 ?timemmm0 . 372 ?mmm0 a ?mmm0type. 373 ?mmm0 ?mmm0source . 374 } 375 GROUP BY 376 ? a 377 ?mmm0source 378 } as %mmm0 379 380 WITH { 381 SELECT DISTINCT 382 ?subjectmmm0 383 ?mmm0generator 384 WHERE { 385 INCLUDE %mmm0Common 386 GRAPH ?mmm0{? a }. 387 hint:SubQuery hint:optimizer " None " . 388 ?mmm0 a . 389 ?mmm0 ?mmm0generator . 390 ?mmm0 ?subjectmmm0 .

118 Design and Development of a Service for Software Interrelationships Appendix B. Selected Complete Listings

391 } 392 } as %mmm0Continuous0 393 394 WITH { 395 SELECT DISTINCT 396 ?subjectmmm0 397 ?mmm0generator 398 (MAX(?timemmm0) as ?timemmm0) 399 WHERE { 400 INCLUDE %mmm0Continuous0 401 hint:SubQuery hint:optimizer " None " . 402 FILTER ((?timemmm0 <= ?selectedDate)) 403 ?mmmmmcontext2 ?subjectmmm0 . 404 ?mmmmmcontext2 ?mmm0generator . 405 ?mmmmmcontext2 ?timemmm0 . 406 } 407 GROUP BY 408 ?subjectmmm0 409 ?mmm0generator 410 } as %mmm0Continuous1 411 412 WITH { 413 SELECT DISTINCT 414 ?mmm0generator 415 ?mmm0 416 ? a 417 ?timemmm0 418 WHERE { 419 hint:SubQuery hint:optimizer " None " . 420 INCLUDE %mmm0Continuous1 421 ?mmm0 ?subjectmmm0 . 422 ?mmm0 ?mmm0generator . 423 ?mmm0 ?timemmm0 . 424 GRAPH ?mmm0{? a }. 425 } 426 } as %mmm0Continuous 427 428 WHERE { 429 hint:SubQuery hint:optimizer " None " . 430 INCLUDE %h e u r i s t i c 431 { 432 { 433 INCLUDE %mmm2 434 GRAPH ?mmm2{? a ? t i t l e } . 435 } UNION { 436 INCLUDE %mmm2Continuous 437 } 438 ?mmm2 ?timemmm2 . 439 FILTER ((?mmm2type != )) 440 ?mmm2 a ?mmm2type. 441 } UNION { 442 {

metaservice: a Semantic Web based Approach 119 B.2. SPARQL Queries

443 INCLUDE %mmm3 444 GRAPH ?mmm3{? a ? t i t l e } . 445 } UNION { 446 INCLUDE %mmm3Continuous 447 } 448 ?mmm3 ?timemmm3 . 449 FILTER ((?mmm3type != )) 450 ?mmm3 a ?mmm3type. 451 } UNION { 452 { 453 INCLUDE %mmm4 454 GRAPH ?mmm4{? a ? t i t l e } . 455 } UNION { 456 INCLUDE %mmm4Continuous 457 } 458 ?mmm4 ?timemmm4 . 459 FILTER ((?mmm4type != )) 460 ?mmm4 a ?mmm4type. 461 } 462 { 463 INCLUDE %mmm1 464 GRAPH ?mmm1{? a ? a l t } . 465 } UNION { 466 INCLUDE %mmm1Continuous 467 } 468 ?mmm1 ?timemmm1 . 469 FILTER ((?mmm1type != )) 470 ?mmm1 a ?mmm1type. 471 { 472 INCLUDE %mmm0 473 GRAPH ?mmm0{? a }. 474 } UNION { 475 INCLUDE %mmm0Continuous 476 } 477 ?mmm0 ?timemmm0 . 478 FILTER ((?mmm0type != )) 479 ?mmm0 a ?mmm0type. 480 } Listing B.9: Full Translation of Temporal SPARQL Query on the Metaservice Datamodel

120 Design and Development of a Service for Software Interrelationships Appendix B. Selected Complete Listings

B.3 Metaservice Module Descriptor

1 2 3 4 < !−− Module name and v e r s i o n are replaced through Maven filtering ,→ −−> 5 ${pom. groupId} 6 ${pom. artifactId} 7 < v e r s i o n >${pom. version} 8 9 < p r o v i d e r 10 id="debianPackageProvider" 11 model="org.metaservice.core.deb. parser . ast .Package" 12 class="org. metaservice .core.deb.DebianPackageProvider" 13 > 14 15 16 17 < p a r s e r 18 id="parboiledParser" 19 t y p e =" deb " 20 model="org.metaservice.core.deb. parser . ast .Package" 21 class="org.metaservice.core.deb.ParboiledDebParser"/> 22 23 26 27 28 31 32 33 36 37 38 41 42 43 44

metaservice: a Semantic Web based Approach 121 B.3. Metaservice Module Descriptor

48 startUri="http: // ftp.debian.org/debian/dists/" 49 crawler="debianarchive" 50 archiveClass="org.metaservice.core.deb. util .DebianGitArchive" 51 active="false" 52 / > 53 54 62 63 ubuntu 64 ^([^/]+) /. * $ 65 66 67 68 70 71 72 73 74 75 < / f o l l o w > 76 77 78 < / f o l l o w > 79 < / f o l l o w > 80 < / f o l l o w > 81 < / f o l l o w > 82 < / c r a w l e r > 83 84 < t e m p l a t e 85 name="debian −release .hbs" 86 appliesTo="http: // metaservice.org/ns/deb#Release"/> 87 < t e m p l a t e 88 name="debian −package.hbs" 89 appliesTo="http: // metaservice.org/ns/deb#Package"/> 90 < t e m p l a t e 91 name="debian −project .hbs" 92 appliesTo="http: // metaservice.org/ns/deb#Project"/> 93 94 < o n t o l o g y 95 name="deb.rdf" apply="true" distribute="true"/> 96 Listing B.10: Metaservice Module Descriptor

122 Design and Development of a Service for Software Interrelationships Appendix C. Screenshots

Appendix C Screenshots

Figure C.1: Debian Package Template - Web-Frontend

metaservice: a Semantic Web based Approach 123 Figure C.2: Generic Package Template - Web-Frontend

Figure C.3: Managment Shell

124 Design and Development of a Service for Software Interrelationships Appendix C. Screenshots

Figure C.4: WordPress Security Alert on Android

metaservice: a Semantic Web based Approach 125