Managing and Using Provenance in the Semantic Web

Managing and Using Provenance in the Semantic Web Renata Queiroz Dividino Institute for Web Science and Technologies University of Koblenz-Landau June 2017 Vom Promotionsausschuss des Fachbereichs 4: Informatik der UniversitätKoblenz-Landau zur Verleihung des akademischen Grades Doktor der Naturwissenschaften (Dr. rer. nat.) genehmigte Dissertation. Datum der wissenschaftlichen Aussprache: 14 Dezember 2016 Vorsitz des Promotionsausschusses: Prof. Dr. Hannes Frey Gutachter: Prof. Dr. Steffen Staab Gutachter: Prof. Dr. York Sure-Vetter Gutachter: Prof. Dr. Luc Moreau Abstract The Web contains some extremely valuable information; however, often poor quality, inac- curate, irrelevant or fraudulent information can also be found. With the increasing amount of data available, it is becoming more and more difficult to distinguish truth from specula- tion on the Web. One of the most, if not the most, important criterion used to evaluate data credibility is the information source, i.e., the data origin. Trust in the information source is a valuable currency users have to evaluate such data. Data popularity, recency (or the time of validity), reliability, or vagueness ascribed to the data may also help users to judge the validity and appropriateness of information sources. We call this knowledge derived from the data the provenance of the data. Provenance is an important aspect of the Web. It is essential in identifying the suitability, veracity, and reliability of information, and in deciding whether information is to be trusted, reused, or even integrated with other information sources. Therefore, models and frameworks for representing, managing, and using provenance in the realm of Semantic Web technologies and applications are critically required. This thesis highlights the benefits of the use of provenance in different Web applications and scenarios. In particular, it presents management frameworks for querying and reasoning in the Semantic Web with provenance, and presents a collection of Semantic Web tools that explore provenance information when ranking and updating caches of Web data. To begin, this thesis discusses a highly flexible and generic approach to the treatment of provenance when querying RDF datasets. The approach re-uses existing RDF modeling possibilities in order to represent provenance. It extends SPARQL query processing in such a way that given a SPARQL query for data, one may request provenance without modifying it. The use of provenance within SPARQL queries helps users to understand how RDF facts are derived, i.e., it describes the data and the operations used to produce the derived facts. Turning to more expressive Semantic Web data models, an optimized algorithm for reasoning and debugging OWL ontologies with provenance is presented. Typical reasoning tasks over an expressive Description Logic (e.g., using tableau methods to perform consistency checking, instance checking, satisfiability checking, and so on) are in the worst case doubly-exponential, and in practice are often likewise very expensive. With the algorithm described in this thesis, however, one can efficiently reason in OWL ontologies with provenance, i.e., provenance is efficiently combined and propagated within the reasoning process. Users can use the derived provenance information to judge the reliability of inferences and to find errors in the ontology. Next, this thesis tackles the problem of providing to Web users the right content at the right time. The challenge is to efficiently rank a stream of messages based on user preferences. Provenance is used to represent preferences, i.e., the user defines his preferences over the messages' popularity, recency, etc. This information is then aggregated to obtain a joint ranking. The aggregation problem is related to the problem of preference aggregation in So- cial Choice Theory. The traditional problem formulation of preference aggregation assumes a I fixed set of preference orders and a fixed set of domain elements (e.g. messages). This work, however, investigates how an aggregated preference order has to be updated when the domain is dynamic, i.e., the aggregation approach ranks messages 'on the fly’ as the message passes through the system. Consequently, this thesis presents computational approaches for online preference aggregation that handle the dynamic setting more efficiently than standard ones. Lastly, this thesis addresses the scenario of caching data from the Linked Open Data (LOD) cloud. Data on the LOD cloud changes frequently and applications relying on that data { by pre-fetching data from the Web and storing local copies of it in a cache { need to continually update their caches. In order to make best use of the resources (e.g., network bandwidth for fetching data, and computation time) available, it is vital to choose a good strategy to know when to fetch data from which data source. A strategy to cope with data changes is to check for provenance. Provenance information delivered by LOD sources can denote when the resource on the Web has been changed last. Linked Data applications can benefit from this piece of information since simply checking on it may help users decide which sources need to be updated. For this purpose, this work describes an investigation of the availability and reliability of provenance information in the Linked Data sources. Another strategy for capturing data changes is to exploit provenance in a time-dependent function. Such a function should measure the frequency of the changes of LOD sources. This work describes, therefore, an approach to the analysis of data dynamics, i.e., the analysis of the change behavior of Linked Data sources over time, followed by the investigation of different scheduling update strategies to keep local LOD caches up-to-date. This thesis aims to prove the importance and benefits of the use of provenance in different Web applications and scenarios. The flexibility of the approaches presented, combined with their high scalability, make this thesis a possible building block for the Semantic Web proof layer cake - the layer of provenance knowledge. II Zusammenfassung Das Internet enthalt äußerst nutzbringende Informationen, allerdings stößt der Nutzer häufig auch auf unwichtige Daten oder solche, die ungenau, von geringer Qualität oder sogar mit betrugerischer¨ Absicht verfälscht worden sind. Mit der stetig wachsenden Menge an Daten, die im Internet verfugbar¨ ist, wird es zunehmend schwieriger zwischen Wahrheit und Spekulation zu unterscheiden. Eines der wichtigsten, wenn nicht das wichtigste Kriterium zur Analyse der Glaubwurdigkeit¨ der Daten ist die Informationsquelle oder in anderen Worten, die Herkunft der Daten. Das Vertrauen in die Informationsquelle ist fur¨ den Nutzer entscheidend zur Beurteilung der Information. Daruber¨ hinaus können ihm die Popularität, Aktualität, die Zuverlässigkeit der Daten als Anhaltspunkte fur¨ die Eignung und Aussagekraft der Informationsquelle dienen. Die oben genannten Eigenschaften einer Datenquelle werden als \Provenance" bezeichnet. Die Betrachtung der \Provenance" der Information gewinnt zunehmend an Bedeutung im Semantic Web. \Provenance" ist entscheidend fur¨ die Bestimmung der Eignung, des Wahr- heitsgehaltes und der Zuverlässigkeit der Information und damit fur¨ die Beurteilung, ob der Information vertraut werden, sie weiterverbreitet oder sogar mit anderen Informationsquellen verknupft¨ werden kann. Demzufolge ist die Entwicklung von Modellen und Strukturen, die computergestutzte¨ Repräsentation, Verwaltung und Nutzung der \Provenance" ermöglichen, sehr wichtig. Diese Doktorarbeit untersucht und hebt die Vorteile der Nutzung der \Provenance" der Daten in verschiedenen Webanwendungen und Szenarien hervor. Insbesondere werden Rah- menkonzepte zum Abfragen und zum Schlussfolgern mit \Provenance" im Semantic Web und eine Reihe von Semantic Webtools präsentiert, die die \Provenance" von Daten während der Klassifizierung und der Aktualisierung des Webdatencaches untersucht. Zunächst wird ein flexibler Ansatz zur Verarbeitung von \Provenance" bei dem Abfragen von RDF Datensätzen diskutiert. Dieser Ansatz nutzt existierende RDF Modellierungsmöglich- keiten um \Provenance" darzustellen. Er erweitert die SPARQL Abfrageverarbeitung derart, dass es möglich ist \Provenance" zu erfragen ohne die Abfrage zu verändern. Die Nutzung von \Provenance" in SPARQL Abfragen hilft dem Nutzer zu verstehen, wie RDF Fakten hergeleitet wurden, d.h. es beschreibt die Daten und Verarbeitungsschritte, die benutzt wurden die Fakten herzuleiten. Im weiteren Verlauf beschäftigt sich diese Arbeit mit einer ausdrucksstärkeren Modellie- rungssprache und präsentiert einen optimierten Algorithmus fur¨ das Schlussfolgern und Feh- lersuchen mit Hilfe von \Provenance" in OWL Ontologien. Typische Schlussfolgerungsauf- gaben mit einer ausdrucksstarken Beschreibungslogik (z.B. Tableau Methoden nutzend um Konsistenzprufungen¨ durchzufuhren,¨ Instanzenprufung,¨ Erfullbarkeitspr¨ ufung,¨ etc.) sind im ungunstigsten¨ Fall doppelt exponentiell und in der Praxis oft außerordentlich teuer. Der in dieser Dissertation erarbeitete und beschriebene Algorithmus erlaubt effiziente Schlussfolge- rungsunterstutzung¨ mit \Provenance" in OWL Ontologien, d.h. \Provenance" wird im Schluss- folgerungsprozess effizient kombiniert und verbreitet. Der Nutzer kann die hergeleiteten \Pro- III venance" Informationen verwenden um die Zuverlässigkeit der Inferenzen zu beurteilen und Fehler in der Ontologie zu finden. In einem weiteren Schritt wird das Problem

Load more