Detecting Errors in Linked Data Using Learned Ontologies and Outlier

Detecting Errors in Linked Data Using Ontology Learning and Outlier Detection Inauguraldissertation zur Erlangung des akademischen Grades eines Doktors der Naturwissenschaften der UniversitätMannheim vorgelegt von Daniel Fleischhacker aus Groß-Gerau Mannheim, 2015 Dekan: Professor Dr. Heinz Jürgen Müller, Universität Mannheim Referent: Professor Dr. Heiner Stuckenschmidt, Universität Mannheim Korreferent: Professor Dr. Felix Naumann, Universität Potsdam Tag der mündlichen Prüfung: 11. März 2016 i Abstract Linked Data is one of the most successful implementations of the Semantic Web idea. This is demonstrated by the large amount of data available in repositories constituting the Linked Open Data cloud and being linked to each other. Many of these datasets are not created manually but are extracted automatically from existing datasets. Thus, extraction errors, which a human would easily recognize, might go unnoticed and could hence considerably diminish the usability of Linked Data. The large amount of data renders manual detection of such errors unrealistic and makes automatic approaches for detecting errors desirable. To tackle this need, this thesis focuses on error detection approaches on the logical level and on the level of numerical data. In addition, the presented methods operate solely on the Linked Data dataset without a requirement for additional external data. The first two parts of this work deal with the detection of logical errors in Linked Data. It is argued that an upstream formalization of the knowledge, which is required for the error detection, into ontologies and then applying it in a sepa- rate step has several advantages over approaches that skip the formalization step. Consequently, the first part introduces inductive approaches for learning highly expressive ontologies from existing instance data as a basis for detecting logical errors. The proposed and evaluated techniques allow to learn class disjointness axioms as well as several property-centric axiom types from instance data. The second part of this thesis operates on the ontologies learned by the approaches proposed in the previous part. First, their quality is improved by detecting errors possibly introduced by the automatic learning process. For this purpose, a pattern-based approach for finding the root causes of ontology errors that is tai- lored to the specifics of the learned ontologies is proposed and then used in the context of ontology debugging approaches. To conclude the logical error detection, the usage of learned ontologies for finding erroneous statements in Linked Data is evaluated in the final chapter of the second part. This is done by applying a pattern-based error detection approach that employs the learned ontologies to the DBpedia dataset and then manually evaluating the results which finally shows the adequacy of learned ontologies for logical error detection. The final part of this thesis complements the previously shown logical error detection with an approach to detect data-level errors in numerical values. The presented method applies outlier detection techniques to the datatype property values to find potentially erroneous ones whereby the result and performance of the detection step is improved by the introduction of additional preprocessing steps. Furthermore, a subsequent cross-checking step is proposed which allows to handle the outlier detection imminent problem of natural outliers. In summary, this work introduces a number of approaches that allow to detect errors in Linked Data without a requirement for additional, external data. The generated lists of potentially erroneous facts can be a first indication for errors and the intermediate step of learning ontologies makes the full workflow even more suited for being used in a scenario which includes human interaction. ii Zusammenfassung Linked Data ist eine der erfolgreichsten Umsetzungen der Ideen des Semantic Web, was insbesondere an den großen Datenmengen zu erkennen ist, welche im Rahmen der Linked Open Data Cloud verfügbar sind. Viele dieser Datensätze sind jedoch nicht manuell erstellt, sondern mittels automatisierter Ansätze aus bereits vorhan- denen Datensätzen extrahiert worden. Hierdurch enthalten sie viele Fehler, welche bei einer manuellen Erstellung der Daten hätten erkannt werden können, nun jedoch die Verwendbarkeit der Daten einschränken. Da eine nachgelagerte manuelle Fehlererkennung aufgrund der großen Datenmenge nicht praktikabel ist, ist ein automatisierter Ansatz zur Erkennung von Datenfehlern wünschenswert. Die vor- liegende Arbeit setzt hier an, indem sie Methoden zur Erkennung von Datenfehlern auf der logischen und der numerischen Ebene einführt. Ein Hauptaugenmerk liegt hierbei auf Ansätzen, welche ohne zusätzliche, externe Datenquellen direkt auf dem Linked Data Datensatz angewandt werden können. Die ersten beiden Teile dieser Arbeit befassen sich mit der Erkennung von Fehlern auf der logischen Ebene. Grundlegend wird hierbei zugunsten der Nut- zung von Ontologien zur vorgelagerten Formalisierung des Wissens, welches für die Fehlererkennung genutzt wird, argumentiert. Daher werden im ersten Teil dieser Arbeit induktive Ansätze zum Lernen von expressiven Ontologien präsentiert, welche Ontologie-Axiome für die Disjunktheit von Klassen sowie einer Reihe von Property-spezifischen Axiomen unterstützen. Der zweite Teil dieser Arbeit baut anschließend auf den dieserart gelernten Ontologien auf. Zur Erkennung von Fehlern in den gelernten Ontologien wird eine muster-basierte Methode zur Bestimmung der Ursachen von Ontologie-Fehlern vorgeschlagen, welche speziell auf die in den Ontologien verwendeten Axiomar- ten und ihre Verwendung zugeschnitten ist. Diese Methode wird daraufhin im Rah- men von verschiedenen Ansätzen zur Behebung von Fehlern in Ontologien genutzt und die Ergebnisse ausgewertet. Schließlich wird die Erkennung logischer Feh- ler mittels der gelernten Ontologien anhand von Experimenten auf dem DBpedia Datensatz demonstriert. Die anschließende Auswertung zeigt die Anwendbarkeit gelernter Ontologien zur Erkennung logischer Fehler. Im dritten Teil wird daraufhin eine Methode zur Erkennung von Fehlern in numerischen Werten erweitert, welche auf Techniken zur Erkennung von Ausrei- ßern basiert. Hierbei verbessert ein Vorverarbeitungsschritt die Genauigkeit des Ansatzes und reduziert gleichzeitig die benötigte Verarbeitungszeit. Ein zusätzli- cher Nachbearbeitungsschritt erlaubt die Einbindung von im Linked Data Daten- satz verbundenen Werten zur Behandlung von natürlichen Ausreißern. Zusammenfassend präsentiert diese Arbeit Ansätze, deren Kombination es erlaubt Fehler in Linked Data auf logischer und numerischer Ebene zu erkennen und dabei unabhängig von externen Datenquellen zu sein. Die Listen potentieller Feh- ler, welche durch diese Ansätze erstellt werden, können anschließend manuell ge- prüft und wenn notwendig behoben werden. Der Zwischenschritt über Ontologien eröffnet hierbei zusätzliche Möglichkeiten im interaktiven Einsatz. Contents 1 Introduction 1 1.1 Research Questions . .5 1.2 Reader’s Guide . .6 2 Foundations 8 2.1 Description Logics and Ontologies . .8 2.2 RDF and Linked Data . 16 2.2.1 Resource Description Framework . 16 2.2.2 Linked Data . 17 2.2.3 DBpedia . 19 I Learning Expressive Schemas 23 3 Preliminaries 24 3.1 Learning from Instance Data . 25 3.2 Association Rule Mining . 27 3.2.1 Generating Association Rules . 31 3.2.2 Other Algorithms . 32 3.3 Statistical Schema Induction . 33 4 Related Work 39 4.1 Ontology Learning . 39 4.2 Inductive Ontology Learning . 41 4.3 Learning Disjointness Axioms . 42 4.4 Profiling Linked Data Datasets . 46 5 Inductive Learning of Disjointness Axioms 48 5.1 Class Disjointness Gold Standard . 50 5.1.1 Methodology . 50 5.1.2 Analysis . 52 5.2 Approaches . 57 5.2.1 Correlation-Based Approach . 57 5.2.2 Association Rule Mining-Based Approach . 59 iii iv CONTENTS 5.2.3 Negative Association Rule-based Approach . 61 5.3 Evaluation . 62 5.4 Conclusion . 68 6 Inductive Learning of Property Axioms 71 6.1 Approaches . 72 6.1.1 Terminology Acquisition . 73 6.1.2 Creation of Transaction Tables . 74 6.1.3 Association Rule Mining and Axiom Generation . 76 6.2 Evaluation . 78 6.2.1 Settings . 78 6.2.2 Expert Evaluation . 79 6.2.3 Crowd-Sourced Evaluation . 82 6.3 Conclusions and Contributions . 86 II Logical Debugging of Linked Data 87 7 Generating Incoherence Explanations 88 7.1 Related Work . 90 7.2 Approach . 91 7.2.1 Generation of Explanations . 94 7.2.2 Implementation . 96 7.3 Experiments . 97 7.3.1 Settings . 97 7.3.2 Results . 98 7.4 Conclusion . 101 8 Repairing Incoherent Ontologies 104 8.1 Related Work . 105 8.2 Approaches . 107 8.2.1 Baseline Approach . 108 8.2.2 Axiom Adding Approach . 109 8.2.3 MAP Inference-Based Approach . 110 8.2.4 Pure Markov Logic Approach . 112 8.3 Evaluation . 112 8.3.1 Settings . 113 8.3.2 Results . 114 8.4 Conclusion . 116 9 Schema-Based Error Detection 118 9.1 Related Work . 119 9.2 Approach . 123 9.3 Experiments . 125 CONTENTS v 9.3.1 Disjointness-Enriched Ontologies . 126 9.3.2 Property-Enriched Ontology . 129 9.4 Conclusion . 131 III Detection of Numerical Errors in Linked Data 133 10 Preliminaries: Outlier Detection 134 10.1 Statistical Outlier Detection . 136 10.2 Nearest-Neighbor-Based Outlier Detection . 137 11 Detecting Numerical Errors 143 11.1 Related Work . 145 11.2 Approach . 148 11.2.1 Dataset Inspection . 148 11.2.2 Generation of Possible Constraints . 149 11.2.3 Finding Subpopulations . 151 11.2.4 Outlier Detection and Outlier Scores . 154

Detecting Errors in Linked Data Using Learned Ontologies and Outlier

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support