XML to RDF Transformation
Total Page:16
File Type:pdf, Size:1020Kb
Diploma Thesis August 3, 2006 XML to RDF Transformation Markus Fehlmann of Aadorf TG, Switzerland (00-912-857) supervised by Prof. Dr. Harald Gall Dr. Gerald Reif Department of Informatics software evolution & architecture lab Diploma Thesis XML to RDF Transformation Markus Fehlmann Department of Informatics software evolution & architecture lab Diploma Thesis Author: Markus Fehlmann, [email protected] Project period: February 3, 2006 - August 3, 2006 Software Evolution & Architecture Lab Department of Informatics, University of Zurich Acknowledgements I am grateful to Gerald Reif whose PhD thesis was an excellent foundation for this work. Many of the now implemented features and ideas originated from fruitful discussions during the last six months. Gerald proved that the sentence used to advertise the thesis ”be best supervised by your advisers” was more than just empty words. I also express my gratitude to professor Harald Gall for giving me the opportunity to write my diploma thesis in the field of the Semantic Web, that I believe will strongly influence the way people access and process information, the main resource of today’s information society. I thank my parents for making my education possible and for all their encouragement through the years. Needless to say I could not have done this without them. My thanks also go to my fellow students and friends who provided me with useful inputs and critical suggestions. Abstract XML continues to be the primary format for data exchange in distributed systems. However, since several serializations of domain specific knowledge are possible, XML documents have no imma- nent semantic. The Semantic Web provides a common framework that allows data to be shared and reused across application, enterprise, and community boundaries. The Resource Description Framework (RDF), which is part of the Semantic Web, formalizes the meaning of information. While many documents are encoded in XML, only few documents are represented in RDF. In his PhD thesis, Reif proposed an algorithm and did a prototype implementation, called WEESA, that generates RDF graphs out of arbitrary XML documents by applying processing instructions defined in a mapping. In this thesis we propose an object-oriented architecture of the mapping algorithm in order to improve its maintainability, efficiency, and extensibility. In addition to that, we introduce new mapping directives that simplify the mapping definition process. The result of this thesis is a new implementation of the mapping algorithm that incorpo- rates the suggested object-oriented architecture and the additional mapping constructs. Thus, the transformation from XML data to RDF could be simplified to a reasonable extent. A prominent example that benefits from our results is the semantic annotation of Web sites. Zusammenfassung XML ist das tragende Format um Daten in verteilten Systemen auszutauschen. Allerdings haben XML Dokumente keine immanente Semantik, da in XML unterschiedliche Serialisierungen des- selben domänenspezifischen Wissens möglich sind. Das semantische Web bietet ein Rahmen- werk, das es erlaubt, Daten über Anwendungs- und Unternehmensgrenzen hinaus zu teilen und wiederzuverwenden. Das Resource Description Framework (RDF), ein Bestandteil des seman- tischen Webs, formalisiert hierzu die Bedeutung von Informationen. Während viele Dokumente in XML vorliegen, existieren erst wenige, die eine RDF Repräsentation haben. Reif schlug in seiner Dissertation einen Algorithmus vor, der RDF Repräsentationen aus beliebigen XML Doku- menten erstellt, indem Verarbeitungsanweisungen aus einem Mappingdokument auf das XML Dokument angewendet werden. Ebenso hat er den Algorithmus prototypisch implementiert. In dieser Arbeit stellen wir eine objektorientierte Architektur des Mapping Algorithmus vor, um dessen Wartbarkeit, Effizienz und Erweiterbarkeit zu verbessern. Zusätzlich erweitern wir das Mappingvokabular um Anweisungen, welche die Erstellung von Mappings vereinfachen. Das Ergebnis dieser Arbeit ist eine neue Implementierung des genannten Algorithmus, der die objektorientierte Architektur und die neuen Mappinganweisungen vereinigt. Auf diese Weise konnte die Transformation von XML Dokumenten in das RDF Format erheblich vereinfacht wer- den. Ein bedeutendes Anwendungsgebiet, das von unseren Ergebnissen profitieren kann, ist die semantische Annotation von Webseiten. Contents 1 Introduction 1 1.1 Semantic Web Overview . 1 1.1.1 Origins and Vision . 1 1.1.2 Architecture . 3 1.2 XML based Web Engineering . 12 1.2.1 Apache Cocoon . 12 1.3 Problem Statement . 13 1.4 Structure of the Thesis, Objectives of this Work . 15 2 Semantic Annotation of XML-based Web Applications 17 2.1 Tools for Manual Annotation . 17 2.2 Embedding and Retrieving Metadata . 18 2.2.1 GRDDL . 18 2.2.2 RDFa . 19 2.3 XML to Metadata Translation . 21 2.3.1 Bridging the Gap between RDF and XML . 21 2.3.2 Mapping XML to OWL Ontologies . 21 2.3.3 Lifting XML Schema to OWL . 22 2.3.4 Round-tripping between XML and RDF . 23 2.3.5 XR . 24 2.4 Conversion of arbitrary document types to RDF . 25 2.4.1 Data Conversion, Extraction and Record Linkage using XML and RDF Tools in Project SIMILE . 25 3 Introduction to WEESA 27 3.1 WEESA - Web Engineering for Semantic Web Applications . 27 3.2 Semantic Web Applications with WEESA and Apache Cocoon . 28 3.2.1 Integration of WEESA in the Apache Cocoon Framework . 29 3.2.2 WEESA Cocoon Transformer to generate HTML+RDF . 30 3.2.3 WEESA Cocoon Transformer to generate RDF/XML . 30 3.3 Building the Knowledge Base of the Semantic Web Application . 31 3.3.1 Architecture and Maintenance of the WEESA Knowledge Base . 31 4 Design of the Object-Oriented Architecture 33 4.1 General Object-Oriented Design Principles . 33 4.2 Application of Design Principles to Mapping Algorithm . 34 4.3 Description of the WEESA Mapping Algorithm . 39 4.3.1 Resource Dependencies . 41 viii CONTENTS 4.3.2 Circular References in Resource Definitions . 42 4.3.3 Mapping Procedure . 43 4.3.4 Sample Mapping Procedure . 44 5 WEESA Mapping Features 47 5.1 Target Ontology and Source XML File . 47 5.2 WEESA Mapping Structure . 49 5.2.1 Method Parameter Attributes . 50 5.3 Relative Paths . 51 5.4 Variables . 52 5.5 If Statement . 53 5.6 Switch Statement . 55 5.7 Dictionary . 57 5.8 Datatype Attribute for Typed Literals . 58 5.9 Language Section, Language Attributes . 59 5.9.1 Language Tag Syntax . 59 5.9.2 Definition of a Default Language . 60 5.9.3 Language Definition in Triples Section . 60 5.9.4 Language Precedence and Typed Literals . 61 6 Conclusions and Future Work 63 6.1 Conclusions . 63 6.2 Summary of Contributions . 64 6.3 Future Research . 64 A XML Schemas 67 A.1 Mapping Definition . 67 A.2 WEESA Dictionary Schema . 77 A.3 Sample Shop Schema . 77 B XML Files 81 B.1 Sample Shop . 81 B.2 Sample Dictionary . 82 B.3 Shop Mapping . 82 B.4 Shop Ontology . 85 C Java Code of Methods used in Mapping Examples 87 C.1 weesa.util.MappingLib.addPrefix . 87 C.2 weesa.util.MappingLib.avg . 87 D Content of the CD 89 CONTENTS ix List of Figures 1.1 Semantic Web Architecture [AvH04] . 3 1.2 A simple RDF Graph . 5 1.3 A Blank Node that represents a Shop Item . 6 1.4 A Cocoon Pipeline using several XML Technologies [Rei05] . 13 2.1 Recursive Application of the GRDDL Mechanism [Haz05] . 19 2.2 Operating Sequence of an XML to OWL Transformation as suggested in [BA05] . 22 2.3 Operating Sequence of an XML to OWL Transformation as suggested in [FZT04] . 23 3.1 WEESA Design and Instance Levels [Rei05] . 28 3.2 a) Cocoon Pipeline to integrate RDF into HTML, b) Pipeline to create a separate RDF/XML File [Rei05] . 31 4.1 Interface Hierarchy of Jena’s RDF Nodes . 34 4.2 Interface Hierarchy of Mapping Elements that generate RDF Nodes and Arcs . 35 4.3 Triple Class that makes use of Interfaces described in Figure 4.2 . 36 4.4 Const Class Hierarchy that is used similarly for the Const, Method, and XPath Classes . 36 4.5 ExpressionFactory Class used for Resource, Literal, and Property Creation . 37 4.6 Integration of new Mapping Directives in Class Hierarchy . 38 4.7 Extract of the Operator Class Hierarchy . 39 4.8 Circular Dependencies that are not directly supported by the recursive Mapping Algorithm . 43 4.9 Simple RDF Graph for the Description of the Mapping Algorithm . 45 5.1 The Sample Shop Ontology . 48 5.2 Triple expressing that TrekKing belongs to the Outdoor Sector . 49 5.3 Triple created using Variables for ID/IDREF Relationship . 52 6.1 Target RDF Graph that needs careful Consideration with Respect to Variable and XPath Dependencies . 65 6.2 RDF Graph for Resources with several incoming Edges . 65 List of Tables 6.1 Table describing Variable and Relative Path Dependencies of Figure 6.2 . 66 List of Listings 1.1 Three different XML Representations of the same Fact . 4 1.2 RDF/XML Serialization Example . 7 1.3 TriX Serialization Example . 9 1.4 TriG Serialization.