Semantic Enrichment of Ontology Mappings
Total Page:16
File Type:pdf, Size:1020Kb
Semantic Enrichment of Ontology Mappings Von der Fakultät für Mathematik und Informatik der Universität Leipzig angenommene DISSERTATION zur Erlangung des akademischen Grades Doctor Rerum Naturalium (Dr. rer. nat.) im Fachgebiet Informatik Vorgelegt von Patrick Arnold, M. Sc. Informatik, geboren am 26. Juli 1987 in Leipzig Die Annahme der Dissertation haben empfohlen: 1. Prof. Dr. Erhard Rahm (Universität Leipzig) 2. Prof. Dr. Sören Auer (Universität Bonn) Die Verleihung des akademischen Grades erfolgt mit Bestehen der Verteidigung am 15. Dezember 2015 mit dem Gesamtprädikat magna cum laude. Acknowledgement First, and foremost, I want to sincerely thank Prof. Rahm for his supervision and support, his encouragement and patience, as well as his professional advice. Much of the achieve- ments and outcomes of this work would not have been possible without his guidance and counsel, and I am entirely convinced that it also added a lot to my personal development as a researcher in the field of computer science. The support I obtained from my colleagues at the database department is ineffable, and I would never have believed how much I could learn from them within this time. I want to thank them for the many important pieces of advice they gave me, and for the help they offered whenever help was needed. I also want to thank them for the wonderful time we spent together; our daily lunch times, our annual summer seminars in Zingst and our lively group discussions shall always remain in my memories. In particular, I want to thank Christian Wartner, who introduced me to the database department and remained a very close colleague for many years. I also thank Sabine Maßmann, who encouraged me still in my student days to become a researcher at our database department. I also want to thank Anika Groß and Eric Peukert, who gave me much support and advice for my research topic and PhD thesis. I thank Marcel Jacob and Max Möller, whose assistance to the project was very fruitful. Eventually, I want to thank our secretary Andrea Hesse, who always takes care about the many formalities and paperwork that are constantly accumulating, and whose daily help is beyond any description. I am very grateful to my family, who enabled my education and supported me throughout the entire time. I am also very grateful to the many proof-readers who helped me on very short notice, and who gave very important suggestions for improve- ment. Additionally, I should not fail to mention that the many conferences and workshops I attended have been very supportive to me; they were very great experiences. There are only few people I could call by name, but the advice gained from them, reviewers, conference participants and fellow researchers alike, have been very crucial for my work. i At last, I want to express my deep gratitude for the European Commission and the Fed- eral State of Saxony that funded my research at our grand Alma Mater. They are the ac- tual institutions that made this research possible. I also want to thank the InfAI Leipzig e.V. that allowed me three years of participation in the Linked Design project, and Sebas- tian Fuß, who always took care about organizational and employment matters. Leipzig, July 2015 Patrick Arnold ii Abstract Schema and ontology matching play an important part in the field of data integration and semantic web. Given two heterogeneous data sources, meta data matching usually constitutes the first step in the data integration workflow, which refers to the analysis and comparison of two input resources like schemas or ontologies. The result is a list of correspondences between the two schemas or ontologies, which is often called mapping or alignment. Many tools and research approaches have been proposed to automatically determine those correspondences. However, most match tools do not provide any in- formation about the relation type that holds between matching concepts, for the simple but important reason that most common match strategies are too simple and heuristic to allow any sophisticated relation type determination. Knowing the specific type holding between two concepts, e.g., whether they are in an equality, subsumption (is-a) or part-of relation, is very important for advanced data inte- gration tasks, such as ontology merging or ontology evolution. It is also very important for mappings in the biological or biomedical domain, where is-a and part-of relations may exceed the number of equality correspondences by far. Such more expressive map- pings allow much better integration results and have scarcely been in the focus of re- search so far. In this doctoral thesis, the determination of the correspondence types in a given mapping is the focus of interest, which is referred to as semantic mapping enrichment. We intro- duce and present the mapping enrichment tool Stroma, which obtains a pre-calculated schema or ontology mapping and for each correspondence determines a semantic rela- tion type. In contrast to previous approaches, we will strongly focus on linguistic laws and linguistic insights. By and large, linguistics is the key for precise matching and for the determination of relation types. We will introduce various strategies that make use of these linguistic laws and are able to calculate the semantic type between two match- ing concepts. The observations and insights gained from this research go far beyond the field of mapping enrichment and can be also applied to schema and ontology matching in general. iii Since generic strategies have certain limits and may not be able to determine the relation type between more complex concepts, like a laptop and a personal computer, background knowledge plays an important role in this research as well. For example, a thesaurus can help to recognize that these two concepts are in an is-a relation. We will show how back- ground knowledge can be effectively used in this instance, how it is possible to draw conclusions even if a concept is not contained in it, how the relation types in complex paths can be resolved and how time complexity can be reduced by a so-called bidirec- tional search. The developed techniques go far beyond the background knowledge ex- ploitation of previous approaches, and are now part of the semantic repository SemRep, a flexible and extendable system that combines different lexicographic resources. Further on, we will show how additional lexicographic resources can be developed au- tomatically by parsing Wikipedia articles. The proposed Wikipedia relation extraction approach yields some millions of additional relations, which constitute significant addi- tional knowledge for mapping enrichment. The extracted relations were also added to SemRep, which thus became a comprehensive background knowledge resource. To aug- ment the quality of the repository, different techniques were used to discover and delete irrelevant semantic relations. We could show in several experiments that STROMA obtains very good results w.r.t. relation type detection. In a comparative evaluation, it was able to achieve consider- ably better results than related applications. This corroborates the overall usefulness and strengths of the implemented strategies, which were developed with particular emphasis on the principles and laws of linguistics. iv Contents I Introduction 1 1 Introduction 3 1.1 Motivation . 3 1.2 Scientic Contributions . 9 1.3 Outline . 10 2 Basics of Schema and Ontology Matching 13 2.1 Data Integration . 13 2.2 Database Schemas and Ontologies . 16 2.3 Matching and Mapping . 18 2.4 Ontology Matching Techniques . 21 3 Basics of Linguistics 23 3.1 Morphology . 23 3.1.1 Morphemes . 23 3.1.2 Word Formation . 25 3.1.3 Arbitrariness of Language . 28 3.1.4 Relevance for Ontology Mapping . 30 3.2 Semantics . 31 3.2.1 Concepts . 31 3.2.2 The Hierarchy of Concepts . 33 3.2.3 Meronyms and Holonyms . 33 3.2.4 Relation Types . 35 4 Related Work 37 4.1 Relation Type Determination . 38 4.1.1 Match Tools . 38 4.1.2 Comparison of Approaches . 42 4.2 Background Knowledge . 42 4.2.1 Introduction . 42 v CONTENTS 4.2.2 Classication of Resources . 44 4.2.3 Manually or Collaboratively Generated Resources . 45 4.2.4 Automatically Generated Approaches . 48 4.2.5 Web-based Approaches . 51 4.2.6 Juxtaposition of Background Knowledge Approaches . 52 4.3 Mapping Enrichment and Repair . 55 II The STROMA System 57 5 STROMA Architecture and Workflow 59 5.1 Introduction . 59 5.2 STROMA Workow . 61 5.2.1 Overview . 61 5.2.2 Type Computation . 63 5.2.3 Selection . 64 5.3 Technical Details . 66 6 Implemented Strategies 69 6.1 Compound Strategy . 69 6.1.1 Overview . 69 6.1.2 Compounds in Dierent Languages . 72 6.1.3 Relations between Compound and Modier . 72 6.2 Background Knowledge . 73 6.3 Itemization Strategy . 74 6.3.1 Problem Description . 74 6.3.2 Approach . 74 6.3.3 Two-ways Adaptation . 78 6.4 Structure Strategy . 78 6.5 Multiple Linkage . 81 6.6 Word Frequency Strategy . 82 6.6.1 Implementation . 83 6.6.2 Obstacles . 84 6.7 Type Verication . 85 6.7.1 Verifying Is-a Relations . 86 6.7.2 Verifying Equal-Relations . 87 6.7.3 Verifying Part-of relations . 87 6.8 Strategy Comparison . 88 vi CONTENTS 7 Evaluation 91 7.1 Evaluating the Match Quality . 91 7.2 Evaluating Semantic Mappings . 93 7.2.1 Problem Description . 93 7.2.2 Measures for Perfect Mappings . 94 7.2.3 Measures for Authentic Input Mappings . 95 7.3 Evaluation Data Sets . 97 7.4 Evaluations Based on Perfect Input Mappings . 99 7.4.1 Evaluating the Full STROMA System . 99 7.4.2 Evaluating Selected Strategies .