Possibly Imperfect Ontologies for Effective Information Retrieval
Total Page:16
File Type:pdf, Size:1020Kb
Gábor Nagypál Possibly imperfect ontologies for effective information retrieval Possibly imperfect ontologies for effective information retrieval by Gábor Nagypál Dissertation, Universität Karlsruhe (TH) Fakultät für Informatik, 2007 Referenten: Prof. em. Dr. Dr. h. c. Peter C. Lockemann, Prof. Dr. Rudi Studer Tag der mündlichen Prüfung: 16.05.2007 Impressum Universitätsverlag Karlsruhe c/o Universitätsbibliothek Straße am Forum 2 D-76131 Karlsruhe www.uvka.de Dieses Werk ist unter folgender Creative Commons-Lizenz lizenziert: http://creativecommons.org/licenses/by-nc-nd/2.0/de/ Universitätsverlag Karlsruhe 2007 Print on Demand ISBN: 978-3-86644-190-3 To Bea and Bendi. Acknowledgments I wrote this work during my time at the FZI Research Center for Information Technologies. I would like to thank to all of my colleagues at FZI and especially to Simone Braun, Stephan Grimm, Heiko Paoli, Andreas Schmidt, Peter Tomczyk, and Andreas Walter for the great work- ing atmosphere that made my years at FZI enjoyable. My very warm thanks go to Dr. Wassilios Kazakos. He supported my work during his time in FZI and also thereafter. He read and commented many versions of my thesis from the earliest versions to the latest ones. We also share a common interest in history and Wassili’s historical comments definitely improved the historical quality of my examples in the thesis. Finally, Wassili also helped me to start my life after my PhD work (yes, it seems that such a thing really exists). I am deeply indebted to Dr. Boris Motik who helped me to polish my ideas on fuzzy time intervals that finally lead to this thesis. Boris also showed me how to pursue a PhD research and how to successfully finish it, just by doing it perfectly right. I am much obliged to Prof. em. Dr. Dr. h. c. Peter C. Lockemann, my thesis supervisor. He helped start my research carrier in Germany by inviting me to FZI as a Marie Curie fellow. Later he supported my application as a PhD student to Karlsruhe. He read and corrected all of my written ideas and thesis versions amazingly fast but thoroughly during the last years. It is clear that without him this thesis would not exist. I would also like to thank to Prof. Dr. Rudi Studer who accepted my request to be my second supervisor despite of his nu- merous duties. In general, the scientific works of Prof. Studer’s group were always a great inspiration to me. My warm thanks go to dr. Sándor Gajdos, my thesis supervisor in Hungary, for his support in the early times and also that he encouraged me to make this huge step and continue my research in Germany. I would like to thank to my students. Without their support, work and ideas it would have been impossible to implement and evaluate such a complex system. József Vadkerti implemented the first version of the semantic search engine on Lucene basis. Zhigao Yu worked a lot on the user interface of the IRCON prototype. Andreas Hintermeier implemented the tool to construct fuzzy time intervals. Finally, Ümit Yoldas elaborated on and implemented my first vague ideas about metadata generation. His work served as the basis for the metadata generation part of this thesis. i Acknowledgments Last but not least my biggest thank goes to my family. Without the whole-life support of my parents I would have not even dared to start my PhD research. My wife, Bea and my son, Bendi, to whom this thesis is dedicated, tolerated the many evenings, nights and weekends I had to steal from them to finish this work. Bea persuaded me to stay in Germany and finish my thesis here. Her continuous support was invaluable for successfully finishing this journey. ii Contents Acknowledgments i 1 Introduction 1 2 Fundamentals 5 2.1 The IR process . 5 2.2 Data retrieval vs. information retrieval . 6 2.3 Full-text IR and the vector space model . 7 2.4 Efficiency and Effectiveness . 9 2.5 Representing background knowledge . 9 2.5.1 Thesauri . 9 2.5.2 Ontologies . 10 2.5.3 Ontology modeling constructs . 10 2.5.4 Semantic annotations . 12 2.6 Summary . 13 3 Problem analysis 15 3.1 Motivating scenario . 15 3.2 The prediction game . 17 3.3 Full-text search . 17 3.3.1 Problems with full-text search . 17 3.3.2 Requirements inferred by full-text search problems . 18 3.4 Ontologies for information retrieval . 18 3.4.1 Potential uses of ontologies in the IR process . 19 3.5 Imperfection in background knowledge . 21 3.5.1 Imperfect ontologies . 21 3.5.2 Domain imperfection . 23 3.6 Time modeling in history . 23 3.6.1 Uncertainty . 24 3.6.2 Subjectivity . 24 3.6.3 Vagueness . 25 3.6.4 Combined aspects . 25 3.7 User Friendliness . 25 3.8 Scalability and Interactivity . 26 3.8.1 The importance of efficiency . 26 3.8.2 Efficiency problems of ontology reasoning . 27 3.8.3 Experimental demonstration of the efficiency problems . 27 3.8.4 Combining ontology reasoning with other technologies . 30 3.9 Feasibility . 30 iii Contents 3.10 Summary . 31 4 State of the Art 33 4.1 Systems focusing on ontology instance retrieval . 33 4.1.1 QuizRDF . 33 4.1.2 SEAL Portals . 35 4.1.3 The system of Zhang et al. 36 4.1.4 The system of Rocha et al. 37 4.2 Systems focusing on document retrieval . 38 4.2.1 KIM . 38 4.2.2 The system of Vallet et al. 39 4.2.3 SCORE . 40 4.2.4 OWLIR . 41 4.3 Summary . 43 5 Overview of my approach 45 5.1 Lessons learned from related work . 45 5.2 Missing features in state-of-the-art solutions . 46 5.3 Fundamental modeling decisions . 47 5.4 The ontology-supported IR process . 49 5.5 High-level system architecture . 52 5.6 Feasibility of the approach . 54 5.7 Summary . 55 6 Representing temporal imperfection 57 6.1 Requirements . 57 6.1.1 Temporal Relations . 57 6.1.2 Compatibility with classical time specifications . 58 6.1.3 Temporal Specifications vs. General Theory of Time . 58 6.2 Models of imperfection . 59 6.2.1 Modeling uncertainty . 59 6.2.2 Modeling vagueness . 59 6.2.3 Modeling subjectivity . 60 6.2.4 Comparison of the theories . 61 6.3 Hierarchy of imperfection aspects in time specifications . 63 6.4 Related Work . 65 6.5 Temporal Model . 66 6.5.1 The preferred mathematical model of imperfection . 66 6.5.2 Intervals or Points? . 66 6.5.3 Continuous vs. Discrete . 66 6.5.4 Time Intervals as Fuzzy Sets . 67 6.6 Creating fuzzy time intervals . 69 6.6.1 Creating intervals using statistical distribution . 69 6.6.2 The algorithm for the construction of fuzzy intervals . 70 6.6.3 GUI of the fuzzy interval construction tool . 79 6.7 Fuzzy Temporal Relations . 80 6.7.1 Defining Crisp Temporal Relations using Set Operations . 81 iv Contents 6.7.2 Extending Auxiliary Interval Operators to Fuzzy Intervals . 83 6.7.3 Constraints using Comparison with the Empty Set . 86 6.7.4 Temporal Relations on Fuzzy Intervals . 86 6.8 Discussion of the fuzzy temporal model . 88 6.8.1 Representing Imprecise Information. 88 6.8.2 Compatibility with Crisp Case . 90 6.8.3 Intuitiveness of Fuzzy Relations. 90 6.9 Alternative approaches . 94 6.10 Generality of the approach . 94 6.10.1 Fuzzy Bounding Shapes . 95 6.10.2 Bounding Boxes vs. Bounding Shapes . 96 6.11 Summary . 97 7 Ontology formalism 99 7.1 Requirements . 99 7.2 Choosing the appropriate formalism . 100 7.2.1 Alternative formalisms . 100 7.2.2 A Datalog-based ontology formalism . 100 7.3 Fuzzy time in the ontology formalism . 104 7.3.1 General considerations . 104 7.3.2 Integration into the Datalog-based formalism . 106 7.4 Summary . 106 8 Metadata generation 107 8.1 Information model . 108 8.1.1 Requirements . 108 8.1.2 Related work . 108 8.1.3 The IRCON information model . 109.