Ontology-Based Information Retrieval

Ontology-based Information Retrieval Henrik Bulskov Styltsvig A dissertation Presented to the Faculties of Roskilde University in Partial Ful¯llment of the Requirement for the Degree of Doctor of Philosophy Computer Science Section Roskilde University, Denmark May 2006 ii Abstract In this thesis, we will present methods for introducing ontologies in information retrieval. The main hypothesis is that the inclusion of conceptual knowledge such as ontologies in the information retrieval process can contribute to the solution of major problems currently found in information retrieval. This utilization of ontologies has a number of challenges. Our focus is on the use of similarity measures derived from the knowledge about relations between concepts in ontologies, the recognition of semantic information in texts and the mapping of this knowledge into the ontologies in use, as well as how to fuse together the ideas of ontological similarity and ontological indexing into a realistic information retrieval scenario. To achieve the recognition of semantic knowledge in a text, shallow natural language processing is used during indexing that reveals knowledge to the level of noun phrases. Furthermore, we briefly cover the identi¯cation of semantic relations inside and between noun phrases, as well as discuss which kind of problems are caused by an increase in compoundness with respect to the structure of concepts in the evaluation of queries. Measuring similarity between concepts based on distances in the structure of the ontology is discussed. In addition, a shared nodes measure is introduced and, based on a set of intuitive similarity properties, compared to a number of di®erent measures. In this comparison the shared nodes measure appears to be superior, though more computationally complex. Some of the major problems of shared nodes which relate to the way relations di®er with respect to the degree they bring the concepts they connect closer are discussed. A generalized measure called weighted shared nodes is introduced to deal with these problems. Finally, the utilization of concept similarity in query evaluation is discussed. A semantic expansion approach that incorporates concept similarity is introduced and a generalized fuzzy set retrieval model that applies expansion during query evaluation is presented. While not commonly used in present information retrieval systems, it appears that the fuzzy set model comprises the flexibility needed when generalizing to an ontology-based retrieval model and, with the introduction of a hierarchical fuzzy aggregation principle, compound concepts can be handled in a straightforward and natural manner. iv Resum¶e(in danish) Fokus i denne afhandling er anvendelse af ontologier i informationss¿gning (In- formation Retrieval). Den overordnede hypotese er, at indf¿ring af konceptuel viden, sºasom ontologier, i forbindelse med foresp¿rgselsevaluering kan bidrage til l¿sning af v½sentlige problemer i eksisterende metoder. Denne inddragelse af ontologier indeholder en r½kke v½sentlige udfor- dringer. Vi har valgt at fokusere pºasimilaritetsmºalder baserer sig pºaviden om relationer mellem begreber, pºagenkendelse af semantisk viden i tekst og pºa hvordan ontologibaserede similaritetsmºalog semantisk indeksering kan forenes i en realistisk tilgang til informationss¿gning. Genkendelse af semantisk viden i tekst udf¿res ved hj½lp af en simpel natursprogsbehandling i indekseringsprocessen, med det formºalat afd½kke substantivfraser. Endvidere, vil vi skitsere problemstillinger forbundet med at identi¯cere hvilke semantiske relationer simple substantivfraser er opbygget af og diskutere hvordan en for¿gelse af sammenf¿jning af begreber influerer pºa foresp¿rgselsevalueringen. Der redeg¿res for hvorledes et mºalfor similaritet kan baseres pºaafstand i ontologiers struktur, og introduceres et nyt afstandsmºal{ \shared nodes". Dette mºalsammenlignes med en r½kke andre mºalved hj½lp af en samling af intuitive egenskaber for similaritetsmºal.Denne sammenligning viser at \shared nodes" har fortrin frem for ¿vrige mºal,men ogsºaat det er beregningsm½ssigt mere indviklet. Der redeg¿res endvidere for en r½kke v½sentlige problemer forbundet med \shared nodes", som er relateret til den forskel der er mellem relationer med hensyn til i hvor h¿j grad de bringer de begreber de forbinder, sammen. Et mere generelt mºal,\weighted shared nodes", introduceres som l¿sning pºadisse problemer. Afslutningsvist fokuseres der pºahvorledes et similaritetsmºal,der sam- menligner begreber, kan inddrages i foresp¿rgselsevalueringen. Den l¿sning vi pr½senterer indf¿rer en semantisk ekspansion baseret pºasimilaritetsmºal. Evalueringsmetoden der anvendes er en generaliseret \fuzzy set retrieval" model, der inkluderer ekspansion af foresp¿rgsler. Selvom det ikke er al- mindeligt at anvende fuzzy set modellen i informationss¿gning, viser det sig at den har den forn¿dne fleksibilitet til en generalisering til ontologibaseret foresp¿rgselsevaluering, og at indf¿relsen af et hierarkisk aggregeringsprincip giver mulighed for at behandle sammensatte begreber pºaen simpel og naturlig mºade. vi Acknowledgements The work presented here has been completed within the framework of the interdisciplinary research project OntoQuery and the Design and Manage- ment of Information Technology Ph.D. program at Roskilde University. First of all, I would like to thank my supervisor Troels Andreasen for constructive, concise and encouraging supervision from beginning to end. His advice has been extremely valuable throughout the course of the thesis. I would furthermore like to thank my friend and colleague Rasmus Knappe for fruitful discussions and support throughout the process. My thanks also goes to the members of the OntoQuery project, J¿rgen Fischer Nilsson, Per Anker Jensen, Bodil Nistrup Madsen, and Hanne Erdman Thomsen, as well as the other Ph.D. students associated with the project. Finally, I would like to thank my friends and family for putting up with me during a somewhat stressful time, and I would especially like to thank my wife Kirsten for standing by me and giving me the room to make this happen. viii Contents 1 Introduction 1 1.1 Research Question ......................... 3 1.2 Thesis Outline ........................... 4 1.3 Foundations and Contributions .................. 4 2 Information Retrieval 6 2.1 Search Strategies .......................... 9 2.1.1 Term Weighting ...................... 11 2.1.2 Boolean Model ....................... 13 2.1.3 Vector Model ........................ 14 2.1.4 Probabilistic Model .................... 16 2.1.5 Fuzzy Set Model ...................... 19 2.2 Retrieval Evaluation ........................ 27 2.3 Summary and Discussion ..................... 29 3 Ontology 32 3.1 Ontology as a Philosophical Discipline .............. 32 3.2 Knowledge Engineering Ontologies ................ 35 3.3 Types of Ontologies ........................ 36 3.4 Representation Formalisms .................... 38 3.4.1 Traditional Ontology Languages ............. 46 3.4.2 Ontology Markup Languages ............... 47 3.4.3 Ontolog .......................... 48 3.5 Resources .............................. 50 3.5.1 Suggested Upper Merged Ontology ............ 51 3.5.2 SENSUS ........................... 52 3.5.3 WordNet .......................... 53 3.5.4 Modeling Ontology Resources ............... 59 3.6 Summary and Discussion ..................... 61 ix 4 Descriptions 64 4.1 Indexing ............................... 66 4.2 Ontology-based Indexing ...................... 68 4.2.1 The OntoQuery Approach ............... 70 4.2.2 Word Sense Disambiguation ................ 72 4.2.3 Identifying Relational Connections ............ 77 4.3 Instantiated Ontologies ...................... 80 4.3.1 The General Ontology ................... 80 4.3.2 The Domain-speci¯c Ontology .............. 81 4.4 Summary and Discussion ..................... 83 5 Ontological Similarity 88 5.1 Path Length Approaches ...................... 89 5.1.1 Shortest Path Length ................... 91 5.1.2 Weighted Shortest Path .................. 91 5.2 Depth-Relative Approaches .................... 94 5.2.1 Depth-Relative Scaling ................... 95 5.2.2 Conceptual Similarity ................... 95 5.2.3 Normalized Path Length .................. 96 5.3 Corpus-Based Approaches ..................... 96 5.3.1 Information Content .................... 96 5.3.2 Jiang and Conrath's Approach .............. 98 5.3.3 Lin's Universal Similarity Measure ............ 99 5.4 Multiple-Paths Approaches .................... 100 5.4.1 Medium-Strong Relations ................. 101 5.4.2 Generalized Weighted Shortest Path ........... 102 5.4.3 Shared Nodes ........................ 103 5.4.4 Weighted Shared Nodes Similarity ............ 107 5.5 Similarity Evaluation ........................ 108 5.6 Summary and Discussion ..................... 113 6 Query Evaluation 120 6.1 Semantic Expansion ........................ 121 6.1.1 Concept Expansion ..................... 122 6.2 Query Evaluation Framework ................... 126 6.2.1 Simple Fuzzy Query Evaluation .............. 126 6.2.2 Ordered Weighted Averaging Aggregation ........ 128 6.2.3 Hierarchical Aggregation .................. 129 6.2.4 Hierarchical Query Evaluation Approaches ....... 130 6.3 Knowledge Retrieval ........................ 134 6.4 Query Language .......................... 136 6.5 Surveying, Navigating, and Visualizing .............. 137 x 6.6 Summary and Discussion ..................... 139 7 A Prototype 142 7.1

Load more