Ontology-Based Information Retrieval

Total Page:16

File Type:pdf, Size:1020Kb

Ontology-Based Information Retrieval Ontology-based Information Retrieval Henrik Bulskov Styltsvig A dissertation Presented to the Faculties of Roskilde University in Partial Ful¯llment of the Requirement for the Degree of Doctor of Philosophy Computer Science Section Roskilde University, Denmark May 2006 ii Abstract In this thesis, we will present methods for introducing ontologies in information retrieval. The main hypothesis is that the inclusion of conceptual knowledge such as ontologies in the information retrieval process can contribute to the solution of major problems currently found in information retrieval. This utilization of ontologies has a number of challenges. Our focus is on the use of similarity measures derived from the knowledge about relations between concepts in ontologies, the recognition of semantic information in texts and the mapping of this knowledge into the ontologies in use, as well as how to fuse together the ideas of ontological similarity and ontological indexing into a realistic information retrieval scenario. To achieve the recognition of semantic knowledge in a text, shallow nat- ural language processing is used during indexing that reveals knowledge to the level of noun phrases. Furthermore, we briefly cover the identi¯cation of semantic relations inside and between noun phrases, as well as discuss which kind of problems are caused by an increase in compoundness with respect to the structure of concepts in the evaluation of queries. Measuring similarity between concepts based on distances in the structure of the ontology is discussed. In addition, a shared nodes measure is introduced and, based on a set of intuitive similarity properties, compared to a number of di®erent measures. In this comparison the shared nodes measure appears to be superior, though more computationally complex. Some of the major problems of shared nodes which relate to the way relations di®er with respect to the degree they bring the concepts they connect closer are discussed. A generalized measure called weighted shared nodes is introduced to deal with these problems. Finally, the utilization of concept similarity in query evaluation is dis- cussed. A semantic expansion approach that incorporates concept similarity is introduced and a generalized fuzzy set retrieval model that applies expansion during query evaluation is presented. While not commonly used in present in- formation retrieval systems, it appears that the fuzzy set model comprises the flexibility needed when generalizing to an ontology-based retrieval model and, with the introduction of a hierarchical fuzzy aggregation principle, compound concepts can be handled in a straightforward and natural manner. iv Resum¶e(in danish) Fokus i denne afhandling er anvendelse af ontologier i informationss¿gning (In- formation Retrieval). Den overordnede hypotese er, at indf¿ring af konceptuel viden, sºasom ontologier, i forbindelse med foresp¿rgselsevaluering kan bidrage til l¿sning af v½sentlige problemer i eksisterende metoder. Denne inddragelse af ontologier indeholder en r½kke v½sentlige udfor- dringer. Vi har valgt at fokusere pºasimilaritetsmºalder baserer sig pºaviden om relationer mellem begreber, pºagenkendelse af semantisk viden i tekst og pºa hvordan ontologibaserede similaritetsmºalog semantisk indeksering kan forenes i en realistisk tilgang til informationss¿gning. Genkendelse af semantisk viden i tekst udf¿res ved hj½lp af en simpel natursprogsbehandling i indekseringsprocessen, med det formºalat afd½kke substantivfraser. Endvidere, vil vi skitsere problemstillinger forbundet med at identi¯cere hvilke semantiske relationer simple substantivfraser er opbygget af og diskutere hvordan en for¿gelse af sammenf¿jning af begreber influerer pºa foresp¿rgselsevalueringen. Der redeg¿res for hvorledes et mºalfor similaritet kan baseres pºaafstand i ontologiers struktur, og introduceres et nyt afstandsmºal{ \shared nodes". Dette mºalsammenlignes med en r½kke andre mºalved hj½lp af en samling af in- tuitive egenskaber for similaritetsmºal.Denne sammenligning viser at \shared nodes" har fortrin frem for ¿vrige mºal,men ogsºaat det er beregningsm½ssigt mere indviklet. Der redeg¿res endvidere for en r½kke v½sentlige problemer forbundet med \shared nodes", som er relateret til den forskel der er mellem relationer med hensyn til i hvor h¿j grad de bringer de begreber de forbinder, sammen. Et mere generelt mºal,\weighted shared nodes", introduceres som l¿sning pºadisse problemer. Afslutningsvist fokuseres der pºahvorledes et similaritetsmºal,der sam- menligner begreber, kan inddrages i foresp¿rgselsevalueringen. Den l¿sning vi pr½senterer indf¿rer en semantisk ekspansion baseret pºasimilaritetsmºal. Evalueringsmetoden der anvendes er en generaliseret \fuzzy set retrieval" model, der inkluderer ekspansion af foresp¿rgsler. Selvom det ikke er al- mindeligt at anvende fuzzy set modellen i informationss¿gning, viser det sig at den har den forn¿dne fleksibilitet til en generalisering til ontologibaseret foresp¿rgselsevaluering, og at indf¿relsen af et hierarkisk aggregeringsprincip giver mulighed for at behandle sammensatte begreber pºaen simpel og naturlig mºade. vi Acknowledgements The work presented here has been completed within the framework of the interdisciplinary research project OntoQuery and the Design and Manage- ment of Information Technology Ph.D. program at Roskilde University. First of all, I would like to thank my supervisor Troels Andreasen for constructive, concise and encouraging supervision from beginning to end. His advice has been extremely valuable throughout the course of the thesis. I would fur- thermore like to thank my friend and colleague Rasmus Knappe for fruitful discussions and support throughout the process. My thanks also goes to the members of the OntoQuery project, J¿rgen Fischer Nilsson, Per Anker Jensen, Bodil Nistrup Madsen, and Hanne Erdman Thomsen, as well as the other Ph.D. students associated with the project. Finally, I would like to thank my friends and family for putting up with me during a somewhat stressful time, and I would especially like to thank my wife Kirsten for standing by me and giving me the room to make this happen. viii Contents 1 Introduction 1 1.1 Research Question ......................... 3 1.2 Thesis Outline ........................... 4 1.3 Foundations and Contributions .................. 4 2 Information Retrieval 6 2.1 Search Strategies .......................... 9 2.1.1 Term Weighting ...................... 11 2.1.2 Boolean Model ....................... 13 2.1.3 Vector Model ........................ 14 2.1.4 Probabilistic Model .................... 16 2.1.5 Fuzzy Set Model ...................... 19 2.2 Retrieval Evaluation ........................ 27 2.3 Summary and Discussion ..................... 29 3 Ontology 32 3.1 Ontology as a Philosophical Discipline .............. 32 3.2 Knowledge Engineering Ontologies ................ 35 3.3 Types of Ontologies ........................ 36 3.4 Representation Formalisms .................... 38 3.4.1 Traditional Ontology Languages ............. 46 3.4.2 Ontology Markup Languages ............... 47 3.4.3 Ontolog .......................... 48 3.5 Resources .............................. 50 3.5.1 Suggested Upper Merged Ontology ............ 51 3.5.2 SENSUS ........................... 52 3.5.3 WordNet .......................... 53 3.5.4 Modeling Ontology Resources ............... 59 3.6 Summary and Discussion ..................... 61 ix 4 Descriptions 64 4.1 Indexing ............................... 66 4.2 Ontology-based Indexing ...................... 68 4.2.1 The OntoQuery Approach ............... 70 4.2.2 Word Sense Disambiguation ................ 72 4.2.3 Identifying Relational Connections ............ 77 4.3 Instantiated Ontologies ...................... 80 4.3.1 The General Ontology ................... 80 4.3.2 The Domain-speci¯c Ontology .............. 81 4.4 Summary and Discussion ..................... 83 5 Ontological Similarity 88 5.1 Path Length Approaches ...................... 89 5.1.1 Shortest Path Length ................... 91 5.1.2 Weighted Shortest Path .................. 91 5.2 Depth-Relative Approaches .................... 94 5.2.1 Depth-Relative Scaling ................... 95 5.2.2 Conceptual Similarity ................... 95 5.2.3 Normalized Path Length .................. 96 5.3 Corpus-Based Approaches ..................... 96 5.3.1 Information Content .................... 96 5.3.2 Jiang and Conrath's Approach .............. 98 5.3.3 Lin's Universal Similarity Measure ............ 99 5.4 Multiple-Paths Approaches .................... 100 5.4.1 Medium-Strong Relations ................. 101 5.4.2 Generalized Weighted Shortest Path ........... 102 5.4.3 Shared Nodes ........................ 103 5.4.4 Weighted Shared Nodes Similarity ............ 107 5.5 Similarity Evaluation ........................ 108 5.6 Summary and Discussion ..................... 113 6 Query Evaluation 120 6.1 Semantic Expansion ........................ 121 6.1.1 Concept Expansion ..................... 122 6.2 Query Evaluation Framework ................... 126 6.2.1 Simple Fuzzy Query Evaluation .............. 126 6.2.2 Ordered Weighted Averaging Aggregation ........ 128 6.2.3 Hierarchical Aggregation .................. 129 6.2.4 Hierarchical Query Evaluation Approaches ....... 130 6.3 Knowledge Retrieval ........................ 134 6.4 Query Language .......................... 136 6.5 Surveying, Navigating, and Visualizing .............. 137 x 6.6 Summary and Discussion ..................... 139 7 A Prototype 142 7.1
Recommended publications
  • Basic Concepts in Information Retrieval
    1 Basic concepts of information retrieval systems Introduction The term ‘information retrieval’ was coined in 1952 and gained popularity in the research community from 1961 onwards.1 At that time the organizing function of information retrieval was seen as a major advance in libraries that were no longer just storehouses of books, but also places where the information they hold is catalogued and indexed.2 Subsequently, with the introduction of computers in information handling, there appeared a number of databases containing bibliographic details of documents, often married with abstracts, keywords, and so on, and consequently the concept of information retrieval came to mean the retrieval of bibliographic information from stored document databases. Information retrieval is concerned with all the activities related to the organization of, processing of, and access to, information of all forms and formats. An information retrieval system allows people to communicate with an information system or service in order to find information – text, graphic images, sound recordings or video that meet their specific needs. Thus the objective of an information retrieval system is to enable users to find relevant information from an organized collection of documents. In fact, most information retrieval systems are, truly speaking, document retrieval systems, since they are designed to retrieve information about the existence (or non-existence) of documents relevant to a user query. Lancaster3 comments that an information retrieval system does not inform (change the knowledge of) the user on the subject of their enquiry; it merely informs them of the existence (or non-existence) and whereabouts of documents relating to their request.
    [Show full text]
  • Artificial Intelligence in Health Care: the Hope, the Hype, the Promise, the Peril
    Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril Michael Matheny, Sonoo Thadaney Israni, Mahnoor Ahmed, and Danielle Whicher, Editors WASHINGTON, DC NAM.EDU PREPUBLICATION COPY - Uncorrected Proofs NATIONAL ACADEMY OF MEDICINE • 500 Fifth Street, NW • WASHINGTON, DC 20001 NOTICE: This publication has undergone peer review according to procedures established by the National Academy of Medicine (NAM). Publication by the NAM worthy of public attention, but does not constitute endorsement of conclusions and recommendationssignifies that it is the by productthe NAM. of The a carefully views presented considered in processthis publication and is a contributionare those of individual contributors and do not represent formal consensus positions of the authors’ organizations; the NAM; or the National Academies of Sciences, Engineering, and Medicine. Library of Congress Cataloging-in-Publication Data to Come Copyright 2019 by the National Academy of Sciences. All rights reserved. Printed in the United States of America. Suggested citation: Matheny, M., S. Thadaney Israni, M. Ahmed, and D. Whicher, Editors. 2019. Artificial Intelligence in Health Care: The Hope, the Hype, the Promise, the Peril. NAM Special Publication. Washington, DC: National Academy of Medicine. PREPUBLICATION COPY - Uncorrected Proofs “Knowing is not enough; we must apply. Willing is not enough; we must do.” --GOETHE PREPUBLICATION COPY - Uncorrected Proofs ABOUT THE NATIONAL ACADEMY OF MEDICINE The National Academy of Medicine is one of three Academies constituting the Nation- al Academies of Sciences, Engineering, and Medicine (the National Academies). The Na- tional Academies provide independent, objective analysis and advice to the nation and conduct other activities to solve complex problems and inform public policy decisions.
    [Show full text]
  • Automatic Knowledge Retrieval from the Web
    Automatic Knowledge Retrieval from the Web Marcin Skowron and Kenji Araki Graduate School of Information Science and Technology, Hokkaido University, Kita-ku Kita 14-jo Nishi 8-chome, 060–0814 Sapporo, Japan Abstract. This paper presents the method of automatic knowledge retrieval from the web. The aim of the system that implements it, is to automatically create entries to a knowledge database, similar to the ones that are being provided by the volunteer contributors. As only a small fraction of the statements accessible on the web can be treated as valid knowledge concepts we considered the method for their filtering and verification, based on the similarity measurements with the concepts found in the manually created knowledge database. The results demonstrate that the system can retrieve valid knowledge concepts both for topics that are described in the manually created database, as well as the ones that are not covered there. 1 Introduction Despite the years of research in the field of Artificial Intelligence, the creation of a machine with the ability to think is still far from realization. Although computer systems are capable of performing several complicated tasks that require human beings to extensively use their thinking capabilities, machines still cannot engage into really meaningful conversation or understand what people talk about. One of the main unresolved problems is the lack of machine usable knowledge. Without it, machines cannot reason about the everyday world in a similar way to human beings. In the last decade we have witnessed a few attempts to create knowledge databases using various approaches: man- ual, machine learning and mass collaboration of volunteer contributors.
    [Show full text]
  • The Seven Ages of Information Retrieval
    International Federation of Library Associations and Institutions UNIVERSAL DATAFLOW AND TELECOMMUNICATIONS CORE PROGRAMME OCCASIONAL PAPER 5 THE SEVEN AGES OF INFORMATION RETRIEVAL Michael Lesk Bellcore March, 1996 International Federation of Library Associations and Institutions UNIVERSAL DATAFLOW AND TELECOMMUNICATIONS CORE PROGRAMME The IFLA Core Programme on Universal Dataflow and Telecommunications (UDT) seeks to facilitate the international and national exchange of electronic data by providing the library community with pragmatic approaches to resource sharing. The programme monitors and promotes the use of relevant standards, promotes the use of relevant technologies and monitors relevant policy issues in an effort to overcome barriers to the electronic transfer of data in library fields. CONTACT INFORMATION Mailing Address: IFLA International Office for UDT c/o National Library of Canada 395 Wellington Street Ottawa, CANADA K1A 0N4 UDT Staff Contacts: Leigh Swain, Director Email: [email protected] Phone: (819) 994-6833 or Louise Lantaigne, Administration Officer Email: [email protected] Phone: (819) 994-6963 Fax: (819) 994-6835 Email: [email protected] URL: http://www.ifla.org/udt/ Occasional papers are available electronically at: http://www.ifla.org/udt/op/ UDT Occasional Papers # 5 Universal Dataflow and Telecommunications Core Programme International Federation of Library Associations and Institutions The Seven Ages of Information Retrieval Michael Lesk Bellcore [email protected] March, 1996 ABSTRACT analysis. This dates to a memo by Warren Weaver in 1949 [Weaver 1955] thinking about the success of Vannevar Bush's 1945 article set a goal of fast access computers in cryptography during the war, and to the contents of the world's libraries which looks suggesting that they could translate languages.
    [Show full text]
  • Information Retrieval (Text Categorization)
    Information Retrieval (Text Categorization) Fabio Aiolli http://www.math.unipd.it/~aiolli Dipartimento di Matematica Pura ed Applicata Università di Padova Anno Accademico 2008/2009 Dip. di Matematica F. Aiolli - Information Retrieval 1 Pura ed Applicata 2008/2009 Text Categorization Text categorization (TC - aka text classification) is the task of buiding text classifiers, i.e. sofware systems that classify documents from a domain D into a given, fixed set C = {c 1,…,c m} of categories (aka classes or labels) TC is an approximation task , in that we assume the existence of an ‘oracle’, a target function that specifies how docs ought to be classified. Since this oracle is unknown , the task consists in building a system that ‘approximates’ it Dip. di Matematica F. Aiolli - Information Retrieval 2 Pura ed Applicata 2008/2009 Text Categorization We will assume that categories are symbolic labels; in particular, the text constituting the label is not significant. No additional knowledge of category ‘meaning’ is available to help building the classifier The attribution of documents to categories should be realized on the basis of the content of the documents. Given that this is an inherently subjective notion, the membership of a document in a category cannot be determined with certainty Dip. di Matematica F. Aiolli - Information Retrieval 3 Pura ed Applicata 2008/2009 Single-label vs Multi-label TC TC comes in two different variants: Single-label TC (SL) when exactly one category should be assigned to a document The target function in the form f : D → C should be approximated by means of a classifier f’ : D → C Multi-label TC (ML) when any number {0,…,m} of categories can be assigned to each document The target function in the form f : D → P(C) should be approximated by means of a classifier f’ : D → P(C) We will often indicate a target function with the alternative notation f : D × C → {-1,+1} .
    [Show full text]
  • Information Retrieval System: Concept and Scope MODULE - 5B INFORMATION RETRIEVAL SYSTEM
    Information Retrieval System: Concept and Scope MODULE - 5B INFORMATION RETRIEVAL SYSTEM 15 Notes INFORMATION RETRIEVAL SYSTEM: CONCEPT AND SCOPE 15.1 INTRODUCTION Information is communicated or received knowledge concerning a particular fact or circumstance. Retrieval refers to searching through stored information to find information relevant to the task at hand. In view of this, information retrieval (IR) deals with the representation, storage, organization of/and access to information items. Here, types of information items include documents, Web pages, online catalogues, structured records, multimedia objects, etc. Chief goals of the IR are indexing text and searching for useful documents in a collection. Libraries were among the first institutions to adopt IR systems for retrieving information. In this lesson, you will be introduced to the importance, definitions and objectives of information retrieval. You will also study in detail the concept of subject approach to information, process of information retrieval, and indexing languages. 15.2 OBJECTIVES After studying this lesson, you will be able to: define information retrieval; understand the importance and need of information retrieval system; explain the concept of subject approach to information; LIBRARY AND INFORMATION SCIENCE 321 MODULE - 5B Information Retrieval System: Concept and Scope INFORMATION RETRIEVAL SYSTEM illustrate the process of information retrieval; and differentiate between natural, free and controlled indexing languages. 15.3 INFORMATION RETRIEVAL (IR) Notes The term ‘information retrieval’ was coined by Calvin Mooers in 1950. It gained popularity in the research community from 1961 onwards, when computers were introduced for information handling. The term information retrieval was then used to mean retrieval of bibliographic information from stored document databases.
    [Show full text]
  • Information Extraction Using Natural Language Processing
    INFORMATION EXTRACTION USING NATURAL LANGUAGE PROCESSING Cvetana Krstev University of Belgrade, Faculty of Philology Information Retrieval and/vs. Natural Language Processing So close yet so far Outline of the talk ◦ Views on Information Retrieval (IR) and Natural Language Processing (NLP) ◦ IR and NLP in Serbia ◦ Language Resources (LT) in the core of NLP ◦ at University of Belgrade (4 representative resources) ◦ LR and NLP for Information Retrieval and Information Extraction (IE) ◦ at University of Belgrade (4 representative applications) Wikipedia ◦ Information retrieval ◦ Information retrieval (IR) is the activity of obtaining information resources relevant to an information need from a collection of information resources. Searches can be based on full-text or other content-based indexing. ◦ Natural Language Processing ◦ Natural language processing is a field of computer science, artificial intelligence, and computational linguistics concerned with the interactions between computers and human (natural) languages. As such, NLP is related to the area of human–computer interaction. Many challenges in NLP involve: natural language understanding, enabling computers to derive meaning from human or natural language input; and others involve natural language generation. Experts ◦ Information Retrieval ◦ As an academic field of study, Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need from within large collection (usually stored on computers). ◦ C. D. Manning, P. Raghavan, H. Schutze, “Introduction to Information Retrieval”, Cambridge University Press, 2008 ◦ Natural Language Processing ◦ The term ‘Natural Language Processing’ (NLP) is normally used to describe the function of software or hardware components in computer system which analyze or synthesize spoken or written language.
    [Show full text]
  • Open Mind Common Sense: Knowledge Acquisition from the General Public
    Open Mind Common Sense: Knowledge Acquisition from the General Public Push Singh, Grace Lim, Thomas Lin, Erik T. Mueller Travell Perkins, Mark Tompkins, Wan Li Zhu MIT Media Laboratory 20 Ames Street Cambridge, MA 02139 USA {push, glim, tlin, markt, wlz}@mit.edu, [email protected], [email protected] Abstract underpinnings for commonsense reasoning (Shanahan Open Mind Common Sense is a knowledge acquisition 1997), there has been far less work on finding ways to system designed to acquire commonsense knowledge from accumulate the knowledge to do so in practice. The most the general public over the web. We describe and evaluate well-known attempt has been the Cyc project (Lenat 1995) our first fielded system, which enabled the construction of which contains 1.5 million assertions built over 15 years a 400,000 assertion commonsense knowledge base. We at the cost of several tens of millions of dollars. then discuss how our second-generation system addresses Knowledge bases this large require a tremendous effort to weaknesses discovered in the first. The new system engineer. With the exception of Cyc, this problem of scale acquires facts, descriptions, and stories by allowing has made efforts to study and build commonsense participants to construct and fill in natural language knowledge bases nearly non-existent within the artificial templates. It employs word-sense disambiguation and intelligence community. methods of clarifying entered knowledge, analogical inference to provide feedback, and allows participants to validate knowledge and in turn each other. Turning to the general public 1 In this paper we explore a possible solution to this Introduction problem of scale, based on one critical observation: Every We would like to build software agents that can engage in ordinary person has common sense of the kind we want to commonsense reasoning about ordinary human affairs.
    [Show full text]
  • A Philosophical Wish List for Research in Music Information Retrieval Cynthia M
    A Philosophical Wish List for Research in Music Information Retrieval Cynthia M. Grund Institute of Philosophy, Education and the Study of Religions - Philosophy University of Southern Denmark, Odense [email protected] Abstract what might otherwise seem utterly intractable questions. Within a framework provided by the traditional trio What insights could MIR-research provide into the consisting of metaphysics, epistemology and ethics, a first great questions of philosophy? The answer lies in the stab is made at a wish list for MIR-research from a particular facility which the tools of MIR possess for philosophical point of view. Since the tools of MIR are dealing with language as a sonic phenomenon, thus 1 equipped to study language and its use from a purely sonic providing yet another revolution in the linguistic turn. A standpoint, MIR research could result in another revealing search through the reams of literature produced regarding revolution within the linguistic turn in philosophy. the role of language in philosophical endeavor reveals that the language framework is virtually always a written one. Keywords: Philosophy and MIR, language as spoken, Little or no attention has been paid to the mechanisms at memory work in thinking, learning and communicating in a context 2 1. Introduction and Brief Setting of the Stage which is virtually of an exclusively oral, sonic. Since the overwhelming majority of time during which our thinking, Philosophy wrestles with questions regarding learning and communicative skills developed was metaphysics,
    [Show full text]
  • An Architecture for Information Retrieval Agents* Craig A
    From: AAAI Technical Report SS-94-03. Compilation copyright © 1994, AAAI (www.aaai.org). All rights reserved. An Architecture for Information Retrieval Agents* Craig A. Knoblock and Yigal Arens Information Sciences Institute University of Southern California 4676 Admiralty Way Marina del Rey, CA 90292, USA {KNOBLOCK,ARENS}QISI.ED Abstract sources that are available to it. Given an informa- tion request, an agent identifies an appropriate set of With the vast number of information resources information sources, generates a plan to retrieve and available today, a critical problem is howto lo- process the data, uses knowledge about the data to re- cate, retrieve and process information. It would formulate the plan, and then executes it. This paper be impractical to build a single unified system that describes our approach to the issues of representation, combines all of these information resources. A communication, problem solving, and learning, and de- more promising approach is to build specialized scribes how this approach supports multiple, collabo- information retrieval agents that provide access rating information retrieval agents. to a subset of the information resources and can send requests to other information retrieval agents Representing the Knowledge of an when needed. In this paper we present an archi- tecture for building such agents that addresses the Agent issues of representation, communication, problem Each information agent is specialized to a particular solving, and learning. Wealso describe how this area of expertise. This provides a modular organiza- architecture supports agents that are modular, ex- tion of the vast numberof information sources and pro- tensible, flexible and efficient, vides a clear delineation of the types of queries each agent can handle.
    [Show full text]
  • Semantic Information Retrieval Based on Wikipedia Taxonomy
    International Journal of Computer Applications Technology and Research Volume 2– Issue 1, 77-80, 2013, ISSN: 2319–8656 Semantic Information Retrieval based on Wikipedia Taxonomy May Sabai Han University of Technology Yatanarpon Cyber City Myanmar Abstract: Information retrieval is used to find a subset of relevant documents against a set of documents. Determining semantic similarity between two terms is a crucial problem in Web Mining for such applications as information retrieval systems and recommender systems. Semantic similarity refers to the sameness of two terms based on sameness of their meaning or their semantic contents. Recently many techniques have introduced measuring semantic similarity using Wikipedia, a free online encyclopedia. In this paper, a new technique of measuring semantic similarity is proposed. The proposed method uses Wikipedia as an ontology and spreading activation strategy to compute semantic similarity. The utility of the proposed system is evaluated by using the taxonomy of Wikipedia categories. Keywords: information retrieval; semantic similarity; spreading activation strategy; wikipedia taxonomy; wikipedia categories 1. INTRODUCTION fashion than a search engine and with more coverage than WordNet. And Wikipedia articles have been categorized by Information in WWW are scattered and diverse in nature. So, providing a taxonomy, categories. This feature provides the users frequently fail to describe the information desired. hierarchical structure or network. Wikipedia also provides Traditional search techniques are constrained by keyword articles link graph. So many researches has recently used based matching techniques. Hence low precision and recall is Wikipedia as an ontology to measure semantic similarity. obtained [2]. Many natural language processing applications must estimate the semantic similarity of pairs of text We propose a method to use structured knowledge extracted fragments provided as input, e.g.
    [Show full text]
  • Using a Natural Language Understanding System to Generate Semantic Web Content
    Final version to appear, International Journal on Semantic Web and Information Systems, 3(4), 2007. (http://www.igi-pub.com/) 1 Using a Natural Language Understanding System to Generate Semantic Web Content Akshay Java, Sergei Nirneburg, Marjorie McShane, Timothy Finin, Jesse English, Anupam Joshi University of Maryland, Baltimore County, Baltimore MD 21250 {aks1,sergei,marge,finin,english1,joshi}@umbc.edu Abstract We describe our research on automatically generating rich semantic annotations of text and making it available on the Semantic Web. In particular, we discuss the challenges involved in adapting the OntoSem natural language processing system for this purpose. OntoSem, an implementation of the theory of ontological semantics under continuous development for over fifteen years, uses a specially constructed NLP-oriented ontology and an ontological- semantic lexicon to translate English text into a custom ontology-motivated knowledge representation language, the language of text meaning representations (TMRs). OntoSem concentrates on a variety of ambiguity resolution tasks as well as processing unexpected input and reference. To adapt OntoSem’s representation to the Semantic Web, we developed a translation system, OntoSem2OWL, between the TMR language into the Semantic Web language OWL. We next used OntoSem and OntoSem2OWL to support SemNews, an experimental web service that monitors RSS news sources, processes the summaries of the news stories and publishes a structured representation of the meaning of the text in the news story. Index Terms semantic web, OWL, RDF, natural language processing, information extraction I. INTRODUCTION A core goal of the development of the Semantic Web is to bring progressively more meaning to the information published on the Web.
    [Show full text]