An Association Thesaurus for Information Retrieval 1 Introduction

An Asso ciation Thesaurus for Information Retrieval Yufeng Jing and W Bruce Croft Department of Computer Science University of Massachusetts at Amherst Amherst MA jingcsumassedu croftcsumassedu Abstract Although commonly used in b oth commercial and exp erimental information retrieval systems thesauri have not demonstrated consistent b enets for retrieval p erformance and it is dicult to construct a thesaurus automatically for large text databases In this pap er an approach called PhraseFinder is prop osed to construct collectiondep e ndent asso ciation thesauri automatically using large fulltext do cument collections The asso ciation thesaurus can b e accessed through natural language queries in INQUERY an information retrieval system based on the probabilistic inference network Exp eriments are conducted in IN QUERY to evaluate dierent typ es of asso ciation thesauri and thesauri constructed for a variety of collections Intro duction A thesaurus is a set of items phrases or words plus a set of relations b etween these items Although thesauri are commonly used in b oth commercial and exp erimental IR systems ex p eriments have shown inconsistent eects on retrieval eectiveness and there is a lack of viable approaches for constructing a thesaurus automatically There are three basic issues related to thesauri in IR as follows These issues should each b e addressed separately in order to improve retrieval eectiveness Construction There are two typ es of thesauri manual and automatic The fo cus of this pap er is on how to construct a thesaurus automatically Access Given a particular query the thesaurus must b e accessed and used in some way to improve or expand the query Evaluation After a thesaurus is built it is imp ortant to know how go o d it is Manual thesauri are evaluated in terms of the soundness coverage of classication and thesaurus item selection The evaluation of automatic thesauri is generally done via query expansion to see if retrieval p erformance is improved There are two typ es of manual thesauri The rst are generalpurp ose and wordbased thesauri like Rogets and WordNet Those thesauri contain sense relations like antonym and synonym but are rarely used in IR systems The second are IRoriented and phrasebased thesauri like INSPEC LCSH Library of Congress Sub ject Headings and MeSH Medical Sub ject Headings Those manual thesauri usually contain relations b etween thesaurus items such as BT Broader Term NT Narrow Term UF Used For and RT Related To and can b e either general or sp ecic dep ending on the needs of thesaurus builders This typ e of manual thesauri is widely used in commercial systems The ma jor problem with manual thesauri is that they are exp ensive to build and hard to up date in a timely manner Even though the determination of thesaurus item relations is made by human exp erts it is a dicult task For example what is the relation b etween information system and data base system Although generally built to supp ort information retrieval manual thesauri have b een evaluated by testing the soundness and coverage of the thesaurus concept classication This do es not always directly serve the purp ose of information retrieval Go o d classication and concept coverage do not guarantee eective retrieval An automatic thesaurus is usually collectiondep endent ie dep endent on the text database which is used A few small and automatically constructed thesauri have b een used in exp er imental IR systems but the eectiveness of these thesauri has not b een established for large text databases Automatic thesauri are typically built based on coo ccurrence information and relevance judgements are often used to estimate the probability that thesaurus terms are similar to query terms or a particular query Since determining term or phrase relations is hard to achieve automatically in automatic thesauri these relations are simplied to one typ e of asso ciation relation In this pap er an approach for the automatic construction of thesauri is presented along with a program PhraseFinder that utilizes this approach to construct a collectiondep endent thesaurus called an asso ciation thesaurus The asso ciation thesaurus is accessed through INQUERY an information retrieval system based on inference networks using natural lan guage queries Adding phrases retrieved in this way to the original queries pro duces signicant p erformance improvements In the following sections previous work is reviewed the approach and techniques for PhraseFinder are describ ed exp erimental results are presented and some future research issues related to asso ciation thesauri are discussed Thesauri and Information Retrieval The use of thesauri in IR involves automatic construction user interface design retrieval mech anisms and retrieval architecture Research related to automatic thesauri dates back to Spark Joness work on automatic term classication Saltons work on automatic thesaurus construction and query expansion and Van Rijsb ergens work on term coo ccurrence In those exp eriments automatic term classication without relevance judgments or feed back information did not pro duce any signicant improvements Minker Wilson and Zimmer man in after an exhaustive investigation came to the same conclusion Salton showed that using the Harris synonym thesaurus with relevance judgements pro duces signicant improvements He prop osed two approaches termdo cument and term prop erty matrices for automatic construction of thesauri based on relevance judgements In Salton Yu and Buckley further develop ed these metho ds into a formal term dep endence mo del But the serious drawback with the term dep endence mo del is that it assumes the availability of relevance judgements The o ccurrence information of terms in relevant and nonrelevant do cuments is used to estimate the probability of terms similar to queries When full relevance judgments are not fully available the retrieval p erformance under the mo del degraded badly Recently Crouch and Yang used Saltons approach to build a term vsterm thesaurus whose classes are clustered in terms of discrimination values Although such a thesaurus improved retrieval p erformance to determine the b est threshold value is dicult b ecause such a value is determined assuming the availability of relevance judgements Yus study on clustering for information retrieval showed that term dep endence information improves the eectiveness of retrieval systems In the nonbinary indep endence mo del he used hierarchical term relations Rijsb ergen and Smeaton in used a statisticallyderived MST maximum spanning tree as a term dep endence structure with relevance feedback to verify the following association hypothesis If an index term is good at discriminating relevant from nonrelevant documents then any closely associated index term is also likely to be good at this The exp eriments using MST resulted in no p ositive results In the recent work by Yonggang Qiu and HP Frei a termvsterm similarity matrix was constructed based on how the terms of the collection are indexed A probabilistic metho d is used to estimate the probability of a term similar to a given query in the vector space mo del Even when adding hundreds of terms into queries this approach showed that the similarity thesaurus can improve p erformance signicantly Adding so many terms may not however b e ecient for large information retrieval systems It should b e noted that their improved results on the NPL collection is still lower than the baselines used in this pap er In the CLARIT pro ject NLP techniques are used to identify candidate noun phrases in fulltext and map them into candidate terms in a morphologicallynormalized form empha sizing modier and head relations and the candidate terms are matched against a rstorder thesaurus of certied domainsp ecic terminology which is a semiautomatically generated con trol vo cabulary dictionary This indexing technique builds a mapping b etween terms within a candidate noun phrase and terms in the rstorder thesaurus As an indexing technique the metho d used in CLARIT was compared with human indexers on the basis of ten do cuments Ruge uses the headmodier relation to investigate the eect of term asso ciations on the p erformance of IR systems in the hyp erterm system REALIST But this typ e of asso ciation might b e to o restrictive to capture some asso ciations for example for the query insider trad ing scandal p erson and company names which were involved in and other phrases which often coo ccur with insider trading scandal can b e very imp ortant asso ciated phrases In summary some problems with previous work are Relevance judgements are used to estimate the probability of a term related to another term or query Because relevance judgements are often not available these approaches are impractical for term classication or thesaurus construction Second even if available relevance judgments are often pro duced for a set of queries which do not cover a whole collection Term selection for query expansion is based on individual terms but not on the whole query This is consistent with the asso ciation hyp othesis but it ignores the additional context that additional terms in the query provide A term may b e closely asso ciated with one of the words in the query but not with the meaning of the entire query Coo ccurrence data for terms is gathered from individual texts without limiting the range of the coo ccurrence Previous research has fo cussed on short abstractlength do cuments With fulltext databases where do

An Association Thesaurus for Information Retrieval 1 Introduction

Evaluation of Stop Word Lists in Chinese Language

PSWG: an Automatic Stop-Word List Generator for Persian Information

Natural Language Processing Security- and Defense-Related Lessons Learned

Ontology Based Natural Language Interface for Relational Databases

INFORMATION RETRIEVAL SYSTEM for SILT'e LANGUAGE Using

The Applicability of Natural Language Processing (NLP) To

FTS in Database

214310991.Pdf

Accelerating Text Mining Using Domain-Specific Stop Word Lists

Full-Text Search in Postgresql: a Gentle Introduction by Oleg Bartunov and Teodor Sigaev

Automatic Multi-Word Term Extraction and Its Application to Web-Page Summarization

Corpora Preparation and Stopword List Generation for Arabic Data in Social Network