<<

An Asso ciation Thesaurus for

Yufeng Jing and W Bruce Croft

Department of Computer Science

University of Massachusetts at Amherst

Amherst MA

jingcsumassedu croftcsumassedu

Abstract

Although commonly used in b oth commercial and exp erimental information retrieval

systems thesauri have not demonstrated consistent b enets for retrieval p erformance and

it is dicult to construct a thesaurus automatically for large text databases In this pap er

an approach called PhraseFinder is prop osed to construct collectiondep e ndent asso ciation

thesauri automatically using large fulltext do cument collections The asso ciation thesaurus

can b e accessed through natural language queries in INQUERY an information retrieval

system based on the probabilistic inference network Exp eriments are conducted in IN

QUERY to evaluate dierent typ es of asso ciation thesauri and thesauri constructed for a

variety of collections

Intro duction

A thesaurus is a set of items phrases or words plus a set of relations b etween these items

Although thesauri are commonly used in b oth commercial and exp erimental IR systems ex

p eriments have shown inconsistent eects on retrieval eectiveness and there is a lack of viable

approaches for constructing a thesaurus automatically There are three basic issues related to

thesauri in IR as follows These issues should each b e addressed separately in order to improve

retrieval eectiveness

Construction There are two typ es of thesauri manual and automatic The fo cus of

this pap er is on how to construct a thesaurus automatically

Access Given a particular query the thesaurus must b e accessed and used in some

way to improve or expand the query

Evaluation After a thesaurus is built it is imp ortant to know how go o d it is Manual

thesauri are evaluated in terms of the soundness coverage of classication and thesaurus

item selection The evaluation of automatic thesauri is generally done via

to see if retrieval p erformance is improved

There are two typ es of manual thesauri The rst are generalpurp ose and wordbased thesauri

like Rogets and WordNet Those thesauri contain sense relations like antonym and synonym

but are rarely used in IR systems The second are IRoriented and phrasebased thesauri

like INSPEC LCSH Library of Congress Sub ject Headings and MeSH Medical Sub ject

Headings Those manual thesauri usually contain relations b etween thesaurus items such as

BT Broader Term NT Narrow Term UF Used For and RT Related To and can b e either

general or sp ecic dep ending on the needs of thesaurus builders This typ e of manual thesauri

is widely used in commercial systems The ma jor problem with manual thesauri is that they are

exp ensive to build and hard to up date in a timely manner Even though the determination of

thesaurus item relations is made by human exp erts it is a dicult task For example what is

the relation b etween information system and data base system Although generally built

to supp ort information retrieval manual thesauri have b een evaluated by testing the soundness

and coverage of the thesaurus concept classication This do es not always directly serve the

purp ose of information retrieval Go o d classication and concept coverage do not guarantee

eective retrieval

An automatic thesaurus is usually collectiondep endent ie dep endent on the text database

which is used A few small and automatically constructed thesauri have b een used in exp er

imental IR systems but the eectiveness of these thesauri has not b een established for large

text databases Automatic thesauri are typically built based on coo ccurrence information

and relevance judgements are often used to estimate the probability that thesaurus terms are

similar to query terms or a particular query Since determining term or phrase relations is

hard to achieve automatically in automatic thesauri these relations are simplied to one typ e

of asso ciation relation

In this pap er an approach for the automatic construction of thesauri is presented along

with a program PhraseFinder that utilizes this approach to construct a collectiondep endent

thesaurus called an asso ciation thesaurus The asso ciation thesaurus is accessed through

INQUERY an information retrieval system based on inference networks using natural lan

guage queries Adding phrases retrieved in this way to the original queries pro duces signicant

p erformance improvements In the following sections previous work is reviewed the approach

and techniques for PhraseFinder are describ ed exp erimental results are presented and some

future research issues related to asso ciation thesauri are discussed

Thesauri and Information Retrieval

The use of thesauri in IR involves automatic construction user interface design retrieval mech

anisms and retrieval architecture Research related to automatic thesauri dates back to Spark

Joness work on automatic term classication Saltons work on automatic thesaurus

construction and query expansion and Van Rijsb ergens work on term coo ccurrence

In those exp eriments automatic term classication without relevance judgments or feed

back information did not pro duce any signicant improvements Minker Wilson and Zimmer

man in after an exhaustive investigation came to the same conclusion

Salton showed that using the Harris synonym thesaurus with relevance judgements

pro duces signicant improvements He prop osed two approaches termdo cument and term

prop erty matrices for automatic construction of thesauri based on relevance judgements In

Salton Yu and Buckley further develop ed these metho ds into a formal term dep endence

mo del But the serious drawback with the term dep endence mo del is that it assumes the

availability of relevance judgements The o ccurrence information of terms in relevant and

nonrelevant do cuments is used to estimate the probability of terms similar to queries When

full relevance judgments are not fully available the retrieval p erformance under the mo del

degraded badly Recently Crouch and Yang used Saltons approach to build a term

vsterm thesaurus whose classes are clustered in terms of discrimination values Although such

a thesaurus improved retrieval p erformance to determine the b est threshold value is dicult

b ecause such a value is determined assuming the availability of relevance judgements Yus

study on clustering for information retrieval showed that term dep endence information

improves the eectiveness of retrieval systems In the nonbinary indep endence mo del he used

hierarchical term relations

Rijsb ergen and Smeaton in used a statisticallyderived MST maximum spanning

tree as a term dep endence structure with relevance feedback to verify the following association

hypothesis

If an is good at discriminating relevant from nonrelevant documents

then any closely associated index term is also likely to be good at this

The exp eriments using MST resulted in no p ositive results

In the recent work by Yonggang Qiu and HP Frei a termvsterm similarity matrix

was constructed based on how the terms of the collection are indexed A probabilistic metho d

is used to estimate the probability of a term similar to a given query in the vector space mo del

Even when adding hundreds of terms into queries this approach showed that the similarity

thesaurus can improve p erformance signicantly Adding so many terms may not however b e

ecient for large information retrieval systems It should b e noted that their improved results

on the NPL collection is still lower than the baselines used in this pap er

In the CLARIT pro ject NLP techniques are used to identify candidate noun phrases in

fulltext and map them into candidate terms in a morphologicallynormalized form empha

sizing modier and head relations and the candidate terms are matched against a rstorder

thesaurus of certied domainsp ecic terminology which is a semiautomatically generated con

trol vo cabulary dictionary This indexing technique builds a mapping b etween terms within a

candidate noun phrase and terms in the rstorder thesaurus As an indexing technique the

metho d used in CLARIT was compared with human indexers on the basis of ten do cuments

Ruge uses the headmodier relation to investigate the eect of term asso ciations on the

p erformance of IR systems in the hyp erterm system REALIST But this typ e of asso ciation

might b e to o restrictive to capture some asso ciations for example for the query insider trad

ing scandal p erson and company names which were involved in and other phrases which often

coo ccur with insider trading scandal can b e very imp ortant asso ciated phrases

In summary some problems with previous work are

Relevance judgements are used to estimate the probability of a term related to another

term or query Because relevance judgements are often not available these approaches

are impractical for term classication or thesaurus construction Second even if available

relevance judgments are often pro duced for a set of queries which do not cover a whole

collection

Term selection for query expansion is based on individual terms but not on the whole

query This is consistent with the asso ciation hyp othesis but it ignores the additional

context that additional terms in the query provide A term may b e closely asso ciated

with one of the words in the query but not with the meaning of the entire query

Coo ccurrence data for terms is gathered from individual texts without limiting the range

of the coo ccurrence Previous research has fo cussed on short abstractlength do cuments

With fulltext databases where do cuments may b e many pages in length it is imp ortant

to consider text windows that restrict the coo ccurrences

All words verbs adjectives adverbs and nouns are treated equally ie an implicit

assumption is that that all words contribute equivalently to retrieval p erformance

The following section describ es an approach for automatic thesaurus construction that do es

not use relevance judgments or relevance feedback Instead of term clustering or classication

a retrieval system is used to measure the probability of thesaurus terms b eing asso ciated with

a particular query The topranked terms are used to automatically expand the query

PhraseFinder

PhraseFinder is a program that automatically constructs a thesaurus using text analysis and

text feature recognition It is designed to gather data ab out asso ciations b etween phrases and

terms over a large amount of text and views a text as a structured ob ject Text ob jects consist

of paragraphs which consist of a series of sentences which contains phrases and words Terms

are dened as all words except stop words Within a text terms are basic features and the

comp osite features are paragraphs sentences and phrases called characteristic features of

the text Noun phrases are used as the main characteristic features instead of terms b ecause

there is evidence that they play an imp ortant role in characterizing the content of a text

PhraseFinder considers coo ccurrences b etween phrases and terms as asso ciations As we

will see later phrases do not have to b e noun phrases They are dened by a set of rules In an

asso ciation thesaurus the asso ciation b etween phrases and terms is extracted and represented

as a text feature dep endence The following basic text features are used in PhraseFinder

Terms Any word except a is a term

Part of Sp eech Each word has a part of sp eech that was assigned by a partofsp eech

tagger So far the Church tagger has b een used to tag texts

In addition PhraseFinder recognizes phrases based on a set of simple phrase rules and the part

of sp eech of terms The following text comp osite features are used

Paragraphs A paragraph consists of a set of sentences It can b e a natural text

paragraph or a xed numb er of sentences

Sentences A sentence is a sequence of words whose end is recognized by the tagger

Phrases A phrase is a sequence of terms whose partofsp eech satisfy one of the sp ecied

phrase rules For example the simple noun phrase rule fNNN NN Ng means that a

phrase can b e triple nouns double nouns or single noun PhraseFinder do es not allow a

phrase to cross sentence b oundaries

PhraseFinder as its name implies identies the asso ciations b etween phrases and terms so that

asso ciated phrases can b e retrieved through natural language queries and used for expansion

Regular ie the Porter stemmer is used for terms Because this stemmer is quite

aggressive in terms of removing endings and do es not pro duce real word forms as output we

used a more conservative approach for stemming phrases

Asso ciation Generation and Paragraph Limits

Asso ciations represent the coo ccurrence b etween terms and phrases within texts Before the

generation of asso ciations is describ ed an imp ortant question is what range should b e used to

generate asso ciations Intuitively a natural paragraph seems appropriate b ecause it generally

fo cuses on describing an individual topic Because very long paragraphs would generate large

numb ers of asso ciations that would have less chance of b eing valid we restricted the maxi

mum numb er of sentences allowed within a paragraph known as the paragraph limit Later

exp erimental results show that sentences p er paragraph is sucient for full texts

The pro cedure for asso ciation generation is quite simple Sentence by sentence PhraseFinder

reads texts recognizes phrases and terms inserts phrases and terms into hash dictionaries

stores phrase and term identiers separately into tables along with their o ccurrence frequen

cies until the numb er of readin sentences reaches the paragraph limit or the end of a paragraph

is reached After all phrases and terms have b een obtained within a paragraph pairwise asso

ciations are generated b etween terms and phrases The asso ciation frequency which is equal

to term frequency times phrase frequency is also stored with each asso ciation An asso ciation

is a triple

ter mid phr aseid association f r eq uency

PhraseFinder go es through all texts in a collection and generates asso ciation data The asso ci

ation frequency for an asso ciation is summed over a whole collection

Statistics and Asso ciation Data Filtering

The amount of asso ciation data is very large Many asso ciations are however infrequent This

suggests that data ltering is necessary Some of the information that can b e used for ltering

includes asso ciation frequency b etween phrases and terms the numb er of asso ciated terms or

phrases for a phrase or term the total of asso ciation frequencies of asso ciated terms or phrases

for a phrase or term the do cument and collection frequencies of terms or phrases Some simple

statistics based on the asso ciation data for simple noun phrases are

Around of asso ciation data has the asso ciation frequency of over TIPSTER volume

Around of phrases o ccur just once in only one do cument

In addition a small numb er of phrases esp ecially single nouns in our exp eriments o ccur in

many do cuments These phrases are asso ciated with many terms Phrases such as people

company and state are to o general and not useful for identifying a topic

On the basis of these observations we do data ltering on asso ciation data and pro duce

an acceptable asso ciation thesaurus by discarding asso ciations with frequency and phrases

that are asso ciated with to o many terms The latter parameter is set exp erimentally for each

collection As it turns out data ltering not only reduces the amount of asso ciation data but

also enhances retrieval p erformance

Access to an Asso ciation Thesaurus

As mentioned early thesaurus access is handled separately from thesaurus construction The

saurus access is a pro cedure that measures the closeness of thesaurus items in the context of

a particular query In this pap er we implement PhraseFinder access through INQUERY an

information retrieval system based on the probabilistic inference network To do this an

asso ciation thesaurus is converted into a form suitable for INQUERY Sp ecically thesaurus

phrases are mapp ed into pseudoINQUERY do cuments where the representation of a pseudo

do cument consists of all the asso ciated terms The PhraseFinder version of INQUERY can then

accept natural language queries and output a ranked list of phrases rather than do cuments

asso ciated with the query This list is used for automatic query expansion Figure shows

an example query from the TIPSTER collection and the top phrases that are retrieved by

PhraseFinder The ranking values are not used as the weights of the asso ciated thesaurus items

Query Impact of the Immigration Law will report

specific consequence consequences of the USs Immigration Reform

and Control Act of

illegal immigration

illegals

undocumented aliens

amnesty program

immigration reform law

editorialpage article

naturalization service

civil fines

new immigration law

legal immigration

employer sanctions

simpsonmazzoli immigration reform

statutes

applicability

seeking amnesty

legal status

immigration act

undocumented workers

guest worker

sweeping immigration law

Figure Example PhraseFinder Output

Query Expansion Using An Asso ciation Thesaurus

The evaluation of a thesaurus is an op en issue There are some criteria to evaluate the quality

and category soundness of a manual thesaurus but these criteria do not predict whether the

manual thesaurus will improve retrieval eectiveness

Automatic thesauri have b een evaluated by doing query expansion and measuring retrieval

eectiveness In query expansion thesaurus items are retrieved for a particular query and are

added into the query To some extent query expansion shows whether or not a thesaurus

can provide useful phrases or terms for termbased thesauri for searchers To measure the

p erformance of PhraseFinder we conducted exp eriments using thesauri built from the NPL

and TIPSTER collections

NPL Asso ciations generated from do cuments containing titles andor abstracts

in the area of physics

TIPSAMP Asso ciations generated from a sample of the TIPSTER collection consisting of

ab out fulltext do cuments from AP ZIFF and WSJ It was pro duced

by taking every fth do cument out of TIPSTER volume It contained ab out

AP do cuments ZIFF articles WSJ articles Federal Register and DOE are

part of TIPSTER volume but were not used in this version of PhraseFinder b ecause

1

very few queries are directed at these do cuments

TIPFULL Generated from ab out full text TIPSTER do cuments containing ab out

AP news ZIFF articles and WSJ journal articles

Obviously these collections vary dramatically from text length to content All do cuments

are tagged and PhraseFinder takes tagged do cuments as input to pro duce asso ciation data It

takes two weeks to generate an asso ciation thesaurus for TIPFULL a week for TIPSAMP and

hours for NPL

In these exp eriments query expansion was done by using a natural language query or part of

the query in the case of TIPSTER to retrieve a ranked list of phrases from the PhraseFinder

version of INQUERY Given this ranked list the following metho ds were used to determine

which phrases to add to the original query

Only Duplicates

Only duplicate phrases from an asso ciation thesaurus are added into queries A phrase

is a duplicate for a given query if all terms simple stemming is used of the phrase

are subsets of the query Eg for query high frequency oscil lators using transistors

phrases transistor oscil lator and frequency oscil lator and terms transistors and

frequencies all are duplicate with the query The purp ose of adding duplicate phrases is

to see whether the evidence from the thesaurus identies the imp ortance of query terms

or phrases

Nonduplicates

Nonduplicate phrases are added into queries If any term of a phrase is not a subset of

a given query the phrase is nonduplicate with the query For the ab ove example query

phrases npn transistors and junction transistor and terms triode and anode

are nonduplicate with the query The purp ose of adding nonduplicate phrases is to see

whether a thesaurus can provide some new information ab out queries

Both Duplicates and Nonduplicates

Both duplicate and nonduplicate phrases are added into queries

Both weighting and nonweighting schemes are used for query expansion The following

conventions are used to indicate how queries are expanded

BOTH adding b oth duplicate and nonduplicate phrases

DUPONLY Only adding duplicate phrases

NODUP Adding nonduplicate phrases

The top ranked asso ciated phrases are used as candidates for query expansion In runs

that used duplicate phrases all duplicates in the top were used The numb er of nonduplicate

phrases and the weighting of duplicate and nonduplicate phrases were varied in a large numb er

of exp eriments The results rep orted here are summaries of those exp eriments and only a few

examples of these parameters are given

1

WSJ stands for Wall Street Journal articles AP for Asso ciated Press Newswire ZIFF for ZIFF

articles doe for short abstracts from Department of Energy

Exp eriments on Small Collections

The phrase rule in PhraseFinder do es not have to dene linguistic phrases In fact it denes

what typ e of concept asso ciates with terms Dierent phrase rules result in dierent typ es of

asso ciation thesauri pro duced by PhraseFinder Terms are classied into several classes each

of which are symb olized by individual letter tokens as follows in terms of part of sp eech

token meaning

N noun

J adjective

R adverb

V verb

G verbing

D verbed

I cardinal numb er

The purp ose of this classication is to investigate retrieval eectiveness on dierent typ es of

words for query expansion A phrase rule is a set of sequences of ab ove tokens PhraseFinder

pro duces dierent typ es of asso ciation thesauri for dierent phrase rules Exp eriments were

conducted to address the following issues

How do dierent typ es of words like verbs adjectives adverbs and nouns aect retrieval

p erformance This question may indirectly relate to the role of dierent typ es of words

in conveying the content of texts

What are the b est phrases for an information retrieval system like INQUERY

Comparison of a wordbased thesaurus and a nounphrasebased thesaurus

In following subsections two typ es of asso ciation thesauri wordbased and nounphrasebased

asso ciation thesauri are constructed on the NPL collection evaluated and compared

Wordbased Asso ciation Thesauri

A wordbased asso ciation thesaurus means that asso ciation data generated by PhraseFinder is

b etween dierent typ e of words and terms Most automatic thesauri are based on termvsterm

coo ccurrence called a termbased thesaurus here PhraseFinder is able to pro duce dierent

kinds of thesauri based on its phrase rule For example to pro duce a termbased asso ciation

thesaurus the phrase rule can b e dened as fN J R V G D Ig ie all terms are phrases

Words are categorized into three classes verbs adjectives and adverb and nouns Thesauri

for dierent classes of words are generated on the NPL and these thesauri are evaluated and

compared

Intuitively if all nouns are removed from a text it is generally hard to understand the text

If all verbs are removed from a text it is often p ossible to know what the text is ab out A word

based thesaurus can b e a way to test how words with dierent part of sp eech aects retrieval

p erformance and how go o d they are at representing the content of texts Obviously these

exp erimental results do not exclude the p ossibility that some words are particularly imp ortant

in conveying the content of some texts The termbased thesaurus which ignores part of sp eech

is also evaluated and compared with the others The brief summary of exp erimental results on

wordbased asso ciation thesauri is rep orted in Table in which only improvement p ercentages

over a common baseline are presented These results represent the b est p erformance over a

range of parameter settings such as weighting of added phrases and numb er of added phrases

Thesaurus Phrase Metho ds for Query Expansion

Typ e Rule DUPONLY NODUP BOTH

Verbbased fVGDg

AdjAdvbased fJ Rg

Nounbased f N g

Termbased fNJRVGDIg

Table Brief Summary of Results on WordBased Asso ciation Thesauri on NPL

When phrase rule is fV G Dg a verbbased asso ciation thesaurus is generated The

asso ciations b etween verbs and terms are generated for the verbbased thesaurus The results

show that verbs do improve retrieval p erformance a little but not signicantly Reweighting

verbs in queries do es not make a dierence Adding asso ciated verbs not already in queries

helps more than reweighting but improvement is not signicant either The result for

metho d BOTH results in no improvement

For the thesaurus based on adjectives and adverbs the phrase rule fJ Rg is used to gen

erate asso ciation data b etween adjectives or adverbs and terms The results show that this

thesaurus pro duces bigger improvements than the verbbased one That means that reweight

ing adjectives and adverbs within queries and adding new ones from this thesaurus are mutually

complementary

As exp ected the results show that the nounbased thesaurus p erforms b etter than either the

verbbased one or the one based on adjectives and adverbs over all Reweighting nouns within

queries and adding new nouns asso ciated with queries from this thesaurus are complementary

although it is interesting to see how much of the improvement is due to reweighting or simple

variations of existing query terms

Most existing automatic thesauri are termbased Comparing with the previous three typ es

of thesaurus the termbased thesaurus p erforms less eectively than either the nounbased one

or the one based on adjectives and adverbs in INQUERY

In summary all typ es of words are useful for improving retrieval p erformance but nouns

contribute the most adjectives and adverbs less and verbs the least The termbased thesaurus

which ignores part of sp eech is even less eective than the one based on adjectives and adverbs

NounPhrasebased Thesauri

In the following exp eriments three sets of phrase rules corresp onding to dierent phrase forms

are used to pro duce dierent nounphrasebased asso ciation thesauri Table presents the

exp erimental results for these thesauri

For the thesaurus based on the phrases containing only nouns whose phrase rule is fNNN

NN Ng the results in table show that and improvements are obtained

for three query expansion metho ds resp ectively These improvements are bigger than those on

other phrasebased thesauri

Adjectives can b e used to improve p erformance and many noun phrases are of form adjective

noun phrase For this reason a set of more general noun phrase rules fNNN JNN JJN

NN JN Ng would b e used to generate the asso ciation thesaurus The results in Table show

that under the general noun phrase rule adding duplicate phrases with weight pro duces

a improvement adding fteen nonduplicate phrases with weight pro duces a

improvement overall Combining two metho ds results in a improvement It is clear that

adjectives and nouns together do not work as well as nouns alone

Phrase Metho ds for Query Expansion

Rule DUPONLY NODUP BOTH

fNNNJNNJJNNNJNNg

fNNNNNNg

fNNNNNg

Table Summary of Results on PhraseBased Asso ciation Thesauri on NPL

The purp ose of removing single noun from phrase rules is to reduce the asso ciation data

b ecause many phrases are comp osed of single nouns It is exp ected that some imp ortant single

noun phrases would b e lost b ecause of this But how big would the loss b e The results

with phrase rule fNNN NNg in table in which duplicate phrases with weight and eight

nonduplicate phrases with are added show that the loss is signicant It is concluded that

single nouns play a very imp ortant role in queries and texts For those single nouns that are

very general and not useful for conveying content of texts simple data ltering techniques can

b e used to throw them away

For phrasebased thesauri the asso ciation thesaurus based on nounonly noun phrases pro

duces the b est improvement out of all three in INQUERY and the improvement is signicant

The addition of adjectives into noun phrases hurts rather than enhances retrieval p erformance

Individual nouns are very imp ortant noun phrases and help to improve retrieval p erformance

The exp eriments show that the nounphrasebased thesauri p erform much b etter than any typ e

of wordbased thesauri

Exp eriments on the TIPSTER collections

There are three TIPSTER databases built for INQUERY tip tip and tip Table illustrates

2

what is in each database

Database Name of do cs Database Content

tip wsj wsj wsj zi ap

do e fr

wsj wsj wsj zi ap

tip do e fr wsj wsj zi

ap fr

tip zia zib ap patn

sjm a sjm b

Table TIPSTER Database Contents

An original TIPSTER topic or query consists of a numb er of elds such as the descrip

tion a short naturallanguage description of the information need and the concepts a list of

words and phrases related to the query Only some of these elds are used in these exp eriments

Two sets of TIPSTER topics are used for exp eriments which are topics and

resp ectively notated as set and set Relevance judgments for the TIPSTER queries are from

the TREC conference For each set of topics baseline query sets are generated by automatic

pro cessing The query sets used were as follows

2

In this table fr stands for Federal Register patn for Patent data and sjm for the San Jose Mercury

News The other abbreviations have b een used previously

SET There are two sets of baseline queries setqry and INQ The rst is the

initial baseline query used in early TIPSTER exp eriments The second is the one that so

far pro duces the b est results in our TIPSTER exp eriments Both setqry and INQ

are constructed using all elds of the TIPSTER original queries

SET One baseline query set INQ is used INQ is structured using only the

description part of the original queries with the phrase op erators The purp ose of using

INQ is to simulate an online retrieval environment in which queries are relatively

short

All thesauri used in these exp eriments were generated using the phrase rule fNNN NN Ng

Exp eriments were conducted to investigate the following issues

What the eect of the paragraph limit is on the p erformance of an asso ciation thesaurus

and how big this eect would b e

Is it p ossible to use a representative sample of a collection to generate a asso ciation

thesaurus for a whole collection

Is it p ossible to apply an asso ciation thesaurus built for one collection to another that

overlaps in content

How well would an asso ciation thesaurus p erform in an online environment

Exp eriments on the Paragraph Limit

The way that PhraseFinder works a small paragraph limit can signicantly reduce the amount

of asso ciation data and allow to use fewer computer resources Exp eriments for the paragraph

limit are to nd an appropriate paragraph limit Since current available standard collections

like NPL and CACM are not full text and contain only titles and abstracts the paragraph

limit do es not make sense These exp eriments with dierent paragraph limits are conducted

on TIPSTER sample collection As a general observation if the paragraph limit is to o small

eg one sentence p er paragraph imp ortant asso ciations would b e lost If the paragraph limit is

to o large eg or or larger to o many asso ciations are generated A very large paragraph

limit means that asso ciations are generated within a natural paragraph

The concepts eld within the original TIPSTER topics were used to retrieve phrases from

TIPSAMP thesauri constructed with dierent paragraph limits ie and Five nondu

plicate phrases with weight and all duplicate phrases in the list of the top asso ciated

phrases with weight are added into the queries Table in which PLn means a paragraph

limit of n summarizes the results

There are some slight changes in p erformance when the paragraph limit varies from to

Using as paragraph limit is a little b etter than using and using b etter than This

may conrm the intuition that a natural paragraph would b e go o d However the dierences

are not signicant It seems that when a paragraph is limited to sentences the paragraph

limit has a little impact on the p erformance Even with the high baseline query ie INQ

a more than improvement is obtained on tip and tip and a improvement on tip The

improvement on tip was unexp ected since it is a dierent set of do cuments from a dierent

time p erio d than the collection used to build the asso ciation thesauri A related result app ears

in the next section

Baseline TIPSTER baseline query The Paragraph Limit

Queries Database precision PL PL PL

tip

setqry tip

tip

tip

INQ tip

tip

Table Summary of the Results on the Paragraph Limit

Using a Sample Collection

It is currently very inecient to generate an asso ciation thesaurus on all texts in a very large

collection The exp eriments conducted here test the dierence b etween a thesaurus built using

a representative sample of a collection and one built using the whole collection In the preceding

subsection the TIPSAMP thesaurus for the TIPSTER sample collection was evaluated Here

the TIPFULL thesaurus with paragraph limit is also used

Precision change queries

Recall setqry TIPSAMP TIPFULL

: :

: :

: :

: :

: :

: :

: :

: :

: :

: :

: :

average : :

Table Performance of the TIPSAMP and TIPFULL thesauri on tip

The results in table show that roughly the same improvements are obtained with b oth

thesauri Even though the thesaurus built using the whole collection certainly contains more

asso ciations b etween phrases and terms it is clear that in terms of retrieval p erformance the

asso ciation thesaurus for a representative sample of a collection is as go o d as the one for the

whole collection We have not yet determined however what the optimal sample size should

b e

In another exp eriment we obtained a average improvement in precision when the

queries expanded using the TIPFULL version of PhraseFinder were evaluated with the tip

database This leads to the conclusion that for two collections an asso ciation thesaurus built for

one collection can b e used for another if the content of the two collections overlaps signicantly

The Use of An Asso ciation Thesaurus in An Online Environment

In the preceding two subsections highlystructured queries are used in exp eriments These

queries use all information in the original TIPSTER topics In an online information retrieval

environment it is unrealistic to ask users to submit queries of that typ e In this section

the exp eriments are designed to simulate an online retrieval environment to test how well an

asso ciation thesaurus works Here the description elds of the TIPSTER queries are taken as

user query in an online environment These user queries are used as natural language queries

for the TIPFULL thesaurus and the retrieved asso ciated phrases are automatically added into

queries In the expanded queries the duplicate phrases are added with weight and the top

ten nonduplicates with weight Table shows the exp erimental results in which a

improvement is gained on the tip database

Precision change queries

Recall INQ DUPONLY NODUP BOTH

: : :

: : :

: : :

: : :

: : :

: : :

: : :

: : :

: : :

: : :

: : :

average : : :

Table Performance of descriptiononly queries on tip

These improvements on TIPSTER are larger than those obtained with NPL This suggests

that bigger collections may pro duce b etter asso ciation thesauri

Conclusions and Discussion

These exp eriments show that it is p ossible and feasible to construct useful collectiondep endent

asso ciation thesauri automatically without relevance judgments It seems that the larger the

collection the b etter the asso ciation thesaurus p erforms The approach to the construction is

realistic and general An asso ciation thesaurus can b e converted to b e suitable for the framework

of other retrieval systems This kind of thesaurus lo oks very promising for information retrieval

The approach emb o died in PhraseFinder is general and can b e extended to many retrieval

problems The asso ciation data generated by PhraseFinder is an information synthesis over

a given collection For example by dening the phrase rule as a company or p erson name

PhraseFinder would generate asso ciation data for company or p ersonal information systems

For a given query INQUERY would output a list of company or p erson names asso ciated with

the query

Many questions remain in using asso ciation thesauri to do query expansion In our exp eri

ments the same numb er of nonduplicate phrases are added into all queries Obviously this is

not necessarily optimal How many phrases or terms should b e added dep ends on the queries

and the asso ciated phrases For some queries many phrases can b e added whereas other may

degrade with only a few additional phrases It is still not clear how to determine this numb er

for a given query In exp eriments on the NPL collection a weighting scheme is used for query

expansion How weights should assigned to duplicate and nonduplicate phrases is exp erimen

tally determined It is not clear that it is p ossible to compute these weights automatically It

is in question whether weighting values for one set of queries can b e applied to another set of

queries It is still a puzzle why added phrases have to b e downweighted on small collections

Fortunately query expansion metho ds using an asso ciation thesaurus are straightforward on a

large collection like the TIPSTER even though they may not b e optimal

Although data ltering techniques for asso ciation data not only reduce the amount of as

so ciation data but also enhance the p erformance of thesauri these techniques are still ad ho c

and premature It would b e b etter to do data ltering using asso ciation strength based on

asso ciation frequencies and other statistical values A comprehensive data ltering technique

can b e exp ected to do b etter and to reduce asso ciation data further

Acknowledgments

This research was supp orted by the NSF Center for Intelligent Information Retrieval at the

University of Massachusetts Discussions ab out the eect of dierent words on representing the

content of texts with Bob Krovetz inspired the exp eriments on wordbased asso ciation thesauri

Dan Nachbar Steve Harding and John Broglio supp orted the PhraseFinder development and

exp eriments

References

James P Callan and W Bruce Croft An Evaluation of Query Processing Strategies Using

the TIPSTER Col lection SIGIR

Kenneth Church A Stochastic Parts Program and Noun Phrase Parser for Unrestricted

Text In Pro c of the nd Conf on Applied Natural Language Pro cessing

WBruce Croft Howard R Turtle and David D Lewis The Use of Phrases and Structured

Queries in Information Retrieval SIGIR

Carolyn J Crouch An Approach to the Automatic Construction of Global Thesauri IPM

Vol

Carolyn J Crouch and Bokyung Yang Experiments in Automatic Statistical Thesaurus

Construction SIGIR

David A Evans Kimb erly GintherWebster Mary Hart Rob ert G Leerts and Ira A

Monarch Automatic Indexing Using Selective NLP and FirstOrder Thesauri RIAO

Donna Harman Overview of the First Text REtrieval Conference SIGIR

R Krovetz Viewing Morphology as an Inference Process SIGIR

Jack Minker Gerald A Wilson and Barbara H Zimmerman An Evaluation of Query

Expansion by the Addition of Clustered Terms for a Document Retrieval System IPM

Vol

CJ Rijsb ergen DJ Harp er and MF Porter The Selection of Good Search Terms

IPM Vol

Gerda Ruge Experiments on Linguistical ly Based Term Associations RIAO

Porter M An Algorithm for Sux Stripping Program Vol

Yonggang Qiu HP Frei Concept Based Query Expansion SIGIR

G Salton Automatic Information Organization and Retrieval McGrawHill Bo ok Com

pany

G Salton Automatic Term Class Construction Using Relevance A Summary of Word in

Automatic Pseudoclassication IPM Vol

G Salton C Buckley and CT Yu An Evaluation of Term Dependence Models in Infor

mation Retrieval LNCS

Alan F Smeaton The Retrieval Eects of Query Expansion on a Feedback Document

Retrieval System TR Dept of Computer Science University College Dublin

K Spark Jones and R M Needham Automatic Term Classication and Retrieval IPM

Vol

K Spark Jones and DM Jackson The Use of Automatical lyObtained Keyword Classi

cations for Information Retrieval IPM Vol

K Spark Jones Automatic Keyword Classication for Information Retrieval Butterworth

Howard R Turtle Inference Networks for Document Retrieval PhD Thesis COINS Tech

nical Rep ort University of Massachusetts at Amherst

C T Yu C Buckley K Lam and G Salton A Generalized Term Dependence Model in

Information Retrieval Information Technology Research and Development Vol

CT Yu W Meng and S Park A Framework for Eective Retrieval ACM Tran on

Database Systems VolNo