CorpusSp ecic using

Form Coo ccurrence

W Bruce Croft and Jinxi Xu

Computer Science Department

University of Massachusetts Amherst

Abstract

Stemming is used in many IR systems to reduce word forms to

common ro ots It is one of the simplest and most successful applications of natural language

pro cessing for IR Current stemming algorithms are however either inexible or dicult

to adapt to the sp ecic characteristics of a text corpus except by the manual denition of

exception lists We prop ose a technique for using corpusbased word coo ccurrence statistics

to mo dify a stemmer Exp eriments show that this technique is eective and is very suitable

for querybased stemming

Introduction

Stemming is a common form of language pro cessing in most information retrieval

systems It is similar to the morphological pro cessing used in natural language

pro cessing but has somewhat dierent aims In an information retrieval system

stemming is used to reduce dierent word forms to common ro ots and thereby im

prove the ability of the system to match query and do cument vocabulary The variety

in word forms comes from b oth inectional and derivational and stem

mers are usually designed to handle b oth although in some systems stemming consists

solely of handling simple plurals Stemmers have also b een used to group or conate

that are synonyms such as children and childhoo d rather than variant

word forms but this is not a typical function Although stemming has b een studied

mainly for English there is evidence that it is useful for a number of languages

Stemming in English is usually done during do cument indexing by removing word

endings or suxes using tables of common endings and heuristics ab out when it is

appropriate to remove them One of the b estknown stemmers used in exp erimental

IR systems is the Porter stemmer which iteratively removes endings from a word

until termination conditions are met The Porter stemmer has a number of problems

It is dicult to understand and mo dify It makes errors by b eing to o aggressive

in conation eg p olicyp o li ce executeexecutive are conated and by

missing others eg Europ eanEurop e matricesmatrix are not conated

It also pro duces stems that are not words and are often dicult for an end user to

interpret eg iteration pro duces iter and general pro duces gener Despite

these problems recallprecision evaluations of the Porter stemmer show that it gives

consistent if rather small p erformance b enets across a range of collections and

that it is b etter than most other stemmers

Krovetz developed a new approach to stemming based on machinereadable

and welldened rules for inectional and derivational morphology This

stemmer now called KSTEM addresses many of the problems with the Porter stem

mer but do es not pro duce consistently b etter recallprecision p erformance One of

the reasons for this is that KSTEM is heavily dep endent on the entries in the dic

tionary b eing used and can b e conservative in conation For example b ecause the

words sto cks and b onds are valid entries in a for general English they

are not conated with sto ck and b ond which are separate entries If the database

b eing searched is the Wall St Journal this can b e a problem

The work rep orted here is motivated by two ideas corpussp ecic stemming and

querybased stemming Corpussp ecic stemming refers to automatic mo dication of

conation classes words that result in a common stem or ro ot to suit the charac

teristics of a given text corpus This should pro duce more eective results and less

obvious errors from the end users p oint of view The basic hypothesis is that word

forms that should b e conated for a given corpus will coo ccur in do cuments from

that corpus Based on that hypothesis we use a coo ccurrence measure similar to

the exp ected mutual information measure EMIM to mo dify conation classes

generated by the Porter stemmer

In querybased stemming all decisions ab out word conation are made when the

query is formulated rather than at do cument indexing time This greatly increases

the exibility of stemming and is compatible with corpussp ecic stemming in that

explicit conation classes can b e used to expand the query

In the next two sections we present these ideas in more detail In section we

discuss the sp ecic corp ora we used in the exp eriments and give examples of the

conation classes that are generated and how they are mo died Section gives the

results of retrieval tests that were done with the new stemming approach

CorpusSp ecic Stemming

Generalpurp ose language to ols have generally not b een successful for IR For exam

ple using a general thesaurus for automatic query expansion do es not improve the

eectiveness of the system and can indeed result in less eective retrieval eg

When the to ol can b e tuned to a given domain or text corpus however the results

1

are usually much b etter

From this p oint of view stemming algorithms have b een one of the more successful

general techniques in that they consistently give small eectiveness improvements In

most applications where an IR system includes a stemmer exception lists are used to

describ e conations that are of particular imp ortance to the application but are not

handled appropriately by the stemmer For example an exception list may b e used to

guarantee that Japanese and Chinese are conated to Japan and China for

an application containing exp ort rep orts Exception lists are constructed manually

for each application Using human judgement for these lists is exp ensive and can b e

inconsistent in quality

Instead of the manual approach of exception lists the conations p erformed by

the stemmer can b e mo died automatically using corpusbased statistics To do this

we assume that word forms that should b e conated will o ccur in the same do cuments

or more sp ecically in the same text windows For example articles from the Wall

St Journal that discuss sto ck prices will typically contain b oth the words sto ck

and sto cks This technique should identify words that should b e conated but are

not sto ck and sto cks are an example for KSTEM and words that should not

b e conated but are Examples of the latter are the word pairs p olicyp o li ce and

addition addi tive for the Porter stemmer

The basic measure that is used to measure the signicance of word form co

o ccurrence is a variation of EMIM This measure is dened for a pair of words

a and b by the following formula

n

ab

ema b

n n

a b

where n n are the number of o ccurrences of a and b in the corpus and n is the

a b ab

number of times b oth a and b fall in a text window of size win in the corpus We

dene n as the number of elements in the set f a b jdista b w ing where

ab i j i j

a s and b s are distinct o ccurrences of a and b in the corpus and dista b is the

i j i j

distance b etween a and b measured using a word count within each do cument

i j

Given this measure the question is which word pair statistics should b e measured

In previous studies the EMIM measure has b een applied to all word pairs that co

o ccur in text windows The aim of this type of study was to discover phrasal and

thesuarus relationships In this study we have a dierent aim namely to clarify the

relationship b etween words that have similar morphology For this reason the em

measure is calculated only for word pairs that p otentially could b e conated The way

we have chosen to do this is to use an aggressive stemmer Porter to identify words

that may b e conated and then use the corpus statistics to rene that conation

1

Jing and Croft discuss a corpusbased technique for query expansion that pro duces signicant

eectiveness improvements

A problem with this approach is that if the aggressive stemmer is not aggressive

enough word pairs that should b e conated will b e missed There are a number

of ways that this could b e addressed such as identifying word pairs with signicant

overlap In our work we have combined the Porter stemmer with KSTEM

to identify p ossible conations Even though KSTEM is not as aggressive as Porter

it do es conate some words that Porter do es not For example the Porter stemmer

conates ab domen and ab domens but not ab dominal KSTEM conates all

of these

More generally we can view stemming as constructing equivalence classes of words

For example the Porter stemmer conates b onds b onding and b onded to

b ond so these words form an equivalence class The corpus statistics for word

pairs in the equivalence classes are used to determine the nal classes For example

if b onding and b onded do not coo ccur signicantly in a particular corpus one

or b oth of them may b e removed from the equivalence or conation class dep ending

on their relationship to the other words

More sp ecically if a and b are stemmed to c then all o ccurrences of a b and c

are the same after the stemming transformation ie a b and c form an equivalence

class If however a is stemmed to b and b is stemmed to c then a b and c do not

form an equivalence class The Porter stemmer o ccasionally makes such incomplete

conations We consider this a bug of the Porter stemmer and put a b and c in

an equivalence class

Supp ose the collection has a vocabulary V fw w w g We use the union

1 2 n

nd algorithm to construct the equivalence classes as follows

For each word w use the Porter stemmer to stem it to r Let R fr g

i i i

S

For each element in V R form a singleton class

For each pair w r if w and r are not in the same equivalence class

i i i i

merge the two equivalence classes into one

In each equivalence class remove those elements not in V





The union nd algorithm runs in O n log n almost linear time b ecause l og n

is a small number even if n is very large

Given the nal equivalence classes a representative or stem for each class must

b e generated We chose simply to use the shortest word in the class As well as b eing

simple this has the desirable result of pro ducing complete words instead of the usual

type of Porter stem

The other issue is what to do with word pairs for which there is insucient statis

tics If the words in a conation class are rare in the corpus the em measure will

b e unreliable For these pairs we chose to use KSTEM to determine whether they

should remain in an equivalence class The threshold used for sucient statistics is

that n n

a b

To summarize the overall pro cess for pro ducing corpussp ecic conation classes

consists of the following steps

Collect the unique words the vocabulary V in the corpus This is done using

a simple ex scanner Numbers stop words and p ossible prop er nouns are

discarded

Construct equivalence classes using the Porter stemmer sometimes augmented

by KSTEM

Calculate em for every pair of words in the same equivalence class

Form new equivalence classes This is done by starting with every word in V

forming a singleton equivalence class Then every em pair if ema b min

and they are not in the same class merge the equivalence classes If the statistics

are inadequate use KSTEM to decide whether to merge classes

Make a stem dictionary from the equivalence classes for future use in indexing

and query pro cessing The shortest word in each class is the stem for that

class

Timing gures and class statistics for sample corp ora are presented in section

QueryBased Stemming

The corpusbased stemming approach describ ed in the last section pro duces a dictio

nary of words with the appropriate stem Given this dictionary the usual pro cess of

stemming during indexing can b e replaced by dictionary lo okup Alternatively the

full word form could b e used for indexing and stemming would b ecome part of query

pro cessing The way this would work is that when a query is entered the equiva

lence class for each nonstopword would b e used to generate an expanded query For

example if the original query in the INQUERY query language was

SUMstock prices for IBM

The expanded query for a particular corpus could b e

SUM SYNstock stocks SYNprice prices IBM

The SYN op erator is used to group synonyms Dep ending on the details of how

this is done in the underlying system this query will pro duce the same result as a

query pro cessed in an environment where the database had b een indexed by stems

The advantages of querybased stemming are that the user of the system can b e

consulted as to the applicability of particular word forms and queries can b e restricted

to search for a sp ecic word form These advantages can b e signicant in cases where

small dierences in word forms result in large dierences in the relevance of the

retrieved do cuments For example in lo oking for articles ab out terrorist incidents

the word assassinati on is very go o d at discriminating relevant from nonrelevant

do cuments but the word assassinatio ns is much less useful

The main disadvantage of querybased stemming is that the queries b ecome longer

and will therefore take longer to pro cess The impact on resp onse times will dep end

on the degree of query expansion In the next section we present statistics for some

corp ora

Corp ora and Conation Classes

The corp ora that we use in these exp eriments are the West collection of law do cuments

and the Wall St Journal collection of newspap er articles The statistics for

these corp ora and the asso ciated queries and relevance judgements are shown in Table

Two sets of queries are used for the West collection The rst is where the queries

are treated as a collection of individual words The second uses INQUERY op erators

such as PHRASE to structure the combinations of words In previous work

stemming on phrasal units can pro duce dierent results than wordonly stemming

Retrieval results for b oth types of query are presented in the next section

WEST WSJ

Number of queries

Number of do cuments

Mean words p er query

Mean words p er do cument

Mean relevant do cuments p er query

Number of words in a collection

Table Statistics on text corp ora

As an example of the timing gures for generating conation classes the follow

ing gures are for the WSJ corpus All timing gures are CPU times for a SUN

workstation To collect the unique words in the corpus takes minutes There

were of these The Porter stemmer takes seconds to stem these words and

the unionnd algorithm takes seconds to form equivalence classes The number of

classes pro duced is and the average class size is word forms Generating

the em values for a text window of size words w in takes hour and

minutes With a threshold for the em value of min the number of

conation classes generated is with an average class size of Using these

classes as the basis for stemming pro duces the b est retrieval results shown in the

next section and avoids of the conations made by Porter This means that

query expansion is reduced in a querybased stemming environment

For the West collection there are unique words which generate

classes using the Porter stemmer After application of the em threshold there were

classes

As an example with these classes b onds is conated to b ond and b onding

is conated to b onded in the WSJ corpus In the West corpus all words are

conated to b ond

Figure contains examples of the conation classes for Porter on the WSJ corp ora

and Figure has the classes after application of the em threshold

abandon abandoned abandoning abandonment abandonments abandons

abate abated abatement abatements abates abating

abrasion abrasions abrasive abrasively abrasiveness abrasives

absorb absorbable absorbables absorbed absorbencies absorbency absorbent

absorbents absorber absorbers absorbing absorbs

abusable abuse abused abuser abusers abuses abusing abusive abusively

access accessed accessibility accessible accessing accession

Figure Example of conation classes on WSJ using Porter

The em value dep ends on the window size The larger the window the higher

the em values that are generated When the window size is xed the em threshold

controls how any conations by Porter are prevented By exp erimenting with dierent

window sizes and threshold values we found that as long as a reasonable sized window

larger than is used p erformance dep ends only on the p ercentage of conations

that are prevented

The dominant overhead of our metho d is the time to collect coo ccurrence data

This is prop ortional to the window size Since p erformance do es not directly de

p end on window size a word window represents a go o d compromise b etween

p erformance and computational overhead

abandonment abandonments

abated abatements abatement

abrasive abrasives

absorbable absorbables

absorbencies absorbency absorbent

absorber absorbers

abuse abusing abuses abusive abusers abuser abused

accessibility accessible

Figure Example of conation classes on WSJ after coo ccurrence thresholding

Retrieval Results

The following tables give standard recallprecision results for retrieval exp eriments

carried out using the Porter KSTEM and em mo died Porter stemmers NEW

Table shows the results of the simple wordbased queries on the West corpus The

results show little dierence b etween the stemmers with p erhaps a small advantage

at higher recall levels for the NEW stemmer Table gives the results for the phrase

based queries for West These results give an advantage to b oth Porter and the NEW

stemmers

Table gives the results for the WSJ collection We see again a clear advantage

for the Porter and NEW stemmers Overall the NEW stemmer has very consistent

p erformance and may b e able to combine the advantages of b oth the Porter and

KSTEM approaches

The nal exp eriment shown in Table uses KSTEM to decide whether a word

pair is conated when there is insucient statistics Comparing this table to the

previous one we see there is little dierence Words that do not o ccur frequently

enough to generate reliable em values are unlikely to aect retrieval on an average

basis For individual queries however this mo dication could b e very imp ortant

Conclusions

A new approach to stemming that uses corpusbased statistics was prop osed This

approach can p otentially avoid making conations that are not appropriate for a

given corpus and uses an aggressive stemmer as a starting p oint The result of this

stemmer is an actual word rather than an incomplete stem as is often the case with

the Porter approach It can also b e implemented eciently and is suitable for query

based stemming The exp erimental results show that the new stemmer gives more

consistent p erformance improvements than either the Porter or KSTEM approaches

Acknowledgements

This work was supp orted by the NSF Center for Intelligent Information Retrieval at

the University of Massachusetts Bob Krovetz help ed with the organization of the

exp eriments

References

K Church and P Hanks Word asso ciation norms mutual information and

lexicography In Proceedings of the th ACL Meeting pages

th

D Harman Overview of the rst TREC conference In Proceedings of the

ACM SIGIR International Conference on Research and Development in Informa

tion Retrieval pages

Y Jing and WB Croft An asso ciation thesaurus for information retrieval In

Proceedings of RIAO to app ear

Rob ert Krovetz Viewing morphology as an inference pro cess In Proceedings of

th

the International Conference on Research and Development in Information

Retrieval pages

M Porter An algorithm for sux stripping Program

E Rilo and W Lehnert as a basis for highprecision text

classication ACM Transactions on Information Systems

Howard Turtle Natural language vs Bo olean query evaluation A comparison of

retrieval p erformance In Proceedings ACM SIGIR pages

CJ van Rijsb ergen Information Retrieval Butterworths second edition

E Voorhees Query expansion using lexicalsemantic relations In Processings of

the th ACM SIGIR Conference pages

Recall Precision queries

KSTEM PORTER NEW

avg

Table Retrieval exp eriments on the West corpus

Recall Precision queries

KSTEM PORTER NEW

avg

Table Retrieval exp eriments using West structured queries

Recall Precision queries

KSTEM PORTER NEW

avg

Table Retrieval exp eriments using WSJ corpus

Recall Precision queries

KSTEM NEW

avg

Table Retrieval exp eriments using KSTEM for pairs with insucient statistics

WSJ collection