Corpus-Speci C Stemming Using Word Form Co-Occurrence
Total Page:16
File Type:pdf, Size:1020Kb
CorpusSp ecic Stemming using Word Form Coo ccurrence W Bruce Croft and Jinxi Xu Computer Science Department University of Massachusetts Amherst Abstract Stemming is used in many information retrieval IR systems to reduce word forms to common ro ots It is one of the simplest and most successful applications of natural language pro cessing for IR Current stemming algorithms are however either inexible or dicult to adapt to the sp ecic characteristics of a text corpus except by the manual denition of exception lists We prop ose a technique for using corpusbased word coo ccurrence statistics to mo dify a stemmer Exp eriments show that this technique is eective and is very suitable for querybased stemming Introduction Stemming is a common form of language pro cessing in most information retrieval systems It is similar to the morphological pro cessing used in natural language pro cessing but has somewhat dierent aims In an information retrieval system stemming is used to reduce dierent word forms to common ro ots and thereby im prove the ability of the system to match query and do cument vocabulary The variety in word forms comes from b oth inectional and derivational morphology and stem mers are usually designed to handle b oth although in some systems stemming consists solely of handling simple plurals Stemmers have also b een used to group or conate words that are synonyms such as children and childhoo d rather than variant word forms but this is not a typical function Although stemming has b een studied mainly for English there is evidence that it is useful for a number of languages Stemming in English is usually done during do cument indexing by removing word endings or suxes using tables of common endings and heuristics ab out when it is appropriate to remove them One of the b estknown stemmers used in exp erimental IR systems is the Porter stemmer which iteratively removes endings from a word until termination conditions are met The Porter stemmer has a number of problems It is dicult to understand and mo dify It makes errors by b eing to o aggressive in conation eg p olicyp o li ce executeexecutive are conated and by missing others eg Europ eanEurop e matricesmatrix are not conated It also pro duces stems that are not words and are often dicult for an end user to interpret eg iteration pro duces iter and general pro duces gener Despite these problems recallprecision evaluations of the Porter stemmer show that it gives consistent if rather small p erformance b enets across a range of collections and that it is b etter than most other stemmers Krovetz developed a new approach to stemming based on machinereadable dictionaries and welldened rules for inectional and derivational morphology This stemmer now called KSTEM addresses many of the problems with the Porter stem mer but do es not pro duce consistently b etter recallprecision p erformance One of the reasons for this is that KSTEM is heavily dep endent on the entries in the dic tionary b eing used and can b e conservative in conation For example b ecause the words sto cks and b onds are valid entries in a dictionary for general English they are not conated with sto ck and b ond which are separate entries If the database b eing searched is the Wall St Journal this can b e a problem The work rep orted here is motivated by two ideas corpussp ecic stemming and querybased stemming Corpussp ecic stemming refers to automatic mo dication of conation classes words that result in a common stem or ro ot to suit the charac teristics of a given text corpus This should pro duce more eective results and less obvious errors from the end users p oint of view The basic hypothesis is that word forms that should b e conated for a given corpus will coo ccur in do cuments from that corpus Based on that hypothesis we use a coo ccurrence measure similar to the exp ected mutual information measure EMIM to mo dify conation classes generated by the Porter stemmer In querybased stemming all decisions ab out word conation are made when the query is formulated rather than at do cument indexing time This greatly increases the exibility of stemming and is compatible with corpussp ecic stemming in that explicit conation classes can b e used to expand the query In the next two sections we present these ideas in more detail In section we discuss the sp ecic corp ora we used in the exp eriments and give examples of the conation classes that are generated and how they are mo died Section gives the results of retrieval tests that were done with the new stemming approach CorpusSp ecic Stemming Generalpurp ose language to ols have generally not b een successful for IR For exam ple using a general thesaurus for automatic query expansion do es not improve the eectiveness of the system and can indeed result in less eective retrieval eg When the to ol can b e tuned to a given domain or text corpus however the results 1 are usually much b etter From this p oint of view stemming algorithms have b een one of the more successful general techniques in that they consistently give small eectiveness improvements In most applications where an IR system includes a stemmer exception lists are used to describ e conations that are of particular imp ortance to the application but are not handled appropriately by the stemmer For example an exception list may b e used to guarantee that Japanese and Chinese are conated to Japan and China for an application containing exp ort rep orts Exception lists are constructed manually for each application Using human judgement for these lists is exp ensive and can b e inconsistent in quality Instead of the manual approach of exception lists the conations p erformed by the stemmer can b e mo died automatically using corpusbased statistics To do this we assume that word forms that should b e conated will o ccur in the same do cuments or more sp ecically in the same text windows For example articles from the Wall St Journal that discuss sto ck prices will typically contain b oth the words sto ck and sto cks This technique should identify words that should b e conated but are not sto ck and sto cks are an example for KSTEM and words that should not b e conated but are Examples of the latter are the word pairs p olicyp o li ce and addition addi tive for the Porter stemmer The basic measure that is used to measure the signicance of word form co o ccurrence is a variation of EMIM This measure is dened for a pair of words a and b by the following formula n ab ema b n n a b where n n are the number of o ccurrences of a and b in the corpus and n is the a b ab number of times b oth a and b fall in a text window of size win in the corpus We dene n as the number of elements in the set f a b jdista b w ing where ab i j i j a s and b s are distinct o ccurrences of a and b in the corpus and dista b is the i j i j distance b etween a and b measured using a word count within each do cument i j Given this measure the question is which word pair statistics should b e measured In previous studies the EMIM measure has b een applied to all word pairs that co o ccur in text windows The aim of this type of study was to discover phrasal and thesuarus relationships In this study we have a dierent aim namely to clarify the relationship b etween words that have similar morphology For this reason the em measure is calculated only for word pairs that p otentially could b e conated The way we have chosen to do this is to use an aggressive stemmer Porter to identify words that may b e conated and then use the corpus statistics to rene that conation 1 Jing and Croft discuss a corpusbased technique for query expansion that pro duces signicant eectiveness improvements A problem with this approach is that if the aggressive stemmer is not aggressive enough word pairs that should b e conated will b e missed There are a number of ways that this could b e addressed such as identifying word pairs with signicant trigram overlap In our work we have combined the Porter stemmer with KSTEM to identify p ossible conations Even though KSTEM is not as aggressive as Porter it do es conate some words that Porter do es not For example the Porter stemmer conates ab domen and ab domens but not ab dominal KSTEM conates all of these More generally we can view stemming as constructing equivalence classes of words For example the Porter stemmer conates b onds b onding and b onded to b ond so these words form an equivalence class The corpus statistics for word pairs in the equivalence classes are used to determine the nal classes For example if b onding and b onded do not coo ccur signicantly in a particular corpus one or b oth of them may b e removed from the equivalence or conation class dep ending on their relationship to the other words More sp ecically if a and b are stemmed to c then all o ccurrences of a b and c are the same after the stemming transformation ie a b and c form an equivalence class If however a is stemmed to b and b is stemmed to c then a b and c do not form an equivalence class The Porter stemmer o ccasionally makes such incomplete conations We consider this a bug of the Porter stemmer and put a b and c in an equivalence class Supp ose the collection has a vocabulary V fw w w g We use the union 1 2 n nd algorithm to construct the equivalence classes as follows For each word w use the Porter stemmer to stem it to r Let R fr g i i i S For each element in V R form a singleton class For each pair w r if w and r are not in the same equivalence class i i i i merge the two equivalence classes