A Suffix Subsumption-based Approach to Building Stemmers and Lemmatizers for Highly Inflectional with Sparse Resources

Vlado Kešelj, Danko Šipka, Dalhousie University Arizona State University

Abstract: We present a general suffix-based method for sionality of term vectors. Namely, in the vec- construction of stemmers and lemmatizers for highly in- tor-space model of IR, the documents are repre- flectional languages with only sparse resources. The pro- sented as vectors of weights, where each weight cess is directly implementable with described efficient design and it is evaluated on a construction of a stemmer corresponds to a term in a vocabulary. Removing for the Serbian . The evaluation on real data has stop- and very rare words from the vocabu- shown an accuracy of 79%. lary is the first step in dimensionality reduction, and on top of this, further reduction by 1 Introduction is estimated to be about one third [6]. Significant Two important tasks at the low level of Natu- benefits in retrieval performance are sometimes ral Language Processing (NLP) are stemming and disputed ([4] §3.4), at least for English, but for lemmatization. Stemming is well-known in the highly inflectional languages stemming or some NLP, IR (Information Retrieval), and Text Min- equivalent preprocessing is essential [5]. Stem- ing research areas as an essential preprocessing ming can also be used as a preprocessing step in step for some tasks, such as text and document , and various other tasks. retrieval, , classification, in- Is Suffix stripping sufficient? Beside suffix formation extraction, and other content-related removal, one could be tempted to use prefix re- applications. Descriptively speaking, stemming moval as well, but prefixes usually change is a word transformation in which a word may be meaning radically and it is preferred that they stripped of some suffixes without loosing its core are left intact [7]. Stemming has been a mainly semantic content. Very frequent words are usu- suffixbased transformation since the publica- ally removed as stop-words in an IR system, and tion of the Porter’s stemmer [6], and it has been they are not subject to stemming. We could think successfully applied to several other languages of stemming as a process of normalization in of Indo-European family; e.g., stemmers for which several morphological variants of a word 16 languages are implemented in the Snowball are mapped into the same form. An elaborate dis- framework [7]. However, one should not gener- cussion about stemming and its application to IR alize this suffix-oriented methodology to all lan- is given in [7]. Stemming brings two important guages; for example Arabic relies on prefixes, benefits to an IR system: (1) a better IR recall suffixes, and infixes in morphological transfor- can be achieved since query words are matched mations, such as using prefixes to indicate per- with their variants in the documents, and (2) son feature in verb. The languages in the Bantu stemming decreases the size of the overall term group use prefixes to form plurals. For some vocabulary, which leads to significant efficiency languages such as Chinese, this question is of benefits in speed and memory requirements, due no relevance at all. Irregular inflections are not to decreased size of the term index and dimen- well-handled by suffix-based transformations

INFOTHECA – Journal of Informatics and Librarianship № 1-2, vol IX, May 2008 24a VLADO KEŠELJ, DANKO ŠIPKA and they should be handled as exception word gramming language that was a predecessor of C lists, one of which is the stop-word list. and not used so much these days. The stemmer The concepts of a word stem and a word root has been re-implemented in many different lan- are related but distinct: A stem is a product of guages, but the reader should be aware that many the stemming process, which conflates all se- of them do not implement stemmer exactly as it mantically close words, while a root is an “in- was specified.1 Both, Porter and Lovins’s stem- ner” word from which the initial word derives; mers, are examples of algorithmic stemmers. i.e., it has an etymological meaning [7]. Finding There are two general approaches to stemming: a root frequently requires removal of prefixes as -based and algorithmic. We discuss well, while they are not removed in stemming. them in more details in the next section. Another computational problem related to stem- Since the appearance of the Porter stemmer ming is morphological analysis, which aims at a number of stemmers were implemented. For breaking words into smallest parts that maintain example, the Snowball framework [7] at the mo- a unit meaning related to the meaning of the ini- ment includes stemmers for 16 languages. Rus- tial word [3]. sian is the only Balto-Slavonic language currently Lemmatization. Similarly to stemming, implemented in the Snowball framework. There lemmatization is a morphological transforma- are some other implementations being publicly tion that changes a word into a normalized form. available. A notable site is CPAN2, which hosts However, while the purpose of stemming is to several stemmers, including a wrapper module conflate related morphological variations into for Snowball. For majority of languages there one unifying form, and separate unrelated forms, are no publicly available stemmers, especially a lemmatizer returns the corresponding lemma, for languages with sparse electronic linguistic which is the normalized word form as it would resources. To paraphrase [7], while there are appear in the dictionary. large amounts of publications discussing stem- In the rest of the paper we will first discuss ming, there are only a few descriptions that can related work in section 2, then we will formally be readily implemented in the popular efficient introduce our approach and methodology in sec- programming languages, such as C, Perl, Java, or tion 3. In section 4 we describe the resource that similar; and there are a relatively small number we used as a the starting point. In section 5 we of publications giving quantitative analysis and describe experiments and discuss the results, and evaluations of stemmer performance. in section 6 we conclude with a summary of the The theoretical basis of our methodology is results and the main contributions, and propose related to the finite state methodology described tasks for future work. in [1] and [8]. The resources and program used in the paper Related to Serbian language, our search for are made publicly available and can be found at a wider set of stemmers for any of the Slavonic http://www.cs.dal.ca/˜vlado/nlp/2007-sr. languages of former Yugoslavia produced only a few results. Two stemmers could be found: A 2 Related Work stemmer for Slovene is described in [5] and it is Likely the best-known and most widely used evaluated on an IR task, but we could not locate stemmer is the Porter stemmer for English [6]. any available implementation. The “three new The Lovins’s stemmer was another known stem- stemmers for Slovene” were mentioned at the web mer, created about the same time (a bit earlier) site of the INCO-Copernicus project3. There was than the Porter stemmer. The original Porter a discussion at the Snowball list about including stemmer was implemented in BCPL, the pro- a Slovene stemmer into the framework4. There A Suffix Subsumption-based Approach... 25a is a publicly available Perl code for a Croatian at least some rules. For example, in many highly- stemmer5. It includes very limited documenta- inflectional languages, such as Serbian, proper tion (several code revision comments), and only names are inflected and one cannot expect to the author’s user id ‘dpavlin’. It seems to be a have all proper names included in a dictionary. short and well-written stemmer, but it is not clear Similarly, an algorithmic stemmer will usually what is its coverage. It could be a toy stemmer have lists of exceptions, which are small diction- designed only for 143 words included in the test aries. The approach that we explore here is algo- data. The other related projects on morphological rithmic. On top of the known advantages of the analysis that seem to have implemented lemma- algorithmic approach, an algorithmic approach is tizers, but not stemmers are [10] in Serbian and even more advantageous in the context of hav- [9] in Croatian. ing an initial of limited cover- Contributions. The three main contribu- age with significant number of errors, i.e., noise. tions of this paper are: (1) developing and mak- Overfitting the model with the resource, which ing publicly available implemented stemmer for would come with the dictionary-based approach, Serbian, and associated resources, (2) providing would lead to a decreased stemmer performance quantitative analysis of the stemmer and vari- not only on the unseen words, but also on the ous steps in the process of its development, and training . (3) proposing and testing a general approach to Stemming and Lemmatization. Under a building stemmers and lemmatizers for highly- more general term lemmatization we distinguish inflectional languages with sparse resources. We three different levels, each of which provides find that the method that we used provide some more sophisticated analysis interesting insights into the algorithms and data of a word: structures needed for efficient implementation of 1. stemming, which has been described, such stemmers. 2. direct lemmatization, or translation of a word form to a lemma, and 3 Background 3. annotated lemmatization, or translation Algorithmic and Dictionary stemmers. of a word form to a lemma annotated with the There are two approaches to building stemmers: features associated with the word form. 1. dictionary-based approach and In direct lemmatization, for any given word 2. algorithmic approach. from a text, the lemmatizer returns a lemma, i.e., In the dictionary approach, we rely on the ex- a base form of the word that could be found in a tensive linguistic knowledge collected in a ma- dictionary. The advantageous of direct lemmati- chine-readable dictionary, while in the algorith- zation over stemming include better distinguish- mic approach we use a relative small set of rules. ing between word variations, and this would lead The algorithmic approach is generally more ef- to better applications in the IR domain. The ap- ficient and more compact in the sense of program proach could also be used in on-line size, i.e., Kolmogorov complexity. According to where users frequently enter variations of a word the Occam’s razor this should lead to more gen- that are not directly represented in a dictionary, erality and robustness when previously unseen but their base form is. A disadvantage of direct words are encountered. On the other hand, the lemmatization is that there could be a number dictionary approach is more straightforward in of inherent ambiguities, since some word forms handling exceptions and may be easier to modify may correspond to different lemmas. Without and maintain. The boundary between approaches knowing the word context, a lemmatizer can is not clear: a dictionary approach usually needs only return all of them and let the user or calling 26a VLADO KEŠELJ, DANKO ŠIPKA application resolve ambiguities. In the stemming the stem contains a vowel. If several rules in process, these ambiguities are resolved by merg- one group are applicable, then the longest suffix ing different lemmas and their word forms into match is applied. The total number of rules is 63, the same class, which has a single representative but if we want to represent them in a “plain-suf- stem. fix” format, e.g., instead of matching a “double Annotated lemmatization maps, similarly consonant” we actually repeat the rule with each to direct lemmatization, a word form into one or consonant, then the number of rules is about 120. more lemmas, with an addition of providing a These kind of rules seem to be applicable to stem- set of morphological features that are associated mers for other languages in the Indo-European with this word form, such as gender, case, and family as well. This is the motivation behind the number. The set of features should be such that development of the special-purpose program- an inverted process of morphological generation ming language Snowball [7]. could produce the exact word form based on the It has been noted that the conditions on the provided lemma and the set of features. Annotat- stem length do not seem to be very important for ed lemmatization can be regarded as an extended Russian and Slovene.6 Based on this observation, direct lemmatization, since it incrementally pro- we assume that the plain suffix substitution rules vides more information. Annotated lemmatiza- should be sufficient in building our stemmer. We tion may face a higher degree of ambiguity than use only some trivial conditions on stem length, direct lemmatization, if there is more than one and exploring further these conditions is part set of features that generate the same word form of the future work. Compared to more complex from the same lemma. rules, the plainsuffix rules are sufficient since the As an example, stemming translates all words complex suffix rules can be expressed as a larger in the set {boxer, boxers, boxing, boxed, . . . } set of plain-suffix rules. Since the Porter stemmer to the word ‘box’; direct lemmatization makes is roughly equivalent to about 120 plain-suffix translations ‘boxers’ → ‘boxer’, and ‘boxing’ rules in English, which is a low-inflectional lan- → ‘box’; and annotated lemmatization produces guage, we expect that the number of plain-suffix translation ‘boxers’ → ‘boxer.noun.plural’. rules for a highly inflectional language such as Serbian could be an order of thousands. 4 Methodology Lexical Morphological Resource. Our base Based on the published literature, an exclu- lexical resource is a list of mappings of words sive algorithmic approach to stemming has been w into their lemmas l. This “mapping” is not a suffix stripping, or, more precisely, suffix substi- functional relation since a word could be mapped tution. The main representative is the Porter al- to several lemmas. It is a general word relation: gorithm: The algorithm groups the rules into five w →l l. steps applied in succession and at most one rule A Simple Dictionary-based Direct Lemmatizer can be triggered in a group. Each rule consists of (SDDL) could be created by using this resource. a condition and a substitution of the form s1→ For any given word w the lemma l(w) is deter- l s2, with the interpretation that if the condition is mined by the resource relation w → l(w). Two satisfied for a word and the word has suffix s1, issues are: (1) ambiguity, since one word could the suffixs 1 is replaced with suffixs 2. The condi- be associated with more than one lemma, and (2) tions used in Porter stemmer are either such that coverage, documents regularly include words not they can be represented as suffix requirements as seen before in a dictionary (hapax legomena). well, they involve minimal length of the stem in Stemmer Derivation. The process of deriving number of syllables, or it is a requirement that a stemmer is divided into the following steps: A Suffix Subsumption-based Approach... 27a

4.1 creation of stem-classes, frequent suffixes will likely be useful, suffixes 4.2 generation of stems and suffixes, of low frequency, e.g., one, should be discarded 4.3 sorting suffixes by frequency, and not only to reduce the number of rules, but to 4.4 generation of suffix-rules. produce more general rules that do not overfit co- 4.1. Creation of stem-classes. If two words incidental word overlaps. w1 and w2 have the same stem, we say that they 4.4. Generation of suffix-rules. We consider conflate [7], and we write w1 ~ w2. The confla- several way of generating suffix rules and experi- tion relation is an equivalence relation and it mentally evaluate each of them. Simple suffix- partitions the set of words into the classes of removal rules are considered, i.e., the rules are of equivalence. We call these classes stem-classes. the form s → ε, where ε is an empty string. We create the stem-classes from our resource by 4.4a Frequency-based Subsumption Stem- defining the conflation relation to be reflexive, mer. In the first approach, called frequency-based symmetric, and transitive closure of the relation subsumption stemmer, we first select suffixes →l . Namely, for any three words w w and w : 1, 2 3 that occur with frequency higher than a given w1 ~ w1, l ⇒ ∧ threshold. These frequent suffixes are called val- w1 → w2 w1 ~ w2 w2 ~ w1, and w ~ w ∧w ~ w ⇒ w ~ w . id suffixes, and they are candidates for the suffix 1 2 2 3 1 3 removal algorithm. The set of all valid suffixes Transitive closure is frequently implemented is denoted by S . If a valid suffix s is a suffix using matrix, but it would likely be prohibitively v 1 of another valid suffix s , than any word ending expensive in this case due to matrix size. An effi- 2 with suffix s , also ends with suffix s , so we say cient way is to use the UNION-FIND data struc- 2 1 that suffixs subsumes suffixs , and write s ⊇ s , ture [2]. The result of this phase are stemclasses, 1 2 1 2 or we say that s is more specific than s . If two i.e., groups of words that should be conflated by 2 1 the stemmer. All words derived from the same valid suffixes can be removed from a word, then l one subsumes the other one, and the more spe- lemma, according to the relation → , will be con- flated, but since one word may be associated with cific one is removed. Otherwise a more specific several lemmas, these lemmas will be merged affix would never be applied. Additionally, this is into the same class as well. The quality of stem- a principle used in all Porter-style stemmers. classes needs to be verified experimentally. 4.4b Greedy Subsumption Stemmer: The 4.2. Generation of stems and suffixes. In rules for suffix removal are selected according this step we need to identify what are correct to suffix frequency in descending order,- simi stems for each word and good suffixes. An un- lar to 4.4a. The additional condition is applied supervised machine learning method is applied by measuring stemming accuracy of the newly due to sparse resources that are available. For formed group after each rule. If the accuracy is each stem-class, we find the longest common not improved by a certain threshold, the rule is prefix of all words in the class and define this to not selected. be the stem of each word in the class. After this, 4.4c Optimal Suffix Stemmer: Presence or for each word in the class, the part that remains absence of suffixes can be used in more complex after the stem is collected as a valid suffix. We ways that in simple suffix removal rules. For ex- keep the count of suffixes, i.e., frequency, with ample, some rules of the Porter stemmer a rule an expectation that high-frequency suffixes will are stated as “if a word has suffix s1 and not s2, be good candidates for suffix-removal rules. then suffix s3 is removed.” The goal of the opti- 4.3. Sorting suffixes by frequency. Gener- mal suffix stemmer is to explore whether a better ated valid suffixes are sorted by frequency for performance could be achieved by creating such, selection of significant suffixes. While highly more complex rules, while still using only the 28a VLADO KEŠELJ, DANKO ŠIPKA suffixes generated from step 2. Such optimiza- example, in English, one could count work/NN tion problem is not obviously tractable to com- (noun) and work/VB (verb) as two different lem- pute, but we show that it is tractable, and imple- mas, but we count them as one. This kind of am- ment an efficient algorithm to solve it. We say biguity is not very frequent in Serbian. We can also note that the number of (word that two words w1 and w2 are indistinguishable by form, lemma) pairs is larger than the number of the set of valid suffixes Sv, and write w1 ≡ sv w2, ∈ word forms, but not much larger (≈ 3%). This if for each suffixs Sv, s is or is not suffix ofw 1 means the Simple Dictionary-based Direct Lem- and w2 in the same time. If two words are indis- tinguishable, then they are either changed or un- matizer (SDDL), described in the previous sec- changed by the same suffix-removal rule in the tion could be quite accurate, since about 97% wordforms map uniquely to one lemma. It can be stemming process. Additionally, the relation ≡sv is an equivalence relation and it partitions the set observed that there are about 14 different word forms per one lemma on average. of words into |Sv| + 1 equivalence classes (or |Sv| ∈ if ε Sv). These equivalence classes are impor- tant in the context of complex suffix rules since 5.2 Simple Dictionary-based Direct two words in the same class cannot be separated Lemmatizer by matching them with valid suffixes; and if two The performance of SDDL depends on the words belong to different classes then it is pos- ambiguity level of the dictionary, i.e., the re- sible to create a boolean expression over valid- source. We define ambiguity level of a word w l suffix matching conditions to separate the words. as ambiguity(w) = |{l : w → l}|, i.e., the number Hence, to find the optimal achievable accuracy of lemmas associated with the word. For unam- biguous words, i.e., words with ambiguity level with a set of suffixes Sv, we need to locally opti- mize each equivalence class by finding the most 1, the lemmatizer would give a correct answer, at least according to the resource. The distribution optimal suffix to be removed from each word in of ambiguity levels is given in Table 2. the class. This can be efficiently performed. . Number of Ambiguity level Percentage lemmas 47,489 word forms word forms 675,140 6 1 0.00015 % word form → lemma pairs 696,263 5 18 0.0027 % Table 1: Lexical Resource Statistics 4 156 0.023 % 3 1566 0.23 % 5 Evaluation 2 17446 2.58 % 5.1 Lexical Morphological Resource 1 655953 97.16 % Our processing started from a basic lexical re- Table 2: Ambiguity level distribution source for Serbian language, which was manual- of the word forms in the resource ly created and enriched by applying derivational This implies that, assuming a uniform distri- rules. We went through a long process of clean- bution of words, we could expect an accuracy ing, and the resource still includes some errors. of at least 97% of the SDDL. The most ambigu- To make processing easier, the diacritic Latin let- ous word with 6 corresponding lemmas in the ters in Serbian are transcribed into the so-called resource is ‘žute’ (engl. yellow) and its lemmas ‘dual1’ encoding (e.g., č=cx, ć=cy). The resource are: žut, žuta, žuteti, žutiti, žutjeti. consists of word → lemma pairs, and the basic Corpus-based Evaluation. In the above esti- statistics is shown in Table 1. Distinct part-of- mate we assume uniform distribution of words in speech tags are not counted in this statistics. For text, which is not realistic. The words are typical- A Suffix Subsumption-based Approach... 29a ly distributed according to the Zipf’s law [4]—a 31 2633 (6,3%) power distribution law, very different from uni- 32 1481 (3,6%) form distribution. To make a more realistic eval- 33 2872 (6,9%) uation we use a . As a representative 34 446 (1,1%) corpus of the common contemporary language, 37 632 (1,5%) we have chosen a collection of articles from the Table 3: Distribution of Stem Class Sizes, news magazine “Vreme” (engl. “Time”) from the higher than 1% period of five years 2001–5. The corpus size is Before proceeding with evaluation of our 44MB and it consists of 6.6 million words. stemmer-generating method, the lexical resource The first use of the corpus is to evaluate the is improved in the following way. The ten most coverage of our resource, i.e., the percentage of frequent word that are not covered by the resource corpus words that are included in the resource. are: ‘i’, ‘u’, ‘na’, ‘za’, ‘su’, ‘a’, ‘ne’, ‘od’, ‘sa’, After the first run we found that only 56% of the and ‘o’, which are very frequent functional words words in the corpus were found in the resource and are omitted from the resource simply because with case-sensitive matching. Besides names, those part-of-speech tags were not included (con- the words are capitalized at the beginning of a junctions: ‘i’, and ‘a’; prepositions: ‘u’, ‘na’, ‘za’, sentence and in titles so we found that case-in- ‘od’, ‘sa’, and ‘o’; auxiliary verb: ‘su’, and ad- sensitive coverage is 61%. An examination of verb ‘ne’). After manually adding 200 more word- unrecognized words reveals that about 35% of lemma pairs, the coverage increased to 85% with them are proper names. Another significant un- the 73% unambiguous words from the resource. recognized group are conjunctions and preposi- This is a usable accuracy, but the limitations are tions, which are very frequent and happened not that it requires almost 700,000 wordlemma pairs, be included in the resource. The proper names has no generalization capability, and likely con- are a group that is hard to predict so we can- tains some errors evident in the resource. not assume that they would be covered by a bet- ter resource. However, the names follow simi- 5.3 Stemmer Evaluation lar morphological patterns as common nouns, Step 4.1: Creation of stem-classes. After which is an additional evidence that an algorith- transitive closure, 677,868 unique words from mic approach would be advantageous, and that it the resource are distributed into 41,681 classes, would generalize better. Within these 61%, 50% giving on average 16.3 words per class. The words in the corpus (49.79% more precisely) are number of words per class varies between 1 and unambiguous in the resource (ambiguity(w)=1). 307 words per class. The classes with more than This is about 50/61 ≈ 82% of recognized words, 80 words are very sparse. For example, two larg- which is a less optimistic evaluation than the est stem classes have 307 and 283 words. After one obtained for uniform distribution. This im- examining them, we see that they are created by plies that SDDL would have accuracy of at least incorrectly merging two or more proper stem 50%, and likely not much higher than 61%, as- classes, likely due to some erroneous word-lem- suming that some simple strategy for unknown ma pairs. The most frequent stem-class size is 7, words is used. which is 29% of the classes. The distribution of 1 457 (1,1%) 8 3946 (9,5%) class sizes with more than 1% of all stem classes 4 1436 (3,4%) 9 1494 (3,6%) is shown in Table 3. 5 1703 (4,1%) 12 3962 (9,6%) Step 4.2: Generation of stems and suffixes. 6 1320 (3,2%) 13 2433 (5,8%) After producing stems and suffixes in this step, 7 11942 (28,7%) 29 547 (1,3%) any empty stems obtained are indicators of in- 30a VLADO KEŠELJ, DANKO ŠIPKA correct stemclasses. In a number of cases it was fix method used to generate stems created some caused by the prefix ‘naj-’, which is used in su- additional overlap among stem-classes, effec- perlative inflections of adjectives and adverbs. tively merging them: 39,289 stems are created, As we noted before, the prefixbased derivations 1,823 (4.6%) of those were ambiguous in the should not be treated in stemming. As an illustra- sense that they were associated with more than tion, if we are searching a document collection one stem-class. Only 253 had ambiguity level of for the highest mountain peak in the world, we three or more, with the stem ambiguity level de- are likely interested in ‘highest’ precisely and not creasing quickly when sorted in descending or- ‘high’ peaks or comparison ‘higher peak’, even der. The most ambiguous stems are given in the though these are conflated in English. Removal list below. of prefix ‘naj-’ would cause additional errors 43 ist 18 post 16 samo 14 ekst since it appears as a prefix in non-superlative 26 rast 18 sat 15 ust 13 zast words, such as ‘najamnik’ and ‘najahati’. The 12 ost 7 pos 7 nast issue could be resolved by removing the prefix 12 konst 7 podst 7 nas ‘naj-’ only when matched with the corresponding These highly ambiguous stems do not main- superlative suffixes ‘-ija’, ‘-iji’, and similar. We tain meaning of the word, and an improvement address this problem by separating superlative method for this step is a part of our future work. and nonsuperlative stem-classes. Step 4.3: Sorting suffixes by frequency. In Another source of empty stems are irregu- this step 18,274 suffixes were generated, and the lar inflections, such as the plural noun ‘ljudi’ of top of the sorted list of generated suffixes with ‘čovek’ or ‘čovjek’ and auxiliary verb form ‘ćeš’ frequency is given in the table below. of ‘biti’. Both of these could be handled by an 24833 -e 6495 -ој exception list, but we decide to separate them in 22874 -u 6475 -omu different stem-classes. We assume that an IR user 22389 -i 6121 -oga would not expect a search term to be expanded 22184 -a 6118 -og in this way (e.g., for ‘ljudi’), or auxiliary verbs 19475 -om 5929 -ti would be removed as stop-words anyway. The stems of length 1 are suspects of incor- 17756 -o 5775 -t rect classes, but they were not systematically re- 16190 ‘’ 4412 -h moved. One example is the word ‘beže’, which (empty) 4399 -m is a present tense form of verb ‘bežati’ (engl. to 8996 -im 4303 -cyесx escape), and the vocative case of the noun ‘beg’ 8281 -ama 4289 -cyu (engl. bey), which leads to an erroneous merge of 8101 -ih 4273 -le two, otherwise correct, stem classes. In the first 7573 -te 4272 -la run 650 words produced an empty stem. For all 7472 -ima 4268 -li of them, we manually fixed the original resource, 6821 -mo 4252 -cyе which caused break-up of corresponding stem- classes and production of non-empty stems. All of these suffixes have linguistic interpre- Short stems (e.g., length 1) are also frequently tation. created by incorrectly merged stem-classes, but Step 4.4: Generation of Suffix Rules. A di- we hypothesized that it may not be necessary to rect implementation of evaluation of different fix them in this experiment, since the later meth- suffix-rule generation approaches lead to a very ods use the most frequent suffixes, which should slow evaluation. An efficient implementation have high reliability. The maximal common pre- with a compact trie (historically also known as A Suffix Subsumption-based Approach... 31a a Patricia trie) with reversed strings significantly 20 592 71,78 reduced running time, from 5-6 hours for initial 25 497 71,48 experiments to about 5-10 minutes. 30 453 71,30 (4.4a) Frequency-based Subsumption 35 423 71,16 Stemmer. For the frequency-based subsumption 40 410 71,09 stemmer, we started with an empty set of valid 45 380 70,90 50 360 70,76 suffixes and incrementally added one rule at the 60 347 70.65 time in order determined by rule frequency. Af- 70 319 70.39 ter each stem the stemming accuracy according 80 310 70.29 to our generated stems is measured. It started 90 298 70.14 ∅ with 2.4% with Sv = and gradually increased 100 294 70.23 to 56.3% with 98 suffix rules, and then -gradu 150 273 69.87 ally decreased to 14.2% when all 17,839 suffixes 200 230 68.80 were included. 250 218 68.43 (4.4b) Greedy Subsumption Stemmer. In 300 202 67.77 the greedy approach, we add rules in the same 350 188 67.08 order as 4.4a, but before and after adding each 400 180 66.65 suffix we measure accuracies А1 and А2 in the 450 179 66.59 number of correct stems. The rule is accepted if 500 175 66.31 600 131 62.74 А2 – А1 > Ө, where Ө is a given parameter, i.e., the suffix is accepted only if it improves accura- 700 121 61.82 cy for more than a given threshold. For example, 800 114 61.03 if Ө = 0, then a suffix is accepted only if it does 900 87 57.76 not decrease the overall accuracy; if Ө = 1, then 1000 85 57.48 the number of correct stems must increase by 1 Two interesting observations that can be made at least, and so on. The higher Ө parameter is, are that the accuracy is much higher than with we expect the better generalization by choosing the previous approach, and the accuracy drops less rules that have higher quality, but they may very initially slowly while the number of rules decrease performance. The results are shown in drops quickly, which is another very encouraging the following table. observation. At the Ө = 7 we obtain 1002 suffix Valid Ө Accuracy rules with accuracy only about 1.7% less than the Suffixes best one. This fits well with our prediction that a 0 9849 74,15 stemmer for Serbian language would need about 1 8633 74,16 1000 suffix rules. 2 3367 73,38 (4.4c) Optimal Suffix Stemmer. The accu- 3 1901 72,95 4 1557 72,83 racy of the optimal suffix stemmer is 81.83%. 5 1262 72,66 This is the upper bound of what can be achieved 6 1124 72,56 with the obtained set of valid suffixes and the 7 1002 72,46 corresponding suffix removal rules, when evalu- 8 933 72,39 ated on the produced set of stems. We can see 9 878 72,32 that the greedy approach is not that much lower, 10 831 72,26 especially considering the argument that our goal 15 673 71,99 should not be to match the optimal accuracy since 32a VLADO KEŠELJ, DANKO ŠIPKA we would overfit the initial flaws of the lexical with only a few resources. Some limitations in resources and some incorrect stems produced in the process are discovered as well as opportuni- previously described process. ties for further improvement. The final evalua- tion has shown 79% accuracy on real data for the 5.4 Unbiased Evaluation Greedy stemmer, which is even a bit higher than To evaluate the stemmers in an unbiased way the accuracy obtained on the training data, show- we use the news corpus, run the stemmers on a ing a very good generalization capability. Some sample set of words of the corpus and manually directions for future work are: (1) evaluation on judge produced stems. We choose to evaluate two more data, (2) inclusion of suffix substitution stemmers: 4.4c (Optimal Suffix Stemmer) and the rules instead of just suffix removal rules, and (3) greedy stemmer (4.4b) with the parameter Ө = 7 inclusion of stem length parameter. With suffix and 1000 generated rules. An interactive program substitution rules, the method can be directly ap- plied to the lemmatizer generation. reads the words from the corpus in sequence and runs both stemmers on them. Since we are more interested in words not included in the resource, the words that exist in the resource and for which the stemmers produce the same stem are ignored. Otherwise, the stems are produced for manual evaluation with four decisions: only greedy cor- rect, only optimal correct, both correct, both in- correct, and ignore. The option ‘ignore’ is used to exclude some functional words which are obvi- ous stop-words and some English words appear- ing in the corpus. A stem is judged to be correct if the original meaning can be clearly predicted from the stem (no over-stemming), and it seems that the stem covers all morphological variations of the lemma (no under-stemming). After evalu- ating 1000 non-ignored words from the corpus (with possible repetitions) the result was: 127 words with greedy correct only, 90 optimal cor- rect only, 663 both correct, and 120 none correct. 1The current official web site for the Porter’s stemmer These results confirm two of our hypotheses: (1) is http://tartarus.org/˜martin/PorterStemmer/, and it is The stemmers produced in the process seem to the authoritative source for the implementations of the be usable in IR (greedy accuracy 79% and Opti- original stemmer. A quick test to check authenticity of a Porter stemmer implementation is the word ‘agree- mal accuracy 75%); and (2) the greedy approach ment’—it is not changed in the original Porter stemmer, produces not only as good results as the optimal while some incorrect implementations change it. stemmer, but generalizes even better (better ac- 2CPAN—Comprehensive Perl Archive Network, http:// curacy) with only 1000 rules. cpan.org/, is an open-source repository for Perl pack- ages. 3http://www.mf.uni-lj.si/ds/new-stemmers.html 6 Conclusion and Future Work 4http://snowball.tartarus.org/archives/snowball-dis- In summary, we described and evaluated a cuss/0722.html largely automatic general approach to generat- 5http://svn.rot13.org/index.cgi/stem-hr ing stemmers for highly-inflectional languages 6Source: Snowball mailing list. A Suffix Subsumption-based Approach... 33a

References tual data. Journal of the American Society for Informa- [1] K. Beesley and L. Karttunent. Finite State Morphol- tion Science, 43(5):384–390, 1992. ogy. CSLI, 2003. [6] Martin F. Porter. An algorithm for suffix stripping. [2] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Program, 14(3):130–137, July 1980. Rivest, and Clifford Stein. Introduction fo Algorithms. [7] Martin F. Porter. Snowball: A language for stemming The MIT Press, 2nd edition, 2002. algorithms. Published on WWW, October 2001. Last ac- [3] Chris Jordan, John Healy, and Vlado Kešelj. Swordfish: cess in April 2007. An unsupervised ngram based approach to morphological [8] S. Sheremetyeva, W. Jin, and S. Nirenburg. Rapid de- analysis. In SIGIR’06: Proceedings of the 29th Annual ployment morphology. , 13(4):239– International ACM SIGIR Conference on Research and 268, 1998. Development in Information Retrieval, pages 657–658, Seattle, Washington, USA, August 2006. ACM Press. [9] Marko Tadić. Hrvatski lematizacijski poslužitelj. [4] Daniel Jurafsky and James H. Martin. Speech and Published on WWW, 2005. Last access in April 2007. Language Processing. Prentice Hall Series in Artificial [10] Duško Vitas and Cvetana Krstev. Derivational mor- Intelligence. Prentice Hall, 2000. phology in an e-dictionary of Serbian. In Proceedings of [5] Mirko Popović and Peter Willett. The effectiveness 2nd Language & Technology Conference, pages 139– of stemming for natural language access to Slovene tex- 143, Poznan, Poland, April 21–23 2005.