A Suffix Subsumption-Based Approach to Building Stemmers and Lemmatizers for Highly Inflectional Languages with Sparse Resources

A SuFFIX SuBSuMPTION-BASED APPROACH to BuILDINg STEMMERS AND LemmatizERS FOR HIgHLy INFLECTIONAL LANguAgES WITH Sparse RESOuRCES Vlado Kešelj, Danko Šipka, Dalhousie university Arizona State university Abstract: We present a general suffix-based method for sionality of term vectors. Namely, in the vec- construction of stemmers and lemmatizers for highly in- tor-space model of IR, the documents are repre- flectional languages with only sparse resources. The pro- sented as vectors of weights, where each weight cess is directly implementable with described efficient design and it is evaluated on a construction of a stemmer corresponds to a term in a vocabulary. Removing for the Serbian language. The evaluation on real data has stop-words and very rare words from the vocabu- shown an accuracy of 79%. lary is the first step in dimensionality reduction, and on top of this, further reduction by stemming 1 Introduction is estimated to be about one third [6]. Significant Two important tasks at the low level of Natu- benefits in retrieval performance are sometimes ral Language Processing (NLP) are stemming and disputed ([4] §3.4), at least for English, but for lemmatization. Stemming is well-known in the highly inflectional languages stemming or some NLP, IR (Information Retrieval), and Text Min- equivalent preprocessing is essential [5]. Stem- ing research areas as an essential preprocessing ming can also be used as a preprocessing step in step for some tasks, such as text and document information extraction, and various other tasks. retrieval, document clustering, classification, in- Is Suffix stripping sufficient? Beside suffix formation extraction, and other content-related removal, one could be tempted to use prefix re- applications. Descriptively speaking, stemming moval as well, but prefixes usually change word is a word transformation in which a word may be meaning radically and it is preferred that they stripped of some suffixes without loosing its core are left intact [7]. Stemming has been a mainly semantic content. Very frequent words are usu- suffixbased transformation since the publica- ally removed as stop-words in an IR system, and tion of the Porter’s stemmer [6], and it has been they are not subject to stemming. We could think successfully applied to several other languages of stemming as a process of normalization in of Indo-European family; e.g., stemmers for which several morphological variants of a word 16 languages are implemented in the Snowball are mapped into the same form. An elaborate dis- framework [7]. However, one should not gener- cussion about stemming and its application to IR alize this suffix-oriented methodology to all lan- is given in [7]. Stemming brings two important guages; for example Arabic relies on prefixes, benefits to an IR system: (1) a better IR recall suffixes, and infixes in morphological transfor- can be achieved since query words are matched mations, such as using prefixes to indicate per- with their variants in the documents, and (2) son feature in verb. The languages in the Bantu stemming decreases the size of the overall term group use prefixes to form plurals. For some vocabulary, which leads to significant efficiency languages such as Chinese, this question is of benefits in speed and memory requirements, due no relevance at all. Irregular inflections are not to decreased size of the term index and dimen- well-handled by suffix-based transformations INFOTHECA – Journal of Informatics and Librarianship № 1-2, vol IX, May 2008 24a VLADO KEŠELJ, DANKO ŠIPKA and they should be handled as exception word gramming language that was a predecessor of C lists, one of which is the stop-word list. and not used so much these days. The stemmer The concepts of a word stem and a word root has been re-implemented in many different lan- are related but distinct: A stem is a product of guages, but the reader should be aware that many the stemming process, which conflates all se- of them do not implement stemmer exactly as it mantically close words, while a root is an “in- was specified.1 Both, Porter and Lovins’s stem- ner” word from which the initial word derives; mers, are examples of algorithmic stemmers. i.e., it has an etymological meaning [7]. Finding There are two general approaches to stemming: a root frequently requires removal of prefixes as dictionary-based and algorithmic. We discuss well, while they are not removed in stemming. them in more details in the next section. Another computational problem related to stem- Since the appearance of the Porter stemmer ming is morphological analysis, which aims at a number of stemmers were implemented. For breaking words into smallest parts that maintain example, the Snowball framework [7] at the mo- a unit meaning related to the meaning of the ini- ment includes stemmers for 16 languages. Rus- tial word [3]. sian is the only Balto-Slavonic language currently Lemmatization. Similarly to stemming, implemented in the Snowball framework. There lemmatization is a morphological transforma- are some other implementations being publicly tion that changes a word into a normalized form. available. A notable site is CPAN2, which hosts However, while the purpose of stemming is to several stemmers, including a wrapper module conflate related morphological variations into for Snowball. For majority of languages there one unifying form, and separate unrelated forms, are no publicly available stemmers, especially a lemmatizer returns the corresponding lemma, for languages with sparse electronic linguistic which is the normalized word form as it would resources. To paraphrase [7], while there are appear in the dictionary. large amounts of publications discussing stem- In the rest of the paper we will first discuss ming, there are only a few descriptions that can related work in section 2, then we will formally be readily implemented in the popular efficient introduce our approach and methodology in sec- programming languages, such as C, Perl, Java, or tion 3. In section 4 we describe the resource that similar; and there are a relatively small number we used as a the starting point. In section 5 we of publications giving quantitative analysis and describe experiments and discuss the results, and evaluations of stemmer performance. in section 6 we conclude with a summary of the The theoretical basis of our methodology is results and the main contributions, and propose related to the finite state methodology described tasks for future work. in [1] and [8]. The resources and program used in the paper Related to Serbian language, our search for are made publicly available and can be found at a wider set of stemmers for any of the Slavonic http://www.cs.dal.ca/˜vlado/nlp/2007-sr. languages of former yugoslavia produced only a few results. Two stemmers could be found: A 2 Related Work stemmer for Slovene is described in [5] and it is Likely the best-known and most widely used evaluated on an IR task, but we could not locate stemmer is the Porter stemmer for English [6]. any available implementation. The “three new The Lovins’s stemmer was another known stem- stemmers for Slovene” were mentioned at the web mer, created about the same time (a bit earlier) site of the INCO-Copernicus project3. There was than the Porter stemmer. The original Porter a discussion at the Snowball list about including stemmer was implemented in BCPL, the pro- a Slovene stemmer into the framework4. There A SuFFIX SuBSuMPTION-BASED APPROACH... 25a is a publicly available Perl code for a Croatian at least some rules. For example, in many highly- stemmer5. It includes very limited documenta- inflectional languages, such as Serbian, proper tion (several code revision comments), and only names are inflected and one cannot expect to the author’s user id ‘dpavlin’. It seems to be a have all proper names included in a dictionary. short and well-written stemmer, but it is not clear Similarly, an algorithmic stemmer will usually what is its coverage. It could be a toy stemmer have lists of exceptions, which are small diction- designed only for 143 words included in the test aries. The approach that we explore here is algo- data. The other related projects on morphological rithmic. On top of the known advantages of the analysis that seem to have implemented lemma- algorithmic approach, an algorithmic approach is tizers, but not stemmers are [10] in Serbian and even more advantageous in the context of hav- [9] in Croatian. ing an initial lexical resource of limited cover- Contributions. The three main contribu- age with significant number of errors, i.e., noise. tions of this paper are: (1) developing and mak- Overfitting the model with the resource, which ing publicly available implemented stemmer for would come with the dictionary-based approach, Serbian, and associated resources, (2) providing would lead to a decreased stemmer performance quantitative analysis of the stemmer and vari- not only on the unseen words, but also on the ous steps in the process of its development, and training lexicon. (3) proposing and testing a general approach to Stemming and Lemmatization. under a building stemmers and lemmatizers for highly- more general term lemmatization we distinguish inflectional languages with sparse resources. We three different levels, each of which provides find that the method that we used provide some more sophisticated analysis interesting insights into the algorithms and data of a word: structures needed for efficient implementation of 1. stemming, which has been described, such stemmers. 2. direct lemmatization, or translation of a word form to a lemma, and 3 Background 3. annotated lemmatization, or translation Algorithmic and Dictionary stemmers. of a word form to a lemma annotated with the There are two approaches to building stemmers: features associated with the word form.

Load more