Corpus Based Evaluation of Stemmers

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Repository of the Academy's Library Corpus based evaluation of stemmers Istvan´ Endredy´ Pazm´ any´ Peter´ Catholic University, Faculty of Information Technology and Bionics MTA-PPKE Hungarian Language Technology Research Group 50/a Prater´ Street, 1083 Budapest, Hungary [email protected] Abstract There are many available stemmer modules, and the typical questions are usually: which one should be used, which will help our system and what is its numerical benefit? This article is based on the idea that any corpora with lemmas can be used for stemmer evaluation, not only for lemma quality measurement but for information retrieval quality as well. The sentences of the corpus act as documents (hit items of the retrieval), and its words are the testing queries. This article describes this evaluation method, and gives an up-to-date overview about 10 stemmers for 3 languages with different levels of inflectional morphology. An open source, language independent python tool and a Lucene based stemmer evaluator in java are also provided which can be used for evaluating further languages and stemmers. 1. Introduction 2. Related work Stemmers can be evaluated in several ways. First, their Information retrieval (IR) systems have a common crit- impact on IR performance is measurable by test collec- ical point. It is when the root form of an inflected word tions. The earlier stemmer evaluations (Hull, 1996; Tordai form is determined. This is called stemming. There are and De Rijke, 2006; Halacsy´ and Tron,´ 2007) were based two types of the common base word form: lemma or stem. on pre-defined queries and their expected document sets. The lemma is the base or dictionary form of a word, which These sets were called experimental collections. A few is the grammatically correct, morphological root of the hundreds of queries were constructed by hand and evalu- word. The stem, however, is a common root form of a ated on thousands of documents for relevance with respect word, which is made usually by cutting typical endings to these queries. This dataset had to be made very care- from the end of the word. The stem need not be a full word. fully: the selection had great impact on the final results. For instance, the word form spies has the lemma spy and its Stemmers were evaluated against this dataset. Ranking stem might be just sp. Stemmers are thus algorithms that was also the part of the evaluation. remove certain endings from word forms. Lemmatizers, These evaluations might show the usefulness of a stem- however, are special stemmers that produce the dictionary mer, but a stemmer developer does not get feedback about form of a word. Lemmatizers are able to deal with irregu- the typical errors of the module. Thus error metrics (Paice, lar word forms, they perform dictionary and morphological 1994) were defined which might give an insight into the operations on the input. They usually use more resources comparison of stemmers and can also give detailed in- than stemmers. Therefore, every lemmatizer is a high qual- formation about errors. There are basically two types of ity and more resource-intensive stemmer. stemming error: overstemming and understemming. Over- stemming index (OI) shows how many times a stemmer The advantage of the lemmatizers is that they exactly wrongly cuts too many letters from a word form. The know the words, and they have good precision on known higher is the OI, the more unrelated words are connected words. But they usually can not handle unknown words. to the same stem in an IR, and precision is decreased. Un- On the contrary, there are no unknown words for al- derstemming index (UI) counts how many times a stem- gorithmic stemmers. They are much simpler, smaller and mer misses to remove a suffix. This type of error results usually faster than lemmatizers, and for many applications in the separation of related word forms. In an IR it causes their results are good enough. For instance, the final res- lower recall. The ideal case is when OI and UI are both ult of an IR does not require to define exact lemma, it is zero. The rate of OI / UI means the weight or strength of enough to assign the same stem to the related word forms. a stemmer: a light stemmer strips only fewer affixes from One of the famous algorithmic stemmers is the Porter words, a heavy one eats them with greater appetite. This stemmer for English (Porter, 1980). This logic was expan- weight does not represent the quality of a stemmer, it only ded, and a programming language, Snowball was created characterizes its suffix stripping mode. Another rate was for defining stemming algorithms (Porter and Boulton, defined to compare stemmers: error rate relative to trun- 2001). cation (ERRT) which shows the distance of the (UI, OI) series from the ideal one. Stemming has strong impact not only on the precision There are several available open source IR systems, for and recall of searching, but it can also reduce the size of instance Lucene (Hatcher et al., 2004) and systems based the index. In the case of agglutinative languages (such on it: Apache Solr (Smiley et al., 2015), Elasticsearch as Finnish or Hungarian) where a single word might have (Gormley and Tong, 2015). They are shipped with more thousands of word forms, stemming is essential. stemmers. However, it is not easy to decide which stemmer to choose for a given system. Previous evaluations are but this metrics might give an overview about the stemmer older than these systems, and there was no available eval- and the language. uation about them. This article would like to give an up-to-date overview 4. Metrics on the IR quality of the stemmers about stemmers, and also to provide a simple method and The main idea of this article is that every corpus with 1 a language-independent python and java tool , which can lemmas can be used as an IR evaluation data set. In the be used to compare and evaluate any stemmers for further tool presented here, every sentence in the corpus will be languages. the result item of an IR (hit), and its words (in their original forms) are the queries. These word-sentence connections 3. Metrics on the direct output of the will be used as a gold standard, and each stemmer will stemmers be tested against this: calculation of precision and recall is Stemmers are evaluated by six metrics. First, a corpus based on the sets of sentences determined by stemmers and with lemmas is needed, whose lemmas will be the gold by the gold standard (illustrated on Figure 1). standard. Second, each word of the corpus is stemmed with various stemming modules. Their results can be compared to the gold standard. Some languages may have more lemmas for the same word form. Most of the lead- ing search tools are able to handle this phenomenon (Lu- cene, Elastic search, MS Indexing Service, MSSQL Full- text search, etc), others not (e.g. Tovek tools). A stemmer module typically has one input word without its context, so it has no chance to disambiguate its stem alternatives. It might provide only a heuristic ranking. There can be more strategies: use the first one, use the most probable one or use all of them. The tool counts lemma accuracy first: how many times the first lemma is correct. Secondly, another 3 metrics are applied which mark the false alternatives in a strong and Figure 1: IR quality evaluation of a stemmer, based a weak way. The latter one is called average precision at on a corpus with lemmas: sentences=documents and maximum recall (Linden,´ 2009), for n lemmas before the words=queries, sentence sets are compared to the gold correct one, we get a precision of 1/(n+1). The benefit standard of this metric that it evaluates the lemma ranking of the stemmer. Other metric assumes that every lemma will be stored, On the one hand, the tool contains a Lucene evaluation so each incorrect lemma means false positive (fp), correct java code as well: sentences of the corpora (as separate lemmas mean true positive (tp), and missing one means documents) are indexed by stemmers, and each word (as false negative (fn). a query) gets a result set which is compared to the gold Third metric concentrates on the first stem: correct standard. This is a classical collection set based evaluation, means tp, incorrect is fn, and fp is the case when the correct but the collection is big (millions of doc) and it is made stem exists in the alternatives (but not at the first place). automatically from the corpus. This java code evaluates in Fourth metric takes into consideration all correct lem- two ways: evaluates all result items, and evaluates only the mas occured in the corpus compared to the output of a first n results. This latter option reflects the mode when a stemmer. Each word form in the corpus gets its all lemmas, human verifies the results: only the first n items matter. and these sets were compared to the output of a stemmer. Their intersection is tp, lemmas only in corpus are fn, and 4.1. Experimental evaluation method lemmas only from stemmer are fp. On the other hand, there is another approach without In addition, Paice metrics described in Section 2. are IR system. Every word of the corpus has a sentence set, also counted: UI, IO, ERRT by NLTK library (Bird, 2006).

Corpus Based Evaluation of Stemmers

A Sentiment-Based Chat Bot

How Do BERT Embeddings Organize Linguistic Knowledge?

Applying Ensembling Methods to BERT to Boost Model Performance

Automatic Correction of Real-Word Errors in Spanish Clinical Texts

Treebanks, Linguistic Theories and Applications Introduction to Treebanks

Welsh Language Technology Action Plan Progress Report 2020 Welsh Language Technology Action Plan: Progress Report 2020

Aconcorde: Towards a Proper Concordance for Arabic

Senserelate::Allwords - a Broad Coverage Word Sense Tagger That Maximizes Semantic Relatedness

Learning to Read by Spelling Towards Unsupervised Text Recognition

Detecting Personal Life Events from Social Media

Deep Linguistic Analysis for the Accurate Identification of Predicate

Natural Language Toolkit Based Morphology Linguistics