Journal of Universal Computer Science, vol. 16, no. 9 (2010), 1190-1214 submitted: 21/1/10, accepted: 28/4/10, appeared: 1/5/10 © J.UCS LemmaGen: Multilingual Lemmatisation with Induced Ripple-Down Rules MatjaˇzJurˇsiˇc (Joˇzef Stefan Institute, Ljubljana, Slovenia
[email protected]) Igor Mozetiˇc (Joˇzef Stefan Institute, Ljubljana, Slovenia
[email protected]) TomaˇzErjavec (Joˇzef Stefan Institute, Ljubljana, Slovenia
[email protected]) Nada Lavraˇc (Joˇzef Stefan Institute, Ljubljana, Slovenia and University of Nova Gorica, Nova Gorica, Slovenia
[email protected]) Abstract: Lemmatisation is the process of finding the normalised forms of words appearing in text. It is a useful preprocessing step for a number of language engineering and text mining tasks, and especially important for languages with rich inflectional morphology. This paper presents a new lemmatisation system, LemmaGen, which was trained to generate accurate and efficient lemmatisers for twelve different languages. Its evaluation on the corresponding lexicons shows that LemmaGen outperforms the lemmatisers generated by two alternative approaches, RDR and CST, both in terms of accuracy and efficiency. To our knowledge, LemmaGen is the most efficient publicly available lemmatiser trained on large lexicons of multiple languages, whose learning engine can be retrained to effectively generate lemmatisers of other languages. Key Words: rule induction, ripple-down rules, natural language processing, lemma- tisation Category: I.2.6, I.2.7, E.1 1 Introduction Lemmatisation is a valuable preprocessing step for a large number of language engineering and text mining tasks. It is the process of finding the normalised forms of wordforms, i.e. inflected words as they appear in text.