An Extensible Multilingual Open Source Lemmatizer
Total Page:16
File Type:pdf, Size:1020Kb
An Extensible Multilingual Open Source Lemmatizer Ahmet Akera,b and Johann Petraka and Firas Sabbahb Department of Computer Science, University of Sheffielda Department of Information Engineering, University of Duisburg-Essenb [email protected], [email protected] [email protected] Abstract the inflected form of more than one lexeme each of which may have different lemmas. Lemmas can be We present GATE DictLemmatizer, a mul- used in various ways for NLP, for instance, to im- tilingual open source lemmatizer for the prove the performance of text similarity metrics. GATE NLP framework that currently sup- For this application, all words are mapped to their ports English, German, Italian, French, lemma before a similarity is calculated. Lemmas Dutch, and Spanish, and is easily exten- are also often used in information retrieval and in- sible to other languages. The software formation extraction to better identify and group is freely available under the LGPL li- terms which occur in their inflected forms. cense. The lemmatization is based on the The task of finding lemmas is different and Helsinki Finite-State Transducer Technol- harder than finding stems. Stemming is often used ogy (HFST) and lemma dictionaries au- as a much cruder heuristic approach to map in- tomatically created from Wiktionary. We flectional forms of words to some canonical form, evaluate the performance of the lemma- but unlike lemmatization does not differentiate tizers against TreeTagger, which is only between different lexemes which could have the freely available for research purposes. same inflectional form and it is possible for the Our evaluation shows that DictLemma- stem of a word to not be a valid lexeme of the lan- tizer achieves similar or even better re- guage. sults than TreeTagger for languages where The TreeTagger (Schmid, 2013) software pro- there is support from HFST. The per- vides lemmatization for 20 languages including formance drops when there is no sup- English, German, Italian, French, Dutch and port from HFST and the entire lemmatiza- Spanish. However, it is not open source and it is tion process is based on lemma dictionar- not straightforward to use it for non-research or ies. However, the results are still satisfac- commercial applications. There exist a few other tory given the fact that DictLemmatizer is lemmatizers which are open for non-research pur- open-source and can be easily extended to poses (Lezius et al., 1998; Perera and Witte, 2005; other languages. The software for extend- Bar¨ et al., 2013; Cappelli and Moretti, 1983)1. ing the lemmatizer by creating word lists However, these lemmatizers are mostly concerned from Wiktionary dictionaries is also freely with only one language and do not provide a broad available as open-source software. coverage like the TreeTagger. In this paper, we describe GATE DictLemma- 1 Introduction tizer, a plugin for the GATE NLP framework2 The process of lemmatization is an important (Cunningham et al., 2011) that performs lemma- part of many computational linguistics applica- tization for English, German, Italian, French, tions such as Information Retrieval (IR) and Natu- Dutch, and Spanish and is freely available under ral Language Processing (NLP). In lemmatization, the LGPL license. The GATE NLP framework inflected forms of a lexeme are mapped to a canon- is one of the most widely used frameworks for ical form that is referred to as the lemma. The task 1https://github.com/giodegas/ of finding the correct lemma for a word in context morphit-lemmatizer is often complicated by the fact that a word can be 2https://gate.ac.uk 40 Proceedings of Recent Advances in Natural Language Processing, pages 40–45, Varna, Bulgaria, Sep 4–6 2017. https://doi.org/10.26615/978-954-452-049-6_006 applied natural language processing. It is imple- 2011), POS tags can be created using different mented in Java, freely available under the permis- methods or plugins, however for the evaluation sive LGPL license and can be extended through in this paper we use the ANNIE POS-tagger plugins. (Cunningham et al., 2002) for English and the Our method combines the Helsinki Finite-State Stanford CoreNLP POS tagger (Toutanova et al., Transducer Technology (HFST)3 (Linden´ et al., 2003) for all other languages. These language- 2011) and word-lemma dictionaries obtained from specific POS tags are then converted to Universal Wiktionary. Since we use separate dictionaries de- Dependencies tags using mappings adapted from pending on the word category, the method also https://github.com/slavpetrov/ depends on a POS tagger for the language. The universal-pos-tags (Petrov et al., 2011). word dictionaries are obtained automatically from The lemmatizer first tries to look up each word Wiktionary4 data dumps. The code for creating form in the dictionary that matches the language the dictionaries automatically is available as free and word category of the word. Currently there are and open-source software.5 This software can be lists for the following categories: adjective, adpo- used to easily add dictionaries for new languages sition, adverb, conjunction, determiner, noun, par- to the DictLemmatizer. The plugin also contains ticle, pronoun, verb. If the word form is found in the HFST models for the 4 languages for which the dictionary, the corresponding lemma is used. models are available: English, German, French Pre-generated dictionaries for the six supported and Italian.6 languages are included with the plugin. The rest of the paper is structured as fol- If the word could not be found in the dictionary, lows. First we describe our method of performing an attempt is made to find the lemma by using the lemmatization (Section2). Our lemmatizer uses HFST model for the language, if it is available. automatically generated lemma dictionaries. The The HFST model returns for each word all pos- process of obtaining such dictionaries from Wik- sible morphological variants. This makes it diffi- tionary is outlined in Section3. In Section4 we cult to directly find the lemma for the word. We detail the release information. Next, in Section5 therefore implemented rules that use the Univer- we evaluate the performance of our lemmatizer. sal POS tag information and extract the correct We use the TreeTagger for comparison. We con- lemma. E.g. for the word “computers” the HFST clude in Section6. returns the following options: 2 Method compute[V]+ER[V/N]+N+PL computer[N]+N+PL To obtain lemmas we combine two strategies: Since we know from the POS tagger that “com- the Helsinki Finite-State Transducer Technology puters” is a noun we can use that information and (HFST)7 and word-lemma dictionaries obtained extract from the HFST list the entry that refers to from Wiktionary8. For both strategies, it is nec- a noun ([N]) - “computer”. essary to know the coarse-grained word categories The HFST models are freely available only for such as “noun”, “verb”, “adposition” for each a few languages. For any language where there is word. no HFST model, our lemmatizer will rely only on For this purpose, the lemmatizer requires the the Wiktionary-based dictionaries.10 Universal POS tags9 from the Universal Depen- dencies project. In GATE (Cunningham et al., 3 Parsing dictionaries 3http://www.ling.helsinki.fi/ kieliteknologia/tutkimus/hfst/ We implemented a Java based tool that allows 4https://www.wiktionary.org/ users to extract lemma information from the Wik- 5 https://github.com/ahmetaker/ tionary API. With this tool it is easy to create dic- Wiktionary-Lemma-Extractor 6https://sourceforge.net/ tionaries for additional languages not included in projects/hfst/files/resources/ the lemmatizer distribution. We refer to this tool morphological-transducers/ as Wiktionary-Lemma Extractor. It fetches for a 7http://www.ling.helsinki.fi/ kieliteknologia/tutkimus/hfst/ 10In this case, it is also possible to make DictLemmatizer 8https://www.wiktionary.org/ work without any POS tags at all by merging the original dic- 9http://universaldependencies.org/u/ tionaries per word type into one dictionary for unknown/u- pos/all.html nidentified POS type. 41 given word form its lemma from the Wiktionary German Tiger Corpus (DE-Tiger) (Brants • page. In addition the tool expects the language in- et al., 2004) formation, such as English, German, etc. Once Universal Dependencies English tree bank these pieces of information are provided the tool • fetches through the Wiktionary API the English (EN-UD) (Bies et al., 2012) version of the Wiktionary page for the queried Universal Dependencies French tree bank word. The English Wiktionary page is divided into • (FR-UD) different areas where each area conveys a particu- lar information such as lemma, synonym, trans- Universal Dependencies German tree bank • lation, etc. Our tool isolates the lemma area and (DE-UD) finds the non-inflected form for the queried word. The queried word and the non-inflected form are Universal Dependencies Spanish tree bank • saved into a database to be used as dictionary (ES-UD) lookup. Universal Dependencies Spanish Ancora cor- • 4 Software Availability pus (ES-Ancora) 4.1 GATE DictLemmatizer Plugin For more information on the Universal Dependen- Most of the tools and resources for the GATE cies tree banks see McDonald et al.(2013). NLP framework are created as separate plugins All corpora were converted to GATE documents which can be used as needed for a process- using format specific open-source software121314 . ing pipeline. The approach for finding lem- The software and setup for carrying out all evalu- mas described earlier has been implemented ation is also available online.15 as a GATE plugin and is freely available Note that for this comparison, the GATE from https://github.com/GateNLP/ Generic Tagger Framework plugin16 was used to gateplugin-dict-lemmatizer. This plu- wrap the original TreeTagger software.