Nefnir: a High Accuracy Lemmatizer for Icelandic

Home , Compound (linguistics)

arXiv:1907.11907v1 [cs.CL] 27 Jul 2019 et sohriei shr ree mosbeto impossible even or evi- hard is therefore it is otherwise word as each dent, for surface different appear of that number a forms the over reduce and be- to of nouns able value ing The for verbs. 16 and to adjectives for up hundred be can word each mín ug,lmaiainhlsfidalisacsof instances all pronoun personal find the helps lemmatization guage, retrieval. information trans- and mining machine text as lation, such before tasks, Icelandic, various like conducting languages, normalization complex morpholog- text pre-processing ically of in step type important This an is found dictionary. word the a of morpho- form in the common i.e., the form, base produce logical to lem- the order of words, in morphology of words the account stem into common take the matizers reach end- word to off ings chop stemmers While determined. from differs noacutaliflcinlfrs( forms inflectional all account into h ai omo od( word a of form basic the Processing Language a Natural (NLP), and mining text In Introduction 1 ogv neapefo h clni lan- Icelandic the from example an give To , cieadevaluate de- and We scribe languages. lan- rich is natural morphologically corpus, with many working when a tasks in processing in guage step word important a an of form logical morpho- basic the finding Lemmatization, ihaPStge,teacrc obtained accuracy 96.88%. the is tagged tagger, text PoS for a with and 99.55%, of an accuracy obtains Nefnir text, tagged that correctly for shows lem- Evaluation to text. tagged database, matize morphological large a from derived Nefnir rules, substitution suffix Icelandic. uses for lemmatizer source við vnvtIglsótr rf Loftsson Hrafn Ingólfsdóttir, Svanhvít sahii6 hrafn}@ru.is {svanhviti16, , okkur lemmatizer eateto optrScience Computer of Department stemming and , enr ihacrc emtzrfrIcelandic for lemmatizer accuracy high A Nefnir: ekai University Reykjavik Abstract ég okkar sato sdt determine to used tool a is ntewyti aefr is form base this way the in I natx ops taking corpus, text a in “I” Nefnir lemma .Teevrain of variations These ). e open new a , .Lemmatization ). ég , mig , mér , a nlzrta spr fteCehDependency Czech the (Haji of Treebank part morphologi- is the that is analyzer cal lemmatizer example hand-crafted An a of 2012). inflec- (Bjarnadóttir, extensive further system the task, tional by is time-consuming Icelandic a language in complicated is the This of needed. features of knowledge morphological thorough the a lemmas, the determine ma- using learning. be chine automatically either to learned can rules order or These in hand-crafted token) form. base (the the word produce the of form the analyze surface to word used this, been the have solve rules To in transformation not processed. are be cannot that lexicon the words has that a method drawback This is obvious lexicon. lemmatization a in to look-up approach simple basic most The work Related 2 tagger. PoS a with given tagged accuracy text 96.88% and part- tags, correct (PoS) given of-speech Ice- accuracy of 99.55% with accuracy landic, ob- lemmatization it highest the results, tains that, published indicates previously Nefnir to of compared car- evaluation been Our not out. had ried evaluation formal a but results, pus the of lemmatization forms. inflectional million 5.8 over 2012), contains Modern which (Bjarnadóttir, of (DMII) Database Inflection the Icelandic from rules (learned) substitution derived suffix lemmatizer uses Nefnir source Icelandic. open for new a 2018), (Daðason, olo palisacso atclrterm. particular a of instances all up look or to corpus, a in frequency word determine correctly loihshv endvlpd hs methods These developed. been various have and algorithms effective, more process rule-learning ó aao,KitnBjarnadóttir Kristín Daðason, Jón hnhn-rfigterlsta r sdto used are that rules the hand-crafting When large-scale for used was lemmatizer new This nti ae,w ecieadevaluate and describe we paper, this In ahn erigmtoseegdt aethe make to emerged methods learning Machine jd,kristinb}@hi.is {jfd1, h riMgúsnInstitute Magnússon Árni The Senrmsne l,21)wt promising with 2018) al., et (Steingrímsson nvriyo Iceland of University o clni Studies Icelandic for ta. 2018). al., et c ˇ clni iaodCor- Gigaword Icelandic Nefnir rely on training data, which can be a corpus of 2007) reports around 90% accuracy on a random words and their lemmas or a large morphologi- sample of 600 words from the IFD, when the in- cal lexicon (Jongejan and Dalianis, 2009). By an- put has been PoS tagged automatically (with a tag- alyzing the training data, transformation rules are ging accuracy of 91.5%). The PoS tagger used formed, which can subsequently be used to find was IceTagger (Loftsson, 2008), which is part of lemmas in new texts, given the word forms. the IceNLP natural language processing toolkit In addition, maching learning lemmatizers (Loftsson and Rögnvaldsson, 2007). These results based on deep neural networks (DNNs) have indicate that the accuracy of this lemmatizer is recently emerged (see for example finnlem very dependent upon the tags it is given. To our (Myrberg, 2017) for Finnish and LemmaTag knowledge, the Icelandic CST Lemmatizer model (Kondratyuk et al., 2018) for German, Czech and is not openly available. Arabic). Along with the best rule-derived machine 2.2 Lemmald learning methods, these are now the state-of-the- art approaches to lemmatizers for morphologically The second tool is Lemmald (Ingason et al., complex languages. 2008), which is part of the IceNLP toolkit. It uses The biggest problem in lemmatization is the a mixed method of data-driven machine learning issue of unknown words, i.e. words not found (using the IFD as a training corpus) and linguistic in the training corpus or the underlying lexi- rules, as well as providing the option of looking con of the lemmatizer. This has been han- up word forms in the DMII. Given correct PoS dled in various ways, such as by only look- tagging of the input, Lemmald’s accuracy mea- ing at the suffix of a word to determine the sures at 98.54%, in a 10-fold cross-validation. The lemma, thereby lemmatizing unseen words that authors note that the CST Lemmatizer performs (hopefully) share the same morphological rules better than Lemmald when trained on the same as a known word (Dalianis and Jongejan, 2006). data, without the added DMII lookup. The DMII DNN-based lemmatizers may prove useful in solv- lookup for Lemmald delivers a statistically sig- ing this issue, as they have their own inherent nificant improvement on the accuracy (99.55%), ways of handling these out-of-vocabulary (OOV) but it is not provided with the IceNLP distribu- words, such as by using character-level context tion, so this enhancement is not available for pub- (Bergmanis and Goldwater, 2018). lic use. When used for lemmatization of the Ice- Previous to Nefnir, two lemmatization tools landic Tagged Corpus (MÍM) (Helgadóttir et al., had been developed for Icelandic. We will now 2012), the lemmatization accuracy of Lemmald 1 briefly mention these lemmatizers, before describ- was roughly estimated at around 90%. ing Nefnir further. 3 System Description 2.1 CST Lemmatizer The main difference between Nefnir and the two The CST Lemmatizer (Jongejan and Dalianis, previously described lemmatizers for Icelandic, 2009) is a rule-based lemmatizer that has been CST Lemmatizer and Lemmald, is that Nefnir de- trained for Icelandic on the Icelandic Frequency rives its rules from a morphological database, the Dictionary (IFD) corpus, consisting of about DMII, whereas the other two are trained on a cor- 590,000 tokens (Pind et al., 1991). This is a pus, the IFD. Note that the IFD only consists of language-independent lemmatizer that only looks about 590,000 tokens, while the DMII contains at the suffix of the word as a way of lemmatizing over 5.8 million inflectional forms. OOV words, and can be used on both tagged and Nefnir uses suffix substitution rules, derived untagged input. from the DMII to lemmatize tagged text. An example of such a rule is (ngar, nkfn, ar→ur), which The authors of Lemmald (see Section 2.2) can be applied to any word form with the suffix trained and evaluated the CST Lemmatizer on the ngar that has the PoS tag nkfn (a masculine plu- IFD and observed a 98.99% accuracy on correctly ral noun in the nominative case), transforming the tagged text and 93.15% accuracy on untagged text, suffix from ar to ur. This rule could, for example, in a 10-fold cross-validation, where each test set be applied to the word form kettlingar “kittens” contained about 60,000 tokens. Another evaluation of this lemmatizer for Icelandic (Cassata, 1See https://www.malfong.is/index.php?lang=en&pg=mim to obtain the corresponding lemma, kettlingur. DMII, where over 88% of all words are com- Words are lemmatized using the rule with the pounds (Bjarnadóttir, 2017). Any of the open longest shared suffix and the same tag. word classes can be combined to form a com- Each inflectional form in the DMII is annotated pound, and there is no theoretical limit to how with a grammatical tag and lemma. As the DMII many words they can consist of. Due to the abun- is limited to inflected words, the training data is dance of compounds in the training data, and the supplemented with a hand-curated list of approxi- freedom with which they can be formed, Nefnir mately 4,500 uninflected words (such as adverbs, places additional restrictions on which suffixes to conjunctions and prepositions) and abbreviations. consider when generating rules for them. Suffixes To account for subtle differences between the for the final part of a compound are generated in tagsets used in the DMII and by the Icelandic PoS the same manner as for base words, growing part taggers, Nefnir translates all tags to an intermedi- by part thereafter. For example, the compound ate tagset which is a subset of both. word fjall+göngu+skó “hiking boots” would yield ε Rules are successively generated and applied to rules for the suffixes , ó, kó, skó, gönguskó and the training set, with each new rule minimizing the fjallgönguskó. Allowing suffixes to grow freely number of remaining errors. Rules continue to be past the final part of the compound may result in generated until the number of errors cannot be re- overfitting as the rules adapt to incidental patterns duced. The process is as follows: in the training data. 1. Initially, assume that each word form is iden- 4 Evaluation tical to its lemma. 2. Generate a list of rules for all remaining er- We have evaluated the output of Nefnir against a rors. reference corpus of 21,093 tokens and their correct 3. Choose the rule which minimizes the num- lemmas. ber of remaining errors and apply it to the Samples for the reference corpus were extracted training set, or stop if no improvement can from two larger corpora, in order to obtain a di- be made. verse vocabulary: 4. Repeat from step 2. • The IFD corpus mostly contains literary texts Rules are only generated if they can correctly (Pind et al., 1991). It was first published in lemmatize at least two examples in the training set. book form and is now available online. This A dictionary is created for words which are incor- corpus has been manually PoS tagged and rectly lemmatized by the rules, for example be- lemmatized. cause they require a unique transformation, such • The Icelandic Gold Standard (GOLD) is a as from við “we” to ég “I”. Once trained, Nefnir PoS tagged and manually corrected corpus lemmatizes words using the dictionary if they are of around 1,000,000 tokens, containing a bal- present, or else with the most specific applicable anced sample of contemporary texts from 13 rule. sources, including news texts, laws and adju- A rule is generated for every suffix in a word cations, as well as various web content such form, with some restrictions. For base words, as blog texts (Loftsson et al., 2010). Nefnir considers all suffixes, from the empty string Samples were extracted at random from these to the full word. For skó “shoes”, an inflected two corpora, roughly 10,000 tokens from each, form of the word skór “shoe”, rules are gener- and the lemmas manually reviewed, following ated for the suffixes ε, ó, kó and skó. However, the criteria laid out in the preface of the IFD Nefnir does not create rules for suffixes that are (Pind et al., 1991). shorter than the transformation required to lemma- The incentive when performing the evaluation tize the word. For example, for bækur “books”, was to create a diverse corpus of text samples which requires the transformation ækur→ók (the containing foreign words, misspellings and other lemma for bækur is bók), only the suffixes ækur OOV words. Such words are likely to appear in and bækur are considered. real-world NLP tasks, and pose special problems Compounding is highly productive in Icelandic for lemmatizers. In the proofread and literature- and compound words comprise a very large por- heavy IFD corpus, which was used for training and tion of the vocabulary. This is reflected in the evaluating the previous two lemmatizers, these Gold tags IceTagger tags 2. Proper names Accuracy (%) Errors Accuracy (%) Errors 3. Two valid lemmas for word form 99.55 94 96.88 658 4. Typos 5. Incorrect capitalization, abbreviations, hy- Table 1: Results of the evaluation, with the accu- phenation, etc. racy and the total number of errors found. 6. Unknown Icelandic words 7. Wrong PoS tag leads to wrong lemma OOV words are less prevalent. Consequently, the The most prevalent error categories when the test corpus used here is not directly comparable PoS tags are correct are foreign words and proper with the corpus used to evaluate Lemmald and names, such as foreign names of people, products the CST Lemmatizer for Icelandic. On the other and companies. A special issue that often came up hand, it is more diverse and offers more challeng- is the cliticized definite article in Icelandic proper ing problems for the lemmatizer. names. This is quite common in organization One of the motivations of this work was to de- names (Síminn, Samfylkingin), titles of works of termine how well Nefnir performs when lemma- art (Svanurinn), names of ships (Vonin), buildings tizing text which has been PoS tagged automati- (Kringlan), etc. Ultimately, it depends on the aim cally, without any manual review, as such manual of the lemmatization how these should be handled, labour is usually not feasible in large-scale NLP but in this evaluation we assume as a general rule tasks. For this purpose, we created two versions of that they should be lemmatized with the definite the test corpus, one with the correct PoS tags, and article (Síminn, and not sími or Sími). The same another tagged using IceTagger (Loftsson, 2008). applies to the plural, in names such as Hjálmar The accuracy of IceTagger is further enhanced us- “helmets” (band) and Katlar (place name). ing data from the DMII. Measured against the cor- In the automatically tagged data, tagging errors rect PoS tags, the accuracy of the PoS tags in the are the most common source of lemmatization er- reference corpus is 95.47%. rors, such as when læknum (referring to the plu- Accuracy of the lemmatizaton was measured by ral dative of the masculine noun læknir “doctor”) comparing the reference corpus lemmas with the is tagged as being in the singular, which leads to obtained lemmas from Nefnir. This was done for it being incorrectly lemmatized as lækur “brook”. both the correctly tagged corpus (gold tags) and This was to be expected, as the rules learned from the automatically tagged one (IceTagger tags). As the DMII rely on the correct tagging of the in- seen in Table 1, the accuracy for the test file with put. However, as the authors of Lemmald com- the correct PoS tags is 99.55%, with 94 errors in ment, as long as the word class is correct, the 21,093 tokens. For the text tagged automatically lemmatizer can usually still find the correct lemma with IceTagger, the accuracy is 96.88%, with 658 (Ingason et al., 2008). errors. These results indicate that given correct PoS The main reason for the high accuracy in our tags, Nefnir obtains high accuracy, with under a view lies in the richness of the DMII data. No hundred errors in the whole corpus sample. This lexicon can ever include all words of a particular is comparable to the score reported for Lemmald, language, as new words appear every day, but most when DMII lookup has been added (99.55%). In often, new words in Icelandic are compounds, cre- fact, it can be argued that a higher score is hard ated from words already present in the DMII. This to come by, as natural language always contains explains how rare or unknown words such as the some unforeseen issues that are hard to accommo- adjective fuglglaður “bird-happy”, which appears date for, such as OOV words, misspellings, col- in the corpus data, can be correctly lemmatized us- loquialisms, etc. When Nefnir bases its lemmas ing the suffix rule for glaður “happy”. on the automatically PoS tagged text, the accu- As mentioned above, Nefnir, the CST Lemma- racy decreases, from 99.55% to 96.88%, resulting tizer for Icelandic, and Lemmald have not been in six times as many errors. evaluated using the same reference corpus. The We can classify the errors made by Nefnir into accuracy of the three lemmatizers are, therefore, the following main categories: not directly comparable, but our results indicate 1. Foreign words that Nefnir obtains the highest accuracy. 5 Conclusion dependency treebank 3.5. LINDAT/CLARIN dig- ital library at the Institute of Formal and Applied We described and evaluated Nefnir, a new open Linguistics (ÚFAL), Faculty of Mathematics and source lemmatizer for Icelandic. It uses suffix sub- Physics, Charles University. stitution rules, derived from a large morphological database, to lemmatize tagged text. Evaluation Sigrún Helgadóttir, Ásta Svavarsdóttir, Eiríkur Rögn- valdsson, Kristín Bjarnadóttir, and Hrafn Loftsson. shows that Nefnir obtains high accuracy for both 2012. The Tagged Icelandic Corpus (MÍM). In correctly and automatically PoS-tagged input. Proceedings of the Workshop on Language Technol- As taggers for Icelandic gradually get better, we ogy for Normalisation of Less-Resourced Languages can expect to see the lemmatization accuracy go (SaLTMiL 8 – AfLaT2012), LREC 2012, Istanbul, Turkey. up as well. Expanding the morphological database with more proper names may also help to achieve Anton Karl Ingason, Sigrún Helgadóttir, Hrafn Lofts- even higher accuracy. son, and Eiríkur Rögnvaldsson. 2008. A Mixed Method Lemmatization Algorithm Using a Hierar- chy of Linguistic Identities (HOLI). In Advances 6th References in Natural Language Processing, International Conference on NLP, GoTAL 2008, Proceedings, Toms Bergmanis and Sharon Goldwater. 2018. Con- Gothenburg, Sweden. text sensitive neural lemmatization with Lematus. In Proceedings of the 2018 Conference of the North Bart Jongejan and Hercules Dalianis. 2009. Automatic American Chapter of the Association for Computa- Training of Lemmatization Rules That Handle Mor- tional Linguistics: Human Language Technologies, phological Changes in Pre-, In- and Suffixes Alike. Volume 1 (Long Papers), New Orleans, Louisiana. In Proceedings of the Joint Conference of the 47th AnnualMeeting of the ACL and the 4th International Kristín Bjarnadóttir. 2012. The Database of Mod- Joint Conference on Natural Language Processing ern Icelandic Inflection. In Proceedings of the of the AFNLP, ACL ’09, Suntec, Singapore. Workshop on Language Technology for Normalisa- tion of Less-Resourced Languages (SaLTMiL 8 – Daniel Kondratyuk, Tomáš Gavenciak,ˇ Milan Straka, AfLaT2012), LREC 2012, Istanbul, Turkey. and Jan Hajic.ˇ 2018. LemmaTag: Jointly tagging and lemmatizing for morphologically rich languages Kristín Bjarnadóttir. 2017. Phrasal compounds in with BRNNs. In Proceedings of the 2018 Confer- Modern Icelandic with reference to Icelandic word ence on Empirical Methods in Natural Language formation in general. In Carola Trips and Jaklin Ko- Processing, Brussels, Belgium. rnfilt, editors, Further investigations into the nature of phrasal compounding. Language Science Press, Hrafn Loftsson. 2008. Tagging Icelandic text: A lin- Berlin, Germany. guistic rule-based approach. Nordic Journal of Lin- guistics, 31(1):47–72. Frank Cassata. 2007. Automatic thesaurus extraction for Icelandic. BSc Final Project, Department of Hrafn Loftsson and Eiríkur Rögnvaldsson. 2007. Computer Science, Reykjavik University. IceNLP: A Natural Language Processing Toolkit For Icelandic. In Proceedings of InterSpeech 2007, Spe- Hercules Dalianis and Bart Jongejan. 2006. Hand- cial session: Speech and language technology for crafted versus Machine-learned Inflectional Rules: less-resourced languages, Antwerp, Belgium. The Euroling-SiteSeeker Stemmer and CST’s Lem- matiser. In Proceedings of the 5th International Hrafn Loftsson, Jökull H. Yngvason, Sigrún Helga- Conference on Language Resources and Evaluation, dóttir, and Eiríkur Rögnvaldsson. 2010. Developing LREC 2006, Genoa, Italy. a PoS-tagged corpus using existing tools. In Pro- ceedings of 7th SaLTMiL Workshop on Creation and Jón F. Daðason. 2018. Nefnir. Use of Basic Lexical Resources for Less-Resourced https://github.com/jonfd/nefnir. Languages, LREC 2010, Valetta, Malta.

Jan Hajic,ˇ Eduard Bejcek,ˇ Alevtina Bémová, Eva Jesse Myrberg. 2017. ﬁnnlem. Buránová,ˇ Eva Hajicová,ˇ Jiríˇ Havelka, Petr Ho- https://github.com/jmyrberg/finnlem. mola, Jiríˇ Kárník, Václava Kettnerová, Na- talia Klyueva, Veronika Kolárová,ˇ Lucie Kucová,ˇ Jörgen Pind, Friðrik Magnússon, and Stefán Briem. Markéta Lopatková, Marie Mikulová, Jiríˇ Mírovský, 1991. Íslensk orðtíðnibók [The Icelandic Frequency Anna Nedoluzhko, Petr Pajas, Jarmila Panevová, Dictionary]. The Institute of Lexicography, Univer- Lucie Poláková, Magdaléna Rysová, Petr Sgall, Jo- sity of Iceland, Reykjavik, Iceland. hanka Spoustová, Pavel Stranák,ˇ Pavlína Synková, Magda Ševcíková,ˇ Jan Štepánek,ˇ Zdenkaˇ Ure- Steinþór Steingrímsson, Sigrún Helgadóttir, Eiríkur šová, Barbora Vidová Hladká, Daniel Zeman, Šárka Rögnvaldsson, Starkaður Barkarson, and Jon Gud- Zikánová, and Zdenekˇ Žabokrtský. 2018. Prague nason. 2018. Risamálheild: A Very Large Icelandic Text Corpus. In Proceedings of the Eleventh In- ternational Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan.