Lemmatisation As a Tagging Task

Lemmatisation as a Tagging Task Andrea Gesmundo Tanja Samardziˇ c´ Department of Computer Science Department of Linguistics University of Geneva University of Geneva [email protected] [email protected] Abstract In this paper, we present a new general approach to the task of lemmatisation which can be used to We present a novel approach to the task of overcome the shortage of comprehensive dictionar- word lemmatisation. We formalise lemmati- ies for languages for which they have not been devel- sation as a category tagging task, by describ- oped. Our approach is based on redefining the task ing how a word-to-lemma transformation rule of lemmatisation as a category tagging task. Formu- can be encoded in a single label and how a set of such labels can be inferred for a specific lating lemmatisation as a tagging task allows the use language. In this way, a lemmatisation sys- of advanced tagging techniques, and the efficient in- tem can be trained and tested using any super- tegration of contextual information. We show that vised tagging model. In contrast to previous this approach gives the highest accuracy known on approaches, the proposed technique allows us eight European languages having different morpho- to easily integrate relevant contextual informa- logical complexity, including agglutinative (Hungar- tion. We test our approach on eight languages ian, Estonian) and fusional (Slavic) languages. reaching a new state-of-the-art level for the lemmatisation task. 2 Lemmatisation as a Tagging Task Lemmatisation is the task of grouping together word 1 Introduction forms that belong to the same inflectional morpho- Lemmatisation and part-of-speech (POS) tagging logical paradigm and assigning to each paradigm its are necessary steps in automatic processing of lan- corresponding canonical form called lemma. For ex- guage corpora. This annotation is a prerequisite ample, English word forms go, goes, going, went, for developing systems for more sophisticated au- gone constitute a single morphological paradigm tomatic processing such as information retrieval, as which is assigned the lemma go. Automatic lemma- well as for using language corpora in linguistic re- tisation requires defining a model that can determine search and in the humanities. Lemmatisation is es- the lemma for a given word form. Approaching it pecially important for processing morphologically directly as a tagging task by considering the lemma rich languages, where the number of different word itself as the tag to be assigned is clearly unfeasible: forms is too large to be included in the part-of- 1) the size of the tag set would be proportional to the speech tag set. The work on morphologically rich vocabulary size, and 2) such a model would overfit languages suggests that using comprehensive mor- the training corpus missing important morphologi- phological dictionaries is necessary for achieving cal generalisations required to predict the lemma of good results (Hajiˇc, 2000; Erjavec and Dˇzeroski, unseen words (e.g. the fact that the transformation 2004). However, such dictionaries are constructed from going to go is governed by a general rule that manually and they cannot be expected to be devel- applies to most English verbs). oped quickly for many languages. Our method assigns to each word a label encod- 368 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics, pages 368–372, Jeju, Republic of Korea, 8-14 July 2012. c 2012 Association for Computational Linguistics ing the transformation required to obtain the lemma 350 string from the given word string. The generic trans- 300 formation from a word to a lemma is done in four steps: 1) remove a suffix of length Ns; 2) add a 250 new lemma suffix, Ls; 3) remove a prefix of length 200 Np; 4) add a new lemma prefix, Lp. The tuple 150 τ ≡ hNs,Ls,Np,Lpi defines the word-to-lemma label set size transformation. Each tuple is represented with a label that lists the 4 parameters. For example, the 100 transformation of the word going into its lemma is 50 English encoded by the label h3, ∅, 0, ∅i. This label can be Slovene Serbian 0 observed on a specific lemma-word pair in the train- 0 10000 20000 30000 40000 50000 60000 70000 80000 90000 ing set but it generalizes well to the unseen words word-lemma samples that are formed regularly by adding the suffix -ing. Figure 1: Growth of the labelset with the numberof train- The same label applies to any other transformation ing instances. which requires only removing the last 3 characters of the word string. Suffix transformations are more frequent than pre- the lemma and ‘t’ follows it. The generated label is fix transformations (Jongejan and Dalianis, 2009). added to the set of labels. In some languages, such as English, it is sufficient to define only suffix transformations. In this case, all 3 Label set induction the labels will have Np set to 0 and Lp set to ∅. How- We apply the presented technique to induce the la- ever, languages richer in morphology often require bel set from annotated running text. This approach encoding prefix transformations too. For example, results in a set of labels whose size convergences in assigning the lemma to the negated verb forms in quickly with the increase of training pairs. Czech the negation prefix needs to be removed. In Figure 1 shows the growth of the label set size this case, the label h1, t, 2, ∅i maps the word nevedˇ elˇ with the number of tokens seen in the training set for to the lemma vedˇ etˇ . The same label generalises to three representative languages. This behavior is ex- other (word, lemma) pairs: (nedokazal´ , dokazat´ ), pected on the basis of the known interaction between (neexistoval, existovat), (nepamatoval, pamatovat).1 the frequency and the regularity of word forms that The set of labels for a specific language is induced is shared by all languages: infrequent words tend to from a training set of pairs (word, lemma). For each be formed according to a regular pattern, while ir- pair, we first find the Longest Common Substring regular word forms tend to occur in frequent words. (LCS) (Gusfield, 1997). Then we set the value of The described procedure leverages this fact to in- Np to the number of characters in the word that pre- duce a label set that covers most of the word occur- cede the start of LCS and Ns to the number of char- rences in a text: a specialized label is learnt for fre- acters in the word that follow the end of LCS. The quent irregular words, while a generic label is learnt value of Lp is the substring preceding LCS in the to handle words that follow a regular pattern. lemma and the value of Ls is the substring follow- We observe that the non-complete convergence of ing LCS in the lemma. In the case of the example the label set size is, to a large extent, due to the pres- pair (nevedˇ elˇ , vedˇ etˇ ), the LCS is vedˇ eˇ, 2 characters ence of noise in the corpus (annotation errors, ty- precede the LCS in the word and 1 follows it. There pos or inconsistency). We test the robustness of our are no characters preceding the start of the LCS in method by deciding not to filter out the noise generated labels in the experimental evaluation. We also 1 The transformation rules described in this section are well observe that encoding the prefix transformation in adapted for a wide range of languages which encode morpho- the label is fundamental for handling the size of the logical information by means of affixes. Other encodings can be designed to handle other morphological types (such as Semitic label sets in the languages that frequently use lemma languages). prefixes. For example, the label set generated for 369 Czech doubles in size if only the suffix transforma- Base Line [w0], flagChars(w0), tion is encoded in the label. Finally, we observe that (BL) prefixes(w0), suffixes(w0) + context BL+[w1], [w 1], [lem1], [lem 1] the size of the set of induced labels depends on the − − + POS BL+[pos0] morphological complexity of languages, as shown in +cont.&POS BL+[w1], [w 1], [lem1], [lem 1], Figure 1. The English set is smaller than the Slovene − − [pos0], [pos 1], [pos1] and Serbian sets. − Table 1: Feature sets. 4 Experimental Evaluation Base + + +cont.&POS The advantage of structuring the lemmatisation task Language Line cont. POS Acc. UWA as a tagging task is that it allows us to apply success- Czech 96.6 96.8 96.8 97.7 86.3 ful tagging techniques and use the context informa- English 98.8 99.1 99.2 99.6 94.7 tion in assigning transformation labels to the words Estonian 95.8 96.2 96.5 97.4 78.5 in a text. For the experimental evaluations we use Hungarian 96.5 96.9 97.0 97.5 85.8 the Bidirectional Tagger with Guided Learning pre- Polish 95.3 95.6 96.0 96.8 85.8 sented in Shen et al. (2007). We chose this model Romanian 96.2 97.4 97.5 98.3 86.9 since it has been shown to be easily adaptable for Serbian 95.0 95.3 96.2 97.2 84.9 solving a wide set of tagging and chunking tasks ob- Slovene 96.1 96.6 97.0 98.1 87.7 taining state-of-the-art performances with short ex- Table 2: Accuracy of the lemmatizer in the four settings. ecution time (Gesmundo, 2011). Furthermore, this model has consistently shown good generalisation behaviour reaching significantly higher accuracy in the second experiment are reported in the third col- tagging unknown words than other systems.

Load more