Learning to Lemmatise Polish Noun Phrases
Total Page:16
File Type:pdf, Size:1020Kb
Learning to lemmatise Polish noun phrases Adam Radziszewski Institute of Informatics, Wrocław University of Technology Wybrzeze˙ Wyspianskiego´ 27 Wrocław, Poland [email protected] Abstract of the National Corpus of Polish (Przepiórkowski, 2009), henceforth called NCP in short. The or- We present a novel approach to noun thographic form (1a) appears in instrumental case, phrase lemmatisation where the main singular. Phrase lemma is given as (1b). Lem- phase is cast as a tagging problem. The matisation of this phrase consists in reverting case idea draws on the observation that the value of the main noun (miasto) as well as its lemmatisation of almost all Polish noun adjective modifier (główne) to nominative (nom). phrases may be decomposed into trans- Each form in the example is in singular number formation of singular words (tokens) that (sg), miasto has neuter gender (n), gmina is fem- make up each phrase. We perform eval- inine (f). uation, which shows results similar to those obtained earlier by a rule-based sys- (1) a. głównym miastem gminy tem, while our approach allows to separate main city municipality chunking from lemmatisation. inst:sg:n inst:sg:n gen:sg:f 1 Introduction b. główne miasto gminy main city municipality Lemmatisation of word forms is the task of find- nom:sg:n nom:sg:n gen:sg:f ing base forms (lemmas) for each token in running text. Typically, it is performed along POS tagging According to the lemmatisation principles ac- and is considered crucial for many NLP applica- companying the NCP tagset, adjectives are lem- tions. Similar task may be defined for whole noun matised as masculine forms (główny), hence it is phrases (Degórski, 2011). By lemmatisation of not sufficient to take word-level lemma nor the or- noun phrases (NPs) we will understand assigning thographic form to obtain phrase lemmatisation. each NP a grammatically correct NP correspond- Degórski (2011) discuses some similar cases. He ing to the same phrase that could stand as a dic- also notes that this is not an easy task and lemma tionary entry. of a whole NP is rarely a concatenation of lem- The task of NP lemmatisation is rarely con- mas of phrase components. It is worth stressing sidered, although it carries great practical value. that even the task of word-level lemmatisation is For instance, any keyword extraction system that non-trivial for inflectional languages due to a large works for a morphologically rich language must number of inflected forms and even larger num- deal with lemmatisation of NPs. This is because ber of syncretisms. According to Przepiórkowski keywords are often longer phrases (Turney, 2000), (2007), “a typical Polish adjective may have 11 while the user would be confused to see inflected textually different forms (. ) but as many as 70 forms as system output. Similar situation happens different tags (2 numbers 7 cases 5 genders)”, × × when attempting at terminology extraction from which indicates the scale of the problem. What is domain corpora: it is usually assumed that do- more, several syntactic phenomena typical for Pol- main terms are subclass of NPs (Marciniak and ish complicate NP lemmatisation further. E.g., ad- Mykowiecka, 2013). jectives may both precede and follow nouns they In (1) we give an example Polish noun phrase modify; many English prepositional phrases are (‘the main city of the municipality’). Through- realised in Polish using oblique case without any out the paper we assume the usage of the tagset proposition (e.g., there is no standard Polish coun- 701 Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, pages 701–709, Sofia, Bulgaria, August 4-9 2013. c 2013 Association for Computational Linguistics terpart for the preposition of as genitive case is type of named entities considered, those two may used for this purpose). be solved using similar or significantly different In this paper we present a novel approach to methodologies. One approach, which is especially noun phrase lemmatisation where the main phase suitable for person names, assumes that nominat- is cast as a tagging problem and tackled using a ive forms may be found in the same source as the method devised for such problems, namely Con- inflected forms. Hence, the main challenge is to ditional Random Fields (CRF). define a similarity metric between named entities (Piskorski et al., 2009; Kocon´ and Piasecki, 2012), 2 Related works which can be used to match different mentions of the same names. Other named entity types may be NP lemmatisation received very little attention. realised as arbitrary noun phrases. This calls for This situation may be attributed to prevalence of more robust lemmatisation strategies. works targeted at English, where the problem is Piskorski (2005) handles the problem of lem- next to trivial due to weak inflection in the lan- matisation of Polish named entities of various guage. types by combining specialised gazetteers with The only work that contains a complete de- lemmatisation rules added to a hand-written gram- scription and evaluation of an approach to this mar. As he notes, organisation names are often task we were able to find is the work of Degór- built of noun phrases, hence it is important to un- ski (2011). The approach consists in incorpor- derstand their internal structure. Another interest- ating phrase lemmatisation rules into a shallow ing observation is that such organisation names are grammar developed for Polish. This is implemen- often structurally ambiguous, which is exempli- ted by extending the Spejd shallow parsing frame- fied with the phrase (2a), being a string of items work (Buczynski´ and Przepiórkowski, 2009) with in genitive case (‘of the main library of the Higher a rule action that is able to generate phrase lem- School of Economics’). Such cases are easier to mas. Degórski assumes that lemma of each NP solve when having access to a collocation diction- may be obtained by concatenating each token’s ary — it may be inferred that there are two colloc- orthographic form, lemma or ‘half-lemmatised’ ations here: Biblioteka Główna and Wyzsza˙ Szkoła form (e.g. grammatical case normalised to nom- Handlowa. inative, while leaving feminine gender). The other assumption is to neglect letter case: all phrases are (2) a. Biblioteki Głównej Wyzszej˙ converted to lower case and this is not penalised library main higher during evaluation. For development and evalu- gen:sg:f gen:sg:f gen:sg:f ation, two subsets of NCP were chosen and manu- Szkoły Handlowej ally annotated with NP lemmas: development set school commercial (112 phrases) and evaluation set (224 phrases). gen:sg:f gen:sg:f Degórski notes that the selection was not entirely b. Biblioteka Główna Wyzszej˙ random: two types of NPs were deliberately omit- library main higher ted, namely foreign names and “a few groups for nom:sg:f nom:sg:f gen:sg:f which the proper lemmatisation seemed very un- Szkoły Handlowej clear”. The final evaluation was performed in two school commercial ways. First, it is shown that the output of the en- gen:sg:f gen:sg:f tire system intersects only with 58.5% of the test set. The high error rate is attributed to problems While the paper reports detailed figures on with identifying NP boundaries correctly (29.5% named entity recognition performance, the qual- of test set was not recognised correctly with re- ity of lemmatisation is assessed only for all entity spect to phrase boundaries). The other experiment types collectively: “79.6 of the detected NEs were was to limit the evaluation to those NPs whose lemmatised correctly” (Piskorski, 2005). boundaries were recognised correctly by the gram- mar (70.5%). This resulted in 82.9% success rate. 3 Phrase lemmatisation as a tagging problem The task of phrase lemmatisation bears a close resemblance to a more popular task, namely lem- The idea presented here is directly inspired by De- matisation of named entities. Depending on the górski’s observations. First, we will also assume 702 that lemma of any NP may be obtained by concat- of applicable attributes depends on the grammat- enating simple transformations of word forms that ical class. For instance, nouns are specified for make up the phrase. As we will show in Sec. 4, number, gender and case. This assumption is im- this assumption is virtually always satisfied. We portant for our approach to be able to use simple will argue that there is a small finite set of inflec- tag transformations in the form replace the value tional transformations that are sufficient to lem- of attribute A with the new value V (A=V). This is matise nearly every Polish NP. not a serious limitation, since the same assumption Consider example (1) again. Correct lemmat- holds for most tagsets developed for inflectional isation of the phrase may be obtained by apply- languages, e.g., the whole MULTEXT-East fam- ing a series of simple inflectional transformations ily (Erjavec, 2012), Czech tagset (Jakubícekˇ et al., to each of its words. The first two words need to 2011). be turned into nominative forms, the last one is Our idea is simple: by expressing phrase lem- already lemmatised. This is depicted in (3a). To matisation in terms of word-level transformations show the real setting, this time we give full NCP we can reduce the task to tagging problem and tags and word-level lemmas assigned as a result of apply well known Machine Learning techniques tagging. In the NCP tagset, the first part of each that have been devised for solving such problems tag denotes grammatical class (adj stands for ad- (e.g. CRF). An important advantage is that this al- jective, subst for noun).