Automatic Lemmatisation of Lithuanian Mwes

Automatic Lemmatisation of Lithuanian MWEs Loïc Boizou Jolanta Kovalevskaite˙ Erika Rimkute˙ Centre of Computational Linguistics Vytautas Magnus University Kaunas, Lithuania {l.boizou, j.kovalevskaite, e.rimkute}@hmf.vdu.lt Abstract In traditional Lithuanian dictionaries of idioms, there is no problem of MWE lemmatisation, be- This article presents a study of lemmatisa- cause data are collected manually and represented tion of flexible multiword expressions in following the rule that a verb of an idiom is pro- Lithuanian. An approach based on syntac- vided in infinitive form (Paulauskas (ed), 2001), tic analysis designed for multiword term e.g.: Savo viet ˛ažinoti ‘to know one‘s place’, Vie- lemmatisation was adapted for a broader tos netureti˙ ‘to have nowhere to go’. The re- range of MWEs taken from the Dictionary cent Dictionary of Lithuanian Nominal Phrases2, of Lithuanian Nominal Phrases. In the which was compiled semi-automatically from a present analysis, the main lemmatisation corpus, contains the unlemmatised list of MWEs. errors are identified and some improve- Therefore, automatic phrase lemmatisation could ments are proposed. It shows that auto- help in organizing the dictionary data and in im- matic lemmatisation can be improved by proving the user interface. taking into account the whole set of gram- This article describes the main problems oc- matical forms for each MWE. It would curring during the lemmatisation of Lithuanian allow selecting the optimal grammatical MWEs. The concept of lemmatisation is quite form for lemmatisation and identifying clear for single words, but for MWEs, it can be un- some grammatical restrictions. derstood differently as it will be discussed in part 1 Introduction 3. Although the accuracy of lemmatisation of in- dividual words is high (99% for lemmatisation and In Lithuanian, in addition to fully fixed multiword 94% for grammatical form identification (Rimkute˙ expressions (MWEs), there are many MWEs with and Daudaravicius,ˇ 2007)), the lemmatisation of one or more constituents (possibly all) which can single words included in MWEs cannot produce be inflected. Therefore, these “flexible” MWEs well-formed MWE lemmas. Indeed, base forms appear in texts in several forms1: as shown by the of Lithuanian multiword terms which should oc- corpus data, the Lithuanian verbal phraseme pak- cur in dictionaries and terminology databases are išti koj ˛a, meaning ‘to put a spoke in somebody’s not the same as the sequences of lemmas of their wheel’, has a form with the verb in definite past constitutive words (Boizou et al., 2012, p. 28). For pakišo koj ˛a, a form with the verb in past frequen- example, if we lemmatise each constitutive word tative (pakišdavo koj ˛a), and future (pakiš koj ˛a) in MWEs taupom ˛uj˛ubank ˛u, taupomuoju banku (for more examples, see Kovalevskaite˙ (2014)). If (‘savings bank’ in genitive plural and instrumen- we extract MWEs of a strongly-inflected language tal singular), the result is taupyti bankas (infini- like Lithuanian from a raw text corpus using sta- tive ‘to save’ and nominative singular ‘bank’), be- tistical association measures, we often get MWEs cause the morphological annotator of the Lithua- with their different grammatical forms (GF). How- nian language (Zinkevicius,ˇ 2000; Zinkeviciusˇ et ever, in lexical databases and terminology banks, a al., 2005) assignes infinitive as the proper lemma single lemma is usually used in order to represent for participles and other verb forms. The struc- the MWE independently from its concrete forms ture of such improperly lemmatised MWE fails to which appear in the corpus. reflect the syntactic relations (agreement and gov- 1Grammatical forms of the same MWE (or phraseme) can be labelled as phraseme-types (Kovalevskaite,˙ 2014). 2http://donelaitis.vdu.lt/lkk/pdf/dikt_fr.pdf Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 41 ernment) which ensure the grammatical cohesion • nenuleisti rank ˛u, nenuleido rank ˛u, nenulei- between the constitutive words of an MWE. džia rank ˛u (‘not to give up’, lit. ‘not to lower This work is focused on syntagmatic lemma, hands’, in various grammatical forms: infini- which is the form of the MWE where the MWE tive, definite past, present tense). syntactic head is lemmatised and the necessary adaptations are made in order to ensure the mor- As Lithuanian is a strongly inflected language, phosyntactic unity of the MWE. With a similar it is an advantage that users can see phrases in the approach, a tool for automatic Lithuanian sintag- form they are used in the corpus. Phrases were ex- matic lemmatisation called JungLe was first de- tracted by the method of Gravity Counts (Dauda- veloped and trained with multiword terms during raviciusˇ and Marcinkevicienˇ e,˙ 2004, p. 330) from the project ŠIMTAI 2 (semi-automatic extraction the Corpus of Contemporary Lithuanian Language of education and science terms). The first ex- (100 m running words; made up of periodicals, periments with the Lithuanian multiword terms fiction, non-fiction, and legal texts published in showed accuracy close to 95% (Boizou et al., 1991-2002). Gravity Counts helps to evaluate 2012), but it can be related to a relatively low vari- the combinability of two words according to in- ety of term structures. For this study, JungLe was dividual word frequencies, pair frequencies or the adapted for a broader set of types of Lithuanian number of different words in a selected 3 word- compositional and non-compositional MWEs, e.g. span. As a result, it detects collocational chains as idioms, collocations, nominal compounds, MW text fragments, not as a list of collocates for the terms, MW named entities, MW function words, previously selected node-words. proverbs, etc. After automatic extraction of collocational chains from the corpus, manual procedures were 2 Data performed: transformation of collocational chains into phrases (the procedure is described in de- This study is based on the data from the Dictionary tail in Marcinkevicienˇ e˙ (2010) and Rimkute˙ et of Lithuanian Nominal Phrases (further on, dictio- al. (2012)). According to the lexicographical ap- nary). The database of the dictionary3 consists of proach, linguistically well-formed collocational 68,602 nominal phrases. It has to be mentioned, chains have to be grammatical, meaningful, and that the term nominal phrase refers to all MWEs arbitrary. Therefore, some chains were shortened, which contain at least one noun: it can be phrases complemented, joined or deleted manually. At with a noun as a syntactic head, as well as phrases present, the dictionary database contains phrases with a verb or an adjective as a syntactic head. In without additions (1) and with additions: additions this article, the terms MWE and phrase are used can be explicitly specified (2) or not (3), e.g.: interchangeably. In the dictionary, the phrases are of different 1. ne tuo adresu ‘under a wrong address’; length: from two-word phrases (31,853 phrases) to phrases comprising 46 words (1 phrase). The 2. (gauti; suteikti) daugiau informacijos ‘(get; major part of the dictionary phrases is made up give) more information’; of two-word phrases (46.4%), whereas three-word phrases and four-word phrases form accordingly 3. atkreipiant demes˛i˛i˙ (. ) ‘paying attention to 28.7% and 10.1% of the dictionary (Rimkute˙ et (. )’. al., 2012, p. 19). The phrases are not lemmatised, but given in the form as they appear in the corpus, e.g.: Phrase Number of Number of lemmas type lemmas with 2 or more GF • mobilaus ryšio telefonas, mobilaus ryšio tele- 2-word 18,581 6,585 (35,4%) fono, mobilaus ryšio telefon ˛a, mobilaus ryšio 3-word 10,970 2,245 (20,5%) telefonus (‘mobile phone’ in various gram- 4-word 3,333 477 (14,3%) matical forms: nominative singular, genitive In total 32,884 9,307 (28,3%) singular, accusative singular and accusative plural); Table 1: Statistical information about MWEs from 3http://tekstynas.vdu.lt/page.xhtml?id=dictionary-db the type (1). Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 42 Only two-, three-, and four-word phrases from • positive form for adverbs (if they vary in de- the type (1) were filtered out for this study (see gree); Table 1). As already mentioned, these phrases make up 85% of the whole dictionary database, • infinitive for verbs (including participles). and thus can be considered as the most typical multiword units in Lithuanian (Marcinkevicienˇ e,˙ In the field of computational linguistics, it is 2010; Rimkute˙ et al., 2012). Most of them are id- common to use artificial lemmas for MWEs, be- ioms, phraseologisms and collocations, although cause they can be easily generated by automatic there are many multiword terms as well. means. There are two main kinds of artificial lemmas: It was calculated that two-word expressions have on average 1.71 different grammatical forms, a) It is possible to use a lemmatic sequence which is the sequence of lemmas of each con- three-word expressions – around 1.35 different 4 grammatical forms, and four-word expressions – stitutive word of the MWE . Using the morpho- 1.22 different grammatical forms. The maximum logical annotation tool for Lithuanian (the tool ˇ number of grammatical forms for one lemma are is described in Zinkevicius (2000) and Zinke- ˇ respectively 21 (aukšta mokykla ‘high school’), 15 vicius et al. (2005)), each grammatical form of the bendrosioms mokslo programoms (mobilaus ryšio telefonas ‘mobile phone’) and 8 multiword term, (kandidatas ˛iseimo narius ‘candidate as MP’). For ’framework programme’, is annotated morpholog- this study, only the 9,307 MWEs with two or more ically as follows: grammatical forms were selected. The remaining • <word="bendrosioms" lemma="bendras" MWEs, those for which only one form was iden- type="bdv., teig, nelygin. l., ˛ivardž., mot. g., tified automatically, are excluded, as they do not dgs., N."/>5 always need to be lemmatised and require a further study.

Automatic Lemmatisation of Lithuanian Mwes

Enhanced Thesaurus Terms Extraction for Document Indexing

An Evaluation of Machine Learning Approaches to Natural Language Processing for Legal Text Classification

Using Lexico-Syntactic Ontology Design Patterns for Ontology Creation and Population

LASLA and Collatinus

Experiments in Clustering Urban-Legend Texts

Removing Boilerplate and Duplicate Content from Web Corpora

The State of (Full) Text Search in Postgresql 12

A Diachronic Treebank of Russian Spanning More Than a Thousand Years

Download in the Conll-Format and Comprise Over ±175,000 Tokens

Lemmagen: Multilingual Lemmatisation with Induced Ripple-Down Rules

Language Technology Meets Documentary Linguistics: What We Have to Tell Each Other

A 500 Million Word POS-Tagged Icelandic Corpus