Automatic Lemmatisation of Lithuanian MWEs

Loïc Boizou Jolanta Kovalevskaite˙ Erika Rimkute˙ Centre of Computational Vytautas Magnus University Kaunas, Lithuania {l.boizou, j.kovalevskaite, e.rimkute}@hmf.vdu.lt

Abstract In traditional Lithuanian dictionaries of idioms, there is no problem of MWE lemmatisation, be- This article presents a study of lemmatisa- cause data are collected manually and represented tion of flexible multiword expressions in following the rule that a verb of an idiom is pro- Lithuanian. An approach based on syntac- vided in infinitive form (Paulauskas (ed), 2001), tic analysis designed for multiword term e.g.: Savo viet ˛ažinoti ‘to know one‘s place’, Vie- lemmatisation was adapted for a broader tos netureti˙ ‘to have nowhere to go’. The re- range of MWEs taken from the Dictionary cent Dictionary of Lithuanian Nominal Phrases2, of Lithuanian Nominal Phrases. In the which was compiled semi-automatically from a present analysis, the main lemmatisation corpus, contains the unlemmatised list of MWEs. errors are identified and some improve- Therefore, automatic phrase lemmatisation could ments are proposed. It shows that auto- help in organizing the dictionary data and in im- matic lemmatisation can be improved by proving the user interface. taking into account the whole set of gram- This article describes the main problems oc- matical forms for each MWE. It would curring during the lemmatisation of Lithuanian allow selecting the optimal grammatical MWEs. The concept of lemmatisation is quite form for lemmatisation and identifying clear for single , but for MWEs, it can be un- some grammatical restrictions. derstood differently as it will be discussed in part 1 Introduction 3. Although the accuracy of lemmatisation of in- dividual words is high (99% for lemmatisation and In Lithuanian, in addition to fully fixed multiword 94% for grammatical form identification (Rimkute˙ expressions (MWEs), there are many MWEs with and Daudaravicius,ˇ 2007)), the lemmatisation of one or more constituents (possibly all) which can single words included in MWEs cannot produce be inflected. Therefore, these “flexible” MWEs well-formed MWE lemmas. Indeed, base forms appear in texts in several forms1: as shown by the of Lithuanian multiword terms which should oc- corpus data, the Lithuanian verbal phraseme pak- cur in dictionaries and terminology databases are išti koj ˛a, meaning ‘to put a spoke in somebody’s not the same as the sequences of lemmas of their wheel’, has a form with the verb in definite past constitutive words (Boizou et al., 2012, p. 28). For pakišo koj ˛a, a form with the verb in past frequen- example, if we lemmatise each constitutive tative (pakišdavo koj ˛a), and future (pakiš koj ˛a) in MWEs taupom ˛uj˛ubank ˛u, taupomuoju banku (for more examples, see Kovalevskaite˙ (2014)). If (‘savings bank’ in genitive plural and instrumen- we extract MWEs of a strongly-inflected language tal singular), the result is taupyti bankas (infini- like Lithuanian from a raw using sta- tive ‘to save’ and nominative singular ‘bank’), be- tistical association measures, we often get MWEs cause the morphological annotator of the Lithua- with their different grammatical forms (GF). How- nian language (Zinkevicius,ˇ 2000; Zinkeviciusˇ et ever, in lexical databases and terminology banks, a al., 2005) assignes infinitive as the proper lemma single lemma is usually used in order to represent for participles and other verb forms. The struc- the MWE independently from its concrete forms ture of such improperly lemmatised MWE fails to which appear in the corpus. reflect the syntactic relations (agreement and gov- 1Grammatical forms of the same MWE (or phraseme) can be labelled as phraseme-types (Kovalevskaite,˙ 2014). 2http://donelaitis.vdu.lt/lkk/pdf/dikt_fr.pdf

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 41 ernment) which ensure the grammatical cohesion • nenuleisti rank ˛u, nenuleido rank ˛u, nenulei- between the constitutive words of an MWE. džia rank ˛u (‘not to give up’, lit. ‘not to lower This work is focused on syntagmatic lemma, hands’, in various grammatical forms: infini- which is the form of the MWE where the MWE tive, definite past, present tense). syntactic head is lemmatised and the necessary adaptations are made in order to ensure the mor- As Lithuanian is a strongly inflected language, phosyntactic unity of the MWE. With a similar it is an advantage that users can see phrases in the approach, a tool for automatic Lithuanian sintag- form they are used in the corpus. Phrases were ex- matic lemmatisation called JungLe was first de- tracted by the method of Gravity Counts (Dauda- veloped and trained with multiword terms during raviciusˇ and Marcinkevicienˇ e,˙ 2004, p. 330) from the project ŠIMTAI 2 (semi-automatic extraction the Corpus of Contemporary Lithuanian Language of education and science terms). The first ex- (100 m running words; made up of periodicals, periments with the Lithuanian multiword terms fiction, non-fiction, and legal texts published in showed accuracy close to 95% (Boizou et al., 1991-2002). Gravity Counts helps to evaluate 2012), but it can be related to a relatively low vari- the combinability of two words according to in- ety of term structures. For this study, JungLe was dividual word frequencies, pair frequencies or the adapted for a broader set of types of Lithuanian number of different words in a selected 3 word- compositional and non-compositional MWEs, e.g. span. As a result, it detects collocational chains as idioms, collocations, nominal compounds, MW text fragments, not as a list of collocates for the terms, MW named entities, MW function words, previously selected node-words. proverbs, etc. After automatic extraction of collocational chains from the corpus, manual procedures were 2 Data performed: transformation of collocational chains into phrases (the procedure is described in de- This study is based on the data from the Dictionary tail in Marcinkevicienˇ e˙ (2010) and Rimkute˙ et of Lithuanian Nominal Phrases (further on, dictio- al. (2012)). According to the lexicographical ap- nary). The database of the dictionary3 consists of proach, linguistically well-formed collocational 68,602 nominal phrases. It has to be mentioned, chains have to be grammatical, meaningful, and that the term nominal phrase refers to all MWEs arbitrary. Therefore, some chains were shortened, which contain at least one noun: it can be phrases complemented, joined or deleted manually. At with a noun as a syntactic head, as well as phrases present, the dictionary database contains phrases with a verb or an adjective as a syntactic head. In without additions (1) and with additions: additions this article, the terms MWE and phrase are used can be explicitly specified (2) or not (3), e.g.: interchangeably. In the dictionary, the phrases are of different 1. ne tuo adresu ‘under a wrong address’; length: from two-word phrases (31,853 phrases) to phrases comprising 46 words (1 phrase). The 2. (gauti; suteikti) daugiau informacijos ‘(get; major part of the dictionary phrases is made up give) more information’; of two-word phrases (46.4%), whereas three-word phrases and four-word phrases form accordingly 3. atkreipiant demes˛i˛i˙ (. . . ) ‘paying attention to 28.7% and 10.1% of the dictionary (Rimkute˙ et (. . . )’. al., 2012, p. 19). The phrases are not lemmatised, but given in the form as they appear in the corpus, e.g.: Phrase Number of Number of lemmas type lemmas with 2 or more GF • mobilaus ryšio telefonas, mobilaus ryšio tele- 2-word 18,581 6,585 (35,4%) fono, mobilaus ryšio telefon ˛a, mobilaus ryšio 3-word 10,970 2,245 (20,5%) telefonus (‘mobile phone’ in various gram- 4-word 3,333 477 (14,3%) matical forms: nominative singular, genitive In total 32,884 9,307 (28,3%) singular, accusative singular and accusative plural); Table 1: Statistical information about MWEs from 3http://tekstynas.vdu.lt/page.xhtml?id=dictionary-db the type (1).

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 42 Only two-, three-, and four-word phrases from • positive form for adverbs (if they vary in de- the type (1) were filtered out for this study (see gree); Table 1). As already mentioned, these phrases make up 85% of the whole dictionary database, • infinitive for verbs (including participles). and thus can be considered as the most typical multiword units in Lithuanian (Marcinkevicienˇ e,˙ In the field of computational linguistics, it is 2010; Rimkute˙ et al., 2012). Most of them are id- common to use artificial lemmas for MWEs, be- ioms, phraseologisms and collocations, although cause they can be easily generated by automatic there are many multiword terms as well. means. There are two main kinds of artificial lem- mas: It was calculated that two-word expressions have on average 1.71 different grammatical forms, a) It is possible to use a lemmatic sequence which is the sequence of lemmas of each con- three-word expressions – around 1.35 different 4 grammatical forms, and four-word expressions – stitutive word of the MWE . Using the morpho- 1.22 different grammatical forms. The maximum logical annotation tool for Lithuanian (the tool ˇ number of grammatical forms for one lemma are is described in Zinkevicius (2000) and Zinke- ˇ respectively 21 (aukšta mokykla ‘high school’), 15 vicius et al. (2005)), each grammatical form of the bendrosioms mokslo programoms (mobilaus ryšio telefonas ‘mobile phone’) and 8 multiword term, (kandidatas ˛iseimo narius ‘candidate as MP’). For ’framework programme’, is annotated morpholog- this study, only the 9,307 MWEs with two or more ically as follows: grammatical forms were selected. The remaining • 5 always need to be lemmatised and require a fur- ther study. • 6 applied to the process of automatic lemmatisation of MWEs. • 7 3 Approaches to Lemmatisation of MWEs The lemmatic sequence, e.g. for the previous example bendras mokslas programa, is often used In Lithuanian, a great number of MWEs can ap- in the field of automatic term recognition (e.g., pear in different grammatical forms. As such, they Loginova et al. (2012, p. 9)) to represent a term or do not differ from variable simple words. Accord- another type of MWE. Nonetheless, such a substi- ingly, a lot of Lithuanian MWEs consist of nouns, tute, which lacks grammatical cohesion between verbs and/or adjectives that are used in a particular the parts of the MWE, appears as a heap of words, grammatical form. Some of these word classes can which is unnatural for human users8. have from a few to dozens of different grammati- 4The difference between syntagmatic lemma (with mor- cal forms. Traditionally, for the set of grammati- phosyntactic relations between constitutive words) and lem- cal forms of each variable word, one basic form is matic sequence (the sequence of lemmas of constitutive assigned. The latter, a lemma, is a convenient rep- words) is relevant only for MWEs, not for single words. 5The field type contains the following grammatical fea- resentation of the whole set of grammatical word tures: adjective, positive, undefined, positive degree, femi- forms. Although in principle a lemma could be an nine, plural, dative. 6 artificial form (a stem, for example), the tradition Grammatical features: noun, masculine, singular, geni- tive. is to select as a lemma one form from the whole 7Grammatical features: noun, feminine, plural, dative. set of grammatical forms, e.g. in Lithuanian: 8In about 5% of the studied phrases, the sequences of iso- lated lemmas incidentally correspond to their natural lemma, e.g. vyras ir moteris ‘man and woman’, valstybinis simfoni- • nominative singular form for nouns (except nis orkestras ‘national symphony orchestra’. Such cases re- for plural nouns); quire the following conditions: the nominal syntactic head is masculine singular, the only syntactic relation inside the term is agreement or implies invariable words, degree and defi- • nominative singular masculine positive in- niteness must not be retained in the lemma, no participle is definite form for adjectives; implied.

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 43 b) The second frequent method is , The most difficult case concerns words congru- that is, dropping of endings. For example, the ent with the head, since they often have to be cor- forms of the previously mentioned MWE bendroji rected to remain congruent with the head once it mokslo programa, bendrosios mokslo programos is lemmatised. If the head is masculine singular, and bendrosioms mokslo programoms can be rep- the adaptation usually requires only taking lem- resented as: bendr moksl program. This option is mas for each congruent word, e.g., in the mul- even more artificial for Lithuanian, since, in addi- tiword term individualus studij ˛ugrafikas ‘indivi- tion to the loss of syntactic cohesion, this approach dual study plan/schedule’ (the adjective individ- generates shortened words without endings, which ualus ‘individual’ agrees with the noun grafikas do not exist in Lithuanian. ‘schedule’, not with the noun studij ˛u ‘study’). Other approaches attempt to provide a natural When the syntactic head is feminine, congruent lemma, i.e., by either choosing the most frequent words must also be put in their feminine form, e.g., form as a lemma, or generating a correct syntag- nuotolines˙ studijos ‘distance studies’ (instead of matic lemma from grammatical forms. Taking the *nuotolinis studijos, where the masculine singular most frequent form of the lemma avoids mistakes indefinite positive form of the adjective nuotolinis in generation, but the result is that the set of ba- is incongruent with the feminine plural head studi- sic forms is heterogeneous: some MWEs will be jos). in nominative, some will be in accusative, genitive Besides, some lexico-grammatical features, or in some other case, some will be in the plural e.g., definiteness, comparative/superlative de- form, others - in the singular. grees, are usually semantically relevant, so that Automatic lemmatisation according to the syn- they have to be kept in the syntagmatic lemma, tactic structure of each MWE ensures the con- which requires to generate the proper form, even stitency of basic forms, but it is the most com- when the head is masculine and singular, e.g. plicated process. The tool JungLe, which is de- Senasis ir Naujasis testamentas ‘The Old and scribed in Boizou et al. (2012), was specifically New Testament’ (where the adjectives Senasis designed for this task. This software analyses an and Naujasis are in the definite form, instead of MWE and attempts to distinguish three types of *Senas ir Naujas testamentas), aukštesnioji žemes˙ syntactic components (as a concrete example the ukio¯ mokykla ‘high school of agriculture’ (where multiword term individualus studij ˛ugrafikas ‘in- the adjective aukštesnioji is in the definite com- dividual study plan/schedule’ is provided): parative form, instead of *aukštas žemes˙ ukio¯ mokykla). • syntactic head (e.g., the noun grafikas Syntagmatic lemmatisation also requires to ‘plan/schedule’); lemmatise participles in a different manner than • words congruent with the head (e.g., the ad- single words. Indeed, participles are traditionally jective individualus ‘individual’); lemmatised as verbs in infinitive form. For ex- ample, the single word lemmatisation of the term • other words, that is, words governed by the perkeliamieji gebejimai˙ ‘transferable skills’ gives head and their own dependents (e.g., the noun a result perkelti gebejimai˙ , that is a sequence of in genitive case studij ˛u ‘study’). an infinitive (perkelti ‘to transfer‘) and a noun in nominative (gebejimai˙ ‘skills‘). The correct syn- The generation of the syntagmatic lemma re- tagmatic lemmatisation requires participles to be quires the syntactic head to be lemmatised (for corrected in gender, number and case only, in or- terms, the syntactic head is a noun, but there is der to remain congruent with their lemmatised more diversity with other types of MWEs). Words head, e.g. perkeliamasis gebejimas˙ . (usually in the genitive case) governed by the head All required generations are made by a light- and their own dependents remain in their gram- weighted generative module. This module uses matical form, e.g., švietimo {lygmuo} ‘education to the largest possible extent the information pro- level’, socialini ˛umoksl ˛u{sritis} ‘field of social vided by the morphological analyser, which works 9 sciences’. on a single word basis. Its generative capacities 9Here and below the syntactic head is indicated in curly are restricted to the nominative forms, since noun, brackets. adjective and participle lemmas are in the nomina-

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 44 tive form. Aiming at facilitation of the process, the MWEs. Table 2 shows that the highest accuracy generation proceeds either from a single lemma or is with two-word phrases; however, the number of a grammatical form. For example, lemmas for par- incorrectly lemmatised MWEs increases for three- ticiples are generated from the grammatical form, and four-word phrases. It shows that the syn- because it helps to avoid the problem of numer- tactic complexity increases with the length of the ous verbal paradigms in Lithuanian, while adjec- MWEs. tives are generated from the lemma. Indeed, some endings of nouns and adjectives (e.g., -(i) ˛u geni- Phrase Number of Number Correctly tive plural) hide the declension paradigm (which type lemmas of GF lemmatised GF is necessary for the selection of the correct fem- 2-word 6,585 19,822 91.56% inine ending), so that it is better to decide from 3-word 2,245 6,110 80.57% the lemma, which expresses the adjectival declen- 4-word 477 1,206 76.43% sion paradigm by its ending10, e.g., nuotolinis In total 9,307 27,138 ‘distant‘ (third adjectival declension paradigm) → nuotoline˙. The whole process is very similar to Table 2: Statistical information about the analysed Thurmair (2012, p. 257). MWEs (2 or more forms only).

4 Syntagmatic Lemmatisation of The analysis of the automatic lemmatisation Lithuanian MWEs: Evaluation and revealed three groups of errors11: a) agreement Results errors (number, gender); b) government errors; c) lexico-grammatical errors (degree, definiteness, In this part of the article, we present our results: lexical plural). what problems are solved by syntactic analysis, and what problems still remain and pose chal- 4.1 Agreement Errors lenges for automatic MWE lemmatisation. Many errors occur with numerals, e.g., *beveik Two-, three- and four-word phrases were au- du treˇcdalis ‘*nearly two third’ (it should be tomatically lemmatised with the help of Jun- beveik du treˇcdaliai, ‘nearly two thirds’, with treˇc- gLe tool and the results were evaluated manu- dalis ‘third’ in the plural form), also *aštuoni ally (see Table 2). JungLe generates a lemma menuo˙ ‘*eight month’ (while it should be aštuoni for each MWE grammatical form separately, so menesiai˙ , ‘eight months’, with menuo˙ ‘month’ in that more than one lemma can be provided for the plural form). Many of these errors can be elim- the same MWE, especially when it is difficult to inated by applying proper rules in the syntactic identify automatically to which word an attribute analysis. in genitive belongs, e.g., two lemmas, both in- During the syntactic analysis gender errors oc- accurate, were provided for the MWE bendroji cur when the composition of an MWE is more dalines˙ nuosavybes˙ teise˙ ‘general partial owner- complex, e.g., *vienas ar kita grupe˙ ‘one or the ship’, where the first adjective bendroji ‘general’ other group’(instead of viena ar kita grupe˙, with is congruent with the MWE head teise˙ ‘law’ and viena ‘one’ and kita ‘other’ in the feminine form). the second adjective dalines˙ ‘partial’ with the We can see that the coordinating link could be the noun nuosavybes˙ ‘property’ (which depends on factor determining the agreement errors12. the MWE head). In the first provided lemma *ben- droji daline˙ nuosavybes˙ teise˙, daline˙ incorrectly 11Some errors of lemmatisation occur due to errors of the previous morphological analysis, e.g., the lemma for the agrees with teise˙, and in the second one, *ben- MWE arbatinio šaukštelio (‘tea spoon’, sing. Gen.) is pro- drosios dalines˙ nuosavybes˙ teise˙, bendrosios in- vided incorrectly as *arbatinio šaukštelis (genitive singular + correctly agrees with nuosavybes˙ . nominative singular, instead of arbatinis šaukštelis, nomina- tive singular for both the adjective and the noun), because of As each grammatical form is lemmatised sepa- an improper morphological analysis: arbatinis was annotated rately, in some cases there is more than one lemma as a noun, not an adjective. 12 for the same MWE. Thus, lemmatisation accuracy Some similar errors, which must be corrected, appear with the genitive case, e.g. Afrikos ir Azija (genitive and nom- was assessed for individual grammatical forms of inative, it should be Afrika ir Azija ‘Africa and Asia’, nomina- tive and nominative), *daina ir šoki ˛uansamblis (nominative 10Ending -as for the first adjectival declension, -ias for the noun, conjunction, genitive noun, nominative noun), instead second, -is and -ys for the third and fourth, and -us for the of dain ˛uir šoki ˛u{ansamblis} (genitive noun, conjunction, fifth. genitive noun, nominative noun) ‘song and dance ensemble’.

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 45 It was observed that a large number of agree- cated lemmatisation instances, the loss of gram- ment errors take place when one of the attributes matical forms which carry the meaning of an is an apposition, i.e., a noun which has to agree in a MWE, e.g., atstov ˛uteigimas (‘representatives’ as- case (sometimes gender and number) with the ad- sertion’) could be considered as a correctly gen- jacent noun, e.g., *šalies gaveja˙ (it should be šalis erated lemma; however, after a closer investiga- gaveja˙ , ‘the recipient party’), *mergeles˙ Marija (it tion of the grammatical forms, we can see that should be mergele˙ Marija, ‘the Virgin Mary’, with in this MWE the syntactic head is always used both parts in nominative). One the other hand, in the instrumental case (teigimu), i.e., atstov ˛u such attributes are not numerous, and such errors teigimu (‘according to the representatives’), thus could be solved by looking at other cases than gen- the lemma should keep this form. Similarly, the itive. lemma of an MWE balsavimo paštas (‘voting post’) is not accurate, as the syntactic head (paš- 4.2 Government Errors tas) should be in the instrumental case, i.e., bal- savimas paštu (‘voting by mail’), while the lemma Problems mainly arise when the tool fails to cor- of a phrase visa išgale˙ (‘all possible measures’) rectly identify the syntactic head of a phrase. Such should be visomis išgalemis˙ (‘by all possible mea- a problem usually occurs when the head is not at sures’), because this phrase as an MWE is used the end of an MWE, e.g. *paskolos {studijos} (in- only in the form of instrumental plural. stead of {paskola} studijoms ‘study loan’), *atliko savo {darbas} ‘carried out their work’ (instead of {atlikti} savo darb ˛a, where the verb is the head). 4.3 Lexico-grammatical Errors Problems also occur in phrases, where the head is There are many errors made by JungLe where a half-participle or a gerund: *˛isigaliojus nauja- a lemma has to be assigned to nouns which are sis civilinis {kodeksas} ‘when the new civil code used in plural in the phrase, e.g., *žmogaus teise˙ came into effect’ (it should be {˛isigaliojus} nau- ir laisve˙ ‘human right and freedom’, instead of jajam civiliniam kodeksui, i.e. where dative is re- žmogaus teises˙ ir laisves˙ , ‘human rights and free- quired). doms’; *jungtine˙ tauta ‘united nation’, instead of It must be noticed that in some cases the rep- Jungtines˙ Tautos, ‘United Nations’; *visa Baltijos resentation of the lemma does not correspond to šalis ‘the whole Baltic country’, instead of visos a natural linguistic form. It occurs in colloca- Baltijos šalys ‘all Baltic countries’, *Vilniaus ir tions which contain a conjugated verb (pakilo) Šalˇcinink˛urajonas ‘Vilnius and Šalcininkaiˇ dis- with a (nominative) subject, e.g. pakilo tem- trict’, instead of Vilniaus ir Šalˇcinink˛urajonai peratura¯ ‘temperature rised’. In the assigned lem- ‘Vilnius and Šalcininkaiˇ districts’. As number er- mas, conjugated verbs are substituted for infinitive rors were considered the examples when a lemma forms (pakilti). Infinitives cannot have a subject in looked correct at first sight, i.e., a lemma is pro- Lithuanian, and therefore the MWE subject could vided in the same number as in the dictionary. be presented in brackets in the nominative form However, from the usage data (all forms of a (e.g. pakilti (temperatura)¯ ‘to rise (temperature)’). phrase) one can see that certain MWEs are used A further exception comes from the MWEs with a only in plural, e.g., meteorologines˙ s ˛alygos ‘me- gerund, since the logical subject of a gerund is not teorological conditions’, mineralines˙ tr ˛ašos ‘min- expressed as a nominative, but as a dative comple- eral fertilizers’, mirties aplinkybes˙ ‘death circum- ment, e.g., atsitikus nelaimei ‘a disaster occurs’. stances’. All these phrases, which are made of an In such cases, we propose to assign two differ- adjective or a genitive noun followed by a noun, ent lemmas: one lemma, which retains the gram- are incorrectly lemmatised in the singular form, matical form without change, atsitikus nelaimei e.g. *meteorologine˙ s ˛alyga, *mineraline˙ tr ˛aša, (gerund + dative complement), as used in gerund *mirties aplinkybe˙. Many of the above-mentioned grammatical form; and the second lemma, where nouns can be used in plural and singular, when the gerund is substituted for the infinitive and the they are used independently, but they can be re- dative complement is substituted for a nominative stricted to one of these numbers inside MWEs. form in bracket, e.g. atsitikti (nelaime)˙ (infinitive Traditional grammars and dictionaries do not pro- + nominative), as in the previous example. vide necessary information to solve this problem, We should also mention, among other compli- which could often be resolved if we take into ac-

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 46 count actual usage from the corpus. forms of MWEs. But beside syntactic analysis of There are two types of degree errors: a) in some MWEs, we often need to take into account the us- phrases a particular degree form is used, thus, the age data of a particular MWE and to apply addi- same form should be in the lemma (Aukšˇciausia- tional criteria. The automatic syntagmatic lemma- sis Teismas ‘supreme court’; daugiau kaip dveji tisation tool was tested on the data from the Dic- metai ‘more than two years’); b) there are phrases, tionary of Lithuanian Nominal Phrases, which are where an adverb or an adjective is used in several characterized by a high variety. For this reason, it degrees: then different phrases can contain adjec- can be stated that the essential features, as well as tives or adverbs of different degrees (cf. ˛ivairus¯ / problems, of automatic lemmatisation of Lithua- ˛ivairiausibudai¯ (‘various/ the most various ways’) nian MWEs were identified. and skirti daug/daugiau/daugiausia demesio˙ (to One of the most important lemmatisation is- pay a lot of/more/ most attention)). Analysis of sues that is difficult to solve is the problem of all forms of an MWE can help to distinguish a) an attribute which is incongruent with a noun and b) phrases. and usually expressed in genitive. Most com- Errors of definiteness often occur in phrases monly, such problems (in automatic lemmatisa- joined by coordination, when one adjective is pro- tion) are inevitable, because ambivalent syntac- vided in the definite form while the other one is in- tic relations can exist in MWEs composed of the definite, e.g., *Senas ir Naujasis testamentas (‘Old same words, e.g., the lemma for MWE gram- and New Testament’). matical forms administracines˙ teises˙ pažeidim ˛u, After the examination of errors and problem- administracines˙ teises˙ pažeidim ˛a, administracines˙ atic cases created by JungLe, we can draw a con- teises˙ pažeidimus ‘breach of administrative law’ clusion that automatic lemmatisation is aggravated (where administracinis ‘administrative’ is con- by: gruent with teise˙ ‘law’) should be adminis- tracines˙ teises˙ pažeidimas, while the lemma for 1. syntactic heads in the genitive form: when MWE grammatical forms administracin˛i teises˙ there are several nouns in the genitive in the pažeidim ˛a, administracini ˛uteises˙ pažeidim ˛u, ad- MWE, it leads to attachment ambiguities; ministracinius teises˙ pažeidimus ‘administrative 2. the length of an MWE: the longer the phrase, breach of law’ (where administracinis ‘adminis- the more complicated syntactic structure; the trative’ is congruent with pažeidimas ‘breach’) accuracy of lemmatisation decreases (see Ta- should be administracinis teises˙ pažeidimas. In ble 2); order to set the right lemma, the noun with which the adjective agrees must be correctly assigned. 3. problems of lexico-grammatical nature, when The head of a phrase in genitive can influence a grammatical form depends on a lexical adjective agreement errors, too. For instance, the meaning (here, errors of number must be em- genitive grammatical form periodinio mokslo lei- phasized). dinio, where it is unclear if periodinio ‘periodic’ is It must be emphasized that the numbers in Ta- congruent with mokslo ‘science’ or leidinio ‘pub- ble 2 show the situation after the first extension of lication’, could formally be lemmatised as *pe- JungLe. The results can still be improved signifi- riodinio mokslo leidinys ‘a publication of peri- cantly. Some errors can be corrected by improving odic science’ or periodinis mokslo leidinys ’a pe- the grammar used by JungLe for syntactic analy- riodic scientific publication’ by looking at the in- sis, some of them require adding new capacities ternal syntactic structure of the term. In order to to JungLe, other errors will be difficult to correct disambiguate syntax correctly, we need to com- without human intervention. pare other (unambiguous, i.e., cases other than the genitive) forms of the term, e.g., periodiniams 5 Discussion and Recommendations mokslo leidiniams (in dative plural), which shows The traditional morphological analyser, which that the adjective periodinis ‘periodic’ is congru- analyses every word individually, cannot produce ent with the noun leidinys ‘publication’, therefore, natural lemmas for MWEs. It is necessary to carry this MWE should properly be lemmatised as peri- out the syntactic analysis for automatic assigna- odinis mokslo leidinys. tion of natural lemmas for different grammatical The problems concerning the genitive case

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 47 would decrease, if the usage criterion was applied, only. i.e., if lemma was identified considering all forms It is possible that next to the usage criterion, of the MWE. For example, it is especially com- other criteria will have to be introduced. For ex- plicated to lemmatise an MWE with all genitive ample, in order to avoid lemmatisation errors re- cases, e.g., it is impossible to identify an accurate lated to definiteness, it would be worthwhile to in- lemma for MWEs valstybinio socialinio draudimo voke not only the usage, but, also, frequency cri- biudžeto (‘the budget of state social insurance’), terion. Indeed, according to the data, nekilnoja- fizini ˛uasmen ˛upajam ˛umokesˇci˛u (‘income taxes of mas turtas (with the indefinite form of the adjec- natural persons’). In such cases, a rule should be tive nekilnojamas) and nekilnojamasis turtas (with applied: if the same phrase is used in genitive and the definite form of the adjective nekilnojamasis), in other cases, the lemma should be identified on which both mean ‘real property’, are concurrently the basis of phrases with other cases than genitive. used. However, one can expect the standard form, The usage criterion would help to avoid the the definite one, to be more frequent, as it is a term. number errors. Quite often this criterion proved The evaluation of the research results has re- the rule that if different grammatical forms of vealed that the accuracy of the MWE lemma- an MWE are in plural, then the lemma should tisation is not only influenced by the accuracy keep the plural form too. For example, a dictio- of the syntactic analyser, but, also, by the vari- nary of nominal phrases provides two grammatical ability of MWEs. If we come across a phrase forms: laužas ir atliekos, laužo ir atliek ˛u ‘debris which has two variants, then a separate lemma and waste’, in both phrases the noun laužas is in will be assigned to each variant during the au- the singular form, while atliekos is used in plural. tomatic lemmatisation, e.g., užraš ˛uknygute˙ and Thus, when merging the two MWEs to one lemma, užraš ˛uknygele˙ (‘a notebook’, the difference lies atliekos has to remain in the plural form. During in the diminutive suffix of the nouns). However, the lemmatisation of the forms žveris˙ ir paukšˇcius, several forms of degree, different forms of defi- žveri˙ ˛uir paukšˇci˛u, žverys˙ ir paukšˇciai (‘beasts and niteness could be used in the same MWE; for this birds’, repectively accusative plural, genitive plu- reason, we have to discuss how to reflect all this ral, nominative plural), we have to assign plural in a lemma. The substituting component could be lemmas for both nouns – žverys˙ ir paukšˇciai, be- presented in angle brackets: skirti [daug/daugiau] cause all forms of these nominal phrases are in demesio˙ ‘to pay [much/more] attention’; [nekilno- the plural form. This is especially important for jamas/nekilnojamasis] turtas ‘real property’ (with names, cf. *Lietuvos geležinkelis (it should be Li- a definite or indefinite adjective). Thus, this would etuvos geležinkeliai, ‘Lithuanian Railways’), *Vil- indicate that some syntagmatic lemmas contain niaus šilumos tinklas (it should be Vilniaus šilu- substituting components. mos tinklai, ‘Vilnius Heating Network’). Based on the usage data, it would be possible References to distinguish between the MWEs where a certain Loïc Boizou, Gintare˙ Grigonyte,˙ Erika Rimkute,˙ and word is used only in one form of the degree (Aukš- Andrius Utka. 2012. Automatic Inference of Base ˇciausiasisTeismas, superlative, ‘Supreme Court’), Forms for Multiword Terms in Lithuanian. In Pro- and those where several forms of a degree are used ceedings of the Fifth International Conference Hu- (˛ivairus¯ budai¯ and ˛ivairiausi budai¯ , positive and man Language Technologies – The Baltic Perspec- tive, pages 27–35. superlative, ‘various ways’). When applying the usage criterion, it is impor- Vidas Daudaraviciusˇ and Ruta¯ Marcinkevicienˇ e.˙ 2004. Gravity Counts for the Boundaries of Colloca- tant to remember that in this case the accuracy of tions. International Journal of , the tool will be linked to the corpus data: the rarer 9(2):321–348. the phrase, the higher the risk for the tool to make Jolanta Kovalevskaite.˙ 2014. Phraseme-type and a mistake. For example, if we recognize only two Phraseme-token: a Corpus-driven Evidence for forms of a particular phrase, and they are both in Morphological Flexibility of Phrasemes. Res Hu- the plural form, the tool can come to a false con- manitariae, XVI, pages 126–143. clusion that the lemma of that phrase is also in plu- Elizaveta Loginova, Anita Gojun, Helena Blancafort, ral, although that phrase could also be used in sin- Marie Guégan, Tatiana Gornostay, and Ulrich Heid. gular. But such a risk is significant for rare MWEs 2012. Reference Lists for the Evaluation of Term

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 48 Extraction Tools. In Proceedings of the 10th Inter- national Congress on Terminology and Knowledge Engineering (TKE), pages 177–192, Madrid, Spain.

Ruta¯ Marcinkevicienˇ e.˙ 2010. Lietuvi ˛u kalbos kolokacijos. Vytauto Didžiojo universitetas, Kau- nas, Lithuania. Jonas Paulauskas (ed). 2001. Frazeologijos žodynas. Lietuvi ˛ukalbos institutas, Vilnius, Lithuania.

Erika Rimkute˙ and Vidas Daudaravicius.ˇ 2007. Mor- fologinis dabartines˙ lietuvi ˛ukalbos tekstyno ano- tavimas. Kalb ˛ustudijos, 11:30–35.

Erika Rimkute,˙ Agne˙ Bielinskiene,˙ and Jolanta Ko- valevskaite˙ (eds). 2012. Lietuvi ˛u kalbos daiktavardini ˛ufrazi ˛užodynas. Vytauto Didžiojo universitetas, Kaunas, Lithuania. Gregor Thurmair and Vera Aleksic.´ 2012. Creating Term and Lexicon Entries from Phrase Tables. In Proceedings of the 16th EAMT Conference, pages 253–260. Vytautas Zinkevicius,ˇ Vidas Daudaravicius,ˇ and Erika Rimkute.˙ 2005. The Morphologically Annotated Lithuanian Corpus. In Proceedings of The Second Baltic Conference on Human Language Technolo- gies, pages 365–370. Vytautas Zinkevicius.ˇ 2000. Lemuoklis – mor- fologinei analizei. Darbai ir Dienos, 24:245–273.

Proceedings of the 20th Nordic Conference of Computational Linguistics (NODALIDA 2015) 49