Estonian copular and existential constructions as an UD annotation problem

Kadri Muischnek Kaili Mu¨urisep¨ University of Tartu University of Tartu Estonia Estonia [email protected] [email protected]

Abstract frequent verb of Estonian language which can also function as in compound tenses, be This article is about annotating clauses part of phrasal verbs, and may occur in existential, with nonverbal predication in version 2 possessor or cognizer sentences. of Estonian UD treebank. Three possible As the descriptive grammar of Estonian (Erelt annotation schemas are discussed, among et al., 1993) lacks more detailed treatment of cop- which separating existential clauses from ular sentences, the label “” has not been in- copular clauses would be theoretically troduced into original Estonian Dependency Tree- most sound but would need too much man- bank (EDT) (Muischnek et al., 2014). In copu- ual labor and could possibly yield incon- lar clauses olema is annotated as the root of the cistent annotation. Therefore, a solution clause and other components of the sentence de- has been adapted which separates exis- pend on it; that is also the case if the sentence tential clauses consisting only of contains a subject complement. As subject com- and (copular) verb olema be from all other plements have a special label PRD (predicative) in olema-clauses. EDT, such sentences can be easily searched. Estonian treebank for UD v1.3 has been gener- 1 Introduction ated automatically from EDT using transfer rules. This paper discusses the annotation problems and The guidelines for UD v1 implied that subject research questions that came up during the anno- complements serve as roots in copular clauses. tation of Estonian copular sentences while devel- This analysis of copula constructions, according oping Estonian Universal Dependencies treebank, to UD v1 guidelines, extended to adpositional especially while converting it from version 1 of phrases and oblique nominals as long as they have UD annotation guidelines to version 2. a predicative function. By contrast, temporal and Copular clauses are a sentence type in which the locative modifiers were treated as dependents on contentful predicate is not a verb, but falls into the existential verb ’be’. some other category. In some languages there is Therefore, while converting EDT to UD v1, no verbal element at all in these clauses; in other sentences with subject complements were rela- languages there is a verbal copula joining the sub- tively easily transferred to sentences with copular ject and the non-verbal element (Mikkelsen, 2011, tree structure (Muischnek et al., 2016). However, p. 1805). oblique nominals and adpositional phrases were In Estonian, there is mainly one verb, namely not annotated as instances of nonverbal predica- olema ’be’, that functions as copula in copular tion. clauses. Estonian descriptive grammar (Erelt et Since UD v 2.0 assumes a more general annota- al., 1993) uses the term copular verb (Est ’koide’)¨ tion scheme for copular sentences, we faced sev- only for describing sentences with subject com- eral conversion problems and also linguistic ques- plements, stating that in such sentences the verb tions. This paper provides insights into these re- olema has only grammatical features of a predi- search questions, gives an overview how copular cate (time, mode, person). Also, the copula olema clauses are annotated in some UD v2 treebanks for is semantically empty if used alone and it can not some other languages (Finnish, German, English) have any other dependents than non-verbal pred- and describes what are the options for annotating icate. At the same time, verb olema is the most Estonian sentences.

79 Proceedings of the NoDaLiDa 2017 Workshop on Universal Dependencies (UDW 2017), pages 79–85, Gothenburg, Sweden, 22 May 2017. In the remainder of this paper, Section 2 we give Finnish UD v2 treebank, clauses with olla ’be’ are a short account of UD v2 guidelines for annotat- mostly regarded as instances of nonverbal predi- ing copular clauses and show how these construc- cation, annotating olla as copula (1), among them tions are annotated in some UD v2 treebanks for also possessive clauses (2). However, if the clause Finnish, German and English. Section 3 is ded- contains only subject besides some form of olla, icated to copular constructions in Estonian lan- olla is annotated as root (3). guage and Estonian UD versions 1 and 2. Some In Finnish FTB v2 treebank more clause types conclusions are drawn in Section 4. are annotated with olla as root, e.g. possessive clause (4) and clause containing predicative adver- 2 UD annotation guidelines for bial (5). It seems that annotation of copular con- nonverbal predication structions in Finnish FTB resembles that in ver- According to the UD annotation scheme version sion 1 of Estonian UD only subject complements 1, copular constructions are to be annotated differ- are annotated as roots in copular constructions. ently from other clause types, analysing the pred- In the UD v2 treebank of German, sentences icative element as root and if there is an overt link- with subject complement are annotated as in- ing verb present, it should be attached to this non- stances of nonverbal predication (6) and other in- verbal predicate as copula. The copula relation is stances of sein and werden ’be’ seem to be anno- restricted to function words whose sole function tated as main verbs, not copulas (7). is to link a non-verbal predicate to its subject and (1) Hyllylla¨ oli H&M which does not add any meaning other than gram- shelf-ADE was H&M maticalised TAME categories. Such an analysis is ROOT cop nmod motivated by the fact that many languages often or Home-tuotteita always lack an overt copula, so annotation would Home-product-PL.PRT 1 be cross-linguistically consistent . nsubj:cop Version 2 of UD annotation guidelines extend ’There were some H&M products on the the set of constructions that should be annotated shelf’ as instances of nonverbal predication, defining six categories of nonverbal predication, namely (2) Kuvia minulla ei ole those of equation, attribution, location, posses- picture-PL.PRT I-ADE not be sion, benefaction and existence2. nsubj:cop ROOT aux cop:own In order to get better overview of practical anno- ’I have no pictures’ tation of copular constructions cross-linguistically, (3) Kun rahaa ei ole ... we studied the v2 versions of UD treebanks of If money-PRT not is Finnish, which is the most closely related lan- mark nsubj aux ROOT guage to Estonian present in UD, and also German and English. As the language-independent anno- ’If there is no money’ tation guidelines for UD version 2 were published (4) Meilla¨ ei ole rahaa in the very end of last year, there are no language- we-ADE not is money-PRT specific guidelines published yet. So we had to nmod:own aux ROOT nsubj rely on treebank queries in order to gain infor- tuhlata. mation about annotating copular and related con- waste-INF structions in the afforementioned languages. We acl queried UD v2 treebanks using the SETS treebank ’We have no money to waste’ search maintained by the University of Turku3. There are two UD treebanks for Finnish: the (5) Talonmies on juovuksissa. Finnish UD treebank, based on Turku Dependency Caretaker is drunkenness-INE Treebank, and Finnish-FTB (FinnTreeBank). In nsubj ROOT advmod

1http://universaldependencies.org/v2/ ’The caretaker is drunk’ copula.html 2http://universaldependencies.org/u/ (6) Das Personal ist freundlich. overview/simple-syntax.html#nonverbal-clauses DET staff is friendly 3http://bionlp-www.utu.fi/dep_search/ det nsubj:cop cop ROOT

80 ’The staff is friendly.’ some exceptions. First, copula is often omitted in headlines like (8). (7) Ich war in dem Dezember bei I was in DET December at (8) Valitsus otsustusvoimetu˜ nsubj ROOT case obl case det Government indecisive Kuchen¨ Walther. ’Government is indecisive.’ Kuchen¨ Walther And due to time pressure, copula, but also other obl flat verbs, can be omitted in online communication ’I was in December at Kuchen¨ Walther.’ (9). There are four English UD treebanks, but we (9) Ma nii kurb queried only the largest of them, the English Web I so sad Treebank. It seems that predicative (e.g. Fig. 1) ’I am so sad.’ and locative (Fig. 2) constructions are analysed as instances of non-verbal predication, whereas exis- As a sidenote, although ellipsis of verb olema tential clauses (Fig. 3) are not. is rare in Estonian, it is still more frequent than in Finnish (Kehayov, 2008). The annotation guidelines for version 2 of Universal Dependencies define six categories of nonverbal predication that can be found cross- linguistically (with or without a copula), namely Figure 1: Predicative construction in the English equation (aka identification), attribution, location, UD treebank. , benefaction and existence. As for Estonian, constructions expressing equa- tion (10a), attribution (10b), location (10c), pos- session (10d), benefaction (10e) or existence (10f) are all coded using verb olema. In addition to the afforementioned clause types, there are also cog- nizer clauses (10g) that can be viewed as a sub- Figure 2: Locative construction in the English UD type or metaphorical extension of the possessive treebank. clause type. Perhaps also quantification clause (10h) should be mentioned as a separate type. (10) a. Mari on opetaja.˜ Mari is teacher ’Mari is a teacher.’ Figure 3: Existential clause in the English UD b. Laps on vaike.¨ treebank. Child is small ’Child is small.’ As the above discussion illustrates, annota- c. Laps on koolis. tion of copular constructions varies across differ- Child is school-INE ent languages and also across different treebanks. ’Child is at school.’ Having better documentation for v2 treebanks d. Lapsel on raamat. would facilitate better understanding of these dif- Child-ADE is book ferences. It would also be helpful for those teams which are still working on v2 of their treebanks to ’Child has a book.’ make more informed decisions. e. See raamat on lapsele. This book is child-ALL 3 Copular and existential constructions ’This book is for child.’ in Estonian f. Oli tore kontsert. In Estonian, copular verb can not be omitted in was nice concert normal writing or speech. However, there are ’This was a nice concert.’

81 g. Lapsel on igav. distinguish between three main clause types, de- Child-ADE is boring. pending on whether syntactic subject, semantic ’Child is bored.’ macrorole of actor and clause topic (theme) over- lap, i.e. are coded by the same nominal. In so- h. Lund on palju. called normal clauses, the same nominal functions Snow-PRT is lot as subject, actor and topic; in possessor-cognizer ’There is a lot of snow.’ clauses, the possessed or cognized entity is both subject and topic, but not actor, which in turn de- Further in this section we will discuss three pos- notes the possessor or cognizer. Subject noun de- sible ways to annotate Estonian clauses containing notes actor in existential clauses, but is rhematic. the (copular) verb olema. Possessor-cognizer and existential clauses are Among the constructions (10 a-h), existential regarded as marked clause types, also termed in- clauses (f) are exceptional as they can consist also verted clauses (Erelt, 2003, pp. 93–55), as they only of subject and some form of verb olema. have inverted word order in pragmatically neutral In our opinion, such construction can not be an- sentences - XVS instead of SVX, otherwise typi- notated as an instance of nonverbal predication cal for “ordinary” Estonian sentences. Subjects of as there is no possible predicate except the verb these marked clause types can be in partitive case olema. As for the sake of consistency, all existen- form, while nominative is the unmarked and sta- tial clauses should be annotated in the same way, tistically dominant subject case. i.e. as instances of verbal predication, not nonver- A few remarks about the possible case forms of bal, we would have to distinguish existential con- subject in Estonian are in place here. Estonian de- structions from all other olema-clauses. scriptive grammars (Erelt et al., 1993, p. 15) and In what follows in this Section, we will study (Erelt, 2013, p. 36) state that existential and pos- if this solution can be applied while annotating sessive clauses differ from other clause types in real corpus sentences. For that we have to find the possibility of subject case alternation: the sub- out, if and how existential clauses can be iden- ject is mostly in partitive case form in negative tified. We start with investigating the linguistic sentences (12), (14) and can be in partitive case features of the existential clause type on the back- form also in affirmative sentences (11),(13) if the ground of main clause types in Estonian. Subsec- referent of the subject noun is quantitatively un- tion 3.1 presents the linguistic features of Estonian bounded. existential constructions. Subsection 3.2 gives overview of constructions annotated as instances (11) Selles klassis on targad lapsed / of nonverbal predication in version 1 of Estonian this-INE class-INE are smart-PL kid-PL / UD treebank. In subsection 3.3 we discuss three tarku lapsi. possibilities for annotating olema-clauses in Esto- smart-PL.PRT kid-PL.PRT nian UD v2 and conclude the section with adopt- ’There are smart kids in this class. / There ing a compromise solution. are some smart kids in this class.’

3.1 Existential clauses in Estonian (12) Selles klassis ei ole tarku this-INE class-INE not are smart-PL.PRT Characteristic features of Estonian existential clauses include the possibility of partitive subject lapsi. (the default case of Estonian subject is nomina- kid-PL.PRT tive) and inverted word order (subject comes after ’There are no smart kids in this class.’ verb), but existential clauses share these features (13) Tal on head sobrad˜ / with possessive clauses and also some other mi- (S)he-ADE are good-PL friend-PL / nor clause types. In order to understand the prob- haid¨ sopru.˜ lem, we start with a small overview of main clause good-PL.PRT friend-PL.PRT types in Estonian, explain how olema-clauses are ’(S)he has good friends. / (S)he has some distributed among those clause types, paying spe- good friends.’ cial attention to the existential clause type. Descriptive grammars of Estonian (Erelt et al., (14) Tal ei ole haid¨ 1993, pp. 14–15) and (Erelt, 2003, pp. 43–46) (S)he-ADE not are good-PL.PRT

82 sopru.˜ were annotated as instances of nonverbal predica- friend-PL.PRT tion (17). This was partly motivated by the fact ’(S)he has no good friends.’ that subject complements were already annotated using a special dependency relation (PRD) in our Statistically, negation is the most powerful pre- original treebank, the Estonian Dependency Tree- dictor of partitive subject (Miestamo, 2014). Of bank. All other copular clauses were annotated as the clause types defined in UD documentation, instances of verbal predication, annotating form of existential clauses share the property of case- verb olema as root (18). alternating subject with possessive clauses, but partitive subject is much less frequent in cognizer (17) Mina olen Merlin. clauses, which can otherwise be seen as a con- I am Merlin structional extension of the possessive clause. Par- titive subject is the only option in quantification ’I am Merlin’ clause, regardless of its polarity, and in negative possessive clause. (18) Ta on praegu kodus. Existential clauses differ from other marked (s)he is now home-INE clause types as they can consist also of subject ’(S)he is at home now.’ only, besides the verb olema; in this case the clause merely states or negates the existence of an entity denoted by subject (15). There is also a spe- 3.3 Nonverbal predication in Estonian UD cial periphrastic verb form olema olemas (be be- version 2: three possible solutions INE) used for stating that something exists (16). As already mentioned in Section 2, version 2 of (15) On kontserte, kus loetakse the Universal Dependencies’ guidelines extends Are concert-PL.PRT where read-IMPS the number of constructions that fall into the cate- ka luuletusi. gory of nonverbal predication. also poem-PL.PRT Among the Estonian copular constructions ’There are concerts where poems are also listed in the beginning of Section 3, existential chanted.’ constructions only stating or negating the exis- (16) Noiad˜ on olemas. tence of an entity expressed by subject and consist- ing only of verb olema and its subject (15) pose a Witches are be-INE problem for UD v2 annotation scheme. We are on ’Witches exist.’ the opinion that they can not be analysed as exam- 3.2 Nonverbal predication in version 1 of ples of nonverbal predication as one can not label Estonian UD treebank subject as a predicate, so in these sentences verb form of olema ’be’ has to be annotated as root, not According to the first version of Universal Depen- as copula. dencies’ guidelines, copular clauses consisting of a noun or an adjective, which takes a single ar- Thus we have three basic options for annotating gument with the subject relation were to be anal- copular constructions in Estonian UD v2: always ysed as instances of nonverbal predication. The annotate olema as copula; separate clauses con- copula verb (if present) was attached to the pred- sisting of verb olema and its subject from all other icate with the “cop” relation. This analysis of olema-clauses and, as third option, try to separate copula constructions was extended to adpositional existential clauses from other olema-clauses. The phrases and oblique nominals as long as they had fourth possible solution would be to stick to the so- a predicative function. By contrast, temporal and lution we had in version 1 of Estonian UD, namely locative modifiers were to be treated as depen- annotate only subject complements, i.e. nouns and dents on the existential verb ’be’. So clauses con- adjectives in nominative or partitive case form as taining some copular verb were to be divided be- non-verbal predicates, but as this solution would tween categories of verbal and nonverbal predi- violate the guidelines for UD version 2, we will cation. In version 1 of Estonian UD treebank, not discuss it further. only sentences containing verb olema and subject We will take a closer look at the first three af- complement in nominative or partitive case form forementioned options one by one.

83 Annotate all olema-clauses as instances of clauses show the distinction - subject is in nomi- nonverbal predication native case form in locative clause (22) and in par- This would be the most straightforward solution titive case form in existential clause (23). The ex- from the point of view of UD v1 to v2 conversion ample sentences are of course simplifications, es- process. The main drawback would be having to pecially with regard to word order; in the corpus annotate subjects as predicates in clauses consist- sentences the word order does not distinguish ex- ing only of verb olema and its subject (Fig. 4). istential clauses as it is determined mostly by in- formation structure and depends heavily on larger textual context. (20) Koer on aias. Dog is garden-INE ’Dog is in the garden.’ Figure 4: A subject as a root, annotation of the (21) Aias on koer. sentence (15). garden-INE is dog ’There is a dog in the garden.’ Distinguishing subject-only clauses and other (22) Koer ei ole aias. olema-clauses Dog not is garden-INE Second possible option would be to distinguish ’Dog is not in the garden.’ two separate classes of olema-clauses basing on (23) Aias ei ole koera. simple syntactic criterion: if the sentence con- garden-INE not is dog-PRT sists only of some form of olema or periphrastic verb olemas olema and its subject, then (olemas) ’There is no dog in the garden.’ olema is annotated as root. All other sentences are So, if we would like to apply this solution, it annotated as instances of nonverbal predication. would mean that human annotators have to go over The distinction would be easy to make and the all affirmative clauses that could possibly be exis- main drawback would be that existential sentences tential clauses and make the distinction basing on like (15) and (19) get different syntactic structures, their intuitions about the probable subject case in which can be regarded as an inconsistent solution. the negative counterpant of the affirmative clause (19) Eile oli kontsert, kus under consideration - which means that there has Yesterday was concert-PL where to be more than one annotator for every clause. loeti ka luuletusi. But Peep Nemvalts (2000), who has analysed Es- tonian existential sentences, comparing them with read-IMPS also poem-PL.PRT the same phenomenon in other languages, has con- ’There was a concert yesterday where po- cluded that it is impossible to distinguish Estonian ems were also chanted.’ existential sentences basing on formal criteria. Separate existentials and other olema-clauses Therefore, after considering all possible solu- For the sake of consistency, all existential con- tions for distinguishing copular and non-copular structions should be annotated the same way, ir- usages of olema ’be’, we had to make a compro- relevant whether they consist only of subject and mise and adopt the second possible solution, i.e. verb or do they have more syntactic participants. distinguish subject-only clauses and other olema- This approach, although theoretically correct in clauses. In resulting annotated treebank, existen- our opinion, is not easy to apply in annotation tial clauses are divided between instances of verbal practice. and non-verbal predication, which can be regarded As already described in subsection 3.1, Esto- as a drawback. On the other hand, the resulting an- nian existential clauses are defined as those, which notation is consistent, which is a clear advantage. subject is in partitive case in negative clauses and 4 Conclusion can be in partitive case also in affirmative clauses. In a pragmatically neutral affirmative clause, word Universal Dependencies is planned to offer order distinguishes between locative (20) and ex- language-typologically relevant and cross- istential clauses (21). Negative variants of these linguistically consistent annotation guidelines

84 for building dependency treebanks (Nivre et al., Silvi Vare. 1993. Eesti keele grammatika II. 2016). Its version 2, published only a few months Suntaks¨ . Eesti TA Keele ja Kirjanduse instituut. ago, introduced a major change concerning Mati Erelt, editor. 2003. Estonian Language. Linguis- annotating nonverbal predication: the repertoire tica Uralica Supplementary series, volume 1. Esto- of clauses that should be treated as examples of nian Academy Publishers, Tallinn. nonverbal predication was considerably broad- Mati Erelt. 2013. Eesti keele lausepetus. Sissejuhatus. ened. Often real corpus data is a challenge even Oeldis.¨ Tartu likool. for well-premeditated theoretical constructs; even more so if this corpus data comes in more than 50 Petar Kehayov. 2008. Olema-verbi ellipsist eesti kirja- keeles. In Emakeele Seltsi aastaraamat, volume 54, languages. So it should not be a surprise that there pages 107–152. Eesti Teaduste Akadeemia Kirjas- are still some open issues or inconsistencies. tus. This article tackled the problems concerning Matti Miestamo. 2014. Partitives and negation: A defining and annotating copular constructions in cross-linguistic survey . In Partitive cases and Estonian, with some brief cross-linguistic compar- related categories, volume 54 of Empirical Ap- ison. proaches to Language Typology, pages 63–86. De We came forward with three possible annota- Gruyter Mouton. tion schemas, among which separating existential Line Mikkelsen. 2011. Copular clauses. In Klaus clauses from copular clauses would be theoreti- von Heusinger Claudia Maienborn and Paul Portner, cally most sound but would need too much manual editors, Semantics: An International Handbook of labor and would possibly result in inconcistent an- Natural Language Meaning, volume 2, pages 1805– 1829. Berlin: Mouton de Gruyter. notation. So we will adapt the solution that, some- what artificially, separates existential clauses con- Kadri Muischnek, Kaili Mu¨urisep,¨ Tiina Puolakainen, sisting only of subject and (copular) verb olema Eleri Aedmaa, Riin Kirt, and Dage Sarg.¨ 2014. olema Estonian Dependency Treebank and its annotation from all other -clauses. scheme. In Verena Henrich et al., editors, Proceed- It seems that delimiting and annotating nonver- ings of the Thirteenth International Workshop on bal predication and related phenomena is not en- Treebanks and Linguistic Theories (TLT13), pages tirely consistent cross-linguistically. In this ar- 285–291. University of Tubingen.¨ ticle we had a look at a very small set of lan- Kadri Muischnek, Kaili Mu¨urisep,¨ and Tiina Puo- guages, but the analysis of nonverbal predication lakainen. 2016. Estonian Dependency Treebank: and copular constructions from a cross-linguistic from Constraint Grammar Tagset to Universal De- (or cross-treebank) perspective deserves in-depth pendencies. In Proc. of LREC 2016. study. For the sake of better understanding the ex- Peep Nemvalts. 2000. Aluse sisu ja vorm. Eesti Keele act annotation of linguistic phenomena in different Sihtasutus. languages, thorough documentation of principles Joakim Nivre, Marie-Catherine de Marneffe, Filip and decisions underlying the annotation would be Ginter, Yoav Goldberg, Jan Hajic, Christopher D. beneficial. Manning, Ryan T. McDonald, Slav Petrov, Sampo As for Estonian UD treebank, the solution that Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel at the first glance seemed most correct from the Zeman. 2016. Universal Dependencies v1: A Mul- tilingual Treebank Collection. In LREC. European linguistic point of view, is (almost) impossible to Language Resources Association (ELRA). achieve even by manual annotation.

Acknowledgements This study was supported by the Estonian Ministry of Education and Research (IUT20-56), and by the European Union through the European Regional Development Fund (Centre of Excellence in Esto- nian Studies).

References Mati Erelt, Reet Kasik, Helle Metslang, Henno Ra- jandi, Kristiina Ross, Henn Saari, Kaja Tael, and

85