Creation of a corpus with semantic role labels for Hungarian Attila Novák1,2, László János Laki1,2, Borbála Novák1,2 Andrea Dömötör1,3, Noémi Ligeti-Nagy1,3, Ágnes Kalivoda1,3 1MTA-PPKE Technology Research Group, 2Pázmány Péter Catholic University, Faculty of Information Technology and Bionics Práter u. 50/a, 1083 Budapest, Hungary 3Pázmány Péter Catholic University, Faculty of Humanities and Social Sciences Egyetem u. 1, 2087 Piliscsaba, Hungary {surname.firstname}@itk.ppke.hu

Abstract the given text, and this ability is closely related to the ability to answer questions. Therefore, our In this article, an ongoing research is pre- aim is to create a system that is actually capable sented, the immediate goal of which is to cre- ate a corpus annotated with semantic role la- of formulating relevant questions about the text it bels for Hungarian that can be used to train processes. To do this, many distinctions need to a parser-based system capable of formulat- be made that are not present in syntactic annota- ing relevant questions about the text it pro- tion currently available for Hungarian. This article cesses. We briefly describe the objectives of presents the first phase of this work, which aims to our research, our efforts at eliminating errors create an annotated corpus where the annotation in the Hungarian Universal Dependencies cor- contains all the features needed to generate ques- pus, which we use as the base of our an- notation effort, at creating a Hungarian ver- tions concerning the text. bal argument database annotated with thematic roles, at classifying adjuncts, and at match- 2 Shortcomings of the traditional ing verbal argument frames to specific occur- analysis rences of and in the corpus. Since our goal is to create a system that can gen- 1 Introduction erate meaningful questions, we have decided that Recently, state-of-the-art performance in most when determining what distinctions need to be NLP related tasks has been achieved by end-to-end made in the annotation should be basically deter- systems based on neural deep learning networks mined by what questions can be asked concerning (see e.g. BERT (Devlin et al., 2018) or GPT- the particular grammatical construction. For ex- 2 (Radford et al., 2019)) surpassing the perfor- ample, in order to be able to formulate questions mance of previous systems employing some sort concerning noun phrases, the Who/What? dis- of grammatical analysis. This has raised doubts as tinction is indispensable, so the system must be to whether it makes sense to deal with grammat- able to clearly distinguish persons from things. At ical analysis at all. At the same time, the train- the same time, we ask who or what questions con- ing of end-to-end systems usually requires a great cerning NP’s that refer to groups or organizations amount of training material, which is not available depending on the role they play in the given sen- in most languages. Therefore, we think it may still tence. For example, a bank is referred to linguis- make sense put an effort into the implementation tically as a person when sending an invoice letter, of a grammatical analysis framework as long as but as a thing when it is liquidated. In addition, the output of the system can be directly used to a more detailed classification is required to gener- perform tasks relevant to everyday users. ate questions about nominal predicates. Concern- However, we cannot be satisfied with an anal- ing the predicate in the sentence John is a doc- ysis that relies on completely abstract categories tor, the question Who is John? is not very sophis- that cannot be clearly translated into terms that ticated. What is John’s occupation? is a ques- can be linked to what that text means in a man- tion matching the predicate in the sentence much ner that can also be understood by ordinary people. more precisely. Classifying concepts as occupa- An essential element of reading comprehension is tions, animals, tools, behaviors, etc. also makes that we are able to ask meaningful questions about for the system possible to generate more specific

220 Proceedings of the 13th Linguistic Annotation Workshop, pages 220–229 Florence, Italy, August 1, 2019. c 2019 Association for Computational Linguistics questions related to non-predicative occurrences 3 The corpus of noun phrases: e.g. What animal have you seen As a starting point, we chose the Hungarian sub- in the garden? vs. What did you see in the garden? corpus (Vincze et al., 2017) of the Universal This is particularly important in the case of coor- Dependencies (UD) corpus (Nivre et al., 2016) dinated phrases where one can only identify which consisting of 1800 sentences (42000 tokens) of conjunct is meant in the question if the question is mainly newswire text in order to put the annota- specific enough. tion schema we propose in a context that can be interpreted at an international level. The UD cor- To formulate questions concerning adverbials pus contains texts in many languages annotated even at the most basic level, we also need a much with morphosyntactic and dependency-based syn- more detailed system of distinctions than what is tactic analysis using unified principles and cate- provided by the syntactic annotation present in gories. Our original plan was to supplement or re- currently available tree banks. Hungarian NP’s fine the annotation in the Hungarian UD corpus headed by a word in inessive case or correspond- with the information needed to formulate ques- ing English PP’s headed by the preposition in tions. However, it turned out that the annotation can have quite a number of different grammat- in the Hungarian sub-corpus does not correspond ical functions. Thus we ask different questions to the currently valid UD specification in many re- concerning them: (1) szeptemberben ‘in Septem- spects, and contains many random annotation er- ber’: mikor? ‘when?’, (2) Londonban ‘in Lon- rors, so fixing these errors turned out to be an in- don’: hol? ‘where?’, (3) fájdalmában (felüvöltött) evitable part of our task. ‘(he screamed) in pain’: mit˝ol? ‘what made (him According to the UD 2.0 specification1, the scream)?’, (4) magában (hisz) ‘(he believes) in internal structure of multiword expressions is himself’: kiben? ‘in whom?’, (5) bajban ‘in trou- to be annotated using the flat, fixed or ble’: milyen helyzetben? ‘in what situation?’, (6) compound dependency relations. The fixed életben (marad) ‘(stay) alive’ lit. ‘(stay) in life’: relation is used exclusively to annotate fully lex- no question in general, this is part of a light icalized function-word-like structures. In many construction. languages, such as English, multiword names are generally considered to be flat exocentric struc- Generating questions concerning not only nom- tures, and the use of flat is suggested to annotate inal but also verbal predicates requires information the internal structure of these names with all words not provided by currently available annotation for of the name directly attached to the first word of Hungarian. How a question concerning a verbal the name. On the other hand, the UD 2.0 an- predicate should be formulated using specific ar- notation specification explicitly excludes the use guments as anchors depends on the thematic roles of this type of analysis in cases where the name the arguments play. What did John do to Frank? is has a regular syntactic structure (eg. in the case an adequate question if John is an and Frank of book or movie titles or a large part of names is a patient. In the same situation, What happened of organizations). Here the generic syntactic de- to Frank? and What did John do? are likewise pendency structures are to be used. Similarly, adequate questions. endocentric structures should be annotated using compound 2 Identification of thematic roles of verbal argu- the relation or one of its sub-types ment slots is also needed in order to be able to (see e.g. Kahane et al.(2017) on the contradic- distinguish oblique arguments from semantically tion of applying the flat annotation to languages compositional relations (e.g. locative and oblique where names are endocentric). Hungarian noun uses of in: believe in something vs. be some- phrases are always right-headed endocentric struc- where). We also need to distinguish parts of id- tures, so in the case of names that do not have a ioms and light verb constructions from composi- regular structure and compositional meaning, the compound tional verb-to-argument relations. It is a joke to relation is to be used. This ensures, ask a question concerning a non-compositionally 1http://universaldependencies.org/ related constituent: guidelines.html 2https://universaldependencies. org/u/overview/specific-syntax.html# What are you holding? — A meeting. multiword-expressions

221 for example, that case endings always attached to the head of the NP using the same dependency la- the head of the NP are directly accessible. E.g. the bel the whole NP was annotated with. We cor- head of an object NP is always in the accusative rected these errors and attached all predeterminers case in Hungarian. The current annotation for as det:predet to the head of the NP (Figure3). names completely obscures this fact (see e.g. az Further corrections performed automatically in- Egyesült Államokat ‘the united States[Acc]’ in cluded using nmod:poss instead of nmod:att Figure1)). Therefore, as one of the preprocessing in possessive structures, (bottom of Figure2), at- steps, all multiword names, originally erroneously taching all postpositions using the case relation, annotated in the corpus as flat structures, were and fixing clauses where the subject and the (nom- automatically converted into compound struc- inal) predicate were exchanged in the annotation tures (Figure1). For the time being, the identi- by mistake due to the annotators confusing the fo- fication and further reannotation of names with a cus construction with predication. The latter errors completely regular structure has not been done, as needed to be identified manually, the correction of this requires manual intervention. the identified structures was then performed auto- Structures like Angela Merkel német kancellár matically (Figure4). ‘German Chancellor Angela Merkel’ were often erroneously annotated as appositive structures in 4 The argument frame database the corpus.3 We converted these structures intro- ducing the compound:title_of relation be- All stems of verbs and participles occurring in tween the name and the occupation/role. the Hungarian UD corpus were collected, and The UD 2.0 specification prescribes the use of they were clustered using agglomerative cluster- the obl relation to attach NP arguments other than ing like in (Siklósi, 2016) based on their vec- subjects, objects or indirect objects even in the tor representation in a word-embedding model case of nonverbal heads. Often, some other re- constructed from a morphosyntactically annotated lation was used in the corpus even for verbal ar- corpus (Novák and Novák, 2018). This process guments. We were able to automatically correct effectively clustered verbs having similar distribu- most of these annotations in the case of arguments tional patterns (and argument frames). Each verb of verbs and participles (Figure2). in the list was supplemented with its surface argu- In Hungarian, like in German, verbal particles ment frames from a Hungarian verb-frame dictio- are detached from the verb in various syntactic nary (Sass et al., 2010). Using this initial repre- constructions, and they are moved to some dis- sentation as a source of inspiration, we have de- tinct syntactic position. Nevertheless, these par- scribed the possible argument frames of each verb ticles are considered part of the verb lemma. The manually. Our description contains the thematic verbal argument database that we created as part of role, the surface features (case-ending, postposi- our annotation effort, also contains particle verbs tion, possessive suffix, etc.), possible optionality in this form. In the Hungarian UD corpus, on the and, if applicable, lexical/semantic constraints of other hand, verb lemmas did not include the parti- each argument. Clustering helped us to streamline cle in such cases. This needed to be fixed (adding the process and simplified the task of annotators. the particle to the verb lemma) in order for the ver- Annotating verbs with similar argument frames in bal argument frames could be matched to their oc- a batch together instead of having to process them currences in the corpus. Many additional lemma- in some random order made it possible for us to tization errors were fixed, and we also needed to use an inheritance mechanism and improved con- relemmatize participles so that we can match verb sistency. argument frames against them. The main point in describing the argument In Hungarian, demonstrative predeterminers frames of verbs was to provide as much informa- agree in case and number with the head of the tion as possible to make it possible to ask the best, NP (azokat a kutyákat ‘those dogsACC’). These most accurate questions. With that in mind, our set structures were often annotated erroneously, with of thematic roles is based on widely known the- the demonstrative predeterminer being attached to matic role hierarchies. However, it differs from them in minute details, just like they differ from 3In appositive structures, like a bátyámmal, Péterrel ‘with my brother, Peter’ there is case between parts of each other. The description of verbs is intended to the phrase. This is not the case in these structures. cover every possible meaning (argument frame).

222 Figure 1: Fixing the annotation of multiword names: ‘William Ramsey representing the United States at the International Atomic Energy Agency’

Figure 4: Correction of an exchanged subject and pred- icate: ‘that he (˝o) was one of the leaders (vezet˝oje) of...’

Figure 2: Using the obl relation for arguments of verbs and participles: ‘... may decrease by 2.3 million This frame can be added to the record of the spe- barrels a day’ – ‘a recently completed [report] commis- cific verb. sioned by Péter Kovács, director of LRI’ The required and optional arguments of each verb are represented either by their thematic roles or lexically, supplemented with the required case- endings or postpositions. The identification of the thematic roles is based on the question that can be asked about the given argument or about the verb with the given argument as an anchor. For exam- ple, the question concerning the agent is what is A doing?, the question concerning the patient is what is happening to P?. Some roles also represent a kind of semantic Figure 3: Correction of erroneously annotated prede- category, such as CONT which refers to the con- terminers: ‘... was the only way he could create that tent of communication, or ACT which denotes an impression’ action (usually expressed as an infinitive xcomp). Arguments not having a specific thematic role that Since verbs with similar meanings and argument could not be used as an anchor when we want to frames were already grouped in the database, it ask a question about the predicate were marked us- was possible to specify common argument frames ing the semantically neutral theme (TH) thematic for groups of verbs. These frames are inherited au- role. tomatically by verbs belonging to the same group. The fixed components of idiomatic or semi- In addition, each verb can have its own argument compositional verbal structures are not labeled frame which does not apply to the whole group. by thematic roles but they are specified lexically.

223 These structures were supplemented with their the first line of the extract, the PATMV_(PATH) own argument frame descriptions (thus interpreted frame including an optional PATH argument that as autonomous units) where this solution seemed can in turn be expanded as any combination of to be justified. For example, the description of sor SRC, DST and VIA arguments, as well as the neu- kerül ‘to take place’ (lit. ‘turn comes’) is not part tral polarity marked with @. refer to each verb of the description of the word kerül ‘to come’, but below them. Round brackets in the descriptions we have assigned an argument frame to the whole indicate optionality, square brackets contain a list phrase as a unit. The thematic roles assigned to of examples defining a semantic category. the verbs and verbal structures are summarized in At the time of writing this paper, the argu- Table1. ment frame database contains 1604 verbs with Since verbs of movement imply the applica- 5994 different argument frames, including the bility of specific types of questions (e.g. How thematic role of each argument. Although frames did X get to Y?), in addition to the roles listed containing optional arguments (e.g. olvas AG_ in the table, a special annotation was applied to (HOW)_(PAT-t)_(REC-nAk)_(TH-rÓl)_(LOC-bAn) moving actors: for example moving agents are ‘somebody reads (somehow) (something) (to marked as AGMV. Our basic assumption was that somebody) (about something) (somewhere)’) a verb can not have more than one argument hav- appear as many seemingly different frames in ing the same thematic role. However, in some practice, we obtained these numbers by counting cases – where it is necessary – the co-actor is the frames containing optional arguments and marked with the co- prefix. For example, sétál possible thematic role variants only once. valakivel ‘to walk with someone’ is represented with AG_coAG-vAl (-vAl stands for the instru- 5 Identifying the role of adjuncts mental case-ending). An important task is to provide a fine-grained de- The argument frames described above could scription regarding the role of nominals with case- also obtain some special semantic classification ending, traditionally referred to as adjuncts. If we which may help in the further refinement of the approach the question from the case-endings, we possible questions. The categories used for this could say that the nominal having an inessive case- are as follows: ending indicates some kind of location and an- perception (e.g. to see) swers the question Where?. However, if the ques- emotion (e.g. to be glad) tion is e.g. Where did Mary graduate?, it is a joke sound (e.g. to resound) to say In her dream. The case-endings answering situation (e.g. to be pressed for time) the questions Hol? ‘Where?’, Hová? ‘Where to?’ beginning (e.g. to be established) and Honnan? ‘From where?’4 are not always used cognitive (e.g. to agree) to specify the location, the source or the destina- communication (e.g. to inform) tion. Depending on the lemma, the suffixed forms mathematical (e.g. to add) may express various temporal relations, modality, nonverbal communication (e.g. to nod) etc. For most lexical items that can refer to loca- self-propelled motion (e.g. to step) tions, only one set of the suffixes (e.g. only the financial (e.g. to transfer) inessive -bAn ‘in’, illative -bA ‘into’, and the el- destruction (of patient, e.g. to dry up) ative -bÓl ‘from inside’ can be used to express natural (e.g. to rain) location, source and destination), the rest of the transformation/change (e.g. to speed up) suffixes can only be used as markers of specific behavior (e.g. to flirt) oblique arguments of verbs. E.g. while for settle- relation (e.g. to support) ments outside Hungary, the locative relation is al- ways expressed using inessive (e.g. Londonban ‘in Finally, the argument frames also have a polar- London’), for most Hungarian settlements, the su- ity value indicating that the given event is positive, peressive is used (e.g. Budapesten ‘in Budapest’, negative or neutral for the patient or experiencer. 4In Hungarian, locative/lative/delative case-endings are as Figure5 shows the description of the verbs follows: the inessive -bAn ‘in’, the adessive -nÁl ‘at’, the su- peressive -On ‘on’; the illative -bA ‘into’, the allative -hOz sodródik ‘to drift’, hull ‘to fall’ and zuhan ‘to ‘to’, the sublative -rA ‘onto’; the elative -bÓl ‘from inside’, drop/plummet’ in the argument frame database. In the ablative -tÓl ‘from’, the delative -rÓl ‘from the top of’.

224 PATMV_(PATH) @. sodródik[IGE] +CHAR_ár-vAl ‘drift[V] +CHAR_tide~with’ hull[IGE] +AG_térd-rA_(CHAR~elott)˝ +hó +PAT~[haj|könny]-A +PAT@-pusztulás ‘fall[V] +AG~to~one’s~knees +snow +PAT’s~hair|tears (die:)+PAT@-decay’ zuhan[IGE] +EXP_á[email protected]ünet ‘drop/plummet[V] (fall asleep:)+EXP_into~dream’

Figure 5: An extract from the argument frame database.

Table 1: Thematic roles used in the description of argument frames

Annotation Name Question regarding the verb Example AG agent What is AG doing? John has climbed the tree. CHAR characterized What is characteristic of CHAR? Expertise is an advantage. ATTR attribute – Expertise is an advantage. John loves Mary. EXP experiencer How does EXP feel? What has EXP perceived? John has seen a swallow. PAT patient What happened to PAT? John kissed Mary. What happened to PATDST? PATDST patient-destination He painted the wall green. Where did PAT get to? TH theme – John relies on his intuition. John loves Mary. ST stimulus What effect has ST (on EXP)? John got frightened of his shadow. CONT information content – John presented the plan to Joe. John presented the plan to Joe. REC recipient – Mary received a letter. RES result How did RES come into being? Mary baked a cake. INS instrument What is AG using INS for? John travels to work by scooter. What did CAU cause? CAU causer John was late because of an accident. What was the consequence of CAU? MOT motivation – John is studying to be an engineer. LOC location What happened in/at/on... LOC? John kissed Mary in the cinema. John came out of the room. SRC source, starting point – Mary received a letter from John. DST destination How did AG/PAT get to DST? John went into the room. HOW mode – John deftly climbed the tree. ASPECT aspect – John is doing well financially. ACT action – John wants to work from home. lit. ‘on Budapest’). On the other hand, as oblique ‘for five months’ or a category that we could term arguments of the verbs hisz ‘believe’ and múlik as ‘adverbs of garment’ such as kabátban ‘wearing ‘depend’, all nouns take the inessive ‘in’ and the a coat’. 31 main categories have been identified, superessive ‘on’ suffixes, respectively. Lemmas some of which can be divided into several subcat- can thus be be classified concerning what func- egories. Together with the subcategories, we have tional/semantic relation is expressed by the com- divided the adjuncts having locative case-endings bination of the lemma and each case ending. We into 51 classes. To illustrate some of the subcate- identified such classes and defined templates that gories, in Table2, we present lemmas which, when describe the semantic role of each suffixed form in combined with a subset of the locative suffixes, the template. For all words (lemmas) belonging to function as adverbs of place. When combined the specific class, the template yields the seman- with other locative suffixes, they cannot function tics of each suffixed form. as heads of adjunct phrases. In these cases, they The task can also be formulated as a classifica- can only depend on some head word selecting that tion of adverbs. There are, of course, adverbs of specific suffix as the marker of a specific oblique place such as a sarkon ‘at the corner’ or bankban argument. ‘in a bank’, and adverbs of time such as télen ‘in The first two columns of the table show the main winter’, decemberben ‘in December’. However, category (in this case, loc) and its subcategories we also find adverbs of duration, e.g. 5 hónapra (all, ine, city-sup, etc.). This is followed by an

225 category example -bAn (inessive) -nÁl (adessive) -On (superessive) loc all szekrény ‘wardrobe’ where where where loc ade Microsoft in what where on what loc ine állam ‘state’ where at what on what loc sup címoldal ‘title page’ in what at what where loc ine-sup könyv ‘book’ where at what where loc city-ine Altenkirchen where where on which city loc city-sup Budapest in which city where where loc country Afganisztán ‘Afghanistan’ where where on which country

Table 2: Examples of lexical items that function as heads of locative adverbial phrases when combined with a specific subset of locative case suffixes (cells marked with where). With other suffixes, they can only function as oblique arguments of some predicate. example lemma belonging to the given subcate- token is the lemma with the main POS tag attached gory, and the best applicable questions for each of to it, while the other optional token consists of the case-endings -bAn, -nÁl and -On, respectively. possible extra morphosyntactic tags (such as tense, The questions indicate what role the suffixed word case, etc.) if present. We extracted at most 7- form plays in a sentence. token-long parallel phrases from the word-aligned corpus using the phrase extraction algorithm using 6 Automatic identification of the grow-diag-final heuristic implemented in the semi-compositional structures Moses SMT toolkit (Koehn et al., 2007). Of the When identifying idiomatic and semi- pairs of phrases extracted, we only kept pairs con- compositional verbal constructions, we focused taining exactly one verb both on the English and on the behavior of phrases with regard to the on the Hungarian side. For each Hungarian verb relevant question that can be asked about the from these phrase pairs, we collected all the nouns given phrase. In the case of döntést hoz ‘to make on the Hungarian side that were aligned with the a decision’ (lit. ‘to bring a decision’), What does verb on the English side. For example, in dön- A bring? is not an acceptable question. Similarly, tést hoz ‘to make a decision’ (lit. ‘to bring a de- Where does A bring P? is incorrect regarding the cision’), the Hungarian verb is hoz ‘to bring’. If phrase szóba hoz ‘to mention’ (lit. ‘to bring into it is aligned with the verb decide on the English word’). side, the noun döntés ‘decision’ is also aligned We have implemented an algorithm for collect- with this verb, as it does not appear as a sepa- ing such phrases from a parallel corpus. First, rate word on the English side. Note that even if we generated word alignments in the English- the English translation is make a decision, dön- Hungarian parallel subcorpus of the OpenSubti- tés on the Hungarian side is also usually aligned tles corpus consisting of 644.5 million tokens (Li- with make as well as with decision, because make son and Tiedemann, 2016) using the fast align tool only corresponds to hoz ‘bring’ only in this and (Dyer et al., 2013). To alleviate data sparseness a few other similar light verb constructions. In problems due to the rich morphology of Hungar- contrast, e.g. in the case of the compositional ian, to improve alignment quality and to facili- táskát hoz phrase ‘to bring a bag’, bring and bag tate the subsequent light verb construction and id- are both present on the English side, so these are iom identification process, we used a morphosyn- only aligned with their Hungarian equivalents. Fi- tactically annotated version of the corpus. The nally, we have normalized and sorted the list of English side was annotated using Stanford tagger nouns collected for the Hungarian verbs, based on (Toutanova et al., 2003) and the morpha lemma- their frequency and their homogeneity regarding tizer (Minnen et al., 2001), while the Hungarian the given verb. We cut off the end of the resulting side using the PurePos tagger (Orosz and Novák, list (where only phrases having a compositional 2013) and the Hungarian Humor morphological meaning were gathered). The algorithm generated analyzer (Novák et al., 2016). We further post- 6531 candidate expressions for 309 verbs. processed the output of the tagger/lemmatizer tool Originally, we planned to evaluate the algo- 5 combos, to generate an annotation in which each rithm using a list of light verb constructions word is represented by one or two tokens on the (LVC’s) (Vincze, 2011) created from the syntacti- Hungarian as well as on the English side. The first 5The list contains 1524 items.

226 cally annotated monolingual Szeged Dependency sake of readability, case suffixes are represented Treebank (Vincze et al., 2010) and the English- by their underlying phonological form in the ar- Hungarian SzegedParalell corpus. However, we gument frame database. The alignment algorithm found that it would be a mistake to consider the converts these descriptions into feature constraints original list gold standard. Only 83.6% of the ex- that match the morphosyntactic descriptions in the pressions on the Szeged list met our criteria. For UD corpus. the rest, we found that there is nothing odd about Hungarian is a ‘pro drop’ language, i.e. subject asking a question concerning the nominal part of and singular object pronouns generally have an the supposed light verb construction. Finally, we explicit phonological representation in sentences manually evaluated the union of the original list only if they are stressed (e.g. focused or in con- and the entries returned by our algorithm (7538 trastive topic position). Subjects and objects hav- items altogether). The original list had a recall of ing no surface realization are recovered in the 32.2% of the true positives on the union of lists. alignment algorithm by introducing implicit pro- The precision and recall values for our algorithm nouns and assigning the corresponding thematic turned out to be P=28.6%, and R=84.2%. As a re- role to them if the argument frame contains such sult, we managed to extend the list of Hungarian arguments and there is no explicit subject or object LVC’s and verbal idioms significantly compared in the given clause. For infinitives, and to the original Szeged list. We extended our lex- participles, verbal argument frames are matched icon of argument frames with LVC’s and idioms by implicitly binding subjects and objects depend- on the list we obtained by manually adding other ing on the type of the construction, while the arguments along with their thematic roles. rest of the arguments are matched in the regu- lar manner. Since objects (NP’s marked with ac- 7 Aligning argument frames to their cusative case) and infinitives can only occur as ar- occurrences in the corpus guments, not as adjuncts, frames that do not con- tain an object/infinitive are discarded if an explicit The first step of the algorithm aligning the argu- object/infinitive is attached to the actual verb in- ment frames to their occurrences in the UD corpus stance. is reading the source lexicon files containing the argument frame descriptions and checking them The lexically bound nominal element of some for syntactic errors. It generates the full argument of the light verb constructions is an apparently frame description for each verb by applying an in- possessive form (annotated as the head of the heritance mechanism adding the argument frames phrase like in other possessive structures), e.g. belonging to the verb group to those pertaining szomszédja nyakára küldte az adóhatóságot ‘he only to the given verb. set the tax authority to check up on his neighbour’ (lit. ‘he sent the tax authority onto his neighbour’s Explicit and implicit constraints on the form of neck’). In these constructions, the actual argu- arguments (suffixes, postpositions etc.) implied ment that is to be assigned a thematic role is not by the thematic roles in argument frame descrip- this semantically empty word, which rather func- tions are converted into constraints on features tions like an oblique case ending or postposition, and dependency relations applied to the morpho- but the possessor (i.e. the neighbor in the previ- logical and syntactic annotations in the UD cor- ous example). These structures are converted into pus, respectively. We align the argument frames a form similar to postpositional phrases. The real to the verbs and the heads of phrases attached to argument (his neighbour) becomes the head in the them in the corpus using these constraints. The modified structure, and thus the appropriate the- thematic roles location (LOC), destination (DST) matic role can be directly assigned to it. and source (SRC) cover noun phrases the head of which is marked with the suffixes and postposi- When multiple frames match the specific verb tions used to denote location, direction and source instance, the most specific frame is selected: (Where?, Where to?, From where?) and adverbs matching light verb or idiomatic constructions having such meaning. The argument frames of are ranked high, otherwise match candidates are many verbs contain the thematic role PATH, which ranked by the length of the matching argument list. can be replaced by any combination of destina- Figure6 shows how the original annotation of a tion, source and location passed by (VIA). For the sentence in the original UD corpus was corrected

227 Figure 6: The result of automatic correction and assignment of thematic roles to heads of phrases and clauses attached to verbs in a sentence in the Hungarian UD corpus: ‘The government submitted its bill for next year’s budget to the parliament at the end of September: it bodes no good for those working in the public education sector, László Varga told our newspaper’ and extended with thematic role labels by the the man Resources. We would like to thank Kinga Fe- adjunct and verbal argument frame matching algo- gyó and Ivett Bognár for their participation in cre- rithm. ating verbal argument frames and assigning the- matic roles. 8 Conclusion

Within the scope of the ongoing research pre- References sented in this article, we have created a seman- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and tically rich corpus annotation for Hungarian us- Kristina Toutanova. 2018. BERT: Pre-training of ing the Hungarian UD subcorpus as a starting deep bidirectional transformers for language under- point. Our future tasks include integrating the ar- standing. CoRR, abs/1810.04805. gument frames of nominal predicates and manual Chris Dyer, Victor Chahuneau, and Noah A. Smith. checking of the generated thematic role annota- 2013. A simple, fast, and effective reparameteriza- tion. We may also need to finetune the interac- tion of IBM Model 2. In Proceedings of the 2013 tion of lexically-driven automatic adjunct annota- Conference of the North American Chapter of the tion and verbal argument frame alignment, and the Association for Computational Linguistics: Human Language Technologies, pages 644–648. Associa- ranking of matching argument frames. Further- tion for Computational Linguistics. more, arguments annotated with thematic roles will need to be semantically further subcatego- Sylvain Kahane, Marine Courtin, and Kim Gerdes. rized to be able to generate the right questions. We 2017. Multi-word annotation in syntactic treebanks - propositions for universal dependencies. In Pro- have done this, but we have not yet integrated this ceedings of the 16th International Workshop on information with the rest of the annotation. We Treebanks and Linguistic Theories, pages 181–189, will also need to extend our corpus (converting and Prague, Czech Republic. correcting parts of the Szeged Dependency Tree- Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris bank not included in the Hungarian UD corpus) Callison-Burch, Marcello Federico, Nicola Bertoldi, to provide enough training material for a semantic Brooke Cowan, Wade Shen, Christine Moran, parser. Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open Acknowledgments source toolkit for statistical machine translation. In ACL. The Association for Computer Linguistics.

This research has been implemented with support Pierre Lison and Jörg Tiedemann. 2016. OpenSub- provided by grants FK 125217 and PD 125216 titles2016: Extracting large parallel corpora from of the National Research, Development and In- movie and TV subtitles. In Proceedings of the Tenth novation Office of Hungary financed under the International Conference on Language Resources FK 17 and PD 17 funding schemes as well as by and Evaluation (LREC 2016), Paris, France. Euro- pean Language Resources Association (ELRA). grant ÚNKP-18–3-III-PPKE-26 of the New Na- tional Excellence Program of the Ministry of Hu- Guido Minnen, John Carroll, and Darren Pearce. 2001.

228 Applied morphological processing of English. Nat. Veronika Vincze, Richárd Farkas, Zsolt Szántó, and Lang. Eng., 7(3):207–223. Katalin Ilona Simkó. 2017. Universal Dependencies and morphology for Hungarian – and on the price of Joakim Nivre, Marie-Catherine de Marneffe, Filip Gin- universality. In Proceedings of EACL 2017, pages ter, Yoav Goldberg, Jan Hajic, Christopher D. Man- 356–365. ning, Ryan McDonald, Slav Petrov, Sampo Pyysalo, Natalia Silveira, Reut Tsarfaty, and Daniel Zeman. Veronika Vincze, Dóra Szauter, Attila Almási, György 2016. Universal Dependencies v1: A multilingual Móra, Zoltán Alexin, and János Csirik. 2010. treebank collection. In Proceedings of the Tenth In- Hungarian dependency treebank. In Proceedings ternational Conference on Language Resources and of the Seventh International Conference on Lan- Evaluation (LREC 2016), Paris, France. European guage Resources and Evaluation (LREC’10), Val- Language Resources Association (ELRA). letta, Malta. European Language Resources Associ- ation (ELRA). Attila Novák and Borbála Novák. 2018. POS, ANA and LEM: Word embeddings built from annotated corpora perform better. In Computational Linguis- tics and Intelligent Text Processing: 17th Interna- tional Conference, CICLing 2018, Hanoi, Vietnam. Springer International Publishing, Cham.

Attila Novák, Borbála Siklósi, and Csaba Oravecz. 2016. A new integrated open-source morphological analyzer for Hungarian. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), Portorož, Slovenia. Eu- ropean Language Resources Association (ELRA).

György Orosz and Attila Novák. 2013. PurePos 2.0: a hybrid tool for morphological disambiguation. In Proceedings of the International Conference on Recent Advances in Natural Language Processing (RANLP 2013), pages 539–545, Hissar, Bulgaria. In- coma Ltd. Shoumen, Bulgaria.

Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners.

Bálint Sass, Tamás Váradi, Júlia Pajzs, and Margit Kiss. 2010. Magyar igei szerkezetek: a leggyako- ribb vonzatok és szókapcsolatok szótára [Hungar- ian verbal constructions: a dictionary of the most frequent arguments and phrases]. A magyar nyelv kézikönyvei. Tinta Könyvkiadó.

Borbála Siklósi. 2016. Using embedding models for lexical categorization in morphologically rich lan- guages. In Computational Linguistics and Intel- ligent Text Processing: 17th International Con- ference, CICLing 2016, pages 115–126, Konya, Turkey. Springer International Publishing, Cham.

Kristina Toutanova, Dan Klein, Christopher D. Man- ning, and Yoram Singer. 2003. Feature-rich part-of- speech tagging with a cyclic dependency network. In NAACL ’03: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, pages 173–180, Morristown, NJ, USA. Association for Computational Linguistics.

Veronika Vincze. 2011. Semi-Compositional Noun + Verb Constructions: Theoretical Questions and Computational Linguistic Analyses. Ph.D. thesis.

229