Creation of a Corpus with Semantic Role Labels for Hungarian
Total Page:16
File Type:pdf, Size:1020Kb
Creation of a corpus with semantic role labels for Hungarian Attila Novák1;2, László János Laki1;2, Borbála Novák1;2 Andrea Dömötör1;3, Noémi Ligeti-Nagy1;3, Ágnes Kalivoda1;3 1MTA-PPKE Hungarian Language Technology Research Group, 2Pázmány Péter Catholic University, Faculty of Information Technology and Bionics Práter u. 50/a, 1083 Budapest, Hungary 3Pázmány Péter Catholic University, Faculty of Humanities and Social Sciences Egyetem u. 1, 2087 Piliscsaba, Hungary {surname.firstname}@itk.ppke.hu Abstract the given text, and this ability is closely related to the ability to answer questions. Therefore, our In this article, an ongoing research is pre- aim is to create a system that is actually capable sented, the immediate goal of which is to cre- ate a corpus annotated with semantic role la- of formulating relevant questions about the text it bels for Hungarian that can be used to train processes. To do this, many distinctions need to a parser-based system capable of formulat- be made that are not present in syntactic annota- ing relevant questions about the text it pro- tion currently available for Hungarian. This article cesses. We briefly describe the objectives of presents the first phase of this work, which aims to our research, our efforts at eliminating errors create an annotated corpus where the annotation in the Hungarian Universal Dependencies cor- contains all the features needed to generate ques- pus, which we use as the base of our an- notation effort, at creating a Hungarian ver- tions concerning the text. bal argument database annotated with thematic roles, at classifying adjuncts, and at match- 2 Shortcomings of the traditional ing verbal argument frames to specific occur- analysis rences of verbs and participles in the corpus. Since our goal is to create a system that can gen- 1 Introduction erate meaningful questions, we have decided that Recently, state-of-the-art performance in most when determining what distinctions need to be NLP related tasks has been achieved by end-to-end made in the annotation should be basically deter- systems based on neural deep learning networks mined by what questions can be asked concerning (see e.g. BERT (Devlin et al., 2018) or GPT- the particular grammatical construction. For ex- 2 (Radford et al., 2019)) surpassing the perfor- ample, in order to be able to formulate questions mance of previous systems employing some sort concerning noun phrases, the Who/What? dis- of grammatical analysis. This has raised doubts as tinction is indispensable, so the system must be to whether it makes sense to deal with grammat- able to clearly distinguish persons from things. At ical analysis at all. At the same time, the train- the same time, we ask who or what questions con- ing of end-to-end systems usually requires a great cerning NP’s that refer to groups or organizations amount of training material, which is not available depending on the role they play in the given sen- in most languages. Therefore, we think it may still tence. For example, a bank is referred to linguis- make sense put an effort into the implementation tically as a person when sending an invoice letter, of a grammatical analysis framework as long as but as a thing when it is liquidated. In addition, the output of the system can be directly used to a more detailed classification is required to gener- perform tasks relevant to everyday users. ate questions about nominal predicates. Concern- However, we cannot be satisfied with an anal- ing the predicate in the sentence John is a doc- ysis that relies on completely abstract categories tor, the question Who is John? is not very sophis- that cannot be clearly translated into terms that ticated. What is John’s occupation? is a ques- can be linked to what that text means in a man- tion matching the predicate in the sentence much ner that can also be understood by ordinary people. more precisely. Classifying concepts as occupa- An essential element of reading comprehension is tions, animals, tools, behaviors, etc. also makes that we are able to ask meaningful questions about for the system possible to generate more specific 220 Proceedings of the 13th Linguistic Annotation Workshop, pages 220–229 Florence, Italy, August 1, 2019. c 2019 Association for Computational Linguistics questions related to non-predicative occurrences 3 The corpus of noun phrases: e.g. What animal have you seen As a starting point, we chose the Hungarian sub- in the garden? vs. What did you see in the garden? corpus (Vincze et al., 2017) of the Universal This is particularly important in the case of coor- Dependencies (UD) corpus (Nivre et al., 2016) dinated phrases where one can only identify which consisting of 1800 sentences (42000 tokens) of conjunct is meant in the question if the question is mainly newswire text in order to put the annota- specific enough. tion schema we propose in a context that can be interpreted at an international level. The UD cor- To formulate questions concerning adverbials pus contains texts in many languages annotated even at the most basic level, we also need a much with morphosyntactic and dependency-based syn- more detailed system of distinctions than what is tactic analysis using unified principles and cate- provided by the syntactic annotation present in gories. Our original plan was to supplement or re- currently available tree banks. Hungarian NP’s fine the annotation in the Hungarian UD corpus headed by a word in inessive case or correspond- with the information needed to formulate ques- ing English PP’s headed by the preposition in tions. However, it turned out that the annotation can have quite a number of different grammat- in the Hungarian sub-corpus does not correspond ical functions. Thus we ask different questions to the currently valid UD specification in many re- concerning them: (1) szeptemberben ‘in Septem- spects, and contains many random annotation er- ber’: mikor? ‘when?’, (2) Londonban ‘in Lon- rors, so fixing these errors turned out to be an in- don’: hol? ‘where?’, (3) fájdalmában (felüvöltött) evitable part of our task. ‘(he screamed) in pain’: mit˝ol? ‘what made (him According to the UD 2.0 specification1, the scream)?’, (4) magában (hisz) ‘(he believes) in internal structure of multiword expressions is himself’: kiben? ‘in whom?’, (5) bajban ‘in trou- to be annotated using the flat, fixed or ble’: milyen helyzetben? ‘in what situation?’, (6) compound dependency relations. The fixed életben (marad) ‘(stay) alive’ lit. ‘(stay) in life’: relation is used exclusively to annotate fully lex- no question in general, this is part of a light verb icalized function-word-like structures. In many construction. languages, such as English, multiword names are generally considered to be flat exocentric struc- Generating questions concerning not only nom- tures, and the use of flat is suggested to annotate inal but also verbal predicates requires information the internal structure of these names with all words not provided by currently available annotation for of the name directly attached to the first word of Hungarian. How a question concerning a verbal the name. On the other hand, the UD 2.0 an- predicate should be formulated using specific ar- notation specification explicitly excludes the use guments as anchors depends on the thematic roles of this type of analysis in cases where the name the arguments play. What did John do to Frank? is has a regular syntactic structure (eg. in the case an adequate question if John is an agent and Frank of book or movie titles or a large part of names is a patient. In the same situation, What happened of organizations). Here the generic syntactic de- to Frank? and What did John do? are likewise pendency structures are to be used. Similarly, adequate questions. endocentric structures should be annotated using compound 2 Identification of thematic roles of verbal argu- the relation or one of its sub-types ment slots is also needed in order to be able to (see e.g. Kahane et al.(2017) on the contradic- distinguish oblique arguments from semantically tion of applying the flat annotation to languages compositional relations (e.g. locative and oblique where names are endocentric). Hungarian noun uses of in: believe in something vs. be some- phrases are always right-headed endocentric struc- where). We also need to distinguish parts of id- tures, so in the case of names that do not have a ioms and light verb constructions from composi- regular structure and compositional meaning, the compound tional verb-to-argument relations. It is a joke to relation is to be used. This ensures, ask a question concerning a non-compositionally 1http://universaldependencies.org/ related constituent: guidelines.html 2https://universaldependencies. org/u/overview/specific-syntax.html# What are you holding? — A meeting. multiword-expressions 221 for example, that case endings always attached to the head of the NP using the same dependency la- the head of the NP are directly accessible. E.g. the bel the whole NP was annotated with. We cor- head of an object NP is always in the accusative rected these errors and attached all predeterminers case in Hungarian. The current annotation for as det:predet to the head of the NP (Figure3). names completely obscures this fact (see e.g. az Further corrections performed automatically in- Egyesült Államokat ‘the united States[Acc]’ in cluded using nmod:poss instead of nmod:att Figure1)). Therefore, as one of the preprocessing in possessive structures, (bottom of Figure2), at- steps, all multiword names, originally erroneously taching all postpositions using the case relation, annotated in the corpus as flat structures, were and fixing clauses where the subject and the (nom- automatically converted into compound struc- inal) predicate were exchanged in the annotation tures (Figure1).