Universal Dependencies and Morphology for Hungarian – and on the Price of Universality
Total Page:16
File Type:pdf, Size:1020Kb
Universal Dependencies and Morphology for Hungarian – and on the Price of Universality Veronika Vincze1,2, Katalin Ilona Simkó1, Zsolt Szántó1, Richárd Farkas1 1University of Szeged Institute of Informatics 2MTA-SZTE Research Group on Artificial Intelligence [email protected] {vinczev,szantozs,rfarkas}@inf.u-szeged.hu Abstract plementation of multilingual morphological and syntactic parsers from a computational linguistic In this paper, we present how the princi- point of view. Furthermore, it can enhance studies ples of universal dependencies and mor- on linguistic typology and contrastive linguistics. phology have been adapted to Hungarian. From the viewpoint of syntactic parsing, the We report the most challenging grammati- languages of the world are usually categorized ac- cal phenomena and our solutions to those. cording to their level of morphological richness On the basis of the adapted guidelines, (which is negatively correlated with configura- we have converted and manually corrected tionality). At one end, there is English, a strongly 1,800 sentences from the Szeged Tree- configurational language while there is Hungarian bank to universal dependency format. We at the other end of the spectrum with rich mor- also introduce experiments on this manu- phology and free word order (Fraser et al., 2013). ally annotated corpus for evaluating auto- In this paper, we present how UD principles were matic conversion and the added value of adapted to Hungarian, with special emphasis on language-specific, i.e. non-universal, an- Hungarian-specific phenomena. notations. Our results reveal that convert- Hungarian is one of the prototypical morpho- ing to universal dependencies is not nec- logically rich languages thus our UD principles essarily trivial, moreover, using language- can provide important best practices for the uni- specific morphological features may have versalization of other morphologically rich lan- an impact on overall performance. guages. The UD guidelines for Hungarian were motivated by both linguistic considerations and 1 Introduction data-driven observations. We developed a con- Morphological tagging and syntactic parsing are verter from the existing Szeged Dependency Tree- key components in most natural language process- bank (Vincze et al., 2010) to UD and manually ing (NLP) applications. Linguistic resources and corrected 1,800 sentences from the newspaper do- parsers for morphological and syntactic analysis main. The experiences gained during the converter have been developed for several languages, see development and during the manual correction e.g. the shared tasks on morphologically rich lan- could reinforce the linguistic guidelines. More- guages (Seddah et al., 2013; Seddah et al., 2014). over, the manually corrected gold standard corpus However, the comparison of results achieved for provides the opportunity for empirical evaluations different languages is not straightforward as most like assessing the converter and comparing depen- languages and databases apply a unique tagset, dency parsers employing the original and the uni- moreover, they were annotated following differ- versal morphological representations. Thus, we ent guidelines. In order to overcome these issues, evaluated the quality of the automatic conversion, the project Universal Dependencies and Morphol- which reveals that converting to universal depen- ogy (UD) has recently been initiated within the dencies is not necessarily trivial, at least for Hun- NLP community (Nivre, 2015). The main goal of garian. We also show that using different morpho- the UD project is to develop a “universal”, i.e. a logical tagsets may have an impact on overall pars- language-independent morphological and syntac- ing performance and utilizing language-specific, tic representation which can contribute to the im- i.e. non-universal, information has a considerable 356 Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers, pages 356–365, Valencia, Spain, April 3-7, 2017. c 2017 Association for Computational Linguistics added value at both the morphological and syntac- syntactic levels and texts are annotated on the ba- tic layers. sis of the same linguistic principles, to the widest The chief contributions of the paper are the in- extent possible. troduction of The UD tagset encodes morphological informa- tion in the form of POS tags and feature–value the universal morphology and dependency • pairs. As for syntactic information, each word is principles for Hungarian, leading to insights assigned to its parent word in the dependency tree for other morphologically rich languages, and the grammatical function of the specific word empirical experiments on the upper bound is encoded in dependency labels. Dependency la- • of the accuracy of automatic conversion and bels, POS tags and features are universal (i.e. there pre-parsing, is a fixed set of them without the possibility of in- troducing new members), but values and depen- comparative evaluations for assessing the dency labels can have language-specific additions • added value of language-specific informa- if needed. Features are divided into the categories tion at the morphological and syntactic layers lexical features and inflectional features. Lexical along with the interaction of these two. features are features that are characteristics of the lemmas rather than the word forms, whereas in- 2 Related Work flectional features are those that are characteristics Standardized tagsets for both morphological and of the word forms. Both lexical and inflectional syntactic annotations have been constantly devel- features can have layered features: some features oped in the international NLP community. For are marked more than once on the same word, instance, the MSD morphological coding system e.g. a Hungarian noun may denote its possessor’s was developed for a set of Eastern European lan- number as well as its own number. In this case, the guages (Erjavec, 2012), within the MULTEXT- Number feature has an added layer, Num[psor]. EAST project. Interset functions as an interlingua Up to now, several papers have been published for several morphological coding systems, which on the general principles behind UD (Nivre, 2015; can convert different tagsets to the same mor- Nivre et al., 2016) or on specific treebanks. For phological representation (Zeman, 2008). There instance, there are UD treebanks available for ag- have also been some attempts to define a com- glutinative languages such as Finnish (Haverinen mon set of parts-of-speech: Rambow et al. (2006) et al., 2014; Pyysalo et al., 2015), Estonian (Muis- defined a multilingual tagset for part-of-speech chnek et al., 2016) and Japanese (Tanaka et al., (POS) tagging and parsing, while McDonald and 2016), for Slavic languages (Zeman, 2015) and Nivre (2007) identified eight POS tags based on spoken Slovenian (Dobrovoljc and Nivre, 2016) data from the CoNLL-2007 Shared Task (Nivre and for Nordic languages such as Norwegian et al., 2007). Petrov et al. (2012) offered a tagset (Øvrelid and Hohle, 2016), Danish (Johannsen et of 12 POS tags and applied this tagset to 22 lan- al., 2014) and Swedish (Nivre, 2014), together guages. with several other languages (Persian (Seraji et al., Now, Universal Dependencies (UD) is the latest 2016) and Basque (Aranzabe et al., 2014), just to standardized tagset that we are aware of. UD is name a few). Recently, a further extension on the an international project that aims at developing a UD relations has been proposed: enhanced En- unified annotation scheme for dependency syntax glish dependencies are described in Schuster and and morphology in a language-independent frame- Manning (2016). work (Nivre, 2015). Hungarian was among the Our UD principles introduced in this paper first 10 languages of the project, participating also follow the central UD guidelines (Nivre, 2015) in the first official release in January 2015. In the and we did our best to align with the exist- latest release (Version 1.3, May 2016), there are ing guidelines for other morphologically rich lan- annotated datasets available for 40 languages, in- guages as well. On the other hand, there are sev- cluding English, German, French, Hungarian and eral Hungarian-specific phenomena that required 1 Irish, among others . In these datasets, the very changes and extensions of the original UD princi- same tagsets are applied at the morphological and ples. 1http://universaldependencies.org/ The only available manually annotated tree- 357 bank for Hungarian is the Szeged Corpus (Csendes the possessor on the possessed noun but use deter- et al., 2004) and Szeged Dependency Treebank miners for this purpose (cf. my car but az autóm (Vincze et al., 2010). It contains approximately (the car-1SGPOSS)). Moreover, the number of the 82,000 sentences and 1.5 million tokens, all man- possessed can be marked on the noun in elliptical ually annotated for POS-tagging and constituency constructions such as: and dependency syntax. We developed an au- tomatic tool that converts the morphological de- (3) Láttam az scriptions of the Szeged Corpus to universal mor- see-PAST-2SGPOSS-ACC the phology tags and the dependency trees of the autódat , de a Szeged Treebank to universal dependencies. car-2SG-POSS , but the szomszédét nem . 3 Universal Morphology for Hungarian neighbor-POSSD.SG-ACC not I could see your car but not that of the In this section, we present the morphological neighbor. tagset applied to Hungarian. When adapting the