Universal Morphology for Old Hungarian
Total Page:16
File Type:pdf, Size:1020Kb
Universal Morphology for Old Hungarian Eszter Simon Veronika Vincze Research Institute for Linguistics, MTA-SZTE Research Group Hungarian Academy of Sciences for Artificial Intelligence Benczur´ u. 33. Tisza Lajos krt. 103. H-1068 Budapest, Hungary H-6720 Szeged, Hungary [email protected] [email protected] Abstract Parsed Corpus of Middle English (Kroch and Tay- lor, 2000), the Tycho Brahe Parsed Corpus of His- This paper provides a description of the torical Portuguese (Galves and Britto, 2002), or automatic conversion of the morphologi- the Welsh Prose corpus (Thomas et al., 2007) and cally annotated part of the Old Hungar- for non-Indo-European languages as well, such as ian Corpus. These texts are in the for- the Old Hungarian Corpus (Simon, 2014). mat of the Humor analyzer, which does not Historical corpora represent a rich source of follow any international standards. Since data, but only if the relevant information is speci- standardization always facilitates future fied in a computationally interpretable and retriev- research, even for researchers who do not able way. Moreover, following the current stan- know the Old Hungarian language, we dardisation efforts allows for cross-lingual com- opted for mapping the Humor formalism parative studies, as well as for longitudinal inves- to a widely used universal tagset, namely tigations on language change. With the recent the Universal Dependencies framework. increase in the number of annotated corpora, it The benefits of using a shared tagset across seems advisable to move towards a harmonized languages enable interlingual comparisons common framework and methodology. Standard- from a theoretical point of view and also ization always facilitates future research – in this multilingual NLP applications can profit case even for researchers who do not know the Old from a unified annotation scheme. In this Hungarian language. paper, we report the adaptation of the Uni- Natural language processing activities in Hun- versal Dependencies morphological anno- gary were not synchronized in the past, hence sim- tation scheme to Old Hungarian, and we ilar resources were developed in parallel at dif- discuss the most important theoretical lin- ferent locations. As a consequence, there are guistic issues that had to be resolved dur- two morphological analyzers for Hungarian: Hun- ing the process. We focus on the linguistic morph (Tron´ et al., 2005) and Humor (Novak,´ phenomena typical of Old Hungarian that 2003). The former one has not been maintained required special treatment and we offer so- recently, while the latter one is not freely available. lutions to them. Moreover, they use different formalisms, which share only one common property: they do not fol- 1 Introduction low any international standards. For the morpho- There is a growing interest not only in the nat- logical annotation of Old Hungarian texts, the Hu- ural language processing (NLP) community, but mor analyzer was used, thus all of the morphologi- even among theoretical and historical linguists cally annotated texts are in a special format, which for building and using databases of historical is hard to be interpreted for a non-Hungarian re- texts. High quality historical corpora enriched searcher. That is the reason behind the need of with some kinds of linguistic information and mapping the Humor formalism to a widely used metadata can provide a fertile ground for theoret- universal tagset, for which we chose the Universal ical investigations. Several databases of historical Dependencies (UD) framework. texts have recently been created for various Indo- The UD tagset and annotation scheme have just European languages, such as the Penn-Helsinki been adapted to Modern Hungarian (Vincze et al., 118 Proceedings of the 10th SIGHUM Workshop on Language Technology for Cultural Heritage, Social Sciences, and Humanities (LaTeCH), pages 118–127, Berlin, Germany, August 11, 2016. c 2016 Association for Computational Linguistics 2016). In this paper, we report the adaptation of POS description the morphological annotation scheme to Old Hun- ADJ garian, and we discuss the most important theoret- adjective ADP ical linguistic issues that had to be resolved during adposition ADV the process. Section 2 briefly presents the inter- adverb AUX national project Universal Dependencies and Mor- auxiliary CONJ phology, then we summarize the part-of-speech coordinating conjunction DET (POS) tags and morphological features that are rel- determiner INTJ evant for Old Hungarian. Section 3 gives a brief interjection NOUN introduction of the Old Hungarian language and noun NUM describes the morphologically annotated part of number PART the Old Hungarian Corpus which has been con- particle PRON verted into the UD tagset. Section 4 reports on nominal pronoun PROPN our experiences in the conversion and discusses proper noun PUNCT the specific linguistic issues concerning parts-of- punctuation SCONJ speech and features. In Section 5, we contrast the subordinating conjunction VERB annotation schemes developed for Old and Mod- verb X ern Hungarian. Conclusions and the planned fu- other ture work end the paper in Section 6. Table 1: POS tags for Old Hungarian. 2 Universal Dependencies and Morphology terlingua for different morphological tagsets and Universal Dependencies is an international project it enables the conversion of different tagsets to that aims at developing a unified annotation the same morphological representation (Zeman, scheme for dependency syntax and morphology in 2008). Rambow et al. (2006) defined a multilin- a language-independent framework (Nivre, 2015). gual tagset for POS tagging and parsing, while Currently (as of June 2016), there are anno- McDonald and Nivre (2007) identified eight POS tated datasets available for 45 languages, includ- tags based on data from the CoNLL-2007 Shared ing modern languages such as English, German, Task (Nivre et al., 2007). Petrov et al. (2012) French, Hungarian and Irish, and old languages offered a tagset of 12 POS tags and applied this such as Ancient Greek, Coptic, Latin and Old tagset to 22 languages. 1 Church Slavic, among others . Datasets from all Now, Universal Dependencies is the latest stan- these languages apply the same tagsets at the mor- dardized tagset that we are aware of. In its current phological and syntactic levels and are annotated form, morphological information is encoded in the on the basis of the same linguistic principles, to form of POS tags and feature–value pairs. There is the widest extent possible, however, in some cases, a fixed set of universal POS tags without the pos- language-specific decisions had to be made. The sibility of introducing new members, but features benefits of using a shared tagset across languages and values can have language-specific additions if enable interlingual comparisons from a theoretical needed. Features are divided into the categories point of view and also multilingual NLP applica- lexical features and inflectional features. Lexical tions can profit from a unified annotation scheme. features are features that are characteristics of the Standardized tagsets for both morphological lemmas rather than the word forms, whereas in- and syntactic annotation have been constantly im- flectional features are those that are characteris- proved in the international NLP community. As tics of the word forms. Both lexical and inflec- for dependency syntax, Stanford dependencies is tional features can have layered features: some one of the most widely used tagsets (de Marn- features are marked more than once on the same effe and Manning, 2008). For morphology, the word, e.g. a Hungarian noun may denote its pos- MSD coding system was developed for a bunch sessor’s number as well as its own number. In of Eastern European languages including Hungar- this case, the Number feature has an added layer, ian (Erjavec, 2012). Interset functions as an in- Number[psor]. 1http://universaldependencies.org As mentioned above, Universal Morphology 119 annotates words with POS information and mor- 4 Language-specific extensions phological features. Tables 1 and 2 summarize the Since the time interval of the Old Hungarian pe- POS tags and morphological features that are rel- riod is more than 600 years, several linguistic phe- evant for Old Hungarian, based on the annotation nomena were in permanent change during this pe- scheme created for Modern Hungarian, described riod. That is one of the reasons behind the het- at the UD website and in Vincze et al. (2016). erogeneity of Old Hungarian texts. For instance, 3 Old Hungarian the progress in which postpositions became ver- bal particles or adverbs roots back to the Proto- The Old Hungarian era lasted from 896 to 1526, Hungarian period and lasts even in the Modern the year of the occupation of the major part of the Hungarian era, thus making a decision on their Hungarian Kingdom by the Ottoman Empire. The POS tag is far from trivial (discussed in more de- first part of this period (between 896–1350), doc- tail in Section 4.2). Such issues posed several umented by linguistic fragments and short coher- problems during the conversion process, which are ent texts, is called the Early Old Hungarian period. detailed in this section. The Late Old Hungarian period between 1350– In examples, throughout the section, the rel- 1526 is the period of codices. evant parts are emboldened. As a morphologi- The Old Hungarian Corpus (Simon, 2014) con- cal description, we apply and follow the standard tains all codices from the Late Old Hungarian pe- Leipzig Glossing Rules. The source of the exam- riod and several minor texts from the Early Old ple is provided in brackets after the translation. If Hungarian period in their original orthographic the example is part of the Bible, the translation is form. Because of the heterogeneity of the Old copied from the King James Bible, and its biblical Hungarian orthographic system, the original to- locus (book, chapter, verse) is also provided. kens had to be transcribed into their modernized First, we discuss general issues of the conver- form during a normalization step (for more de- sion, then we illustrate specific cases that are rel- tails, see Oravecz et al.