Universal Morphologies for the Caucasus Region

Universal Morphologies for the Caucasus region Christian Chiarcos Kathrin Donandt Maxim Ionov Monika Rind-Pawlowski Hasmik Sargsian Jesse Wichers Schreur Frank Abromeit Christian Fath¨ {chiarcos|donandt|ionov|abromeit|faeth}@informatik.uni-frankfurt.de {sargsian|wichersschreur}@em.uni-frankfurt.de {rind-pawlowski}@lingua.uni-frankfurt.de Goethe University Frankfurt, Germany Abstract The Caucasus region is famed for its rich and diverse arrays of languages and language families, often challenging European-centered views established in traditional linguistics. In this paper, we describe ongoing efforts to improve the coverage of Universal Morphologies for languages of the Caucasus region. The Universal Morphologies (UniMorph) are a recent community project aiming to complement the Universal Dependencies which focus on morphosyntax and syntax. We describe the development of UniMorph resources for Nakh-Daghestanian and Kartvelian languages as a well as for Classical Armenian, we discuss challenges that the complex morphology of these and related languages poses to the current design of UniMorph, and suggest possibilities to improve the applicability of UniMorph for languages of the Caucasus region in particular and for low resource languages in general. We also criticize the UniMorph TSV format for its limited expressiveness, and suggest to complement the existing UniMorph workflow with support for additional source formats on grounds of Linked Open Data technology. Keywords: morphology, Caucasus, UniMorph 1. Background Caucasian and Nakh-Daghestanian or (North-)East Cau- The Universal Morphology project (Sylak-Glassman et al., casian. A fourth language family with roots in the Cauca- 2015b, UniMorph)1 is a recent community effort aiming to sus, Hurro-Urartian, is known only from epigraphic records complement the Universal Dependencies (Nivre and oth- and assumed to be extinct for more than 2000 years. ers, 2015, UD),2 which focus on syntax, with coverage of With respect to morphosyntax, certain typological traits are morphology. We describe the development of UniMorph frequently encountered in Caucasian languages: (Klimov, 3 resources for languages of the Caucasus region, known 1994) : (1) use of agglutination, with a varying degree for its rich and diverse arrays of languages and language of inflective elements, (2) verbocentric sentence structure families, and often posing challenges to European-centered and complex verbal morphology, often including agree- views established in traditional linguistics. In particular, ment with multiple syntactic arguments, (3) features of we focus on Nakh-Daghestanian (North-East Caucasian) ergative, where the subject argument of intransitive verbs and Kartvelian (South Caucasian) languages, as well as on receives the same morphological case as the object of tran- Classical Armenian, and discuss challenges that these and sitive verbs (absolutive case), whereas the transitive subject 4 related languages pose to the current design of UniMorph. receives ergative case , and (4) in Nakh-Daghestanian lan- A practical challenge for linguists working with dictionary guages: rich case systems, with up to more than 40 mor- data consists of linking it with text data. Corpus-based re- phological cases. In addition, all living languages in the search thus requires computational models of the morphol- Caucasus are low-resource (except for Georgian and Ar- ogy of the languages under consideration, i.e., lemmati- menian which have considerable amounts of written litera- zation, at least. But also for low-resource languages (for ture), and many exhibit traces of intense language contact which few or small amounts of corpus data exist or have to with Iranian, Armenian, Georgian, Turkic, Arabic and/or be collected), an explicit treatment of morphology is nec- Russian (reflecting shifting patterns of political dominance essary for the study of language contact, especially if mor- in the last 2,500 years). phologically rich languages are involved (as in the Cauca- 2. Universal and language-specific sus area): Neither inherited words nor loan words are trans- morphology ferred between language( stage)s in their base form only. Accordingly, the computational handling of complex mor- Following the success of the Universal Dependencies as phological processes and features are important for grasp- a growing community project, a similar effort for the de- ing interrelations of Caucasian languages. 3 The over 100 languages spoken in the Caucasus are These characteristics do not apply to Armenian, which is an Indo-European language, albeit ‘as Caucasian as an Indo- grouped into several language families, out of which three European language could possibly become’ (Gippert, p.c., May are indigeneous, i.e., Caucasian in a strict sense: Kartvelian 2017). or South Caucasian, Abkhazo-Adyghean or (North-)West 4In addition, active-inverse structures can be found in several Caucasus languages, as manifested, for example, in the Kartvelian 1http://unimorph.github.io/ ‘narrative’ case (which is, however, often referred to as ‘ergative’ 2http://universaldependencies.org/ in Western linguistics). 2631 velopment of cross-linguistic features for inflectional morphology has been initiated: Universal Morphology. Both projects aim to develop features and categories which are cross-linguistically applicable (not necessarily universal in the sense of any notion of ‘universal grammar’). As such, the UniMorph annotation schema “allow[s] any given xvad keserxvadukˇ AFF;V;LGSPEC4;ARGDA2S; overt, affixal (non-root) inflectional morpheme in any lan- ARGNO1S;LGSPEC6 guage to be given a precise, language-independent defini- tion ... [by means] of a set of features that represent se- Figure 1: Megrelian (ma si) keserxvadukˇ ‘I will meet you’ mantic “atoms” that are never decomposed into more finely as conventional interlinear glossed text (above) and in Uni- differentiated meanings in any natural language” (Sylak- Morph LEMMA - FORM - FEATS representation (below) Glassman et al., 2015b, p.674). 2.1. UniMorph inventories 2.2. Caucasian languages in UniMorph The UniMorph data format is a list of tab-separated val- Already during the design of the UniMorph guidelines ues for one word per line, with columns for the word form, (Sylak-Glassman, 2016), Nakh-Dagestanian languages the lemma and morphological features; it is thus roughly have been taken into consideration for some phenomena, comparable to the CoNLL format as previously used for, e.g., with respect to the ‘universal’ gender features NAKH1, e.g., syntactically annotated corpora of Classical Armenian ..., NAKH8 for Nakh-Dagestanian noun classes (Sylak- (Haug and Jøhndal, 2008).5 The primary data structure of Glassman, 2016, p.27). Selected features of Abkhazo- UniMorph is an unordered set of semicolon-separated, un- Adyghean (on argument marking, p.12-13; on interroga- qualified features. Figure 1 shows an example of a con- tivity, p.29), and Kartvelian (on evidentiality, p.25) have ventional gloss of the Megrelian word keserxvadukˇ ‘I will been mentioned, too. Beyond this, languages from the Cau- meet you’ together with its UniMorph representation. casus area are not discussed in relation to the UniMorph schema and the UniMorph repositories comprise datasets UniMorph resources are rarely original resources, but for only Modern Georgian and Modern East Armenian. The rather extracted from existing material,6 such as Wiktionary datasets provided as result of our efforts thus constitute a (Kirov et al., 2016, first-generation UniMorph inventories) major increase in coverage of languages of the Caucasus and other dictionaries, bootstrapped from morpheme inven- area. We created morphologically annotated datasets in tories or corpora (as described here), or generated by rule- the UniMorph data format for Megrelian (Kartvelian), Khi- based morphologies. However, this conversion-based ap- nalug (Nakh-Daghestanian) and Classical Armenian (Indo- proach means that the segmentation and annotation princi- European). Additional data on Batsbi (Nakh-Daghestanian) ples of the underlying resource tend to be preserved. is in preparation. In general, UniMorph follows a word-based approach to morphology where inflected forms are organized in 2.3. Language specific features paradigms, but their internal structure left unanalyzed. In addition to universal features, UniMorph conventions In language documentation, however, a morpheme-based permit language-specific features to be represented by approach prevails, i.e., words are segmented into mor- LGSPEC, followed by a numerical index. Although a sepa- phemes which are annotated with the linguistic features rate file that defines those markers can be provided, lim- that they encode. This can lead to vastly different analy- iting LGSPEC markers to numerical labels impedes the ses: In morpheme-based annotation, a number of language- readability of this data, as the Megrelian example in Fig. specific features are inflectional morphemes that contribute 1 illustrates. to the indication of morphosemantic features rather than For languages with a greater number of language-specific to unambiguously indicate them. As such, two Megrelian features, this convention for the nomenclature of language- morphemes in Fig. 4b conspire with other TAM markers to specific features may become problematic, as they are indicate tense, aspect and mood (resp., valency). Which likely to be confused and errors in LGSPEC assignment can- morphosemantic (UniMorph) category these morphemes not be easily spotted.

Universal Morphologies for the Caucasus Region

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support