On the questions in developing computational infrastructure for Komi-Permyak

Jack Rueter Niko Partanen University of Helsinki University of Helsinki [email protected] [email protected]

Larisa Ponomareva University of Helsinki [email protected]

Abstract вычислительнӧй ресурсісь, медбы керны сыись пермяцкӧйӧ. Медодз There are two main written Komi vari- энӧ вежсьӧммесӧ колӧ керны FST-ын, eties, Permyak and Zyrian. These are mu- но мийӧ сідзжӧ видзӧтам, кыдз FST tually intelligible but derive from different лӧсялӧ Быдкодь Йитсьӧммезлӧн схемаӧ parts of the same Komi , морфология ладорсянь. representing the varieties prominent in the vicinity and in the cities of and 1 Introduction Kudymkar, respectively. Hence, they share The Komi is a member of the Permic a vast number of features, as well as the branch of the Uralic . By nature, it majority of their lexicon, yet the overlap in is a , which, in addition to hav- their dialects is very complex. This paper ing two strong written traditions (Komi-Zyrian and evaluates the degree of difference in these Komi-Permyak), can be divided into several vari- written varieties based on changes required eties. Although some of the other varieties do ex- for computational resources in the descrip- hibit written use, no actual new written standards tion of these when adapted from seem to be emerging (Цыпанов, 2009). Both Zyr- the Komi-Zyrian original. Primarily these ian and Permyak have long and established writ- changes include the FST architecture, but ten traditions, with continuous contemporary use. we are also looking at its application to the There are also numerous dialect resources currently Universal Dependencies annotation scheme available, i.. Пономарева (2016) for Northern in the morphologies of the two languages. Permyak dialects. Дженыта висьталӧм Komi-Permyak and Komi-Zyrian are extremely agglutinative, but the two standards have different Коми кылын кык гижан кыв: пермяцкӧй tendencies in their criteria for the definition of an да зырянскӧӥ. Öтамӧд коласын нія orthographic word. Inflection mainly involves the вежӧртанаӧсь, но аркмисӧ нія разнӧй use of suffixes, which, in the case of nominals, are коми диалекттэзісь. Пермяцкӧй кыв final. Hence, contextual dropping of the head олӧ Кудымкар лапӧлын, а зырянскӧӥ noun means that formatives shift to the next right- – Сыктывкар ладорын. Пермяцкӧй да most constituent of the . зырянскӧй литературнӧй кыввезын эм While Komi-Zyrian has a long tradition of com- уна ӧткодьыс, ӧткодьӧн лоӧ и ыджыт puterized morphological analysis, as finite-state тор лексикаын, но ны диалектнӧй transducers have been developed for it since the чертаэзлӧн пантасьӧмыс ӧддьӧн mid 1990s (Rueter, 2000), the computational re- гардчӧм. sources for Komi-Permyak have been the focus Эта статьяын мийӧ видзӧтам эна кык of less intensive work. This article is intended кывлісь ассямасӧ сы ладорсянь, мый as a roadmap for further development of Komi- ковсяс вежны лӧсьӧтӧм зырянскӧй Permyak computational resources. The morpho-

15 Proceedings of the 6th International Workshop on Computational Linguistics of , pages 15–25 Wien, Austria, January 10 — 11, 2020. 2020 Association for Computational Linguistics https://doi.org/10.18653/v1/FIXME logical features discussed in this paper have largely working on closely related languages is certainly been already implemented in the FST in Giellate- where similarity is most easily enforced but also kno infrastructure, and the work has been pri- most logically expected. In this context, it also has marily carried out by Jack Rueter. Ongoing to be taken into account that several other smaller work includes intense work on paradigmatic de- Uralic languages have had their own in- scription by Larisa Ponomareva (Rueter et al., troduced in the past couple of years, e.. Erzya 2019a), which has been published within AKU- (Rueter and Tyers, 2018), Karelian (Pirinen, 2019) infrastructure. AKU is an abbreviation for Avointa and North Saami (Tyers and Sheyanova, 2017). Kieliteknologiaa Uralilaisille/Uhanalaisille kielille This kind of work that concentrates more on manu- (Open language technology for Uralic/Endangered ally annotated corpora complements the descriptive languages). Other projects that are directly associ- work on morphological analyzers extremely well. A ated with this are uralicNLP (Hämäläinen, 2019), well functioning morphological analyzer, however, Akusanat (Hämäläinen and Rueter, 2019b) and seems to be one of the best starting points for fur- Ver’dd (Alnajjar et al., 2019) (see also On Editing ther language technology, which provides a motiva- Dictionaries for Uralic Languages in an Online En- tion for the work grounded in this paper. vironment, in this publication). Forthcoming work Since this paper describes only the current, rather includes the expansion of the initial Permyak tree- preliminary state of investigation, we have pub- bank found in Universal Dependencies version 2.5 lished the list of discussed features as an ac- (Zeman et al., 2019), i.e. further work on what is companying database (Rueter et al., 2020). This scheduled for the next UD release, hence the un- database is available online,¹ and can be extended derlying acuteness of further work with this often as needed when a larger inventory of differing lexi- understudied, but central variety of Komi. cal items and syntactic constructions becomes avail- As the Komi-Zyrian finite state transducer has al- able. Since we hope the comparative investigations ready reached a very advanced state, and the lan- reach other dialects and variants of the Permic lan- guages are so similar to one another, it is necessary guages, the database has been named accordingly to ask how far we can reuse the components of a with an optimistic mindset. Zyrian analyzer when working with Permyak. Al- In this paper we have chosen to distinguish mor- though it has been suggested this kind of resource- phological suffixes from clitics with preceding hy- sharing becomes most useful at higher levels of phens. This will be achieved through the use of grammar, especially Antonsen et al. (2010), to set of morphological suffixes and equal in the case of very closely related languages the signs to indicate clitics and other elements separated number of shared elements is considerable at all lev- by a in the written norms. All examples in els of the language. We understand this is a slippery the paper, unless the source is given, have been cre- road, and uttermost attention has to be paid to full ated by Larisa Ponomareva. respect of Permyak features and particularities, so 2 Orthographic distinctions that we do not simply force the Zyrian conventions upon Permyak. At the same time starting to develop During the development of the two Komi norms, a Permyak infrastructure from scratch feels like a a few orthographic distinctions have been made. missed opportunity to find some synchrony. In this These distinctions can be attributed to sub-dialect article we attempt to describe all those particulari- variation, on the one hand, and arbitrary spelling ties and pitfalls that have to be considered when one principles, on the other. The arbitrary spelling endeavors further the analysis of Permyak. Our ap- choices are simply orthographic, and do not neces- proach is also in some sense comparable to that of sarily relate to actual phonological differences in the (Pirinen, 2019). languages. As two Komi-Zyrian treebanks are also under Arbitrary choice involves the definition of word continuous development (Partanen et al., 2018) it (i.e. written unit without white space) and the is particularly important to pay attention to Komi selection of background language form and letter infrastructure at large. A recent survey of Uralic combinations. It will be observed below that the Universal Dependencies treebanks (Rueter and Par- Komi-Permyak paradigms are minimalis- tanen, 2019) showed that more work is needed to ¹https://langdoc.github.io/ harmonize the annotation between languages, and comparative-permic-database

16 tic in to those of Komi-Zyrian. Komi- ət/ vs. -ӧд /-əd/ in stems, such as велӧтны Permyak, on the one hand, tends to write separate /velətnɨ/ and велӧдны /velədnɨ/ ‘to teach’, Permyak words, and Komi-Zyrian tends to write single con- and Zyrian respectively. The same correlation is catenated words, on the other. This can be exem- also found in the ending -кӧт /- plified in the converb paradigms for Komi-Permyak kət/ vs -кӧд /kəd/ and the possessive suffixes for -ик /-ik/, which can only appear alone, in the sin- the second persons singular and : -ыт /-ɨt/ gular illative or with possessive suffixes, whereas vs. -ыд /-ɨd/, and -ныт /-nɨt/ vs. -ныд /-nɨd/. The the Komi-Zyrian -иг /-ig/ converb also takes plural same voiceless vs voiced correlation might also be marking, cases, as well as numerous other elements detected in the -икӧ /-ikə/ vs -игӧ /-igə/, -ӧн /-ən/, -моз /-moz/, -тыр /-tɨr/, -тырйи /-tɨrji/, Permyak and Zyrian respectively. -тыръя /-tɨrja/, -кості /-kosti/, -коста /-kosta/,... On a similar note, there is a correlation between (Некрасова, 2000, p. 344–353) Permyak ӧ /-ə/ and Zyrian -ӧй /-əj/ in first person Arbitrary character combinations can be illus- singular possessive marking. In verbal , trated best with two prominent paradigms: the gem- the Permyak -final ӧ /ə/ of the first person inate voiced palatal affricate is represented in Zyr- plural marker -мӧ /-mə/ corresponds to Zyrian end- ian with ддз ddz but in Permyak this same affricate ings -м /-/, whereas the Zyrian first person plural is rendered with дзз dzz. Consonants followed by imperative usage might include both -мӧ /-mə/ and a palatal glide and subsequent vowel are written us- -мӧй /-məj/, as in мунамӧй /munaməj/ ‘let’ go’. ing hard and soft sign combinations. In Zyrian, the norm is to use a soft sign following inherently soft 4 Morphological differences consonants, whereas hard signs are used in other in- Many of the morphological forms provide an illus- stances. In Permyak, on the contrary, the hard sign tration of where the human learner may have prob- is used with specifically hard consonants, while the lems in comprehension while the computer has no soft sign is used as default for other combinations. problems in computation. There are, however, nu- One orthographic convention that works simi- merous ways of how a minor difference in one mor- larly in Permyak and Zyrian alike is : variation phological form has a potential impact on ambigui- in stem-final position. This variation is not present ties in other parts of the system. in this form in any of the Komi-Permyak dialects, In this section we go through most essential dif- but as a literary convention it is shared with Zyrian ferences in Komi-Zyrian and Komi-Permyak mor- standard. Permyak dialects, it will be noted, gen- phology. erally display a multitude of l-related subsystems (Баталова, 1982, p. 58). 4.1 Paragogical consonants Orthographic distinctions between the two Komi norms present few problems. On the one hand, Both forms have the same para- computational distinctions are only attested in the gogical consonants, but their distribution is varied. use of the few paragogic consonants in alternate In practice, the so-called paragogic consonants are Permyak morphological forms. On the other, use present when the stem is followed by a suffix with of the plural in both variants appear to follow the an onset vowel, and it is absent when word-final or same distribution, so any computational issue might followed by a consonant (cf. Безносикова et al., p. only be found at the morpho-syntactic level. 16). Paragogic consonants may be present in Permyak 3 Phonetical differences but to a lesser extent than they are in Zyrian due to sub-dialect representation, i.e. many of the The morpheme-final / correlation between sub-dialects do not have them. Komi-Zyrian in- Permyak and Zyrian is a prominent source of cludes paragogic consonants in its nominal declen- predictable morphological and lexical differences. sion and derivation – approximately 0.07 percent This morpho-phonological difference is found word of the 12,046 noun stems in the Zyrian transducer finally in the Permyak adjective сьӧкыт /ɕəkɨt/ and have paragogic consonants, but this is reduced to Zyrian сьӧкыд /ɕəkɨd/ ‘heavy; difficult’ as well as 0.024 once the diminutive/material formative тор other corresponding pairs -ыт /-ɨt/ vs. -ыд /-ɨd/, /tor/ is removed. Komi-Permyak, in contrast, limits Permyak and Zyrian respectively. It can also be its use of paragogic consonants in declensions, and observed in the causative derivation marker -ӧт /- the number of Komi-Permyak stems with paragogic

17 Permyak кыв + вез (← -йэз) кыввез Permyak кай + ез кайез /kɨv/ + /vez (← -jez)/ /kɨvvez/ /kaj/ + /jez/ /kajjez/ Zyrian кыв + яс кывъяс Zyrian кай + яс кайяс /kɨv/ + /jas/ /kɨvjas/ /kaj/ + /jas/ /kajjas/ ‘word; language’ + ‘words; languages’ ‘bird’ + ‘birds’

Figure 1: Example plural of /kɨv/ ‘word; language’ Figure 3: Example plural of /kaj/ ‘bird’

Permyak му + эз муэз /mu/ + /ez/ /muez/ sive forms in Zyrian are clearly segmentable, i.e. Zyrian му + яс муяс понъяс : понъясыд /ponjas/ : /pon-jas-ɨd/ dog- /mu/ + /jas/ /mujas/ ‘land; country’ + ‘lands; countries’ : dog--2, the corresponding forms for the sec- ond and third person in Permyak are often fused, i.e. Figure 2: Example plural of /mu/ ‘land; country’ поннэз : поннэт /pon-nez : pon-net/ dog- : dog- .2 (Лыткин, 1962). Forms with separate ele- consonants is smaller. The Komi-Permyak standard ments are, however, also possible. Both form types language recognizes the paragogic consonants й //, have already been implemented in the Permyak an- к // and м /m/ as alternative variants. The para- alyzer. gogic consonant й /j/ is more common than к /k/ 4.4 Cases and м /m/, the latter two are found only in a limited set of stems, such as син /ɕin/ : синм- /ɕinm-/ ‘eye’, While both literary norms generally describe the кос /kos/ : коск- /kosk-/ ‘lower back’, мыш /mɨʃ/ : number of cases as being sixteen or seventeen, a re- мышк- /mɨʃk/ ‘back’. Thus the Komi-Permyak lit- ality check might be required. The most recent and erary language supports the use of both синмӧ пырӧ extensive presentation of Komi-Zyrian, it should be /ɕinmə pyrə/ and синӧ пырӧ /ɕinə pyrə/ ‘gets in the noted, indicates at least 23 cases with new ones ap- eye’, where the analysis of синмӧ /ɕinmə/ and синӧ pearing all the time (Некрасова 2000:59–62). One /ɕinə/ is eye... (The paragogic т /t/ in the verb reason for this inconsistency is the definition of локны /loknɨ/ and локт- /lokt-/ ‘to arrive’ is the case: What is a case, and what kinds of combina- standard and cannot be left out of the paradigm in tions they can be used in when speaking of a single either of the literary languages.) referent and a double referent (inclusive elliptic ref- erent). Thus we can observe organic expansion of 4.2 Plural formation the local cases and diversion in case enumerations. Phonological variation can be detected in the plu- Both language norms have regular extensions of ral marking of heads, where the Zyrian normal the approximative case -лань /-laɲ/ ‘towards ’. plural marker involves the realization of -яс /-jas/, The case marker may take additional local case on the one hand, and the Permyak normal plural combinations, e.g. approximative + elative, in marker calls for either word-final consonant dou- Permyak -ланись /-laɲ+ɨɕ/ and in Zyrian -ланьысь bling (see fig 1) or, following a vowel, a simple - /-laɲ+ɨɕ/ ‘from on towards X’, which is actually just эз /-ez/ (see fig 2), on the other. Orthographically, a more specific combination of semantics. Addi- the word-final consonant й /j/ forms an exception tional extensions mutual to both literary norms in- to this, here the Cyrillic е // without orthographic clude the inessive, illative, prolative, terminative duplication of й /j/ (see fig. 3). and egressive. Plural character duplication, which is the pri- Diversity between Komi-Zyrian and Komi- mary method of plural formation in Permyak, is Permyak is apparent in both phonetic variation and also partially present in Zyrian dialects. This, how- complementary distribution of morphology. This ever, is not accepted in the Zyrian written standard. can be seen in regular nominal declension with re- Whereas Zyrian plural is formed with distinct suffix gard to the prolative and terminative. The prolative -яс /-jas/ (as illustrated in figures 1, 2, 3, above). -ӧд /-əd/ and translative -ті /-ti/, which are both regular declension in Komi-Zyrian, are only rep- 4.3 Possessive marking resented by a regular prolative -ӧт /-ət/ in Komi- Although singular possessive marking differs from Permyak. Albeit, an analogous transitive -ті /-ti/ is Zyrian only through expected phonetic corre- present present in Komi-Permyak in a few adposi- spondence t / d, the plural forms display more tions and adverbs, but it is not considered to be an complex assimilation. While the plural posses- independent case of its own.

18 Similarly, the two Komi-Permyak terminative 4.5 Case and possessive marker ordering cases in -ӧдз /-ədz/ and -ви /-vi/ are only repre- Possessive suffixes and case endings in the Komi- sented by one terminative -ӧдз /-ədz/ in Komi- Permyak standard may appear in varied order, as Zyrian. As a rule of thumb, we can say that the illustrated in Example 3. deviant Komi-Permyak -ви /-vi/ might be replaced in most places by -ӧдз /-ədz/, but research is still (3) каньыстӧг : каньтӧгьяс required to establish where the semantics of these two forms are distinct. Initially, it may be said that kaɲɨstəg kaɲtəgjas -ви /-vi/ can be used when indicating motion up to cat-PS3- cat--PS3 a boundary, whereas -ӧдз /-ədz/ implies both up to ‘Without his / her cat’ (Permyak) and passing that boundary. Phonetic diversity is observed in the dative and Similar phenomena are also attested in Komi- elative cases. While the Zyrian dative is marked Zyrian but not to the same extent (cf. Некрасова with -лы /-lɨ/, Permyak uses -лӧ /-lə/. Similarly, 2000, pp.54–95). Instead of changing the order of elative and ablative differ in their vowels. In Zyr- tags in the transducer according to morpheme or- ian, the elative is marked with -ысь /-ɨɕ/ and the der, an additional tag set for suffix ordering +So/CP ablative with -лысь /-lɨɕ/, whereas in Permyak the case, possession and +So/PC possession, case has corresponding forms are elative -ись /-iɕ/ and abla- been adapted, as in the description of the two Mari tive -лісь /-liɕ/ standards (mhr) and (mrj) by Jeremy Bradley, Jack When inspecting s where the head has been Rueter and Trond Trosterud at Giellatekno. The deleted because it can be derived contextually, as idea of the extra tag is to retain tag ordering used discussed in the WALS chapter on adjectives with- in testing and construction. In out nouns (Gil, 2013), it will be noticed that Komi- the meantime, an extra tag is made available for pos- Permyak uses a special accusative form for the ac- sible grammar research. cusative adjective without a head noun in -ӧ /-ə/, 4.6 Verbal morphology while the Komi-Zyrian solution in the same context is -ӧс /-əs/, see in Examples 1 and 2 below. Both Permyak and Zyrian have dialect variation in verbal morphology, but in Permyak more variation is accepted. For example, first and (1) тэныт гӧрдӧ али вежӧ сетны? second person finite verb forms have a possibility tenɨt gərd-ə aʎi veʒ-ə ɕet-nɨ? to omit the final -ӧ /-ə/ in all tenses, both мунам 2. red- or yellow- give- /mun-am/ and мунамӧ /mun-amə/, for example, have identical meaning ‘to_go-1.’. Similar ‘shall [I] give you the red one or the yellow variation is also present in Zyrian dialects, but in the one?’ (Permyak) literary language it is not accepted, and the Zyrian FST returns an additional error tag. In the second past tense third person singular, (2) Тэныд гӧрдӧс либӧ вежӧс сетны? a different kind of variation is present in which tenɨd gərd-əs ʎibə veʒ-əs ɕet-nɨ? мунӧма /munəma/ and andмунӧм /munəm/ with 2. red- or green- give- both being accepted. In Zyrian, only the first variant is in the literary standard. This has some impact to ‘shall [I] give you the red one or the green one?’ the possible tags of corresponding . In the (Zyrian) second past tense second person singular, however, variation is present in the two possible forms such This difference, although seemingly small, has as мунӧмат /munəmat/ and мунӧмыт /munəmɨt/ many implications for possible morphological anal- 2.2. Again, there is no difference in meaning. ysis of such adjective forms. It creates ambiguity The latter form is directly comparable to the Zyrian between adjective accusative, illative and possessive form мунӧмыд /munəmɨd/ through a phonological forms in a way that is not at all present in Zyrian. In difference already described above, see Section 3. addition, the resulting syntactic structure will need In the third person plural present the variation very distinct Constraint Grammar rules (Karlsson, is similar, but with different elements: мунӧны 1990). /munənɨ/ and мунӧн /munən/ ‘to_go-3.’.

19 Again, there is no conceivable difference in mean- paradigm with stem initial э- /e-/. The variation ing. The shorter form seems to be used more in the in vowel in the end behaves as already described spoken language. This variation is not present in any above. form in Zyrian. Permyak imperatives have multiple forms not There are parts of Permyak verbal morphology found in Zyrian. Forms created with -те /-ce/, that have no counterparts in the Zyrian standard lan- e.g. мунӧте /munəce/ ‘go-.2’ and босьтӧте guage. One of the most frequent differing forms /boɕtəce/ ‘take-.2’, do not differ in their mean- are the third person plural past and future indica- ing from more common imperative forms, such as tive verb forms. In Permyak, the paradigm in мунӧ /munə/ ‘go-.2’ and босьтӧ /boɕtə/ ‘take- past, present and future can be illustrated with the .2’. The former forms, however, may be more verb мунны /munnɨ/ ‘to go’, мунісӧ : мунӧны (or colloquial (Лыткин, 1962, 249). Forms marked мунӧн): мунасӧ /munisə/ : /munənɨ/ (or /munən/) with -те -ce are present in plural first and second : /munasə/. In Zyrian the corresponding paradigm persons. would be мунісны : мунӧны мунасны /munisnɨ/ : Another type of imperative is formed with -ко /munənɨ/ : /munasnɨ/, which illustrates how forms /-ko/. In the orthography it is written with a hy- with -sə are lacking. phen. It is used in second person singular, and in the Permyak past tense formation is more regular first and second person plural. This imperative has a than Zyrian, which displays complex variation in softer meaning, more of a request than a command. possible homonymy for first and third person past We use the tag +Prec, as in precative². This form is tense forms of some intransitive , such that a direct parallel to the Russian -ка /-ka/, which also муні /muni/ could be both a first or third person sin- indicates a request, e.g. возьмите-ка /voʑmice-ka/ gular form. In Permyak, the only verb that displays ‘do take [it]’. this variation is вӧвны /vəvnɨ/ ‘to be’, whereas other Related to imperatives, the optative is formed in verbs are regularly marked: муні /mun-i/ to_go- Permyak written language with two particles ась 1. муніс /mun-is/ to_go-3.. /aɕ/ and мед /med/. The former particle does not exist in Komi-Zyrian. In the Permyak second past tense the form The converb system in Permyak displays some мунӧмась /munəmaɕ/ corresponds to Zyrian characteristics not found in Zyrian. One difference мунӧмаӧсь /munəmaəɕ/. Here the morpheme is uniquely the Permyak converb -тӧн /-tən/. It ex- suffixation in Zyrian is more transparent. Similar presses simultaneous action of two verbs. forms are also possible in Zyrian dialects, but they do not occur in the written standard. From the per- (4) муні сьывтӧн spective of morphological analyzer construction, these forms pose no challenge. mun-i ɕɨv-tən Permyak connegatives are formed differently go-1. sing- from their Zyrian counterparts, so that Permyak ‘I went singing’ (Permyak) plural connegative is always marked with -ӧ /-ə/, e.g. оз мунӧ ‘he/she does not go’ : озӧ мунӧ ‘they Besides converb forms that are not marked for do not go’ /oz munə/ : /ozə munə/. In Zyrian, the person, there are also forms with possessive suffixes. plural connegative would be formed as оз мунны These, unexpectedly, occur with palatalization and /oz munnɨ/ ‘they do not go’, with the connegative of the stem-final consonant, as in: form identical to the infinitive of the verb. In this detail, the Permyak connegative is less ambiguous (5) муні сьывтӧнням than Zyrian, and i.e. some of the Constraint Gram- mar rules that disambiguate this currently in Zyrian mun-i ɕɨv-təɲːam would not be needed. go-1. sing-.1 Another difference associated with connegatives ‘I went singing’ (Permyak) is the second person plural negation verb forms од /od/ and одӧ /odə/ in Permyak, which are dis- In fact, this palatalization and concurrent gemi- tinct from their Zyrian counterparts он /on/ and nation occurs in other possessive forms, too: онӧ /onə/. The same stem is also present in past tense forms, and regularly matches the past tense ²https://glossary.sil.org/term/precative-mood

20 (6) увтöттяс 5 Clitics uvt-əcːas Discourse clitic marking in Komi-Zyrian is a salient under-.PS3 source of morphological ambiguity. While both =сӧ /=sə/ and =тӧ /=tə/ can be interpreted as cl- ‘(to go) under (something)’ (Permyak) itics, they also represent the with third person singular and second person singular possessive marking, respectively. As these clitics In this latter form the prolative and possessive do not occur in Permyak, such a homonymy is not suffix are not clearly separable, which again illus- present in the paradigm, making disambiguation of trates the more fusional morphology of Permyak Permyak less problematic. when compared to Zyrian. (Looking back at the There are two discourse clitics commonly used plural morpheme, it will be noted that palatalization in Permyak, =ту /=tu/ and =то /=to/. Both occur is a distinguishing factor in the possessive forms) in the written standard, with their origin possibly Another converb that lacks a complete corre- in varied dialect distributions. In Zyrian dialects, a spondence in Komi is the Permyak -ик /-ik/. In Zyr- corresponding clitic in =то /=to/ is also present, but ian there is a cognate converb -иг /-ig/, and this form the most important factor here is that, as explained also expresses simultaneous action as the Permyak above, while these clitics take the role of Zyrian =сӧ -тӧн /-tən/ converb discussed above. There are, /=sə/ and =тӧ, they also make Permyak accusatives however, small differences between the languages. much less ambigious than those in Zyrian. In Permyak the converb when not used as an un- With the infinitive forms of Permyak verbs, a marked complement is always used with the unam- form identical to Zyrian =тӧ /=tə/ does occur biguous or the ambiguous illative case (Баталова, 1975, p. 188), but the amount of am- with possessive suffixing, whereas in Zyrian the in- biguity this introduces is not as problematic as what strumental is used in the forms that are not marked is seen in Zyrian. for possessor. In both languages, however, the pos- Question marking in the two Komis presents a sessive forms are deductively in the illative (as de- dichotomy of =ӧ /=ə/ in Komi-Zyrian and an in- termined by the semantic use of the illative), and dependent particle я /ja/ in Komi-Permyak. An- they are structurally formed in identical way, i.e. ticipation of a shallow-transfer translation system, мун-ікас /mun-ikas/ go-..3, Zyrian мун- raises the question of how these equivalent items ігас /mun-igas/ go-..3 ‘while going’ might be designated for both languages regardless of orthographic conventions. (In Western tradition, 4.7 Derivational morphology the question is one of the four traditional types, so there should be a way to address it in the There are individual derivational that code.) are present in Permyak but not in Zyrian. There is -жуг /-ʒug/ that forms pejoratives, and multiple 6 Universal Dependencies diminutives such as -ок /-ok/, -очка /-otɕka/͡ and - Work with the 2.5 release of the Komi- иньӧй /-iɲəj/. Permyak Universal Dependencies In adjective formation, Permyak has several par- (UD_Komi_Permyak-UH) has emphasized the ticular features. It is possible to form new adjec- need for consistency with the existing Zyrian tives from nouns with suffix -овӧй /-ovəj/ (Лыткин, treebanks. Since the Zyrian treebanks are relatively 1962, p. 14) Additionally, -ӧв /-əv/ forms excessive small, it is still easy to propose changes for both adjectives and adverbs, i.e. ыджыт : ыджытӧв treebank sets, and future work with Zyrian also /ɨdʒːɨt/ : /ɨdʒːɨtəv/ ‘large : too large’. needs to be considered in the Permyak treebank. There are also numerous derivation types that are As Permyak and Zyrian are very closely related found in Zyrian, but are not present in Permyak languages, the development of different treebanks (Лыткин, 1962, p. 14) -лун /-lun/, -шой /ʃoj/ and will certainly be mutually beneficial. There has -ук /-uk/. As corresponding forms do not exist, the been recent interest to use resources from related analyzer should either provide no analysis for them, or contact languages in order to train tools such as or possibly mark them with a tag indicating they are dependency parsers (i.e. Lim et al., 2018), but, in non-standard. the case of Komi-Zyrian, none of the languages

21 in the Universal Dependencies project have been 7 Possibilities for resource reuse particularly close to Komi, and the results have not been at so high a level that such models could While the morphological analyzer is still being de- have been applied in language documentation work. veloped for Permyak, with the groundwork for it With Komi-Permyak and Komi-Zyrian, multilin- largely copied from the existing Zyrian analyzer, gual model training of this type may very well be special attention must be paid to the particularities worth the effort, as the grammatical structures and of Permyak and the reduction of interference from lexicon are largely shared. The benefits become par- the original Zyrian. One approach that needs fur- ticularly clear when attempts are made to process ther work is to ensure that both Permyak and Zyr- dialect materials in either language, as the distri- ian YAML tests are comparable in their coverage, bution of features is in many ways different from which would also allow further automatic testing of those of the written standards (further discussion of how large the number of shared forms is. This, for which, unfortunately, is outside the scope of this pa- example, would require the writing of YAML tests per). for Komi-Zyrian, which has few tests on the whole. Permyak and Zyrian also share a extensive major- The Komi-Permyak treebank, once again, un- ity of their lexicon. This leaves the question open as derscores the need for a different approach to rep- to how exactly we should proceed with the manage- resentative sentence selection. While large tree- ment of the lexicographic data for these languages, bank projects are able to utilize large amounts i.e. while using tools such as Akusanat and Verdd of data with inherited but transferable annotation (see i.e. Rueter and Hämäläinen, 2017; Hämäläinen from other projects, small languages, such as Komi- and Rueter, 2019b). One also has to ask whether Permyak and Komi-Zyrian, cannot really opt for there are specific ways on how Permyak and Zyr- statistical representation. Instead, it is proposed ian lexical resources should be connected to each that features specific to the language be selected. other. This might be solved with cognate search- Hence, part of the strategy for the initial release of ing analogical to what has been used for North- the Komi-Permyak treebank was to feature numer- ern Sami and cognates for establishing als and their regular morpho-semantic use, e.g. both initial etymological associations (Hämäläinen and Komi standards have multiplicative-distributional Rueter, 2019a). Russian loanwords, although dif- numerals as well as ordinal-multiplicative numerals. ferently adapted are largely shared. At present, this Komi-Permyak, however, has an additional a-final issue has been partially solved through the sharing numeral used in copula complement position to in- of proper nouns mutual to nearly all languages writ- dicate the notion of a tallied sum, e.g. ten in Russian Cyrillics³ (49,156 words) and addi- tional adjectives shared by both Komi transducers ⁴ (~6000 words), whose content was initially intro- (7) Деревняын оліссес нёля. duced in FU-Lab for adjectives ending in -ӧй /-əj/. ɟerevɲa-ɨn oliɕːes ɲoʎ-a The shared kom-adjectives-russian-like.lexc file has village- dweller. four- preliminarily been selected on the pretext that the Komi letter ӧ cannot occur twice in a given Komi- ‘In the village, there are four people all to- Permyak stem. Further editing of this file will be gether’ (Permyak) required to remove Komi-Zyrian instances of -ӧй /- əj/ where the Russian equivalent would indicate a stressed vowel. When the Russian equivalent has a One approach could be to select example sen- stressed -о, the Permyak variant is also -о. tences from available Komi grammars, as this way It must be mentioned that through our meticu- it would be possible to make different grammati- lous work on Komi-Permyak analyzer, we have ar- cal phenomena fully represented. There are many rived at a situation where there are more YAML features of Komi that are typologically relevant, but tests for Permyak than for Zyrian. It could be an relatively rare, as already discussed in Partanen et al. interesting idea to make sure that Permyak and Zyr- (2018). These include, among other features, vari- ous stacked cases that occur only sporadically in all ³gtsvn/giella-shared/urj-Cyrl/src/morphology/stems/urj- Cyrl-propernouns.lexc their realizations even in a very large written cor- ⁴gtsvn/langs/kpv/src/morphology/stems/kom-adjectives- pora. russian-like.lexc

22 ian tests contain the same lexemes with their match- ture described here. ing analyses. The forms would be different, but Our analysis is based on standard grammatical this would allow comparing the paradigms from references to Komi-Permyak, so if there are fea- one more perspective. (In fact, this can be ren- tures that need to be addressed further, they might dered rather easily by generating a separate full Zyr- be something that earlier literature has either ne- ian YAML test for every lexeme addressed in the glected or failed to notice. Thus the development Permyak YAML tests, but it will also require native- of a computational infrastructure becomes better like language knowledge for proof-reading. (Rueter anchored in the grammatical description of Komi- et al., 2019b)) In addition, at least the forms that Permyak, and the relationship of these often re- categorically do not exist in the Permyak should mote, although closely connected activities, be- not be getting a reading, but the situation becomes comes more firmly established. more complicated with the forms shared by vari- ous Zyrian and Permyak dialects. (Here, we will Acknowledgments need to use the descriptive YAML tests. As there are already three categories of YAML tests in the Jack Rueter has been able to participate in these Giella infrastructure: dict[ionary], norm[ative] and developments while performing expertise work on desc[riptive]) Probably, some additional distinc- Uralic languages for a FINCLARIN project at the tions will be made between the descriptive and nor- University of Helsinki, Digital Humanities Depart- mative analyzers, with the first being less restricted, ment. Special thanks to the University of Helsinki as has been done with Zyrian earlier. (Analogical for funding Rueter’s travel. work has been done in this vein with development Niko Partanen works within the project Language of the Võro language YAML tests due to the exten- Documentation meets Language Technology: The sively descriptive nature originally depicted in the Next Step in the Description of Komi, funded by transducer to cover various dialects (Iva and Rueter, the Kone Foundation, . Special thanks to the 2020)) University of Helsinki for funding Partanen’s travel. Larisa Ponomareva is presently working as a re- 8 Conclusion search assistant at the University of Helsinki, Digi- tal Humanities with funding from the Finnish Social Based to our analysis, developing a Komi-Permyak Insurance Institution (KELA). FST based on the Komi-Zyrian FST is a worthwhile and relatively straightforward process. We also be- lieve that there are ways to use such analyzers for References better identification and quantification of the differ- K. Alnajjar, M. Hämäläinen, . Partanen, and Jack ences between these pluricentric varieties. Rueter. 2019. The open dictionary infrastructure for The approach taken in this paper, with a de- uralic languages. In Электронная письменность tailed description of the morphological differences народов Российской Федерации: Опыт, проблемы и перспективы. Материалы II Международной encountered between the two norms, is believed to научной конференции (Уфа, 11–12 декабря 2019 render a more legible work flow. Such a plan helps г.), pages 49–51. to formulate strategies for development and further work on the Komi-Permyak analyzer and treebanks. Lene Antonsen, Trond Trosterud, and Linda Wiechetek. 2010. Reusing grammatical resources for new lan- One of the upcoming tasks is to extend this work guages. In Proceedings of the Seventh International from the literary languages into various dialects, as Conference on Language Resources and Evaluation has already been done with the Zyrian analyzer. (LREC’10), Valletta, Malta. European Language Re- This will further complicate the relationship be- sources Association (ELRA). tween work done on Permyak and Zyrian, as the fea- David Gil. 2013. Adjectives without nouns. In ture isoglosses usually have distributions that do not Matthew S. Dryer and Martin Haspelmath, edi- follow the official language boundaries. Although tors, The World Atlas of Language Structures Online. smaller Komi varieties such as Zyuzdin and Yazva Max Planck Institute for Evolutionary Anthropology, have some resources and recent publishing activities Leipzig. (for Yazva i.e. Паршакова, 2003), it is currently un- Mika Hämäläinen. 2019. UralicNLP: An NLP library clear in which forms the existing resources on these for Uralic languages. Journal of Open Source Soft- languages should be integrated into the infrastruc- ware, 4(37):1345.

23 Mika Hämäläinen and Jack Rueter. 2019a. Finding Jack M. Rueter. 2000. Хельсинкиса университетын Sami cognates with a Character-Based NMT Ap- кыв туялысь Ижкарын перымса симпозиум вылын proach. In Proceedings of the Workshop on Computa- лыддьӧмтор. In Пермистика 6 (Proceedings of Per- tional Methods for Endangered Languages, volume 1. mistika 6 conference), pages 154–158.

Mika Hämäläinen and Jack Rueter. 2019b. An open on- Francis M Tyers and Mariya Sheyanova. 2017. Annota- line dictionary for endangered uralic languages. Elec- tion schemes in North Sámi dependency parsing. In tronic lexicography in the 21st century (eLex 2019): Proceedings of the Third Workshop on Computational Smart lexicography. Linguistics for Uralic Languages, pages 66–75.

Sulev Iva and Jack Rueter. 2020. rueter/aku-morph- Daniel Zeman, Joakim Nivre, Mitchell Abrams, Noëmi voro: Basic adjectives, nouns and verbs. Aepli, Željko Agić, Lars Ahrenberg, Gabrielė Alek- sandravičiūtė, Lene Antonsen, Katya Aplonova, Fred Karlsson. 1990. Constraint grammar as a frame- Maria Jesus Aranzabe, Gashaw Arutie, Masayuki work for parsing running text. In COLNG 1990 Vol- Asahara, Luma Ateyah, Mohammed Attia, Aitz- ume 3: Papers presented to the 13th International Con- iber Atutxa, Liesbeth Augustinus, Elena Badmaeva, ference on Computational Linguistics. Miguel Ballesteros, Esha Banerjee, Sebastian Bank, Verginica Barbu Mititelu, Victoria Basmov, Colin KyungTae Lim, Niko Partanen, and Thierry Poibeau. Batchelor, John Bauer, Sandra Bellato, Kepa Ben- 2018. Multilingual dependency parsing for low- goetxea, Yevgeni Berzak, Irshad Ahmad Bhat, resource languages: Case studies on North Saami and Riyaz Ahmad Bhat, Erica Biagetti, Eckhard Bick, Komi-Zyrian. Agnė Bielinskienė, Rogier Blokland, Victoria Bo- bicev, Loïc Boizou, Emanuel Borges Völker, Carl Niko Partanen, Rogier Blokland, KyungTae Lim, Börstell, Cristina Bosco, Gosse Bouma, Sam Bow- Thierry Poibeau, and Michael Rießler. 2018. The first man, Adriane Boyd, Kristina Brokaitė, Aljoscha Bur- Komi-Zyrian Universal Dependencies treebanks. In chardt, Marie Candito, Bernard , Gauthier Second Workshop on Universal Dependencies (UDW Caron, Tatiana Cavalcanti, Gülşen Cebiroğlu Ery- 2018), November 2018, Brussels, Belgium, pages 126– iğit, Flavio Massimiliano Cecchini, Giuseppe G. A. 132. Celano, Slavomír Čéplö, Savas Cetin, Fabricio Chalub, Jinho Choi, Yongseok Cho, Jayeol Chun, Tommi A Pirinen. 2019. Building minority dependency Alessandra T. Cignarella, Silvie Cinková, Aurélie treebanks, dictionaries and computational grammars Collomb, Çağrı Çöltekin, Miriam Connor, Ma- at the same time—an experiment in Karelian tree- rine Courtin, Elizabeth Davidson, Marie-Catherine banking. In Proceedings of the Third Workshop de Marneffe, Valeria de Paiva, Elvis de Souza, on Universal Dependencies (UDW, SyntaxFest 2019), Arantza Diaz de Ilarraza, Carly Dickerson, Bamba pages 132–136. Dione, Peter Dirix, Kaja Dobrovoljc, Timothy Dozat, Kira Droganova, Puneet Dwivedi, Hanne Eckhoff, Jack Rueter and Mika Hämäläinen. 2017. Synchro- Marhaba Eli, Ali Elkahky, Binyam Ephrem, Olga nized Mediawiki based analyzer dictionary develop- Erina, Tomaž Erjavec, Aline Etienne, Wograine ment. In International Workshop for Computational Evelyn, Richárd Farkas, Hector Fernandez Al- Linguistics of Uralic Languages, pages 1–7. Associa- calde, Jennifer Foster, Cláudia Freitas, Kazunori Fu- tion for Computational Linguistics. jita, Katarína Gajdošová, Daniel Galbraith, Mar- Jack Rueter and Niko Partanen. 2019. Survey of Uralic cos Garcia, Moa Gärdenfors, Sebastian Garza, Universal Dependencies development. In Workshop Kim Gerdes, Filip Ginter, Iakes Goenaga, Koldo on Universal Dependencies, page 78. Association for Gojenola, Memduh Gökırmak, Yoav Goldberg, Computational Linguistics. Xavier Gómez Guinovart, Berta González Saave- dra, Bernadeta Griciūtė, Matias Grioni, Normunds Jack Rueter, Niko Partanen, and Larisa Ponomareva. Grūzītis, Bruno Guillaume, Céline Guillot-Barbance, 2019a. rueter/aku-morph-komi-permyak: Basic Nizar Habash, Jan Hajič, Jan Hajič jr., Mika nouns, verbs and pronouns. Hämäläinen, Linh Hà Mỹ, Na-Rae Han, Kim Har- ris, Dag Haug, Johannes Heinecke, Felix Hennig, Jack Rueter, Niko Partanen, and Larisa Ponomareva. Barbora Hladká, Jaroslava Hlaváčová, Florinel Hoci- 2019b. rueter/aku-morph-komi-zyrian: Basic nouns, ung, Petter Hohle, Jena Hwang, Takumi Ikeda, Radu verbs and pronouns. Ion, Elena Irimia, Ọlájídé Ishola, Tomáš Jelínek, Anders Johannsen, Fredrik Jørgensen, Markus Juu- Jack Rueter, Niko Partanen, and Larisa Ponomareva. tinen, Hüner Kaşıkara, Andre Kaasen, Nadezhda 2020. langdoc/comparative-permic-database: Com- Kabaeva, Sylvain Kahane, Hiroshi Kanayama, Jenna parative Permic Database. Kanerva, Boris Katz, Tolga Kayadelen, Jessica Ken- ney, Václava Kettnerová, Jesse Kirchner, Elena Kle- Jack Rueter and Francis Tyers. 2018. Towards an open- mentieva, Arne Köhn, Kamil Kopacewicz, Natalia source universal-dependency treebank for Erzya. In Kotsyba, Jolanta Kovalevskaitė, Simon Krek, Sooky- Proceedings of the 4th International Workshop on oung Kwak, Veronika Laippala, Lorenzo Lambertino, Computational Linguistics for Uralic languages, pages Lucia Lam, Tatiana Lando, Septina Dian Larasati, 106–118. ACL. Alexei Lavrentiev, John Lee, Phương Lê Hồng,

24 Alessandro Lenci, Saran Lertpradit, Herman Leung, Yasuoka, Marat M. Yavrumyan, Zhuoran Yu, Zdeněk Cheuk Ying Li, Josie Li, Keying Li, KyungTae Lim, Žabokrtský, Amir Zeldes, Manying Zhang, and Maria Liovina, Yuan Li, Nikola Ljubešić, Olga Logi- Hanzhi Zhu. 2019. Universal dependencies 2.5. nova, Olga Lyashevskaya, Teresa Lynn, Vivien Mack- LINDAT/CLARIN digital library at the Institute of etanz, Aibek Makazhanov, Michael Mandl, Christo- Formal and Applied Linguistics (ÚFAL), Faculty of pher Manning, Ruli Manurung, Cătălina Mărăn- Mathematics and Physics, Charles University. duc, David Mareček, Katrin Marheinecke, Héc- tor Martínez Alonso, André Martins, Jan Mašek, Р. М. Баталова. 1975. Коми-пермяцкая Yuji Matsumoto, Ryan McDonald, Sarah McGuin- диалектология. М.: Наука. ness, Gustavo Mendonça, Niko Miekka, Mar- garita Misirpashayeva, Anna Missilä, Cătălin Mi- Р. М. Баталова. 1982. Ареальные исследования titelu, Maria Mitrofan, Yusuke Miyao, Simonetta по восточным финно-угорским языкам: Коми Montemagni, Amir More, Laura Moreno Romero, языки. М.: Наука. Keiko Sophie Mori, Tomohiko Morioka, Shinsuke Mori, Shigeki Moro, Bjartur Mortensen, Bohdan Л. М. Безносикова, Е. А. Айбабина, and Р. И. Moskalevskyi, Kadri Muischnek, Munro, Коснырева. Коми-роч кывчукӧр, publisher = Yugo Murawaki, Kaili Müürisep, Pinkey Nainwani, Сыктывкар: Коми небӧг лэдзанін, year=2000. Juan Ignacio Navarro Horñiacek, Anna Nedoluzhko, В. И. Лыткин. 1962. Коми-пермяцкий язык: учебник Gunta Nešpore-Bērzkalne, Lương Nguyễn Thị, для высших учебных заведений. Кудымкар: Коми Huyền Nguyễn Thị Minh, Yoshihiro Nikaido, Vi- пермяцкое книжное издательство. taly Nikolaev, Rattima Nitisaroj, Hanna Nurmi, Stina Ojala, Atul Kr. Ojha, Adédayọ̀ Olúòkun, Mai Г. Некрасова. 2000. Эмакыв. In Г. В. Федюнёва, edi- Omura, Petya Osenova, Robert Östling, Lilja Øvre- tor, Ӧнія коми кыв, морфология. Россияса наукаяс lid, Niko Partanen, Elena Pascual, Marco Passarotti, академия, Коми наука шӧрин, Сыктывкар. Agnieszka Patejuk, Guilherme Paulino-Passos, An- gelika Peljak-Łapińska, Siyao Peng, Cenel-Augusto А. Л. Паршакова. 2003. Коми-язьвинский букварь. Perez, Guy Perrier, Daria Petrova, Slav Petrov, Jason Учебное издание. Пермь: Пермское книжное Phelan, Jussi Piitulainen, Tommi A Pirinen, Emily издательство. Pitler, Barbara Plank, Thierry Poibeau, Larisa Pono- mareva, Martin Popel, Lauma Pretkalniņa, Sophie Л. Г. Пономарева. 2016. Ойвывся коми- Prévost, Prokopis Prokopidis, Adam Przepiórkowski, пермяккезлӧн сёрни. М.: Быдкодь Отирлӧн Tiina Puolakainen, Sampo Pyysalo, Peng Qi, An- кыввез. driela Rääbis, Alexandre Rademaker, Loganathan Ramasamy, Taraka Rama, Carlos Ramisch, Vinit Йӧлгинь Цыпанов. 2009. Перым кывъяслöн талунъя Ravishankar, Livy Real, Siva Reddy, Georg Rehm, серпас. Suomalais-Ugrilaisen Seuran Toimituksia = Ivan Riabov, Michael Rießler, Erika Rimkutė, Larissa Mémoires de la Société Finno-Ougrienne, 258:191– Rinaldi, Laura Rituma, Luisa Rocha, Mykhailo Ro- 206. manenko, Rudolf Rosa, Davide Rovati, Valentin Roșca, Olga Rudina, Jack Rueter, Shoval Sadde, Benoît Sagot, Shadi Saleh, Alessio Salomoni, Tanja Samardžić, Stephanie Samson, Manuela Sanguinetti, Dage Särg, Baiba Saulīte, Yanin Sawanakunanon, Nathan Schneider, Sebastian Schuster, Djamé Sed- dah, Wolfgang Seeker, Mojgan Seraji, Mo Shen, Atsuko Shimada, Hiroyuki Shirasu, Muh Shohibus- sirri, Dmitry Sichinava, Aline Silveira, Natalia Sil- veira, Maria Simi, Radu Simionescu, Katalin Simkó, Mária Šimková, Kiril Simov, Aaron Smith, Isabela Soares-Bastos, Carolyn Spadine, Antonio Stella, Mi- lan Straka, Jana Strnadová, Alane Suhr, Umut Suluba- cak, Shingo Suzuki, Zsolt Szántó, Dima Taji, Yuta Takahashi, Fabio Tamburini, Takaaki Tanaka, Is- abelle Tellier, Guillaume Thomas, Liisi Torga, Trond Trosterud, Anna Trukhina, Reut Tsarfaty, Francis Tyers, Sumire Uematsu, Zdeňka Urešová, Larraitz Uria, Hans Uszkoreit, Andrius Utka, Sowmya Vaj- jala, Daniel van Niekerk, Gertjan van Noord, Vik- tor Varga, Eric Villemonte de la Clergerie, Veronika Vincze, Lars Wallin, Abigail Walsh, Jing Xian Wang, Jonathan North Washington, Maximilan Wendt, Seyi Williams, Mats Wirén, Christian Wittern, Tsegay Woldemariam, Tak-sum Wong, Alina Wróblewska, Mary Yako, Naoki Yamazaki, Chunxiao Yan, Koichi

25