Open-Source Morphology for Endangered Mordvinic Languages
Total Page:16
File Type:pdf, Size:1020Kb
Open-Source Morphology for Endangered Mordvinic Languages Jack Rueter Mika Hämäläinen Niko Partanen Dept. of Digital Humanities Dept. of Digital Humanities Dept. of Finnish, University of Helsinki University of Helsinki Finno-Ugrian [email protected] and Rootroo Ltd and Scandinavian Studies [email protected] University of Helsinki [email protected] Abstract languages. 2002 saw the publication of the first monolingual dictionary of Erzya (Abramov, 2002), This document describes shared development and the manuscript was proclaimed open by the of finite-state description of two closely re- author for future development. The Mordvin lan- lated but endangered minority languages, Erzya and Moksha. It touches upon mor- guages have continued to receive a fair share of pholexical unity and diversity of the two lan- linguistic research interest in the recent years (Luu- guages and how this provides a motivation for tonen, 2014; Hamari and Aasmäe, 2015; Kashkin shared open-source FST development. We de- and Nikiforova, 2015; Grünthal, 2016). scribe how we have designed the transducers After the release first finite-state transducer for so that they can benefit from existing open- the closely related Komi-Zyrian (Rueter, 2000), source infrastructures and are as reusable as possible. it was only obvious that similar work should be done for Erzya Mordvin. Fortunately, over the 1 Introduction past decade there has been an increasing number of publications on Erzya, relating to its morphol- There are over 5000 languages spoken world wide, ogy (Rueter, 2010), its OCR tools (Silfverberg and and a vast majority of them are endangered (see Rueter, 2015) and universal dependencies (Rueter Moseley 2010). The Mordvinic languages Erzya and Tyers, 2018). and Moksha are no exception. One of the first NLP This document discusses open-source morphol- solutions that are typically developed along with ogy development, which has greatly benefited from lexical resources for any low-resourced language open-source projects, most notably achievements is a morphological analyzer (cf Zueva et al. 2020; attributed to the GiellaLT infrastructure (Mosha- Tyers et al. 2019; Lovick et al. 2018). gen et al., 2014), i.e. Giellatekno & Divvun at the In this paper, we describe the development of Norwegian Arctic University in Tromsø, Norway. an open-source FST (finite-state transducer) based morphological analyzer, lemmatizer and generator It is also very important that people new to lan- for Erzya and Moksha. We highlight the impor- guage documentation be given opportunities to de- tance of certain design decisions to ensure the com- velop their understanding of languages through par- patibility of our transducers with existing systems. ticipation in projects. Here we must mention an In addition, we will describe how the transducers important span of time 1988–1997, during which can be edited in a system that creates an abstrac- the first author did word processing for the 2073- tion layer behind a graphical user interface for FST page ’Dictionary of Mordvin Dialects’ by Heikki development. Paasonen. The unity and diversity of the Mordvin liter- The transducers are available on GitHub for ary languages of today, Erzya and Moksha, has Erzya 1 and Moksha 2. The nightly builds are avail- been a subject of research for over two hundred able through a Python library called UralicNLP3 years. The first grammars were published in 1830s (Hämäläinen, 2019). – Moksha in 1838 (Ornatov, 1838) and Erzya in 1838–1839 (Gabelentz, 1839). The subsequent 1https://github.com/giellalt/lang-myv 180 years brought scholars for fieldwork, gram- 2https://github.com/giellalt/lang-mdf mars, dictionaries, and the popularization of the 3https://github.com/mikahama/uralicNLP 94 Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 94–100 Virtual Conference, November 19, 2020. c 2020 Association for Computational Linguistics input output 2 Designing for a Reusable API вирев+A+Sg+Nom+Indef analyzer вирев вирь+N+SP+Lat+Indef It is not uncommon that similar tools and methods generator вирев+A+Sg+Nom+Indef вирев are developed by different research groups in differ- Table 1: Examples of GiellaLT style input and output ent parts of the world for language documentation purposes. To name a few, there are projects de- veloping similar language documentation systems HFST based transducers, and in fact Apertium type for African languages (Jones and Muftic, 2020), transducers can be compiled to HFST form as well Indonesian languages (Nasution et al., 2018) and (Pirinen and Tyers, 2012). However, their tagset Yupik (Hunt et al., 2019), while at the same time, is different from that of GiellaLT. This is solved it has been shown that digital humanities projects by applying filters at the time the FST is compiled that are seemingly different, still face the very same to produce a separate, Apertium-compatible, trans- technical problems (Mäkelä et al., 2020). In ar- ducer automatically. guably, any work conducted with endangered lan- guages to document them and better serve the small 3 Erzya and Moksha language pair community of speakers has a lot of value. However, At the moment Erzya and Moksha infrastructure in the fact that the wheel gets reinvented over and GiellaLT can be considered to be in nearly equal over again leads to fragmentation in the resources standing, despite the fact that work on Erzya was available for smaller languages and makes their use started considerably earlier. This does not mean, more difficult as each individual tool exposes an however, that the resources for both languages are API of their own. identical in all measures, although in basic numeric Rule-based morphology can be implemented in levels their sizes are comparable. There are many many different systems. Popular tools include situations where additional consistency between XFST (Karttunen et al., 1997), Foma (Hulden, the infrastructures for these two languages would 2009) and OpenFST (Allauzen et al., 2007). How- be quite desirable and beneficial. We discuss next ever, we use HFST (Helsinki-Finite State Technol- what these instances could be, and outline some ogy) (Lindén et al., 2009) because it is the system of the major questions while working with this used in GiellaLT (Moshagen et al., 2014). language pair. GiellaLT provides our transducers with a list of The open-source machine translation infrastruc- quality attributes. Their infrastructure consists of ture Apertium uses a shallow-transfer strategy. By work done for multiple endangered languages in definition, shallow transfer makes a morphosyntac- such a way that the morphological resources can tic analysis of the source language, translates the be used in a multitude of different contexts such as lemmas from source to target language, generates disambiguation (Trosterud, 2004; Ens et al., 2019), morphology acquired from the source language in dependency parsing (Antonsen et al., 2010), on- the target language and makes adjustments to the line dictionaries (Rueter and Hämäläinen, 2019), syntax of the target language. This concept of paral- spell checkers (Wiechetek et al., 2019), online cre- lel lexica, morphology and syntax, would therefore, ative writing tools (Hämäläinen, 2018), automated seem most effective in translation between closely news generation (Alnajjar et al., 2019) and lan- related languages. In the case of the Mordvin pair, guage learning tools (Antonsen and Argese, 2018). this means the use of mutual tags for describing In order to gain the added benefit from the Giel- mutual phenomena. laLT infrastructure, we have to design our transduc- Mutual phenomena in Erzya and Moksha, how- ers so that they are compatible with HFST and that ever, can be distinguished as exact matches and they follow a certain morphological tagset and that fuzzy matches, as it were. While there is no doubt they take the input and output in a certain format. that the two languages share nearly the same cat- These requirements define the API our transducers egories of case, person, number and definiteness, need to implement. An example of this can be seen it must also be noted that the paradigms do not in Table1. share the same cellular structure (Keresztes 1999; Apertium (Forcada et al., 2011) is another open- Trosterud 2006; Rueter 2016). When the paradigm source system that uses FST transducers for rule- structure is diverse, it is suggested that a union based machine translation. They use their own of morphological tags be taken as a starting point. transducer format, but fortunately also support When one language distinguishes a category of 95 +Nom +Gen +Dat +Nom +Gen +Dat Sg+PxSg1 (my son) цёразе цёразень цёразти цёрам туртов цёрам цёране цёранень цёраненди Sg+PxSg1 (my son) цёрам ∼ цёрань туртов Sg+PxSg1 (my sons) ∼ цёрань Sg+PxSg2 (your son) цёраце цёрацень цёрацти ∼ цёранень цёратне цёратнень цёратненди цёран цёран цёран туртов Pl+PxSg2 (your sons) Pl+PxSg1 (my sons) Sg+PxSg3 (his/her son) цёрац цёранц цёранцты ∼ цёрам ∼ цёрам цёрам туртов Pl+PxSg3 (his/her son) цёранза цёранзон цёранзонды цёрать цёрать туртов SP+PxPl1 (our son/sons) цёраньке цёраньконь цёраньконди Sg+PxSg2 (your son) цёрат SP+PxPl2 (your son/sons) цёранте цёрантень цёрантенди ∼ цёрат ∼ цёратень SP+PxPl3 (their son/sons) цёрасна цёраснон цёраснонды PxSg2 Pl (your son) цёрат цёрат цёрат туртов цёранстэнь ∼ PxSg3 Sg (his/her son) цёразо цёранзо цёранзо туртов Table 2: Symmetric possessive declension of Moksha цёранстэнь ∼ PxSg3 Pl (his/her sons) цёранзо цёранзо core cases цёранзо туртов SP+PxPl1 (our son/sons) цёранок цёранок цёранк туртов SP+PxPl2 (your son/sons) цёранк цёранк цёранк туртов цёранстэнь ∼ SP+PxPl3 (their son/sons) цёраст цёраст number, for instance, and the other does not, there цёраст туртов comes a point where number must be determined. And, in order to facilitate a transition from the ab- Table 3: Asymmetric possessive declension of Erzya sence of the category of number to its presence core cases and vice versa, the rudiments of tagging this cate- Obj+Sg1 Obj+Pl1 gory must be put in place, e.g.