Open-Source Morphology for Endangered

Jack Rueter Mika Hämäläinen Niko Partanen Dept. of Digital Humanities Dept. of Digital Humanities Dept. of Finnish, University of Helsinki University of Helsinki Finno-Ugrian [email protected] and Rootroo Ltd and Scandinavian Studies [email protected] University of Helsinki [email protected]

Abstract languages. 2002 saw the publication of the first monolingual dictionary of Erzya (Abramov, 2002), This document describes shared development and the manuscript was proclaimed open by the of finite-state description of two closely re- author for future development. The Mordvin lan- lated but endangered minority languages, Erzya and Moksha. It touches upon mor- guages have continued to receive a fair share of pholexical unity and diversity of the two lan- linguistic research interest in the recent years (Luu- guages and how this provides a motivation for tonen, 2014; Hamari and Aasmäe, 2015; Kashkin shared open-source FST development. We de- and Nikiforova, 2015; Grünthal, 2016). scribe how we have designed the transducers After the release first finite-state transducer for so that they can benefit from existing open- the closely related Komi-Zyrian (Rueter, 2000), source infrastructures and are as reusable as possible. it was only obvious that similar work should be done for Erzya Mordvin. Fortunately, over the 1 Introduction past decade there has been an increasing number of publications on Erzya, relating to its morphol- There are over 5000 languages spoken world wide, ogy (Rueter, 2010), its OCR tools (Silfverberg and and a vast majority of them are endangered (see Rueter, 2015) and universal dependencies (Rueter Moseley 2010). The Mordvinic languages Erzya and Tyers, 2018). and Moksha are no exception. One of the first NLP This document discusses open-source morphol- solutions that are typically developed along with ogy development, which has greatly benefited from lexical resources for any low-resourced language open-source projects, most notably achievements is a morphological analyzer (cf Zueva et al. 2020; attributed to the GiellaLT infrastructure (Mosha- Tyers et al. 2019; Lovick et al. 2018). gen et al., 2014), i.e. Giellatekno & Divvun at the In this paper, we describe the development of Norwegian Arctic University in Tromsø, Norway. an open-source FST (finite-state transducer) based morphological analyzer, lemmatizer and generator It is also very important that people new to lan- for Erzya and Moksha. We highlight the impor- guage documentation be given opportunities to de- tance of certain design decisions to ensure the com- velop their understanding of languages through par- patibility of our transducers with existing systems. ticipation in projects. Here we must mention an In addition, we will describe how the transducers important span of time 1988–1997, during which can be edited in a system that creates an abstrac- the first author did word processing for the 2073- tion layer behind a graphical user interface for FST page ’Dictionary of Mordvin Dialects’ by Heikki development. Paasonen. The unity and diversity of the Mordvin liter- The transducers are available on GitHub for ary languages of today, Erzya and Moksha, has Erzya 1 and Moksha 2. The nightly builds are avail- been a subject of research for over two hundred able through a Python library called UralicNLP3 years. The first grammars were published in 1830s (Hämäläinen, 2019). – Moksha in 1838 (Ornatov, 1838) and Erzya in 1838–1839 (Gabelentz, 1839). The subsequent 1https://github.com/giellalt/lang-myv 180 years brought scholars for fieldwork, gram- 2https://github.com/giellalt/lang-mdf mars, dictionaries, and the popularization of the 3https://github.com/mikahama/uralicNLP

94 Proceedings of Second Workshop for NLP Open Source Software (NLP-OSS), pages 94–100 Virtual Conference, November 19, 2020. c 2020 Association for Computational Linguistics input output 2 Designing for a Reusable API вирев+A+Sg+Nom+Indef analyzer вирев вирь+N+SP+Lat+Indef It is not uncommon that similar tools and methods generator вирев+A+Sg+Nom+Indef вирев are developed by different research groups in differ- Table 1: Examples of GiellaLT style input and output ent parts of the world for language documentation purposes. To name a few, there are projects de- veloping similar language documentation systems HFST based transducers, and in fact Apertium type for African languages (Jones and Muftic, 2020), transducers can be compiled to HFST form as well Indonesian languages (Nasution et al., 2018) and (Pirinen and Tyers, 2012). However, their tagset Yupik (Hunt et al., 2019), while at the same time, is different from that of GiellaLT. This is solved it has been shown that digital humanities projects by applying filters at the time the FST is compiled that are seemingly different, still face the very same to produce a separate, Apertium-compatible, trans- technical problems (Mäkelä et al., 2020). In ar- ducer automatically. guably, any work conducted with endangered lan- guages to document them and better serve the small 3 Erzya and pair community of speakers has a lot of value. However, At the moment Erzya and Moksha infrastructure in the fact that the wheel gets reinvented over and GiellaLT can be considered to be in nearly equal over again leads to fragmentation in the resources standing, despite the fact that work on Erzya was available for smaller languages and makes their use started considerably earlier. This does not mean, more difficult as each individual tool exposes an however, that the resources for both languages are API of their own. identical in all measures, although in basic numeric Rule-based morphology can be implemented in levels their sizes are comparable. There are many many different systems. Popular tools include situations where additional consistency between XFST (Karttunen et al., 1997), Foma (Hulden, the infrastructures for these two languages would 2009) and OpenFST (Allauzen et al., 2007). How- be quite desirable and beneficial. We discuss next ever, we use HFST (Helsinki-Finite State Technol- what these instances could be, and outline some ogy) (Lindén et al., 2009) because it is the system of the major questions while working with this used in GiellaLT (Moshagen et al., 2014). language pair. GiellaLT provides our transducers with a list of The open-source machine translation infrastruc- quality attributes. Their infrastructure consists of ture Apertium uses a shallow-transfer strategy. By work done for multiple endangered languages in definition, shallow transfer makes a morphosyntac- such a way that the morphological resources can tic analysis of the source language, translates the be used in a multitude of different contexts such as lemmas from source to target language, generates disambiguation (Trosterud, 2004; Ens et al., 2019), morphology acquired from the source language in dependency parsing (Antonsen et al., 2010), on- the target language and makes adjustments to the line dictionaries (Rueter and Hämäläinen, 2019), syntax of the target language. This concept of paral- spell checkers (Wiechetek et al., 2019), online cre- lel lexica, morphology and syntax, would therefore, ative writing tools (Hämäläinen, 2018), automated seem most effective in translation between closely news generation (Alnajjar et al., 2019) and lan- related languages. In the case of the Mordvin pair, guage learning tools (Antonsen and Argese, 2018). this means the use of mutual tags for describing In order to gain the added benefit from the Giel- mutual phenomena. laLT infrastructure, we have to design our transduc- Mutual phenomena in Erzya and Moksha, how- ers so that they are compatible with HFST and that ever, can be distinguished as exact matches and they follow a certain morphological tagset and that fuzzy matches, as it were. While there is no doubt they take the input and output in a certain format. that the two languages share nearly the same cat- These requirements define the API our transducers egories of case, person, number and definiteness, need to implement. An example of this can be seen it must also be noted that the paradigms do not in Table1. share the same cellular structure (Keresztes 1999; Apertium (Forcada et al., 2011) is another open- Trosterud 2006; Rueter 2016). When the paradigm source system that uses FST transducers for rule- structure is diverse, it is suggested that a union based machine translation. They use their own of morphological tags be taken as a starting point. transducer format, but fortunately also support When one language distinguishes a category of

95 +Nom +Gen +Dat +Nom +Gen +Dat Sg+PxSg1 (my son) цёразе цёразень цёразти цёрам туртов цёрам цёране цёранень цёраненди Sg+PxSg1 (my son) цёрам ∼ цёрань туртов Sg+PxSg1 (my sons) ∼ цёрань Sg+PxSg2 (your son) цёраце цёрацень цёрацти ∼ цёранень цёратне цёратнень цёратненди цёран цёран цёран туртов Pl+PxSg2 (your sons) Pl+PxSg1 (my sons) Sg+PxSg3 (his/her son) цёрац цёранц цёранцты ∼ цёрам ∼ цёрам цёрам туртов Pl+PxSg3 (his/her son) цёранза цёранзон цёранзонды цёрать цёрать туртов SP+PxPl1 (our son/sons) цёраньке цёраньконь цёраньконди Sg+PxSg2 (your son) цёрат SP+PxPl2 (your son/sons) цёранте цёрантень цёрантенди ∼ цёрат ∼ цёратень SP+PxPl3 (their son/sons) цёрасна цёраснон цёраснонды PxSg2 Pl (your son) цёрат цёрат цёрат туртов цёранстэнь ∼ PxSg3 Sg (his/her son) цёразо цёранзо цёранзо туртов Table 2: Symmetric possessive of Moksha цёранстэнь ∼ PxSg3 Pl (his/her sons) цёранзо цёранзо core cases цёранзо туртов SP+PxPl1 (our son/sons) цёранок цёранок цёранк туртов SP+PxPl2 (your son/sons) цёранк цёранк цёранк туртов цёранстэнь ∼ SP+PxPl3 (their son/sons) цёраст цёраст number, for instance, and the other does not, there цёраст туртов comes a point where number must be determined. And, in order to facilitate a transition from the ab- Table 3: Asymmetric possessive declension of Erzya sence of the category of number to its presence core cases and vice versa, the rudiments of tagging this cate- Obj+Sg1 Obj+Pl1 gory must be put in place, e.g. for the category of Subj+Sg2 NA -sami´z number we use SG ’singular’, PL ’plural’ and SP Subj+Pl2 -sami´z -sami´z Subj+Sg3 NA -sami´z ’singular or plural’. Subj+Pl3 -sami´z -sami´z Diversity in the Erzya-Moksha language pair can Table 4: Erzya default first person object be found in the core ranges of the categories defi- niteness, person and number. Both languages have an indefinite or basic declension, a definite or deter- (see Table4). miner declension and a possessive declension. For Moksha, however, makes a further semantic split both languages, it can be stated that the indefinite with regard to the category of person in the subject, declension distinguishes the category of number in namely, the form -samast’´ is used to indicate a sec- the nominative alone, whereas number is always ond person subject, and -sama´z is used to indcate a specified in the definite declension paradigms. The third or indefinite person subject (see Table5). possessive declension, however, exhibits a salient The differences outlined here are not the same, rift in unity. Moksha has a symmetric paradigm but comparable, to those found in a recent study in the core cases, nominative, genitive and dative that evaluated the morphological differences be- (see Table2), with distinct word forms for each tween Komi-Zyrian and Komi-Permyak within the case-possessor combination slot and an additional context of FST and development (Rueter distinction for number of possessa when the pos- et al., 2020). With closely related languages any sessor is singular, whereas Erzya makes virtually kind of resource reuse or transfer is rarely trivial, no consistent distinction for case or number in the but through careful evaluation of linguistic features nominative and genitive – with one exception the and differences we show that these goals are cer- specific nominative singular reading of the third tainly possible and realistic. person singular +N+SG+NOM+PXSG3 and option- ally in the first and second person singular of mod- 4 Current State and Ensuring ern written literature (see Table3). Maintainability Additional diversity is found in the subject- object paradigm, where portmanteau morphology At the time of writing the transducers lemmas, enables the specification of first, second and third stems and glosses were acquired through several person subject and object provided both arguments channels. Statistics on the coverage of the Erzya are singular. When one or both of the arguments is FST can be seen on Table6. The same statistics for not specifically singular, however, one form may serve to indicate more than one set of arguments, Obj+Sg1 Obj+Pl1 Subj+Sg2 NA -samast’´ e.g. Erzya has a default non-past form -sami´z which Subj+Pl2 -samast’´ -samast’´ Subj+Sg3 NA -sama´z simply indicates a first person object with a second Subj+Pl3 -sama´z -sama´z or third person subject when it is not true that both subject and object are specified, singular entities Table 5: Moksha default first person object

96 word class total glossed inflections derivations common nouns 24723 10754 450 8 write for an average linguist or an NLP researcher. proper nouns 50566 146 (450) 8 In the context of endangered languages, however, adjectives 12938 446 (450) 1 verbs 16133 4123 356 20 one would hope that people working on language lemma:stem pairs 106293 15908 36 documentation with a limited technical background Table 6: Statistics for Erzya transducers or even community members could make an active contribution to the morphological tools. In order to word class total glossed inflections derivations address this need, we have embraced two levels of common nouns 12851 9056 426 6 proper nouns 50070 267 (426) 6 abstraction for FST development. adjectives 12043 4083 (426) 1 The first level of abstraction is the use of an verbs 13449 11577 337 10 lemma:stem pairs 92716 26572 23 XML-based dictionary. Lexical data is stored in XML form in the GiellaLT infrastructure, and to a Table 7: Statistics for Moksha transducers great extent, they contain similar information to the lexicon of an FST: words (lemmas and stems) and Moksha are shown in Table7. their continuation lexicons that indicate how they are inflected. For this reason, we actually compile Although the largest bilingual dictionaries for the FST lexicon from the XML-based dictionary. Erzya and Moksha provide declension information, The XML dictionaries are, in fact, also used as classification of declension for other nominals has source files for online dictionaries, and therefore not been dealt with. In the noun phrase, adjectives they are an attempt to address the matter of reuse, are declined only when they are promoted to NP i.e. transducer construction, online morphological head in ellipsis or in the predicative. Hence, adjec- dictionaries, source material for ICALL projects as tives, by nature, might be declined to virtually the well as rule-based machine translation. same extent as common nouns, but they are subject to fewer derivations. Proper nouns, although most With a recent effort of a TEI (Text-Encoding Ini- commonly encountered in the singular, may, in fact, tiative) compatibility layer in the GiellaLT XML be declined in the plural – place names when de- dictionaries (Rueter and Hämäläinen, 2019), it was clined in the plural are down cased and their new decided that editing the FSTs in the XML dictionar- semantics refer to inhabitants. Person names gain ies should be open for a majority of people using a sense of associative plural. modern lexicographic tools. This compatibility layer is important as TEI is an ISO-standard, which Derivations include morphologically new forms means that it will most likely outlive any individual and orthographic compounds. Most salient are the project and remain usable in the future as well. ever present diminutive, which has a sense of en- dearment, diminutive and even partitive measure. The second layer of abstraction is the open- Orthographic derivation can be observed in com- source online user interfaces Ve’rdd (Alnajjar et al., pound nouns, with collective sense, and verbs, ex- 2020) and Akusanat (Hämäläinen and Rueter, pressing simultaneous activities. The challenge 2018) that make it possible to edit XML based of orthographic compounds lies in the hyphened dictionaries for anyone without any technical back- pair where both elements are inflected for the same ground. This is important since direct edits to an grammatical categories, such that copula forms of XML might be daunting for a language commu- compound nouns are included in the morphology nity member who has no programming background. of both stems, but the additive clitic attaches only These two different levels of abstraction ultimately to the second. produce XMLs in the GiellaLT format that can be directly used to enhance the transducers. The transducer development is conducted in an To provide further comparison, a very similar agile fashion with nightly builds available for the mechanism to store lexicon externally from the end user via an open-source Python library called FST has also been used successfully by Wilbur UralicNLP (Hämäläinen, 2019). The quality of (2018) with Pite Saami. Also in this case same the transducers is ensured by unit testing. We use lexicon is used in FST, in a published dictionary test scripts written in YAML that contain a list of (Wilbur, 2016), and in an online dictionary4. This inputs and accepted outputs. Any changes in the shows that approach described here is practical, transducers that break the tests will be immediately noticed and acted upon. 4https://saami.uni-freiburg.de/psdp/ The source code of an FST is not the easiest to ordbok/

97 and already used, in different endangered language Automata, (CIAA 2007), volume 4783 of Lecture contexts. Notes in Computer Science, pages 11–23. Springer. 5 Conclusions Khalid Alnajjar, Mika Hämäläinen, and Jack Rueter. 2020. On editing dictionaries for in In this paper we have presented our work on open- an online environment. In Proceedings of the Sixth International Workshop on Computational Linguis- source FSTs for Erzya and Moksha. Leveraging tics of Uralic Languages, pages 26–30. existing open source platforms has been the core design principle throughout the development. With- Khalid Alnajjar, Leo Leppänen, and Hannu Toivonen. out GiellaLT support, our transducer would be left 2019. No time like the present: methods for gener- ating colourful and factual multilingual news head- out of a plethora of higher level services provided lines. In Proceedings of the 10th International Con- by the infrastructure such as spell-checking, con- ference on Computational Creativity. Association straint grammar based syntax, language learning for Computational Creativity. tools etc. This would mean that we would need to Lene Antonsen and Chiara Argese. 2018. Using au- build these resources from the ground up. thentic texts for grammar exercises for a minor- Another important design principle has been ac- ity language. In Proceedings of the 7th workshop cessibility of the FSTs. Continuous integration on NLP for Computer Assisted Language Learning, makes it possible to commit to the nightly builds pages 1–9. of the UralicNLP library, which makes our FSTs Lene Antonsen, Trond Trosterud, and Linda Wiechetek. usable through a generic multilingual Python API. 2010. Reusing grammatical resources for new lan- Accessibility has also been thought of from the guages. In Proceedings of the Seventh conference point of view of the development. Using an XML on International Language Resources and Evalua- tion (LREC’10). based dictionary to generate the FST lexicon with a compatibility with the ISO-standardized TEI XML Jeff Ens, Mika Hämäläinen, Jack Rueter, and Philippe makes it easier for non-FST savvy people to con- Pasquier. 2019. Morphosyntactic disambiguation in an endangered language setting. In Proceedings of tribute to the work. Furthermore, integration with the 22nd Nordic Conference on Computational Lin- a GUI such as Ve’rdd, makes it possible to crowd- guistics, pages 345–349. source the development in the future by evoking community involvement of the native speakers. Mikel L Forcada, Mireia Ginestí-Rosell, Jacob Nord- falk, Jim O’Regan, Sergio Ortiz-Rojas, Juan An- We have had positive experiences from building tonio Pérez-Ortiz, Felipe Sánchez-Martínez, Gema on top of these other open-source solutions and Ramírez-Sánchez, and Francis M Tyers. 2011. Aper- would strongly recommend that other researchers tium: a free/open-source platform for rule-based ma- working on the morphology of endangered lan- chine translation. Machine translation, 25(2):127– 144. guages investigate the existing open-source plat- forms before starting to build everything from Herr Conon von der Gabelentz. 1839. Versuch einer scratch. The GiellaLT infrastructure could save mordwinischen grammatik. In Zeitschrift für die you from two to three years of development. At Kunde des Morgenlandes., II. 2–3., pages 235–284, 383–419. Druck und Verlag der Dieterlichschen the same time attention has to be paid to general Buchhandlung. APIs and interfaces that allow access to these tools at various levels of abstraction. Riho Grünthal. 2016. in Erzya: Second language speakers in a grammatical focus, Uralica Helsingiensia, page 291–318. Finno-Ugrian Society, References Finland. Kuzma Abramov. 2002. Валонь ёвтнема валкс. Mika Hämäläinen. 2018. Poem machine-a co-creative Mordovskoj knizhnoj izdateljstvasj. The manuscript nlg web application for poem writing. In The 11th of this dictionary was compiled by the Erzya na- International Conference on Natural Language Gen- tional writer Kuz’ma Grigorievich Abramov, 1914- eration Proceedings of the Conference. The Associ- 2008, whose activities as an Erzya writer spanned ation for Computational Linguistics. nearly 70 years. Mika Hämäläinen and Jack Rueter. 2018. Advances Cyril Allauzen, Michael Riley, Johan Schalkwyk, Wo- in synchronized xml-media wiki dictionary develop- jciech Skut, and Mehryar Mohri. 2007. Openfst: A ment in the context of endangered uralic languages. general and efficient weighted finite-state transducer In Proceedings of the XVIII EURALEX International library. In Proceedings of the Ninth International Congress: Lexicography in Global Contexts 17-21 Conference on Implementation and Application of July 2018, Ljubljana. Ljubljana University Press.

98 Arja Hamari and Niina Aasmäe. 2015. Negation in Kaislaniemi, and Terttu Nevalainen. 2020. Wran- erzya. Negation in Uralic languages, 108:293. gling with non-standard data. In Proceedings of the Digital Humanities in the Nordic Countries 5th Con- Mans Hulden. 2009. Foma: a finite-state compiler and ference. library. In Proceedings of the 12th Conference of the European Chapter of the Association for Com- Christopher Moseley, editor. 2010. Atlas of 0 putational Linguistics, pages 29–32. Association for the World s Languages in Danger, 3rd edi- Computational Linguistics. tion. UNESCO Publishing. Online version: http://www.unesco.org/languages-atlas/. Benjamin Hunt, Emily Chen, Sylvia L.R. Schreiner, and Lane Schwartz. 2019. Community lexical ac- Sjur Moshagen, Jack Rueter, Tommi Pirinen, Trond cess for an endangered polysynthetic language: An Trosterud, and Francis M. Tyers. 2014. Open- electronic dictionary for st. lawrence island yupik. source infrastructures for collaborative work on In Proceedings of the 2019 Conference of the North under-resourced languages. The LREC 2014 Work- American Chapter of the Association for Computa- shop “CCURL 2014 - Collaboration and Computing tional Linguistics (Demonstrations), pages 122–126, for Under-Resourced Languages in the Linked Open Minneapolis, Minnesota. Association for Computa- Data Era”. tional Linguistics. Arbi Haza Nasution, Yohei Murakami, and Toru Ishida. 2018. Designing a collaborative process to cre- Mika Hämäläinen. 2019. UralicNLP: An NLP library ate bilingual dictionaries of Indonesian ethnic lan- for Uralic languages. Journal of Open Source Soft- guages. In Proceedings of the Eleventh Interna- ware, 4(37):1345. tional Conference on Language Resources and Eval- uation (LREC 2018), Miyazaki, Japan. European Kerry Jones and Sanjin Muftic. 2020. Endangered Language Resources Association (ELRA). African languages featured in a digital collection: The case of the Ç‚Khomani San, Hugh Brody Col- Pavel Ornatov. 1838. Mordovskaja grammatika / lection. In Proceedings of the first workshop on Re- sostavlennaja na narechij mordvy mokshi Pavlom sources for African Indigenous Languages, pages 1– Ornatovym. V Sinodalnoj tip., Moskva. 8, Marseille, France. European Language Resources Association (ELRA). Tommi A Pirinen and Francis M Tyers. 2012. Com- piling apertium morphological dictionaries with hfst Lauri Karttunen, Tamás Gaál, and André Kempe. 1997. and using them in hfst applications. Language Tech- Xerox finite-state tool. Rapport technique, Centre nology for Normalisation of Less-Resourced Lan- de recherche Xerox de Grenoble. guages, page 25.

Egor Kashkin and Sofya Nikiforova. 2015. Verbs of Jack Rueter. 2010. Adnominal person in the morpho- sound in the moksha language: a typological ac- logical system of Erzya. Suomalais-ugrilaisen seu- count. Nyelvtudományi Közlemények, 111:341–362. ran toimituksia. Suomalais-Ugrilainen Seura, Fin- land. László Keresztes. 1999. Development of Mordvin Jack Rueter and Mika Hämäläinen. 2019. On xml- definite conjugation. Suomalais-Ugrilaisen Seu- mediawiki resources, endangered languages and tei ran toimituksia, 233. Suomalais-Ugrilainen Seura, compatibility, multilingual dictionaries for endan- Helsinki. gered languages. Rachel Edita O. ROXAS President Krister Lindén, Miikka Silfverberg, and Tommi Piri- National University (The Philippines), page 350. nen. 2009. Hfst tools for morphology–an efficient Jack Rueter, Niko Partanen, and Larisa Ponomareva. open-source package for construction of morpholog- 2020. On the questions in developing computational ical analyzers. In International Workshop on Sys- infrastructure for komi-permyak. In Proceedings of tems and Frameworks for Computational Morphol- the Sixth International Workshop on Computational ogy, pages 28–47. Springer. Linguistics of Uralic Languages, pages 15–25. Olga Lovick, Christopher Cox, Miikka Silfverberg, Jack M. Rueter. 2000. Xeljsinkisa universitetsa kyv Antti Arppe, and Mans Hulden. 2018. A compu- tujalysj izhkaryn perymsa kyvjas simpozium vylyn tational architecture for the morphology of upper lyddjomtor. In V sbornike Permistika 6: Problemy tanana. In Proceedings of the Eleventh International sinxronii i diaxronii permskix jazykov i ix dialektov Conference on Language Resources and Evaluation [Permistika 6: Problems in the synchrony and di- (LREC 2018), Miyazaki, Japan. European Language achrony of the and their dialects], Resources Association (ELRA). volume 6 of Permistika, pages 154–158. Jorma Luutonen. 2014. Kahden sukupolven ersää – Jack Michael Rueter. 2016. Towards a systematic kielenhuoltoa ja muutoksen merkkejä. Memoires de characterization of dialect variation in the Erzya- la Societe Finno-Ougrienne, 270:187––201. speaking world: Isoglosses and their reflexes at- tested in and around the Dubyonki Raion, number 10 Eetu Mäkelä, Krista Lagus, Leo Lahti, Tanja in Uralica Helsingiensia, pages 109–148. University Säily, Mikko Tolonen, Mika Hämäläinen, Samuli of Helsinki, Finland.

99 Jack Michael Rueter and Francis M Tyers. 2018. To- wards an open-source universal-dependency tree- bank for erzya. In International Workshop for Com- putational Linguistics of Uralic Languages. Miikka Silfverberg and Jack Rueter. 2015. Can mor- phological analyzers improve the quality of optical character recognition? In First International Work- shop on Computational Linguistics for Uralic Lan- guages, volume 2 of Septentrio Conference Series, pages 45–56, Norway. Septentrio Academic Publish- ing. Volume: Proceeding volume: 2. Trond Trosterud. 2004. Porting morphological analysis and disambiguation to new languages. In SALTMIL Workshop at LREC 2004: First Steps in Language Documentation for Minority Languages, pages 90– 92. Citeseer. Trond Trosterud. 2006. Homonymy in the Uralic Two-Argument Agreement Paradigms. Suomalais- Ugrilaisen Seuran Toimituksia 251. Suomalais- Ugrilainen Seura, Helsinki. Francis M. Tyers, Jonathan Washington, Darya Kav- itskaya, Memduh Gökırmak, Nick Howell, and Remziye Berberova. 2019. A biscriptual morpho- logical transducer for crimean Tatar. In Proceedings of the 3rd Workshop on the Use of Computational Methods in the Study of Endangered Languages Vol- ume 1 (Papers), pages 74–80, Honolulu. Association for Computational Linguistics.

Linda Wiechetek, Sjur Moshagen, Børre Gaup, and Thomas Omma. 2019. Many shades of grammar checking – launching a constraint grammar tool for north sámi. In Proceedings of the NoDaLiDa 2019 Workshop on Constraint Grammar - Methods, Tools and Applications. Joshua Wilbur. 2016. Pitesamisk ordbok: samt stavingsregler. Department of Scandinavian Stud- ies, University of Freiburg. Joshua Wilbur. 2018. Extracting inflectional class as- signment in pite saami: Nouns, verbs and those pesky adjectives. In Proceedings of the Fourth In- ternational Workshop on Computational Linguistics of Uralic Languages, pages 154–168. Anna Zueva, Anastasia Kuznetsova, and Francis Ty- ers. 2020. A finite-state morphological analyser for evenki. In Proceedings of The 12th Language Re- sources and Evaluation Conference, pages 2581– 2589, Marseille, France. European Language Re- sources Association.

100