Arxiv:2004.04803V1 [Cs.CL] 9 Apr 2020

Arxiv:2004.04803V1 [Cs.CL] 9 Apr 2020

FST Morphology for the Endangered Skolt Sami Language Jack Rueter, Mika Hämäläinen Department of Digital Humanities University of Helsinki {jack.rueter, mika.hamalainen}@helsinki.fi Abstract We present advances in the development of a FST-based morphological analyzer and generator for Skolt Sami. Like other minority Uralic languages, Skolt Sami exhibits a rich morphology, on the one hand, and there is little golden standard material for it, on the other. This makes NLP approaches for its study difficult without a solid morphological analysis. The language is severely endangered and the work presented in this paper forms a part of a greater whole in its revitalization efforts. Furthermore, we intersperse our description with facilitation and description practices not well documented in the infrastructure. Currently, the analyzer covers over 30,000 Skolt Sami words in 148 inflectional paradigms and over 12 derivational forms. Keywords: Skolt Sami, endangered languages, morphology 1. Introduction members access to language materials directly. The trick is Skolt Sami is a minority language belonging to Sami to find new uses and reuses for data sets and technologies branch of the Uralic language family. With its native speak- as well as to bring development closer to the language com- ers at only around 300, it is considered a severely endan- munity. If development follows the North Sámi lead, any gered language (Moseley, 2010), which, despite its pluri- project can reap from the work already done. centric potential, is decidedly focusing on one mutual lan- Extensive work has already been done on data and tool gauge (Rueter and Hämäläinen, 2019). In this paper, we development in the GiellaLT infrastructure (Moshagen et present our open-source FST morphology for the language, al., 2013) and (Moshagen et al., 2014), and previous work 3 which is a part of the wider context of its on-going revital- also exists for Skolt Sami (Sammallahti and Mosnikoff, ization efforts. 1991; Sammallahti, 2015; Feist, 2015). There are online 4 The intricacies of Skolt Sami morphology include qual- and click-in-text dictionaries (Rueter, 2017), spell check- 5 ity and quantity variation in the word stem as well as ers (Morottaja et al., 2018), , these are implemented in suprasegmental palatalization before subsequent affixes. OpenOffice, but some of the more prominent languages Like Northern Sami and Estonian, Skolt Sami has conso- are supported in MS Word, as well as rule-based language nant quantity and quality variation that surpasses that of learning (Antonsen et al., 2013; Uibo et al., 2015). For Finnish, i.e. Skolt Sami has as many as three lengths in languages with extensive description and documentation, the vowel and consonant quantities in a given word. there are syntax checkers (Wiechetek et al., 2019), machine The finite-state description of Skolt Sami involves develop- translation (Antonsen et al., 2017) and speech synthesis and ing strategies for reusability of open-source documentation recognition (Hjortnaes et al., 2020), just to mention the tip in other minority languages. In other words, the FST de- of the iceberg (Rueter, 2014). From a language learner scription is designed in such a fashion that it can be ap- and research point of departure, the development and ap- plied to other languages as well with minimal modifica- plication of these tools points to well-organized morpho- tions. Skolt Sami, like many other minority Uralic lan- syntactic and lexical descriptions of the language in focus. guages, attests to a fair degree of regular morphology, i.e., By well-organized descriptions, we mean approaching its nouns are marked for the categories of number, pos- tasks at hand with applied reusability. Reusability is illus- arXiv:2004.04803v1 [cs.CL] 9 Apr 2020 session and numerous case forms with regular diminutive trated in the construction of a morphological analyzer for derivation, and its verbs are conjugated for tense, mood linguists, which, due to the fact that it is able to recognize and person in addition to undergoing several regular deriva- and analyze regular morphological forms, can also serve as tions. Morphological descriptions have been developed in a morphological spell checker. In fact, this same analyzer the GiellaLT (Sami Language technology) infrastructure at can be reversed and used as a generator, which is useful the Norwegian Arctic University in Tromso, using Helsinki in providing language learners with fixed, analogous and Finite-State Technology (HFST) (Lindén et al., 2013). random tasks in morphology. The same morphological an- Working in the GiellaLT infrastructure, it is possible to ap- 3 ply ready-made solutions to multiple language learning, fa- http://oahpa.no/sms/useoahpa/background. eng.html/ cilitation and empowerment tasks. Leading into the digital , read further in this article for subsequent develop- 1 ments in http://oahpa.no/nuorti/ age, there are ongoing implementations, such as keyboards 4 2 The forerunner https://sanit.oahpa.no/read/, an for various platforms, and corpora , being expanded to online dictionary here, and on analogous pages of other dic- provide developers, researchers and language community tionaries, (e.g., https://saan.oahpa.no/read/), can be dragged to the tool bar of Firefox and Google Chrome 1http://divvun.no/keyboards/index.html/ 5http://divvun.no/korrektur/korrektur. 2http://gtweb.uit.no/korp/ html/ alyzer, when augmented by glosses, can immediately begin The scarce quantity of textual data is one limitation, but it to provide online dictionary and click-in-text analyses. is even a greater one given that the language is still being The development of an optimal morphological analyzer and standardized and the users provide a variety of forms and glossing for a language like Skolt Sami requires concise vocabulary when expressing themselves in their native lan- morphological and lexical work, on the one hand, and ac- guage. This means an even greater variety in morphology cess to corpora including language learning materials, on that the statistical model should be able capture from a lim- the other. Corpora provide access to language in use, and ited dataset. language learning materials help to establish a received un- In the absence of a reasonably sized descriptive corpus of derstanding of the language. To this end, the morphologi- the language, annotated or not, the most accurate way to cal analyzer for Skolt Sami has been constructed to analyze model the morphology is by using a rule-based methodol- and generate a pedagogically enhanced orthography, for in- ogy. dication of short and long diphthongs preceding geminates FSTs (Finite-State Transducers) have been shown in the as well as mid low front vowels, as might be rendered in a past to be an effective way to model the morphology even pronouncing dictionary. One such example might be seen for languages with an abundance of morphological features in the word kue0tt ‘hut’ as opposed to the literal norm kue0tt, (cf. (Beesley and Karttunen, 2003)). Perhaps one of the where the dot˙ below the e not only indicates a slightly low- largest-scale FSTs to model the morphology of a language ered pronunciation of the vowel but also assists in identi- is the one developed for Finnish (Pirinen et al., 2017). This fying the paradigm type, kue0tt : kue0¡id ‘hut+N+Pl+Acc’ tool, Omorfi, serves as the state-of-the-art morphological versus kue0ll : kuõ0lid ‘fish+N+Pl+Acc˙ ’. analyzer for Finnish. By focusing on the construction of a pedagogical enhanced analyzer-generator, teaching resources can be developed 3. The FST Model Development Pipeline that target randomly generated morphological tasks for the 6 Developing a morphological description of a language pre- language learner as in the North Sami learning tool Davvi . supposes a language-learning and documentary approach. In any given language reader, there are texts with words in Other people have learned the language and become profi- various forms and an accompanying vocabulary. While vo- cient in it before you, so extract paradigms from grammars, cabulary translation can readily be utilized as a fixed task in readers and research to build the language model. If you language learning, inflectional tasks, especially in morpho- are the first researcher to describe the language, take hints logically rich languages, can be developed as random exer- from the language learners, if there are any, they may be cises. Although the contextual word forms in the reader are still developing their own understanding of the language quite limited, it is possible to construct randomized mor- morpho-syntax, and, at times, they may provide you with phological exercises where the student is expected to in- informative interpretations of the language. flect nouns, adjectives and verbs alike in forms that have Idiosyncrasies of a language can, sometimes, be captured been taught but not explicitly given for the random words through comparison to those of another. When a descrip- provided in the reader vocabulary, e.g. in nouns the student tion of Skolt Sami, Finnish, Estonian, etc. introduces alien may select vocabulary from reader A chapters 1–5 with a phenomena, such as word-stem quality and quantity vari- randomized task for nouns, plural, comitative, third person ation as well as suprasegmental palatalization, it is a good singular possessive suffix: +N+Pl+Com+PxSg3. Essen- idea to try describing them both separately and in tandem. tially all nouns in the selected vocabulary available for this Word-stem quality variation affects both consonants and reading are inadvertently presented to the learner. vowel. In consonants, an analogous English example might be illustrated with the f :v variation found in the English 2. Related Work words life, lives and loaf, loaves. From a historical perspec- In the past, multiple methods have been proposed for auto- tive, the verb to live will serve as an instance where long matically learning morphology for a given language.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    8 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us