Tools for Supporting Language Learning for Sakha

Tools for supporting language learning for Sakha Sardana Ivanova, Anisia Katinskaia, Roman Yangarber University of Helsinki [email protected] Abstract words, has many borrowings from the surrounding This paper presents an overview of linguis- Mongolic and Tungusic languages, numerous loan- tic resources available for the Sakha lan- words from Russian, as well as words of unknown guage, and presents new tools for support- origin. Sakha makes extensive use of post-positions, ing language learning for Sakha. The essen- which indicate syntactic relations and govern the tial resources include a morphological an- grammatical case of nominals, (Forsyth, 1994). alyzer, digital dictionaries, and corpora of In the digital sphere, Sakha can be considered a Sakha texts. We extended an earlier version low-resource language. We report on our project of the morphological analyzer/transducer, to provide learning support for Sakha. Building built on the Apertium finite-state platform. on pre-existing digital resources, we aim to provide The analyzer currently has an adequate level a learning platform for students (including adults) of coverage, between 86% and 89% on two who are interested in strengthening their linguistic Sakha corpora. Based on these resources, competency in Sakha. we implement a language-learning environ- The paper is structured as follows. Section 2 de- ment for Sakha in the Revita computer- scribes distinctive properties of the Sakha language assisted language learning (CALL) plat- and motivates the need for language-learning sup- form. Revita is a freely available online lan- port by reviewing the social environment of the language learning platform for learners beyond guage. Section 3.1 presents an overview of previous the beginner level. We describe the tools for work Sakha; Section 3.2 describes the Revita plat- Sakha currently integrated into the Revita form for language learning. Section 4 describes the platform. To our knowledge, at present this instruments we integrate to support language learn- is the first large-scale project undertaken to ing for Sakha. In Section 5 we discuss initial results support intermediate-advanced learners of obtained with the tools. Sections 6 concludes with a minority Siberian language. pointers for future work. 1 Introduction 2 Sakha language The Sakha language, also known by its exonym Sakha is the national language of the Sakha people, Yakut, is the language of an ethnic community, who which, along with Russian, is one of the official lan- mainly inhabit the Republic of Sakha in the Far guages of the Republic of Sakha (Yakutia), (Yart- East of Siberia, Russian Federation. According to seva, 1990). The Sakha language differs signifi- the 2010 census, Sakha is the native language of cantly from other Turkic languages by the presence 450,140 people, and is considered vulnerable due of a layer of vocabulary of unclear (possibly Paleo- to its limited usage. Children do not use Sakha in Asiatic) origin, (Kharitonov, 1987). There are also a all aspects of their life; they speak Sakha at home large number of words of Mongolic origin related to with family, but do not use it in school and socially. ancient borrowings, as well as late borrowings from Sakha belongs to the Northern group of the the Russian language, (Tenishev, 1997). Siberian branch of the Turkic language family, and is agglutinative, as are all Turkic languages, (Ubry- 2.1 Distinctive features atova, 1982). It has complex, four-way vowel har- Vowels in Sakha follow complex vowel harmony mony, and a basic Subject-Object-Verb word or- rules. The features of the vowels within a word must der. The lexicon of Sakha consists of native Turkic agree in a strictly defined fashion. First, palatal- velar harmony of vowels is observed in Sakha • remote-past: strictly sequentially and does not admit exceptions. “үлэлээбитим” [ülel¯ebitim] If the first syllable contains a front vowel, then the “I worked (long ago)”; vowels in all subsequent syllables in the word must • past perfect: be front. Otherwise, if the first syllable contains a “үлэлээбиппин” [ülel¯ebippin] back vowel, then the vowels in all subsequent syl- “In fact, I worked”; lables must be back. Second, labial vowel harmony • episodic past: requires that the sequence of vowels agree according “үлэлээбиттээхпин” [ülel¯ebitt¯eXpin] to the degree of roundedness within adjacent sylla- “I used to work on occasion”; bles, (Sleptsov, 2018). For example: • past imperfect: • back+unrounded: “аҕа / аҕалардыын” “үлэлиирим” [ülel¯irim] [aGa / aGalard¯In] “I worked in the past for some time”; “father/with fathers” • plusquamperfect: • back+rounded: “оҕо / оҕолордуун” “үлэлээбит этим” [ülel¯ebit etim] [oGo / oGolord¯un] “I had worked prior to that”; “child / with children” • episodic plusquamperfect: • front+unrounded: “эбэ / эбэлэрдиин” “үлэлээбиттээх этим” [ülel¯ebitt¯eXetim] [ebe / ebelerd¯in] “Long ago, I used to work”. “grandmother / with grandmothers” The total number of tense forms exceeds 20. • front+rounded: “бөрө / бөрөлөрдүүн” One of the particularities of nouns is when paired [börö/ börölördün¯ ] nouns are marked with possessiveness, both compo- “wolf / with wolves” nents of the compound noun change in parallel, as the word is inflected: Thus, the vowels in the suffixes “-лар-” [lar], indicating the plural, and “-дыын” [d¯In], indicating • “баай-дуол” [b¯ajduol] “wealth”: comitative case, undergo 4-way mutation according “баайа-дуола” [b¯aja duola] to vowel harmony. (3rd person possessive, nominative case), In Sakha, the verb is the central part of “баайын-дуолун” [b¯ajIn duolun] speech, (Dyachkovsky et al., 2018). Some verbs can (3rd person singular possessive, accusative) have multiple affixes (as in most Turkic languages), • “сурук-бичик” [suruk biˇcik] “writing”: which can correspond to an entire clause or sentence “сурукта-бичиктэ” [surukta biˇcikte] in other languages, such as Russian. Sakha has no (partitive case), infinitive form for verbs, therefore a predicate that “суругу-бичиги” [surugu biˇcigi] (in other languages) would include an infinitive is (accusative case) conveyed by various indirect means, for example: 2.2 Socio-linguistic environment • “суруйан бүтэрдэ”: [surujan büterde] According to (Vasilieva et al., 2013), since 1990, “he finished writing” the percentage of ethnic Sakha has grown, reach- (literally: “he wrote, finished”); ing 45% of the total population in the Republic of • “сатаан ыллыыр”: [sat¯anIll¯Ir] Sakha. Ethnic Sakha together with other indigenous “he can sing” peoples of the North Siberia and the Far East com- (literally: “he knows how, sings”); prise over 50% of the total population. • “бобуоххун син”: [bobuoXXun sin] Vasilieva et al. (2013) has conducted surveys, “you can forbid” which show a direct dependence of the level of lin- (literally: “you can, let’s forbid”). guistic proficiency on the language of instruction at Sakha is characterized by an exceptional variety school. A fluent level of proficiency is achieved by: of verbal tenses. In particular, according to (Kork- • respondents who had schooling in the Sakha ina, 1970), 8 past forms are distinguished: language (34.5%) • proximal-past: • respondents who had studied in schools, where “үлэлээтим” [ülel¯etim] subjects were taught in Russian and partly in “I worked (recently)”; Sakha (27.4%). Only 17.9% of respondents who had studied in the language in a number of ways, and several Russian are fluent in Sakha. Respondents who projects are being undertaken to support Sakha. We speak Sakha poorly, or do not speak at all, graduated briefly mention some of them here. from Russian-speaking schools. Thus, as expected, The digital bilingual dictionary SakhaTyla.Ru1 linguistic skills and abilities in Sakha are poorer for currently offers over 20,000 items from Sakha to those who had studied in Russian. Russian, over 35,000 items from Russian to Sakha, In work life, the Russian language is dominant about 2,000 items from Sakha to English, and about for all age groups. In the two youngest age groups 1,000 items from English to Sakha. In addition to (16–25 and 26–35 years old), the use of Russian is translations, this dictionary also contains examples growing, approaching 50%. This is due to the re- of usage, including idiomatic usage, for every item, quirements of formal communication, terminolog- which constitutes a base of lexical data, and can ical dependence, ethnically mixed composition of be highly useful for language learning and teaching. professional teams. On the other hand, after the The base of examples from this dictionary is cur- completion of active professional life, the return to rently not utilized in our learning platform. an increased usage of the original ethnic language is Leontiev (2015) has compiled a newspaper cor- common, (Vasilieva et al., 2013). pus of Sakha containing over 12 million tokens. In Yakutsk—the capital and the largest city of The Sakha Wikipedia contains over 12,000 articles, the Sakha Republic—one in three Sakha children which makes up a corpus of over 2 million tokens.2 lack the opportunity to study in their native lan- A Sakha course on the educational platform guage. This is a violation of the right to study Memrise offers a vocabulary of about 3100 words.3 in one’s native language. The number of schools Audio materials: Common Voice is a platform which offer teaching in Sakha in Yakutsk in 2002– for crowdsourcing open-source audio datasets.4 At 2003 was 16, and dropped to 15 by 2003–2004. present, it offers just under 2.5 hours validated The number of schools where Sakha is studied as voice recordings in Sakha. By comparison, English a subject decreased from 22 (in 2002–2003) to 16 has almost 850 hours of audio content on the plat- (in 2003–2004). The number of Sakha language form, and Russian has 50 hours. learners decreased from 6,377 to 2,902, (Vasilieva In summary, few linguistic resources exists for et al., 2013). According to the statistical report Sakha.

Tools for Supporting Language Learning for Sakha

Forest Fires and Climate in Alaska and Sakha Forest Fires Near Yakutsk

SITTING “UNDER the MOUTH”: DECLINE and REVITALIZATION in the SAKHA EPIC TRADITION OLONKHO by ROBIN GAIL HARRIS (Under the D

The Russian Constitution and Foriegn Policy

A Manual on the Turanians and Pan-Turanianism

Declension System of the Turkic Languages: Historical Development of Case Endings Gulgaysha S

The Imposition of Translated Equivalents to Avoid T

Orthoepical Potential of Speaking in the Yakut Language

Fur Animal Hunting of the Indigenous People in the Russian Far East: History, Technology, and Economic Effects

Loanwords in Sakha (Yakut), a Turkic Language of Siberia Brigitte Pakendorf, Innokentij Novgorodov

Yakutia) “…The Republic of Sakha (Yakutia) Is the Largest Region in the Russian Federation and One of the Richest in Natural Resources

Amur Oblast TYNDINSKY 361,900 Sq

Problem of Possessive Constructions in the Yakut Language