Speech Recognition for Endangered and Extinct Samoyedic Languages
Total Page:16
File Type:pdf, Size:1020Kb
Speech Recognition for Endangered and Extinct Samoyedic languages Niko Partanen Mika Hämäläinen Tiina Klooster University of Helsinki University of Helsinki Luua Forestry School Finland and Rootroo Ltd Estonia [email protected] Finland [email protected] [email protected] Abstract are extinct. Even extinct Samoyedic languages have, however, been documented to various degrees. Our study presents a series of experiments on Documentation of Samoyedic languages has speech recognition with endangered and ex- reached a mature stage in last decades. Three lan- tinct Samoyedic languages, spoken in North- guages in this group have a recent monograph long ern and Southern Siberia. To best of our grammars in English. These are Tundra Nenets knowledge, this is the first time a functional ASR system is built for an extinct language. (Nikolaeva, 2014), Forest Enets (Siegl, 2013) and We achieve with Kamas language a Label Er- Nganasan (Wagner-Nagy, 2018). Similarly re- ror Rate of 15%, and conclude through care- sources on these languages have been steadily be- ful error analysis that this quality is already coming available for researchers, often in connec- very useful as a starting point for refined hu- tion with major language documentation projects, man transcriptions. Our results with related such as ‘The word stock of the Selkup language as Nganasan language are more modest, with best the main source of cultural and historical informa- model having the error rate of 33%. We show, however, through experiments where Kamas tion about a moribund endangered native ethnicum 1 2 training data is enlarged incrementally, that of Western Siberia’ , ‘Enets and Forest Nenets’ , 3 Nganasan results are in line with what is ex- ‘Corpus of Nganasan Folklore Texts’ , ‘Documen- pected under low-resource circumstances of tation of Enets: digitization and analysis of legacy the language. Based on this, we provide rec- field materials and fieldwork with last speakers’4, ommendations for scenarios in which further ‘Tundra Nenets Texts’5, ‘Comprehensive Documen- language documentation or archive process- tation and Analysis of Two Endangered Siberian ing activities could benefit from modern ASR Languages: Eastern Khanty and Southern Selkup’6, technology. All training data and processing scripts haven been published on Zenodo with ‘Selkup Language Corpus (SLC)’ (Budzisch et al., clear licences to ensure further work in this im- 2019), ‘Nganasan Spoken Language Corpus’ (Bryk- portant topic. ina et al., 2018), ‘INEL Selkup Corpus’ (Brykina et al., 2020), ‘INEL Kamas corpus’ (Gusev et al., 2019). Also Online Dictionary for Uralic Languages 1 Introduction (Rueter and Hämäläinen, 2017) includes Tundra Samoyedic languages are spoken in the Western Nenets lexical material. Siberia, and Tundra Nenets also extends far to the 1https://cordis.europa.eu/project/id/ European part of Northern Russia. These languages INTAS2005-1000006-8411 2 belong to the Uralic language family. All currently https://dobes.mpi.nl/projects/nenets/ 3https://iling-ran.ru/gusev/Nganasan/texts/ spoken Samoyedic languages are endangered (see 4https://elar.soas.ac.uk/Collection/MPI950079 Moseley (2010). Tundra Nenets is the largest lan- 5https://elar.soas.ac.uk/Collection/MPI120925 guage in the family, while some, such as Kamas, 6https://elar.soas.ac.uk/Collection/MPI43298 Our study here focuses on two of these languages: age (Partanen and Rießler, 2018). This responds Nganasan and Kamas. Our topic is ASR, but we an- well to OCR challenges identified earlier for these chor this work into a wider context of computational languages by Partanen (2017). The vast majority resources and language documentation. The work of these languages have virtually no language tech- has two goals: to examine feasibility of developing nology at the moment, but as there are increasingly an ASR system for an extinct language, in the case larger and larger corpora, the possibilities for future of Kamas, and to investigate the usability of such a work are many. system in a real on-going endangered language doc- One challenge in working with these endangered umentation scenario that is presented by Nganasan. languages is that very few researchers are able to These scenarios are not wide apart from one another. transcribe them confidently and accurately. In the Worldwide extinction of linguistic diversity has been past few years, however, speech recognition in en- recognized for the last 30 years (Krauss, 1992), and dangered language context has seen significant im- many languages are in a very endangered situation. provements, especially in scenarios where there is The models trained in this paper have been re- only one single speaker. Adams et. al. (2018) report leased and made openly accessible on Zenodo with a a reasonable accuracy under these circumstances al- permanent DOI7. As both corpora used in our study ready with just a few hours of transcribed data, with are distributed with restrictive non-commercial li- rapid increase in accuracy when there is more train- cense (CC-BY-NC-SA), we have also published our ing data. They also present a comparison of models training and testing materials with the same license. trained on different amounts of training data using We think open practices such as these will be gaining Na and Chatino data, which also inspired our own importance in the future, and we want to contribute comparative experiments. to this development as well by our example (Garellek We have also seen very recently large improve- et al., 2020). ments in such systems on related Uralic languages, for example Zyrian Komi (Hjortnaes et al., 2020b; 2 Context and Related Work Hjortnaes et al., 2020a). We have also seen experi- Despite the wide documentation activities, we have ments where ASR is being integrated to the language not witnessed large improvements in language tech- documentation workflows, for example, in Papuan nology and computational linguistics on these lan- context (Zahrer et al., 2020). Most widely applied guages. Thereby one of the goals in this paper speech recognition systems have been Persephone is to encourage further work on these languages, (Adams et al., 2018), Elpis (Foley et al., 2018) and also through our published new datasets. There is DeepSpeech (Hannun et al., 2014). In this paper, a rule based morphological analyser for Nganasan we present and discuss several experiments we have (Endrédy et al., 2010), which, however, appears to done using Persephone system. be available only through a web interface, and is not open access. 3 Languages and Data What it comes to other Samoyedic languages, a rule based Tundra Nenets morphological anal- Nganasan is an endangered Samoyedic language yser exists in the GiellaLT infrastructure (Moshagen spoken by Nganasans, a small ethnic group in et al., 2014)8 with availability through UralicNLP Taimyr Peninsula, Northern Siberia (Janhunen and (Hämäläinen, 2019). There are also early Selkup9 Gruzdeva, 2020). According to official statis- and Nganasan10 analysers in the same infrastruc- tics there are 470 Nganasans, from who approxi- ture. Also OCR models have been developed to tar- mately 125 speak the Nganasan language (Wagner- get early writing systems on these languages (Parta- Nagy, 2018, 3,17). Despite languages endanger- nen and Rießler, 2019b), with associated data pack- ment, plenty of documentation work has been con- ducted (Leisio, 2006; Wagner-Nagy, 2014; Ka- 7https://zenodo.org/record/4029494 8https://github.com/giellalt/lang-yrk heinen, 2020). Largest available Nganasan corpus 9https://github.com/giellalt/lang-sel was published in 2018 (Brykina et al., 2018), and it 10https://github.com/giellalt/lang-nio was used in our study. Kamas is another Samoyedic language, represent- ful and realistic scenario. ing the southern group of this branch of Uralic lan- As Nganasan corpus was significantly larger than guages. Kamas was spoken in the slopes of Sayan Kamas, also more preprocessing was needed, proba- mountains in the Central Siberia. It is believed that bly reflecting the longer time frame where it has been by the 19th century Kamas tribe consisted of only created. We excluded all segments that were shorted 130 people (Matveev, 1964). Kamas were forced than 400 milliseconds, and removed all empty an- to abandon their nomadic lifestyle in the beginning notations or annotations that contained only punctu- of 20th century, which, in connection with large so- ation characters. There were several invisible Uni- cietal changes, increased the contact with Russian code space characters that were removed. Also all speakers and led to a cultural assimilation (Klooster, annotations that contained number written as number 2015, 9). The last Kamas speaker was Klavdiya Plot- characters were excluded. We also removed from nikova, who was born in 1895 in a small village of Nganasan corpus all instances of utterances that con- Abalakovo in Central Siberia. She worked with sev- tained Cyrillic characters. eral linguists since 1960s, and this results in a sizable For both corpora the annotations that contained collection of Kamas recordings that are available in unclear words marked with HIAT conventions various archives. (Ehlich and Rehbein, 1976) were removed. When In 2019, a corpus containing transcribed versions the transcriptions contained annotations for non- of these materials was published (Gusev et al., 2019). verbal expressions, including coughing or laughter, We used Klavdiya Plotnikova’s part of the corpus we chose to remove