Speech Recognition for Endangered and Extinct Samoyedic

Niko Partanen Mika Hämäläinen Tiina Klooster University of Helsinki University of Helsinki Luua Forestry School and Rootroo Ltd [email protected] Finland [email protected] [email protected]

Abstract are extinct. Even extinct have, however, been documented to various degrees. Our study presents a series of experiments on Documentation of Samoyedic languages has speech recognition with endangered and ex- reached a mature stage in last decades. Three lan- tinct Samoyedic languages, spoken in North- guages in this group have a recent monograph long ern and Southern . To best of our grammars in English. These are Tundra Nenets knowledge, this is the first time a functional ASR system is built for an extinct . (Nikolaeva, 2014), Forest Enets (Siegl, 2013) and We achieve with Kamas language a Label Er- Nganasan (Wagner-Nagy, 2018). Similarly re- ror Rate of 15%, and conclude through care- sources on these languages have been steadily be- ful error analysis that this quality is already coming available for researchers, often in connec- very useful as a starting point for refined hu- tion with major language documentation projects, man transcriptions. Our results with related such as ‘The word stock of the as are more modest, with best the main source of cultural and historical informa- model having the error rate of 33%. We show, however, through experiments where Kamas tion about a moribund endangered native ethnicum 1 2 training data is enlarged incrementally, that of Western Siberia’ , ‘Enets and Forest Nenets’ , 3 Nganasan results are in line with what is ex- ‘Corpus of Nganasan Folklore Texts’ , ‘Documen- pected under low-resource circumstances of tation of Enets: digitization and analysis of legacy the language. Based on this, we provide rec- field materials and fieldwork with last speakers’4, ommendations for scenarios in which further ‘Tundra Nenets Texts’5, ‘Comprehensive Documen- language documentation or archive process- tation and Analysis of Two Endangered Siberian ing activities could benefit from modern ASR Languages: Eastern Khanty and Southern Selkup’6, technology. All training data and processing scripts haven been published on Zenodo with ‘Selkup Language Corpus (SLC)’ (Budzisch et al., clear licences to ensure further work in this im- 2019), ‘Nganasan Spoken Language Corpus’ (Bryk- portant topic. ina et al., 2018), ‘INEL Selkup Corpus’ (Brykina et al., 2020), ‘INEL Kamas corpus’ (Gusev et al., 2019). Also Online Dictionary for 1 Introduction (Rueter and Hämäläinen, 2017) includes Tundra Samoyedic languages are spoken in the Western Nenets lexical material. Siberia, and Tundra Nenets also extends far to the 1https://cordis.europa.eu/project/id/ European part of Northern . These languages INTAS2005-1000006-8411 2 belong to the Uralic . All currently https://dobes.mpi.nl/projects/nenets/ 3https://iling-ran.ru/gusev/Nganasan/texts/ spoken Samoyedic languages are endangered (see 4https://elar.soas.ac.uk/Collection/MPI950079 Moseley (2010). Tundra Nenets is the largest lan- 5https://elar.soas.ac.uk/Collection/MPI120925 guage in the family, while some, such as Kamas, 6https://elar.soas.ac.uk/Collection/MPI43298 Our study here focuses on two of these languages: age (Partanen and Rießler, 2018). This responds Nganasan and Kamas. Our topic is ASR, but we an- well to OCR challenges identified earlier for these chor this work into a wider context of computational languages by Partanen (2017). The vast majority resources and language documentation. The work of these languages have virtually no language tech- has two goals: to examine feasibility of developing nology at the moment, but as there are increasingly an ASR system for an extinct language, in the case larger and larger corpora, the possibilities for future of Kamas, and to investigate the usability of such a work are many. system in a real on-going endangered language doc- One challenge in working with these endangered umentation scenario that is presented by Nganasan. languages is that very few researchers are able to These scenarios are not wide apart from one another. transcribe them confidently and accurately. In the Worldwide extinction of linguistic diversity has been past few years, however, speech recognition in en- recognized for the last 30 years (Krauss, 1992), and dangered language context has seen significant im- many languages are in a very endangered situation. provements, especially in scenarios where there is The models trained in this paper have been re- only one single speaker. Adams et. al. (2018) report leased and made openly accessible on Zenodo with a a reasonable accuracy under these circumstances al- permanent DOI7. As both corpora used in our study ready with just a few hours of transcribed data, with are distributed with restrictive non-commercial li- rapid increase in accuracy when there is more train- cense (CC-BY-NC-SA), we have also published our ing data. They also present a comparison of models training and testing materials with the same license. trained on different amounts of training data using We think open practices such as these will be gaining Na and Chatino data, which also inspired our own importance in the future, and we want to contribute comparative experiments. to this development as well by our example (Garellek We have also seen very recently large improve- et al., 2020). ments in such systems on related Uralic languages, for example Zyrian Komi (Hjortnaes et al., 2020b; 2 Context and Related Work Hjortnaes et al., 2020a). We have also seen experi- Despite the wide documentation activities, we have ments where ASR is being integrated to the language not witnessed large improvements in language tech- documentation workflows, for example, in Papuan nology and computational linguistics on these lan- context (Zahrer et al., 2020). Most widely applied guages. Thereby one of the goals in this paper speech recognition systems have been Persephone is to encourage further work on these languages, (Adams et al., 2018), Elpis (Foley et al., 2018) and also through our published new datasets. There is DeepSpeech (Hannun et al., 2014). In this paper, a rule based morphological analyser for Nganasan we present and discuss several experiments we have (Endrédy et al., 2010), which, however, appears to done using Persephone system. be available only through a web interface, and is not open access. 3 Languages and Data What it comes to other Samoyedic languages, a rule based Tundra Nenets morphological anal- Nganasan is an endangered Samoyedic language yser exists in the GiellaLT infrastructure (Moshagen spoken by Nganasans, a small ethnic group in et al., 2014)8 with availability through UralicNLP Taimyr Peninsula, Northern Siberia (Janhunen and (Hämäläinen, 2019). There are also early Selkup9 Gruzdeva, 2020). According to official statis- and Nganasan10 analysers in the same infrastruc- tics there are 470 Nganasans, from who approxi- ture. Also OCR models have been developed to tar- mately 125 speak the Nganasan language (Wagner- get early writing systems on these languages (Parta- Nagy, 2018, 3,17). Despite languages endanger- nen and Rießler, 2019b), with associated data pack- ment, plenty of documentation work has been con- ducted (Leisio, 2006; Wagner-Nagy, 2014; Ka- 7https://zenodo.org/record/4029494 8https://github.com/giellalt/lang-yrk heinen, 2020). Largest available Nganasan corpus 9https://github.com/giellalt/lang-sel was published in 2018 (Brykina et al., 2018), and it 10https://github.com/giellalt/lang-nio was used in our study. Kamas is another Samoyedic language, represent- ful and realistic scenario. ing the southern group of this branch of Uralic lan- As Nganasan corpus was significantly larger than guages. Kamas was spoken in the slopes of Sayan Kamas, also more preprocessing was needed, proba- mountains in the Central Siberia. It is believed that bly reflecting the longer time frame where it has been by the 19th century Kamas tribe consisted of only created. We excluded all segments that were shorted 130 people (Matveev, 1964). Kamas were forced than 400 milliseconds, and removed all empty an- to abandon their nomadic lifestyle in the beginning notations or annotations that contained only punctu- of 20th century, which, in connection with large so- ation characters. There were several invisible Uni- cietal changes, increased the contact with Russian code space characters that were removed. Also all speakers and led to a cultural assimilation (Klooster, annotations that contained number written as number 2015, 9). The last Kamas speaker was Klavdiya Plot- characters were excluded. We also removed from nikova, who was born in 1895 in a small village of Nganasan corpus all instances of utterances that con- Abalakovo in Central Siberia. She worked with sev- tained Cyrillic characters. eral linguists since 1960s, and this results in a sizable For both corpora the annotations that contained collection of Kamas recordings that are available in unclear words marked with HIAT conventions various archives. (Ehlich and Rehbein, 1976) were removed. When In 2019, a corpus containing transcribed versions the transcriptions contained annotations for non- of these materials was published (Gusev et al., 2019). verbal expressions, including coughing or laughter, We used Klavdiya Plotnikova’s part of the corpus we chose to remove those extra annotations, but in our Kamas experiments, as she contributes the keep those transcriptions in the training data. Self- vast majority of all Kamas materials that exists. In corrections were kept, but the hyphens and brackets the Nganasan experiments, we used the data from around them were removed. three prominent speakers in the Nganasan Spoken It has been asked previously to what degree the Language Corpus, who are also mentioned in the preprocessing of language documentation data can Nganasan grammar based largely to the same ma- be automatized (Wisniewski et al., 2020). Based to terials (Wagner-Nagy, 2018, 30). our experience with these corpora we can say that One of the most important preprocessing steps a good deal of manual inspection and examination was to exclude from training all sentences that are is necessary to understand how the raw data has longer than 10 seconds. This is a condition set to be processed to make it usable in ASR training. by Persephone system, and a convention followed In our experience the actual transformation of cor- also in other investigations (Wisniewski et al., 2020, pus XML files into the structure expected by Perse- 30). Similarly, Hjortnaes et al (2020b) filtered Zyr- phone was relatively easy. Much more time was ian Komi corpus by this limit. This choice leaves consumed by analysing the annotation conventions open an obvious possibility to improve the current used in the corpus, and processing some of the mis- results. As the filtered portion of the corpus is rela- takes in transcriptions. To this vein we can strongly tively large, either finding a way to include longest recommend for different projects the approach sug- segments into the training process, or splitting those gested by (Partanen and Riessler, 2019a), where a into smaller units, would easily increase the amount team working with endangered languages of Barents of training data. region have integrated automatic testing and valida- Preprocessing conventions were very similar with tion very deeply into their corpus workflows, thus both corpora, although independent inspection of ensuring the systematic and orderly presentation of particularities of individual datasets was done. It the corpus. is customary that we work with speech corpora in- cludes an intensive preprocessing step. With the 4 Method and Experiment Design case of Kamas the work was greatly aided by hav- ing a specialist of Kamas in our team. In the case of Our model is a bi-directional long short-term mem- Nganasan we worked primarily with the light shed ory (LSTM) (Hochreiter and Schmidhuber, 1997; by the project documentation, which was also an use- Schuster and Paliwal, 1997), which is trained to pre- 4.2 Kamas Tests After our preprocessing methods the Kamas corpus contains approximately 5 hours of utterances from a single speaker, above mentioned Klavdiya Plot- nikova. Thereby the most obvious experiment to conduct is to train the model with all the speech we have from her, taking the original transcription as it is. However, this also allows more varied experi- ments. One of the questions that are unclear in low- resource ASR is what to do with the word bound- aries. In -level recognition usually prac- tices with Persephone they often are omitted. Also in experiments done with other tools, such as Deep- Speech, specific language models have been used Figure 1: Distribution of the Samoyedic languages in the to insert spaces into correct places (Hjortnaes et al., beginning of the 20 century (Timo Rantanen, BEDLAN) 2020b). The second experiment with Kamas arises from this starting point: let’s just leave the word boundaries as predicted labels. dict the character sequences from audio input. The One problem with the word boundaries must be loss function used in the training is a connection- their nature as a higher level construct. In real speech ist temporal classification (CTC) in order to make they often do not appear as pauses. As forced align- it possible to train the model with only a coarse ment tools have already developed relatively far, we alignment between the audio and text (Graves et al., opted for the MAUS system (Kisler et al., 2012) to 2006). As suggested by (Wisniewski et al., 2020) align our data automatically. More specifically, we and (Adams et al., 2018), we use 3 hidden layers used the functionality provided in the emuR R pack- with 250 hidden units. Using the same settings as age (Winkelmann et al., 2017). The language was other experiment have done maximises the compar- left unspecified, and the alignment used grapheme ative value of current work. We train our models by to phoneme -mapping for the original transcriptions, using Persephone framework (Adams et al., 2018). and returned phoneme aligned SAMPA and IPA ver- sions (Reichel, 2012; Reichel and Kisler, 2014). The mapping will be published openly with the other re- 4.1 Nganasan Tests sources described here. It must be noted that there are minor differences between the original transcrip- Although Nganasan corpus is fairly large, 28 hours tion and IPA representation. These are primarily according to corpus description, there are few in- about the how the long are presented, as the dividual speakers who are most prominent in the original transcript was primarily split to characters, dataset. We selected all data from three such speak- whereas the converted IPA has more phonemic units ers, and trained individual models for all of them. To as individual labels. our knowledge experiments with Persephone system This process gave us two more transcriptions ver- have not been carried out with tens of hours of data, sions: Plain IPA text, and an IPA version where only and as earlier experiments have also focused to sin- those word boundaries were retained which occurred gle speaker settings, we decided to continue along within natural pauses. This work was highly exper- these lines. As the Nganasan speakers represent dif- imental, and we did not correct the segmentations ferent amounts of transcribed data, the differences in manually. Same Kamas data will be included in accuracy can still give important information about manually corrected DoReCo corpus (Paschen et al., this particularly low-resource setting. 2020), which will allow better inspection of these Experiment Utterances Minutes LER Exp. Description LER 1 1152 108 0.334 4 Original transcript, no spaces 0.226 2 512 57 0.930 5 Original transcript, with spaces 0.195 3 704 43 0.892 6 IPA transcript, no spaces 0.149 7 IPA transcript, real pauses 0.243 Table 1: Nganasan experiments with three different speakers. Table 2: Kamas experiments with 4224 training samples, 266 minutes. features. Our primary goal in this experiment was simply to investigate whether essentially very minor this high the model does not produce a useful result. changes in transcriptions impact the result, and to see However, there was a clear improvement with one if different representation of word boundaries brings speaker for who we had more training material. Brief any benefits. example and discussion is provided in Section 6.

4.3 Gradual Data Augmentation Test 5.2 Kamas Results In order to evaluate the importance of the amount Compared to the Nganasan experiment, the Kamas of transcribed minutes and hours, we designed addi- results are very different. Indeed, the results we tional tests. In these experiments we use the exactly achieve are very high, and on par with the best scores same Kamas corpus as in the Experiment 6, but take reported elsewhere for Persephone. We argue that only smaller portions of it, that are augmented grad- the primary reason to this is sufficient amount of ually. As the maximum amount of data was close training data. Table 2 shows these results in detail. to 5 hours, we selected intervals that should repre- In Experiment 4 we trained Persephone on the sent realistically increasing corpus size, and thereby original Kamas transcriptions, without word bound- show where the most important thresholds lie. aries separately marked, and with no modifications These experiments are described in Table 3. to the existing transcriptions. In the Experiment 5 While discussing the results we have also compared the space characters were left to their original places. our error rates to those reported in other studies, in Surprisingly the result is significantly better with the order to understand better how the variation we see word boundaries than without them. connects to earlier studies with different languages. Since Experiments 4 and 6 use extremely similar training data, just in different transcription system, 5 Results we would had assumed the results to be very similar. In this section, we present the results of the different We see, however, very large difference between the models. These results are reported as a LER (label models. As we did not run the experiments multi- error rate) score. In practice, this is a measurement ple times, it is left open whether the difference can similar to CER (character error rate) that is widely be caused from different random seeds. In our error used in studies focusing on text normalization and analysis some possible reasons for these differences OCR correction (see (Tang et al., 2018; Veliz et al., are discussed further. 2019; Hämäläinen and Hengchen, 2019)). The Experiment 5, however, was necessary in or- der to evaluate with more confidence the results of 5.1 Nganasan Results Experiment 7. Between these experiments the only In Nganasan experiment we selected utterances from difference was in detected pauses, instead of original three most prominent speakers. Table 1 shows word boundaries. The procedure was described in the amount of data that we used and the accuracy Section 4. As the Experiment 7 produces the worst reached. results, we must conclude that this experiment was We can easily conclude that the results were not not successful. However, since the presence of word successful in all experiments. In the cases where boundaries as their own tokens has small impact to we had less than an hour of transcriptions, the qual- the accuracy, and as they are useful information, this ity was extremely low. When the label error rate is model may still be favoured in actual use. Experiment Utterances Minutes LER one of the models was much better than the others. 6-1 448 28 0.612 6-2 896 57 0.254 6 Error Analysis 6-3 1856 117 0.224 6-4 2816 177 0.176 Our error analysis focuses mainly on Kamas, since 6-5 3776 238 0.190 with this language we achieved a very high level of accuracy. In our error analysis the numbered lines in Table 3: Gradual data augmentation experiments examples correspond to experiments described and numbered in Section 4. We can, however, state All Kamas models are relatively good, and the briefly that two Nganasan models with worst accu- accuracy is inspected closer in Section 6. We see racy predict mainly short character sequences, essen- clearly in the Figure 2 that although there are minor tially repeating the same fixed string. This predic- differences, when the model has sufficient amount tion is, naturally, not useful. With the best Nganasan of training data the accuracy does not significantly model, however, the result could already be useful change. We also cannot entirely exclude the possi- as preliminary transcription stage. We see in Ex- ble impact of random run-time differences when the ample (1) that the errors are primarily connected to results are very close to one another. length, and most of the words and morphemes start to be recognizable. 5.3 Gradual Data Augmentation Test Result (1) Mənə bəbəədʼəətənɨ isʼüðəm hüətə. The goal of this experiment was to investigate how mənəbəbədʼətənɨsʼühuhətə the model’s accuracy changes when the amount of ‘I will be all the time at the old place.’ training data is increased. In the past we have seen various tests with different corpora, often reaching very good results, as discussed in Section 2. The re- Next examples are all from the Kamas corpus. Some sults of this experiment are presented in Table 3. of the mistakes different models make seem to be The major result we find here is that soon after systematic. Examples (2) and (3) show that espe- containing two hours of training data, the models cially sequences are challenging for the show extremely modest improvements. The largest model. Both /ll/ or /tt/ come out systematically as improvement takes place between half an hour and single . a full hour. Especially when we compare the results to those reported for different languages by Wisni- (2) Ujabə ajirbi mĭnzərzittə. wski et. al. (2020), it appears that the amount of 1: ujabajrbimĭnzərzitəo training data is the main denominator that impacts 2: ujabaj irbi mĭnzərzitə the models accuracy. Na language model is essen- 3: ujabajirbimɪnzirzitə tially as good as Kamas model, which reaches it’s 4: ujabajirbimɪnzərzitəo maximal accuracy after three hours of training data. ‘He was reluctant to cook his meat.’ There are possible exceptions, for example, Duoxu model is relatively good compared to its small size. Especially in the second phoneme in Example (3) we Even then, it fits to the general curve very well. see wide variation in the predicted vowel. This ap- Based on this comparison, more training data is pears to be very common with reduced vowels. not necessarily better, and the benefits decrease after certain level has been reached. We essentially have (3) Mĭlleʔbi, mĭlleʔbi, ej kuʔpi. repeated the results of Adams et al. (2018) on Na 1: mĭleʔtimĭleʔtiejkuʔpiö and Chatino. We will discuss this further in Sec- 2: müleʔpi mĭleʔtə ej kuʔpi tion 7, but this clearly gives us some guidelines of 3: nuleʔbəmɪleʔtəejkuʔpi how much transcriptions are needed at the moment 4: mɪleʔpimɪleʔpiejkuʔpia to achieve the best possible accuracy. This also con- ‘He went, he went, he did not kill.’ textualizes the Nganasan results, and explains why Figure 2: Results of our Nganasan and Kamas experiments compared with Wisniewski (2020)

In Example (4) we see how self-corrections have related to vowels that share a very similar place of

been treated in the original transcription and our var- articulation or other properties: /e/ : /ə/, /o/ : /u/, ious experiments. The model with original word /I/ : /i/, /ø/ : /o/, /y/ : /u/, / /a : /a/. Within conso- boundaries is able to predict the spaces correctly, nant system the errors between nasals /n/ : /m/ : /ŋ/ whereas the model with natural spaces only captures are frequent, and also is often replaced some of them. We see in this example also that none with zero. We can also point that the error /t/ : /d/ is of the models recognize /b/ that is in the end of self common, but other systematic errors related to voic- correction. This is not fully realized, which ing opposition cannot be found. These differences makes it acoustically very different from other in- can be compared to the error analysis reported from stances of this phoneme. We can also notice that Yongning language, where the confusions between some of the models have a tendency to drop glottal characters also appeared to be related to acoustic re- stops, although some combinations show regularity. alities (Michaud et al., 2020).

(4) I dĭ poʔto (kuzab-) kuzazi mobi. When our error analysis was repeated with the 1: idĭpoʔtkuzakuzaziʔmobi original transcription, which generally gave much 2: i dĭ poʔto kuza kuzaziʔ mobi worse results than IPA, very curious picture starts 3: idɪpoʔtokuzakuzazimobi to emerge. Especially vowels containing diacritics 4: idɪ potəkuzakuzaziʔ mobia were often misrecognized or omitted from predic- ‘This goat became man.’ tion. This hints there is possibly something in this representation of the texts that does not pass cor- rectly through the system and needs to be thoroughly Although not intended for ASR error analysis, we investigated. Further research should be conducted decided to use OCR tool Calamari’s error evalua- with different character representations and refined tion method (Wick et al., 2018). This gave us char- data preprocessing. Also in-depth investigation of acter level error information. Most frequent indi- how the strings are passed to the ASR system inter- vidual errors were related to letters /t/ and /l/. This nally could reveal more information about how, for seems to be related to them occurring as both short example, different combining characters are treated. and long , and the models had particular As we have published the training data and trained problems to learn this distinction. Besides this we models openly, this examination can be continued can see that many of the most prominent errors are easily. 7 Conclusion a complicated matter. However, we would argue, based to our results, that after one hour is transcribed We conducted several experiments on speech recog- for an individual speaker, training an ASR model to nition of endangered and extinct languages. The speed up the transcription work should already take most significant result is that we can identify a clear place. In the same vein, this could be taken into threshold of few hours after which the current mod- account when working with endangered languages els do not show clear decrease in the label error rate. with very small speech communities. Recording and We also show that differences in transcription system transcribing different speakers widely, with substan- do not cause significant differences between mod- tial transcription base for each speaker, seems to be els, although there is enough variation that ideal rep- the best way to take advantage from currently avail- resentation should be investigated. We do not see able ASR systems. We are clearly moving into a sit- large differences in whether we use IPA or a project’s uation where manual transcription of everything is internal transcription convention. As all these tran- not the only option. scriptions are phonemic at fairly same level, the lack Future research should also focus into moving of differences is not surprising. Best results were from single speaker systems into ASR that can work achieved without word boundaries, but also the ex- with multiple speakers, including unknown speak- periments with all word boundaries were encourag- ers. Very encouraging results were reported recently ing enough that we would suggest testing a model with only 10 minutes of training data, with the use trained with those intact. We aligned transcriptions of pretraining on unannotated audio and using a lan- to detect only word boundaries that correspond to guage model (Baevski et al., 2020). Since unanno- natural pauses in speech, but the results of this were tated audio is available for virtually all language doc- not better than in other experiments. We presume umentation projects, and also text corpora are be- that the question here lies in the possibly weak qual- coming increasingly available and have proven use- ity of this automatic segmentation, and the exper- ful (Hjortnaes et al., 2020a), there are certainly pos- iment should be repeated with manually corrected sibilities to experiment with these methods also in version of the data. what it comes to language documentation context. As the transcription bottleneck is a major prob- There is also evidence that systems other than lem in linguistic fieldwork and documentation of en- Persephone could deliver better results, which is to dangered languages, our work sheds light to possible be expected as the field evolves. Gupta and Bou- emerging working methods. This is also a question lianne (2020) reached a phoneme error rate of 8.7% of resource allocation: when the language is rapidly on 3.1 hours of Cree training data, and their compari- disappearing, should we focus into recording more son of different systems showed significant improve- or into refinement and transcription of the existing ment to other currently available methods, among materials? The situation is especially concerning those Persephone. This would suggest that there are when there are only a few individual elder speakers. possibilities to improve also from the Persephone re- We hope our work offers new insight into this sults we have reported now. complicated question. Kamas has been extinct for The Persephone models that we have trained can more than 30 years, yet we are able to build a rel- be used with Cox’s (2019) ELAN extension, or pro- atively good speech recognition model from just grammatically using Python. We have published two hours of transcribed speech. This is a realis- both the models and the training datasets11 in order tic amount, and not particularly much in the con- to encourage further experiments on this important text where contemporary language documentation topic, and also to allow Nganasan researchers to ben- projects usually have budgets that cover several re- efit from our results. Although the majority of Ka- searchers work for multiple years. mas materials are already transcribed, we believe our Automatic transcription with ASR tools is a fast results are relevant and valuable for the work being moving target. The results in few years will cer- done with endangered and extinct languages. tainly be entirely different from what we are see- ing now. Thereby making recommendations is also 11https://zenodo.org/record/4029494 References phonetics: What we can gain and how we can avoid pitfalls. Journal of Speech Science, 9(1). Oliver Adams, Trevor Cohn, Graham Neubig, Hilaria Alex Graves, Santiago Fernández, Faustino Gomez, and Cruz, Steven Bird, and Alexis Michaud. 2018. Evalu- Jürgen Schmidhuber. 2006. Connectionist temporal ating phonemic transcription of low-resource tonal lan- classification: labelling unsegmented sequence data guages for language documentation. In Proceedings of with recurrent neural networks. In Proceedings of the LREC 2018. 23rd international conference on Machine learning, Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, pages 369–376. and Michael Auli. 2020. wav2vec 2.0: A framework Vishwa Gupta and Gilles Boulianne. 2020. Speech for self-supervised learning of speech representations. transcription challenges for resource constrained in- Maria Brykina, Valentin Gusev, Sandor Szeverényi, digenous language cree. In Proceedings of the 1st and Beáta Wagner-Nagy. 2018. Nganasan Spo- Joint Workshop on Spoken Language Technologies for ken Language Corpus (NSLC). Archived in Ham- Under-resourced languages (SLTU) and Collabora- burger Zentrum für Sprachkorpora. Version 0.2. tion and Computing for Under-Resourced Languages Publication date 2018-06-12. Available online at (CCURL), pages 362–367. http://hdl.handle.net/11022/0000-0007-C6F2-8. Valentin Gusev, Tiina Klooster, and Beáta Wagner-Nagy. Maria Brykina, Svetlana Orlova, and Beáta 2019. INEL Kamas Corpus. Version 1.0. Publication Wagner-Nagy. 2020. INEL Selkup Cor- date 2019-12-15. http://hdl.handle.net/11022/0000- pus. Version 1.0. Publication date 2020-06-30. 0007-DA6E-9. In Beáta Wagner-Nagy, Alexandre http://hdl.handle.net/11022/0000-0007-E1D5-A. In Arkhipov, Anne Ferger, Daniel Jettka, and Timm Beáta Wagner-Nagy, Alexandre Arkhipov, Anne Lehmberg, editors, The INEL corpora of indigenous Ferger, Daniel Jettka, and Timm Lehmberg, editors, Northern Eurasian languages. The INEL corpora of indigenous Northern Eurasian Mika Hämäläinen and Simon Hengchen. 2019. From the languages. paft to the fiiture: a fully automatic NMT and word Josefina Budzisch, Anja Harder, and Beáta Wagner- embeddings method for OCR post-correction. In Pro- Nagy. 2019. Selkup Language Corpus (SLC). ceedings of the International Conference on Recent Archived in Hamburger Zentrum für Sprachkor- Advances in Natural Language Processing (RANLP pora. Version 1.0.0. Publication date 2019-02-08. 2019), pages 431–436. http://hdl.handle.net/11022/0000-0007-D009-4 . Awni Hannun, Carl Case, Jared Casper, Bryan Catan- zaro, Greg Diamos, Erich Elsen, Ryan Prenger, San- Christopher Cox, 2019. Persephone-ELAN: Automatic jeev Satheesh, Shubho Sengupta, Adam Coates, et al. phoneme recognition for ELAN users. Version 0.1.2. 2014. Deep speech: Scaling up end-to-end speech Konrad Ehlich and Jochen Rehbein. 1976. Halbinter- recognition. arXiv preprint arXiv:1412.5567. pretative arbeitstranskriptionen (hiat). Linguistische Nils Hjortnaes, Timofey Arkhangelskiy, Niko Partanen, Berichte, 45(1976):21–41. Michael Rießler, and Francis M. Tyers. 2020a. Im- István Endrédy, László Fejes, Attila Novák, Beatrix Os- proving the language model for low-resource ASR zkó, Gábor Prószéky, Sándor Szeverényi, Zsuzsa Vár- with online text corpora. In Dorothee Beermann, nai, and Wágner-Nagy Beáta. 2010. Nganasan– Laurent Besacier, Sakriani Sakti, and Claudia So- computational resources of a language on the verge of ria, editors, Proceedings of the 1st joint SLTU and extinction. In 7th SaLTMiL Workshop on Creation and CCURL workshop (SLTU-CCURL 2020), pages 336– Use of Basic Lexical Resources for Less-Resourced 341, Marseille. European Language Resources Asso- Languages, LREC 2010, Valetta, Malta, 23 May 2010, ciation (ELRA). pages 41–44. Nils Hjortnaes, Niko Partanen, Michael Rießler, and Ben Foley, Joshua T Arnold, Rolando Coto-Solano, Gau- Francis M. Tyers. 2020b. Towards a speech recog- tier Durantin, T Mark Ellison, Daan van Esch, Scott nizer for Komi, an endangered and low-resource Uralic Heath, Frantisek Kratochvil, Zara Maxwell-Smith, language. In Proceedings of the Sixth International David Nash, et al. 2018. Building speech recog- Workshop on Computational Linguistics of Uralic Lan- nition systems for language documentation: The Co- guages, pages 31–37. EDL endangered language pipeline and inference sys- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long tem (ELPIS). In SLTU, pages 205–209. short-term memory. Neural computation, 9(8):1735– Marc Garellek, Matthew Gordon, James Kirby, Wai-Sum 1780. Lee, Alexis Michaud, Christine Mooshammer, Oliver Mika Hämäläinen. 2019. UralicNLP: An NLP library for Niebuhr, Daniel Recasens, Timo Roettger, Adrian Uralic languages. Journal of Open Source Software, Simpson, et al. 2020. Toward open data policies in 4(37):1345. Juha Janhunen and Ekaterina Gruzdeva. 2020. problemy i perspektivy. Syktyvkar, 16-17 marta 2017 Nganasan: A fresh focus on a little known lan- g., pages 263–273. guage. Linguistic Typology, 24(1):181–186. Ludger Paschen, François Delafontaine, Christoph Kaisla Kaheinen. 2020. Nganasanin itsekorjaus: Draxler, Susanne Fuchs, Matthew Stave, and Frank Huomioita korjaustoimintojen rakenteesta ja korjauk- Seifart. 2020. Building a time-aligned cross-linguistic sen merkityksistä vuorovaikutuksessa. MA thesis, reference corpus from language documentation University of Helsinki. data (doreco). In Proceedings of The 12th Lan- Thomas Kisler, Florian Schiel, and Han Sloetjes. 2012. guage Resources and Evaluation Conference, pages Signal processing via web services: the use case Web- 2657–2666. MAUS. In Digital Humanities Conference 2012. Uwe D Reichel and Thomas Kisler. 2014. Language- Tiina Klooster. 2015. Individual language change: a independent grapheme-phoneme conversion and word case study of Klavdiya Plotnikova’s Kamas. MA the- assignment as a web service. Studientexte sis, University of Tartu. zur Sprachkommunikation: Elektronische Sprachsig- Michael Krauss. 1992. The world’s languages in crisis. nalverarbeitung 2014, pages 42–49. Language, 68(1):4–10. Uwe D Reichel. 2012. PermA and Balloon: Tools for Larisa Leisio. 2006. Passive in Nganasan. Typological string alignment and text processing. In Proc. Inter- Studies in Language, 68:213. speech. AK Matveev. 1964. Kamassi keele jälgedel. Keel ja Jack Rueter and Mika Hämäläinen. 2017. Synchronized Kirjandus, 3:167–169. Mediawiki based analyzer dictionary development. In Alexis Michaud, Oliver Adams, Christopher Cox, Séver- Proceedings of the Third Workshop on Computational ine Guillaume, Guillaume Wisniewski, and Benjamin Linguistics for Uralic Languages, pages 1–7, St. Pe- Galliot. 2020. La transcription du linguiste au miroir tersburg, Russia, January. Association for Computa- de l’intelligence artificielle: réflexions à partir de la tional Linguistics. transcription phonémique automatique. Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional Christopher Moseley, editor. 2010. Atlas of ′ recurrent neural networks. IEEE transactions on Sig- the World s Languages in Danger. UN- nal Processing, 45(11):2673–2681. ESCO Publishing, 3rd edition. Online version: Florian Siegl. 2013. Materials on Forest Enets, an in- http://www.unesco.org/languages-atlas/. digenous language of Northern Siberia. Number 267 Sjur Moshagen, Jack Rueter, Tommi Pirinen, Trond in Mémoires de la Société Finno-Ougrienne. Société Trosterud, and Francis M. Tyers. 2014. Open-Source Finno-Ougrienne. Infrastructures for Collaborative Work on Under- Gongbo Tang, Fabienne Cap, Eva Pettersson, and Joakim Resourced Languages. The LREC 2014 Workshop Nivre. 2018. An evaluation of neural machine trans- “CCURL 2014 - Collaboration and Computing for lation models on historical spelling normalization. In Under-Resourced Languages in the Linked Open Data Proceedings of the 27th International Conference on Era”. Computational Linguistics, pages 1320–1331, Santa Irina Nikolaeva. 2014. A grammar of Tundra Nenets. Fe, New Mexico, USA, August. Association for Com- Walter de Gruyter GmbH & Co KG. putational Linguistics. Niko Partanen and Michael Riessler. 2019a. Au- tomatic validation and processing of ELAN cor- Claudia Matos Veliz, Orphée De Clercq, and Véronique pora for spoken language data. Presentation in: Hoste. 2019. Benefits of data augmentation for Research Data and Humanities – RDHum 2019. nmt-based text normalization of user-generated con- University of Oulu, August 14–16, 2019. URL: tent. In Proceedings of the 5th Workshop on Noisy https://www.oulu.fi/suomenkieli/node/55261. User-generated Text (W-NUT 2019), pages 275–285. Niko Partanen and Michael Rießler. 2019b. An OCR Beáta Wagner-Nagy. 2014. Possessive constructions in system for the Unified Northern Alphabet. In The fifth Nganasan. Tomsk Journal of Linguistics and Anthro- International Workshop on Computational Linguistics pology, (1):76–82. for Uralic Languages. Beáta Wagner-Nagy. 2018. A grammar of Nganasan. Niko Partanen and Michael Rießler. 2018. lang- Brill. doc/iwclul2019: An OCR system for the Uni- Christoph Wick, Christian Reul, and Frank Puppe. 2018. fied Northern Alphabet – data package, December. Calamari-a high-performance tensorflow-based deep https://doi.org/10.5281/zenodo.2506881. learning package for optical character recognition. Niko Partanen. 2017. Challenges in OCR today: arXiv preprint arXiv:1807.02004. Report on experiences from INEL. In Èlektron- Raphael Winkelmann, Jonathan Harrington, and Klaus naâ pis’mennost’ narodov Rossijskoj Federacii: Opyt, Jänsch. 2017. EMU-SDMS: Advanced speech database management and analysis in R. Computer Speech & Language, 45:392–410. Guillaume Wisniewski, Alexis Michaud, and Séverine Guillaume. 2020. Phonemic transcription of low- resource languages: To what extent can preprocessing be automated? In Proceedings of the 1st Joint SLTU (Spoken Language Technologies for Under-resourced languages) and CCURL (Collaboration and Comput- ing for Under-Resourced Languages) Workshop. Alexander Zahrer, Andrej Zgank, and Barbara Schup- pler. 2020. Towards building an automatic transcrip- tion system for language documentation: Experiences from muyu. In Proceedings of The 12th Language Re- sources and Evaluation Conference, pages 2893–2900.