The Relevance of the Source Language in Transfer Learning For
Total Page:16
File Type:pdf, Size:1020Kb
The Relevance of the Source Language in Transfer Learning for ASR Nils Hjortnæs Niko Partanen Indiana University University of Helsinki [email protected] Helsinki, Finland [email protected] Michael Rießler Francis M. Tyers University of Eastern Finland Department of Linguistics Joensuu, Finland Indiana University [email protected] Bloomington, IN [email protected] Abstract mediately useful for our goals, but we continue to explore different ways to improve our result. This This study presents new experiments on Zyr- study uses the same dataset, but attempts to take ian Komi speech recognition. We use Deep- Speech to train ASR models from a language the multilingual processes found in the corpus into documentation corpus that contains both con- account better. temporary and archival recordings. Earlier ASR has progressed greatly for high resource studies have shown that transfer learning from languages and several advances have been made English and using a domain matching Komi to extend that progress to low resource languages. language model both improve the CER and There are still, however, numerous challenges WER. In this study we experiment with trans- in situations where the available data is limited. fer learning from a more relevant source lan- guage, Russian, and including Russian text in In many situations with the larger languages the the language model construction. The moti- training data for speech recognition is collected vation for this is that Russian and Komi are expressly for the purpose of ASR. These ap- contemporary contact languages, and Russian proaches, such as Common Voice platform, can is regularly present in the corpus. We found be extended also the endangered languages (see that despite the close contact of Russian and i.e. Berkson et al., 2019), so there is no clear cut Komi, the size of the English speech corpus boundary between resources available for different yielded greater performance when used as the languages. While having dedicated, purpose-built source language. Additionally, we can report that already an update in DeepSpeech version data is good for the performance of ASR, it also improved the CER by 3.9% against the earlier leaves a large quantity of more challenging but us- studies, which is an important step in the de- able data untapped. At the same time these mate- velopment of Komi ASR. rials not explicitly produced for this purpose may be more realistic for the resources we intend to use 1 Introduction the ASR for in the later stages. This study describes a Automatic Speech Recog- The data collected in language documentation nition (ASR) experiment on Zyrian Komi, an en- work customarily originates from recorded and dangered, low-resource Uralic language spoken in transcribed conversations and/or elicitations in the Russia. Komi has approximately 160,000 speakers target language. While this data does not have the and the writing system is well established using desirable features of a custom speech recognition the Cyrillic script. Although Zyrian Komi is en- dataset such as a large variety of speakers and ac- dangered, it is used widely in various media, and cents, and includes much fewer recorded hours, also in education system in the Komi Republic. for many endangered languages the language doc- This paper continues our experiments on Zyr- umentation corpora are the only available source. ian Komi ASR using DeepSpeech, which started However, attention should also be paid to the in Hjortnaes et al.(2020b). The scores reported differences in endangered language contexts glob- there were very low, but in a later study we found ally. There is an enormous variation in how that a language model built on more data increased much one language is previously documented, and the performance dramatically (Hjortnaes et al., whether earlier resources exist. This also connects 2020a). We are not yet at a level that would be im- to the new materials collected, as some languages 63 Proceedings of the 4th Workshop on the Use of Computational Methods in the Study of Endangered Languages: Vol. 1 Papers, pages 63–69, Online, March 2–3, 2021. need a full linguistic description, and some already segments where the language is in Russian, al- have established orthographies and variety of de- though the main language is Komi. There are also scriptions available. For example, in the case of very fragmentary occurrences of Tundra Nenets, Komi our transcription choice is the existing or- Kildin Saami and Northern Mansi languages, but thography which is also used in other resources these are so rare at the moment that we have not (Gerstenberger et al., 2016, 32). Our spoken lan- addressed them separately. In addition, none of guage corpus is connected with the entire NLP the data is annotated for which language is be- ecosystem of the Komi language, which includes ing spoken, and we only have transcriptions in the Finite State Transducer (Rueter, 2000), well devel- contemporary Cyrillic orthographies of these lan- oped dictionaries both online1 and in print (Rueter guages, as explained above. We propose two pos- et al., 2020; Beznosikova et al., 2012; Alnajjar sible methods to accommodate these properties of et al., 2019), several treebanks (Partanen et al., the data. First, we compare whether it is better to 2018) and also written language corpora (Fedina, transfer from a high resource language, English, 2019)2. We use this technical stack to annotate or the contact language, Russian. Second, we an- our corpus directly into the ELAN files (Gersten- alyze the impact of constructing languages mod- berger et al., 2017), but also to create versions els from different combinations of Komi and Rus- where identifiable information has been restricted sian sources. The goal is to make the language (Partanen et al., 2020). Thereby our goal is not to model more representative of the data and thereby describe the language from the scratch, but to cre- improve performance. ate a spoken language corpus that is not separate from the rest of the work and infrastructure done 2 Prior Work on this language. From this point of view we need an ASR system that produces the contemporary A majority of the work on Speech Recognition orthography, and not purely the phonetic or phone- focuses on improving the performance of models mic levels. It can be expected that entirely undocu- for high resource languages. Very small improve- mented languages and languages with a long tradi- ments may be made through advances such as im- tion of documentation need very different ASR ap- proving the network and the information available proaches, although still being under the umbrella to it, as in Han et al.(2020) or Li et al.(2019), of endangered language documentation. though as performance increases the gains of these In this work we expand on the use of trans- new methods decrease. Another avenue is to try to fer learning to improve the quality of a speech make these systems more robust to noise through recognition system for dialectal Zyrian Komi. Our data augmentation (Braun et al., 2017; Park et al., data consists of about 35 hours of transcribed 2019). As with improving the networks, how- speech data that will be available as an indepen- ever, these improvements become more and more dent dataset in the Language Bank of Finland marginal as performance increases. (Blokland et al., forthcoming). While this collec- As more models for ASR become available as tion is under preparation, the original raw mul- open source (Hannun et al., 2014; Pratap et al., timedia is available by request in The Language 2019), it becomes easier to develop these tools for Archive in Nijmegen (Blokland et al., 2021). This low resource languages and to create best prac- is a relatively large dataset for a low resource tice standards for doing so. This is the fundamen- language, but is still nowhere near high resource tal goal of Common Voice (Ardila et al., 2020). datasets such as Librispeech (Panayotov et al., Others also work on individual languages, such as 2015), which has about 1000 hours of English. Fantaye et al.(2020) and Dalmia et al.(2018). One of the largest challenges in our dataset is In the language documentation context we have that there is significant code switching between seen a large number of experiments on endangered Komi and Russian. This is a feature shared with languages in the last few years, but often focus- other similar corpora (compare i.e. Shi et al., ing on the datasets with a single speaker. Under 2021). All speakers are bi- or multilingual, and this constraint a few hours of transcribed data has use several languages regularly, so there are large already shown to result in a relatively good ac- curacy, as shown by Adams et al.(2018). Also 1https://dict.fu-lab.ru Partanen et al.(2020) report very good results 2http://komicorpora.ru on the extinct Samoyedic language Kamas, where 64 the model was also trained with one speaker, for cleaned any sections which were too long or too whom, however, a relatively large corpus exists. short as defined by DeepSpeech. The alphabet is Under many circumstances it is realistic and im- based on the Komi data and not the text used to portant to record individual speakers in numerous construct the language models as it is what deter- recording sessions, and such collections appear to mines the output of the network. We obtained our be numerous in the archives containing past field English model from DeepSpeech’s publicly avail- recordings, so there is no doubt that also single able release models4. speaker systems can be useful, although not ideal. Recently, Shi et al.(2021) also report very 3.3 DeepSpeech encouraging results on Yoloxochitl´ Mixtec and We trained our models using Mozilla’s open Puebla Nahuatl, especially as the corpus contains source DeepSpeech5 (Hannun et al., 2014) ver- multiple speakers.