NeMo Toolbox for Speech Dataset Construction

Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg

NVIDIA, Santa Clara, USA {ebakhturina, vlavrukhin, bginsburg}@nvidia.com

Abstract iteratively improved the text-speech alignment by retraining an ASR model. Finally, Section 7 discusses the Russian LS dataset In this paper, we introduce a new toolbox for constructing statistics and the Russian ASR model trained on the new dataset. speech datasets from long audio recordings and raw reference texts. We develop tools for each step of the speech dataset construction pipeline including data pre-processing, audio-text alignment, data post-processing and filtering. The proposed 2. Audio and pipeline also supports human-in-the-loop to address text-audio The audio recordings are directly downloaded from Lib- mismatch issues and remove samples that do not satisfy the riVox.org. We convert MP3 audio files to the mono WAV format quality requirements. We demonstrate the toolbox’s efficiency sampled at 16 kHz. We also cut the first 3 seconds of the audio by building the Russian LibriSpeech corpus (RuLS) from Lib- files to shorten the duration of the LibriVox preamble. riVox audiobooks. The toolbox is open sourced in NeMo frame- Unlike the English part of the LibriVox project which heav- work. The RuLS corpus is released at OpenSLR. ily relies on the Gutenburg project texts1 in TXT format, the Russian LibriVox section mostly uses reference text in PDF 1. Introduction or JPEG formats. We abandoned the idea of applying PDF to text converter or using an Optical Character Recognition (OCR) Neural Network (NN) based automatic model because 1) the text recovered from book images would (ASR) systems have made a huge leap [1, 2, 3, 4] in recent contain a significant number of artifacts due to poor quality years. One of the key enabling factors was the release of the of the scanned pages 2) many LibriVox Russian books con- LibriSpeech corpus [5] which contains about 1000 hours of tain archaic orthography which would make OCR perform even English speech. It became the standard benchmark for the worse. Thus, we resort to alternative text resources to obtain English ASR research [2, 4]. The large non-English speech TXT versions of the books, and used only about 40% of Lib- datasets – AISHELL [6, 7], Mozilla Common Voice (MCV) [8], riVox Russian audiobooks in our experiments. The downside CSS10 [9], and the Multilingual LibriSpeech (MLS) corpus of this approach is that the book versions we found could differ [10] have made a significant contribution towards encouraging from the text used during the audio recording. The mismatch ASR model development for non-English languages. between audio and the reference text could not only result in There are two methods traditionally used to create a speech poor alignments but also hurt the model that uses such data (see dataset: Section 5 for details). • Record short text utterances read by speakers. Once we have the reference text available, we split long text • Given a long audio file and corresponding text, split the files into short utterances and prepare the text for alignment. audio and text into short utterances. Text pre-processing includes the following steps: In our work, we focus on the second option: create short audio- • Normalize numbers. Expanding numbers to their full text pairs from original long files. The audio segmentation equivalent is a required step for the CTC-Segmentation could be challenging, but this method can produce large datasets algorithm [13] to produce valid alignments. Unfortu- from various sources and domains with minimal cost. There nately, good normalization tools are rarely open sourced are many public sources of speech data that could be used for languages other then English. As a naive solution, for dataset creation, e.g. LibriVox audiobooks [11], YouTube we propose a simple conversion of numbers to their digit videos under Creative Commons license etc based expanded equivalent. For example, given the num- Our contribution is two-fold: ber ’19’ we can convert it to ’one nine’ or ’один де- вять’. Note, that with this naive approach, the segments 1. We develop tools for each step of the speech dataset con- with the numbers should be excluded from the final ver- arXiv:2104.04896v1 [eess.AS] 11 Apr 2021 struction including data pre-processing, audio-text align- sion of the dataset. The aim of this step is to make sure ment, data post-processing and filtering (see Figure 1). the alignments of the subsequent segments are not bro- 2. We used the toolbox to build the Russian LibriSpeech ken. A better normalization solution would be to use a corpus (RuLS) , which contains 100 hours of speech. library such as num2words2 to convert numbers to their spoken equivalent. The rest of the paper is organized as follows. Section 2 describes audio-text processing. Section 3 explains speech • Remove inaudible fragments. For LibriVox text, such and text alignment using the enhanced Connectionist Tempo- fragments are automatically detected based on curly ral Classification (CTC)-based [12] segmentation tool. Section and/or square brackets. 4 and 5 describe Speech Data Explorer (SDE) and how it is used for the data error analysis. Section 6 explains how we 1http://gutenberg.org/ 2https://github.com/savoirfairelinux/ Preprint. Submitted to INTERSPEECH-21 num2words Figure 1: NeMo dataset construction workflow.

• Substitute Russian abbreviations with their full spoken the following functionality: equivalent. • global dataset statistics (alphabet, vocabulary, duration- • Convert non-Russian letters to Russian equivalent. Many based histograms); old Russian books contain snippets of French texts, so • navigation across the dataset using an interactive data- we apply simple transliteration of non Russian characters table that supports sorting and filtering; to their most closest Russian equivalent. For instance, the letter ’i’ is changed to ’и’ and ’j’ to ’ж’. Although, • inspection of individual utterances (plotting waveforms, the converted hardly sound similar to the origi- spectrograms, custom attributes and playing audio); nal , this step helps to fill up the gaps for the CTC- • error analysis (word error rate, character error rate, word Segmentation algorithm to successfully find the align- match rate, word accuracy, display highlighted diff. ments. Note, such segments are removed from the final Some of mentioned features is demonstrated on Figure 2. To version of the dataset. the best of our knowledge, SDE is the first open source tool • Split the original text into short utterances based on the for interactive speech dataset analysis. Next section describes end of sentence punctuation marks. We use regular ex- how SDE was used for error analysis and filtering during the pressions for this step. In our experiments, we also tried construction of the RuLS dataset. splitting the sentences in the middle to reduce the aver- age duration of the final samples. However, the segmen- 5. Error analysis and filtering tation results deteriorate with such a strategy and more partially cut words were observed (see Section 5 for de- SDE helps to find utterances which have: tails). We believe this is due to the relatively low accu- • out-of-alphabet characters; racy of the original Russian ASR model. • out-of-vocabulary words; 3. Text and audio alignment • high and low character count. To align text utterances constructed during step 2 with the au- Test on of-vocabulary words is used to detect typos from OCR dio files we used CTC-Segmentation algorithm proposed by process or new valid words. Test on high/low character rate can Kürzinger et al. [13] since we found this method to be robust be used to find misalignment or errors in reference transcripts. comparing to other tools such as Gentle [14] or Aeneas [15]. Note that correctly aligned segments with fast paced speech also We used the CTC-segmentation package3 [13], and speed up it have high character rate, while segments that contain long non- by enabling parallel segmentation of multiple files. speech fragments have low character rate. Dataset designer de- The CTC-segmentation requires an CTC-based ASR model cides whether to filter data samples with thresholds on character pre-trained to extract character log probabilities for each time rate automatically or to validate outliers manually. step. We train Russian ASR model using the cross-language SDE can be also used to analyse samples with high WER or transfer learning [16] from English QuartzNet 15x5 4, which character error rate (CER) after samples to determine if the high was trained on 3000 hours of public English speech. The Rus- WER / CER was caused by misalignment, errors in reference sian model was trained on 80 hours of the Russian subset of the transcripts or low accuracy of the ASR model. MCV (ver. 5.1) using NovoGrad optimizer [17] with 0.0015 learning rate, weight decay 0.001, a cosine annealing learning 5.1. RuLS error analysis and filtering rate policy. Overall, the CTC-Segmentation produces accurate alignments given that the Russian ASR model was fine-tuned on only 80 4. NeMo Speech Data Explorer hours of data and a greedy decoding strategy without Language Model (LM) was used to produce time step probabilities. How- After the initial segmentation is complete, we analyze the qual- ever, we notice that some segmented audio clips are ending ity of the data using Speech Data Explorer (SDE). SDE provides abruptly and miss ending phonemes of the utterance or include 3https://pypi.org/project/ctc-segmentation/ the first phonemes of the next utterance. Figure 3 shows one 4https://ngc.nvidia.com/catalog/models/ such example, where the audio clip ends before the fading of nvidia:nemospeechmodels the waveform occurs. To filter out such examples we apply the Figure 2: Speech Data Explorer. threshold on the mean absolute (MA) signal value at the end of To improve the dataset alignment quality we re-tune the the audio clip. ASR model in a cross-validation manner. Specifically, we cre- As discussed in Section 2, the reference text might deviate ate 3 data folds (no book overlap between the folds) and use two from the audio due to the differences in book versions. This folds along with the Russian MCV training set for ASR model could result in misaligned audio-text segments. To mitigate the fine-tuning and re-segmentation of the third fold (see Figure 4). issue we use the alignment confidence score of -2 and run the The 3-fold re-alignment prevents the ASR model from memo- algorithm on the same utterance lists with different minimum rizing reference text and alignment errors. It reduces the aver- window sizes. The minimum window size parameter determines age book WER from 45.8% to 18.83%. Figure 5 depicts the the minimum size corresponding to the single utterance. Only WER drop per book after the re-alignment. the segments that are similarly aligned with all window sizes are included in the dataset. In our experiments we align text utterances with three minimum window size values, i.e. 8000 (recommended default value), 10000, and 12000 to make sure the final alignments are valid.

Figure 3: Example of the segmented audio clip waveform with the very last sound of the utterance missing - abrupt ending. Figure 5: WER before and after re-alignment per book.

6. K-fold re-alignment 7. Russian LibriSpeech dataset The RuLS corpus is complementary to the MLS dataset [10]. It The initial text-audio alignment was done with an ASR model is constructed from 17 Russian LibriVox audiobooks, and has fine-tuned on 80 hours of Russian speech data. Due to the lim- total 98.2 hours of speech. The corpus has three parts: train, ited amount of the training data, the WER on the newly gener- dev, and test without speaker overlap (see Table. 1). ated corpus was relatively high - about 45.8%. The accuracy of the ASR model directly affects the quality of the segmentation. Statistics of the RuLS dataset splits. In our case, the original ASR model was relatively weak and Table 1: this caused misalignment. Subset Duration, hours Number of Speakers Train 92.79 5 Dev 2.81 1 Test 2.65 7 Total 98.25 13

Figure 4: Three-fold re-alignment. The maximum length of audio utterances is 20 seconds (see Figure. 6). Convolution-augmented transformer for speech recognition,” arXiv:2005.08100, 2020. [5] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- riSpeech: an ASR corpus based on public domain audio books,” in ICASSP, 2015. [6] H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AISHELL-1: An open-source mandarin speech corpus and a speech recognition baseline,” in Oriental COCOSDA 2017, 2017, p. Submitted. [7] J. Du, X. Na, X. Liu, and H. Bu, “AISHELL-2: Transforming mandarin ASR research into industrial scale,” arXiv:1808.10583, 2018. [8] R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, Figure 6: Violin plot of the audio segment lengths for the RuLS J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. We- dataset splits. ber, “Common Voice: A massively-multilingual speech corpus,” arXiv:1912.06670, 2019. To validate the quality of the RuLS dataset, we trained [9] K. Park and T. Mulc, “CSS10: A collection of single speaker QuartzNet-15x5 model on both the MCV Russian subset and speech datasets for 10 languages,” arXiv:1903.11269, 2019. the training subset of the RuLS corpus using fine-tuning from [10] V. Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, the English model. As shown in Table 2, the WER for MCV “MLS: A Large-Scale Multilingual Dataset for Speech Research,” dev set drops from 12.63% to 9.65% . Interspeech, 2020. [11] “LibriVox - Free public domain audiobooks,” https://librivox.org/. Table 2: Russian QuartzNet-15x5 model: (1) trained on MCV- [12] A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Con- Ru only and (2) trained on MCV-Ru combined with RuLS. Both nectionist temporal classification: labelling unsegmented se- models are evaluated on MCV-Ru dev. Greedy WER,% quence data with recurrent neural networks,” in ICML, 2006. [13] L. Kürzinger, D. Winkelbauer, L. Li, T. Watzel, and G. Rigoll, Train set WER , % “CTC-Segmentation of Large Corpora for German End-to-End Speech Recognition,” in Speech and Computer, A. Karpov and MCV 12.63 R. Potapova, Eds. Cham: Springer International Publishing, MCV + RuLS 9.65 2020, pp. 267–278. [14] R. Ochshorn and M. Hawkins, “Gentle,” https://lowerquality.com/ gentle/. [15] “Aeneas,” https://www.readbeyond.it/aeneas/. 8. Conclusions [16] J. Huang, O. Kuchaiev, P. O’Neill, V. Lavrukhin, J. Li, A. Flores, We developed a new toolbox for speech dataset construction. It G. Kucsko, and B. Ginsburg, “Cross-language transfer learning, includes the enhanced CTC-based segmentation tool for audio- continuous learning, and domain adaptation for end-to-end auto- text alignment and the Speech Data Explorer for interactive er- matic speech recognition,” arXiv:2005.04290, 2020. ror analysis. We demonstrated the efficiency of the toolbox by [17] B. Ginsburg, P. Castonguay, O. Hrinchuk, O. Kuchaiev, building the Russian LibriSpeech dataset. The addition of the V. Lavrukhin, R. Leary, J. Li, H. Nguyen, Y. Zhang, and J. M. Co- hen, “Stochastic gradient methods with layer-wise adaptive mo- new dataset to Mozilla Common VoiceRussian subset improved ments for training of deep networks,” arXiv:1905.11286, 2019. WER on the dev subset from 12.63% to 9.65%. The toolbox is open-sourced in NeMo [18]. 5 [18] O. Kuchaiev, J. Li, H. Nguyen, O. Hrinchuk, R. Leary, B. Gins- burg, S. Kriman, S. Beliaev, V. Lavrukhin, J. Cook et al., “NeMo: a toolkit for building AI applications using neural modules,” 9. Acknowledgements arXiv:1909.09577, 2019. The authors would like to thank Eric Harper and Elena Ras- torgueva for review and feedback. We also would like to express our gratitude to LibriVox volunteers.

10. References [1] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “State- of-the-art speech recognition with sequence-to-sequence models,” in ICASSP, 2018. [2] J. Li, V. Lavrukhin, B. Ginsburg, R. Leary, O. Kuchaiev, J. M. Cohen, H. Nguyen, and R. T. Gadde, “Jasper: An end-to-end con- volutional neural acoustic model,” arXiv:1904.03288, 2019. [3] S. Kriman, S. Beliaev, B. Ginsburg, J. Huang, O. Kuchaiev, V. Lavrukhin, R. Leary, J. Li, and Y. Zhang, “QuartzNet: Deep au- tomatic speech recognition with 1D time-channel separable con- volutions,” in ICASSP, 2020. [4] A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y. Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y. Wu et al., “Conformer:

5https://github.com/NVIDIA/NeMo