Arxiv:2104.04896V1 [Eess.AS] 11 Apr 2021 Struction Including Data Pre-Processing, Audio-Text Align- Sion of the Dataset
Total Page:16
File Type:pdf, Size:1020Kb
NeMo Toolbox for Speech Dataset Construction Evelina Bakhturina, Vitaly Lavrukhin, Boris Ginsburg NVIDIA, Santa Clara, USA {ebakhturina, vlavrukhin, bginsburg}@nvidia.com Abstract iteratively improved the text-speech alignment by retraining an ASR model. Finally, Section 7 discusses the Russian LS dataset In this paper, we introduce a new toolbox for constructing statistics and the Russian ASR model trained on the new dataset. speech datasets from long audio recordings and raw reference texts. We develop tools for each step of the speech dataset construction pipeline including data pre-processing, audio-text alignment, data post-processing and filtering. The proposed 2. Audio and Text processing pipeline also supports human-in-the-loop to address text-audio The audio recordings are directly downloaded from Lib- mismatch issues and remove samples that do not satisfy the riVox.org. We convert MP3 audio files to the mono WAV format quality requirements. We demonstrate the toolbox’s efficiency sampled at 16 kHz. We also cut the first 3 seconds of the audio by building the Russian LibriSpeech corpus (RuLS) from Lib- files to shorten the duration of the LibriVox preamble. riVox audiobooks. The toolbox is open sourced in NeMo frame- Unlike the English part of the LibriVox project which heav- work. The RuLS corpus is released at OpenSLR. ily relies on the Gutenburg project texts1 in TXT format, the Russian LibriVox section mostly uses reference text in PDF 1. Introduction or JPEG formats. We abandoned the idea of applying PDF to text converter or using an Optical Character Recognition (OCR) Neural Network (NN) based automatic speech recognition model because 1) the text recovered from book images would (ASR) systems have made a huge leap [1, 2, 3, 4] in recent contain a significant number of artifacts due to poor quality years. One of the key enabling factors was the release of the of the scanned pages 2) many LibriVox Russian books con- LibriSpeech corpus [5] which contains about 1000 hours of tain archaic orthography which would make OCR perform even English speech. It became the standard benchmark for the worse. Thus, we resort to alternative text resources to obtain English ASR research [2, 4]. The large non-English speech TXT versions of the books, and used only about 40% of Lib- datasets – AISHELL [6, 7], Mozilla Common Voice (MCV) [8], riVox Russian audiobooks in our experiments. The downside CSS10 [9], and the Multilingual LibriSpeech (MLS) corpus of this approach is that the book versions we found could differ [10] have made a significant contribution towards encouraging from the text used during the audio recording. The mismatch ASR model development for non-English languages. between audio and the reference text could not only result in There are two methods traditionally used to create a speech poor alignments but also hurt the model that uses such data (see dataset: Section 5 for details). • Record short text utterances read by speakers. Once we have the reference text available, we split long text • Given a long audio file and corresponding text, split the files into short utterances and prepare the text for alignment. audio and text into short utterances. Text pre-processing includes the following steps: In our work, we focus on the second option: create short audio- • Normalize numbers. Expanding numbers to their full text pairs from original long files. The audio segmentation equivalent is a required step for the CTC-Segmentation could be challenging, but this method can produce large datasets algorithm [13] to produce valid alignments. Unfortu- from various sources and domains with minimal cost. There nately, good normalization tools are rarely open sourced are many public sources of speech data that could be used for languages other then English. As a naive solution, for dataset creation, e.g. LibriVox audiobooks [11], YouTube we propose a simple conversion of numbers to their digit videos under Creative Commons license etc based expanded equivalent. For example, given the num- Our contribution is two-fold: ber ’19’ we can convert it to ’one nine’ or ’один де- вять’. Note, that with this naive approach, the segments 1. We develop tools for each step of the speech dataset con- with the numbers should be excluded from the final ver- arXiv:2104.04896v1 [eess.AS] 11 Apr 2021 struction including data pre-processing, audio-text align- sion of the dataset. The aim of this step is to make sure ment, data post-processing and filtering (see Figure 1). the alignments of the subsequent segments are not bro- 2. We used the toolbox to build the Russian LibriSpeech ken. A better normalization solution would be to use a corpus (RuLS) , which contains 100 hours of speech. library such as num2words2 to convert numbers to their spoken equivalent. The rest of the paper is organized as follows. Section 2 describes audio-text processing. Section 3 explains speech • Remove inaudible fragments. For LibriVox text, such and text alignment using the enhanced Connectionist Tempo- fragments are automatically detected based on curly ral Classification (CTC)-based [12] segmentation tool. Section and/or square brackets. 4 and 5 describe Speech Data Explorer (SDE) and how it is used for the data error analysis. Section 6 explains how we 1http://gutenberg.org/ 2https://github.com/savoirfairelinux/ Preprint. Submitted to INTERSPEECH-21 num2words Figure 1: NeMo dataset construction workflow. • Substitute Russian abbreviations with their full spoken the following functionality: equivalent. • global dataset statistics (alphabet, vocabulary, duration- • Convert non-Russian letters to Russian equivalent. Many based histograms); old Russian books contain snippets of French texts, so • navigation across the dataset using an interactive data- we apply simple transliteration of non Russian characters table that supports sorting and filtering; to their most closest Russian equivalent. For instance, the letter ’i’ is changed to ’и’ and ’j’ to ’ж’. Although, • inspection of individual utterances (plotting waveforms, the converted words hardly sound similar to the origi- spectrograms, custom attributes and playing audio); nal word, this step helps to fill up the gaps for the CTC- • error analysis (word error rate, character error rate, word Segmentation algorithm to successfully find the align- match rate, word accuracy, display highlighted diff. ments. Note, such segments are removed from the final Some of mentioned features is demonstrated on Figure 2. To version of the dataset. the best of our knowledge, SDE is the first open source tool • Split the original text into short utterances based on the for interactive speech dataset analysis. Next section describes end of sentence punctuation marks. We use regular ex- how SDE was used for error analysis and filtering during the pressions for this step. In our experiments, we also tried construction of the RuLS dataset. splitting the sentences in the middle to reduce the aver- age duration of the final samples. However, the segmen- 5. Error analysis and filtering tation results deteriorate with such a strategy and more partially cut words were observed (see Section 5 for de- SDE helps to find utterances which have: tails). We believe this is due to the relatively low accu- • out-of-alphabet characters; racy of the original Russian ASR model. • out-of-vocabulary words; 3. Text and audio alignment • high and low character count. To align text utterances constructed during step 2 with the au- Test on of-vocabulary words is used to detect typos from OCR dio files we used CTC-Segmentation algorithm proposed by process or new valid words. Test on high/low character rate can Kürzinger et al. [13] since we found this method to be robust be used to find misalignment or errors in reference transcripts. comparing to other tools such as Gentle [14] or Aeneas [15]. Note that correctly aligned segments with fast paced speech also We used the CTC-segmentation package3 [13], and speed up it have high character rate, while segments that contain long non- by enabling parallel segmentation of multiple files. speech fragments have low character rate. Dataset designer de- The CTC-segmentation requires an CTC-based ASR model cides whether to filter data samples with thresholds on character pre-trained to extract character log probabilities for each time rate automatically or to validate outliers manually. step. We train Russian ASR model using the cross-language SDE can be also used to analyse samples with high WER or transfer learning [16] from English QuartzNet 15x5 4, which character error rate (CER) after samples to determine if the high was trained on 3000 hours of public English speech. The Rus- WER / CER was caused by misalignment, errors in reference sian model was trained on 80 hours of the Russian subset of the transcripts or low accuracy of the ASR model. MCV (ver. 5.1) using NovoGrad optimizer [17] with 0.0015 learning rate, weight decay 0.001, a cosine annealing learning 5.1. RuLS error analysis and filtering rate policy. Overall, the CTC-Segmentation produces accurate alignments given that the Russian ASR model was fine-tuned on only 80 4. NeMo Speech Data Explorer hours of data and a greedy decoding strategy without Language Model (LM) was used to produce time step probabilities. How- After the initial segmentation is complete, we analyze the qual- ever, we notice that some segmented audio clips are ending ity of the data using Speech Data Explorer (SDE). SDE provides abruptly and miss ending phonemes of the utterance or include 3https://pypi.org/project/ctc-segmentation/ the first phonemes of the next utterance. Figure 3 shows one 4https://ngc.nvidia.com/catalog/models/ such example, where the audio clip ends before the fading of nvidia:nemospeechmodels the waveform occurs. To filter out such examples we apply the Figure 2: Speech Data Explorer. threshold on the mean absolute (MA) signal value at the end of To improve the dataset alignment quality we re-tune the the audio clip.