Automatic Speech Recognition for Low-Resource and Morphologically Complex Languages
Total Page:16
File Type:pdf, Size:1020Kb
Rochester Institute of Technology RIT Scholar Works Theses 4-2021 Automatic Speech Recognition for Low-Resource and Morphologically Complex Languages Ethan Morris [email protected] Follow this and additional works at: https://scholarworks.rit.edu/theses Recommended Citation Morris, Ethan, "Automatic Speech Recognition for Low-Resource and Morphologically Complex Languages" (2021). Thesis. Rochester Institute of Technology. Accessed from This Thesis is brought to you for free and open access by RIT Scholar Works. It has been accepted for inclusion in Theses by an authorized administrator of RIT Scholar Works. For more information, please contact [email protected]. Automatic Speech Recognition for Low-Resource and Morphologically Complex Languages Ethan Morris Automatic Speech Recognition for Low-Resource and Morphologically Complex Languages Ethan Morris April 2021 A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Engineering Department of Computer Engineering Automatic Speech Recognition for Low-Resource and Morphologically Complex Languages Ethan Morris Committee Approval: Emily Prud'hommeaux Advisor Date Department of Computer Science, Boston College Alexander Loui Date Department of Computer Engineering Andreas Savakis Date Department of Computer Engineering i Acknowledgments I would like to thank Dr. Emily Prud'hommeaux for her continued support and guidance, with invaluable suggestions. I am also thankful for Dr. Alexander Loui and Dr. Andreas Savakis for being on the thesis committee. I also wish to thank Dr. Raymond Ptucha, Dr. Christopher Homan, and Robert Jimerson for their ideas, advice, and general expertise. ii Abstract The application of deep neural networks to the task of acoustic modeling for au- tomatic speech recognition (ASR) has resulted in dramatic decreases of word error rates, allowing for the use of this technology in smart phones and personal home assistants in high-resource languages. Developing ASR models of this caliber, how- ever, requires hundreds or thousands of hours of transcribed speech recordings, which presents challenges for most of the world's languages. In this work, we investigate the applicability of three distinct architectures that have previously been used for ASR in languages with limited training resources. We tested these architectures using publicly available ASR datasets for several typologically and orthographically diverse languages, whose data was produced under a variety of conditions using different speech collection strategies, practices, and equipment. Additionally, we performed data augmentation on this audio, such that the amount of data could increase nearly tenfold, synthetically creating higher resource training. The architectures and their individual components were modified, and parameters explored such that we might find a best-fit combination of features and modeling schemas to fit a specific language morphology. Our results point to the importance of considering language-specific and corpus-specific factors and experimenting with multiple approaches when developing ASR systems for resource-constrained languages. iii Contents Signature Sheet i Acknowledgments ii Abstract iii Table of Contents iv List of Figures vii List of Tables viii 1 Introduction and Motivation 1 1.1 Introduction and Motivation . 1 1.2 Document Structure . 2 2 Background 4 2.1 Introduction . 4 2.2 Automatic Speech Recognition . 4 2.3 Feature Extraction . 5 2.3.1 Spectral Features . 6 2.3.2 Mel Filterbanks . 7 2.3.3 Mel-frequency Cepstral Coefficients (MFCCs) . 7 2.3.4 Feature Space Maximum Likelihood Linear Regression (fMLLR) 8 2.3.5 I/X Vector . 8 2.4 Acoustic Models . 9 2.4.1 Hidden Markov Model . 9 2.4.2 Gaussian Mixture Model . 9 2.4.3 Deep Neural Network . 10 2.4.4 Recurrent Neural Networks . 11 2.5 Language Models . 12 2.5.1 N-Gram . 12 2.5.2 Neural . 12 2.5.3 Trans-dimensional Random Field (TRF) . 13 2.6 Data Augmentation . 13 iv CONTENTS 2.6.1 Pitch/Speed/Noise . 13 2.6.2 SpecAugment . 14 2.6.3 Speech Synthesis . 15 2.7 Evaluation Metrics . 16 2.7.1 Character/Word Error Rate . 16 2.7.2 Perplexity . 17 2.8 Frameworks . 17 2.8.1 Kaldi . 17 2.8.2 DeepSpeech . 18 2.8.3 Wav2Letter/Wav2Vec . 18 2.8.4 WireNet . 19 3 Datasets 21 3.1 Amharic . 21 3.1.1 Language Description . 21 3.1.2 Prior ASR Work . 22 3.2 Bemba . 23 3.2.1 Language Description . 23 3.2.2 Prior ASR Work . 23 3.3 Iban . 24 3.3.1 Language Description . 24 3.3.2 Prior ASR Work . 25 3.4 Seneca . 25 3.4.1 Language Description . 25 3.4.2 Prior ASR Work . 26 3.5 Swahili . 26 3.5.1 Language Description . 26 3.5.2 Prior ASR Work . 27 3.6 Vietnamese . 27 3.6.1 Language Description . 27 3.6.2 Prior ASR Work . 28 3.7 Wolof . 28 3.7.1 Language Description . 28 3.7.2 Prior ASR Work . 29 v CONTENTS 4 Methodology 30 4.1 Language Comparison . 30 4.1.1 Average Utterance Length . 30 4.1.2 Quality Measures . 30 4.2 Kaldi . 32 4.3 WireNet . 33 4.3.1 Transfer Learning . 33 4.3.2 Data Augmentation . 33 4.3.3 Architecture Exploration . 34 4.3.4 Feature Comparison . 35 4.4 Speaker Re-partitioning . 36 4.5 Language Model Exploration . 36 5 Results 37 5.1 Kaldi Results . 37 5.1.1 Overall . 37 5.1.2 Data Augmentation . 38 5.2 WireNet . 39 5.2.1 Overall Results . 39 5.2.2 Architecture Exploration . 39 5.2.3 Transfer Learning . 43 5.2.4 Feature Comparison . 44 5.3 Language Comparison . 46 5.4 Language Model Exploration . 48 5.5 Speaker Overlap . 48 6 Conclusions 56 6.1 Conclusions . 56 6.2 Future Work . 57 Bibliography 58 vi List of Figures 2.1 Generic pipeline of an ASR system. 4 2.2 An overview of the most basic feature extraction process, including the different stopping points for various features [1]. 6 2.3 Augmentations applied to the base input, given at the top. From top to bottom, the figures depict the log mel spectrogram of the base input with no augmentation, time warp, frequency masking and time masking applied [2] . 15 2.4 Left: The overall WireNet architecture. Right: A bottleneck block consisting of 9 paths, each with bottleneck filters centered by filters of different width to capture different temporal dependencies. Each layer shows (# input channels, filter width, # output channels). 20 4.1 Comparison of the WADA SNR algorithm and the average estimated NIST SNR against an artificially corrupted database. [3] . 32 5.1 CER vs. training epochs of the Swahili language at specific bottleneck depths. 41 5.2 U-Net architecture where the convolutional size increases to a maxi- mum of 1024, before reducing back to the input size. [4] . 42 5.3 CER vs. training epochs through the stages of training the Swahili language with transfer learning. 43 5.4 CER vs. training epochs through the stages of training the Swahili language without transfer learning. 44 5.5 Best produced WER vs. NIST SNR across all languages, training set depicted. 47 5.6 Best produced WER vs. WADA SNR across all languages, training set depicted. 47 5.7 WER vs. % of Test Set for the Bemba Language. 51 5.8 WER vs. % of Test Set for the Seneca Language. 53 5.9 WER vs. % of Test Set for the Wolof Language. 55 vii List of Tables 3.1 Amharic Dataset Information . 22 3.2 Bemba Dataset Information . 23 3.3 Iban Dataset Information . 24 3.4 Seneca Dataset Information . 26 3.5 Swahili Dataset Information . 27 3.6 Vietnamese Dataset Information . 28 3.7 Wolof Dataset Information . 29 5.1 Overall Kaldi results for the two best architectures across all languages. 37 5.2 Kaldi results for the two best architectures across the Wolof language with data augmentation. 38 5.3 Overall WireNet results across all languages. 39 5.4 WireNet architecture exploration for Swahili, varying the width and depth of the system. ..