Automatic Speech Recognition Adaptation for Various Noise Levels
Total Page:16
File Type:pdf, Size:1020Kb
Automatic Speech Recognition Adaptation for Various Noise Levels by Azhar Sabah Abdulaziz Bachelor of Science Computer Engineering College of Engineering University of Mosul 2002 Master of Science in Communication and Signal Processing College of Agricultual, Science and Engineering Electrical and Electronics Engineering Newcastle University 2009 A dissertation submitted to the College of Engineering Department of Electrical and Computer Engineering at Florida Institute of Technology in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Engineering Melbourne, Florida May 2018 ⃝c Copyright 2018 Azhar Sabah Abdulaziz All Rights Reserved The author grants permission to make single copies We the undersigned committee hereby recommend that the attached document be accepted as fulfilling in part the requirements for the degree of Doctor of Philosophy in Computer Engineering. “Automatic Speech Recognition Adaptation for Various Noise Levels”, a dissertation by Azhar Sabah Abdulaziz Veton Z. Kepuska,¨ Ph.D. Professor, Electrical and Computer Engineering Dissertation Advisor Samuel Kozaitis, Ph.D. Professor, Department Head Electrical and Computer Engineering Josko Zec, Ph.D. Associate Professor, Electrical and Computer Engineering Nezamoddin Nezamoddini-Kachouie, Ph.D. Assistant Professor, Mathematical Sciences Abstract TITLE: Automatic Speech Recognition Adaptation for Various Noise Levels AUTHOR: Azhar Sabah Abdulaziz MAJOR ADVISOR: Veton Z. Kepuska,¨ Ph.D. The automatic speech recognition (ASR) is a set of complicated algorithms that convert the intended spoken utterance into a textual form. Acoustic features, which are extracted from the speech signal, are matched against a trained network of linguistic and acoustic models. The ASR performance is degraded significantly when the ambient noise is different than that of the training data. Many approaches have been introduced to address this problem with various degrees of complexity and improvement rates. The general pattern of solving this issue lies in three categories: empowering features, train a general acoustic model and transform models to match noisy features. The acoustic noise is added to the training speech data after collecting them for two reasons: firstly because the data are usually recorded in a specific environment and secondly to control the environments during the training and testing phases. The speech and noise signals are usually combined in the electrical domain using straightforward linear addition. Although this procedure is commonly used, it is iii investigated in depth in this research. It has been proven that the linear addition is no more than an approximation of the real acoustic combination, and it is valid if the speech and noise are non-coherent signals. The adaptive model switching (AMS) solution is proposed, so that the ASR mea- sures the noise level then picks the model that should produce as minimum errors as possible. This solution is a trade-off between model generalization and transformation properties, so that both error and speed costs are maintained as minimum as possible. The short time of silence (STS), which is a signal-to-noise ratio (SNR) level detector, was designed specifically for the proposed system. The proposed AMS approach is a general recipe that could be applied to any other ASR systems, although it was tested on Gaussian Mixture Model-Hidden Markov Model (GMM-HMM) recognizer. The AMS ASR has outperformed the model generalization and multiple-decoder maximum score voting for both accuracy and decoding speed. The average error rate reduction was around 34.11% , with a decoding speed improvement of about 37.79% relatively, both compared to the baseline ASR. iv Contents Contents v List of Figures ix List of Tables xii Abbreviations xiii Acknowledgments xvi Dedication xvii 1 Introduction 1 1.1 Automatic Speech Recognition . 1 1.2 ASR in Noisy Environment . 3 1.3 Outline . 5 2 Review of Speech Recognition Technologies 6 2.1 Introduction . 6 2.2 The Motivations for the ASR Design . 7 2.3 The Automatic Speech Recognition Design . 9 2.3.1 Feature Extraction . 10 v 2.3.2 Decoder . 12 2.3.3 Knowledge Base . 15 2.4 Knowledge Sources Integration . 16 2.5 Decoder Methodologies . 18 2.5.1 Hidden Markov Model . 20 2.5.2 Neural Networks . 24 2.6 Towards Robust ASR . 27 2.6.1 Feature Enhancement . 27 2.6.2 Transformation . 31 2.6.3 Generalization . 32 2.6.4 Multiple Decoder Combination . 34 3 Acoustic Noise Simulation 35 3.1 Introduction . 35 3.2 The SPL and the Total SPL . 36 3.3 Total Microphone Voltage For N Acoustic Sources . 39 3.4 Experimental Results . 42 3.4.1 Monotonic Acoustic Signal Test . 43 3.4.2 Real Speech Audio Test . 46 3.5 The Divergence between the Two Approaches . 48 3.6 Conclusion . 53 4 Signal-to-Noise Ratio Estimation 54 4.1 Introduction . 54 4.2 Previous SNR Estimation Approaches . 55 4.2.1 The Statical Closed-Form Approach . 56 vi 4.2.2 The Audio Power Histogram Approach . 59 4.2.3 The Supervised Method Approach . 61 4.3 The Proposed SNR Estimation Approach . 63 4.3.1 The Short-Time of Silence SNR Estimator . 63 4.3.2 The Noise-Speech Power Offset . 67 4.4 Experimental Results . 69 4.5 Discussion . 73 5 Experimental Tools and Environments 77 5.1 Introduction . 77 5.2 CMU Sphinx ASR Toolkit . 78 5.3 Feature Extraction . 78 5.4 GMM-HMM Acoustic Model Training . 80 5.5 Sphinxtrain Tool Procedure . 84 5.6 The Data . 86 6 Acoustic Noise and ASR Performance 87 6.1 Introduction . 87 6.2 Performance Metrics . 88 6.3 Analyzing the Noise Effect on Performance . 89 6.3.1 Noise Type Effect . 90 6.3.2 Noise Level Effect . 92 6.4 Conclusion . 94 7 Adaptive Model Switching (AMS) Speech Recognition 95 7.1 Introduction . 95 7.2 System Design . 96 vii 7.3 Adaptive Model Switching ASR Training . 97 7.3.1 Acoustic Models Training . 97 7.3.2 STS-SNR Training . 99 7.4 AMS Decoding Algorithm . 101 7.5 Silence Samples Update . 103 7.6 Conclusion . 104 8 Results and Discussion 105 8.1 Introduction . 105 8.2 The word Error Rate (WER) . 106 8.2.1 White Noise Experiment . 107 8.2.2 Door Slam Noise Experiment . 109 8.2.3 Babble Noise Experiment . 111 8.3 WER Relative Improvement Comparison . 114 8.4 The Recognition Speed . 115 8.5 Conclusion . 118 9 Conclusions 120 Appendices 122 A Noisy TIMIT Corpus 123 A.1 TIMIT Corpus . 123 A.2 Noisy TIMIT Data Structure . 125 A.3 Noise Types and Ranges . 126 viii List of Figures 1.1 General schematic diagram for the LVCSR system . 2 2.1 Human speech production/perception process . 7 2.2 A simple speech production/perception model . 8 2.3 The LVCSR general modules’ functions . 9 2.4 PLP and MFCC analysis comparison . 11 2.5 Knowledge sources hierarchy for ASR . 15 2.6 ASR bottom-up approach for knowledge integration . 17 2.7 Top-down approach for knowledge integration . 18 2.8 Blackboard knowledge sources integration in ASR . 19 2.9 Two states simple Markov model . 21 2.10 HMM word model and sub-phoneme . 23 2.11 Deep neural network (DNN) architecture . 26 2.12 Conventional MFCC, RASTA-PLP and PNCC . 29 2.13 PNCC vs MFCC WER . 30 2.14 Joint training frame work example . 33 3.1 Monotonic acoustic signal synchronization . 44 3.2 Monotonic signal addition experiment . 45 3.3 Synchronization pulse train . 47 ix 3.4 The divergence analysis scenario . 49 3.5 Factors affecting vector-linear divergence . 52 4.1 The closed-form SNR response for frame-based noise . 58 4.2 The closed form SNR response for fixed 10 dB noise . 58 4.3 The NIST-STNR approach . 59 4.4 Comparison WADA and NIST SNR algorithms . 61 4.5 Regression, WADA and NIST SNR error . 62 4.6 The Short-Time-Silence estimator (STS-SNR) Algorithm . 65 4.7 The effect of STS step 4 . 66 4.8 Finding the offset for STS-SNR . 68 4.9 MAE results for NIST, WADA and STS using NOIZEUS corpus . 71 4.10 MAE of NIST, WADA and STS estimators using TIMIT corpus . 72 4.11 Mean estimation of NIST, WADA and STS using NOIZUS corpus . 73 4.12 Comparing different SNR estimators on for different noise types. 74 5.1 The speech features used in experiments . 79 5.2 A simple 3-state HMM . 82 6.1 ASR WER degradation in noisy environment . 90 6.2 Different noise types speech on AWGN AM . 91 6.3 Baeline ASR noise level test . 93 7.1 The proposed AMS speech recognition . 96 7.2 STS-SNR Estmator Block Diagram . 99 8.1 Multiple decoder maximum MAP decoder . 106 8.2 The AMS WER performance of TIMIT on AWGN . 107 x 8.3 The AMS WER performance of AN4 on AWGN . 108 8.4 The AMS WER performance of PDAmWSJ on AWGN . 108 8.5 The AMS WER performance of TIMIT on door slam noise . 109 8.6 The AMS WER performance of AN4 on door slam noise . 110 8.7 The AMS WER performance of PDAmWSJ on door slam noise . 110 8.8 Babble noise TIMIT test . 112 8.9 Babble noise AN4 test . 113 8.10 Babble noise PDAmWSJ test . 113 8.11 The averge decoding speed performance of TIMIT . 116 8.12 The average decoding speed performance for AN4 . 117 8.13 The average decoding performance for PDAmWSJ . 118 A.1 Noisy TIMIT Corpus directory structure . 125 A.2 The spectrum of different colors of noise. ..