Robust In-Car Speech Recognition Based on Nonlinear Multiple Regressions
Total Page:16
File Type:pdf, Size:1020Kb
Hindawi Publishing Corporation EURASIP Journal on Advances in Signal Processing Volume 2007, Article ID 16921, 10 pages doi:10.1155/2007/16921 Research Article Robust In-Car Speech Recognition Based on Nonlinear Multiple Regressions Weifeng Li,1 Kazuya Takeda,1 and Fumitada Itakura2 1 Graduate School of Information Science, Nagoya University, Nagoya 464-8603, Japan 2 Department of Information Engineering, Faculty of Science and Technology, Meijo University, Nagoya 468-8502, Japan Received 31 January 2006; Revised 10 August 2006; Accepted 29 October 2006 Recommended by S. Parthasarathy We address issues for improving handsfree speech recognition performance in different car environments using a single distant microphone. In this paper, we propose a nonlinear multiple-regression-based enhancement method for in-car speech recogni- tion. In order to develop a data-driven in-car recognition system, we develop an effective algorithm for adapting the regression parameters to different driving conditions. We also devise the model compensation scheme by synthesizing the training data using the optimal regression parameters and by selecting the optimal HMM for the test speech. Based on isolated word recognition experiments conducted in 15 real car environments, the proposed adaptive regression approach shows an advantage in average relative word error rate (WER) reductions of 52.5% and 14.8%, compared to original noisy speech and ETSI advanced front end, respectively. Copyright © 2007 Weifeng Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 1. INTRODUCTION ization (CMN) [8], codeword-dependent cesptral normal- ization (CDCN) [9], and so on. Spectral subtraction was The mismatch between training and testing conditions is originally proposed in the context of the enhancement of one of the most challenging and important problems in au- speech quality, but it can be used as a preprocessing step tomatic speech recognition (ASR). This mismatch may be for recognition. However, its performance suffers from the caused by a number of factors, such as background noise, annoying “musical tone” artifacts. CMN performs the sim- speaker variation, a change in speaking styles, channel effects, ple linear transformation and aims to remove the cep- and so on. State-of-the-art ASR techniques for removing the stral bias. Although effective for the convolutional distor- mismatch usually fall into the following three categories [1]: tions, this technique is not successful for the additive noise. robust features, speech enhancement, and model compensa- CDCN may be somewhat intensive to compute since it de- tion. The first approach seeks parameterizations that are fun- pends on the online estimation of the channel and additive damentally immune to noise. The most widely used speech noise through an iterative EM approach. Model compen- recognition features are the Mel-frequency cepstral coeffi- sation approach aims to adapt or transform acoustic mod- cients (MFCCs) [2]. MFCC’s lack of robustness in noisy or els to match the noisy speech feature in a new testing en- mismatched conditions has led many researchers to inves- vironment. The representative methods include multistyle tigate robust variants or novel feature extraction algorithm. training [8], maximum-likelihood linear regression (MLLR) Some of these researches could be perceptually based on, for [10], and Jacobian adaptation [11, 12]. Their main disad- example, the PLP [3]andRASTA[4], while other approaches vantage is that they require the retraining of a recognizer are related to the auditory processing, for example, gamma- or adaptation data, which leads to much higher complex- tone filter [5]andEIHmodel[6]. ity than speech enhancement approach. Most speech en- Speech enhancement approach aims to perform noise hancement and model compensation methods are accom- reduction by transforming noisy speech (or feature) into plished by linear functions such as simple bias removal, an estimate that more closely resembles clean speech (or affine transformation, linear regression, and so on. How- feature). Examples falling in this approach include spec- ever, it is well known that distortion caused even by ad- tral subtraction [7], Wiener filter, cepstral mean normal- ditive noise only is highly nonlinear in the log-spectral or 2 EURASIP Journal on Advances in Signal Processing cepstral domain. Therefore, a nonlinear transformation or 7 9101112 compensation is more appropriate. The use of a neural network allows us to automatically 5 6 learn the nonlinear mapping functions between the refer- ence and testing environments. Such a network can han- 1 dle additive noise, reverberation, channel mismatches, and 34 combinations of these. Neural-network-based feature en- hancement has been used in conjunction with a speech recognizer. For example, Sorensen used a multilayer net- work for noise reduction in the isolated word recogni- tion under F-16 jet noise [13]. Yuk and Flanagan em- ployed neural networks to perform telephone speech recog- nition [14]. However, the feature enhancement they im- 5 3 plemented was performed in the ceptral domain and the 9 clean features were estimated using the noisy features 10 only. 7 In previous work, we proposed a new and effective 11 multimicrophone speech enhancement approach based on 12 4 multiple regressions of log spectra [15] that used multi- 1 6 ple spatially distributed microphones. Their idea is to ap- proximate the log spectra of a close-talking microphone ff by e ectively the combining of the log spectra of dis- Figure 1: Side view (top) and top view (bottom) of the arrangement tant microphones. In this paper, we extend the idea to of multiple spatially distributed microphones and the linear array in single-microphone case and propose that the log spec- the data collection vehicle. tra of clean speech are approximated through the nonlin- ear regressions of the log spectra of the observed noisy speech and the estimated noise using a multilayer percep- tron (MLP) neural network. Our neural-network-based fea- ture enhancement method incorporates the noise informa- Five spatially distributed microphones (#3 to #7) are placed tion and can be viewed as a generalized log spectral subtrac- around the driver. Among them, microphone #6, located at tion. the visor location to the speaker (driver), is the closest to In order to develop a data-driven in-car recognition sys- the speaker. The speech recorded at this microphone (also tem, we develop an effective algorithm for adapting the re- named “visor mic.”) is used for speech recognition in this gression parameters to different driving conditions. In order paper. A four-element linear microphone array (#9 to #12) to further reduce the mismatch between training and testing with an interelement spacing of 5 cm is located at the visor conditions, we synthesize the training data using the optimal position. regression parameters, and train multiple hidden Markov The test data includes Japanese 50 word sets under 15 driving conditions (3 driving environments ×5 in-car states models (HMMs) over the synthesized data. We also develop = several HMM selection strategies. The devised system results 15 driving conditions as listed in Table 1). Table 2 shows in a universal in-car speech recognition framework including the average signal-to-noise ratio (SNR) for each driving con- both the speech enhancement and the model compensation. dition. For each driving condition, 50 words are uttered by The organization of this paper is as follows: in Section 2, each of 18 speakers. A total of 7000 phonetically balanced we describe the in-car speech corpus used in this paper. In sentences (uttered by 202 male speakers and 91 female speak- Section 3, we present the regression-based feature enhance- ers) were recorded for acoustical modeling. (3600 of them ment algorithm, and the experimental evaluations are out- were collected in the idling-normal condition and 3400 of lined in Section 4.InSection 5, we present the environmen- them were collected while driving the DCV on the streets tal adaptation and model compensation algorithms. Then near Nagoya University (city-normal condition).) the performance evaluation on the adaptive regression-based Speech signals are digitized into 16 bits at a sampling speech recognition framework is reported in Section 6.Fi- frequency of 16 kHz. For spectral analysis, a 24-channel nally Section 7 concludes this paper. MFB analysis is performed on 25-millisecond-long win- dowed speech, with a frame shift of 10 milliseconds. Spec- tral components lower than 250 Hz are filtered out to com- 2. IN-CAR SPEECH DATA AND SPEECH ANALYSIS pensate for the spectrum of the engine noise, which is con- centrated in the lower-frequency region. Log MFB parame- A data collection vehicle (DCV) has been specially designed ters are then estimated. The estimated log MFB vectors are for developing the in-car speech corpus at the Center for transformed into 12 mean normalized Mel-frequency cep- Integrated Acoustic Information Research (CIAIR), Nagoya stral coefficients (CMN-MFCC) using discrete cosine trans- University, Nagoya, Japan [16]. The driver wears a headset formation (DCT) and mean normalization, after which the with a close-talking microphone (#1 in Figure 1) placed in it. time derivatives (Δ CMN-MFCC) are calculated. Weifeng Li et al. 3 Feature vector (log MFB) Estimated Noisy speech