EFFICIENT CEPSTRAL NORMALIZATION for ROBUST SPEECH RECOGNITION Fu-Hua Liu, Richard M
Total Page:16
File Type:pdf, Size:1020Kb
EFFICIENT CEPSTRAL NORMALIZATION FOR ROBUST SPEECH RECOGNITION Fu-Hua Liu, Richard M. Stern, Xuedong Huang, Alejandro Acero Department of Electrical and Computer Engineering School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 ABSTRACT algorithm, FCDCN, which provides an additive environ- mental compensation to cepstral vectors, but in an environ- In this paper we describe and compare the performance of a series ment-specific fashion [5]. MFCDCN is less of cepstrum-based procedures that enable the CMU SPHINX-II computationally complex than the earlier CDCN algorithm, speech recognition system to maintain a high level of recognition and more accurate than the related SDCN and BSDCN accuracy over a wide variety of acoustical environments. We algorithms [6], and it does not require domain-specific describe the MFCDCN algorithm, an environment-independent training to new acoustical environments. In this paper we extension of the efficient SDCN and FCDCN algorithms devel- describe the performance of MFCDCN and related algo- oped previously. We compare the performance of these algorithms rithms, and we compare it to the popular RASTA approach with the very simple RASTA and cepstral mean normalization to robustness. procedures, describing the performance of these algorithms in the context of the 1992 DARPA CSR evaluation using secondary microphones, and in the DARPA stress-test evaluation. 2. EFFICIENT CEPSTRUM-BASED COMPENSATION TECHNIQUES 1. INTRODUCTION In this section we describe several of the cepstral normal- ization techniques we have developed to compensate The need for speech recognition systems and spoken lan- simultaneously for additive noise and linear filtering. Most guage systems to be robust with respect to their acoustical of these algorithms are completely data-driven, as the com- environment has become more widely appreciated in recent pensation parameters are determined by comparisons years (e.g. [1]). Results of many studies have demonstrated between the testing environment and simultaneously- that even automatic speech recognition systems that are recorded speech samples using the DARPA standard clos- designed to be speaker independent can perform very etalking Sennheiser HMD-414 microphone (referred to as poorly when they are tested using a different type of micro- the CLSTLK microphone in this paper). The remaining phone or acoustical environment from the one with which algorithm, codeword-dependent cepstral normalization they were trained (e.g. [2,3]), even in a relatively quiet (CDCN), is model-based, as the speech that is input to the office environment. Applications such as speech recogni- recognition system is characterized as speech from the tion over telephones, in automobiles, on a factory floor, or CLSTLK microphone that undergoes unknown linear filter- outdoors demand an even greater degree of environmental ing and corruption by unknown additive noise. robustness. In addition, we discuss two other procedures, the RASTA Many approaches have been considered in the development method, and cepstral mean normalization, that may be of robust speech recognition systems including techniques referred to as cepstral-filtering techniques. These proce- based on autoregressive analysis, the use of special distor- dures do not provide as much improvement as CDCN, tion measures, the use of auditory models, and the use of MFCDCN and related algorithms, but they can be imple- microphone arrays, among many other approaches (as mented with virtually no computational cost. reviewed in [1,4]). In this paper we describe and compare the performance of a 2.1. Cepstral Normalization Techniques series of cepstrum-based procedures that enable the CMU SPHINX-II speech recognition system to maintain a high SDCN. The simplest compensation algorithm, SNR- level of recognition accuracy over a wide variety of acousti- Dependent Cepstral Normalization (SDCN) [2,4], applies cal environments. The most recently-developed algorithm an additive correction in the cepstral domain that depends is multiple fixed codeword-dependent cepstral normaliza- exclusively on the instantaneous SNR of the signal. This tion (MFCDCN). MFCDCN is an extension of a similar correction vector equals the average difference in cepstra between simultaneous “stereo” recordings of speech sam- 1. Assume initial values for r' []kl, and σ2 []l . ples from both the training and testing environments at each SNR of speech in the testing environment. At high SNRs, 2. Estimate f []k , the a posteriori probabilities of the mix- this correction vector primarily compensates for differences [], in spectral tilt between the training and testing environ- ture components given the correction vectors r' kli , vari- ments (in a manner similar to the blind deconvolution pro- ances σ2 []l , and codebook vectors c []k cedure first proposed by Stockham et al. [7]), while at low i SNRs the vector provides a form of noise subtraction (in a 1 manner similar to the spectral subtraction algorithm first exp − z + r' []kl, − c []k 2 proposed by Boll [8]). The SDCN algorithm is simple and σ2 [] i 2 li effective, but it requires environment-specific training. f []k = i K − 1 1 FCDCN. Fixed codeword-dependent cepstral normaliza- ∑ exp − z + r' []pl, − c []p 2 σ2 [] i i tion (FCDCN) [4,6] was developed to provide a form of p = 0 2 li compensation that provides greater recognition accuracy than SDCN but in a more computationally-efficient fashion where l is the instantaneous SNR of the ith frame. than the CDCN algorithm which is summarized below. i The FCDCN algorithm applies an additive correction that 3. Maximize the likelihood of the complete data by obtaining depends on the instantaneous SNR of the input (like new estimates for the correction vectors r' []kl, and cor- SDCN), but that can also vary from codeword to codeword responding σ []l : (like CDCN) N − 1 xˆ = zr+ []kl, ()− []δ[]− ∑ xi zi fi k lli = r []kl, = i 0 where for each frame xˆ represents the estimated cepstral N − 1 vector of the compensated speech, z is the cepstral vector of []δ[]− ∑ fi k lli the incoming speech in the target environment, k is an index i = 0 identifying the VQ codeword, l is an index identifying the N − 1K − 1 SNR, and r []kl, is the correction vector. − − [], 2 []δ[]− ∑ ∑ xi zi r kl fi k lli The selection of the appropriate codeword is done at the σ2 [] = i = 0 k = 0 l − − VQ stage, so that the label k is chosen to minimize N 1K 1 []δ[]− ∑ ∑ fi k lli zr+ []kl, − c []k 2 i = 0 k = 0 4. Stop if convergence has been reached, otherwise go to Step where the c []k are the VQ codewords of the speech in the 2. training database. The new correction vectors are estimated with an EM algorithm that maximizes the likelihood of the In the current version of FCDCN the SNR is varied over a data. range of 30 dB in 1-dB steps, with the lowest SNR set equal to the estimated noise level. At each SNR compensation vec- The probability density function of x is assumed to be a tors are computed for each of 8 separate VQ clusters. mixture of Gaussian densities as in [2,4]. Figure 1 illustrates some typical compensation vectors − K 1 obtained with the FCDCN algorithm, computed using the p ()x = Pk[]()N ck[]Σ, ∑ x k standard closetalking Sennheiser HMD-414 microphone and k = 0 the unidirectional desktop PCC-160 microphone used as the target environment. The vectors are computed at the extreme The cepstra of the corrupted speech are modeled as Gauss- SNRs of 0 and 29 dB, as well as at 5 dB. These curves are ian random vectors, whose variance depends also on the obtained by calculating the cosine transform of the cepstral instantaneous SNR, l, of the input. compensation vectors, so they provide an estimate of the (),, = C' − 1 + [], − [] 2 effective spectral profile of the compensation vectors. The p z k r l σ []exp zrkl ck l 2σ2 horizontal axis represents frequency, warped nonlinearly according to the mel scale [9]. The maximum frequency cor- In [4] it is shown that the solution to the EM algorithm is responds to the Nyquist frequency, 8000 Hz. We note that the the following iterative algorithm. In practice, convergence spectral profile of the compensation vector varies with SNR, is reached after 2 or 3 iterations if we choose the initial val- and that especially for the intermediate SNRs the various VQ ues of the correction vectors to be the ones specified by the clusters require compensation vectors of different spectral SDCN algorithm. shapes. The compensation curves for 0-dB SNR average to zero dB at low frequencies by design. 25 procedure as described above. When an utterance from an unknown environment is input to the recognition system, 20 SNR = 29 dB compensation vectors computed using each of the possible target environments are applied successively, and the envi- 15 ronment is chosen that minimizes the average residual VQ distortion over the entire utterance, 10 zr+ []klm,, − c []k 2 5 where k refers to the VQ codeword, l to the SNR, and m to Compensation Magnitude (dB) 0 0.00 0.25 0.50 0.75 1.00 the target environment used to train the ensemble of com- Normalized Warped Frequency pensation vectors. This general approach is similar in spirit -5 to that used by the BBN speech system [13], which per- 25 forms a classification among six groups of secondary microphones and the CLSTLK microphone to determine 20 SNR = 5 dB which of seven sets of phonetic models should be used to process speech from unknown environments. 15 The success of MFCDCN depends on the availability of 10 training data with stereo pairs of speech recorded from the training environment and from a variety of possible target 5 environments, and on the extent to which the environments in the training data are representative of what is actually Compensation Magnitude (dB) 0 encountered in testing.