Warped Discrete Cosine Transform Cepstrum: a New Feature for Speech Processing

WARPED DISCRETE COSINE TRANSFORM CEPSTRUM: A NEW FEATURE FOR SPEECH PROCESSING R. Muralishankar1, Abhijeet Sangwan2 and Douglas O'Shaughnessy1 1INRS-EMT (Telecommunications), University of Quebec, Montreal, Canada. 2Department of Electrical and Computer Engineering, Concordia University, Montreal, Canada. [email protected], a [email protected], [email protected] ABSTRACT all-pass filters whose parameters are used to adjust the trans- In this paper, we propose a new feature for speech recog- form according to the frequency contents of the signal block nition and speaker identification application. The new fea- [4]. As a parallel, the advantage of using WDCT over DCT ture is termed as warped-discrete cosine transform cepstrum for image compression has been shown in [4]. (WDCTC). The feature is obtained by replacing the discrete Smith and Abel [5] derived an analytic expression for an cosine transform (DCT) by the warped discrete cosine trans- all-pass filter parameter such that the mapping between the form (WDCT, [4]) in the discrete cosine tranform cepstrum warped and the unwarped frequencies, for a given sampling (DCTC [2]). The WDCT is implemented as a cascade of frequency fs, follows the psychoacoustic Bark-scale. This the DCT and IIR all-pass filters. We incorporate a nonlin- expression is used to get WDCT. Applying WDCT to DCTC ear frequency-scale in DCTC which closely follows the bark- generates a new feature, warped discrete cosine transform scale. This is accomplished by setting the all-pass filter para- cepstrum (WDCTC). To evaluate the efficacy of the new fea- meter using an expression given by Smith and Abel [5] . Per- ture in speech processing applications, we compare the performance of WDCTC is compared to mel-frequency cepstral formances of WDCTC with MFCC for vowel recognition coefficients (MFCC) in a speech recognition and speaker and speaker identification tasks. Our experimental results identification experiment. WDCTC outperforms MFCC in show that WDCTC consistently performs better than MFCC. both noisy and noiseless conditions. This paper is organized as follows. In section 2, we introduce the WDCT and present its nonlinear frequncy resolution property. In section 3, WDCT is incorporated into DCTC. In 1. INTRODUCTION section 4, we present the comparative performances of WD- For the past two decades, the feature extraction algorithm of CTC and MFCC for vowel recognition and speaker identifi- Davis and Mermelstein [1] has been used extensively in the cation tasks. Section 5 provides the concluding remarks. field of speech and audio processing. The algorithm stems from two ideas: (1) vocal tract modeling, and (2) homomor- 2. WARPED DISCRETE COSINE TRANSFORM phic filtering. In the vocal tract model, speech is produced by 2.1 Definition passing an excitation through a filter whose response models the effects of the vocal tract on the excitation. In homomor- Here, we review an N-point WDCT of the input vec- T phic filtering, the convolution of the excitation with the vocal tor [x(0);x(1);:::;x(N 1)] [4]. The N-point DCT, X(0);X(1);:::;X(N 1−) is defined by tract response is transformed to addition, where linear filter- f − g ing techniques are applied to remove the excitation from the N 1 − (2n + 1)k filter response. Unlike homomorphic filtering, the Davis and X(k) = U(k) ∑ x(n)cos p (1) Mermelstein algorithm generates mel-frequency cepstral co- n=0 2N efficients (MFCC) by transforming the log-energies of the for k = 0;1;:::;N 1 where spectrum passed through a bank of band-pass filters. The − MFCC filter bank is composed of triangular filters spaced on 1 k = 0 a logarithmic scale. The spacing of the filters follow the mel- U(k) = p2 (2) scale, which is inspired by the critical band measurements of 1; otherwise: the human auditory system. The kthrow of the N N DCT matrix can be viewed as a filter MFCC was compared to the linear predictive cepstral co- × efficients (LPCC) and the discrete cosine transformed cep- whose transfer function is given by strum (DCTC [2]) in a speaker identification task presented N 1 ( + ) by Muralishankar et al [3]. Speaker identification using 1 − 2n 1 kp n Fk(z− ) = ∑ U(k)cos z− : (3) DCTC was shown to perform better than LPCC but worse n=0 2N than MFCC [3]. It is observed that the underperformance th 1 th of DCTC when compared to MFCC can be attributed to the That is, the i coefficient of Fk(z− ) is the (k;i) element 1 absence of a nonlinear frequency scale. We propose to incor- of the DCT matrix. It can be shown that Fk(z− ) is a band- porate the nonlinear frequency scale for the DCTC feature pass filter with a center frequency at (2k + 1)=2N; with the which we believe will improve its performance to match that sampling frequency normalized to 1. Hence, the magnitude 1 of MFCC. This is achieved by using the warped discrete co- response of Fk(z− ) for small k is larger for low-frequency sine transform (WDCT) instead of DCT in obtaining DCTC. inputs such as voiced sounds, which enable data compres- The WDCT is implemented as a cascade of the DCT and IIR sion by giving more emphasis to the lower band outputs than the higher band ones [4]. Further, inputs with mostly 1 1 Im(z) high-frequency components such as unvoiced sounds have a Im(z) −1 1 z A(z) higher magnitude response of Fk(z− ) for large k, which en- 0.5 0.5 ables high frequency coefficients to have compacted energy. B = 0 B = 0.5 This is a desirable feature for noise removal purposes [6]. 0 0 Note that the frequency resolution of the DCT is uni- form. Therefore, incorporating a nonlinear frequency reso- −0.5 −0.5 lution closely following the psychoacoustic Bark-scale will result in an enhanced representation for the speech signals. −1 (a) −1 (b) −1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1 We introduce such a nonlinearity in DCTC using warping. Re(z) Re(z) To warp the frequency axis, we apply an all-pass transfor- 1 mation by replacing z− with an all-pass filter A(z) defined by Figure 1: Linear and warped frequency sampling. (a) Lin- b + z 1 ear frequency sampling of trivial all-pass filter (b = 0). ( ) = − A z − 1 (4) (b)Warped frequency resolution of first-order all-pass filter 1 bz− − (b = 0:5). where b is the control parameter for warping the frequency response. A(z) is known as the Laguerre filter and is widely used in various signal processing algorithms. The resulting The phase function determines a frequency mapping occur- Fk(A(z)) now becomes an infinite impulse response (IIR) fil- ing in the all-pass chain [7]. Depending on the sign of b; ter given by the low or high frequency range is expanded whereas the re- maining part of the unit circle becomes compressed. This N 1 is shown in Fig. 1. For a certain value of b the frequency − (2n + 1)kp n Fk(A(z)) = ∑ U(k)cos (A(z)) : (5) transformation closely resembles the frequency mapping oc- n=0 2N curing in the human auditory system. Smith and Abel [5] derived an analytic expression for b so that the mapping, for 2.2 Implementation of the WDCT a given sampling frequency fs; matches the psychoacoustic The WDCT can be implemented in several ways. The most Bark-scale mapping. The value is given by straightforward approach is to implement the filters in a La- 1 guerre network (considering first order all-pass filters, A(z), 2 2 b 1:0211 arctan(0:076 fs) 0:19877: (7) which are reset every N samples). In the second approach, ≈ p − we can implement the filtering by a matrix-vector multipli- cation in two steps: first we divide the all-pass IIR transfer For a given fs (say 16 kHz), we calculated b from eq. 7 to functions into N terms, and then sample the frequency re- generate the WDCT matrix. sponses of the warped filter bank to obtain the WDCT matrix through an inverse discrete Fourier transform (IDFT). 3. WARPED DISCRETE COSINE TRANSFORMED We use the second approach, which is the filter bank CEPSTRUM method suggested by Cho and Mitra [4]. For an N-tap fi- Two variants of DCTC were proposed in [3], namely, DCTC- nite impulse response (FIR) filter, the result of filtering and 1 and DCTC-2. It was shown that DCTC-2 performs better decimation by N corresponds to the inner product of the filter than DCTC-1 and both outperform LPCC in a speaker iden- coefficient vector and the input vector. From Parseval's rela- tification task. Hence, we chose the DCTC-2 algorithm and tion, this is again equal to the inner product of the conjugate replaced DCT with WDCT to obtain the WDCTC algorithm. DFT of the input and the DFT of the filter coefficients, which jw The new WDCTC algorithm is outlined below. is equal to the sampled value of Fk(e ) for w = (2pk=N) Consider a finite duration, real sequence x(n), defined for where k = 0;1;:::;N 1: Similarly, we can approximate the 0 n N 1 and zero elsewhere. Taking an N-point WDCT result of the filtering− with F (A(e jw )) as the inner product of ≤ ≤ − k of the above sequence, we have XWDCT (k) defined for 0 the input vector and the IDFT of the sampled sequence of ≤ jw k N 1: We can write XWDCT (k) as Fk(A(e )): More detailed description about the WDCT and ≤ − its implementations can be found in [4].

Warped Discrete Cosine Transform Cepstrum: a New Feature for Speech Processing

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support