A BLOCK COSINE TRANSFORM and ITS APPLICATION in SPEECH RECOGNITION Jingdong Chen U*, Kuldip K

A BLOCK COSINE TRANSFORM AND ITS APPLICATION IN SPEECH RECOGNITION Jingdong Chen U*, Kuldip K. Paliwal U and Satoshi Nakamura * U School of Microelectronic Engineering, Griffith University Brisbane, QLD 4111, Australia * ATR Spoken Language Translation Research Labs Kyoto, 619-0288, Japan E-mail: {jingdong.chen, nakamura}@slt.atr.co.jp, [email protected]

ABSTRACT is to divide the full-band speech signal into several sub-bands and convert each sub-band spectrum into several cepstral Noise robust speech recognition has become an important features. These sub-band features are then concatenated area of research in recent years. The fact that human listeners together as a single feature vector and used for speech can recognize speech in the presence of strong noise inspires recognition [4-7,12-14]. Results reported in the literature have researchers to imitate some aspects of human auditory shown the advantages of these multi-band features for noisy perception in automatic speech recognition. This has led to speech recognition. However, the performance for clean sub-band based speech recognition in which the full-band speech is often poorer as the features from different sub-bands speech is split into several sub-bands and where each sub-band may be correlated. Furthermore, the number of sub-bands and is processed separately. The resulting multi-band features can the boundaries for each sub-band are empirical values which be combined in various ways for carrying out speech have to be manually adjusted to gain good performance for a recognition task. Reported results have shown the superiority given recognition task. of this technique for speech recognition in strong noise An alternative is to model the sub-band features conditions. In this paper, we will briefly review the multi-band independently and to combine the likelihood score at some feature extraction. We will then propose a block discrete segmental level [8-10]. Such a combination may enable a more cosine transform (BDCT) with its kernel transformation matrix flexible way to manipulate the sub-band features to permit being derived from the decomposition of the kernel of the further enhancement in performance. An unsolved problem discrete cosine transform (DCT). We show that the BDCT with this approach is how to determine the weighting function approximates the DCT in keeping information in decorrelating to guarantee at least a sub-optimal combination of sub-band a sequence. When the BDCT is applied to the mel frequency features. filter bank energies (FBEs) to replace the DCT to convert them to cepstral coefficients, a new kind of MFCCs is yielded. We In this paper, we will firstly review the extraction of multi- band MFCCs. We will then propose a block discrete cosine call these new features Block discrete cosine transform based MFCCs (BMFCCs) and show that a sub-band processing idea transform (BDCT) with its kernel transformation matrix being derived from the decomposition of the kernel of the discrete is implicit in the BMFCCs since the BDCT automatically divides the mel frequency FBEs into two sub-bands. We will cosine transform (DCT). We show that the proposed new transform behaves similarly with the DCT in keeping report various speech recognition results using the BMFCCs as well as the comparison with the multi-band MFCCs and full- information in decorrelating a sequence. band MFCCs to elaborate the properties of the BMFCCs. When the BDCT is applied to the representation of the mel frequency cepstrum in replacing the DCT, a new type of MFCCs 1. INTRODUCTION is obtained. We call these new MFCCs Block discrete cosine transform based MFCCs (BMFCC). It is found that the BDCT Significant advances have been made in recent years in the automatically divide the power spectrum into two sub-bands, area of automatic speech recognition. It is now possible to use hence a sub-band processing idea is implicit in the new cepstral a speech recognition successfully in a controlled environment. features. However, the performance of a speech recognizer suffers dramatic degradation when there is a mismatch between Various speech recognition experiments are carried out to training and testing environments [1-3]. There are many test the properties of the BDCT and the BMFCCs. We will factors that contribute to this mismatch. The main factor, that report some results as well the comparison with the multi-band causes the mismatch, is the presence of ambient background and full-band MFCCs to elaborate the properties of these new noise in the speech signal. Maintaining good recognition features. accuracy in noisy conditions has become one of the challenging areas of research currently. 2. MULTI-BAND MFCC The fact that human listeners can achieve and secure very Let E = [e ,e ,L,e ] denote a sequence of log filter bank high recognition accuracy even in the conditions in which the 1 2 N signal to noise ratio becomes extremely low inspires energies, where N is the number of filters in the filter bank, researchers to mimic some aspects of human auditory perception in automatic speech recognition. Psychoacoustic then the full-band cepstrum is computed from a DCT, = evidence shows that human beings process speech on a narrow X F [C]E (1) band basis. An intuitive way to imitate the auditory system is Suppose we divide the speech signal into M sub-bands. In to split the full-band speech into several sub-bands and represent each sub-band individually. This has led to a order to compute cepstral coefficients for the mth sub-band, we technique called sub-band based speech recognition. process each sub-band signal with a filter bank having N m

One straightforward way to achieve sub-band representation filters. Thus these filter bank energies are given as ì N / 2 −1 E = [E , E , L, E ] = [{e , e , L, e }, {e , e , L, e }, L, ï2 å C m ,nC k , n if both m and n are even number 1 2 M 11 12 1N1 21 22 2N2 N N ï n = 0 (9) {e , e , L, e }] . The cepstral coefficients for the mth ï N M1 M 2 MN M = m , n k ,m d m , d k í 2 å C N C N , if both m and n are odd number sub-band can be computed through a DCT as follows: ï n = N / 2 ï m = [] ï X S C m Em (2) î0, Otherwise Various ways can be used to merge the sub-band cepstral where N / 2−1 N / 2−1 m,n k,n = m / 2,n k / 2,n vectors. If these vectors are directly concatenated together as a 2 åCN CN åCN / 2 CN / 2 n=0 n=0 1 2 M single feature vector, i.e., = L , this P−1 X S [X S , X S , , X S ] = åC m / 2,nC k / 2,n = g P , g P = δ = δ (10) P P m / 2 k / 2 m / 2,k / 2 m,k merging strategy is called feature combination (FB) [5] and the n=0 resulting cepstra are called multi-band features. and N N N m,n k,m m,n k,m m,n k,m 2 åC C = åC C + åC C N N N N N N 3. BLOCK DISCRETE COSINE n=N / 2 n=N / 2 n= N / 2 N N TRANSFORM = m,N −n k,N −n + m,n k,m åCN CN åCN CN n= N / 2 n= N / 2 The DCT, which has been found to be asymptotically N / 2−1 N = m,n k,n + m,n k,n = N N = δ (11) equivalent to the optimal Karhunen-Loeve transform (KLT) in åCN CN åCN CN gm , gk mk = = decorrelating a signal sequence, has shown its special n 0 n N / 2 g N N N applicability in cepstral analysis of speech signal. For a vector where m denotes the mth column vector of the by X containing N sample data points, its DCT is a vector matrix [C] , P = N / 2 , and

X c given by ì1, for k = m δ = í = mk (12) X c [C]X (5) î0, for k ≠ m where [C] is a N by N kernel conversion matrix of the DCT. Combining (5), (6) and (7), we can now represent the unitarity The (m,n) element of this matrix is defined as [15] property of the BDCT by æ ö1/ 2 é æ + π öù = δ m,n = ç 2 ÷ ç m(n 0.5) ÷ m,n = 0,1,L, N −1 (6) d m ,d k mk (13) C N êkm cos ú è N ø ë è N øû B. Energy packing efficiency and ì1 if m ≠ 0 The energy packing efficiency (EPE) is used as a criterion k = í (7) m î1/ 2, if m = 0 that measures the effectiveness of a transform in data Because of the symmetrical properties of the cosine compression. It is defined as the energy proportion contained function, [C] can be decomposed into sparse matrices. We in the first M of N transform coefficients. Rao and Yip [16] show, in what follows, that if N is an even number, [C] can be showed that the EPE is equivalent to the ratio of the sum of the decomposed as: first M diagonal elements to the sum of all the N diagonal elements of the auto-covariance matrix in the transform é 0,0 0,1 L 0,N −2 0,N −1 ù C N C N C N C N ê − − ú domain. Let [A] be the data covariance matrix and [T ] be the C 1,0 C 1,1 C 1,N 2 C 1,N 1 ê N N N N ú ê M M ú transform. Then, the covariance matrix in the transform ê ú T N −2,0 N −2,0 N −2,N −2 N −2,N −1 domain [A ] is given by êC C C C ú N N N N − ê N −1,1 N −1,1 L N −1,N −2 N −1,N −1 ú [AT ] = [T ][A][T ] 1 ë C N C N C N C N û (14) é C0,0 C0,1 L C0,N / 2−1 0 0 L 0 ù é 1 1ù Thus, the EPE for the transform, by definition, is ê N N N ú ê ú L 1,N / 2 1,N / 2+1 L 1,N−1 O N ê 0 0 0 CN CN CN ú ê ú M N ê ú ê 1 1 ú = T T (15) = ê ú ê ú EPE(M ) å Ajj å Ajj M M M − ê ú ê 1 1 ú j=1 j=1 ê N−2,0 N−2,1 L N−2,N / 2−1 L ú ê N O ú êCN CN CN 0 0 0 ú ê ú − − + − − ê L N 1,N / 2 N 1,N / 2 1 L N 1,N 1ú ê− ú One standard comparison the EPE for different transforms ë 0 0 0 CN CN CN û ë 1 1û is based on the assumption that the random sequence is é ~ ù 1 I N / 2 I N / 2 1 (8) = [D]ê ~ ú = [D][B] governed by a first-order Markov process with the adjacent ê− I I ú 2 ë N / 2 N / 2 û 2 correlation coefficient ρ specified. In Figure 1, comparisons ~ where [I N / 2 ] is an N / 2 by N / 2 identity matrix, [I N / 2 ] is an are shown for the EPEs of the BDCT and DCT based on a ρ = N / 2 by N / 2 opposite identity matrix, [D] is an N by N Markov-1 signal of 0.9 and N=24. conversion matrix and [B] is an N by N butterfly matrix. 100 90 If [D] is used as a kernel transform matrix, a new discrete 80 transform is formed. Since [D] is block matrix, we call this 70 60 new transform the block discrete cosine transform (BDCT). 50

EPE DCT 40 4. PROPERTIES AND PERFORMANCE 30 BDCT 20 OF THE BDCT 10 A. The unitarity property 0 246810121416182022 M Let d m denote the ith column vector in the matrix [D] , then it can be shown that the inner product of two such vectors is Figure 1. EPEs for DCT and BDCT An inspection of Figure 1 reveals that the EPE of the BDCT 6. SPEECH RECOGNITION is extremely close to that of the DCT when M ≥ 10 , which EXPERIMENTS indicates the effectiveness of the BDCT in data compression. In this paper, we investigate the use of different features for speaker independent isolated speech recognition. The In cepstral analysis used for speech recognition, the first speech database used for this task is the ISOLET spoken letter cepstral coefficient ( C ) is often neglected, and the succeeding 0 database from OGI [11]. Here, the vocabulary consists of 26

M coefficients are used as features. For this reason, we show English letters (A-Z). From this database, we take 90 in Figure 2, the EPE for both the BDCT and the DCT. One can utterances for each word from 45 male and 45 female talkers see the efficiency of the BDCT in keeping information in for training and 30 utterances for each word from 15 male and cepstral analysis. 15 female talkers (different from training talkers) for testing.

100 The original speech was sampled at 16 kHz. We down-sample

90 the speech to 8 kHz using a lowpass filter with a cutoff 80 frequency of 3.5 kHz. The speech signal is analyzed every 12.5 70 ms with a frame width of 25 ms (with Hamming window and 60 preemphasis).

EPE 50 To test the robustness of different feature sets with respect 40 to noise, we directly add some noise to the speech signal in the 30 DCT test set. The training speech is kept clean. The noise signals 20 BDCT 10 used are from NOISEX database [17]. We down-sample the

0 noise signal from 16 kHz to 8 kHz. 2 4 6 8 10 12 14 16 18 20 22 The recognition system uses a multi-mixture continuous M density HMM framework. We use a 6-state continuous density Figure 2. EPEs after removing the first coefficient HMM recognizer with probability density function approximated by a mixture of 5 multivariate normal 5. MEL FREQUENCY CEPSTRA USING distributions with diagonal covariance matrices. THE BDCT The feature sets investigated include:

Mel frequency cepstrum coefficients (MFCCs) are perhaps • FB MFCC (full-band MFCC): 12 MFCCs + 12 the most widely used features for speech recognition. ∆ MFCCs. The Mel frequency filter bank consists of 24 Conventionally, the MFCCs are computed by applying a DCT triangular filters. • to the log mel filter bank energies. In this paper, we proposed MB MFCC (multi-band MFCC): 12 multi-band MFCCs + 12 ∆ multi-band MFCCs. We divide the full-band to use the BDCT to replace the DCT in the estimation of the spectrum into 2 sub-bands: (0-1257 Hz) (1104-4000 Hz). MFCCs. The resulting new MFCCs are called Block discrete Each subband is passed though a filter bank which cosine transform based MFCCs (BMFCCs). The BMFCCs contains 12 mel-scale triangle filters. The resulting 12 haven been shown to have the following properties: FBEs are converted to 6 MFCCs. We then concatenate 1. They may contain more information than the traditional these two sub-band MFCC vectors into a 12 dimensional

MFCCs as the BDCT is shown to be able to preserve more sub-band MFCC vector. • BMFCC: 12 BMFCCs + 12 ∆ BMFCCs. The BMFCCs information than DCT in cepstral analysis. are computed from applying the BDCT to 24 mel 2. A sub-band processing strategy is implicit in the BMFCCs. frequency FBEs. × To see this more clearly, the M N conversion matrix The recognition results for this experiment under three which transforms the FEBs to cepstral coefficients can be different noise conditions are shown in Tables 1, 2 and 3.

written as follows without losing any information. (We Feature set Clean speech 30dB 20dB 15dB 10dB assume M and N are even numbers). FB MFCC 88.0 87.7 84.9 67.1 31.3

MB MFCC 87.8 87.5 87.1 77.4 41.7 é 2,0 2,1 L 2,N / 2−1 ù C N C N C N ê ú BMFCC 89.5 89.6 88.0 75.4 35.1 4,0 4,1 L 4,N / 2−1 ê C N C N C N ú ê M M ú Table 1. Speech recognition accuracy (in %) in speech ê ú M ,0 M ,1 M ,N / 2−1 êC C L C ú (16) noise condition (Average speech spectrum). [D ]= N N N ê C 1,N / 2 C 1,N / 2+1 C 1,N −1 ú ê N N N ú 3,N / 2 3,N / 2+1 3,N −1 Feature set Clean speech 30dB 20dB 15dB 10dB ê C N C N C N ú ê M M ú FB MFCC 88.0 87.8 85.7 78.7 68.0 ê ú M −1,N −1 M −1,N −1 M −1,N −1 ëê C C C ûú MB MFCC 87.8 87.8 86.9 80.0 69.1 N N N BMFCC 89.5 89.4 89.4 82.6 73.5 We can see that [D] is a block diagonal matrix. When it is applied to the FEBs, [D] automatically divides them into two Table 2. Speech recognition accuracy (%) in machine sub-bands. gun noise condition (Calibre 0.50, repeated).

Feature set Clean speech 30dB 20dB 15dB 10dB Performance Measures”, Proceedings of the DARPA Broadcast News Workshop, Hilton at Washington Dulles FB MFCC 88.0 88.0 87.6 87.1 84.7 Airport Herndon, Virginia, February 28-March 3, 1999.

MB MFCC 87.8 87.8 87.7 87.3 84.6 [4] P. McCourt, S. Vaseghi and N. Harte, “Multi-Resolution BMFCC 89.5 89.5 89.5 89.4 87.8 Cepstral Features for Phoneme Recognition Across Table 3. Speech recognition accuracy (%) in car noise Speech Sub-bands,” ICASSP’98, Seattle, USA, PP.557- condition ( Car-Volvo-340 120 km/h, 4th gear). 560. [5] S. Okawa, E. Bocchieri and A. Potamianos, “Multi-Band

From these experiment results, we can make following Speech Recognition in Noisy Environments,” observations: ICASSP’98, Seattle, USA, PP.641-644.

1. The BMFCCs yield the best performance in clean speech [6] A. Chen, S. Vaseghi and P. McCourt, “Transformation

and high SNR conditions. This demonstrates the of Full-Band to Sub-band HMMS for Speech

superiority of the BDCT to the DCT in cepstral analysis Recognition in Noisy Car Environments,” ARSU’99,

for speech recognition. Keystone, Colorado, December 1999, Vol. 1, PP. 31-34.

2. All sub-band based features are more robust than the full- [7] B. Doherty, S. Vaseghi and P. McCourt, “Linear

band MFCCs to three types of noise investigated. Transformations in Sub-band Groups for Speech

3. BMFCCs are more robust to the machine gun noise and car Recognition,” EUROSPEECH’99, Budapest, Hungary,

noise than the multi-band MFCCs. While its robustness to September, 1999, PP. 1359-1366.

speech noise is slightly poorer than that of the multi-band [8] H. Bourlard and S. Dupont, “A mew ASR Approach

MFCCs. Tthe reason for this is not clear. We will perform Based on Independent Processing and Recombination of

further experiments with various speech databases and Partial Frequency Bands,” ICSLP’96, Philadelphia, with more kinds of real noise before we draw a conclusion. October 1996. [9] H. Bourlard and S. Dupont, “Subband-based Speech 7. CONCLUSION Recognition,” ICASSP’97, PP. 1251-1254.

In this paper, we proposed a block DCT and applied it to [10] S. Okawa, T. Nakajima and K. Shiria, “A Recombination the mel frequency cepstral analysis in speech recognition. We Strategy for Multi-band Speech Recognition Based on showed that a sub-band processing idea is implicit in the new Mutual Information Criterion,” EUROSPEECH’99, PP. features (BMFCCs). Experiment results based on a continuous 603-606. density isolated speech recognizer revealed that the new [11] K. K. Paliwal, “Decorrelated and Liftered Filter-Bank

MFCCs yield better recognition accuracy than full-band Energies for Robust Speech Recogntion,”

MFCCs in noise as well as in clean speech conditions. EUROSPEECH’99, PP. 85-88.

BMFCCs were also compared with the multi-band MFCCs [12] R. Chengalvarayan, “Hierarchical Subband Linear in terms of their recognition performance. The results showed Prediction Cepstral (HSLPC) Features for HMM-Based that the BMFCCs were able to yield better performance in Speech Recognition,” ICASSP’99, PP. 409-412. clean speech and various noisy speech environments. Thus, [13] K. Yoshida K. Takagi and K. Ozeki, “Speaker

BMFCCs are more robust than the multi-band MFCCs under Identification Using Subband HMMS,” various noise conditions. EUROSPEECH’99, PP. 1019-1022.

In this paper, experiments are only performed for small [14] S. Rao and W. A. Pearlman, “Analysis of Linear vocabulary isolated speech recognition tasks. Work is in Prediction, Coding, and Spectral Estimation from progress to test the BMFCCs and various sub-band based Subbands,” IEEE Trans. on Information Theory, Vol. 42, front-ends for large vocabulary continuous speech recognition No. 4, 1996, PP. 1160-1178. in clean as well as noise conditions. [15] A. D. Poularikas, “The Transforms and Applications Handbook,” IEEE Press, 1995. [16] K. R. Rao and R. Yip, “Discrete Cosine Transform: REFERENCE Algorithm, Advantages, Application,” Academic Press,

[1] D. S. Pallett, et al, “1996 Preliminary Broadcast News Inc, 1990.

Benchmark”, Proceedings of the 1997 DARPA Speech [17] A. Varga, et al, “The Moise-92 Study on the Effect of

Recognition Workshop, International Conference Center Additive Noise on Automatic Speech Recognition,”

Chantilly, Virginia, February 2-5, 1997 DRA Speech Research Unit, St. Andrew’s Rd., Malvern,

[2] D. S. Pallett, et al, “1997 Broadcast News Benchmark Worcestershire, WR14 3PS UK.

Test Results: English and Non-English”, Proceedings of the DARPA Broadcast News Transcription and Understanding Workshop, Lansdowne Conference Resort, Lansdowne, Virginia, February 8-11, 1998. [3] D. S. Pallett, et al “1998 Broadcast News Benchmark Test Results: English and Non-English Word Error Rate