Speech Coding Parameters and Their Influence on Speech Recognition
Total Page:16
File Type:pdf, Size:1020Kb
Speech Coding Parameters and Their Influence on Speech Recognition JARI TURUNEN*, DAMJAN VLAJ**, PEKKA LOULA * & BOGOMIR HORVAT** *Tampere University of Technology, Pori P.O.Box 300, FIN-28101 Pori, FINLAND **University of Maribor, FERI Maribor Smetanova 17, Maribor, SI-2000, SLOVENIA http://www.pori.tut.fi, http://dsplab.uni-mb.si Abstract: - Speech recognition over different transmission channels will set demands on the parametric encoded/decoded speech. The effects of different types of noise have been studied a lot and the effects of the parametrization process in speech has been known to cause degradation in decoded speech when compared to the original speech. But does the encoding/decoding process modify the speech so much that it will cause degradation in the speech recognition result? If it does, what may cause the speech recognition degradation? We have studied ten different speech codec configurations with noisy data. The results indicate that the vocal tract parameters are in key role in speech information transmission and the certain residual modeling methods will reduce the effect of the noise. Key-Words: - speech recognition; speech coding; quality evaluation; SNR; codec 1 Introduction example of a codec primarily designed and Speech recognition over fixed telephone network optimized for speech recognition. In [9], the speech has worked successfully in several automated recognition from encoded bit stream have been services over the decades. The rapidly evolving studied. technology will rise several other speech Speech recognition accuracy and its degradation communication possibilities, and the need and to in noisy environments has been studied extensively. develop speech interface for the semi- and fully- For example in [10] the degradation sources in automated computer services over other telephone network are studied and the speaker communication channels than telephone network are dependent variations, the effect of different types of also becoming reality [1]- [5]. environments, transmission media and the codecs There are several approaches how the speech themselves are examined in [11]-[15]. recognition service over distant media should be These researches have mainly concentrated on the implemented. Traditionally, speech is transferred as differences between codecs and environmental it is to a central server and the features are extracted changes. Our goal was to find differences in the in there in custom applications. In order to save recognition results that are caused by the coding bandwidth in transfer media , speech is usually algorithm. The bit allocation for vocal tract compressed with predefined and standardized parameters in codecs is quite constant in the codecs compression methods, where the lossy compression that will be used in speech recognition purposes so causes some degradation [6]. Most of the speech residual signal packing and modeling in different compression and coding methods utilize some codecs will be the focus of interest. predefined vocal tract model for the redundant In this paper, we examined the degradation effects information removal. It is also known that Linear of nine different codecs with noisy environment data Prediction (LP) filter parameters can be used for isolated and concatenated word recognition. directly or with small modifications for speech Earlier we have demonstrated the coding effects recognition purposes [7]. with studio quality speech database in [16]. Due to the increased processing power and memory resources speech recognition applications 2 Methods can be implemented in portable phones and 2.1 Baseline Speech Recognizer handheld devices in the near future. A hybrid The baseline speech recognition system was built on solution of these two alternatives is reported in [8]. an HP-Unix workstation. The Hidden Markov In this approach the handheld device extracted the Model Toolkit (HTK) tool was used for the hidden necessary features from the speech and sent them to Markov model construction [17]. The speech a central server to post-process them. This is a good features for the HTK tool were selected to be Mel- cepstral coefficients and their variants. The detailed test material was used for testing the recognition features are presented in Table 1. engine with baseline and coding/encoding the test material with codecs. The codecs and configurations Table 1: The selected speech recognition features are listed in Table 3. Coefficients # of coeff. Table 3: Different codec systems used in simulations MFCC + energy = MFCC_E 12+1 = 13 Codec Bitrate Delta of MFCC_E 13 G.723.1 No postfilter 5.3 kbit/s Delta delta of MFCC_E 13 G.723.1 Postfilter 5.3 kbit/s Total (MFCC_E_D_A) 39 G.723.1 No postfilter 6.4 kbit/s G.727 ADPCM 40 kbit/s The grammar was set for connected and isolated G.728 No postfilter 16 kbit/s word (digit) recognition with this data. G.728 Postfilter 16 kbit/s G.729 8 kbit/s 2.2 Speech data GSM 6.10 13 kbit/s The noisy speech test experiments were processed Tandem: GSM – ADPCM with Slovenian SpeechDat II fixed phone database Tandem: GSM-ADPCM-GSM [18], consisting of 16-bit linearly quantized speech samples at an 8 kHz sampling frequency. The noise The G.723.1 is a dual rate speech codec operating at in the background of speech samples was originated 5.3 and 6.4 kbit/s. There is a built-in filtering from natural environment, for example line noise, mechanism in this codec on the decoder side, and cafe babble noise, and telephone booth background we chose three alternatives described in Table 1. noise. The noisy database samples were selected The G.727 Adaptive Differential Pulse Code from corpuses B1, C1-C4 and I1. The ratio of Modulation codec was defined to operate at 16, 24, female and male speech samples was approximately 32 and 40 kbit/s. We chose the best quality available 50% / 50% in both databases. The number of for this codec in order to test the performance of the speakers, spoken words and connected digit high-bit rate adaptive filtering coding method. The sentences, whic h were used in these experiments, G.728 is a low-delay codec that operates with 16 are presented in Table 2. kbit/s bit-rate. It also has a possibility to post-filter the output speech. The G.729 is an 8 kbit/s codec Table 2: The Speech data used in the experiments described in. The GSM6.10 is a coding mechanism, Noisy data speakers words sentences common in GSM portable phone systems. The (connected digits) remaining two tandem experiments were built from Training data 800 27804 3972 two and three consecutive asynchronous encoding- Testing data 200 6708 958 decoding mechanisms for simulating the speech transmission over GSM and fixed telephone 2.3 Codecs networks and between two GSMs over fixed We evaluated nine different codecs with noisy telephone network. database. The structure of the system is presented in Fig. 1. 2.4 Evaluation of the codecs In order to evaluate the speech recognition results, several methods for measuring encoded/decoded speech quality were used. We applied four different Training objective methods for evaluating the speech quality Noisy speech Baseline from different codec configurations. The first is data Testing segmental signal-to-noise (SNRSEG) ratio [19]: Baseline HP-unix G.723.1 NoPF mj é 2 ù G.723.1 PF ê å s (n) ú . 1 M-1 n=mj -N+1 . SNR = å 10log ê ú (1) SEG 10 m . M j=0 ê j 2 ú å [s(n) - sˆ(n)] GSM.ADPCM.GSM ê ú ën=mj-N+1 û Fig. 1: The schematic structure of the system where s(n) is the original speech sample at time n, The recognizer was first trained with baseline train and s(n) is the encoded/decoded speech sample at material from the database. Then the same database time n, M is the number of segments, mj is the end of the current segment and N is the segment length. the recognizer. The results of the experiments are The second objective measurement is the segmental shown in Table 4. In Table 4, “NOPF” means that root-mean-square error (RMSE) method: the codec specific built-in postfiltering mechanism in decoding phase for smoothing the output speech was disabled and “PF” means that it was enabled. The baseline in Table 4 means the baseline system m j å (s(n) - sˆ(n))2 tested with the test data without encoding/decoding, M -1 1 n=mj -N +1 and the rest of the codec configurations are the same RMSE = å (2) as in Table 3 respectively. SEG M j=0 N Table 4: Recognition results with noisy data. The For evaluating the objective spectral similarity the abbreviation “SNR” is signal-to-noise ratio in decibels, Itakura (ID) and Itakura-Saito distance (ISD) “RMS” is root-mean-square error measurement, “ID” measures were used [19]: means Itakura distance, “ISD” is Itakura -Saito distance and “SER” is sentence error rate that is used to present the connected digit recognition error rate. M -1 T ~ 1 ß (mj )Rs(mj )ßß( j ) Noisy database SNR RMS ID ISD SER ID = log 3 SEG å T ~ (3) (x10 ) M a (m )R (m )aa( ) j = 0 j s j j Baseline 40.71 G.723.1.53.NOPF 4.6 29.12 .05 .13 39.87 G.723.1.53.PF 3.7 30.08 .07 .19 43.32 T ~ 1 M -1 a(m ) - ß(m ) R (m ) a(m ) - ß(m ) G.723.1.64.NOPF 5.4 25.47 .04 .11 41.44 [ j j ] s j [ j j ] G.727.ADPCM 25.5 04.63 .00 .00 42.59 ISDSEG = å T ~ G.728.NOPF 6.8 41.33 .02 .06 45.62 M j=0 a (m j )Rs (mj )aa( j ) G.728.PF 6.2 44.28 .03 .08 45.20 (4) G.729 2.4 30.16 .09 .25 41.75 GSM.6.10 4.2 31.88 .05 .13 40.29 where b(mj) is the linear prediction coefficient GSM.ADPCM 4.1 33.38 .05 .13 40.61 vector calculated from encoded/decoded speech GSM.ADPCM.G 3.4 36.63 .08 .23 42.17 T segment mj and b (mj) is its transpose, a(mj) is the linear prediction coefficient vector calculated from ~ the original speech segment and R s(mj) is the 4 Discussion autocorrelation matrix derived from original We presented the speech recognition results coded speech segment.