Parameters and Their Influence on Speech Recognition

JARI TURUNEN*, DAMJAN VLAJ**, PEKKA LOULA * & BOGOMIR HORVAT** *Tampere University of Technology, Pori P.O.Box 300, FIN-28101 Pori, FINLAND **University of Maribor, FERI Maribor Smetanova 17, Maribor, SI-2000, SLOVENIA

http://www.pori.tut.fi, http://dsplab.uni-mb.si

Abstract: - Speech recognition over different transmission channels will set demands on the parametric encoded/decoded speech. The effects of different types of noise have been studied a lot and the effects of the parametrization process in speech has been known to cause degradation in decoded speech when compared to the original speech. But does the encoding/decoding process modify the speech so much that it will cause degradation in the speech recognition result? If it does, what may cause the speech recognition degradation? We have studied ten different speech configurations with noisy data. The results indicate that the vocal tract parameters are in key role in speech information transmission and the certain residual modeling methods will reduce the effect of the noise.

Key-Words: - speech recognition; speech coding; quality evaluation; SNR; codec

1 Introduction example of a codec primarily designed and Speech recognition over fixed telephone network optimized for speech recognition. In [9], the speech has worked successfully in several automated recognition from encoded bit stream have been services over the decades. The rapidly evolving studied. technology will rise several other speech Speech recognition accuracy and its degradation communication possibilities, and the need and to in noisy environments has been studied extensively. develop speech interface for the semi- and fully- For example in [10] the degradation sources in automated computer services over other telephone network are studied and the speaker communication channels than telephone network are dependent variations, the effect of different types of also becoming reality [1]- [5]. environments, transmission media and the There are several approaches how the speech themselves are examined in [11]-[15]. recognition service over distant media should be These researches have mainly concentrated on the implemented. Traditionally, speech is transferred as differences between codecs and environmental it is to a central server and the features are extracted changes. Our goal was to find differences in the in there in custom applications. In order to save recognition results that are caused by the coding bandwidth in transfer media , speech is usually algorithm. The bit allocation for vocal tract compressed with predefined and standardized parameters in codecs is quite constant in the codecs compression methods, where the that will be used in speech recognition purposes so causes some degradation [6]. Most of the speech residual signal packing and modeling in different compression and coding methods utilize some codecs will be the focus of interest. predefined vocal tract model for the redundant In this paper, we examined the degradation effects information removal. It is also known that Linear of nine different codecs with noisy environment data Prediction (LP) filter parameters can be used for isolated and concatenated word recognition. directly or with small modifications for speech Earlier we have demonstrated the coding effects recognition purposes [7]. with studio quality speech database in [16]. Due to the increased processing power and memory resources speech recognition applications 2 Methods can be implemented in portable phones and 2.1 Baseline Speech Recognizer handheld devices in the near future. A hybrid The baseline speech recognition system was built on solution of these two alternatives is reported in [8]. an HP-Unix workstation. The Hidden Markov In this approach the handheld device extracted the Model Toolkit (HTK) tool was used for the hidden necessary features from the speech and sent them to Markov model construction [17]. The speech a central server to post-process them. This is a good features for the HTK tool were selected to be Mel- cepstral coefficients and their variants. The detailed test material was used for testing the recognition features are presented in Table 1. engine with baseline and coding/encoding the test material with codecs. The codecs and configurations Table 1: The selected speech recognition features are listed in Table 3. Coefficients # of coeff. Table 3: Different codec systems used in simulations MFCC + energy = MFCC_E 12+1 = 13 Codec Bitrate Delta of MFCC_E 13 G.723.1 No postfilter 5.3 kbit/s Delta delta of MFCC_E 13 G.723.1 Postfilter 5.3 kbit/s Total (MFCC_E_D_A) 39 G.723.1 No postfilter 6.4 kbit/s G.727 ADPCM 40 kbit/s The grammar was set for connected and isolated G.728 No postfilter 16 kbit/s word (digit) recognition with this data. G.728 Postfilter 16 kbit/s G.729 8 kbit/s 2.2 Speech data GSM 6.10 13 kbit/s The noisy speech test experiments were processed Tandem: GSM – ADPCM with Slovenian SpeechDat II fixed phone database Tandem: GSM-ADPCM-GSM [18], consisting of 16-bit linearly quantized speech samples at an 8 kHz sampling frequency. The noise The G.723.1 is a dual rate speech codec operating at in the background of speech samples was originated 5.3 and 6.4 kbit/s. There is a built-in filtering from natural environment, for example line noise, mechanism in this codec on the decoder side, and cafe babble noise, and telephone booth background we chose three alternatives described in Table 1. noise. The noisy database samples were selected The G.727 Adaptive Differential Pulse Code from corpuses B1, C1-C4 and I1. The ratio of Modulation codec was defined to operate at 16, 24, female and male speech samples was approximately 32 and 40 kbit/s. We chose the best quality available 50% / 50% in both databases. The number of for this codec in order to test the performance of the speakers, spoken words and connected digit high- adaptive filtering coding method. The sentences, whic h were used in these experiments, G.728 is a low-delay codec that operates with 16 are presented in Table 2. kbit/s bit-rate. It also has a possibility to post-filter the output speech. The G.729 is an 8 kbit/s codec Table 2: The Speech data used in the experiments described in. The GSM6.10 is a coding mechanism, Noisy data speakers words sentences common in GSM portable phone systems. The (connected digits) remaining two tandem experiments were built from Training data 800 27804 3972 two and three consecutive asynchronous encoding- Testing data 200 6708 958 decoding mechanisms for simulating the speech transmission over GSM and fixed telephone 2.3 Codecs networks and between two GSMs over fixed We evaluated nine different codecs with noisy telephone network. database. The structure of the system is presented in Fig. 1. 2.4 Evaluation of the codecs In order to evaluate the speech recognition results, several methods for measuring encoded/decoded speech quality were used. We applied four different Training objective methods for evaluating the speech quality Noisy speech Baseline from different codec configurations. The first is data Testing segmental signal-to-noise (SNRSEG) ratio [19]: Baseline HP-unix G.723.1 NoPF mj é 2 ù G.723.1 PF ê å s (n) ú . 1 M-1 n=mj -N+1 . SNR = å 10log ê ú (1) SEG 10 m . M j=0 ê j 2 ú å [s(n) - sˆ(n)] GSM.ADPCM.GSM ê ú ën=mj-N+1 û

Fig. 1: The schematic structure of the system where s(n) is the original speech sample at time n, The recognizer was first trained with baseline train and s(n) is the encoded/decoded speech sample at material from the database. Then the same database time n, M is the number of segments, mj is the end of the current segment and N is the segment length. the recognizer. The results of the experiments are The second objective measurement is the segmental shown in Table 4. In Table 4, “NOPF” means that root-mean-square error (RMSE) method: the codec specific built-in postfiltering mechanism in decoding phase for smoothing the output speech was disabled and “PF” means that it was enabled. The baseline in Table 4 means the baseline system m j å (s(n) - sˆ(n))2 tested with the test data without encoding/decoding, M -1 1 n=mj -N +1 and the rest of the codec configurations are the same RMSE = å (2) as in Table 3 respectively. SEG M N j=0 Table 4: Recognition results with noisy data. The For evaluating the objective spectral similarity the abbreviation “SNR” is signal-to-noise ratio in decibels, Itakura (ID) and Itakura-Saito distance (ISD) “RMS” is root-mean-square error measurement, “ID” measures were used [19]: means Itakura distance, “ISD” is Itakura -Saito distance and “SER” is sentence error rate that is used to present the connected digit recognition error rate. M -1 T ~ 1 ß (mj )Rs(mj )ßß( j ) Noisy database SNR RMS ID ISD SER ID = log 3 SEG å T ~ (3) (x10 ) M a (m )R (m )aa( ) j = 0 j s j j Baseline 40.71 G.723.1.53.NOPF 4.6 29.12 .05 .13 39.87 G.723.1.53.PF 3.7 30.08 .07 .19 43.32 T ~ 1 M -1 a(m ) - ß(m ) R (m ) a(m ) - ß(m ) G.723.1.64.NOPF 5.4 25.47 .04 .11 41.44 [ j j ] s j [ j j ] G.727.ADPCM 25.5 04.63 .00 .00 42.59 ISDSEG = å T ~ G.728.NOPF 6.8 41.33 .02 .06 45.62 M j=0 a (m j )Rs (mj )aa( j ) G.728.PF 6.2 44.28 .03 .08 45.20 (4) G.729 2.4 30.16 .09 .25 41.75 GSM.6.10 4.2 31.88 .05 .13 40.29 where b(mj) is the linear prediction coefficient GSM.ADPCM 4.1 33.38 .05 .13 40.61 vector calculated from encoded/decoded speech GSM.ADPCM.G 3.4 36.63 .08 .23 42.17 T segment mj and b (mj) is its transpose, a(mj) is the linear prediction coefficient vector calculated from ~ the original speech segment and R s(mj) is the 4 Discussion autocorrelation matrix derived from original We presented the speech recognition results coded speech segment. with nine different codec configurations. These The sentence error rate (SER) was used for results obtained from isolated and concatenated evaluating the recognition success. SER means error word recognition experiments cannot be extended to rate of the connected digit, for example “Three- continuous speech recognition. seven-six”. The sentence error rate is calculated as The goal of this research was not to measure the follows: subjective values of the coding/recognition process. Instead of that our purpose was to find facts in the H speech encoding/decoding that may affect in the SER = 100%- ´ 100%, (5) N speech recognition process. The objective measurement methods, described in the previous where H is the number of correct labels, and N is the section, were selected due to their simplicity and total number of labels in the defining transcription ability to show different types of signal properties, files. The insertions and deletions were judged as although better total signal evaluation methods wrong recognition results. exists, for example ITU-T P.862. The segment size in all evaluation experiments All of the codecs use certain vocal tract model for was 256 samples. Some of the codecs will add small redundancy removal from the speech. In most cases delay to the decoded sound files. This phenomenon the vocal tract model is based on Linear Prediction was overruled by utilizing the time alignment with (G.723.1 10th order, G.728 50th order, G.729 10th the minimum MSE error criterion. order and GSM6.10 8th order filters). The linear filter parameters are compressed to certain bitrate, 3 Results for example G.723.1 linear filter coefficients are The HMM recognizer was trained with noisy compressed by using Line Spectral Pair/Split Vector database speech, and in the testing phase the Quantization combination to 24 bits per speech encoded/decoded test speech samples were fed to frame. The G.729 utilizes similar type of system resulting the filter coefficient compression to 18 bits per speech frame. The GSM6.10 vocal tract model The G.723.1 5.3 kbit/s and GSM6.10 is also very slightly differs from the previous codecs; the interesting pair, although the vocal tract model is not reflection coefficients from the LPC analysis are exactly the same, and spectral variations may exist compressed with Log-Area-Ratio and quantized to in the residual signal. The objective measurements 36 bits. and the recognition results are the same, although The G.727 differs from the previous ones, the GSM6.10 will use 13 kbit/s for the encoded because it does not use an all-pole filter model but a speech transmission. The G.723.1 5.3 kbit/s is the pole -zero model. These pole -zero filter coefficient lowest bitrate in this study, but it survived best. The parameters are not sent to the decoder, but both G.723.1 5.3 kbit/s uses fixed and adaptive codebook encoder and decoder filter coefficients are updated with pitch period search for the residual modeling using the signal estimate and quantized residual. while the GSM6.10 deals with the real residual, The G.728 is also an exception: the LP parameters although reduced and differentiated, for the same are not transmitted to the decoder, but the decoder purpose. The G.723.1 way of modeling the speech updates the estimate for the vocal tract parameters and residual signal and the decoded output is more for the 50th order predictor based on the transmitted accurate than GSM6.10’s way, but GSM6.10 residual and amplitude parameters. From this point compensates the modeling accuracy with shorter of view, we can analyze the importance of sending frame length and increased bitrate than in G.723.1 the vocal tract parameters in speech coding and 5.3 kbit/s. The good recognition result with both recognition by measuring the results with G.727 and codecs may be explained by the automatic noise G.728. The rest of the codecs will show different reduction due to the coarse residual modeling. ways of coding the residual signal and possible The effect of the tandeming is just as excepted: significant features that must be encoded and thus the several concatenated encoding phases will may affect speech recognition. degrade the speech quality and thus the speech The HMM-based speech recognizer uses the Mel- recognition results. Cepstral coefficient based features. The cepstral The postfiltering worsen the G.723.1 5.3 kbit/s coefficients are related to the linear predictive SER recognition result over 3 % than without it. In coding coefficients, which model the spectral the same time the G.728 results with and without envelope of the human vocal tract. This fact postfiltering remains almost the same. The structure suggests that it is important that the vocal tract is of the postfilter is very complex in both cases modeled and transmitted right in low-bit rate (including 10th order short-term filter, long term vocoders. In our case the recognition is not filter, gain scaling etc.), and it suggests that the performed directly from the parameters, but it is first residual correlation analysis in the postfiltering decoded and then recognized. process will lost significant features from signal G.723.1 6.4 kbit/s and G.729 codecs show similar while as it will enhance them slightly in G.728 recognition results in the Table 4. The residual decoding process. signals are modeled in both codecs with adaptive and fixed codebook combinations. The objective 5 Conclusion measurement methods suggest that the G.723.1 is We focused in this work to encoded/decoded speech slightly better coder than the G.729 but the recognition. Six different codecs in ten differences are very small. configurations with baseline speech recognition The comparison between G.727 and G.728 is also engine were examined in noisy speech environment. interesting; the G.728 got the worst SER results. The The vocal tract information is essential parameter in goodness of G.727 when compared to the G.728 can the low-bit-rate codecs, especially in noisy be explained by the fact that the bit rate is 2,5 times environments and the residual modeling with bigger and thus the residual modeling accuracy is structures that can compensate the effect of the better than in the G.728 codec. Also the G.727 pole - background noise are also in significant role in noisy zero filter can adapt better to the local speech speech recognition. th variations than the 50 order filter in the G.728. However the G.727 do not got better recognition results when compared to the other lower bit-rate coders than G.728, although the signal to noise ratio is the best with the G.727. This suggests that the additive noise may be present in the encoded residual and thus transferred in the decoder as it is. References: Technology, Publication 295, Finland, [1] Rabiner L., Applications of Speech 2000. Recognition in the area of [12]Peláez-Moreno C., Gallardo-Antolin Telecommunications. In Proceedings of A., Diaz-de-Maria F., Recognizing IEEE Workshop on ASR and Voice Over IP: A Robust Front-End for Understanding, 1997, pp. 501-510. Speech Recognition on the World Wide [2] Isobe T., Morishima M., Yoshitani F., Web, IEEE Transactions on Koizumi N., Murakami K., Voice- Multimedia,2001, (3), pp. 209-218. activated home banking system and its [13]Mokbel C., Mauuary L., Karray L., field trial, In Proceedings of 4th Jouvet D., Monné J., Simonin J., International Conference of Spoken Bartkova K., Towards Improving ASR Language, 1996, (3), pp. 1688-1691. Robustness for PSN and GSM [3] Carlson S., Barclay C., O´Connor J., Telephone Applications, Speech Duckworth G., Application of Speech Communication, 1997, (23), pp. 141- Recognition Technology to ITS 159. Advanced Traveler Systems, In [14]Lilly B.T., Paliwal K.K., Effect of Proceedings of 6th Vehicle Navigation speech coders on speech recognition and Information Systems Conference, performance, In Proceedings of 4th 1995, pp. 118-125. International Conference on Spoken [4] Gallardo-Antolin A., Peláez-Moreno C., Language, 1996, (4) pp. 2344-2347. Diaz-de-Maria F., A robust front-end for [15]Euler S., Zinke J., The Influence of ASR over IP and GSM networks: an Speech Coding Algorithms in Automatic integrated scenario, In Proceedings of Speech Recognition, In Proceedings of Eurospeech, 2001, pp. 1103-1107. IEEE International Conference on [5] Riskin E., Boulis C., Otterson S., Acoustics, Speech and Signal Ostendorf M., Graceful Degradation of Processing, 1994, (1), pp. 621-624. Speech Recognition Performance Over [16]Turunen J., Vlaj D., A Study of Speech Lossy Packet Networks, In Proceedings Coding Parameters in Speech of Eurospeech, 2001, pp. 2715-2719. Recognition. In Proceedings of [6] Kondoz A., Digital Speech, Coding for Eurospeech, 2001, pp. 2363-2367. low-bit rate communications systems, [17]Young, S., The HTK Book - Version John Wiley & Sons, 1998. 2.1, Cambridge University, UK, 1997. [7] Rabiner L.R., Juang B-H., [18]Kaiser J., Kacic Z., Development of Fundamentals of Speech Recognition, Slovenian SpeechDat Database. In Prentice Hall, 1993. Workshop On Speech Database [8] Reichl W., Weerackody V., Potamios Development for Central and Eastern A., A Codec for Speech Recognition in a European Languages, 1998, paper #9, 4 Wireless System. In Proceedings of p. EUROCOMM2000, Information Systems [19]Deller, J., Proakis, J., Hansen, J., for Enhanced Public Safety and Discrete-Time Processing of Speech Security, 2000, pp. 34 –37. Signals, McMillan Publishing Company, [9] Kim H., Cox R., Bitstream-Based 1993. Feature Extraction for Wireless Speech Recognition, In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 2001, pp. 1607-1610. [10]Moreno J.P., Stern R.M., Sources of Degradation of Speech Recognition In the Telephone Network, In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing, 1994, (1), pp. 109-112. [11]Salmela P., Neural Networks in Speech Recognition, Tampere University of