DISTORTION-CLASS WEIGHTED ACOUSTIC MODELING FOR ROBUST SPEECH RECOGNITION UNDER GSM RPE-LTP CODING

Juan M. Huerta and Richard M. Stern Carnegie Mellon University Department of Electrical and Computer Engineering Email: {juan, rms}@speech.cs.cmu.edu

separate the set of phonemes into cluster ABSTRACT their relative spectral distortion. In Sec We present a method to reduce the effect aof method full-rate to weight two sets of acoustic mo GSM RPE-LTP coding by combining two sets of acousticthe distortion categories introduced in S models during recognition, one set traineddescribe on GSM- the results of recognition experi distorted speech and the other trained on cleantechniques uncoded in Section 6. speech. During recognition,a posteriori the probabilities of an utterance are calculated as a sum of the 2.posteriors THE RPE-LTP of AS A SOURCE the individual models, weighted according to OFthe ACOUSTIC DEGRADATION distortion class each state in the model represents. We analyze the origin of spectral distortion toThe the full-rate quantized GSM codec decomposes the sp long-term residual introduced by the RPE-LTPinto codec a andset of LAR coefficients and a shor discuss how this distortion varies accordingsignal to phonetic [1]. The LAR coefficients are qu class. For the Research Management corpus,transmitted the to the decoder while the short- proposed method reduces the degradation in segmentedframe-by- into subframes and coded by an frame phonetic recognition accuracy introducedcoder by [3].GSM In this section we discuss coding by more than 20 percent. degradation that exist in the RPE-LTP code Figures 1 and 2, a simplified version of LTP codec and a simplified version of a r 1.INTRODUCTION codec respectively. reduces the accuracy of Figurespeech 1 depicts an ideal codec able to recognition systems [4]. As speech recognitionsignal that is identical to the original application in cellular and mobile environmentsEven becomethough this codec would not achieve a more ubiquitous, robust recognition in theseof conditionsthe of the input sequence it h becomes crucial. Even though the speech codecsources is only and nature of the distortion of t one of the several factors that contributeintroduced to the by the actual RPE-LTP codec. degradation in recognition accuracy, it is necessary to understand and compensate for this degradationThe inideal order RPE-LTP codec works as follow to achieve a system’s full potential. We focussequence on x[n] the(the short-term residual) enter acoustic distortion introduced by the full-rateand GSMis codeccompared to a predicted sequ [1]. This distortion can be traced to the quantizationreconstructed of short-term residual) prod the log-area ratio (LAR) and to the quantizationpredictor and block (the LTP section). The downsampling performed in the RPE-LTP process.between The these two signals is what the pred distortion introduced in the residual signalunable affects to predict, which we will refe recognition to a larger extent than the quantization“innovation that sequence”. In the absence of the LAR coefficients undergo [2]. acoustic degradation, our ideal codec wi generate a reconstructed time sequence In Section 2 of this paper, we discuss the togetherorigin ofthe the innovation sequence and th distortion in the RPE-LTP codec. We observe sequence.that based This reconstructed sequence wil on the “predictability” of the short-term residualto the originalsignal, time sequence by definition. x[n] the RPE-LTP will be able to minimize the errorreality, in thethe RPE-LTP coder, does not t quantized long-term residual. This predictabilitynecessary is laterinformation for the decoder t shown to be related to phonetic characteristicsinnovation of thesequence exactly.subsampled It sends and signal. In Section 3 we show that the relativequantized spectral information that will result in distortion introduced in the quantized long-terminnovation residual sequence (called the reconstruc tends to be concentrated around three levels,residual) and that thatthis will approximate the origi amount of relative spectral distortion cansequence be loosely to the extent possible. The related to the relative degradation introduceddegradation to the frame in the reconstructed signal wi recognition accuracy by GSM coding. In Sectionthe 4 energywe of the innovation signal which on how good the predictor module (RPE) “follow” or predict the next subframe ofdistortion the time observed in a frame. We observ sequence based on previous reconstructed subframes.counts are roughly clustered in 3 regions: and high distortion, separated by breakpoint Input 33 and 67 percent. We also observe that the sequence Innovation the frames suffer only a small amount of dist most of the time the LTP section of the code do a reasonably good job of predicting the residual signal. Predicted sequence .

6 Predictor Reconstructed sequence 5

Figure 1. Simplified block diagram of an ideal RPE-LTP4 coder.

Input sequence Innovation Reconstructed 3 Subsampling Innovation Log Counts and Coding 2

Predicted sequence 1

Predictor 0 0 0.2 0.4 0.6 0.8 1 Reconstructed Relative Log Spectral Distortion sequence Figure 3.Histogram of the log distribution o Figure 2. Simplified block diagram of an actualspectral RPE-LTP distortion introduced to a large samp coder. 1.3. Impact of relative log spectral distor 3.THE RPE-LTP INDUCED on phonetic recognition SPECTRAL DISTORTION 140 3.1. Relative log spectral distortion introduced by the RPE-LTP coder 120

We use the relative log spectral distance 100below to measure the dissimilarity between the reconstructed and the original innovation sequence.ω ) is the E( power 80 spectrum of the innovation sequenceω) is and the E ( R 60 power spectrum of the quantized innovation sequence. We integrate the absolute value of the difference40 between both log power spectra and normalize it by the integral of the log of the power spectrum of the 20original innovation signal. Rate Increase in Frame Error Percentage ππ0 LEEdEd=−∫∫log( (ωωωωω ) log( ( ))) log( ( )) 0 1 00R 0.17 0.330.5 0.67 0.88 1.0 Relative Spectral Distortion Figure 4. Relative degradation in frame-ba 1.2. Distribution of the relative log spectralphonetic accuracy as a function of the rel distortion spectral distortion introduced by the RTE- coder. We computed the relative log spectral distortion introduced by the RPE-LTP codec to the RM Incorpus. order to analyze the relation that exi Figure 3 is a histogram that shows the log ofdegradation the relative in recognition performance and t frequency of observing various levels of logof relativespectral log spectral distortion introduce distortion. The horizontal axis is the amountLTP ofblock, relative we performed two phonetic reco experiments: one training and testingi.e., using non-GSM coded) speech data and a second experimentA plot of the normalized histograms of the using speech that underwent GSM coding. We showncomputed below. These five classes exhibit the frame accuracy for each amount of relativepattern spectral of distortion in each of the three distortion both for when GSM was present anddefined when itin Section 3.2. Classes 1 and 3 was absent. Figure 4 depicts how much the frame-basedpercentage of counts in the high distort phonetic recognition error rate increased presencedue to inthe the middle area, while Classes introduction of the GSM coding as a functionmore ofcounts the in the middle area. Class 5 relative spectral distortion. We observe that,almost generally completely concentrated in the lo speaking, the phonemes that produced a greaterregion. amount of distortion due to GSM coding also suffered greater amounts of frame error rate. 0.4 Class 1 0.2

4.RELATING RELATIVE LOG SPECTRAL 0 0 1 DISTORSION PATTERNS TO PHONETIC 0.4 Class 2 CLASSES 0.2

0 0 1 4.1. Clustering Phonetic-classes using the 0.4 Class 3 relative log-spectral distortion 0.2 0 0 1 We grouped the 52 phonetic units used by our 0.4recognition Class 4 system into phonetic clusters by incrementally0.2 clustering the closest histograms of the counts of the0 relative log 0 1 spectral distortion for the frames associated0.4 with each Class 5 phonetic unit in the corpus. The distance 0.2between each pair of histograms was calculated using normalized-area 0 0 0.2 0.4 0.6 0.8 1 histograms. The clustering yielded five classes, as shownRelative Log Spectral Distortion in the table below. Figure 5. Histograms of the relative spectral dist five phonetic classes of Table 1.

Class Phonemes in cl Frame Frame % Acc. Acc., Degra- 5.ACOUSTIC MODEL WEIGHTING GSM dation Acoustic modeling for HMM-based speech re 1 PD KD 46.76% 44.10% 5% commonly makes use of mixtures of G 2 IX B BD DD DH 68.28% 58.34% 31.34% distribution representing a set of tied sta M N JH NG V W probability that an observed vector has be Y DX ZH AX SH certain state is thus expressed by 3 D G K P T TD CH 70.88% 65.94% 16.96% M 4 F HH S Z TH TS 72.39% 57.63% 53.43% j bojt()= ∑ cjkNo [,](, tµ jk , C jk ) 5 IY OW OY UW L 74.37% 67.90% 25.24% k=1 R AA AE AH AXR EH AO IH ER AW AY UH EY The termc[j, k] expresses the prior probabikth th Table 1. Phonetic classes generated by automaticallyGaussian component jof theHMM. For a givenj st clustering the phonemes distortion histograms the sum of c[j,k] over kall is equal to 1.

Classes 1, 3, and 4 encompass the majorityWe canof considerthe the amount of distortion t consonants, Class 4 being mostly fricatives.a phonetic Class 2class undergoes while evaluatin includes the remaining consonants and a probabilitycouple of using several models that vowels, while Class 5 encompass the vowels, distortiondiphthongs regions. We can express this and semivowels. Silence units were omittedintroducing from this a function f that weights th th analysis. We can see that even though the classesprobability do not of the model p dependingd t, onthe exactly correspond to phonetic classes, theydistortion achieve ofa the observed j,frame which t, indicaand t reasonable partition between broad classesthe of modelsounds. representing a given phonetic The pattern of distributions of relative functionlog spectralf also depends on the prior probabi distortion introduced by the RPE-LTP is stronglymodel crelatedp[j,k]. This function can also be to phonetic properties of the signal. weighted version [j,k].of cp The resulting po becomes 2 M also made these weights dependent of the phon bojd, () t = ∑∑ fcjkdjNo (ptpt [,],,))(,µ jk , C jk ) using three automatically-clustered phonet p==1 k 1 While there is a modest advantage of weig The function f can alsodependent be on the distortionmodels differently, either of these two weigh class the model represents.f willThis weightway, moreprovide better results than using multi-s the clean models for states that model phonemesmodels. that The 3-class optimized weights red suffer small average GSM distortion. Alternatively, onedegradation in frame error rate introduced can make the function f depend on knowledgecoding of theby 27%, from 5.36% to 3.87%. The frame weighting improves recognition ac instantaneous relative distortiond t ofif eachthis frame information is available. This function shouldapproximately give more equally for all three phonetic c weight to the distorted models when the relative frame distortion is greater. Weighted modeling Phone Frame Accuracy Accuracy 6.SPEECH RECOGNITION EXPERIMENTS Equal weights 61.2% 65.77%

Phonetic recognition experiments were performed3-class using optimized weights61.2% 65.99% the Resource Management corpus and the SPHINX-3Table 3. Frame accuracy obtained when using equally system. Our basic acoustic models consisted modelsof 2500 and tiedwhen using phonetic-class based weightin states, each modeled by a mixture of two Gaussians. One set of models was trained on clean speech while the other7.CONCLUSIONS set of models was trained with speech that underwent GSM coding. We obtained our reference transcriptionIn this paperby we examined the distortion in performing automatic forced alignment on thethe 1600 RPE-LTP test GSM codec to the short-term r utterances using their pronunciations. We computedsignal. bothWe related this distortion to the re the phone accuracy and the frame accuracy.distortion Phone produced by the quantized long-te accuracy is calculated considering insertionsWe observed and that this distortion influences deletions of phones in the hypothesis. We evaluatedperformance, the and that it can be related to frame accuracy by making a frame-by-frame comparisonproperties of the speech signal. We introduc of the output of the phonetic recognizermix with two theacoustic models that had been traine alignment. Table 2 summarizes recognition accuracyand distorted for speech by taking into consid clean and GSM speech, using clean and GSM amountmodels. of distortion suffered by the phonetic We also show the results when models are trained using both GSM-coded speech and clean speech (multi-style 8.REFERENCES trained models). [1] European Telecommunication Standards Institute pean digital telecommunications system (Phase rate speech processing functions (GSM 06.01 Testing Training da Phone Frame 1994. data Accuracy Accuracy [2] Huerta, J. M. and Stern R. M., “Speech Recog Clean Clean 63.3% 69.86% GSM Codec Parameters” Proc. ICSLP-98, 1998. GSM Clean 55.7% 61.20% [3] Kroon, P., Deprettere, E. F., Sluyter, R. F. “ GSM GSM 60.1% 64.50% Excitation - A Novel Approach to Effective and Multi-pulse Coding of Speech”, IEEE Trans. on A GSM GSM+Clean 60.1% 64.64% Speech and Signal Processing, 34:1054-1063, Multistyle 1986. Table 2. Baseline frame accuracy and phoneme recognition[4] Lilly, B. T.,Paliwal, and K. K. “Effect of Speech C accuracy in Resource Management on Speech Recognition Performance”,Proc. ICSLP-96, 1996. From Table 2 we can see that the effect of having GSM codec distortion during training and recognition reduces frame accuracy by 5.35 points. Recognition accuracy can improved somewhat by using multi-style trained models.

6.1. Experiments using weighted acoustic models

We performed recognition experiments using weighted models by considering different mixing factors for two acoustic models, trained GSM-coded and clean speech. Table 3 shows results weighting both models equally. We