Distortion-Class Weighted Acoustic Modeling for Robust Speech Recognition Under Gsm Rpe-Ltp Coding

DISTORTION-CLASS WEIGHTED ACOUSTIC MODELING FOR ROBUST SPEECH RECOGNITION UNDER GSM RPE-LTP CODING Juan M. Huerta and Richard M. Stern Carnegie Mellon University Department of Electrical and Computer Engineering Email: {juan, rms}@speech.cs.cmu.edu separate the set of phonemes into cluster ABSTRACT their relative spectral distortion. In Sec We present a method to reduce the effect aof method full-rate to weight two sets of acoustic mo GSM RPE-LTP coding by combining two sets of acousticthe distortion categories introduced in S models during recognition, one set traineddescribe on GSM- the results of recognition experi distorted speech and the other trained on cleantechniques uncoded in Section 6. speech. During recognition,a posteriori the probabilities of an utterance are calculated as a sum of the 2.posteriors THE RPE-LTP of CODEC AS A SOURCE the individual models, weighted according to OFthe ACOUSTIC DEGRADATION distortion class each state in the model represents. We analyze the origin of spectral distortion toThe the full-rate quantized GSM codec decomposes the sp long-term residual introduced by the RPE-LTPinto codec a andset of LAR coefficients and a shor discuss how this distortion varies accordingsignal to phonetic [1]. The LAR coefficients are qu class. For the Research Management corpus,transmitted the to the decoder while the short- proposed method reduces the degradation in segmentedframe-by- into subframes and coded by an frame phonetic recognition accuracy introducedcoder by [3].GSM In this section we discuss coding by more than 20 percent. degradation that exist in the RPE-LTP code Figures 1 and 2, a simplified version of LTP codec and a simplified version of a r 1.INTRODUCTION codec respectively. Speech coding reduces the accuracy of Figurespeech 1 depicts an ideal codec able to recognition systems [4]. As speech recognitionsignal that is identical to the original application in cellular and mobile environmentsEven becomethough this codec would not achieve a more ubiquitous, robust recognition in theseof conditionsthe bit rate of the input sequence it h becomes crucial. Even though the speech codecsources is only and nature of the distortion of t one of the several factors that contributeintroduced to the by the actual RPE-LTP codec. degradation in recognition accuracy, it is necessary to understand and compensate for this degradationThe inideal order RPE-LTP codec works as follow to achieve a system’s full potential. We focussequence on x[n] the(the short-term residual) enter acoustic distortion introduced by the full-rateand GSMis codeccompared to a predicted sequ [1]. This distortion can be traced to the quantizationreconstructed of short-term residual) prod the log-area ratio (LAR) and to the quantizationpredictor and block (the LTP section). The downsampling performed in the RPE-LTP process.between The these two signals is what the pred distortion introduced in the residual signalunable affects to predict, which we will refe recognition to a larger extent than the quantization“innovation that sequence”. In the absence of the LAR coefficients undergo [2]. acoustic degradation, our ideal codec wi generate a reconstructed time sequence In Section 2 of this paper, we discuss the togetherorigin ofthe the innovation sequence and th distortion in the RPE-LTP codec. We observe sequence.that based This reconstructed sequence wil on the “predictability” of the short-term residualto the originalsignal, time sequence by definition. x[n] the RPE-LTP will be able to minimize the errorreality, in thethe RPE-LTP coder, does not t quantized long-term residual. This predictabilitynecessary is laterinformation for the decoder t shown to be related to phonetic characteristicsinnovation of thesequence exactly.subsampled It sends and signal. In Section 3 we show that the relativequantized spectral information that will result in distortion introduced in the quantized long-terminnovation residual sequence (called the reconstruc tends to be concentrated around three levels,residual) and that thatthis will approximate the origi amount of relative spectral distortion cansequence be loosely to the extent possible. The related to the relative degradation introduceddegradation to the frame in the reconstructed signal wi recognition accuracy by GSM coding. In Sectionthe 4 energywe of the innovation signal which on how good the predictor module (RPE) “follow” or predict the next subframe ofdistortion the time observed in a frame. We observ sequence based on previous reconstructed subframes.counts are roughly clustered in 3 regions: and high distortion, separated by breakpoint Input 33 and 67 percent. We also observe that the sequence Innovation the frames suffer only a small amount of dist most of the time the LTP section of the code do a reasonably good job of predicting the residual signal. Predicted sequence . 6 Predictor Reconstructed sequence 5 Figure 1. Simplified block diagram of an ideal RPE-LTP4 coder. Input sequence Innovation Reconstructed 3 Subsampling Innovation Log Counts and Coding 2 Predicted sequence 1 Predictor 0 0 0.2 0.4 0.6 0.8 1 Reconstructed Relative Log Spectral Distortion sequence Figure 3.Histogram of the log distribution o Figure 2. Simplified block diagram of an actualspectral RPE-LTP distortion introduced to a large samp coder. 1.3. Impact of relative log spectral distor 3.THE RPE-LTP INDUCED on phonetic recognition SPECTRAL DISTORTION 140 3.1. Relative log spectral distortion introduced by the RPE-LTP coder 120 We use the relative log spectral distance 100below to measure the dissimilarity between the reconstructed and the original innovation sequence.ω ) is the E( power 80 spectrum of the innovation sequenceω) is and the E ( R 60 power spectrum of the quantized innovation sequence. We integrate the absolute value of the difference40 between both log power spectra and normalize it by the integral of the log of the power spectrum of the 20original innovation signal. Rate Increase in Frame Error Percentage ππ0 LEEdEd=−∫∫log( (ωωωωω ) log( ( ))) log( ( )) 0 1 00R 0.17 0.330.5 0.67 0.88 1.0 Relative Spectral Distortion Figure 4. Relative degradation in frame-ba 1.2. Distribution of the relative log spectralphonetic accuracy as a function of the rel distortion spectral distortion introduced by the RTE- coder. We computed the relative log spectral distortion introduced by the RPE-LTP codec to the RM Incorpus. order to analyze the relation that exi Figure 3 is a histogram that shows the log ofdegradation the relative in recognition performance and t frequency of observing various levels of logof relativespectral log spectral distortion introduce distortion. The horizontal axis is the amountLTP ofblock, relative we performed two phonetic reco experiments: one training and testingi.e., using non-GSM coded) speech data and a second experimentA plot of the normalized histograms of the using speech that underwent GSM coding. We showncomputed below. These five classes exhibit the frame accuracy for each amount of relativepattern spectral of distortion in each of the three distortion both for when GSM was present anddefined when itin Section 3.2. Classes 1 and 3 was absent. Figure 4 depicts how much the frame-basedpercentage of counts in the high distort phonetic recognition error rate increased presencedue to inthe the middle area, while Classes introduction of the GSM coding as a functionmore ofcounts the in the middle area. Class 5 relative spectral distortion. We observe that,almost generally completely concentrated in the lo speaking, the phonemes that produced a greaterregion. amount of distortion due to GSM coding also suffered greater amounts of frame error rate. 0.4 Class 1 0.2 4.RELATING RELATIVE LOG SPECTRAL 0 0 1 DISTORSION PATTERNS TO PHONETIC 0.4 Class 2 CLASSES 0.2 0 0 1 4.1. Clustering Phonetic-classes using the 0.4 Class 3 relative log-spectral distortion 0.2 0 0 1 We grouped the 52 phonetic units used by our 0.4recognition Class 4 system into phonetic clusters by incrementally0.2 clustering the closest histograms of the counts of the0 relative log 0 1 spectral distortion for the frames associated0.4 with each Class 5 phonetic unit in the corpus. The distance 0.2between each pair of histograms was calculated using normalized-area 0 0 0.2 0.4 0.6 0.8 1 histograms. The clustering yielded five classes, as shownRelative Log Spectral Distortion in the table below. Figure 5. Histograms of the relative spectral dist five phonetic classes of Table 1. Class Phonemes in cl Frame Frame % Acc. Acc., Degra- 5.ACOUSTIC MODEL WEIGHTING GSM dation Acoustic modeling for HMM-based speech re 1 PD KD 46.76% 44.10% 5% commonly makes use of mixtures of G 2 IX B BD DD DH 68.28% 58.34% 31.34% distribution representing a set of tied sta M N JH NG V W probability that an observed vector has be Y DX ZH AX SH certain state is thus expressed by 3 D G K P T TD CH 70.88% 65.94% 16.96% M 4 F HH S Z TH TS 72.39% 57.63% 53.43% j bojt()= ∑ cjkNo [,](, tµ jk , C jk ) 5 IY OW OY UW L 74.37% 67.90% 25.24% k=1 R AA AE AH AXR EH AO IH ER AW AY UH EY The termc[j, k] expresses the prior probabikth th Table 1. Phonetic classes generated by automaticallyGaussian component jof theHMM. For a givenj st clustering the phonemes distortion histograms the sum of c[j,k] over kall is equal to 1. Classes 1, 3, and 4 encompass the majorityWe canof considerthe the amount of distortion t consonants, Class 4 being mostly fricatives.a phonetic Class 2class undergoes while evaluatin includes the remaining consonants and a probabilitycouple of using several models that vowels, while Class 5 encompass the vowels, distortiondiphthongs regions.

Load more