Digital Speech Processing— Lecture 17 Speech Coding Methods Based on Speech Models 1 Waveform Coding versus Block Processing • Waveform coding – sample-by-sample matching of waveforms – coding quality measured using SNR • Source modeling (block processing) – block processing of signal => vector of outputs every block – overlapped blocks Block 1 Block 2 Block 3 2 Model-Based Speech Coding • we’ve carried waveform coding based on optimizing and maximizing SNR about as far as possible – achieved bit rate reductions on the order of 4:1 (i.e., from 128 Kbps PCM to 32 Kbps ADPCM) at the same time achieving toll quality SNR for telephone-bandwidth speech • to lower bit rate further without reducing speech quality, we need to exploit features of the speech production model, including: – source modeling – spectrum modeling – use of codebook methods for coding efficiency • we also need a new way of comparing performance of different waveform and model-based coding methods – an objective measure, like SNR, isn’t an appropriate measure for model- based coders since they operate on blocks of speech and don’t follow the waveform on a sample-by-sample basis – new subjective measures need to be used that measure user-perceived quality, intelligibility, and robustness to multiple factors 3 Topics Covered in this Lecture • Enhancements for ADPCM Coders – pitch prediction – noise shaping • Analysis-by-Synthesis Speech Coders – multipulse linear prediction coder (MPLPC) – code-excited linear prediction (CELP) • Open-Loop Speech Coders – two-state excitation model – LPC vocoder – residual-excited linear predictive coder – mixed excitation systems • speech coding quality measures - MOS • speech coding standards 4 Differential Quantization x[n] d[n] dˆ[n] c[n] ~x[n] xˆ[n] P: simple predictor of vocal tract response c′[n ] dˆ ′[n ] xˆ ′[n ] p − k ~ P ( z ) = ∑ z x ′[n ] 5 k =1 Issues with Differential Quantization • difference signal retains the character of the excitation signal – switches back and forth between quasi- periodic and noise-like signals • prediction duration (even when using p=20) is order of 2.5 msec (for sampling rate of 8 kHz) – predictor is predicting vocal tract response – not the excitation period (for voiced sounds) • Solution – incorporate two stages of prediction, namely a short-time predictor for the vocal tract response and a long- time predictor for pitch period 6 Pitch Prediction ˆ ˆ xn[] dn[] Q[ ] dn[] xn[] + + + + + + − + + + + xn%[] + P1(z) P2(z) xnˆ[] P (z) P (z) = β z−M + + 1 1 + p Pc (z) −k P2(z) = ∑αk z P2(z) 1− P1(z) k=1 residual Transmitter Receiver • first stage pitch predictor: −M Pz1()= β ⋅ z • second stage linear predictor (vocal tract predictor): p −k Pz2 ()= ∑αk z 7 k =1 Pitch Prediction first stage pitch predictor: −M Pz1()=⋅β z this predictor model assumes that the pitch period, M, is an integer number of samples and β is a gain constant allowing for variations in pitch period over time (for unvoiced or background frames, values of M and β are irrelevant) an alternative (somewhat more complicated) pitch predictor is of the form: 1 −+MM1 − −MMk−−−1 Pz11()=++βββ z 2 z 3 z = ∑ βk z k =−1 this more advanced form provides a way to handle non-integer pitch period through interpolation around the nearest integer pitch period value, M 8 Combined Prediction The combined inverse system is the cascade in the decoder system: ⎛⎞⎛⎞11⎛⎞ 1 Hzc ()==⎜⎟⎜⎟⎜⎟ ⎝⎠⎝⎠1()1()1()−−Pz12 Pz⎝⎠ − Pzc with 2-stage prediction error filter of the form: 1−=−−Pzc ()1()1()11[][] Pz12 Pz =−−[]Pz12() Pz ()− Pz 1 () Pzc ()=−[] 1 Pz12 () Pz () − Pz 1 () which is implemented as a parallel combination of two predictors: []1()()− Pz12 Pz and Pz 1 () The prediction signal, xn%[] can be expressed as p ˆ ˆˆ xn%[]= β x[]nM−+∑αβk ([][ xnk −− xnkM −− ]) k =1 9 Combined Prediction Error The combined prediction error can be defined as: dnc[]=− xn [] xn% [] p =−vn[]∑αk vn [ − k ] k =1 where vn[]=− xn []β xn [ − M ] is the prediction error of the pitch predictor. The optimal values of β,M and {αk } are obtained, in theory, by minimizing the variance of dnc[]. In practice a sub-optimum solution is obtained by first minimizing the variance of vn[] and then minimizing the variance of dnc[] subject to the chosen values of β and M 10 Solution for Combined Predictor Mean-squared prediction error for pitch predictor is: 2 2 EvnxnxnM1 ==−−([])() []β [ ] where denotes averaging over a finite frame of speech samples. We use the covariance-type of averaging to eliminate windowing effects, giving the solution: xnxn[][− M ] β = ()opt 2 ()xn[]− M Using this value of β, we solve for E as ()1 opt ⎛⎞ ([][xnxn− M ])2 Exn=−[]2 ⎜⎟ 1 ()1 opt ( ) 22 ⎜⎟xn[] xn [− M ] ⎝⎠()() with minimum normalized covariance: xnxn[][− M ] ρ[]M = 1/2 xn[]22 xn [− M ] ( ()()) 11 Solution for Combined Predictor • Steps in solution: – first search for M that maximizes ρ[M] – compute βopt • Solve for more accurate pitch predictor by minimizing the variance of the expanded pitch predictor • Solve for optimum vocal tract predictor coefficients, αk, k=1,2,…,p 12 Pitch Prediction Vocal tract prediction Pitch and vocal tract prediction 13 Noise Shaping in DPCM Systems 14 Noise Shaping Fundamentals The output of an ADPCN encoder/decoder is: xnˆ[]=+ xn [] en [] where en[] is the quantization noise. It is easy to show that en[] generally has a flat spectrum and thus is especially audible in spectral regions of low intensity, i.e., between formants. This has led to methods of shaping the quantization noise to match the speech spectrum and take advantage of spectral masking concepts 15 Noise Shaping Basic ADPCM encoder and decoder Equivalent ADPCM encoder and decoder Noise shaping ADPCM encoder and decoder 16 Noise Shaping The equivalence of parts (b) and (a) is shown by the following: xnˆ[]=+↔ xn [] en [] Xzˆ () = Xz () + Ez () From part (a) we see that: Dz()=− Xz () PzXz ()ˆ () =− [1 PzXz ()]() − PzEz () () with Ez()=− Dzˆ () Dz () showing the equivalence of parts (b) and (a). Further, since Dzˆ ()=+=− Dz () Ez () [1 Pz ()]() X z +− [1 Pz ()]() Ez ˆˆ⎛⎞1 ˆ X() z== HzDz () ()⎜⎟ Dz () ⎝⎠1()− Pz ⎛⎞1 =−+−⎜⎟([1Pz ( )] X ( z ) [1 Pz ( )] Ez ( )) ⎝⎠1()− Pz =+Xz() Ez () Feeding back the quantization error through the predictor, Pz() ensures that the reconstructed signal, xnˆ [], differs from xn[], by the quantization error, en [], incurred in quantizing the difference signal, dn[]. 17 Shaping the Quantization Noise To shape the quantization noise we simply replace Pz() by a different system function, Fz(), giving the reconstructed signal as: ˆˆ⎛⎞1 ˆ X'( z )== HzDz ( ) '( )⎜⎟ Dz '( ) ⎝⎠1()− Pz ⎛⎞1 =−+−⎜⎟()[]1()()1(Pz X z[] FzEz)'() ⎝⎠1()− Pz ⎛⎞1()− Fz =+Xz()⎜⎟ E '() z ⎝⎠1()− Pz Thus if xn[] is coded by the encoder, the z-transform of the reconstructed signal at the receiver is: Xzˆ '( )=+ Xz ( ) Ez% '( ) ⎛⎞1()− Fz Ez% '()==Γ⎜⎟ Ez '() () zEz '() ⎝⎠1()− Pz ⎛⎞1()− Fz where Γ=()z ⎜⎟ is the effective noise shaping filter 18 ⎝⎠1()− Pz Noise Shaping Filter Options Noise shaping filter options: 1. Fz()= 0 and we assume noise has a flat spectrum, then the noise and speech spectrum have the same shape 2. Fz()= Pz () then the equivalent system is the standard DPCM system where Ez% '( )== Ez '( ) Ez ( ) with flat noise spectrum p −−1 kk 3.Fz ( )== P (γαγ z ) ∑ k z and we ''shape'' the noise spectrum k =1 to ''hide'' the noise beneath the spectral peaks of the speech signal; each zero of [1−−Pz ( )] is paired with a zero of [1 Fz ( )] where the paired zeros have the same angles in the z-plane, but with a radius that is divided by γ. 19 Noise Shaping Filter 20 Noise Shaping Filter If we assume that the quantization noise has a flat spectrum with noise 2 power of σ e' , then the power spectrum of the shaped noise is of the form: jFF2/π S jFF2/π S 1(− Fe )2 Peee''()= σ 1(− PejFF2/π S ) Noise spectrum above speech spectrum 21 Fully Quantized Adaptive Predictive Coder 22 Full ADPCM Coder • Input is x[n] • P2(z) is the short-term (vocal tract) predictor • Signal v[n] is the short-term prediction error • Goal of encoder is to obtain a quantized representation of this excitation signal, from which the original signal can be reconstructed. 23 Quantized ADPCM Coder Total bit rate for ADPCM coder: IBFBFBFADPCM=+ SΔΔ + P P where B is the number of bits for the quantization of the difference signal, BΔ is the number of bits for encoding the step size at frame rate FBΔ , and P is the total number of bits allocated to the predictor coefficients (both long and short-term) with frame rate FP Typically FBS =≈−8000 and even with 1 4 bits, we need between 8000 and 3200 bps for quantization of difference signal Typically we need about 3000-4000 bps for the side information (step size and predictor coefficients) Overall we need between 11,000 and 36,000 bps for a fully quantized system 24 Bit Rate for LP Coding • speech and residual sampling rate: Fs=8 kHz • LP analysis frame rate: F∆=FP = 50-100 frames/sec • quantizer stepsize: 6 bits/frame • predictor parameters: – M (pitch period): 7 bits/frame – pitch predictor coefficients: 13 bits/frame – vocal tract predictor coefficients: PARCORs 16-20, 46- 50 bits/frame • prediction residual: 1-3 bits/sample • total bit rate: – BR = 72*FP + Fs (minimum) 25 Two-Level (B=1 bit) Quantizer Prediction residual Quantizer input Quantizer output Reconstructed pitch Original pitch residual Reconstructed speech Original speech 26 Three-Level Center-Clipped Quantizer Prediction residual Quantizer input Quantizer output Reconstructed pitch Original pitch residual Reconstructed speech
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages104 Page
-
File Size-