
Speech and audio Institut coding Mines-Telecom Marco Cagnazzo, [email protected] MN910 – Advanced compression Introduction Audio perception Speech compression Music compression Outline Introduction Speech signal Music signal Audio perception Masking Speech compression Simple encoders CELP encoder Encoder 3GPP AMR-WB Music compression MP3 2/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Outline Introduction Speech signal Music signal Audio perception Speech compression Music compression 3/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Speech signal ◮ Non-stationary, but locally stationary (20 ms) ◮ Voiced sounds (vowels, some consonants) non-voiced (some consonants), other (transitions) ◮ Simple and effective prediction models: ◮ Linear filters AR of impulsions for voiced sounds ◮ Same filter on white noise for non-voiced sounds 4/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Speech signal Digitalization ◮ PCM (Pulse Code Modulation) ◮ Bandwidth 200 Hz ÷ 3400 Hz ◮ Enough for understanding ◮ Sampling at 8000 Hz : Fs > 2Fmax ◮ 8 bits per sample → 64 kbps ◮ Extended bandwidth ◮ 50-200 Hz : more natural ◮ and 3.4-7 kHz : better understanding 5/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Speech signal Coding standards G.711 (1972) PCM (no compression), 8 samples per ms coded on 8 bits: 64 kbps G.721 (1984) ADPCM: 32 kbps G.728 (1991) CELP (Code Excited Linear Predictive), low latency: 16 kbps G.729 (1995) CELP, without latency constraint: 8kbps G.723.1 (1995) Encoder at 6.3 kbps 6/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Speech signal Normes de compression : Mobiles GSM 06.10 (1988) RPE-LTP : Regular Pulse Excitation Long Term Predictor. 13 (22.8) kbps GSM 06.20 (1994) “Half-Rate”. 5.6 (11.4) kbps GSM 06.60 (1996) “ACELP”. 12.2 (22.8) kbit/s GSM 06.90 (1999) Source/channel coding at variable bit-rate 4.75÷12.2 (11.4÷22.8)kbps (ACELP-AMR : Adaptive Multi Rate) G.722 Wideband speec coder, rates 6.6, 8.85, 12.65 kbps (AMR-WB) ; further rates 15.85 and 23.85 kbps 7/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Music signal ◮ Large dynamics: 90dB ◮ Locally stationary signal ◮ No simple model 8/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Music signal Compression CD: sampling at 44.1 kHz, quantization on 16 bits: 705 kbps (mono) MP3: Audio part of MPEG-1. Three layers of increasing complexity, at 192, 128 and 96 kbps AAC: Audio part of MPEG-2, reputed as the best audio encoder MPEG-4: Sound object representation 9/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Quality ◮ Objective criteria (MSE) not satisfying ◮ Subjective stest: ◮ Speech: understandability ◮ Music: “transparency” criterion. Double blind method with triple stimuli and hidden reference (UIT-T BS.1116) 10/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Masking Speech compression Music compression Outline Introduction Audio perception Masking Speech compression Music compression 11/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Masking Speech compression Music compression Audition threshold Power − dB 120 Sa(f ) 80 40 0 20 50 100 200 500 1000 2000 5000 10000 Frequency − Hz 12/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Masking Speech compression Music compression Frequency masking Power − dB 120 80 40 0 20 50 100 200 500 1000 2000 5000 10000 Frequency − Hz 13/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Masking Speech compression Music compression 2 Frequency masking function Sm(f0,σ , f ) Power − dB ◮ 2 2 For a given f0 and σ , Sm(f ) has S f 2 f Sm(f2,σ2, f ) m( 1,σ1, ) a triangular shape ◮ The maximum is in f = f0 ◮ 2 2 Masking index: Sm(f ,σ , f ) − σ Frequency − Hz 14/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Masking Speech compression Music compression Time Masking Power − dB Time ◮ Pre-masking : 2÷5 ms ◮ Post-masking : 100÷200 ms 15/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression Outline Introduction Audio perception Speech compression Simple encoders CELP encoder Encoder 3GPP AMR-WB Music compression 16/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps Linear Prediction Coding with 10 samples ◮ Only for teaching! ◮ Window N = 160 ◮ Sample prediction using P = 10 previous samples ◮ Scheme x(n) y(n) y(n) A(z) Q a a LPC 17/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps Filter P y(n)= x(n) − x (n) x (n)= h x(n − k) P P k k=1 P y(n)= a x(n − k) a = 1 a = −h k 0 k k k=0 −1 −2 −P A(z)= 1 + a1z + a2z + . + aP z Y (z)= A(z)X(z) A is computed by minimisation of the residual power in the current window 18/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps Filter T c = [rX (1) rX (2) . rX (P)] rX (0) rX (1) . rX (P − 1) r (1) r (0) . r (P − 2) R = X X X ... ... ... ... rX (P − 1) rX (P − 2) . rX (0) a = −R−1c Estimation of rX : using N samples of the current window − 1 N 1 r (k)= x(n)x(n − k) k ∈ {0, 1,..., P} X N − k n=k 19/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps Non-voiced sounds ◮ If X was deprived of all correlation, Y would be white noise ◮ We dont send Y : the decoder will filter WN using 1/A(z) ◮ Resulting in unaudible phase noise ◮ WN is a good model only for non-voiced suonds ◮ For voice sounds we have residual periodicity 20/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps Non-voiced sounds : Filtering WN with 1/A(z) Voiced sounds : Filtering 1/A(z) of a pulse train: y(n)= α δ (n − mT + φ) 0 m∈Z α 0 T0 − 1 φ 21/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps ◮ Estimation of auto-correlation rx (k) ◮ 2 rx (0) estimates opower σY or α ◮ If rx decreases towards zero, it is a non-voiced sound ◮ If rx is periodical, it is a voiced sound with period T0 22/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps Each 20 ms we send: ◮ The P filter coefficients P = 10 over 3 or 4 bits. bP = 36 bits ◮ One bit for voiced/non-voiced bv = 1 bit ◮ For voiced ◮ bα = 6 ◮ The period T0: bT0 = 7 ◮ For non-voiced: ◮ bσ2 = 6 bits 23/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps Total R = b + bv + b + b /(20ms) P T0 α = (36 + 1 + 6 + 7)/0.02 = 2500bps or R = (bP + bv + bσ2 ) /(20ms) = (36 + 1 + 6)/0.02 = 2150bps R = 2.4 kbps 24/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression CELP encoder ◮ Find a filter and the prediction error signal ◮ Filter A(z): LPC idea ◮ Error y(n): from a “GS-VQ” dictionary x(n) y(n) x(n) ǫ(n) 1 1 L Min A(z) i g a 25/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression CELP encoder ◮ Filter coefficient coding ◮ Line Spectrum Pairs: effective mathematical representation of A(z) ◮ Perceptual weighting function (WF) ◮ Noise can be tolored where the signal is strong ◮ The WF depends thus on A(z) 26/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression Perceptual weighting ◮ A(z) We use W (z)= A(z/γ) to weight the signal ◮ Noise can be higher in the formantic areas ◮ The filter 1/A(z) has peaks in the formantic areas ◮ The filter 1/A(z/γ), with γ ∈ (0, 1) has peaks at the sames frequencies, but smaller ◮ Let pi be the poles of 1/A(z) ◮ Thus γpi are the poles of 1/A(z/γ), which are thus closer to the center 27/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression Perceptual weighting Blue : 1/A(z) ; Red : 1/A(z/γ) ; Black: W (z)= A(z)/A(z/γ) 28/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression CELP encoder ◮ Excitation model.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages41 Page
-
File Size-