Speech and Audio Coding Introduction Audio Perception Speech Signal Speech Compression Music Signal Music Compression Outline

Speech and Audio Coding Introduction Audio Perception Speech Signal Speech Compression Music Signal Music Compression Outline

Speech and audio Institut coding Mines-Telecom Marco Cagnazzo, [email protected] MN910 – Advanced compression Introduction Audio perception Speech compression Music compression Outline Introduction Speech signal Music signal Audio perception Masking Speech compression Simple encoders CELP encoder Encoder 3GPP AMR-WB Music compression MP3 2/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Outline Introduction Speech signal Music signal Audio perception Speech compression Music compression 3/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Speech signal ◮ Non-stationary, but locally stationary (20 ms) ◮ Voiced sounds (vowels, some consonants) non-voiced (some consonants), other (transitions) ◮ Simple and effective prediction models: ◮ Linear filters AR of impulsions for voiced sounds ◮ Same filter on white noise for non-voiced sounds 4/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Speech signal Digitalization ◮ PCM (Pulse Code Modulation) ◮ Bandwidth 200 Hz ÷ 3400 Hz ◮ Enough for understanding ◮ Sampling at 8000 Hz : Fs > 2Fmax ◮ 8 bits per sample → 64 kbps ◮ Extended bandwidth ◮ 50-200 Hz : more natural ◮ and 3.4-7 kHz : better understanding 5/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Speech signal Coding standards G.711 (1972) PCM (no compression), 8 samples per ms coded on 8 bits: 64 kbps G.721 (1984) ADPCM: 32 kbps G.728 (1991) CELP (Code Excited Linear Predictive), low latency: 16 kbps G.729 (1995) CELP, without latency constraint: 8kbps G.723.1 (1995) Encoder at 6.3 kbps 6/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Speech signal Normes de compression : Mobiles GSM 06.10 (1988) RPE-LTP : Regular Pulse Excitation Long Term Predictor. 13 (22.8) kbps GSM 06.20 (1994) “Half-Rate”. 5.6 (11.4) kbps GSM 06.60 (1996) “ACELP”. 12.2 (22.8) kbit/s GSM 06.90 (1999) Source/channel coding at variable bit-rate 4.75÷12.2 (11.4÷22.8)kbps (ACELP-AMR : Adaptive Multi Rate) G.722 Wideband speec coder, rates 6.6, 8.85, 12.65 kbps (AMR-WB) ; further rates 15.85 and 23.85 kbps 7/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Music signal ◮ Large dynamics: 90dB ◮ Locally stationary signal ◮ No simple model 8/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Music signal Compression CD: sampling at 44.1 kHz, quantization on 16 bits: 705 kbps (mono) MP3: Audio part of MPEG-1. Three layers of increasing complexity, at 192, 128 and 96 kbps AAC: Audio part of MPEG-2, reputed as the best audio encoder MPEG-4: Sound object representation 9/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Speech signal Speech compression Music signal Music compression Quality ◮ Objective criteria (MSE) not satisfying ◮ Subjective stest: ◮ Speech: understandability ◮ Music: “transparency” criterion. Double blind method with triple stimuli and hidden reference (UIT-T BS.1116) 10/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Masking Speech compression Music compression Outline Introduction Audio perception Masking Speech compression Music compression 11/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Masking Speech compression Music compression Audition threshold Power − dB 120 Sa(f ) 80 40 0 20 50 100 200 500 1000 2000 5000 10000 Frequency − Hz 12/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Masking Speech compression Music compression Frequency masking Power − dB 120 80 40 0 20 50 100 200 500 1000 2000 5000 10000 Frequency − Hz 13/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Masking Speech compression Music compression 2 Frequency masking function Sm(f0,σ , f ) Power − dB ◮ 2 2 For a given f0 and σ , Sm(f ) has S f 2 f Sm(f2,σ2, f ) m( 1,σ1, ) a triangular shape ◮ The maximum is in f = f0 ◮ 2 2 Masking index: Sm(f ,σ , f ) − σ Frequency − Hz 14/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Audio perception Masking Speech compression Music compression Time Masking Power − dB Time ◮ Pre-masking : 2÷5 ms ◮ Post-masking : 100÷200 ms 15/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression Outline Introduction Audio perception Speech compression Simple encoders CELP encoder Encoder 3GPP AMR-WB Music compression 16/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps Linear Prediction Coding with 10 samples ◮ Only for teaching! ◮ Window N = 160 ◮ Sample prediction using P = 10 previous samples ◮ Scheme x(n) y(n) y(n) A(z) Q a a LPC 17/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps Filter P y(n)= x(n) − x (n) x (n)= h x(n − k) P P k k=1 P y(n)= a x(n − k) a = 1 a = −h k 0 k k k=0 −1 −2 −P A(z)= 1 + a1z + a2z + . + aP z Y (z)= A(z)X(z) A is computed by minimisation of the residual power in the current window 18/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps Filter T c = [rX (1) rX (2) . rX (P)] rX (0) rX (1) . rX (P − 1) r (1) r (0) . r (P − 2) R = X X X ... ... ... ... rX (P − 1) rX (P − 2) . rX (0) a = −R−1c Estimation of rX : using N samples of the current window − 1 N 1 r (k)= x(n)x(n − k) k ∈ {0, 1,..., P} X N − k n=k 19/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps Non-voiced sounds ◮ If X was deprived of all correlation, Y would be white noise ◮ We dont send Y : the decoder will filter WN using 1/A(z) ◮ Resulting in unaudible phase noise ◮ WN is a good model only for non-voiced suonds ◮ For voice sounds we have residual periodicity 20/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps Non-voiced sounds : Filtering WN with 1/A(z) Voiced sounds : Filtering 1/A(z) of a pulse train: y(n)= α δ (n − mT + φ) 0 m∈Z α 0 T0 − 1 φ 21/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps ◮ Estimation of auto-correlation rx (k) ◮ 2 rx (0) estimates opower σY or α ◮ If rx decreases towards zero, it is a non-voiced sound ◮ If rx is periodical, it is a voiced sound with period T0 22/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps Each 20 ms we send: ◮ The P filter coefficients P = 10 over 3 or 4 bits. bP = 36 bits ◮ One bit for voiced/non-voiced bv = 1 bit ◮ For voiced ◮ bα = 6 ◮ The period T0: bT0 = 7 ◮ For non-voiced: ◮ bσ2 = 6 bits 23/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression LPC10 encoder at 2.4 kbps Total R = b + bv + b + b /(20ms) P T0 α = (36 + 1 + 6 + 7)/0.02 = 2500bps or R = (bP + bv + bσ2 ) /(20ms) = (36 + 1 + 6)/0.02 = 2150bps R = 2.4 kbps 24/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression CELP encoder ◮ Find a filter and the prediction error signal ◮ Filter A(z): LPC idea ◮ Error y(n): from a “GS-VQ” dictionary x(n) y(n) x(n) ǫ(n) 1 1 L Min A(z) i g a 25/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression CELP encoder ◮ Filter coefficient coding ◮ Line Spectrum Pairs: effective mathematical representation of A(z) ◮ Perceptual weighting function (WF) ◮ Noise can be tolored where the signal is strong ◮ The WF depends thus on A(z) 26/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression Perceptual weighting ◮ A(z) We use W (z)= A(z/γ) to weight the signal ◮ Noise can be higher in the formantic areas ◮ The filter 1/A(z) has peaks in the formantic areas ◮ The filter 1/A(z/γ), with γ ∈ (0, 1) has peaks at the sames frequencies, but smaller ◮ Let pi be the poles of 1/A(z) ◮ Thus γpi are the poles of 1/A(z/γ), which are thus closer to the center 27/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression Perceptual weighting Blue : 1/A(z) ; Red : 1/A(z/γ) ; Black: W (z)= A(z)/A(z/γ) 28/41 06.03.2019 Institut Mines-Telecom Speech and audio coding Introduction Simple encoders Audio perception CELP encoder Speech compression Encoder 3GPP AMR-WB Music compression CELP encoder ◮ Excitation model.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    41 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us