Lecture 8: Audio Compression & Coding

Lecture 8: Audio Compression & Coding

EE E6820: Speech & Audio Processing & Recognition Lecture 8: Audio Compression & Coding 1 Information, compression & quantization 2 Speech coding 3 Wide bandwidth audio coding Dan Ellis <[email protected]> http://www.ee.columbia.edu/~dpwe/e6820/ E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 1 1 Compression & Quantization • How big is audio data? What is the bitrate? - Fs frames/second (e.g. 8000 or 44100) x C samples/frame (e.g. 1 or 2 channels) x B bits/sample (e.g. 8 or 16) → Fs·C·B bits/second (e.g. 64 Kbps or 1.4 Mbps) CD Audio 1.4 Mbps 32 bits / frame 8 Mobile 8000 44100 ≤13 Kbps frames / sec Telephony 64 Kbps • How to reduce? - lower sampling rate → less bandwidth (muffled) - lower channel count → no stereo image - lower sample size → quantization noise • Or: use data compression E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 2 Data compression: Redundancy vs. Irrelevance • Two main principles in compression: - remove redundant information - remove irrelevant information • Redundant information is implicit in remainder - e.g. signal bandlimited to 20kHz, but sample at 80kHz →can recover every other sample by interpolation: In a bandlimited signal, the red samples can be exactly recovered by interpolating the blue samples sample time • Irrelevant information is unique, unnecessary - e.g. recording a microphone signal at 80 kHz sampling rate E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 3 Irrelevant data in audio coding • For coding of audio signals, irrelevant means perceptually insignificant - an empirical property • Compact Disc standard is adequate: - 44 kHz sampling for 20 kHz bandwidth - 16 bit linear samples for ~ 96 dB peak SNR • Reflect limits of human sensitivity: - 20 kHz bandwidth, 100 dB intensity - sinusoid phase, detail of noise structure - dynamic properties - hard to characterize • Problem: separating salient & irrelevant E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 4 Quantization • Represent waveform with discrete levels 5 4 6 3 Q[x[n]] x[n] 2 4 1 2 Q[x] 0 -1 0 -2 -3 -2 -4 error e[n] = x[n] - Q[x[n]] -5 0 5 10 15 20 25 30 35 40 -5 -4 -3 -2 -1 0 1 2 3 4 5 x • Equivalent to adding error e[n]: xn[]= Qxn[][]+ en[] • e[n] ~ uncorrelated, uniform white noise p(e[n]) 2 σ2 D - variance e = ------- -D +D 12 /2 /2 E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 5 Quantization noise (Q-noise) • Uncorrelated noise has flat spectrum • With a B bit word and a quantization step D - max signal range (x) = -(2B-1)·D .. (2B-1-1)·D - quantization noise (e) = -D/2 .. D/2 → Best signal-to-noise ratio (power) SNR= E[] x2 ⁄ Ee[]2 2 = ()2B ⋅⋅≈ ⋅ .. or, in dB, 20log10 2 B 6 B dB 0 Quantized at 7 bits -20 -40 level / dB -60 -80 0 1000 2000 3000 4000 5000 6000 7000 freq / Hz E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 6 Redundant information • Redundancy removal is lossless • Signal correlation implies redundant information - e.g. if x[n] = x[n-1] + v[n] x[n] has a greater amplitude range → more bits than v[n] - sending v[n] = x[n] - x[n-1] can reduce amplitude, hence bitrate x[n] - x[n-1] - ‘white noise’ sequence has no redundancy • Problem: separating unique & redundant E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 7 Optimal coding • Shannon information: An unlikely occurrence is more ‘informative’ p(A) = 0.5 p(B) = 0.5 p(A) = 0.9 p(B) = 0.1 ABBBBAAABBABBABBABB AAAAABBAAAAAABAAAAB A, B equiprobable A is expected; B is ‘big news’ • Information in bits I = –log2(probability) - clearly works when all possibilities equiprobable • Opt. bitrate → token length = entropy H=E[I] - i.e. equal-length tokens are equally likely • How to achieve this? - transform signal to have uniform pdf - nonuniform quantization for equiprobable tokens - variable-length tokens → Huffman coding E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 8 Quantization for optimum bitrate • Quantization should reflect pdf of signal: p(x = x0) p(x < x0) x' 1.0 0.8 0.6 0.4 0.2 0 -0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 0.025 x - cumulative pdf p(x < x0) maps to uniform x' - or: nonuniform quantization bins • Or, codeword length per Shannon –log2(p(x)): p(x) Shannon info / bits Codewords -0.02 111111111xx 111101xx -0.01 111100xx 101xx 100xx 0 0xx 110xx 0.01 1110xx 111110xx 1111110xx 0.02 111111100xx 111111101xx 111111110xx 0.03 0 2 4 6 8 E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 9 Huffman coding • Variable-length bit sequence tokens → can code unequally probable events • Tree-structure for unambiguous decoding: 0 0 p = 0.5 10 p = 0.25 0 0 1100 p = 0.0625 1 0 1101 p = 0.0625 1 1 0 1110 p = 0.0625 1 1 1111 p = 0.0625 1011001101000001001100010011100001110 • Can build tables to approximate arbitrary distributions • Eliminates irrelevance .. within limits • Problem: very probable events → short tokens E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 10 Vector Quantization • Quantize mutually dependent values in joint space: 3 x1 2 1 0 -1 -2 x2 -6 -4 -2 0 2 4 • May help even if values are largely independent - larger space {x1,x2} is easier for Huffman E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 11 Compression & Representation • As always, success depends on representation • Appropriate domain may be ‘naturally’ bandlimited - e.g. vocal-tract-shape coefficients - can reduce sampling rate without data loss • In right domain, irrelevance may be easier to get at - e.g. STFT to separate magnitude and phase E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 12 Aside: Coding standards • Coding is only useful when recipient knows the code! • Standardization efforts are important • Federal Standards: Low bit-rate secure voice: - FS1015e: LPC-10 2.4 Kbps - FS1016: 4.8 Kbps CELP • ITU G.series - G.726 ADPCM - G.729 Low delay CELP • MPEG - MPEG-Audio layers 1,2,3 - MPEG 2 Advanced Audio Codec - MPEG 4 Synthetic-Natural Hybrid Codec • etc ... E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 13 Outline 1 Information, compression & Quantization 2 Speech coding - General principles - CELP & friends 3 Wide bandwidth audio coding E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 14 2 Speech coding • Standard voice channel: - analog: 4 kHz slot (~ 40 dB SNR) - digital: 64 Kbps = 8 bit µ-law x 8 kHz • How to compress? Redundant - signal assumed to be a single voice, not any possible waveform Irrelevant - need code only enough for intelligibility, speaker identification (c/w analog channel) • Specifically, source-filter decomposition - vocal tract & fund. frequency change slowly • Applications: - live communications - offline storage E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 15 Channel Vocoder (1940s-1960s) • Basic source-filter decomposition - filterbank breaks into spectral bands - transmit slowly-changing energy in each band Encoder Decoder Bandpass Smoothed Downsample E filter 1 energy & encode 1 Bandpass filter 1 Output Input Bandpass Smoothed Downsample E filter N energy & encode N Bandpass filter 1 Voicing V/UV analysis Pulse Noise Pitch generator source - 10-20 bands, perceptually spaced • Downsampling? • Excitation? - pitch / noise model - or: baseband + ‘flattening’... E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 16 LPC encoding • The classic source-filter model Encoder Filter coefficients {ai} Decoder ω |1/A(ej )| Represent Input f s[n] & encode LPC Output analysis e^[n] s^[n] Represent Excitation All-pole Residual & encode generator filter e[n] H(z) = 1 t Σ -i 1 - aiz • Compression gains: - filter parameters are ~slowly changing - excitation can be represented many ways 20 ms Filter parameters Excitation/pitch parameters 5 ms E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 17 Encoding LPC filter parameters • For ‘communications quality’: - 8 kHz sampling (4 kHz bandwidth) - ~10th order LPC (up to 5 pole pairs) - update every 20-30 ms → 300 - 500 param/s • Representation & a quantization i -{ai} - poor distribution, -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 can’t interpolate ki -reflection coefficients {ki }: guaranteed stable -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 f - LSPs - lovely! Li • Bit allocation (filter): 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 - GSM (13 kbps): 8 LARs x 3-6 bits / 20 ms = 1.8 Kbps - FS1016 (4.8 kbps): 10 LSPs x 3-4 bits / 30 ms = 1.1 Kbps E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 18 Line Spectral Pairs (LSPs) • LSPs encode LPC filter by a set of frequencies • Excellent for quantization & interpolation –1p – –1 Pz()= Az()+ z ⋅ Az() • Def: zeros of –1p – –1 Qz()= Az()– z ⋅ Az() ω ω - z = ej → z-1 = e-j → |A(z)| = |A(z-1)| on u.c. - P(z), Q(z) have (interleaved) zeros when angle{A(z)} = ± angle{z-p-1A(z-1)} Π ζ -1 - reconstruct P(z), Q(z) = (1 - iz ) etc. - A(z) = [P(z) + Q(z)]/2 A(z-1) = 0 A(z) = 0 Q(z) = 0 P(z) = 0 E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 19 Excitation • Excitation already better than raw signal: 5000 0 Original signal -5000 100 0 -100 LPC residual 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 time / s - save several bits/sample, still > 32 Kbps • Crude model: U/V flag + pitch period - ~ 7 bits / 5 ms = 1.4 Kbps → LPC10 @ 2.4 Kbps Pitch period values 16 ms frame boundaries 100 50 0 -50 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 time / s • Band-limited then re-extended (RELP) E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 20 Encoding excitation • Something between full-quality residual (32 Kbps) and pitch parameters (2.4 kbps)? •‘Analysis by synthesis’ loop: Input LPC Filter coefficients s[n] analysis A(z) Excitation code Pitch-cycle LPC filter Synthetic predictor – excitation + 1 + ^ A(z) c[n] b·z–NL x[n] control ‘Perceptual’ MSError weighting minimization A(z) ^ W(z)= x[n]*ha[n] - s[n] A(z/γ) Excitation encoding •‘Perceptual’ weighting discounts peaks: 20 A(z) A(z) W(z)= 10 A(z/γ) 0 A(z/γ) gain / dB -10 -20 0 500 1000 1500 2000 2500 3000 3500 freq / Hz E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 21 Multi-Pulse Excitation (MPE-LPC) • Stylize excitation as N discrete pulses 15 10 original LPC 5 residual 0 -5 mi multi-pulse excitation 5 0 -5 0 20 40 60 80 100 120 ti time / samps - encode as N x (ti, mi) pairs • Greedy algorithm places one pulse at a time: Az() Xz() E = ------------------ ----------- – Sz() pcp Az()⁄ γ Az() Xz() Rz() = ------------------ – ------------------ Az()⁄ γ Az()⁄ γ - cross-correlate hγ and r*hγ , iterate E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 22 CELP • Represent excitation with codebook e.g.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    35 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us