Lecture 8: Audio Compression & Coding

EE E6820: Speech & Audio Processing & Recognition Lecture 8: Audio Compression & Coding 1 Information, compression & quantization 2 Speech coding 3 Wide bandwidth audio coding Dan Ellis <[email protected]> http://www.ee.columbia.edu/~dpwe/e6820/ E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 1 1 Compression & Quantization • How big is audio data? What is the bitrate? - Fs frames/second (e.g. 8000 or 44100) x C samples/frame (e.g. 1 or 2 channels) x B bits/sample (e.g. 8 or 16) → Fs·C·B bits/second (e.g. 64 Kbps or 1.4 Mbps) CD Audio 1.4 Mbps 32 bits / frame 8 Mobile 8000 44100 ≤13 Kbps frames / sec Telephony 64 Kbps • How to reduce? - lower sampling rate → less bandwidth (muffled) - lower channel count → no stereo image - lower sample size → quantization noise • Or: use data compression E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 2 Data compression: Redundancy vs. Irrelevance • Two main principles in compression: - remove redundant information - remove irrelevant information • Redundant information is implicit in remainder - e.g. signal bandlimited to 20kHz, but sample at 80kHz →can recover every other sample by interpolation: In a bandlimited signal, the red samples can be exactly recovered by interpolating the blue samples sample time • Irrelevant information is unique, unnecessary - e.g. recording a microphone signal at 80 kHz sampling rate E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 3 Irrelevant data in audio coding • For coding of audio signals, irrelevant means perceptually insignificant - an empirical property • Compact Disc standard is adequate: - 44 kHz sampling for 20 kHz bandwidth - 16 bit linear samples for ~ 96 dB peak SNR • Reflect limits of human sensitivity: - 20 kHz bandwidth, 100 dB intensity - sinusoid phase, detail of noise structure - dynamic properties - hard to characterize • Problem: separating salient & irrelevant E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 4 Quantization • Represent waveform with discrete levels 5 4 6 3 Q[x[n]] x[n] 2 4 1 2 Q[x] 0 -1 0 -2 -3 -2 -4 error e[n] = x[n] - Q[x[n]] -5 0 5 10 15 20 25 30 35 40 -5 -4 -3 -2 -1 0 1 2 3 4 5 x • Equivalent to adding error e[n]: xn[]= Qxn[][]+ en[] • e[n] ~ uncorrelated, uniform white noise p(e[n]) 2 σ2 D - variance e = ------- -D +D 12 /2 /2 E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 5 Quantization noise (Q-noise) • Uncorrelated noise has flat spectrum • With a B bit word and a quantization step D - max signal range (x) = -(2B-1)·D .. (2B-1-1)·D - quantization noise (e) = -D/2 .. D/2 → Best signal-to-noise ratio (power) SNR= E[] x2 ⁄ Ee[]2 2 = ()2B ⋅⋅≈ ⋅ .. or, in dB, 20log10 2 B 6 B dB 0 Quantized at 7 bits -20 -40 level / dB -60 -80 0 1000 2000 3000 4000 5000 6000 7000 freq / Hz E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 6 Redundant information • Redundancy removal is lossless • Signal correlation implies redundant information - e.g. if x[n] = x[n-1] + v[n] x[n] has a greater amplitude range → more bits than v[n] - sending v[n] = x[n] - x[n-1] can reduce amplitude, hence bitrate x[n] - x[n-1] - ‘white noise’ sequence has no redundancy • Problem: separating unique & redundant E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 7 Optimal coding • Shannon information: An unlikely occurrence is more ‘informative’ p(A) = 0.5 p(B) = 0.5 p(A) = 0.9 p(B) = 0.1 ABBBBAAABBABBABBABB AAAAABBAAAAAABAAAAB A, B equiprobable A is expected; B is ‘big news’ • Information in bits I = –log2(probability) - clearly works when all possibilities equiprobable • Opt. bitrate → token length = entropy H=E[I] - i.e. equal-length tokens are equally likely • How to achieve this? - transform signal to have uniform pdf - nonuniform quantization for equiprobable tokens - variable-length tokens → Huffman coding E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 8 Quantization for optimum bitrate • Quantization should reflect pdf of signal: p(x = x0) p(x < x0) x' 1.0 0.8 0.6 0.4 0.2 0 -0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 0.025 x - cumulative pdf p(x < x0) maps to uniform x' - or: nonuniform quantization bins • Or, codeword length per Shannon –log2(p(x)): p(x) Shannon info / bits Codewords -0.02 111111111xx 111101xx -0.01 111100xx 101xx 100xx 0 0xx 110xx 0.01 1110xx 111110xx 1111110xx 0.02 111111100xx 111111101xx 111111110xx 0.03 0 2 4 6 8 E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 9 Huffman coding • Variable-length bit sequence tokens → can code unequally probable events • Tree-structure for unambiguous decoding: 0 0 p = 0.5 10 p = 0.25 0 0 1100 p = 0.0625 1 0 1101 p = 0.0625 1 1 0 1110 p = 0.0625 1 1 1111 p = 0.0625 1011001101000001001100010011100001110 • Can build tables to approximate arbitrary distributions • Eliminates irrelevance .. within limits • Problem: very probable events → short tokens E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 10 Vector Quantization • Quantize mutually dependent values in joint space: 3 x1 2 1 0 -1 -2 x2 -6 -4 -2 0 2 4 • May help even if values are largely independent - larger space {x1,x2} is easier for Huffman E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 11 Compression & Representation • As always, success depends on representation • Appropriate domain may be ‘naturally’ bandlimited - e.g. vocal-tract-shape coefficients - can reduce sampling rate without data loss • In right domain, irrelevance may be easier to get at - e.g. STFT to separate magnitude and phase E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 12 Aside: Coding standards • Coding is only useful when recipient knows the code! • Standardization efforts are important • Federal Standards: Low bit-rate secure voice: - FS1015e: LPC-10 2.4 Kbps - FS1016: 4.8 Kbps CELP • ITU G.series - G.726 ADPCM - G.729 Low delay CELP • MPEG - MPEG-Audio layers 1,2,3 - MPEG 2 Advanced Audio Codec - MPEG 4 Synthetic-Natural Hybrid Codec • etc ... E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 13 Outline 1 Information, compression & Quantization 2 Speech coding - General principles - CELP & friends 3 Wide bandwidth audio coding E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 14 2 Speech coding • Standard voice channel: - analog: 4 kHz slot (~ 40 dB SNR) - digital: 64 Kbps = 8 bit µ-law x 8 kHz • How to compress? Redundant - signal assumed to be a single voice, not any possible waveform Irrelevant - need code only enough for intelligibility, speaker identification (c/w analog channel) • Specifically, source-filter decomposition - vocal tract & fund. frequency change slowly • Applications: - live communications - offline storage E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 15 Channel Vocoder (1940s-1960s) • Basic source-filter decomposition - filterbank breaks into spectral bands - transmit slowly-changing energy in each band Encoder Decoder Bandpass Smoothed Downsample E filter 1 energy & encode 1 Bandpass filter 1 Output Input Bandpass Smoothed Downsample E filter N energy & encode N Bandpass filter 1 Voicing V/UV analysis Pulse Noise Pitch generator source - 10-20 bands, perceptually spaced • Downsampling? • Excitation? - pitch / noise model - or: baseband + ‘flattening’... E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 16 LPC encoding • The classic source-filter model Encoder Filter coefficients {ai} Decoder ω |1/A(ej )| Represent Input f s[n] & encode LPC Output analysis e^[n] s^[n] Represent Excitation All-pole Residual & encode generator filter e[n] H(z) = 1 t Σ -i 1 - aiz • Compression gains: - filter parameters are ~slowly changing - excitation can be represented many ways 20 ms Filter parameters Excitation/pitch parameters 5 ms E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 17 Encoding LPC filter parameters • For ‘communications quality’: - 8 kHz sampling (4 kHz bandwidth) - ~10th order LPC (up to 5 pole pairs) - update every 20-30 ms → 300 - 500 param/s • Representation & a quantization i -{ai} - poor distribution, -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 can’t interpolate ki -reflection coefficients {ki }: guaranteed stable -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 f - LSPs - lovely! Li • Bit allocation (filter): 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 - GSM (13 kbps): 8 LARs x 3-6 bits / 20 ms = 1.8 Kbps - FS1016 (4.8 kbps): 10 LSPs x 3-4 bits / 30 ms = 1.1 Kbps E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 18 Line Spectral Pairs (LSPs) • LSPs encode LPC filter by a set of frequencies • Excellent for quantization & interpolation –1p – –1 Pz()= Az()+ z ⋅ Az() • Def: zeros of –1p – –1 Qz()= Az()– z ⋅ Az() ω ω - z = ej → z-1 = e-j → |A(z)| = |A(z-1)| on u.c. - P(z), Q(z) have (interleaved) zeros when angle{A(z)} = ± angle{z-p-1A(z-1)} Π ζ -1 - reconstruct P(z), Q(z) = (1 - iz ) etc. - A(z) = [P(z) + Q(z)]/2 A(z-1) = 0 A(z) = 0 Q(z) = 0 P(z) = 0 E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 19 Excitation • Excitation already better than raw signal: 5000 0 Original signal -5000 100 0 -100 LPC residual 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 time / s - save several bits/sample, still > 32 Kbps • Crude model: U/V flag + pitch period - ~ 7 bits / 5 ms = 1.4 Kbps → LPC10 @ 2.4 Kbps Pitch period values 16 ms frame boundaries 100 50 0 -50 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 time / s • Band-limited then re-extended (RELP) E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 20 Encoding excitation • Something between full-quality residual (32 Kbps) and pitch parameters (2.4 kbps)? •‘Analysis by synthesis’ loop: Input LPC Filter coefficients s[n] analysis A(z) Excitation code Pitch-cycle LPC filter Synthetic predictor – excitation + 1 + ^ A(z) c[n] b·z–NL x[n] control ‘Perceptual’ MSError weighting minimization A(z) ^ W(z)= x[n]*ha[n] - s[n] A(z/γ) Excitation encoding •‘Perceptual’ weighting discounts peaks: 20 A(z) A(z) W(z)= 10 A(z/γ) 0 A(z/γ) gain / dB -10 -20 0 500 1000 1500 2000 2500 3000 3500 freq / Hz E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 21 Multi-Pulse Excitation (MPE-LPC) • Stylize excitation as N discrete pulses 15 10 original LPC 5 residual 0 -5 mi multi-pulse excitation 5 0 -5 0 20 40 60 80 100 120 ti time / samps - encode as N x (ti, mi) pairs • Greedy algorithm places one pulse at a time: Az() Xz() E = ------------------ ----------- – Sz() pcp Az()⁄ γ Az() Xz() Rz() = ------------------ – ------------------ Az()⁄ γ Az()⁄ γ - cross-correlate hγ and r*hγ , iterate E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 22 CELP • Represent excitation with codebook e.g.

Load more