EE E6820: Speech & Audio Processing & Recognition
Lecture 8: Audio Compression & Coding
1 Information, compression & quantization
3 Wide bandwidth audio coding
Dan Ellis
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 1
1 Compression & Quantization
¥ How big is audio data? What is the bitrate? - Fs frames/second (e.g. 8000 or 44100) x C samples/frame (e.g. 1 or 2 channels) x B bits/sample (e.g. 8 or 16) → FsáCáB bits/second (e.g. 64 Kbps or 1.4 Mbps) CD Audio 1.4 Mbps
32
Mobilebits / frame 8 Telephony8000 64 Kbps 44100 ≤13 Kbps frames / sec
¥ How to reduce? - lower sampling rate → less bandwidth (muffled) - lower channel count → no stereo image - lower sample size → quantization noise ¥ Or: use data compression
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 2
Data compression: Redundancy vs. Irrelevance ¥ Two main principles in compression: - remove redundant information - remove irrelevant information ¥ Redundant information is implicit in remainder - e.g. signal bandlimited to 20kHz, but sample at 80kHz →can recover every other sample by interpolation: In a bandlimited signal, the red samples can be exactly recovered by interpolating the blue samples sample
time
¥ Irrelevant information is unique, unnecessary - e.g. recording a microphone signal at 80 kHz sampling rate
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 3
Irrelevant data in audio coding ¥ For coding of audio signals, irrelevant means perceptually insignificant - an empirical property ¥ Compact Disc standard is adequate: - 44 kHz sampling for 20 kHz bandwidth - 16 bit linear samples for ~ 96 dB peak SNR ¥ Reflect limits of human sensitivity: - 20 kHz bandwidth, 100 dB intensity - sinusoid phase, detail of noise structure - dynamic properties - hard to characterize ¥ Problem: separating salient & irrelevant
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 4
Quantization ¥ Represent waveform with discrete levels 5 4 6 3 Q[x[n]] x[n] 2 4 1 2 Q[x] 0 -1 0 -2 -3 -2 -4 error e[n] = x[n] - Q[x[n]] -5 0 5 10 15 20 25 30 35 40 -5 -4 -3 -2 -1 0 1 2 3 4 5 x ¥ Equivalent to adding error e[n]: xn[]= Qxn[][]+ en[] ¥ e[n] ~ uncorrelated, uniform white noise p(e[n]) 2 σ2 D - variance e = ------D +D 12 /2 /2
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 5
Quantization noise (Q-noise) ¥ Uncorrelated noise has flat spectrum ¥ With a B bit word and a quantization step D - max signal range (x) = -(2B-1)áD .. (2B-1-1)áD - quantization noise (e) = -D/2 .. D/2 → Best signal-to-noise ratio (power) SNR= E[] x2 ⁄ Ee[]2 2 = ()2B ⋅⋅≈ ⋅ .. or, in dB, 20log10 2 B 6 B dB 0 Quantized at 7 bits -20
-40 level / dB
-60
-80 0 1000 2000 3000 4000 5000 6000 7000 freq / Hz
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 6
Redundant information ¥ Redundancy removal is lossless ¥ Signal correlation implies redundant information - e.g. if x[n] = x[n-1] + v[n] x[n] has a greater amplitude range → more bits than v[n] - sending v[n] = x[n] - x[n-1] can reduce amplitude, hence bitrate
x[n] - x[n-1]
- ‘white noise’ sequence has no redundancy ¥ Problem: separating unique & redundant
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 7
Optimal coding ¥ Shannon information: An unlikely occurrence is more ‘informative’ p(A) = 0.5 p(B) = 0.5 p(A) = 0.9 p(B) = 0.1 ABBBBAAABBABBABBABB AAAAABBAAAAAABAAAAB A, B equiprobable A is expected; B is ‘big news’
¥ Information in bits I = Ðlog2(probability) - clearly works when all possibilities equiprobable ¥ Opt. bitrate → token length = entropy H=E[I] - i.e. equal-length tokens are equally likely ¥ How to achieve this? - transform signal to have uniform pdf - nonuniform quantization for equiprobable tokens - variable-length tokens → Huffman coding
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 8 Quantization for optimum bitrate ¥ Quantization should reflect pdf of signal:
p(x = x0) p(x < x0) x' 1.0
0.8
0.6
0.4
0.2
0 -0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 0.025 x - cumulative pdf p(x < x0) maps to uniform x' - or: nonuniform quantization bins
¥ Or, codeword length per Shannon Ðlog2(p(x)): p(x) Shannon info / bits Codewords -0.02 111111111xx 111101xx -0.01 111100xx 101xx 100xx 0 0xx 110xx 0.01 1110xx 111110xx 1111110xx 0.02 111111100xx 111111101xx 111111110xx 0.03 0 2 4 6 8 E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 9 Huffman coding ¥ Variable-length bit sequence tokens → can code unequally probable events ¥ Tree-structure for unambiguous decoding:
0 0 p = 0.5 10 p = 0.25 0 0 1100 p = 0.0625 1 0 1101 p = 0.0625 1 1 0 1110 p = 0.0625 1 1 1111 p = 0.0625
1011001101000001001100010011100001110
¥ Can build tables to approximate arbitrary distributions ¥ Eliminates irrelevance .. within limits ¥ Problem: very probable events → short tokens
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 10 Vector Quantization ¥ Quantize mutually dependent values in joint space: 3 x1 2
1
0
-1
-2
x2
-6 -4 -2 0 2 4
¥ May help even if values are largely independent - larger space {x1,x2} is easier for Huffman
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 11 Compression & Representation ¥ As always, success depends on representation ¥ Appropriate domain may be ‘naturally’ bandlimited - e.g. vocal-tract-shape coefficients - can reduce sampling rate without data loss ¥ In right domain, irrelevance may be easier to get at - e.g. STFT to separate magnitude and phase
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 12 Aside: Coding standards ¥ Coding is only useful when recipient knows the code! ¥ Standardization efforts are important ¥ Federal Standards: Low bit-rate secure voice: - FS1015e: LPC-10 2.4 Kbps - FS1016: 4.8 Kbps CELP ¥ ITU G.series - G.726 ADPCM - G.729 Low delay CELP ¥ MPEG - MPEG-Audio layers 1,2,3 - MPEG 2 Advanced Audio Codec - MPEG 4 Synthetic-Natural Hybrid Codec ¥ etc ...
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 13 Outline
1 Information, compression & Quantization
2 Speech coding - General principles - CELP & friends
3 Wide bandwidth audio coding
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 14 2 Speech coding
¥ Standard voice channel: - analog: 4 kHz slot (~ 40 dB SNR) - digital: 64 Kbps = 8 bit µ-law x 8 kHz ¥ How to compress? Redundant - signal assumed to be a single voice, not any possible waveform Irrelevant - need code only enough for intelligibility, speaker identification (c/w analog channel) ¥ Specifically, source-filter decomposition - vocal tract & fund. frequency change slowly ¥ Applications: - live communications - offline storage
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 15 Channel Vocoder (1940s-1960s) ¥ Basic source-filter decomposition - filterbank breaks into spectral bands - transmit slowly-changing energy in each band Encoder Bandpass Smoothed Downsample E filter 1 energy & encode 1 Bandpass filter 1
Decoder Output Input
Bandpass Smoothed Downsample E filter N energy & encode N Bandpass filter 1
Voicing V/UV analysis Pulse Noise Pitch generator source - 10-20 bands, perceptually spaced ¥ Downsampling? ¥ Excitation? - pitch / noise model - or: baseband + ‘flattening’...
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 16 LPC encoding ¥ The classic source-filter model Encoder Filter coefficients { ai} ω |1/A(ej )| Represent Input f s[n] & encode LPC Decoder Output analysis e^[n] s^[n] Represent Excitation All-pole Residual & encode generator filter e[n] 1 H(z) = t Σ -i 1 - aiz
¥ Compression gains: - filter parameters are ~slowly changing - excitation can be represented many ways 20 ms
Filter parameters Excitation/pitch parameters
5 ms
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 17 Encoding LPC filter parameters ¥ For ‘communications quality’: - 8 kHz sampling (4 kHz bandwidth) - ~10th order LPC (up to 5 pole pairs) - update every 20-30 ms → 300 - 500 param/s ¥ Representation & a quantization i
-{ai} - poor distribution, -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 can’t interpolate ki -reflection coefficients {ki }: guaranteed stable -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 f - LSPs - lovely! Li
¥ Bit allocation (filter): 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 - GSM (13 kbps): 8 LARs x 3-6 bits / 20 ms = 1.8 Kbps - FS1016 (4.8 kbps): 10 LSPs x 3-4 bits / 30 ms = 1.1 Kbps
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 18 Line Spectral Pairs (LSPs) ¥ LSPs encode LPC filter by a set of frequencies ¥ Excellent for quantization & interpolation
Ð1p Ð Ð1 Pz()= Az()+ z ⋅ Az() ¥ Def: zeros of Ð1p Ð Ð1 Qz()= Az()Ð z ⋅ Az() ω ω - z = ej → z-1 = e-j → |A(z)| = |A(z-1)| on u.c. - P(z), Q(z) have (interleaved) zeros when angle{A(z)} = ± angle{z-p-1A(z-1)} Π ζ -1 - reconstruct P(z), Q(z) = (1 - iz ) etc. - A(z) = [P(z) + Q(z)]/2
A(z-1) = 0
A(z) = 0 Q(z) = 0 P(z) = 0
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 19 Excitation ¥ Excitation already better than raw signal: 5000
0 Original signal -5000 100
0
-100 LPC residual 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 time / s - save several bits/sample, still > 32 Kbps ¥ Crude model: U/V flag + pitch period - ~ 7 bits / 5 ms = 1.4 Kbps → LPC10 @ 2.4 Kbps Pitch period values 16 ms frame boundaries 100 50 0 -50 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 time / s ¥ Band-limited then re-extended (RELP)
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 20 Encoding excitation ¥ Something between full-quality residual (32 Kbps) and pitch parameters (2.4 kbps)? •‘Analysis by synthesis’ loop: Input LPC Filter coefficients s[n] analysis A(z) Excitation code
Pitch-cycle LPC filter Synthetic predictor Ð excitation + 1 + ^ A(z) c[n] bázÐNL x[n] control
‘Perceptual’ MSError weighting minimization A(z) ^ W(z)= x[n]*ha[n] - s[n] A(z/γ) Excitation encoding
•‘Perceptual’ weighting discounts peaks: 20 A(z) A(z) W(z)= 10 A(z/γ) 0 A(z/γ)
gain / dB -10 -20 0 500 1000 1500 2000 2500 3000 3500 freq / Hz E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 21 Multi-Pulse Excitation (MPE-LPC) ¥ Stylize excitation as N discrete pulses 15 10 original LPC 5 residual 0 -5 m i multi-pulse 5 excitation 0 -5 0 20 40 60 80 100 120 ti time / samps - encode as N x (ti, mi) pairs ¥ Greedy algorithm places one pulse at a time: Az() Xz() E = ------Ð Sz() pcp Az()⁄ γ Az() Xz() Rz() = ------Ð ------Az()⁄ γ Az()⁄ γ
- cross-correlate hγ and r*hγ , iterate
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 22 CELP ¥ Represent excitation with codebook e.g. 512 sparse excitation vectors Codebook 000 001 002 003
004 005 006 007
Search index 008 009 010 011 excitation
012 013 014 015
060time / samples - linear search for minimum weighted error? ¥ FS1016 4.8 Kbps CELP (30ms frame = 144 bits): 10 LSPs 4x4 + 6x3 bits = 34 bits Pitch delay 4 x 7 bits = 28 bits 138 Pitch gain 4 x 5 bits = 20 bits bits Codebk index 4 x 9 bits = 36 bits Codebk gain 4 x 5 bits = 20 bits
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 23 Aside: CELP for nonspeech? ¥ CELP is sometimes called a ‘hybrid’ coder: - originally based on source-filter voice model - CELP residual is waveform coding (no model) ¥ CELP does not break with multiple voices etc. Original (mrzebra-8k) 4000 3000
freq / Hz 2000 1000 0 4.8 Kbps CELP 4000 3000 2000 1000 0 0 1 2 3 4 5 6 7 8 time / s - just does the best it can ¥ LPC filter models vocal tract; matches auditory system? - i.e. the ‘source-filter’ separation is good for relevance as well as redundancy?
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 24 Outline
1 Information, compression & Quantization
2 Speech coding
3 Wide bandwidth audio coding - General principles - MPEG-Audio
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 25 3 Wide-Bandwidth Audio Coding
¥ Goals: - transparent coding i.e. no perceptible effect - general purpose - handles any signal ¥ Simple approaches (redundancy removal) - Adaptive Differential PCM (ADPCM)
X[n] + D[n] Adaptive C[n] + quantizer Ð C[n] = Q[D[n]]
Predictor X'[n] + Dequantizer X [n] + p X [n] = F[X'[n-i]] D'[n] = Q-1[C[n]] p + D'[n]
- as prediction gets smarter, becomes LPC e.g. “shorten” - lossless LPC encoding ¥ Larger compression gains needs irrelevance - hide quantization noise with psychoacoustic masking
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 26 Noise shaping ¥ Plain Q-noise sounds like added white noise - actually, not all that disturbing - .. but worst-case for exploiting masking
¥ Have Q-noise scale 1 mu-law with signal level quantization 0.8 - i.e. quantizer step gets larger with 0.6 mu(x)
amplitude 0.4 - minimum distortion 0.2 for some center- x - mu(x)
heavy pdf 0 0 0.2 0.4 0.6 0.8 1 ¥ Put Q-noise around peaks in signal spectrum - key to getting benefit of perceptual masking 20
0 Signal -20 level / dB
-40
-60 Linear Q-noise 0Transform Q-noise 1500 2000 2500 3000 3500 freq / Hz
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 27 Subband coding ¥ Idea: Quantize separately in separate bands Analysis filters Encoder Decoder Bandpass Downsample Quantize Q-1[¥] M f f M Q[¥] Input Output (M channels) Reconstruction + filters
MQ[¥] Q-1[¥] M f
- Q-noise stays within band, gets masked •‘Critical sampling’ → 1/M of spectrum per band 0 -10 alias energy -20 gain / dB -30 -40 -50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 normalized freq 1 - some aliasing inevitable ¥ Trick is to cancel with alias of adjacent band →constraints on BPF shape
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 28 MPEG-Audio ¥ Basic idea: Subband coding plus psychoacoustic masking model to choose dynamic Q-levels in subbands Scale indices
Scale Quantize normalize Input 32 band Format Bitstream polyphase & pack filterbank bitstream Data frame Scale Control & scales normalize Quantize
32 chans Psychoacoustic x 36 samples Per-band masking masking margins model 24 ms - 22 kHz ÷ 32 equal bands = 690 Hz bandwidth - 8 / 24 ms frames = 12 / 36 subband samples - fixed bitrates 32 - 256 Kbps/chan (1-6 bits/samp) - scale factors like LPC envelope?
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 29 Psychoacoustic model ¥ Based on simultaneous masking experiments ¥ Difficulties: - noise energy masks ~10 dB better than tones - masking level nonlinear in frequency & intensity - complex, dynamic sounds not well understood
¥ Procedure 60 Tonal 50 peaks 40 Signal - pick ‘tonal peaks’ in 30 spectrum Non-tonal 20 SPL / dB energy NB FFT spectrum 10 0 -10 - remaining energy 1 3 5 7 9 11 13 15 17 19 21 23 25 → ‘noisy’ peaks 60 50 40 Masking - apply nonlinear 30 20 spread SPL / dB 10 ‘spreading function’ 0 -10 - sum all masking & 1 3 5 7 9 11 13 15 17 19 21 23 25 60 threshold in power 50 40 domain 30 Resultant 20
SPL / dB masking 10 0 -10 1 3 5 7 9 11 13 15 17 19 21 23 25 freq / Bark
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 30 Bit allocation ¥ Result of psychoacoustic model is maximum tolerable noise per subband Masking tone B á
level Masked threshold
SNR ~ 6 Safe noise level
Quantization noise freq Subband N - safe noise level → required SNR → bits B ¥ Bit allocation procedure (fixed bit rate): - pick channel with worst noise-masker ratio - improve its quantization by one step - repeat while more bits available for this frame ¥ Bands with no signal above masking curve can be skipped entirely
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 31 MPEG Audio Layer III •‘Transform coder’ stuck on ‘subband coder’
Digital Audio Subband Line Coded Signal (PCM) Distortion 31 575 Audio Signal (768 kbit/s) Control Loop 192 kbit/s Filterbank Huffman MDCT 32 Subbands Nonuniform Encoding 32 kbit/s 0 0 Quantization Rate Control Loop Window Switching
CRC-Check Coding of
Side- Bitstream Formatting information Psycho- FFT acoustic 1024 Points Model External Control
¥ Blocks of 36 subband time-domain samples become 18 pairs of frequency-domain samples - more redundancy in spectral domain - finer control e.g. of aliasing, masking - scale factors now in band-blocks ¥ Fixed Huffman tables optimized for audio data ¥ Power-law quantizer
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 32 Adaptive time window ¥ Time window relies on temporal masking - single quantization level over 8-24 ms window •‘Nightmare’ scenario:
Pre-echo distortion
- ‘backward masking’ saves in most cases ¥ Adaptive switching of time window: transition short 1 normal 0.8 0.6 0.4
window level 0.2 0 0 20 40 60 80 100 tim e / m s E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 33 MP3 & Beyond ¥ MP3 is ‘transparent’ at ~ 128 Kbps for stereo (11:1 better than 1.4 Mbps CD rate) - only decoder is standardized: better psych. models → better encoders ¥ MPEG2 AAC - rebuild of MP3 without backwards compatibility - 30% better (stereo at 96 Kbps?) - multichannel etc. ¥ MPEG4-Audio - wide range of component encodings - MPEG Audio, LSPs ¥ SAOL - ‘synthetic’ component of MPEG-4 Audio - complete DSP/computer music language! - how to encode into it?
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 34 Summary ¥ For coding, every bit counts - take care over quantization domain & effects - Shannon limits... ¥ Speech coding - LPC modeling is old but good - CELP residual modeling can go beyond speech ¥ Wide-band coding - noise shaping ‘hides’ quantization noise - detailed psychoacoustic models are key
E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 35