EE E6820: Speech & Audio Processing & Recognition Lecture 7: Audio compression and coding
Dan Ellis
Columbia University Dept. of Electrical Engineering http://www.ee.columbia.edu/∼dpwe/e6820
March 31, 2009
1 Information, Compression & Quantization 2 Speech coding 3 Wide-Bandwidth Audio Coding
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 1 / 37 Outline
1 Information, Compression & Quantization
2 Speech coding
3 Wide-Bandwidth Audio Coding
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 2 / 37 Compression & Quantization How big is audio data? What is the bitrate?
I Fs frames/second (e.g. 8000 or 44100) x C samples/frame (e.g. 1 or 2 channels) x B bits/sample (e.g. 8 or 16) → Fs · C · B bits/second (e.g. 64 Kbps or 1.4 Mbps) CD Audio 1.4 Mbps
32 bits / frame 8 Mobile 8000 44100 !13 Kbps frames / sec Telephony 64 Kbps How to reduce?
I lower sampling rate → less bandwidth (muffled)
I lower channel count → no stereo image I lower sample size → quantization noise Or: use data compression
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 3 / 37 Data compression: Redundancy vs. Irrelevance
Two main principles in compression:
I remove redundant information I remove irrelevant information Redundant information is implicit in remainder
I e.g. signal bandlimited to 20kHz, but sample at 80kHz → can recover every other sample by interpolation:
In a bandlimited signal, the red samples can be exactly recovered by interpolating the blue samples sample
time Irrelevant info is unique but unnecessary
I e.g. recording a microphone signal at 80 kHz sampling rate
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 4 / 37 Irrelevant data in audio coding
For coding of audio signals, irrelevant means perceptually insignificant
I an empirical property Compact Disc standard is adequate:
I 44 kHz sampling for 20 kHz bandwidth I 16 bit linear samples for ∼ 96 dB peak SNR Reflect limits of human sensitivity:
I 20 kHz bandwidth, 100 dB intensity I sinusoid phase, detail of noise structure I dynamic properties - hard to characterize Problem: separating salient & irrelevant
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 5 / 37 Quantization
Represent waveform with discrete levels
5 4 6 3 Q[x[n]] x[n] 2 4 1 2 Q[x] 0 -1 0 -2 -3 -2 -4 error e[n] = x[n] - Q[x[n]] -5 0 5 10 15 20 25 30 35 40 -5 -4 -3 -2 -1 0 1 2 3 4 5 x Equivalent to adding error e[n]: x[n] = Q [x[n]] + e[n] e[n] ∼ uncorrelated, uniform white noise p(e[n]) 2 D2 → variance σe = 12 -D +D /2 /2
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 6 / 37 Quantization noise (Q-noise) Uncorrelated noise has flat spectrum With a B bit word and a quantization step D B−1 B−1 I max signal range (x) = −(2 ) · D ... (2 − 1) · D I quantization noise( e) = −D/2 ... D/2 → Best signal-to-noise ratio (power)
SNR = E[x2]/E[e2] = (2B )2
.. or, in dB, 20 · log10 2 · B ≈ 6 · B dB
0 Quantized at 7 bits -20
-40 level / dB
-60
-80 0 1000 2000 3000 4000 5000 6000 7000 freq / Hz
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 7 / 37 Redundant information
Redundancy removal is lossless Signal correlation implies redundant information
I e.g. if x[n] = x[n − 1] + v[n] x[n] has a greater amplitude range → uses more bits than v[n]
I sending v[n] = x[n] − x[n − 1] can reduce amplitude, hence bitrate
x[n] - x[n-1]
I but: ‘white noise’ sequence has no redundancy ... Problem: separating unique& redundant
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 8 / 37 Optimal coding
Shannon information: An unlikely occurrence is more ‘informative’ p(A) = 0.5 p(B) = 0.5 p(A) = 0.9 p(B) = 0.1 ABBBBAAABBABBABBABB AAAAABBAAAAAABAAAAB A, B equiprobable A is expected; B is ‘big news’
Information in bits I = − log2(probability) I clearly works when all possibilities equiprobable Optimal bitrate → av.token length = entropy H = E[I ]
I .. equal-length tokens are equally likely How to achieve this?
I transform signal to have uniform pdf I nonuniform quantization for equiprobable tokens I variable-length tokens → Huffman coding
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 9 / 37 Quantization for optimum bitrate Quantization should reflect pdf of signal:
p(x = x0) p(x < x0) x' 1.0
0.8
0.6
0.4
0.2
0 -0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 0.025 x
0 I cumulative pdf p(x < x0) maps to uniform x
Or, codeword length per Shannon log2(p(x)): p(x) Shannon info / bits Codewords -0.02 111111111xx 111101xx -0.01 111100xx 101xx 100xx 0 0xx 110xx 0.01 1110xx 111110xx 1111110xx 0.02 111111100xx 111111101xx 111111110xx 0.03 0 2 4 6 8
I Huffman coding: tree-structured decoder
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 10 / 37 Vector Quantization
Quantize mutually dependent values in joint space: 3 x1 2
1
0
-1
-2
x2
-6 -4 -2 0 2 4 May help even if values are largely independent
I larger space x1,x2 is easier for Huffman
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 11 / 37 Compression & Representation
As always, success depends on representation Appropriate domain may be ‘naturally’ bandlimited
I e.g. vocal-tract-shape coefficients
I can reduce sampling rate without data loss In right domain, irrelevance is easier to ‘get at’
I e.g. STFT to separate magnitude and phase
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 12 / 37 Aside: Coding standards
Coding only useful if recipient knows the code! Standardization efforts are important Federal Standards: Low bit-rate secure voice:
I FS1015e: LPC-10 2.4 Kbps I FS1016: 4.8 Kbps CELP ITU G.x series (also H.x for video)
I G.726 ADPCM
I G.729 Low delay CELP MPEG
I MPEG-Audio layers 1,2,3 (mp3) I MPEG 2 Advanced Audio Codec (AAC)
I MPEG 4 Synthetic-Natural Hybrid Codec More recent ‘standards’
I proprietary: WMA, Skype... I Speex ...
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 13 / 37 Outline
1 Information, Compression & Quantization
2 Speech coding
3 Wide-Bandwidth Audio Coding
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 14 / 37 Speech coding
Standard voice channel:
I analog: 4 kHz slot (∼ 40 dB SNR) I digital: 64 Kbps = 8 bit µ-law x 8 kHz How to compress? Redundant
I signal assumed to be a single voice, not any possible waveform Irrelevant
I need code only enough for intelligibility, speaker identification (c/w analog channel) Specifically, source-filter decomposition
I vocal tract & f0 change slowly Applications:
I live communications
I offline storage
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 15 / 37 Channel Vocoder (1940s-1960s) Basic source-filter decomposition
I filterbank breaks into spectral bands I transmit slowly-changing energy in each band
Encoder Decoder Bandpass Smoothed Downsample E filter 1 energy & encode 1 Bandpass filter 1
Output Input
Bandpass Smoothed Downsample E filter N energy & encode N Bandpass filter 1
Voicing V/UV analysis Pulse Noise Pitch generator source
I 10-20 bands, perceptually spaced Downsampling? Excitation?
I pitch / noise model I or: baseband + ‘flattening’...
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 16 / 37 LPC encoding
The classic source-filter model Encoder Filter coefficients {ai} Decoder |1/A(ej!)| Represent Input f s[n] & encode LPC Output analysis e^[n] s^[n] Represent Excitation All-pole Residual & encode generator filter e[n] 1 t H(z) = -i 1 - "aiz Compression gains:
I filter parameters are ∼slowly changing I excitation can be represented many ways
20 ms
Filter parameters Excitation/pitch parameters
5 ms
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 17 / 37 Encoding LPC filter parameters
For ‘communications quality’:
I 8 kHz sampling (4 kHz bandwidth) I ∼10th order LPC (up to 5 pole pairs)
I update every 20-30 ms → 300 - 500 param/s Representation & quantization
ai I {ai } - poor distribution, can’t interpolate -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 ki I reflection coefficients {ki }: guaranteed stable -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 fLi I LSPs - lovely!
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Bit allocation (filter):
I GSM (13 kbps): 8 LARs x 3-6 bits / 20 ms = 1.8 Kbps I FS1016 (4.8 kbps): 10 LSPs x 3-4 bits / 30 ms = 1.1 Kbps
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 18 / 37 Line Spectral Pairs (LSPs) LSPs encode LPC filter by a set of frequencies Excellent for quantization & interpolation Definition: zeros of P(z) = A(z) + z−p−1 · A(z−1) Q(z) = A(z) − z−p−1 · A(z−1)
jω −1 −jω −1 I z = e → z = e → |A(z)| = |A(z )| on u.circ. I P(z), Q(z) have (interleaved) zeros when −p−1 −1 ∠{A(z)} = ±∠{z A(z )} Q −1 I reconstruct P(z), Q(z) = i (1 − ζi z ) etc. I A(z) = [P(z) + Q(z)]/2 A(z-1) = 0
Q(z) = 0 A(z) = 0 P(z) = 0
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 19 / 37 Encoding LPC excitation
Excitation already better than raw signal:
5000
0 Original signal -5000 100
0
-100 LPC residual 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 time / s
I save several bits/sample, but still > 32 Kbps Crude model: U/V flag + pitch period
I ∼ 7 bits / 5 ms = 1.4 Kbps → LPC10 @ 2.4 Kbps
Pitch period values 16 ms frame boundaries 100 50 0 -50 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 time / s Band-limit then re-extend (RELP)
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 20 / 37 Encoding excitation Something between full-quality residual (32 Kbps) and pitch parameters (1.4 kbps)? ‘Analysis by synthesis’ loop: Input LPC Filter coefficients s[n] analysis A(z) Excitation code
Pitch-cycle LPC filter Synthetic predictor – excitation + 1 + ^ ( ) c[n] b·z–NL x[n] A z control
‘Perceptual’ MSError weighting minimization A(z) ^ W(z)= x[n]*ha[n] - s[n] A(z/!) Excitation encoding ‘Perceptual’ weighting discounts peaks: 20 A(z) 1/A(z) W(z)= 10 A(z/!) 0 1/A(z/!)
gain / dB -10 -20 0 500 1000 1500 2000 2500 3000 3500 freq / Hz
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 21 / 37 Multi-Pulse Excitation (MPE-LPC) Stylize excitation as N discrete pulses 15 10 original LPC 5 residual 0 -5 mi multi-pulse excitation 5 0 -5 0 20 40 60 80 100 120 ti time / samps
I encode as N × (ti , mi ) pairs Greedy algorithm places one pulse at a time: A(z) X (z) E = − S(z) pcp A(z/γ) A(z) X (z) R(z) = − A(z/γ) A(z/γ)
I R(z) is residual of target waveform after inverse-filtering I cross-correlate hγ and r ∗ hγ , iterate E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 22 / 37 CELP Represent excitation with codebook e.g. 512 sparse excitation vectors Codebook 000 001 002 003
004 005 006 007
Search index 008 009 010 011 excitation
012 013 014 015
0 60 time / samples
I linear search for minimum weighted error? FS1016 4.8 Kbps CELP (30ms frame = 144 bits): 10 LSPs 4x4 + 6x3 bits = 34 bits Pitch delay 4 x 7 bits = 28 bits Pitch gain 4 x 5 bits = 20 bits 138 bits Codebk index 4 x 9 bits = 36 bits I Codebk gain 4 x 5 bits = 20 bits E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 23 / 37 Aside: CELP for nonspeech? CELP is sometimes called a ‘hybrid’ coder: I originally based on source-filter voice model I CELP residual is waveform coding (no model)
Original (mrZebra-8k) 4
3 freq / kHz
2 CELP does not 1 0 5 kbps buzz-hiss break with multiple 4 voices etc. 3 2
1 I just does the 0 4.8 kbps CELP best it can 4
3
2
1
0 0 1 2 3 4 5 6 7 8 time / sec LPC filter models vocal tract; also matches auditory system? I i.e. the ‘source-filter’ separation is good for relevance as well as redundancy?
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 24 / 37 Outline
1 Information, Compression & Quantization
2 Speech coding
3 Wide-Bandwidth Audio Coding
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 25 / 37 Wide-Bandwidth Audio Coding
Goals:
I transparent coding i.e. no perceptible effect I general purpose - handles any signal Simple approaches (redundancy removal)
I Adaptive Differential PCM (ADPCM)
X[n] + D[n] Adaptive C[n] + quantizer – C[n] = Q[D[n]]
Predictor X'[n] + Dequantizer X [n] + p X [n] = F[X'[n-i]] D'[n] = Q-1[C[n]] p + D'[n]
I as prediction gets smarter, becomes LPC e.g. shorten - lossless LPC encoding Larger compression gains needs irrelevance
I hide quantization noise with psychoacoustic masking
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 26 / 37 Noise shaping Plain Q-noise sounds like added white noise
I actually, not all that disturbing I .. but worst-case for exploiting masking
1 mu-law quantization Have Q-noise scale with 0.8 signal level 0.6 mu(x) I i.e. quantizer step gets larger with amplitude 0.4
I minimum distortion for 0.2 x - mu(x) some center-heavy pdf 0 0 0.2 0.4 0.6 0.8 1 Or: put Q-noise around peaks in spectrum
I key to getting benefit of perceptual masking
20 Signal 0 Linear Q-noise -20 level / dB
-40
-60 0Transform Q-noise 1500 2000 2500 3000 3500 freq / Hz
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 27 / 37 Subband coding Idea: Quantize separately in separate bands
Encoder Decoder Bandpass Downsample Quantize Q-1[•] M f f M Q[•] Input Output Analysis (M channels) Reconstruction + filters filters
M Q[•] Q-1[•] M f
I Q-noise stays within band, gets masked ‘Critical sampling’ → 1/M of spectrum per band 0
-10 alias energy -20 gain / dB -30 -40 -50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 normalized freq 1
I some aliasing inevitable Trick is to cancel with alias of adjacent band → ‘quadrature-mirror’ filters
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 28 / 37 MPEG-Audio (layer I, II)
Basic idea: Subband coding plus psychoacoustic masking model to choose dynamic Q-levels in subbands Scale indices Scale normalize Quantize Input 32 band Format Bitstream polyphase & pack filterbank bitstream Data frame Scale Control & scales normalize Quantize
32 chans Psychoacoustic x 36 samples masking Per-band masking margins model 24 ms
I 22 kHz ÷ 32 equal bands = 690 Hz bandwidth I 8 / 24 ms frames = 12 / 36 subband samples
I fixed bitrates 32 - 256 Kbps/chan (1-6 bits/samp) I scale factors are like LPC envelope?
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 29 / 37 MPEG Psychoacoustic model
Based on simultaneous masking experiments Difficulties:
I noise energy masks ∼10 dB better than tones I masking level nonlinear in frequency & intensity I complex, dynamic sounds not well understood
60 Tonal 50 peaks Non-tonal Procedure 40 energy 30 Signal 20 SPL / dB spectrum 10 I pick ‘tonal peaks’ in NB FFT 0 -10 spectrum 1 3 5 7 9 11 13 15 17 19 21 23 25 60 Masking 50 spread I remaining energy → ‘noisy’ 40 30 20 SPL / dB peaks 10 0 -10 1 3 5 7 9 11 13 15 17 19 21 23 25 I apply nonlinear ‘spreading function’ 60 Resultant 50 masking 40 30 20 SPL / dB I sum all masking & threshold in 10 0 -10 power domain 1 3 5 7 9 11 13 15 17 19 21 23 25 freq / Bark
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 30 / 37 MPEG Bit allocation
Result of psychoacoustic model is maximum tolerable noise per subband Masking tone
level Masked threshold
SNR ~ 6·B Safe noise level
Quantization noise freq Subband N
I safe noise level → required SNR → bits B Bit allocation procedure (fixed bit rate):
I pick channel with worst noise-masker ratio I improve its quantization by one step I repeat while more bits available for this frame Bands with no signal above masking curve can be skipped
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 31 / 37 MPEG Audio Layer III ‘Transform coder’ on top of ‘subband coder’
Digital Audio Subband Line Coded Signal (PCM) Distortion 31 575 Audio Signal (768 kbit/s) Control Loop 192 kbit/s Filterbank Huffman MDCT 32 Subbands Nonuniform Encoding 32 kbit/s 0 0 Quantization Rate Control Loop Window Switching
CRC-Check Coding of
Side- Bitstream Formatting information Psycho- FFT acoustic 1024 Points Model External Control Blocks of 36 subband time-domain samples become 18 pairs of frequency-domain samples
I more redundancy in spectral domain I finer control e.g. of aliasing, masking I scale factors now in band-blocks Fixed Huffman tables optimized for audio data Power-law quantizer
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 32 / 37 Adaptive time window Time window relies on temporal masking
I single quantization level over 8-24 ms window ‘Nightmare’ scenario:
Pre-echo distortion
I ‘backward masking’ saves in most cases Adaptive switching of time window:
transition short 1 normal 0.8 0.6 0.4
window level 0.2 0 0 20 40 60 80 100 time / ms
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 33 / 37 The effects of MP3
Before & after:
Josie - direct from CD
20
15 freq / kHz 10
5
0 After MP3 encode (160 kbps) and decode 20
15 freq / kHz 10
5
0 Residual (after aligning 1148 sample delay) 20
15 freq / kHz 10
5
0 0 2 4 6 8 10 time / sec
I chop off high frequency (above 16 kHz) I occasional other time-frequency ‘holes’ I quantization noise under signal
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 34 / 37 MP3 & Beyond
MP3 is ‘transparent’ at ∼ 128 Kbps for stereo (11x smaller than 1.4 Mbps CD rate)
I only decoder is standardized: better psychological models → better encoders MPEG2 AAC
I rebuild of MP3 without backwards compatibility I 30% better (stereo at 96 Kbps?)
I multichannel etc. MPEG4-Audio
I wide range of component encodings I MPEG Audio, LSPs, ... SAOL
I ‘synthetic’ component of MPEG-4 Audio I complete DSP/computer music language! I how to encode into it?
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 35 / 37 Summary
For coding, every bit counts
I take care over quantization domain & effects I Shannon limits... Speech coding
I LPC modeling is old but good I CELP residual modeling can go beyond speech Wide-band coding
I noise shaping ‘hides’ quantization noise
I detailed psychoacoustic models are key
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 36 / 37 References
Ben Gold and Nelson Morgan. Speech and Audio Signal Processing: Processing and Perception of Speech and Music. Wiley, July 1999. ISBN 0471351547. D. Pan. A tutorial on MPEG/audio compression. IEEE Multimedia, 2(2):60–74, 1995. Karlheinz Brandenburg. MP3 and AAC explained. In Proc. AES 17th International Conference on High Quality Audio Coding, pages 1–12, 1999. T. Painter and A. Spanias. Perceptual coding of digital audio. Proc. IEEE, 80(4): 451–513, Apr. 2000.
E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 37 / 37