EE E6820: Speech & Audio Processing & Recognition

Lecture 8: Audio Compression & Coding

1 Information, compression & quantization

2

3 Wide bandwidth audio coding

Dan Ellis http://www.ee.columbia.edu/~dpwe/e6820/

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 1

1 Compression & Quantization

¥ How big is audio data? What is the bitrate? - Fs frames/second (e.g. 8000 or 44100) x C samples/frame (e.g. 1 or 2 channels) x B bits/sample (e.g. 8 or 16) → FsáCáB bits/second (e.g. 64 Kbps or 1.4 Mbps) CD Audio 1.4 Mbps

32

Mobilebits / frame 8 Telephony8000 64 Kbps 44100 ≤13 Kbps frames / sec

¥ How to reduce? - lower sampling rate → less bandwidth (muffled) - lower channel count → no stereo image - lower sample size → quantization noise ¥ Or: use

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 2

Data compression: Redundancy vs. Irrelevance ¥ Two main principles in compression: - remove redundant information - remove irrelevant information ¥ Redundant information is implicit in remainder - e.g. signal bandlimited to 20kHz, but sample at 80kHz →can recover every other sample by interpolation: In a bandlimited signal, the red samples can be exactly recovered by interpolating the blue samples sample

time

¥ Irrelevant information is unique, unnecessary - e.g. recording a microphone signal at 80 kHz sampling rate

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 3

Irrelevant data in audio coding ¥ For coding of audio signals, irrelevant means perceptually insignificant - an empirical property ¥ Compact Disc standard is adequate: - 44 kHz sampling for 20 kHz bandwidth - 16 bit linear samples for ~ 96 dB peak SNR ¥ Reflect limits of human sensitivity: - 20 kHz bandwidth, 100 dB intensity - sinusoid phase, detail of noise structure - dynamic properties - hard to characterize ¥ Problem: separating salient & irrelevant

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 4

Quantization ¥ Represent waveform with discrete levels 5 4 6 3 Q[x[n]] x[n] 2 4 1 2 Q[x] 0 -1 0 -2 -3 -2 -4 error e[n] = x[n] - Q[x[n]] -5 0 5 10 15 20 25 30 35 40 -5 -4 -3 -2 -1 0 1 2 3 4 5 x ¥ Equivalent to adding error e[n]: xn[]= Qxn[][]+ en[] ¥ e[n] ~ uncorrelated, uniform white noise p(e[n]) 2 σ2 D - variance e = ------D +D 12 /2 /2

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 5

Quantization noise (Q-noise) ¥ Uncorrelated noise has flat spectrum ¥ With a B bit word and a quantization step D - max signal range (x) = -(2B-1)áD .. (2B-1-1)áD - quantization noise (e) = -D/2 .. D/2 → Best signal-to-noise ratio (power) SNR= E[] x2 ⁄ Ee[]2 2 = ()2B ⋅⋅≈ ⋅ .. or, in dB, 20log10 2 B 6 B dB 0 Quantized at 7 bits -20

-40 level / dB

-60

-80 0 1000 2000 3000 4000 5000 6000 7000 freq / Hz

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 6

Redundant information ¥ Redundancy removal is lossless ¥ Signal correlation implies redundant information - e.g. if x[n] = x[n-1] + v[n] x[n] has a greater amplitude range → more bits than v[n] - sending v[n] = x[n] - x[n-1] can reduce amplitude, hence bitrate

x[n] - x[n-1]

- ‘white noise’ sequence has no redundancy ¥ Problem: separating unique & redundant

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 7

Optimal coding ¥ Shannon information: An unlikely occurrence is more ‘informative’ p(A) = 0.5 p(B) = 0.5 p(A) = 0.9 p(B) = 0.1 ABBBBAAABBABBABBABB AAAAABBAAAAAABAAAAB A, B equiprobable A is expected; B is ‘big news’

¥ Information in bits I = Ðlog2(probability) - clearly works when all possibilities equiprobable ¥ Opt. bitrate → token length = entropy H=E[I] - i.e. equal-length tokens are equally likely ¥ How to achieve this? - transform signal to have uniform pdf - nonuniform quantization for equiprobable tokens - variable-length tokens →

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 8 Quantization for optimum bitrate ¥ Quantization should reflect pdf of signal:

p(x = x0) p(x < x0) x' 1.0

0.8

0.6

0.4

0.2

0 -0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 0.025 x - cumulative pdf p(x < x0) maps to uniform x' - or: nonuniform quantization bins

¥ Or, codeword length per Shannon Ðlog2(p(x)): p(x) Shannon info / bits Codewords -0.02 111111111xx 111101xx -0.01 111100xx 101xx 100xx 0 0xx 110xx 0.01 1110xx 111110xx 1111110xx 0.02 111111100xx 111111101xx 111111110xx 0.03 0 2 4 6 8 E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 9 Huffman coding ¥ Variable-length bit sequence tokens → can code unequally probable events ¥ Tree-structure for unambiguous decoding:

0 0 p = 0.5 10 p = 0.25 0 0 1100 p = 0.0625 1 0 1101 p = 0.0625 1 1 0 1110 p = 0.0625 1 1 1111 p = 0.0625

1011001101000001001100010011100001110

¥ Can build tables to approximate arbitrary distributions ¥ Eliminates irrelevance .. within limits ¥ Problem: very probable events → short tokens

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 10 Vector Quantization ¥ Quantize mutually dependent values in joint space: 3 x1 2

1

0

-1

-2

x2

-6 -4 -2 0 2 4

¥ May help even if values are largely independent - larger space {x1,x2} is easier for Huffman

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 11 Compression & Representation ¥ As always, success depends on representation ¥ Appropriate domain may be ‘naturally’ bandlimited - e.g. vocal-tract-shape coefficients - can reduce sampling rate without data loss ¥ In right domain, irrelevance may be easier to get at - e.g. STFT to separate magnitude and phase

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 12 Aside: Coding standards ¥ Coding is only useful when recipient knows the code! ¥ Standardization efforts are important ¥ Federal Standards: Low bit-rate secure voice: - FS1015e: LPC-10 2.4 Kbps - FS1016: 4.8 Kbps CELP ¥ ITU G.series - G.726 ADPCM - G.729 Low delay CELP ¥ MPEG - MPEG-Audio layers 1,2,3 - MPEG 2 Advanced Audio - MPEG 4 Synthetic-Natural Hybrid Codec ¥ etc ...

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 13 Outline

1 Information, compression & Quantization

2 Speech coding - General principles - CELP & friends

3 Wide bandwidth audio coding

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 14 2 Speech coding

¥ Standard voice channel: - analog: 4 kHz slot (~ 40 dB SNR) - digital: 64 Kbps = 8 bit µ-law x 8 kHz ¥ How to compress? Redundant - signal assumed to be a single voice, not any possible waveform Irrelevant - need code only enough for intelligibility, speaker identification (c/w analog channel) ¥ Specifically, source-filter decomposition - vocal tract & fund. frequency change slowly ¥ Applications: - live communications - offline storage

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 15 Channel Vocoder (1940s-1960s) ¥ Basic source-filter decomposition - filterbank breaks into spectral bands - transmit slowly-changing energy in each band Encoder Bandpass Smoothed Downsample E filter 1 energy & encode 1 Bandpass filter 1

Decoder Output Input

Bandpass Smoothed Downsample E filter N energy & encode N Bandpass filter 1

Voicing V/UV analysis Pulse Noise Pitch generator source - 10-20 bands, perceptually spaced ¥ Downsampling? ¥ Excitation? - pitch / noise model - or: baseband + ‘flattening’...

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 16 LPC encoding ¥ The classic source-filter model Encoder Filter coefficients { ai} ω |1/A(ej )| Represent Input f s[n] & encode LPC Decoder Output analysis e^[n] s^[n] Represent Excitation All-pole Residual & encode generator filter e[n] 1 H(z) = t Σ -i 1 - aiz

¥ Compression gains: - filter parameters are ~slowly changing - excitation can be represented many ways 20 ms

Filter parameters Excitation/pitch parameters

5 ms

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 17 Encoding LPC filter parameters ¥ For ‘communications quality’: - 8 kHz sampling (4 kHz bandwidth) - ~10th order LPC (up to 5 pole pairs) - update every 20-30 ms → 300 - 500 param/s ¥ Representation & a quantization i

-{ai} - poor distribution, -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 can’t interpolate ki -reflection coefficients {ki }: guaranteed stable -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 f - LSPs - lovely! Li

¥ Bit allocation (filter): 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 - GSM (13 kbps): 8 LARs x 3-6 bits / 20 ms = 1.8 Kbps - FS1016 (4.8 kbps): 10 LSPs x 3-4 bits / 30 ms = 1.1 Kbps

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 18 Line Spectral Pairs (LSPs) ¥ LSPs encode LPC filter by a set of frequencies ¥ Excellent for quantization & interpolation

Ð1p Ð Ð1 Pz()= Az()+ z ⋅ Az() ¥ Def: zeros of Ð1p Ð Ð1 Qz()= Az()Ð z ⋅ Az() ω ω - z = ej → z-1 = e-j → |A(z)| = |A(z-1)| on u.c. - P(z), Q(z) have (interleaved) zeros when angle{A(z)} = ± angle{z-p-1A(z-1)} Π ζ -1 - reconstruct P(z), Q(z) = (1 - iz ) etc. - A(z) = [P(z) + Q(z)]/2

A(z-1) = 0

A(z) = 0 Q(z) = 0 P(z) = 0

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 19 Excitation ¥ Excitation already better than raw signal: 5000

0 Original signal -5000 100

0

-100 LPC residual 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 time / s - save several bits/sample, still > 32 Kbps ¥ Crude model: U/V flag + pitch period - ~ 7 bits / 5 ms = 1.4 Kbps → LPC10 @ 2.4 Kbps Pitch period values 16 ms frame boundaries 100 50 0 -50 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 time / s ¥ Band-limited then re-extended (RELP)

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 20 Encoding excitation ¥ Something between full-quality residual (32 Kbps) and pitch parameters (2.4 kbps)? •‘Analysis by synthesis’ loop: Input LPC Filter coefficients s[n] analysis A(z) Excitation code

Pitch-cycle LPC filter Synthetic predictor Ð excitation + 1 + ^ A(z) c[n] bázÐNL x[n] control

‘Perceptual’ MSError weighting minimization A(z) ^ W(z)= x[n]*ha[n] - s[n] A(z/γ) Excitation encoding

•‘Perceptual’ weighting discounts peaks: 20 A(z) A(z) W(z)= 10 A(z/γ) 0 A(z/γ)

gain / dB -10 -20 0 500 1000 1500 2000 2500 3000 3500 freq / Hz E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 21 Multi-Pulse Excitation (MPE-LPC) ¥ Stylize excitation as N discrete pulses 15 10 original LPC 5 residual 0 -5 m i multi-pulse 5 excitation 0 -5 0 20 40 60 80 100 120 ti time / samps - encode as N x (ti, mi) pairs ¥ Greedy algorithm places one pulse at a time: Az() Xz() E = ------Ð Sz() pcp Az()⁄ γ Az() Xz() Rz() = ------Ð ------Az()⁄ γ Az()⁄ γ

- cross-correlate hγ and r*hγ , iterate

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 22 CELP ¥ Represent excitation with codebook e.g. 512 sparse excitation vectors Codebook 000 001 002 003

004 005 006 007

Search index 008 009 010 011 excitation

012 013 014 015

060time / samples - linear search for minimum weighted error? ¥ FS1016 4.8 Kbps CELP (30ms frame = 144 bits): 10 LSPs 4x4 + 6x3 bits = 34 bits Pitch delay 4 x 7 bits = 28 bits 138 Pitch gain 4 x 5 bits = 20 bits bits Codebk index 4 x 9 bits = 36 bits Codebk gain 4 x 5 bits = 20 bits

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 23 Aside: CELP for nonspeech? ¥ CELP is sometimes called a ‘hybrid’ coder: - originally based on source-filter voice model - CELP residual is waveform coding (no model) ¥ CELP does not break with multiple voices etc. Original (mrzebra-8k) 4000 3000

freq / Hz 2000 1000 0 4.8 Kbps CELP 4000 3000 2000 1000 0 0 1 2 3 4 5 6 7 8 time / s - just does the best it can ¥ LPC filter models vocal tract; matches auditory system? - i.e. the ‘source-filter’ separation is good for relevance as well as redundancy?

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 24 Outline

1 Information, compression & Quantization

2 Speech coding

3 Wide bandwidth audio coding - General principles - MPEG-Audio

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 25 3 Wide-Bandwidth Audio Coding

¥ Goals: - transparent coding i.e. no perceptible effect - general purpose - handles any signal ¥ Simple approaches (redundancy removal) - Adaptive Differential PCM (ADPCM)

X[n] + D[n] Adaptive C[n] + quantizer Ð C[n] = Q[D[n]]

Predictor X'[n] + Dequantizer X [n] + p X [n] = F[X'[n-i]] D'[n] = Q-1[C[n]] p + D'[n]

- as prediction gets smarter, becomes LPC e.g. “” - lossless LPC encoding ¥ Larger compression gains needs irrelevance - hide quantization noise with psychoacoustic masking

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 26 Noise shaping ¥ Plain Q-noise sounds like added white noise - actually, not all that disturbing - .. but worst-case for exploiting masking

¥ Have Q-noise scale 1 mu-law with signal level quantization 0.8 - i.e. quantizer step gets larger with 0.6 mu(x)

amplitude 0.4 - minimum distortion 0.2 for some center- x - mu(x)

heavy pdf 0 0 0.2 0.4 0.6 0.8 1 ¥ Put Q-noise around peaks in signal spectrum - key to getting benefit of perceptual masking 20

0 Signal -20 level / dB

-40

-60 Linear Q-noise 0Transform Q-noise 1500 2000 2500 3000 3500 freq / Hz

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 27 Subband coding ¥ Idea: Quantize separately in separate bands Analysis filters Encoder Decoder Bandpass Downsample Quantize Q-1[¥] M f f M Q[¥] Input Output (M channels) Reconstruction + filters

MQ[¥] Q-1[¥] M f

- Q-noise stays within band, gets masked •‘Critical sampling’ → 1/M of spectrum per band 0 -10 alias energy -20 gain / dB -30 -40 -50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 normalized freq 1 - some aliasing inevitable ¥ Trick is to cancel with alias of adjacent band →constraints on BPF shape

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 28 MPEG-Audio ¥ Basic idea: Subband coding plus psychoacoustic masking model to choose dynamic Q-levels in subbands Scale indices

Scale Quantize normalize Input 32 band Format Bitstream polyphase & pack filterbank bitstream Data frame Scale Control & scales normalize Quantize

32 chans Psychoacoustic x 36 samples Per-band masking masking margins model 24 ms - 22 kHz ÷ 32 equal bands = 690 Hz bandwidth - 8 / 24 ms frames = 12 / 36 subband samples - fixed bitrates 32 - 256 Kbps/chan (1-6 bits/samp) - scale factors like LPC envelope?

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 29 Psychoacoustic model ¥ Based on simultaneous masking experiments ¥ Difficulties: - noise energy masks ~10 dB better than tones - masking level nonlinear in frequency & intensity - complex, dynamic sounds not well understood

¥ Procedure 60 Tonal 50 peaks 40 Signal - pick ‘tonal peaks’ in 30 spectrum Non-tonal 20 SPL / dB energy NB FFT spectrum 10 0 -10 - remaining energy 1 3 5 7 9 11 13 15 17 19 21 23 25 → ‘noisy’ peaks 60 50 40 Masking - apply nonlinear 30 20 spread SPL / dB 10 ‘spreading function’ 0 -10 - sum all masking & 1 3 5 7 9 11 13 15 17 19 21 23 25 60 threshold in power 50 40 domain 30 Resultant 20

SPL / dB masking 10 0 -10 1 3 5 7 9 11 13 15 17 19 21 23 25 freq / Bark

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 30 Bit allocation ¥ Result of psychoacoustic model is maximum tolerable noise per subband Masking tone B á

level Masked threshold

SNR ~ 6 Safe noise level

Quantization noise freq Subband N - safe noise level → required SNR → bits B ¥ Bit allocation procedure (fixed ): - pick channel with worst noise-masker ratio - improve its quantization by one step - repeat while more bits available for this frame ¥ Bands with no signal above masking curve can be skipped entirely

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 31 MPEG Audio Layer III •‘Transform coder’ stuck on ‘subband coder’

Digital Audio Subband Line Coded Signal (PCM) Distortion 31 575 Audio Signal (768 kbit/s) Control Loop 192 kbit/s Filterbank Huffman MDCT 32 Subbands Nonuniform Encoding 32 kbit/s 0 0 Quantization Rate Control Loop Window Switching

CRC-Check Coding of

Side- Bitstream Formatting information Psycho- FFT acoustic 1024 Points Model External Control

¥ Blocks of 36 subband time-domain samples become 18 pairs of frequency-domain samples - more redundancy in spectral domain - finer control e.g. of aliasing, masking - scale factors now in band-blocks ¥ Fixed Huffman tables optimized for audio data ¥ Power-law quantizer

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 32 Adaptive time window ¥ Time window relies on temporal masking - single quantization level over 8-24 ms window •‘Nightmare’ scenario:

Pre-echo distortion

- ‘backward masking’ saves in most cases ¥ Adaptive switching of time window: transition short 1 normal 0.8 0.6 0.4

window level 0.2 0 0 20 40 60 80 100 tim e / m s E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 33 MP3 & Beyond ¥ MP3 is ‘transparent’ at ~ 128 Kbps for stereo (11:1 better than 1.4 Mbps CD rate) - only decoder is standardized: better psych. models → better encoders ¥ MPEG2 AAC - rebuild of MP3 without backwards compatibility - 30% better (stereo at 96 Kbps?) - multichannel etc. ¥ MPEG4-Audio - wide range of component encodings - MPEG Audio, LSPs ¥ SAOL - ‘synthetic’ component of MPEG-4 Audio - complete DSP/computer music language! - how to encode into it?

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 34 Summary ¥ For coding, every bit counts - take care over quantization domain & effects - Shannon limits... ¥ Speech coding - LPC modeling is old but good - CELP residual modeling can go beyond speech ¥ Wide-band coding - noise shaping ‘hides’ quantization noise - detailed psychoacoustic models are key

E6820 SAPR - Coding - Dan Ellis 2001-03-20 - 35