<<

EE E6820: Speech & Audio Processing & Recognition Lecture 7: Audio compression and coding

Dan Ellis Michael Mandel

Columbia University Dept. of http://www.ee.columbia.edu/∼dpwe/e6820

March 31, 2009

1 Information, Compression & Quantization 2 3 Wide- Audio Coding

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 1 / 37 Outline

1 Information, Compression & Quantization

2 Speech coding

3 Wide-Bandwidth Audio Coding

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 2 / 37 Compression & Quantization How big is audio data? What is the bitrate?

I Fs frames/second (e.g. 8000 or 44100) x C samples/frame (e.g. 1 or 2 channels) x B bits/sample (e.g. 8 or 16) → Fs · C · B bits/second (e.g. 64 Kbps or 1.4 Mbps) CD Audio 1.4 Mbps

32 bits / frame 8 Mobile 8000 44100 !13 Kbps frames / sec 64 Kbps How to reduce?

I lower sampling rate → less bandwidth (muffled)

I lower channel count → no stereo image I lower sample size → quantization noise Or: use

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 3 / 37 Data compression: Redundancy vs. Irrelevance

Two main principles in compression:

I remove redundant information I remove irrelevant information Redundant information is implicit in remainder

I e.g. signal bandlimited to 20kHz, but sample at 80kHz → can recover every other sample by interpolation:

In a bandlimited signal, the red samples can be exactly recovered by interpolating the blue samples sample

time Irrelevant info is unique but unnecessary

I e.g. recording a signal at 80 kHz sampling rate

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 4 / 37 Irrelevant data in audio coding

For coding of audio signals, irrelevant means perceptually insignificant

I an empirical property Compact Disc standard is adequate:

I 44 kHz sampling for 20 kHz bandwidth I 16 bit linear samples for ∼ 96 dB peak SNR Reflect limits of human sensitivity:

I 20 kHz bandwidth, 100 dB intensity I sinusoid phase, detail of noise structure I dynamic properties - hard to characterize Problem: separating salient & irrelevant

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 5 / 37 Quantization

Represent waveform with discrete levels

5 4 6 3 Q[x[n]] x[n] 2 4 1 2 Q[x] 0 -1 0 -2 -3 -2 -4 error e[n] = x[n] - Q[x[n]] -5 0 5 10 15 20 25 30 35 40 -5 -4 -3 -2 -1 0 1 2 3 4 5 x Equivalent to adding error e[n]: x[n] = Q [x[n]] + e[n] e[n] ∼ uncorrelated, uniform white noise p(e[n]) 2 D2 → variance σe = 12 -D +D /2 /2

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 6 / 37 Quantization noise (Q-noise) Uncorrelated noise has flat spectrum With a B bit word and a quantization step D B−1 B−1 I max signal range (x) = −(2 ) · D ... (2 − 1) · D I quantization noise( e) = −D/2 ... D/2 → Best signal-to-noise ratio (power)

SNR = E[x2]/E[e2] = (2B )2

.. or, in dB, 20 · log10 2 · B ≈ 6 · B dB

0 Quantized at 7 bits -20

-40 level / dB

-60

-80 0 1000 2000 3000 4000 5000 6000 7000 freq / Hz

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 7 / 37 Redundant information

Redundancy removal is lossless Signal correlation implies redundant information

I e.g. if x[n] = x[n − 1] + v[n] x[n] has a greater amplitude range → uses more bits than v[n]

I sending v[n] = x[n] − x[n − 1] can reduce amplitude, hence bitrate

x[n] - x[n-1]

I but: ‘white noise’ sequence has no redundancy ... Problem: separating unique& redundant

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 8 / 37 Optimal coding

Shannon information: An unlikely occurrence is more ‘informative’ p(A) = 0.5 p(B) = 0.5 p(A) = 0.9 p(B) = 0.1 ABBBBAAABBABBABBABB AAAAABBAAAAAABAAAAB A, B equiprobable A is expected; B is ‘big news’

Information in bits I = − log2(probability) I clearly works when all possibilities equiprobable Optimal bitrate → av.token length = entropy H = E[I ]

I .. equal-length tokens are equally likely How to achieve this?

I transform signal to have uniform pdf I nonuniform quantization for equiprobable tokens I variable-length tokens → Huffman coding

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 9 / 37 Quantization for optimum bitrate Quantization should reflect pdf of signal:

p(x = x0) p(x < x0) x' 1.0

0.8

0.6

0.4

0.2

0 -0.02 -0.015 -0.01 -0.005 0 0.005 0.01 0.015 0.02 0.025 x

0 I cumulative pdf p(x < x0) maps to uniform x

Or, codeword length per Shannon log2(p(x)): p(x) Shannon info / bits Codewords -0.02 111111111xx 111101xx -0.01 111100xx 101xx 100xx 0 0xx 110xx 0.01 1110xx 111110xx 1111110xx 0.02 111111100xx 111111101xx 111111110xx 0.03 0 2 4 6 8

I Huffman coding: tree-structured decoder

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 10 / 37

Quantize mutually dependent values in joint space: 3 x1 2

1

0

-1

-2

x2

-6 -4 -2 0 2 4 May help even if values are largely independent

I larger space x1,x2 is easier for Huffman

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 11 / 37 Compression & Representation

As always, success depends on representation Appropriate domain may be ‘naturally’ bandlimited

I e.g. vocal-tract-shape coefficients

I can reduce sampling rate without data loss In right domain, irrelevance is easier to ‘get at’

I e.g. STFT to separate magnitude and phase

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 12 / 37 Aside: Coding standards

Coding only useful if recipient knows the code! Standardization efforts are important Federal Standards: Low bit-rate :

I FS1015e: LPC-10 2.4 Kbps I FS1016: 4.8 Kbps CELP ITU G.x series (also H.x for )

I G.726 ADPCM

I G.729 Low CELP MPEG

I MPEG-Audio layers 1,2,3 () I MPEG 2 Advanced Audio (AAC)

I MPEG 4 Synthetic-Natural Hybrid Codec More recent ‘standards’

I proprietary: WMA, Skype... I ...

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 13 / 37 Outline

1 Information, Compression & Quantization

2 Speech coding

3 Wide-Bandwidth Audio Coding

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 14 / 37 Speech coding

Standard voice channel:

I analog: 4 kHz slot (∼ 40 dB SNR) I digital: 64 Kbps = 8 bit µ-law x 8 kHz How to ? Redundant

I signal assumed to be a single voice, not any possible waveform Irrelevant

I need code only enough for intelligibility, speaker identification (c/w analog channel) Specifically, source-filter decomposition

I vocal tract & f0 change slowly Applications:

I live communications

I offline storage

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 15 / 37 Channel (1940s-1960s) Basic source-filter decomposition

I filterbank breaks into spectral bands I transmit slowly-changing energy in each band

Encoder Decoder Bandpass Smoothed Downsample E filter 1 energy & encode 1 Bandpass filter 1

Output Input

Bandpass Smoothed Downsample E filter N energy & encode N Bandpass filter 1

Voicing V/UV analysis Pulse Noise Pitch generator source

I 10-20 bands, perceptually spaced Downsampling? Excitation?

I pitch / noise model I or: baseband + ‘flattening’...

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 16 / 37 LPC encoding

The classic source-filter model Encoder Filter coefficients {ai} Decoder |1/A(ej!)| Represent Input f s[n] & encode LPC Output analysis e^[n] s^[n] Represent Excitation All-pole Residual & encode generator filter e[n] 1 t H(z) = -i 1 - "aiz Compression gains:

I filter parameters are ∼slowly changing I excitation can be represented many ways

20 ms

Filter parameters Excitation/pitch parameters

5 ms

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 17 / 37 Encoding LPC filter parameters

For ‘communications quality’:

I 8 kHz sampling (4 kHz bandwidth) I ∼10th order LPC (up to 5 pole pairs)

I update every 20-30 ms → 300 - 500 param/s Representation & quantization

ai I {ai } - poor distribution, can’t interpolate -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 ki I reflection coefficients {ki }: guaranteed stable -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 fLi I LSPs - lovely!

0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 Bit allocation (filter):

I GSM (13 kbps): 8 LARs x 3-6 bits / 20 ms = 1.8 Kbps I FS1016 (4.8 kbps): 10 LSPs x 3-4 bits / 30 ms = 1.1 Kbps

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 18 / 37 (LSPs) LSPs encode LPC filter by a set of frequencies Excellent for quantization & interpolation Definition: zeros of P(z) = A(z) + z−p−1 · A(z−1) Q(z) = A(z) − z−p−1 · A(z−1)

jω −1 −jω −1 I z = e → z = e → |A(z)| = |A(z )| on u.circ. I P(z), Q(z) have (interleaved) zeros when −p−1 −1 ∠{A(z)} = ±∠{z A(z )} Q −1 I reconstruct P(z), Q(z) = i (1 − ζi z ) etc. I A(z) = [P(z) + Q(z)]/2 A(z-1) = 0

Q(z) = 0 A(z) = 0 P(z) = 0

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 19 / 37 Encoding LPC excitation

Excitation already better than raw signal:

5000

0 Original signal -5000 100

0

-100 LPC residual 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 time / s

I save several bits/sample, but still > 32 Kbps Crude model: U/V flag + pitch period

I ∼ 7 bits / 5 ms = 1.4 Kbps → LPC10 @ 2.4 Kbps

Pitch period values 16 ms frame boundaries 100 50 0 -50 1.3 1.35 1.4 1.45 1.5 1.55 1.6 1.65 1.7 1.75 time / s Band-limit then re-extend (RELP)

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 20 / 37 Encoding excitation Something between full-quality residual (32 Kbps) and pitch parameters (1.4 kbps)? ‘Analysis by synthesis’ loop: Input LPC Filter coefficients s[n] analysis A(z) Excitation code

Pitch-cycle LPC filter Synthetic predictor – excitation + 1 + ^ ( ) c[n] b·z–NL x[n] A z control

‘Perceptual’ MSError weighting minimization A(z) ^ W(z)= x[n]*ha[n] - s[n] A(z/!) Excitation encoding ‘Perceptual’ weighting discounts peaks: 20 A(z) 1/A(z) W(z)= 10 A(z/!) 0 1/A(z/!)

gain / dB -10 -20 0 500 1000 1500 2000 2500 3000 3500 freq / Hz

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 21 / 37 Multi-Pulse Excitation (MPE-LPC) Stylize excitation as N discrete pulses 15 10 original LPC 5 residual 0 -5 mi multi-pulse excitation 5 0 -5 0 20 40 60 80 100 120 ti time / samps

I encode as N × (ti , mi ) pairs Greedy algorithm places one pulse at a time: A(z) X (z)  E = − S(z) pcp A(z/γ) A(z) X (z) (z) = − A(z/γ) A(z/γ)

I R(z) is residual of target waveform after inverse-filtering I cross-correlate hγ and r ∗ hγ , iterate E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 22 / 37 CELP Represent excitation with codebook e.g. 512 sparse excitation vectors Codebook 000 001 002 003

004 005 006 007

Search index 008 009 010 011 excitation

012 013 014 015

0 60 time / samples

I linear search for minimum weighted error? FS1016 4.8 Kbps CELP (30ms frame = 144 bits): 10 LSPs 4x4 + 6x3 bits = 34 bits Pitch delay 4 x 7 bits = 28 bits Pitch gain 4 x 5 bits = 20 bits 138 bits Codebk index 4 x 9 bits = 36 bits I Codebk gain 4 x 5 bits = 20 bits E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 23 / 37 Aside: CELP for nonspeech? CELP is sometimes called a ‘hybrid’ coder: I originally based on source-filter voice model I CELP residual is waveform coding (no model)

Original (mrZebra-8k) 4

3 freq / kHz

2 CELP does not 1 0 5 kbps buzz-hiss break with multiple 4 voices etc. 3 2

1 I just does the 0 4.8 kbps CELP best it can 4

3

2

1

0 0 1 2 3 4 5 6 7 8 time / sec LPC filter models vocal tract; also matches auditory system? I i.e. the ‘source-filter’ separation is good for relevance as well as redundancy?

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 24 / 37 Outline

1 Information, Compression & Quantization

2 Speech coding

3 Wide-Bandwidth Audio Coding

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 25 / 37 Wide-Bandwidth Audio Coding

Goals:

I transparent coding i.e. no perceptible effect I general purpose - handles any signal Simple approaches (redundancy removal)

I Adaptive Differential PCM (ADPCM)

X[n] + D[n] Adaptive C[n] + quantizer – C[n] = Q[D[n]]

Predictor X'[n] + Dequantizer X [n] + p X [n] = F[X'[n-i]] D'[n] = Q-1[C[n]] p + D'[n]

I as prediction gets smarter, becomes LPC e.g. - lossless LPC encoding Larger compression gains needs irrelevance

I hide quantization noise with psychoacoustic masking

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 26 / 37 Noise shaping Plain Q-noise like added white noise

I actually, not all that disturbing I .. but worst-case for exploiting masking

1 mu-law quantization Have Q-noise scale with 0.8 signal level 0.6 mu(x) I i.e. quantizer step gets larger with amplitude 0.4

I minimum distortion for 0.2 x - mu(x) some center-heavy pdf 0 0 0.2 0.4 0.6 0.8 1 Or: put Q-noise around peaks in spectrum

I key to getting benefit of perceptual masking

20 Signal 0 Linear Q-noise -20 level / dB

-40

-60 0Transform Q-noise 1500 2000 2500 3000 3500 freq / Hz

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 27 / 37 Subband coding Idea: Quantize separately in separate bands

Encoder Decoder Bandpass Downsample Quantize Q-1[•] M f f M Q[•] Input Output Analysis (M channels) Reconstruction + filters filters

M Q[•] Q-1[•] M f

I Q-noise stays within band, gets masked ‘Critical sampling’ → 1/M of spectrum per band 0

-10 alias energy -20 gain / dB -30 -40 -50 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 normalized freq 1

I some inevitable Trick is to cancel with alias of adjacent band → ‘quadrature-mirror’ filters

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 28 / 37 MPEG-Audio (layer I, II)

Basic idea: Subband coding plus psychoacoustic masking model to choose dynamic Q-levels in subbands Scale indices Scale normalize Quantize Input 32 band Format Bitstream polyphase & filterbank bitstream Data frame Scale Control & scales normalize Quantize

32 chans Psychoacoustic x 36 samples masking Per-band masking margins model 24 ms

I 22 kHz ÷ 32 equal bands = 690 Hz bandwidth I 8 / 24 ms frames = 12 / 36 subband samples

I fixed bitrates 32 - 256 Kbps/chan (1-6 bits/samp) I scale factors are like LPC envelope?

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 29 / 37 MPEG Psychoacoustic model

Based on simultaneous masking experiments Difficulties:

I noise energy masks ∼10 dB better than tones I masking level nonlinear in frequency & intensity I complex, dynamic sounds not well understood

60 Tonal 50 peaks Non-tonal Procedure 40 energy 30 Signal 20 SPL / dB spectrum 10 I pick ‘tonal peaks’ in NB FFT 0 -10 spectrum 1 3 5 7 9 11 13 15 17 19 21 23 25 60 Masking 50 spread I remaining energy → ‘noisy’ 40 30 20 SPL / dB peaks 10 0 -10 1 3 5 7 9 11 13 15 17 19 21 23 25 I apply nonlinear ‘spreading function’ 60 Resultant 50 masking 40 30 20 SPL / dB I sum all masking & threshold in 10 0 -10 power domain 1 3 5 7 9 11 13 15 17 19 21 23 25 freq / Bark

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 30 / 37 MPEG Bit allocation

Result of psychoacoustic model is maximum tolerable noise per subband Masking tone

level Masked threshold

SNR ~ 6·B Safe noise level

Quantization noise freq Subband N

I safe noise level → required SNR → bits B Bit allocation procedure (fixed ):

I pick channel with worst noise-masker ratio I improve its quantization by one step I repeat while more bits available for this frame Bands with no signal above masking curve can be skipped

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 31 / 37 MPEG Audio Layer III ‘Transform coder’ on top of ‘subband coder’

Digital Audio Subband Line Coded Signal (PCM) Distortion 31 575 (768 kbit/s) Control Loop 192 kbit/s Filterbank Huffman MDCT 32 Subbands Nonuniform Encoding 32 kbit/s 0 0 Quantization Rate Control Loop Window Switching

CRC-Check Coding of

Side- Bitstream Formatting information Psycho- FFT acoustic 1024 Points Model External Control Blocks of 36 subband time-domain samples become 18 pairs of frequency-domain samples

I more redundancy in spectral domain I finer control e.g. of aliasing, masking I scale factors now in band-blocks Fixed Huffman tables optimized for audio data Power-law quantizer

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 32 / 37 Adaptive time window Time window relies on temporal masking

I single quantization level over 8-24 ms window ‘Nightmare’ scenario:

Pre- distortion

I ‘backward masking’ saves in most cases Adaptive switching of time window:

transition short 1 normal 0.8 0.6 0.4

window level 0.2 0 0 20 40 60 80 100 time / ms

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 33 / 37 The effects of MP3

Before & after:

Josie - direct from CD

20

15 freq / kHz 10

5

0 After MP3 encode (160 kbps) and decode 20

15 freq / kHz 10

5

0 Residual (after aligning 1148 sample delay) 20

15 freq / kHz 10

5

0 0 2 4 6 8 10 time / sec

I chop off (above 16 kHz) I occasional other time-frequency ‘holes’ I quantization noise under signal

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 34 / 37 MP3 & Beyond

MP3 is ‘transparent’ at ∼ 128 Kbps for stereo (11x smaller than 1.4 Mbps CD rate)

I only decoder is standardized: better psychological models → better encoders MPEG2 AAC

I rebuild of MP3 without backwards compatibility I 30% better (stereo at 96 Kbps?)

I multichannel etc. MPEG4-Audio

I wide range of component encodings I MPEG Audio, LSPs, ... SAOL

I ‘synthetic’ component of MPEG-4 Audio I complete DSP/ music language! I how to encode into it?

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 35 / 37 Summary

For coding, every bit counts

I take care over quantization domain & effects I Shannon limits... Speech coding

I LPC modeling is old but good I CELP residual modeling can go beyond speech Wide-band coding

I noise shaping ‘hides’ quantization noise

I detailed psychoacoustic models are key

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 36 / 37 References

Ben Gold and Nelson Morgan. Speech and Audio : Processing and Perception of Speech and Music. Wiley, July 1999. ISBN 0471351547. D. Pan. A tutorial on MPEG/audio compression. IEEE Multimedia, 2(2):60–74, 1995. Karlheinz Brandenburg. MP3 and AAC explained. In Proc. AES 17th International Conference on High Quality Audio Coding, pages 1–12, 1999. T. Painter and A. Spanias. Perceptual coding of . Proc. IEEE, 80(4): 451–513, Apr. 2000.

E6820 (Ellis & Mandel) L7: Audio compression and coding March 31, 2009 37 / 37