What Will We Be Talking About? Perceptual Audio Coding: An Overview • Overview of Perceptual Audio Coding • Examples of Audio Coder Designs Marina Bosi • Sound Examples

T18 Tutorial Audio Compression 121st AES Convention October 7, 2006 1 © 2004-6 Marina Bosi-All rights reserved 2

Audio Coding

• In general, an audio coder (or ) is an apparatus Some Familiar Coders whose input is an audio signal and whose output is an audio signal which is perceptually identical (or at least •Portable Devices, MP3 files, AAC: MPEG Layer III, AAC very close) to the (somewhat delayed) input signal •DVDs: (AC-3) or DTS •Digital Radio (DAB): MPEG Layer II (MUSICAM), MPEG AAC •Digital Television (HDTV, DVB): Dolby Digital (AC-3), EncodeEncode [10010101]or [10010101] DecodeDecode MPEG Layer II, HE AAC •Electronic Distribution of Music (EMD): MPEG Layer III (MP3), AAC, WMA •3rd Generation Mobile (3GPP): MPEG HE AAC

© 2004-6 Marina Bosi-All rights reserved 3 © 2004-6 Marina Bosi-All rights reserved 4 Evolution of Data Rates for Good Sound Quality for Stereo Signals Two Key Ideas • 1992 256 kb/s MPEG Layer II • 1993 192 kb/s MPEG Layer III • In perceptual audio coding, two key ideas in the audio • 1994 128-192 kb/s MPEG MP3 signals representation are: – removal of Redundancy • 1995 384-448 kb/s per 5.1 signal AC-3 • 1997 96-128 kb/s MPEG-2 AAC – removal of Irrelevancy • 2000 64-96 kb/s MPEG-4 AAC • 2001 48-64 kb/s AAC+ (HE AAC) • 2004 24-48 AAC+ PS • 2006 64 kb/s per 5.1 signal MPEG Surround

© 2004-6 Marina Bosi-All rights reserved 5 © 2004-6 Marina Bosi-All rights reserved 6

Redundancy “Redundant adj 1. Exceeding what is necessary or normal. 2. Characterized by or containing an excess: specif more words than necessary….” [Websters Dictionary] Time to Frequency Mapping Stage

– In audio coding redundant means that the same information can be represented with fewer bits • Designed to provide a compact representation of the – For example, consider a sine wave signal: audio signals • Redundant: sample the waveform 44,100 times per second and describe each sample with 16 bits • Maximize the ability to separate frequency components • Concise: Describe the amplitude, frequency, phase, and duration • Minimize audibility of blocking artifacts A/2 A/2 • Critical sampling

f -f0 0 f • Perfect reconstruction • Time delay x(t) X(f) – Notice that the concise representation of the sine wave is basically • Computational complexity equivalent to the information in its Fourier Transform. – Since music and many other audio signals are very tonal, most coders work in the frequency domain to reduce redundancy © 2004-6 Marina Bosi-All rights reserved 7 © 2004-6 Marina Bosi-All rights reserved 8 Examples of Filter-Banks in Audio Irrelevancy •“Irrelevant adj 1. Not having significant and Coding demonstrable bearing on the matter at hand.” [Websters • PQMF Dictionary] – MPEG Layers I and II: 32-band, 511 PQMF • DCT – In audio coding irrelevant data means that you can’t hear any – OCF: 512 frequency lines; 576 impulse response (early version) difference in the audio signal if those data are omitted • MDCT/MDST – Main causes of irrelevancy: – AC-2A: 256/64 frequency lines; 512/128 impulse response • Hearing Threshold • MDCT • Masking – AC-3: 256/128 frequency lines; 512/256 impulse response – Hearing Threshold – MPEG AAC, PAC: 1024/128 frequency lines; 2048/256 impulse response • We can’t hear sounds below a certain frequency-dependant level • Hybrid – MPEG Layer III : 576/192 frequency lines; 1664/896 impulse response – Masking – ATRAC : 512/64 frequency lines; 1072/304 impulse response • Loud sounds can prevent us from hearing softer sounds nearby in time or frequency • (EPAC) – Tree structure with higher frequency resolution at low frequencies and – Exploiting irrelevancy higher temporal resolution at high frequencies, utilized during transients • Don’t code signal components you can’t hear only • Only quantize audible signal components with enough bits to keep • Int MDCT (Lossless Coding) quantization noise below the level it can be heard – MPEG-4 SLS : same as AAC with 4x over sampling also enabled

© 2004-6 Marina Bosi-All rights reserved 9 © 2004-6 Marina Bosi-All rights reserved 10

Masking

Hearing Threshold Masker dB

Frequency Hearing (or Simultaneous) Threshold Masking Threshold Masking

Frequency

Masked Signals

dB Masker On

Temporal Post-Masking Masking Pre-Masking Simultaneous Masking

From [Zwicker & Fastl 90] Time ~20ms ~200ms ~150ms

© 2004-6 Marina Bosi-All rights reserved 11 © 2004-6 Marina Bosi-All rights reserved 12 Quantization

Masked Threshold – Quantization is the representation of a continuous signal amplitude (time or frequency sample) with a finite number of • The hearing threshold can be combined with the effects bits of masking from the signal to create the Masked – Quantization is a lossy process and is the main source of Threshold signal degradation in a digital audio coder • The Masked Threshold represents the level below – Each additional bit buys you about 6 dB more of signal to which noise added to the signal should be inaudible noise • uniform quantization: count down from overload level of quantizer 100 Signal M asked Threshold 80 • floating point quantization (linearized A-law): count down from

60 nearest 6 dB point above signal level

SPL 40 – If you know where the Masked Threshold is, you know how 20 many bits are needed to get quantization noise 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 – Psychoacoustic-based bit allocation is the secret to Perceptual f (Hz) Audio Coders!

© 2004-6 Marina Bosi-All rights reserved 13 © 2004-6 Marina Bosi-All rights reserved 14

Bit/Noise Allocation Using Masked Demo: 13 dB Miracle Threshold • The “13 dB miracle” paradox (Johnston and Brandenburg ‘90), where the original signal was injected with noise that was either Audible – a) shaped according to psychoacoustic masking models Signal – b) white dB SMR Few Bit SNR (Audible Noise) shows that two systems with identical SNR = 13 dB have very Many Bit SNR (Inaudible Noise) different perceived audio quality • In case a) the quantization noise is shaped so that it is contained below masked thresholds

• In case b) the quantization noise is shaped so that it is uniformly Frequency distributed in frequency (in general above masked thresholds)

© 2004-6 Marina Bosi-All rights reserved 15 © 2004-6 Marina Bosi-All rights reserved 16 Rate Constraints Basic Building Blocks for a Perceptual

• The psychoacoustic model provides the SMR values that are Audio Coder

needed to achieve transparency (at least according to the Audio Encoded Bitstream PCM Time to Bitstream psychoacoustic model) – implying a corresponding bit allocation to Frequency Quantization transparently encode the signal Mapping Formatting Encode: • However, data rate constraints often limit the “allowed” of Psychoacoustic the encoded signal below that needed to achieve transparency so Model Ancillary Data methods are needed to allocate bits subject to both a data rate constraint and the calculated SMR values • Bit allocation algorithms allocate a greater number of bits to Quantized Reconstructed Subband Sub-band spectral regions with higher than average SMR values at the cost Encoded Data and Data of lower allocations to lower-than-average SMR regions Bitstream Scale Factors Frequency Frequency Decoded Bitstream Decode: Sample to Time PCM Unpacking B−1 Reconstruction Mapping Audio 1bit ⎛ 1 ⎞ Rb ≈ R + ⎜ SMRb − K ∑ Nc SMRc ⎟ 6.02 dB c=0 Ancillary ⎝ ⎠ Data

© 2004-6 Marina Bosi-All rights reserved 17 © 2004-6 Marina Bosi-All rights reserved 18

New Trends… Lossless Coding

• Pure lossless coding allows for average compression • Lossless Coding ratios of about 2:1 which are much lower than • Parametric Coding compression ratios achieved in perceptual coding (10- – Full Synthesis 30) – Spectral Band Replication • MPEG-4 lossless is based on the following technology : – Spatial Audio Coding (Stereo/Multichannel) – ADPCM and noisless coding • Scalable Coding – Int MDCT and noisless coding • Each of these are examples of ways to increase the – 1-bit lossless based on LPC and entropy coding coding efficiency of the system and/or to better encode (SACD) the signal to specific target application requirements

© 2004-6 Marina Bosi-All rights reserved 19 © 2004-6 Marina Bosi-All rights reserved 20 Spectral Band Replication (SBR) Stereo/Multichannel Coding

• Exploit correlations/spatial irrelevancies between • Only the low part of the signal spectrum is stereo/multichannel signals waveform coded Energy Signal Energy • Two common approaches: Masked Threshold – M/S coding – Intensity coding • The high frequency components of the signal Quantization Noise • M/S coding Frequency are reconstructed from the low frequency Transpose – Change basis from L,R channels to sum (M) and difference (S) components of the signal through a small channels • Intensity coding amount of side information – Approximate signal with a mono signal plus a phase angle to define how signal splits between L,R channels • Parametric stereo coding • Compression efficiency can be significantly SBR Reconstruct – The stereo signal is coded as a monaural signal plus a small amount improved by using SBR (mp3PRO, MPEG-4 of stereo parameters HE AAC) • Similar matrix basis changes can be applied to multichannel coding • Similar principles applied in Enhanced AC-3

© 2004-6 Marina Bosi-All rights reserved 21 © 2004-6 Marina Bosi-All rights reserved 22

Scalable Audio Coding Examples of Audio Coder Designs • Embed lower bandwidth bitstream in higher bandwidth bitstream • MPEG-1/2 Layers I, II, and III • Key functionality for MPEG-4 audio • MPEG-2/4 AAC • Dolby AC-3 • Main types of scalability: – Small step scalability Enhancement layers of ~ 1 k/s (BSAC) – Large step scalability Enhancement layers of 8 k/s and more – General audio coding in MPEG-4 supports scalability

© 2004-6 Marina Bosi-All rights reserved 23 © 2004-6 Marina Bosi-All rights reserved 24 Layers I and II (Single Channel Mode) Layer III (Single Channel Mode)

0 0 0 Non-Uniform . . . . Midtread Midtread Quantizer Filter Bank . . Filter Bank . . Quantizer MDCT Rate/Distortion (32 Sub-Bands) . . Coded (32 Sub-Bands) . . Coded Loop 31 Audio 31 575 Audio Bitstream Data Bitstream Data Formatting Formatting

DTF Coding of Side DTF Coding of Side Psychoacoustic Psychoacoustic 512/1024 Hann Information 2 * 1024 Hann Information Model Model Window Window

© 2004-6 Marina Bosi-All rights reserved 25 © 2004-6 Marina Bosi-All rights reserved 26

Input time signal

Bitstreams for MPEG-1 Audio Layers Pre- Processing Perceptual Model Legend Filter Data Bank Control

TNS Header CRC Bit Allocation Scale Factors LAYER I Samples Ancillary Data (32) (0,16) (128-256) (0-384) Intensity/ Coupling

Quantized Spectrum Prediction 13818-7 Header CRC Bit Allocation SCFSI Scale Factors of Bitstream Coded Audio LAYER II Samples Ancillary Data Previous Formatter Stream (32) (0,16) (26-188) (0-120) (0-1080) Frame M/S Iteration Loops

Header CRC Side Information Scale LAYER III Main Data (May start at a previous frame) Factors (32) (0,16) (130-246) AAC Encoder

Rate/Distortion Quantizer Configuration Control Process

Noiseless Coding

© 2004-6 Marina Bosi-All rights reserved 27 Block Diagram of the MPEG-4 GA Encoder Input time signal Legend: dat a control

Pre- MPEG-4 Audio Processing Perceptual Model Filter Bank

TNS

LTP

Intensity / Coupling

Prediction Bitstream Coded Audio Formatter Bitstream

PNS

M/S

BSAC

AAC

TwinVQ

Quantization and (from ISO/IEC 14496-3) Coding Choices © 2004-6 Marina Bosi-All rights reserved 29 © 2004-6 Marina Bosi-All rights reserved 30

AC-3 Encoder Flow Diagram AC-3 Bitstream Overview

Input PCM

blksw flags Transient Detect

Forward Transform

cplg strat Coupling Strategy 1536 PCM samples

Form Coupling Channel

rematflgs Rematrixing

CRC AUDIO AUDIO AUDIO AUDIO AUDIO AUDIO AUX CRC Extract Exponents SYNC SI BSI #1 BLOCK 0 BLOCK 1 BLOCK 2 BLOCK 3 BLOCK 4 BLOCK 5 DATA #2

expstrats Exponent Strategy

dithflgs Dither Strategy

Encoded Spectral Envelope Encode Exponents

Mantissas •The AC-3 bitstream is composed of independent frames Normalize Mantissas

bitalloc params baps Quantize Core Bit Allocation Mantissas •Each frame represents a fixed amount of time, equal to 1536 PCM

Pack AC-3 Frame Side Information Main Information samples: Output Frame 1 frame = 32 ms for 48 kHz sample rate © 2004-6 Marina Bosi-All rights reserved 31 © 2004-6 Marina Bosi-All rights reserved 32 Comparison of AC-3, MPEG-2 AAC, MPEG LII and LIII

Sound Examples

(from CRC AES publication) © 2004-6 Marina Bosi-All rights reserved 33 34

To Learn More:

• M. Bosi and R. E. Goldberg, “ Introduction to Digital Audio Coding and Standards ”, Kluwer /Springer 2003 • “Collected Papers on Digital Audio Bit-Rate Reduction” Neil Gilchrist and Christer Grewin, Editors, Audio Engineering Society 1996 • E. Zwicker and H. Fastl, “Psychoacoustics”, Springer-Verlag 1990 • Proceedings of the AES 17th International Conference on “High- Quality Audio Coding”, K. Brandenburg and M. Bosi Co-chairs, Florence September 1999 • AES CD-ROM On Perceptual Audio Coders 2001: "Perceptual Audio Coders: What to Listen For", AES 2001.

© 2004-6 Marina Bosi-All rights reserved 35