EECS 225D Audio Signal Processing in Humans and Machines

Lecture 16 – Perceptual Audio Coding

2012-3-14 Professor Nelson Morgan today’s lecture by John Lazzaro

www.icsi.berkeley.edu/eecs225d/spr12/ “Hero” sound Play EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB Today’s lecture: Audio Coding

Compression: Lossless and Lossy

Quantization and Noise

Psychoacoustic Masking

Time-Frequency Tradeoffs

Research Topics

EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB OS X System Sound: Hero.aiff

1 second of 44.1 kHz, 16-bit, stereo audio 186, 450 byte disk file, 1.4112 Mb/s b = bit, B = 8 bit bytes

Play

EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB How well does work on audio files? File size reduced by a factor of 1.56 (measured in units of 4KB disk blocks)

“Lossless” Lossless algorithms remove compression. redundant bits -- bits that are not (decompression needed to exactly reconstruct is bit-accurate). the original file. “Shorten”: Redundancy removal can be Tony Robinson, improved if the algorithm can be Cambridge, 1992. specialized for audio waveforms.

EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB (after shorten, FLAC, ...)

File size reduced

by a factor of 3 . 1 (double the performance of gzip on the same file) Lossless, just like gzip. To reduce file size by larger factors, we need to go beyond removing redundancy. One approach: Remove information that is irrelevant information for a particular use case. Example: Remove audio information whose loss a human listener cannot perceive.

EECS 225D L16: Perceptual Audio Coding Lossy perceptual audio coding. UC Regents Spring 2012 © UCB MPEG 4 Advanced (AAC)

Request 128 kb/s Encoder adjusts quality to meet request File size reduced by a factor of 9.4 Lossy: Decompression does not restore original file. Today’s Lecture How it works Listening Test: Original: Play 128 kb/s (9.4X): Play 16 kb/s (23.5X): Play

EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB Today’s lecture: Audio Coding

Compression: Lossless and Lossy

Quantization and Noise

Psychoacoustic Masking

Time-Frequency Tradeoffs

Research Topics

EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB Quantization, noise, and compression ... To a real-valued discrete-time waveform, quantize the samples to reduce bits/sample, and then apply .

s(t) Play s(t) + e(t) Play

Quantization corrupts the signal s(t) with noise term e(t). In this example, quantizing to 1 bit is clearly objectionable. However, a 40 dB reduction in e(t) yield a better result. Quantizing with more bits s(t) + 0.01*e(t) Play acts to reduce e(t). s(t) + 0.1*e(t) Play

EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB 540 CHAPTER 35 PERCEPTUAL AUDIO CODING 540 CHAPTERWhich 35 PERCEPTUAL leads AUDIO CODING to this architecture ... Encoder Bandpass Downsample Quantize Subband 1 quantized Filter bank splits Encoder f M Q[•] samples Bandpass Downsample Quantize Subband 1 Input quantized audio input into Analysisf M (M subbands)Q[•] filters samples M sub-bands. Input Analysis (M subbands) Subband M filters MQ[•]quantized f samples Quantize to Subband M quantized f MQ[•] minimize the Psychoacoustic samples model Minimum masking number of bits Psychoacoustic thresholds needed across model Minimum masking thresholds Decoder all M channels. Subband 1 quantized Q-1[•] M samplesDecoder f Subband 1 Constraint: quantized Q-1[•] M Output Dequantize Upsample Reconstructionf + Human samples filters Output imperceptibility Subband M Dequantize Upsample Reconstruction + quantized Q-1[•] M filters samples f of the encode -> Subband M quantized Q-1[•] M decode process. FIGURE 35.7 Structure of a subband coder andf corresponding decoder. The input signal samplesis dividedEECS into 225DM bandlimited L16: Perceptual Audio signals, Coding each accounting for 1/Mth of the total bandwidth,UC Regents Spring 2012 © UCB FIGUREwhich are 35.7 thenStructure downsampled of a subband by a factor coder of andM (critically corresponding sampled), decoder. quantized The input based signal on the is dividedpsychoacoustic into M bandlimited model’s estimate signals, of each the tolerable accounting quantization for 1/Mth of noise the totalat that bandwidth, moment in that whichband, are and then transmitted. downsampled At the by decoding a factor of end,M (critically the dequantized sampled), samples quantized are interpolated based on the up psychoacousticto full sampling model’s rate, again estimate bandpass-filtered of the tolerable to quantization recover the appropriate noise at that frequencies, moment in that then band,summed and transmitted. to reconstruct At the the decoding full-bandwidth end, signal.the dequantized samples are interpolated up to full sampling rate, again bandpass-filtered to recover the appropriate frequencies, then summed to reconstruct the full-bandwidth signal. full band, and the effort to approach such ideal “brick-wall” filters would result in time- domain properties that were increasingly problematic – long ringing that would compromise fullthe band, time and localization the effort we to approach wish to achieve.such ideal Practical “brick-wall” filters filters with would finite-duration result in impulse time- domainresponses properties will exhibit that were a finite increasingly transition problematic region between – long the ringing passband that would and stopband, compromise and if thethe time passbands localization are set we up wish to fully to achieve. capture all Practical the original filters signal, with finite-durationthis transition region impulse will responsesend up straying will exhibit beyond a finite the transition1/Mth of the region spectrum between associated the passband with the and particular stopband, subband, and if theas passbands illustrated are in Figureset up to 35.8. fully Decimation, capture all however, the original will signal, alias (fold) this transition any signal region components will endoutside up straying the principal beyond subband the 1/Mth backof the into spectrum that band, associated leading withto a particularly the particular nasty subband, kind of asdistortion. illustrated in Figure 35.8. Decimation, however, will alias (fold) any signal components outside theThis principal problem subband is solved back through into alias that band, cancellation leading [12]. to a Although particularly a single nasty decimated kind of distortion.subband will introduce alias terms resulting from energy just beyond the ideal edge of theThis band, problem the corresponding is solved through imperfections alias cancellation in the reconstruction [12]. Although filters a single in the decoderdecimated will subband will introduce alias terms resulting from energy just beyond the ideal edge of the band, the corresponding imperfections in the reconstruction filters in the decoder will 540 CHAPTER 35 PERCEPTUAL AUDIO CODING

Encoder Bandpass Downsample Quantize Subband 1 quantized f M Q[•] samples Input Analysis (M subbands) filters

Subband M MQ[•]quantized f samples

540 CHAPTER 35 PERCEPTUAL AUDIO CODING Psychoacoustic model Minimum masking Quantization noise in a sub-band ... thresholds 538 CHAPTER 35 PERCEPTUAL AUDIO CODING Encoder Decoder Bandpass Downsample QuantizeSubband 1 Subband 1 quantized quantizedQ-1[•] M f M samplesQ[•] samples f Input Output Analysis (M subbands) DequantizeA tone Upsample in the sub-band,Reconstruction + filters filters Masking tone scaled to fully use B Subband M Subband M MQ[•]quantized quantizedQ-1quantize[•] M bits. f samples samples f The noise floor level FIGURE 35.7 StructureMasked of a subband coder and corresponding decoder. The input signal is 6*B PsychoacousticdB is divided into M bandlimitedthreshold signals, each accounting for 1/Mth of the total bandwidth, model Minimumwhich masking are then downsampled by a factor of M (critically sampled), quantized based on the thresholds below the tone. SNR ~ 6·B psychoacoustic model’s estimate of the tolerable quantization noise at that moment in that band, and transmitted. At the decodingSafe noise end, the level dequantized samples are interpolated up Decoder to full sampling rate, again bandpass-filtered to recover the appropriate frequencies, then Subband 1 summed to reconstruct the full-bandwidth signal. quantized Q-1[•] M freq samples Quantization noise f Output Dequantize Upsample ReconstructionSubband + N filtersfull band, and the effort to approach such ideal “brick-wall” filters would result in time- FIGURE 35.6 Quantizationdomain properties noise that were within increasingly one problematic subband – long should ringing that be would kept compromise below the estimated Subband M (Approximate result.the See time localizationthe book we for wish the to fine achieve. print) Practical filters with finite-duration impulse quantized Q-1[•] M masked threshold. responsesf will exhibit a finite transition region between the passband and stopband, and if samples EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB the passbands are set up to fully capture all the original signal, this transition region will FIGURE 35.7 Structure of a subband coder andend corresponding up straying decoder. beyond the The1 input/Mth signalof the spectrum associated with the particular subband, is divided into M bandlimited signals, each accounting for 1/Mth of the total bandwidth, as illustrated in Figure 35.8. Decimation, however, will alias (fold) any signal components which are then downsampled by a factor of M (critically sampled), quantized based on the psychoacousticpane model’s estimate of Figure of the tolerable 35.4. quantizationoutside the principal noise at that subband moment back in into that that band, leading to a particularly nasty kind of band, and transmitted. AtThe the decoding MPEG end, the specification dequantizeddistortion. samples includes are interpolated a up second, more complex psychoacoustic model to full sampling rate, again bandpass-filtered to recoverThis the appropriate problem is frequencies, solved through then alias cancellation [12]. Although a single decimated summed to reconstruct(model the full-bandwidth 2) that avoids signal. subband the explicit will introduce detection alias terms resulting of masking from energy peaks just beyond or their the ideal explicit edge of classification the band, the corresponding imperfections in the reconstruction filters in the decoder will as “tonal” or “noise”: instead, each spectral bin has an associated “tonality” index between full band, and thezero effort and to approach one that such ideal reflects “brick-wall” how filters accurately would result it in matchestime- an extrapolation (in both magnitude and domain properties that were increasingly problematic – long ringing that would compromise the time localizationphase) we wish from to achieve. the preceding Practical filters two with time finite-duration windows: impulse sustained sinusoids, which are quite common responses will exhibitin music a finite transition audio, region will between tend the to passband be well and stopband, predicted and if by extrapolation. The tonality index is then the passbands are set up to fully capture all the original signal, this transition region will end up strayingused beyond the as1/ anMth interpolationof the spectrum associated between with the particular masking subband, functions fit to tone and noise data. However, as illustrated in Figure 35.8. Decimation, however, will alias (fold) any signal components outside the principalin practice subband back the into that difference band, leading between to a particularly the nasty psychoacoustic kind of models is a matter of degree and distortion. precision: their overall behavior is similar. This problem is solved through alias cancellation [12]. Although a single decimated subband will introduce alias terms resulting from energy just beyond the ideal edge of the band, the corresponding imperfections in the reconstruction filters in the decoder will 35.3 NOISE SHAPING

Given the predictions of the masking thresholds at different frequencies that are synchro- nized to the signal, we now have the possibility of maximizing the amount of quantization noise that can be added to the signal while preserving perceptual transparency – provided we can control where the noise falls. This problem, of manipulating the distribution of quantization noise in time and frequency, is called “noise shaping” and comes in many guises. For the kinds of coding considered in this chapter, the goal is to be able to indepen- dently vary the quantization noise at each point in the time-frequency plane, corresponding to the masking surface calculated by the psychoacoustic model. In practice, this is achieved by dividing the signal up on a time-frequency grid, then choosing different quantization levels within each cell of the grid according to the local capacity to tolerate quantization noise, as dictated by the local masking threshold. This idea is illustrated in Figure 35.6: The full audio spectrum is assumed divided into a number of subbands, each broad enough to encompass some variation in the masked threshold. Linear quantization of the subband time-domain samples has the effect of adding a small error offset to each sample, and this is well modeled as independent, random, additive noise with a fixed distribution. Such a white noise sequence has a flat spectrum (equal energy at all frequencies, on average) whose energy is proportional to the amplitude 540 CHAPTER 35 PERCEPTUAL AUDIO CODING If a B is too small, noise may be audible

Encoder Bandpass Downsample Quantize Subband 1 quantized f M Q[•] samples Input Analysis (M subbands) filters

Subband M MQ[•]quantized f samples

Psychoacoustic model Minimum masking thresholds

EncoderDecoder includes a model of human perception. Subband 1 quantized CandidateQ-1[•] setsM of M quantizations are tested f samples against the model to check imperceptibility. Output Dequantize Upsample Reconstruction + EECS 225D L16: Perceptual Audio Coding filters UC Regents Spring 2012 © UCB

Subband M quantized Q-1[•] M samples f FIGURE 35.7 Structure of a subband coder and corresponding decoder. The input signal is divided into M bandlimited signals, each accounting for 1/Mth of the total bandwidth, which are then downsampled by a factor of M (critically sampled), quantized based on the psychoacoustic model’s estimate of the tolerable quantization noise at that moment in that band, and transmitted. At the decoding end, the dequantized samples are interpolated up to full sampling rate, again bandpass-filtered to recover the appropriate frequencies, then summed to reconstruct the full-bandwidth signal.

full band, and the effort to approach such ideal “brick-wall” filters would result in time- domain properties that were increasingly problematic – long ringing that would compromise the time localization we wish to achieve. Practical filters with finite-duration impulse responses will exhibit a finite transition region between the passband and stopband, and if the passbands are set up to fully capture all the original signal, this transition region will end up straying beyond the 1/Mth of the spectrum associated with the particular subband, as illustrated in Figure 35.8. Decimation, however, will alias (fold) any signal components outside the principal subband back into that band, leading to a particularly nasty kind of distortion. This problem is solved through alias cancellation [12]. Although a single decimated subband will introduce alias terms resulting from energy just beyond the ideal edge of the band, the corresponding imperfections in the reconstruction filters in the decoder will Today’s lecture: Audio Coding

Compression: Lossless and Lossy

Quantization and Noise

Psychoacoustic Masking

Time-Frequency Tradeoffs

Research Topics

EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB The absolute threshold of hearing ...

96 dB Assume volume is turned up loud, so that the loudest part of the file is 96 dB/SPLPERCEPTUAL. MASKING 533 Quantization whose noise falls in an inaudible region masker always meets the absolute imperceptibility constraint. threshold --- Inaudible --- masked threshold Intensity / dB / Intensity -5 dB --- Inaudible --- 100 Hz 10 kHz masked tone (log) freq We are most sensitive FIGUREto 35.1 soundsTone-on-tone around 3 kHz simultaneous. masking. A tonal (sinusoidal) component will EECS“mask” 225D L16: Perceptual the Audio perception Coding of weaker tones nearby inUC Regents frequency, Spring 2012 © UCB effectively elevating the threshold of audibility in that region of the spectrum (after [14]).

quality. Then we will look at how we can in practice control the occurrence of noise in time and frequency, i.e., “noise shaping”. Finally, we look at a number of ancilliary issues in the design of coders, and compare the details of several widely-used compression standards including MP3 and AAC. An excellent and more detailed account of perceptual audio coding is provided by Painter & Spanias [6]. A more detailed description specific to MPEG-1 Audio layer 3 (MP3) is given by Pan [7].

35.2 PERCEPTUAL MASKING

35.2.1 Psychoacoustic phenomena Given the perceptual coder’s goal of introducing distortion only in parts of the signal where it will be unnoticed by a human listener, we must start with a more detailed look at what can and cannot be perceived by the ear. The essential idea of psychoacoustic masking is illustrated schematically in Figure 35.1, which shows the threshold of detection by a human listener of a sinusoidal tone at different frequencies in the presence of a fixed “masker” sinusoid. The vertical scale shows the energy of tones at different frequencies, and the lower curve shows the “absolute threshold”, i.e., for each frequency, the minimum energy of a tone that can be detected in quiet; tones below this curve are simply not perceived by a listener. If, however, instead of quiet, the experiment is conducted in the presence of a clearly-audible masker tone, test tones with frequencies close to the masker need to be considerably more intense before they are detected – the masker tone has modified the limits of detectibility resulting in the upper, “masked threshold” curve. The range of frequencies around the masker tone that exhibit elevated thresholds is known as the “critical band”, introduced in Chapter 15; many hearing phenomena vary on this scale, and it appears to reflect a deep aspect of the ear’s mechanics and neurophyisiology. Notice that the elevated threshold is always somewhat below the level of the masker (in practice, 5–20 dB, depending on frequency and the temporal properties of the masker and masked signals), and that the masking is asymmetric in frequency, extending significantly Tonal masking ....

Quantization noise will be imperceptible if it falls in the inaudible skirt surrounding a tonal signal.

-- Inaudible --

-- Inaudible -- Critical band filter shape EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB Maskers compose using max() function PERCEPTUAL MASKING Given a short segment of wide-band audio, we can 533 identify narrow-band maskers and compute a composite masking function for the audio signal.

absolute masker threshold

-- Inaudible --

-Inaudible- masked Inaudible threshold Intensity / dB / Intensity -- Inaudible -- 100 Hz 10 kHz masked tone (log) freq Effectively, tonal maskers locally raise the absolute threshold. FIGURE 35.1 Tone-on-tone simultaneous masking. A tonal (sinusoidal) component will EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB “mask” the perception of weaker tones nearby in frequency, effectively elevating the threshold of audibility in that region of the spectrum (after [14]).

quality. Then we will look at how we can in practice control the occurrence of noise in time and frequency, i.e., “noise shaping”. Finally, we look at a number of ancilliary issues in the design of coders, and compare the details of several widely-used compression standards including MP3 and AAC. An excellent and more detailed account of perceptual audio coding is provided by Painter & Spanias [6]. A more detailed description specific to MPEG-1 Audio layer 3 (MP3) is given by Pan [7].

35.2 PERCEPTUAL MASKING

35.2.1 Psychoacoustic phenomena Given the perceptual coder’s goal of introducing distortion only in parts of the signal where it will be unnoticed by a human listener, we must start with a more detailed look at what can and cannot be perceived by the ear. The essential idea of psychoacoustic masking is illustrated schematically in Figure 35.1, which shows the threshold of detection by a human listener of a sinusoidal tone at different frequencies in the presence of a fixed “masker” sinusoid. The vertical scale shows the energy of tones at different frequencies, and the lower curve shows the “absolute threshold”, i.e., for each frequency, the minimum energy of a tone that can be detected in quiet; tones below this curve are simply not perceived by a listener. If, however, instead of quiet, the experiment is conducted in the presence of a clearly-audible masker tone, test tones with frequencies close to the masker need to be considerably more intense before they are detected – the masker tone has modified the limits of detectibility resulting in the upper, “masked threshold” curve. The range of frequencies around the masker tone that exhibit elevated thresholds is known as the “critical band”, introduced in Chapter 15; many hearing phenomena vary on this scale, and it appears to reflect a deep aspect of the ear’s mechanics and neurophyisiology. Notice that the elevated threshold is always somewhat below the level of the masker (in practice, 5–20 dB, depending on frequency and the temporal properties of the masker and masked signals), and that the masking is asymmetric in frequency, extending significantly 536 CHAPTER 35 PERCEPTUAL AUDIO CODING

Computing a mask. (analysis of a 26 ms audio “frame”) 60 60 x Tonal x 50 peaks Non-tonal SPL / dB / SPL peaks Non-tonal

SPL / dB / SPL x [1] Identify tonal (x) 40 x o o x energy 30 o x x energyo o 30 o o x xx o o o and non-tonal (o) 20 Signal 20 spectrum energy peaks. 10 0 -10 1 3 5 7 9 11 13 15 17 19 21 23 25

60 x Masking [2] Place a local 50 Masking

SPL / dB / SPL spread

SPL / dB / SPL 40 spread masking function 40 o 30 for each peak. 20 10 0 -10 1 3 5 7 9 11 13 15 17 19 21 23 25

[3] Apply max() 60 Resultant 50 Resultant over frequency dB / SPL masking SPL / dB / SPL 40 masking to compute the 30 20 composite masker. 10 0 -10 1 3 5 7 9 11 13 15 17 19 21 23 25 freq / Bark FIGURE 35.4 Development of the estimated psychoacoustic masking threshold for a single frame of audio. Crosses indicate tonal peaks, and circles are non-tonal maskers. The dashed line shows the assumed absolute hearing threshold, the lower bound on all masking effects. (Figures generated with [8]).

quality and efficiency of their psychoacoustic models, since the precise allocation of bit resources to different time and frequency locations is not a rigid part of the standard, but is left for individual implementations to decide. Figure 35.4 illustrates the results of the processing stages in a typical psychoacoustic model, in this case the low-complexity model (model 1) published as an example in the original MPEG Audio specification [5] and implemented in Matlab by Petitcolas [8]. The process starts with a short segment of the original audio file spanning the time for which the masking is to be estimated, usually 1024 samples or 23.2 ms at 44.1 kHz. This is windowed then converted to a spectrum with the Fourier transform. The solid line in the top panel of Figure 35.4 shows this spectrum, plotted on the Bark frequency axis (introduced in Chapter 15) which approximates the critical-band frequency resolution of the human ear. The dashed line shows the inferred absolute threshold on the same axis, i.e., the lower limit of energy that needs to be encoded at all1. The next step is to identify prominent

1 1This threshold will depend on the absolute level at which the signal is being played, but the curve is positioned PERCEPTUAL MASKING 537

100 Masking functions 100

80 80 Masking function 60 widens with masker 60

level, following SPL / dB 40 cochlear filter 40 20 response shapes. 20

0 5 6 7 8 9 10 11 12 13 14 15 freq / Bark FIGURE 35.5 Modeled masking thresholds as a function of masker level for a tone at 10 Bark 1.5 kHz (after [6]). Masking function ⇡ widen at higher masking components and to classify them as “tonal” or “non-tonal” so that the appropriate masking properties may be applied. Tonal components are identified by looking for peaks channels, following in the spectrum that are significantly larger (e.g., 7 dB) than their immediate neighborhood, critical bandwidth. where the neighborhood gets proportionally wider at higher frequencies to account for the broadening auditory filters. Energy not picked up in the search for tonal peaks is summed The Bark scale within each critical band and represented as a set of discrete “non-tonal” components, one for each critical band. The overall set of components is pruned to keep only the strongest warping handles tonal peaks within a 1 Bark window, and to ignore any components falling below the this effect. absolute (quiet) threshold. These individual tonal and non-tonal components are shown as points at the corresponding energies and center frequencies in the top pane. The next stage is to calculate the effective masking thresholds for each of these components, shown in the second panel. Each component results in a masking skirt that spreads to adjacent frequencies. The skirt is approximated by a piecewise linear function of the Bark-scale frequency; Figure 35.5 shows the masking curves for a tonal component at 10 Bark at a number of levels. Notice the “upward spread of masking” asymmetry, and that the upward slope becomes shallower as the masker becomes more intense. The middle pane of Figure 35.4 shows these curves for all the maskers identified in the short frame of music; you can see that the non-tonal maskers result in a higher level of masking relative to their energy when compared to the tonal components. Finally, the individual masking thresholds are combined with the threshold in quiet by summation in the power domain, giving the overall masking threshold as a function of frequency as shown in the lowest

based on the conservative assumption that the maximum amplitude possible within the soundfile encoding would correspond to 96 dB SPL – corresponding to listening to the music with the volume turned “way up”. What is “lost”?

Input spectrum a mono audio frame.

Spectrum of encoded audio (64 kb/s).

Masking profile that guided the quantization. 548 WhatCHAPTER is 35“lost”? PERCEPTUAL10 AUDIO seconds CODING of pop music content encoded using MP3 @ 128 kb/s.

20

freq / kHz / freq 15

Original 10 -21.0 dB RMS 5

0 20 40

15 20 MP3 @ 128 kbps 10 0 -21.3 dB RMS 5 -20

0 -40 20 level / dB

15

Residual 10 -39.4 dB RMS 5

0 0 2 4 6 8 10 time / s FIGURENoise 35.13 is onlyExample 10 to of psychoacoustic20 dB below coding. signal Top ... pane: but original carefully music file,placed! including drums, guitar, and bell tree. Middle pane: reconstruction after coding in MP3 at 128 kbps. Bottom: residual difference after aligning coding delay of 2257 samples.

35.5 SUMMARY

In this chapter we have seen how specific properties of hearing – most importantly the way that a strong tone will mask the perception of other energy nearby in time and frequency – can be exploited to achieve large data rate reductions for audio without perceptible degra- dation. In fact, audio coded at one or two bits per sample actually has a high level of background noise: As illustrated in Figure 35.13, the residual noise that has effectively been added through coding is only 10-20 dB below the energy of the signal. However, because it follows the time-frequency structure of the signal very closely, it remains largely imperceptible to human listeners. Achieving this kind of coding requires an accurate pre- diction of when and where noise will be masked, good techniques to efficiently control the quantization within such fine-scale patches of time-frequency without introducing any re- dundant coefficients, and solutions to a range of other problems including signals that may require particularly close time resolution to avoid pre-echos. We have seen the powerful and elegant solutions that have been employed to permit this quantization, and we finished with a brief summary of the collections of techniques used in some of today’s dominant high-quality audio compression schemes. SOME EXAMPLE CODING SCHEMES 547 The bit level: An encoded frame in a file

1 Byte (8 bits) 0 Header: defines layer, bitrate, channels, etc. (4 bytes) 8 16 Subband bit allocation indices: 24 32 subbands x 2 channels x 4 bits 32 = 32 bytes 40 Subband scale factor indices: 48 32 subbands x 2 channels x 6 bits 56 (only for subbands with nonzero bit allocation) 64 ≤ 48 bytes 72

Byte index 80 88 Quantized subband samples: 96 32 subbands x 2 channels x 12 samples 104 x 2-15 bits / sample (as per bit allocation, 112 only for subbands with nonzero bit allocation) 120 128 136 Padding to make frame an integer number of 4 byte blocks FIGURE 35.12 Bit usage layout in an example MPEG-1 Audio Layer I frame encoding 384 stereo samples in 140 bytes, for a bit rate of 128 kbps. MP3: Lossless (Huffman) encoding used on “sample” field.

EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB step sizes get smaller for smaller sample values. A stage varies the number of bits used to represent each sample (or small groups of samples) in inverse proportion to their overall prevalence, meaning that common sample values consume fewer bits. Finally, a bit “reservoir” allows bits to be shifted around over a scale larger than single frames, even for a fixed-rate stream. MP3 achieves transparent coding for most material at around 128 kbps.

35.4.3 MPEG-2 Advanced Audio Codec (AAC) Advances in computational hardware, as well as theoretical developments in psychoacoustic coding, led to the specification of a new MPEG audio scheme in the mid 1990s that abandoned the hybrid filterbank and other vestiges of MPEG-1 Audio for a complete redesign based on a single MDCT stage, switching between 1024 and 128 spectral bins – similar to the switching windows of MP3, but with a greater difference between long and short windows. AAC also includes the LPC-based temporal noise shaping described in Section 35.3.2, additional backward prediction of spectral values between frames for highly stationary signals, and more flexible schemes for encoding stereo and multichannel sequences to exploit redundancies between channels. Along with differences in the coding techniques, the standard is expanded to accommodate a larger range of signals such as “5.1” surround sound (left, right, center, rear left, rear right, plus a low frequency effects channel), and to provide a broader set of ‘profiles’ to offer different trade-offs of bitrate and computational expenses. AAC at 96 kbps is near transparent for stereo signals. Technical details are provided in Bosi et al. [2]. Today’s lecture: Audio Coding

Compression: Lossless and Lossy

Quantization and Noise

Psychoacoustic Masking

Time-Frequency Tradeoffs

Research Topics

EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB Time-Frequency Tradeoff

Good time resolution is required in the filter bank ...

... which implies a gentle rolloff in frequency. Time-Frequency Tradeoff

... which results in aliases appearing in sub-band outputs ...

... which fold over as we move the sub-band to baseband by downsampling. 540 CHAPTER 35 PERCEPTUAL AUDIO CODING

Encoder Bandpass Downsample Quantize Subband 1 quantized f M Q[•] samples Input Analysis (M subbands) filters

Subband M MQ[•]quantized f samples

Psychoacoustic model Minimum masking thresholds

Decoder Subband 1 quantized Q-1[•] M samples f Output Dequantize Upsample Reconstruction + filters

Subband M quantized Q-1[•] M samples f FIGURE 35.7 Structure of a subband coder and corresponding decoder. The input signal is divided into M bandlimited signals, each accounting for 1/Mth of the total bandwidth, which are then downsampled by a factor of M (critically sampled), quantized based on the psychoacoustic model’s estimate of the tolerable quantization noise at that moment in that band, and transmitted. At the decoding end, the dequantized samples are interpolated up to full sampling rate, again bandpass-filtered to recover the appropriate frequencies, then summed to reconstruct the full-bandwidth signal.

full band, and the effort to approach such ideal “brick-wall” filters would result in time- domain properties that were increasingly problematic – long ringing that would compromise the time localization we wish to achieve. Practical filters with finite-duration impulse responses will exhibit a finite transition region between the passband and stopband, and if the passbands are set up to fully capture all the original signal, this transition region will end up straying beyond the 1/Mth of the spectrum associated with the particular subband, as illustrated in Figure 35.8. Decimation, however, will alias (fold) any signal components outside the principal subband back into that band, leading to a particularly nasty kind of distortion. This problem is solved through alias cancellation [12]. Although a single decimated subband will introduce alias terms resulting from energy just beyond the ideal edge of the band, the corresponding imperfections in the reconstruction filters in the decoder will NOISE SHAPING 541 Solution: Quadrature-mirror filter banks

Input signal subband N subband N+1 alias region Analysis f subband N subband N+1 f out-of-band

Downsample Downsample fnyq/M M M alias

Upsample Upsample fnyq M M

Reconstruct Reconstruct fnyq/M f f complementary alias +

Output FIGURE 35.8 Alias cancellation in quadrature-mirror filterbanks.

EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB

introduce similar alias terms from the neighboring channel. By careful construction, it is possible to have these two corresponding alias terms appear with opposite signs so that in the final summation they cancel out. This process is illustrated in Figure 35.8: a spectral component lies close to the subband boundary, such that it has substantial energy in two adjacent subband signals. After decimation and the subsequent upsampling (i.e., the stages in Figure 35.7, ignoring the quantization), both bands have both components at the true frequency and aliases at the frequency reflected in the band edge, fnyq/M (where fnyq is the Nyquist rate or highest representable frequency, and M is the number of subbands). However, by putting these signal through reconstruction filters that mirror the original analysis filters (giving this approach its name, quadrature-mirror filtering or QMF) we can guarantee that the complementary alias from the higher band has exactly the same amplitude as the original alias in the lower band. By negating the reconstruction filters in alternating bands prior to final reconstruction, all such aliases can be cancelled 2. Careful

2Alert readers may wonder why this negation – subtracting the reconstruction of subband N + 1 from the reconstruction of subband N in order to cancel the alias – does not end up flipping the polarity of the components in that subband. In fact, the way that the high-pass filter defining subband N + 1 is constructed from the low-pass filter for subband N results in a p/2 phase shift for the unaliased components (and p/2 for aliased components). Specifically, if the low-pass filter is an even-length, symmetric FIR filter, then the high-pass – obtained by multiplying the low-pass impulse response by ( 1)n – will be antisymmetric. When the filter is reapplied in reconstruction, the original component thus accumulates a total phase shift of p, which is then corrected by negating its sign. 542 CHAPTER 35 PERCEPTUAL AUDIO CODING Frame Windows Calculated mask yields imperceptible noise once the 542hit begins, but not duringCHAPTER the silence 35 PERCEPTUALbefore the click. AUDIO CODING

Original - castanets. Input: castanet hit. Original - castanets.wav

MPEG-2pre-echo (Level II) MP2 @ 128 kbps Frame Window 26 ms pre-echo MP2 @ 128 kbps Decoder output, with artifacts. MP3 @ 128 kbps

EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB MP3 @ 128 kbps

AAC @ 128 kbps

AAC @ 128 kbps

0.965 0.97 0.975 0.98 0.985 0.99 0.995 time / sec FIGURE 35.9 Castanet sound example. Top pane: original waveform. Middle pane: Reconstruction0.965 0.97 from MPEG-10.975 Audio0.98 Layer0.985 II (MP2)0.99 at 1280.995 kbps. Bottomtime / sec pane: ReconstructionFIGURE 35.9 Castanet from MPEG-1 sound Audio example. Layer Top III (MP3) pane: originalat 128 kbps. waveform. Note the Middle “pre-echo” pane: noise immediatelyReconstruction preceding from MPEG-1 the attack Audio in the Layer MP2 II (MP2) version. at 128 kbps. Bottom pane: Reconstruction from MPEG-1 Audio Layer III (MP3) at 128 kbps. Note the “pre-echo” noise immediately preceding the attack in the MP2 version. design of the original band-pass filter ensures that the magnitude response of the non-alias components remains exactly or very nearly flat. designA of different the original analysis band-pass of maximally-decimated filter ensures that the filterbanks magnitude focuses response on the of the time-domain non-alias [9].components If a signal remains is broken exactly into or blocks very nearly of N samples, flat. with 50% overlap between successive windows,A different maximal analysis decimation of maximally-decimated will require each block filterbanks to be represented focuses on by the only time-domainN/2 coef- ficients,[9]. If a signalmeaning is broken that a reconstruction into blocks of ofN thatsamples, block with in isolation 50% overlap cannot between fully reproduce successive the originalwindows, block maximal but must decimation introduce will distortion require each that block may be to be considered represented time-domain by only N/ aliasing.2 coef- If,ficients, however, meaning the alias that components a reconstruction in successive of that block reconstructed in isolation blocks cannot can fully be made reproduce equal the and opposite,original block then, but as in must the introduce frequency distortion domain case that may above, be the considered aliases can time-domain be cancelled aliasing. in the finalIf, however, overlap-add the alias reconstruction. components This in successive approach reconstructed is known as time-domain blocks can be alias made cancellation equal and (TDAC),opposite, then, and by as dealing in the frequency directly with domain the case time-domain above, the form aliases of the can filters, be cancelled it permits in the the derivationfinal overlap-add of exact reconstruction. perfect-reconstruction This approach analysis-synthesis is known as time-domain structures. The alias most cancellation common instance(TDAC), of and this by uses dealing the Discrete directly Cosine with the Transform time-domain (DCT) form for the of core the filters, frequency it permits transform, the andderivation is known of exact as the perfect-reconstruction modified DCT (MDCT) analysis-synthesis [10]. structures. The most common instance of this uses the Discrete Cosine Transform (DCT) for the core frequency transform, and is known as the modified DCT (MDCT) [10]. 35.3.2 Temporal noise shaping In35.3.2 general, Temporal we want to noise divide the shaping spectrum of the signal into a relatively large number of subbands so that each band corresponds to a narrow range of frequencies, enabling us to takeIn general, full advantage we want of to the divide masking the spectrum that results of the from signal the strongest into a relatively components large in number the band. of However,subbands so due that to the each dual band nature corresponds of time and to a frequency, narrow range reducing of frequencies, the width of enabling the frequency us to take full advantage of the masking that results from the strongest components in the band. However, due to the dual nature of time and frequency, reducing the width of the frequency 542 CHAPTER 35 PERCEPTUAL AUDIO CODING

542 CHAPTER 35 PERCEPTUAL AUDIO CODING “MP3” solution: Variable-length frames ... Original - castanets.wav

Original - castanets.wav Input: castanet hit.

pre-echo MP2 @ 128 kbps

“MP3” pre-echo MP2 @ 128 kbps Frames

MP3 @ 128 kbps Decoder output, with reduced artifacts. MP3 @ 128 kbps

EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB AAC @ 128 kbps

AAC @ 128 kbps

0.965 0.97 0.975 0.98 0.985 0.99 0.995 time / sec FIGURE 35.9 Castanet sound example. Top pane: original waveform. Middle pane: Reconstruction0.965 0.97 from MPEG-10.975 Audio0.98 Layer0.985 II (MP2)0.99 at 1280.995 kbps. Bottomtime / sec pane: FIGUREReconstruction 35.9 Castanet from MPEG-1 sound Audio example. Layer Top III (MP3) pane: at original 128 kbps. waveform. Note the Middle “pre-echo” pane: noise Reconstructionimmediately preceding from MPEG-1 the attack Audio in the Layer MP2 II (MP2)version. at 128 kbps. Bottom pane: Reconstruction from MPEG-1 Audio Layer III (MP3) at 128 kbps. Note the “pre-echo” noise immediately preceding the attack in the MP2 version. design of the original band-pass filter ensures that the magnitude response of the non-alias components remains exactly or very nearly flat. designA of different the original analysis band-pass of maximally-decimated filter ensures that the filterbanks magnitude focuses response on the of time-domain the non-alias components[9]. If a signal remains is broken exactly into or blocks very of nearlyN samples, flat. with 50% overlap between successive windows,A different maximal analysis decimation of maximally-decimated will require each block filterbanks to be represented focuses on by the only time-domainN/2 coef- [9].ficients, If a meaningsignal is brokenthat a reconstruction into blocks of ofN thatsamples, block in with isolation 50% overlap cannot betweenfully reproduce successive the windows,original block maximal but must decimation introduce will distortion require each that mayblock be to considered be represented time-domain by only N aliasing./2 coef- ficients,If, however, meaning the alias that components a reconstruction in successive of that block reconstructed in isolation blocks cannot can fully be made reproduce equal and the originalopposite, block then, but as in must the introduce frequency distortion domain case that above, may be the considered aliases can time-domain be cancelled aliasing. in the If,final however, overlap-add the alias reconstruction. components This in successive approach isreconstructed known as time-domain blocks can alias be made cancellation equal and opposite,(TDAC), then,and by as dealing in the frequency directly with domain the time-domain case above, the form aliases of the can filters, be cancelled it permits in the the finalderivation overlap-add of exact reconstruction. perfect-reconstruction This approach analysis-synthesis is known as time-domain structures. The alias most cancellation common (TDAC),instance of and this by uses dealing the Discrete directly Cosine with Transform the time-domain (DCT) form for the of core the frequency filters, it permits transform, the derivationand is known of exact as the perfect-reconstruction modified DCT (MDCT) analysis-synthesis [10]. structures. The most common instance of this uses the Discrete Cosine Transform (DCT) for the core frequency transform, and is known as the modified DCT (MDCT) [10]. 35.3.2 Temporal noise shaping In general, we want to divide the spectrum of the signal into a relatively large number of 35.3.2subbands so Temporal that each band noise corresponds shaping to a narrow range of frequencies, enabling us to Intake general, full advantage we want of to the divide masking the spectrum that results of from the signal the strongest into a relatively components large in number the band. of subbandsHowever, so due that to the each dual band nature corresponds of time and to frequency, a narrow range reducing of frequencies, the width of enablingthe frequency us to take full advantage of the masking that results from the strongest components in the band. However, due to the dual nature of time and frequency, reducing the width of the frequency 534 CHAPTER 35 PERCEPTUAL AUDIO CODING

masker envelope simultaneous masking ~10 dB

masked threshold Intensity / dB / Intensity time

backward masking forward masking ~5 ms ~100 ms FIGURE 35.2 Sequential masking. The masking effect of a tone takes several hundred milliseconds to fully decay after the masker ceases (forward masking). Even if the low-intensity tone starts slightly earlier than the masker, its perception may still be suppressed by the masking tone (backward masking).

further for frequencies above the masker than below (the so-called ‘upward spread of 534 masking’,CHAPTER 35related PERCEPTUAL to the structure AUDIO CODING of the cochlea in which energy at a particular frequency must pass through the parts of the cochlea responsible for detecting higher frequencies Temporal maskingbefore effects it dissipates at its... ‘best place’). Masking of this kind was first noted in the late nineteenth century, wasmasker first envelope systematically investigated by Wegel & Lane [13], and has beensimultaneous widely masking studied since. A comprehensive summary is provided by Zwicker & Fastl [14]. Masking ~10 dB Masking occurs not only when the masking tone is present, but also for a certain phenomena masked threshold amount of time afterwards and even before. Figure 35.2 gives a sketch of this effect. It can have take 100 ms or more for thresholds to return to normal after a masker is extinguished; this Intensity / dB / Intensity temporal time properties backward masking forward masking which must ~5 ms ~100 ms FIGURE 35.2 Sequential masking.Masking tone The masking effect of a tone takes several hundred be considered milliseconds to fully decay after the masker ceases (forward masking). Even if the when low-intensity tone starts slightly earlier than the masker, its perception may still be suppressed by the masking tone (backward masking). encoding 20 18 Elevated masking examples 16 threshold “skirt” level / dB / level further14 for frequencies above the masker than below (the so-called ‘upward spread of like masking’,12 related to the structure of the cochlea in which energy at a particular frequency 10 must pass through the parts of the cochlea responsible20 for detecting higher frequencies “castanets” 8 before6 it dissipates at its ‘best place’). Masking15 of this kind was first noted in the late

4 nineteenth century, was first systematically investigated10 by Wegel & Lane [13], and has 2 freq / Bark been widely studied since. A comprehensive summary5 is provided by Zwicker & Fastl [14]. 0

0 50 100 150 0 200 250 Masking occurstime / notms only when the masking tone is present, but also for a certain EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB FIGUREamount of 35.3 timeThe afterwards combination and even of before. simultaneous Figure 35.2 and gives sequential a sketch masking of this effect. result It in can a time-frequencytake 100 ms or more “skirt” for of thresholds elevated threshold to return to around normal a after strong, a masker sustained is extinguished; tone. this

Masking tone

20 18 Elevated masking 16 threshold “skirt” level / dB / level 14

12

10 20 8

6 15

4 10 2 freq / Bark 5 0

0 50 100 150 0 200 250 time / ms FIGURE 35.3 The combination of simultaneous and sequential masking result in a time-frequency “skirt” of elevated threshold around a strong, sustained tone. Today’s lecture: Audio Coding

Compression: Lossless and Lossy

Quantization and Noise

Psychoacoustic Masking

Time-Frequency Tradeoffs

Research Topics

EECS 225D L16: Perceptual Audio Coding UC Regents Spring 2012 © UCB