antees that, regardless of origin, any fully compli- ant MPEGiaudio decoder will be able to decode any MPEG/audio bitstream with a predictable Tutorial on result. Designers are free to try new and different A implementations of the encoder or decoder within the bounds of the standard. The encoder especial- ly offers good potential for diversity. Wide accep- MPEG/ Audio tance of this standard will permit manufacturers to produce and sell, at reasonable cost, large num- bers of MPEG/audio codecs. Lompression Features and applications MPEGiaudio is a generic audio compression standard. Unlike vocal-tract-model coders special- Davis Pan ly tuned for speech signals, the MPEG/audio coder Motorola gets its compression without making assumptions about the nature of the audio source. Instead, the coder exploits the perceptual limitations of the his tutorial covers the theory behind human auditory system. Much of the compression -This tutorial covers MI’EGiaudio compression. It is written results from the removal of perceptually irrelevant the theory behind for people with a modest background parts of the audio signal. Since removal of such MPEG/audio in digital signal processing and does parts results in inaudible distortions, MPEG/audio compression. While T not assume prior experience in audio compression can any signal meant to be heard by the lossy, the algorithm or psychoacoustics. The goal is to give a broad, pre- human ear. often can provide liminary understanding of MPEG/audio compres- In keeping with its generic nature, MI’EGiaudio “transparent,” sion, so I omitted many of the details. Wherever offers a diverse assortment of compression modes, perceptually lossless possible, figures and illustrative examples present as follows. In addition, the MPEG/audio bitstream compression even the intricacies of the algorithm. makes features such as random access, audio fast- with factors of 6-to-1 The MPEGiaudio compression algorithm is the forwarding, and audio reverse possible. or more. It exploits first international standard’,’ for the digital the perceptual compression of high-fidelity audio. Other audio Sampling rate. The audio sampling rate can be properties of the compression algorithms address speech-only appli- 32, 41.1, or 48 kHz. human auditory cations or provide only medium-fidelity audio system. The article compression performance. Audio channel support. The compressed bit- also covers the basics The MPEGiaudio standard results from more stream can support one or two audio channels in of psychoacoustic than three years of collaborative work by an inter- one of four possible modes: modeling and the national committee of high-fidelity audio com- methods the pression experts within the Moving Picture Experts 1. a monophonic mode for a single audio channel, algorithm uses to Group (MPEGiaudio). The International Organ- compress audio data ization for Standards and the International 2. a dual-monophonic mode for two indepen- with the least Electrotechnical Commission (ISO/IEC) adopted dent audio channels (functionally identical to perceptible this standard at the end of 1992. the stereo mode), degradation. Although perfectly suitable for audio-only applications, MPEG/audio is actually one part of a 3. a stereo mode for stereo channels that share three-part compression standard that also includes bits but do not use joint-stereo coding, and video and systems. The MPEG standard addresses the compression of synchronized video and audio 4. a joint-stereo mode that takes advantage of at a total bit rate of about 1.5 megabits per second either the correlations between the stereo (Mbps). channels or the irrelevancy of the phase dif- The MPEG standard is rigid only where neces- ference between channels, or both. sary to ensure interoperability. It mandates the syntax of the coded bitstream, defines the decod- Predefined bit rates. The compressed bit- ing process, and provides compliance tests for stream can have one of several predefined fixed bit assessing the accuracy of the decoder.‘ This guar- rates ranging from 32 to 224 kilobits per second

1070-986X/95/$4 00 0 1995 IEEE (Kbps) per channel. Depending on the audio sam- pling rate, this translates to compression factors Bit/noise bitstream ranging from 2.7 to 24. In addition, the standard PCM frequency Bitstream audio quantizer, and formatting provides a “free” bit rate mode to support fixed bit input filter bank rates other than the predefined rates.

Compression layers. MPEG/audio offers a Psychoacoustic Ancillary data choice of three independent layers of compression. model (optional) This provides a wide range of trade-offs between (a) codec complexity and compressed audio quality. Layer I, the simplest, best suits bit rates above 128 Kbps per channel. For example, Philips’ bitstream PCM Digital Compact Cassette (DCC)’ uses Layer I Frequency audio sample to time b compression at 192 Kbps per channel. unpacking reconstruction Layer I1 has an intermediate complexity and I I targets bit rates around 128 Kbps per channel. ‘i Possible applications for this layer include the Ancillary data (b) (if included) coding of audio for digital audio broadcasting (DAB),” the storage of synchronized video-and- audio sequences on CD-ROM, and the full-motion extension of CD-interactive, Video CD. Layer 111 is the most complex but offers the best audio quality, particularly for bit rates around 64 the input into multiple subbands of frequency. Figure 1. MPEG/audio Kbps per channel. This layer suits audio transmis- The input audio stream simultaneously passes compression and sion over ISDN. through a psychoacoustic model that determines decompression. (a) All three layers are simple enough to allow sin- the ratio of the signal energy to the masking MPEG/audio encoder. gle-chip, real-time decoder implementations. threshold for each subband. The bit- or noise- (b)MPEG/audio allocation block uses the signal-to-mask ratios to decoder. Error detection. The coded bitstream supports decide how to apportion the total number of code an optional cyclic redundancy check (CRC) error- bits available for the quantization of the subband detection code. signals to minimize the audibility of the quanti- zation noise. Finally, the last block takes the rep- Ancillary data. MPEG/audio provides a means resentation of the quantized subband samples and of including ancillary data within the bitstream. formats this data and side information into a coded bitstream. Ancillary data not necessarily Overview related to the audio stream can be inserted Mth- The key to MPEG/audio compression-quanti- in the coded bitstream. The decoder deciphers this zation-is lossy. Nonetheless, this algorithm can bitstream, restores the quantized subband values, give “transparent,” perceptually lossless compres- and reconstructs the audio signal from the sub- sion. The MPEG/audio committee conducted band values. extensive subjective listening tests during devel- First we’ll look at the time to frequency map- opment of the standard. The tests showed that ping of the polyphase filter bank, then implemen- even with a 6-to-1 compression ratio (stereo, 16 tations of the psychoacoustic model and more bits per sample, audio sampled at 48 kHz com- detailed descriptions of the three layers of pressed to 256 Kbps) and under optimal listening MPEG/audio compression. That gives enough conditions, expert listeners could not distinguish background to cover a brief summary of the differ- between coded and original audio clips with sta- ent bit (or noise) allocation processes used by the tistical significance. Furthermore, these clips were three layers and the joint stereo coding methods. specially chosen as difficult to compress. Grewin and Ryden: gave the details of the setup, proce- The polyphase filter bank dures, and results of these tests. The polyphase filter bank is common to all lay- Figure 1 shows block diagrams of the MPEG/ ers of MPEGiaudio compression. This filter bank audio encoder and decoder. The input audio divides the audio signal into 32 equal-width stream passes through a filter bank that divides frequency subbands. Relatively simple, the filters Figure 2. Flow diagram Shift in 32 new samples into 512- analysis polyphase filter outputs that closely of the MPEG/audio I point FIFO buffer, Xi I resembles a method described by Rothweiler.# encoder filter bank. Figure 2 shows the flow diagram from the IS0 MPEG/audio standard for the MPEG-encoder fil- ter bank based on Rothweiler’s proposal. I Window samples: for i = 0 to 511 do Z, = , XI By combining the equations and steps shown by Figure 2, we can derive the following equation for the filter bank outputs: I I I Partial calculation: for i = 0 to 63 do Y, = Z’ + 641 st[;] = /=o A2 - 1 c M[i][k]x (C[k+64j] x x[k+64j]) (1) 63 k=O j=O Calculate 32 samples by matrixing SI= YI MI k=O where i is the subband index and ranges from 0 to 31; s,[i] is the filter output sample for subband i at time where tis an integer multiple of 32 audio Output 32 subband samples t, I I sample intervals; C[H]is one of 512 coefficients of the analysis window defined in the standard; x[n] is an audio input sample read from a 512-sample buffer: and provide good time resolution with reasonable

frequency resolution. The design has three (2 x i + 1)x (k- 16)x E notable concessions. M[i][k]= cos First, the equal widths of the subbands do not 1 accurately reflect the human auditory system’s fre- quency-dependent behavior. Instead, the width are the analysis matrix coefficients. of a “critical band” as a function of frequency is a Equation 1 is partially optimized to reduce the good indicator of this behavior. Many psycho- number of computations. Because the function acoustic effects are consistent with a critical-band within the parentheses is independent of the value frequency scaling. For example, both the per- of i, and M[q[k] is independent of j, the 32 filter out- ceived loudness of a signal and its audibility in the puts need only 512 + 32 x 64 = 2,560 multiplies and presence of a masking signal differ for signals 64 x 7 + 32 x 63 = 2,464 additions, or roughly 80 within one critical band versus signals that extend multiplies and additions per output. Substantially over more than one critical band. At lower fre- further reductions in multiplies and adds are possi- quencies a single subband covers several critical ble with, for example, a fast discrete cosine trans- bands. In this circumstance the number of quan- formJ or a implementation.” tizer bits cannot be specifically tuned for the noise Note this filter bank implementation is criti- masking available for individual critical bands. cally sampled: For every 32 input samples, the fil- Instead, the critical band with the least noise ter bank produces 32 output samples. In effect, masking dictates the number of quantization bits each of the 32 subband filters subsamples its out- needed for the entire subband. put by 32 to produce only one output sample for Second, the filter bank and its inverse are not every 32 new audio samples. lossless transformations. Even without quantiza- We can manipulate Equation 1 into a familiar tion, the inverse transformation cannot perfectly filter convolution equation: recover the original signal. However, by design the error introduced by the filter bank is small and inaudible. Third, adjacent filter bands have a major fre- I1 =o quency overlap. A signal at a single frequency can affect two adjacent filter-bank outputs. where x[t] is an audio sample at time t, and To understand the polyphase filter bank, it w w helps to examine its origin. The IS0 MPEG/audio [ (2xi + l);! - 16)xrr Hi[/,]= /z[rl]xcos w standard describes a procedure for computing the 1 with h[n]= -C[n] if the integer part of (?z/64)is odd Figure 3. Comparison of and h[n]= C[n]otherwise, for n = 0 to 511. h[n] with C[n]. In this form, each subband of the filter bank has its own band-pass filter response, H,[rz]. Although this form is more convenient for analy- sis, it is clearly not an efficient solution: A direct implementation of this equation requires 32 x 512 = 16,384 multiplies and 32 x 511 = 16,352 addi- tions to compute the 32 filter outputs. The coefficients, h[n],correspond to the proto- type low-pass filter response for the polyphase fil- ter bank. Figure 3 compares a plot of h[n]with C[n]. The C[n]used in the partially optimized Equation 1 has every odd-numbered group of 64 coefficients of h[n]negated to compensate for M[i][k].The cosine term of M[i][k] only ranges from k = 0 to 63 and covers an odd number of half cycles, Figure 4. Frequency Nominal bandwidth whereas the cosine response ofprototype 20 terms of H,[ri] range filter, h[n]. from ri = 0 to 511 and 0 cover eight times the $ number of half cycles. 2. -20 1 The equation for -40 W,[n]clearly shows that -2 each is a modulation of -60 the prototype response & with a cosine term to 5 -80 shift the low pass 2 response to the appro- 5 -100 priate frequency band. b -120 Hence, these are called polyphase filters. These -1 40 filters have center fre- quencies at odd multi- Frequencies from 0 to Piis ples of rcl(647J where T is the audio sampling period and each has a nominal bandwidth of shows how a pure sinusoid tone, which has ener- rcl(327). gy at only one frequency, appears at the output of As Figure 4 shows, the prototype filter response two polyphase filters. does not have a sharp cutoff at its nominal band- Although the polyphase filter bank is not loss- width. So when the filter outputs are subsampled less, any consequent errors are small. Assuming by 32, a considerable amount of aliasing occurs. you do not quantize the subband samples, the The design of the prototype filter, and the inclu- composite frequency response combining the sion of appropriate phase shifts in the cosine response of the encoder’s analysis filter bank with terms, results in a complete alias cancellation at that of the decoder‘s synthesis bank has a ripple the output of the decoder’s synthesis filter bank.# lo of less than 0.07 dB. Another consequence of using a filter with a wider-than-nominal bandwidth is an overlap in Psychoacoustics the frequency coverage of adjacent polyphase fil- The MPEG/audio algorithm the ters. This effect can be detrimental to efficient audio data in large part by removing the acousti- audio compression because signal energy near cally irrelevant parts of the audio signal. That is, nominal subband edges will appear in two adja- it takes advantage of the human auditory system’s cent polyphase filter outputs. Figure 5, next page, inability to hear quantization noise under condi- Input audio: 1,500-Hz sine wave sampled at 32 kHz, 64 of 256 samples sion, each band should be quantized with no shown more levels than necessary to make the quantiza- tion noise inaudible.

The psychoacoustic model The psychoacoustic model analyzes the audio signal and computes the amount of noise mask- subband 0 ing available as a function of frequency.12 The masking ability of a given signal component depends on its frequency position and its loud- I7subband 1 c ness. The encoder uses this information to decide how best to represent the input audio signal with its limited number of code bits. 'The MPEGiaudio standard provides two exam- ple implementations of the psychoacoustic model. bank 1 Isubband 3 = I,I,W Psychoacoustic model 1 is less complex than psy- choacoustic model 2 and makes more compro- subband 4 mises to simplify the calculations. Either model works for any of the layers of compression. However, only model 2 includes specific modifi- subband 31 cations to accommodate Layer 111. There is considerable freedom in the imple- Subband outputs: mentation of the psychoacoustic model. The Figure 5. Aliasing: Pure 8x32 samples; both subbands 3 and 4 required accuracy of the model depends on the sinusoid input can have significant output values target compression factor and the intended appli- produce nonzero output cation. For low levels of compression, where a for two subbands. generous supply of code bits exists, a complete tions of auditory masking. This masking is a per- bypass of the psychoacoustic model might be ade- ceptual property of the human auditory system quate for consumer use. In this case, the bit allo- that occurs whenever the presence of a strong cation process can iteratively assign bits to the audio signal makes a temporal or spectral neigh- subband with the lowest signal-to-noise ratio. For borhood of weaker audio signals imperceptible. A archiving music, the psychoacoustic model can be variety of psychoacoustic experiments corroborate made much more stringent.I' this masking phenomenon." Now let's look at a general outline of the basic Empirical results also show that the human steps involved in the psychoacoustic calculations auditory system has a limited, frequency-depen- for either model. Differences between the two dent resolution. This dependency can be expressed models will be highlighted. in terms of critical-band widths that are less than 100 Hz for the lowest audible frequencies and more Time-align audio data. There is one psycho- than 4 kHz at the highest. The human auditory sys- acoustic evaluation per frame. The audio data sent tem blurs the various signal components within a to the psychoacoustic model must be concurrent critical band, although this system's frequency with the audio data to be coded. The psycho- selectivity is much finer than a critical band. acoustic model must account for both the delay Because of the human auditory system's of the audio data through the filter bank and a frequency-dependent resolving power, the noise- data offset so that the relevant data is centered masking threshold at any given frequency within the psychoacoustic analysis window. depends solely on the signal energy within a lim- For example, when using psychoacoustic ited bandwidth neighborhood of that frequency. model 1 for Layer 1, the delay through the filter Figure 6 illustrates this property. bank is 256 samples and the offset required to cen- MPEG/audio works by dividing the audio sig- ter the 384 samples of a Layer I frame in the 512- nal into frequency subbands that approximate point analysis window is (512 - 384)/2 = 64 critical bands, then quantizing each subband points. The net offset is 320 points to time-align according to the audibility of quantization noise the psychoacoustic model data with the filter within that band. For the most efficient compres- bank outputs. Convert audio to a frequency domain repre- Figure 6. Audio noise sentation. The psychoacoustic model should use Strong tonal signal nrnsking. a separate, independent, time-to-frequency map- ping instead of the polyphase filter bank because it needs finer frequency resolution for an accurate Region where weaker signals are masked calculation of the masking thresholds. Both psy- choacoustic models use a Fourier transform for this mapping. A standard Hann weighting, applied to the audio data before Fourier transfor- Frequency mation, conditions the data to reduce the edge L effects of the transform window. Psychoacoustic model 1 uses a 512-sample band. The frequency index of each concentrated analysis window for Layer 1 and a 1,024-sample nontonal component is the value closest to the window for Layers I1 and 111. Because there are geometric mean of the enclosing critical band. only 384 samples in a Layer I frame, a 512-sample Psychoacoustic model 2 never actually sepa- window provides adequate coverage. Here the rates tonal and nontonal components. Instead, it smaller window size reduces the computational computes a tonality index as a function of fre- load. Layers I1 and I11 use a 1,152-sample frame quency. This index gives a measure of whether the size, so the 1,024-sample window does not pro- component is more tone-like or noise-like. Model vide complete coverage. While ideally the analy- 2 uses this index to interpolate between pure sis window should completely cover the samples tone-masking-noise and noise-masking-tone val- to be coded, a 1,024-sample window is a reason- ues. The tonality index is based on a measure of able compromise. Samples falling outside the predictability. Model 2 uses data from the previ- analysis window generally will not have a major ous two analysis windows to predict, via linear impact on the psychoacoustic evaluation. extrapolation, the component values for the cur- Psychoacoustic model 2 uses a 1,024-sample rent window. Tonal components are more pre- window for all layers. For Layer I, the model centers dictable and thus will have higher tonality a frame’s 384 audio samples in the psychoacoustic indices. Because this process relies on more data, window as previously discussed. For Layers I1 and it probably discriminates better between tonal and 111, the model computes two 1,024-point psycho- nontonal components than the model 1 method. acoustic calculations for each frame. The first cal- culation centers the first half of the 1,152 samples Apply a spreading function. The masking in the analysis window, and the second calculation ability of a given signal spreads across its sur- centers the second half. The model combines the rounding critical band. The model determines the results of the two calculations by using the higher noise-masking thresholds by first applying an of the two signal-to-mask ratios for each subband. empirically determined masking function (model This in effect selects the lower of the two noise- 1) or spreading function (model 2) to the signal masking thresholds for each subband. components.

Process spectral values into groupings relat- Set a lower bound for the threshold values. ed to critical-band widths. To simplify the psy- Both models include an empirically determined choacoustic calculations, both models process the absolute masking threshold, the threshold in frequency values in perceptual quanta. quiet. This threshold is the lower bound on the audibility of sound. Separate spectral values into tonal and non- tonal components. Both models identify and Find the masking threshold for each sub- separate the tonal and noise-like components of band. Both psychoacoustic models calculate the the audio signal because the masking abilities of masking thresholds with a higher frequency reso- the two types of signal differ. lution than that provided by the polyphase filter Psychoacoustic model 1 identifies tonal com- bank. Both models must derive a subband thresh- ponents based on the local peaks of the audio old value from possibly a multitude of masking power spectrum. After processing the tonal com- thresholds computed for frequencies within that ponents, model 1 sums the remaining spectral val- subband. ues into a single nontonal component per critical Model 1 selects the minimum masking thresh- Figure 7. Input 20 -- because model 1 concentrates all the audio bd energy. P cal band into a single value at a single frequency. In effect, model 1 converts nontonal components into a form of tonal component. A subband within a wide critical 140 T band but far from the concentrated nontonal com- -- 120 ponent will not get an accurate nontonal masking assessment. This approach is a compromise to reduce the computational loads. Model 2 selects the minimum of the masking thresholds covered by the subband only where the band is wide relative to the critical band in that fre- quency region. It uses the average of the masking thresholds covered by the subband where the 1- band is narrow relative to the critical band. Model 0.9 -- 2 has the same accuracy for the higher frequency 0.8 -- subbands as for lower frequency subbands because 0.7 -- it does not concentrate the nontonal components. 0.6 -- 0.5 -- Calculate the signal-to-maskratio. The psy- 0.4 -- choacoustic model computes the signal-to-mask 0.3 -- ratio as the ratio of the signal energy within the 0.2 -- subband (or, for Layer Ill, a group of bands) to the 0.1 -- minimum masking threshold for that subband. 0~I:IIII:::::::I::III::::I:t The model passes this value to the bit- (or noise-) allocation section of the encoder.

Figure 8. Psychoacoustic model Zpartition-domain processing. (a) Audio energy Example of psychoacoustic model and spread energy in the perceptual domain. (b) Tonality index. analysis Figure 7 shows a spectral plot of Amplitude the example audio signal to be psy- choacoustically analyzed and com- Partition pressed. This signal consists of a number combination of a strong, 11,250-Hz, sinusoidal tone with low-pass noise. Figure 9. Because the processes used by psy- Spreading choacoustic model 2 are somewhat function easier to visualize, we will cover this for each model first. Figure 8a shows the partition result, according to psychoacoustic (psycho- model 2, of transforming the audio acoustic signal to the perceptual domain (63, model 2). one-third critical band, partitions) and then applying the spreading function. Figure 8b shows the tonali- Figure 10. Psychoacoustic inode12 120 processing. (a) Original signal energy 100 and computed inasking thresholds. 80 (b) Signal-to-mask ratios. (c) Coded 60 audio energy (64 Kbps). 40

60 - 40 -- 20 -- 50 40 0--: :I I: I:: I I :-I I :I I : I: I I I I : I I:: : I! I -mmr\m-Mmh -mqhm 30 -20 ------?NNNNNz 20 -40.- 10 n -60 -- -80 -- -1 00- 70 601__ I 50 140 40 1201 I 30 20 looh80 10 n

mor\ maw vww

ty index for the audio signal as computed by psy- choacoustic model 2. Note the shift of the sinusoid peak and the expansion of the low-pass noise dis- tribution. The perceptual transformation expands the low-frequency region and compresses the Figure 11. Psycho- higher frequency region. Because the spreading , acoustic model 1 function is applied in a perceptual domain, the uniform frequency domain instead of the per- processing. shape of the spreading function is relatively uni- ceptual domain in preparation for the final step (a) Identified tonal and form as a function of partition. Figure 9 shows a of the psychoacoustic model, the calculation of nontonal components. plot of the spreading functions. the signal-to-mask ratios (SMR) for each subband. (b) Decimated tonal Figure 10a shows a plot of the masking thresh- Figure 10b is a plot of these results, and Figure 1Oc and nontonal old as computed by the model based on the spread is a frequency plot of a processed audio signal components. (c) Global energy and the tonality index. This figure has plots using these SMRs. In this example the audio com- niasking thresholds. of the masking threshold both before and after the pression was severe (768 to 64 Kbps), so the coder incorporation of the threshold in quiet to illustrate cannot necessarily mask all the quantization its impact. Note the threshold in quiet significant- noise. ly increases the noise-masking threshold for the For psychoacoustic model 1, we use the same higher frequencies. The human auditory system is example audio signal as above. Figure 1la shows much less sensitive in this region. Also note how how the psychoacoustic model 1 identifies the the sinusoid signal increases the masking thresh- local spectral peaks as tonal and nontonal compo- old for the neighboring frequencies. nents. Figure llb shows the remaining tonal and The masking threshold is computed in the nontonal components after the decimation Figzdre 12. 30 Layer coding options Psycho- The MPEG/audio standard has three acoustic distinct layers for compression. Layer I inodel 1 -1 0 -F--+--www forms the most basic algorithm, while processing -20 Layer I1 and Layer I11 are enhancements results. that use some elements found in Layer (a) Signal- I. Each successive layer improves the -60 to-inask -80 compression performance-at the cost ratios. of greater encoder and decoder com- (a) (b) Coded plexity. audio 140 Every MI'EG/audio bitstream con- energy (64 tains periodically spaced frame headers 120 K bps). to identify the bitstream. Figure 13 100 gives a pictorial representation of the 80 header syntax. A 2-bit field in the 60 MPEG header identifies the layer in use. 40 Layer I 20 The Layer I algorithm codes audio in 0 frames of 384 audio samples. It does so -20 by grouping 12 samples from each of (b) the 32 subbands, as shown in Figure 14. Besides the code for audio data, each process. This process both removes components frame contains a header, an optional cyclic-redun- that would fall below the threshold in quiet and dancy-code (CRC) error-check word, and possibly removes the weaker tonal components within ancillary data. roughly half a critical-band width (0.5 Bark) of a Figure 15a shows the arrangement of this data stronger tonal component. in a Layer I bitstream. The numbers within paren- Psychoacoustic model 1 uses the decimated theses give the number of bits possible to encode tonal and nontonal components to determine the each field. Each group of 12 samples gets a bit allo- global masking threshold in a subsampled fre- cation and, if the bit allocation is not zero, a scale quency domain. This subsampled domain corre- factor. The bit allocation tells the decoder the num- sponds approximately to a perceptual domain. ber of bits used to represent each sample. For Layer Figure 1IC shows the global masking threshold I this allocation can be 0 to 15 bits per subband. calculated for the example audio signal. Psycho- The scale factor is a multiplier that sizes the acoustic model 1 selects the minimum global mask- samples to fully use the range of the quantizer. ing threshold within each subband to compute the Each scale factor has a 6-bit representation. The SMRs. Figure 12a shows the resulting signal-to-mask decoder multiplies the decoded quantizer output ratio, and Figure 12b is a frequency plot of the with the scale factor to recover the quantized sub- processed audio signal using these SMRs. band value. The dynamic range of the scale fac- tors alone exceeds 120 dB. The combination of the bit allocation e+ ++ and the scale factor provide the .,cb + +5 potential for representing the sam- *ab \e / &#' $9 .ca 4Q' ,.Ah $? 9." &obe ,oQ eeQ ples with a dynamic range well over Figure 13. 120 dB. Joint stereo coding slightly MPEG alters the representation of left- and header right-channel audio samples and syntax. will be covered later. oc *= ,j;""Q ,CA 8%\ej9 \e/ <+.,.a $' Layer II Q" Q" Fe Q $4" The Layer I1 algorithm is a Q &Ot e straightforward enhancement of Layer I. It codes the audio data in ...... larger groups and Figure 14. Grouping of imposes some restric- subband samples for tions on the possible Layer I and Layer 11. bit allocations for val- Note: Each subbarrd ues from the middle subband filter 1 filter produces 1 sumnple and higher subbands. It out for every 32 samples also represents the bit Audio subband filter 2 in. allocation, the scale- samples in factor values, and the subband filter 3 I, .I I.. I. quantized samples with \’ ,. I. :I more compact code. Layer I1 gets better audio quality by saving bits in these areas, so more code bits are Layer I ~ available to represent frame ~ Layer 11, ...... I Layer 111 the quantized subband ...... , frame values. The Layer I1 encoder forms frames of 1,152 samples per audio channel. Whereas Layer I codes data in single groups of 12 samples for each subband, Layer I1 Header CRC allo?‘’~ion Scale factors Ancillary codes data in three groups of 12 samples for each (32) 6, (1 28-256) (0-384) Samples data subband. Figure 14 shows this grouping. Again discounting stereo redundancy coding, each trio of 12 samples has one bit allocation and up to three scale factors. Header CRC Bit SCFSI Scale factors Samples Ancillary The encoder uses a different scale factor for (32) (0,16) a(z::i; (0-60) (0-1 080) data each group of 12 samples only if necessary to avoid audible distortion. The encoder shares scale- factor values among two or all three groups in two other cases: one, when the values of the scale fac- Header CRC Side information Main data; not necessarily linked tors are sufficiently close; two, when the encoder (32) (0,16) (1 36, 256) 1 to this frame. See Figure 18. 1 I I I anticipates that temporal noise masking by the I I human auditory system will hide any distortion caused by using only one scale factor instead of two or three. The scale-factor selection informa- Figure 15. Frame tion (SCFSI) field in the Layer I1 bitstream informs Although based on the same filter bank found in formats of the three the decoder if and how to share the scale-factor Layer I and Layer 11, Layer 111 compensates for MPEG/uudio layers’ values. Figure 15b shows the arrangement of the some filter-bank deficiencies by processing the fil- bitstreanis: (U)Layer I, various data fields in a Layer I1 bitstream. ter outputs with a modified discrete cosine trans- (b)Layer 11, and (c) Another enhancement covers the occasion form (MDCT).” Layer Ill. when the Layer I1 encoder allocates three, five, or Figure 16 on the next page shows a block dia- nine levels for subband quantization. In these cir- gram of this processing for the encoder. Unlike the cumstances, the Layer I1 encoder represents three polyphase filter bank, without quantization the consecutive quantized values with a single, more MDCT transformation is lossless. The MDCTs fur- compact code word. ther subdivide the subband outputs in frequency to provide better spectral resolution. Furthermore, Layer Ill once the subband components are subdivided in The Layer 111 algorithm is a much more refined frequency, the Layer 111 encoder can partially can- approach derived from ASPEC (audio spectral per- cel some aliasing caused by the polyphase filter ceptual entropy coding) and OCF (optimal coding bank. Of course, the Layer 111 decoder has to undo in the frequency domain) algorithms.I2 I’ Ih the alias cancellation so that the inverse MDCT Subband 0 quently has poorer time resolution. The MDCT operates on 12 or 36 polyphase filter samples, so ]-b,.,,,I ]-b,.,,,I window MDCT the effective time window of audio samples involved in this processing is 12 or 36 times larg- er. The quantization of MDCT values will cause errors spread over this larger time window, so it is more likely that this quantization will produce audible distortions. Such distortions usually man- ifest themselves as pre-echo because the temporal masking of noise occurring before a given signal Subband 31 is weaker than the masking of noise after. Layer 111 incorporates several measures to reduce pre-echo. First, the Layer Ill psychoacoustic model has modifications to detect the conditions for pre-echo. Second, Layer I11 can borrow code Long or short short, short-to-long; block control (from bits from the bit reservoir to reduce quantization Figure 16. MPEG/audio window select psychoacoustic model) noise when pre-echo conditions exist. Finally, the Layer I1 filter bank encoder can switch to a smaller MDCT block size processing (encoder to reduce the effective time window. side). Besides the MDCT processing, other enhance- can reconstruct subband samples in their original, ments over the Layer I and Layer I1 algorithms aliased, form for the synthesis filter bank. include the following. Layer I11 specifies two different MDCT block lengths: a long block of 18 samples or a short Alias reduction. Layer III specifies a method of block of 6. There is a 50-percent overlap between processing the MDCT values to remove some arti- successive transform windows, so the window size facts caused by the overlapping bands of the is 36 and 12, respectively. The long block length polyphase filter bank. allows greater frequency resolution for audio sig- nals with stationary characteristics, while the Nonuniform quantization. The Layer I11 short block length provides better time resolution quantizer raises its input to the 3/4 power before for transients.’* quantization to provide a more consistent signal- Note the short block length is one third that of to-noise ratio over the range of quantizer values. a long block. In the short block mode, three short The requantizer in the MPEG/audio decoder relin- blocks replace a long block so that the number of earizes the values by raising its output to the 4/3 MDCT samples for a frame of audio samples power. remains unchanged regardless of the block size selection. For a given frame of audio samples, the Scale-factor bands. Unlike Layers I and 11, MDCTs can all have the same block length (long where each subband can have a different scale fac- or short) or a mixed-block mode. In the mixed- tor, Layer 111 uses scale-factor bands. These bands block mode the MDCTs for the two lower fre- cover several MDCT coefficients and have approx- quency subbands have long blocks, and the imately critical-band widths. In Layer 111 scale fac- MDCTs for the 30 upper subbands have short tors serve to color the quantization noise to fit the blocks. This mode provides better frequency reso- varying frequency contours of the masking lution for the lower frequencies, where it is need- threshold. Values for these scale factors are adjust- ed the most, without sacrificing time resolution ed as part of the noise-allocation process. for the higher frequencies. The switch between long and short blocks is Entropy coding of data values. To get better not instantaneous. A long block with a specialized , Layer 111 uses variable-length long-to-short or short-to-long data window serves Huffman codes to encode the quantized samples. to transition between long and short block types. After quantization, the encoder orders the 576 (32 Figure 17 shows how the MDCT windows transi- subbands x 18 MDCT coefficients/subband) quan- tion between long and short block modes. tized MDCT coefficients in a predetermined order. W W Because MDCT processing of a subband signal The order is by increasing frequency except for the Lu provides better frequency resolution, it conse- short MDCT block mode. For short blocks there are 4 36 samples b 4 b4 b4 8 new samples 18 new samples 18 new 24 new samples -18 new samples 36 samdes

I WEU I

12 samples

h12 samples 12 samples

36 samples 1. 1. 18 samples 18 samples b 36 samples -18 samples

Figure 17. The three sets of window values for a given frequency, of -1, 0, or 1. The Huffman table for this region arrangement of so the ordering is by frequency, then by window, codes four values at a time, so the number of val- overlapping MDCT within each scale-factor band. Ordering is advan- ues in this region must be a multiple of four. windows. tageous because large values tend to fall at the The third region, the “big values” region, cov- lower frequencies and long runs of zero or near- ers all remaining values. The Huffman tables for zero values tend to occupy the higher frequencies. this region code the values in pairs. The “big val- The encoder delimits the ordered coefficients ues” region is further subdivided into three subre- into three distinct regions. This enables the gions, each having its own specific Huffman table. encoder to code each region with a different set of Besides improving coding efficiency, partition- Huffman tables specifically tuned for the statistics ing the MDCT coefficients into regions and sub- of that region. Starting at the highest frequency, regions helps control error propagation. Within the encoder identifies the continuous run of all- the bitstream, the Huffman codes for the values I zero values as one region. This region does not are ordered from low to high frequency. have to be coded because its size can be deduced from the size of the other two regions. However, Use of a “bit reservoir.” The design of the I it must contain an even number of zeroes because Layer I11 bitstream better fits the encoder’s time- the other regions code their values in even-num- varying demand on code bits. As with Layer 11, bered groupings. Layer 111 processes the audio data in frames of The second region, the “countl” region, con- 1,152 samples. Figure 15c shows the arrangement sists of a continuous run of values made up only of the various bit fields in a Layer 111 bitstream. Bits in reservoir Bits in reservoir Bits in reservoir Bits in reservoir Bits in reservoir Figure 18. Layer 111 for Frame 1 = 0 bitstream diagram.

Frame 1 Frame 2 Frame 3 Frame 4 Frame 5 main-data-begin = 0 main-data-begin main-data-begin main-data-begin main-data-begin

Main data for Frame 1 Main data for Frame 3 0Main data for Frame 5 0Main data for Frame 2 Main data for Frame 4

Unlike Layer 11, the coded data representing mainly increase the complexity of the encoder these samples do not necessarily fit into a fixed- and the memory requirements of the decoder. length frame in the code bitstream. The encoder :an donate bits to a reservoir when it needs fewer Bit allocation than the average number of bits to code a frame. The bit allocation process determines the num- Later, when the encoder needs more than the ber of code bits allocated to each subband based average number of bits to code a frame, it can bor- on information from the psychoacoustic model. row bits from the reservoir. The encoder can only For Layers I and 11, this process starts by comput- borrow bits donated from past frames; it cannot ing the mask-to-noise ratio: borrow from future frames. The Layer 111 bitstream includes a 9-bit pointer, “main-data-begin,” with each frame’s side infor- mation pointing to the location of the starting byte where MNRLI,is the mask-to-noise ratio, SNR,,, is of the audio data for that frame. Figure 18 illustrates the signal-to-noise ratio, and SMR,,, is the signal- the implementation of the bit reservoir within a to-mask ratio from the psychoacoustic model. fixed-length frame structure via the main-data- The MI’EGiaudio standard provides tables that begin pointer. Although main-data-begin limits the give estimates for the signal-to-noise ratio result- maximum variation of the audio data to 29 bytes ing from quantizing to a given number of quan- (header and side information are not counted tizer levels. In addition, designers can try other because for a given mode they are of fixed length methods of getting the signal-to-noise ratios. and occur at regular intervals in the bit stream), the Once the bit allocation unit has mask-to-noise actual maximum allowed variation will often be ratios for all the subbands, it searches for the sub- much less. For practical considerations, the standard band with the lowest mask-to-noise ratio and allo- stipulates that this variation cannot exceed what cates code bits to that subband. When a subband would be possible for an encoder with a code buffer gets allocated more code bits, the bit allocation limited to 7,680 bits. Because compression to 320 unit looks up the new estimate for the signal-to- Kbps with an audio sampling rate of 48 kHz requires noise ratio and recomputes that subband’s mask- an average number of code bits per frame of 1,152 to-noise ratio. The process repeats until no more (sampledframe) x 320,000 (bitsisecond) / 48,000 code bits can be allocated. (samples/second) = 7,680 bits/frame, absolutely no The Layer III encoder uses noise allocation. The variation is allowed for this coding mode. encoder iteratively varies the quantizers in an Despite the conceptually complex enhance- orderly way, quantizes the spectral values, counts ments to Layer 111, a Layer Ill decoder has only a the number of Huffman code bits required to code modest increase in computation requirements the audio data, and actually calculates the result- over a Layer I1 decoder. For example, even a direct ing noise. If, after quantization, some scale-factor matrix-multiply implementation of the inverse bands still have more than the allowed distortion, MDCT requires only about 19 multiplies and addi- the encoder amplifies the values in those scale- tions per subband value. The enhancements factor bands and effectively decreases the quan- tizer step size for those bands. Then the process extends the first MPEG standard in the following repeats. The process stops if any of the following ways: three conditions is true: I Multichannel audio support. The enhanced 1. None of the scale-factor bands have more than standard supports up to five high-fidelity audio the allowed distortion. channels, plus a low-frequency enhancement channel (also known as 5.1 channels). Thus, it 2. The next iteration would cause the amplifica- will handle audio compression for high-defin- tion for any of the bands to exceed the maxi- ition television (HDTV) or digital movies. mum allowed value. I Multilingual audio support. It supports up to 3. The next iteration would require all the scale- seven additional commentary channels. factor bands to be amplified. I Lower, compressed audio bit rates. The stan- Real-time encoders also can include a time- dard supports additional lower, compressed bit limit exit condition for this process. rates down to 8 Kbps.

Stereo redundancy coding I Lower audio sampling rates. Besides 32, 44.1, The MPEG/audio compression algorithm and 48 kHz, the new standard accommodates supports two types of stereo redundancy coding: 16-, 22.05-, and 24-kHz sampling rates. The intensity stereo coding and middle/side (MS) commentary channels can have a sampling rate stereo coding. All layers support intensity stereo that is half the high-fidelity channel sampling coding. Layer 111 also supports MS stereo coding. rate. Both forms of redundancy coding exploit another perceptual property of the human audito- In many ways this new standard is compatible ry system. Psychoacoustic results show that above with the first MPEG/audio standard (MPEG-1). about 2 kHz and within each critical band, the MPEG-2/audio decoders can decode MPEG- human auditory system bases its perception of l/audio bitstreams. In addition, MPEG-l/audio stereo imaging more on the temporal envelope of decoders can decode two main channels of the audio signal than on its temporal fine structure. MPEG-2/audio bitstreams. This backward com- In intensity stereo mode the encoder codes patibility is achieved by combining suitably some upper-frequency subband outputs with a weighted versions of each of the up to 5.1 chan- single summed signal instead of sending inde- nels into a “down-mixed” left and right channel. pendent left- and right-channel codes for each of These two channels fit into the audio data frame- the 32 subband outputs. The intensity stereo work of an MPEG-l/audio bitstream. Information decoder reconstructs the left and right channels needed to recover the original left, right, and based only on a single summed signal and inde- remaining channels fits into the ancillary data pendent left- and right-channel scale factors. portion of an MPEG-l/audio bitstream or in a With intensity stereo coding, the spectral shape separate auxiliary bitstream. of the left and right channels is the same within Results of subjective tests conducted in 1994 each intensity-coded subband, but the magni- indicate that, in some cases, the backward com- tude differs. patibility requirement compromises the audio The MS stereo mode encodes the left- and compression performance of the multichannel right-channel signals in certain frequency ranges coder. Consequently, the IS0 MPEG group is cur- as middle (sum of left and right) and side (differ- rently working on an addendum to the MPEG-2 ence of left and right) channels. In this mode, the standard that specifies a non-backward-compati- encoder uses specially tuned threshold values to ble multichannel coding mode that offers better compress the side-channel signal further. coding performance. MM IA Future MPEG/audio standards: Phase 2 Acknowledgments The second phase of the MPEG/audio com- I wrote most of this article while employed at pression standard, MPEG-2 audio, has just been Digital Equipment Corporation. I am grateful for completed. This new standard became an inter- the funding and support this company provided national standard in November 1994. It further for my work in the MPEG standards. I also appre- iate the many helpful editorial comments given Compression for Professional Applications," 92nd JY the many reviewers, especially Karlheinz AES Convention, preprint 3330, Audio Engineering Brandenburg, Bob Dyas, Jim Fiocca, Leon van de Society, New York, 1992. Kerkhof, and Peter Noll. 15. K. Brandenburg and J.D. Johnston, "Second Generation Perceptual Audio Coding: The Hybrid References Coder," 88th AES Convention, preprint 2937, Audio 1. ISO/IEC Int'l Standard IS 11 172-3 "Information Engineering Society, New York, 1990. Technology-Coding of Moving Pictures and 16. K. Brandenburg et al., "ASPEC: Adaptive Spectral Associated Audio for Digital Storage Media at up to Perceptual Entropy Coding of High Quality Music about 1.5 Mbits/s-Part 3: Audio." Signals," 90th AES Convention, preprint 301 1, 2. K. Brandenburg et al., "The ISO/MPEC-: Audio Engineering Society, New York, 1991. A Generic Standard for Coding of High Quality 17.1. Princen, A. Johnson, and A. Bradley, Digital Audio," 92nd AES Convention, preprint 3336, "Subband/ Technique Based on Audio Engineering Society, New York, 1992. Time Domain Aliasing Cancellation," Proc. lnt7 Conf. 3. D. Pan, "Digital Audio Compression", Digital /E€€ ASSP, IEEE Press, Piscataway, N.J., 1987, pp. Technical]., Vo1.5, No.2, 1993. 21 61 -2164. 4. ISO/IEC International Standard IS 11172-4 18.B. Elder, "Coding of Audio Signals with Overlapping "Information Technology-Coding of Moving Block Transform and Adaptive Window Functions," Pictures and Associated Audio for Digital Storage Frequenz, Vol. 43, 1989, pp. 252-256 (in German). Media at up to about 1.5 Mbits/s-Part 4: Conformance." 5. C.C. Wirtz, "Digital Compact Cassette: Audio Coding Technique," 91 st AES Convention, preprint 321 6, Audio Engineering Society, New York, 1991. 6. G. Plenge, "A Versatile High Quality Radio System Davis Pan is a principal staff engi- for the Future Embracing Audio and Other neer in Motorola's Chicago Corp- Applications," 94th AES Convention, Audio orate Research Labs, where he heads Engineering Society, New York, 1994. a team working on audio compres- 7. C. Crewin and T. Ryden, "Subjective Assessments on sion and audio processing algo- Low Bit-rateAudio Codecs," Proc. 10th Int'l A€S rithms. He received both a Conf.,Audio Engineering Society, 1991, pp. 91 -1 02. bachelor's and a master's degree in electrical engineer- 8. J.H. Rothweiler, "Polyphase Quadrature Filters-A ing from the Massachusetts Institute of Technology in New Subband Coding Technique," Proc. Int? Conf. 1981, and a PhD in electrical engineering from the same /€€E ASSP, 27.2, IEEE Press, Piscataway, N.J., 1983, institute in 1986. pp. 1280-1283. Pan has worked in the MPEGiaudio compression 9. H.J. Nussbaumer and M. Vetterli, "Computationally standards since 1990 and led the development of the Efficient QMF Filter Banks," Proc. /nt7 Conf. /€€E ASSP, IS0 simulations for the MPEG/audio algorithm. IEEE Press, Piscataway, N.]., 1984, pp. 1 1.3.1 -1 1.3.4. 10. P. Chu, "Quadrature Mirror Filter Design for an Readers may contact the author at Motorola, Inc., Arbitrary Number of Equal Bandwidth Channels," 1301 E. Algonquin Rd., Schaumberg, IL 60196, e-mail /E€€ Trans. on ASSP, Vol. ASSP-33, No. 1, pp. 203- [email protected]. 21 8, Feb. 1985. 11. B. Scharf, "Critical Bands," in Foundations of Modern Auditory Theory, J.Tobias, ed., Academic Press, New York and London, 1970, pp. 159-202. 12. J.D.Johnston, "Transform Coding of Audio Signals Using Perceptual Noise Criteria," /E€€/. on Selected Areas in Comm., Vol. 6, Feb. 1988, pp. 314-323. 13. D. Wiese and C.Stoll, "Bitrate Reduction of High Quality Audio Signals by Modeling the Ears' Masking Thresholds," 89th AES Convention, preprint 2970, Audio Engineering Society, New York, 1990. 14. K. Brandenburg and I. Herre, "Digital Audio