<<

PAPERS

ISO/IEC MPEG-4 High-Definition Scalable *

RALF GEIGER,1 RONGSHAN YU,2** JU¨ RGEN HERRE,1 SUSANTO RAHARDJA,2 ([email protected]) ([email protected]) ([email protected]) ([email protected])

SANG-WOOK KIM,3 XIAO LIN,2 and MARKUS SCHMIDT1 (sangwookkim@.com) ([email protected]) ([email protected])

1Fraunhofer IIS, Erlangen, Germany 2Institute for Infocomm Research, Singapore 3Samsung Electronics, Suwon, Korea

Recently the MPEG Audio standardization group has successfully concluded the - ization process on technology for lossless coding of audio signals. A summary of the scalable lossless coding (SLS) technology as one of the results of this standardization work is given. MPEG-4 scalable lossless coding provides a fine-grain scalable lossless extension of the well-known MPEG-4 AAC perceptual audio coder up to fully lossless reconstruction at word lengths and sampling rates typically used for high-resolution audio. The underlying innova- tive technology is described in detail and its performance is characterized for lossless and near lossless representation, both in conjunction with an AAC coder and as a stand-alone com- pression engine. A number of application scenarios for the new technology are discussed.

0 INTRODUCTION dia types, such as DVD-Audio [11], SACD [12], HD- DVD [11], or Blu-Ray Disc [13]. In the realm of audio, Perceptual coding of high-quality audio signals has ex- this is achieved by employing lossless formats with high perienced a tremendous evolution over the past two de- resolution (word length) and/or high sampling rate. cades, both in terms of research progress and in worldwide There exist several proprietary lossless formats, the deployment in products. Examples for successful applica- most prominent of which are MLP Lossless [14], [15], tions include portable players, Internet audio, audio DTS-HD [16], and [17]. Furthermore, sev- for (such as VCD, DVD), and digital broad- eral coding systems have gained some promi- casting. Several phases of international standardization nence through the Internet. Among them are FLAC [18], have been conducted successfully [1]–[3]. Many recent Monkey’s Audio [19], and OptimFROG [20]. research and standardization efforts focus at achieving On the other end of the bit-rate scale, approaches for good at even lower bit rates [4]–[9] to ac- scalable perceptual audio coding have been developed in commodate storage and transmission channels with lim- the recent years. Within the International Telecommuni- ited quality (such as terrestrial and satellite cation Union (ITU) scalability in wide-band audio coding and third-generation mobile ). Never- has been adopted only recently [21]. Within MPEG-4 Au- theless, for other scenarios with higher transmission band- dio [3], scalability has been developed and adopted some- width available, there is a general trend toward providing time earlier [22]–[24]. This includes the realm of speech the consumer with a media experience of extremely high coding where scalability is provided both for harmonic fidelity [10], as it is frequently associated with the terms vector excitation coding (HVXC) [25], [26] and code- “high definition” and “high resolution” and dedicated me- excited linear prediction (CELP) [27], [28]. In this context the ISO/MPEG Audio standardization group decided to start a new work item to explore tech- *Manuscript received 2006 March 23; revised 2006 October nology for lossless and near-lossless coding of audio sig- 24 and November 27. nals by issuing a call for proposals for relevant technology **Currently with , San Francisco, CA. in 2002 [29]. Three specifications emerged from this call

J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 27 GEIGER ET AL. PAPERS as amendments to the MPEG-4 Audio standard. First the mation to produce losslessly/near-losslessly coded audio standard on lossless coding of 1-bit oversampled signals for high-definition applications. [30] specifies the of highly over- sampled 1-bit sigma–delta-modulated audio signals as 1.2 AAC Background they are stored on the SACD media under the name Direct Since the MPEG-4 SLS coder has been designed to Stream Digital (DSD) [31]. Second the audio lossless cod- operate as an enhancement to MPEG-4 AAC [3], [34], its ing (ALS) specification [32] describes technology for the structure is closely related to that of the underlying AAC lossless coding of PCM-coded signals at sampling rates up core coder [35]. This section sketches the architectural to 192 kHz as well as floating-point audio. features of MPEG-4 AAC as they are relevant to the SLS This paper provides an overview of the third descen- enhancement technology. dant, the scalable lossless coding (SLS) specification [33], The underlying AAC provides efficient percep- which extends traditional methods for perceptual audio tual audio coding with high quality and achieves broadcast coding toward lossless coding of high-resolution audio quality at a of about 64 kbps per channel [36]. Fig. signals in a scalable way. Specifically, it allows to scale up 3 gives a very concise view of the AAC encoder’s struc- from a perceptually coded representation (MPEG-4 ad- ture. The audio signal is processed in a blockwise spectral vanced audio coding, AAC) to a fully lossless representa- representation using the modified discrete cosine trans- tion with high definition, including a wide range of inter- form (MDCT) [37]. The resulting 1024 spectral values are mediate near-lossless representations. The paper starts quantized and coded considering the required accuracy as with an explanation of the general concept, discusses the demanded by the perceptual model. This is done to mini- underlying novel technology components, and character- mize the perceptibility of the introduced quantization dis- izes the codec in terms of compression performance and tortion by exploiting masking effects. Several neighboring complexity. Finally a number of application scenarios are spectral values are grouped into so-called scale factor briefly outlined. bands, sharing the same scale factor for quantization. Prior to the quantization/coding (Q/) tool, a number of pro- 1 CONCEPT AND TECHNOLOGY OVERVIEW cessing tools operate on the spectral coefficients in order to improve coding performance for certain situations. The This section provides an overview of the principle and most important tools are the following. the structure of the scalable lossless coding technology and its combination with the AAC perceptual audio coder. • The temporal (TNS) tool [38] carries out This combination will be referred to as high-definition predictive filtering across frequency in order to achieve advanced audio coding (HD-AAC) in the following. a temporal shaping of the quantization noise according to the signal envelope and in this way optimize temporal 1.1 General System Structure masking of the quantization distortion. Fig. 1 shows a very high-level view of the structure of • The M/S stereo coding tool [39] provides sum/ the HD-AAC encoder. The input signal is first coded by an difference coding of channel pairs, exploits interchannel AAC (“core layer”) encoder. The SLS algorithm uses the redundancy for near-monophonic signals, and avoids output to enhance the system’s performance toward loss- binaural unmasking. less coding, resulting in an enhancement layer. The two layers of information are subsequently multiplexed into 1.3 HD-AAC/SLS Enhancement one high-definition bit stream. Decoding of an HD-AAC The SLS scalable lossless enhancement works on top of bit stream is illustrated in Fig. 2. From the HD-AAC bit this AAC architecture. The structures of an HD-AAC en- stream, decoders are able either to decode the perceptually coder and decoder are shown in Figs. 4 and 5. coded AAC part only, or to use the additional SLS infor- In the encoder the audio signal is converted into a spectral representation using the integer modified discrete cosine transform (IntMDCT) [40], [41]. This transform represents an invertible integer approximation of the MDCT and is well-suited for lossless coding in the frequency domain. Other AAC coding tools, such as mid/side coding or tempo- ral noise shaping, are also considered and performed on the IntMDCT spectral coefficients in an invertible integer fash- ion, thus maintaining the similarity between the spectral val- Fig. 1. HD-AAC encoder structure. ues used in the AAC coder and in the lossless enhancement.

Fig. 2. HD-AAC decoder structure. Fig. 3. Structure of AAC encoder (simplified).

28 J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS ISO/IEC MPEG-4 SCALABLE AAC

The link between the perceptual core and the scalable as it results from the noise-shaping process of the AAC lossless enhancement layer of the coder [42] is provided perceptual coder, and thus takes advantage of the AAC by an error-mapping process. The error-mapping process perceptual model. removes the information that has already been coded in the The following sections will describe the principles un- perceptual (AAC) path from the IntMDCT spectral coef- derlying the design of SLS in greater detail by focusing on ficients such that only the resulting (IntMDCT) residuals its IntMDCT filter bank, error-mapping strategy, and en- are coded in the enhancement encoder, and in this way the tropy coding parts. coding efficiency benefits from the underlying AAC layer. The error-mapping process also preserves the probability 2 FILTER BANK distribution skew of the original IntMDCT coefficients and thus permits an efficient encoding of the residual by means 2.1 IntMDCT of two bit-plane coding processes, namely, bit-plane Golomb The IntMDCT, as introduced in [40], is an invertible coding (BPGC) [43] and context-based integer approximation of the MDCT, which is obtained by (CBAC) [44], and a so-called low-energy-mode encoder, utilizing the “lifting scheme” [45] or “ladder network” all of which will be described later in further detail. [46]. It enables efficient lossless coding of audio signals By using the bit-plane coding process the lossless en- [40] by means of entropy coding of integer spectral coef- hancement is performed in a fine-grain scalable way. A ficients. Furthermore, as the IntMDCT closely approxi- lossless reconstruction is obtained if all the bit planes of mates the behavior of the MDCT, it allows to combine the the residual are coded, transmitted, and decoded com- strategies of perceptual and lossless audio coding in the pletely. If only parts of the bit planes are decoded, a lossy frequency domain into a common framework [42]. reconstruction of the signal is obtained with a quality be- Fig. 7 illustrates the close relationship between IntMDCT tween the AAC layer’s and lossless reconstruction. In or- and MDCT for a small audio segment by displaying the der to achieve optimal perceptual quality at intermediate respective magnitude spectra of both filter banks. The dif- bit rates, the bit-plane coding is started from the most ference between MDCT and IntMDCT values is visible as significant bit (MSB) for all scale-factor bands, and pro- a small noise floor that is typically much lower than the gresses toward the least significant bit (LSB) for all bands error introduced by perceptual coding. Thus the IntMDCT (see Fig. 6). In this way the bit-plane coding process pre- allows to code efficiently the quantization error of an serves the overall spectral shape of the quantization noise, MDCT-based perceptual codec in the frequency domain.

Fig. 4. Structure of HD-AAC encoder.

Fig. 5. Structure of HD-AAC decoder.

J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 29 GEIGER ET AL. PAPERS

The following section provides some detail on the deriva- with the time-domain input x(k) and the windowing func- tion of the IntMDCT from the MDCT. tion w(k) can be decomposed into two blocks, namely, windowing and time-domain aliasing (TDA) and discrete 2.2 Decomposition of MDCT cosine transform of type IV (DCT-IV). This is illustrated The MDCT, defined by in Fig. 8 for both forward MDCT and inverse MDCT. In the forward IntMDCT, the windowing/TDA block is 2N 1 2 − ͑2k + 1 + N͒͑2m + 1͒␲ calculated by 3N/2 so-called lifting steps: X͑m͒ = w͑k͒x͑k͒ cos , ͱN ͚ 4N k=0 w͑N − 1 − k͒ − 1 m = 0, . . . , N − 1 (1) x͑k͒ 1 − w͑k͒ ۋ ͩx͑N − 1 − k͒ͪ ΂01΃ w͑N − 1 − k͒ − 1 101 − × w͑k͒ ͩ−w͑k͒ 1ͪ ΂01΃ x͑k͒ N × , k = 0, . . . , − 1. (2) ͩx͑N − 1 − k͒ͪ 2

After each lifting step a rounding operation is applied to stay in the integer domain. Every lifting step can be in- verted by simply adding the subtracted value.

2.3 Integer DCT-IV For the IntMDCT, the DCT-IV is calculated in an in- vertible integer fashion, called the integer DCT-IV. The Fig. 6. Bit-plane scan process in SLS. multidimensional lifting (MDL) scheme [41], [47] is ap- plied in order to reduce the required rounding operations in the invertible integer approximation as much as possible and in this way minimize the approximation error noise floor (see Fig. 7). The following block matrix decompo- sition for an invertible matrix T and the identity matrix I shows the basic principle of the MDL scheme,

T 0 −I 0 I −T 0 I = . (3) ͩ0 T −1 ͪ ͩT −1 I ͪͩ0 I ͪͩI T −1 ͪ

The three blocks in this decomposition are the so-called MDL steps. Similar to the conventional lifting steps, they can be transferred to invertible integer mappings by round- ing the floating-point values after being processed by T or T−1, and they can be inverted by subtracting the values that Fig. 7. IntMDCT and MDCT magnitude spectra. have been added.

Fig. 8. MDCT and inverse MDCT by windowing/TDA and DCT-IV.

30 J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS ISO/IEC MPEG-4 SCALABLE AAC

When coding stereo signals, this decomposition is used consuming as few bits as possible. Rather than encoding to obtain an integrated calculation of the M/S matrix and all IntMDCT coefficients c[k] directly, in the lossless en- the integer DCT-IV for the left and right channels. The hancement layer, it is more efficient to make use of the number of required rounding operations is 3N per channel information that has already been coded by the AAC layer. pair, or 3N/2 per channel, which is the same number as for This is achieved by the error-mapping process, which the windowing/TDA stage. As a whole, this so-called ste- produces a deterministic residual signal between the reo IntMDCT requires only three rounding operations per IntMDCT spectral values and their counterparts from the sample, including M/S processing. Concerning mono sig- AAC layer. nals, the same structure can be used, but has to be ex- In order to produce a residual signal e[k] with the small- tended by some additional lifting steps to obtain the inte- est possible entropy, given the core AAC quantized value ger DCT-IV of one block; see [47]. This mono IntMDCT i[k], it is sufficient in most cases to use the minimum mean requires four rounding operations per sample. square error (MMSE) residual obtained by subtracting the IntMDCT coefficient c[k] from its MMSE reconstruction ,{[E{c[k]|i[k ס [Noise Shaping cˆ[k 2.4 The lossless coding efficiency of the IntMDCT is fur- given i[k], that is, ther improved by utilizing a noise-shaping technique, in- troduced in [48]. In the lifting steps where time-domain e͓k͔ = c͓k͔ − cˆ ͓k͔. (4) signals are processed, the rounding operations include an error feedback mechanism to provide a spectral shaping of Here E{·} denotes the expectation operation. However, in the approximation noise. This approximation noise affects SLS a somewhat different approach is adopted, where the the lossless coding efficiency mainly in the high- residual signal is given by frequency region where audio signals usually carry only a c͓k͔ − thr͓k͔, i ͓k͔ " 0 small amount of energy, especially at sampling rates of 96 e͓k͔ = (5) c k , i k 0 kHz and above. Hence the low-pass characteristics of the ͭ ͓ ͔ ͓ ͔ = approximation noise improve the lossless coding effi- where thr (i[k]) is the quantized value closer to zero ciency. A first-order noise-shaping filter is employed in with respect to i[k], and is calculated via table lookup and the three stages of lifting steps in the windowing/TDA linear interpolation to ensure the deterministic behavior processing and in the first rounding stage of the integer necessary for lossless coding. This error-mapping process DCT-IV processing. Fig. 9 compares the resulting ap- is illustrated in Fig. 10, where two different cases are proximation error between the IntMDCT values and the shown. In the first case the IntMDCT coefficients c[k] MDCT values rounded to integer, when the IntMDCT belong to a scale-factor band that has been quantized and operates both with and without noise shaping. coded at the AAC encoder (significant band). For these coefficients the residual coefficients are obtained by sub- 3 ERROR MAPPING AND tracting the quantization thresholds from their correspond- RESIDUAL CALCULATION ing c[k], resulting in a residual spectrum with reduced amplitude. In the other case c[k] belongs to a band that is The objective of the error-mapping/residual calculation not coded or has been quantized to zero in the AAC en- stage is to produce an integer enhancement signal that coder (insignificant band). In this case the residual spec- enables lossless reconstruction of the audio signal while trum is simply the IntMDCT spectrum itself. This error-mapping process offers many advantages that permit better coding efficiency in the enhancement layer. First, as illustrated in Fig. 11, we notice that if the ampli- tude of c[k] is distributed exponentially, the amplitude of

Fig. 9. Mean-squared approximation error of stereo IntMDCT (including M/S) with and without noise shaping. Fig. 10. Illustration of error-mapping process.

J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 31 GEIGER ET AL. PAPERS e[k] will likewise be distributed approximately exponen- the bit-plane symbols are coded arithmetically with a tially. Thus it can be coded very efficiently by using the structural frequency assignment rule. Considering an input e[0], . . , e[N − 1]} for which N is the} ס BPGC/CBAC coding process described in the next sec- data vector e tion. Secondly, for a significant band we find the follow- dimension of e, each element e[k] in e is first represented ing property for the coefficients c[k]: in a binary format as

|thr͑i͓k͔͒| Յ |c͓k͔| Ͻ |thr͑i͓k͔͒|+⌬͑i͓k͔͒ (6) M−1 j where e͓k͔ = ͑2s͓k͔ − 1͒ ͚ b͓k, j͔ и 2 , k = 0, . . . , N − 1 j=0 ⌬͑i͓k͔͒ = thr͑|i͓k͔|+1͒ − thr͑|i͓k͔|͒ (7) (8) is the quantization step size of i[k]. Clearly, the magnitude which consists of a sign symbol of e[k] is bounded by the value of ⌬(i[k]), and its sign is 1, e͓k͔ Ն 0 also identical to that of i[k] for i[k] " 0. As a result, no ᭝ s͓k͔ = , k = 0, . . . , N − 1 (9) additional data transmission is needed for conveying the ͭ0, e͓k͔ Ͻ 0 sign information. This is referred to as “implicit signaling” k, and , . . . ,1 ס and bit-plane symbols b[k, j] {0, 1}, i in the SLS specification. ∈ M is the MSB for e that satisfies 2M−1 Յ max { e[k] }< This implicit signaling mechanism assumes that the in- | | N − 1. The bit-plane symbols are then , . . . ,0 ס2M,k put to the SLS layer and the AAC core quantization value scanned and coded from MSB to LSB over all the ele- are “well synchronized.” This may not be true in all cases, ments in e, and coded by using an arithmetic code with a given that AAC encoders have all the freedom to optimize structural frequency assignment QL given by the encoding process and thus may produce an i[k] that J does not satisfy Eq. (6). Thus SLS also includes an “ex- 1 , j Ͻ L plicit signaling” mechanism, which employs an MMSE 2 error mapping as defined in Eq. (4), where cˆ[k] is approxi- QL͓j͔ = (10) mated by the AAC inverse quantization value. In this case 1 j−L , j Ն L all the side information necessary for decoding e[k] is Ά1 + 2 2 signaled explicitly from the encoder to the decoder. where the “lazy-plane” parameter L can be selected using the adaptation rule 4 ENTROPY CODING LЈ+1 To achieve best efficiency in the compression of the en- L = min͕LЈ ∈ ޚ|2 N Ն A͖ (11) hancement layer information, several methods for entropy and A is the absolute sum of the data vector e. coding of the IntMDCT residual information are employed. Although this BPGC coding process delivers excellent compression performance for data that stem from an in- 4.1 Combined BPGC/CBAC Coding dependent and identically distributed (iid) source with The bit-plane Golomb code (BPGC) coding process Laplacian distribution [43], it lacks the capability to ex- used in SLS is basically a bit-plane coding scheme where plore the statistical dependencies between data samples that may exist in certain sources to achieve better com- pression performance. These correlations can be captured very effectively by incorporating context modeling tech- nology into the BPGC coding process, where the fre- quency assignment for arithmetic coding of bit-plane sym- bols is not only dependent on the distance of the current bit plane from the lazy-plane parameter as in the frequency assignment rule [Eq. (10)], but also on other possible el- ements that may affect the probability distribution of these bit-plane symbols. In the context of lossless coding of IntMDCT spectral data for audio, elements that possibly affect the distribu- tion of the bit-plane symbols include the frequency loca- tions of the IntMDCT spectral data, the magnitude of the adjacent spectral lines, and the status of the AAC core quantizer. In order to capture these correlations, several contexts are used in the context-based arithmetic code (CBAC) of SLS. The guide in selecting these contexts is to try to find those contexts that are “most” correlated to the distribution of the bit-plane symbols. Fig. 11. Distribution of residual signal from error-mapping In CBAC, three types of contexts are used, namely, the process. frequency band (FB) context, the distance to lazy (D2L)

32 J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS ISO/IEC MPEG-4 SCALABLE AAC context, and the significant state (SS) context. The detailed context assignments are summarized in the following. 4.3 Smart Decoding Context 1: Frequency Band (FB) It is found in [44] Due to their fine-grain scalability, HD-AAC bit streams that the probability distribution of bit-plane symbols of can be truncated at any bit rate lower than what would be IntMDCT varies for different frequency bands. Therefore needed for a fully lossless reconstruction to produce near- in CBAC the IntMDCT spectral data are classified into lossless representations of the audio signal. In the context three different FB contexts, namely, low band (0–4 kHz), of arithmetic decoding, the smart decoding method pro- midband (4–11 kHz), and high band (above 11 kHz). vides a way to optimally decode such a truncated bit Context 2: Distance to Lazy (D2L) The D2L context is stream. It decodes additional symbols in the absence of defined as the distance of the current bit plane j to the incoming bits when a decoding buffer still contains mean- BPGC lazy-plane parameter L, as defined in the following ingful information for arithmetic decoding in the CBAC/ equation: BPGC mode and/or low-energy mode. Decoding contin- ues up to the point where no ambiguity exists in deter- 3 − j + L, j − L Ն −2 mining a symbol [51]. D2L = (12) ͭ6, otherwise. This is motivated by the BPGC frequency assignment rule, 5 OTHER CODING TOOLS Eq. (10), which is based on the fact that the skew of the As counterparts to the underlying AAC perceptual au- probability distribution of bit-plane symbols from a source dio coder, SLS provides a number of integerized versions with a (near) Laplacian distribution tends to decrease as of AAC coding tools. the number of D2L decreases. To reduce the total number of the D2L context, all the bit planes with j − L < −2 are 5.1 Integer M/S grouped into one context where all the bit-plane symbols In the AAC codec the M/S tool allows to choose indi- are coded with probability 0.5. vidually between mid/side and left/right coding modes on Context 3: Significant State (SS) The SS context at- a scale-factor-band basis. As shown in Section 2.3 it can tempts to capture the factors that correlate with the distri- provide either a global left/right or a global mid/side spec- bution of the magnitude of the IntMDCT residual in one tral representation. In order to make the integer spectral place. These include the magnitude of the adjacent values fit the spectral values from the AAC core on a IntMDCT spectral lines and the quantization interval of scale-factor-band basis, an invertible integer version of the the AAC core quantizer if it has been quantized previously M/S mapping is used. It is based on a lifting decomposi- in the core encoder. Further detail on the SS context can be tion of the normalized M/S matrix, that is, a rotation found in [49]. by ␲/4,

4.2 Low-Energy-Mode Coding 1 1 −1 1 1 − ͌2 10 The BPGC/CBAC coding process described earlier = ͌2 ͩ11ͪ ͩ01 ͪͩ1 ͌2 1ͪ works well for sources with (near) Laplacian distribution, ր which usually is the case for most audio signals [50]. 1 1 − ͌2 However, it was also found that for some music items × . (14) ͩ01 ͪ there are time/frequency (T/F) regions with very low en- ergy levels where the IntMDCT spectral data are in fact 5.2 Integer TNS dominated by the rounding errors of the IntMDCT (see When the temporal noise-shaping (TNS) tool is used in Fig. 7) with a distribution that is substantially different the AAC core, the resulting MDCT spectral values deviate from Laplacian. In order to encode those low-energy re- from the IntMDCT spectral values. In order to compensate gions efficiently, the BPGC/CBAC coding process is re- for this, the same TNS filter as in the AAC core is applied placed by low-energy-mode coding, as shown in the to the integer spectral values in the lossless enhancement. following. To assure lossless operation, the TNS filter is converted to The low-energy-mode coding is invoked for scale factor a deterministic invertible integer filter. bands for which the BPGC parameter L is smaller than or equal to 0. Then the amplitude of the residual spectral data Table 1. Binarization of IntMDCT error spectrum at .low-energy mode ס e[k] is first converted into a unitary binary string b {b[0], b[1], ..., b[pos], . . .}, as illustrated in Table 1, with M being the maximum bit plane. It can be seen that Amplitude of e[k] Binary String {b[pos]} the probability distribution of these symbols is a function 00 of the position pos, and the distribution of e[k]; 1 1 0 2 1 1 0 Pr͕b͓pos͔ = 1͖ = Pr͕e͓k͔ Ͼ pos|e͓k͔ Ն pos͖ (13) ...... M M 2 −2 11...... 10 where 0 Յ pos < 2 .b[pos] is then coded arithmetically 2M −1 11...... 11 depending on its position pos and the BPGC parameter L pos 0 1 2 3 . . . with a trained frequency table.

J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 33 GEIGER ET AL. PAPERS

6 BIT STREAM MULTIPLEXING AAC coder, and its ability to be used in combination with other types of AAC-based , such as AAC scalable From a mechanical point of view, the SLS coded data, or AAC/BSAC. including the core layer AAC bit stream and the enhance- ment bit stream, can be carried in multiple elementary 7.1 Oversampling Mode streams (ES) in an MPEG-4 system [52]. As shown in Fig. In the context of high-definition audio applications it is 12, the AAC bit stream is carried in the so-called base frequently desirable to achieve lossless signal reconstruc- layer ES, and the enhancement bit stream is carried in one tion at sampling rates of 96 kHz, or even 192 kHz. While or more enhancement layer ESs. Each ES is thus com- the MPEG-4 AAC coder supports these sampling rates, it posed of a sequence of access units (AUs), where one AU typically achieves best coding efficiency for high-quality contains one audio frame from the AAC bit stream or the perceptual coding at sampling rates between 32 and 48 enhancement bit stream. kHz. Thus in order to allow both efficient core layer cod- From an application point of view, such a bit stream ing at common rates and lossless reconstruction at higher structure provides great flexibility in constructing either a rates, MPEG-4 SLS includes an additional feature called large-step scalable system or a fine-grain scalable system “oversampling mode.” This refers to the possibility of let- with SLS. For example, in a scalable audio streaming ap- ting the lossless enhancement operate at a sampling rate plication, the server stores multiple SLS ESs at predefined higher than that of the AAC core codec. The ratio between bit rates, each assigned with a different stream priority. the SLS sampling rate and the AAC sampling rate is called During the streaming, the ESs are transmitted in the order “oversampling factor” and can be either 1, 2, or 4. For of their stream priority, and ESs with lower stream priority example, the lossless enhancement can operate at a rate of are dropped by either the streaming server or the network 192 kHz, whereas the AAC core operates at 48 kHz, see gateway whenever the transmission is insuffi- Table 2. cient to stream a full-rate lossless bit stream. Alternatively The mapping between the two coding layers is achieved it is also possible to implement a lightweight bit stream by using a time-aligned framing and a correspondingly truncation algorithm in the streaming server or in the mul- longer IntMDCT in the lossless enhancement. For ex- timedia gateway to truncate the enhancement stream AUs ample, an IntMDCT of the size of 4096 spectral values is directly according to the available bandwidth to achieve the fine granular bit rate scalability. Table 2. Example combinations of sampling rates for AAC 7 MODES OF OPERATION core and lossless enhancement. AAC @ AAC @ AAC @ So far the HD-AAC coder has been described as a com- 48 kHz 96 kHz 192 kHz bination of a regular AAC core layer coder (such as low complexity AAC) and an SLS enhancement layer. This SLS @ 48 kHz × section introduces further features offered by the SLS en- SLS @ 96 kHz ×× SLS @ 192 kHz × × × hancement layer that concern its combination with the

Fig. 12. Structure of MPEG-4 SLS bit stream.

34 J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS ISO/IEC MPEG-4 SCALABLE AAC used in case of oversampling by a factor of 4, and the 1024 MDCT values from the AAC core are mapped to the lower 8.1 Lossless Compression Performance 1024 IntMDCT values. In addition to allowing for opti- This section reports the performance of HD-AAC in mum AAC performance, the lossless performance is im- terms of its ability for lossless compression of various proved when using longer IntMDCT transforms. (A trans- audio material. As a figure of merit, the compression ratio form length of 2048 or 4096 provides a better lossless is defined as performance for stationary signals than a transform length original file size of 1024.) compression ratio = . (15) compressed file size 7.2 Combination with MPEG-4 Tables 3 and 4 show the lossless compression perfor- Scalable AAC mance for two major sets of test material, that is, the In order to account for varying or unknown transmis- MPEG-4 lossless audio coding test set (donated by Mat- sion capacity, the MPEG-4 AAC codec also provides sev- sushita Corporation and containing in part recordings per- eral scalable coding modes [3], [34]. MPEG-4 scalable formed by the New York Symphonic Ensemble). For both AAC allows perceptual audio coding with one or more sets the compression results are given for an HD-AAC AAC mono or stereo layers and can be combined with an configuration with an AAC core layer running at 128 SLS enhancement layer. This results in an overall coder kbit/s stereo plus SLS enhancement, and for an SLS stand- that offers fine-grain scalability in the range between loss- alone configuration. less reconstruction and the AAC representation, and scal- As could be expected from theory, it can be observed in ability in several steps within the perceptually coded AAC Table 3 that an increase in word length reduces the aver- representation. age compression ratio (due to the fact that the least sig- nificant bits of the PCM codewords are more random and 7.2 Combination with MPEG-4 AAC/BSAC thus less compressible). On the other hand, increasing the Similar to the combination with scalable AAC, the SLS sampling rate improves compression because of the in- enhancement can also be operated on top of MPEG-4 creased correlation between adjacent samples (assuming AAC/BSAC as a core layer coder. This provides a fine- sound material with typical high-frequency characteristics). grain scalable representation both between lossless and For the MPEG-4 lossless audio test set, an average com- perceptually coded audio and within the perceptually pression ratio of 2:1 can be achieved easily at a sampling coded range. The latter has a granularity of 1 kbps per rate of 48 kHz and 16-bit word length. This is competitive channel. with the best of other known lossless compression systems [53]. It can also be observed that for the AAC-based mode 7.4 Stand-Alone Operation an additional bit rate of only 30–40 kbps is required com- Finally the SLS lossless enhancement can also operate pared to the stand-alone mode for lossless representation. as a straightforward stand-alone codec, without any un- This reduces the bit rate consumption by 90–100 kbps derlying core codec. Nonetheless, this operation mode of- compared to simulcast solutions that transmit both an 128- fers both full lossless coding capability and fine-grain kbps AAC bit stream and a stand-alone SLS lossless bit scalability. stream simultaneously.

8.2 Near-Lossless Compression Performance 8 PERFORMANCE The bit-plane coding of residual spectral values (that is, This section quantifies the performance of the HD-AAC of the AAC quantization error) allows to refine the initial codec in various operation modes. While the compression AAC quantization successively as more bits from the SLS ratio can be seen as the sole measure of merit for lossless enhancement layer are decoded. With each additional de- operation scenarios, an evaluation of performance for coded bit plane the quantization error is reduced by 6 dB. near-lossless operation requires audio quality measure- Consequently an increasing safety margin with respect to ments at various data rates and operating points. audibility is added as the bit rate increases.

Table 3. Lossless compression results for MPEG-4 lossless audio test set.

SLS + AAC @ 128 kbps/Stereo (AAC @ 48 kHz sampling rate) SLS Stand-Alone Average Average Compression Bit Rate Compression Bit Rate Ratio (kbps) Ratio (kbps) 48 kHz/16 bit 2.09 735 2.20 698 48 kHz/24 bit 1.55 1490 1.58 1454 96 kHz/24 bit 2.09 2201 2.13 2160 192 kHz/24 bit 2.60 3543 2.63 3509 Overall 2.08 1992 2.12 1955

J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 35 GEIGER ET AL. PAPERS

of running numerous listening tests for subjective quality 8.2.1 Evaluation of Near-Lossless Audio Quality assessment at individual operating points, the evaluation While it may seem sufficient for most purposes to pro- employed the PEAQ measurement (BS.1387-1) [57], vide perceptually transparent reproduction of audio signals which provides methods for objective measurements of by using conventional perceptual audio coders (such as perceived audio quality in scenarios that are normally as- AAC at a sufficient bit rate), there are applications that sessed by ITU-R BS.1116 testing. The most essential re- demand still higher audio quality. This is especially the sults can be seen in Figs. 13–15; see also [58]. case for professional audio production facilities, such as The graphs show the estimated subjective sound quality archiving and broadcasting, in which audio signals may expressed as objective difference grade (ODG) values, undergo many cycles of encoding/decoding (tandem cod- which were computed by a PEAQ measurement. The ing) before being delivered to the consumer. This leads to evaluation procedure consists of multiple cycles of tandem an accumulation of introduced coding distortion and may coding/decoding with up to 16 cycles. The standard set of lead to unacceptable final audio quality [54], unless sub- critical MPEG-4 audio items for perceptual audio coding stantial headroom toward audibility is provided by each evaluations was used. ODG values of 0, −1, −2, −3, −4 coding step, for example, by using coding algorithms with correspond to a subjective audio quality of “indistinguish- very high quality or bit rate. able from original,”“perceptible but not annoying,”“slightly ITU-R BS.1548-1 [55] defines the requirements for au- annoying,”“annoying,” and “very annoying,” respectively. dio coding systems for digital broadcasting, assuming a Fig. 13 shows the achieved ODG values as a function of codec chain consisting of so-called contribution, distribu- tandem cycles for a traditional AAC coder running at a bit tion, and emission codecs. According to this recommen- rate of 128 kbps/stereo. As expected, it can be observed dation, and based on ITU-R BS.1116-1 [56], audio codecs that the audio quality degrades significantly with an in- for contribution and distribution should fulfill the follow- creasing number of tandem cycles, depending on the test ing requirements: item. For this reason tandem coding is not a recommended practice for such coders. The quality of sound reproduced after a reference contribution/ Fig. 14 displays the corresponding tandem coding re- distribution cascade [. . .] should be subjectively indistinguish- sults for the HD-AAC combination running at 512 kbps/ able from the source for most types of audio programme ma- stereo (AAC at 128 kbps + SLS enhancement at 384 kbps). terial. Using the triple stimuli double blind with hidden reference It can be noted that the audio quality remains consistently test, described in Recommendation ITU-R BS.1116 [. . .], this requires mean scores generally higher than 4.5 in the impair- at a very high level, even after a total of 16 tandem cycles. ment 5-grade scale, for listeners at the reference listening po- This illustrates the high robustness of the HD-AAC rep- sition. The worst rated item should not be graded lower than 4. resentation against tandem coding. According to these measurements, the aforementioned BS.1548-1 audio qual- In accordance with these recommendations, tests were ity requirement is fulfilled with a considerable safety mar- run on signals encoded or decoded with HD-AAC. Instead gin. Furthermore, when placed in tandem with AAC (such

Table 4. Lossless compression results for commercial CD test set.

Compression Ratio SLS+AAC @ SLS CD Items (16 bit/44.1 kHz) 128 kbps/Stereo Stand-Alone ACDC—Highway to Hell ( 80206) 1.31 1.36 Avril Lavigne—Let Go (Arista 14740) 1.36 1.41 Backstreet Boys—Greatest Hits Chapter One (Jive 41779) 1.39 1.45 Brian Setzer—The Dirty Boogie (Interscope 90183) 1.43 1.49 Cowboy Junkies—Trinity Session (RCA-8568) 1.93 2.04 Grieg—Peer Gynt, von Karajan (DG 439010) 2.63 2.83 Jannifer Warnes—Famous Blue Raincoat (BMG 258418) 2.07 2.20 Marlboro Music Festival—DISC A (Bridge 9108) 2.23 2.35 Marlboro Music Festival—DISC B (Bridge 9108) 2.20 2.33 Nirvana—Nirvana (Interscope 493523) 1.50 1.56 Philip Jones—40 Famous Marches, CD1 (Decca 416241) 1.99 2.11 Philip Jones—40 Famous Marches, CD2 (Decca 416241) 2.00 2.11 Pink Floyd—Dark Side of the Moon (Capitol 46001) 1.76 1.85 Rebecca Pidgeon—The Raven (Chesky 115) 1.88 1.97 Ricky Martin (Sony 69891) 1.33 1.38 Schubert Piano Trio in E-flat (Sony 48088) 2.74 2.90 Spaniels—The Very Best Of (Collectables 7243) 2.41 2.62 Steeleye Span—Below the Salt (Shanachie 79039) 1.85 1.95 Suzanne Vega—Solitude Standing (A&M 5136) 1.74 1.83 Westminster Concert Bell Choir—Christmas Bells (Gothic Records 49055) 2.55 2.71 Overall 1.85 1.94

36 J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS ISO/IEC MPEG-4 SCALABLE AAC as for final audio distribution over a narrow-band chan- nel), the resulting audio quality is not degraded signifi- 8.2.2 Stand-Alone SLS Operation cantly by the preceding HD-AAC tandem cascade. Further The SLS codec can also operate as a stand-alone loss- details can be found in [59]. less codec when the AAC core codec is not used, some-

Fig. 13. Test results: AAC tandem coding.

Fig. 14. Test results: HD-AAC tandem coding.

J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 37 GEIGER ET AL. PAPERS times also referred to as the “noncore mode”. Despite of its performance can be much lower, depending on the audio simple structure (only IntMDCT and BPGC/CBAC mod- material to be encoded. In contrast, HD-AAC is able to ules are used), this mode allows efficient lossless coding guarantee a certain compression ratio while providing [59]. Furthermore, fine-grain scalability by truncated bit- lossless or near-lossless signal representation, depending plane coding is also possible in this mode. Given that the on the input signal. stand-alone SLS codec does not include any perceptual model to estimate masking thresholds, it is interesting to 9 DECODER COMPLEXITY investigate the audio quality resulting from a truncation of the SLS bit stream. The computational complexity for SLS decoding can be Due to the behavior of bit-plane coding in this mode, a evaluated by counting the total number of standard in- constant signal-to-noise ratio is achieved in each scale- structions (multiplications, additions, bit shifts, compari- factor band. With each additional bit plane the signal-to- sons, memory transfers) required for performing the de- noise ratio improves by 6 dB. While this behavior does not coding process on a generic 32-bit fixed-point CPU. allow SLS to compete with efficient perceptual codecs at The main components contributing to the computational low bit rates (for example, AAC at 128 kbps/stereo), this complexity of SLS are: simple approach works quite well at higher bit rates in the 1) IntMDCT filter bank near-lossless range. Fig. 15 shows tandem coding results 2) Bit-plane arithmetic decoder for the stand-alone SLS codec operating at 512 kbps/ 3) AAC Huffman decoding stereo. It reaches about the same near-lossless audio qual- 4) AAC + SLS inverse error mapping ity as the AAC-based HD-AAC mode discussed in the 5) Integer M/S stereo coding previous section. 6) Unpacking of tables. At a constant bit rate of 768 kbps most test items still Items 3) to 5) are only required in the AAC-based mode, require the truncation of some coder frames. Nevertheless item 6) only if the necessary tables are not precomputed. the corresponding PEAQ measurements indicate that no degradation of subjective audio quality occurs in this tan- 9.1 Number of Instructions dem coding scenario for both the AAC-based mode and Table 5 lists the number of instructions required for de- the stand-alone mode; see [59]. This provides an interest- coding in the AAC-based mode, with the AAC core operat- ing operating point for HD-AAC modes, corresponding to ing at 64 kbps per channel. Table 6 shows the correspond- a guaranteed 2:1 compression. While other stand-alone ing for the SLS in stand-alone mode (without the lossless codecs can also provide an average compression AAC core layer). For both tables, values are provided for of 2:1 for suitable test material, their peak compression both implementations with and without table prepacking.

Fig. 15. Test results: stand-alone SLS tandem coding.

38 J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS ISO/IEC MPEG-4 SCALABLE AAC

SLS technology facilitates the possibility that lower bit- 9.2 ROM Requirements rate versions of the archive’s lossless audio items can be For an implementation of SLS in stand-alone mode, a extracted at any time to allow applications such as remote ROM size of 4 kbytes is required. For the AAC-based data browsing. mode, the ROM requirement is 45 kbytes. As can be seen Broadcast Contribution/Distribution Chain In a from Tables 5 and 6, a tradeoff between ROM requirement broadcast environment HD-AAC audio coding technology and number of instructions can be made by precomputing could be used in all stages comprising archiving, contri- the necessary table values. More details on SLS compu- bution/distribution, and emission. In the broadcast chain tational complexity can be found in [59]. The computa- one main feature of the technology can be used: In every tional complexity of AAC decoding is analyzed in [60]. stage where lower bit rates are required, the bit stream is merely truncated, and no reencoding is therefore required. 10. APPLICATIONS Consumer Disc-Based Delivery HD-AAC technology can also be used in consumer disc-based delivery of music As the primary functionality of HD-AAC audio coding content. It enables the music disc to deliver both lossless is lossless audio coding, it can be used in applications that and lossy audio on the same medium. require bit-exact reconstruction, such as studio operation, Internet Delivery of Audio In such an application sce- music disc delivery, or audio archiving. Due to its inherent nario the available transmission bandwidth can vary dra- scalability, HD-AAC audio coding technology in fact fits matically across different access network technologies and into virtually every application that requires audio com- over time. As a result, the same audio content at a variety pression. Several potential application scenarios are listed of bit rates and qualities may need to be kept ready at the here. server side. HD-AAC technology provides a “one-file” Studio Operations HD-AAC audio coding technology solution for such a requirement. is useful for the storage of audio at various points in studio Audio Streaming HD-AAC technology delivers the operations such as recording, editing, mixing, and premas- vital bit-rate scalability for streaming applications on tering as studio procedures are designed to preserve the channels with variable quality of service (QoS) conditions. highest levels of quality. The scalability of the SLS layer Examples for this kind of streaming applications include also provides a nice solution to situations in which the band- Internet audio streaming and multicast streaming applica- width is not sufficient to support fully lossless quality. tions that feed several channels of differing capacity. Archival Application Archives of sound recordings Digital Home The idea of the digital home is to create are very common in studios, record labels, libraries, and so an open and transparent home network platform that en- on. These archives are tremendously large and certainly ables consumers to easily create, use, manage, and share compression is essential. In addition, the scalability of the digital content such as audio, , or image. In a typical

Table 5. Maximum numbers of INT32 operations per sample for SLS decoding with AAC core.

Combined Cycle 1 ס Frame Length Muls Adds/Subs Ors Shifts Negs Movs All Tables Preunpacked 4096 or 512 19.50 93.46 34.50 72.60 34.26 16.54 270.86 2048 or 256 20.25 91.22 31.50 73.54 32.25 16.29 265.05 1024 or 128 18.00 88.99 28.50 68.50 30.85 15.99 250.83 Tables Unpacked in Place 4096 or 512 19.50 123.21 34.50 88.01 34.26 16.54 316.10 2048 or 256 20.25 117.47 31.50 85.54 32.25 16.29 303.30 1024 or 128 18.00 105.24 28.50 75.00 30.85 15.99 273.58

Table 6. Maximum number of INT32 operations per sample for SLS stand-alone decoding.

Combined cycle 1 ס Frame Length Muls Adds/Subs Ors Shifts Negs Movs All Tables Preunpacked 4096 or 512 18 86.63 34.50 67.10 29.93 9.54 245.70 2048 or 256 18.75 84.39 31.50 68.04 27.92 9.29 239.89 1024 or 128 16.5 82.16 28.50 63.00 26.52 8.99 225.67 Tables Unpacked in Place 4096 or 512 18.00 116.38 34.50 82.6 29.93 9.54 290.05 2048 or 256 18.75 110.64 31.50 80.04 27.92 9.29 278.14 1024 or 128 16.50 98.41 28.50 69.50 26.52 8.99 248.42

J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 39 GEIGER ET AL. PAPERS setup for audio, the user can download the HD-AAC Coding,” presented at the 112th Convention of the Audio coded bit streams in lossless quality from the service pro- Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. vider and archive them on the home music server. These 50, pp. 509, 510 (2002 June), convention paper 5553. bit streams are then streamed, or downloaded to different [6] ISO/IEC 14496-3:2001/Amd.2:2004, “Coding of audio terminals at differing quality for playback. Audio-Visual Objects—Part 3: Audio, Amendment 2: Parametric Coding for High Quality Audio,” International 11. CONCLUSIONS Standards Organization, Geneva, Switzerland (2004). [7] C. den Brinker, E. Schuijers, and W. Oomen, “Para- The new ISO/MPEG specification for scalable lossless metric Coding for High-Quality Audio,” presented at the coding extends the well-known perceptual coding scheme 112th Convention of the Audio Engineering Society, J. AAC toward lossless and near-lossless operation, and in Audio Eng. Soc. (Abstracts), vol. 50, p. 510 (2002 June), this way enables its use in the context of high-definition convention paper 5554. applications. The HD-AAC scheme offers competitive [8] ISO/IEC FCD 23003-1, “MPEG-D (MPEG Audio lossless compression rates at all relevant operating points Technologies)—Part 1: MPEG Surround” International (word length and sampling rate). For distribution on band- Standards Organization, Geneva, Switzerland (2006). width-limited channels a perceptually coded compatible [9] J. Breebaart, J. Herre, C. Faller, J. Roeden, F. My- AAC bit stream can simply be extracted from the com- burg, S. Disch, H. Purnhagen, G. Hotho, M. Neusinger, K. posite HD-AAC stream. Alternatively, the SLS part can Kjoerling, and W. Oomen, “MPEG Spatial Audio Coding/ also be used as a simple and versatile stand-alone com- MPEG Surround: Overview and Current Status,” pre- pression engine. In both cases the fidelity of the signal sented at the 119th Convention of the Audio Engineering representation can be scaled with fine granularity within a Society, J. Audio Eng. Soc. (Abstracts), vol. 53, p. 1228 wide range of near lossless representations. This enables (2005 Dec.), convention paper 6599. lossless or near lossless transmission of high-definition [10] “Special Issue: High-Resolution Audio,” J. Audio audio with a guaranteed maximum rate. We anticipate that Eng. Soc., vol. 52, pp. 116–260 (2004 Mar.). this flexibility will make HD-AAC the technology of [11] “DVD Forum,” http://www.dvdforum.org/ choice for many applications that call for both very high forum.shtml (2004). audio quality and delivery over a wide range of transmis- [12] Royal Electronics, “Super Audio CD Sys- sion channels. tems,” http://www.licensing.philips.com/information/ sacd/ (2006). 12 ACKNOWLEDGMENT [13] Blu-Ray Disc Association, “Blu-Ray Disc,” http:// www.blu-raydisc.com/ (2006). The authors would like to thank all their colleagues at [14] Dolby Laboratories Inc., “MLP Lossless” http:// the MPEG audio subgroup who supported the lossless www.dolby.com/consumer/technology/mlp_lossless.html standardization activity, especially Takehiro Moriya (2006). (NTT) for inspiring these standardization activities, Til- [15] M.A. Gerzon, P.G. Craven, J.R. Stuart, M.J. man Liebchen (Technical University of Berlin) for chair- Law, and R. J. Wilson, “The MLP Lossless Compression ing the ad hoc group on lossless audio coding, and Yuriy System,” in Proc 17th AES Conf. (Florence, Italy, 1999 Reznik (Real Networks/Qualcomm) for his thorough com- Sept.), pp. 61–75. plexity evaluations. [16] DTS Inc., “DTS HD,” http://www.dtsonline.com/ consumer/dtshd.php (2006). 13 REFERENCES [17] Apple Computer Inc., “Apple Quicktime,” http:// www.apple.com/downloads/macosx/apple/quicktime651 [1] ISO/IEC 11172-3, “Coding of Moving Pictures and .html (2006). Associated Audio for Digital Storage Media at up to about [18] J. Coalson, “FLAC—Free Lossless ,” 1.5 Mbit/s—Part 3: Audio,” International Standards Orga- http://flac.sourceforge.net (2006). nization, Geneva, Switzerland (1992). [19] M. T. Ashland, “Monkey’s Audio—A Fast and [2] ISO/IEC 13818-3, “Information Technology— Powerful Lossless Audio ,” http://www Generic Coding of Moving Pictures and Associated Au- .monkeysaudio.com (2004). dio—Part 3: Audio,” International Standards Organiza- [20] F. Ghido, “OptimFROG,” http://www.losslessaudio tion, Geneva, Switzerland (1994). .org (2006). [3] ISO/IEC 14496-3:2001, “Coding of Audio-Visual [21] ITU-T G.729.1, “G.729 Based Embedded Variable Objects—Part 3: Audio,” International Standards Organi- Bit-Rate Coder: An 8–32 kbit/s Scalable Wideband Coder zation, Geneva, Switzerland (2001). Bitstream Interoperable with G.729,” International Tele- [4] ISO/IEC 14496-3:2001/Amd.1:2003, “Coding of communication Union, Geneva, Switzerland (2006). Audio-Visual Objects—Part 3: Audio, Amendment 1: [22] B. Grill, “A Bit-Rate Scalable Perceptual Coder for Bandwidth Extension,” International Standards Organiza- MPEG-4 Audio presented at the 103rd Convention of the tion, Geneva, Switzerland (2003). Audio Engineering Society, J. Audio Eng. Soc. (Ab- [5] M. Dietz, L. Liljeryd, K. Kjo¨rling, and O. Kunz, stracts), vol. 15, p. 1005 (1997 Nov.), preprint 4620. “Spectral Band Replication—A Novel Approach in Audio [23] J. Herre, E. Allamanche, K. Brandenburg, M.

40 J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS ISO/IEC MPEG-4 SCALABLE AAC

Dietz, B. Teichmann, B. Grill, A. Jin, T. Moriya, N. Time Domain Aliasing Cancellation,” in Proc IEEE Iwakami, T. Norimatsu, M. Tsushima, and T. Ishikawa, ICASSP (Dallas, TX, 1987), pp. 2161–2164. “The Integrated Filterbank-Based Scalable MPEG-4 Au- [38] J. Herre and J. D. Johnston, “Enhancing the Per- dio Coder,” presented at the 105th Convention of the Au- formance of Perceptual Audio Coders by Using Temporal dio Engineering Society, J. Audio Eng. Soc. (Abstracts), Noise Shaping (TNS),” presented at the 101st Convention vol. 46, p. 1039 (1998 Nov.), preprint 4810. of the Audio Engineering Society, J. Audio Eng. Soc. (Ab- [24] S. H. Park, Y. B. Kim, S. W. Kim, and Y. S. Seo, stracts), vol. 44, p. 1175 (1996 Dec.), preprint 4384. “Multi-Layer Bit-Sliced Bit-Rate Scalable Audio Coding,” [39] J. D. Johnston, J. Herre, M. Davis, and U. Gbur, presented at the 103rd Convention of the Audio Engineer- “MPEG-2 NBC Audio—Stereo and Multichannel Coding ing Society, J. Audio Eng. Soc. (Abstracts), vol. 45, p. Methods,” presented at the 101st Convention of the Audio 1005 (1997 Nov.), preprint 4520. Engineering Society, J. Audio Eng. Soc. (Abstracts), vol. [25] M. Nishiguchi, “MPEG-4 Speech Coding,” in 44, p. 1175 (1996 Dec.), preprint 4383. Proc. 17th AES Conf. (Florence, Italy, 1999 Sept.), pp. [40] R. Geiger, T. Sporer, J. Koller, and K. Branden- 139–146. burg, “Audio Coding Based on Integer Transforms,” pre- [26] M. Nishiguchi, A. Inoue, Y. Maeda, and J. Matsu- sented at the 111th Convention of the Audio Engineering moto, “Parametric Speech Coding—HVXC at 2.0–4.0 Society, J. Audio Eng. Soc. (Abstracts), vol. 49, p. 1230 kbps,” presented at the IEEE Workshop on Speech Cod- (2001 Dec.), convention paper 5471. ing, Porvoo, Finland (1999 June). [41] R. Geiger, Y. Yokotani, and G. Schuller, “Im- [27] M.S. Schroeder and B.S. Atal, “Code-Excited proved Integer Transforms for Lossless Audio Coding,” in Linear Prediction (CELP): High-Quality Speech at Very Proc. 37th Asilomar Conf. on Signals, Systems and Com- Low Bit Rates,” in Proc. IEEE ICASSP (Tampa, FL, 1985 puters (Pacific Grove, CA, 2003 Nov.). Mar.), pp. 937–940. [42] R. Geiger, J. Herre, J. Koller, and K. Brandenburg, [28] T. Nomura, M. Iwadare, M. Serizawa, and K. “IntMDCT—A Link between Perceptual and Lossless Au- Ozawa, “A Bitrate and Bandwidth Scalable CELP Coder,” dio Coding,” in Proc. IEEE ICASSP (Orlando, FL, 2002 in Proc. IEEE ICASSP (Seattle, WA, 1998 May), pp. May). 341–344. [43] R. Yu, C. C. Ko, S. Rahardja, and X. Lin, “Bit- [29] ISO/IEC JTC1/SC29/WG11, “Final Call for Pro- Plane Golomb Code for Sources with Laplacian Distribu- posals on MPEG-4 Lossless Audio Coding,” MPEG2002/ tions,” in Proc. IEEE ICASSP (Hong Kong, China, 2003 N5208, Shanghai, China (2002 Oct.). Apr.), pp. 277–280. [30] ISO/IEC 14496-3:2001/Amd.6:2005, “Coding of [44] R. Yu, X. Lin, S. Rahardja, C. C. Ko, and H. Audio-Visual Objects—Part 3: Audio, Amendment 6: Huang, “Improving Coding Efficiency for MPEG-4 of Oversampled Audio,” International Scalable Lossless Coding,” in Proc. IEEE ICASSP (Phila- Standards Organization, Geneva, Switzerland (2005) delphia, PA, 2005 May). [31] E. Knapen, D. Reefman, E. Janssen, and F. Bruek- [45] I. Daubechies, and W. Sweldens, “Factoring Wave- ers, “Lossless Compression of 1-Bit Audio,” J. Audio Eng. let Transforms into Lifting Steps,” Tech. Rep. Bell Labo- Soc., vol. 52, pp. 190–199 (2004 Mar.). ratories, Lucent Technologies (1996). [32] ISO/IEC 14496-3:2005/Amd.2:2006, “Coding of [46] F. Bruekers and A. Enden, “New Networks for Audio-Visual Objects—Part 3: Audio, Amendment 2: Au- Perfect Inversion and Perfect Reconstruction,” IEEE J. dio Lossless Coding (ALS), New Audio Profiles and Selected Areas Comm., vol. 10, pp. 130–137 (1992 Jan). BSAC Extensions,” International Standards Organization, [47] R. Geiger, Y. Yokotani, G. Schuller, and J. Herre, Geneva, Switzerland (2006). “Improved Integer Transforms Using Multi-Dimensional [33] ISO/IEC 14496-3:2005/Amd.3:2006, “Coding of Lifting,” in Proc. IEEE ICASSP (Montreal, Canada, 2004 Audio-Visual Objects—Part 3: Audio, Amendment 3: May). Scalable Lossless Coding (SLS),” International Standards [48] Y. Yokotani, R. Geiger, G. Schuller, S. Oraintara, Organization, Geneva, Switzerland (2006). and K. R. Rao, “Improved Lossless Audio Coding Using [34] J. Herre and H. Purnhagen, “General Audio Cod- the Noise-Shaped IntMDCT,” presented at the IEEE 11th ing,” in The MPEG-4 Book, IMSC Multimedia Ser., F. DSP Workshop, Taos Ski Valley, NM (2004 Aug). Pereira and T. Ebrahimi, Eds. (Prentice-Hall, Englewood [49] R. Yu, X. Lin, S. Rahardja, and H. Haibin, “Pro- Cliffs, NJ, 2002). posed Core Experiment for Improving Coding Efficiency [35] M. Bosi, K. Brandenburg, S. Quackenbush, L. in MPEG-4 Audio Scalable Coding (SLS),” ISO/IEC Fielder, K. Akagiri, H. Fuchs, M. Dietz, J. Herre, G. Dav- JTC1/SC29/WG11, M10683, Munich, Germany (2004 idson, and Y. Oikawa, “ISO/IEC MPEG-2 Advanced Au- Mar.) dio Coding,” J. Audio Eng. Soc., vol. 45, pp. 789–814 [50] R. Yu, X. Lin, S. Rahardja, and C. C. Ko, “A (1997 Oct.). Statistics Study of the MDCT Coefficient Distribution for [36] ISO/IEC JTC1/SC29/WG11, “Report on the Audio,” in Proc. ICME (Taipei, Taiwan, 2004 June). MPEG-2 AAC Stereo Verification Tests,” MPEG1998/ [51] K. H. Choo, E. Oh, J. H. Kim, and C. Y. Son, N2006, San Jose, CA (1998 Feb.). “Enhanced Performance in the Functionality of Fine Grain [37] J. Princen, A. Johnson, and A. Bradley, “Subband/ Scalability,” presented at the 119th Convention of the Au- Using Filter Bank Designs Based on dio Engineering Society, J. Audio Eng. Soc. (Abstracts),

J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 41 GEIGER ET AL. PAPERS vol. 53, pp. 1227, 1228 (2005 Dec.), convention paper cluding Multichannel Sound Systems,” International Tele- 6597. communication Union, Geneva, Switzerland (1994). [52] ISO/IEC 14496–1:2004, “Coding of Audio-Visual [57] ITU-R BS.1387-1, “Method for Objective Mea- Objects—Part 1: Systems,” International Standards Orga- surements of Perceived Audio Quality,” International nization, Geneva, Switzerland (2004). Telecommunication Union, Geneva, Switzerland (1998). [53] M. Hans and R. W. Schafer, “Lossless Compres- [58] R. Geiger, M. Schmidt, J. Herre, and R. Yu, sion of ,” IEEE Signal Process Mag., vol. “MPEG-4 SLS—Lossless and Near-Lossless Audio Cod- 18, pp. 21–32 (2001 July). ing Based on MPEG-4 AAC,” presented at the Interna- [54] AES Technical Committee of Coding of Audio tional Symposium on Communications, Control and Sig- Signals, “Perceptual Audio Coders: What to Listen For,” nal Processing, Marrakech, Morocco (2006 Mar.) CD-ROM with tutorial information and audio examples, [59] ISO/IEC JTC1/SC29/WG11, “Verification Report Audio Engineering Society, New York (2001). on MPEG-4 SLS,” MPEG2005/N7687, Nice, France [55] ITU-R BS.1548-1, “User Requirements for Audio (2005 Oct.). Coding Systems for Digital Broadcasting,” International [60] ISO/IEC JTC1/SC29/WG11, “Revised Report on Telecommunication Union, Geneva, Switzerland Complexity of MPEG-2 AAC Tools,” MPEG1999/N2957 (2001–2002). (Melbourne, Australia, 1999 Oct.), http://www.chiariglione [56] ITU-R BS.1116-1, “Methods for the Subjective .org/mpeg/working_documents/mpeg-02/audio/AAC_ Assessment of Small Impairments in Audio Systems In- tool_complexity(rev).zip

THE AUTHORS

R. Geiger R. Yu J. Herre S. Rahardja

S.-W. Kim X. Lin M. Schmidt

Ralf Geiger received a diploma degree in mathematics schemes. He is a coeditor of the ISO/IEC standard from the University of Regensburg, Regensburg, Ger- “MPEG-4 Scalable Lossless Coding (SLS).” many, in 1997. G In 1998 he joined the Audio/Multimedia Department at the Fraunhofer Institute for Integrated Circuits (IIS), Er- Rongshan Yu received a B.Eng. degree from Shanghai langen, Germany. From 2000 to 2004 he was with the Jiaotong University, Shanghai, P. R. China, in 1995, and Fraunhofer Institute for Digital Media Technology M. Eng. and Ph.D. degrees from the National University (IDMT), Ilmenau, Germany. In 2005 he returned to Fraun- of Singapore in 2000 and 2005, respectively. hofer IIS. He was with the Centre for Signal Processing, School of Mr. Geiger is working on the development and stan- Electrical and Electronics, Nanyang Technological Uni- dardization of perceptual and lossless audio coding versity, Singapore, from 1999 to 2001, and with the Insti-

42 J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February PAPERS ISO/IEC MPEG-4 SCALABLE AAC

tute for Infocomm Research (I2R), A*STAR, Singapore, communications, signal processing, and logic synthesis. from 2001 to 2005. He is currently with Dolby Laborato- He was the recipient of the IEE Hartree Premium Award ries, San Francisco, CA, USA. His research interests in- for best journal paper published in the IEE Proceedings in clude audio coding, , and digital signal 2001 and the Tan Kah Kee Young Inventors’ Gold Award processing. (Open Category) in 2003. He has served in various IEEE Dr. Yu received the Tan Kah Kee Young Inventors’ and SPIE-related professional activities in the area of mul- Gold Award (Open Category) in 2003. He has been work- timedia and is actively involved in technical committees ing on the development and standardization of MPEG of the IEEE Circuits and Systems and Signal Processing lossless audio coding schemes and coedited the ISO/IEC societies. He is associate editor of the Journal of Visual standard “MPEG-4 Scalable Lossless Coding (SLS).” Communication and Image Representation. Since 2002 he has been an active participant of the MPEG and since then G has been involved in the development of MPEG-4 Audio, Ju¨rgen Herre joined the Fraunhofer Institute for Inte- where he is one of the contributors to the MPEG-4 SLS grated Circuits (IIS) in Erlangen, Germany, in 1989. Since and ALS. then he has been involved in the development of percep- G tual coding algorithms for high-quality audio, including the well-known ISO/MPEG-Audio Layer III coder (aka Sang-Wook Kim received a B.A. degree in electrical MP3). In 1995 he joined Bell Laboratories for a postdoc- engineering from the Yonsei University, Korea, in 1989. toral term, working on the development of MPEG-2 ad- He completed his master’s degree in 1991 at the Korea vanced audio coding (AAC). Since the end of 1996 he has Advanced Institute of Science and Technology while re- been back at Fraunhofer, working on the development of searching digital signal processing. advanced multimedia technology, including MPEG-4, G MPEG-7, and secure delivery of audiovisual content, cur- rently as the chief scientist for the audio/multimedia ac- Xiao Lin received a Ph.D. degree from the Electronics tivities at Fraunhofer IIS, Erlangen. and Computer Science Department of the University of Dr. Herre is a fellow of the Audio Engineering Society, Southampton, Southampton, UK, in 1993. cochair of the AES Technical Committee on Coding of He worked at the Centre for Signal Processing, Audio Signals, and vice chair of the AES Technical Coun- Nanyang Technological University, Singapore, as a re- cil. He also is an IEEE senior member, served as an as- search fellow and senior research fellow for about five sociate editor of the IEEE Transactions on Speech and years. Subsequently he joined DeSOC Technology as a Audio Processing, and is an active member of the MPEG technical director and then the Institute for Infocomm Re- audio subgroup. Outside his professional life he is a ded- search in 2002. There he was member of technical staff, icated amateur musician. lead scientist, and principal scientist and managed the Me- dia Processing Department until 2006. He is now with G Fortemedia Inc. as a senior director. He actively partici- Susanto Rahardja received a Ph.D. degree in electrical pated in international standards such as MPEG-4, and electronic engineering from the Nanyang Technologi- JPEG2000, and JVT. cal University, Singapore. Dr. Lin is a senior member of the IEEE. He joined the Centre for Signal Processing, Nanyang G Technological University, in 1996 and he has been a fac- ulty member at its School of Electrical and Electronic Markus Schmidt received a Dipl.-Ing. degree in media Engineering since 2001. In 2002 he joined the Institute for technology from the Technical University of Ilmenau, Infocomm Research (I2R) and is currently the director of Germany, in 2004. During his studies he spent a year at its Media Division. He is overseeing research areas on the University of Strathclyde in Glasgow, Scotland. signal processing (audio coding, video/image processing), After completing an intership at the Fraunhofer Institute media analysis (text/speech, image, video), media security for Digital Media Technology (IDMT) in Ilmenau, he (biometrics, computer vision, and surveillance), and sen- joined the Fraunhofer Institute for Integrated Circuits sor network. (IIS), Erlangen, Germany, in 2005. There his research in- Dr. Rahardja has published in more than 150 interna- terests include low-delay and lossless audio coding schemes tional journals and at conferences in the areas of digital and their implementation in real-time environments.

J. Audio Eng. Soc., Vol. 55, No. 1/2, 2007 January/February 43