A new model-based algorithm for optimizing the MPEG-AAC in MS-stereo Olivier Derrien, Gael Richard

To cite this version:

Olivier Derrien, Gael Richard. A new model-based algorithm for optimizing the MPEG-AAC in MS- stereo. IEEE Transactions on Audio, Speech and Language Processing, Institute of Electrical and Electronics Engineers, 2008, 16 (8), pp.1373-1382. ￿10.1109/TASL.2008.2002068￿. ￿hal-00467510￿

HAL Id: hal-00467510 https://hal.archives-ouvertes.fr/hal-00467510 Submitted on 26 Mar 2010

HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. 1 A new model-based algorithm for optimizing the MPEG-AAC in MS-stereo Olivier Derrien and Ga¨el Richard, Senior Member, IEEE

Abstract— In this paper, a new model-based algorithm Middle (M) and Sides (S) channels. M and S are coded for optimizing the MPEG-Advanced Audio Coder (AAC) instead of L and R and the reverse transformation is per- in MS-stereo mode is presented. This algorithm is an ex- tension to stereo signals of prior work on a statistical model formed at the decoder side. In MPEG-AAC, LR and MS of quantization noise. Traditionally, MS-stereo coding ap- mode can be used alternatively for each frequency subband proaches replace the Left (L) and Right (R) channels by the and each frame. The moderate coding gain is compen- Middle (M) and Sides (S) channels, each channel being in- sated by a small amount of side-information (one bit per dependently processed, almost like a monophonic signal. In contrast, our method proposes a global approach for coding subband). Other linear transformations have been pro- both channels in the same process. A model for the quan- posed in the literature: Inter-channel decorrelation with tization error allows us to tune the quantizers on channels a Karhunen-Loeve transform [5], inter-channel prediction M and S with respect to a distortion constraint on the re- constructed channels L and R as they will appear in the [6], [7], and more recently a time-aligned version of the MS decoder. This approach leads to a more efficient percep- transformation [8]. With all these techniques, the coding tual noise-shaping and avoids using complex psychoacoustic gain is increased for some signals, but at the expense of models built on the M and S channels. Furthermore, it pro- vides a straightforward scheme to choose between LR and additional side information. MS modes in each subband for each frame. Subjective lis- Parametric stereo coding is another popular scheme for tening tests prove that the coding efficiency at a medium increasing the coding efficiency. A core monophonic coder bitrate (96 kbits/s for both channels) is significantly bet- is used in combination with additional parameters that de- ter with our algorithm than with the algorithm, without increase of complexity. scribe the stereo information. The resulting auxiliary bit- Keywords— Perceptual audio coding, MPEG-AAC, MS- stream usually requires very few coding bits, but the orig- stereo, statistical model, quantization, scalefactor, bitrate inal signals in channels L and R can not be totally recov- constraint, distortion constraint, optimization algorithm. ered, even for very high bitrates. Thus, parametric stereo schemes are suitable for low bitrate applications. Orig- I. Introduction inally, a simple parametric stereo mode, called Intensity Stereo (IS), was specified in the MPEG-AAC standard. The MPEG-4 Advanced Audio Coder (AAC) is the lat- It consists of coding only the M channel and an inter- est international standard for high-quality lossy audio cod- channel intensity difference parameter for each subband. ing [1], [2]. Its application field is still expanding, includ- Since, many studies have been carried out on parametric ing consumer audio equipment and digital broad- stereo (see for instance [9], [10]) and in the latest exten- casting. This has been derived in several profiles sion of MPEG-AAC, called HE-AAC v2 [11], the paramet- i.e. variations, for different applications: Low Complexity ric stereo mode can be considered as an improved version (LC-AAC), Low Delay (LD-AAC), High Efficiency (HE- of the original IS mode: more parameters can be used AAC/AACPlus) etc. The MPEG-AAC is a frame-based to describe the stereo image (intensity difference, cross- transform-coder. Its apparent complexity is due to a large correlation, phase/time difference). However, the typical variety of coding parameters, which make the optimization bitrate for the HE-AAC v2 is quite low: 24kbps for both process difficult to engineer and recent publications show channels. that AAC optimization is still a current issue [3]. In this paper, we consider high-quality/high-bitrate ap- The MPEG-AAC is a multichannel codec, designed for plications and focus on the MS-stereo mode for the MPEG- stereo and surround audio applications. An AAC audio AAC, especially on the implementation of the optimization stream can include single channels and channel pairs. A algorithm which is strongly related to the coding efficiency. single channel corresponds to a monophonic audio scene, a In a previous paper [12], we proposed a new algorithm for channel pair to a stereophonic scene (Left and Right chan- the single channel case, based on a statistical model of the nels). With the basic coding scheme for a channel pair, quantization noise. In the informative annex of the MPEG- called LR-stereo in MPEG-AAC, each channel is processed AAC standard [1], an implementation of the coding algo- as a monophonic signal. However, when a stereo signal ex- rithm is described. In this paper, it will be referred to as hibits significant inter-channel redundancy, the LR mode is the standard algorithm. Compared to this algorithm, our quite ineffective. Improving the coding efficiency by remov- method exhibits a lower complexity and a better sound ing the redundancy is possible with stereo coding modes. quality for the same bitrate. In this paper, we extend this A popular method for inter-channel decorrelation is the model to the MS-stereo case, and propose a new efficient sum-difference transformation [4]. This technique, also re- algorithm for coding a channel pair. ferred to as MS joint channel coding, consists of a linear This article is divided in three parts. First, we briefly combination of the Left (L) and Right (R) channels to get describe the MPEG-AAC codec and the MS-stereo mode. 2

CODER Optimizationloop Psychoacoustic Sub-band Scalefactor model Masking optimization Threshold Scalefactors Time- Huffman Bit- domain MDCT Quantized MDCT Sub-band coding stream signal coefficients quantization coefficients

audiosignal DECODER control Scalefactors Bit- Huffman Time- decoding Quantized Reverse MDCT stream iMDCT domain coefficients quantization coefficients signal

Fig. 1. Synopsis of a MPEG AAC codec.

Then, after recalling the main results of the monophonic consists of finding the scaling parameters A(s) which max- model, we describe our stereophonic model and the new op- imize the audio quality under a bitrate constraint. This timization algorithm. Finally, we compare our algorithm to is generally implemented with an iterative algorithm in- the standard MPEG-AAC, both in terms of audio quality cluding quantization and Huffman coding modules, and a and computational complexity. psycho-acoustic model. Both psychoacoustic model and optimization algorithm are not specified in the standard, II. MPEG-AAC MS-stereo mode in order to allow for future advances in technology that will A. Quantization and coding improve the coding efficiency.

Figure 1 presents the general scheme of a MPEG-AAC B. The MS-stereo mode codec. The audio signal is segmented in variable-length analysis windows (256 or 2048 samples) with 50% over- When the audio signal is a channel pair, we denote XL(k) lap. Over each window, the signal is transformed in a fre- and XR(k) the MDCT coefficients corresponding respec- quency domain with a Modified Discrete Cosine Transform tively to channels Left and Right. The MS transformation (MDCT) [13]. In this paper, we denote X(k) the MDCT is defined by: coefficients corresponding to a single channel, over the cur- X k 1 X k X k rent analysis window. k is a frequency index. Variable M ( ) = 2 [ L( )+ R( )] (3) length frequency subbands are defined as non-overlapping  1 XS(k) = [XL(k) XR(k)] subsets of frequency indexes: k kmin(s) kmax(s)  2 − where s is a subband index. Subband∈ { width K· · ·increases} It can be used independently for each subband. A one- along the frequency scale. bit flag per subband indicates whether the MS transforma- The MDCT coefficients are quantized subband by sub- tion is used. In MS mode, X and X are quantized and band according to: M S coded instead of XL and XR. On the decoder side, the re- 3 ˆ ˆ X(k) 4 constructed MDCT coefficients XM and XS are obtained i(k)= (1) after the reverse quantization process. Finally the reverse R A(s)   ! MS transformation is performed: where A(s) is a scaling parameter, is a rounding func- Xˆ (k) = Xˆ (k)+ Xˆ (k) tion and i(k) are the quantization indexes.R A(s) follows a L M S (4) logarithmic scale:  1 φ s XˆR(k) = XˆM (k) XˆS(k) A(s)=2 4 ( ) (2)  − where φ is an integer parameter called scalefactor. The Compared to the single channel case, the MS-stereo rounding function is not set by the MPEG standard. The mode raises three additional problems: 1) the LR/MS de- function which minimizes the quantization error is defined cision, 2) the hearing model and 3) the inter-channel bit- in [12]. A sub-optimal function is proposed in the informa- allocation. The classical approach was originally proposed tive annex of the MPEG document [1]. by J.D. Johnston et al. [4], [14]: The MS mode is enabled Both quantization index i(k) and scalefactor φ(s) are when the energy difference between channels M and S ex- coded with a noiseless Huffman coding module. Coded au- ceeds a given threshold. Masking thresholds for M and S dio data are segmented in frames. One frame corresponds are computed by extending the monophonic psychoacous- either to a single 2048-samples window or to a sequence tic model. The inter-channel bit-allocation is performed of 8 256-samples window. Optimizing the coding process according to a Perceptual Entropy criterion (see also [15]). 3

Then, a single channel noise-shaping algorithm is applied is defined and a fast method is used to solve a variable- twice to M and S. bitrate problem, i.e. to determine the set of scaling pa- In the standard algorithm, some improvements have rameters which minimizes the bitrate under a distortion been made concerning problems 1) and 3): LR and MS constraint: channels are optimized in an iterative process which uses Eε(s) T (s) (7) two nested-loops and a local decoder inside the outer-loop. ≤ The outer-loop (distortion loop) changes the scalefactor At the first step of the iterative process, T (s) is initialized values independently for each channel (L,R,M,S) accord- to Tm(s). If the resulting bitrate matches the bitrate con- ing to a Noise-To-Mask criterion. The inner-loop (bitrate straint, the coding problem is solved. Else, the distortion loop) performs a global translation of the scalefactors for levels T (s) are raised until the bitrate constraint is verified. all channels in parallel, in order to meet the global bitrate Finding the exact solution to the variable-bitrate prob- constraint (solves problem 3). On each iteration, quan- lem is practically too much time-consuming and therefore, tization and Huffman coding are performed for channels it is often preferred to find a near-optimal solution with LR and MS, and the mode which minimizes the number of a fast method. For that purpose, a new error model was coding bits in each subband is selected (solves problem 1). developed. This method is statistically optimal, and thus An improvement to this framework has been proposed practically near-optimal. by C.M. Liu et al. [16], applied to the MPEG-1 Layer III We assume that the variable-bitrate problem can be codec, which is very close to the MPEG-AAC. The main solved independently in each subband, which is almost advances are: a new method for computing the masking true. In the remaining of this paper, we omit the sub- threshold for M and S (solves problem 2), an inter-channel band index s, but subband dependant variables are noted bit-allocation based on a new criterion called Allocation with a bold font. Three different quantization modes have Entropy (solves problem 3), and a new intra-channel noise- to be considered: shaping process. 1) High resolution: When the scaling parameter is small Our method is radically different on three major issues: enough for the quantizer to work in high resolution mode First, it relies on a specific MS distortion model which al- (which means that the energy of the quantization error is lows us to tune the quantization for both channels at the small compared to the energy of the input signal), all quan- same time. Second, with our method, the inter-channel bit- tization indexes i(k) are greater than zero. We consider the allocation problem (problem 3) is solved jointly with the error coefficients ε(k) as random variables. The energy Eε noise-shaping process. Third, we consider only the distor- is also a random variable, and the strict distortion con- tion constraint on channels L and R, even in the MS-mode. straint (7) is replaced by a statistical version: Thus, problem 2) is no more an issue, as Prob Eε T α (8) only involve channels L and R. { ≤ }≥ III. Description of the new coding algorithm where α [0.5, 1] is a confidence parameter. It reflects the confidence∈ we have on the masking threshold. α close to A. Single channel error model 1 means that the threshold is judged reliable, α close to In this section, we briefly recall the main results of our 0.5 means that the threshold is not reliable. We showed in prior work on the single-channel case (see [12] for more [12] that α = 1 results in a high bitrate, whereas α = 0.9 details). The quantization error in the transform domain significantly reduces the bitrate for approximately the same is defined by: error level. ε(k)= Xˆ(k) X(k) (5) The probability density function (pdf) of Eε must be − known to solve (8). Its exact expression would be far too and the error energy in subband s is: complex, so we chose a simple model. Equation (6) shows that, if ε(k) are independent and equally distributed, and kmax(s) if K (number of MDCT coefficients) is large enough, E 2 ε Eε(s)= ε (k) (6) will follow a Gaussian law according to the central-limit k=kXmin(s) theorem. Its mean and variance are:

The usual criterion for evaluating the perceived distortion µ K E ε2 (9) in one subband is the distance between Eε(s) and a so- ' 2 called masking threshold T (s), computed by the psycho- σ2 K E ε4 E ε2 (10) m ' − acoustic model. If Eε(s) Tm(s), the masking constraint ≤       is verified, and no distortion will be perceived in this fre- We have also considered a nonasymptotic model using a quency subband. As the MPEG-AAC usually operates in Gamma-law. With this finer model, there is no assump- a fixed-bitrate mode, the available bitrate is not sufficient tion made on K. Both models are equivalent on large sub- for the masking constraint to be verified in each subband. bands. We expected similar performances on large sub- The fixed-bitrate problem can be efficiently tackled by bands and an improvement on narrow subbands. However, solving successive variable-bitrate problems: At each step as we observed no significant improvement, we finally chose of an iterative process, a distortion level per subband T (s) the simple Gaussian model. 4

In high-resolution, approximations for the second- and and when A A , the dead zone expression (17) is valid. ≥ 0 fourth-order moments of the quantization error can be ob- A0 is chosen at the junction of both modes: 2 tained, assuming that the rounding error is a white and 3 uniformly distributed random variable: EX A0 = (18) 3 2 2 Ka m 1 β K a m a m 1  E 2 1 2 2 + 2 ( 4 1 2 ) ε a2m A (11) 2 − 2 ' 2 E 4 3  q  ε  a4m1A (12) Finally, the solution of the variable-bitrate problem is: ' a) If T < EX , the nearly-optimal scaling parameter   a2 and a4 are multiplying factors which depend on the value Aopt is given by equation (16). rounding function , mp is defined in the current subband b) If T EX , we can choose any scaling parameter value R ≥ as: greater than A0, given by equation (18). 1 p mp = X(k) (13) K | | 90 k X 85 and A is the scaling parameter. For the sub-optimal round- E X ing function proposed in the MPEG document, the analytic 80 expression for the multiplying factors is: 75 T p 70 4 p+1 p+1 ap = (1 Nm) + Nm (p + 1)3p − 65 high dead resolution zone   60 where Nm =0.4054, referred to as the magic number in [1].

Energy(dB) For the optimal rounding function, the analytic expression 55 Mean (m) for the multiplying factors is: 50 Model(m+bs ) p E 2 45 e a = p (p + 1)3p 40 A A opt 0 E As ε is modelled with a Gaussian law, the distortion A (log) constraint (8) is equivalent to: Fig. 2. Example of the distortion function and quantization model. µ + βσ T (14) ≤ This model is illustrated on Figure 2, where the exact where β is a secondary parameter depending on α: values of the error energy for a 16-coefficients subband and − the model estimation as functions of the scaling parameter β = √2 Erf 1(2α 1) (15) − are drawn. We chose α = 0.9. The model curve corre- sponds to µ + βσ when A < A and to EX when A A . Erf is the standard error function [17]. We assume that 0 ≥ 0 the bitrate is a decreasing function of A. Then, the near- The typical SNR for a fixed-bitrate coder is 10 dB. We can see on figure 2 that this corresponds to the transi- optimal value of the scaling parameter Aopt is obtained when (14) is an equality. We combine equations (9) and tion mode. But one can also see that extending the high- (10) with equations (11) and (12). We get: resolution and dead-zone models is accurate.

2 B. Extending the error model to MS stereo 3 T In this section, we present the new MS-stereo model for Aopt (16) '  1 2 2  the quantization noise. When MS-stereo is enabled for a Ka2m + β 2K(a4m1 a2m 1 ) 2 − 2 particular subband, channels M and S are quantized in-   q stead of L and R. The variable-bitrate coding problem con- 2) Dead zone: When the scaling parameter is large sists of minimizing the total bitrate for both channels M enough for the quantization indexes i(k) to be all zero, and S, under a distortion constraint. The distortion con- the quantizer is in the dead zone. The bitrate is zero for straint is evaluated after the reverse MS-transformation: this subband, the output of the quantizer is also zero. The quantization error is the input signal itself, and the error EεL TL ≤ (19) energy is given by:   EεR TR 2 ≤ Eε = EX = X(k) (17) The quantization error samples in the transform domain k  X are εM (k) and εS(k). After the reverse MS-transformation, 3) Transition mode: Between the high-resolution mode the quantization error samples are: and the dead-zone, we do not propose a specific model, εL(k) = εM (k)+ εS(k) we simply extend the other two models. We consider that (20)  when A A0, the high-resolution expression (16) is valid, εR(k) = εM (k) εS(k) ≤  −  5 and the error energy in each channel is: Combining this equation with equations (26) and (27) leads to a new equation which cannot be easily simplified. To E ε2 k εL = k L( ) circumvent this difficulty, we propose an approximation of (21)  P 2 σL and σR with Taylor series. We denote:  EεR = k εR(k) 3 2 We now have to consider four different situations: AS  P ξ = 1 (30) 1) Full high-resolution: When the quantizers on channels A −  M  M and S are in high-resolution mode, EεL and EεR are con- sidered as random variables, and the statistical distortion From equation (27), we get: constraints are: 1 3 2 2 2δS +4δMS δS 2 Prob EεL TL α σL σR √K∆ A 1+ ξ + ξ { ≤ }≥ ' ' M ∆ ∆ (22)    (31) Prob EεR TR α  { ≤ }≥ where: Using the same hypothesis as with the single channel ∆ = δM + δS +4δMS (32) model, we assume that E (resp. E ) follows a Gaus- εL εR When the audio signal has a strong stereo effect, A and sian law. Thus, (22) can be written as : M AS should have close values, which would lead to ξ 0. ' µL + β σL TL It appears that this is true even with a weak stereo effect. ≤ (23) Using first order Taylor series, we get:  µ β σ T  R + R R 1 ≤ 2 2δS +4δMS δS 2 δS +2δMS µ σ2 µ σ2 1+ ξ + ξ 1+ ξ (33) The parametersL and L (resp. R and R), given by ∆ ∆ ' ∆ E 2 equations (9) and (10), depend on the moments εL and   E 4 E 2 E 4 εL (resp. εR and εR ). But the moments of the and: 3 3   2 2 quantization error as functions of the scaling parameter, σL σR γM AM + γSAS (34) given  by equations (11) and (12), involve channels M and S. ' ' 2 2 where parameters γM and γS are: Thus, µL, µR, σL and σR must be re-written as functions of the error moments on M and S. Here, we need a new K hypothesis on the correlation between the error samples γM = ∆ (δM +2δMS) εM (k) and εS(k). When a quantizer is in high-resolution q (35) mode, the quantization error is approximated to be sta- K γS = ∆ (δS +2δMS ) tistically independent from the input signal [18]. Then, q assuming that εM (k) and εS(k) are independent variables On audio excerpt #8, which has the weakest stereo ef- is reasonable. From equation (20), we get: fect in our selection, coded at 48 kbits/s, we measured E 2 E 2 E 2 E 2 ξ = 0.074 0.002 (95 % confidence interval), and the εL εR εM + εS (24) ' ' maximum− error± measured for the linear approximation of E 4 E 4 E 4 E 4 E 2 E 2 εL εR  εM + εS +6 εM  εS (25) σL and σR is 0.5%. ' ' Finally, the distortion constraint (29) is equivalent to: Combining  these  equations  with (9) and (10) applied  to channels L-R and with (11) and (12) applied to channels 3 3 2 2 Ka2m 1 + βγ A + Ka2m 1 + βγ A T M-S, we get: 2 M M M 2 S S S ≤ 3 3     (36) 2 2 µL µR Ka2 m 1 M A + m 1 SA (26) with T = min (T , T ). ' ' 2 M 2 S L R  3 3 Solving the variable-bitrate problem in full high- 2 2 3 3 2 2 σ σ K δM A + δSA +4δMSA A (27) resolution mode requires a model for the bitrate function, L ' R ' M S M S h i i.e. the amount of coding bits for a single channel, in one the parameters δM , δS and δMS are : particular subband, as a function of the scaling parameter. 2 2 As the coding module uses 11 Huffman codebooks, code- δM = a4m1M a2m 1 − 2 M words from codebook #11 being interleaved with escape se- 2 2 quences, building an analytical model for the bitrate func- δS = a4m1S a2m 1 (28) − 2 S tion seems very difficult. So, we chose an empirical ap- proach: We measured the bitrate function for each value of 2 δMS = a m 1 m 1 2 2 M 2 S the subband width K, on a database of 8 audio excerpts, actually excerpts #1, 2, 3, 5, 6 in table I, and 3 other ones As one can see, EεL and EεR have approximately the same mean and variance. Thus, the distortion constraints (23) that were not retained for the listening tests. The con- are equivalent to a single equation : clusion is that the number of coding bits per subband can be reasonably modelled by a decreasing linear function of µ + β σL µ + β σR min (TL, TR) (29) log (A). On Figure 3, we plot the mean bitrate function L ' R ≤ 6

80 80

70 70

60 60

Energy(dB) 50 Energy(dB) 50

S dead-zone S dead-zone 40 40 M dead-zone M dead-zone A A 0M A 0M A A 0S A 0S M (log) A M (log) A S (log) S (log)

Fig. 4. Example of the distortion function (on the left) and model (on the right) for one output channel (Left).

90 random variable and EεS is a constant, equal to EXS . The 80 Meanvalue distortion constraints are: Minandmax 70 EεL = EεM + EXS TL ≤ (39) 60  EεR = EεM + EXS TR  ≤ 50 This coding problem is similar to a single-channel optimiza- 40 tion on channel M, with the following distortion constraint: 30 EεM min (TL, TR) EXS (40) ≤ − 20 numberofcodingbits According to the single-channel model, the nearly-optimal 10 solution is:

2 0 3 AA log( /0 ) T EXS AMopt − '  1 2 2  Fig. 3. Mean bitrate function and extreme values. Subband width: Ka2m M + β 2K(a4m1M a2m 1 ) 2 − 2 M 16 coefficients. Audio excerpt #8.  q (41) 3) Channel S high-resolution: When the quantizer on and the extreme values for K = 16, for audio excerpt #8. channel S is in high-resolution mode, and the quantizer One can see that the linear model is a reasonable approx- on channel M is in the dead-zone, the coding problem is imation. However, the high variability justifies the use of exactly similar to the previous one, and the nearly-optimal an iterative process where the model is only used for opti- solution is: 2 mizing the inter-channel bit-allocation. 3

Assuming a log-linear model for the bitrate as a func- T EXM ASopt − tion of the scaling parameter, it appears that minimizing 2 2 ' Ka m 1 + β 2K(a m a m 1 ) 2 2 S 4 1S 2 S the total bitrate is equivalent to maximizing log (AM AS) − 2 under the distortion constraint (36). The solution can be  q (42) easily obtained with a Lagrange-multiplier maximization 4) Full dead-zone: When both quantizers are in the dead- technique: zone, any scaling parameters AM AM0 and AS AS0 are optimal. ≥ ≥ 2 3 This model is illustrated on Figure 4. On a 16-coefficients T AMopt (37) subband, for output channel L and for one long analysis '  Ka m 1 βγ  window from audio excerpt #8, we draw the exact values 2 2 2 M + M   of the error energy and the model estimation as functions   2 3 of scaling parameters on channels M and S. One can see T that the model is an accurate approximation of the actual A (38) Sopt distortion function. '  Ka m 1 βγ  2 2 2 S + S   2) Channel M high-resolution : When the quantizer on C. Stereophonic optimization process for fixed bitrates channel M is in high-resolution mode, and the quantizer In the previous section, we have described an error model on channel S is in the dead-zone, EεM is considered as a for the MS-stereo mode. Given distortion levels on chan- 7 nels L and R, it allows us to compute the values of the scal- Initialization ing parameters A (s) and A (s) which solve the variable- 0 M S TL( s )= T mL ( s ) 0 bitrate problem, i.e. minimize the number of coding bits TR( s )= T mR ( s ) under a distortion constraint. In this section, we revisit the fixed-bitrate problem: i.e. minimizing the perceived distor- i i TLR( s ), T ( s ) tion under a bitrate constraint. As for the single-channel algorithm, this method is merely a single-loop process. The block-diagram of the optimization algorithm is presented on Figure 5. The notations are : L/R M/S errormodel • TmC(s): Masking threshold for channel C L, R and errormodel subband s, computed by the psychoacoustic model.∈{ } • i TC(s): Distortion level for channel C L, R and sub- Ai s A i s i i LR( ), ( ) AMS( s ), A ( s ) band s at iteration i. ∈{ } • Ai (s): Scaling parameter (related to the scalefactor) for C M/S channel C L,R,M,S and subband s at iteration i. L/R quantization& quantization& • i ∈{ } bC(s): Required number of coding bits for channel C Huffmancoding Huffmancoding L,R,M,S and subband s at iteration i, after quantiza-∈ { } bi s b i s bi s b i s tion and Huffman coding. LR( ), ( ) MS( ), ( ) • Bmax: Maximum number of coding bits per frame, de- Yes i i No pending on the output bitrate. bLR( s )+ b ( s )< i+1 bi s b i s The computation of the distortion levels TC (s), for MS( )+ ( ) i channels C L, R , from previous values TC (s), uses a process similar∈ { to that} of the single-channel algorithm: For high SNR (1st phase), it is a water-filling technique subband L/R-M/S [19] with a protection factor. For low SNR (2nd phase), switch a constant SNR degradation is performed. During the 1st phase, the water-filling technique retrieves bits from the i i i i i i b( s )= bLR( s )+ b ( s ) b( s )= bMS( s )+ b ( s ) sub-bands with the lowest signal energy in order to min- imize the distortion on high-energy sub-bands. The pro- tection factor is used to avoid large distortion levels at low Yes No frequencies. During the second phase, a uniform bit re- i Ss b( s )< B trieval along subbands is performed in the case of very low max bitrate constraint, but some noticeable distortions will then be clearly perceived. The complete description of this process, applied both Ti s Ti +1 s STOP LL()()® to channels L and R, is as follows, where τ(s) is the protec- i i +1 TRR()() s® T s tion factor for each subband (see [12] for numerical values i® i +1 and implementation details). The protection threshold is defined by: EX (s) G(s)= Fig. 5. Block-diagram of the optimization algorithm. τ(s) The protection threshold can be interpreted as the maxi- mum error energy required in each subband to preserve a standard [1], referred to as the standard coder in this paper. perceptually-acceptable level of distortion. We used the Low Complexity profile. The standard coder • 1st phase, until T i(s) < G(s) for at least one sub-band was chosen preferably to an embedded AAC coder, because it is a public implementation which allows a fair compari- T i = min T i(s) son: The only difference between both coders under test is min s the optimization algorithm. All other components are the i+1 i  i T (s) = min max T (s), r1 Tmin , G(s) same. • 2nd phase   A. Subjective evaluation T i+1(s)= r T i(s) 2 The signal quality can be assessed using objective qual- step-constants r1 and r2 have been set respectively to 1 dB ity tests, but as mentioned in [22], the ultimate quality test and 0.25 dB. of any audio compression technique is the human listener. In this work, we refer, to a large extent, to the ITU rec- IV. Performance evaluation ommendation BS.1534-1 [21] (often referred as MUSHRA In this section, our coding algorithm is compared to the test) which is especially designed for the subjective assess- algorithm described in the informative annex of the MPEG ment of intermediate quality audio coding systems. The 8

100 i.e. the average of Left and Right channels. The score given 60 to this mono anchor is informative on the level of stereo of Ex1 20 the test signal. 100 A total of 17 selected subjects participated to the lis- 60 tening test and scored the different signals according to Ex2 20 their quality from score 0 (extremely poor quality) to 100 (transparent). Even if some of them were familiar with 100 audio coding evaluation, all subjects underwent a training 60 phase which allowed them to better identify the typical Ex3 20 coding artefacts of the tested coders1. The participants 100 were post-screened according to the score they have at- 60 tributed to the reference: the data for listeners who gave Ex4 20 a score under 80 (which correspond to the lowest mark for the category excellent quality) were discarded. As a con- 100 sequence, 14 listeners were judged reliable and therefore 60 kept for the results. Note that none of the authors have Ex5 20 participated to the test. 100 The results of the subjective test are summarized in Fig- 60 ure 6. The results are given as mean absolute scores for Ex6 20 each signal with 95% confidence intervals. The main result is that the proposed coder provides a 100 significantly better quality than the standard one for all 60

Ex7 test items. It is also interesting to notice that both 20 were judged better than the low-pass anchor, except for the 100 standard codec on items #3 and 8. To our opinion, this 60 illustrates the main weaknesses of the standard algorithm: Ex8 20 On the one hand, when the input signal has a wide spec- 100 tral content, for example the harpsichord accompaniment 60 on item #3, the power spectral density (PSD) of the error Ex9 20 can strongly vary from one frame to another, which creates birdies-like degradations. Since our method is temporally 100 more stable, this inherent weakness is greatly reduced. On 60 the other hand, when the input signal possesses a globally

Ex10 20 contrasted PSD (for example Suzanne Vega’s voice on item 100 #8), the standard algorithm produces an error with a glob- 60 ally smooth PSD, which creates gaps in the spectrogram All 20 of the coded signal. By contrast, our methods adapts the PSD of the error to the PSD of the input signal, and avoids Anchor New Standard Anchor such degradations. lp3.5kHz codec codec mono The comparison with the mono anchor is more difficult Fig. 6. MUSHRA test results with 95% confidence intervals for each to interpret as many listeners mentioned the difficulty to audio excerpt. assess the reduction of stereophonic effect compared to artefact-like degradations. However, it seems that the score given to the mono anchor is actually related to the level of subjective evaluation was carried out at a of 96 stereo: For item #8, which is nearly a monophonic sig- kbit/s since near transparent quality is obtained for both nal, the mono anchor obtains almost 100%, and for items codecs at 128 kbit/s or higher bitrates. #1, 3 and 6, which provide a strong stereo effect, the score Table I gives the list of the selected test material. The se- given to the mono anchor is the lowest. Furthermore, one lection of audio test items was done by choosing a subset of can notice that the proposed coder is always judged bet- excerpts where audio impairments of both coding schemes ter than the mono anchor, except for item #8 which is a were the most audible and by favouring the widest variety very special case. This proves that our codec does not sig- of musical content. All excerpts are stereophonic and were nificantly degrade the stereophonic rendering in order to played at a sampling rate of 48 kHz. reduce traditional coding artefacts.

Our coder is subjectively evaluated and compared to the 1In practice, the training phase is done in two phases. First, the standard AAC coder and two anchor signals. The first one, listeners learn typical quality degradations due to bitrate reduction required by the MUSHRA protocol, is a 3.5 kHz low-pass on typical signals (Low pass filtering, birdies and pre-echoes). Note that no specific artefacts associated with stereophonic degradations version of the reference signal. We chose to add a second were included. Second, the subjects listen to all items included in the anchor signal: a monophonic version of the reference signal, test, in which case stereophonic degradations are presented. 9

Id Author Identification Style Duration 1 The Beatles Drive My Car Pop-Rock 8.5 s 2 J.J. Cale Cocaine Pop-Rock 9.8 s 3 A. Vivaldi Gloria Choir 11.0 s 4 M. Marais Le Labyrinthe Viola da gamba 8.6 s 5 Anonymous Saltarello Medieval 7.6 s 6 Simon & Garfunkel Sound of Silence Singing voice 9.3 s 7 Supertramp Goodbye Stranger Pop-Rock 8.3 s 8 S. Vega Toms dinner Singing voice 9.5 s 9 H. Texier Tzigane Jazz 10.0 s 10 R. Galliano Viaggio Jazz 10.5 s

TABLE I Audio material for subjective evaluation.

Finally, it can be observed that relatively small confi- dence intervals are obtained and that they are in general smaller for the proposed coder than for the standard AAC coder. This may be explained by the fact that when an artefact is clearly audible (which is more often the case with the standard ACC coder than with our coder), the listeners often have a different perception on its acceptability and therefore use significantly different scores, leading to an increase of the confidence intervals for the standard AAC coder. The relatively small confidence intervals obtained is to our opinion the consequence of the specific training phase conducted beforehand by all listeners which in fact leads to a better agreement of the listeners during the test phase.

B. Complexity In a previous paper [12], we showed that the optimiza- Fig. 7. Mean execution time and 95% confidence intervals for each audio excerpt. tion algorithm takes about 50% of the whole computation time and that our model-based algorithm for monophonic signals requires about 40% less computation time than the standard algorithm at 48 kbits/s. In MS stereo, we possibly expect different results, be- cause the quantization process is more complex than twice common to both implementations, is approximately 0.5s. the monophonic case: The standard algorithm relies on three nested-loops (distortion loop for channel M, distor- One can see that the optimization algorithm takes about tion loop for channel S, bitrate loop for both channels), 70% of the whole computation time, which is more than in with simple calculations inside each loop. Our model-based the monophonic case. For the optimization part, our al- algorithm has only one bitrate loop, but the calculations gorithm performs better on excerpts #2, 4, 6, 8, 9, the inside the loop are more complex. standard algorithm on excerpts #1, 3, 5, 7, 10. In average, To evaluate the complexity, we measured the mean our algorithm is 10% faster. One can also notice that con- CPU time required for coding one analysis window, for fidence intervals are larger with our algorithm. This point the excerpts listed in Table I, and for a bitrate of 96 can be explained by the small value for the step constant kbits/s.2 On Figure 7, we plot the mean execution time r2: In the second phase of the optimization, reached when and 95% confidence interval for each audio excerpt, and for psychoacoustics require a much larger amount of coding all excerpts. The remaining computation time (window- bits than available, we choose to slowly raise the distortion switching, MDCT and psycho-acoustic model), which is level in order to get a very progressive quality degradation, which increases the execution time. In contrast, in the 2Note that the implementation was made on a MATLAB 6 plat- standard algorithm, the variation applied to the scalefac- form, and that we did not use a fast scheme (FFT based) for the tors is raised at each iteration. This ensures a fast conver- implementation of the time-frequency transform (MDCT). Thus, the results might slightly differ with a compiled coder, and the total com- gence, but results in a poor subjective quality when psy- putation time would be lower with a fast MDCT scheme. choacoustics require a larger amount of coding bits. 10

V. Conclusion [12] O. Derrien, P. Duhamel, M. Charbit and G. Richard, “A New Quantization Optimization Algorithm for the MPEG Advanced In this paper, we have described a new coding algorithm Audio Coder Using a Statistical Subband Model of the Quantiza- for the MPEG in MS-stereo mode, tion Noise,” IEEE Transactions on Audio, Speech and Language Processing, vol. 14, pp. 1328–1339, July 2006. based on a subband model for the quantization noise. Our [13] J.P. Princen and A.B. Bradley, “Analysis/synthesis filter bank approach is radically different from the MPEG standard design based on time domain aliasing cancellation,” IEEE algorithm: we propose a global approach for coding both Transactions on Acoustics, Speech and Signal Processing, vol. ASSP-34, pp. 1153–1161, 1986. channels in the same process. First, a quantization er- [14] J.D. Johnston and A.J. Ferreira, “Sum-Difference Stereo Trans- ror model allows us to tune the quantizers on channels M form Coding, ” proc. IEEE ICASSP, pp. 569-572, 1992. and S with respect to a distortion constraint on the recon- [15] J.D. Johnston, “ of audio signals using perce- tual noise criteria,” IEEE Journal on Selected Areas in Com- structed channels L and R as they will appear in the de- munications, vol. 6, no. 2, pp. 314–323, February 1988. coder. This approach leads to a more efficient perceptual [16] J.C. Liu, W.C. Lee and Y.H. Hsiao, “ M/S Coding Based on noise-shaping and to avoid the use of complex psychoa- Allocation Entropy, ” proc. DAFX’03, London, UK, September 2003. coustic models built on the MS channels. Furthermore, it [17] M. Abramowitz and A. Stegun, Handbook of Mathematical provides a straightforward scheme to choose between LR Functions, Dover Publications Inc., New York, 1970. and MS modes in each subband for each frame. [18] S.P. Lipshitz, R.A. Wannamaker, and J. Vanderkooy, “Quanti- zation and dither : a theoretical survey,” Journal of the Audio Subjective listening tests performed with trained sub- Engineering Society, vol. 40, no. 5, pp. 355–374, May 1992. jects prove that the coding efficiency at a medium bitrate [19] M. Bosi and E. Goldberg, Introduction to Digital Audio Coding and Standard, Kluwer Academic Publishers, 2002. (96 kbits/s for both channels) is significantly better with [20] ITU-R, ITU-R Recommendation BS.1116. Methods for the sub- our algorithm, with no increase of complexity. jective assessment of small impairments in audio systems in- Our method is compatible with almost any psychoacous- cluding multichannel sound systems, 1997. [21] ITU, ITU-R BS.1534-1: Method for the subjective assessment tic model. For our experimentations, we used the binaural of intermediate quality levels of coding systems, 2003. extension of the model proposed in the MPEG standard, [22] T.Grusec, L.Thibault, and G.Soulodre, “Subjective evaluation but further studies should focus on improving the psychoa- of high quality audio coding systems: Methods and results in the two-channel case,” Proceedings of the 99th International Con- coustic modelling of binaural effects, because this aspect is ference of the Audio Engineering Society, New York, October strongly related to the coding efficiency. 1995, preprint #4065. [23] D. Kirby and K. Watanabe, “Formal subjective testing of the VI. Acknowledgements MPEG-2 NBC multichannel coding algorithm,” Proceedings of the 102th International Conference of the Audio Engineering The authors wish to thank all listeners who have ac- Society, Munich, March 1997, preprint #4418. [24] F. P. Myburg, Design of a scalable parametric audio coder, cepted to participate to the listening tests. The authors Ph.D. thesis, Technische Universiteit Eindhoven, 2004. are also grateful to the anonymous reviewers who greatly helped to improve the original manuscript. References [1] International Organization for Standardization, ISO/IEC 13818-7 (MPEG-2 Advanced Audio Coding, AAC), 1997. [2] International Organization for Standardization, ISO/IEC 14496-3 (Information technology - Very low bitrate audio-visual coding - Part3: Audio), 1998. [3] C. Bauer, M. Feller and G. Davidson, “Multidimensional Op- timization of MPEG-4 AAC Encoding,” proc. IEEE ICASSP, Toulouse, France, May 2006. [4] J.D. Johnston, “Perceptual Transform Coding of Wideband Stereo Signals,” proc. IEEE ICASSP, 1990. [5] D. Yang, H. Ai, C. Kyriakakis, C. Faller and C.C.J. Kuo, “High-Fidelity Multichannel Audio Coding With Karhunen- Lo`eve Transform” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 4, pp. 365-380, July 2003. [6] H. Fuchs, “Improving Joint Stereo Audio Coding By Adaptive Inter-Channel Prediction,” proc. IEEE WASPAA, Mohonk, NY, October 1993. [7] H. Fuchs, “Improving MPEG Audio Coding by Backward Adap- tive Linear Stereo Prediction” proc. 99th AES Convention, 1995. [8] J. Lindbom, J.H. Plasberg and R. Vafin, “Flexible Sum- Difference Stereo Coding Based on Time-Aligned Signal Com- ponents” IEEE Workshop on Application of Signal Processing to Audio and Acoustics, New Paltz, New-York, October 2005. [9] F. Baumgarte and C. Faller, “Binaural Cue Coding - Part I: Psychoacoustics Fundamentals and Design Principles” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 509-519, November 2003. [10] F. Baumgarte and C. Faller, “Binaural Cue Coding - Part II: Schemes and Applications” IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 520-531, November 2003. [11] International Organization for Standardization, ISO/IEC 14496-3:2005 (Information technology - Coding of audio-visual objects – Part 3: Audio), 2005.