A MULTIDIMENSIONAL COMPANDING SCHEME FOR SOURCE CODING WITHA PERCEPTUALLY RELEVANT DISTORTION MEASURE

J. Crespo, P. Aguiar R. Heusdens

Department of Electrical and Computer Engineering Department of Mediamatics Instituto Superior Técnico Technische Universiteit Delft Lisboa, Portugal Delft, The Netherlands

ABSTRACT coding mechanisms based on quantization, where less percep- tually important information is quantized more roughly and In this paper, we develop a multidimensional companding vice-versa, and lossless coding of the symbols emitted by the scheme for the preceptual distortion measure by S. van de Par quantizer to reduce statistical redundancy. In the quantization [1]. The scheme is asymptotically optimal in the sense that process of the (en)coder, it is desirable to quantize the infor- it has a vanishing rate-loss with increasing vector dimension. mation extracted from the signal in a rate-distortion optimal The compressor operates in the frequency domain: in its sim- sense, i.e., in a way that minimizes the perceptual distortion plest form, it pointwise multiplies the Discrete Fourier Trans- experienced by the user subject to the constraint of a certain form (DFT) of the windowed input signal by the square-root available bitrate. of the inverse of the masking threshold, and then goes back into the time domain with the inverse DFT. The expander is Due to its mathematical tractability, the Mean Square based on numerical methods: we do one iteration in a fixed- Error (MSE) is the elected choice for the distortion measure point equation, and then we fine-tune the result using Broy- in many coding applications. In audio coding, however, the den’s method. Additionally, we will show simulations which usage of more complex perceptual distortion measures dif- corroborate the approximations and results of the theoretical ferent from the MSE, exploting the frequency selectivity and derivations. masking phenomena of the human auditory system, achieves a much higher performance, i.e., a higher perceptual quality Index Terms— Multidimensional companding, asymp- for the same bitrate. A way to code with these measures totic optimality, perceptual distortion measure, sinusoidal au- is to multiply the input signal by certain perceptual weights dio coding before quantizing on the encoder and do the inverse (divide) on the decoder, so that the perceptual distortion measure 1. INTRODUCTION gets mapped into an MSE in the normalized domain. These weights are usually related to the masking threshold, a time- In this last decade we have observed an explosive increase in frequency function delivering the maximum distortion power the usage of audio coding schemes and coded audio content, insertible in a time-frequency bin so that the distortion is still which enable the reduction of the information throughput (bi- inaudible. The ease of use of the MSE distortion measure trate) of an audio signal on the order of 7 to 15 times with makes this solution attractive, having it been employed in respect to the original Pulse Code Modulation (PCM) coded several quantization schemes [4], [5], [6], [7], [8]. signal, with very reduced penalty in perceptual quality [2]. Nevertheless, such an approach has the inconvenient that These schemes make countless applications possible, such the weights have to be transmitted as side information through as handheld audio decoders with reduced memory capacity the channel so that the receiver can do the inverse normaliza- which nevertheless can carry hours of audio content, stream- tion, thereby introducing an overhead in the transmission pro- ing thorugh bandwidth constrained channels such as the In- cess. In the case of a multiple description coding scenario [9], ternet with low bandwith usage and high experienced qual- this overhead may be intolerably high. In such a scenario we ity, delivery of audio content interactively through the World have to transmit information through n > 1 channels and at Wide Web (WWW), digital radio, digital television, recorded the receiver m n channels are received successfully. The digital , and many more. channels that were≤ not transmitted successfully do not arrive This decrease in bitrate yet with surprisingly high trans- at all (they are “erased”). As we don’t know at the sender parency is achieved through the exploitation of the perceptual which channels will arive, we have to transmit the perceptual irrelevance and statistical redundancy present in the audio sig- weights in all channels. With increasing number of channels nal [3]. Indeed, contemporary audio coders use lossy source n, for a constant total bit budget, the useful information for the audio signal becomes thus smaller and smaller, becoming 2. PRELIMINARIES a multiple description audio coding scheme thus useless for high n. We will start overviewing the theme of multidimensional To solve this problem, we will make use of multidimen- companding, and in particular the condition for optimality sional companding [10]. As shown in figure 1, in a mul- that we will need to derive the companding scheme and the tidimensional companding quantization scheme the source additional rate at the output of the quantizer when using a x RN is first pre-processed by applying a non-linear vector sub-optimal scheme, with respect to the optimal scheme, the function∈ F to it, that we call the compressor, then the result so-called rate-loss. For an extensive treatment of the subject is vector quantized, transmitted through the channel in an we refer the reader to [10]. Subsequently, we will deliver the efficient way, and finally at the receiver the inverse function distortion measure which will be used in this paper, described 1 in [1]. F − , the expander, is applied as a post-processing step, be- ing the reproduction y obtained in this fashion. The set of both functions, the compressor and the expander, is called the 2.1. Multidimensional Companding compander. Linder et al. [10] treated the theme of multidimensional com- panding (figure 1) assuming that the distortion measure which we are coding with is locally quadratic. This type of distor- / / channel / 1 / 3 RN x F ( ) Q( ) F − ( ) y tion measures is characterized by being of class C ( ) (d · · · is 3 times continuously differentiable) with respect to y, by being strictly positive except in its absolute minimum y = x, Fig. 1. Source coding with multidimensional companding where it should be 0,

d(x, y) 0 with equality iff y = x, (1) In [10], it was shown that general non-difference mea- ≥ sures (which satisfy some natural conditions) are equivalentto and by having a Taylor expansion with respect to y around x the MSE measure at high resolution (at low distortion levels) given by in the compressed domain (between the compressed source signal and the not-yet-expanded received signal) if an “opti- d(x, y) = (y x)TM(x)(y x)+ O( y x 3), (2) − − k − k mal” companding scheme (in the sense described in [10]) was where denotes the l norm and where the matrix M(x), applied, thus making all schemes optimal for the MSE mea- k·k 2 sure also optimal for this type of non-MSE measures upon dubbed in [10] as the sensitivity matrix, is defined as application of the optimal companding scheme. The advan- 1 ∂2d(x,y) tage of using an optimal companding scheme in audio source [M(x)]m,l = , m,l =0, 1,...,N 1, (3) 2 ∂ym∂yl − coding is then that coding with a perceptually relevant (non- y=x

MSE) distortion measure can be done with any MSE-based scheme without the need to do pre-normalization of the in- i.e., it is half of the Hessian matrix of d with respect to y th st put signal with perceptual weights. The transmission of this calculated at y = x. The absence of the 0 and 1 order term side information is thereby removed, and multiple description in (2) comes directly from the condition (1). Due to the same coding can be done with an arbitrary large number of descrip- condition, M(x) is positive definite. tions n without performance loss. It was defined in [10] that a multidimensionalcompanding Although the usage of multidimensional companding scheme is optimal when the Jacobian matrix seems promising, an optimal companding scheme does in ∂Fm general not exist [10]. To work around that problem, a [F ′(x)]m,l = (x), m,l =0, 1,...,N 1 (4) ∂x − suboptimal companding scheme must be built, having an ad- l ditional rate (entropy), the so called rate loss, at the output of the compressor satisfies1 of the quantizer in relation to the rate that would be theoreti- T cally achievable with the (non-existent) optimal companding M(X)= F ′(X) F ′(X) (5) scheme. In this paper we will sum up the work done in [11], which had as objective the construction and simulation almost everywere, where M(X), is the sensitivity matrix of of such a sub-optimal companding scheme, with vanishing equation (3), where T denotes transposition and where we de- rate-loss with increasing vector dimension, for the percep- note random variables by uppercase letters and their realiza- tual distortion measure developed by S. van de Par [1], a tions by the lowercase correspondences. Due to the positive distortion measure designed for sinusoidal audio coding [12] definiteness of (3), there exists a matrix F ′ satisfying (5) [13]. applications. 1In the original paper there was a multiplying constant c which is set to 1 This paper is organizes as follows. In section for simplicity. No loss of generality occurs. We call any such a matrix a square-root of M, denoted here In this last equation, c1 > 0 and c2 > 0 are calibration con- by √M. An equivalent condition to (5) is then stants, being c1 independent of N and hi(n) is the (L-point) impulse response of the cascade of the filter simulating the (6) F ′(x)= M(x), behavior of the outer- and middle ear with the ith gamma- for some square-root of M. Whenp fixing the matrix √M, tone filter of the filter-bank of size P , simulating the band- the solution of (6) is not unique: any solution of the form pass characteristic of the basilar membrane of the cochlea [1]. U(x) M(x) with U(x) orthogonal is also a solution of (5) These filters hi are assumed to be absolutely summable. and all solutions of (5) are separated by the left-multiplication In accordance to what was explained in the introduction p by an orthogonal matrix. In this paper, we will restrict our- (section 1), we would like to point out that this distortion mea- selves to the case U(x)=I, where I is the identity matrix. sure can be transformed into a mean-square-error (MSE) by Additionally, the rate-loss of a sub-optimal companding normalization of the input signal. Indeed, looking at the form scheme at high resolution (low distortion) was derived in [10] (9) of the distortion measure, we can define as being xˆ′(f)=ˆa(f) xw(f)y ˆ′(f)=ˆa(f) yw(f) (12) H(Q ˜) H(QD,F ) D,F − ≈ and work on the normalized domain x′,y′, i.e. quantize and 1 det M˜ (X) c c transmit x′ and recover y from yˆ′. Nevertheless, if we use the Elog2 ≈ 2N det M(X)! normalization directly, the square-root of the inverse of the (7) masking threshold has to be transmitted to the receiver to ˜ 1 aˆ 1 M(X)− M(X) perform the inverse normalization + log2 E tr 2 " N !# yˆ′(f) with yw(f)= , (13) T aˆ(f) M˜ (X)= F˜′(X) F˜′(X), (8) with the consequencec of an intorable overhead in a multiple 2.2. Perceptual Distortion Measure description coding scenario with high number of channels. In [1], S. van de Par defined an auditory, perceptually rele- vant distortion measure to be used in the context of sinusoidal 3. A SUBOPTIMAL COMPANDING SCHEME audio coding, which delivers the distortion detectability of an In section 2, we have shown the basic elements necessary to N-dimensional signal x and its reproduction y as a weighted construct a companding scheme for the distortion measure in mean square error of the windowed signals in the frequency question. Based on those elements, we will construct a sub- domain. More precisely, the distortion measure between x optimal companding scheme in this section. and y is defined as

L 1 − 3.1. Sensitivity Matrix d(x, y)= aˆ2(x, f) yw(f) xw(f) 2, (9) | − | f We will first work out the optimality condition (5), check X=0 whether optimality can be achieved, and then, due to arriving where the juxtaposition of two vectorsc denotesc the pointwise at a negative result, we will construct a sub-optimal compres- product between them, the ˆ operator denotes the L-point sor based on the same condition. As we can see through the unnormalized discrete Fourier transform (DFT) in which the condition equation, to express it explicitely we have to calcu- input signal is zero-padded up to size L, i.e. for any N- late the sensitivity matrix (3). Those calculations were done dimensional signal u in [11], where the cases L = N and L = N were treated N 1 6 − 2 separately, being the result j π fn uˆ(f)= u(n)e− L , f =0, 1,...,L 1, (10) − n=0 M(x)=Λ M (x)Λ . (14) X w c w and where w is an N-size frequency selective window (e.g. 2 with the Hann one) with w(n) > 0 n and aˆ (x, f) is selected H M (x)= D Λ 2 D , (15) to be the (signal dependent) inverse∀ of the masking threshold c N Naˆ(x) N at frequency f/L fs (fs denotes the working sample rate), where Λu denotes a diagonal matrix with the elements of the given by · vector u in the diagonal, H denotes hermitian transposition and DN denotes the normalized DFT matrix [14]. Mc is a P ˆ 2 2 hi(f) circulant matrix due to being diagonalized by the DFT matrix aˆ (x, f)= Nc1 | | , L 1 ˆ 2 2 i=1 f ′−=0 hi(f ′) xw(f ′) + c2 [15]. In turn, for L = N, the result is X | | | | 6 f =0, 1,...,LP 1. (11) M(x)=Λ M (x)Λ , (16) − c w t w with [11], it was proven that no optimal companding scheme ex- ists for the specific sensitivity matrix given by (23) and for all [Mt(x)]ml = [Mc(x)]ml, m,l =0, 1,...,N 1 (17) sensitivity matrices given by the left-multiplication of equa- − tion (23) by a signal-independent orthogonal matrix U. The where in this case Mc is a larger L-by-L matrix, given by signal-dependent case was left for future work. H A suboptimal companding scheme was thus developed, M (x)= D Λ 2 D . (18) c L Laˆ(x) L and U(x) was set to I. The proposed scheme windows the input signal, goes into the frequency domain using the DFT, The matrix Mt is a cropped version of the circulant Mc, being consequently a Toeplitz matrix. Due to the difficulty applies a non-linear function G and goes back into the time in dealing with Toeplitz matrices, namely in calculating their domain with the inverse DFT. In a block diagram, the scheme can be synthesized as depicted in figure 2. In mathematical eigenvalues, determinant, inverse and other functions, Mt terms, the compressor is given by was approximated by a circulant matrix Mt,a on basis of H asymptotic equivalence results of [15]. The approximation DN F (x)= G(√NDN Λwx). (24) Mt Mt,a will therefore deliver exact results as N √N and≈ good approximations for the sizes of N tipically→ dealt ∞ It is easy to see through the application of the chain rule [16] with in audio coding N 1024. After some calculations, the that the optimality condition (23) degenerates into approximation resulted in∼ √ √ G′(z)= NΛaˆ x z NΛaˆ z , (25) H ˜( ( )) ≡ ˜( ) Mt,a(x)= DN ΛNaˆ2 DN , (19) ˜ 1 H where x(z)=Λw− DN z/√N and we use a slight abuse of where notation, saying that now a˜ˆ is a function of z directly. There are N 2 equations and N unknowns in equation L L a˜ˆ2(x, f)= aˆ2 x, f , f =0, 1,...,N 1 (20) (25), meaning that the equation system is overdetermined. N N − 2   To build the compressor, we discard the N N equations − is a decimated, energy corrected version of the masking outside the diagonal elements, and choose to satisfy only the threshold. It was also shown, for N not too small, that we N equations on the diagonal. This negligence of equation can decimate the band-pass filters instead of the masking results in a sub-optimal compressor, since with the resulting threshold, i.e. define compressor the off-diagonal elements of its Jacobian matrix will not be 0. Calculations show that the resulting compressor ˆ def L L is of the form h˜i(f) = ; hˆi f , f =0, 1,...,N 1 (21) N N − G˜m(z)= √N Γm(z) z(m), m =0, 1,...,N 1, (26) r   − and use where the expression for the component m was given, where the tilde stands for sub-optimality, and where the signal- P ˜ˆ 2 2 hi(f) dependent gain Γ has a component m given by a˜ˆ (x, f)= Nc1 | | , N 1 ˜ˆ 2 2 1 i=1 f ′−=0 hi(f ′) xwN (f ′) + c2 X | | | | Γm(z)= γm(z,t) dt (27) f =0, 1,...,NP 1, (22) Z0 − c with where uˆN stands for the N-point DFT of u. γm(z,t)=

3.2. Compressor P ˆ 2 h˜i(m) = Nc1 | | Having delivered the form of the sensitivity matrix in section v 2 ˆ 2 2 u i=1 Λˆ z +2 h˜i(m) z(m) (t 1)+ c2 u h˜i 3.1, we will work now more deeply out what condition the u X k k | | | | − optimal compressor should satisfy and develop a compressor t (28) using the derived optimality condition. for m =0 and m = N/2, and From equations (5), (6) and (16) with the approximation 6 6 (19) for L = N) or (14) and (15) for L = N, we can see that γm(z,t)= a possible square-root6 is P ˆ 2 h˜i(m) = Nc1 | | H v 2 ˆ 2 2 2 F ′(x)= M(x)= DN Λ√Na˜ˆ(x)DN Λw, (23) u i=1 Λˆ z + h˜i(m) z (m) (t 1)+ c2 u h˜i u X k k | | − p t (29) where for L = N, aˆ sould be used instead of a˜ˆ (in this case aˆ = a˜ˆ anyway, but the origin of the equations is different). In for m =0 or m = N/2. / / / / / x / window / DFT / G( ) / iDFT / F (x) × ·

Fig. 2. Block Diagram of the compressor F (x)

3.3. Taylor Expansion from the one in (25) due to the discarding of the diagonal equations on this last equation; even the diagonal elements Equations (28) and (28) show that the integrand of (27) is 2 are not exactly the same due to the execution of the Taylor given in terms of a sum of Λ˜ˆ z + c2 with a term pro- k hi k expansion with finite M. Furthermore, it was noted that the portional to the mth component of the same squared norm, Jacobian matrix is given by a low-rank update of a matrix de- ˆ 2 2 h˜i(m) z(m) . The proportionality constant (take n = fined there as being cross diagonal, meaning that it is of the 0| ,N/2|as| an example)| 2(t 1) is on the order of the units6 form − 2 (note that t [0, 1]), and as consequence, the dependence V (z)=Λvf (z) +Λvb(z)DN , (32) on t gets weaker∈ and weaker with increasing vector dimen- where, in this particular case, vf is a real symmetric(vf (m)= sion N – the mth component of the squared norm gets more vf (N m)) vectorand vb is a hermitian symmetric (vb(m)∗ = and more neglegible with respect to the whole squared-norm. − vb(N m)) complex vector. The expression of the Jacobian This motivates a Taylor expansion of Γm around, say, around matrix− is then given by the update t =1 with order M (for the full motivation, see [11]). H The Taylor expansion results in G˜′(z)= √N(V (z)+ A(z)H(z) ), (33)

M k k where A(z) and H(z) are N-by-P tall matrices (P N). ( 1) ∂ γm ≪ Γm,M (z)= − (z, 1), (30) For a higher detail, the results for the Jacobian matrix can be (k + 1)! ∂tk k=0 consulted in [11]. X Additionally, the asymptotic (N ) optimality of the where the derivatives of γ with respect to t are obtained by m compressor (26) was established in→ [11]. ∞ Intuitively, as the a recursive equation, which computes a certain order k using 2 term Λ˜ˆ z + c2 gets dominant with respect to the term all previously calculated orders of the same quantity (orders k hi k ˆ 2 2 smaller than k) and an “inhomogeneous” function sequence h˜i(m) z(m) in the Taylor expansion of γm (equations (for each m), with the property that the first element of the |(28) and| (29)),| asymptotically,| the compressor for an arbitrary 2 th sequence is proportionalto γm and the derivative of the k el- M = 0, 1,..., behaves like the compressor for M = 0. ement is the (k + 1)st element. The Jacobian matrix∞ of the compressor for M = 0 can be Note that if we use M =0 we get the very simple expres- calculated, being the result the expression sion ˜ √ (34) Γm,0(z)= γm(z, 1) = a˜ˆ(z,m), (31) G′(z) = NΛa˜ˆ(z) (E + I) M=0 i.e., in its most simple form, the compressor multiplies the with E having as components (windowed) input signal by the square-root of the inverse of ˆ 2 ˆ 2 ∗ the masking threshold in the frequency domain. If you com- P h˜i(m) h˜i(l) z(m)z(l) | | | | 2 i=1 2 pare the compressor (26) with the normalization step (12), − Λ˜ˆ z +c2 hk hi k i [E]ml = . (35) you will notice that asymptotically, the compressor does ex- P ˆ 2 P h˜i(m) | 2 | actly the thing we wanted to avoid: normalize the input signal i=1 Λ˜ˆ z +c2 k hi k by perceptual weights. Nevertheless, instead of transmitting P 2 perceptual weights through the channel, at the receiver we As intuition indicates, the term 1/( Λ˜ˆ z + c2) in the de- k hi k now only have to apply the inverse of the compressor (26) nominator of (35) decays with N, but its square, presentin the (the expander); it is not necessary to use the weights at the numerator, decays faster. It is thus also intuitive that E 0 → receiver. How to calculate the inverse of the compressor will with increasing N and that G˜′ fulfills the optimality condition be the subject of subsection 3.5. (25) asymptotically, making the rate-loss (7) vanish asymptot- ically. A formal derivation can be found in [11]. 3.4. Analysis of the Compressor 3.5. Expander In [11], the compressor was analysed in terms of the optimal- ity condition (25). Its effective Jacobian matrix was calcu- After having derived a suboptimal compressor and having lated, having been the result given in terms of recursive equa- analysed it in terms of its asymptotic behaviour with increas- tions as well. Note that the Jacobian matrix of G˜ is different ing vector dimension, it is now natural to complete the whole chain of the companding scheme depicted in 1 by building an in this way, we thereby get a good initial estimate for fine- expander which implements the inverse function of the sub- tuning. optimal compressor. For the fine-tuning process we propose to use the Broy- It can be proven [11] that, at least for M = 0, the com- den’s method [17], due to its supralinear convergence[17] but pressor is invertible. We are thus sure that if we use an ade- lower complexity than the Newton’s method. The equations quate numerical method to solve the equation defining this method are

˜ ˆ (36) (n+1) (n) 1 (n) G(z)= ξ z = z J˜− G˜(z ) ξˆ , n =1, 2,... (40) − n − for a certain xiˆ with a sufficiently close initial estimate z(0),   1 with the solution exists, z = G˜− (ξˆ) and the method will converge to it. For the inversion of the complete compressor function (n) (n) (n)H ∆G˜ J˜n 1∆z ∆z we only have to use the equivalence relation − − F (x) J˜n = J˜n 1+ , n =2, 3,... −  ∆z(n) 2  H k k DN (41) ξ = F˜(y)= G˜(√NDN Λwy) √N ⇐⇒ and the initial matrix equal to the Jacobian matrix of the com- H pressor at the initial guess 1 1 DN 1 y = F˜− (ξ)=Λ− G˜− (√NDN ξ), (37) w (1) ⇐⇒ √N J˜1 = G˜′(z ). (42) which is directly deductible from the definition of G (or the A memory optimization taking advantage of the form (33) equivalent for the suboptimal compressor) (24). of the Jacobian matrix of the compressor and of the one-rank For the first iteration of the method, we propose to rear- updates of (41) (and of its inverse) was done in [11] to enable range equation (26) as the execution of the expander for large N without calculat- ing the -by- matrices explicitely. The largest matrix ξˆ N N Jn z(m)= (38) size stored in memory after the optimization was N-by-P (re- √ N Γm(z) member that P N). ≪ and perform a fixed-point iteration of it, i.e., build a second estimate z(1) with 4. SIMULATIONS ξˆ z(1)(m)= , m =0, 1,...,N 1. (39) 4.1. Distortion-rate performance (0) √N Γm(z ) − In this section we will simulate the companding scheme, This first iteration can be motivated as follows. In the first showing simulation figures and tables that corroborate the place, it can be shown that Γ(z) does not dependent on the theoretical results. After calculating the rate-distortion func- phase of the components of z, but only on their magnitude. tion for the distortion measure in question, using results of The execution of iteration (39) has thus the advantage that it [18], and the rate-loss of equation (7), simulations of the wipes out phase differences between the initial estimate z(0) distortion-rate performance of the compander with the pa- and the final desired value z of (36). In other words, if z(0) rameters of table 1 were done. We used calculations for is very close to z up to phase differences, the second estimate the non-approximated version of the sensitivity matrix for (1) z will be an excellent initial estimate for the next numerical N 1024 (i.e. Mt instead of Mt,a) and the calculations ≤ method, which will be used to “fine-tune” the obtained vector for the approximated version for 1024 N < 8192 (Mt,a (1) (0) ≤ z . The same argument can be applied if z has a similar instead of Mt). From N = 8192 on (inclusive) we used the 2 masking threshold a˜ˆ− than the one of z. Remember that the matrix Mc (in this case Mt = Mt,a = Mc. Furthermore, the th 0 order term of Γm is exactly the square root of the inverseof expected values in the equations were replaced by statistical the masking threshold and that for high vector dimension only averages (several realizations of the signal X were emitted), this term matters (cf. equation (31) and the considerations with lower number of realizations for higher vector size N on the asymptotical compressor in subsection 3.4) so that for and vice-versa. The reason for decreasing the number of similar masking thresholds we get similar Γ(z). realizations with higher N is that the quantities for which we For obtaining similar masking thresholds, we propose to should estimate the expected value are averages themselves use the vector z obtained from running the numerical meth- (normalized traces and normalized sums of logarithms of ods described in this section on the last audio frame as an eigenvalues). Assuming that these inner quantities that we initial estimate z(0) for this frame; for this choice the mask- are averaging (the eigenvalues and their logarithms) are well ing threshold will not have changed much between these two behaved, the inner average is consistent, so that these quan- frames due to the stationarity of typical audio signals in the tities have a low variance for high N. The outer averages order of the dozens of miliseconds. By using equation (39) (the ones that replace the expectation) reduce the variance Parameter Value Input signal x Gaussian i.i.d. samples with zero mean, variance σ2 Power of the input signal σ2 0,012 Vector dimension N Powers of 2 from 256 to 65536 DFT dimension L 8192 for N 8192, N otherwise ≤ Window w Hamming, w(n)=0,54 0,46 cos 2π n − N Sample frequency f 48 kHz s  Quantizer ZN (componentwise uniform scalar quantization) Inverse of the quantization step size, 1/s 5 scales varying exponentially from 103 to 105 Order of Taylor expansion M 3

Table 1. Values for the parameters used in the simulations furthermore, and thus the lower the N, the heavier that re- N, being the results for the distortion-loss, the correspondent duction needs to be performed (the higher the needed number quantity in the vertical axis of the graph, shown in table 2. of realizations) and vice-versa. In practice, we used on the As we can see from figure 3, the scheme performs well order of 100 realizations for N = 256, of 10 realizations for with N = 1024,and from table 2 we can confirm that asymp- N = 1024 and 1 realization for N = 8192 or higher. totic optimality is achieved. Note that N = 1024 is the vec- As output, graphs like the one in 3 were delivered. This tor size which corresponds to the frame-size used in practice graph shows the distortion-rate performance of the compand- (N/fs = 21,3 ms, where fs is the sample frequency); (much) ing scheme (for N = 1024, in this particular case). The rate- lower or higher values of N don’t correspond to the frequency and time resolution of the human ear, respectively. Further- more, for N/fs 20 ms the audio signal can’t be considered stationary. ≫ Additionally, we computed the distortion-rate figures for N = 1024 with and without approximation of Mt by Mt,a, and overlaid the resulting curves. The resultant graph can be seen in figure 4. As we can see, the lines which correspont to each other are practically perfectly overlaid, which indicates that for N = 1024 we can approximate Mt by Mt,a when L = N. 6

4.2. Taylor expansion

To test how good the approximation of the original compres- sor of section 3.2 by the Taylor expansion of section 3.3 is, and what order of M we must use to get a good approxima- tion, we computed the contribution of the 0th to the 6th order Fig. 3. Distortion-rate performance of the companding terms of the signal dependent gain Γ of equation (30) (re- scheme for N = 1024. Distortions are in terms of the SNR member that the compressor essentially multiplies pointwise 2 in dB (10 log1 0(σ /D)). The blue line represents the Shan- the windowed input in the frequency domain by the gain, non’s distortion-rate function. The blue dotted line gives the equation (26)), for the vector size most useful in practice, best achievable ZN lattice vector quantizer (LVQ) distortion- N = 1024. The contributions can be seen in figure 5, de- rate performance. The green and red lines give the perfor- picted in double-logarithmic scale. The figure shows that al- mance of the companding scheme and of the identity com- ready for N = 1024 we can use M = 0. If we want a finer 1 pander F (x) = F − (x) = x. The red crosses are results detail we can use higher M, butthe orderofunitswillbe more obtained from quantizing the source directly (i.e. using the than enough. Indeed, the distance between the 0th and 1st or- identity compander), and calculating the rate and distortion der terms is at least 15,7 dB, which means that the 1st order values directly. term has values that are attenuated at least 6,1 with respect to the values of the 0th order term. For the 2nd ×order term this loss (the horizontal distance between the blue dotted line and distance is 23,9 dB (15,7 ), and for the 6th order 42,8 dB the green line in the graph) was extracted for several values of (138 ). × × N 256 512 1024 2048 4096 8192 16384 32768 65536 Distortion Loss (dB) 14,6 14,0 12,6 11,0 8,65 6,27 4,28 2,77 1,57

Table 2. Distortion Loss of the companding scheme for varying N

th Fig. 4. Distortion-rate performance of the companding Fig. 5. Contribution of the j order term of the Taylor ex- scheme for N = 1024. Comparison of the circulant ma- pansion of the signal-dependent Γ, j =0, 1,..., 6 trix approximation with the real values. See 3 for the mean- ing of the colors (the dotted red and green here are exactly the same as the solid red and green in figure 3, respectively). frequency is below a certain threshold in the low-frequency The common colors between the two figures refer to the non- range, we use the boundaryvalue (the value on the threshold), approximated version. The light-blue, dotted light-blue, dark and if the frequency is above another threshold, in the high- green and brown lines correspond to the blue, dotted blue, frequency range, we do the same. Between the two thresh- green and red lines, respectively, but for the approximated olds, the original gain/masking threshold are conserved. For version. N = 1024, the distortion-rate performance of the compressor was simulated again cropping Γ and a˜ˆ in the high frequency range at 24, 18, 14 kHz (the first case corresponds to not { } 4.3. Limitations cropping as fs/2 = 24 kHz), and then cropping in the low frequency range at 50, 100 Hz for no high-frequency crop- Figure 5 shows an important limitation that the developed ping and for high-frequency{ } cropping at 18k Hz. The results multidimensional companding scheme has: as the gain Γ de- are shown in figures 6 and 7, respectively. creases abruptly with increasing frequency, the inversion of We observe a degradation of performance for decreas- the compressor, although mathematically possible (at least for ing threshold of the high-frequency crop: the distortion-rate M = 0; see subsection 3.5), is a numerically badly condi- function, the ZN best distortion-rate performance and the tioned problem. Using the rough rationale based on equation distortion-rate performance of the compander all sink in the (38) that the expander multiplies the incoming signal point- graphs, i.e., for the same rate, the best achievable and the wise by the inverse of Γ, we can see through figure 5 that the actual distortions increase (the SNR decreases). At a certain quantization noise present in the input of the expander will be point (in the figure at 14 kHz) the performance of the com- heavily amplified in the high frequencies (imagine the char- pander sinks below the performance of not doing anything, acteristics upside-down, since the inverse in the linear domain i.e., of using the identity compander F (x) = x. For the low- corresponds to the symmetric in the log domain). The same frequency cropping case no performance loss is obtained, as happens for the very low frequencies. In practice, this means the compander curves lie on top of each other. For practical that two very different signals in the non-compressed domain numerical good condition, a cropping on the low frequencies will have very near compressed signals, up to numerical error. at 50 Hz and on the high frequencies at 18 kHz suffices. The To deal with this limitation we can crop the gain Γ and graph 7(b) is thus the one that depicts the performance of the the masking threshold a˜ˆ of the distortion measure, i.e., if the scheme in a practial situation. (a) 24 kHz (no cropping) (b) 18 kHz (c) 14 kHz

Fig. 6. Distortion-rate performances for cropping in the high-frequency range at 24, 18 and 14 kHz

(a) no cropping (b) high-frequency cropping at 18 kHz

Fig. 7. Distortion-rate performances for cropping in the low-frequency range at 50 and 100 Hz, for no high-frequency cropping and for cropping at 18 kHz

5. CONCLUSIONS but with the advantage that no perceptual weights have to be transmitted through the channel. At the receiver, an algorithm based on numerical methods, the expander, has to be run. The A multidimensional companding scheme was developed for expander does not depend on the perceptual weights and it the perceptual distortion measure [1]. Although the devel- can be run using the previous audio frame, already available oped scheme is suboptimal, it was proven (in [11]) that no at the decoder, as its initial condition. The theoretical results optimal scheme exists under a wide range of solutions for and assumptions were corroborated with simulations. the optimality condition (5), and that nevertheless the con- ceived scheme reached optimality asymptotically, i.e., with 6. REFERENCES increasing vector dimension N. The scheme was developed in the frequency domain; in its most simple form, the com- [1] S. van de Par, A. Kohlrausch, R. Heusdens, J. Jensen, pressor windows the input signal, applies a Discrete Fourier and S. Jensen, “A perceptual model for sinusoidal audio Transform (DFT), multiplies the input signal by the square- coding based on spectral integration,” EURASIP Jour- root of the inverse of its masking threshold, the square-rootof nal on Applied , vol. 2005, no. 9, pp. equation (11), and then goes back into the time domain with 1292–1304, June 2005. the inverse DFT. This process is exactly the same as the pro- cess of normalization by perceptual weights of equation (12), [2] G. A. Soulodre, T. Grusec, M. Lavoie, and L. Thibault, “Subjective evaluation of state-of-the-art 2-channel au- [16] T. M. Apostol, Calculus, Vol. 2: Multi-Variable Calcu- dio ,” J. Audio Eng. Soc., vol. 46, no. 3, pp. 164 lus and Linear Algebra with Applications, vol. 2, Wiley, – 176, March 1998. second edition, 1969. [3] T. Painter and A. Spanias, “Perceptual coding of digital [17] J. Nocedal and S. J. Wright, Numerical Optimization, audio,” Proceedings of the IEEE, vol. 88, no. 4, pp. 451 Springer-Verlag, 1999. – 515, April 2000. [18] T. Linder and R. Zamir, “High-resolution source cod- [4] ISO/IEC 14496-3:2001, Coding of audio-visual objects ing for non-difference distortion measures: The rate- – Part 3: Audio (MPEG-4 Audio Edition 2001), 2001, distortion function,” IEEE Transactions on Information ISO/IEC International Standard. Theory, vol. 45, no. 2, pp. 533–547, March 1999. [5] B. Edler and G. Schuller, “Audio coding using a psy- choacoustic pre- and post-filter,” in in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), June 2000, vol. 2, pp. 881 – 884. [6] J. Østergaard, Multiple-Description Lattice Vector Quantization, Ph.D. thesis, Technical University of Delft, 2007. [7] R. Vafin and W. Kleijn, “Entropy-constrained polar quantization and its application to audio coding,” IEEE Trans. on Speech and Audio Processing, vol. 13, no. 2, pp. 220 – 232, March 2005. [8] P. Korten, J. Jensen, and R. Heusdens, “High resolu- tion spherical quantization of sinusoidal parameters us- ing a perceptual distortion measure,” in in Proc. IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP), March 2005, vol. 3, pp. 177 – 180. [9] V. K. Goyal, “Multiple description coding: Compres- sion meets the network,” IEEE Signal Processing Mag- azine, vol. 18, no. 5, pp. 74 – 94, Sept. 2001. [10] T. Linder, R. Zamir, and K. Zeger, “High-resolution source coding for non-difference distortion measures: Multidimensional companding,” IEEE Transactions on , vol. 45, no. 2, pp. 548–561, March 1999. [11] J. Crespo, “A multidimensional companding scheme for source coding with a perceptually relevant distortion measure,” 2009. [12] H. Purnhagen, “Advances in parametric audio coding,” in in Proc. 1999 IEEE Workshop on Applications of Sig- nal Proc. to Audio and Acoustics, New Paltz, New York, USA, 1999, pp. W99–1 – W99–4. [13] R. Horn and C. Johnson, Matrix Analysis, Cambridge University Press, 1990. [14] P. C. Yip and K. Ramamohan Rao, The Transform and Handbook, CRC, 2000. [15] R. M. Gray, Toeplitz and Circulant Matrices: A Review, Now Publishers, Norwell, Massachusetts, 2006.