<<

arXiv:0711.1766v4 [cs.IT] 18 Feb 2008 rte s [2], as, written xetdma-qae itrinlevel mean-squared expected civn h asinRt-itrinFnto by Function Rate-Distortion Gaussian the Achieving ncnrs,lna rdcinaogtetm oancan domain Cioffi time capacity, for the Indeed, int along ones. quantities mutual-information prediction vector these linear translate of domai contrast, values frequency limiting the In in to vectors amount n between effective formulas mutual-information the these to Both relative capacity solution spectrum. the water-filling a Similarly, by source. given the of spectrum C the of terms function cec onain rn S 1259/07 ISF grant Foundation, Science ainascae ihvcoso oresmls o real a For infor- samples. mutual source normalized source of of valued vectors limit with a associated as mation given is memory with Function Rate-Distortion Quadratic-Gaussian The [3]. [6], of [7], decoding sources and and encoding channels successive Gaussian in err o estimation mean-square joins minimum (MMSE) it of and role later), the regarding clarified observations be will (as manner sequential qaiain MEetmto,dcso feedback. decision directed-information, estimation, ECDQ, MMSE bound, equalization, lower Shannon DPCM, ouain(PM,adadaiyrltosi ihdecisio interferenc with inter-symbol relationship for channels. (DFE) duality equalization a feedback and (DPCM), differential vector-quantized modulation filtered all pre/post at of optimality ha levels the test-channel including predictive implications, The interesting prediction. linear on rate-distortio based the for realization provide give We time-domain realization. ternative it is “test-channel” lends naturally domain source formula frequency This Gaussian a spectrum. power stationary its a of terms of function distortion function: loop. -prediction decision-feedback a in the to equal nasuc rdcinlo,a hw nFgr .This 1. Figure in shown as loop, that prediction implies result source embedded a channel (AWGN) in noise Gaussian white additive an 0 The h aedsoto ucin(D)o ttoaysource stationary a of (RDF) function rate-distortion The Keywords: Abstract eso htaprle euthlsfrterate-distortion the for holds result parallel a that show We fapwrcntandIIcanlwt asinnieis noise Gaussian with channel ISI power-constrained a of h oko h rttoatoswsprilyspotdb th by supported partially was authors two first the of work The a ai,YvlKcmnadUiEe et lcrclEngin Electrical Dept. Erez Uri and Kochman Yuval Zamir, Ram R water-filling ( R R D h wtrfiln”slto o h udai rate- quadratic the for solution “water-filling” The — ( ( lim = ) D scalar D ) ) { etcanl ae-lig pre/post-filtering, water-filling, channel, Test seult the to equal is n X fasainr asinsuc sgvnin given is source Gaussian stationary a of →∞ n uulifrainoe lcrembedded slicer a over mutual-information } ouinfrteqartcrate-distortion quadratic the for solution R .I I. = n 1 ( D inf X , . . . NTRODUCTION ) I a setal eraie na in realized be essentially can ( X scalar − 1 X , . . . , 2 X , tal et − uulifrainover mutual-information 1 X , 4 hwdthat showed [4] n D ; Y 0 h D a be can RDF the , X , 1 Y , . . . , 1 X , Prediction 2 function, n us code pulse n . . . , distortion ) scalar o some s (ISI) e Israel e efto self nal- an and , C in n oise ther or n- n. is level, h n hc smr sflfrorproereplaces 4.5] purpose Sec. our [2, for useful 9.7], more Sec. is which [8, one forms The equivalent several Gau the has of RDF realization channel forward the case, stationary bv ylna ieivratfitr,wiekeigteno the keeping while filters, N time-invariant linear by above θ where ∼ with hr the where n hr ecos the choose is distortion we where and where ie aaercfruafrteGusa D ntrsof terms in RDF Gaussian Figur the parameter in for a illustrated formula solution, parametric a filling gives water The source. the aldan called N hr h nmmi vralchannels all over is infimum the where h piu etcanlcnb rte nti aei a in case this in written be can backward test-channel optimum The asin h D ae nepii omi h frequency the in power-spectrum form the explicit of terms an in takes domain RDF the Gaussian, n 1 k = (0 nteseilcs fammrls wie asinsource Gaussian (white) memoryless a of case special the In Y N sAG [18]: AWGN as D , D α (0 S − ∗ R R N ( = n h D ssmlfidto simplified is RDF the and σ , ) eoe ovlto,and convolution, denotes D e X [ ri a in or , ( n k j piu test-channel optimum D β ( 2 2 k = ] itrinspectrum distortion diienieform: additive-noise e ∼ ) πf 2 = ) h oe-pcrmi flat is power-spectrum the , j = θ D 2 ≤ = ) N πf E 8,[] [5]: [2], [8], p eigSses e vvUniversity Aviv Tel eering-Systems, : = Y 1 2 (0 D { = ) n 1 forward X log X θ , hne hc elzsti nmmis infimum this realizes which channel A . D = − k Z Z n )  f − X  D/σ h = R 1 : Y 1 sAG with AWGN is S 2 / n / σ D [ θ, S ( ,n 2 Z + 2 k e = 2 ( j − k ] 1 2 2 2 ieradtv-os form: additive-noise linear ∗  e e 1 } 1 πf β j / − / ( log and 2 2 steat-orlto ucinof function auto-correlation the is ( ae level water h 2 jk ) , πf αX >θ 1 D hntesuc szero-mean is source the When . sgvnby given is 2 ,n  ) πf N ( h , X 0 1 2 e D ∗ S + 1 , j D < ,n log ∼ 2 X ( ( if otherwise, N e πf = e and n j N j − S 2 θ ) 2  ) + πf πf ( df. 1 (0 Y ≤ = e S / h θ S N j ) X ) D , 2 ( σ 2 + 2 θ  ( e πf n ,n 2 ota h total the that so < f < e ( j θ ) → . ) D j 2 df N ) ntegeneral the In . 2 πf r h impulse the are πf = ) θ > with , Y ) = )  1 h water the uhthat such df / α 2 σ , N and 2 ssian so , 2, e (5) (1) (2) (3) (4) ise ∼ β 1 . 2

Nn Xn Un Zn Zqn Vn Pre−filter Post−filter Yn j2πf Σ Σ Σ j2πf H1(e ) H2(e ) Reconstruction Source + −

Uˆn Predictor n−1 g(Vn−L )

Fig. 1. Predictive Test Channel.

2 prediction of a Gaussian source from its infinite past [2]: S(ej πf ) 2 θ ∞ Pe(X) = inf E Xn aiXn−i . (8) 2 {ai} − D(ej πf ) i=1 ! f X 1 2 The error process associated with the infinite-order optimum predictor, Fig. 2. The water filling solution. ∞ Z = X a X , (9) n n − i n−i i=1 responses of a suitable pre-filter and post-filter, respectively. X is called the innovation process. The orthogonality principle See (13)-(18) in the next section. of MMSE estimation implies that the innovation process has If we take a discrete approximation of (1), zero mean and is white; in the Gaussian case un-correlation 1 S(ej2πfi ) implies independence, so log , (6) 2 D(ej2πfi ) i Zn (0, Pe(X)) (10) X   ∼ N then each component has the memoryless form of (4). Hence, is a memoryless process. See, e.g., [7]. we can think of the frequency domain formula (1) as an encod- ¿From an information theoretic perspective, the entropy- ing of parallel (independent) Gaussian sources, where source power plays a role in the SLB: j2πfi i is a memoryless Gaussian source Xi N(0,S(e )) ∼ 1 Pe(X) encoded at distortion level D(ej2πfi ); see [5]. Indeed, practical R(D) log . (11) ≥ 2 D frequency domain source coding schemes such as Transform   Coding and Sub-band Coding [10] get close to the RDF of a Equality in the SLB holds if the distortion level is smaller stationary Gaussian source using an “array” of parallel scalar than or equal to the lowest value of the power spectrum: D ∆ j2πf j2πf ≤ quantizers. Smin = minf S(e ), in which case D(e ) = θ = D [2]. It follows that for distortion levels below Smin the RDF of a Gaussian source with memory is equal to the RDF of its Rate-Distortion and Prediction memoryless innovation process Zn: Our main result is a predictive channel realization for the 2 1 σZ quadratic-Gaussian RDF (1), which can be viewed as the R(D)= RZ (D)= log , D Smin, (12) 2 D ≤ time-domain counterpart of the frequency domain formulation   2 above. The notions of entropy-power and Shannon lower where σZ = Pe(X). bound (SLB) provide a simple relation between the Gaussian We shall see later in Section II how identity (12) translates RDF and prediction, and motivate our result. Recall that the into a predictive test-channel, which can realize the RDF not entropy-power is the variance of a white only for small but for all distortion levels. This test channel having the same entropy-rate as the source [5]; for a zero-mean is motivated by the sequential structure of Differential Pulse Gaussian source with power-spectrum S(ej2πf ), the entropy- Code Modulation (DPCM) [12], [10]. The goal of DPCM is power is given by to translate the encoding of dependent source samples into a series of independent encodings. The task of removing the 1/2 time dependence is achieved by (linear) prediction: at each P (X) = exp log S(ej2πf ) df . (7) e time instant the incoming source sample is predicted from Z−1/2 !  previously encoded samples, the prediction error is encoded In the context of Wiener’s spectral-factorization theory, by a scalar quantizer and added to the predicted value to form the entropy-power quantifies the MMSE in one-step linear the new reconstruction. See Figure 3. 3

Source Reconstruction Nn Σ Quantizer Σ Xn Un Vn Pre−filter Post−filter Yn − + j2πf j2πf H1(e ) Σ H2(e )

Fig. 4. Equivalent Channel. Predictor

0 where 0 is taken as 1. The pre-filter output, denoted Un, is fed Fig. 3. DPCM Quantiztion Scheme. to the central block which generates a process Vn according to the following recursion equations:

A negative result along this direction was recently given Uˆn = g(Vn−1, Vn−2,...,Vn−L) (14) by Kim and Berger [13]. They showed that the RDF of an Zn = Un Uˆn (15) auto-regressive (AR) Gaussian process cannot be achieved by − Zq = Z + N (16) directly encoding its innovation process. This can be viewed n n n ˆ as open-loop prediction, because the innovation process is Vn = Un + Zqn (17) extracted from the clean source rather than from the quantized where Nn (0,θ) is a zero-mean white Gaussian noise, source [12], [9]. Here we give a positive result, showing that ∼ N independent of the input process Un , whose variance is the RDF can be achieved if we embed the quantizer inside equal to the water level θ = θ(D); and{ g}( ) is some prediction the prediction loop, i.e., by closed-loop prediction as done · function for the input Un given the L past samples of the in DPCM. The RDF-achieving system consists of pre- and 1 output process (Vn−1, Vn−2,...,Vn−L). Finally, the post- post-filters, and an AWGN channel embedded in a source filter frequency response is the complex conjugate of the prediction loop. As we show, the scalar (un-conditioned) frequency response of the pre-filter, mutual information over this inner AWGN channel is equal j2πf ∗ j2πf to the RDF. H2(e )= H1 (e ). (18) After presenting and proving our main result in Sec- Equivalently, the impulse response of the post-filter is the tions II and III, respectively, we discuss its characteristics reflection of the impulse response of the pre-filter: and operational implications. Section IV discusses the spectral features of the solution. Section V relates the solution to h2,n = h1,−n . (19) vector-quantized DPCM of parallel sources. Section VI shows See a comment regarding causality in the end of the section. an implementation by Entropy Coded Dithered Quantization The block from U to V is equivalent to the configuration (ECDQ), while extending the ECDQ rate formula [16] to the n n of DPCM, [12], [10], with the DPCM quantizer replaced by case of a system with feedback. Finally, in Section VII we the additive Gaussian noise channel Zq = Z + N . In relate prediction in source coding to prediction for channel n n n particular, the recursion equations (14)-(17) imply that this equalization and to recent observations by Forney [7]. As in block satisfies the well known “DPCM error identity”, [12], [7], our analysis is based on the properties of information measures; the only result we need from Wiener’s estimation Vn = Un + (Zqn Zn)= Un + Nn. (20) theory is the orthogonality principle. − That is, the output Vn is a noisy version of the input Un via the AWGN channel Vn = Un + Nn. Thus, the system of Figure 1 is equivalent to the system depicted in Figure 4, which corresponds to the forward channel realization (5) of II. MAIN RESULT the quadratic-Gaussian RDF. Consider the system in Figure 1, which consists of three In DPCM the prediction function g is linear: j2πf basic blocks: a pre-filter H1(e ), a noisy channel embedded L j2πf j2πf in a closed loop, and a post-filter H2(e ), where H(e ) g(Vn−1,...,Vn−L)= aiVn−i (21) denotes the frequency response of a filter with impulse re- i=1 X sponse h , n where a1,...,aL are chosen to minimize the mean-squared prediction error: H(ej2πf )= h e−jn2πf , 1/2

2 so σL is also the MMSE in estimating Un from the vector Xn−δ. In this sense, Theorem 1 holds in general in the limit (Vn−1,...,Vn−L). Clearly, this MMSE is non-increasing with as the system delay δ goes to infinity. the prediction order L, and as L goes to infinity it converges If we insist on a system with causal reconstruction (δ = to 0), then we cannot realize the pre- and post-filters (13) and 2 2 σ∞ = lim σL, (23) (18), and some loss in performance must be paid. Nevertheless, L→∞ if the source spectrum is bounded from below by a positive the optimum infinite order prediction error in Un given the constant, then it can be seen from (13) that in the limit of small past distortion (D 0) the filters can be omitted, i.e., H1 = H2 = − ∆ → Vn = Vn−1, Vn−2,... . (24) 1 for all f. Hence, a causal system (the central block in Figure { } 1) is asymptotically optimal at “high resolution” conditions. We shall see later in Section IV that 2 . We σ∞ = Pe(V ) θ Furthermore, the redundancy of an AWGN channel above the further elaborate on the relationship with DPCM in Section− V. RDF is at most 0.5 bit per source sample for any source and at We now state our main result. any resolution; see, e.g., [16]. It thus follows from Lemma 1 below (which directly characterizes the information rate of the Theorem 1: (Predictive test channel) For any stationary central block of Figure 1), that a causal system (the system source with power spectrum S(ej2πf ) and distortion level D, of Figure 1 without the filters) loses at most 0.5 bit at any the system of Figure 1, with the pre-filter (13) and the post- resolution. filter (18), satisfies These observations shed some light on the “cost of causal- ity” in encoding stationary Gaussian sources [14]. It is an open E(Y X )2 = D. (25) question, though, whether a redundancy better than 0.5 bit can n − n − be guaranteed when using causal pre and post filters in the Furthermore, if the source Xn is Gaussian and g = g(Vn ) 2 system of Figure 1. achieves the optimum infinite order prediction error σ∞ (23), then 1 σ2 I(Z ; Z + N )= log(1 + ∞ )= R(D), (26) n n n 2 θ III. PROOF OF MAIN RESULT where the left hand side is the scalar mutual information over the channel (16). We start with Lemma 1 below, which shows an identity between the mutual information rate over the central block of Figure 1 and the scalar mutual information (26). This identity The proof is given in Section III. The result above is in holds regardless of the pre- and post-filters, and only assumes sharp contrast to the classical realization of the RDF (5), optimum infinite order prediction in the feedback loop. which involves mutual information rate over a test-channel Let with memory. In a sense, the core of the encoding process in 1 I( Un ; Vn ) = lim I(U1,...,Un; V1,...,Vn) (27) the system of Figure 1 amounts to a memoryless AWGN test- { } { } n→∞ n channel (although, as we discuss in the sequel, the channel denote mutual information-rate between jointly stationary (16) is not quite memoryless nor additive). ¿From a practical sources U and V , whenever the limit exists. perspective, this system provides a bridge between DPCM and { n} { n} rate-distortion theory for a general distortion level D> 0. Another interesting feature of the system is the relationship Lemma 1: For any stationary Gaussian process Un in between the prediction error process Zn and the original { } Figure 1, if Uˆn is the optimum infinite order predictor of Un process Xn. If Xn is an auto-regressive (AR) process, then − 2 from Vn (so the variance of Zn is σ∞ as defined in (23)), in the limit of small distortion (D 0), Zn is roughly its then innovation process (9). Hence, unlike→ in open-loop prediction I( Un ; Vn )= I(Zn; Zn + Nn). (28) [13], encoding the innovations in a closed-loop system is { } { } optimal in the limit of high-resolution encoding. We shall return to this point, as well as discuss the case of general n−1 Proof: For any finite order predictor g(Vn−L ) we can resolution, in Section IV. write Finally, we note that while the central block of the system I( U ; V V i−1) = I( U ,U Uˆ (L); V Uˆ (L) V i−1) is sequential and hence causal, the pre- and post-filters are { n} i| i−L { n} i − i i − i | i−L non-causal and therefore their realization in practice requires (L) (L) i−1 = I( Un ,Zi ; Zi + Ni Vi−L ) delay. Specifically, since by (19) h2,n = h1,−n, if one of the { } | = I(Z(L); Z(L) + N V i−1) (29) filters is causal then the other must be anti-causal. Often the i i i| i−L (L) (L) filter’s response is infinite, hence the required delay is infinite = I(Zi ; Zi + Ni) (30) as well. Of course, one can approximate the desired spectrum ˆ (L) i−1 (in L2 sense and hence also in rate-distortion sense) to any where Ui = g(Vi−L ) is the L-th order predictor output at (L) degree using filters of sufficiently large but finite delay δ, so time i, and Zi is the prediction error. The first equality the system distortion is actually measured between Yn and above follows since manipulating the condition does not 5 affect the conditional mutual information; the second equality The theorem now follows by combining (36), (35) and (L) follows from the definition of Zi ; (29) follows since Ni is Lemma 1. − independent of ( Un , Vi ) and therefore An alternative proof of Theorem 1, based only on spectral { } considerations, is given in the end of the next section. (Z(L) + N ) (Z(L), V i−1) U i i ←→ i i−L ←→ { n} form a Markov chain; and (30) follows from two facts: first, since Ni is independent of Ui and previous Ni’s, it is { (}L) i−1 also independent of the pair (Zi , Vi−L ) by the recursive IV. PROPERTIES OF THE PREDICTIVE TEST-CHANNEL structure of the system; second, we assume optimum (MMSE) The following observations shed light on the behavior of prediction, hence the orthogonality principle implies that the the test channel of Figure 1. (L) i−1 prediction error Zi is orthogonal to the measurements Vi−L , Prediction in the high resolution regime. If the power- so by Gaussianity they are also independent, and hence by spectrum S(ej2πf ) is everywhere positive (e.g., if X can i−1 { n} the two facts we have that Vi−L is independent of the pair be represented as an AR process), then in the limit of small (L) (Zi ,Ni). Since by (22) the variance of the L-th order distortion D 0, the pre- and post-filters (13), (18) converge (L) 2 → prediction error Zi is σL, while the variance of the noise to all-pass filters, and the power spectrum of Un becomes Ni is θ, we thus obtained from (30) the power spectrum of the source Xn. Furthermore, noisy − 2 prediction of Un (from the “noisy past” Vn , where Vn = i−1 1 σL I( Un ; Vi Vi−L )= log 1+ . (31) Un + Nn) becomes equivalent to clean prediction of Un from { } | 2 θ − its own past U . Hence, in this limit the prediction error Zn This implies in the limit as L   n → ∞ is equivalent to the innovation process of Xn (9). In particular, 2 1 σ Zn is an i.i.d. process whose variance is Pe(X) = the entropy- I( U ; V V −) = log 1+ ∞ (32) { n} i| i 2 θ power of the source (7). = I(Zn;Zn + Nn). (33) Prediction in the general case. Interestingly, for general distortion D> 0, the prediction error Zn is not white, as the Note that by stationarity, I( U ; V V −) is independent of i. { n} i| i noisiness of the past does not allow the predictor g to remove Thus, all the source memory. Nevertheless, the noisy version of the i−1 I( U ; V )+ I( U ; V V )+ . . . + I( U ; V V ) prediction error Zqn = Zn + Nn is white for every D > 0, { n} 1 { n} 2| 1 { n} i| 1 because it amounts to predicting Vn from its own infinite normalized by 1/i converges as i to I( U ; V V −). n i i past: since Nn has zero-mean and is white (and therefore By the definition of mutual information→ ∞ rate (27){ and} by| the independent of the past), Uˆn that minimizes the prediction chain rule for mutual information [5], this implies that the left error of Un is also the optimal predictor for Vn = Un + Nn. hand side of (28) is equal to In particular, in view of (8) and (10), we have − I( Un ; Vn )= I( Un ; Vi Vi ). (34) Zq (0, P (V )) (37) { } { } { } | n ∼ N e Combining (33) and (34) the lemma is proved. where Pe(V ) is the entropy-power of the process Vn. And since Zqn is the independent sum of Zn and Nn, we also have the relation Theorem 1 is a simple consequence of Lemma 1 above and P (V )= σ2 + θ the forward channel realization of the RDF. As discussed in e ∞ 2 the previous section, the DPCM error identity (20) implies where σ∞ is the variance of Zn (23) and θ is the variance of that the entire system of Figure 1 is equivalent to the system Nn. depicted in Figure 4, consisting of a pre-filter (13), an AWGN Sequential Additivity. The whiteness of Zqn might seem channel with noise variance θ, and a post-filter (18). This is at first a contradiction, because Zqn is the sum of a non-white also the forward channel realization (5) of the RDF [8], [2], process, Zn, and a white process Nn; nevertheless, Zn and [18]. In particular, as simple spectral analysis shows, the power N are not independent, because Z depends on past{ values} { n} n spectrum of the overall error process Yn Xn is equal to the of Nn through the feedback loop and the past of Vn. Thus, the j2πf− water filling distortion spectrum D(e ) in (2). Hence, by channel Zqn = Zn +Nn is not quite additive but “sequentially (3) the total distortion is D, and (25) follows. additive”: each new noise sample is independent of the present We turn to prove the second part of the theorem (equa- and the past but not necessarily of the future. In particular, this tion (26) ). Since the system of Figure 4 is equivalent to the channel satisfies: forward channel realization (5) of the RDF of X , we have { n} I(Z ; Z +N Z +N , ..., Z +N )= I(Z ; Z +N ) , n n n| 1 1 n−1 n−1 n n n I( Xn ; Yn )= R(D) (35) (38) { } { } so by the chain rule for mutual information where I denotes mutual information-rate (27). Since U is { n} a function of Xn , and since the post-filter H2 is invertible I¯( Zn ; Zn + Nn ) > I(Zn; Zn + Nn) . within the pass-band{ } of the pre-filter H , we also have { } { } 1 Later in Section VI we rewrite (38) in terms of directed mutual I( X ; Y )= I( U ; V ). (36) information. { n} { n} { n} { n} 6

(1) Z The channel when the SLB is tight. As long as D is Xn smaller than the lowest point of the source power spectrum K independent joint quantization S , we have D(ej2πf )= θ = D in (1), and the quadratic- pre-post filters min Zq Gaussian RDF coincides with the SLB (11). In this case, the (K) and (K-dim VQ) following properties hold for the predictive test channel: Xn prediction loops • The power spectra of Un and Yn are the same and are equal to S(ej2πf ) D. − • The power spectrum of Vn is equal to the power spectrum j2πf of the source S(e ). (1) (K) Yn Yn • The variance of Zqn is equal to the entropy-power of Vn by (37), which is equal to Pe(X). Fig. 5. DPCM of parallel sources. • As a consequence we have I(Z ; Z + N ) = h(Zq ) h(N ) n n n n − n sources, as happens, e.g., in video coding. Figure 5 shows = h (0, Pe(V )) h (0,D) DPCM encoding of K parallel sources. The spectral shaping N − N and prediction are done in the time domain for each source 1 Pe(X)    = log separately. Then, the resulting vector of prediction errors 2 D K is quantized jointly at each time instant by a vector quantizer. which is indeed the SLB (11).  The desired properties of additive quantization error, and rate As discussed in the Introduction, the SLB is also the RDF of which is equal to K times the mutual information I(Zn; Zn + the innovation process (12), i.e., the conditional RDF of the Nn), can be approached in the limit of large K by a suitable − source Xn given its infinite clean past Xn . choice of the quantizer. In the next section we discuss one An alternative derivation of Theorem 1 in the spectral way to do that using lattice ECDQ. domain. For a general D, we can use (37) and the equivalent What if we have only one source instead of K parallel channel of Figure 4 to re-derive the scalar mutual information - sources? If the source has decaying memory, we can still RDF identity (26). Note that for any D the power spectrum of approximate the parallel source coding approach above, at U and Y is equal to max 0,S(ej2πf ) θ , where θ = θ(D) n n { − } the cost of large delay, by using interleaving. We divide the is the water-level. Thus the power spectrum of Vn = Un +Nn (pre-filtered) source into K long blocks, which are separately is given by max θ,S(ej2πf ) . Since as discussed above the { } predicted and then interleaved and jointly quantized as if they variance of Zqn = Zn + Nn is given by the entropy power of were parallel sources. See Figure ??. This is analogous to the process Vn, we have the method used in [11] for combining coding-decoding and 1 P (max θ,S(ej2πf ) ) decision-feedback equalization (DFE). I(Z ; Z + N ) = log e { } n n n 2 θ   from pre-filter to post-filter = R(D) Σ Π VQ Π−1 Σ − + where Pe( ) as a function of the spectrum is given in (7), and the second· equality follows from (1).

Predictor

−1 V. VECTOR-QUANTIZED DPCM AND D∗PCM Fig. 6. VQ-DPCM for a single source using interleaving. (Π and Π denote interleaving and de-interleaving, respectively.) As mentioned earlier, the structure of the central block of the channel of Figure 1 is of a DPCM encoder, with If we do not use any of the above, but restrict our- the scalar quantizer replaced by the AWGN channel Zqn = selves to scalar quantization (K = 1), then we have a Zn + Nn. However, if we wish to implement the additive pre/post filtered DPCM scheme. By combining Theorem 1 noise by a quantizer whose rate is the mutual information with known bounds on the performance of (memoryless) I(Zn; Zn + Nn), we must use vector quantization (VQ). In- entropy-constrained scalar quantizers (e.g., [18]), we have deed, while scalar quantization noise is approximately uniform opt 1 2πe over intervals, good high dimensional lattices generate near H(Q (Zn)) R(D)+ log (39) Gaussian quantization noise [17]. Yet, how can we combine ≤ 2 12   VQ and DPCM without violating the sequential nature of where 1/2 log(2πe/12) 0.254 bit. See Remark 3 in the ≈ the system? In particular, the quantized sample Zqn must be next section regarding scalar/lattice ECDQ. Hence, Theorem1 available to generate Vn, before the system can predict Un+1 implies that in principle, a pre/post filtered DPCM scheme is and generate Zn+1. optimal, up to the loss of the VQ gain, at all distortion levels One way we can achieve the VQ gain and still retain the and not only at the high resolution regime. sequential structure of the system is by adding a “spatial” A different approach to combine VQ and prediction is dimension, i.e., by jointly encoding a large number of parallel first to extract the innovation process and then to quantize 7

dithern An ECDQ operating on the source Zn is depicted in Figure − + 7. A dither sequence Dn, independent of the input sequence Lattice Z , is added before the quantization and subtracted after. If Z Quantizer n n Zqn Qn the quantizer has a lattice structure of dimension K 1, then P P we assume that the sequence length is ≥ code Entropy L = MK Coder s for some integer M, so the quantizer is activated M times. In this section, we use bold notation for -blocks corresponding Fig. 7. ECDQ Structure. K to a single quantizer operation. At each quantizer operation instant m, a dither vector Dm is independently and uniformly it. It is interesting to mention that this method of “open distributed over the basic lattice cell. The lattice points at the loop” prediction, which we mentioned earlier regarding the quantizer output Qm, m =1,...,M are fed into an entropy model of [13], is known in the quantization literature as coder which is allowed to jointly encode the sequence, and has D∗PCM [12]. The best pre-filter for D∗PCM under a high knowledge of the dither as well, thus for an input sequence of resolution assumption turns out to be the “half-whitening length L it achieves an average rate of: filter”: j2πf 2 j2πf , with the post filter H1(e ) = 1/ S(e ) ∆ 1 M M j2πf| | R = H(Q D ) (40) H2(e ) being its inverse. But even with this optimum filter, ECDQ L 1 | 1 ∗ p D PCM is inferior to DPCM: The optimal distortion gain of bit per source sample. The entropy coder produces a sequence D∗PCM over a non-predictive scheme is s of LRECDQ bits, from which the decoder can recover 2 ⌈ ⌉ σX Q1,... QM , and then subtract the dither to obtain the re- G ∗ = D PCM 2 construction sequence Zq = Q D , n = 1,...L. The 1/2 S (ej2πf )df n n − n −1/2 X reconstruction error sequence (strictly greater than one forR non-whitep spectra by the Cauchy- Nn = Zqn Zn, Schwartz inequality). Comparing to the optimum prediction − gain obtained by the DPCM scheme: called in the sequel the “ECDQ noise”, has K-blocks which 2 are uniformly distributed over the mirror image of the basic σX GDPCM = , lattice cell and are mutually i.i.d. [16]. It is further stated in Pe(X) L [16, Thm.1] that the input and the noise sequences, Z = Z1 L we have: and N = N1 , are statistically independent, and that the ECDQ 1/2 2 rate is equal to the mutual information over an additive noise S (ej2πf )df 2 2 G −1/2 X σ ˜ channel with the input Z and the noise N: DPCM = = U ,   ˜ GD∗PCM R pPe(X) Pe(U)! 1 RECDQ = I(Z; Zq) where ˜ is the pre-filter output in the D∗PCM scheme. This L Un 1 ratio is strictly greater than one for non-white spectra. = I(Z; Z + N) . (41) L However, the derivation of [16, Thm. 1] makes the implicit assumption that the quantizer is used without feedback, that is, the current input is conditionally independent of past outputs VI. ECDQ IN A CLOSED LOOP SYSTEM given the past inputs. (In other words, the dependence on the Subtractive dithering of a uniform/lattice quantizer is a past, if exists, is only due to memory in the source.) When common approach to make the quantization noise additive. there is feedback, this condition does not necessarily hold, As shown in [16], the conditional entropy of the dithered which implies that (even with the dither) the sequences Z lattice quantizer (given the dither) is equal to the mutual and N are possibly dependent. Specifically, since feedback is information in an additive noise channel, where the noise is causal, the input Zn can depend on past values of the ECDQ uniform over the lattice cell. Furthermore, for “good” high noise Nn, so their joint distribution in general has the form: dimensional lattices, the noise becomes closer to a white M Gaussian process [17]. Thus, ECDQ (entropy-coded dithered f(ZL,N L)= f(N )f(Z Nm−1) (42) 1 1 m m| 1 quantization) provides a natural way to realize the inner m=1 AWGN channel block of the predictive test-channel. Y where One difficulty, however, we observe in this section is that Z = ZmK the results developed in [16] do not apply to the case where m (m−1)K+1 the ECDQ input depends on previous EDCQ outputs and the denotes the mth K-block, and similarly for Nm. In this case, entropy coding is conditioned on the past. This situation indeed the mutual information rate of (41) over-estimates the true rate happens in predictive coding, when ECDQ is embedded within of the ECDQ. a feedback loop. As we shall see, the right measure in this case Massey shows in [15] that for DMCs with feedback, tradi- is the directed information. tional mutual information is not a suitable measure, and should 8 be replaced by directed information. The directed information (49) follows from Theorem 1. This implies (39) in the previous L between the sequences Z and Zq = Zq1 is defined as section. 4. We can embed a K-dimensional lattice ECDQ for K > 1 L ∆ in the predictive test channel of Figure 1, instead of the I(Z Zq) = I(Zn; Zq Zqn−1) (43) → 1 n| 1 additive noise channel, using the Vector-DPCM (VDPCM) n=1 XL configuration discussed in the previous section. For good = I(Z ; Zq Zqn−1) lattices, when the quantizer dimension K , the noise n n| 1 → ∞ n=1 process N in the rate expressions (41) and (46) becomes white X Gaussian, and the scheme achieves the rate-distortion function. where the second equality holds whenever the channel from Indeed, combining Theorems 1 and 2, we see that the average Z to Zq is memoryless, as in our case. In contrast, the n n rate per sample of such VDPCM with ECDQ satisfies: mutual information between Z and Zq is given by,

L RVDPCM−ECDQ = I(Zn; Zn + Nn)= R(D) . I(Z; Zq)= I(ZL; Zq Zqn−1) (44) 1 n| 1 This implies, in particular, that the entropy coder does not need n=1 X to be conditioned on the past at all, as the predictor handles which by the chain rule for mutual information is in general all the memory. However, when the quantization noise is not higher. For our purposes, we will define the K-block directed Gaussian, or the predictor is not optimal, the entropy coder can information: use the residual time-dependence after prediction to further M reduce the coding rate. The resulting rate of the ECDQ would I (Z Zq) =∆ I(Zm; Zq Zqm−1) (45) be the average directed information between the source and K → 1 m| 1 m=1 its reconstruction as stated in Theorem 2. X The following result, proven in Appendix A, extends Massey’s observation to ECDQ with feedback, and generalizes the result of [16, Thm. 1]: VII. ADUAL RELATIONSHIP WITH DECISION-FEEDBACK EQUALIZATION Theorem 2: (ECDQ Rate with Feedback) The ECDQ system with causal feedback defined by (42) satisfies: In this section we make an analogy between the predictive form of the Gaussian RDF and the “information-optimality” 1 1 of decision-feedback equalization (DFE) for colored Gaussian RECDQ = IK (Z Zq)= IK (Z Z + N) . (46) L → L → channels. As we shall see, a symmetric equivalent form of this channel coding problem, including a water-pouring trans- Remarks: mission filter, an MMSE receive filter and a noise prediction 1. When there is no feedback, the past and future input feedback loop, exhibits a striking resemblance to the pre/post- blocks (Zm−1, ZM ) are conditionally independent of the filtered predictive test-channel of Figure 1. 1 m+1 We consider the (real-valued) discrete-time time-invariant current output block Zqm given the current input block Zm, implying by the chain rule that (43) coincides with (44), and linear Gaussian channel, Theorem 2 reduces to [16, Thm. 1]. Rn = cn Xn + Zn, (50) 2. Even for scalar quantization (K = 1), the ECDQ rate ∗ (40) refers to joint entropy coding of the whole input vector. where the transmitted signal Sn is subject to a power constraint This does not contradict the sequential nature of the system E[S2] P , the channel dispersion is modeled by a linear n ≤ since entropy coding can be implemented causally. Indeed, it time-invariant filter cn, and where the channel noise Zn is follows from the chain rule for entropy that it is enough to (possibly colored) Gaussian noise. encode the instantaneous quantizer output Qm conditioned on Let Un represent the data stream which we model as an m−1 2 past quantizer outputs Q1 and on past and present dither i.i.d. zero-mean Gaussian random process with variance σU . m samples D1 , in order to achieve the joint entropy of the Further, let h1,n be a spectral shaping filter, satisfying quantizer in (40). 1/2 3. If we don’t condition the entropy coding on the past, then σ2 H (ej2πf ) 2df P (51) U | 1 | ≤ we have Z−1/2 so the channel input X = h U indeed satisfies the power R = I(Z ; Z + N (uniform)) (47) n 1,n n ECDQ n n n constraint. For the moment, we∗ make no further assumption (gauss) 1 2πe I(Zn; Zn + Nn )+ log (48) on hn. ≤ 2 12 The channel (52) has inter-symbol interference (ISI) due to 1 2πe   = R(D)+ log (49) the channel filter c , as well as colored Gaussian noise. Let 2 12 n   us assume that the channel frequency response is non-zero (uniform) where Nn , the scalar quantization noise, is uniformly everywhere, and pass the received signal Rn through a zero- 1 distributed over the interval ( √12D, +√12D), and where forcing (ZF) linear equalizer , resulting in Yn. We thus − C(z) 9

Nn ”+En”

Un shaping Xn Yn Uˆn Vn Un Σestimator Σ symbols filter slicer (correct) decisions channel Dˆ n Dn predictor Σ estimation error − Uˆn

Fig. 8. MMSE-DFE Scheme in Predictive Form

arrive at an equivalent ISI-free channel, where Dn is statistically independent of Uˆn due to the orthogonality{ } principle and Gaussianity. { } Yn = Xn + Nn, (52) − Assuming the decoder has access to past symbols Un = (see in the sequel), the decoder knows also where the power spectrum of Nn is Un−1,Un−2,... − the past estimation errors D = Dn−1,Dn−2,... and may j2πf n j2πf SZ (e ) form an optimal linear predictor, Dˆ , of the current estimation SN (e )= . n C(ej2πf ) 2 error D , which may then be added to Uˆ to form V . The | | n n n prediction error E = D Dˆ has variance P (D), the The mutual information rate (normalized per symbol) (27) n n − n e between the input and output of the channel (52) is entropy power of Dn. It follows that 1/2 j2πf U = Uˆ + D 1 SX (e ) n n n I( Xn , Yn )= log 1+ df. (53) ˆ { } { } 2 S (ej2πf ) = Vn Dn + Dn Z−1/2  N  − = Vn + En, (55) We note that if the spectral shaping filter hn satisfies the optimum “water-filling” power allocation condition, [5], then and therefore (53) will equal the channel capacity. E U V 2 = σ2 = E D Dˆ 2 = P (D). (56) Similarly to the observations made in Section I with respect { n − n} E { n − n} e to the RDF, we note (as reflected in (53)) that capacity The channel (55), which describes the input/output relation may be achieved by parallel AWGN coding over narrow of the slicer in Figure 8, is often referred to as the backward frequency bands (as done in practice in Discrete Multitone channel. Furthermore, since Un and En are i.i.d Gaussian (DMT)/Orthogonal Frequency-Division Multiplexing (OFDM) and since by the orthogonality principle E is independent of systems). An alternative approach, based on time-domain n present and past values of Vn (but dependent of future values prediction rather than the Fourier transform, is offered by through the feedback loop), it is a “sequentially additive” the canonical MMSE - feed forward equalizer - decision AWGN channel. See Figure 10 for a geometric view of these feedback equalizer (FFE-DFE) structure used in single-carrier properties. Notice the strong resemblance with the channel transmission. It is well known that this scheme, coupled with (16), Zqn = Zn + Nn, in the predictive test-channel of the AWGN coding, can achieve the capacity of linear Gaussian RDF: in both channels the output and the noise are i.i.d. and channels. This has been shown using different approaches by Gaussian, but the input has memory and it depends on past numerous authors; see [11], [4], [1], [7] and references therein. outputs via the feedback loop. Our exposition closely follows that of Forney [7]. We now recount this result, based on linear prediction of the error Un sequence; see the system in Figure 8 and its equivalent channel in Figure 9. In the communication literature, this structure is − En referred to as “noise prediction”. It can be recast into the more Dn familiar FFE-DFE form by absorbing a part of the predictor ˆ into the estimator filter, forming the usual FFE. Dn Vn As a first step, let Uˆ be the optimal MMSE estimator {Yn} n Uˆn of U from the equivalent channel output sequence Y of n { n} (52). Since Un and Yn are jointly Gaussian and stationary this estimator{ is} linear{ } and time invariant. Note that the Fig. 10. Geometric View of the Estimation Process 1 combination of the ZF equalizer C(z) at the receiver front- end and the estimator above is equivalent to direct MMSE We have therefore derived the following. estimation of Un from the original channel output Rn (50). Denote the estimation error, which is composed in general of Theorem 3: (Information Optimality of Noise Predic- ISI and Gaussian noise, by Dn. Then tion) For stationary Gaussian processes Un and Nn, and if j2πf Un = Uˆn + Dn (54) H2(e ) is chosen to be the optimal estimation filter of Un 10

Nn

Un Vn ShapingXn Yn Estimation Uˆn Filter Σ Filter Σ j2πf j2πf H1(e ) H2(e ) +

− ˆ + Dn Predictor Dn Σ n−1 g(Dn−L)

Fig. 9. Noise-Prediction Equivalent Channel

from Yn and the predictor g( ) is chosen to be the optimal Slicing and Coding We assumed that the decoder has prediction filter of D (with· L ), then the mutual access to past symbols. In the simplest realization, this is n → ∞ information-rate (53) of the channel from Xn to Yn (or from achieved by a decision element (“slicer”) that works on a Un to Yn) is equal to the scalar mutual information symbol-by-symbol basis. In practice however, to approach capacity, the slicer must be replaced by a “decoder”. Here I(Vn; Vn + En) we must actually break with the assumption that Xn is a j2πf of the channel (55). Furthermore, if H1(e ) is chosen Gaussian process. We implicitly assume that Xn are symbols j2πf such that SX (e ) equals the water-filling spectrum of the of a capacity- achieving AWGN code. The slicer should be channel input, then this mutual information equals the channel viewed as a mnemonic aid where in practice an optimal capacity. decoder should be used. However, we encounter two problems with this interpre- tation. First, the common view of a slicer is as a nearest Proof: Let U − = U ,U ,... and D− = n { n−1 n−2 } n neighbor quantizer. Thus in order to function correctly, the Dn−1,Dn−2,... . Using the chain rule of mutual informa- noise En in (55) must be independent of the symbols Un and tion{ we have } not of the estimator Vn (i.e., the channel should be ”forward” I( U , Y ) = h( U ) h( U Y ) additive: Vn = Un + En). This can be achieved by dithering { n} { n} { n} − { n}|{ n} = h( U ) h(U Y ,U −) the codebook via a modulo-shift as in [6]. This is reminiscent { n} − n|{ n} n to the dithered quantization approach of Section VI. Another ˆ − = h( Un ) h(Un Un Yn ,Un ) difficulty is the conflict between the inherent decoding delay { } − − |{ −} = h( Un ) h(Dn Yn ,U ) of a good code, and the sequential nature of the noise- { } − |{ } n = h( U ) h(D Y ,D−) prediction DFE configuration. Again (as with vector-DPCM in { n} − n|{ n} n = h( U ) h(D Dˆ Y ,D−) Section V), this may in principle be solved by incorporating { n} − n − n|{ n} n an interleaver as suggested by Guess and Varanasi [11]. = h( U ) h(E Y ,D−) { n} − n|{ n} n Capacity achieving shaping filter. For any spectral shaping = h( Un ) h(En) (57) filter h , the mutual information is given by (53). The { } − 1,n = I(Vn; Vn + En), shaping filter hn which maximizes the mutual information (and yields capacity) under the power constraint (51) is given where h( ) denotes differential entropy rate, and where (57) by the parametric water-filling formula: follows from· successive application of the orthogonality prin- ciple [7], since we assumed optimum estimation and prediction σ2 H (ej2πf ) 2 = [θ S (ej2πf )]+, (59) U | 1 | − N filters, which are MMSE estimators in the Gaussian setting. where the “water level” θ is chosen so that the power constraint is met with equality, In view of (53) and (56), and since I( Un , Yn ) = { } { } 1/2 I( Xn , Yn ), Theorem 3 can be re-written as 2 2 j2πf 2 { } { } σX = σU H(e ) df 1/2 1 S (ej2πf ) 1 σ2 −1/2 | | X U (58) Z log 1+ j2πf df = log 2 1/2 −1/2 2 SN (e ) 2 σE j2πf + Z     = θ SN (e ) df = P. (60) ⌈ − ⌉ from which we obtain the following well known formula for Z−1/2 the “SNR at the slicer” for infinite order FFE-DFE, [4], [1], Using this choice, and arbitrarily setting 2 1/2 j2πf 2 σU SX (e ) σU = θ (61) 2 = exp log 1+ j2πf df . σE −1/2 SN (e ) j2πf Z   ! it can be verified that the shaping filter H1(e ) and the j2πf We make a few remarks and interpretations regarding the estimation filter H2(e ) satisfy the same complex conjugate capacity-achieving predictive configuration, which further en- relation as the RDF-achieving pre- and post-filters (13) and hance its duality relationship with the predictive realization of (18) j2πf ∗ j2πf the RDF. H2(e )= H1 (e ). 11

Under the same choice, we also have that: VIII. SUMMARY

j2πf j2πf SD(e ) = min SN (e ),θ . (62) We demonstrated the dual role of prediction in rate- distortion theory of Gaussian sources and capacity of ISI Shaping, estimation and prediction at high SNR. At channels. These observations shed light on the configurations high signal-to-noise ratio (SNR), the shaping filter H1 and of DPCM (for source compression) and FFE-DFE (for channel the estimation filter H2 become all-pass, and can be replaced demodulation), and show that in principle they are “informa- by scalar multipliers. If we set the symbol variance as in (61), tion lossless” for any distortion / SNR level. The theoretic 2 ˆ then we get at high SNR σU P , so Xn Un and Un Yn. bounds, RDF and capacity, can be approached in practice by It follows that the estimation≈ error D ≈N , and therefore≈ n ≈ n appropriate use of feedback and linear estimation in the time the slicer error En becomes simply the prediction error (or domain combined with coding across the “spatial” domain. the entropy power) of the channel noise . This is the well Nn A prediction-based system has in many cases a delay lower known “zero-forcing DFE” solution for optimum detection at than that of a frequency domain approach, as is well known in high SNR [1]. We shall next see that the same behavior of the practice. We slightly touched on this issue when discussing the slicer error holds even for non-asymptotic conditions. 0.5 bit loss due to avoiding the (“non-causal”) pre/post filters. The prediction process when the Shannon upper bound But the full potential of this aspect requires further study. is tight. The Shannon upper bound (SUB) on capacity states It is tempting to ask whether the predictive form of the that RDF can be extended to more general sources and distortion 1 2 measures (and similarly for capacity of more general ISI C log(2πeσY ) h(N) ≤ 2 − channels). Yet, examination of the arguments in our derivation 2 1 P + σN ∆ reveals that it is strongly tied to the quadratic-Gaussian case: log = CSUB, (63) ≤ 2 Pe(N)   • The orthogonality principle, implied by the MMSE cri- where terion, guarantees the whiteness of the noisy prediction 1/2 2 j2πf error Zqn and its orthogonality with the past. σN = SN (e )df • Gaussianity implies that orthogonality is equivalent to Z−1/2 statistical independence. is the variance of the equivalent noise, and where equality For other error criteria and/or non-Gaussian sources, prediction holds if and only if the output Yn is white. This in turn is satisfied if and only if (either linear or non-linear) is in general unable to remove the dependence on the past. Hence the scalar mutual information j2πf θ max SN (e ), over the prediction error channel would in general be greater ≥ f than the mutual information rate of the source before predic- 2 in which case θ = P + σN . tion. 2 If we choose σU according to (61), we have: • The shaping and estimation filters satisfy

j2πf j2πf 2 j2πf 2 SN (e ) H1(e ) = H2(e ) =1 . | | | | − θ ACKNOWLEDGEMENT • Un and Yn are white, with the same variance θ. We’d like to thank Robert M. Gray for pointing to us the • Xn and Uˆn have the same power spectrum, θ S (ej2πf ). origin of DPCM in a U.S. patent by C.C. Cutler in 1952. − N • The power spectrum of Dn is equal to the power spectrum j2πf of the noise Nn, SN (e ). Consequently, the variance of En which is equal to the entropy-power of Dn, is equal to Pe(N). PPENDIX • As a consequence we have A I(V ; V + E ) = h(U ) h(E ) A. PROOF OF THEOREM 2 n n n n − n = h (0,θ) h (0, Pe(N)) It will be convenient to look at K-blocks, which we denote N − N by bold letters as in Section VI. Substituting the ECDQ rate 1 P +σ2   = log N definition (40) and the K-block directed information definition 2 Pe(N) (45), the required result (46) becomes:   which is indeed the SUB (63). M An alternative derivation of Theorem 3 in the spectral H(QM DM )= I(Zm; Zq Zqm−1) . 1 | 1 m| m domain. Similarly to the alternative proof of Theorem 1, one m=1 can prove Theorem 3 using the spectra derived above. X Using the chain rule for entropies, it is enough to show that:

H(Q Qm−1, DM−1)= I(Zq ; Zm Zqm−1) . m| 1 1 m 1 | 1 12

To that end, we have the following sequence of equalities: [14] T. Linder and R. Zamir, Causal coding of stationary sources and individual sequences with high hesolution. IEEE Trans. Information m−1 M−1 Theory, IT-52:662–680, Feb. 2006. H(Qm Q1 , D1 ) | [15] J. Massey. Causality, Feedback and Directed Information. In Proc. IEEE (a) m−1 m Int. Symp. on , pages 303–305, 1990. = H(Qm Q1 , D1 ) | [16] R. Zamir and M. Feder. On universal quantization by randomized (b) m−1 m m−1 m m uniform / lattice quantizer. IEEE Trans. Information Theory, pages 428– = H(Qm Q , D ) H(Qm Q , Z , D ) 1 1 1 1 1 436, March 1992. | m m−1 −m | = I(Qm; Z1 Q1 , D1 ) [17] R. Zamir and M. Feder. On lattice quantization noise. IEEE Trans. | Information Theory, pages 1152–1159, July 1996. (c) m m−1 m = I(Qm Dm; Z1 Q1 , D1 ) [18] R. Zamir and M. Feder. Information rates of pre/post filtered dithered − | quantizers. IEEE Trans. Information Theory, pages 1340–1353, Septem- = I(Zq ; Zm Qm−1, Dm) m 1 | 1 1 ber 1996. = I(Zq ; Zm Qm−1 Dm−1, Dm) m 1 | 1 − 1 1 = I(Zq ; Zm Zqm−1, Dm) m 1 | 1 1 (d) = I(Zq ; Zm Zqm−1, D ) m 1 | 1 m = I(Zq , D ; Zm Zqm−1) I(D ; Zm Zqm−1) m m 1 | 1 − m 1 | 1 (e) = I(Zq ; Zm Zqm−1) I(D ; Zm Zqm−1) m 1 | 1 − m 1 | 1 (f) = I(Zq ; Zm Zqm−1) . m 1 | 1 In this sequence, equality (a) comes from the independent dither generation and causality of feedback. (b) is justified because Qm is a deterministic function of the elements on which the subtracted entropy is conditioned, thus entropy is 0. In (c) we subtract from the left hand side argument of the mutual information one of the variables upon which mutual information is conditioned. (d) and (e) hold since each dither vector Dm is a deterministic function of the corresponding quantizer output Zqm. Finally, (f) is true since m Z1 is independent of Dm (both conditioned on past quantized values and unconditioned).

REFERENCES [1] J. Barry, E. A. Lee and D. G. Messerschmitt. Digital Communication. Kluwer Academic Press, 2004 (third edition). [2] T. Berger. Rate Distortion Theory: A Mathematical Basis for Data Compression. Prentice-Hall, Englewood Cliffs, NJ, 1971. [3] J. Chen, C. Tian, T. Berger, and S. S. Hemami. Multiple Description Quantization via Gram-Schmidt Orthogonalization. IEEE Trans. Infor- mation Theory, IT-52:5197–5217, Dec. 2006. [4] J.M. Cioffi, G.P. Dudevoir, M.V. Eyuboglu, and G.D. J. Forney. MMSE Decision-Feedback Equalizers and Coding - Part I: Equalization Results. IEEE Trans. Communications, COM-43:2582–2594, Oct. 1995. [5] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley, New York, 1991. 1 SNR [6] U. Erez and R. Zamir. Achieving 2 log(1 + ) on the AWGN channel with lattice encoding and decoding. IEEE Trans. Information Theory, IT-50:2293–2314, Oct. 2004. [7] G. D. Forney, Jr. Shannon meets Wiener II: On MMSE estimation in successive decoding schemes. In 42st Annual Allerton Conference on Communication, Control, and Computing, Allerton House, Monticello, Illinois, Oct. 2004. [8] R. G. Gallager. Information Theory and Reliable Communication. Wiley, New York, N.Y., 1968. [9] A. Gersho and R. M. Gray. Vector Quantization and Signal Compression Kluwer Academic Pub., Boston, 1992. [10] G.D.Gibson, T.Berger, T.Lookabaugh, D.Lindbergh, and R.L.Baker. Digital Compression for Multimedia: Principles and Standards‘. Morgan Kaufmann Pub., San Fansisco, 1998. [11] T. Guess and M. K. Varanasi. An information-theoretic framework for deriving canonical decision-feedback receivers in gaussian channels. IEEE Trans. Information Theory, IT-51:173–187, Jan. 2005. [12] N. S. Jayant and P. Noll. Digital Coding of Waveforms. Prentice-Hall, Englewood Cliffs, NJ, 1984. [13] K. T. Kim and T. Berger. Sending a Lossy Version of the Innovations Process is Suboptimal in QG Rate-Distortion. In Proceedings of ISIT- 2005, Adelaide, Australia, pages 209–213, 2005.