arXiv:1303.6409v2 [cs.IT] 17 Apr 2013 h nu sadesdadFn-yebud ewe the between probability bounds error Fano-type reconstruction and the derived. and addressed reconstru loss perfectly is inf information of the distr problem input continuously systems the is of the finiteness, input class the this large if on random even a Based finite, output for is the that loss given shown mation variable is random It condi input variable. the the evaluating of by entropy investigated is systems memoryless nomto os éy nomto ieso,sse the system dimension, information Rényi loss, information e xml ytm,aogte ut-hne autocorre multi-channel a them receiver. among illu systems, lation are example results few theoretical a the point-of-view, theoretic info below. relative informatio the from reco by Rényi loss the bounded to is argument, probability related error Fano-type tightly struction another be to Employing shown dimension. and defined is nWrls omncto ytm,te21 n.Zrc Sem Zürich Int. Works Theory 2012 Information IEEE the 2012 the Systems, and Communication Communications, Wireless on determinis vari- by random increase a cannot it process) (be stochastic signal a a or of a able content kno by We information suggested inequality. the processing strongly that t data is the despite characterization theorem: hand, a elementary at such system that the fact of characterization theoretic m the e.g., are error. functions, processing cost squared signal energetic of or terms theory in system formulated in su problems no Bussgang is many it that of the statistics, prevalence second-order overwhelming (e.g., and this measures systems af By energetic 9-17]). are nonlinear Thm. add they [4, and theorem how can linear and one by densities, [4], fected spectral processes power stochastic correlations, on textbook oriented rnfrfntos nu-uptsaiiy asvt,l passivity, the stability, and ness, input-output natu definin in equation functions, energy-centered exclusively differential transfer almost or are – difference system the the from aside typicall one – characterizations the systems, input-output hrfr setal nrei nntr.We pnn a opening When nature. signal in on involved textbook energetic the essentially of functions) therefore amplitude (or amplitudes gkbnie.r)aewt h inlPoesn n Spee and Processing Signal the Technology of with University Graz Laboratory, are Communication ([email protected]) at fti okhv enpeetda h 01IE n.Sy Int. IEEE Kubin 2011 the Gernot at presented been and have work ([email protected]) this of Parts Geiger C. Bernhard o ytm ihifiieifrainls eaiemeasur relative a loss information infinite with systems For Abstract ne Terms Index information- an from theory system a developing of view In htoede o n nalteebosi ninformation- an is books these all in find not does one What hnoeigatxbo nlna 1 rnnier[2] nonlinear or [1] linear on textbook a opening When I hswr h nomto osi deterministic, in loss information the work this —In statistical L Dt rcsigieult,Fn’ inequality, Fano’s inequality, processing —Data 2 nomto esrsfrDeterministic for Measures Information reeg/oe anaealdfie sn the using defined all are gain energy/power or enadC Geiger, C. Bernhard .I I. inlpoesn 3 ra engineering- an or [3] processing signal NTRODUCTION nu-uptSystems Input-Output tdn ebr IEEE Member, Student tae by strated hop. ossless- rmation finds y ibuted. nron inar rprise tional ory cting ean- are or- re: tic he m. n- ch w s, n n g e - - nraeteeeg otie nasignal a can in system contained passive energy a the speaking, increase loosely as, just processing, hl ehv ento fteeeg os(aey the (namely, loss energy the of definition a inverse Bu have system. we a through while fed therefore when signal differently a Gaussian completely of the behave of information property and max-entropy Energy the densit distribution. by spectral Gaussian bound d a power connection a to and erates this of signals variance non-Gaussian its rate for there for to respectively), entropy information While related and and gained. are signal energy (entropy or between lost signals connection energy Gaussian strong the informatio a for this is carrying holds signal the same on the also but system the on yarcn xoaiaino nomto os[6]. loss justifi information and of 1) axiomatization b Definition recent motivated (cf. a partly by inequality is processing entropy characteristic, data conditional system the this descriptio general of energy-centered choice a The prevailing as th – the of output entropy complementing conditional the the given – input loss information the propose lacking. still is system ewl s sses n fntos interchangeably. “functions” and “systems” use will we non-injec static, finite-alphabe a discrete-time, through a process feeding stochastic by stationary caused loss tion ilcto ftoitgrrno aibe 7 n h work the the concerning and mul- Abraham [7] the and variables in Watanabe random lost of integer information two the of of tiplication analysis deterministi excepti Pippenger’s of Notable are published. behavior been have processing systems input-output information the about Work Related A. proces memory. stochastic th with to of systems some extensions or for as foundation such the steps, ley forthcoming theory will system here that a presented hope towards we results step point-of-view; small information-theoretic first an a from as manuscript introducti regarded This be the loss). information thus by of (e.g., measure relative we answers a which of adequate of some present to to w information?), if intend of happens amount what infinite (e.g., cl an asked system be lose simple can comparably questions of this multitude for a already that is rea this The for variables. random (multidimensional) on operating 2 1 lal,teifrainls nasse o nydepends only not system a in lost information the Clearly, tpeetw etitorevst eoyessystems memoryless to ourselves restrict we present At to and gap this close to work this of purpose the is It otebs fteatos nweg,vr e results few very knowledge, authors’ the of best the To ic hs ytm ilb ecie y(yial)nonlin (typically) [5]. by cf. described amount, be finite will a systems than these more Since not by least At n entKubin, Gernot and , L 2 an,a nlsso h nomto osi a in loss information the of analysis an gain), ebr IEEE Member, 1 . rate finforma- of a functions, ear egen- tive can can ons son not the ass ses ns. on ed n; y, y e e e c 2 1 t t 2 function [8]. Moreover, in [9] the authors made an effort to B. Outline and Contributions extend the results of Watanabe and Abraham to dynamical The organization of the paper is as follows: In Section II we input-output systems with finite internal memory. All these give the definitions of absolute and relative information loss works, however, focus only on discrete random variables and and analyze their elementary properties. Among other things, stochastic processes. we prove a connection between relative information loss and Slightly larger, but still focused on discrete random vari- Rényi information dimension and the additivity of absolute ables only, is the field concerning information-theoretic cost information loss for a cascade. We next turn to a class of functions: The infomax principle [10], the information bottle- systems which has finite absolute information loss for a real- neck method [11] using the Kullback-Leibler divergence as valued input in Section III. A connection to differential entropy a distortion function, and system design by minimizing the is shown as well as numerous upper bounds on the information error entropy (e.g., [12]) are just a few examples of this recent loss. Given this finite information loss, we present Fano- trend. Additionally, Lev’s approach to aggregating accounting type inequalities for the probability of a reconstruction error. data [13], and, although not immediately evident, the work Section IV deals with systems exhibiting infinite absolute about macroscopic descriptions of multi-agent systems [14] information loss, e.g., quantizers and systems reducing the belong to that category. dimensionality of the data, to which we apply our notion A system theory for neural information processing has been of relative information loss. Presenting a similar connection proposed by Johnson3 in [15]. The assumptions made there between relative information loss and the reconstruction er- (information need not be stochastic, the same information can ror probability establishes a link to analog compression as be represented by different signals, information can be seen investigated in [26]. We apply our theoretical results to two as a parameter of a , etc.) suggest the larger systems in Section V, a communications receiver and use of the Kullback-Leibler divergence as a central quantity. an accumulator. It is shown that indeed both the absolute and Although these assumptions are incompatible with ours, some relative measures of information loss are necessary to fully similarities exist (e.g., the information transfer ratio of a characterize even the restricted class of systems we analyze in cascade in [15] and Proposition 4). this work. Eventually, Section VI is devoted to point at open issues and lay out a roadmap for further research. Aside from these, the closest the literature comes to these concepts is in the field of autonomous dynamical systems or iterated maps: There, a multitude of information-theoretic II. DEFINITION AND ELEMENTARY PROPERTIES OF characterizations are used to measure the information transfer INFORMATION LOSS AND RELATIVE INFORMATION LOSS in deterministic systems. In particular, the information flow A. Notation between small and large scales, caused by the folding and We adopt the following notation: Random variables (RVs) stretching behavior of chaotic one-dimensional maps was are represented by upper case letters (e.g., ), lower case analyzed in [16]; the result they present is remarkably sim- X letters (e.g., ) are reserved for (deterministic) constants or ilar to our Proposition 7. Another notable example is the x realizations of RVs. The alphabet of an RV is indicated by a introduction of transfer entropy in [17], [18] to capture the calligraphic letter (e.g., ). The probability distribution of an information transfer between states of different (sub-)systems. RV is denoted by X. Similarly, and denote An alternative measure for the information exchanged between X PX PXY PX Y the joint distribution of the RVs and and the conditional| system components was introduced in [19]. Information trans- X Y distribution of given , respectively. fer between time series and spatially distinct points in the X Y If is a proper subset of the N-dimensional Euclidean phase space of a system are discussed, e.g., in [20], [21]. X N space R and if PX is absolutely continuous w.r.t. the N- Notably, these works can be assumed to follow the spirit of N N dimensional Lebesgue measure µ (in short, PX µ ), then Kolmogorov and Sinaï, who characterized dynamical systems ≪ exhibiting chaotic behavior with entropy [22]–[24], cf. [25]. PX possesses a probability density function (PDF) w.r.t. the Lebesgue measure, which we will denote as fX . Conversely, Recently, independent from the present authors, Baez et if the probability measure PX is concentrated on an at most al. [6] took the reverse approach and formulated axioms for countable set of points, we will write pX for its probability the information loss induced by a measure-preserving func- mass function, omitting the index whenever it is clear from tion between finite sets, such as continuity and functoriality the context. (cf. Proposition 1 in this work). They show that the difference We deal with functions of (real-valued) RVs: If for example between the entropy of the input and the entropy of the output, ( , B , PX ) is the (standard) probability space induced by or, equivalently, the conditional entropy of the input given the theX RVXX and ( , B ) is another (standard) measurable space, output is the only function satisfying the axioms, thus adding and g: Y is measurable,Y we can define a new RV as justification to one of our definitions. Y = g(X). → The Y probability distribution of Y is

1 B B : PY (B)= PX (g− [B]) (1) ∀ ∈ Y 3Interestingly, Johnson gave a further motivation for the present work by 1 where g− [B] = x : g(x) B denotes the preimage claiming that “Classic is silent on how to use information { ∈ X ∈ } theoretic measures (or if they can be used) to assess actual system perfor- of B under g. Abusing notation, we write PY (y) for the mance”. probability measure of a single point instead of PY ( y ). { } 3

Xˆn In particular, if g is a quantizer which induces a partition I(Xˆn; Y ) of , we write Xˆ for the quantized RV. The uniform P X 1 quantization of X with hypercubes of side length 2n is ˆ I(Xn; X) Q n denoted by P 2nX Xˆ = ⌊ ⌋ (2) n 2n where the floor operation is applied element-wise if X is a X g Y multi-dimensional RV. The partition induced by this uniform 4 (n) quantizer will be denoted as n = ˆ . Note also that P {Xk } the partition n gets refined with increasing n (in short, n P P ≻ n+1). Fig. 1. Model for computing the information loss of a memoryless input- P output system g. Q is a quantizer with partition P . Finally, H( ), h( ), H2( ), and I( ; ) denote the entropy, n the differential· entropy,· the· binary entropy· · function, and the mutual information, respectively. Unless noted otherwise, the L(X Z) → logarithm is taken to base two, so all entropies are measured L(X Y ) L(Y Z) in bits. → →

B. Information Loss Y X g( ) h( ) Z A measure of information loss in a deterministic input- · · output system should, roughly speaking, quantify the differ- ence between the information available at its input and its Fig. 2. Cascade of systems: The information loss of the cascade equals the sum of the individual information losses of the constituent systems. output. While for discrete RVs this amounts to the difference of their entropies, continuous RVs require more attention. To this end, in Fig. 1, we propose a model to compute the ˆ information loss of a system which applies to all real-valued since Xn is a function of X. With the monotone convergence of H(Xˆn Y = y) H(X Y = y) implied in [27, Lem. 7.18] RVs. | ր | In particular, we quantize the system input with partition it follows that L(X Y )= H(X Y ). Indeed, for a discrete input RV X we obtain→ L(X Y )=| H(X) H(Y ). n and compute the mutual information between the input X → − P ˆ While L(X Y ) and H(X Y ) can be used interchange- and its quantization Xn, as well as the mutual information → | ˆ ably, we will stick to the notation L(X Y ) to make clear between the system output Y and Xn. The first quantity is an → approximation of the information available at the input, while that Y is a function of X. the second approximates the information shared between input For discrete input RVs X or stochastic systems (e.g., and output, i.e., the information passing through the system communication channels) the mutual information between the and thus being available at its output. By the data processing input and the output I(X; Y ), i.e., the information transfer, inequality (cf. [27, Cor. 7.16]), the former cannot be smaller is an appropriate characterization. In contrast, deterministic than the latter, i.e., I(Xˆn; X) I(Xˆn; Y ), with equality if the systems with non-discrete input RVs usually exhibit infinite system is described by a bijective≥ function. We compute the information transfer I(X; Y ). As we will show in Section III difference between these two mutual informations to obtain there exists a large class of systems for which the information an approximation of the information lost in the system. For loss L(X Y ) remains finite, thus allowing to give a → bijective functions, for which the two quantities are equal, meaningful description of the system. the information loss will vanish, as suggested by intuition: One elementary property of information loss, which Bijective functions describe lossless systems. will prove useful in developing a system theory from an Refining the partition yields better approximations; we thus information-theoretic point-of-view, is found in the cascade present of systems (see Fig. 2). We maintain5 Definition 1 (Information Loss). Let X be an RV with Proposition 1 (Information Loss of a Cascade). Consider alphabet , and let Y = g(X). The information loss induced two functions and and a cascade X g: h: by g is of systems implementingX → Y these functions.Y → Let Z Y = g(X) and Z = h(Y ). The information loss induced by this cascade, L(X Y ) = lim I(Xˆn; X) I(Xˆn; Y ) = H(X Y ). → n − | or equivalently, by the system implementing the composition →∞ (3)   (h g)( )= h(g( )) is given by: Along the lines of [27, Lem. 7.20] one obtains ◦ · · L(X Z)= L(X Y )+ L(Y Z) (6) I(Xˆn; X) I(Xˆn; Y )= H(Xˆn) H(Xˆn X) − − | → → → H(Xˆn)+ H(Xˆn Y ) (4) − | Proof: Referring to Definition 1 and [28, Ch. 3.9] we = H(Xˆn Y ) (5) | 4 ˆ(n) 5 E.g., for a one-dimensional RV, the k-th element of Pn is Xk = In [6] this property was formulated as an axiom desirable for a measure k k+1 [ 2n , 2n ). of information loss. 4 obtain g(x) L(X Z)= H(X Z) (7) → | = H(Y Z)+ H(X Y,Z) (8) | | = H(Y Z)+ H(X Y ) (9) | | c c x − since Y and (Y,Z) are mutually subordinate. A sufficient condition for the information loss to be infinite is presented in Proposition 2 (Infinite Information Loss). Let g: , RN , and let the input RV X be such that itsX proba- → Y X ⊆ Fig. 3. The center clipper – an example for a system with both infinite bility measure PX has an absolutely continuous component information loss and information transfer. ac N PX µ which is supported on . If there exists a set ≪ X 1 B of positive PY -measure such that the preimage g− [y] is uncountable⊆ Y for every y B, then the output of a system: While in all practically relevant cases ∈ a quantizer will always have finite information at its output L(X Y )= . (10) (H(Y ) < ), the information at the input is infinite as soon → ∞ as P has∞ a continuous component. Proof: We write the information loss L(X Y ) as X → While for a quantizer the mutual information between the input and the output may be a more appropriate characteri- L(X Y )= H(X Y )= H(X Y = y)dPY (y) (11) → | | zation because it remains finite, the following example shows ZY that also mutual information has its limitations: H(X Y = y)dPY (y) (12) ≥ | ZB Example 2 (Center Clipper). The center clipper, used for, since B . By assumption, the conditional probability mea- ⊆ Y e.g., residual echo suppression [29], can be described by the sure PX Y =y is not concentrated on a countable set of points following function (see Fig. 3): (the preimage| of y under g is uncountable, and the probability measure P has an absolutely continuous component on all x, if x >c X g(x)= | | (16) ) one obtains H(X Y = y)= for all y B. The proof 0, otherwise X | ∞ ∈ ( follows from PY (B) > 0. It will be useful to explicitly state the following Assuming again that PX µ and that 0 < PX ([ c,c]) < 1, with Corollary 1 the information≪ loss becomes infinite.− On the N Corollary 1. Let PX µ . If there exists a point y∗ other hand, there exists a subset of (namely, a subset of ≪ ∈ Y X ×Y such that PY (y∗) > 0, then the line x = y) with positive PXY measure, for which PX PY vanishes. Thus, with [28, Thm. 2.1.2] L(X Y )= . (13) → ∞ I(X; Y )= . (17) Proof: Since y∗ has positive PY -measure and since ∞ N PX µ , the preimage of y∗ under g needs to be uncount- able.≪ In other words, if the input is continuously distributed and if the distribution of the output has a non-vanishing discrete C. Relative Information Loss component, the information loss is infinite. A particularly As the previous example shows, there are systems for simple example for such a case is a quantizer: which neither information transfer (i.e., the mutual information Example 1 (Quantizer). We now look at the information loss between input and output) nor information loss provides suf- of a scalar quantizer, i.e., of a system described by a function ficient insight. For these systems, a different characterization is necessary, which leads to g(x)= x . (14) ⌊ ⌋ Definition 2 (Relative Information Loss). The relative infor- ˆ With the notation introduced above we obtain Y = X0 = mation loss induced by Y = g(X) is defined as g(X). Assuming that X has an absolutely continuous distri- H(Xˆn Y ) bution (PX µ), there will be at least one point y∗ for l(X Y ) = lim | (18) ≪ n ˆ which Pr(Y = y∗) = PY (y∗) = PX ([y∗,y∗ + 1)) > 0. The → →∞ H(Xn) conditions of Corollary 1 are thus fulfilled and we obtain provided the limit exists. L(X Xˆ )= . (15) → 0 ∞ One elementary property of the relative information loss is that l(X Y ) [0, 1], due to the non-negativity of entropy → ∈ and the fact that H(Xˆn Y ) H(Xˆn). The relative informa- This simple example illustrates the information loss as the tion loss is related to the| Rényi≤ information dimension, which difference between the information available at the input and we will establish after presenting 5

Definition 3 (Rényi Information Dimension [30]). The infor- Definition 4 (Relative Information Transfer). The relative mation dimension of an RV X is information transfer achieved by Y = g(X) is

H(Xˆn) d(X) = lim (19) I(Xˆn; Y ) n n t(X Y ) = lim =1 l(X Y ) (25) →∞ → n H(Xˆ ) − → provided the limit exists and is finite. →∞ n We adopted this definition from Wu and Verdú, who showed provided the limit exists. in [26, Prop. 2] that it is equivalent to the one given by While, as suggested by the data processing inequality, the Rényi in [30]. Note further that we excluded the case that focus of this work is on information loss, we introduce this the information dimension is infinite, which may occur if definition to simplify a few proofs of the forthcoming results. ˆ H(X0) = [26, Prop. 1]. Conversely, if the information In particular, we maintain dimension of∞ an RV X exists, it is guaranteed to be finite if ǫ Proposition 4 (Relative Information Transfer and Information H(Xˆ0) < [30] or if E X < for some ǫ > 0 [26]. Aside from∞ that, the information{| | } dimension∞ exists for discrete Dimension). Let X be an RV with positive information dimen- RVs and RVs with probability measures absolutely contin- sion d(X) and let g be a Lipschitz function. Then, the relative uous w.r.t. the Lebesgue measure on a sufficiently smooth information transfer through this function is manifold [30], for mixtures of RVs with existing information d(Y ) dimension [26], [30], [31], and self-similar distributions gener- t(X Y )= . (26) → d(X) ated by iterated function systems [26]. Finally, the information dimension exists if the MMSE dimension exists [32, Thm. 8]. Proof: See Appendix A. For the remainder of this work we will assume that the This result for Lipschitz functions is the basis for several of information dimension of all considered RVs exists and is the elementary properties of relative information loss presented finite. in the remainder of this section. Aside from that, it suggests We are now ready to state that, at least for this restricted class of functions, Proposition 3 (Relative Information Loss and Information d(X)= d(X, Y )= d(X Y )+ d(Y ) (27) Dimension). Let X be an N-dimensional RV with positive | information dimension d(X). If d(X Y = y) exists and is | holds. This complements the results of [34], where it was finite PY -a.s., the relative information loss equals shown that the point-wise information dimension satisfies this d(X Y ) chain rule (second equality) given that the conditional prob- l(X Y )= | (20) → d(X) ability measure satisfies a Lipschitz property. Moreover, the first equality was shown to hold for the point-wise information where d(X Y )= d(X Y = y)dPY (y). | Y | dimension given that Y is a Lipschitz function of X; for Proof: FromR Definition 2 we obtain non-Lipschitz functions the point-wise information dimension of (X, Y ) may exceed the one of X. If the same holds for H(Xˆn Y = y)dPY (y) l(X Y ) = lim Y | (21) the information dimension (which is the expectation over the n ˆ → →∞ R H(Xn) point-wise information dimension) is an interesting question ˆ H(Xn Y =y) for future research. n| dPY (y) = lim Y . (22) We can now present the counterpart of Proposition 1 for n H(Xˆn) →∞ R n relative information loss: By assumption, the limit of the denominator and the expression Proposition 5 (Relative Information Loss of a Cascade). under the integral both exist and correspond to and d(X) Consider two Lipschitz functions g: and h: RN d(X Y = y), respectively. Since for an -valued RV X and a cascade of systems implementingX → these Y functions.Y → Let Z the information| dimension satisfies [26], [30] Y = g(X) and Z = h(Y ). For the cascade of these systems d(X Y = y) N (23) the relative information transfer and relative information loss | ≤ are given as one can apply Lebesgue’s dominated convergence theorem (e.g., [33]) to exchange the order of the limit and the integral. t(X Z)= t(X Y )t(Y Z) (28) The limit of the numerator thus exists and we continue with → → → and H(Xˆn Y =y) limn n| dPY (y) l(X Y )= →∞ Y → d(X) l(X Z) R → d(X Y = y)dPY (y) d(X Y ) = l(X Y )+ l(Y Z) l(X Y )l(Y Z) (29) = Y | = | . (24) → → − → → d(X) d(X) R respectively. This completes the proof. We accompany the relative information loss by its logical Proof: The proof follows from Proposition 4 and Defini- complement, the relative information transfer: tion 4. 6

1 D. Interplay between Information Loss and Relative Informa- In addition to that, since the preimage g− [y] is countable tion Loss for all y , it follows that d(X Y )=0. But with d(X)= ∈ Y N | We introduced the relative information loss to characterize N from PX µ we can apply Proposition 3 to obtain l(X Y )=0≪. Thus, relative information loss will not tell systems for which the absolute information loss from Defi- → nition 1 is infinite. The following result shows that, at least us much about the behavior of the system. In the following, for input RVs with infinite entropy, an infinite absolute infor- we therefore stick to Definition 1 and analyze the (absolute) mation loss is a prerequisite for positive relative information information loss in piecewise bijective functions (PBFs). loss: A. Information Loss in PBFs Proposition 6 (Positive Relative Loss leads to Infinite Abso- lute Loss). Let X be such that H(X)= and let We present ∞ l(X Y ) > 0. Then, L(X Y )= . Proposition 7 (Information Loss and Differential Entropy). → → ∞ Proof: We prove the proposition by contradiction. To this The information loss induced by a PBF is given as end, assume that L(X Y )= H(X Y )= κ< . Thus, L(X Y )= h(X) h(Y )+E log det g(X) (35) → | ∞ → − { | J |} H(Xˆn Y ) where the expectation is taken w.r.t. X. l(X Y ) = lim | (30) → n ˆ →∞ H(Xn) Proof: See Appendix B. (a) H(X Y ) Aside from being one of the main results of this work, it lim | (31) n ˆ also complements a result presented in [4, pp. 660]. There, it ≤ →∞ H(Xn) (b) was claimed that = 0 (32) h(Y ) h(X)+E log det g(X) (36) where (a) is due to data processing and (b) follows from ≤ { | J |} H(X Y ) = κ < and from H(Xˆn) H(X) = where equality holds if and only if g is bijective, i.e., a lossless (e.g.,| [27, Lem. 7.18]).∞ → ∞ system. This inequality results from

Note that the converse is not true: There exist examples fX (x) where an infinite amount of information is lost, but for which fY (g(x)) (37) ≥ det g(x) the relative information loss nevertheless vanishes, i.e., l(X | J | Y )=0 (see Example 4 further below). → with equality if and only if g is invertible at x. Proposition 7 essentially states that the difference between the right-hand side and the left-hand side of (36) is the information lost due III.INFORMATION LOSS FOR PIECEWISE BIJECTIVE to data processing. FUNCTIONS Example 3 (Square-Law Device and Gaussian Input). We In this section we analyze the information loss for a illustrate this result by assuming that is a zero-mean, unit restricted class of functions and under the practically relevant X variance Gaussian RV and that 2. We switch in this assumption that the input RV has a probability distribution Y = X N example to measuring entropy in nats, so that we can compute PX µ supported on . Let i be a partition of 1 ≪ N X {X } the differential entropy of X as h(X)= ln(2πe). The output R , i.e., the elements i are disjoint and unite to , 2 X ⊆ X X Y is a χ2-distributed RV with one degree of freedom, for and let PX ( i) > 0 for all i. We present X which the differential entropy can be computed as [35] Definition 5 (Piecewise Bijective Function). A piecewise 1 bijective function g: , , RN , is a surjective h(Y )= (1 + ln π γ) (38) 2 − function defined in a piecewiseX → Y manner:X Y ⊆ where γ is the Euler-Mascheroni constant [36, pp. 3]. The

g1(x), if x 1 Jacobian determinant degenerates to the derivative, and using ∈ X some calculus we obtain g(x)= g2(x), if x 2 (33) . ∈ X 1 . E ln g′(X) =E ln 2X = (ln 2 γ) . (39) . { | |} { | |} 2 − where each gi: i i is bijective. Furthermore, the Jacobian Applying Proposition 7 we obtain an information loss of X → Y  matrix g( ) exists on the closures of i, and its determinant, L(X Y ) = ln 2, which after changing the base of the J · X → det g( ), is non-zero PX -a.s. logarithm amounts to one bit. Indeed, the information loss J · induced by a square-law device is always one bit if the PDF A direct consequence of this definition is that also PY of the input RV has even symmetry [37]. N ≪ µ . Thus, PY possesses a PDF fY w.r.t. the Lebesgue measure µN which, using the method of transformation (e.g. [4, p. 244], We note in passing that the statement of Proposition 7 can be computed as has a tight connection to the theory of iterated function systems. In particular, [16] analyzed the information flow in fX (xi) fY (y)= . (34) one-dimensional maps, which is the difference between infor- −1 det g(xi) xi g [y] mation generation via stretching (corresponding to the term ∈X | J | 7 involving the Jacobian determinant) and information reduction fX (x) via folding (corresponding to information loss). Ruelle [38] later proved that for a restricted class of systems the folding entropy (L(X Y )) cannot fall below the information generated via stretching,→ and therefore speaks of positivity of entropy production. He also established a connection to the Kolmogorov-Sinaï entropy rate. In [39] both components 1 1 1 1 1 constituting information flow in iterated function systems are 16 8 4 2 x described as ways a dynamical system can lose information. g(x) Since the connection between the theory of iterated function 1 maps, Kolmogorov-Sinaï entropy rate, and information loss deserves undivided attention, we leave a more thorough anal- ysis thereof for a later time. We turn to an explanation of why the information loss actually occurs. Intuitively, the information loss is due to the x non-injectivity of g, i.e., employing Definition 1, due to the fact that the bijectivity of g is only piecewise. We will make Fig. 4. Piecewise bijective function and input density leading to infinite loss this precise after introducing (cf. Example 4) Definition 6 (Partition Indicator). Let W be a discrete RV which is defined as the interval (0, 1]: n n n n+1 N W = i if X i (40) g(x)=2 (x 2− ) if x (2− , 2− ], n (44) ∈ X − ∈ ∈ for all i. Assume further that the PDF of the input X is given as (see Fig. 4) Proposition 8. The information loss is identical to the un- certainty about the set i from which the input was taken, n 1 1 X fX (x)=2 , i.e., log(n + 1) − log(n + 2)  n n+1 L(X Y )= H(W Y ). (41) if x (2− , 2− ], n N. (45) → | ∈ ∈ Proof: See Appendix C. As an immediate consequence, the output RV Y is uniformly This proposition states that the information loss of a PBF distributed on (0, 1]. stems from the fact that by observing the output one has To apply Proposition 8, we need remaining uncertainty about which element of the partition Pr(W = n Y = y) = Pr(W = n) i contained the input value. Moreover, by looking at | Definition{X } 6 one can see that is obtained by quantizing 1 1 W = . (46) X with partition i . Consequently, the limit in Definition 1 log(n + 1) − log(n + 2) {X } is actually achieved at a comparably coarse partition. For this distribution, the entropy is known to be infinite [40], The result permits a simple, but interesting and thus Corollary 2. Y and W together determine X, i.e., L(X Y )= H(W Y )= H(W )= . (47) → | ∞ H(X Y, W )=0. (42) | Proof: Since W is obviously a function of X,

L(X Y )= H(X Y )= H(X, W Y ) B. Upper Bounds on the Information Loss → | | = H(X W, Y )+ H(W Y ) The examples we examined so far were simple in the sense | | that the information loss could be computed in closed form. = H(X W, Y )+ L(X Y ) (43) | → There are certainly cases where this is not possible, especially from which H(X Y, W )=0 follows. since the expressions involved in Proposition 7 may involve In other words,| knowing the output value, and the element a logarithm of a sum. It is therefore essential to accompany of the partition from which the input originated, perfect recon- the exact expressions by bounds which are more simple to struction is possible. We will make use of this in Section III-C. evaluate. In particular, we present a corollary to Proposition 8 Before proceeding, we present an example where an infinite which follows from the fact that conditioning reduces entropy: amount of information is lost in a PBF: Corollary 3. L(X Y ) H(W ) (48) Example 4 (Infinite Loss). → ≤ Assume that we consider the following scalar function Note that in both Example 3 and 4 this upper bound holds n n+1 g: (0, 1] (0, 1], mapping every interval (2− , 2− ] onto with equality. → 8

The proof of Proposition 8 also allows us to derive the g1(x), y bounds presented in Proposition 9 (Upper Bounds on Information Loss). The information loss induced by a PBF can be upper bounded q by the following ordered set of inequalities: −a 1 ax, r(y)

1 L(X Y ) fY (y) log card(g− [y])dy (49) → ≤ ZY

log fY (y)dy (50) g2(x), y ≤ i i ZY ! X 1 ess sup log card(g− [y]) (51) ≤ y ∈Y log card( i ) (52) ≤ {X } −a q2 a where card(B) is the cardinality of the set B. Bound (49) x, r(y) holds with equality if and only if

fX (xk) det g(x) PX -a.s. 1 | J | = card(g− [g(x)]). −1 det g(xk) fX (x) xk g [g(x)] | J | ∈ X (53) Fig. 5. Two different functions g1 and g2 with the same information loss, If and only if this expression is constant PX -a.s., bounds (50) but with a different mean-squared reconstruction error. The corresponding and (51) are tight. Bound (52) holds with equality if and only reconstructors are indicated by thick red lines. (It is immaterial whether the functions are left- or right-continuous.) if additionally PY ( i)=1 for all i. Y Proof: See Appendix D. further that the input RV is uniformly distributed on [ a,a]. Note that all bounds of Proposition 9 hold with equality in − Examples 3 and 4. Clearly, examples where the PDF of X and It follows that the absolute value of the Jacobian determinant are constant L(X g1(X)) = L(X g2(X))=1 (54) on render the first bound (49) tight (cf. [41, conference → → 2 version,X Sect. VI]). Two other types of scenarios, where these The mean-squared reconstruction error E (X r(Y )) , however, differs for Y = g (X) and Y = g (X−), since for bounds can hold with equality, are worth mentioning: First, for 1 2 functions g: R R equality holds if the function is related to g1 the Euclidean distance between x and r(y) is generally the cumulative→ distribution function of the input RV such that, smaller. In particular, decreasing the value of q1 even further and extending the function accordingly would allow us to for all x, g′(x) = fX (x) (see extended version of [37]). The second| case| occurs when both function and PDF are make the mean-squared reconstruction error arbitrarily small, “repetitive”, in the sense that their behavior on is copied to while the information loss remains unchanged. X1 all other i, and that, thus, fX (xi) and det g(xi) is the same X 1 | J | Clearly, a reconstructor trying to minimize the mean- for all elements of the preimage g− [y]. Example 3 represents squared reconstruction error will look totally different than such a case. a reconstructor trying to recover the input signal with high probability. Since for piecewise bijective functions a recovery C. Reconstruction and Reconstruction Error Probability of X is possible (in contrast to noisy systems, where this We now investigate connections between the information is not the case), in our opinion a thorough analysis of such lost in a system and the probability for correctly reconstructing reconstructors is in order. the system input. In particular, we present a series of Fano-type Aside from being of theoretical interest, there are practical inequalities between the information loss and the reconstruc- reasons to justify the investigation: As already mentioned in tion error probability. This connection is sensible, since the Section III-B, the information loss is a quantity which is not preimage of every output value is an at most countable set. always computable in closed form. If one can thus define a Intuitively, one would expect that the fidelity of a recon- (sub-optimal) reconstruction of the input of the output for struction of a continuous input RV is best measured by some which the probability Pe of error is easy to calculate, the distance measure “natural” to the set , such as, e.g., the Fano-type bounds would yield yet another set of upper bounds mean absolute distance or the mean squared-errorX (MSE), on the information loss. But also the reverse direction is of practical interest: Given the information loss L(X Y ) if is a subset of the Euclidean space. However, as the → followingX example shows, there is no connection between the of a system, the presented inequalities allow one to bound information loss and such distance measures: the reconstruction error Pe. For example, one might want to obtain performance bounds of a non-coherent communications Example 5 (Energy and Information behave differently). receiver (e.g., energy detector) in a semi-coherent broadcast Consider the two functions g1 and g2 depicted in Fig. 5, scenario (e.g, combining pulse-position modulation and phase- together with a possible reconstructor (see below). Assume shift keying, as in the IEEE 802.15.4a standard [42]). 9

We therefore present Fano-type inequalities to bound the trivially holds. We further note that in the equation above 1 reconstruction error probability via information loss. Due to one can exchange card( i ) by ess supy card(g− [y]) the peculiarities of entropy pointed out in [43], however, we to improve the bound. In{X what} follows, we∈Y aim at further restrict ourselves to finite partitions i , guaranteeing a finite improvements. preimage for every output value.{X We} note in passing that Definition 8 (Bijective Part). Let b be the maximal set such the results derived in this subsection not only apply to the X that g restricted to this set is injective, and let b be the image mitigation of non-injective effects of deterministic systems, Y of this set. Thus, g: b b bijectively, where but to any reconstruction scenario where the cardinality of the X → Y 1 input alphabet depends on the actual output value. b = x : card(g− [g(x)]) = 1 . (63) We start with introducing X { ∈ X } Then Pb = PX ( b)= PY ( b) denotes the bijectively mapped Definition 7 (Reconstructor & Reconstruction Error). Let probability mass.X Y r: be a reconstructor. Let E denote the event of a reconstructionY → X error, i.e., Proposition 11 (Fano-Type Bound). For the MAP reconstruc- 1 tor – or any reconstructor for which r(y) g− [y] – the 1, if r(Y ) = X information loss L(X Y ) is upper bounded∈ by E = 6 . (55) → (0, if r(Y )= X L(X Y ) min 1 Pb,H2(Pe) Pe log Pe The probability of a reconstruction error is given by → ≤ { − }−1 + Pe log E card(g− [Y ]) 1 . (64) − Pe = Pr(E =1)= Pe(y)dPY (y) (56) Proof: See Appendix E.   ZY If we compare this result with Fano’s original bound (62), where Pe(y) = Pr(E =1 Y = y). | we see that the cardinality of the partition is replaced by the In the following, we will investigate two different types of expected cardinality of the preimage. Due to the additional reconstructors: the maximum a-posteriori (MAP) reconstruc- term Pe log Pe this improvement is only potential, since there tor and a sub-optimal reconstructor. The MAP reconstructor exist cases where Fano’s original bound is better. An example chooses the reconstruction such that its conditional probability is the square-law device of Example 3, for which Fano’s given the output is maximized, i.e., inequality is tight, but for which Proposition 11 would yield L(X Y ) 2. rMAP(y)=arg max Pr(X = xk Y = y). (57) → ≤ −1 xk g [y] | For completeness, we want to mention that for the MAP ∈ In other words, with Definition 7 the MAP reconstructor reconstructor also a lower bound on the information loss can be given. We restate minimizes Pe(y). Interestingly, this reconstructor has a simple description for the problem at hand: Proposition 12 (Feder & Merhav, [45]). The information loss L(X Y ) is lower bounded by the error probability Pe of Proposition 10 (MAP Reconstructor). The MAP estimator for → a PBF is a MAP reconstructor by 1 rMAP(y)= gk− (y) (58) φ(Pe) L(X Y ) (65) ≤ → where where φ(x) is a piecewise linear function defined as 1 f (g− (y)) k =arg max X i . (59) i 1 1 − 1 i:g 1 (y)= det g(g− (y)) φ(x)= x − (i + 1)i log 1+ + log i (66) i 6 ∅  i  − i i | J |     Proof: The proof follows from Corollary 2, which states i 1 i for − x . that, given Y = y is known, reconstructing the input essen- i ≤ ≤ i+1 tially amounts to reconstructing the partition from which it At the time of submission we were not able to improve was chosen. The MAP reconstructor thus can be rewritten as this bound for the present context since the cardinality of the preimage has no influence on φ. rMAP(y) = argmax p(i y). (60) i | We illustrate the utility of these bounds, together with those where we used the notation from the proof of Proposition 8. presented in Proposition 9, in From there, Example 6 (Third-order Polynomial). Consider the function −1 fX (gi (y)) 1 −1 , if g− (y) = depicted in Fig. 6, which is defined as det g (g (y)) fY (y) i p(i y)= | J i | 6 ∅ (61)  1 3 | 0, if g− (y)= g(x)= x 100x. (67)  i ∅ − This completes the proof.  The input to this function is a zero-mean Gaussian RV X with We derive Fano-type bounds for the MAP reconstructor, or variance σ2. A closed-form evaluation of the information loss 1 any reconstructor for which r(y) g− [y]. Under the assump- is not possible, since the integral involves the logarithm of a ∈ tion of a finite partition i , note that Fano’s inequality [44, sum. However, we note that pp. 39], where H (p)={Xp}log p (1 p) log(1 p), 2 − − − − 20 20 b = , , (68) L(X Y ) H2(Pe)+ Pe log (card( i ) 1) (62) X −∞ −√3 ∪ √3 ∞ → ≤ {X } −     10

the input), it is therefore desirable to introduce a simpler, sub- g(x), y optimal reconstructor: Proposition 13 (Suboptimal Reconstruction). Consider the following sub-optimal reconstructor

1 g− (y), if y b x, rMAP(y) 1 ∈ Y rsub(y)= g− (y), if y k b (71) − 10 10 20  k √3 √3 √3 ∈ Y \ Y x: x k, else ∈ X where  k = argmax PX ( i b) (72) i X ∪ X

and where k = g( k). Y X 1 Fig. 6. Third-order polynomial of Example 6 and its MAP reconstructor Letting K = esssupy card(g− [y]) and with the error indicated with a thick red line. probability ∈Y

Pˆe =1 PX ( k b) (73) L(X → Y ) − X ∪ X Numeric Result 2.0 Upper Bound (Prop. 9) of this reconstructor, the information loss is upper bounded by Fano’s Bound the following, Fano-type inequality: Fano-type Bound (Prop. 11) Lower Bound (Prop. 12) L(X Y ) 1 Pb + Pˆe log K 1 (74) 1.5 → ≤ − − Proof: See Appendix F  This reconstructor is simple in the sense that the reconstruc- 1.0 tion is always chosen from the element k containing most of the probability mass, after considering theX set on which the function is bijective. This allows for a simple evaluation of 0.5 the reconstruction error probability Pˆe, which is independent of the Jacobian determinant of g. It is interesting to see that the Fano-type bound derived 0 σ 0 10 20 30 40 here permits a similar expression as derived in Proposition 11, despite the fact that the sub-optimal reconstructor not necessar- Fig. 7. Information loss for Example 6 as a function of input variance 2. 1 σ ily satisfies rsub(y) g− [y]. For this type of reconstructors, (card( ) 1) typically∈ has to be replaced by card( ). We thus note that· − also the following bounds hold: · and thus P = 2Q 20 , where Q denotes the Q- b √3σ function [36, 26.2.3]. With a little algebra we thus obtain the 1   L(X Y ) H2(Pˆe)+ Pˆe log ess sup card(g− [y]) (75) bounds from Proposition 9 as → ≤ y  ∈Y  H2(Pˆe)+ Pˆe log (card( i )) (76) L(X Y ) (1 Pb) log 3 log (3 2Pb) log 3 (69) → ≤ − ≤ − ≤ ≤ {X } L(X Y ) min 1 Pb,H2(Pe) Pe log Pe where ess sup card(g 1[y]) = card( )=3. → ≤ { − }− y − i 1 (77) As it can be∈Y shown6, the MAP reconstructor{X } assumes the + Pe log E card(g− [Y ]) properties depicted in Fig. 6, from which an error probability Before proceeding, we want  to briefly reconsider  of 10 20 Pe =2Q 2Q (70) Example 4 (Infinite Loss (revisited)). For the PDF and the √3σ − √3σ     function depicted in Fig. 4 it was shown that the information can be computed. We display Fano’s bound together with the loss was infinite. By recognizing that the probability mass 1 bounds from Propositions 9, 11, and 12 in Fig. 7 contained in 1 = ( 2 , 1] exceeds the mass contained in all other subsets,X we obtain an error probability for reconstruction What becomes apparent from this example is that the equal to bounds from Propositions 9 and 11 cannot form an ordered set; 1 Pˆe = 0.63. (78) the same holds for Fano’s inequality, which can be better or log 3 ≈ worse than our Fano-type bound, depending on the scenario. ˆ While in this example the MAP reconstructor was relatively In this particular case we even have Pe = Pe, since the MAP simple to find, this might not always be the case. For bounding reconstructor coincides with the suboptimal reconstructor. 1 the information loss of a system (rather than reconstructing Since in this case card(g− [y]) = for all y , all upper bounds derived from Fano-type bounds∞ evaluate∈ to Y infinity. 6The authors thank Stefan Wakolbinger for pointing us to this fact. 11

IV. INFORMATION LOSS FOR FUNCTIONS WHICH REDUCE We present two Corollaries to Proposition 14 concerning DIMENSIONALITY projections onto a subset of coordinates and functions which are constant on some subset with positive P -measure. We now analyze systems for which the absolute information X loss L(X Y ) is infinite. Aside from practically irrelevant Corollary 4. Let g be any projection of X onto M of its cases as in→ Example 4, this subsumes cases where the di- coordinates. Then, the relative information loss is mensionality of the input signal is reduced, e.g., by dropping N M coordinates or by keeping the function constant on a subset of l(X Y )= − . (81) → N its domain. Corollary 5. Let g be constant on a set A with positive Throughout this section we assume that the input RV has P -measure. Let furthermore g be such that⊆card( X g 1[y]) < positive information dimension, i.e., d(X) > 0 and infinite X − for all y / g(A). Then, the relative information loss is entropy H(X) = . We further assume that the function g ∞ ∈ ∞ describing the system is such that the relative information loss l(X Y )= PX (A). (82) l(X Y ) is positive (from which L(X Y )= follows; → cf. Proposition→ 6). → ∞ The first of these two corollaries has been applied to According to Proposition 4 the relative information loss in a principle component analysis in [47], while the second allows Lipschitz function is positive whenever the information dimen- us to take up Example 2 (center clipper) again: There, we sion of the input RV X is reduced. Interestingly, a reduction showed that both the information loss and the information of the dimension of the support does not necessarily lead to transfer are infinite. For the relative information loss we can a positive relative information loss,X nor does its preservation now show that it corresponds to the probability mass contained guarantee vanishing relative information loss. in the clipping region, i.e., l(X Y )= PX ([ c,c]). The somewhat surprising consequence→ of these− results is that the shape of the PDF has no influence on the relative A. Relative Information Loss for Continuous Input RVs information loss; whether the PDF is peaky in the clipping N N region or flat, or whether the omitted coordinates are highly We again assume that R and PX µ , thus correlated to the preserved ones does neither increase nor d(X) = N. We alreadyX found ⊆ in Proposition≪ 4 that for Lipschitz functions the relative information loss is given as decrease the relative information loss. In particular, in [47] we showed that dimensionality reduc- d(Y ) tion after performing a principle component analysis leads to l(X Y )=1 (79) → − N the same relative information loss as directly dropping N M − where Y may be a mixture of RVs with different information coordinates of the input vector X. dimensions (for which d(Y ) can be computed; cf. [26], [31]). Such a mixture may result, e.g., from a function g mapping B. Bounds on the Relative Information Loss different subsets of to sets of different covering dimension. Complementing the results from Section III-B we now X We intend to make this statement precise in what follows. present bounds on the relative information loss for some First, let us drop the requirement of Lipschitz continuity; particular cases. We note in passing that from the trivial generally, we now cannot expect Proposition 4 to hold. We bounds on the information dimension (d(X) [0,N] if assume that g is piecewise defined, as in Definition 5. Here, is a subset of the N-dimensional Euclidean∈ space or a X however, we do not require gi: i i to be bijective, but sufficiently smooth N-dimensional manifold) simple bounds X → Y to be a submersion, i.e., a smooth function between smooth on the relative information loss can be computed. manifolds whose pushforward is surjective everywhere (see, Here we present bounds on the relative information transfer e.g., [46]). A projection onto any M

l(X Y ) Proposition 16 (Upper Bound on the Relative Information → Loss). Let X be an N-dimensional RV with a probability b 2 b 1 N measure PX µ and let Y be N-dimensional. Then, 1 ≪ N N 1 1 l(X Y ) l(X(i) Y ) l(X(i) Y (i)) → ≤ N → ≤ N → i=1 i=1 X X (84) b 7 (i) (i) where X and Y are the i-th coordinates of X and Y , 2 2 respectively. b b Proof: See Appendix I.

Example 7 (Projection). Let X be an N-dimensional RV with N (i) probability measure PX µ and let X denote the i-th ≪ coordinate of X. Let g be a projection onto the first M 0. → available at its input. This naturally holds for all n, so a finer Conversely, for a simple projection one will have Pe = 1 partition n cannot decrease the relative information loss. while l(X Y ) < 1. Finally, that l(X Y )=1 need not P → → Conversely, the mean-squared reconstruction error decreases imply Pe =1 can be shown by revisiting the center clipper: with increasing n. Example 2 (Center Clipper (revisited)). Assume that the input We therefore turn to find connections between relative probability measure PX is mixed with an absolutely contin- information loss and the reconstruction error probability after uous component supported on [ c,c] (0 < PX ([ c,c]) < 1) introducing and a point mass at an arbitrary− point x / [ c,c]−. According 0 ∈ − to [26], [30], we have d(X) = PX ([ c,c]). The output Definition 9 (Minkowski Dimension). The Minkowski- or − box-counting dimension of a compact set RN is probability measure PY has two point masses at 0 and x0 with X ⊂ 7 N log card( n) Always, d(X) ≤ dB (X ) ≤ N if X ⊂ R , e.g., by [50, Thm. 1 and dB ( ) = lim P (87) n Lem. 4]. d(X) = N leads to the desired result. X →∞ n 13

PY (0) = PX ([ c,c]) and PY (x0)=1 PX ([ c,c]), respec- D. Special Case: 1D-maps and mixed RVs tively. Clearly,−d(X Y = 0)= 1 while− d(X −Y = x )=0. | | 0 We briefly analyze the relative information loss for scenarios Consequently, similar to the one of Example 2: We consider the case where d(X Y ) , R, but we drop the restriction that PX µ. Instead, l(X Y )= | =1. (89) X Y ⊂ ≪ → d(X) we limit ourselves to mixtures of continuous and discrete probability measures, i.e., we assume that PX has no singular In comparison to that, we have Pe PX ([ c,c]), since one ≤ − continuous component. Thus, [33, pp. 121] can always use the reconstructor r(y)= x0 for all y. P = P ac + P d . (96) This is a further example where Proposition 4 holds, despite X X X that neither the center clipper is Lipschitz, nor that the require- According to [26], [30] we obtain d(X) = P ac( ). We X X ment of a continuously distributed input RV in Proposition 14 present is met. Proposition 18 (Relative Information Loss for Mixed RVs). It is worth mentioning that Proposition 17 allows us to ac Let X be a mixed RV with a probability measure PX = PX + prove a converse to lossless analog compression, as it was d ac P , 0 < P ( ) 1. Let i be a finite partition of R investigated in [26], [51]. To this end, and borrowing the X X X ≤ {X } X ⊆ into compact sets. Let g be a bounded function such that g i terminology and notation from [26], we encode a length-n X is either injective or constant. The relative information loss| is block Xn of independent realizations of a real-valued input RV given as X with information dimension 0 < d(X) 1 via a Lipschitz ac PX (A) mapping to the Euclidean space of dimension≤ R . Let l(X Y )= (97) n n → P ac( ) R(ǫ) be the infimum of R such that there exists⌊ a⌋≤ Lipschitz X X n Rn where A is the union of sets on which g is constant. g : R R⌊ ⌋ and an arbitrary (measurable) reconstructor i → X r such that Pe ǫ. ≤ Proof: See Appendix K. Corollary 6 (Converse for Lipschitz Encoders; connection If we compare this result with Corollary 5, we see that the ac to [26], [51, eq. (26)]). For a memoryless source with com- former implies the latter for PX ( )=1. Moreover, as we saw in Example 2, the relative informationX loss induced by pactly supported marginal distribution PX and information dimension 0 < d(X) 1, and a Lipschitz encoder function g, a function g can increase if the probability measure is not ≤ absolutely continuous: In this case, from PX ([ c,c]) (where R(ǫ) d(X) ǫ. (90) − PX µ) to 1. As we will show next, the relative information ≥ − ≪ Proof: Since Xn is the collection of n real-valued, inde- loss can also decrease: n n n pendent RVs it follows that = R and thus dB( )= n. X X Example 2 (Center Clipper (revisited)). Assume that With Proposition 17 we thus obtain ac ac PX ( )=0.6 and PX ([ c,c]) = 0.3. The remaining proba- n n n X − d nPe d(X )l(X Y ) (91) bility mass is a point mass at zero, i.e., P (0) = P (0) = 0.4. ≥ → X X (a) n n It follows that d(X)=0.6 and, from Proposition 18, l(X = d(X ) d(Y ) (92) → − Y )=0.5. We choose a fixed reconstructor r(0) = 0 with (b) = nd(X) d(Yn) (93) a reconstruction error probability P = P ac([ c,c]) = 0.3. − e X Using Proposition 17 we obtain − where (a) is due to Proposition 4 and (b) is due to the fact that the information dimension of a set of independent RVs is the dB ( ) 1 0.5= l(X Y ) X Pe = 0.3=0.5 (98) sum of the individual information dimensions (see, e.g., [34] → ≤ d(X) 0.6 Yn R Rn Yn or [50, Lem. 3]). Since is an ⌊ ⌋-valued RV, d( ) which shows that in this case the bound holds with equality. Rn . Thus, ≤ ⌊ ⌋ Consider now the case that the point mass at 0 is split d nPe nd(X) Rn nd(X) Rn. (94) into two point masses at a,b [ c,c], where PX (a)=0.3 ≥ −⌊ ⌋≥ − and P d (b)=0.1. Using r(0)∈ =−a the reconstruction error Dividing by the block length n and rearranging the terms X increases to P = P ([ c,c]) P d (a)=0.4. Proposition 17 yields e X X now is a strict inequality.− − R d(X) Pe. (95) ≥ − This completes the proof. While this result – compared with those presented in [26], V. IMPLICATIONS FOR A SYSTEM THEORY [51] – is rather weak, it suggests that our theory has rela- In the previous sections we have developed a series of tionships with different topics in information theory, such as results about the information loss – absolute or relative – compressed sensing. Note further that we need not restrict caused by deterministic systems. In fact, quite many of the the reconstructor, since we only consider the case where basic building blocks of static, i.e., memoryless, systems have already the encoder – the function g – loses information. The been dealt with: Quantizers, the bridge between continuous- restriction of Lipschitz continuity cannot be dropped, however, and discrete-amplitude systems, have been dealt with in Ex- since only this class of functions guarantees that Proposition 4 ample 1. Cascades of systems allow a simplified analysis holds. In general, as stated in [26], there are non-Lipschitz employing Propositions 1 and 5. Dimensionality reductions of bijections from Rn to R. all kinds – functions which are constant somewhere, omitting 14 coordinates of RVs, etc. – where a major constituent part of L(X →·) = 0 L(·→ Yk) = 0 Section IV. Finally, the third-order polynomial (Example 6) is significant in view of Weierstrass’ approximation theorem k1 (polynomial functions are dense in the space of continu- ˜ Y˜k X b ∗ 1 ( ) ous functions supported on bounded intervals). Connecting X log( ) e Yk1 · · subsystems in parallel and adding the outputs – a case of 1 dimensionality reduction – is so common that it deserves l(·→·)= 2 separate attention:

Example 8 (Adding Two RVs). We consider two N- dimensional input RVs X1 and X2, and assume that the output Fig. 10. Equivalent model for multiplying the two branches in Fig. 9. of the system under consideration is given as Y = X + X (99) 1 2 consecutive values of the input, which we will denote with i.e., as the sum of these two RVs. X(1) through X(N) (X is the collection of these RVs). The (i) (i) We start by assuming that X1 and X2 have a joint prob- real and imaginary parts of X will be denoted as X and 2N (i) ℜ2N ability measure PX ,X µ . As it can be shown rather X , respectively. We may assume that PX µ , where 1 2 ≪ ℑ ≪ easily by transforming X1,X2 invertibly to X1 + X2,X1 and PX is compactly supported. We consider only three time lags dropping the second coordinate, it follows that in this case k , k , k 1,...,N 1 . 1 2 3 ∈{ − } 1 By assuming periodicity of the input signal, we can re- l(X1,X2 Y )= . (100) place the linear autocorrelation by the circular autocorrelation → 2 (e.g., [1, pp. 655]) Things may look totally different if the joint probability N 1 N 1 measure P is supported on some lower-dimensional sub- − − X1,X2 (k) (n) (n+k) (n) 2N R = X X∗ = Y (102) manifold of R . Consider, e.g., the case where X2 = X1, k − n=0 n=0 thus Y 0, and l(X1,X2 Y )=1. In contrary to this, X X assume that≡ both input variables→ are one-dimensional, and that where denotes complex conjugation, and where k assumes 3 ∗ X2 = 0.01X1 . Then, as it turns out, one value of the set k1, k2, k3 . Note further that the circular − { } (n) (n+N) 3 3 autocorrelation here is implicit, since X = X . Y = X1 0.01X1 = 0.01(X1 100X1) (101) We start by noting that for k = 0, l(X Y ) = 1 , since − − − → 0 2 which is a piecewise bijective function. As the analysis of then (n) (n) (n) (n) 2 Example 6 shows, l(X ,X Y )=0 in this case. Y0 = X X∗ = X (103) 1 2 → | | N (0) which shows that Y is real and thus PY µ . Since R We will next apply these results to systems which are larger 0 0 ≪ than the toy examples presented so far. In particular, we focus is the sum of the components of Y0, it is again real and we on an autocorrelation receiver as an example for a system get 1 which looses an infinite amount of information, and on an t(X R(0))= . (104) accumulator which will be shown to loose only a finite amount. → 2N Yet another example – the analysis of principle component For non-zero k note that (see Fig. 10) analysis employing the sample covariance matrix – can be ∗ (n) (n+k) log X(n)X (n+k) found in the extended version of [47]. We will not only analyze X X∗ =e (n) ∗(n+k) (n) (n+k) ∗ the information loss in these systems, but also investigate =elog X +log X =elog X +(log X ) how information propagates on the signal flow graph of the ∗ ˜ (n) ˜ (n+k) ˜ (n) (n) X +X Yk corresponding system. This will eventually mark a first step =e =e = Yk . (105) towards a system theory from an information-theoretic point- In other words, we can write the multiplication as an addition of-view. (logarithm and exponential function are invertible and, thus, information lossless). A. Multi-Channel Autocorrelation Receiver Letting X˜k denote the vector of elements indexed by ˜ (n+k) The multi-channel autocorrelation receiver (MC-AcR) was X , we get ˜ C ˜ introduced in [52] and analyzed in [53], [54] as a non-coherent Xk = kX (106) receiver architecture for ultrawide band communications. In where C is a circulant permutation matrix. Thus, Y˜ = X˜ + this receiver the decision metric is formed by evaluating the k k X˜ and autocorrelation function of the input signal for multiple time k lags (see Fig. 9). Y˜k = (I + Ck) X˜ (107) To simplify the analysis, we assume that the input signal ℜ ℜ Y˜k = (I Ck) X˜ (108) is a discrete-time, complex-valued N-periodic signal superim- ℑ − ℑ posed with independent and identically distributed complex- where I is the N N identity matrix. Since I+Ck is invertible, valued noise. The complete analysis can be based on we have P ×µN . In contrary to that, the rank of I C N Y˜k k ℜ ≪ − 15

k1 Y ∗ k1 X b b R(k1) P k2 Y ∗ k2 b b R(k2) P k3 Y ∗ k3 b R(k3) P 2N 1 t(Y → R(k))= 2 t(X → Yk)= 2N− k 2N 1 −

(k1) (k2) (k3) 3 t(X → R , R , R ) ≤ N

Fig. 9. Discrete-time model of the multi-channel autocorrelation receiver: The elementary mathematical operations depicted are the complex conjugation (∗), the summation of vector elements (P), and the circular shift (blocks with ki). The information flow is illustrated using red arrows, labeled according to the relative information transfer.

L(X → FX ) = 0 L(FR → R) = 0 is N 1 and thus d( Y˜k)= N 1. It follows that − ℑ −

t( X,˜ X˜ Y˜k, Y˜k)= t(X˜ Y˜k) ℜ ℑ →ℜ ℑ → (k1) 2N 1 FX FR R = t(X Yk)= − (109) R → 2N X DFT 2 IDFT Π R(k2) | · | R(k3) and d(Yk)=2N 1. Clearly, for k −= 0 the autocorrelation will be a complex 6 number a.s., thus d(R(k))= 2. Since the summation is a 1 6 t(FX → FR)= 2 t(R →·) ≤ N Lipschitz function, we obtain

(k) 2 t(Yk R )= (110) Fig. 11. Equivalent model of the MC-AcR of Fig. 9, using the DFT to → 2N 1 compute the circular autocorrelation. IDFT denotes the inverse DFT and Π − a projection onto a subset of coordinates. The information flow is indicated and by the result about the cascades, by red arrows labeled according to the relative information transfer. 1 t(X R(k))= . (111) → N computed via the discrete Fourier transform (DFT, cf. Fig. 11): Finally, applying Proposition 15, Letting W denote the DFT matrix, we obtain the DFT of 3 W t(X R(k1), R(k2), R(k3)) . (112) X as FX = X. Doing a little algebra, the DFT of the → ≤ N autocorrelation function is obtained as Note that this analysis would imply that if all values of 2 FR = FX . (114) the autocorrelation function would be evaluated, the relative | | information transfer would increase to Since the DFT is an invertible transform (W is a unitary 2N 1 matrix), one has d(FX ) = d(X)= 2N. The squaring of t(X R) − . (113) → ≤ 2N the magnitude can be written as a cascade of a coordinate transform (from Cartesian to polar coordinates; invertible), a However, knowing that the autocorrelation function of a com- dimensionality reduction (the phase information is dropped), plex, periodic sequence is Hermitian and periodic with the N and a squaring function (invertible, since the magnitude is a same period, it follows that PR µ , and thus t(X 1 ≪ → non-negative quantity). It follows that R) = 2 . The bound is thus obviously not tight in this case. Note further that, applying the same bound to the relative 1 l(X R)= l(FX FR)= . (115) information transfer from X to Yk1 , Yk2 , Yk3 would yield a → → 2 number greater than one. This is simply due to the fact that since the inverse DFT is again invertible. N the three output vectors have a lot of information in common, Since FR is a real RV and thus PFR µ , it clearly N ≪ prohibiting simply adding their information dimensions. follows that also PR µ , despite the fact that R is a A slightly different picture is revealed if we look at an vector of N complex≪ numbers. The probability measure of equivalent signal model, where the circular autocorrelation is this RV is concentrated on an N-dimensional submanifold of 16

9 g Markov chain is a positive circulant matrix built from pX . As a consequence of the Perron-Frobenius theorem (e.g., [4, p (i) b p (i) X X S Thm. 15-8]) there exists a unique stationary distribution which, for a doubly stochastic matrix as in this case, equals the (i 1) uniform distribution on (cf. [55, Thm. 4.1.7]). S − X 1 z− To attack the problem, we note that the PMF of the sum S(i) of independent RVs is given as the convolution of the PMFs of the summands. In case of the modulo sum, the circular Fig. 12. Accumulating a sequence of independent, identically distributed RVs X(i) convolution needs to be applied instead, as can be shown by computing one Markov step. Using the DFT again, we can write the circular convolution as a multiplication; in particular R2N defined by the periodicity and Hermitian symmetry of (see, e.g., [1, Sec. 8.6.5]) R. Choosing three coordinates of R as the final output of the (i) system amounts to upper bounding the information dimension (pX pS ) FpX Fp (i) (120) ∗ ←→ S of the output by W where FpX = pX . Iterating the system a few time steps d(R(k1), R(k2), R(k3)) 6. (116) and repeating this analysis yields ≤ Thus, 1 i p (i) = W− F (121) 3 S pX t(X R(k1), R(k2), R(k3)) (117) → ≤ N where the i-th power is taken element-wise. Since neither DFT where equality is achieved if, e.g., all time lags are distinct nor inverse DFT lose information, we only have to consider and smaller than N . The information flow for this example – 2 the information lost in taken FpX to the i-th power. computed from the relative information loss and information We now employ a few properties of the DFT (see, e.g., [1, transfer – is also depicted in Figs. 9 and 11. Sec. 8.6.4]): Since pX is a real vector, FpX will be Hermitian (circularly) symmetric; moreover, if we use indices from 0 to N (0) ( 2 ) B. Accumulator N 1, we have Fp =1 and Fp R. − X X ∈ As a further example, consider the system depicted in Taking the i-th power of a real number β looses at most Fig. 12. We are interested in how much information we loose one bit of information for even i and nothing for odd i; thus about a probabiliy measure of an RV X if we observe the iπ probability measure of a sum of iid copies of this RV. For L(β βi) cos2 . (122) → ≤ 2 simplicity, we assume that the probability measure of X is   supported on a finite field = 0,...,N 1 , where we Taking the power of a complex number α corresponds to X { − } assume N is even. taking the power of its magnitude (which can be inverted by The input to the system is thus the probability mass function taking the corresponding root) and multiplying its phase. Only (PMF) pX . Since its elements must sum to unity, the vector the latter is a non-injective operation, since the i-th root of N pX can be chosen from the (N 1)-simplex in R ; we assume a complex number yields i different solutions for the phase. N 1 − PpX µ − . The output of the system at a certain discrete Thus, invoking Proposition 9 ≪ (i) time index i is the PMF pS(i) of the sum S of i iid copies of X: L(α αi) log i. (123) i → ≤ i k S( ) = X( ) (118) N Applying (123) to index 2 and (123) to the indices k=1 N M 1,..., 2 1 reveals that where denotes modulo-addition. { − } (i) (i) Given the PMF pS of the output S at some time i L(pX pS(i) )= L(FpX Fp ) L → → S(i) (e.g., by computing the histogram of multiple realizations of N iπ this system), how much information do we lose about the 1 log i + cos2 . (124) ≤ 2 − 2 PMF of the input X? Mathematically, we are interested in     the following quantity8: Thus, the information loss increases linearly with the number of components, but sublinearly with time. Moreover, since L(pX pS(i) ) (119) → for i , p (i) converges to a uniform distribution, it is → ∞ S From the theory of Markov chains we know that p (i) plausible that L(pX p (i) ) (at least for N > 2). S → S → ∞ converges to a uniform distribution on . To see this, note Intuitively, the earlier one observes the output of such a (i) (i+1) X that the transition from S to S can be modeled as a system, the more information about the unknown PMF pX cyclic random walk; the transition matrix of the corresponding can be retrieved.

8 (1) (i) 9 Note that L(pX → pS(i) ) is not to be confused with H(S |S ), a In this particular case it can be shown that the first row of the transition quantity measuring the information loss about the initial state of a Markov matrix consists of the elements of pX , while all other rows are obtained by chain. circularly shifting the first one. 17

TABLE I 10 COMPARISONOFDIFFERENTINFORMATIONFLOWMEASURES.THE INPUT not decrease the relative information loss ; this loss is always IS ASSUMED TO HAVE POSITIVE INFORMATION DIMENSION.THE LETTER c fully determined by the information dimension of the input INDICATES THAT THE MEASURE EVALUATES TO A FINITE CONSTANT. and the information dimension of the output (cf. Corollary 4). System I(X; Y ) L(X → Y ) t(X → Y ) l(X → Y ) Contrary to that, the literature employs information theory to Quantizer c ∞ 0 1 prove the optimality of PCA in certain cases [10], [57], [58] Rectifier ∞ c 1 0 – but see also [59] for a recent work presenting conditions Center Clipper ∞ ∞ c 1 − c Example 4 ∞ ∞ 1 0 for the PCA depending on the spectrum of eigenvalues for MC-AcR ∞ ∞ c 1 − c a certain signal-noise model. To build a bridge between our Accumulator ∞ c 1 0 theory of information loss and the results in the literature, the notion of relevance has to be brought into game, allowing us to place unequal weights to different portions of the information C. Metrics of Information Flow – Do We Need Them All? available at the input. In energetic terms: Instead of just comparing variances – which is necessary sometimes! – we are now interested, e.g., in a mean-squared reconstruction error So far, for our system theory we have introduced absolute w.r.t. some relevant portion of the input signal. We actually and relative information loss, as well as relative information proposed the corresponding notion of relevant information loss transfer. Together with mutual information this makes four in [60], where we showed its applicability in signal processing different metrics which can be used to characterize a deter- and machine learning and, among other things, re-established ministic input-output system. While clearly relative informa- the optimality of PCA given a specific signal model. We tion loss and relative information transfer are equivalent, the furthermore showed that this notion of relevant information previous examples showed that one cannot simple omit the loss is fully compatible with what we present in this paper. other measures of information flow without losing flexibility Going from variances to power spectral densities, or, from in describing systems. Assuming that only finite measures information loss to information loss rates, will represent the are meaningful, a quantizer, e.g., requires mutual informa- next step: If the input to our memoryless system is not a tion, whereas a rectifier would need information loss. Center sequence of independent RVs but a discrete-time stationary clippers or other systems for which both absolute measures stochastic process, how much information do we lose per are infinite, benefit from relative measures only (be it either unit time? Following [8], the information loss rate should be relative information transfer or relative information loss). We upper bounded by the information loss (assuming the marginal summarize notable examples from this work together with distribution of the process as the distribution of the input RV). their adequate information measures in Table I. Aside from this, little is known about this scenario, and we hope to bring some light into this issue in the future. Of particular interest would be the reconstruction of nonlinearly distorted sequences, extending Sections III-C and IV-C. VI. OPEN ISSUES AND OUTLOOK The next, bigger step is from memoryless to dynamical input-output systems: A particularly simple subclass of these are linear filters, which were already analyzed by Shannon11. While this work may mark a step towards a system theory In the discrete-time version taken from [4, pp. 663] one gets from an information-theoretic point-of-view, it is but a small the differential entropy rate of the output process Y by adding step: In energetic terms, it would “just” tell us the difference a system-dependent term to the differential entropy rate of the between the signal variances at the input and the output of input process X, or the system – a very simple, memoryless system. We do not yet know anything about the information-theoretic analog of π Y X 1 θ power spectral densities (e.g., entropy rates), about systems h ( )= h ( )+ log H(e ) dθ (125) 2π π | | with memory, or about the analog of more specific energy Z− measures like the mean-squared reconstruction error assuming where H(eθ) is the frequency response of the filter. The fact a signal model with noise (information loss “relevant” in view that the latter term is independent of the process statistics of a signal model). Moreover, we assumed that the input signal shows that it is only related to the change of variables, and not is sufficiently well-behaved in some probabilistic sense. Future to information loss. In that sense, linear filters do not perform work should mainly deal with extending the scope of our information processing where the change of information mea- system theory. sures should obviously depend on the input signal statistics. The first issue which will be addressed is the fact that at present we are just measuring information “as is”; every bit 10This opposes the intuition that preserving the subspace with the largest of input information is weighted equally, and losing a sign variance should also preserve most of the information: For example, the Wikipedia article [56] states, among other things, that “[...] PCA can supply bit amounts to the same information loss as losing the least the user with a lower-dimensional picture, a ’shadow’ of this object when significant bit in a binary expansion of a (discrete) RV. This viewed from its (in some sense) most informative viewpoint” and that the fact leads to the apparent counter-intuitivity of some of our variances of the dropped coordinates “tend to be small and may be dropped with minimal loss of information”. results: To give an example from [47], the principle component 11To be specific, in his work [61] he analyzed the entropy loss in linear analysis (PCA) applied prior to dimensionality reduction does filters. 18

Finally, the class of nonlinear dynamical systems is signifi- To this end, note that the conditional probability measure (n) cantly more difficult. We were able to present some results for P ˆ is supported on ˆ , and that, thus, P ˆ X Xn=ˆxk Xk Y Xn=ˆxk discrete alphabets in [9]. For more general process alphabets | ˆ(n) | is supported on g( k ). Since g is Lipschitz, there exists a we can only hope to obtain results for special subclasses, e.g., X (n) constant λ such that for all a,b ˆ Volterra systems or affine input systems. For example, Wiener ∈ Xk and Hammerstein systems, which are cascades of linear filters g(a) g(b) λ a b . (128) and static nonlinear functions, can completely be dealt with | − |≤ | − | by generalizing our present work to stochastic processes. Choose a and b such that the term on the left is maximized, Finally, many other aspects are worth investigating: The i.e., connection between information loss and entropy production sup g(a) g(b) = g(a◦) g(b◦) λ a◦ b◦ λ sup a b in iterated function systems [38] and to heat dissipation a,b | − | | − |≤ | − |≤ a,b | − | (Landauer’s principle [62]) could be of interest. Yet another (129) interesting point is the connection between energetic and or, in other words, information-theoretic measures, as it exists for the Gaussian λ√N distribution. diam(g( ˆ(n))) λdiam( ˆ(n)) . (130) Xk ≤ Xk ≤ 2n VII. CONCLUSION ˆ(n) The latter inequality follows since k is inside an N- We presented an information-theoretic way to characterize X 1 dimensional hypercube of side length 2n (one may have the behavior of deterministic, memoryless input-output sys- ˆ(n) equality in the last statement if the hull of k and the tems. In particular, we defined an absolute and a relative boundary of are disjoint). X X measure for the information loss occurring in the system due Now note that the support of P can be covered by Y Xˆn=ˆxk to its potential non-injectivity. Since the absolute loss can be | an M-dimensional hypercube of side length diam(g( ˆ(n))), finite for a subclass of systems despite a continuous-valued Xk input, we were able to derive Fano-type inequalitites between which can again be covered by the information loss and the probability of a reconstruction M 2ndiam(g( ˆ(n)))+1 (131) error. Xk The relative measure of information loss, introduced in l m -dimensional hypercubes of side length 1 . By the maxi- this work to capture systems in which an infinite amount of M 2n information is lost (and, possibly, preserved), was shown to mum entropy property of the uniform distribution we get be related to Rényi’s information dimension and to present a M n (n) lower bound on the reconstruction error probability. With the H(Yˆn Xˆn =x ˆk) log 2 diam(g( ˆ ))+1 | ≤ Xk help of an example we showed that this bound can be tight √ l m and that, even in cases where an infinite amount of information n λ N √ M log 2 n +1 = M log λ N +1 < . is lost, the probability of a reconstruction error need not be ≤ & 2 ' ∞ l m unity. (132) While our theoretical results were developed mainly in view This completes the proof. of a system theory, we believe that some of them may be We now turn to the of relevance also for analog compression, reconstruction of Proof of Proposition 4: By noticing that ˆ is a function nonlinearly distorted signals, chaotic iterated function systems, Yn of and using the chain rule of mutual information we expand and the theory of Perron-Frobenius operators. Y the term in Definition 4 as

APPENDIX A I(Xˆn; Yˆn)+ I(Xˆn; Y Yˆn) t(X Y ) = lim | (133) PROOF OF PROPOSITION 4 n ˆ → →∞ H(Xn) For the proof we need the following Lemma: I(Xˆn; Yˆn)/n + I(Xˆn; Y Yˆn)/n = lim | (134) Lemma 1. Let and be the input and output of a Lipschitz n ˆ X Y →∞ H(Xn)/n function g: , RN , RM . Then, ˆ ˆ ˆ X → Y X ⊆ Y ⊆ H(Yn)/n + I(Xn; Y Yn)/n ˆ ˆ = lim | (135) H(Yn Xn) n H(Xˆ )/n lim | =0. (126) →∞ n n n →∞ due to Lemma 1. Proof: We provide the proof by showing that H(Yˆn Xˆn = | We now note that, by Kolmogorov’s formula, xˆk) is finite for all n and for all xˆk. From this then immediately follows that I(Xˆn; Y Yˆn)= I(Xˆn, Yˆn; Y ) I(Yˆn; Y ) (136) | − (a) H(Yˆn Xˆn) I(X; Y Y )= I(X, Y ; Y ) I(Y ; Y ) = 0 (137) lim | | − n n →∞ where (a) is due to the fact that X Y Y is a Markov 1 (n) − − = lim H(Yˆn Xˆn =x ˆk)PX ( ˆ )=0. (127) chain (see, e.g., [28, pp. 43]). With [27, Lem. 7.22] and n n | Xk →∞ k ˆ X the way Xn is constructed from X (cf. [26, Sec. III.D]) 19

we find that limn I(Xˆn, Yˆn; Y ) = I(X, Y ; Y ) and i.e., such that from this index on, the element of the partition →∞ 12 (n) limn I(Yˆn; Y )= I(Y ; Y ), and that, thus under consideration, ˆ , contains just a single element of →∞ Xk the preimage, x. Since f ˆ is non-zero only for arguments lim I(Xˆn; Y Yˆn)=0. (138) X Xn n | ˆ(n) | →∞ in k , in this case (145) degenerates to By employing Definition 3 the proof is completed. X f (x, xˆ ) X Xˆn k f (g(x), xˆ )= | . (147) Y Xˆn k APPENDIX B | det g(x) PROOF OF PROPOSITION 7 | J | Consequently, the ratio Recall that Definition 1 states f (x, xˆ ) X Xˆn k L(X Y ) = lim I(Xˆn; X) I(Xˆn; Y ) . (139) | det g(x) (148) n f (g(x), xˆ ) ր | J | → − Y Xˆn k →∞ | ˆ ˆ(n)  Let Xn =x ˆk if x k . The conditional probability measure monotonically (the number of positive terms in the sum in the N ∈ X P ˆ µ and thus possesses a density denominator reduces with n). We can thus apply the monotone X Xn=ˆxk ≪ | convergence theorem, e.g., [33, pp. 21] and obtain fX (x) , if x ˆ(n) f (x, xˆ )= p(ˆxk) ∈ Xk (140) X Xˆn k | 0, else lim h(Y Xˆn) h(X Xˆn) ( n →∞ | − | ˆ(n) where p(ˆxk)= PX ( k ) > 0. By the same arguments as in = fX (x) log det g(x) dx X N ˆ(n) | J | the beginning of Section III, also P µ , and its k Y Xˆn=ˆxk k ZX | ≪ X PDF if given by the method of transformation. =E log det g(X) . (149) With [44, Ch. 8.5], { | J |} This completes the proof. L(X Y ) → APPENDIX C = lim h(X) h(X Xˆn) h(Y )+ h(Y Xˆn) n − | − | PROOF OF PROPOSITION 8 →∞   = h(X) h(Y ) + lim h(Y Xˆn) h(X Xˆn) (141) n We start by writing − →∞ | − | The latter difference can be written as  H(W Y )= H(W Y = y)dPY (y) ˆ ˆ | | h(Y Xn) h(X Xn) ZY | − | ˆ ˆ = p(i y) log p(i y)fY (y)dy (150) = p(ˆxk) h(Y Xn =x ˆk) h(X Xn =x ˆk) . (142) − | | | − | i x ZY Xˆk   X Since (see [4, Thm. 5-1]) where

p(i y) = Pr(W = i Y = y)= PX Y =y( i). (151) h(Y Xˆ =x ˆ )= f (y, xˆ ) log f (y, xˆ )dy | | | X n k Y Xˆn k Y Xˆn k | − | | We now, for the sake of simplicity, permit the Dirac delta ZY = f (x, xˆ ) log f (g(x), xˆ )dx distribution δ as a PDF for discrete (atomic) probability X Xˆn k Y Xˆn k − | | measures. In particular and following [63], we write for the ZX 1 conditional PDF of Y given X = x, = f (x) log f (g(x), xˆ )dx (143) X Y Xˆn k −p(ˆxk) ˆ(n) | ZXk δ(x xi) fY X (x, y)= δ(y g(x)) = − . (152) we obtain | − −1 det g(xi) xi g [y] ∈X | J | h(Y Xˆn) h(X Xˆn) Applying Bayes’ theorem for densities we get | − | f (x, xˆ ) X Xˆn k | = fX (x) log dx. (144) p(i y)= fX Y (x, y)dx (153) ˆ(n) f ˆ (g(x), xˆk) | k k Y Xn | i X ZX | ZX fY X (x, y)fX (x) Finally, using the method of transformation, = | dx (154) f (y) f (x , xˆ ) i Y X Xˆn i k ZX f ˆ (g(x), xˆk)= | . (145) 1 δ(x xk) Y Xn (155) | −1 det g(xi) = − fX (x)dx xi g [g(x)] f (y) det (x ) ∈ | J | Y i −1 g k X ZX xk g [y] ∈X | J | Since the preimage of g(x) is a set separated by neighbor- − f (g 1(y)) 13 X i −1 , if y i hoods , we can find an n0 such that det g (g (y)) fY (y) = | J i | ∈ Y (156) (n) 1 (n)  n n : k = kˆ : x ˆ : g− [g(x)] ˆ = x (146) 0, if y / i ∀ ≥ 0 { ∈ Xkˆ } ∩ Xk  ∈ Y by the properties of the delta distribution (e.g., [64]) and since, 12The authors thank Siu-Wai Ho for suggesting this proof method.  13The space RN is Hausdorff, so any two distinct points are separated by by Definition 5, at most one element of the preimage of y lies neighborhoods. in i. X 20

1 We rewrite (150) as since card(gi− (y))=1 if y i and zero otherwise. Equality ∈ Y 1 is achieved if and only if card(g− [y]) is constant PY -a.s. H(W Y )= p(i y) log p(i y)fY (y)dy (157) In this case also the third inequality (51) is tight, which is | − | | i i X ZY obtained by replacing the expected value of the cardinality of after exchanging the order of summation and integration with the preimage by its essential supremum. the help of Tonelli’s theorem [65, Thm. 2.37] and by noticing Finally, the cardinality of the preimage cannot be larger than that p(i y)=0 if y / i. Inserting the expression for p(i y) the cardinality of the partition used in Definition 5, which and changing| the integration∈ Y variables by substituting x | = yields the last inequality (52). For equality, consider that, 1 assuming that all previous requirements for equality in the gi− (y) in each integral yields other bounds are fulfilled, (50) yields card( i ) if and only {X } H(W Y ) if PY ( i)=1 for all i. This completes the proof. | 1 1 Y f (g− (y)) f (g− (y)) = X i log X i dy 1 1 APPENDIX E − i det g(gi− (y)) det g(gi− (y)) fY (y) i ZY | J | | J | X (158) PROOF OF PROPOSITION 11

fX (x) The proof follows closely the proof of Fano’s inequality [44, = fX (x) log dx (159) − det (x) f (g(x)) pp. 38], where one starts with noticing that i i g Y X ZX | J | fX (x) H(X Y )= H(E Y )+ H(X E, Y ). (167) = fX (x) log dx (160) | | | − det g(x) fY (g(x)) ZX | J | The first term, H(E Y ) can of course be upper bounded by (a) | = h(X) h(Y )+E log det g(X) (161) H(E)= H2(Pe), as in Fano’s inequality. However, also − { | J |} = L(X Y ) (162) → H(E Y )= H (Pe(y))dPY (y) | 2 where (a) is due to splitting the logarithm and applying [4, ZY Thm. 5-1]. = H2(Pe(y))dPY (y) b ZY\Y APPENDIX D dPY (y)=1 Pb (168) PROOF OF PROPOSITION 9 ≤ b − ZY\Y The proof depends in parts on the proof of Proposition 8, since H (Pe(y)) = Pe(y) = 0 if y b and since 2 ∈ Y where we showed that H (Pe(y)) 1 otherwise. Thus, 2 ≤ L(X Y )= H(W Y = y)fY (y)dy. (163) H(E Y ) min H2(Pe), 1 Pb . (169) → | | ≤ { − } ZY For the second part note that H(X E =0, Y = y)=0, so The first inequality (49) is due to the maximum entropy we obtain | property of the uniform distribution, i.e., H(W Y = y) 1 | ≤ log card(g− [y]) with equality if and only if p(i y) = H(X E, Y )= H(X E =1, Y = y)Pe(y)dPY (y). 1 1 | 1/card(g− [y]) for all i for which g− (y) = . But this | | i 6 ∅ ZY (170) translates to 1 Upper bounding the entropy by log card(g− [y]) 1 we get 1 − 1 det g(gi− (y)) fY (y) card(g− [y]) = | J 1 | . (164) 1 Pe(y) fX (gi− (y)) H(X E, Y ) Pe log card(g− [y]) 1 dPY (y) | ≤ − Pe 1 ZY Inserting the expression for fY and substituting x for gi− (y)  (171) (it is immaterial which i is chosen, as long as the preimage (a) 1 Pe(y) of y is not the empty set) we obtain Pe log card(g− [y]) 1 dPY (y) (172) ≤ − Pe ZY  fX (xk) det g(x) (b)  1 1 1 card(g− [g(x)]) = | J | . Pe log card(g− [y]) 1 dPY (y) + Pe log det g(xk) fX (x) x g−1[g(x)] ≤ − Pe k | J | ZY  ∈ X (165)  (173) The second inequality (50) is due to Jensen [44, 2.6.2], where (a) is Jensen’s inequality (Pe(y)/Pe acts as a PDF) and where we wrote (b) holds since Pe(y) 1 and due to splitting the logarithm. ≤ 1 1 This completes the proof. E log card(g− [Y ]) log E card(g− [Y ]) ≤ 1  = log card(g−[y])dPY (y) APPENDIX F ZY PROOF OF PROPOSITION 13 1 = log card(gi− (y))dPY (y) = log dPY (y) By construction, rsub(y) = x whenever x k b, and i ∈ X ∪ X ZY i i ZY conversely, rsub(y) = x whenever x / k b. This yields X X (166) 6 ∈ X ∪ X Pˆe =1 PX ( k b). − X ∪ X 21

For the Fano-type bound, we again notice that since the information dimension of an RV cannot exceed the covering dimension of its support. Taking the expectation w.r.t. H(X Y )= H(E Y )+ H(X E, Y ). (174) | | | Y yields The first term can be written as K

d(X Y ) (N Mi) PX Y =y( i)dPY (y) ˆ | ≤ − | X H(E Y )= H2(Pe(y))dPY (y) i=1 ZY | X K ZY ˆ = (N Mi)PX ( i) (183) = H2(Pe(y))dPY (y) − X k b i=1 ZY \Y X PY ( k b) (175) and thus ≤ Y \ Y K N Mi ˆ ˆ l(X Y ) − PX ( i) (184) since Pe(y)=0 for y b and Pe(y)=1 for y ( k → ≤ N X ∈ Y ∈Y\ Y ∪ i=1 b). X Y For the second term we can write by Proposition 3. It remains to show the reverse inequality. To this end, note that without the Lipschitz condition H(X E, Y )= H(X Y = y, E = 1)Pˆe(y)dPY (y(176)) Proposition 4 would read | | ZY d(Y ) log(K 1)Pˆ (y)dP (y) t(X Y ) . (185) e Y → ≤ d(X) ≤ k b − ZY \Y But with [26], [31] we can write for the information dimension + log KdP (y) (177) Y of ( k b) Y ZY\ Y ∪Y K Now we note that d(Y )= d(Y X i)PX ( i). (186) | ∈ X X i=1 X Pˆe = PY ( ( k b)) + Pˆe(y)dPY (y) (178) M By the fact that gi are submersions, preimages of µ i -null sets Y\ Y ∪ Y k b ZY \Y N Mi are µ -null sets [66] – were there a µ -null set B i such which we can use above to get ⊆ Y that the conditional probability measure PY X i (B) > 0, N 1 | ∈X there would be some µ -null set gi− (B) i such that H(X E, Y ) Pˆe PY ( ( k b)) log(K 1) 1 ⊆ X N | ≤ − Y\ Y ∪ Y − PX X i (gi− (B)) > 0, contradicting PX µ . Thus, | ∈X Mi ≪ + PY ( ( k b)) log K. (179) PY X i µ and we get Y\ Y ∪ Y | ∈X ≪ K K Rearranging and using Mi N Mi l(X Y ) 1 PX ( i)= − PX ( i). → ≥ − N X N X i i Pb + PY ( ( k b)) + PY ( k b)=1 (180) =1 =1 Y\ Y ∪ Y Y \ Y X X (187) yields This proves the reverse inequality and completes the proof.

H(X Y ) 1 Pb + Pˆe log(K 1) | ≤ − − APPENDIX H K PROOF OF PROPOSITION 15 + PY ( ( k b)) log 1 . Y\ Y ∪ Y K 1 − The proof follows from Proposition 4, Definition 3, and the  −  (181) fact that conditioning reduces entropy. We write The fact 0 log K 1 completes the proof. (1) (2) (K) (188) K 1 t(X Y )= t(X Y , Y ,...,Y ) ≤ − ≤ → → d(Y (1), Y (2),...,Y (K)) = (189) APPENDIX G d(X) PROOF OF PROPOSITION 14 (1) (K) H(Yˆn ,..., Yˆn ) = lim (190) n We start by noting that by the submersion theorem [46, H(Xˆn) K →∞ Cor. 5.25] for any point y = i=1 i the preimage under K ˆ (i) ˆ (1) ˆ (i 1) ∈ Y Y i=1 H(Yn Yn ,..., Yn − ) g i is either the empty set (if y / i) or an (N Mi)- = lim | (191) |X S ∈ Y − n H(Xˆ ) dimensional embedded submanifold of i. We now write →∞ P n X K (i) K (i) with [26], [31] H(Yˆn ) d(Yˆ ) lim i=1 = i=1 (192) ≤ n ˆ d(X) K →∞ P H(Xn) P K d(X Y = y)= d(X Y = y,X i)PX Y =y( i) | | ∈ X | X = t(X Y (i)) (193) i=1 → X i=1 K X (N Mi)PX Y =y( i) (182) where we exchanged limit and summation by the same reason ≤ − | X i=1 as in the proof of Proposition 3. X 22

APPENDIX I APPENDIX K PROOF OF PROPOSITION 16 PROOF OF PROPOSITION 18 The proof follows from the fact that d(X) = N and, for Since the partition is finite, we can write with [26, Thm. 2] all i, d(X(i))=1. We obtain from the definition of relative d(X Y = y)= d(X Y = y,X i)PX Y =y( i). information loss | | ∈ X | X i (1) (N) X (203) H(Xˆn ,..., Xˆn Y ) l(X Y ) = lim | If g is injective on , the intersection g 1[y] is a single n (1) (N) i − i → H(Xˆn ,..., Xˆn ) 14 X ∩ X →∞ point , thus d(X Y = y,X i)=0. Conversely, if g is N (i) (1) (i 1) | ∈ X H(Xˆn Xˆn ,..., Xˆn − , Y ) constant on i, the preimage is i itself, so one obtains = lim i=1 | X X n ˆ (1) ˆ (N) ac →∞ P H(Xn ,..., Xn ) PX ( i) N (i) d(X Y = y,X i)= d(X X i)= X . (204) H(Xˆn Y ) | ∈ X | ∈ X PX ( i) lim i=1 | (194) X ≤ n ˆ (1) ˆ (N) We thus write →∞ HP(Xn ,..., Xn ) Exchanging again limit and summation we get d(X Y ) = d(X Y = y,X i)PX Y =y( i)dPY (y) N N | | ∈ X | X d(X(i) Y ) d(X(i) Y ) ZY i l(X Y ) i=1 | = i=1 | . (195) X (205) → ≤ d(X) N ac P P PX ( i) (i) = X PX Y =y( i)dPY (y) (206) But since N = Nd(X ) for all i, we can write PX ( i) | X i: i A ZY X ⊆ X N N X ac (i) (a) 1 d(X Y ) 1 (i) PX ( i) l(X Y ) | = l(X Y ). = X PX Y =y( i)dPY (y) (207) → ≤ N d(X(i)) N → PX ( i) | X i=1 i=1 i: i A X ZY X X (196) X ⊆ (b) ac This proves the first inequality. The second is obtained = PX (A) (208) by removing conditioning again in (194), since Y = where in (a) we exchanged summation and integration with Y (1),...,Y (N) . { } the help of Fubini’s theorem and (b) is due to the fact that the sum runs over exactly the union of sets on which g is constant, APPENDIX J A. Proposition 3 completes the proof. PROOF OF PROPOSITION 17

Note that by the compactness of the quantized input ACKNOWLEDGMENTS Xˆ has a finite alphabet, which allowsX us to employ Fano’s n The authors thank Yihong Wu, Wharton School, University inequality of Pennsylvania, Siu-Wai Ho, Institute for Telecommunica- H(Xˆn Y ) H2(Pe,n)+ Pe,n log card( n) (197) tions Research, University of South Australia, and Sebastian | ≤ P Tschiatschek, Signal Processing and Speech Communication where Laboratory, Graz University of Technology, for fruitful discus- Pe,n = Pr(r(Y ) = Xˆn). (198) 6 sions and suggesting material. In particular, the authors wish Since Fano’s inequality holds for arbitrary reconstructors, we to thank Christian Feldbauer, formerly Signal Processing and let r( ) be the composition of the MAP reconstructor rMAP( ) Speech Communication Laboratory, Graz University of Tech- and the· quantizer introduced in Section II-A. Consequently· , nology, for his valuable input during the work on Section III. Pe,n is the probability that rMAP(Y ) and X do not lie in the same quantization bin. Since the bin volumes reduce with REFERENCES increasing n, Pe,n increases monotonically to Pe. We thus [1] A. V. Oppenheim and R. W. Schafer, Discrete-Time Signal Processing, obtain with H2(p) 1 for 0 p 1 3rd ed. Upper Saddle River, NJ: Pearson Higher Ed., 2010. ≤ ≤ ≤ [2] H. K. Khalil, Nonlinear Systems, 3rd ed. Upper Saddle River, NJ: H(Xˆn Y ) 1+ Pe log card( n). (199) Pearson Education, 2000. | ≤ P [3] D. G. Manolakis, V. K. Ingle, and S. M. Kogon, Statistical and adaptive With the introduced definitions, signal processing. Boston, London: Artech House, 2005. [4] A. Papoulis and U. S. Pillai, Probability, Random Variables and Stochas- H(Xˆn Y ) tic Processes, 4th ed. New York, NY: McGraw Hill, 2002. l(X Y ) = lim | (200) [5] J. L. Wyatt, L. O. Chua, J. W. Gannett, I. C. Göknar, and D. N. Green, n ˆ → →∞ H(Xn) “Energy concepts in the state-space theory of nonlinear n-ports: Part 1+ Pe log card( n) i–passivity,” IEEE Trans. Circuits Syst., vol. 28, no. 1, pp. 48–61, Jan. lim P (201) 1981. n ˆ ≤ →∞ H(Xn) [6] J. C. Baez, T. Fritz, and T. Leinster, “A characterization of entropy in (a) dB( ) terms of information loss,” Entropy, vol. 13, no. 11, pp. 1945–1957, = Pe X (202) Nov. 2011. [Online]. Available: http://arxiv.org/abs/1106.1791 d(X) [7] N. Pippenger, “The average amount of information lost in multiplica- tion,” IEEE Trans. Inf. Theory, vol. 51, no. 2, pp. 684–687, Feb. 2005. where (a) is obtained by dividing both numerator and denom- inator by n and evaluating the limit. This completes the proof. 14We do not consider the case here that the preimage of y does not intersect Xi, since in this case PX Y =y(Xi) = 0. | 23

[8] S. Watanabe and C. T. Abraham, “Loss and recovery of information by [37] B. C. Geiger, C. Feldbauer, and G. Kubin, “Information loss in static coarse observation of stochastic chain,” Information and Control, vol. 3, nonlinearities,” in Proc. IEEE Int. Sym. Wireless Communication Systems no. 3, pp. 248–278, Sep. 1960. (ISWSC), Aachen, Nov. 2011, pp. 799–803, extended version available: [9] B. C. Geiger and G. Kubin, “Some results on the information loss in arXiv:1102.4794 [cs.IT]. dynamical systems,” in Proc. IEEE Int. Sym. Wireless Communication [38] D. Ruelle, “Positiviy of entropy production in nonequilibrium statistical Systems (ISWSC), Aachen, Nov. 2011, pp. 794–798, extended version mechanics,” J.Stat.Phys., vol. 85, pp. 1–23, 1996. available: arXiv:1106.2404 [cs.IT]. [39] J. Jost, Dynamical Systems: Examples of Complex Behavior. New York, [10] R. Linsker, “Self-organization in a perceptual network,” IEEE Computer, NY: Springer, 2005. vol. 21, no. 3, pp. 105–117, Mar. 1988. [40] M. Baer, “A simple countable infinite-entropy distribution,” 2008. [On- [11] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck line]. Available: https://hkn.eecs.berkeley.edu/~calbear/research/Hinf.pdf method,” in Proc. Allerton Conf. on Communication, Control, and [41] B. C. Geiger and G. Kubin, “On the information loss in memoryless Computing, Sep. 1999, pp. 368–377. systems: The multivariate case,” in Proc. Int. Zurich Seminar on [12] J. C. Principe, Information Theoretic Learning: Renyi’s Entropy and Communications (IZS), Zurich, Feb. 2012, pp. 32–35, extended version Kernel Perspectives, ser. Information Science and Statistics. New York, available: arXiv:1109.4856 [cs.IT]. NY: Springer, 2010. [42] E. Karapistoli, F.-N. Pavlidou, I. Gragopoulos, and I. Tsetsinas, “An [13] B. Lev, “The aggregation problem in financial statements: overview of the IEEE 802.15.4a standard,” IEEE Commun. Mag., An informational approach,” Journal of Accounting Research, vol. 48, no. 1, pp. 47–53, Jan. 2010. vol. 6, no. 2, pp. 247–261, 1968. [Online]. Available: [43] S.-W. Ho and R. Yeung, “On the discontinuity of the Shannon infor- http://www.jstor.org/stable/2490239 mation measures,” IEEE Trans. Inf. Theory, vol. 55, no. 12, pp. 5362 [14] R. Lamarche-Perrin, Y. Demazeau, and J.-M. Vincent, “How to build the –5374, Dec. 2009. best macroscopic description of your multi-agent system?” Laboratoire [44] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. d’Informatique de Grenoble, Tech. Rep., Jan. 2013. [Online]. Available: Hoboken, NJ: Wiley Interscience, 2006. http://rr.liglab.fr/research_report/RR-LIG-035_orig.pdf [45] M. Feder and N. Merhav, “Relations between entropy and error proba- [15] D. H. Johnson, “Information theory and neural information processing,” bility,” IEEE Trans. Inf. Theory, vol. 40, no. 1, pp. 259–266, Jan. 1994. IEEE Trans. Inf. Theory, vol. 56, no. 2, pp. 653–666, Feb. 2010. [46] J. M. Lee, Introduction to Smooth Manifolds, ser. Graduate Texts in [16] W. Wiegerinck and H. Tennekes, “On the information flow for one- Mathematics. New York, NY: Springer, 2003. dimensional maps,” Physics Letters A, vol. 144, no. 3, pp. 145–152, [47] B. C. Geiger and G. Kubin, “Relative information loss in the PCA,” Feb. 1990. in Proc. IEEE Information Theory Workshop (ITW), Lausanne, Sep. [17] T. Schreiber, “Measuring information transfer,” Physical Review Letters, 2012, pp. 562–566, extended version available: arXiv:1204.0429 vol. 85, no. 2, pp. 461–464, July 2000. [cs.IT]. [18] A. Kaiser and T. Schreiber, “Information transfer in continuous pro- [48] J. D. Farmer, E. Ott, and J. A. Yorke, “The dimension of chaotic cesses,” Physica D, vol. 166, pp. 43–62, 2002. attractors,” Physica D, vol. 7, pp. 153–180, May 1983. [19] X. S. Liang and R. Kleeman, “Information transfer between dynamical [49] P. Grassberger, “Generalized dimensions of strange attractors,” Physics system components,” Physical Review Letters, vol. 95, pp. 244 101–1– Letters, vol. 97A, no. 6, pp. 227–230, Sep. 1983. 244 101–4, Dec. 2005. [50] Y. Wu, “Shannon theory for compressed sensing,” Ph.D. dissertation, [20] J. A. Vastano and H. L. Swinney, “Information transport in spatiotem- Princeton University, 2011. poral systems,” Physical Review Letters, vol. 60, no. 18, pp. 1773–1776, [51] Y. Wu and S. Verdú, “Optimal phase transitions in compressed sensing,” May 1988. IEEE Trans. Inf. Theory, vol. 58, no. 10, pp. 6241–6263, Oct. 2012. [52] K. Witrisal, “Noncoherent autocorrelation detection of orthogonal multi- [21] K. Matsumoto and I. Tsuda, “Calculation of information flow rate from carrier UWB signals,” in IEEE Int. Conf. on Ultra-Wideband (ICUWB), mutual information,” J. Phys. A, vol. 21, pp. 1405–1414, 1988. Hannover, Sep. 2008, pp. 161–164. [22] A. N. Kolmogorov, “A new metric invariant of transient dynamical [53] P. Meissner and K. Witrisal, “Analysis of a noncoherent UWB receiver systems and automorphisms in Lebesgue spaces,” Dokl. Akad. Nauk. for multichannel signals,” in Proc. IEEE Vehicular Technology Conf. SSSR, vol. 119, pp. 861–864, 1958. (VTC-Spring), Taipei, May 2010, pp. 1–5. [23] ——, “Entropy per unit time as a metric invariant of automorphisms,” [54] A. Pedross and K. Witrisal, “Analysis of nonideal multipliers for Dokl. Akad. Nauk. SSSR, vol. 124, pp. 754–755, 1959. multichannel autocorrelation UWB receivers,” in Proc. IEEE Int. Conf. [24] Y. Sinai, “On the concept of entropy for a dynamic system,” Dokl. Akad. on Ultra-Wideband (ICUWB), Sep. 2012, pp. 140–144. Nauk. SSSR, vol. 124, pp. 768–771, 1959. [55] J. G. Kemeny and J. L. Snell, Finite Markov Chains, 2nd ed. Springer, [25] T. Downarowicz, Entropy in Dynamical Systems. Cambridge: Cam- 1976. bridge University Press, 2011. [56] (2012, Oct.) Principle component analysis. [Online]. Available: [26] Y. Wu and S. Verdú, “Rényi information dimension: Fundamental limits http://en.wikipedia.org/wiki/Principle_components_analysis of almost lossless analog compression,” IEEE Trans. Inf. Theory, vol. 56, [57] M. Plumbley, “Information theory and unsupervised neural networks,” no. 8, pp. 3721–3748, Aug. 2010. Cambridge University Engineering Department, Tech. Rep. CUED/F- [27] R. M. Gray, Entropy and Information Theory. New York, NY: Springer, INFENG/TR. 78, 1991. 1990. [58] G. Deco and D. Obradovic, An Information-Theoretic Approach to [28] M. S. Pinsker, Information and Information Stability of Random Vari- Neural Computing. New York, NY: Springer, 1996. ables and Processes. San Francisco, CA: Holden Day, 1964. [59] R. N. Rao, “When are the most informative components for inference [29] P. Vary and R. Martin, Digital speech transmission: Enhancement, also the principal components?” arXiv:1302.1231 [math.ST], coding and error concealment. Chichester: John Wiley & Sons, 2006. Feb. 2013. [30] A. Rényi, “On the dimension and entropy of probability distributions,” [60] B. C. Geiger and G. Kubin, “Signal enhancement as minimization of rel- Acta Mathematica Hungarica, vol. 10, no. 1-2, pp. 193–215, Mar. 1959. evant information loss,” May 2012, arXiv:1205.6935 [cs.IT]. [31] M. Smieja´ and J. Tabor, “Entropy of the mixture of sources and entropy [61] C. E. Shannon, “A mathematical theory of communication,” Bell Systems dimension,” IEEE Trans. Inf. Theory, vol. 58, no. 5, pp. 2719–2728, Technical Journal, vol. 27, pp. 379–423, 623–656, Oct. 1948. May 2012. [62] R. Landauer, “Irreversibility and heat generation in the computing [32] Y. Wu and S. Verdú, “MMSE dimension,” IEEE Trans. Inf. Theory, process,” IBM Journal of Research and Development, vol. 5, pp. 183– vol. 57, no. 8, pp. 4857–4879, Aug. 2011. 191, 1961. [33] W. Rudin, Real and Complex Analysis, 3rd ed. New York, NY: [63] A. Chi and T. Judy, “Transforming variables using the Dirac generalized McGraw-Hill, 1987. function,” The American Statistician, vol. 53, no. 3, pp. 270–272, Aug. [34] C. D. Cutler, “Computing pointwise by conditioning 1999. in multivariate distributions and time series,” Bernoulli, vol. 6, no. 3, [64] A. Papoulis, The Fourier Integral and its Applications. McGraw Hill, pp. 381–399, Jun. 2000. 1962. [35] A. C. Verdugo Lazo and P. N. Rathie, “On the entropy of continuous [65] G. B. Folland, Real Analysis. Modern Techniques and Their Applica- probability distributions,” IEEE Transactions on Information Theory, tions, 2nd ed. New York, NY: Wiley Interscience, 1999. vol. IT-24, pp. 120–122, 1978. [66] S. P. Ponomarev, “Submersions and preimages of sets of measure zero,” [36] M. Abramowitz and I. A. Stegun, Eds., Handbook of Mathematical Siberian Mathematical Journal, vol. 28, no. 1, pp. 153–163, Jan. 1987. Functions with Formulas, Graphs, and Mathematical Tables, 9th ed. New York, NY: Dover Publications, 1972.