arXiv:1505.03337v3 [cs.IT] 3 Jan 2017 iesoa iglrsucsadaSannlwrbudfor bound lower Shannon intege a sources. res quantized and singular of a integer-dimensional sources length present proposed singular codeword we expected dimensional the definition, minimal entropy that the our on show of informati mutual applications we the first of integer-dimensi expressions and useful for conveys variables, entropy entropy functions random conditional d Lipschitz We singular and transformations. under unitary entropy under manner tha invariant joint show natural is We it a that cases. and in special as transforms variables it e random differential the continuous and of contains variables random definition discrete entropy of entropy proposed integer-dimensional extens The an of such variables. class present random relevant we practically Here, far. the so continuou for available nor been discrete not has neither are to variables—which extension tractable random A theory. information in quantities hno oe on,snua admvrals orecod source variables, random singular bound, lower Shannon rclEgneig T uih H89 uih Switzerlan Zurich, 8092 CH Zurich, ETH [email protected]). Engineering, trical pclr fhlawats gpichler, eeomnctos UWe,14 ina uti (e-mail Austria Vienna, 1040 Wien, TU Telecommunications, nomto hoy(ST,Hnll,H,Jl 2014. July HA, Honolulu, (ISIT), Theory Information lower Shannon Sec. [4, the entropy differential as using calculated known be can function bound differential b rate-distortion lower using the a simplified theory, on rate-distortion be in often Furthermore, entropy. can Kullbac continuo variables between like information random mutual quantities or divergence undisputed Leibler involving information-t derivations Nonetheless, [3]. inform al controversial as is for interpretation was content its but definition entropy, [1], Shannon similar differential by introduced A variables, viewpoints. random [1] axiomatic continuous from operational thoroughly and analyzed were [2] and go [1] content information Shannon ran as to interpretation discrete its for and entropy variables of definition classical The theory. Motivation and Background A. bandfo h EEb edn eus opubs-permissi purpo to other request any a sending for by material IEEE this the from use obtained to permission However, IE n C1-5 TNON,adb h W ne rn P27 grant under FWF the by and (TINCOIN), N30. ICT12-054 and WIRE) hswr a upre yteWT ne rnsIT006(N ICT10-066 grants under WWTF the by supported was work This .Relri ihteDprmn fIfrainTechnology Information of Department the with is Riegler E. .Kladr .Pclr n .Haashaewt h Insti the with are Hlawatsch F. and Pichler, G. Koliander, G. hspprwspeetdi ata h EEItrainlSy International IEEE the at part in presented was paper This oyih c 06IE.Proa s fti aeili per is material this of use Personal IEEE. 2016 (c) Copyright ne Terms Index Abstract nrp soeo h udmna ocpsi information in concepts fundamental the of one is Entropy Etoyaddfeeta nrp r important are entropy differential and —Entropy } iesoa iglrRno Variables Random Singular Dimensional Ifrainetoy aedsoto theory, distortion rate entropy, —Information nrp n oreCdn o Integer- for Coding Source and Entropy @nt.tuwien.ac.at). .I I. NTRODUCTION ¨ nhrKladr er Pichler, Georg Koliander, G¨unther ri ige,adFazHlawatsch, Franz and Riegler, Erwin [email protected]. : e utbe must ses psu on mposium (e-mail: d { n Elec- and heoretic singular singular gkoliand, ntropy n As on. ueof tute mitted. ound ation back 4.6]. efine dom onal 370- ing. s— ion the ult so us O- k- r- t , where ovre oifiiyad hs omlzto a obe to has normalization a thus, discretizat and, these infinity vari of random entropy to the the converges Usually, of question. existing quantizations in Two on able gene based more impossible. for are definition entropy seems distributions mere (differential) the entropy defining even to meaningful approaches generality its a to of due and handle, to aibei akoerno arxo h form the random of matrix this rand random singular for rank-one a a entropy of not is example does of variable trivial, theory less definition Another, information rigorous variable. classical a fact, of provide in entropy and, differential defined The one-dimensional. only x h ntcrl,ie,ehbtn h eemnsi depend deterministic the exhibiting i.e., x circle, unit the otnosbtotni o iceeete.Abscexample basic A variable be either. random cannot discrete a not source is is the often describing but sour variable continuous a random of the dimension intrinsic Thus, the reduce determi dependencies problems, istic high-dimensional many In insights. coding. valuable source provide to and areas theore these simplify to in potential work the has entropy variables (differential) random of gular generalization suitable a Thus, aibe,wihaenihrdsrt o otnos have in- continuous, recently: nor all described discrete neither numb cover are which a not variables, involving fact, do problems information-theoretic In variables of problems. random information-theoretic of However, teresting variables. kinds random two tre continuous these and information-theoretic discrete and of ment understanding the simplify r quantity. relevance useful operational a its it disputed, o is interpretation the entropy although rando ferential Hence, continuous IV]. a Sec. of [3, quantizations variable finer ever of expans entropy asymptotic the in arises entropy differential Finally, 1 1 2 , + h aeo rirr rbblt itiuin svr har very is distributions probability arbitrary of case The nte edweesnua admvralsapa is appear variables random singular where field Another h ocpso nrp n ifrniletoythus entropy differential and entropy of concepts The • • • x x 2 ssnua nsm etns[,C.6,adtenoiseless [8 the cases and special for 6], except Ch. singular [7, is distribution settings output some in distribut of input singular kinds optimal is different the arise: two distributions models, singular channel [6]. block-fading singular In is the distribution compression, source analog available of underlying the formulation input probabilistic utilize a fully singular In [5]. to freedom a used of be channel, degrees to interference has distribution vector the For tdn ebr IEEE, Member, Student 2 2 r otnosrno variables, random continuous are z 1 = sacniuu admvector. random continuous a is Although . elw IEEE Fellow, x x sdfie on defined is ( = x 1 x 2 ) R T 2 x n ohcomponents both and ∈ tefi intrinsically is itself R singular 2 upre on supported X x random = osof ions enders osin- to snot is dif- f been zz ence tical ions om ion ce. at- ral T n- m er ]. d 1 - , 2 employed to obtain a useful result. In [9], this approach is variables. Our entropy satisfies several well-known properties adopted for very specific quantizations of a random variable. of differential entropy: it is invariant under unitary transforma- Unfortunately, this does not always result in a well-defined tions, transforms as expected under Lipschitz mappings, and entropy and sometimes even fails for continuous random vari- can be extended to joint and conditional entropy. We show that ables of finite differential entropy [9, pp. 197f]. Moreover, the the entropy of discrete random variables and the differential quantization process seems difficult to deal with analytically entropy of continuous random variables are special cases of and no theory was built based on this definition of entropy.1 A our entropy definition. For joint entropy, we prove a chain similar approach is to consider arbitrary quantizations that are rule which takes the geometry of the support set into account. constrained by some measure of fineness to enable a limit op- Furthermore, we discuss why in certain cases our entropy eration. In [3] and [10], ε-entropy is introduced as the minimal definition may violate the classical result that conditioning entropy of all quantizations using sets of diameter less than does not increase (differential) entropy. We provide expres- ε. However, to specify a diameter, a distortion function has sions of the mutual information between integer-dimensional to be defined. Since all basic information-theoretic quantities random variables in terms of our entropy. We also show that an (e.g., mutual information or Kullback-Leibler divergence) do asymptotic equipartition property analogous to [13, Sec. 8.2] not depend on a specific distortion function, it is hardly pos- holds for our entropy, but with the Lebesgue measure replaced sible to embed ε-entropy into a general information-theoretic by the Hausdorff measure of appropriate dimension. framework. Furthermore, once again the quantization process In our proofs, we exercise care to detail all assumptions and seems difficult to handle analytically. to obtain mathematically rigorous statements. Thus, although Since the aforementioned approaches do not provide a sat- many of our results might seem obvious to the cursory isfactory generalization of (differential) entropy, we follow a reader because of their similarity to well-known results for different approach, which is also motivated by ever finer quan- (differential) entropy, we emphasize that they are not simply tizations of the random variable. However, in our approach, the replicas or straightforward adaptations of known results. This order of the two steps “taking the limit of quantizations” and becomes evident, e.g., for the chain rule (see Theorem 41 in “calculating the entropy as the expectation of the logarithm of Section VI-C), which might be expected to have the same form a probability (mass) function” is changed. More precisely, we as the chain rule for differential entropy. However, already a first consider the probability mass functions of quantizations simple example will show that the geometry of the support and take a normalized limit. (In the special case of a contin- set may lead to an additional term, which is not present in the uous random variable, this results in the probability density special case of continuous random variables. function due to Lebesgue’s differentiation theorem.) Then we As a first application of the proposed entropy, we derive take the expectation of the logarithm of the resulting density a result on the minimal expected binary codeword length of function. Due to fundamental results in geometric measure quantized integer-dimensional singular sources. More specif- theory, this approach can result in a well-defined entropy ically, we show that our entropy characterizes the rate at only for integer-dimensional distributions, since otherwise the which an arbitrarily fine quantization of an integer-dimensional density function does not exist [11, Th. 3.1]. In fact, the singular source can be compressed. Another application is existence of the density function implies that the random a lower bound on the rate-distortion function of an integer- variable is distributed according to a rectifiable measure [11, dimensional singular source that resembles the Shannon lower Th. 1.1]. Thus, the distributions considered in the present paper bound for discrete [4, Sec. 4.3] and continuous [4, Sec. 4.6] are rectifiable distributions on Euclidean space. Although this random variables. For the specific case of a singular source that is still far from the generality of arbitrary probability distribu- is uniformly distributed on the unit circle, we demonstrate that tions, it covers numerous interesting cases—including all the our bound is within 0.2 nat of the true rate-distortion function. examples mentioned above—and gives valuable insights. The density function of rectifiable measures can also be C. Notation defined as a certain Radon-Nikodym derivative. A generalized Sets are denoted by calligraphic letters (e.g., ). The A (differential) entropy based on a Radon-Nikodym derivative complement of a set is denoted c. Sets of sets are A A with respect to a “measure of the observer’s interest” was con- denoted by fraktur letters (e.g., M). The set of natural sidered in [12]. Our entropy is consistent with this approach, numbers 1, 2,... is denoted as N. The open ball with { } and at a certain point we will use a result on quantization center x RM and radius r > 0 is denoted by (x), ∈ Br problems established in [12]. However, because in our setting i.e., (x) , y RM : y x < r . The symbol Br { ∈ k − k } a concrete measure is considered, the results we obtain go be- ω(M) denotes the volume of the M-dimensional unit ball, i.e., yond the basic properties derived in [12] for general measures. ω(M)= πM/2/Γ(1+ M/2) where Γ is the Gamma function. Boldface uppercase and lowercase letters denote matrices and B. Contributions vectors, respectively. The m m identity matrix is denoted × by Im. Sans serif letters denote random quantities, e.g., x is We provide a generalization of the classical concepts of a random vector and x is a random scalar. The superscript entropy and differential entropy to integer-dimensional random T stands for transposition. For x R, x , max m M ∈ ⌊ ⌋ { T∈ Z : m x and for x R , x , ( x1 xM ) . 1This entropy should not be confused with the information dimension ≤ } ∈ ⌊ ⌋ ⌊ ⌋···⌊ ⌋ Similarly, x , min m Z : m x . We write Ex[ ] for defined in the same paper [9], which is indeed a very useful and widely ⌈ ⌉ { ∈ ≥ } · used tool. the expectation operator with respect to the random variable 3 x. Pr x denotes the probability that x . For random variables [13, Ch. 8]. Let x be a discrete random {M1 ∈ A} M2 M1+M∈2 A M1 x R and y R , we denote by px : R R , variable with probability mass function px(xi)=Pr x = xi , ∈ ∈ M1+M2 → { } px(x, y) = x, the projection of R to the first M1 i , where is the finite or countably infinite set indexing M1+M2 M2 ∈ I I components. Similarly, py : R R , py(x, y) = y, all possible realizations xi of x. The entropy of x is M1+M2 → denotes the projection of R to the last M2 components. 2 H(x) , Ex[log px(x)] = px(x ) log px(x ) . (1) The generalized Jacobian determinant of a Lipschitz function − − i i i∈I φ is written as Jφ. For a function φ with domain and a X D M subset , we denote by φ e the restriction of φ to For a continuous random variable x on R with probability D ⊆ D D the domain . H m denotes the -dimensional Hausdorff density function fx, the differential entropy is m measure.e3 LDM denotes the M-dimensional Lebesgue measure M x , E x x x L M x and BM denotese the Borel σ-algebra on R . For a measure h( ) x[log fx( )] = fx( ) log fx( ) d ( ) . − − RM µ and a µ-measurable function f, the induced measure is Z (2) given as µf −1( ) , µ(f −1( )). For two measures µ and We note that h(x) may be or undefined. ν on the sameA measurable space,A we indicate by µ ν ±∞ that µ is absolutely continuous with respect to ν (i.e.,≪ for A. Entropy of Dimension d(x) and ε-Entropy any measurable set , ν( )=0 implies µ( )=0). For A A A There exist two previously proposed generalizations of a measure µ and a measurable set , the measure µ E is the (differential) entropy to a larger set of probability distributions. restriction of µ to , i.e., µ ( )=Eµ( ). The logarithm| E |E A A∩E The first generalization is based on quantizations of the to the base e is denoted log and the logarithm to the base 2 random variable to ever finer cubes [9]. More specifically, is denoted ld. In certain equations, we reference an equation for a (possibly singular) random variable x RM , the Renyi´ number on top of the equality sign in order to indicate that information dimension of x is ∈ (42) the equality holds due to some previous equation: e.g., = H ⌊nx⌋ indicates that the equality holds due to eq. (42). d(x) , lim n (3) n→∞ log n D. Organization of the Paper and the entropy of dimension d(x) of x is defined as The rest of this paper is organized as follows. In Sec- tion II, we review the established definitions of entropy R , nx h x (x) lim H ⌊ ⌋ d(x) log n (4) d( ) n→∞ n − and describe the intuitive idea behind our entropy defini- tion. Rectifiable sets, measures, and random variables are provided the limits in (3) and (4) exist. introduced in Section III as the basic setting for integer- This definition of entropy of dimension d(x) corresponds to dimensional distributions. In Section IV, we develop the the following procedure: theory of “lower-dimensional entropy”: we define entropy for M ki ki+1 Z 1) Quantize x using the cubes i=1 n , n , with ki , integer-dimensional random variables, prove a transformation i.e., consider the discrete random variable with probabil-∈ property and invariance under unitary transformations, demon- M kQi ki+1 ities pk = Pr x i=1 n , n . strate connections to classical entropy and differential entropy, 2) Calculate the entropy∈ of the quantized random variable, and provide examples by calculating the entropy of random i.e., the negative expectationQ of the logarithm of the variables supported on the unit circle in R2 and of positive probability mass function pk. semidefinite rank-one random matrices. In Sections V and 3) Subtract the correction term d(x) log n to account for the VI, we introduce and discuss joint entropy and conditional dimension of the random variable x. entropy, respectively. Relations of our entropy to the mutual 4) Take the limit n . information between integer-dimensional random variables are → ∞ Although this approach seems reasonable, there are several demonstrated in Section VII. In Section VIII, we prove an issues. First, the definition of hR (x) seems to be difficult asymptotic equipartition property for our entropy. In Sec- d(x) to handle analytically, and connections to major information- tion IX, we present a result on the minimal expected binary theoretic concepts such as mutual information are not avail- codeword length of quantized integer-dimensional sources. In able. Furthermore, the quantization used is just one of many Section X, we derive a Shannon lower bound for integer- possible—we might, e.g., consider a shifted version of the set dimensional singular sources and evaluate it for a source that of cubes M ki , ki+1 , which, for singular distributions, is uniformly distributed on the unit circle. i=1 n n may result in a different value of the resulting entropy. Q II. PREVIOUS WORK AND MOTIVATION An approach that overcomes the latter issue is the concept of ε-entropy [3], [10]. The definition of ε-entropy does not use a We first recall the definitions of entropy for discrete random specific quantization but takes the infimum of the entropy over variables [13, Ch. 2] and differential entropy for continuous all possible (countable) quantizations under a constraint on the 2By Rademacher’s theorem [14, Th. 2.14], a Lipschitz function is differen- diameter of the quantization sets. This is motivated by data tiable almost everywhere and, thus, the Jacobian determinant is well defined compression: the quantization should be such that an error of almost everywhere. maximally ε is made (thus, the quantization sets have maximal 3Readers unfamiliar with this concept may think of it as a measure of an m-dimensional area in a higher-dimensional space (e.g., surfaces in R3). An diameter ε) and at the same time the minimal possible number introduction and definition can be found in [14, Sec. 2.8]. of bits should be used to encode the data (thus, the entropy is 4 minimized over all possible quantizations). More specifically, 1) For some x RM , divide the probability Pr x M ∈ 5 d(x) { ∈ for a random variable x R , let Pε denote the set of all ε(x) by the correction factor ω(d(x)) ε . (Recall countable partitions of RM∈ into mutually disjoint, measurable Bthat ω(}d(x)) denotes the volume of the d(x)-dimensional sets of diameter at most ε. Furthermore, for a partition Q = unit ball.) : i N P , the quantization [x]Q N is the discrete 2) Take the limit ε 0. {Ai ∈ } ∈ ε ∈ → random variable defined by pi = Pr [x]Q = i = Pr x i 3) Calculate the entropy as the negative expectation of the for i N. Then the ε-entropy of x {is defined} as { ∈ A } logarithm of the resulting density function. ∈ More specifically, steps 1–2 yield the density function6 Hε(x) , inf H([x]Q) . (5) Q∈Pε Pr x ε(x) θx(x) , lim { ∈B } (6) Here, a problem is that Hε(x) is only defined for a fixed ε→0 ω(d(x)) εd(x) ε > 0 and the limit ε 0 converges to for nondiscrete distributions. However, as→ in the case of R´enyi∞ information di- and the entropy in step 3 is thus given by mension, a correction term can be obtained using the following d(x) h (x) , Ex[log θx(x)] . (7) seemingly new definition of information dimension: − H (x) We will show that this definition of entropy will lead to d∗(x) , lim ε . definitions of joint and conditional entropy, various useful ε→0 log 1 ε relations, connections to mutual information, an asymptotic By [15, Prop. 3.3], the definitions of information dimension equipartition property, and bounds relevant to source coding. using R´enyi’s approach and the ε-entropy approach coincide, However, our definition does have one limitation: as pointed i.e., d∗(x)= d(x). This suggests the following new definition out in [6, Sec. VII-A], the existence of the limit in (6) for of a d(x)-dimensional entropy. almost every x RM is a much stronger assumption than ∈ Definition 1: Let x RM be a random variable with the existence of the R´enyi information dimension (3). Loosely existing information dimension∈ d(x). Then the asymptotic ε- speaking, the existence of the limit in (6) requires that the entropy of dimension d(x) is defined as random variable x is d(x)-dimensional almost everywhere whereas the existence of the R´enyi information dimension ∗ , h x (x) lim Hε(x)+ d(x) log ε . merely requires that the random variable is x -dimensional d( ) ε→0 d( ) “on average.” By Preiss’ Theorem [16, Th. 5.6], convergence This definition corresponds to the following procedure: in (6) even implies that the probability measure induced 4 1) Quantize x using an entropy-minimizing quantization Q by the random variable x is rectifiable (see Definition 6 given a diameter constraint ε, i.e., consider the discrete in Section III-B), which means that our definition does not random variable [x]Q with probabilities p = Pr [x]Q = i { apply to, e.g., self-similar fractal distributions. However, we i = Pr x i for i Q, where the diameter of are not aware of any application or calculation of the d(x)- each} {is upper∈ A } boundedA by∈ ε. Ai dimensional entropy in (4) (or the asymptotic version of ε- 2) Calculate the entropy of the quantized random variable entropy) for fractal distributions, and it does not seem clear [x]Q, i.e., the negative expectation of the logarithm of the whether the d(x)-dimensional entropy is well defined in that probability mass function pi. case (although the information dimension (3) exists). 3) Add the correction term d(x) log ε to account for the The rectifiability also implies that the density function θx(x) dimension of the random variable x. is equal to a certain Radon-Nikodym derivative. Based on this 4) Take the limit ε 0. d(x) → equality, the entropy h (x) defined in (7) and (6) can be Although this entropy is more general than the entropy of interpreted as a generalized entropy as defined in [12, eq. (1.5)] dimension d(x) in (4), the fundamental problems persist: we by are still restricted to the choice of sets of small diameter dµ (this is of course useful if we consider maximal distance log (x) dµ(x) if µ λ H (µ) , − RM dλ ≪ (8) as a measure of distortion but can yield unnecessarily many λ Z else. quantization points for areas of almost zero probability), and ∞ the definition still seems to be difficult to handle analytically Here, λ is a σ-finite measure on RM and µ is a probability and lacks connections to established information-theoretic measure on RM . While µ can be chosen as the measure of a quantities such as mutual information. given random variable, the generalized entropy (8) provides no B. An Alternative Approach intuition on how to choose the measure λ. It is more similar to a divergence between measures and, in particular, reduces to Here, we propose a different approach, which is motivated the Kullback-Leibler divergence [17] for a probability measure by the definition of differential entropy. The basic idea is λ. We will see (cf. Remark 19) that our entropy definition to circumvent the quantization step and perform the entropy m coincides with (8) for the choice λ = H E , where m and calculation at the end. Assuming x RM , this results in the | ∈ following procedure: 5The constant factor ω(d(x)) is included to obtain equality with differential entropy in the special case d(x) = M. A different factor would result in an 4We assume for simplicity that an entropy-minimizing quantization exists additive constant in the entropy definition. although in general the infimum in (5) may not be attained. 6A mathematically rigorous definition will be provided in Section III-B. 5
M depend on the given random variable. This interpretation is m-rectifiable if and only if there exist k R such E H m T ⊆ will allow us to use basic results from [12] for our entropy that 0 k∈N k, where ( 0) = 0 and each E ⊆ T ∪ T 1 T M definition. k is an m-dimensional, embedded C submanifold of R Motivated by the entropy expression in (7), a formal defini- T[19, Lem. 5.4.2].S Another characterization, based on [18, tion of the entropy of an integer-dimensional random variable Cor. 3.2.4], is that RM is m-rectifiable if and only if will be given in Section IV-A, based on the mathematical E ⊆ theory of rectifiable measures discussed next. 0 fk( k) (9) E⊆E ∪ N A k[∈ ECTIFIABLE ANDOM ARIABLES m III. R R V where H ( 0) = 0, k are bounded Borel sets, and Rm REM A As mentioned in Section II-B, the existence of a d(x)- fk : are Lipschitz functions that are one-to-one → dimensional density implies that the random variable x is on k. Due to [20, Th. 15.1], this implies that fk( k) are A A rectifiable. In this section, we recall the definitions of rec- also Borel sets. tifiable sets and measures and introduce rectifiable random B. Rectifiable Measures variables as a straightforward extension. Furthermore, we present some basic properties that will be used in subsequent Loosely speaking, rectifiable measures are measures that sections. For the convenience of readers who prefer to skip the are concentrated on a rectifiable set. The most convenient mathematical details, we summarize the most important facts way to define “concentrated on” mathematically is in terms in Corollary 12. of absolute continuity with respect to a specific Hausdorff measure. A. Rectifiable Sets Definition 6 ([14, Def. 2.59]): A Borel measure µ on RM is Our basic geometric objects of interest are rectifiable sets called m-rectifiable if there exists an m-rectifiable set RM [18, Sec. 3.2.14]. As the definition of rectifiable sets is not such that µ H m . E ⊆ ≪ |E consistent in the literature, we provide the definition most For an m-rectifiable measure µ, i.e., µ H m for an m- H m E convenient for our purpose. We recall that denotes the rectifiable set RM , we have by Property≪ 2 in| Lemma 4 m-dimensional Hausdorff measure. m E ⊆ that H E is σ-finite. Thus, by the Radon-Nikodym theorem Definition 2 ([14, Def. 2.57]): For m N, an H m-mea- [14, Th. 1.28],| there exists the Radon-Nikodym derivative RM ∈ 7 surable set (with M m) is called m-rectifiable dµ if there existEL ⊆ m-measurable,≥ bounded sets Rm and θm(x) , (x) (10) k µ H m M A8 ⊆ d E Lipschitz functions fk : k R , both for k N, such | m A → M∈ m m m that H f ( ) = 0. A set R is called satisfying dµ = θ dH E . We will refer to θ (x) as the k∈N k k µ | µ 0-rectifiableE\ if it is finiteA or countably infinite.E ⊆ m-dimensional Hausdorff density of µ. S Remark 3: Hereafter, we will often consider the setting of Remark 7: If µ is an m-rectifiable probability measure, it m-rectifiable sets in RM and tacitly assume m 0,...,M . cannot be n-rectifiable for n = m. Indeed, suppose that µ is ∈{ } 6 Rectifiable sets satisfy the following well-known basic prop- both m-rectifiable and n-rectifiable where, without loss of gen- erties. erality, n>m. Then there exists an m-rectifiable set such that µ H m , which implies µ( c)=0. There alsoE exists RM E Lemma 4: Let be an m-rectifiable subset of . ≪ | EH n E an n-rectifiable set such that µ F . By Property 4 in 1) Any H m-measurable subset is also m-rectifiable. F ≪ |H n D⊆E Lemma 4, the m-rectifiable set satisfies ( )=0 and, in 2) The measure H m is σ-finite. n E n E E particular, H F ( )=0. Because µ H F , this implies RM RN | 3) Let φ: with N m be a Lipschitz function. µ( )=0. Hence,| Eµ(RM ) = µ( c)+≪µ( )=0| , which is a H→m ≥ If φ( ) is -measurable, then it is m-rectifiable. contradictionE to the assumption thatE µ is a probabilityE measure. 4) For n>mE , we have H n( )=0. N E To avoid the nuisance of separately considering the case 5) Let i for i be m-rectifiable sets. Then i∈N i is dµ E ∈ E m = 0 in many proofs and to reduce the class of m- m-rectifiable. dH |E rectifiable sets of interest, we define the following notion ofa 6) For m =0, Rm is m-rectifiable. S 6 support of an m-rectifiable measure. Intuitively, rectifiable sets are lower-dimensional subsets of Definition 8: For an m-rectifiable measure µ on RM , an Euclidean space. Examples include affine subspaces, algebraic m-rectifiable set RM is called a support of µ if varieties, differentiable manifolds, and graphs of Lipschitz H m , dEµ ⊆ H m -almost everywhere, and µ E dH m|E > 0 E functions. As countable unions of rectifiable sets are again ≪ | |N = k∈N fk( k) where, for k , k is a bounded Borel rectifiable, further examples are countable unions of any of E mA M ∈ A set and fk : R R is a Lipschitz function that is one-to- the aforementioned sets. S → one on k. Remark 5: There are various characterizations of m-rec- A Lemma 9: Let be an -rectifiable measure, i.e., tifiable sets that provide connections to other mathematical µ m µ H m for an m-rectifiable set RM . Then there exists≪ a disciplines. For example, an H m-measurable set RM |E E ⊆ E ⊆ support . Furthermore, the support is unique up to sets m E⊆E 7In [14, Def. 2.57], these sets are called countably H m-rectifiable. of H -measure zero. 8 This definition also encompasses finite index sets k ∈ {1,...,K}; it Proof:e See Appendix A. suffices to set Ak = ∅ for k>K. 6
Remark 10: For m-rectifiable measures, it is possible to in the next corollary. Note, however, that although everything m interpret the Hausdorff density θµ (x) as a measure of “local seems to be similar to the continuous case, Hausdorff measures probability per area.” Indeed, for an m-rectifiable measure µ, lack substantial properties of the Lebesgue measure, e.g., the m i.e., µ H E for an m-rectifiable set , we can write product measure is not always again a Hausdorff measure. m ≪ | E θµ (x) in (10) as Corollary 12: Let x be an m-rectifiable random variable on RM , i.e., µx−1 H m for an m-rectifiable set RM . µ( r(x)) E m ≪ | E ⊆ m θµ (x) = lim B (11) Then there exists the m-dimensional Hausdorff density θ , r→0 ω(m)rm x and the following properties hold: m H E -almost everywhere (for a proof see [14, Th. 2.83 and RM | 1) The probability Pr x for a measurable set eq. (2.42)]). Furthermore, the right-hand side in (11) vanishes can be calculated{ as the∈ A} integral of θm over A⊆with re- for H m-almost all points not in . Note the similarity of (11) x A E spect to the m-dimensional Hausdorff measure restricted with the ad-hoc construction in Section II-B. Indeed, (11) is the to , i.e., mathematically rigorous formulation of (6). This formulation E also provides details regarding the probability measures for Pr x = µx−1( )= θm(x) dH m (x) . (14) { ∈ A} A x |E which it results in a well-defined quantity. ZA 2) The expectation of a measurable function f : RM R C. Rectifiable Random Variables with respect to the random variable x can be expressed→ As we are only interested in probability measures and be- as cause information theory is often formulated for random vari- E m H m ables, we define m-rectifiable random variables. In what fol- x[f(x)] = f(x) θx (x) d E (x) . (15) RM | lows, we consider a random variable x: (Ω, S) (RM , B ) Z → M 3) The random variable x is in with probability one, i.e., on a probability space (Ω, S,µ), i.e., Ω is a set, S is a E σ-algebra on Ω, and µ is a probability measure on (Ω, S). −1 m H m Pr x = µx ( )= θx (x) d E (x)=1 . The probability measure induced by the random variable x { ∈ E} E E | −1 −1 Z is denoted by µx . For BM , µx ( ) equals the (16) probability that x , i.e., A ∈ A 4) There exists a support of x. ∈ A E⊆E The special cases m =0 and m = M reduce to well-known µx−1( )= µ(x−1( )) = Pr x . (12) A A { ∈ A} concepts. e Definition 11: A random variable x: Ω RM on a prob- Theorem 13: Let x be a random variable on RM . Then: → ability space (Ω, S,µ) is called m-rectifiable if the induced 1) x is 0-rectifiable if and only if it is a discrete random −1 RM probability measure µx on is m-rectifiable, i.e., there variable, i.e., there exists a probability mass function exists an m-rectifiable set RM such that µx−1 H m . E ⊆ ≪ |E px(xi) = Pr x = xi > 0, i , where is The m-dimensional Hausdorff density of an m-rectifiable a finite or countably{ infinite} index∈ set I indicatingI all random variable x is defined as (cf. (10)) 0 possible realizations xi of x. In this case, θx = px and −1 x is a support of x. m m dµx = i : i x , −1 x x (13) E { ∈ I} θx ( ) θµx ( )= m ( ) . 2) x is M-rectifiable if and only if it is a continuous random dH E | variable, i.e., there exists a probability density function Furthermore, a support of the measure µx−1 is called a support M − fx such that Pr x = fx(x) dL (x). In this −1 m dµx 1 { ∈ A} A of x, i.e., is a support of x if µx H , m (x) > M L M E dH |E case, θx = fx -almost everywhere. H m E ≪ | R 0 E -almost everywhere, and = k∈N fk( k) where, Proof: See Appendix B. |N E RmA RM for k , k is a bounded Borel set and fk : is The following theorem introduces a nontrivial class of m- a Lipschitz∈ A function that is one-to-one onS . → Ak rectifiable random variables. Note that due to Remark 7, an m-rectifiable random variable Theorem 14: Let x be a continuous random variable on cannot be n-rectifiable for n = m. m m M 6 R . Furthermore, let φ: R R with M m be In the nontrivial case m
m m Hence, by Property 5 in Lemma 4, the set , φ(R ) = Jφ(z) > 0 L -almost everywhere. To calculate the Jacobian 0 E T r∈N φ( r( )) is m-rectifiable. Thus, it suffices to show that matrix Dφ(z), we stack the columns of the matrix zz and −1 B m m µy H φ(Rm), i.e., that for any H -measurable set differentiate the resulting vector with respect to each element S ≪M |m −1 R , H φ(Rm)( )=0 implies µy ( )=0. To zi. It is easily seen that the resulting Jacobian matrix is given A ⊆ | A m A this end, assume first that H φ(Rm)( )=0 for a bounded by m M | A T I H -measurable set R . Let f denote the probability ze1 + z1 m density function of x.A By ⊆ the generalized change of variables zeT + z I 2 2 m formula [14, eq. (2.47)], we have Dφ(z)= . (18) . T m ze + zmIm f(x)Jφ(x) dL (x) m −1 Zφ (A) where ei denotes the ith unit vector. As long as at least one x H m y element zi is nonzero, Dφ(z) has full rank. Thus, Jφ(z) > 0 = f( ) d ( ) m −1 L -almost everywhere. Zφ(φ (A)) x∈φ−1(A)∩φ−1({y}) X Remark 17: For the case of positive definite random ma- H m m T = f(x) d (y) trices, i.e., Xm = i=1 zizi with independent continuous m A∩φ(R ) − − Z x∈φ 1(A)∩φ 1({y}) zi, it is easy to see that the measures induced by these X P (a) random matrices are absolutely continuous with respect to = 0 (17) the m(m + 1)/2-dimensional Lebesgue measure on the space where (a) holds because H m( φ(Rm)) = 0. Be- of all symmetric matrices. The intermediate case of positive m A ∩ n T cause Jφ(x) > 0 L -almost everywhere, (17) implies semidefinite rank-deficient random matrices Xn = i=1 zizi m −1 Rm f(x) =0 L -almost everywhere on φ ( ), and hence for n 2,...,m 1 , where the zi , i 1,...,n ∈ { − } ∈ ∈ { } L m A are independent continuous random variables, isP consider- φ−1(A) f(x) d (x)=0. Thus, we have ably more involved because the mapping (z1,..., zn) n T 7→ R −1 −1 −1 m z z has a vanishing Jacobian determinant almost ev- µy ( )= µx (φ ( )) = f(x) dL (x)=0 . i=1 i i A A −1 erywhere. We conjecture that X is (mn n(n 1)/2)- Zφ (A) n rectifiable,P conforming to the dimension of− the− manifold For an unbounded H m-measurable set RM satisfying m A ⊆ of all positive semidefinite rank-n matrices with n distinct H φ(Rm)( )=0, following the arguments above, we obtain −1| A eigenvalues. µy ( r(0))=0 for the bounded sets r(0), r N. A∩B −1 −1 A∩B ∈ This implies µy ( ) N µy ( (0))=0. IV. ENTROPY OF RECTIFIABLE RANDOM VARIABLES A ≤ r∈ A∩Br A. Definition D. Example: DistributionsP on the Unit Circle The m-rectifiable random variables introduced in Defini- As a basic example of 1-rectifiable singular random vari- tion 11 will be the objects considered in our entropy definition. ables, we consider distributions on the unit circle in R2, i.e., Due to the existence of the m-dimensional Hausdorff density on , x R2 : x =1 . θm for these random variables (see (11) and (13)), the heuristic S1 { ∈ k k } x Corollary 15: Let z be a continuous random variable on R. approach described in Section II-B (see (6) and (7)) can be T T made rigorous. Then x = (x1 x2) , (cos z sin z) is a 1-rectifiable random variable. Definition 18: Let x be an m-rectifiable random variable on RM . The m-dimensional entropy of x is defined as Proof: The mapping φ: z (cos z sin z)T is Lipschitz 7→ and its Jacobian determinant is identically one. Thus, we can m , E m m −1 h (x) x log θx (x) = log θx (x) dµx (x) directly apply Theorem 14. − − RM Z (19) This toy example is intuitive and illustrates the concept provided the integral on the right-hand side exists in R of m-rectifiable singular random variables in a very simple . ∪ setup. In a similar way, one can analyze the rectifiability of {±∞} By (15), we obtain distributions on various other geometric structures. m m m H m h (x)= θx (x) log θx (x) d E (x) (20) E. Example: Positive Semidefinite Rank-One Random Matri- − RM | ces Z = θm(x) log θm(x) dH m(x) (21) A less obvious example of an m-rectifiable singular random − x x ZE variable are positive semidefinite rank-one random matrices, where RM is an arbitrary m-rectifiable set satisfying T Rm×m i.e., matrices of the form X = zz , where z is a µx−1 EH ⊆ m (in particular, may be a support of x). Rm ∈ ≪ |E E continuous random variable on . Remark 19: For a fixed m-rectifiable measure µ, our entropy Corollary 16: Let z be a continuous random variable on definition (19) can be interpreted as a generalized entropy (8) Rm. Then the random matrix X , zzT is m-rectifiable on with λ = H m . This will allow us to use basic results 2 E Rm . from [12] for our| entropy definition. However, our definition Proof: The mapping φ: z zzT is locally Lipschitz. changes the measure λ based on the choice of µ and thus is Thus, in order to apply Theorem7→ 14, it remains to show that not simply a special case of (8). 8
B. Transformation Property , the summation reduces to a multiplication by the cardinality Eof φ−1( φ(x) ). One important property of differential entropy is its invari- { } ance under unitary transformations. A similar result holds for C. Relation to Entropy and Differential Entropy m-dimensional entropy. We can even give a more general result for arbitrary one-to-one Lipschitz mappings. In the special cases m = 0 and m = M, our entropy definition reduces to classical entropy (1) and differential x Theorem 20: Let be an m-rectifiable random variable entropy (2), respectively. on RN with 1 m N, finite m-dimensional entropy Theorem 23: Let x be a random variable on RM . If x is hm(x), support ≤, and m≤-dimensional Hausdorff density θm. x a -rectifiable (i.e., discrete) random variable, then the - Furthermore, letEφ: RN RM with M m be a Lipschitz 0 0 10 E → m ≥ dimensional entropy of x coincides with the classical entropy, mapping such that J > 0 H E -almost everywhere, φ( ) φ | E i.e., h0(x)= H(x). If x is an M-rectifiable (i.e., continuous) is H m-measurable, and E [log J E (x)] exists and is finite. If x φ random variable, then the M-dimensional entropy of x coin- the restriction of φ to is one-to-one, then y , φ(x) is an cides with the differential entropy, i.e., hM (x)= h(x). m-rectifiable random variableE with m-dimensional Hausdorff density Proof: Let x be a 0-rectifiable random variable. By m −1 Theorem 13, x is a discrete random variable with possible θx (φ (y)) θm(y)= 0 y J E (φ−1(y)) realizations xi, i , the 0-dimensional Hausdorff density θx φ is the probability∈ mass I function of x, and a support is given H m φ(E)-almost everywhere, and its m-dimensional entropy by = xi : i . Thus, (21) yields is | E { ∈ I} hm(y)= hm(x)+ E [log J E (x)] . h0(x)= θ0(x) log θ0(x) dH 0(x) x φ − x x ZE Proof: See Appendix C. (a) = Pr x = xi log Pr x = xi Remark 21: Theorem 20 shows that for the special case of − { } { } Xi∈I a unitary transformation φ (e.g., a translation), (1) = H(x) hm(φ(x)) = hm(x) where (a) holds because H 0 is the counting measure. E because Jφ (x) is identically one in that case. Let x be an M-rectifiable random variable. By Theorem 13, Remark 22: In general, no result resembling Theorem 20 x is a continuous random variable and the M-dimensional N M M holds for Lipschitz functions φ: R R that are not one- Hausdorff density θx is equal to the probability density → to-one on . We can argue as in the proof of Theorem 20 function fx. Thus, (19) yields E and obtain that y = φ(x) is m-rectifiable and that the m- M M (2) h (x)= Ex log θ (x) = Ex[log fx(x)] = h(x) . dimensional Hausdorff density is − x − m m θx (x) θy (y)= To get an idea of the m-dimensional entropy of random E x −1 Jφ ( ) x∈φX({y}) variables in between the discrete and continuous cases, we can m use Theorem 14 to construct m-rectifiable random variables. H φ(E)-almost everywhere. We then obtain for the m- dimensional| entropy More specifically, we consider a continuous random variable x on Rm and a one-to-one Lipschitz mapping φ: Rm RM m → m θx (x) (M m) whose generalized Jacobian determinant satisfies h (y)= E ≥ m − J (x) Jφ > 0 L -almost everywhere. Intuitively, we should see a φ(E) x −1 y φ ! Z ∈φX({ }) connection between the differential entropy of x and the m- θm(x) dimensional entropy of y , φ(x). By Theorem 14, the random log x dH m(y) × J E (x) variable y is m-rectifiable and, because φ is one-to-one, we x∈φ−1({y}) φ ! X can indeed calculate the m-dimensional entropy. (a) = θm(x) Corollary 24: Let x be a continuous random variable on Rm − x ZE with finite differential entropy h(x) and probability density θm(x′) Rm RM x H m x function fx. Furthermore, let φ: (M m) be a log E ′ d ( ) → ≥m × J (x ) one-to-one Lipschitz mapping such that Jφ > 0 L -almost x′∈φ−1({φ(x)}) φ ! X everywhere and Ex[log Jφ(x)] exists and is finite. Then the where (a) holds because of the generalized area formula [14, m-dimensional Hausdorff density of the m-rectifiable random Th. 2.91]. In most cases, this cannot be easily expressed in variable y , φ(x) is terms of a differential entropy due to the sum in the logarithm. f (φ−1(y)) However, in the special case of a Jacobian determinant J E and m y x φ θy ( )= −1 m Jφ(φ (y)) a Hausdorff density θx that are symmetric in the sense that m ′ E ′ −1 m θ (x ) and J (x ) are constant on φ ( φ(x) ) for all x H Rm -almost everywhere, and the m-dimensional en- x φ { } ∈ φ( ) tropy| of y is 10Here JE denotes the Jacobian determinant of the tangential differential φ m of φ in E. For details see [18, Sec. 3.2.16]. h (y)= h(x)+ Ex[log Jφ(x)] . 9
For the special case of the embedding φ: Rm RM , use Corollary 24. However, along the lines of Remark 22, we φ(x ,...,x ) = (x x 0 0)T, this results in→ obtain 1 m 1 ··· m ··· m h (x1,..., xm, 0,..., 0) = h(x) . (22) hm(X) Proof: The first part is the special case and ′ N = m fz(z ) m m z L z = R of Theorem 20. The result (22) then follows from the = fz( ) log ′ d ( ) . − Rm Jφ(z ) E Z z′∈φ−1({φ(z)}) ! fact that, for the considered embedding, Jφ(x) is identically X (27) 1. Because the z′ φ−1( φ(z) ) are given by z, and because D. Example: Entropy of Distributions on the Unit Circle ∈ { } ± fz(z)+ fz( z)=2f¯z(z) and J (z) = J ( z) (see (26)), − φ φ − It is now easy to calculate the entropy of the 1-rectifiable eq. (27) implies singular random variables on the unit circle previously con- m sidered in Section III-D. Let z be a continuous random h (X) variable on R with probability density function fz supported f (z) = f (z) log 2 ¯z dL m(z) on [0, 2π), i.e., fz(z)=0 for z / [0, 2π). By Corollary 24, z − Rm Jφ(z) ! the 1-dimensional Hausdorff density∈ of the random variable Z T m x = φ(z) = (cos z sin z) is given by (recall that the Jacobian = fz(z) log 2 + log f¯z(z) log Jφ(z) dL (z) − Rm − determinant is identically one) Z 1 −1 = log 2 f (z) log f (z) dL m(z)+ E [log J (z)] θ (x)= fz(φ (x)) (23) z ¯z z φ x − − Rm Z H 1 (a) 1 m S1 -almost everywhere, and the entropy of x is given by L | = log 2 fz(z) log f¯z(z) d (z) − − 2 Rm h1(x)= h(z) . (24) Z 1 m fz( z) log f (z) dL (z)+ Ez[log J (z)] 1 ¯z φ Of course, this result for h (x) may have been conjectured by − 2 Rm − Z heuristic reasoning. Next, we consider a case where heuristic m = log 2 f¯z(z) log f¯z(z) dL (z)+ Ez[log Jφ(z)] reasoning does not help. − − Rm Z = log 2 + h(¯z)+ Ez[log J (z)] (28) E. Example: Entropy of Positive Semidefinite Rank-One Ran- − φ dom Matrices where (a) holds because f¯z( z)= f¯z(z). Inserting (26) into − As a more challenging example, we calculate the entropy (28) gives (25). of a specific type of m-rectifiable singular random variables, A practically interesting special case of symmetric random namely, the positive semidefinite rank-one random matrices matrices is constituted by the class of Wishart matrices [22]. , n T previously considered in Section III-E. A rank-n Wishart matrix is given by Wn,Σ i=1 zizi m×m ∈ R , where the zi, i 1,...,n are independent and Theorem 25: Let z be a continuous random variable on ∈ { } P m identically distributed (i.i.d.) Gaussian random variables on R with probability density function fz, and let ¯z denote the Rm with mean 0 and some nonsingular covariance matrix Σ. random variable with probability density function f¯z(z) = The differential entropy of a full-rank Wishart matrix (i.e., (fz(z)+ fz( z))/2. Then the m-dimensional entropy of the random matrix− X = zzT is given by n m), considered as a random variable in the m(m + 1)/2- dimensional≥ space of symmetric matrices, is given by [23, m m 1 m 2 h (X)= h(¯z)+ − log 2 + Ez[log z ] . (25) eq. (B.82)] 2 2 k k Proof: We first calculate the Jacobian determinant of mn/2 n n/2 h(Wn,Σ) = log 2 Γm (det Σ) z zzT z 2 the mapping φ: , which is given by Jφ( ) = T 7→ mn m n +1 det(Dφ (z)Dφ(z)). By (18) and some simple algebraic ma- + + − E [log det(W Σ)] (29) 2 2 z n, qnipulations, one obtains J (z) = det(2 z 2I +2zzT), φ k k m and further where Γm( ) denotes the multivariate gamma function. In our p setting, full-rank· Wishart matrices can be interpreted as m(m+ 2 m 2m 1 T 1)/2-rectifiable random variables in the m -dimensional space Jφ(z)= 2 z det Im + zz k k z 2 of all matrices by considering the embedding of s k k m m symmetric× matrices into the space of all matrices and using (a) zTz m z 2m Theorem 14. Using this interpretation, we can use Corollary 24 = 2 1+ 2 s k k z m(m+1)/2 k k and obtain h(Wn,Σ)= h (Wn,Σ). = 2m+1 z 2m The case of rank-deficient Wishart matrices, i.e., n k k ∈ m+1 m 1,...,m 1 , has not been analyzed information-theoret- =2p 2 z (26) { − } k k ically so far. For simplicity, we will consider the case of rank- T m×m where (a) holds due to [21, Example 1.3.24]. Because the one Wishart matrices, i.e., W Σ = zz R . The m- 1, ∈ mapping φ: z zzT is not one-to-one, we cannot directly dimensional entropy of W Σ is given by (25) in Theorem 25. 7→ 1, 10
Because z is Gaussian with mean 0, we have ¯z = z in • Suppose we have an m1-rectifiable random variable x Theorem 25, so that (25) simplifies to and an m2-rectifiable random variable y on the same probability space. Which additional assumptions ensure m m 1 m 2 h (W1,Σ)= h(z)+ − log 2 + Ez[log z ] . that (x, y) is (m + m )-rectifiable? 2 2 k k 1 2 • Conversely, suppose we have an m-rectifiable random Again using the Gaussianity of z, we obtain further variable (x, y). Which additional assumptions ensure that m m/2 1/2 x and y are rectifiable? h (W1,Σ) = log (2πe) (det Σ) In what follows, we will provide answers to these questions m 1 m 2 + − log 2 + Ez[log z ] under appropriate conditions on the involved random variables. 2 2 k k One important shortcoming of Hausdorff measures (in con- m−1/2 m/2 Σ 1/2 = log 2 π (det ) trast to, e.g., the Lebesgue measure) is that the product of m m 2 + + Ez[log z ] . (30) two Hausdorff measures is in general not again a Hausdorff 2 2 k k measure. However, our definition of the support of a rectifiable If z contains independent standard normal entries, then z 2 is measure in Definition 8 guarantees that the product of two 2 E 2 k k χm distributed and z[log z ]= ψ(m/2)+log 2, where ψ( ) Hausdorff measures restricted to the respective supports is denotes the digamma functionk k [23, eq. (B.81)]. It is interesting· again a Hausdorff measure. to compare (30) with the differential entropy of the full-rank Lemma 27: Let x be m -rectifiable with support , and let 1 E1 Wishart matrix as given by (29). Although there is a formal y be m2-rectifiable with support 2. Then 1 2 is (m1+m2)- similarity, we emphasize that the differential entropy in (29) rectifiable and E E ×E cannot be trivially extended to the setting n Corollary 29: Let x1:n , (x1,..., xn) be a finite se- The reason for this seemingly unintuitive behavior of Mi quence of independent random variables, where xi R , our entropy are the geometric properties of the projection ∈ M1+M2 M2 i 1,...,n is mi-rectifiable with support i and mi- py : R R , py(x, y) = y, i.e., the projection of ∈ { } mi E RM1+M2 → dimensional Hausdorff density θx . Then x1:n is an m- to the last M2 components. Although py is linear and i n RM RM1+M2 rectifiable random variable on , where m = i=1 mi has a Jacobian determinant Jpy of 1 everywhere on , n , and M = i=1 Mi, and the set 1 n is m- things get more involved once we consider py as a mapping −1 E E m×···×EP rectifiable and satisfies µ(x1:n) H E . Moreover, the between rectifiable sets and want to calculate the Jacobian P ≪ | E m-dimensional Hausdorff density of x1:n is given by determinant Jpy of the tangential differential of py which M1+M2 n maps an m-rectifiable set R to an m2-rectifiable m mi RM2 E ⊆ E θ (x1:n)= θ (xi) . set 2 [18, Sec. 3.2.16]. In this setting, J is not x1:n xi E ⊆ py i=1 necessarily constant and may also become zero. Thus, the Y mi marginalization of an m-dimensional Hausdorff density is not Finally, if h (xi) is finite for i 1,...,n , then ∈{ } as easy as the marginalization of a probability density function. n The following theorem shows how to marginalize Hausdorff hm(x )= hmi (x ) . (35) 1:n i densities and describes the implications for m-dimensional i=1 X entropy. Proof: The corollary follows by inductively applying Theorem 31: Let (x, y) RM1+M2 be an m-rectifiable Theorem 28 to the two random variables (x1,..., xi−1) and ∈ x . random variable (m M1 + M2) with m-dimensional Haus- i m ≤ , dorff density θ(x,y) and support . Furthermore, let 2 M E E B. Dependent Random Variables py( ) R 2 be m -rectifiable (m m, m M ), E ⊆ 2 2 ≤ 2 ≤ 2 H m2 ( ) < , and J E > 0 H m -almost everywhere.e The case of dependent random variables is more involved. E2 ∞ py |E The rectifiability of x and y does not necessarily imply the Then the following properties hold: rectifiability of (x, y) (which is expected, since the marginal 1) Thee random variable y is m2-rectifiable. distributions carry only a small part of the information carried 2) There exists a support 2 2 of y. E ⊆ E by the joint distribution). In general, even for continuous 3) The m2-dimensional Hausdorff density of y is given by random variables x and y, we cannot calculate the joint m e θ(x,y)(x, y) m2 y H m−m2 x (39) differential entropy h(x, y) from the mere knowledge of the θy ( )= E d ( ) (y) J (x, y) differential entropies h(x) and h(y). However, it is always ZE py possible to bound the differential entropy according to [13, H m2 -almost everywhere, where (y) , x RM1 : eq. (8.63)] (x, y) . E { ∈ h(x, y) h(x)+ h(y) . (36) ∈ E} y ≤ 4) An expression of the m2-dimensional entropy of is In general, no bound resembling (36) holds for our entropy given by definition. The following simple setting provides a counterex- hm2 (y)= θm (x, y) log θm2 (y) dH m(x, y) ample. − (x,y) y ZE Example 30: We continue our example of a random variable (40) on the unit circle (see Section IV-D) for the special case of a provided the integral on the right-hand side exists and is uniform distribution of z on [0, 2π). From (24), we obtain finite. Under the assumptions that 1 , px( ) is m1-rectifiable 1 E E h (x)= h(z) = log(2π) . (37) (m m, m M ), H m1 ( ) < , and J E > 0 H m - 1 ≤ 1 ≤ 1 E1 ∞ px |E We can now analyze the components11 x and y of the random almost everywhere, analogouse results hold for x. variable x = (x y)T = (cos z sin z)T. One can easily see that Proof: See Appendix E. e x is a continuous random variable and its probability density We will illustrate the main findings of Theorem 31 in the 2 function is given by fx(x)=1/(π√1 x ). By symmetry, the setting of Example 30. − 2 same holds for y, i.e., fy(y)=1/(π 1 y ). Basic calculus Example 32: As in Example 30, we consider (x, y) − then yields for the differential entropy of x and y R2 ∈ p uniformly distributed on the unit circle 1. By (23), 1 H 1 S π θ(x,y)(x, y) = 1/(2π) -almost everywhere on 1. In h(x)= h(y) = log . (38) 1 S 2 Example 30, we already obtained h (y) = log(π/2) (there, we used the fact that y is a continuous random variable and that, Since x and y are continuous random variables, it follows from by Theorem 23, h1(y) = h(y)). Let us now calculate h1(y) Theorem 23 that h1(x)= h(x) and h1(y)= h(y). Thus, using Theorem 31. Note first that py( 1) = [ 1, 1], which H 1 S − 1 1 π is 1-rectifiable and satisfies ([ 1, 1]) = 2 < . Next, h (x)+ h (y) = 2 log < log(2π) . − S1 ∞ 2 we calculate the Jacobian determinant Jpy (x, y). Consider an arbitrary point on the unit circle, which can always be Comparing with (37), we see that h1 x y h1 x h1 y . ( , ) > ( )+ ( ) expressed as 1 y2, y with y [0, 1]. At that point, ± − ± ∈ 11To conform with the notation (x, y) used in our treatment of joint entropy, the projection py restricted to the tangent space of 1 can be T T p S 2 we change the component notation from (x1 x2) to (x y) . shown to amount to a multiplication by the factor 1 y . − p 12 S1 2 2 Thus, Jpy 1 y , y = 1 y . Hence, we obtain Theorem 31, x is m1-rectifiable and y is m2-rectifiable. Thus, from (40) ± − ± − x and y are product-compatible. p p The setting of product-compatible random variables will be h1 y ( ) especially important for our discussion of mutual information 1 in Section VII. However, already for joint entropy, we obtain = θ(x,y)(x, y) − S1 some useful results. Z 1 θ(x,y)(˜x, y) log dH 1−1(˜x) dH 1(x, y) Theorem 34: Let x be an m1-rectifiable random variable on (y) S1 M × S Jp (˜x,y) R 1 with support , and let y be an -rectifiable random Z 1 y 1 m2 M2 E 1 variable on R with support 2. Furthermore, let x and y 1 2π 0 1 H H E m1+m2 = log d (˜x) d (x, y) be product-compatible. Denote by θ the (m1 + m2)- − S 2π S(y) 1 y2 (x,y) Z 1 Z 1 −1 dimensional Hausdorff density of (x, y) and by 1 2 (a) 1 E ⊆E ×E = log p 2π dH 1(x, y) a support of (x, y). Then the following properties hold: 2 −2π S1 1 y Z x˜∈S(y) X1 − 1) The m2-dimensional Hausdorff density of y is given by 1 (b) 1 p = log 2 2π dH 1(x, y) 2 m2 m1+m2 m1 −2π S 1 y y x y H x Z 1 θy ( )= θ(x,y) ( , ) d ( ) − E1 1 2π p 1 Z = log dφ −2π π cos(φ) H m2 -almost everywhere. Z0 | | π 2) An expression of the m2-dimensional entropy of y is = log (41) 2 given by where holds because H 0 is the counting measure and (a) m2 m1+m2 (y) h (y)= θ(x,y) (x, y) (b) holds because = x R : (x, y) 1 = − S1 { ∈ ∈ S } ZE 1 y2, 1 y2 contains two points for all y m2 H m1+m2 log θy (y) d (x, y) ( 1, 1)−. Note− that− our above result for h1(y) coincides with∈ × −p p the result previously obtained in Example 30. provided the integral on the right-hand side exists and is finite. C. Product-Compatible Random Variables m1 Due to symmetry, analogous properties hold for θx and There are special settings in which m-dimensional entropy hm1 (x). more closely matches the behavior we know from (differential) entropy. In these cases, the three random variables x, y, and Proof: The proof follows along the lines of the proof (x, y) are rectifiable with “matching” dimensions, and we will of Theorem 31 in Appendix E. However, due to the product- see that an inequality similar to (36) holds. compatibility of x and y, one can use Fubini’s theorem in place of (110). Definition 33: Let x be an m1-rectifiable random variable M1 For product-compatible random variables, also the inequal- on R with support 1, and let y be an m2-rectifiable random E ity hm1+m2 (x, y) hm1 (x) + hm2 (y) holds. However, the variable on RM2 with support . The random variables x and 2 proof of this inequality≤ will be much easier once we considered y are called product-compatibleE if (x, y) is an (m + m )- 1 2 the mutual information between rectifiable random variables. rectifiable random variable on RM1+M2 . Thus, we postpone a formal presentation of the inequality to It is easy to see that for product-compatible random vari- Corollary 47 in Section VII. −1 H m1+m2 ables x and y, µ(x, y) E1×E2 . Thus, by Property 4 in Corollary 12, there≪ exists a support| . E⊆E1 ×E2 The most important part of Definition 33 is that the di- VI. CONDITIONAL ENTROPY mensions of x and y add up to the joint dimension of (x, y). Note that this was not the case in Example 32, where x In contrast to joint entropy, conditional entropy is a nontriv- and y “shared” the dimension m = 1 of (x, y). A simple ial extension of entropy. We would like to define the entropy example of product-compatible random variables is the case for a random variable x on RM1 under the condition that a of an m1-rectifiable random variable x and an independent m2- dependent random variable y on RM2 is known. For discrete rectifiable random variable y. Indeed, by Theorem 28, (x, y) and—under appropriate assumptions—for continuous random is (m1 + m2)-rectifiable. variables, the distribution of (x y = y) is well defined and Another example of product-compatible random variables so is the associated entropy H| (x y = y) or differential | can be deduced from Theorem 31. Let (x, y) be (m1 + m2)- entropy h(x y = y). Averaging over all y then results in , RM2 | rectifiable. Assume that 2 py( ) is m2-rectifiable, the well-known definitions of conditional entropy H(x y), H m2 E E H mE ⊆ | ( 2) < , and Jpy > 0 E -almost everywhere. Fur- involving only the probability mass functions and , E ∞ | p(x,y) py M1 thermore, assume that 1e , px( ) R is m1-rectifiable, or of conditional differential entropy h(x y), involving only m EE Em ⊆ | H 1 ( e ) < , and J > 0 H -almost everywhere. By the probability density functions f and fy. Indeed, if x and E1 ∞ px |E (x,y) e e 13 H m2 E H m y are discrete random variables, we have ( 2) < , and Jpy > 0 E -almost everywhere. Then theE following∞ properties hold: | H(x y)= py(yj ) H(x y = yj) 1) Thee measure Pr x y = y is (m m2)-rectifiable | N | j∈ H m2 { ∈ · | }RM2 − X for E2 -almost every y , where 2 2 is a p(x,y)(xi, yj ) support12 |of y. ∈ E ⊆ E = p(x,y)(xi, yj ) log − py(yj ) 2) The (m m )-dimensional Hausdorff density of the i,j∈N 2 e X measure − x y y is given by p (x, y) Pr = E (x,y) { ∈ · | } = (x,y) log (42) θm (x, y) y m−m (x,y) − py( ) θ 2 (x)= (45) Pr{x∈·| y=y} E m2 J (x, y) θy (y) and, if x and y are continuous random variables, we have py H m−m2 H m2 E(y) -almost everywhere, for E2 -almost | M (y) | M h(x y)= fy(y) h(x y = y) dy every y R 2 . Here, as before, , x R 1 : | RM2 | ∈ E { ∈ Z (x, y) . f(x,y)(x, y) ∈ E} = f(x,y)(x, y) log d(x, y) Proof: See Appendix F. − RM1+M2 f (y) Z y As for joint entropy, the case of product-compatible random f(x,y)(x, y) variables (see Definition 33) is of special interest and results = E log . (43) − (x,y) f (y) in a more intuitive characterization of the Hausdorff density y of Pr x y = y . A straightforward generalization to rectifiable measures would { ∈ · | } be to mimic the right-hand sides of (42) and (43) using Theorem 36: Let x be an m1-rectifiable random variable on RM1 with support , and let y be an m -rectifiable random Hausdorff densities. However, it will turn out that this naive E1 2 variable on RM2 with support . Furthermore, let x and y be approach is only partly correct: due to the geometric subtleties E2 of the projection discussed in Section V-B, we may have to product-compatible. Then the following properties hold: 1) The measure Pr x y = y is m -rectifiable for include a correction term that reflects the geometry of the { ∈ · | } 1 H m2 -almost every y RM2 . conditioning process. |E2 ∈ 2) The m1-dimensional Hausdorff density of Pr x y = y is given by { ∈ · | A. Conditional Probability } m1+m2 x y For general random variables x and y, we recall the concept θ(x,y) ( , ) m1 x (46) θPr{x∈·| y=y}( )= m2 of conditional probabilities, which can be summarized as θy (y) follows (a detailed account can be found in [24, Ch. 5]): For H m1 H m2 E1 -almost everywhere, for E2 -almost every RM1+M2 | | a pair of random variables (x, y) on , there exists a y RM2 . regular conditional probability Pr x y = y , i.e., for ∈ { ∈ A | } Proof: The proof follows along the lines of the proof each measurable set RM1 , the function y Pr x A ⊆ 7→ { ∈ of Theorem 35 in Appendix F. However, due to the product- y = y is measurable and Pr x y = y defines compatibility of x and y, one can use Fubini’s theorem in place A | } { R∈M2 · | } a probability measure for each y . Furthermore, the of (110). regular conditional probability Pr x∈ y = y satisfies { ∈ A | } Note that Theorems 35 and 36 hold for any version of the x y y −1 regular conditional probability Pr = . However, Pr (x, y) 1 2 = Pr x 1 y = y dµy (y) . { ∈H A |m2 } { ∈ A ×A } { ∈ A | } for different versions, the statement “for E2 -almost every A2 | Z (44) y RM2 ” may refer to different sets of H m2 -measure ∈ |E2 The regular conditional probability Pr x y = y zero; e.g., (45) may hold for different y RM2 . Thus, ∈ involved in (44) is not unique. Nevertheless,{ ∈ we A can | still us}e results that are independent of the version of the regular (44) in a definition of conditional entropy because any version conditional probability can only be obtained if we can avoid of the regular conditional probability satisfies (44). For the these “almost everywhere”-statements. To this end, we will remainder of this section, we consider a fixed version of the define conditional entropy as an expectation over y. regular conditional probability Pr x y = y . Definition 37: Let (x, y) be an m-rectifiable random variable { ∈ A | } M1+M2 on R such that y is m2-rectifiable with m2-dimensional m2 B. Definition of Conditional Entropy Hausdorff density θy and support 2. The conditional entropy 13 E In order to define a conditional entropy hm−m2 (x y), we of x given y is defined as | first show that Pr x y = y is a rectifiable measure. hm−m2 (x y) The next theorem{ establishes∈ · | sufficient} conditions such that | , m2 m−m2 Pr x y = y is rectifiable for almost every y. As before, θy (y) θPr{x∈·| y=y}(x) (y) { ∈ · | } M +M M M +M − E2 E we denote by py : R 1 2 R 2 the projection of R 1 2 Z Z → log θm−m2 (x) dH m−m2 (x) dH m2 (y) (47) to the last M2 components, i.e., py(x, y)= y. × Pr{x∈·| y=y} 12 Theorem 35: Let (x, y) be an m-rectifiable random variable By Theorem 31, the random variable y is m2-rectifiable with Hausdorff m RM1+M2 m density θ 2 (given by (39)) and some support E ⊆ E . on with m-dimensional Hausdorff density θ(x,y) y 2 2 13The inner integral in (47) can be intuitively interpretede as an entropy , RM2 and support . Furthermore, let py( ) be m−m2 E E2 E ⊆ h (x | y = y). However, such an entropy is not well defined in general m -rectifiable (m m, m M , m m M ), and depends on the choice of the conditional probability. 2 2 ≤ 2 ≤ 2 − 2 ≤ 1 e 14 provided the right-hand side of (47) exists and coincides for all Proof: By the definition of hm(x, y) in (31) and the versions of the regular conditional probability Pr x y = definition of hm2 (y) in (19), we have y . { ∈ A | m m2 E } h (x, y) h (y)+ E(x,y) log J (x, y) Remark 38: For independent random variables x and y, in- − py E m E m2 m1 m1 = (x,y) log θ(x,y)(x,y) + y logθy (y) serting (34) into (46) implies that θPr{x∈·| y=y}(x)= θx (x). − E m1 m1 E x y Thus, (47) reduces to h (x y)= h (x). + (x,y) log Jpy ( , ) | m The following theorem gives a characterization of condi- θ x y (x, y) = E log ( , ) + E log J E (x, y) . tional entropy and sufficient conditions for (47) to be well- − (x,y) θm2 (y) (x,y) py y defined in the sense that the right-hand side of (47) coin- (51) cides for all versions of the regular conditional probability Because we assumed in the theorem that the integrals corre- Pr x y = y . { ∈ A | } sponding to the terms on the left-hand side of (51) are finite, Theorem 39: Let (x, y) be an m-rectifiable random variable the right-hand side of (51) is also finite. By (48), the right-hand RM1+M2 m m−m on with m-dimensional Hausdorff density θ(x,y) and side of (51) equals h 2 (x y). Thus, (50) holds. | support . Furthermore, let 2 , py( ) be m2-rectifiable, Next, we continue Examples 30 and 32 from Section V-B. H m2 E E E H m E ( 2) < , and Jpy > 0 E -almost everywhere. We will see that the geometric correction term in the chain Then E ∞ | E E rule, (x,y) log Jpy (x, y) , is indeed necessary. θm (x, y) Example 42: As in Examples 30 and 32, we consider m−m (x,y) h 2 (x y)= E log 2 (x,y) m2 (x, y) R uniformly distributed on the unit circle , | − θy (y) 1 1∈ H 1 S i.e., θ x y (x, y) = 1/(2π) -almost everywhere on 1. E E x y (48) ( , ) S + (x,y) log Jpy ( , ) According to (41), provided the right-hand side of (48) exists and is finite. π h1(y) = log (52) 2 Proof: See Appendix G. Note the difference between (48) and the expressions (42) and according to (37), and (43) of H(x y) and h(x y), respectively: in the case of 1 rectifiable random| variables, we| generally have to include the h (x, y) = log(2π) . (53) geometric correction term E log J E (x, y) . However, we (x,y) py To calculate the conditional entropy h0(x y) (note that m will show next that, in the special case of product-compatible m = 1 1 = 0), we consider the| regular conditional− rectifiable random variables, this correction term does not 2 probability−Pr x y = y . It is easy to see that one appear. possible version{ of∈Pr A |x }y = y is the following: for { ∈ A | } Theorem 40: Let the m1-rectifiable random variable x on y ( 1, 1), Pr x = x y = y =1/2 for x = 1 y2 and RM1 RM2 ∈ − { | } ± − and the m2-rectifiable random variable y on be Pr x y = y = 0 if 1 y2 / . The probabilities p product-compatible. Then for{ y∈ A |1 are} irrelevant± because− Pr∈ Ay / ( 1, 1) = 0. p Hence,| | by ≥ (47), we obtain { ∈ − } θm1+m2 (x, y) m (x,y) h 1 (x y)= E log (49) 0 | − (x,y) θm2 (y) h (x y) y | 1 1 1 H 0 H 1 provided the right-hand side of (49) exists and is finite. = θy (y) log d (x) d (y) − (−1,1) {±√1−y2} 2 2 Proof: The proof follows along the lines of the proof of Z Z 1 Theorem 39 in Appendix G. However, due to the product- = θ1(y) log dH 1(y) − y 2 compatibility of x and y, one can use Fubini’s theorem in Z(−1,1) place of (110). = log 2 . (54) This differs from h1(x, y) h1(y) = log(2π) log(π/2), and C. Chain Rule for Rectifiable Random Variables therefore the conjecture that− there holds a chain− rule without As in the case of entropy and differential entropy, we can a correction term is wrong. To calculate the correction term, E S1 give a chain rule for m-dimensional entropy. which according to (50) is given by (x,y) log Jpy (x, y) , we recall from Example 32 that J S1 1 y2, y = 1 y2 Theorem 41: Let (x, y) be an m-rectifiable random variable py S1 ± − ± − RM1+M2 m or, more conveniently, Jp (cos φ, sin φ) = cos φ . Thus, we on with m-dimensional Hausdorff density θ(x,y) and y p p obtain | | support . Furthermore, let 2 , py( ) be m2-rectifiable, H m2 E E E H m E 1 ( 2) < , and Jp > 0 E -almost everywhere. S1 S1 1 y E x y log J (x, y) = log J (x, y) dH (x, y) Then E ∞ | ( , ) py 2π py ZS1 2π 1 hm(x, y)= hm2 (y)+ hm−m2 (x y) E log J E (x, y) = log cos φ dφ | − (x,y) py 2π | | (50) Z0 = log 2 . (55) provided the corresponding integrals exist and are finite. − 15 We finally verify that (55) is consistent with the chain rule VII. MUTUAL INFORMATION (50). Starting from (53), we obtain The basic definition of mutual information is for dis- crete random variables x and y with probability mass func- h1(x, y) = log(2π) tions px(xi) and py(yj ) and joint probability mass function π x y . The mutual information between x and y is = log + log 2 ( log 2) p(x,y)( i, j ) 2 − − given by [13, eq. (2.28)] 1 0 E S1 = h (y)+ h (x y) (x,y) log Jp (x, y) x y y , p(x,y)( i, j ) | − I(x; y) p(x,y)(xi, yj ) log . (59) px(x )py(y ) i,j i j where the final expansion is obtained by using (52), (54), and X (55). However, mutual information is also defined between arbitrary Example 42 also provides a counterexample to the rule random variables x and y on a common probability space. “conditioning does not increase entropy,” which holds for the This definition is based on (59) and quantizations [x]Q and entropy of discrete random variables and the differential en- [y]R [13, eq. (8.54)]. We recall from Section II-A that for M1 a measurable, finite partition Q = 1,..., N of R tropy of continuous random variables. Indeed, comparing (38) N {A A } (i.e., RM1 = with Q mutually disjoint and and (54), we see that for the components of a uniform distri- i=1 Ai Ai ∈ 1 0 measurable), the quantization [x]Q 1,...,N is defined as bution on the unit circle, we have h (x) < h (x y). However, S ∈{ } as we will see in Corollary 47 in Section VII,| this is only the discrete random variable with probability mass function p (i)=Pr [x] = i = Pr x for i 1,...,N . due to a “reduction of dimensions”: if x and y are product- [x]Q Q i { } { ∈ A } ∈{ M } compatible, which implies that hm1 (x) and hm−m2 (x y) are Definition 45 ([13, eq. (8.54)]): Let x: Ω R 1 and → of the same dimension m = m m , conditioning will| indeed y: Ω RM2 be random variables on a common probability 1 − 2 → not increase entropy, i.e., hm1 (x y) hm1 (x). Also the chain space (Ω, S,µ). The mutual information between x and y is rule (50) reduces to its traditional| form,≤ as stated next. defined as I(x; y) , sup I([x]Q;[y]R) Theorem 43: Let the m1-rectifiable random variable x on Q,R RM1 and the m -rectifiable random variable y on RM2 be 2 where the supremum is taken over all measurable, finite product-compatible. Then partitions Q of RM1 and R of RM2 . hm1+m2 (x, y)= hm2 (y)+ hm1 (x y) (56) The Gelfand-Yaglom-Perez theorem [25, Lem. 5.2.3] pro- | vides an expression of mutual information in terms of Radon- M1 m1+m2 m2 Nikodym derivatives: for random variables x: Ω R and provided the entropies h (x, y) and h (y) exist and are → y: Ω RM2 on a common probability space (Ω, S,µ), finite. → −1 m +m dµ(x, y) Proof: By the definition of h 1 2 (x, y) in (31) and the I(x; y)= log (x, y) −1 −1 m2 RM1+M2 d µx µy definition of h (y) in (19), we have Z × −1 dµ(x, y) (x, y) (60) hm1+m2 (x, y) hm2 (y) × − if µ(x, y)−1 µx−1 µy−1, and m1+m2 m2 = E log θ (x, y) + Ey log θ (y) ≪ × − (x,y) (x,y) y m1+m2 I(x; y)= (61) θ(x,y) (x,y) ∞ = E log . (57) −1 −1 −1 (x,y) m2 if x y x y . − θy (y) µ( , ) µ µ For the special6≪ cases× of discrete and continuous random By (49), the right-hand side of (57) equals hm1 (x y). Thus, variables, there exist expressions of mutual information in (56) holds. | terms of entropy and differential entropy, respectively. We will Using an induction argument, we can extend the chain extend these expressions to the case of rectifiable random vari- rule (56) to a sequence of random variables. ables. The resulting generalization will involve the entropies hm1 (x), hm2 (y), and hm(x, y). Corollary 44: Let x , (x ,..., x ) be a sequence of ran- 1:n 1 n Theorem 46: Let x be an m -rectifiable random variable dom variables where each x RMi is m -rectifiable. Assume 1 i i M1 ∈ with support 1 R , let y be an m2-rectifiable random that x1:i−1 and xi are product-compatible for i 2,...,n . E ⊆ ∈{ } variable with support RM2 , and let (x, y) be m-rectifiable Then E2 ⊆ with support 1 2. The mutual information I(x; y) n satisfies: E ⊆E ×E m m1 mi h (x1:n)= h (x1)+ h (xi x1:i−1) (58) | 1) If x and y are product-compatible (i.e., m = m1 + m2), i=2 X then n with m = mi, provided the corresponding integrals exist m i=1 I(x; y)= θ(x,y)(x, y) and are finite. E P Z m θ x y (x, y) We note that, consistently with Remark 38, (35) is a special log ( , ) dH m(x, y) . (62) × θm1 (x)θm2 (y) case of (58). x y 16 m Furthermore, has m-dimensional Hausdorff density θx and m-dimensional entropy hm(x). The random variable (1/n) n log θm(x ) I(x; y)= hm1 (x)+ hm2 (y) hm(x, y) (63) − i=1 x i − converges to hm(x) in probability, i.e., for any ε> 0 P and n 1 m m m1 m1 m2 m2 lim Pr log θx (xi) h (x) >ε =0 . I(x; y)= h (x) h (x y)= h (y) h (y x) n→∞ − n − − | − | i=1 (64) X m m m m m provided the entropies h 1 (x), h 2 (y), and h (x, y) Proof: By (19), we have h (x) = Ex log θ (x) , − x exist and are finite. and by the weak law of large numbers, the sample mean n m 2) If m Lemma 48: Let x1:n = (x1,..., xn) be a sequence of i.i.d. where ℓ(s) denotes the length of a binary sequence s m-rectifiable random variables x on RM , where each x 0, 1 ∗. The minimal expected binary codeword length L∗(x∈) i i { } 17 is defined as the minimum of Lf (x) over the set of all possible the quantization: if the quantized random variable [x]Q is with instantaneous codes f. By [13, Th. 5.4.1], L∗(x) satisfies14 high probability—corresponding to µx−1( ) being large—in A ∗ a large quantization set , then this is penalized by the term H(x) ld e L (x) Proof: By Corollary 29, the random variable x1:n is nm- where the second maximization is with respect to all func- −1 m 16 M rectifiable with µ(x1:n) H En and nm-dimensional tions αs : R (0, ) satisfying entropy hnm(x ) = nhm≪(x). Thus,| by Theorem 53, there → ∞ 1:n E −sd(x,y) ˆ x αs(x)e 1 (75) exists δε > 0 such that the following holds: ≤ n ( ) For all δˆ (0, δˆ ), there exists a partition Q P(E ) for each y RM . ∗ ∈ ε ∈ nm,δˆ ∈ such that A. Shannon Lower Bound m ˆ ∗ nh (x) ld e ld δ L ([x1:n]Q) The most common form of the traditional Shannon lower − ≤