<<

arXiv:1505.03337v3 [cs.IT] 3 Jan 2017 iesoa iglrsucsadaSannlwrbudfor bound lower Shannon intege a sources. res quantized and singular of a integer-dimensional sources length present proposed singular codeword we expected dimensional the definition, minimal entropy that the our on show of informati mutual applications we the first of integer-dimensi expressions and useful for conveys variables, entropy entropy functions random conditional d Lipschitz We singular and transformations. under unitary entropy under manner tha invariant joint show natural is We it a that cases. and in special as transforms variables it e random differential the continuous and of contains variables random definition discrete entropy of entropy proposed integer-dimensional extens The an of such variables. class present random relevant we practically Here, far. the so continuou for available nor been discrete not has neither are to variables—which extension tractable random A theory. information in quantities hno oe on,snua admvrals orecod source variables, random singular bound, lower Shannon rclEgneig T uih H89 uih Switzerlan Zurich, 8092 CH Zurich, ETH [email protected]). Engineering, trical pclr fhlawats gpichler, eeomnctos UWe,14 ina uti (e-mail Austria Vienna, 1040 Wien, TU Telecommunications, nomto hoy(ST,Hnll,H,Jl 2014. July HA, Honolulu, (ISIT), Theory Information lower Shannon Sec. [4, the entropy differential as using calculated known be can function bound differential b rate-distortion lower using the a simplified theory, on rate-distortion be in often Furthermore, entropy. can Kullbac continuo variables between like information random mutual quantities or divergence undisputed Leibler involving information-t derivations Nonetheless, [3]. inform al controversial as is for interpretation was content its but definition entropy, [1], Shannon similar differential by introduced A variables, viewpoints. random [1] axiomatic continuous from operational thoroughly and analyzed were [2] and go [1] content information Shannon ran as to interpretation discrete its for and entropy variables of definition classical The theory. Motivation and Background A. bandfo h EEb edn eus opubs-permissi purpo to other request any a sending for by material IEEE this the from use obtained to permission However, IE n C1-5 TNON,adb h W ne rn P27 grant under FWF the by and (TINCOIN), N30. ICT12-054 and WIRE) hswr a upre yteWT ne rnsIT006(N ICT10-066 grants under WWTF the by supported was work This .Relri ihteDprmn fIfrainTechnology Information of Department the with is Riegler E. .Kladr .Pclr n .Haashaewt h Insti the with are Hlawatsch F. and Pichler, G. Koliander, G. hspprwspeetdi ata h EEItrainlSy International IEEE the at part in presented was paper This oyih c 06IE.Proa s fti aeili per is material this of use Personal IEEE. 2016 (c) Copyright ne Terms Index Abstract nrp soeo h udmna ocpsi information in concepts fundamental the of one is Entropy Etoyaddfeeta nrp r important are entropy differential and —Entropy } iesoa iglrRno Variables Random Singular Dimensional Ifrainetoy aedsoto theory, distortion rate entropy, —Information nrp n oreCdn o Integer- for Coding Source and Entropy @nt.tuwien.ac.at). .I I. NTRODUCTION ¨ nhrKladr er Pichler, Georg Koliander, G¨unther ri ige,adFazHlawatsch, Franz and Riegler, Erwin [email protected]. : e utbe must ses psu on mposium (e-mail: d { n Elec- and heoretic singular singular gkoliand, ntropy n As on. ueof tute mitted. ound ation back 4.6]. efine dom onal 370- ing. s— ion the ult so us O- k- r- t , where ovre oifiiyad hs omlzto a obe to has normalization a thus, discretizat and, these infinity vari of random entropy to the the converges Usually, of question. existing quantizations in Two on able gene based more impossible. for are definition entropy seems distributions mere (differential) the entropy defining even to meaningful approaches generality its a to of due and handle, to aibei akoerno arxo h form the random of matrix this rand random singular for rank-one a a entropy of not is example does of variable trivial, theory less definition Another, information rigorous variable. classical a fact, of provide in entropy and, differential defined The one-dimensional. only x h ntcrl,ie,ehbtn h eemnsi depend deterministic the exhibiting i.e., x circle, unit the otnosbtotni o iceeete.Abscexample basic A variable be either. random cannot discrete a not source is is the often describing but sour variable continuous a random of the dimension intrinsic Thus, the reduce determi dependencies problems, istic high-dimensional many In insights. coding. valuable source provide to and areas theore these simplify to in potential work the has entropy variables (differential) random of gular generalization suitable a Thus, aibe,wihaenihrdsrt o otnos have in- continuous, recently: nor all described discrete neither numb cover are which a not variables, involving fact, do problems information-theoretic In variables of problems. random information-theoretic of However, teresting variables. kinds random two tre continuous these and information-theoretic discrete and of ment understanding the simplify r quantity. relevance useful operational a its it disputed, o is interpretation the entropy although rando ferential Hence, continuous IV]. a Sec. of [3, quantizations variable finer ever of expans entropy asymptotic the in arises entropy differential Finally, 1 1 2 , + h aeo rirr rbblt itiuin svr har very is distributions probability arbitrary of case The nte edweesnua admvralsapa is appear variables random singular where field Another h ocpso nrp n ifrniletoythus entropy differential and entropy of concepts The • • • x x 2 ssnua nsm etns[,C.6,adtenoiseless [8 the cases and special for 6], except Ch. singular [7, is distribution settings output some in distribut of input singular kinds optimal is different the arise: two distributions models, singular channel [6]. block-fading singular In is the distribution compression, source analog available of underlying the formulation input probabilistic utilize a fully singular In [5]. to freedom a used of be channel, degrees to interference has distribution vector the For tdn ebr IEEE, Member, Student 2 2 r otnosrno variables, random continuous are z 1 = sacniuu admvector. random continuous a is Although . elw IEEE Fellow, x x sdfie on defined is ( = x 1 x 2 ) R T 2 x n ohcomponents both and ∈ tefi intrinsically is itself R singular 2 upre on supported X x random = osof ions enders osin- to snot is dif- f been zz ence tical ions om ion ce. at- ral T n- m er ]. d 1 - , 2 employed to obtain a useful result. In [9], this approach is variables. Our entropy satisfies several well-known properties adopted for very specific quantizations of a . of : it is invariant under unitary transforma- Unfortunately, this does not always result in a well-defined tions, transforms as expected under Lipschitz mappings, and entropy and sometimes even fails for continuous random vari- can be extended to joint and . We show that ables of finite differential entropy [9, pp. 197f]. Moreover, the the entropy of discrete random variables and the differential quantization process seems difficult to deal with analytically entropy of continuous random variables are special cases of and no theory was built based on this definition of entropy.1 A our entropy definition. For joint entropy, we prove a chain similar approach is to consider arbitrary quantizations that are rule which takes the geometry of the support set into account. constrained by some measure of fineness to enable a limit op- Furthermore, we discuss why in certain cases our entropy eration. In [3] and [10], ε-entropy is introduced as the minimal definition may violate the classical result that conditioning entropy of all quantizations using sets of diameter less than does not increase (differential) entropy. We provide expres- ε. However, to specify a diameter, a distortion function has sions of the between integer-dimensional to be defined. Since all basic information-theoretic quantities random variables in terms of our entropy. We also show that an (e.g., mutual information or Kullback-Leibler divergence) do asymptotic equipartition property analogous to [13, Sec. 8.2] not depend on a specific distortion function, it is hardly pos- holds for our entropy, but with the Lebesgue measure replaced sible to embed ε-entropy into a general information-theoretic by the Hausdorff measure of appropriate dimension. framework. Furthermore, once again the quantization process In our proofs, we exercise care to detail all assumptions and seems difficult to handle analytically. to obtain mathematically rigorous statements. Thus, although Since the aforementioned approaches do not provide a sat- many of our results might seem obvious to the cursory isfactory generalization of (differential) entropy, we follow a reader because of their similarity to well-known results for different approach, which is also motivated by ever finer quan- (differential) entropy, we emphasize that they are not simply tizations of the random variable. However, in our approach, the replicas or straightforward adaptations of known results. This order of the two steps “taking the limit of quantizations” and becomes evident, e.g., for the chain rule (see Theorem 41 in “calculating the entropy as the expectation of the logarithm of Section VI-C), which might be expected to have the same form a probability (mass) function” is changed. More precisely, we as the chain rule for differential entropy. However, already a first consider the probability mass functions of quantizations simple example will show that the geometry of the support and take a normalized limit. (In the special case of a contin- set may lead to an additional term, which is not present in the uous random variable, this results in the probability density special case of continuous random variables. function due to Lebesgue’s differentiation theorem.) Then we As a first application of the proposed entropy, we derive take the expectation of the logarithm of the resulting density a result on the minimal expected binary codeword length of function. Due to fundamental results in geometric measure quantized integer-dimensional singular sources. More specif- theory, this approach can result in a well-defined entropy ically, we show that our entropy characterizes the rate at only for integer-dimensional distributions, since otherwise the which an arbitrarily fine quantization of an integer-dimensional density function does not exist [11, Th. 3.1]. In fact, the singular source can be compressed. Another application is existence of the density function implies that the random a lower bound on the rate-distortion function of an integer- variable is distributed according to a rectifiable measure [11, dimensional singular source that resembles the Shannon lower Th. 1.1]. Thus, the distributions considered in the present paper bound for discrete [4, Sec. 4.3] and continuous [4, Sec. 4.6] are rectifiable distributions on Euclidean space. Although this random variables. For the specific case of a singular source that is still far from the generality of arbitrary probability distribu- is uniformly distributed on the unit circle, we demonstrate that tions, it covers numerous interesting cases—including all the our bound is within 0.2 nat of the true rate-distortion function. examples mentioned above—and gives valuable insights. The density function of rectifiable measures can also be C. Notation defined as a certain Radon-Nikodym derivative. A generalized Sets are denoted by calligraphic letters (e.g., ). The A (differential) entropy based on a Radon-Nikodym derivative complement of a set is denoted c. Sets of sets are A A with respect to a “measure of the observer’s interest” was con- denoted by fraktur letters (e.g., M). The set of natural sidered in [12]. Our entropy is consistent with this approach, numbers 1, 2,... is denoted as N. The open ball with { } and at a certain point we will use a result on quantization center x RM and radius r > 0 is denoted by (x), ∈ Br problems established in [12]. However, because in our setting i.e., (x) , y RM : y x < r . The symbol Br { ∈ k − k } a concrete measure is considered, the results we obtain go be- ω(M) denotes the volume of the M-dimensional unit ball, i.e., yond the basic properties derived in [12] for general measures. ω(M)= πM/2/Γ(1+ M/2) where Γ is the Gamma function. Boldface uppercase and lowercase letters denote matrices and B. Contributions vectors, respectively. The m m identity matrix is denoted × by Im. Sans serif letters denote random quantities, e.g., x is We provide a generalization of the classical concepts of a random vector and x is a random scalar. The superscript entropy and differential entropy to integer-dimensional random T stands for transposition. For x R, x , max m M ∈ ⌊ ⌋ { T∈ Z : m x and for x R , x , ( x1 xM ) . 1This entropy should not be confused with the information dimension ≤ } ∈ ⌊ ⌋ ⌊ ⌋···⌊ ⌋ Similarly, x , min m Z : m x . We write Ex[ ] for defined in the same paper [9], which is indeed a very useful and widely ⌈ ⌉ { ∈ ≥ } · used tool. the expectation operator with respect to the random variable 3 x. Pr x denotes the probability that x . For random variables [13, Ch. 8]. Let x be a discrete random {M1 ∈ A} M2 M1+M∈2 A M1 x R and y R , we denote by px : R R , variable with probability mass function px(xi)=Pr x = xi , ∈ ∈ M1+M2 → { } px(x, y) = x, the projection of R to the first M1 i , where is the finite or countably infinite set indexing M1+M2 M2 ∈ I I components. Similarly, py : R R , py(x, y) = y, all possible realizations xi of x. The entropy of x is M1+M2 → denotes the projection of R to the last M2 components. 2 H(x) , Ex[log px(x)] = px(x ) log px(x ) . (1) The generalized Jacobian determinant of a Lipschitz function − − i i i∈I φ is written as Jφ. For a function φ with domain and a X D M subset , we denote by φ e the restriction of φ to For a continuous random variable x on R with probability D ⊆ D D the domain . H m denotes the -dimensional Hausdorff density function fx, the differential entropy is m measure.e3 LDM denotes the M-dimensional Lebesgue measure M x , E x x x L M x and BM denotese the Borel σ-algebra on R . For a measure h( ) x[log fx( )] = fx( ) log fx( ) d ( ) . − − RM µ and a µ-measurable function f, the induced measure is Z (2) given as µf −1( ) , µ(f −1( )). For two measures µ and We note that h(x) may be or undefined. ν on the sameA measurable space,A we indicate by µ ν ±∞ that µ is absolutely continuous with respect to ν (i.e.,≪ for A. Entropy of Dimension d(x) and ε-Entropy any measurable set , ν( )=0 implies µ( )=0). For A A A There exist two previously proposed generalizations of a measure µ and a measurable set , the measure µ E is the (differential) entropy to a larger set of probability distributions. restriction of µ to , i.e., µ ( )=Eµ( ). The logarithm| E |E A A∩E The first generalization is based on quantizations of the to the base e is denoted log and the logarithm to the base 2 random variable to ever finer cubes [9]. More specifically, is denoted ld. In certain equations, we reference an equation for a (possibly singular) random variable x RM , the Renyi´ number on top of the equality sign in order to indicate that information dimension of x is ∈ (42) the equality holds due to some previous equation: e.g., = H ⌊nx⌋ indicates that the equality holds due to eq. (42). d(x) , lim n (3) n→∞ log n  D. Organization of the Paper and the entropy of dimension d(x) of x is defined as The rest of this paper is organized as follows. In Sec- tion II, we review the established definitions of entropy R , nx h x (x) lim H ⌊ ⌋ d(x) log n (4) d( ) n→∞ n − and describe the intuitive idea behind our entropy defini-     tion. Rectifiable sets, measures, and random variables are provided the limits in (3) and (4) exist. introduced in Section III as the basic setting for integer- This definition of entropy of dimension d(x) corresponds to dimensional distributions. In Section IV, we develop the the following procedure: theory of “lower-dimensional entropy”: we define entropy for M ki ki+1 Z 1) Quantize x using the cubes i=1 n , n , with ki , integer-dimensional random variables, prove a transformation i.e., consider the discrete random variable with probabil-∈ property and invariance under unitary transformations, demon- M kQi ki+1  ities pk = Pr x i=1 n , n . strate connections to classical entropy and differential entropy, 2) Calculate the entropy∈ of the quantized random variable, and provide examples by calculating the entropy of random i.e., the negative expectationQ  of the logarithm of the variables supported on the unit circle in R2 and of positive probability mass function pk. semidefinite rank-one random matrices. In Sections V and 3) Subtract the correction term d(x) log n to account for the VI, we introduce and discuss joint entropy and conditional dimension of the random variable x. entropy, respectively. Relations of our entropy to the mutual 4) Take the limit n . information between integer-dimensional random variables are → ∞ Although this approach seems reasonable, there are several demonstrated in Section VII. In Section VIII, we prove an issues. First, the definition of hR (x) seems to be difficult asymptotic equipartition property for our entropy. In Sec- d(x) to handle analytically, and connections to major information- tion IX, we present a result on the minimal expected binary theoretic concepts such as mutual information are not avail- codeword length of quantized integer-dimensional sources. In able. Furthermore, the quantization used is just one of many Section X, we derive a Shannon lower bound for integer- possible—we might, e.g., consider a shifted version of the set dimensional singular sources and evaluate it for a source that of cubes M ki , ki+1 , which, for singular distributions, is uniformly distributed on the unit circle. i=1 n n may result in a different value of the resulting entropy. Q   II. PREVIOUS WORK AND MOTIVATION An approach that overcomes the latter issue is the concept of ε-entropy [3], [10]. The definition of ε-entropy does not use a We first recall the definitions of entropy for discrete random specific quantization but takes the infimum of the entropy over variables [13, Ch. 2] and differential entropy for continuous all possible (countable) quantizations under a constraint on the 2By Rademacher’s theorem [14, Th. 2.14], a Lipschitz function is differen- diameter of the quantization sets. This is motivated by data tiable almost everywhere and, thus, the Jacobian determinant is well defined compression: the quantization should be such that an error of almost everywhere. maximally ε is made (thus, the quantization sets have maximal 3Readers unfamiliar with this concept may think of it as a measure of an m-dimensional area in a higher-dimensional space (e.g., surfaces in R3). An diameter ε) and at the same time the minimal possible number introduction and definition can be found in [14, Sec. 2.8]. of bits should be used to encode the data (thus, the entropy is 4 minimized over all possible quantizations). More specifically, 1) For some x RM , divide the probability Pr x M ∈ 5 d(x) { ∈ for a random variable x R , let Pε denote the set of all ε(x) by the correction factor ω(d(x)) ε . (Recall countable partitions of RM∈ into mutually disjoint, measurable Bthat ω(}d(x)) denotes the volume of the d(x)-dimensional sets of diameter at most ε. Furthermore, for a partition Q = unit ball.) : i N P , the quantization [x]Q N is the discrete 2) Take the limit ε 0. {Ai ∈ } ∈ ε ∈ → random variable defined by pi = Pr [x]Q = i = Pr x i 3) Calculate the entropy as the negative expectation of the for i N. Then the ε-entropy of x {is defined} as { ∈ A } logarithm of the resulting density function. ∈ More specifically, steps 1–2 yield the density function6 Hε(x) , inf H([x]Q) . (5) Q∈Pε Pr x ε(x) θx(x) , lim { ∈B } (6) Here, a problem is that Hε(x) is only defined for a fixed ε→0 ω(d(x)) εd(x) ε > 0 and the limit ε 0 converges to for nondiscrete distributions. However, as→ in the case of R´enyi∞ information di- and the entropy in step 3 is thus given by mension, a correction term can be obtained using the following d(x) h (x) , Ex[log θx(x)] . (7) seemingly new definition of information dimension: − H (x) We will show that this definition of entropy will lead to d∗(x) , lim ε . definitions of joint and conditional entropy, various useful ε→0 log 1 ε relations, connections to mutual information, an asymptotic By [15, Prop. 3.3], the definitions of information dimension equipartition property, and bounds relevant to source coding. using R´enyi’s approach and the ε-entropy approach coincide, However, our definition does have one limitation: as pointed i.e., d∗(x)= d(x). This suggests the following new definition out in [6, Sec. VII-A], the existence of the limit in (6) for of a d(x)-dimensional entropy. almost every x RM is a much stronger assumption than ∈ Definition 1: Let x RM be a random variable with the existence of the R´enyi information dimension (3). Loosely existing information dimension∈ d(x). Then the asymptotic ε- speaking, the existence of the limit in (6) requires that the entropy of dimension d(x) is defined as random variable x is d(x)-dimensional almost everywhere whereas the existence of the R´enyi information dimension ∗ , h x (x) lim Hε(x)+ d(x) log ε . merely requires that the random variable is x -dimensional d( ) ε→0 d( ) “on average.” By Preiss’ Theorem [16, Th. 5.6], convergence This definition corresponds to the following procedure: in (6) even implies that the probability measure induced 4 1) Quantize x using an entropy-minimizing quantization Q by the random variable x is rectifiable (see Definition 6 given a diameter constraint ε, i.e., consider the discrete in Section III-B), which means that our definition does not random variable [x]Q with probabilities p = Pr [x]Q = i { apply to, e.g., self-similar fractal distributions. However, we i = Pr x i for i Q, where the diameter of are not aware of any application or calculation of the d(x)- each} {is upper∈ A } boundedA by∈ ε. Ai dimensional entropy in (4) (or the asymptotic version of ε- 2) Calculate the entropy of the quantized random variable entropy) for fractal distributions, and it does not seem clear [x]Q, i.e., the negative expectation of the logarithm of the whether the d(x)-dimensional entropy is well defined in that probability mass function pi. case (although the information dimension (3) exists). 3) Add the correction term d(x) log ε to account for the The rectifiability also implies that the density function θx(x) dimension of the random variable x. is equal to a certain Radon-Nikodym derivative. Based on this 4) Take the limit ε 0. d(x) → equality, the entropy h (x) defined in (7) and (6) can be Although this entropy is more general than the entropy of interpreted as a generalized entropy as defined in [12, eq. (1.5)] dimension d(x) in (4), the fundamental problems persist: we by are still restricted to the choice of sets of small diameter dµ (this is of course useful if we consider maximal distance log (x) dµ(x) if µ λ H (µ) , − RM dλ ≪ (8) as a measure of distortion but can yield unnecessarily many λ  Z   else. quantization points for areas of almost zero probability), and ∞ the definition still seems to be difficult to handle analytically Here, λ is a σ-finite measure on RM and µ is a probability and lacks connections to established information-theoretic measure on RM . While µ can be chosen as the measure of a quantities such as mutual information. given random variable, the generalized entropy (8) provides no B. An Alternative Approach intuition on how to choose the measure λ. It is more similar to a divergence between measures and, in particular, reduces to Here, we propose a different approach, which is motivated the Kullback-Leibler divergence [17] for a probability measure by the definition of differential entropy. The basic idea is λ. We will see (cf. Remark 19) that our entropy definition to circumvent the quantization step and perform the entropy m coincides with (8) for the choice λ = H E , where m and calculation at the end. Assuming x RM , this results in the | ∈ following procedure: 5The constant factor ω(d(x)) is included to obtain equality with differential entropy in the special case d(x) = M. A different factor would result in an 4We assume for simplicity that an entropy-minimizing quantization exists additive constant in the entropy definition. although in general the infimum in (5) may not be attained. 6A mathematically rigorous definition will be provided in Section III-B. 5

M depend on the given random variable. This interpretation is m-rectifiable if and only if there exist k R such E H m T ⊆ will allow us to use basic results from [12] for our entropy that 0 k∈N k, where ( 0) = 0 and each E ⊆ T ∪ T 1 T M definition. k is an m-dimensional, embedded C submanifold of R Motivated by the entropy expression in (7), a formal defini- T[19, Lem. 5.4.2].S Another characterization, based on [18, tion of the entropy of an integer-dimensional random variable Cor. 3.2.4], is that RM is m-rectifiable if and only if will be given in Section IV-A, based on the mathematical E ⊆ theory of rectifiable measures discussed next. 0 fk( k) (9) E⊆E ∪ N A k[∈ ECTIFIABLE ANDOM ARIABLES m III. R R V where H ( 0) = 0, k are bounded Borel sets, and Rm REM A As mentioned in Section II-B, the existence of a d(x)- fk : are Lipschitz functions that are one-to-one → dimensional density implies that the random variable x is on k. Due to [20, Th. 15.1], this implies that fk( k) are A A rectifiable. In this section, we recall the definitions of rec- also Borel sets. tifiable sets and measures and introduce rectifiable random B. Rectifiable Measures variables as a straightforward extension. Furthermore, we present some basic properties that will be used in subsequent Loosely speaking, rectifiable measures are measures that sections. For the convenience of readers who prefer to skip the are concentrated on a rectifiable set. The most convenient mathematical details, we summarize the most important facts way to define “concentrated on” mathematically is in terms in Corollary 12. of absolute continuity with respect to a specific Hausdorff measure. A. Rectifiable Sets Definition 6 ([14, Def. 2.59]): A Borel measure µ on RM is Our basic geometric objects of interest are rectifiable sets called m-rectifiable if there exists an m-rectifiable set RM [18, Sec. 3.2.14]. As the definition of rectifiable sets is not such that µ H m . E ⊆ ≪ |E consistent in the literature, we provide the definition most For an m-rectifiable measure µ, i.e., µ H m for an m- H m E convenient for our purpose. We recall that denotes the rectifiable set RM , we have by Property≪ 2 in| Lemma 4 m-dimensional Hausdorff measure. m E ⊆ that H E is σ-finite. Thus, by the Radon-Nikodym theorem Definition 2 ([14, Def. 2.57]): For m N, an H m-mea- [14, Th. 1.28],| there exists the Radon-Nikodym derivative RM ∈ 7 surable set (with M m) is called m-rectifiable dµ if there existEL ⊆ m-measurable,≥ bounded sets Rm and θm(x) , (x) (10) k µ H m M A8 ⊆ d E Lipschitz functions fk : k R , both for k N, such | m A → M∈ m m m that H f ( ) = 0. A set R is called satisfying dµ = θ dH E . We will refer to θ (x) as the k∈N k k µ | µ 0-rectifiableE\ if it is finiteA or countably infinite.E ⊆ m-dimensional Hausdorff density of µ. S  Remark 3: Hereafter, we will often consider the setting of Remark 7: If µ is an m-rectifiable probability measure, it m-rectifiable sets in RM and tacitly assume m 0,...,M . cannot be n-rectifiable for n = m. Indeed, suppose that µ is ∈{ } 6 Rectifiable sets satisfy the following well-known basic prop- both m-rectifiable and n-rectifiable where, without loss of gen- erties. erality, n>m. Then there exists an m-rectifiable set such that µ H m , which implies µ( c)=0. There alsoE exists RM E Lemma 4: Let be an m-rectifiable subset of . ≪ | EH n E an n-rectifiable set such that µ F . By Property 4 in 1) Any H m-measurable subset is also m-rectifiable. F ≪ |H n D⊆E Lemma 4, the m-rectifiable set satisfies ( )=0 and, in 2) The measure H m is σ-finite. n E n E E particular, H F ( )=0. Because µ H F , this implies RM RN | 3) Let φ: with N m be a Lipschitz function. µ( )=0. Hence,| Eµ(RM ) = µ( c)+≪µ( )=0| , which is a H→m ≥ If φ( ) is -measurable, then it is m-rectifiable. contradictionE to the assumption thatE µ is a probabilityE measure. 4) For n>mE , we have H n( )=0. N E To avoid the nuisance of separately considering the case 5) Let i for i be m-rectifiable sets. Then i∈N i is dµ E ∈ E m = 0 in many proofs and to reduce the class of m- m-rectifiable. dH |E rectifiable sets of interest, we define the following notion ofa 6) For m =0, Rm is m-rectifiable. S 6 support of an m-rectifiable measure. Intuitively, rectifiable sets are lower-dimensional subsets of Definition 8: For an m-rectifiable measure µ on RM , an Euclidean space. Examples include affine subspaces, algebraic m-rectifiable set RM is called a support of µ if varieties, differentiable manifolds, and graphs of Lipschitz H m , dEµ ⊆ H m -almost everywhere, and µ E dH m|E > 0 E functions. As countable unions of rectifiable sets are again ≪ | |N = k∈N fk( k) where, for k , k is a bounded Borel rectifiable, further examples are countable unions of any of E mA M ∈ A set and fk : R R is a Lipschitz function that is one-to- the aforementioned sets. S → one on k. Remark 5: There are various characterizations of m-rec- A Lemma 9: Let be an -rectifiable measure, i.e., tifiable sets that provide connections to other mathematical µ m µ H m for an m-rectifiable set RM . Then there exists≪ a disciplines. For example, an H m-measurable set RM |E E ⊆ E ⊆ support . Furthermore, the support is unique up to sets m E⊆E 7In [14, Def. 2.57], these sets are called countably H m-rectifiable. of H -measure zero. 8 This definition also encompasses finite index sets k ∈ {1,...,K}; it Proof:e See Appendix A. suffices to set Ak = ∅ for k>K. 6

Remark 10: For m-rectifiable measures, it is possible to in the next corollary. Note, however, that although everything m interpret the Hausdorff density θµ (x) as a measure of “local seems to be similar to the continuous case, Hausdorff measures probability per area.” Indeed, for an m-rectifiable measure µ, lack substantial properties of the Lebesgue measure, e.g., the m i.e., µ H E for an m-rectifiable set , we can write product measure is not always again a Hausdorff measure. m ≪ | E θµ (x) in (10) as Corollary 12: Let x be an m-rectifiable random variable on RM , i.e., µx−1 H m for an m-rectifiable set RM . µ( r(x)) E m ≪ | E ⊆ m θµ (x) = lim B (11) Then there exists the m-dimensional Hausdorff density θ , r→0 ω(m)rm x and the following properties hold: m H E -almost everywhere (for a proof see [14, Th. 2.83 and RM | 1) The probability Pr x for a measurable set eq. (2.42)]). Furthermore, the right-hand side in (11) vanishes can be calculated{ as the∈ A} integral of θm over A⊆with re- for H m-almost all points not in . Note the similarity of (11) x A E spect to the m-dimensional Hausdorff measure restricted with the ad-hoc construction in Section II-B. Indeed, (11) is the to , i.e., mathematically rigorous formulation of (6). This formulation E also provides details regarding the probability measures for Pr x = µx−1( )= θm(x) dH m (x) . (14) { ∈ A} A x |E which it results in a well-defined quantity. ZA 2) The expectation of a measurable function f : RM R C. Rectifiable Random Variables with respect to the random variable x can be expressed→ As we are only interested in probability measures and be- as cause is often formulated for random vari- E m H m ables, we define m-rectifiable random variables. In what fol- x[f(x)] = f(x) θx (x) d E (x) . (15) RM | lows, we consider a random variable x: (Ω, S) (RM , B ) Z → M 3) The random variable x is in with probability one, i.e., on a probability space (Ω, S,µ), i.e., Ω is a set, S is a E σ-algebra on Ω, and µ is a probability measure on (Ω, S). −1 m H m Pr x = µx ( )= θx (x) d E (x)=1 . The probability measure induced by the random variable x { ∈ E} E E | −1 −1 Z is denoted by µx . For BM , µx ( ) equals the (16) probability that x , i.e., A ∈ A 4) There exists a support of x. ∈ A E⊆E The special cases m =0 and m = M reduce to well-known µx−1( )= µ(x−1( )) = Pr x . (12) A A { ∈ A} concepts. e Definition 11: A random variable x: Ω RM on a prob- Theorem 13: Let x be a random variable on RM . Then: → ability space (Ω, S,µ) is called m-rectifiable if the induced 1) x is 0-rectifiable if and only if it is a discrete random −1 RM probability measure µx on is m-rectifiable, i.e., there variable, i.e., there exists a probability mass function exists an m-rectifiable set RM such that µx−1 H m . E ⊆ ≪ |E px(xi) = Pr x = xi > 0, i , where is The m-dimensional Hausdorff density of an m-rectifiable a finite or countably{ infinite} index∈ set I indicatingI all random variable x is defined as (cf. (10)) 0 possible realizations xi of x. In this case, θx = px and −1 x is a support of x. m m dµx = i : i x , −1 x x (13) E { ∈ I} θx ( ) θµx ( )= m ( ) . 2) x is M-rectifiable if and only if it is a continuous random dH E | variable, i.e., there exists a probability density function Furthermore, a support of the measure µx−1 is called a support M − fx such that Pr x = fx(x) dL (x). In this −1 m dµx 1 { ∈ A} A of x, i.e., is a support of x if µx H , m (x) > M L M E dH |E case, θx = fx -almost everywhere. H m E ≪ | R 0 E -almost everywhere, and = k∈N fk( k) where, Proof: See Appendix B. |N E RmA RM for k , k is a bounded Borel set and fk : is The following theorem introduces a nontrivial class of m- a Lipschitz∈ A function that is one-to-one onS . → Ak rectifiable random variables. Note that due to Remark 7, an m-rectifiable random variable Theorem 14: Let x be a continuous random variable on cannot be n-rectifiable for n = m. m m M 6 R . Furthermore, let φ: R R with M m be In the nontrivial case m 0 L -almost everywhere, and classical sense and is nonzero only on an m-dimensional set . m m , E assume that φ(R ) is H -measurable. Then y φ(x) is an Indeed, the random variable x will vanish everywhere except m-rectifiable random variable on RM . on a set of Lebesgue measure zero, and thus a probability density function cannot exist. However, the m-dimensional Proof: According to Definition 11, we have to show that µy−1 H m for an m-rectifiable set RM . By Hausdorff measure of the support set does not vanish, and ≪ |E E ⊆ m Properties 1, 3, and 6 in Lemma 4, the set , φ( r(0)) one can think of θx as an m-dimensional probability density E B M is m-rectifiable (φ is Lipschitz on (0) for all r > 0). function of the random variable x on R . Br Based on our discussion of rectifiable measures in Sec- 9 The m-dimensional Jacobian determinant of φ is defined as Jφ(x) = tion III-B, we can find a characterization of m-rectifiable DT x D x D x RM×m qdet( φ ( ) φ( )), where φ( ) ∈ denotes the Jacobian random variables that resembles well-known properties of matrix of φ, which is guaranteed to exist almost everywhere. Note in particular continuous random variables. This characterization is stated that Jφ(x) is nonnegative. 7

m m Hence, by Property 5 in Lemma 4, the set , φ(R ) = Jφ(z) > 0 L -almost everywhere. To calculate the Jacobian 0 E T r∈N φ( r( )) is m-rectifiable. Thus, it suffices to show that matrix Dφ(z), we stack the columns of the matrix zz and −1 B m m µy H φ(Rm), i.e., that for any H -measurable set differentiate the resulting vector with respect to each element S ≪M |m −1 R , H φ(Rm)( )=0 implies µy ( )=0. To zi. It is easily seen that the resulting Jacobian matrix is given A ⊆ | A m A this end, assume first that H φ(Rm)( )=0 for a bounded by m M | A T I H -measurable set R . Let f denote the probability ze1 + z1 m density function of x.A By ⊆ the generalized change of variables zeT + z I  2 2 m  formula [14, eq. (2.47)], we have Dφ(z)= . (18) .  T  m ze + zmIm f(x)Jφ(x) dL (x)  m  −1   Zφ (A) where ei denotes the ith unit vector. As long as at least one x H m y element zi is nonzero, Dφ(z) has full rank. Thus, Jφ(z) > 0 = f( ) d ( ) m −1 L -almost everywhere. Zφ(φ (A)) x∈φ−1(A)∩φ−1({y}) X Remark 17: For the case of positive definite random ma- H m m T = f(x) d (y) trices, i.e., Xm = i=1 zizi with independent continuous m A∩φ(R ) − − Z x∈φ 1(A)∩φ 1({y}) zi, it is easy to see that the measures induced by these X P (a) random matrices are absolutely continuous with respect to = 0 (17) the m(m + 1)/2-dimensional Lebesgue measure on the space where (a) holds because H m( φ(Rm)) = 0. Be- of all symmetric matrices. The intermediate case of positive m A ∩ n T cause Jφ(x) > 0 L -almost everywhere, (17) implies semidefinite rank-deficient random matrices Xn = i=1 zizi m −1 Rm f(x) =0 L -almost everywhere on φ ( ), and hence for n 2,...,m 1 , where the zi , i 1,...,n ∈ { − } ∈ ∈ { } L m A are independent continuous random variables, isP consider- φ−1(A) f(x) d (x)=0. Thus, we have ably more involved because the mapping (z1,..., zn) n T 7→ R −1 −1 −1 m z z has a vanishing Jacobian determinant almost ev- µy ( )= µx (φ ( )) = f(x) dL (x)=0 . i=1 i i A A −1 erywhere. We conjecture that X is (mn n(n 1)/2)- Zφ (A) n rectifiable,P conforming to the dimension of− the− manifold For an unbounded H m-measurable set RM satisfying m A ⊆ of all positive semidefinite rank-n matrices with n distinct H φ(Rm)( )=0, following the arguments above, we obtain −1| A eigenvalues. µy ( r(0))=0 for the bounded sets r(0), r N. A∩B −1 −1 A∩B ∈ This implies µy ( ) N µy ( (0))=0. IV. ENTROPY OF RECTIFIABLE RANDOM VARIABLES A ≤ r∈ A∩Br A. Definition D. Example: DistributionsP on the Unit Circle The m-rectifiable random variables introduced in Defini- As a basic example of 1-rectifiable singular random vari- tion 11 will be the objects considered in our entropy definition. ables, we consider distributions on the unit circle in R2, i.e., Due to the existence of the m-dimensional Hausdorff density on , x R2 : x =1 . θm for these random variables (see (11) and (13)), the heuristic S1 { ∈ k k } x Corollary 15: Let z be a continuous random variable on R. approach described in Section II-B (see (6) and (7)) can be T T made rigorous. Then x = (x1 x2) , (cos z sin z) is a 1-rectifiable random variable. Definition 18: Let x be an m-rectifiable random variable on RM . The m-dimensional entropy of x is defined as Proof: The mapping φ: z (cos z sin z)T is Lipschitz 7→ and its Jacobian determinant is identically one. Thus, we can m , E m m −1 h (x) x log θx (x) = log θx (x) dµx (x) directly apply Theorem 14. − − RM Z (19) This toy example is intuitive and illustrates the concept   provided the integral on the right-hand side exists in R of m-rectifiable singular random variables in a very simple . ∪ setup. In a similar way, one can analyze the rectifiability of {±∞} By (15), we obtain distributions on various other geometric structures. m m m H m h (x)= θx (x) log θx (x) d E (x) (20) E. Example: Positive Semidefinite Rank-One Random Matri- − RM | ces Z = θm(x) log θm(x) dH m(x) (21) A less obvious example of an m-rectifiable singular random − x x ZE variable are positive semidefinite rank-one random matrices, where RM is an arbitrary m-rectifiable set satisfying T Rm×m i.e., matrices of the form X = zz , where z is a µx−1 EH ⊆ m (in particular, may be a support of x). Rm ∈ ≪ |E E continuous random variable on . Remark 19: For a fixed m-rectifiable measure µ, our entropy Corollary 16: Let z be a continuous random variable on definition (19) can be interpreted as a generalized entropy (8) Rm. Then the random matrix X , zzT is m-rectifiable on with λ = H m . This will allow us to use basic results 2 E Rm . from [12] for our| entropy definition. However, our definition Proof: The mapping φ: z zzT is locally Lipschitz. changes the measure λ based on the choice of µ and thus is Thus, in order to apply Theorem7→ 14, it remains to show that not simply a special case of (8). 8

B. Transformation Property , the summation reduces to a multiplication by the cardinality Eof φ−1( φ(x) ). One important property of differential entropy is its invari- { } ance under unitary transformations. A similar result holds for C. Relation to Entropy and Differential Entropy m-dimensional entropy. We can even give a more general result for arbitrary one-to-one Lipschitz mappings. In the special cases m = 0 and m = M, our entropy definition reduces to classical entropy (1) and differential x Theorem 20: Let be an m-rectifiable random variable entropy (2), respectively. on RN with 1 m N, finite m-dimensional entropy Theorem 23: Let x be a random variable on RM . If x is hm(x), support ≤, and m≤-dimensional Hausdorff density θm. x a -rectifiable (i.e., discrete) random variable, then the - Furthermore, letEφ: RN RM with M m be a Lipschitz 0 0 10 E → m ≥ dimensional entropy of x coincides with the classical entropy, mapping such that J > 0 H E -almost everywhere, φ( ) φ | E i.e., h0(x)= H(x). If x is an M-rectifiable (i.e., continuous) is H m-measurable, and E [log J E (x)] exists and is finite. If x φ random variable, then the M-dimensional entropy of x coin- the restriction of φ to is one-to-one, then y , φ(x) is an cides with the differential entropy, i.e., hM (x)= h(x). m-rectifiable random variableE with m-dimensional Hausdorff density Proof: Let x be a 0-rectifiable random variable. By m −1 Theorem 13, x is a discrete random variable with possible θx (φ (y)) θm(y)= 0 y J E (φ−1(y)) realizations xi, i , the 0-dimensional Hausdorff density θx φ is the probability∈ mass I function of x, and a support is given H m φ(E)-almost everywhere, and its m-dimensional entropy by = xi : i . Thus, (21) yields is | E { ∈ I} hm(y)= hm(x)+ E [log J E (x)] . h0(x)= θ0(x) log θ0(x) dH 0(x) x φ − x x ZE Proof: See Appendix C. (a) = Pr x = xi log Pr x = xi Remark 21: Theorem 20 shows that for the special case of − { } { } Xi∈I a unitary transformation φ (e.g., a translation), (1) = H(x) hm(φ(x)) = hm(x) where (a) holds because H 0 is the counting measure. E because Jφ (x) is identically one in that case. Let x be an M-rectifiable random variable. By Theorem 13, Remark 22: In general, no result resembling Theorem 20 x is a continuous random variable and the M-dimensional N M M holds for Lipschitz functions φ: R R that are not one- Hausdorff density θx is equal to the probability density → to-one on . We can argue as in the proof of Theorem 20 function fx. Thus, (19) yields E and obtain that y = φ(x) is m-rectifiable and that the m- M M (2) h (x)= Ex log θ (x) = Ex[log fx(x)] = h(x) . dimensional Hausdorff density is − x − m   m θx (x) θy (y)= To get an idea of the m-dimensional entropy of random E x −1 Jφ ( ) x∈φX({y}) variables in between the discrete and continuous cases, we can m use Theorem 14 to construct m-rectifiable random variables. H φ(E)-almost everywhere. We then obtain for the m- dimensional| entropy More specifically, we consider a continuous random variable x on Rm and a one-to-one Lipschitz mapping φ: Rm RM m → m θx (x) (M m) whose generalized Jacobian determinant satisfies h (y)= E ≥ m − J (x) Jφ > 0 L -almost everywhere. Intuitively, we should see a φ(E) x −1 y φ ! Z ∈φX({ }) connection between the differential entropy of x and the m- θm(x) dimensional entropy of y , φ(x). By Theorem 14, the random log x dH m(y) × J E (x) variable y is m-rectifiable and, because φ is one-to-one, we x∈φ−1({y}) φ ! X can indeed calculate the m-dimensional entropy. (a) = θm(x) Corollary 24: Let x be a continuous random variable on Rm − x ZE with finite differential entropy h(x) and probability density θm(x′) Rm RM x H m x function fx. Furthermore, let φ: (M m) be a log E ′ d ( ) → ≥m × J (x ) one-to-one Lipschitz mapping such that Jφ > 0 L -almost x′∈φ−1({φ(x)}) φ ! X everywhere and Ex[log Jφ(x)] exists and is finite. Then the where (a) holds because of the generalized area formula [14, m-dimensional Hausdorff density of the m-rectifiable random Th. 2.91]. In most cases, this cannot be easily expressed in variable y , φ(x) is terms of a differential entropy due to the sum in the logarithm. f (φ−1(y)) However, in the special case of a Jacobian determinant J E and m y x φ θy ( )= −1 m Jφ(φ (y)) a Hausdorff density θx that are symmetric in the sense that m ′ E ′ −1 m θ (x ) and J (x ) are constant on φ ( φ(x) ) for all x H Rm -almost everywhere, and the m-dimensional en- x φ { } ∈ φ( ) tropy| of y is 10Here JE denotes the Jacobian determinant of the tangential differential φ m of φ in E. For details see [18, Sec. 3.2.16]. h (y)= h(x)+ Ex[log Jφ(x)] . 9

For the special case of the embedding φ: Rm RM , use Corollary 24. However, along the lines of Remark 22, we φ(x ,...,x ) = (x x 0 0)T, this results in→ obtain 1 m 1 ··· m ··· m h (x1,..., xm, 0,..., 0) = h(x) . (22) hm(X) Proof: The first part is the special case and ′ N = m fz(z ) m m z L z = R of Theorem 20. The result (22) then follows from the = fz( ) log ′ d ( ) . − Rm Jφ(z ) E Z z′∈φ−1({φ(z)}) ! fact that, for the considered embedding, Jφ(x) is identically X (27) 1. Because the z′ φ−1( φ(z) ) are given by z, and because D. Example: Entropy of Distributions on the Unit Circle ∈ { } ± fz(z)+ fz( z)=2f¯z(z) and J (z) = J ( z) (see (26)), − φ φ − It is now easy to calculate the entropy of the 1-rectifiable eq. (27) implies singular random variables on the unit circle previously con- m sidered in Section III-D. Let z be a continuous random h (X) variable on R with probability density function fz supported f (z) = f (z) log 2 ¯z dL m(z) on [0, 2π), i.e., fz(z)=0 for z / [0, 2π). By Corollary 24, z − Rm Jφ(z) ! the 1-dimensional Hausdorff density∈ of the random variable Z T m x = φ(z) = (cos z sin z) is given by (recall that the Jacobian = fz(z) log 2 + log f¯z(z) log Jφ(z) dL (z) − Rm − determinant is identically one) Z  1 −1 = log 2 f (z) log f (z) dL m(z)+ E [log J (z)] θ (x)= fz(φ (x)) (23) z ¯z z φ x − − Rm Z H 1 (a) 1 m S1 -almost everywhere, and the entropy of x is given by L | = log 2 fz(z) log f¯z(z) d (z) − − 2 Rm h1(x)= h(z) . (24) Z 1 m fz( z) log f (z) dL (z)+ Ez[log J (z)] 1 ¯z φ Of course, this result for h (x) may have been conjectured by − 2 Rm − Z heuristic reasoning. Next, we consider a case where heuristic m = log 2 f¯z(z) log f¯z(z) dL (z)+ Ez[log Jφ(z)] reasoning does not help. − − Rm Z = log 2 + h(¯z)+ Ez[log J (z)] (28) E. Example: Entropy of Positive Semidefinite Rank-One Ran- − φ dom Matrices where (a) holds because f¯z( z)= f¯z(z). Inserting (26) into − As a more challenging example, we calculate the entropy (28) gives (25). of a specific type of m-rectifiable singular random variables, A practically interesting special case of symmetric random namely, the positive semidefinite rank-one random matrices matrices is constituted by the class of Wishart matrices [22]. , n T previously considered in Section III-E. A rank-n Wishart matrix is given by Wn,Σ i=1 zizi m×m ∈ R , where the zi, i 1,...,n are independent and Theorem 25: Let z be a continuous random variable on ∈ { } P m identically distributed (i.i.d.) Gaussian random variables on R with probability density function fz, and let ¯z denote the Rm with mean 0 and some nonsingular covariance matrix Σ. random variable with probability density function f¯z(z) = The differential entropy of a full-rank Wishart matrix (i.e., (fz(z)+ fz( z))/2. Then the m-dimensional entropy of the random matrix− X = zzT is given by n m), considered as a random variable in the m(m + 1)/2- dimensional≥ space of symmetric matrices, is given by [23, m m 1 m 2 h (X)= h(¯z)+ − log 2 + Ez[log z ] . (25) eq. (B.82)] 2 2 k k Proof: We first calculate the Jacobian determinant of mn/2 n n/2 h(Wn,Σ) = log 2 Γm (det Σ) z zzT z 2 the mapping φ: , which is given by Jφ( ) =     T 7→ mn m n +1 det(Dφ (z)Dφ(z)). By (18) and some simple algebraic ma- + + − E [log det(W Σ)] (29) 2 2 z n, qnipulations, one obtains J (z) = det(2 z 2I +2zzT), φ k k m and further where Γm( ) denotes the multivariate gamma function. In our p setting, full-rank· Wishart matrices can be interpreted as m(m+ 2 m 2m 1 T 1)/2-rectifiable random variables in the m -dimensional space Jφ(z)= 2 z det Im + zz k k z 2 of all matrices by considering the embedding of s  k k  m m symmetric× matrices into the space of all matrices and using (a) zTz m z 2m Theorem 14. Using this interpretation, we can use Corollary 24 = 2 1+ 2 s k k z m(m+1)/2  k k  and obtain h(Wn,Σ)= h (Wn,Σ). = 2m+1 z 2m The case of rank-deficient Wishart matrices, i.e., n k k ∈ m+1 m 1,...,m 1 , has not been analyzed information-theoret- =2p 2 z (26) { − } k k ically so far. For simplicity, we will consider the case of rank- T m×m where (a) holds due to [21, Example 1.3.24]. Because the one Wishart matrices, i.e., W Σ = zz R . The m- 1, ∈ mapping φ: z zzT is not one-to-one, we cannot directly dimensional entropy of W Σ is given by (25) in Theorem 25. 7→ 1, 10

Because z is Gaussian with mean 0, we have ¯z = z in • Suppose we have an m1-rectifiable random variable x Theorem 25, so that (25) simplifies to and an m2-rectifiable random variable y on the same probability space. Which additional assumptions ensure m m 1 m 2 h (W1,Σ)= h(z)+ − log 2 + Ez[log z ] . that (x, y) is (m + m )-rectifiable? 2 2 k k 1 2 • Conversely, suppose we have an m-rectifiable random Again using the Gaussianity of z, we obtain further variable (x, y). Which additional assumptions ensure that m m/2 1/2 x and y are rectifiable? h (W1,Σ) = log (2πe) (det Σ) In what follows, we will provide answers to these questions m 1 m 2 + − log 2 + Ez[log z ] under appropriate conditions on the involved random variables. 2 2 k k One important shortcoming of Hausdorff measures (in con- m−1/2 m/2 Σ 1/2 = log 2 π (det ) trast to, e.g., the Lebesgue measure) is that the product of m m 2 + + Ez[log z ] . (30) two Hausdorff measures is in general not again a Hausdorff 2 2 k k measure. However, our definition of the support of a rectifiable If z contains independent standard normal entries, then z 2 is measure in Definition 8 guarantees that the product of two 2 E 2 k k χm distributed and z[log z ]= ψ(m/2)+log 2, where ψ( ) Hausdorff measures restricted to the respective supports is denotes the digamma functionk k [23, eq. (B.81)]. It is interesting· again a Hausdorff measure. to compare (30) with the differential entropy of the full-rank Lemma 27: Let x be m -rectifiable with support , and let 1 E1 Wishart matrix as given by (29). Although there is a formal y be m2-rectifiable with support 2. Then 1 2 is (m1+m2)- similarity, we emphasize that the differential entropy in (29) rectifiable and E E ×E cannot be trivially extended to the setting n

Corollary 29: Let x1:n , (x1,..., xn) be a finite se- The reason for this seemingly unintuitive behavior of Mi quence of independent random variables, where xi R , our entropy are the geometric properties of the projection ∈ M1+M2 M2 i 1,...,n is mi-rectifiable with support i and mi- py : R R , py(x, y) = y, i.e., the projection of ∈ { } mi E RM1+M2 → dimensional Hausdorff density θx . Then x1:n is an m- to the last M2 components. Although py is linear and i n RM RM1+M2 rectifiable random variable on , where m = i=1 mi has a Jacobian determinant Jpy of 1 everywhere on , n , and M = i=1 Mi, and the set 1 n is m- things get more involved once we consider py as a mapping −1 E E m×···×EP rectifiable and satisfies µ(x1:n) H E . Moreover, the between rectifiable sets and want to calculate the Jacobian P ≪ | E m-dimensional Hausdorff density of x1:n is given by determinant Jpy of the tangential differential of py which M1+M2 n maps an m-rectifiable set R to an m2-rectifiable m mi RM2 E ⊆ E θ (x1:n)= θ (xi) . set 2 [18, Sec. 3.2.16]. In this setting, J is not x1:n xi E ⊆ py i=1 necessarily constant and may also become zero. Thus, the Y mi marginalization of an m-dimensional Hausdorff density is not Finally, if h (xi) is finite for i 1,...,n , then ∈{ } as easy as the marginalization of a probability density function. n The following theorem shows how to marginalize Hausdorff hm(x )= hmi (x ) . (35) 1:n i densities and describes the implications for m-dimensional i=1 X entropy. Proof: The corollary follows by inductively applying Theorem 31: Let (x, y) RM1+M2 be an m-rectifiable Theorem 28 to the two random variables (x1,..., xi−1) and ∈ x . random variable (m M1 + M2) with m-dimensional Haus- i m ≤ , dorff density θ(x,y) and support . Furthermore, let 2 M E E B. Dependent Random Variables py( ) R 2 be m -rectifiable (m m, m M ), E ⊆ 2 2 ≤ 2 ≤ 2 H m2 ( ) < , and J E > 0 H m -almost everywhere.e The case of dependent random variables is more involved. E2 ∞ py |E The rectifiability of x and y does not necessarily imply the Then the following properties hold: rectifiability of (x, y) (which is expected, since the marginal 1) Thee random variable y is m2-rectifiable. distributions carry only a small part of the information carried 2) There exists a support 2 2 of y. E ⊆ E by the joint distribution). In general, even for continuous 3) The m2-dimensional Hausdorff density of y is given by random variables x and y, we cannot calculate the joint m e θ(x,y)(x, y) m2 y H m−m2 x (39) differential entropy h(x, y) from the mere knowledge of the θy ( )= E d ( ) (y) J (x, y) differential entropies h(x) and h(y). However, it is always ZE py possible to bound the differential entropy according to [13, H m2 -almost everywhere, where (y) , x RM1 : eq. (8.63)] (x, y) . E { ∈ h(x, y) h(x)+ h(y) . (36) ∈ E} y ≤ 4) An expression of the m2-dimensional entropy of is In general, no bound resembling (36) holds for our entropy given by definition. The following simple setting provides a counterex- hm2 (y)= θm (x, y) log θm2 (y) dH m(x, y) ample. − (x,y) y ZE Example 30: We continue our example of a random variable (40) on the unit circle (see Section IV-D) for the special case of a provided the integral on the right-hand side exists and is uniform distribution of z on [0, 2π). From (24), we obtain finite. Under the assumptions that 1 , px( ) is m1-rectifiable 1 E E h (x)= h(z) = log(2π) . (37) (m m, m M ), H m1 ( ) < , and J E > 0 H m - 1 ≤ 1 ≤ 1 E1 ∞ px |E We can now analyze the components11 x and y of the random almost everywhere, analogouse results hold for x. variable x = (x y)T = (cos z sin z)T. One can easily see that Proof: See Appendix E. e x is a continuous random variable and its probability density We will illustrate the main findings of Theorem 31 in the 2 function is given by fx(x)=1/(π√1 x ). By symmetry, the setting of Example 30. − 2 same holds for y, i.e., fy(y)=1/(π 1 y ). Basic calculus Example 32: As in Example 30, we consider (x, y) − then yields for the differential entropy of x and y R2 ∈ p uniformly distributed on the unit circle 1. By (23), 1 H 1 S π θ(x,y)(x, y) = 1/(2π) -almost everywhere on 1. In h(x)= h(y) = log . (38) 1 S 2 Example 30, we already obtained h (y) = log(π/2) (there, we   used the fact that y is a continuous random variable and that, Since x and y are continuous random variables, it follows from by Theorem 23, h1(y) = h(y)). Let us now calculate h1(y) Theorem 23 that h1(x)= h(x) and h1(y)= h(y). Thus, using Theorem 31. Note first that py( 1) = [ 1, 1], which H 1 S − 1 1 π is 1-rectifiable and satisfies ([ 1, 1]) = 2 < . Next, h (x)+ h (y) = 2 log < log(2π) . − S1 ∞ 2 we calculate the Jacobian determinant Jpy (x, y). Consider   an arbitrary point on the unit circle, which can always be Comparing with (37), we see that h1 x y h1 x h1 y . ( , ) > ( )+ ( ) expressed as 1 y2, y with y [0, 1]. At that point, ± − ± ∈ 11To conform with the notation (x, y) used in our treatment of joint entropy, the projection py restricted to the tangent space of 1 can be T T p  S 2 we change the component notation from (x1 x2) to (x y) . shown to amount to a multiplication by the factor 1 y . − p 12

S1 2 2 Thus, Jpy 1 y , y = 1 y . Hence, we obtain Theorem 31, x is m1-rectifiable and y is m2-rectifiable. Thus, from (40) ± − ± − x and y are product-compatible. p  p The setting of product-compatible random variables will be h1 y ( ) especially important for our discussion of mutual information 1 in Section VII. However, already for joint entropy, we obtain = θ(x,y)(x, y) − S1 some useful results. Z 1 θ(x,y)(˜x, y) log dH 1−1(˜x) dH 1(x, y) Theorem 34: Let x be an m1-rectifiable random variable on (y) S1 M × S Jp (˜x,y) R 1 with support , and let y be an -rectifiable random  Z 1 y  1 m2 M2 E 1 variable on R with support 2. Furthermore, let x and y 1 2π 0 1 H H E m1+m2 = log d (˜x) d (x, y) be product-compatible. Denote by θ the (m1 + m2)- − S 2π S(y) 1 y2 (x,y) Z 1  Z 1  −1 dimensional Hausdorff density of (x, y) and by 1 2 (a) 1 E ⊆E ×E = log p 2π dH 1(x, y) a support of (x, y). Then the following properties hold: 2 −2π S1 1 y Z  x˜∈S(y)  X1 − 1) The m2-dimensional Hausdorff density of y is given by 1 (b) 1 p = log 2 2π dH 1(x, y) 2 m2 m1+m2 m1 −2π S 1 y y x y H x Z 1   θy ( )= θ(x,y) ( , ) d ( ) − E1 1 2π p 1 Z = log dφ −2π π cos(φ) H m2 -almost everywhere. Z0  | |  π 2) An expression of the m2-dimensional entropy of y is = log (41) 2 given by   where holds because H 0 is the counting measure and (a) m2 m1+m2 (y) h (y)= θ(x,y) (x, y) (b) holds because = x R : (x, y) 1 = − S1 { ∈ ∈ S } ZE 1 y2, 1 y2 contains two points for all y m2 H m1+m2 log θy (y) d (x, y) ( 1, 1)−. Note− that− our above result for h1(y) coincides with∈ × −p p the result previously obtained in Example 30. provided the integral on the right-hand side exists and is finite. C. Product-Compatible Random Variables m1 Due to symmetry, analogous properties hold for θx and There are special settings in which m-dimensional entropy hm1 (x). more closely matches the behavior we know from (differential) entropy. In these cases, the three random variables x, y, and Proof: The proof follows along the lines of the proof (x, y) are rectifiable with “matching” dimensions, and we will of Theorem 31 in Appendix E. However, due to the product- see that an inequality similar to (36) holds. compatibility of x and y, one can use Fubini’s theorem in place of (110). Definition 33: Let x be an m1-rectifiable random variable M1 For product-compatible random variables, also the inequal- on R with support 1, and let y be an m2-rectifiable random E ity hm1+m2 (x, y) hm1 (x) + hm2 (y) holds. However, the variable on RM2 with support . The random variables x and 2 proof of this inequality≤ will be much easier once we considered y are called product-compatibleE if (x, y) is an (m + m )- 1 2 the mutual information between rectifiable random variables. rectifiable random variable on RM1+M2 . Thus, we postpone a formal presentation of the inequality to It is easy to see that for product-compatible random vari- Corollary 47 in Section VII. −1 H m1+m2 ables x and y, µ(x, y) E1×E2 . Thus, by Property 4 in Corollary 12, there≪ exists a support| . E⊆E1 ×E2 The most important part of Definition 33 is that the di- VI. CONDITIONAL ENTROPY mensions of x and y add up to the joint dimension of (x, y). Note that this was not the case in Example 32, where x In contrast to joint entropy, conditional entropy is a nontriv- and y “shared” the dimension m = 1 of (x, y). A simple ial extension of entropy. We would like to define the entropy example of product-compatible random variables is the case for a random variable x on RM1 under the condition that a of an m1-rectifiable random variable x and an independent m2- dependent random variable y on RM2 is known. For discrete rectifiable random variable y. Indeed, by Theorem 28, (x, y) and—under appropriate assumptions—for continuous random is (m1 + m2)-rectifiable. variables, the distribution of (x y = y) is well defined and Another example of product-compatible random variables so is the associated entropy H| (x y = y) or differential | can be deduced from Theorem 31. Let (x, y) be (m1 + m2)- entropy h(x y = y). Averaging over all y then results in , RM2 | rectifiable. Assume that 2 py( ) is m2-rectifiable, the well-known definitions of conditional entropy H(x y), H m2 E E H mE ⊆ | ( 2) < , and Jpy > 0 E -almost everywhere. Fur- involving only the probability mass functions and , E ∞ | p(x,y) py M1 thermore, assume that 1e , px( ) R is m1-rectifiable, or of conditional differential entropy h(x y), involving only m EE Em ⊆ | H 1 ( e ) < , and J > 0 H -almost everywhere. By the probability density functions f and fy. Indeed, if x and E1 ∞ px |E (x,y) e e 13

H m2 E H m y are discrete random variables, we have ( 2) < , and Jpy > 0 E -almost everywhere. Then theE following∞ properties hold: | H(x y)= py(yj ) H(x y = yj) 1) Thee measure Pr x y = y is (m m2)-rectifiable | N | j∈ H m2 { ∈ · | }RM2 − X for E2 -almost every y , where 2 2 is a p(x,y)(xi, yj ) support12 |of y. ∈ E ⊆ E = p(x,y)(xi, yj ) log − py(yj ) 2) The (m m )-dimensional Hausdorff density of the i,j∈N   2 e X measure − x y y is given by p (x, y) Pr = E (x,y) { ∈ · | } = (x,y) log (42) θm (x, y) y m−m (x,y) − py( ) θ 2 (x)= (45)    Pr{x∈·| y=y} E m2 J (x, y) θy (y) and, if x and y are continuous random variables, we have py H m−m2 H m2 E(y) -almost everywhere, for E2 -almost | M (y) | M h(x y)= fy(y) h(x y = y) dy every y R 2 . Here, as before, , x R 1 : | RM2 | ∈ E { ∈ Z (x, y) . f(x,y)(x, y) ∈ E} = f(x,y)(x, y) log d(x, y) Proof: See Appendix F. − RM1+M2 f (y) Z  y  As for joint entropy, the case of product-compatible random f(x,y)(x, y) variables (see Definition 33) is of special interest and results = E log . (43) − (x,y) f (y) in a more intuitive characterization of the Hausdorff density   y  of Pr x y = y . A straightforward generalization to rectifiable measures would { ∈ · | } be to mimic the right-hand sides of (42) and (43) using Theorem 36: Let x be an m1-rectifiable random variable on RM1 with support , and let y be an m -rectifiable random Hausdorff densities. However, it will turn out that this naive E1 2 variable on RM2 with support . Furthermore, let x and y be approach is only partly correct: due to the geometric subtleties E2 of the projection discussed in Section V-B, we may have to product-compatible. Then the following properties hold: 1) The measure Pr x y = y is m -rectifiable for include a correction term that reflects the geometry of the { ∈ · | } 1 H m2 -almost every y RM2 . conditioning process. |E2 ∈ 2) The m1-dimensional Hausdorff density of Pr x y = y is given by { ∈ · | A. Conditional Probability } m1+m2 x y For general random variables x and y, we recall the concept θ(x,y) ( , ) m1 x (46) θPr{x∈·| y=y}( )= m2 of conditional probabilities, which can be summarized as θy (y) follows (a detailed account can be found in [24, Ch. 5]): For H m1 H m2 E1 -almost everywhere, for E2 -almost every RM1+M2 | | a pair of random variables (x, y) on , there exists a y RM2 . regular conditional probability Pr x y = y , i.e., for ∈ { ∈ A | } Proof: The proof follows along the lines of the proof each measurable set RM1 , the function y Pr x A ⊆ 7→ { ∈ of Theorem 35 in Appendix F. However, due to the product- y = y is measurable and Pr x y = y defines compatibility of x and y, one can use Fubini’s theorem in place A | } { R∈M2 · | } a probability measure for each y . Furthermore, the of (110). regular conditional probability Pr x∈ y = y satisfies { ∈ A | } Note that Theorems 35 and 36 hold for any version of the x y y −1 regular conditional probability Pr = . However, Pr (x, y) 1 2 = Pr x 1 y = y dµy (y) . { ∈H A |m2 } { ∈ A ×A } { ∈ A | } for different versions, the statement “for E2 -almost every A2 | Z (44) y RM2 ” may refer to different sets of H m2 -measure ∈ |E2 The regular conditional probability Pr x y = y zero; e.g., (45) may hold for different y RM2 . Thus, ∈ involved in (44) is not unique. Nevertheless,{ ∈ we A can | still us}e results that are independent of the version of the regular (44) in a definition of conditional entropy because any version conditional probability can only be obtained if we can avoid of the regular conditional probability satisfies (44). For the these “almost everywhere”-statements. To this end, we will remainder of this section, we consider a fixed version of the define conditional entropy as an expectation over y. regular conditional probability Pr x y = y . Definition 37: Let (x, y) be an m-rectifiable random variable { ∈ A | } M1+M2 on R such that y is m2-rectifiable with m2-dimensional m2 B. Definition of Conditional Entropy Hausdorff density θy and support 2. The conditional entropy 13 E In order to define a conditional entropy hm−m2 (x y), we of x given y is defined as | first show that Pr x y = y is a rectifiable measure. hm−m2 (x y) The next theorem{ establishes∈ · | sufficient} conditions such that | , m2 m−m2 Pr x y = y is rectifiable for almost every y. As before, θy (y) θPr{x∈·| y=y}(x) (y) { ∈ · | } M +M M M +M − E2 E we denote by py : R 1 2 R 2 the projection of R 1 2 Z Z → log θm−m2 (x) dH m−m2 (x) dH m2 (y) (47) to the last M2 components, i.e., py(x, y)= y. × Pr{x∈·| y=y} 12 Theorem 35: Let (x, y) be an m-rectifiable random variable By Theorem 31, the random variable y is m2-rectifiable with Hausdorff m RM1+M2 m density θ 2 (given by (39)) and some support E ⊆ E . on with m-dimensional Hausdorff density θ(x,y) y 2 2 13The inner integral in (47) can be intuitively interpretede as an entropy , RM2 and support . Furthermore, let py( ) be m−m2 E E2 E ⊆ h (x | y = y). However, such an entropy is not well defined in general m -rectifiable (m m, m M , m m M ), and depends on the choice of the conditional probability. 2 2 ≤ 2 ≤ 2 − 2 ≤ 1 e 14 provided the right-hand side of (47) exists and coincides for all Proof: By the definition of hm(x, y) in (31) and the versions of the regular conditional probability Pr x y = definition of hm2 (y) in (19), we have y . { ∈ A | m m2 E } h (x, y) h (y)+ E(x,y) log J (x, y) Remark 38: For independent random variables x and y, in- − py E m E m2 m1 m1 = (x,y) log θ(x,y)(x,y) + y logθy (y) serting (34) into (46) implies that θPr{x∈·| y=y}(x)= θx (x). − E m1 m1 E x y Thus, (47) reduces to h (x y)= h (x).   + (x,y) log Jpy ( , ) | m The following theorem gives a characterization of condi- θ x y (x, y) = E log ( , ) + E  log J E (x, y) . tional entropy and sufficient conditions for (47) to be well- − (x,y) θm2 (y) (x,y) py   y  defined in the sense that the right-hand side of (47) coin-  (51) cides for all versions of the regular conditional probability Because we assumed in the theorem that the integrals corre- Pr x y = y . { ∈ A | } sponding to the terms on the left-hand side of (51) are finite, Theorem 39: Let (x, y) be an m-rectifiable random variable the right-hand side of (51) is also finite. By (48), the right-hand RM1+M2 m m−m on with m-dimensional Hausdorff density θ(x,y) and side of (51) equals h 2 (x y). Thus, (50) holds. | support . Furthermore, let 2 , py( ) be m2-rectifiable, Next, we continue Examples 30 and 32 from Section V-B. H m2 E E E H m E ( 2) < , and Jpy > 0 E -almost everywhere. We will see that the geometric correction term in the chain Then E ∞ | E E rule, (x,y) log Jpy (x, y) , is indeed necessary. θm (x, y) Example 42: As in Examples 30 and 32, we consider m−m (x,y)   h 2 (x y)= E log 2 (x,y) m2 (x, y) R uniformly distributed on the unit circle , | − θy (y) 1    1∈ H 1 S i.e., θ x y (x, y) = 1/(2π) -almost everywhere on 1. E E x y (48) ( , ) S + (x,y) log Jpy ( , ) According to (41), provided the right-hand side of (48) exists and is finite. π h1(y) = log (52) 2 Proof: See Appendix G.   Note the difference between (48) and the expressions (42) and according to (37), and (43) of H(x y) and h(x y), respectively: in the case of 1 rectifiable random| variables, we| generally have to include the h (x, y) = log(2π) . (53) geometric correction term E log J E (x, y) . However, we (x,y) py To calculate the conditional entropy h0(x y) (note that m will show next that, in the special case of product-compatible m = 1 1 = 0), we consider the| regular conditional− rectifiable random variables, this correction term does not 2 probability−Pr x y = y . It is easy to see that one appear. possible version{ of∈Pr A |x }y = y is the following: for { ∈ A | } Theorem 40: Let the m1-rectifiable random variable x on y ( 1, 1), Pr x = x y = y =1/2 for x = 1 y2 and RM1 RM2 ∈ − { | } ± − and the m2-rectifiable random variable y on be Pr x y = y = 0 if 1 y2 / . The probabilities p product-compatible. Then for{ y∈ A |1 are} irrelevant± because− Pr∈ Ay / ( 1, 1) = 0. p Hence,| | by ≥ (47), we obtain { ∈ − } θm1+m2 (x, y) m (x,y) h 1 (x y)= E log (49) 0 | − (x,y) θm2 (y) h (x y)   y  | 1 1 1 H 0 H 1 provided the right-hand side of (49) exists and is finite. = θy (y) log d (x) d (y) − (−1,1) {±√1−y2} 2 2 Proof: The proof follows along the lines of the proof of Z Z 1 Theorem 39 in Appendix G. However, due to the product- = θ1(y) log dH 1(y) − y 2 compatibility of x and y, one can use Fubini’s theorem in Z(−1,1) place of (110). = log 2 . (54) This differs from h1(x, y) h1(y) = log(2π) log(π/2), and C. Chain Rule for Rectifiable Random Variables therefore the conjecture that− there holds a chain− rule without As in the case of entropy and differential entropy, we can a correction term is wrong. To calculate the correction term, E S1 give a chain rule for m-dimensional entropy. which according to (50) is given by (x,y) log Jpy (x, y) , we recall from Example 32 that J S1 1 y2, y = 1 y2 Theorem 41: Let (x, y) be an m-rectifiable random variable py   S1 ± − ± − RM1+M2 m or, more conveniently, Jp (cos φ, sin φ) = cos φ . Thus, we on with m-dimensional Hausdorff density θ(x,y) and y p  p obtain | | support . Furthermore, let 2 , py( ) be m2-rectifiable, H m2 E E E H m E 1 ( 2) < , and Jp > 0 E -almost everywhere. S1 S1 1 y E x y log J (x, y) = log J (x, y) dH (x, y) Then E ∞ | ( , ) py 2π py ZS1   2π 1 hm(x, y)= hm2 (y)+ hm−m2 (x y) E log J E (x, y) = log cos φ dφ | − (x,y) py 2π | | (50) Z0   = log 2 . (55) provided the corresponding integrals exist and are finite. − 15

We finally verify that (55) is consistent with the chain rule VII. MUTUAL INFORMATION (50). Starting from (53), we obtain The basic definition of mutual information is for dis- crete random variables x and y with probability mass func- h1(x, y) = log(2π) tions px(xi) and py(yj ) and joint probability mass function π x y . The mutual information between x and y is = log + log 2 ( log 2) p(x,y)( i, j ) 2 − − given by [13, eq. (2.28)]   1 0 E S1 = h (y)+ h (x y) (x,y) log Jp (x, y) x y y , p(x,y)( i, j ) | − I(x; y) p(x,y)(xi, yj ) log . (59) px(x )py(y )   i,j i j where the final expansion is obtained by using (52), (54), and X   (55). However, mutual information is also defined between arbitrary Example 42 also provides a counterexample to the rule random variables x and y on a common probability space. “conditioning does not increase entropy,” which holds for the This definition is based on (59) and quantizations [x]Q and entropy of discrete random variables and the differential en- [y]R [13, eq. (8.54)]. We recall from Section II-A that for M1 a measurable, finite partition Q = 1,..., N of R tropy of continuous random variables. Indeed, comparing (38) N {A A } (i.e., RM1 = with Q mutually disjoint and and (54), we see that for the components of a uniform distri- i=1 Ai Ai ∈ 1 0 measurable), the quantization [x]Q 1,...,N is defined as bution on the unit circle, we have h (x) < h (x y). However, S ∈{ } as we will see in Corollary 47 in Section VII,| this is only the discrete random variable with probability mass function p (i)=Pr [x] = i = Pr x for i 1,...,N . due to a “reduction of dimensions”: if x and y are product- [x]Q Q i { } { ∈ A } ∈{ M } compatible, which implies that hm1 (x) and hm−m2 (x y) are Definition 45 ([13, eq. (8.54)]): Let x: Ω R 1 and → of the same dimension m = m m , conditioning will| indeed y: Ω RM2 be random variables on a common probability 1 − 2 → not increase entropy, i.e., hm1 (x y) hm1 (x). Also the chain space (Ω, S,µ). The mutual information between x and y is rule (50) reduces to its traditional| form,≤ as stated next. defined as I(x; y) , sup I([x]Q;[y]R) Theorem 43: Let the m1-rectifiable random variable x on Q,R RM1 and the m -rectifiable random variable y on RM2 be 2 where the supremum is taken over all measurable, finite product-compatible. Then partitions Q of RM1 and R of RM2 .

hm1+m2 (x, y)= hm2 (y)+ hm1 (x y) (56) The Gelfand-Yaglom-Perez theorem [25, Lem. 5.2.3] pro- | vides an expression of mutual information in terms of Radon- M1 m1+m2 m2 Nikodym derivatives: for random variables x: Ω R and provided the entropies h (x, y) and h (y) exist and are → y: Ω RM2 on a common probability space (Ω, S,µ), finite. → −1 m +m dµ(x, y) Proof: By the definition of h 1 2 (x, y) in (31) and the I(x; y)= log (x, y) −1 −1 m2 RM1+M2 d µx µy definition of h (y) in (19), we have Z  × −1 dµ(x, y) (x, y) (60) hm1+m2 (x, y) hm2 (y) × − if µ(x, y)−1 µx−1 µy−1, and m1+m2 m2 = E log θ (x, y) + Ey log θ (y) ≪ × − (x,y) (x,y) y m1+m2 I(x; y)= (61)  θ(x,y) (x,y)   ∞ = E log . (57) −1 −1 −1 (x,y) m2 if x y x y . − θy (y) µ( , ) µ µ    For the special6≪ cases× of discrete and continuous random By (49), the right-hand side of (57) equals hm1 (x y). Thus, variables, there exist expressions of mutual information in (56) holds. | terms of entropy and differential entropy, respectively. We will Using an induction argument, we can extend the chain extend these expressions to the case of rectifiable random vari- rule (56) to a sequence of random variables. ables. The resulting generalization will involve the entropies hm1 (x), hm2 (y), and hm(x, y). Corollary 44: Let x , (x ,..., x ) be a sequence of ran- 1:n 1 n Theorem 46: Let x be an m -rectifiable random variable dom variables where each x RMi is m -rectifiable. Assume 1 i i M1 ∈ with support 1 R , let y be an m2-rectifiable random that x1:i−1 and xi are product-compatible for i 2,...,n . E ⊆ ∈{ } variable with support RM2 , and let (x, y) be m-rectifiable Then E2 ⊆ with support 1 2. The mutual information I(x; y) n satisfies: E ⊆E ×E m m1 mi h (x1:n)= h (x1)+ h (xi x1:i−1) (58) | 1) If x and y are product-compatible (i.e., m = m1 + m2), i=2 X then n with m = mi, provided the corresponding integrals exist m i=1 I(x; y)= θ(x,y)(x, y) and are finite. E P Z m θ x y (x, y) We note that, consistently with Remark 38, (35) is a special log ( , ) dH m(x, y) . (62) × θm1 (x)θm2 (y) case of (58).  x y  16

m Furthermore, has m-dimensional Hausdorff density θx and m-dimensional entropy hm(x). The random variable (1/n) n log θm(x ) I(x; y)= hm1 (x)+ hm2 (y) hm(x, y) (63) − i=1 x i − converges to hm(x) in probability, i.e., for any ε> 0 P and n 1 m m m1 m1 m2 m2 lim Pr log θx (xi) h (x) >ε =0 . I(x; y)= h (x) h (x y)= h (y) h (y x) n→∞ − n − − | − |  i=1  (64) X m m m m m provided the entropies h 1 (x), h 2 (y), and h (x, y) Proof: By (19), we have h (x) = Ex log θ (x) , − x exist and are finite. and by the weak law of large numbers, the sample mean n m   2) If m 0 and n N, the ε-typical set ε R is defined to reconstruct an at least one-dimensional component of y as ∈ A ⊆ from x). Thus, an infinite amount of information could be n (n) n 1 m m transmitted over a channel x y (or y x). This is , x1:n : log θ (xi) h (x) ε . −→ −→ Aε ∈E − n x − ≤ consistent with our result that I(x; y)= .  i=1  ∞ X A corollary of Theorem 46 states that for product-compati- The AEP for sequences of m-rectifiable random variables ble random variables, we can upper-bound the joint entropy by is expressed by the following central result. the sum of the individual entropies and prove that conditioning Theorem 50: Let x1:n = (x1,..., xn) be a sequence of does not increase entropy. M i.i.d. m-rectifiable random variables xi on R , where each xi Corollary 47: Let the m1-rectifiable random variable x on has m-dimensional Hausdorff density θm, support , and m- M M x R 1 and the m -rectifiable random variable y on R 2 be m (nE) 2 dimensional entropy h (x). Then the typical set ε satisfies product-compatible. Then the following properties. A hm1+m2 (x, y) hm1 (x)+ hm2 (y) (65) 1) For δ > 0 and n sufficiently large, ≤ and Pr x (n) > 1 δ . { 1:n ∈ Aε } − hm1 (x y) hm1 (x) (66) | ≤ 2) For all n N, m m m +m ∈ provided the entropies h 1(x), h 2(y), and h 1 2(x, y) exist m H nm( (n)) en(h (x)+ε) . and are finite. Aε ≤ Proof: The inequality (65) follows from (63) and the 3) For δ > 0 and n sufficiently large, nonnegativity of mutual information. Similarly, (66) follows m H nm( (n)) > (1 δ)en(h (x)−ε) . from (64) and the nonnegativity of mutual information. Aε − Proof: The proof is similar to that in the continuous case VIII. ASYMPTOTIC EQUIPARTITION PROPERTY [13, Th. 8.2.2], however with the Lebesgue measure replaced Similar to classical entropy and differential entropy, the m- by the Hausdorff measure. dimensional entropy hm(x) satisfies an asymptotic equipar- NTROPY OUNDS ON XPECTED ODEWORD tition property (AEP). Let us consider a sequence x1:n , IX. E B E C ENGTH (x1,..., xn) of i.i.d. random variables xi. Our main findings L are similar to the discrete and continuous cases: based on A well-known result for discrete random variables is a m (n) h (x), we define sets ε of typical sequences x1:n and show connection between the minimal expected codeword length A that, for sufficiently large n, a random sequence x1:n belongs of an instantaneous source code and the entropy of the (n) to ε with probability arbitrarily close to one. Furthermore, random variable [13, Th. 5.4.1]. More specifically, let x be A (n) M we obtain upper and lower bounds on the size of ε given a discrete random variable on R with possible realizations n(hm(x)+ε) n(hm(x)−ε) A by e and (1 δ)e , respectively. In the case xi : i . In variable-length source coding, a one-to- − { ∈ I} ∗ ∗ of classical entropy and differential entropy, these properties one function f : xi : i 0, 1 , where 0, 1 are useful in the proof of various coding theorems because denotes the set of{ all finite-length∈ I} binary → { sequences,} is{ used} to they allow us to consider only typical sequences. represent each realization xi by a finite-length binary sequence Our analysis follows the steps in [13, Sec. 8.2]. However, si = f(xi). This code is instantaneous (or prefix free) if whereas in the discrete case the size of a set of sequences no f(xi) coincides with the first bits of another f(xj ). The x1:n is measured by its cardinality and in the continuous case expected binary codeword length is defined as by its Lebesgue measure, in the present case of m-rectifiable Lf (x) , Ex[ℓ(f(x))] random variables xi, we resort to the Hausdorff measure.

Lemma 48: Let x1:n = (x1,..., xn) be a sequence of i.i.d. where ℓ(s) denotes the length of a binary sequence s m-rectifiable random variables x on RM , where each x 0, 1 ∗. The minimal expected binary codeword length L∗(x∈) i i { } 17

is defined as the minimum of Lf (x) over the set of all possible the quantization: if the quantized random variable [x]Q is with instantaneous codes f. By [13, Th. 5.4.1], L∗(x) satisfies14 high probability—corresponding to µx−1( ) being large—in A ∗ a large quantization set , then this is penalized by the term H(x) ld e L (x) 0, there exists δε > 0 such that the partition of , i.e., all sets i are mutually disjoint and H - E N A following holds: for each δ (0,δε), there exists a partition measurable, and i=1 i = . The partition Q is said to be an (E) ∈ AH mE Qδ Pm,δ such that (m,δ)-partition of if ( i) δ for all i 1,...,N . ∈ S E A ≤ ∈{(E) } ∗ m The set of all (m,δ)-partitions of is denoted P . L ([x]Q ) < h (x) ld e ld δ +1+ ε . (71) E m,δ δ − Note that the definition of an (m,δ)-partition of an m- Proof: See Appendix J. We note that the proof is based rectifiable set does not involve a distortion function. On the on (67) and the expression of hm(x) given in (69). E one hand, this is convenient because we do not have to argue The lower bound (70) shows the following: if we want about a good distortion measure. On the other hand, the points a quantization Q of x with good resolution (in the sense (E) m in a set i of a partition Q Pm,δ are not necessarily “close” that H ( ) δ for all Q), then we have to use at A ∈ m A ≤ A ∈ to each other; in fact, i is not even necessarily connected. least h (x) ld e ld δ bits to represent this quantized random A (E) − Thus, although the partitions in Pm,δ consist of measure- variable using an instantaneous code. However, by the upper theoretically small sets, these sets might be considered large bound (71), we know that for a sufficiently fine resolution in terms of specific distortion measures. (i.e., δ<δε), that resolution δ can be achieved by using at In what follows, we will consider the quantized random most 1 + ε additional bits (in addition to the lower bound variable [x] for Q P(E) . We recall that [x] is the discrete hm(x) ld e ld δ). Q m,δ Q − random variable such∈ that Pr [x] = i = Pr x Q i B. Expected Codeword Length of Sequences of Integer-Dimen- for i 1,...,N . Due to the{ interpretation} of{hm∈(x) Aas} sional Random Variables a generalized∈ { entropy} (cf. Remark 19), we can use [12, eq. (1.8)] to obtain the following result. We will now apply Theorem 53 to sequences of i.i.d. random variables. To this end, we consider quantizations of Lemma 52: Let x be an m-rectifiable random variable, i.e., 15 −1 m M an entire sequence, [x1:n]Q = [(x1,..., xn)]Q with Q µx H E for an m-rectifiable set R , with m 1 (En) ∈ ≪m | (E) E ⊆ ≥ Pnm,δn . We denote by and H ( ) < . Let Pm,∞ denote the set of all finite, H m E ∞ ∗ -measurable partitions of . Then ∗ L ([x1:n]Q) E L ([x1:n]Q) , (72) n n hm(x) the minimal expected binary codeword length per source µx−1( ) symbol. = inf µx−1( ) log A (68) (E) H m ∞ − A E ( ) Corollary 54: Let x = (x ,..., x ) be a sequence Q∈Pm, A∈Q  | A ! 1:n 1 n X of i.i.d. m-rectifiable random variables (m 1) on RM −1 m m ≥ = inf H([x]Q)+ µx ( ) log H E ( ) . with m-dimensional entropy h (x) and support satisfying (E) Q∈Pm,∞ A | A m E  A∈Q  H ( ) < . Then, for each ε> 0, there exists δε > 0 such X (69) E ∞ that the following holds: for each δ (0,δε), there exists a (En) ∈ Proof: See Appendix I. partition Q P n such that the minimal expected binary ∈ nm,δ The terms in (69) give an interesting interpretation of m- codeword length per source symbol satisfies dimensional entropy. Looking for a quantization that min- m ∗ m 1+ ε h (x) ld e ld δ Ln([x1:n]Q) h (x) ld e ld δ + . imizes the first term, H([x]Q), corresponds to minimizing − ≤ ≤ − n the amount of data required to represent this quantization. (73) Of course, the minimum is simply obtained for the partition 15We choose partitions Q of resolution δn, i.e., the sets A ∈ Q satisfy Q = , which gives H([x]Q)=0. But in (69), we also H nm(A) ≤ δn. This choice is made for consistency with the case of {E} n (E) have an additional term that penalizes a bad “resolution” of partitions Q of E that are constructed as products of sets Ai in Q1 ∈ Pm,δ. More specifically, for A = A1 ×···×An with Ai ∈ Q1, we have m nm n n 14 H (Ai) ≤ δ and H (A) ≤ δ and the sets A cover E , i.e., The factor ld e appears because we defined entropy using the natural n , (E ) logarithm. Q {A = A1 ×···×An : Ai ∈ Q1} ∈ Pnm,δn . 18

Proof: By Corollary 29, the random variable x1:n is nm- where the second maximization is with respect to all func- −1 m 16 M rectifiable with µ(x1:n) H En and nm-dimensional tions αs : R (0, ) satisfying entropy hnm(x ) = nhm≪(x). Thus,| by Theorem 53, there → ∞ 1:n E −sd(x,y) ˆ x αs(x)e 1 (75) exists δε > 0 such that the following holds: ≤ n ( ) For all δˆ (0, δˆ ), there exists a partition Q P(E ) for each y RM .   ∗ ∈ ε ∈ nm,δˆ ∈ such that A. Shannon Lower Bound m ˆ ∗ nh (x) ld e ld δ L ([x1:n]Q) The most common form of the traditional Shannon lower − ≤ D0, y∈ ZE

16 Although in [26, Th. 2.3] αs(x) ≥ 1 is assumed, (74) also holds for R(D) = max max sD + Ex[log αs(x)] (74) s≥0 αs(·) − αs(x) > 0 because of [26, eq. (1.23)].  19

Proof: We start by noting that (80) implies B. Maximizing the Shannon Lower Bound The optimal choice of s in (79) depends on D and is hard e−sd(x,y) dH m(x) γ(s) (81) to find in general. At least, the following lemma states that ≤ ZE the optimal (i.e., largest) lower bound in (79), RM ∗ , for all y . Let s 0 be fixed. By (74), RSLB(D) sup RSLB(D,s) ∈ ≥ s≥0

R(D) sD + Ex[log α (x)] (82) is achieved for a finite s. We recall that D is the infimum of ≥− s 0 all D 0 such that R(D) is finite. ≥ for every function αs satisfying (75). We have (cf. (15)) Lemma 56: Let x be an m-rectifiable random variable with support and finite m-dimensional entropy hm(x). Then for 1 x y E E e−sd( , ) D>D0 the lower bound RSLB(D,s) in (79) satisfies x θm(x)γ(s)  x  1 lim RSLB(D,s)= . = e−sd(x,y)θm(x) dH m(x) s→∞ −∞ θm(x)γ(s) x ZE x Proof: See Appendix K. 1 If R (D,s) is a continuous function of s, Lemma 56 = e−sd(x,y) dH m(x) SLB γ(s) E implies that for a fixed D >D0, the global maximum of Z R (D,s) with respect to s exists and is either a local (81) γ(s) SLB maximum or the boundary point s = 0, i.e., R∗ (D) = ≤ γ(s) SLB RSLB(D,s) for some finite s 0. Moreover, if γ(s) in (80) = 1 is differentiable, we can characterize≥ the local maxima of RSLB(D,s) as follows. RM , 1 for all y . Therefore, the choice αs(x) m θx (x)γ(s) Theorem 57: Let x be an m-rectifiable random variable with ∈ 1 satisfies (75). Inserting αs(x)= θm(x)γ(s) into (82), we obtain support , and let γ(s) be differentiable. Then for D>D , x E 0 the lower bound RSLB(D,s) in (79) is maximized either for E 1 s =0 or for some s> 0 satisfying D(s)= D, where R(D) sD + x log m ≥− θx (x)γ(s) ′   , γ (s) m D(s) e . = Ex[log θ (x)] sD Ex[log γ(s)] − γ(s) − x − − = hm(x) sD log γ(s) . That is, the largest lowere bound is given by − − ∗ RSLB(D) = max RSLB(D, 0), sup RSLB(D,s) . e  s>0:D(s)=D  For a continuous random variable x with positive probability (84) density function almost everywhere (i.e., M-rectifiable with Proof: We recall from (79) that R (D,s) = hm(x) support RM ), the definitions of γ(s) in (78) and γ(s) in SLB sD log γ(s). Thus, because γ(s) is differentiable, a necessary− (80) coincide. Indeed, because d(x, y) = d(x y, 0) and condition− for a local maximum of with respect to a translation of the integrand by y does not change− the value RSLB(D,s) e is obtained by setting to zero the derivative of of the integral over RM , the right-hand side of (78) can be s RSLB(D,s) with respect to s. Solving the resulting equation for D yields written as (recall that H M = L M ) D(s)= D. Thus, for a given D>D0, RSLB(D,s) can only have a local maximum at s (0, ) satisfying D(s)= D. By −sd(x,0) L M −sd(x,y) H M e d (x)= e d (x) (83) Lemma 56, the global maximum∈ ∞ either is a local maximum RM RM e Z Z or is achieved for s =0, which concludes the proof.e for any y RM . Because the left-hand side of (83) does If γ(s) is differentiable, Theorem 57 provides a “parame- ∈ M ∗ not depend on y, taking the supremum over y R in (83) trization” of the graph of the largest bound RSLB(D), i.e., we results in ∈ can characterize the set , ∗ R2 (85) 0 D, RSLB(D) : D>D0 . e−sd(x, )dL M (x) = sup e−sd(x,y)dH M (x) G ∈ RM y RM RM As a basis for this characterization, we define the sets Z ∈ Z   , D(s), RSLB(D(s),s) : s> 0 which is (80). Thus, for a continuous random variable x with F1 positive probability density function almost everywhere, the m m 2 , D, h (x) log H ( ) : D>D 0 (86) Shannon lower bounds (77) and (79) coincide. However, for F e − e E x which are illustrated in Fig. 1. Note that 1 is not necessarily a continuous random variable whose support is a proper F RM E the graph of a function, whereas constitutes a horizontal subset of we have γ(s) γ(s), and thus the Shannon F2 lower bound (79) is tighter (i.e.,≤ larger) than (77). This is due line in the (D, R) plane. to the fact that (79) incorporatese the additional information Corollary 58: Let x be an m-rectifiable random variable that the random variable is restricted to . with support , and let γ(s) be differentiable. Define , E E F 20

smoothness conditions, the supremum in (80) is in fact a F1 5 maximum, i.e., F2 F¯ −sd(x,y) m 4 γ(s)= max e dH (x) (93) y∈RM ZE and D(s) can be rewritten as R 3 1 D(se)= D∗(s) , d(x, y˜(s))e−sd(x,y˜(s)) dH m(x) 2 γ(s) ZE wheree y˜(s) is the maximizing value in the definition of γ(s) 1 (cf. (80)):

0 y˜(s) , argmax e−sd(x,y) dH m(x) . 0123456 RM y∈ ZE D (Thus, γ(s) = e−sd(x,y˜(s)) dH m(x).) The following Fig. 1. Illustration of the sets F , F , and F¯ (assuming D = 1). E 1 2 0 corollary shows that even if we do not know whether γ(s) is differentiable, weR can construct a set of lower bounds on F , . Then = ¯, where ¯ is the upper envelope of the RD function. To this end, we define 1 2, where F1 ∪F2 G F F F F F ∪F given by ∗ ∗ e , D (s), RSLB(D (s),s) : s> 0 F1 e e ′ and was defined in (86).  ¯ , (D, R) : R = max R . (87) 2 e F ∈F (D,R′)∈F F   Corollary 59: Let x be an m-rectifiable random variable Proof: All elements (D, R) can be written as with support . Then is a set of lower bounds on the RD ∈ F E F (D, R) = D, RSLB(D,s) for some s 0. Indeed, for function, i.e., for each (D, R) , we have R(D) R. ≥ ∈ F ≥ (D, R) 1 this is obvious, and for (D, R) 2 we have Proof: Let (D, R)e . ∈F  ∈F ∈ F e Case (D, R) 1: In this case, we have (D, R) = m m (a) m (79) R = h (x) log H ( ) = h (x) log γ(0) = RSLB(D, 0) ∗ ∗∈ F − E − D (s), RSLB(D (s),s) efor some s > 0. Thus, R = (88) ∗ RSLB(D (s),s)= RSLBe (D,s) and, by (79), R R(D). (80) m m  ≤ where (a) holds because γ(0) = 1 dH (x) = H ( ). Case (D, R) : In this case, as in (88), we have R = E E 2 Hence, for all (D, R) , we obtain R (D, 0). By∈ (79), F we have R (D, 0) R(D), which ∈F R SLB SLB implies R R(D). ≤ ∗ (89) ≤ R sup RSLB(D,s)= RSLB(D) . In either case R R(D), which concludes the proof. ≤ s≥0 ≤ By Corollary 59, we can use the sets 1 and 2 to construct Because ¯ , (89) also holds for (D, R) ¯. lower bounds on the RD function.17 MoreF specifically,F these F⊆F ¯ ∈ F Consider now the pair (D, R) for a fixed D>D0. bounds are obtained via the following program:e ′ ∈ F ′ By (87), for a pair (D, R ) we obtain R R . ∗ ∈ F ≥ (P1) Calculate D (s) for s (0, ). In particular, for s > 0 satisfying D(s) = D, the pair ∈ ∞ ∗ ∗ (P2) Plot the s-parametrized curve D (s), RSLB(D (s),s) D, RSLB(D,s) belongs to 1 , and thus for s (0, ). F ⊆F  e (P3) Plot∈ the horizontal∞ line D, hm(x) log H m( ) for  R RSLB(D,s) . (90) − E ≥ D (D , ). ∈ 0 ∞  Similarly, D, hm(x) log H m( ) , and thus (P4) Take the upper envelope of these two curves. − E ∈F2 ⊆F In the subsequent Section X-C, we will apply the program m m (88) R h (x) log H ( ) = RSLB(D, 0) . (91) (P1)–(P4) to a specific example. ≥ − E Combining (90) for all s> 0 satisfying D(s)= D and (91), C. Shannon Lower Bound on the Unit Circle we obtain e To demonstrate the practical relevance of Theorem 55, we (84) ∗ apply it to the simple example given by = 1, i.e., the R max RSLB(D, 0), sup RSLB(D,s) = RSLB(D) . ≥ e R2 E S  s>0:D(s)=D  unit circle in , and squared error distortion, i.e., d(x, y)= (92) x y 2. In order to calculate γ(s), we first show that it can Combining (89) and (92) for an arbitrary (D, R) ¯ implies kbe expressed− k as in (93), i.e., ∗ ∈ F that R = RSLB(D). By (85), this yields (D, R) and ∈ G x y 2 thus ¯ . Because both sets and ¯ contain exactly one γ(s) = max e−sk − k dH 1(x) y∈R2 F ⊆ G G F ¯ S1 element (D, R) for each D>D0, we obtain = . Z In certain cases, it may not be possibleF to differentiateG 17If F = F , we obtain by Corollary 58 that these bounds will be the γ(s), and thus the direct calculation of D(s) = γ′(s)/γ(s) 1 1 − best Shannone lower bounds. However, explicit smoothness conditions that is not possible. However, one can show that, under certain guarantee F1 = F1 are difficult to find. e e 21 for all s 0. Let s 0 be arbitrary but fixed. Note that we 101 ≥ ≥ T can restrict to y = (y1 0) , with y1 0, because the problem is invariant under rotations. Thus, ≥

x y 2 2 2 e−sk − k dH 1(x)= e−s((x1−y1) +x2) dH 1(x) 100 ZS1 ZS1 and therefore we have to maximize the function γ(s) 2 2 −s((x1−y1) +x ) 1 fs(y1) , e 2 dH (x) S1 Z 10−1 ′ on [0, ). To this end, we consider the derivative fs and change∞ the order of differentiation and integration (according H 1 to [27, Cor. 5.9], this is justified because S1 is a finite 2 2 −s((x1−y1) +x2) | T measure and 0 < e 1 for (x1 x2) 1). −2 −1 0 1 2 3 This results in the expression ≤ ∈ S 10 10 10 10 10 10 s 2 2 f ′(y )= 2s(x y ) e−s((x1−y1) +x2) dH 1(x) . (94) Fig. 2. Graph of γ(s) for s ∈ [0.01, 5000]. s 1 1 − 1 ZS1 ′ 3 Because x1 1 for x 1, we have fs(y1) < 0 for y > 1, i.e.,≤ f is monotonically∈ S decreasing on (1, ). 1 s ∞ Thus, the function fs can only attain its maximum in the compact interval [0, 1]. Because fs is a continuous function, −skx−yk2 1 2 we conclude that γ(s) = maxy R2 e dH (x) ∈ S1 exists for each s 0. ≥ R −2 To characterize γ(s) in more detail, we consider the equa- RSLB(10 ,s) ′ tion fs(y1)=0 to find local maxima. By (94) and because x2 + x2 =1 for x , f ′ (y )=0 is equivalent to 1 1 2 ∈ S1 s 1 2 2se−s(1+y1) (x y ) e2sx1y1 dH 1(x)=0 . (95) 1 − 1 ZS1 −s(1+y2) 0 Furthermore, because 2se 1 > 0 and using the trans- 20 40 60 80 formation x1 = cos φ, x2 = sin φ, we obtain that (95) is equivalent to s Fig. 3. Shannon lower bound R (10−2,s) for s ∈ [1, 94]. 2π SLB (cos φ y ) e2sy1 cos φ dφ =0 . (96) − 1 Z0 ′ Theorem 60: Let the random variable x on R2 be uniformly Because we know that the function fs can only have zeros on [0, 1], we can solve (96) numerically for any fixed s 0 and distributed on the unit circle, and consider squared error ≥ distortion, i.e., d(x, y)= x y 2. For any n N, compare the values fs(y1) at the different solutions y1 and at k − k ∈ the boundary points y =0 and y =1 to find γ(s). In Fig. 2, 1 1 R(D¯ ) log n (97) the values of γ(s) are depicted for s [0.01, 5000]. n ≤ ∈ We now have all the ingredients to calculate the parametric where 2 lower bound RSLB(D,s) in (79) for any given distortion D n π D¯ n =1 sin . (98) and an arbitrary source x on 1. In particular, let us consider − π n a uniform distribution of x onS , where h1(x) = log(2π)   S1 (see (37)). In Fig. 3, we show the lower bound RSLB(D,s) for Proof: See Appendix L. s [1, 94] and distortion D = 10−2. It can be seen that the The upper bound depicted in Fig. 4 was obtained by linearly ∈ −2 maximal lower bound RSLB(10 ,s) is obtained for s 50. interpolating the upper bounds (97) corresponding to different ≈ ¯ To plot Fig. 3, we had to calculate γ(s) for many different values of n (and, hence, of Dn). This is justified by the values of s. We also used “trial and error” to find the region convexity of the RD function [13, Lem. 10.4.1]. Note that −2 of s where the maximal lower bound RSLB(10 ,s) arises. To the lower and upper bounds shown in Fig. 4 are quite close, avoid this tedious optimization procedure, which would have and thus they provide a rather accurate characterization of the to be carried out for each value of D under consideration, we RD function of x. can use the program (P1)–(P4) formulated in Section X-B. In Fig. 4, we show the lower bounds on R(D) resulting from XI. CONCLUSION this program for 5 , which corresponds to s [1, 10 ] D [5 We presented a generalization of entropy to singular ran- −5 . We also show∈ in Fig. 4 an upper bound on ∈ · 10 , 1] R(D) dom variables supported on integer-dimensional subsets of using the following result. Euclidean space. More specifically, we considered random 22

6 we provided a characterization of the minimal codeword length upper bound on R(D) for quantizations of integer-dimensional sources. Furthermore, lower bound on R(D) we presented a result in rate-distortion theory that generalizes the Shannon lower bound for discrete and continuous random 4 variables to the larger class of rectifiable random variables. The usefulness of this bound was demonstrated by the example of a R uniform source on the unit circle. The resulting bound appears to be the first rigorous lower bound on the rate-distortion 2 function for that distribution. Possible directions for future work include the extension of our entropy definition to distributions mixing different dimensions (e.g., discrete-continuous mixtures). The extension to noninteger-dimensional singular distributions seems to be 0 10−4 10−3 10−2 10−1 100 possible only in terms of upper and lower entropies, which D could be defined based on the upper and lower Hausdorff 18 Fig. 4. Shannon lower bound constructed by (P1)–(P4) and upper bound densities [14, Def. 2.55]. Furthermore, our entropy can be (97) for a source x on R2 uniformly distributed on the unit circle and squared extended to infinite-length sequences of rectifiable random error distortion. variables, which leads to the definition of an generalizing the (differential) entropy rate of a sequence of discrete or continuous random variables. Finally, applications variables distributed according to a rectifiable measure. Sim- of our entropy to source coding and channel coding problems ilar to continuous random variables, these rectifiable random involving integer-dimensional singular random variables are variables can be described by a density. However, in contrast largely unexplored. to continuous random variables, the density is nonzero only on a lower-dimensional subset and has to be integrated with APPENDIX A respect to a Hausdorff measure to calculate probabilities. PROOF OF LEMMA 9 Our entropy definition is based on this Hausdorff density To prove the existence of a support , we have to but otherwise resembles the usual definition of differential construct a set that satisfies (cf. DefinitionE ⊆E 8) entropy. However, this formal similarity has to be interpreted E N Rm (i) = k∈N fk( k) where, for k e , k is a with caution because Hausdorff measures and projections of E C m ∈ MC ⊆ bounded Borele set and fk : R R is a Lipschitz rectifiable sets do not always conform to intuition. We thus S → functione that is one-to-one on k; emphasized mathematical rigor and carefully stated all the (ii) ; C E⊆E assumptions underlying our results. H m e (iii) µ E ; ≪dµ | We showed that for the special cases of rectifiable random (iv) e H m e-almost everywhere. H m e > 0 E variables given by discrete and continuous random variables, d |E | To prove (i), we note that, by (9), the m-rectifiable set our entropy definition reduces to classical entropy and dif- satisfies 0 k∈N fk( k) with bounded Borel sets ferential entropy, respectively. Furthermore, we established a E m E ⊆E ∪ A m M k R , Lipschitz functions fk : R R that are one- connection between our entropy and differential entropy fora to-oneA ⊆ on , and HSm( )=0. Because→ µ H m , we rectifiable random variable that is obtained from a continuous k 0 E A H m ∗ E ∗ , ≪ | obtain µ E where k∈N fk( k). Thus, the random variable through a one-to-one transformation. For joint ≪ | Edµ A dµ Radon-Nikodym derivative H m exists. Note that H m and conditional entropy, our analysis showed that the geometry d |E∗ d |E∗ is in fact an equivalence class of measurableS functions and of the support sets of the random variables plays an important m only defined up to a set of H ∗ -measure zero. Because role. This role is evidenced by the facts that the chain rule E µ( c)=0 and µ(( ∗)c)=0, we| can choose a function g may contain a geometric correction term and conditioning may E E dµ in the equivalence class of H m satisfying g(x)=0 on increase entropy. d |E∗ ( ∗)c. Since g is a measurable function, the set g−1( 0 ) Random variables that are neither discrete nor continuous isE∩EH m-measurable. Furthermore, because ∗ is m-rectifiable,{ } are not only of theoretical interest. Continuity of a ran- Property 1 in Lemma 4 implies that the subsetE g−1( 0 ) ∗ dom variable cannot be assumed if there are deterministic is again m-rectifiable. By [28, Lem. 15.5(4)], there{ exists} ∩E a dependencies reducing the intrinsic dimension of the ran- Borel set satisfying dom variable, which is especially likely to occur in high- B0 −1 ∗ dimensional problems. As two basic examples, we considered 0 g ( 0 ) (99) R2 B ⊇ { } ∩E a random variable x supported on the unit circle, m −1 ∗ ∈ and H ( 0 (g ( 0 ) ))=0. The absolute continuity which is intrinsically only one-dimensional, and the class of mB \ m{ } ∩E µ H ∗ H then implies positive semidefinite rank-one random matrices. In both cases, ≪ |E ≪ −1 ∗ the differential entropy is not defined and, in fact, classical µ( 0 (g ( 0 ) ))=0 . (100) information theory does not provide a rigorous definition of B \ { } ∩E 18The upper and lower Hausdorff densities exist for arbitrary distributions, entropy for these random variables. whereas, by Preiss’ Theorem [16, Th. 5.6], the existence of the Hausdorff As an application of our entropy definition to source coding, density implies that the measure is rectifiable. 23

We further have have for an arbitrary measurable set RM A⊆ m µ( )= g(x) dH E∗ (x) µ( ) µ( (g−1( 0 ) ∗)) + µ(g−1( 0 ) ∗)) A | 0 0 ZA B ≤ B \ −1 { } ∩E∗ −1 { } ∩E µ( 0 (g ( 0 ) )) + µ(g ( 0 )) m m ≤ B \ { } ∩E { } = g(x) dH E∗ (x)+ g(x) dH E∗(x) (a) e | ec | =0 (101) ZA∩E ZA∩E (a) m m = g(x) dH e(x)+ g(x) dH E∗(x) E e A | A∩Ec | where (a) holds by (100) and because µ(g−1( 0 )) = Z Z m { } m c − g(x) dH ∗ (x)=0. Let us define H e g 1({0}) E = g(x) d E (x)+ µ( ) | A | A ∩ E R Z (b) m e −1 = g(x) dH e(x) , f ( f ( )) (102) E E k Ak \ k B0 A | k∈N Z [ where (a) holds because ∗ (see (103)) and (b) holds e E ⊆E because µ( c)=0 (indeed H m e( c)= H m( where f −1( ) are bounded Borel sets (this is because A ∩ E |E A ∩ E E ∩ Ak \ k B0 c)=0 implies µ( e c)=0 by (iii)). By (103), we k are bounded Borel sets, fk are continuous functions, and A ∩ E A ∩ E A have ∗, whiche implies e e 0 is a Borel set). As fk are Lipschitz functions that are one- E⊆E B −1 me ∗ c m e∗ c m ∗ ∗ c to-one on k, and thus also on k f ( 0), this shows H e(( ) )= H ( ( ) ) H ( ( ) )=0 . A A \ k B eE (i). | E E ∩ E ≤ E ∩ E (104) −1 By (99), we have c −1 c ∗ c. Hence, for x Next, we prove (ii). We have y fk( k fk ( 0)) if and 0 eg ( 0 ) ( ) −1∈ A \ B c B ⊆ −{1 } ∪c E ∈ only if there exists x k fk ( 0) such that fk(x) = y, 0 we have either x g ( 0 ) —which is equivalent ∈ A \ B ′ B ∈ ∗ c { } which in turn holds if and only if there exists x k such to g(x) > 0—or x ( ) . By (104), we therefore have −∈1 A ∈ E  ′ H m e c that fk(x ) = y and y / 0. Hence, fk( k fk ( 0)) = for E -almost all x 0 that g(x) > 0. In particular, ∈ B A \ B | ∗∈ B c f ( ) . We can thus rewrite in (102) as because, by (103), = 0 , we obtain g(x) > 0 for k Ak \B0 E E E \B ⊆B0 H m e-almost all x . This proves (iv). |E ∈ E ∗ e∗ −1 ∗ Finally, we showe that the support is unique up to sets of = fk( k) 0 = 0 (g ( 0 ) ) (103) H m E A \B E \B ⊆E \ { } ∩E -measure zero. Lete 1 and 2 be two support sets of k∈N E E [ an m-rectifiable measure µ, and denote the Radon-Nikodym e derivative dµ by . Then dH m|E g2 where the final inclusion holds by (99). Because we chose 2 x on ∗ c c ∗ c, we obtain c −1 . g( )=0 ( ) = ( ) g ( 0 ) g (x) dH m (x)= µ( )=0 (105) Inserting thisE∩E into (103) yieldsE ∪ E E ⊆ { } 2 |E2 E2 \E1 ZE2\E1 c where the latter equality holds because µ( 1 )=0 (indeed, ∗ c ∗ ∗ ∗ c ∗ H m c c E H m ( )= ( ( ) )= E1 ( 1 )=0 implies µ( 1 )=0 due to µ E1 ). E⊆E \ E ∩E E ∩ E ∪ E E ∩E⊆E | E E H m ≪ | Since by Definition 8 g2 > 0 on 2 E2 -almost every- m E | e where, (105) implies H ( 2 1)=0. By an analogous which is (ii). m E \ E argument, we obtain H ( 1 2)=0. This shows that 1 To prove (iii), we start with an arbitrary H m-measurable E \E m E and 2 coincide up to a set of H -measure zero. set RM with H m e( )=0. We have E A⊆ |E A APPENDIX B m m ∗ PROOF OF THEOREM 13 H ∗ ( )= H ( ( )) |E A\B0 E ∩ A\B0 = H m( ∗ c) Proof of Part 1: Let x be 0-rectifiable with support , i.e., 0 −1 0 E m E ∗∩A∩B µx H E for a 0-rectifiable set . Recall that a 0- = H (( 0) ) ≪ | E E \B ∩ A rectifiable set is by definition countable, i.e., = xi : (a) E E { = H m( ) i for a countable index set . By (16), Pr x =1, ∈ I} I { ∈ E} m E ∩ A which implies that x is a discrete random variable. Finally, H e = E ( ) | eA =0 px(x ) = Pr x = x i { i} (12) = µx−1( x ) { i} where holds because ∗ by (103). Because −1 (a) = 0 µ dµx 0 m E E \B ≪ = (x) dH (x) H ∗ , this implies µ( )=0 and, since µ( )=0 0 E E 0 0 dH E | | A\B H m B Z{xi} | by (101), we obtain µ( )=0e . Thus, e( )=0 implies −1 A |E A (a) dµx µ( )=0, which proves (iii). = (x ) A dH 0 i To prove (iv), we first show that is also in the equivalence E g (13) | dµ 0 x class of the Radon-Nikodym derivative H m . Indeed, we = θx ( i) d |Ee 24

0 m M where (a) holds because H is the counting measure. for H φ(E)-almost every y R . We conclude that Conversely, let x be a discrete random variable taking on | ∈ m (21) m m m the values xi, i . We set , xi : i , which h (y) = θ (y) log θ (y) dH (y) ∈ I E { ∈ I} − y y is countable and, thus, 0-rectifiable. Because includes all Zφ(E) c E −1 c m −1 possible values of x, we have Pr x = µx ( )=0. (107) θx (φ (y)) { ∈ E } E = For RM , the measure H 0 ( ) counts the number of − J E (φ−1(y)) A ⊆ |E A Zφ(E) φ points in that also belong to . Thus, for any set such θm(φ−1(y)) that H 0 A( )=0, we obtainE that = andA hence log x dH m(y) E × J E (φ−1(y)) c.| ThisA implies µx−1( ) µxA∩E−1( c)=0∅ . Thus, we  φ  m m A⊆E −1 A ≤ E H 0 (a) θ (x) θ (x) showed that µx ( )=0 for any set with E ( )=0, = x log x J E (x) dH m(x) −1 0 A A | A E E φ i.e., µx H . Hence, x is 0-rectifiable. − E J (x) J (x) ≪ |E Z φ  φ  Proof of Part 2: Let x be M-rectifiable on RM , i.e., m m H m µx−1 H M for an M-rectifiable set . Because H M = θx (x) log θx (x) d (x) E − E is equal≪ to the| Lebesgue measure L M [18,E Th. 2.10.35], Z −1 M M m x E x H m x we obtain µx L E L . Thus, by the Radon- + θx ( ) log Jφ ( ) d ( ) ≪ | ≪ E Nikodym theorem, there exists the Radon-Nikodym derivative (15) Z x−1 m E E dµ L M = h (x)+ x[log Jφ (x)] fx = dL M satisfying Pr x = A fx(x) d (x) for RM{ ∈ A} any measurable , i.e., x is a continuous random where (a) holds because of the generalized area formula [14, AM ⊆ L M R variable. By (13), θx = fx -almost everywhere. Th. 2.91]. Conversely, let x be a continuous random variable on RM with probability density function fx. For a measurable set APPENDIX D RM L M −1 satisfying ( )=0, we obtain µx ( ) = PROOF OF THEOREM 28 A ⊆ AL M A Pr x = A fx(x) d (x) = 0. Thus, we have {−1 ∈ A}M M M M Proof of Properties 1 and 3: We first show that for any x L . Because L H H RM , this is µ = = −1 M +M ≪ −1R H M | µ(x, y) -measurable set R 1 2 equivalent to µx RM . Furthermore, by Property 6 A⊆ RM ≪ | in Lemma 4, is M-rectifiable. It then follows from µ(x, y)−1( )=Pr (x, y) Definition 11 that x is an M-rectifiable random variable. A { ∈ A} m1 m2 H m1+m2 = θx (x)θy (y) d E1×E2 (x, y) . APPENDIX C A | Z (108) PROOF OF THEOREM 20 To this end, we first consider the rectangles with We first note that the set φ( ) is m-rectifiable because is A1 × A2 E E RM1 H m1 -measurable and RM2 H m2 - m-rectifiable and because of Property 3 in Lemma 4. To prove A1 ⊆ A2 ⊆ −1 m measurable. We have that y is m-rectifiable, we will show that µy H φ(E). RM ≪ | For a measurable set , we have Pr (x, y) A⊆ { ∈ A1 ×A2} µy−1( )=Pr φ(x) (a) A { ∈ A} = Pr x 1 Pr y 2 −1 { ∈ A } { ∈ A } = Pr x φ ( ) (14) { ∈ A } = θm1 (x) dH m1 (x) θm2 (y) dH m2 (y) (14) x E1 y E2 m H m A | A | = θx (x) d E (x) Z 1 Z 2 φ−1(A) | (b) Z = θm1 (x)θm2 (y) d H m1 H m2 (x, y) θm(x) x y E1 E2 x E H m A1×A2 | × | = E Jφ (x) d (x) Z φ−1(A)∩E J (x) (c)  Z φ = θm1 (x)θm2 (y) dH m1+m2 (x, y) (109) m −1 x y E1×E2 (a) θ (φ (y)) A ×A | = x dH m(y) Z 1 2 J E (φ−1(y)) ZA∩φ(E) φ where (a) holds because x and y are independent, (b) θm(φ−1(y)) holds by Fubini’s theorem, and (c) holds by Lemma 27. = x dH m (y) . (106) E −1 φ(E) Because the rectangles generate the µ(x, y)−1-measurable A Jφ (φ (y)) | Z sets, (109) implies (108). For a µ(x, y)−1-measurable set Here, holds because of the generalized area formula [14, M1+M2 m1+m2 (a) R satisfying H E ×E ( ) = 0, (108) −1 1 2 Th. 2.91], and φ : φ( ) is well defined because φ is A ⊆ −1 |−1 HA m1+m2 implies µ(x, y) ( )=0, i.e., µ(x, y) E1×E2 one-to-one on . For aE measurable→ E set RM satisfying A ≪ | (note that this is Property 3). Furthermore, since x is m1- H m E , (106) implies y−1 A ⊆ , i.e., y−1 φ(E)( )=0 µ ( )=0 µ rectifiable and y is m2-rectifiable, it follows from Lemma 27 H m| .A Thus, y is an -rectifiable randomA variable. ≪ φ(E) m that 1 2 is (m1 + m2)-rectifiable. Hence, according to | m −1 θx (φ (y)) E ×E By (106), E −1 equals the Radon-Nikodym derivative Definition 11, (x, y) is an (m1 + m2)-rectifiable random Jφ (φ (y)) dµy−1 variable, thus proving Property 1. H m (y), and thus we obtain d |φ(E) Proof of Property 2: Again due to (108), m −1 −1 θ (φ (y)) dµy (13) −1 x m m m dµ(x, y) (13) m +m = (y) = θy (y) (107) 1 2 1 2 E −1 H m θx θy = m +m = θ(x,y) . Jφ (φ (y)) d φ(E) dH 1 2 | |E1×E2 25

m M1 Proof of Property 4: We have disjoint sets i satisfying H ( i) < . For 1 R m1 F M2F m2 ∞ A ⊆ H -measurable and 2 R H -measurable, we have hm1+m2 (x, y) A ⊆ (32) H m = θm1+m2 (x, y) log θm1+m2 (x, y) g(x, y) d (x, y) (x,y) (x,y) (A ×A )∩E − RM1+M2 1 2 Z Z m1+m2 dH E ×E (x, y) H m × | 1 2 = g(x, y) d (x, y) (34) N (A1×A2)∩Fi m1 m2 m1 m2 i∈ Z = θx (x)θy (y) log θx (x)θy (y) X ′ − RM1+M2 (a) g(x, y ) Z = dH m−m2 (x, y′) H m1+m2  E ′ d E1×E2 (x, y) Jp (x, y ) × | i∈N Z e Z y XA2∩E2 (A1×A2)∩ (a) −1 H m2 m1 m2 m1 m2 p ({y})∩F d (y) = θx (x)θy (y) log θx (x)θy (y) y i × − RM1+M2 (111) Z m1 m2 d H E H E (x, y) × | 1 × | 2 where (a) holds by the classical version of the general coarea (b) m1 m2 m1 = θx (x)θy (y) log θx (x)  formula [18, Th. 3.2.22] (note that 2 and i have finite RM RM E F − 2 1 Hausdorff measure) and because J E > 0 H m -almost Z Z  py E m2 H m1 H m2 | + log θy (y) d E1 (x) d E2 (y) everywhere. By either (i) or (ii), we cane apply Fubini’s theorem | | in (111) and change the order of integration and summation. (c)  m1 m1 H m1 = θx (x) log θx (x) d E1 (x) We thus obtain − RM1 | Z m m2 m2 H m2 g(x, y) dH (x, y) θy (y) log θy (y) d E2 (y) − RM2 | (A ×A )∩E Z 1 2 (19) Z = hm1 (x)+ hm2 (y) . g(x, y′) H m−m2 x y′ = E ′ d ( , ) Jp (x, y ) Here, (a) holds by Lemma 27, (b) holds by Fubini’s theorem, Z e i∈N Z y ! A2∩E2 X (A1×A2)∩ −1 and (c) holds because, by (16), py ({y})∩Fi dH m2 (y) × m m m m ′ θ 1 (x) dH 1 (x)= θ 2 (y) dH 2 (y)=1. g(x, y ) m−m ′ m x E1 y E2 = dH 2 (x, y ) dH 2 (y) RM1 | RM2 | E ′ Z Z Jp (x, y ) Z e Z y A2∩E2 (A1×A2)∩ APPENDIX E −1 py ({y})∩E PROOF OF THEOREM 31 (a) g(x, y) = dH m−m2 (x, y′) dH m2 (y) We will use the generalized coarea formula [18, Th. 3.2.22] E Jp (x, y) Z e Z y several times in our proofs. Because the classical version of A2∩E2 (A1×A2)∩ −1 y the generalized coarea formula only holds for sets of finite py ({ })∩E Hausdorff measure, we first present an adaptation that is suited (b) g(x, y) H m−m2 x H m2 y to our setting. = E d ( ) d ( ) e (y) J (x, y) ZA2∩E2 ZA1∩E py Theorem 61: Let RM1+M2 be an m-rectifiable set. Fur- ′ ′ −1 , E ⊆ RM2 where (a) holds because y = y for all (x, y ) p ( y ), thermore, let 2 py( ) be m2-rectifiable (m2 M2, ∈ y { } E HEm2⊆ E H≤m and (b) holds because the Hausdorff measure does not depend m m2 M1), ( 2) < , and Jpy > 0 E - almost− everywhere.≤ e Finally,E assume∞ that g : R is H |m- on the ambient space [14, Remark 2.48], i.e., integration E → with respect to H m−m2 on the affine subspace p−1( y ) measurable and satisfies eithere of the following properties: y { } ⊆ RM1+M2 and on RM1 is identical. Thus, we have shown (110). (i) g(x, y) 0 H m-almost everywhere; ≥ H m We now prove the second part of Theorem 61. By [18, (ii) E g(x, y) d (x, y) < . −1 | | ∞ Th. 3.2.22], the sets py ( y ) i are (m m2)-rectifiable for H m−m2 RM1 H m2 { } ∩F − ThenR for all -measurable sets 1 and - H m2 -almost every y RM2 . By Property 5 in Lemma 4, the RM2 A ⊆ measurable sets 2 , same holds for their union∈ p−1( y ) = p−1( y ) A ⊆ i∈N y { } ∩Fi y { } ∩ . The Lipschitz mapping p : RM1+M2 RM1 , p (x, y) = H m x x g(x, y) d (x, y) Ex satisfies S → Z(A1×A2)∩E g(x, y) m−m m −1 H 2 x H 2 y px py ( y ) = E d ( ) d ( ) e (y) J (x, y) { } ∩E ZA2∩E2 ZA1∩E py = x RM1 : y′ RM2 with (x, y′) p−1( y ) (110) { ∈  ∃ ∈ ∈ y { } ∩E} = x RM1 : (x, y) (y) M1 { ∈ ∈ E} where , x R : (x, y) . Furthermore, the (y) E { ∈ ∈ E} = . set (y) is (m m )-rectifiable for H m2 -almost every E A1 ∩E − 2 y RM2 . (y) ∈ Thus, is obtained via a Lipschitz mapping from the set m −1 E m2 Proof: By Property 2 in Lemma 4, H is σ-finite. p ( y ) , which is (m m2)-rectifiable for H -almost |E y { } ∩E − Thus, we can partition as = with pairwise every y RM2 . Therefore, by Property 3 in Lemma 4, (y) E E i∈N Fi ∈ E S 26 is again (m m )-rectifiable19 for H m2 -almost every y and some support . Let RM1 and RM2 be − 2 ∈ E2 ⊆ E2 A1 ⊆ A2 ⊆ RM2 . Finally, by Property 1 in Lemma 4, the same is true for H m1 -measurable and H m2 -measurable, respectively. Then (y). e A1 ∩E We now proceed to the proof of Theorem 31. Pr (x, y) 1 2 Proof of Property 1: We have for any H m2 -measurable set { ∈ A ×A } M (44) −1 R 2 = Pr x 1 y = y dµy (y) A⊆ { ∈ A | } ZA2 −1 µy ( ) (13) m2 H m2 A = Pr x 1 y = y θy (y) d E2 (y) = Pr y A { ∈ A | } | { ∈ A} Z 2 RM1 = Pr (x, y) (a) m m 2 H 2 e { ∈ ×A} = Pr x 1 y = y θy (y) d E (y) (113) (14) { ∈ A | } | 2 m H m A2 = θ(x,y)(x, y) d (x, y) Z (RM1 ×A)∩E Z m2 m x y where (a) holds because we can choose θy (y)=0 for y (a) θ(x,y)( , ) ∈ H m−m2 x H m2 y c. On the other hand, we have = E d ( ) d ( ) 2 e (y) J (x, y) E ZA∩E2 ZE py θm (x, y) (x,y) m−m m Pr (x, y) = dH 2 (x) dH 2 e (y) (112) 1 2 E E2 { ∈ A ×A } (y) A E Jpy (x, y) | Z Z (14) m H m = θ(x,y)(x, y) d (x, y) m (A ×A )∩E where in (a) we used (110) for g(x, y) = θ(x,y)(x, y) 0. Z 1 2 m2 m2 ≥ m For an H -measurable set satisfying H e ( )=0, (a) θ(x,y)(x, y) E2 H m−m2 x H m2 y −1 A −1 H m| 2 A = E d ( ) d ( ) (112) implies µy ( )=0, i.e., µy e . Thus, e (y) J (x, y) A ≪ |E2 ZA2∩E2 ZA1∩E py according to Definition 11, y is m2-rectifiable. m −1 m2 θ(x,y)(x, y) Proof of Property 2: Because µy H e , it follows m−m2 m2 E2 = dH (x) dH e (y) ≪ | E E2 from Property 4 in Corollary 12 that there exists a support (y) J (x, y) | ZA2 ZA1∩E py of the random variable y. (114) E2 ⊆ E2 Proof of Property 3: From (112), we see that m e where in (a) we used (110) for g(x, y) = θ x y (x, y) m −1 ( , ) θ(x,y)(x, y) dµy H m2 ≥ H m−m2 m2 0. Combining (113) and (114), we obtain that for e - d (x)= (y)= θy (y) E2 E H m2 m1 M|1 E(y) Jp (x, y) d e almost every y and every H -measurable set R Z y |E2 A1 ⊆ −1 H m2 e where the latter equation holds because µy E . m2 ≪ | 2 Pr x 1 y = y θy (y) This implies (39). { ∈ A | } m Proof of Property 4: Using (39) in (21) and proceeding θ(x,y)(x, y) H m−m2 x (115) = E d ( ) . similarly to (112), we obtain (y) J (x, y) ZA1∩E py m2 y h ( ) m Because (115) holds for H 2 e -almost every y and , m E2 2 2 θ x y (x, y) m2 | E ⊆ E ( , ) m−m2 (115) also holds for H -almost every y. Furthermore, = dH (x) E2 e E | m2 m2 − E E(y) Jp (x, y) because is a support of y, we have θ (y) > 0 H - Z 2  Z y  2 y Ee2 m E H m2 | θ (x˜, y) almost everywhere. Thus, we obtain for E2 -almost every (x,y) m−m m H 2 x H 2 y H m1 RM1| log E d (˜) d ( ) y and every -measurable set 1 × (y) J (x˜, y)  ZE py  A ⊆ m = θ(x,y)(x, y) Pr x 1 y = y − E { ∈ A | } m Z m θ x y (x, y) θ (x˜, y) ( , ) m−m2 (x,y) m−m m = dH (x) H 2 x H x y E m2 log E d (˜) d ( , ) . A ∩E(y) Jp (x, y) θy (y) × (y) J (x˜, y) Z 1 y  ZE py  θm (x, y) (x,y) m−m2 Thus, (40) holds. = dH (y) (x) . (116) E m2 E J (x, y) θy (y) | ZA1 py PPENDIX A F m−m2 Therefore, Pr x y = y H (y) . By Theo- PROOF OF THEOREM 35 { ∈ · | } ≪ |E rem 61, the set (y) is (m m )-rectifiable for H m2 -almost E − 2 Proof of Property 1: By Theorem 31, the random variable y every y. Hence, according to Definition 6, Pr x y = y m2 { ∈ · | } m2 is (m m2)-rectifiable for H E -almost every y. is m2-rectifiable with Hausdorff density θy (given by (39)) 2 − | dPr{x∈·| y=y} Proof of Property 2: By (116), we have H m−m2 (x)= d |E(y) m 19 (y) m−m −1 θ (x,y) Note that E is H 2 -measurable because py ({y}) ∩ E is (x,y) H m2 − E m2 for E -almost every y. Thus, (10) im- H m m2 J (x,y) θ (y) 2 -measurable and the Hausdorff measure does not depend on the py y | ambient space [14, Remark 2.48]. plies (45). 27

APPENDIX G measurable set RM1+M2 , we have A⊆ PROOF OF THEOREM 39 µ(x, y)−1( ) A (14) Starting from (47), we have = θm (x, y) dH m (x, y) (x,y) |E ZA m−m2 (a) h (x y) m x y H m x y | = θ(x,y)( , ) d E1×E2 ( , ) A | m2 m−m2 Z = θy (y) θ (x) m Pr{x∈·| y=y} (b) θ x y (x, y) − E E(y) ( , ) m1 m2 m 2 = θ (x)θ (y) dH E ×E (x, y) Z m−Zm m1 m2 x y 1 2 log θ 2 (x) dH m−m2 (x) dH m2 (y) A θx (x)θy (y) | × Pr{x∈·| y=y} Z m θm (x, y) (45) θ(x,y)(x, y) (117) (x,y) m m m2 x y H x y = θ (y) = m1 m2 θ(˜x,˜y)( , ) d E1×E2 ( , ) y E m2 x y − (y) J (x, y) θy (y) A θx ( )θy ( ) | ZE2 ZE py Z m m θ (x, y) (c) θ (x, y) (x,y) m−m m (x,y) −1 −1 log dH 2 (x) dH 2 (y) = m m d µx µy (x, y) . (119) E m2 1 2 × J (x, y) θy (y) A θx (x)θy (y) ×  py  Z θm (x, y)  (x,y) Here, (a) holds because 1 2 and because we = E m E ⊆ E ×c E (y) x y x y − E2 E Jpy ( , ) can choose θ(x,y)( , ) = 0 on , (b) holds because Z Z m1 m2 m E m θ (x)θ (y) > 0 H -almost everywhere on 1 2, and θ x y (x, y) x y ( , ) m−m2 m2 −1 −1 E ×E log dH (x) dH (y) m d(µx ×µy ) m E m2 holds because, by (13), H - (c) θ(˜x,˜y) = H m E ×E E1×E2 × Jpy (x, y) θy (y) d | 1 2   |−1 m almost everywhere. By (119), we obtain that µ(x, y) (a) θ(x,y)(x, y) ≪ = θm (x, y) log dH m(x, y) µx−1 µy−1 with Radon-Nikodym derivative (x,y) E m2 − J (x, y) θy (y) × ZE  py  m −1 m θ (x, y) dµ(x, y) θ(x,y)(x, y) (15) (x,y) E (x, y)= . (120) = E x y log + E x y log J (x, y) m1 m2 − ( , ) θm2 (y) ( , ) py d µx−1 µy−1 θx (x)θy (y)   y  ×   Inserting (120) into (60) yields M1 M2 where in (a) we used (110) with 1 = R , 2 = R , and m m θ(x,yA)(x,y) A m x y x y m . (Here, x y is θ x y (x, y) g( , )= θ(x,y)( , ) log JE (x,y) θ 2 (y) g( , ) ( , ) −1 py y I(x; y)= log dµ(x, y) (x, y) m m1 m2 H -integrable by our assumption in Theorem 39 that the RM1+M2 θx (x)θy (y) |E Z   right-hand side of (48) exists and is finite, i.e., Condition (ii) m (13) θ (x, y) m x y (x,y) H m x y in Theorem 61 is satisfied.) Thus, (48) holds. = θ(x,y)( , ) log m1 m2 d ( , ) E θx (x)θy (y) Z   (121)

APPENDIX H which is (62). Furthermore, we can rewrite (121) as ROOF OF HEOREM P T 46 m (15) θ x y (x, y) I(x; y) = E log ( , ) (x,y) θm1 (x)θm2 (y) We first note that the product measure µx−1 µy−1 can be   x y  × interpreted as the joint measure induced by the independent = E [log θm (x, y)] E [log θm1 (x)] (x,y) (x,y) − (x,y) x random variables ˜x and ˜y, where ˜x has the same distribution m2 E(x,y)[log θ (y)] as x and y˜ has the same distribution as y. Because x is m - − y 1 (31) m m1 m2 rectifiable and y is m -rectifiable, the same holds for ˜x and = h (x, y) Ex[log θ (x)] Ey[log θ (y)] 2 − − x − y y˜, respectively. Furthermore, the Hausdorff densities satisfy (19) = hm(x, y)+ hm1 (x)+ hm2 (y) (122) m1 m1 m2 m2 θ˜x (x)= θx (x) and θ˜y (y)= θy (y). By Properties 1–3 − in Theorem 28, the joint random variable (˜x, y˜) is (m1 +m2)- which is (63). Finally, we obtain the first expression in (64) rectifiable with (m1 + m2)-dimensional Hausdorff density by inserting (56) into (122). The second expression in (64) is obtained by symmetry. m1+m2 m1 m2 m1 m2 θ(˜x,˜y) (x, y)= θ˜x (x)θ˜y (y)= θx (x)θy (y) (117) Proof of Part 2 (case m m, we obtain E m1+m2 m1+m2 H ( )= 0. This implies H E ×E ( ) =0. µx−1 µy−1 H m1+m2 . (118) E | 1 2 E × ≪ |E1×E2 On the other hand, by (16), µ(x, y)−1( ) = 1. Thus, we −1 H mE1+m2 have a contradiction to µ(x, y) E1×E2 . Hence, Proof of Part 1 (case m = m + m ): For any H m- µ(x, y)−1 µx−1 µy−1 and, by≪ (61), I(x; y|)= . 1 2 6≪ × ∞ 28

APPENDIX I (69), we obtain PROOF OF LEMMA 52 m h (x) = inf H([x]Q′ ) ′ (E) Let P denote the set of all finite, measurable partitions Q ∈Pm,∞  −1 m RM + µx ( ) log H E ( ) of , i.e., for Q = 1,..., N P the sets i are A | A {A A } ∈ N A RM A∈Q′  mutually disjoint, measurable, and satisfy i=1 i = . X Using the interpretation of hm(x) as a generalized entropyA with N −1 H m H m S H([x]Q)+ µx ( i) log E ( i) respect to the Hausdorff measure E (cf. Remark 19), we ≤ A | A | i=1 obtain by [12, eq. (1.8)] X (a) N −1 −1 H([x]Q)+ µx ( i) log δ m −1 µx ( ) ≤ A h x x (123) i=1 ( )= sup µ ( ) log m A . − Q∈P A H E ( ) X A∈Q   (b) X | A = H([x]Q) + log δ −1 c H m c Because µx ( )=0 and E ( )=0, we have for all where holds because H m and holds E | E (a) E ( i) δ (b) Q P because N x−1 x−| 1 A ≤ . Multiplying by ∈ i=1 µ ( i) = µ ( ) =1 ld e, we have equivalentlyA E µx−1( ) P x−1 µ ( ) log m A m A H E ( ) (h (x) log δ) ld e H([x]Q) ld e . (127) A∈Q − ≤ X  | A  µx−1( ) By (67), we have x−1 = µ ( ) log m A∩E A∩E H E ( ) A∈Q x ∗ x X  | A∩E  H([ ]Q) ld e L ([ ]Q) . (128) µx−1( ′) ≤ x−1 ′ (124) = µ ( ) log m A ′ Combining (127) and (128), we obtain A H E ( ) ′ e   AX∈Q | A m ∗ (h (x) log δ) ld e L ([x]Q) − ≤ , (E) where Q : Q Pm,∞. Hence, for every which implies (70). {A∩E A ∈ (E)} ∈ Q P there exists a Q Pm,∞ such that (124) holds. Thus, the∈ supremume in (123)∈ does not change if we replace P by J.2 Proof of Upper Bound (71) P(E) , i.e., we obtain further m,∞ e We first state a preliminary result. µx−1( ) Lemma 62: Let x be an m-rectifiable random variable, hm x x−1 −1 m M ( )= sup µ ( ) log m A i.e., µx H E for an m-rectifiable set R , − (E) A H E ( ) Q∈Pm,∞ A∈Q ≪ | H m E ⊆ X  | A  with m 1,...,M and ( ) < . Furthermore, −1 ∈ { } (E) E ∞ −1 µx ( ) let Q P , where each is con- = inf µx ( ) log A = 1,..., N m,∞ i E m {A A } ∈ A ( ) − A H E ( ) structed as the union of disjoint sets ,..., , i.e., Q∈Pm,∞ A∈Q   i,1 i,ki X | A ki A A (125) i = i,j with i,j1 i,j2 = for j1 = j2. Finally, A j=1 A A ∩ A ∅ 6 let Q , 1,1,..., 1,k ,..., N,1,..., N,k . Then = inf µx−1( ) log µx−1( ) S{A A 1 A A N } (E) −1 Q∈Pm,∞ − A A µx ( )  A∈Q e x−1 X µ ( ) log m A − A H E ( ) −1 m Q + µx ( ) log H E ( ) A∈  | A  A | A X −1 Q µx ( ) A∈X  µx−1( ) log A . (129) H m −1 m ≥− e A E ( ) = inf H([x]Q)+ µx ( ) log H E ( ) . A∈Q  | A  (E) X Q∈Pm,∞ A | A  A∈Q  Proof: The inequality (129) can be written as X (126) N µx−1( ) Here, (125) is (68) and (126) is (69). µx−1( ) log Ai − Ai H m ( ) i=1  |E Ai  X N k i µx−1( ) APPENDIX J µx−1( ) log Ai,j . ≥− Ai,j H m ( ) PROOF OF THEOREM 53 i=1 j=1 E i,j X X  | A  Therefore, it suffices to show that J.1 Proof of Lower Bound (70) µx−1( ) x−1 i (E) µ ( i) log m A Let Q Pm,δ be an (m,δ)-partition of according to A H E ( i) ∈ E N  | A  Definition 51, i.e., Q where , ki = 1,..., N i=1 i = µx−1( ) H m{A A } A E µx−1( ) log Ai,j i j = , and ( i) δ for all i, j 1,...,N , i,j H m A ∩ A ∅ A ≤ (E) ∈S { } ≤ A E ( i,j ) i = j. Note that Q also belongs to Pm,∞. Then, starting from j=1  | A  6 X 29 for i 1,...,N . This latter inequality follows from the log H m ( ) =0) ∈{ } |E Ai 6 sum inequality [13, Th. 2.7.1]. m H E ( i) We now proceed to the proof of (71). By (68), for each Ji,δ Ji,δ = | A ′ (E) ⌈ ⌉≥ δ ε > 0, there exists a partition Q = 1,..., N Pm,∞ m {A A } ∈ H E ( i) such that > | A δε µx−1( ) H m hm x x−1 ′ (130) (131) E ( i) ( ) > µ ( ) log m A ε . = ′ | A − A H E ( ) − −ε H m ′ A∈Q (1 e )′ min E ( i )  | A  − i ∈{1,...,N} | A X m H |E (A ′ )6=0 ′ , ε i Let us choose, in particular, ε 2ld e . We define 1 −ε′ . (134) −ε′ m ≥ 1 e δε , 1 e min H E ( i) > 0 . (131) − Ai∈Q Inserting (134) into (133), we obtain for all sets i,j Qδ − m | A H |E (Ai)6=0 m A ∈  satisfying H E ( i,j ) =0 | A 6 Choosing some δ (0,δ ), we furthermore define ε 1 ′ ∈ H m ( ) > 1 δ = e−ε δ . (135) m E i,j 1 H | A − ′ E ( i) 1−e−ε ! Ji,δ , | A δ Combining our results yields and (130) −1 m −1 µx ( ) ′ m h x x J if H ( ) =0 ( ) > µ ( ) log m A ε , i,δ E i − A H E ( ) − Mi,δ ⌈ ⌉ | A 6 A∈Q 1 if H m ( )=0 . X  | A  ( |E Ai (129) µx−1( ) x−1 ′ µ ( ) log m A ε Let us partition each set i Q into Mi,δ disjoint subsets ≥ − A H E ( ) − A∈Qδ   of equal Hausdorff measure,A ∈ 20 i.e., X | A Ai,j (a) µx−1( ) µx−1( ) log A ε′ H m H m m E ( i) ≥ − A E ( ) − H ( )= | A A∈Qδ   E i,j H m | A | A Mi,δ X|E (A)6=0 (135) µx−1( ) and x−1 ′ > µ ( ) log −ε′ A ε Mi,δ − A e δ − A∈Qδ   H mX i,j = i . |E (A)6=0 A A −1 j=1 (b) µx ( ) [ x−1 ′ = µ ( ) log −ε′ A ε H m − A e δ − For i Q such that E ( i)=0, we have Mi,δ = 1, A∈Qδ   A ∈ | A X and thus this partition degenerates to i,1 = i, which implies (c) −1 −1 ′ m A Am = µx ( ) log µx ( ) + log δ 2ε H E ( i,1)=0. For i Q such that H E ( i) =0, we − A A − | A A ∈ | A 6 A∈Qδ have Mi,δ = Ji,δ , and thus X  ⌈ ⌉ ′ = H([x]Q ) + log δ 2ε H m δ − m E ( i) Ji,δ H ( )= | A = δ δ . (132) (d) ∗ E i,j L ([x]Qδ ) 1 ′ | A Ji,δ Ji,δ ≤ > − + log δ 2ε (136) ⌈ ⌉ ⌈ ⌉ ld e − In either case we have H m . −1 m E ( i,j ) δ where (a) and (b) hold because, by µx H E , | A ≤ m −1 ≪ | Let us denote by Qδ the partition of containing all H E ( )=0 implies µx ( )=0 and thus the additional H m E (E) | A m A the sets i,j . Then E ( i,j ) δ implies Qδ Pm,δ. restriction H E ( ) = 0 removes only summands that are A | A ≤ m ∈ | A 6 Furthermore, for i,j Qδ satisfying H E ( i,j ) =0, zero, (c) holds because Q is a partition of and thus A ∈ | A 6 δ µx−1( ) = µx−1( ) = 1, and (d) holdsE by the A∈Qδ m (132) Ji,δ A E H E ( i,j ) = δ second inequality in (67). Finally, rewriting (136) gives (recall | A Ji,δ P′ ε ⌈ ⌉ ε = 2ld e ) Ji,δ Ji,δ Ji,δ ∗ m = ⌈ ⌉− ⌈ ⌉− δ L ([x]Qδ ) < h (x) ld e log δ ld e +1+ ε Ji,δ − ⌈ ⌉  which is (71). J J = 1 ⌈ i,δ⌉− i,δ δ − J APPENDIX K  ⌈ i,δ⌉  (a) 1 PROOF OF LEMMA 56 > 1 δ (133) m − J Because RSLB(D,s) = h (x) (sD + log γ(s)), where  ⌈ i,δ⌉ hm(x) is finite and does not depend− on s, it is sufficient to where (a) holds because Ji,δ Ji,δ < 1. Furthermore, we show lim sD + log γ(s) = . For y RM , we define ⌈ ⌉−H m s→∞ can bound Ji,δ as (note that E ( i,j ) = 0 implies the set of all x whose distortion relative∞ to y∈is less than D/2, ⌈ ⌉ | A 6  20 m D Because H is a nonatomic measure, we can always find subsets of (y) , x RM : d(x, y) < . arbitrary but smaller measure (see [29, Sec. 2.5]). C ∈ 2   30

We obtain the special case k =1. The rate of these (1,n) codes reduces to Rf,g = log n, and the expected distortion is given by sD + log γ(s) 2 (80) D = Ex x g(f(x)) . (139) = sD + log sup e−sd(x,y) dH m(x) f,g k − k y∈RM E  Z  Thus, the implication of the source coding theorem is that for (a) −sd(x,y) m = sup sD + log e dH (x) a (1,n) code with expected distortion Df,g, we have RM y∈   ZE  log n R(Df,g) . (140) = sup log esD e−sd(x,y) dH m(x) ≥ y RM ∈  ZE  We directly design the composed function q , g f. x y ◦ = sup log es(D−d( , )) dH m(x) Because x has probability zero outside 1, we only have to RM S y∈  ZE  define q on the unit circle. Furthermore, because f maps x (b) to one of at most n distinct values, q = g f can also s(D−d(x,y)) H m sup log e d (x) attain at most n distinct values. We define q ◦to map each ≥ y∈RM E∩C(y)  Z  circle segment defined by an angle interval i 2π , (i + 1) 2π , (c) n n sup log esD/2 dH m(x) i 0,...,n 1 , onto one associated “center” point, ≥ RM ∈ { − }   y∈  ZE∩C(y)  which is not constrained to lie on the unit circle. To this = sup log esD/2H m (y) end, we only have to consider the circle segment defined by y∈RM E∩C x = (cos φ sin φ)T : φ [ π/n,π/n) since the problem  {is invariant under rotations.∈ Because− of symmetry,} we choose D H m = s + sup log (y) (137) the “center” associated with this segment to be some point 2 RM E∩C y∈ T T T  (x1 0) , i.e., q(x) = (x1 0) for all x = (cos φ sin φ) with where (a) holds because log is a monotonically increas- φ [ π/n,π/n). According to (139), the expected distortion ing function, (b) holds because es(D−d(x,y)) > 0, and (c) is∈ then− obtained as holds because d(x, y) < D/2 for all x (y). Because −1 ∈ C −1 2 µx ( ) =1 (see (16)), the absolute continuity µx D = Ex x q(x) q k − k H m E implies H m( ) > 0. Thus, there exists a y¯ R≪M |E E ∈ 2π 1 cos φ cos φ 2 such that δ , H m (y¯) > 0. Clearly, this implies = q dφ H m E ∩C 0 2π sin φ − sin φ supy∈RM log (y) log δ, and hence, by (137), Z     E∩C ≥ n π/n cos φ x 2  D = 1 dφ sD + log γ(s) s + log δ . (138) 2π sin φ − 0 ≥ 2 Z−π/n     2(K−log δ) π/n For fixed but arbitrary K > 0 and all s , we have n 2 2 D = (cos φ x1) + sin φ dφ D ≥ , and thus (138) implies 2π −π/n − s 2 + log δ K Z ≥ π/n  n 2 sD + log γ(s) K. = 1+ x1 2x1 cos φ dφ ≥ 2π −π/n − Since K can be chosen arbitrarily large, this shows that Z 2 2nx1 π  lim sD + log γ(s) = . =1+ x1 sin . (141) s→∞ ∞ − π n

APPENDIX L Minimizing the expected distortion with respect to x1 gives PROOF OF THEOREM 60 the optimum value of x1 as Consider the source x on R2 as specified in Theorem 60. n π x∗ = sin . (142) The main idea of the proof is to construct a specific source 1 π n code and calculate its rate and expected distortion. We can then The corresponding quantization function will be denoted by use the source coding theorem [25, Th. 11.4.1] to conclude that q∗. Inserting (142) into (141) yields D¯ in (98): the calculated rate is an upper bound on the RD function. n To this end, recall that a (k,n) source code for a sequence ∗ 2 Dq∗ = Ex x q (x) R2 k x1:k ( ) of k independent realizations of x consists of an k − k2 2 ∈ 2 k n π n π encoding function f : (R ) 1,...,n and a decoding =1+ sin  2 sin R→2 k { } π n − π n function g : 1,...,n ( ) . The rate of this code is     { } → 2 defined as Rf,g , (log n)/k and the expected distortion is n π =1 sin given by − π n   2 ¯ D = Ex x g(f(x )) . = Dn . f,g 1:k k 1:k − 1:k k By the source coding theorem [25, Th. 11.4.1], every (k,n) Thus, we found a (1,n) code with expected distortion Dq∗ = code with expected distortion D must have a rate greater D¯ . Hence, by (140), we have log n R(D ∗ ) = R(D¯ ), f,g n ≥ q n than or equal to R(Df,g). In particular, this has to hold for which is (97). 31

ACKNOWLEDGMENT The authors wish to thank the anonymous reviewers for nu- merous insightful comments and, in particular, for suggesting reference [12].

REFERENCES [1] C. E. Shannon, “A mathematical theory of communication,” Bell Syst. Tech. J., vol. 27, no. 3, pp. 379–423, 623–656, July, Oct. 1948. [2] I. Csisz´ar, “Axiomatic characterizations of information measures,” En- tropy, vol. 10, pp. 261–273, 2008. [3] A. Kolmogorov, “On the Shannon theory of information transmission in the case of continuous signals,” IRE Trans. Inf. Theory, vol. 2, no. 4, pp. 102–108, Dec. 1956. [4] R. M. Gray, Source Coding Theory. Boston, MA: Kluwer, 1990. [5] D. Stotz and H. B¨olcskei, “Degrees of freedom in vector interference channels,” IEEE Trans. Inf. Theory, vol. 62, no. 7, pp. 4172–4197, July 2016. [6] Y. Wu and S. Verd´u, “R´enyi information dimension: Fundamental limits of almost lossless analog compression,” IEEE Trans. Inf. Theory, vol. 56, no. 8, pp. 3721–3748, Aug. 2010. [7] R. Palanki, “Iterative decoding for wireless networks,” Ph.D. disserta- tion, California Inst. of Technol., Pasadena, CA, 2004. [8] G. Koliander, E. Riegler, G. Durisi, and F. Hlawatsch, “Degrees of freedom of generic block-fading MIMO channels without a priori channel state information,” IEEE Trans. Inf. Theory, vol. 60, no. 12, pp. 7760–7781, Dec. 2014. [9] A. R´enyi, “On the dimension and entropy of probability distributions,” Acta Math. Hung., vol. 10, no. 1, pp. 193–215, 1959. [10] C. E. Posner and E. R. Rodemich, “Epsilon entropy and data compres- sion,” Ann. Math. Statist., vol. 42, no. 6, pp. 2079–2125, 1971. [11] C. De Lellis, Rectifiable Sets, Densities and Tangent Measures, ser. Zurich Lectures in Advanced Mathematics. Zurich, Switzerland: EMS Publishing House, 2008. [12] I. Csisz´ar, “Generalized entropy and quantization problems,” in Proc. Sixth Prague Conference on Information Theory, Statistical Decision Functions and Random Processes, Prague, Czechoslovakia, Sep. 1971, pp. 159–174. [13] T. M. Cover and J. A. Thomas, Elements of Information Theory, 2nd ed. New York, NY: Wiley, 2006. [14] L. Ambrosio, N. Fusco, and D. Pallara, Functions of Bounded Variation and Free Discontinuity Problems. New York, NY: Oxford Univ. Press, 2000. [15] T. Kawabata and A. Dembo, “The rate-distortion dimension of sets and measures,” IEEE Trans. Inf. Theory, vol. 40, no. 5, pp. 1564–1572, Sep. 1994. [16] D. Preiss, “Geometry of measures in Rn: Distribution, rectifiability, and densities,” Ann. Math., vol. 125, no. 3, pp. 537–643, May 1987. [17] S. Kullback and R. A. Leibler, “On information and sufficiency,” Ann. Math. Statist., vol. 22, no. 1, pp. 79–86, 1951. [18] H. Federer, Geometric Measure Theory. New York, NY: Springer, 1969. [19] S. G. Krantz and H. R. Parks, Geometric Integration Theory. Basel, Switzerland: Birkh¨auser, 2009. [20] A. S. Kechris, Classical Descriptive Set Theory. New York: Springer, 1995. [21] R. A. Horn and C. R. Johnson, Matrix Analysis, 2nd ed. Cambridge, UK: Cambridge Univ. Press, 2013. [22] H. Uhlig, “On singular Wishart and singular multivariate beta distribu- tions,” Ann. Statist., vol. 22, no. 1, pp. 395–405, 1994. [23] C. M. Bishop, Pattern Recognition and Machine Learning. New York, NY: Springer, 2006. [24] R. M. Gray, Probability, Random Processes, and Ergodic Properties. New York, NY: Springer, 2010. [25] R. M. Gray, Entropy and Information Theory. New York, NY: Springer, 1990. [26] I. Csisz´ar, “On an extremum problem of information theory,” Stud. Sci. Math. Hung., vol. 9, no. 1, pp. 57–71, 1974. [27] R. G. Bartle, The Elements of Integration and Lebesgue Measure. New York, NY: Wiley, 1995. [28] P. Mattila, Geometry of Sets and Measures in Euclidean Spaces. Cam- bridge, UK: Cambridge Univ. Press, 1995. [29] A. Fryszkowski, Fixed Point Theory for Decomposable Sets, ser. Topo- logical Fixed Point Theory and Its Applications. Dordrecht, The Netherlands: Kluwer, 2004, vol. 2.