IEEE TRANSACTIONS ON , VOL. 56, NO. 8, AUGUST 2010 3721 Rényi Information Dimension: Fundamental Limits of Almost Lossless Analog Compression Yihong Wu, Student Member, IEEE, and Sergio Verdú, Fellow, IEEE

Abstract—In Shannon theory, lossless source coding deals with • The central problem is to determine how many compressed the optimal compression of discrete sources. Compressed sensing measurements are sufficient/necessary for recovery with is a lossless coding strategy for analog sources by means of mul- vanishing block error probability as blocklength tends to tiplication by real-valued matrices. In this paper we study almost lossless analog compression for analog memoryless sources in an infinity [2]–[4]. information-theoretic framework, in which the compressor or de- • Random coding is employed to show the existence of compressor is constrained by various regularity conditions, in par- “good” linear encoders. In particular, when the random ticular linearity of the compressor and Lipschitz continuity of the projection matrices follow certain distribution (e.g., stan- decompressor. The fundamental limit is shown to be the informa- dard Gaussian), the restricted isometry property (RIP) is tion dimension proposed by Rényi in 1959. satisfied with overwhelming probability and guarantees Index Terms—Analog compression, compressed sensing, infor- exact recovery. mation measures, Rényi information dimension, Shannon theory, On the other hand, there are also significantly different ingre- source coding. dients in compressed sensing in comparison with information theoretic setups. I. INTRODUCTION • Sources are not modeled probabilistically, and the funda- mental limits are on a worst case basis rather than on av- A. Motivations From Compressed Sensing erage. Moreover, block error probability is with respect to the distribution of the encoding random matrices. HE “Bit” is the universal currency in lossless source • Real-valued sparse vectors are encoded by real numbers T coding theory [1], where Shannon entropy is the fun- instead of bits. damental limit of compression rate for discrete memoryless • The encoder is confined to be linear while generally in sources (DMS). Sources are modeled by stochastic processes information-theoretical problems such as lossless source and redundancy is exploited as probability is concentrated coding we have the freedom to choose the best possible on a set of exponentially small cardinality as blocklength coding scheme. grows. Therefore, by encoding this subset, data compression Departing from the compressed sensing literature, we study fun- is achieved if we tolerate a positive, though arbitrarily small, damental limits of lossless source coding for real-valued mem- block error probability. oryless sources within an information theoretic setup. Compressed sensing [2], [3] has recently emerged as an ap- • Sources are modeled by random processes. This method proach to lossless encoding of analog sources by real numbers is more flexible to describe source redundancy which en- rather than bits. It deals with efficient recovery of an unknown compasses, but is not limited to, sparsity. For example, a real vector from the information provided by linear measure- mixed discrete-continuous distribution is suitable for char- ments. The formulation of the problem is reminiscent of the tra- acterizing linearly sparse vectors [5], [6], i.e., those with a ditional lossless data compression in the following sense. number of nonzero components proportional to the block- • Sources are sparse in the sense that each vector is sup- length with high probability and whose nonzero compo- ported on a set much smaller than the blocklength. This nents are drawn from a given continuous distribution. kind of redundancy in terms of sparsity is exploited to • Block error probability is evaluated by averaging with re- achieve effective compression by taking fewer number of spect to the source. linear measurements. • While linear compression plays an important role in our de- • In contrast to lossy data compression, block error proba- velopment, our treatment encompasses weaker regularity bility, instead of distortion, is the performance benchmark. conditions. Methodologically, the relationship between our approach and Manuscript received March 02, 2009; revised April 30, 2010. Date of current compressed sensing is analogous to the relationship between version July 14, 2010. This work was supported in part by the National Science modern coding theory and classical coding theory: classical Foundation under Grants CCF-0635154 and CCF-0728445. The material in this coding theory adopts a worst case (Hamming) approach whose paper was presented in part at the IEEE International Symposium on Informa- tion Theory, Seoul, Korea, July 2009 [55]. goal is to obtain codes with a certain minimum distance, while The authors are with the Department of Electrical Engineering, Princeton modern coding theory adopts a statistical (Shannon) approach University, Princeton, NJ 08544 USA (e-mail: [email protected]; whose goal is to obtain codes with small probability of failure. [email protected]). Communicated by H. Yamamoto, Associate Editor for Shannon Theory. Likewise compressed sensing adopts a worst case model in Digital Object Identifier 10.1109/TIT.2010.2050803 which compressors work provided that the number of nonzero

0018-9448/$26.00 © 2010 IEEE

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. 3722 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010 components in the source does not exceed a certain threshold, version of the random vector. It characterizes the rate of growth while we adopt a statistical model in which compressors work of the information given by successively finer discretizations for most source realizations. In this sense, almost lossless of the space. Although a fundamental information measure, the analog compression can be viewed as an information theoretic Rényi dimension is far less well known than either the Shannon framework for compressed sensing. Probabilistic modeling entropy or the Rényi entropy. Rényi showed that under certain provides elegant results in terms of fundamental limit, as well conditions for an absolutely continuous -dimensional random as sheds light on constructive schemes on individual sequences. vector the information dimension is . Hence, he remarked in For example, not only is random coding a proof technique in [11] that “the geometrical (or topological) and information-the- Shannon theory, but also a guiding principle in modern coding oretical concepts of dimension coincide for absolutely contin- theory as well as in compressed sensing. uous probability distributions.” However, the operational role Recently there have been considerably new developments of Rényi information dimension has not been addressed before in using statistical signal models (e.g., mixed distributions) except in the work of Kawabata and Dembo [12], which relates in compressed sensing (e.g., [5]–[8]), where reconstruction it to the rate-distortion function. It is shown in [12] that when performance is evaluated by computing the asymptotic error the single-letter distortion function satisfies certain conditions, probability in the large-blocklength limit. As discussed in the rate-distortion function of a real-valued source scales Section IV-B, the performance of those practical algorithms proportionally to as , with the proportionality con- still lies far from the fundamental limit. stant being the information dimension of the source. This result serves to drop the assumption of continuity in the asymptotic B. Lossless Source Coding for Analog Sources tightness of Shannon’s lower bound in the low distortion regime. Discrete sources have been the sole object in lossless data In this paper we give an operational characterization of Rényi compression theory. The reason is at least twofold. First, information dimension as the fundamental limit of almost loss- nondiscrete sources have infinite entropy, which implies that less data compression for analog sources under various regu- representation with arbitrarily small block error probability larity constraints of the encoder/decoder. Moreover, we con- requires arbitrarily large rate. On the other hand, even if we sider the problem of lossless Minkowski dimension compres- consider encoding analog sources by real numbers, the result is sion, where the Minkowski dimension of a set measures its de- still trivial, as and have the same cardinality. Therefore, gree of fractality. In this setup we study the minimum upper a single real number is capable of representing a real vector Minkowski dimension of high-probability events of source re- losslessly, yielding a universal compression scheme for any alizations. This can be seen as a counterpart of lossless source analog source with zero rate and zero error probability. coding, which seeks the smallest cardinality of high-probability However, it is worth pointing out that the compression events. Rényi information dimension turns out to be the funda- method proposed above is not robust because the bijection be- mental limit for lossless Minkowski dimension compression. tween and is highly irregular. In fact, neither the encoder nor the decoder can be continuous [9, Exercise 6(c), p. 385]. D. Organization of the Paper Therefore, such a compression scheme is useless in the pres- Notations frequently used throughout the paper are sum- ence of any observation noise regardless of the signal-to-noise marized in Section II. Section III gives an overview of Rényi ratio (SNR). This disadvantage motivates us to study how to information dimension, a new interpretation in terms of entropy compress not only losslessly but also gracefully in the real rate and discusses connections with rate-distortion theory. field. In fact some authors have also noticed the importance of Section IV states the main definitions and results, as well as regularity in data compression. In [10] Montanari and Mossel their connections with compressed sensing. Section V contains observed that the optimal data compression scheme often ex- definitions and coding theorems of lossless Minkowski dimen- hibits the following inconvenience: codewords tend to depend sion compression, which are important intermediate results for chaotically on the data; hence, changing a single source symbol Sections VI and VII. New type of concentration-of-measure leads to a radical change in the codeword. In [10], a source type of results are proved for memoryless sources, where it is code is said to be smooth (resp., robust) if the encoder (resp., shown that overwhelmingly large probability is concentrated decoder) is Lipschitz (see Definition 6) with respect to the on subsets of low (Minkowski) dimension. Section VI tackles Hamming distance. The fundamental limits of smooth lossless the case of lossless linear compression, where achievability compression are analyzed in [10] for binary sources via sparse results are given as well as a converse for mixed discrete-con- graph codes. In this paper, we focus on sources in the real field tinuous sources. Section VII is devoted to lossless Lipschitz with general distributions. Introducing a topological structure decompression, where we establish a general converse in terms makes the nature of the problem quite different from traditional of upper information dimension, and its tightness for mixed formulations in the discrete world, and calls for machinery discrete-continuous and self-similar sources. Some technical from dimension theory and geometric measure theory. lemmas are proved in Appendixes I–X. C. Operational Characterization of Rényi Information Dimension II. NOTATIONS In 1959, Alfréd Rényi proposed an information measure for The major notations adopted in this paper are summarized as random vectors in Euclidean space named information dimen- follows. sion [11], through the normalized entropy of a finely quantized • , for .

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. WU AND VERDÚ: RÉNYI INFORMATION DIMENSION: FUNDAMENTAL LIMITS OF ALMOST LOSSLESS ANALOG COMPRESSION 3723

• denotes a random vector. A. Definitions denotes a realization of . Definition 1 (Information Dimension [11]): Let be an arbi- • denotes the quantization operator, which can be ap- trary real-valued . Denote for a positive integer plied to real numbers, vectors or subsets of as follows:

(1) (11) (2) Define (3) (12) • . and • For

(4) (13)

where and are called lower and upper information (5) dimensions of , respectively. If , the common value is called the information dimension of , denoted by Then, is a partition of with called , i.e., mesh cubes of size . • denotes the th bit in the binary expansion of (14) , that is Rényi also defined the “entropy of dimension ” as (6) (15) Then provided the limit exists. (7) Definition 1 can be readily extended to random vectors, where the floor function is taken componentwise. Since only (8) depends on the distribution of , we also denote . Similar convention also applies to entropy and other in- formation measures. Similarly, is defined componentwise. Apart from discretization, information dimension can be de- • Let be a metric space. Denote the closed ball of fined from a more general viewpoint: the mesh cubes of size radius centered at by in are the sets for . In particular, in , denote the . For any , the collection par- ball of radius centered at by titions . Hence, for any probability measure on , this (9) partition generates a discrete probability measure on by assigning . Then, the information dimension where the -norm on is defined as of can be expressed as

(10) (16)

• Define , which is not a norm It should be noted that there exist alternative definitions of since for or . However, information dimension in the literature. For example, in [13], is a valid metric on . the lower and upper information dimensions are defined by • Let . For a matrix , denote by replacing with the -entropy with respect to the the submatrix formed by those columns of whose distance. This definition essentially allows unequal partition indices are in . of the whole space and lowers the value of information dimen- • All logarithms in this paper are with respect to base . sion, since . However, the resulting definition is equivalent (see Theorem 23). As an another example, the III. RÉNYI INFORMATION DIMENSION following definition is adopted in [14, Def. 4.2]: In this section, we give an overview of Rényi information di- mension and its properties. Moreover, we give a novel interpre- (17) tation in terms of the entropy rate of the dyadic expansion of the random variable. We also discuss the connection between infor- where denotes the distribution of and is the -ball mation dimension and rate-distortion theory established in [12]. of radius centered at . This definition is equivalent to Defini-

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. 3724 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010 tion 1, as shown in Appendix I. Note that (17) can be generalized probability measure singular with respect to Lebesgue measure to random variables on an arbitrary metric space. but with no atoms (singular part). As shown in [11], the information dimension for the mixture B. Characterizations and Properties of discrete and absolutely continuous distribution can be deter- The lower and upper information dimension of a random vari- mined as follows. able might not always be finite, because can be in- Theorem 1 [11]: Let be a random variable such that finity for all . However, as pointed out in [11], if the mild con- is finite. Assume the distribution of can be repre- dition is satisfied, we have sented as (18) (26) The necessity of this condition is shown in Proposition 1. One sufficient condition for finite information dimension is where is a discrete measure, is an absolutely continuous . Consequently, if for some measure, and . Then , then . (27) Proposition 1: Furthermore, given the finiteness of and (19) admits a simple formula (20) (28) (21) where is the Shannon entropy of is the differ- If , then ential entropy of , and is (22) the binary entropy function. Proof: See [11, Th. 1 and 3] or [16, Th. 1, pp. 588–592]. Proof: See Appendix II. For -valued , (20) can be generalized to Some consequences of Theorem 1 are as follows. As long as . : To calculate the information dimension in (12) and (13), it is 1) is discrete: , and coincides with the sufficient to restrict to the exponential subsequence , as Shannon entropy of . a result of the following proposition. 2) is continuous: , and is equal to the differential entropy of . Proposition 2: 3) is discrete-continuous-mixed: , and is the weighted sum of the entropy of discrete and continuous (23) parts plus a term of . For mixtures of countably many distributions, we have the (24) following theorem.

Proof: See Appendix II. Theorem 2: Let be a discrete random variable with . If exists for all , then exists and Similarly to the approach in Proposition 1, we have the is given by . More generally following. Proposition 3: and are unchanged if rounding or (29) ceiling functions are used in Definition 1. Proof: See Appendix II. (30) C. Evaluation of Information Dimension By the Lebesgue decomposition theorem [15], a probability Proof: For any , the conditional distribution of distribution can be uniquely represented as the mixture given is the same as . Then

(25) (31) where ; is a purely atomic proba- where bility measure (discrete part); is a probability measure abso- lutely continuous with respect to Lebesgue measure, i.e., having (32) a probability density function (continuous part1); and is a

1In measure theory, sometimes a measure is called continuous if it does not have any atoms, and a singular measure is called singularly continuous. Here Since , dividing both sides of (31) by and we say a measure is continuous if and only if it is absolutely continuous. sending yields (29) and (30).

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. WU AND VERDÚ: RÉNYI INFORMATION DIMENSION: FUNDAMENTAL LIMITS OF ALMOST LOSSLESS ANALOG COMPRESSION 3725

To summarize, when has a discrete-continuous-mixed dis- E. Self-Similar Distribution tribution, the information dimension of is given by the weight An iterated function system (IFS) is a family of contractions of the continuous part. When the distribution of has a singular on , where , and component, its information dimension does not admit a simple satisfies for all formula in general. For instance, it is possible that with . By [17, Theorem 2.6], given an IFS, there is a [11]. However, for the important class of self-similar sin- unique nonempty compact set , called the invariant set of the gular distributions, the information dimension can be explicitly IFS, such that . We say that the IFS satisfies determined. See Section III-E. the strong separation condition, if are disjoint. The corresponding invariant set is called self-similar, if D. Interpretation of Rényi Dimension as Entropy Rate the IFS consists of similarity transformations, that is, Let a.s. Observe that , since the with an orthogonal matrix and , in which range of contains at most values. Then, case . The dyadic expansion of can be written as (39)

(33) is called the similarity ratio of . Self-similar sets are usually fractal. For example, consider the IFS on with (40) where each is a binary random variable. Therefore, there is a one-to-one correspondence between and the binary random The resulting invariant set is the middle-third Cantor set. process . Note that the partial sum in (33) is Now we define measures supported on a self-similar set associated with the IFS . A continuous mapping (34) from the space (equipped with the product topology) onto is defined as follows: and and are in one-to-one correspon- (41) dence, therefore

(35) where the right-hand side is a singleton [12]. Therefore, every measure on induces a measure on as the By Proposition 2, we have image measure of under , that is, . If is stationary and ergodic, is called a self-similar mea- (36) sure. In the special case when corresponds to a memoryless process with common distribution satis- (37) fies [17, Th. 2.8] (42) Thus, its information dimension is the entropy rate of its dyadic expansion, or the entropy rate of any -ary expansion of , and for each . The usual Cantor distribution divided by . [15] can be defined through the IFS in (40) and . This interpretation of information dimension enables us to The next result gives the information dimension of a self- gain more intuition about the result in Theorem 1. When has a similar measure with IFS satisfying the open set condition2 discrete distribution, its dyadic expansion has zero entropy rate. [18, p. 129], that is, there exists a nonempty bounded open set When is uniform on , its dyadic expansion is indepen- , such that and for dent identically distributed (i.i.d.) equiprobable, and therefore it . has unit entropy rate in bits. If is continuous, but nonuniform, its dyadic expansion still has unit entropy rate. Moreover, from Theorem 3 [17], [12]: Let the distribution of be a self- (36), (37), and Theorem 1, we have similar measure generated from the stationary ergodic measure on and the IFS with similarity (38) ratios and invariant set . Then (43) where denotes the relative entropy and the differential entropy is since a.s. The information When is the distribution of a memoryless process with dimension of a discrete-continuous mixture is also easily un- common distribution , (43) is reduced to derstood from this point of view, because the entropy rate of a mixed process is the weighted sum of entropy rates of each com- (44) ponent. Moreover, random variables whose lower and upper in- formation dimensions differ can be easily constructed from pro- 2The open set condition is weaker than the previous strong separation cesses with different lower and upper entropy rates. condition.

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. 3726 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010

Note that the open set condition implies that As a consequence of the monotonicity of Rényi entropy, in- . formation dimensions of different orders satisfy the following Since , it follows that result. Lemma 1: For and both decrease with . (45) Define (51) In view of (44), we have Then (46) (52)

Proof: where is a subprobability measure. Since All inequalities follow from the fact that for a fixed , we have , which agrees with random variable decreases with in . Proposition 1. For dimension of order , we highlight the following result from [21]. F. Connections With Rate-Distortion Theory Theorem 4 [21, Th. 3]: Let be a random variable whose The asymptotic behavior of the rate-distortion function, in distribution has Lebesgue decomposition as in (25). Then, we particular, the asymptotic tightness of the Shannon lower bound have the following. in the high-rate regime, has been addressed in [19] and [20] for 1) : if , that is, has a discrete component, we continuous sources. In [12], Kawabata and Dembo generalized have . it to real-valued sources that do not necessarily possess a den- 2) : if , that is, has a continuous component, sity, and showed that the information dimension plays a cen- and , we have . tral role. For completeness, we summarize the main results from The differential Rényi entropy is defined using its [12] in Appendix III. density as

G. Rényi Dimension of Order (53) With Shannon entropy replaced by Rényi entropy in (12)–(13), the generalized notion of dimension of order is In general, is discontinuous in . For discrete-contin- defined similarly. uous-mixed distributions, for all , Definition 2 (Information Dimension of Order ): Let while equals to the weight of the continuous part. How- . Define ever, for Cantor distribution, for all . (47) IV. DEFINITIONS AND MAIN RESULTS and This section presents a unified framework for lossless data compression and our main results in the form of coding theo- (48) rems under various regularity conditions. Proofs are relegated to Sections V–VII. where denotes the Rényi entropy of order of a discrete random variable with probability mass function A. Lossless Data Compression , defined as Let the source be a stochastic process on , with denoting the source alphabet and a -al- gebra over . Let be a measurable space, where is (49) called the code alphabet. The main objective of lossless data compression is to find efficient representations for source real- izations by . and are called lower and upper dimensions of Definition 3: A -code for over the code of order , respectively. If , the common value space is a pair of mappings: is called the information dimension of of order , denoted by 1) encoder: that is measurable relative to . Rényi also defined in [16] “the entropy of of order and ; and dimension ” as 2) decoder: that is measurable relative to (50) and . The block error probability is . provided the limit exists. The fundamental limit in lossless source coding is as follows.

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. WU AND VERDÚ: RÉNYI INFORMATION DIMENSION: FUNDAMENTAL LIMITS OF ALMOST LOSSLESS ANALOG COMPRESSION 3727

Definition 4 (Lossless Data Compression): Let TABLE I be a stochastic process on . Define to be REGULARITY CONDITIONS OF ENCODER/DECODERS AND CORRESPONDING MINIMUM -ACHIEVABLE RATES the infimum of such that there exists a sequence of -codes over the code space , such that

(54) for all sufficiently large . According to the classical discrete almost-lossless source coding theorem, if is countable and is finite, the minimum Definition 5: Let be a stochastic process achievable rate for any i.i.d. process with distribution is on . Define the minimum -achievable rate to be the infimum of such that there exists a sequence of -codes , such that (55) (57)

Using codes over an infinite alphabet, any discrete source can for all sufficiently large , and the encoder and decoder be compressed with zero rate and zero block error probability. are constrained according to Table I. Except for linear encoding In other words, if both and are countably infinite, then for where , it is assumed that . all In Definition 5, we have used the following definitions. (56) Definition 6 (Hölder and Lipschitz Continuity): Let and be metric spaces. A function is called for any random process. -Hölder continuous if there exists such that for any B. Lossless Analog Compression With Regularity Conditions In this section, we consider the problem of encoding analog (58) sources with analog symbols, that is, and or if bounded encoders are is called -Lipschitz if is -Hölder continuous. is required, where denotes the Borel -algebra. As in the simply called Lipschitz (resp., -Hölder continuous) if is countably infinite case, zero rate is achievable even for zero -Lipschitz (resp., -Hölder continuous) for some . block error probability, because the cardinality of is the same We proceed to give results for each of the minimum -achiev- for any [22]. This conclusion holds even if we require the en- able rates introduced in Definition 5. Motivated by compressed coder/decoder to be Borel measurable, because according to Ku- sensing theory, it is interesting to consider the case where the ratowski’s theorem [23, Remark (i), p. 451] every uncountable encoder is restricted to be linear. standard Borel space is isomorphic3 to . Therefore, a single real number has the capability of encoding a real vector, Theorem 5 (Linear Encoding: General Achievability): Sup- or even a real sequence, with a coding scheme that is both uni- pose that the source is memoryless. Then versal and deterministic. However, the rich structure of equipped with a metric (59) topology (e.g., that induced by Euclidean distance) enables us to probe the problem further. If we seek the fundamental limits for all , where is defined in (51). Moreover, we of not only lossless coding but “graceful” lossless coding, have the following. the result is not trivial anymore. In this spirit, our various 1) For all linear encoders (except possibly those in a set of definitions share the basic information-theoretic setup where a zero Lebesgue measure on the space of real matrices), random vector is encoded with a function block error probability is achievable. and decoded with with such that 2) The decoder can be chosen to be -Hölder continuous for and satisfy certain regularity conditions and the probability all , where is the compression of incorrect reproduction vanishes as . rate. Regularity in encoder and decoder is imposed for the sake Proof: See Section VI-C. of both less complexity and more robustness. For example, al- Theorem 6 (Linear Encoding: Discrete-Continuous Mixture): though a surjection from to is capable of lossless Suppose that the source is memoryless with a discrete-contin- encoding, its irregularity requires specifying uncountably many uous mixed distribution. Then real numbers to determine this mapping. Moreover, regularity in encoder/decoder is crucial to guarantee noise resilience of the (60) coding scheme.

3Two measurable spaces are isomorphic if there exists a measurable bijection for all . whose inverse is also measurable. Proof: See Section VI-C.

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. 3728 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010

Theorem 7 (Linear Encoding: Achievability for Self-Similar For sources with a singular distribution, in general there is no Sources): Suppose that the source is memoryless with a self- simple answer due to their fractal nature. For an important class similar distribution that satisfies the open set condition. Then of singular measures, namely self-similar measures generated from i.i.d. digits (e.g., generalized Cantor distribution), the in- (61) formation dimension turns out to be the fundamental limit for for all . lossless compression with Lipschitz decoder. Proof: See Section VI-C. Theorem 11 (Lipschitz Decoding: Achievability for Self-Sim- In Theorems 5 and 6, it has been shown that block error prob- ilar Measures): Suppose that the source is memoryless and ability is achievable for Lebesgue-a.e. linear encoder. There- bounded, and its -ary expansion consists of independent iden- fore, choosing any random matrix with i.i.d. entries distributed tically distributed digits. Then according to some absolutely continuous distribution on (e.g., a Gaussian random matrix) satisfies block error probability al- (65) most surely. Now, we drop the restriction that the encoder is linear, al- for all . Moreover, if the distribution of each bit is lowing very general encoding rules. Let us first consider the case equiprobable on its support, then (65) holds for where both the encoder and decoder are constrained to be con- Proof: See Section VII-D. tinuous. It turns out that zero rate is achievable in this case. Example 1: As an example, we consider the setup in Theorem Theorem 8 (Continuous Encoder and Decoder): For general 11 with and , where . The sources associated invariant set is the middle third Cantor set [18] and is supported on . The distribution of , denoted by , (62) is called the generalized Cantor distribution [25]. In the ternary for all . expansion of , each digit is independent and takes value and Proof: Since and are Borel iso- with probability and respectively. Then, by Theorem 11, morphic, there exist Borel measurable and for any . Furthermore, when , such that . By Lusin’s theorem [24, coincides with the “uniform” distribution on , i.e., the Th. 7.10], there exists a compact set such that re- standard Cantor distribution. Hence, we have a stronger result stricted on is continuous and . Since is that , i.e., exact lossless compression can compact and is injective on is a homeomorphism from be achieved with a Lipschitz continuous decompressor at the to . Hence, is continuous. Since both rate of the information dimension. and are closed, by Tietze extension theorem [9], and can be extended to continuous and Let . Then, is the inverse , respectively. Using and as the new encoder and de- of on . Due to the -Lipschitz continuity of is an coder, the error probability satisfies expansive mapping, that is . (66) Employing similar arguments as in the proof of Theorem 8, we see that imposing additional continuity constraints on the encoder (resp., decoder) has almost no impact on the funda- Note that (66) implies the injectivity of , a necessary con- mental limit (resp., ). This is because a continuous dition for decodability. Moreover, not only does assign encoder (resp., decoder) can be obtained at the price of an arbi- different codewords to different source symbols, but also it trarily small increase of error probability, which can be chosen keeps them sufficiently separated proportionally to their dis- to vanish as grows. tance. Therefore, the encoder respects the metric structure Theorems 9–11 deal with Lipschitz decoding in Euclidean of the source alphabet. spaces. We conclude this section by introducing stable decoding,a weaker condition than Lipschitz continuity. Theorem 9 (Lipschitz Decoding: General Converse): Sup- pose that the source is memoryless. If , then Definition 7 ( -Stable): Let and be metric spaces and . is called -stable (63) on if for all for all . Proof: See Section VII-B. (67) Theorem 10 (Lipschitz Decoding: Achievability for Discrete/ We say is -stable if is -stable. Continuous Mixture): Suppose that the source is memoryless A function is -Lipschitz if and only if it is -stable with a discrete-continuous mixed distribution. Then for every . We denote by the minimum -achiev- (64) able rate such that there exists a sequence of Borel encoders and -stable decoders that achieve block error probability . The for all . fundamental limit of stable decoding is given by the following Proof: See Section VII-C. tight result, whose proof is omitted for conciseness.

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. WU AND VERDÚ: RÉNYI INFORMATION DIMENSION: FUNDAMENTAL LIMITS OF ALMOST LOSSLESS ANALOG COMPRESSION 3729

Theorem 12 ( -Stable Decoding): Let the underlying metric for all with and for all in sup- be the distance. Suppose that the source is memoryless. ported on . Then Then, for all (72) (68) By Theorem 13, using (70) as the decoder, the -norm of the that is, the minimum -achievable rate such that for all suffi- decoding error is upper bounded proportionally to the -norm ciently small there exists a -stable coding strategy is given of the noise. by . In our framework, a stable or Lipschitz continuous coding scheme also implies robustness with respect to noise added at C. Connections With Compressed Sensing the input of the decompressor, which could result from quan- As an application of Theorem 6, we consider the following tization, finite wordlength or other inaccuracies. For example, source distribution: suppose that the encoder output is quantized by a -bit uniform quantizer, resulting in . With a -stable (69) coding strategy , we can use the following decoder. De- note the following nonempty set: where is the Dirac measure with atom at and is an absolutely continuous distribution. This is the model for (73) linearly sparse signals used in [5] and [6], where a universal4 iterative thresholding decoding algorithm is proposed. Under where . Pick any certain assumptions on , the asymptotic error probability turns in and output . Then, by the stability of out to exhibit a “phase transition” [5], [6]: there is a sparsity-de- , we have pendent threshold on the measurement rate above which the error probability vanishes and below which the error prob- (74) ability goes to one. This behavior is predicted by Theorem 6, which shows the optimal threshold is , irrespective of the prior i.e., each component in the decoder output will suffer at most . Moreover, the decoding algorithm in the achievability proof twice the inaccuracy of the decoder input. Similarly, an -Lips- of Section VI-C is universal and robust (Hölder continuous), al- chitz coding scheme with respect to distance incurs an error though it has exponential complexity. The threshold is not no more than . given in closed form (in [5, eq. (5) and Fig. 1]), but its numerical evaluation shows that it lies far from the optimal threshold ex- V. L OSSLESS MINKOWSKI-DIMENSION COMPRESSION cept in the nonsparse regime ( close to 1). Moreover, it can be shown that as . The performance of several As a counterpart to lossless data compression, in this section, other suboptimal factor-graph-based reconstruction algorithms we investigate the problem of lossless Minkowski dimension5 is analyzed in [7]. Practical robust algorithms that approach the compression for general sources, where the minimum -achiev- fundamental limit of compressed sensing given by Theorem 6 able rate is defined as . This is an important intermediate are not yet known. tool for studying fundamental limits of lossless linear encoding Robust reconstruction is of great importance in the theory and Lipschitz decoding. Bridging the three compression frame- of compressed sensing [26]–[28], since noise resilience is an works, in Sections VI-C and VII-B, we prove the following indispensable property for decompressing sparse signals from inequality: real-valued measurements. For example, consider the following robustness result. (75) Theorem 13 [26]: Suppose we wish to recover a vector Hence, studying provides an achievability bound for loss- from noisy compressed linear measurements , less linear encoding and a converse bound for Lipschitz de- where and . Let be a solution coding. We present bounds for for general sources, as of the following -regularization problem: well as tight results for discrete-continuous mixed and self-sim- ilar sources. (70) A. Minkowski Dimension of Sets in Metric Spaces Let satisfy , where is the In fractal geometry, the Minkowski dimension is a way of -restricted isometry constant of matrix , defined as the determining the fractality of a subset in metric spaces. smallest positive number such that Definition 8 (Covering Number): Let be a nonempty (71) bounded subset of the metric space . For , define

4The decoding algorithm is universal if it requires no knowledge of the prior 5Also known as Minkowski–Bouligand dimension, or box- distribution of nonzero entries. counting dimension.

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. 3730 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010

, the -covering number of , to be the smallest number it intersects, hence justifying the name of box-counting dimen- of -balls needed to cover , that is sion. Denote by the smallest number of mesh cubes of size that covers , that is

(76) (81)

(82) Definition 9 (Minkowski Dimensions): Let be a nonempty (83) bounded subset of metric space . Define the lower and upper Minkowski dimensions of as Lemma 3: Let be a bounded subset in . The Minkowski dimensions satisfy (77) (84) (78) (85) respectively. If , the common value is called Proof: See Appendix V. the Minkowski dimension of , denoted by . B. Definitions and Coding Theorems It should be pointed out that the Minkowski dimension depends on the underlying metric. Nevertheless, equivalent Consider a source in equipped with an -norm. We metrics result in the same dimension. A few examples are as define the minimum -achievable rate for Minkowski-dimen- follows. sion compression as follows. • for any finite set . Definition 10 (Minkowski-Dimension Compression Rate): • for any bounded set of nonempty interior Let be a stochastic process on . Define in Euclidean space . • Let be the middle-third Cantor set in the unit interval. Then, [18, Example 3.3]. • [18, Example 3.5]. From this example, wee see that Minkowski dimension lacks certain (86) stability properties one would expect of a dimension, since it is often desirable that adding a countable set would have Note that the conventional minimum source coding rate no effect on dimension. This property fails for Minkowski in Definition 4 is defined like in (86) replacing by dimension. On the contrary, we observe that Rényi infor- . mation dimension exhibits stability with respect to adding In general for any . This is because a discrete component as long as the entropy is finite. How- for any , there exists a compact subset , such that ever, mixing any distribution with a discrete measure with , and by definition. Several unbounded support and infinite entropy will necessarily re- coding theorems for are given as follows. sult in infinite information dimension. The upper Minkowski dimension satisfies the following prop- Theorem 14: Suppose that the source is memoryless with dis- erties (see [18, p. 48 (iii) and p. 102 (7.9)]), which will be used tribution such that . Then in the proof of Theorem 14. (87) Lemma 2: For bounded sets and

(79) (88) for . (80) Proof: See Section V-C. For the special cases of discrete-continuous-mixed and self- The following lemma shows that in Euclidean spaces, without similar sources, we have the following tight results. loss of generality we can restrict attention to covering with Theorem 15: Suppose that the source is memoryless with a mesh cubes defined in (5). Since all the mesh cubes partition discrete-continuous mixed distribution. Then the whole space, to calculate lower or upper Minkowski dimen- sion of a set, it is sufficient to count the number of mesh cubes (89)

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. WU AND VERDÚ: RÉNYI INFORMATION DIMENSION: FUNDAMENTAL LIMITS OF ALMOST LOSSLESS ANALOG COMPRESSION 3731 for all , where is the weight of the continuous part block length. This has been shown for rates above entropy in of the distribution. If is finite, then [29, Exercise 1.2.7, pp. 41–42] via a combinatorial method. A unified proof can be given through the method of Rényi entropy, (90) which we omit for conciseness. The idea of using Rényi entropy to study lossless source coding error exponents was previously Theorem 16: Suppose that the source is memoryless with a introduced by [30]–[32]. self-similar distribution that satisfies the strong separation con- Lemma 5 deals with universal lower bounds on the source dition. Then coding error exponents, in the sense that these bounds are inde- (91) pendent of the source distribution. A better bound on has been shown in [33]: for for all . (100) Theorem 17: Suppose that the source is memoryless and bounded, and its -ary expansion consists of independent However, the proof of (100) was based on the dual expression digits. Then (96) and a similar lower bound of random channel coding error (92) exponent due to Gallager [34, Exercise 5.23], which cannot be applied to . Here we give a common lower bound on for all . both exponents, which is a consequence of Pinsker’s inequality [35] combined with the lower bound on entropy difference by When can take any value in for all variational distance [29]. . Such a source can be constructed using Theorem 15 as follows: Let the distribution of be a mixture of a contin- Proof of Theorem 14: (Converse) Let and uous and a discrete distribution with weights and respec- abbreviate as . Suppose for some . tively, where the discrete part is supported on and has infinite Then, for sufficiently large there exists , such that entropy. Then, by Proposition 1 and Theorem 1, but and . by Theorem 15. First we assume that the source has bounded support, that is, a.s. for some . By Proposition 2, choose C. Proofs such that for all Before showing the converse part of Theorem 14, we state two lemmas which are of independent interest in conventional (101) lossless source coding theory. By Lemma 3, for all , there exists , such that for all Lemma 4: Assume that is a discrete memo- ryless source with common distribution on the alphabet . . Let . Denote by the block (102) error probability of the optimal -code. (103) Then, for any Then, by (101), we have (93) (104) (94) Note that is a memoryless source with alphabet size at where the exponents are given by most . By Lemmas 4 and 5, for all , for all such that , we have (95) (105) (96) (106) (97) Choose and so large that the right-hand side of (106) is less (98) than . In the special case of , (106) contradicts (103) in view of (104). Lemma 5: For Next we drop the restriction that is almost surely bounded. Denote by the distribution of . Let be so large that

(107) (99) Denote the normalized restriction of on by , that is Proof: See Appendix VII. Lemma 4 shows that the error exponents for lossless source (108) coding are not only asymptotically tight, but also apply to every

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. 3732 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010 and the normalized restriction of on by . Then Suppose for now that we can construct a sequence of subsets , such that for any the following holds: (109) (122) By Theorem 2 (123)

(110) (111) Denote where because of the following: since is finite, we (124) have in view of Proposition 1. Conse- quently, , hence . (125) The distribution of is given by (126) (112) where denotes the Dirac measure with atom at . Then Then, . Now we claim that for each and . First observe that for each (113) , therefore covers (114) . Hence, , by (122). Therefore, by Lemma 3 where: • (113): by (112) and (31), since Theorem 1 implies (127) ; • (114): by (111) and (107). For , let . Define By the Borel–Cantelli lemma, (123) implies that . Let where is so large (115) that . By the finite subad- ditivity6 of upper Minkowski dimension in Lemma 2, . By the Then, for all arbitrariness of , the -achievability of rate is proved. Now let us proceed to the construction of the required . (116) To that end, denote by the probability mass function of . Let and (128)

(117) (129)

(118) Then, immediately (122) follows from . (119) Also, for (120) where (120), (117), and (118) follow from (114), (79), and (80), (130) respectively. But now (120) and (116) contradict the converse part of Theorem 14 for the bounded source , which we have already proved. (131) (Achievability) Recall that (132) (121)

(133) We show that for all , for all , there exists such that for all , there exists with (134) and . Therefore, 6Countable subadditivity fails for upper Minkowski dimension. Had it been readily follows. satisfied, we could have picked to achieve .

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. WU AND VERDÚ: RÉNYI INFORMATION DIMENSION: FUNDAMENTAL LIMITS OF ALMOST LOSSLESS ANALOG COMPRESSION 3733 where (134) follows from the fact that By (142), for sufficiently large . are i.i.d. and the joint Rényi entropy is Decompose as the sum of individual Rényi entropies. Hence (145)

where (135) Choose such that , which is guaranteed (146) to exist in view of (121) and the fact that is nonincreasing in according to Lemma 1. Then Note that the collection of all is countable, and thus we may relabel them as . Then, . Hence, there exists and , such that , where (136) (147) (137) (138) Then Hence, decays at least exponentially with . Accordingly, (123) holds, and the proof of is (148) complete. Next we prove for the special case of discrete- (149) continuous mixed sources. The following lemma is needed in the converse proof. where (148) is by Lemma 2, and (149) follows from for each . This proves the -achiev- Lemma 6 [36, Th. 4.16]: Any Borel set whose upper ability of . By the arbitrariness of . Minkowski dimension is strictly less than has zero Lebesgue (Converse) Since , we can assume . Let measure. be such that . Define Proof of Theorem 15: (Achievability) Let the distribution of be (150)

(139) By (142), for sufficiently large . Let , then . Write as the where is a probability measure on abso- disjoint union lutely continuous with respect to Lebesgue measure and is a discrete probability measure. Let be the collection of all the (151) atoms of , which is, by definition, a countable subset of . Let . Then, is a sequence of i.i.d. binary random variables with expectation where

(140) (152)

By the weak law of large numbers (WLLN) Also let . Since , there exists and , such that . Note that (141)

(142) (153) where the generalized support of vector is defined as

(143) If , which implies has positive Lebesgue measure. By Lemma 6, Fix an arbitrary and let , hence

(144) (154)

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. 3734 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010

If implies that has positive Lebesgue By the same arguments that lead to (127), the sets defined in measure. Thus, . By the arbitrariness of , the (125) satisfy proof of is complete. (164) Next we prove for self-similar sources under the same assumption of Theorem 3. This result is due to the stationarity and ergodicity of the underlying discrete process (165) that generates the analog source distribution. Proof of Theorem 16: By Theorem 3, is finite. There- (166) fore, the converse follows from Theorem 14. To show achiev- Since , this shows the -achiev- ability, we invoke the following definition. Define the local di- ability of . mension of a Borel measure on as the function (if the limit Next we proceed to construct the required . Applying exists) [17, p. 169] Lemma 4 to the DMS and blocklength yields

(155) (167)

Denote the distribution of by the product measure , By the assumption that is bounded, without loss of generality, which is also self-similar and satisfies the strong separation the- we shall assume a.s. Therefore, the alphabet size of orem. By [17, Lemma 6.4(b) and Prop. 10.6] is at most . Simply applying (100) to yields

(156) (168) holds for -almost every . Define the sequence of random which does not grow with and cannot suffice for our purpose variables of constructing . Exploiting the structure that consists of independent bits, in Appendix VIII, we show a much better (157) bound

Then, (156) implies that as . There- (169) fore, for all and , there exists , such that Then, by (167) and (169), there exists , such that (163) holds and (158) (170) Let which implies (123). This concludes the proof of . (159)

VI. LOSSLESS LINEAR COMPRESSION (160) In this section, we analyze lossless compression with linear encoders, which are the basic elements in compressed sensing. Then, in view of (158), and Capitalizing on the approach of Minkowski-dimension com- pression developed in Section V, we obtain achievability re- sults for linear compression. For memoryless sources with a (161) discrete-continuous mixed distribution, we also establish a con- (162) verse in Theorem 6 which shows that the information dimension is the fundamental limit of lossless linear encoding. where (162) follows from (160) and . Hence, is proved. A. Minkowski-Dimension Compression and Linear Compression Finally, we prove for memoryless sources whose -ary expansion consisting of independent digits. The following theorem establishes the relationship be- Proof of Theorem 17: Without loss of generality, assume tween Minkowski-dimension compression and linear data , that is, the binary expansion of consists of inde- compression. pendent bits. We follow the same steps as in the achievability Theorem 18 (Linear Encoding: General Achievability): For proof of Theorem 14. Suppose for now that we can construct a general sources sequence of subsets , such that (123) holds and their car- dinality does not exceed (171)

(163) Moreover, we have the following.

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. WU AND VERDÚ: RÉNYI INFORMATION DIMENSION: FUNDAMENTAL LIMITS OF ALMOST LOSSLESS ANALOG COMPRESSION 3735

1) For all linear encoders (except possibly those in a set of the decoder. This is because no matrix acts injectively on zero Lebesgue measure on the space of real matrices), . To see this, introduce the notation block error probability is achievable. 2) For all and (174)

where . Then, for any matrix (172) (175) where is the compression rate, there exists a -Hölder continuous decoder that achieves block error Hence, there exist two -sparse vectors that have the same image probability . under . Consequently, in view of Theorem 18, the results on On the other hand, is sufficient to linearly compress all for memoryless sources in Theorems 14–16 yield the achiev- -sparse vectors, because (175) holds for Lebesgue-a.e. ability results in Theorems 5–7, respectively. Hölder exponents matrix . To see this, choose to be a random matrix with of the decoder can be found by replacing in (172) by its i.i.d. entries according to some continuous distribution (e.g., respective upper bound. Gaussian). Then, (190) holds if and only if all subma- For discrete-continuous sources, the achievability in Theorem trices formed by columns of are invertible. This is an al- 6 can be shown directly without invoking the general result in most sure event, because the determinant of each of the Theorem 18. See Remark 3. From the converse proof of The- orem 6, we see that effective compression can be achieved with submatrices is an absolutely continuous random variable. The linear encoders, i.e., , only if the source distribution sufficiency of is a bit stronger than the result in Re- is not absolutely continuous with respect to Lebesgue measure. mark 1, which gives . For an explicit construction of such a matrix, we can choose to be the matrix Remark 1: Linear embedding of low-dimensional subsets in (see Appendix IV). Banach spaces was previously studied in [37]–[39], etc., in a nonprobabilistic setting. For example, [38, Th 1.1] showed that: B. Auxiliary Results for a subset of a Banach space with , there Let . Denote by the Grassmannian man- exists a bounded linear function that embeds into . Here in ifold [25] consisting of all -dimensional subspaces of . For a probabilistic setup, the embedding dimension can be improved , the orthogonal projection from to defines by a factor of two. a linear mapping of rank . The technique Following the idea in the proof of Theorem 18, we obtain a we use in the achievability proof of linear analog compression is nonasymptotic result of lossless linear compression for -sparse to use the random orthogonal projection as the encoder, vectors, which is relevant to compressed sensing. where is distributed according to the invariant probability measure on , denoted by [25]. The relationship be- Corollary 1: Denote the collection of all -sparse vectors in tween and the Lebesgue measure on is shown in the by following lemma. Lemma 7 [25, Exercise 3.6]: Denote the rows of a (173) matrix by , the row span of by , and the volume of the unit -ball in by . Set Let be a -finite Borel measure on . Then, given any , for Lebesgue-a.e. real matrix , there exists a (176) Borel function , such that for -a.e. . Moreover, when is finite, for any and Then, for measurable, i.e., a collection of -di- , there exists a matrix and , mensional subspaces of such that and is -Hölder continuous. (177) Remark 2: The assumption that the measure is -finite is essential, because the validity of Corollary 1 hinges upon Fu- The following result states that a random projection of a given bini’s theorem, where -finiteness is an indispensable require- vector is not too small with high probability. It plays a central ment. Consequently, if is the distribution of a -sparse random role in estimating the probability of “bad” linear encoders. vector with uniformly chosen support and Gaussian distributed Lemma 8 [25, Lemma 3.11]: For any nonzero entries, we conclude that all -sparse vectors can be linearly compressed except for a subset of zero measure under (178) . On the other hand, if is the counting measure on , Corol- lary 1 no longer applies because that is not -finite. In fact, if , no linear encoder from to works for every To show the converse part of Theorem 6, we will invoke the -sparse vector, even if no regularity constraint is imposed on Steinhaus theorem as an auxiliary result.

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. 3736 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010

Lemma 9 (Steinhaus [40]): For any measurable set We show that for all . with positive Lebesgue measure, there exists an open ball cen- To this end, let tered at contained in . (188) Lastly, with the notation in (174), we give a characterization of the fundamental limit of lossless linear encoding as follows. Define The proof is omitted for conciseness. (189) Lemma 10: is the infimum of such that for (190) sufficiently large , there exists a Borel set and a Then linear subspace of dimension at least , (191) such that (179) Observe that implies that

C. Proofs (192) Proof of Theorem 18: We first show (171). Fix Therefore, for all but a finite number of ’s if and only arbitrarily. Let if and . We show that there exists a matrix (193) of rank and Borel measur- able such that for some . Next we show that for all but a finite number of (180) ’s with probability one. Cover with -balls. The centers of those balls that intersect are denoted by for sufficiently large . . Pick . By definition of , there exists such that for all Then, , hence , there exists a compact such that cover . Suppose , and . Given an encoding matrix , then for any define the decoder as (194) (181) (195) where the is taken componentwise7. Since is (196) closed and is compact, is compact. Hence, where (195) follows because is an orthogonal projection. is well defined. Thus, for all implies that Next consider a random orthogonal projection matrix . Therefore, by the union bound independent of , where is a random -dimensional subspace distributed according to the invariant measure on . We show that for all (197)

(182) By Lemma 8 which implies that there exists at least one realization of that satisfies (180). To that end, we define (198) (183) (199) and use the union bound

(184) (200) where the first term . Next we show that the where (200) is due to , because . second term is zero. Let . Then Since , there is a constant such that . Since is a translation of , it follows that (185) (201) Thus (186)

(187) (202) 7Alternatively, we can use any other tie-breaking strategy as long as Borel measurability is satisfied. (203)

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. WU AND VERDÚ: RÉNYI INFORMATION DIMENSION: FUNDAMENTAL LIMITS OF ALMOST LOSSLESS ANALOG COMPRESSION 3737 where: all but a finite number of ’s a.s., there exists a (independent • (202): by substituting (200) and (201) into (197); of ), such that • (203): by (188). Therefore, by the Borel–Cantelli lemma, for all but a finite number of ’s with probability one. Hence (215) (204) Thus, by (192), for any which implies that for any

(205) (216) In view of (187) Integrating (216) with respect to on and by Fu- bini’s theorem, we have (206) whence (182) follows. This shows the -achievability of . By the arbitrariness of , (171) is proved. Now we show that

(207) holds for all except possibly on a set of zero (217) Lebesgue measure, where is the corresponding decoder for (218) defined in (181). Note that Hence, there exists and an orthogonal projection ma- trix of rank , such that and for all (208) (209) (219)

8 where: for all . Therefore is • (208): by (184); -Hölder continuous. By the extension the- • (209): by (187) and . orem of Hölder continuous mappings [41], can be Define extended to that is -Hölder continuous. Then (220) (210) Recall from (188) that . By the arbitrari- Recalling defined in (176), we have ness of , (172) holds. Remark 3: Without recourse to the general result in Theorem 18, the achievability for discrete-continuous sources in Theorem (211) 6 can be proved directly as follows. In (184), choose (212) (221) (213) and consider whose entries are i.i.d. standard Gaussian (or where: any other absolutely continuous distribution on ). Using linear • (211): by (209) and since holds algebra, it is straightforward to show that the second term in Lebesgue-a.e.; (184) is zero. Thus, the block error probability vanishes since • (212): by Lemma 7; . • (213): by (206). Observe that (213) implies that for any Finally, we complete the proof of Theorem 6 by proving the converse. (214) Converse Proof of Theorem 6: Let the distribution of be defined as in (139). We show that for any . Since , assume . Fix an arbitrary . Since , in view of (209) and (214), we con- Suppose is -achievable. Let and clude that (207) holds Lebesgue-a.e. . By Lemma 10, for sufficiently large , Finally, we show that for any , there exists a sequence of there exist a Borel set and a linear subspace , matrices and -Hölder continuous decoders that achieves com- pression rate and block error probability . Since for 8 denotes the restriction of on the subset .

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. 3738 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010 such that and where . Hence, . contains linear independent vectors, denoted by . If is absolutely continuous with respect to Let be a basis for , where by Lebesgue measure. Therefore, has positive Lebesgue mea- assumption. Since , we conclude that sure. By Lemma 9, contains an open ball in . Hence, are linearly dependent. Therefore cannot hold for any subspace with positive dimension. This proves . Next we assume (234) that . Let where and for some and . If we choose (222) those nonzero coefficients sufficiently small, then and since is a linear subspace. This and . By (142), for sufficiently large contradicts . Thus, , and , hence . follows from the arbitrariness of . Next we decompose according to the generalized support of VII. LOSSLESS LIPSCHITZ DECOMPRESSION (223) In this section, we study the fundamental limit of lossless compression with Lipschitz decoders. To facilitate the discus- sion, we first introduce several important concepts from geo- where we have denoted the disjoint subsets metric measure theory. Then, we proceed to give proofs of The- orems 9–11. (224) A. Geometry Measure Theory Then Geometric measure theory [42], [25] is an area of analysis (225) studying the geometric properties of sets (typically in Euclidean spaces) through measure theoretic methods. One of the core concepts in this theory is rectifiability, a notion of smoothness (226) or regularity of sets and measures. Basically a set is rectifi- able if it is the image of a subset of a Euclidean space under So there exists such that and some Lipschitz function. Rectifiable sets admit a smooth analog . coding strategy. Therefore, lossless compression with Lipschitz Next we decompose each according to which can decoders boils down to finding a subset of source realizations only take countably many values. Let . For , that is rectifiable and has high probability. In contrast, the goal let of conventional almost-lossless data compression is to show (227) concentration of probability on sets of small cardinality. This characterization enables us to use results from geometric mea- (228) sure theory to study Lipschitz coding schemes. Then, can be written as a disjoint union of Definition 11 (Hausdorff Measure and Dimension): Let and . Define (229)

Since , there exists such that (235) . where . Define the Note that -dimensional Hausdorff measure on by (230) (236) (231) The Hausdorff dimension of is defined by (232) (237) Therefore, . Since is absolutely continuous with respect to Lebesgue measure on has positive Lebesgue Hausdorff measure generalizes both the counting measure measure. By Lemma 9, contains an open ball and Lebesgue measure and provides a nontrivial way to mea- for some . Therefore, we have sure low-dimensional sets in a high-dimensional space. When is just a rescaled version of the usual -dimen- (233) sional Lebesgue measure [25, 4.3]; when reduces

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. WU AND VERDÚ: RÉNYI INFORMATION DIMENSION: FUNDAMENTAL LIMITS OF ALMOST LOSSLESS ANALOG COMPRESSION 3739 to the counting measure. For gives a nontrivial a much stronger condition than the existence of information di- measure for sets of Hausdorff dimension in , because if mension. Obviously, if a probability measure is -rectifiable, ; if . then . As an example, consider and . Let be the middle-third Cantor set in the unit interval, which has zero B. Converse Lebesgue measure. Then, and [18, 2.3]. In view of the lossless Minkowski dimension compression results developed in Section V, the general converse in Theorem Definition 12 (Rectifiable Sets [42, 3.2.14]): is 9 is rather straightforward. We need the following lemma to called -rectifiable if there exists a Lipschitz mapping from complete the proof. some bounded set in onto . Lemma 13: Let be -rectifiable. Then Definition 13 (Rectifiable Measures [25, Definition 16.6]): Let be a measure on . is called -rectifiable if (241) and there exists a -a.s. set that is -rectifiable. Several useful facts about rectifiability are presented as Proof: See Appendix IX. follows. Proof of Theorem 9: Lemma 13 implies the following gen- Lemma 11 [42]: eral inequality: 1) An -rectifiable set is also -rectifiable for . 2) The Cartesian product of an -rectifiable set and an -rec- (242) tifiable set is -rectifiable. 3) The finite union of -rectifiable sets is -rectifiable. If the source is memoryless and , then it follows 4) Countable sets are -rectifiable. from Theorem 14 that . Using the notion of rectifiability, we give a sufficient condi- tion for the -achievability of Lipschitz decompression by the C. Achievability for Finite Mixture following lemma. We first prove a general achievability result for finite mix- Lemma 12: if there exists a sequence of -rec- tures, a corollary of which applies to discrete-continuous mixed tifiable sets with distributions in Theorem 10. Theorem 20 (Achievability of Finite Mixtures): Let the dis- (238) tribution of be a mixture of finitely many Borel probability measures on , i.e., for all sufficiently large . Proof: See Appendix V. (243) Definition 14 ( -Dimensional Density [25, Def. 6.8]): Let be a measure on . The -dimensional upper and lower den- sities of at are defined as where is a probability mass function. If is -achievable with Lipschitz decoders for , then is -achievable for with Lipschitz decoders, where (239)

(244) (240)

If , the common value is called the -di- (245) mensional density of at , denoted by . The following important result in geometric measure theory Proof: By induction, it is sufficient to show the result for gives a density characterization of rectifiability for Borel . Denote . Let be a sequence of measures. i.i.d. binary random variables with . Let be a i.i.d. sequence of real-valued random variables, such that Theorem 19 (Preiss Theorem [43, Th. 5.6]): A -finite Borel the distribution of each conditioned on the events measure on is -rectifiable if and only if and are and respectively. Then, for -a.e. . is a memoryless process with common distribution . Since the Recalling the expression for information dimension in claim of the theorem depends only on the probability law of the (17), we see that for the information dimension of a measure to source, we base our calculation of block error probability on this be equal to it requires that the exponent of the average mea- specific construction. sure of -balls equals , whereas -rectifiability of a measure Fix . Since and are achievable for and requires that the measure of almost every -ball scales as , respectively, by Lemma 12, there exists such that for all

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. 3740 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010

, there exists , with where: and is -rectifiable and is • (257): by construction of in (249); -rectifiable. Let • (259): by (250) and (251); • (260): according to (244); (246) • (261): by (247). By WLLN, . Hence, for any , there In view of Lemma 12, is -achievable for . By the exists , such that for all arbitrariness of and is -achievable for . Proof of Theorem 10: Let the distribution of be (247) as defined in (139), where is discrete and Let is absolutely continuous. By Lemma 11, countable sets are -rectifiable. For any , there exists such that (248) . By definition, is -rectifiable. Therefore, by Lemma 12 and Theorem 20, is an -achievable Define rate for . The converse follows from (242) and Theorem 15.

(249) D. Achievability for Singular Distributions Next we show that is -rectifiable. For all In this section, we prove Theorem 11 for memoryless sources, , it follows from (248) that using isomorphism results in ergodic theory. The proof out- line is as follows: a classical result in ergodic theory states that (250) Bernoulli shifts are isomorphic if they have the same entropy. (251) Moreover, the homomorphism can be chosen to be finitary, that is, each coordinate only depends on finitely many coordinates. and This finitary homomorphism naturally induces a Lipschitz de- coder in our setup; however, the caveat is that the Lipschitz con- (252) tinuity is with respect to an ultrametric (Definition 15) that is not (253) equivalent to the usual Euclidean distance. Nonetheless, by an arbitrarily small increase in the compression rate, the decoder (254) can be modified to be Lipschitz with respect to the Euclidean where according to (245). Observe that distance. Before proceeding to the proof, we first present some is a finite union of subsets, each of which is a Cartesian product necessary results of ultrametric spaces and finitary coding in er- of a -rectifiable set in and a -rectifiable godic theory. set in . Recalling Lemma 11, is -rectifi- Definition 15: Let be a metric space. is called an able, in view of (254). ultrametric if Now we calculate the measure of under (262) for all . (255) A canonical class of ultrametric spaces is the ultrametric Cantor space [44]: let denote the set of all one-sided -ary sequences . To endow with an ultra- metric, define (256) (263) Then, for every is an ultrametric on . In a sim- ilar fashion, we define an ultrametric on by considering (257) the -ary expansion of real vectors. Similar to the binary ex- pansion defined in (6), for and , define

(264) (258) then

(265)

(259) Denoting for brevity (260) (261) (266)

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. WU AND VERDÚ: RÉNYI INFORMATION DIMENSION: FUNDAMENTAL LIMITS OF ALMOST LOSSLESS ANALOG COMPRESSION 3741

(263) induces an ultrametric on and . If and each have at least three non-zero components and , then there is a fini- (267) tary homomorphism . It is important to note that is not equivalent to the distance We now use Lemmas 14–15 and Theorem 21 to prove The- (or any distance), since we only have orem 11. Proof of Theorem 11: Without loss of generality, assume (268) that the random variable satisfies . Denote by the common distribution of the -ary digits of . By Proposition To see the impossibility of the other direction of (268), consider 2 and . As (269) but remains . Therefore, a Lipschitz function with respect to is not necessarily Lipschitz under . However, Fix . Let and . Let be a the following lemma bridges the gap if the dimension of the probability measure on such that . domain and the Lipschitz constant are allowed to increase. Such a always exists because . Let and denote the product measure Lemma 14: Let be the ultrametric on defined in on and , respectively. Since and has the same en- (267). and is Lips- tropy rate, by Theorem 21, there exists a finitary homomorphism chitz. Then, there exists and . By the characterization of fini- such that and is Lipschitz. tariness in Lemma 15, for any , there exists Proof: See Appendix X. such that is determined only by . Denote the Next we recall several results on finitary coding of Bernoulli closed ultrametric ball shifts. Kolmogorov–Ornstein theory studies whether two pro- cesses with the same entropy rate are isomorphic. Kean and (270) Smorodinsky [45] showed that two double-sided Bernoulli shifts of the same entropy are finitarily isomorphic. For the where is defined in (267). Then, for any single-sided case, Del Junco [46] showed that there is a finitary . Note that forms a countable cover homomorphism between two single-sided Bernoulli shifts of of . This is because is just a cylinder set in with the same entropy, which is a finitary improvement of Sinai’s base , and the total number of cylinders is countable. theorem [47], [55]. We will see how a finitary homomorphism Furthermore, since intersecting ultrametric balls are contained of the digits is related to a real-valued Lipschitz function, in each other [49], there exists a sequence in , such and how to apply Del Junco’s ergodic-theoretic result to our that partitions . Therefore, for all , problem. there exists , such that , where Definition 16 (Finitary Homomorphisms): Let and be (271) finite sets. Let and denote the left shift operators on the product spaces and respectively. Let and For , recall the -ary expansion of defined in be measures on and (with product -algebras). A ho- (266), denoted by . Let momorphism is a measure preserving mapping that commutes with the shift operator, i.e., (272) and -a.e. is said to be finitary if there exist sets (273) of zero measure and such that (274) is continuous (with respect to the product topology). Informally, finitariness means that for almost every Since is measure preserving, , therefore is determined by finitely many coordinates in . The fol- lowing lemma characterizes this intuition in precise terms. (275)

Lemma 15 [48, Conditions 5.1, p. 281]: Let and Next we use to construct a real-valued Lipschitz mapping . Let be a homomorphism. . Define by Then, the following statements are equivalent. 1) is finitary. (276) 2) For -a.e. , there exists , such that for any implies that . Since commutes with the shift operator, for all 3) For each , the inverse image . Also, for of each time- cylinder set in is, up to a set of measure . Therefore , a countable union of cylinder sets in . Theorem 21 [46, Th. 1]: Let and be probability dis- (277) tributions on finite sets and . Let

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. 3742 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010

Next we proceed to show that . By [18, Exercise 9.11], is Lipschitz. In view of (268) and (263), it is sufficient to show there exists a constant , such that for all that is Lipschitz. Let (288) (278) that is, has positive lower -density everywhere. By [50, Th. 4.1(1)], for any , there exists such that First observe that is -Lipschitz on each ultrametric ball and is -rectifiable. Therefore, in . To see this, consider distinct points . Let . By Lemma 12, the rate is -achievable. . Then, . Since and coincide on their first VIII. CONCLUDING REMARKS digits. Therefore Compressed sensing, as an analog compression paradigm, (279) imposes two basic requirements: the linearity of the encoder (280) and the robustness of the decoder; the rationale is that low com- (281) plexity of encoding operations and noise resilience of decoding operations are indispensable in dealing with analog sources. To Since every closed ultrametric ball is also open [49, Prop. 18.4], better understand the fundamental limits imposed by the re- are disjoint, therefore is -Lipschitz on quirements of low complexity and noise resilience, it is peda- for some . Then, for any gogically sound to study them separately and in a more general (282) paradigm than compressed sensing. Motivated by this observa- (283) tion, in this paper we have proposed an information theoretic (284) framework for lossless analog compression of analog sources under regularity conditions of the coding schemes. Abstractly, (285) the approach boils down to probabilistic dimension reduction where: with smooth embedding. In this framework, obtaining funda- • (282): by (268); mental limits requires tools quite different from those used in • (283) and (285): by (267); traditional information theory, calling for machineries from di- • (284): by (281). mension theory and geometric measure theory in addition to er- Hence, is Lipschitz. godic theory. By Lemma 14, there exists a subset and a Within this general framework, we analyzed the fundamental Lipschitz mapping such that limits under different regularity constraints imposed on com- . By Kirszbraun’s theorem [42, 2.10.43], pressor and decompressor. Perhaps the most surprising result we extend to a Lipschitz function . is, as shown in (75) Then, . Since is continuous and is com- pact, by Lemma 18, there exists a Borel function (289) , such that on . which holds for any real-valued source. This conclusion implies To summarize, we have obtained a Borel function that a Lipschitz constraint at the decompressor results in less ef- and a Lipschitz function , where ficient compression than a linearity constraint at the compressor. , such that For memoryless sources, we have also obtained bounds or exact . Therefore, we conclude that . The converse expressions for various -achievable rates. As seen in Theorems follows from Theorem 9. 5–12, Rényi’s information dimension plays an important role in Last, we show that the associated coding theorems. These results provide new oper- (286) ational characterizations for Rényi’s information dimension in a lossless compression framework. for the special case when is equiprobable on the support, In the important case of discrete-continuous mixed sources, where . Recalling the construction of self-sim- which is a probabilistic generalization of the linearly sparse ilar measures in Section III-C, we first note that the distribu- source model used in compressed sensing (a fixed fraction of ob- tion of is a self-similar measure that is generated by the IFS servations are zero), we have shown that the fundamental limit is , where Rényi information dimension, which coincides with the weight on the continuous part in the source distribution. In the memo- (287) ryless case, this corresponds to the fraction of analog symbols in the source realization. This might suggest that the mixed dis- This IFS satisfies the open set condition, since crete-continuous nature of the source is of fundamental impor- and the union is disjoint. Denote by tance in the analog compression framework; sparsity is just one the invariant set of the reduced IFS . By manifestation of a mixed distribution. [12, Corollary 4.1], the distribution of , denoted by , It should be remarked that random linear coding is not only an is in fact the normalized -dimensional Hausdorff mea- important achievability proof technique in Shannon theory, but sure on , i.e., . Therefore, also an inspiration to obtain efficient schemes in modern coding

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. WU AND VERDÚ: RÉNYI INFORMATION DIMENSION: FUNDAMENTAL LIMITS OF ALMOST LOSSLESS ANALOG COMPRESSION 3743 theory, compressed sensing as well as our analog compression APPENDIX II framework. Moreover, it also provides information about how PROOFS OF PROPOSITIONS 1–3 close practical encoders are to the fundamental limit. For in- Lemma 16: For all stance, in lossless linear compression, the achievability bound on in Theorem 18 can be achieved with Lebesgue-a.e. (298) linear encoders. This implies that generating random linear en- coders using any continuous distribution achieves the desired Proof: error probability almost surely. (299) As far as future research directions, there are regularity (300) conditions beyond those in Table I that are worth studying. For example, it is interesting to investigate the fundamental limit of Note that for any bilipshitz coding schemes, i.e., the encoder and decoder both being Lipschitz continuous. This is a probabilistic version of (301) bilipschitz embedding in Euclidean spaces, e.g., Dvoretzky’s theorem [51] and Johnson–Lindenstrauss lemma [52]. As a (302) more restricted case, the fundamental limit of linear compres- sion with Lipschitz decompression is the most desirable result. (303)

Given , the range of is upper bounded APPENDIX I by . Therefore, for all PROOF OF EQUIVALENCE OF (17) AND (14) Proposition 4: The information dimension of a random vari- (304) able on can be calculated as follows: Hence, admits the same upper bound and (298) (290) holds. Lemma 17 [53, p. 2102]: Let be an -valued random where is the distribution of and is the -ball of variable. Then, if . radius centered at . The lower (upper) infor- Proof of Proposition 1: Using Lemma 16 with mation dimension can be obtained by replacing by and , we have . Proof: Let be a random vector in and denote its dis- (305) tribution by . Due to the equivalence of -norms, it is suffi- Equation (21) (20): When is finite, dividing both cient to show (290) for . Recall the notation in (5) and sides of (305) by and letting results in (20). note that Equation (20) (21): Suppose . By (305), (291) for every and (20) fails. This also proves (22). (19) (21) For any , there exists , such that (306) . Then (307) (292) (308) (293) (309) As a result of (292), we have (310) (294) where: • (308): by Lemma 17; On the other hand, note that is a disjoint • (310): by union of mesh cubes. By (293), we have (311) (295) (312)

(296)

Combining (294) and (296) yields Proof of Proposition 2: Fix any and , such that . By Lemma 16, we have

(297) (313) By Proposition 2, sending and yields (290). (314)

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. 3744 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010

Therefore • -norm and vector source: and . Then

(315) (325) and hence (23) and (24) follow. For general sources, Kawabata and Dembo introduced the Proof of Proposition 3: Note that concept of rate-distortion dimension in [12]. The rate-distortion dimension of a measure (or a random variable with the dis- tribution ) on the metric space is defined as follows: (316)

(326) (317)

(327) (318)

(319) where is the single-letter rate-distortion function of with distortion function . Then, The same bound works for rounding. under the equivalence of the metric and the -norm as in (328), the rate-distortion dimension coincides with the information di- mension of . APPENDIX III Theorem 23 [12, Prop. 3.3]: Consider the metric space INFORMATION DIMENSION AND RATE-DISTORTION THEORY . If there exists , such that for all The asymptotic tightness of the Shannon lower bound in the high-rate regime is shown by the following result. (328)

Theorem 22 [20]: Let be a random variable on the normed then space with a density such that . Let the distortion function be with , and the (329) single-letter rate-distortion function is given by (330) (320) Moreover, (329) and (330) hold even if the -entropy instead of the rate-distortion function is used in the defini- Suppose that there exists an such that . tion of and . Then, and In particular, consider the special case of scalar source and (321) MSE distortion . Then, whenever exists and is finite where the Shannon lower bound takes the following form: (331)

(322) Therefore, is the scaling factor of with respect to in the high-rate regime, which gives an operational where denotes the volume of the unit ball characterization of information dimension in Shannon theory. . Note that in the most familiar cases we can sharpen (331) to Special cases of Theorem 22 include the following. show the following. • MSE distortion and scalar source: • is discrete and : . and . Then, the Shannon • is continuous and : lower bound takes the familiar form .

(323) APPENDIX IV • Absolute distortion and scalar source: INJECTIVITY OF THE COSINE MATRIX and . Then We show that the cosine matrix defined in Remark 2 is injec- tive on . We consider a more general case. Let (324) and . Let be an matrix where

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. WU AND VERDÚ: RÉNYI INFORMATION DIMENSION: FUNDAMENTAL LIMITS OF ALMOST LOSSLESS ANALOG COMPRESSION 3745

. We show that each subma- APPENDIX VI trix formed by columns of is non-singular if and only if PROOF OF LEMMA 3 are distinct. Proof: Since -norms are equivalent, it is sufficient to Let . Then, where only consider . Observe that is nonin- and denotes the th order Cheby- creasing. Hence, for any , we have shev polynomial of the first kind [54]. Note that is a polynomial in of degree . Also (335) if for some . Therefore, . The constant is given by Therefore, it is sufficient to restrict to and the coefficient of the highest order term in the contribution from in (77) and (78). To see the equivalence of covering by mesh the main diagonal . Since the leading coefficient cubes, first note that . On the other hand, of is , we have . Therefore any -ball of radius is contained in the union of mesh cubes of size (by choosing a cube containing some point in the set together with its neighboring cubes). Thus, . Hence, the limits in (77) and (78) coincide with (332) those in (84) and (85).

APPENDIX VII APPENDIX V PROOF OF LEMMA 5 PROOF OF LEMMA 12 Proof: By Pinsker’s inequality Lemma 18: Let be compact, and let be continuous. Then, there exists a Borel measurable function (336) such that for all . Proof: For all is nonempty and com- where is the variational distance between and and pact since is continuous. For each , let the th . In this case, where is countable component of be (337) (333) where is the th coordinate of . This defines By [29, Lemma 2.7, p. 33], when which satisfies for all . Now we claim (338) that each is lower semicontinuous, which implies that is Borel measurable. To this end, we show that for any (339) is open. Assume the opposite, then there exists a When , by (336), ; when sequence in that converges to , such that , by (339) and . Due to the compactness of , there exists a subsequence that converges to some point in (340) . Therefore, . But by the continuity of , we have Using (336) again

Hence, by definition of and , we have , which (341) is a contradiction. Therefore, is lower semicontinuous. Since holds in the minimization of (95) Proof of Lemma 12: Let be -rectifiable and and (97), (99) is proved. assume that (238) holds for all . Then, by definition there exists a bounded subset and a Lipschitz function APPENDIX VIII , such that . By continuity, PROOF OF (169) can be extended to the closure , and . Since is compact, by Lemma 18, there exists a Borel function Proof: By (100), for all , such that for all . By Kirszbraun’s theorem [42, 2.10.43], can be extended to (342) a Lipschitz function with the same Lipschitz constant. Then where

(334) (343) for all , which proves the -achievability of . which is a nonnegative nondecreasing convex function.

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. 3746 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010

Since a.s., is in one-to-one correspon- APPENDIX IX dence with . Denote the distribution of PROOF OF LEMMA 13 and by and , respectively. By Proof: By the -rectifiability of , there exists a bounded assumption, are independent, hence subset and an -Lipschitz mapping such that . Note that (344)

By (95) (358)

By definition of , there exists , (345) such that is covered by the union of . Then where is a distribution on . Denote the marginals of by . Combining (344) with properties of entropy and relative entropy, we have

(346) (359) and

which implies that (347) (360) (348) Therefore Therefore (361) (349) (362)

(350) (363)

which the last inequality follows from (358). (351)

(352) APPENDIX X PROOF OF LEMMA 14 (353) Proof: Suppose we can construct a mapping such that

(354) (364)

(355) holds for all . By (364), is injective. Let (356) and . Then, by (364) and the -Lipschitz continuity of where: • (350): by (346); (365) • (351): by (348), we have (366) (367)

(357) holds for all . Hence, is Lipschitz with respect to the distance, and it satisfies . • (353): by (342); To complete the proof of the lemma, we proceed to construct • (354): let ; then, , by (347); the required . The essential idea is to puncture the -ary ex- • (355): due to the convexity of . pansion of such that any component has at most consecu-

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. WU AND VERDÚ: RÉNYI INFORMATION DIMENSION: FUNDAMENTAL LIMITS OF ALMOST LOSSLESS ANALOG COMPRESSION 3747

[5] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algo- rithms for compressed sensing,” Proc. Nat. Acad. Sci., vol. 106, no. 45, pp. 18914–18919, Nov. 2009. [6] D. L. Donoho, A. Maleki, and A. Montanari, “Construction of message passing algorithms for compressed sensing,” to be submitted to IEEE Trans Inf. Theory. [7] Y. Eftekhari, A. H. Banihashemi, and I. Lambadaris, “An efficient ap- proach toward the asymptotic analysis of node-based recovery algo- rithms in compressed sensing,” 2010 [Online]. Available: http://arxiv. org/abs/1001.2284 [8] P. Schniter, “Turbo reconstruction of structured sparse signals,” in Proc. Conf. Inf. Sci. Syst., Princeton, NJ, Mar. 2010. Fig. 1. Schematic illustration of in terms of -ary expansions. [9] J. R. Munkres, Topology, 2nd ed. Englewood Cliffs, NJ: Prentice- Hall, 2000. [10] A. Montanari and E. Mossel, “Smooth compression, Gallager bound tive nonzero digits. For notational convenience, define and nonlinear sparse graph codes,” in Proc. IEEE Int. Symp. Inf. and for as follows: Theory, Toronto, ON, Canada, Jul. 2008. [11] A. Rényi, “On the dimension and entropy of probability distributions,” Acta Mathematica Hungarica, vol. 10, no. 1–2, Mar. 1959. [12] T. Kawabata and A. Dembo, “The rate-distortion dimension of sets and (368) measures,” IEEE Trans. Inf. Theory, vol. 40, no. 5, pp. 1564–1572, Sep. 1994. (369) [13] Y. B. Pesin, Dimension Theory in Dynamical Systems: Contemporary Views and Applications. Chicago, IL: Univ. Chicago Press, 1997. [14] B. R. Hunt and V. Y. Kaloshin, “How projections affect the dimension Define spectrum of fractal measures,” Nonlinearity, vol. 10, pp. 1031–1046, 1997. [15] E. Çinlar, Probability and Stochastics. New York: Springer-Verlag, (370) 2010. [16] A. Rényi, Probability Theory. Amsterdam, The Netherlands: North- Holland, 1970. [17] K. Falconer, Techniques in Fractal Geometry. New York: Wiley, 1997. A schematic illustration of in terms of the -ary expansion [18] K. Falconer, Fractal Geometry: Mathematical Foundations and Appli- is given in Fig. 1. cations, 2nd ed. New York: Wiley, 2003. Next we show that satisfies the expansiveness condition in [19] A. György, T. Linder, and K. Zeger, “On the rate-distortion function of random vectors and stationary sources with mixed distributions,” IEEE (364). For any , let . Then, Trans. Inf. Theory, vol. 45, pp. 2110–2115, 1999. by definition, for some and [20] T. Linder and R. Zamir, “On the asymptotic tightness of the Shannon . Without loss of generality, lower bound,” IEEE Trans. Inf. Theory, vol. 40, no. 6, pp. 2026–2031, assume that and . Then, by construction of Nov. 1994. [21] I. Csiszár, “On the dimension and entropy of order of the mixture of , there are no more than consecutive nonzero digits in or probability distributions,” Acta Mathematica Hungarica, vol. 13, no. . Since the worst case is that and are followed 3–4, pp. 245–255, Sep. 1962. by ’s and ’s respectively, we have [22] P. Halmos, Naive Set Theory. Princeton, NJ: D. Van Nostrand, 1960. [23] K. Kuratowski, Topology. New York: Academic, 1966, vol. I. [24] G. Folland, Real Analysis: Modern Techniques and Their Applications, 2nd ed. New York: Wiley-Interscience, 1999. (371) [25] P. Mattila, Geometry of Sets and Measures in Euclidean Spaces: Fractals and Rectifiability. Cambridge, U.K.: Cambridge Univ. (372) Press, 1999. [26] E. Candés, J. Romberg, and T. Tao, “Stable signal recovery from in- which completes the proof of (364). complete and inaccurate measurements,” Commun. Pure Appl. Math., vol. 59, no. 8, pp. 1207–1223, Aug. 2006. [27] R. Calderbank, S. Howard, and S. Jafarpour, “Construction of a large ACKNOWLEDGMENT class of deterministic matrices that satisfy a statistical isometry prop- The authors would like to thank M. Chiang for stimulating erty,” IEEE J. Sel. Topics Signal Process., vol. 29, no. 4, 2009. [28] W. Xu and B. Hassibi, “Compressed sensing over the Grassmann man- discussions and J. Luukkainen of the University of Helsinki for ifold: A unified analytical framework,” in Proc. 46th Annu. Allerton suggesting [50]. They are also grateful for suggestions by an Conf. Commun. Control Comput., 2008, pp. 562–567. anonymous reviewer. [29] I. Csiszár and J. G. Körner, Information Theory: Coding Theorems for Discrete Memoryless Systems. New York: Academic, 1982. [30] I. Csiszár, “Generalized cutoff rates and Rényi’s information mea- REFERENCES sures,” IEEE Trans. Inf. Theory, vol. 41, no. 1, pp. 26–34, Jan. [1] C. E. Shannon, “A mathematical theory of communication,” in Bell 1995. Syst. Tech. J., 1948, vol. 27, pp. 379–423, 623–56. [31] P. N. Chen and F. Alajaji, “Csiszár’s cutoff rates for arbitrary discrete [2] D. Donoho, “Compressed sensing,” IEEE Trans. Inf. Theory, vol. 52, sources,” IEEE Trans. Inf. Theory, vol. 47, no. 1, pp. 330–338, Jan. no. 4, pp. 1289–1306, Apr. 2006. 2001. [3] E. Candes, J. Romberg, and T. Tao, “Robust uncertainty principles: [32] H. Shimokawa, “Rényi’s entropy and error exponent of source coding Exact signal reconstruction from highly incomplete frequency infor- with countably infinite alphabet,” in Proc. IEEE Int. Symp. Inf. Theory, mation,” IEEE Trans. Inf. Theory, vol. 52, no. 2, pp. 489–509, Feb. Seattle, WA, Jul. 2006. 2006. [33] C. Chang and A. Sahai, “Universal quadratic lower bounds on source [4] M. Wainwright, “Information-theoretic bounds on sparsity recovery in coding error exponents,” in Proc. Conf. Inf. Sci. Syst., 2007. the high-dimensional and noisy setting,” in Proc. IEEE Int. Symp. Inf. [34] R. G. Gallager, Information Theory and Reliable Communication. Theory, Nice, France, Jun. 2007. New York: Wiley, 1968.

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply. 3748 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 56, NO. 8, AUGUST 2010

[35] M. S. Pinsker, Information and Information Stability of Random Vari- [53] E. C. Posner and E. R. Rodemich, “Epsilon entropy and data compres- ables and Processes. San Francisco, CA: Holden-Day, 1964. sion,” Ann. Math. Stat., vol. 42, no. 6, pp. 2079–2125, Dec. 1971. [36] K. Alligood, T. Sauer, and J. A. Yorke, Chaos: An Introduction to Dy- [54] G. Szegö, Orthogonal Polynomials. Providence, RI: AMS, 1975. namical Systems. New York: Springer-Verlag, 1996. [55] Y. Wu and S. Verdú, “Fundamental limits of almost lossless analog [37] R. Mañé, “On the dimension of the compact invariant sets of certain compression,” in Proc. IEEE Int. Symp. Inf. Theory, Seoul, Korea, Jun. non-linear maps,” in Dynamical Systems and Turbulence, War- 2009. wick 1980, ser. Lecture Notes in Mathematics. Berlin, Germany: Springer-Verlag, 1981, vol. 898, pp. 230–242. [38] B. R. Hunt and V. Y. Kaloshin, “Regularity of of embeddings of in- finite-dimensional fractal sets into finite-dimensional spaces,” Nonlin- earity, vol. 12, no. 5, pp. 1263–1275, 1999. Yihong Wu (S’10) received the B.E. degree in electrical engineering from [39] A. Ben-Artzi, A. Eden, C. Foias, and B. Nicolaenko, “Hölder continuity Tsinghua University, Beijing, China, in 2006 and the M.A. degree in electrical for the inverse of Mañé’s projection,” J. Math. Anal. Appl., vol. 178, engineering from Princeton University, Princeton, NJ in 2008, where he is pp. 22–29, 1993. currently working towards the Ph.D. degree at the Department of Electrical [40] H. Steinhaus, “Sur les distances des points des ensembles de mesure Engineering. positive,” Fundamenta Mathematicae, vol. 1, pp. 93–104, 1920. He is a recipient of the Princeton University Wallace Memorial honorific fel- [41] G. J. Minty, “On the extension of Lipschitz, Lipschitz-Hölder contin- lowship in 2010. His research interests are in information theory, signal pro- uous, and monotone functions,” Bull. Amer. Math. Soc., vol. 76, no. 2, cessing, mathematical statistics, optimization, and distributed algorithms. pp. 334–339, 1970. [42] H. Federer, Geometric Measure Theory. New York: Springer-Verlag, 1969. [43] D. Preiss, “Geometry of measures in : Distribution, rectifiability, Sergio Verdú (S’80–M’84–SM’88–F’93) received the Telecommunications and densities,” Ann. Math., vol. 125, no. 3, pp. 537–643, 1987. Engineering degree from the Universitat Politècnica de Barcelona, Barcelona, [44] G. A. Edgar, Integral, Probability, and Fractal Measures. New York: Spain, in 1980 and the Ph.D. degree in electrical engineering from the Univer- Springer-Verlag, 1997. sity of Illinois at Urbana-Champaign, Urbana, in 1984. [45] M. Keane and M. Smorodinsky, “Bernoulli schemes of the same Since 1984, he has been a member of the faculty of Princeton Univer- entropy are finitarily isomorphi,” Ann. Math., vol. 109, no. 2, pp. sity, Princeton, NJ, where he is the Eugene Higgins Professor of Electrical 397–406, 1979. Engineering. [46] A. Del Junco, “Finitary codes between one-sided Bernoulli shifts,” Er- Dr. Verdú is the recipient of the 2007 Claude E. Shannon Award and the 2008 godic Theory Dyn. Syst., vol. 1, pp. 285–301, 1981. IEEE Richard W. Hamming Medal. He is a member of the National Academy of ˘ [47] J. G. Sinai, “A weak isomorphism of transformations with an invariant Engineering and was awarded a Doctorate Honoris Causa from the Universitat measure,” Dokl. Akad. Nauk SSSR., vol. 147, pp. 797–800, 1962. Politècnica de Catalunya in 2005. He is a recipient of several paper awards from [48] K. E. Petersen, Ergodic Theory. Cambridge, U.K.: Cambridge Univ. the IEEE: the 1992 Donald Fink Paper Award, the 1998 Information Theory Press, 1990. Outstanding Paper Award, an Information Theory Golden Jubilee Paper Award, [49] W. H. Schikhof, Ultrametric Calculus: An Introduction to p-Adic Anal- the 2002 Leonard Abraham Prize Award, the 2006 Joint Communications/In- ysis. New York: Cambridge Univ. Press, 2006. formation Theory Paper Award, and the 2009 Stephen O. Rice Prize from the [50] M. A. Martin and P. Mattila, “ -dimensional regularity classifications IEEE Communications Society. He has also received paper awards from the for -fractals,” Trans. Amer. Math. Soc., vol. 305, no. 1, pp. 293–315, Japanese Telecommunications Advancement Foundation and from Eurasip. He 1988. received the 2000 Frederick E. Terman Award from the American Society for [51] V. Milman, “Dvoretzky’s theorem: Thirty years later,” Geom. Funct. Engineering Education for his book Multiuser Detection (Cambridge, U.K.: Anal., vol. 2, no. 4, pp. 455–479, Dec. 1992. Cambridge Univ. Press, 1998). He served as President of the IEEE Informa- [52] W. Johnson and J. Lindenstrauss, “Extensions of Lipschitz maps into tion Theory Society in 1997. He is currently Editor-in-Chief of Foundations a Hilbert space,” Contemp. Math., vol. 26, pp. 189–206, 1984. and Trends in Communications and Information Theory.

Authorized licensed use limited to: Princeton University. Downloaded on July 20,2010 at 19:49:18 UTC from IEEE Xplore. Restrictions apply.