1 Minimum Rates of Approximate Sufficient

Masahito Hayashi,† Fellow, IEEE, Vincent Y. F. Tan,‡ Senior Member, IEEE

Abstract—Given a sufficient for a when one is given X, then Y is called a sufficient statistic of distributions, one can estimate the without access relative to the family {PX|Z=z}z∈Z . We may then write to the . However, the memory or code size for storing the sufficient statistic may nonetheless still be prohibitive. Indeed, X for n independent samples drawn from a k-nomial distribution PX|Z=z(x) = PX|Y (x|y)PY |Z=z(y), ∀ (x, z) ∈ X × Z with d = k − 1 degrees of freedom, the length of the code scales y∈Y as d log n + O(1). In many applications, we may not have a (1) useful notion of sufficient statistics (e.g., when the parametric or more simply that X (−− Y (−− Z forms a Markov chain family is not an ) and we also may not need in this order. Because Y is a function of X, it is also true that to reconstruct the generating distribution exactly. By adopting I(Z; X) = I(Z; Y ). This intuitively that the sufficient a Shannon-theoretic approach in which we allow a small error in estimating the generating distribution, we construct various statistic Y provides as much information about the parameter approximate sufficient statistics and show that the code length Z as the original data X does. d can be reduced to 2 log n + O(1). We consider errors measured For concreteness in our discussions, we often (but not according to the relative entropy and variational distance criteria. always) regard the family {PX|Z=z}z∈Z as an exponential For the code constructions, we leverage Rissanen’s minimum family [4], i.e., ∝ P . This class of description length principle, which yields a non-vanishing error PX|Z=z exp i ziYi(x) measured according to the relative entropy. For the converse distributions is parametrized by a set of natural parts, we use Clarke and Barron’s formula for the relative z = {zi} and a set of natural statistics Y (x) = {Yi(x)}, which entropy of a parametrized distribution and the corresponding is a function of the data. The natural statistics or maximum mixture distribution. However, this method only yields a weak likelihood estimator (MLE) are known to be sufficient statistics converse for the variational distance. We develop new techniques of the exponential family. In many applications, large datasets to achieve vanishing errors and we also prove strong converses. The latter means that even if the code is allowed to have a non- are prevalent. The one-shot model described above will then d vanishing error, its length must still be at least 2 log n. be replaced by an n-shot one in which the dataset consists Index Terms—Approximate sufficient statistics, Minimum of n independent and identically distributed (i.i.d.) random n rates, Memory size reduction, Minimum description length, variables X = (X1,X2,...,Xn) each distributed according Exponential families, Pythagorean theorem, Strong converse to PX|Z=z where the exact z is unknown. If the support of X is finite, the distribution is a k-nomial distribution (a discrete distribution taking on at most k values) and the empirical I.INTRODUCTION distribution or type [5] of Xn is a sufficient statistic for The notion of sufficient statistics is a fundamental and learning z. However, the number of types with denominator n ubiquitous concept in statistics and information theory [2], n+k−1 k−1 on an alphabet with k values is k−1 = Θ(n ) [5]. We [3]. Consider a X ∈ X whose distribution are interested in this paper in the “memory size” to store the PX|Z=z depends on an unknown parameter z ∈ Z. Typically various types. We imagine that each type is allocated a single in detection and estimation problems, we are interested in storage location in the memory stack and the memory size learning the unknown parameter z. In this case, it is often is the number of storage locations. Thus, the memory size unnecessary to use the full dataset X. Rather a function of required to estimate parameter z in a maximum likelihood arXiv:1612.02542v2 [cs.IT] 16 Nov 2017 the data Y = f(X) ∈ Y usually suffices. If there is no loss k−1 manner (or distribution PX|Z=z) is at least Θ(n ) if the in the performance of learning Z given Y relative to the case (index of the) type is stored. The exponent d = k − 1 here is †Masahito Hayashi is with the Graduate School of Mathematics, Nagoya the number of degrees of freedom in the distribution family, University, and the Center for Quantum Technologies (CQT), National Uni- i.e., the dimensionality of the space Z that z belongs to. Can versity of Singapore (Email: [email protected]). we do better than a memory size of Θ(nd)? The answer to ‡Vincent Y. F. Tan is with the Department of Electrical and Computer Engineering and the Department of Mathematics, National University of this question depends on the strictness of the recoverability Singapore (Email: [email protected]). condition of PX|Z=z. If PX|Z=z is to be recovered exactly, This paper was presented in part [1] at the 2017 International Symposium then the Markov condition in (1) is necessary and no reduction on Information Theory (ISIT) in Aachen, Germany. MH is partially supported by a MEXT Grant-in-Aid for Scientific Research of the memory size is possible. However, if PX|Z=z is to (B) No. 16KT0017. MH is also partially supported by the Okawa Research be recovered only approximately, we can indeed reduce the Grant and Kayamori Foundation of Informational Science Advancement. The memory size. This is one motivation of the current work. Centre for Quantum Technologies is funded by the Singapore Ministry of Education and the National Research Foundation as part of the Research In addition, going beyond exponential families, for general Centres of Excellence programme. distribution families, we do not necessarily have a useful and VYFT is partially supported an NUS Young Investigator Award (R-263- universal notion of sufficient statistics. Thus, we often focus on 000-B37-133) and a Singapore Ministry of Education Tier 2 grant “Network Communication with Synchronization Errors: Fundamental Limits and Codes” local asymptotic sufficient statistics by relaxing the condition (R-263-000-B61-112). for sufficient statistics. For example, under suitable regularity 2 conditions [6]–[8], the MLE forms a set of local asymptotic of i.i.d. distributions. The purpose of Rissanen’s compression sufficient statistics. However, there is no prior work that system is to obtain a compression system for data gen- discusses the required memory size if we allow the sufficient erated under a mixture distribution. He showed that when statistics to be approximate in some appropriate sense. To the dimensionality of the data is d, the optimal redundancy d address this issue, we introduce the notion of the minimum over the Shannon entropy is 2 log n + O(1). Merhav and coding length of certain asymptotic sufficient statistics and Feder [23] extended Rissanen’s analysis to both the minimax d show that it is 2 log n+O(1), where d is the dimension of the and Bayesian (maximin) senses. Clarke and Barron [24], [25] parameter of the family of distribution. Hence, the minimum refined Rissanen’s analysis and determined the constant (in d coding rate is the pre-log coefficient 2 , improving over the asymptotic expansions) under various regularity assumptions. original d when exact sufficient statistics are used. Here, we While we make heavy use of some of Rissanen’s coding ideas also notice that the locality condition can be dropped. That and Clarke and Barron’s asymptotic expansions for the relative is, our asymptotic sufficient statistics works globally. This is entropy between a parametrized distribution and a mixture, another motivation for the current paper. the problem setting we study is different. Indeed, the main ideas in Rissanen’s work [21], [22] are only helpful for us to establish the achievability parts of our theorems with non- A. Related Work zero asymptotic error for the relative entropy criterion (see Our problem is different from lossy and lossless conven- Lemma 1). Similarly, the main results of Clarke and Barron’s tional source coding [9], [10] because we do not seek to recon- work [24], [25] can only lead to a weak converse under the struct the data Xn but rather a distribution on X n. Hence, we variational distance criterion (see Lemma 4). Hence, we need need to generalize standard data compression schemes. Such to develop new coding techniques and converse ideas to satisfy a generalization has been discussed in the context of quantum the more stringent constraints on the code sequences. data compression by Schumacher [11]. Here, the source that generates the state cannot be directly observed. Schumacher’s encoding process involves compressing the original dataset B. Main Contributions and Techniques into a memory stack with a smaller size. The decoding process We provide a precise Shannon-theoretic problem formu- involves recovering certain statistics of the data to within a lation for compression for the model parameter z with an prescribed error bound δ ≥ 0. allowable asymptotic error δ ≥ 0 on the reconstructed dis- Reconstruction of distributions has also been studied in the tribution. This error is measured under the relative entropy context of the information bottleneck (IB) method [12]–[14]. and variational distance criteria. We use some of Rissanen’s In the IB method, given a joint distribution PX,Y , one seeks ideas for encoding in [21], [22] to show that the memory d to find the best tradeoff between accuracy and complexity size can be reduced to approximately Θ(n 2 ) resulting in a d when summarizing a random variable X and an observed and coding length of 2 log n + O(1). Note that Rissanen [21], correlated variable Y . More precisely, one finds a conditional [22] did not explicitly provide the decoders for the problem he ˜ ˜ distribution PX˜ |X that minimizes I(X; X)−βI(X; Y ) where considered; we explicitly specify various decoders. Moreover, β > 0 can be regarded as a Lagrange multiplier. The random assuming that the parametric family of distributions is an variable X˜ is then regarded as a summarized version of X. exponential family [4], we also improve on the evaluations Although such a formalism is a generalization of the notion that are inspired by Rissanen (see Lemma 2). In particular, of sufficient statistics from to arbitrary for exponential families, we propose codes whose asymptotic distributions, it differs from the present work because our work errors measured according to the relative entropy criterion are is concerned with finding minimum rates in an asymptotic and equal to zero. Furthermore, we consider two separate settings information-theoretic framework. known as the blind and visible settings. In the former, the Recently, Yang, Chiribella and Hayashi [15] extended Schu- encoder can directly observe the dataset Xn; in the latter macher’s [11] compression system to a special quantum model. the encoder directly observes the parameter of interest z. The In particular, the authors considered a notion of approximate differences between these two settings are discussed in more sufficient statistics in the quantum setting [16], [17] when detail in [18, Ch. 10]. The visible setting may appear to be the data is generated in an i.i.d. manner. They considered less natural but such a generalized setting is useful for the only the so-called blind setting [18, Ch. 10] and also only proofs of the converse parts. Yang, Chiribella and Hayashi [15] showed a weak converse. We note that there have been recent only considered the special case of the qubit model. They developments of the notion of approximate sufficient statistics also only considered the blind setting. We consider both blind and approximate Markov chains in the quantum information and visible settings and show, somewhat surprisingly, that the literature [19], [20] but the problem studied here and the coding length is essentially unchanged. objectives are different from the existing works. Another significant contribution of our work is in the Another related line of works in the classical information strengthening of the converse in [15]. In our strong converse theory literature are the seminal ones by Rissanen on universal proof for the relative entropy error criterion, we employ the variable-length source coding and [21], [22]. Pythagorean theorem for relative entropy, a fundamental con- Under the minimum description length (MDL) framework, he cept in information geometry [26]. Furthermore, we use Clarke introduced a two-step encoding process to obtain a prefix- and Barron’s formula [24], [25] to provide a weak converse free source code for n data samples generated from a mixture under the variational distance error criterion. This clarifies 3 the relation between our problem and Clarke and Barron’s on X n instead of length-n strings in X n. We often consider formula [24], [25]. We significantly strengthen this method a more relaxed condition for the encoder as follows. In the to obtain a strong converse (see Lemma 6); in contrast [15] visible setting, the encoder does not only have access to the only proves a weak converse. That is, we show that if the random vector Xn but also to the parameter z ∈ Z. error is allowed to be non-vanishing (even if it is arbitrarily Definition 2 (Visible code). A size M visible code C := large for the relative entropy criterion and arbitrarily close to n v,n (f , ϕ ) consists of 2 for the variational distance criterion), we would still require v,n n d( 1 −η) a memory size of at least n 2 for any η > 0 for all • A stochastic encoder (transition kernel) fv,n : Z → sufficiently large n. Yn := {1,...,Mn}; n This paper is organized as follows. In Section II, we formu- • A decoder ϕn : Yn → P(X ). late the problem precisely. We state the main result Theorem 1 We note that any blind encoder fb,n can be regarded as in Section III. The results are discussed in the context of exact a special case of a visible encoder fv,n because the visible sufficient statistics and exponential families in Section IV. We encoder fv,n can be written in terms of the blind encoder prove the direct parts of Theorem 1 in Section V, leveraging n fb,n and the distribution P as follows ideas from Rissanen’s seminal works [21], [22] on universal X|Z=z data compression. We prove the converse parts of Theorem 1 X n n n fv (z) := fb (x )P (x ), ∀ z ∈ Z. (4) in Section VI by leveraging the Pythagorean theorem in infor- ,n ,n X|Z=z xn∈X n mation geometry [26] and Clarke and Barron’s formula [24], [25]. We conclude our discussion in Section VII. B. Error Criteria II.PROBLEM FORMULATION The performance of any code is characterized by two quan- Let X be a set and let P(X ) denote the set of distributions tities. First, we desire the coding length log Mn = log |Yn| to (e.g., probability mass functions) on X . We consider a family be as short as possible. Next we desire a small error. To define of distributions {PX|Z=z}z∈Z ⊂ P(X ) parametrized by a an error criterion precisely, we notice that the reconstructed d n vector parameter z ∈ Z ⊂ R . We assume that n inde- distribution on X (in the visible case) is ϕn · fv,n(z) which pendent and identically distributed (i.i.d.) random variables is defined as Xn = (X ,...,X ) each taking values in X and drawn 1 n  n X  n ϕn · fv,n(z) (x )= Pr {fv,n(z)=y} ϕn(y) (x ). (5) from PX|Z=z. The underlying parameter Z, which is random, follows a distribution µ(dz), which is absolutely continuous y∈Yn d with respect to the Lebesgue measure on R . We will often use Hence the code has a smaller error when the distribution the following notations: Given conditional distributions PX|Y n ϕn · fv,n(z) is closer to the original distribution PX|Z=z for and PY |Z , respectively let the joint and marginal probabilities each z ∈ Z. To evaluate the difference between the two conditioned on Z = z be distributions, we consider an error or fidelity function F whose inputs are distributions on the same probability space. The (PX|Y × PY |Z )(x, y|z) := PX|Y (x|y)PY |Z (y|z), and (2) X average error is defined as (P · P )(x|z) := (P × P )(x, y|z). (3) X|Y Y |Z X|Y Y |Z Z y n εv(fv,n, ϕn) := F (ϕn · fv,n(z),PX|Z=z) µ(dz). (6) Standard asymptotic notation such as o(·), O(·), Ω(·) and Θ Z will be used throughout; fn = o(gn) iff limn→∞ |fn/gn| = 0, For a blind code, in view of (4), we similarly define fn = O(gn) iff limn→∞ |fn/gn| < ∞, fn = Ω(gn) iff gn = n O(fn) and fn = Θ(gn) iff fn = O(gn) and fn = Ω(gn). εb(fb,n, ϕn) = εv(fb,n · PX|Z=z, ϕn) Standard information-theoretic notation such as entropy H(·) Z · · n n (7) and I(·; ·) [3] will also be used. Finally, := F (ϕn fv,n PX|Z=z,PX|Z=z) µ(dz). Z k·k and k·k1 denote the `2 and `1 norms of finite-dimensional vectors respectively. In this paper, we consider two distance measures, namely P the relative entropy D(P kQ) := x P (x)(log P (x) − log Q(x)) and the variational distance1 (also known as the A. Definitions of Codes P total variation distance) kP − Qk1 := x |P (x) − Q(x)|. We consider two classes of codes [18, Ch. 10] for the Generalizations of these “distances” to continuous-alphabet problem of interest: distributions are performed in the usual manner. We denote

Definition 1 (Blind code). A size M blind code of Cb := the errors in the blind and visible cases [18, Ch. 10] when n ,n (1) (1) (fb,n, ϕn) consists of we use the variational distance as εb and εv respectively. n Similarly, we denote the errors in the blind and visible cases • A stochastic encoder (transition kernel) fb,n : X → when we use the relative entropy as (2) and (2) respectively. Yn := {1,...,Mn}; εb εv n The size of a code C is denoted as |C | |Y |. • A decoder ϕn : Yn → P(X ). b,n b,n = n

Observe that this definition of a code is similar to that for 1Unlike some papers, we define the variational distance without the 1 source coding except that the decoder outputs distributions coefficient of 2 multiplying kP − Qk1 so 0 ≤ kP − Qk1 ≤ 2. 4

C. Definitions of Minimum Compression Rates and Properties holds [26]–[28]. We also assume compact convergence Definition 3 (Minimum Compression Rate). Let δ ≥ 0. We (i.e., uniform convergence on compact sets) for (14). define the minimum compression rate for blind codes for a (iii) (Asymptotic Efficiency) There exists a sequence of esti- mators zˆ =z ˆ (Xn) for the parameter z such that given parametric family {PX|Z=z}z∈Z as n n   d  1  (i) log |Cb,n| (i)   Ez D(PX|Z=ˆzn kPX|Z=z) = + o . (15) Rb (δ) := inf lim lim εb (Cb,n) ≤ δ {C } n→∞ n→∞ 2n n b,n n∈N log n (8) In other words, the estimator zˆn asymptotically achieves the Cramer-Rao´ lower bound [2], [29] (i.e., [(ˆz − where denotes whether the error function is the Ez n i = 1, 2 z)(ˆz − z)T ] → J −1), so the expectation of (14) with variational distance or relative entropy respectively. In a n z z = z and z0 = z yields (15). similar manner, we define the minimum compression rate for n (iv) (Local Asymptotic Normality) Fix a point z ∈ Z visible codes for a given parametric family {PX|Z=z}z∈Z as n n and let zˆML(X ) be the MLE of z given X . De-   n √ 1/2 n (i) log |Cv,n| (i) fine the function hz(X ) = nJz (ˆzML(X ) − z) Rv (δ) := inf lim lim εv (Cv,n) ≤ δ . (d) −d/2 2 {C } n→∞ n→∞ and let φ (x) := (2π) exp(−kxk /2) be the d- v,n n∈N log n (9) dimensional standard Gaussian probability density func-

(i) tion. The local asymptotic normality condition [6]–[8] To understand this definition, we note that if Rb (δ) = reads c > 0, then for every  > 0, there exists a sequence of codes (d)  n −1  φ − P z0 · h 0 → 0 (16) {Cb,n}n∈ with asymptotic error no larger than δ and memory | = + √ √z N X Z z n z+ 1 c+ n or coding length upper bounded as |Cb,n| ≤ n for n large 0 enough. Moreover, there is no sequence of codes with with for any vector z ∈ Rd. c− (Local Asymptotic Sufficiency) asymptotic error no larger than δ and with |Cb,n| ≤ n . (v) Let Z be the random 0 The definition of the minimum compression rate differs variable corresponding to the parameter z and let Y be n significantly from traditional source coding in Shannon the- the corresponding MLE zˆML(X ). The local asymptotic ory [3] where the normalization of the coding length log |Cb,n| sufficiency condition [6]–[8] reads is n and not log n. Here, we find that the normalization that   n P n 0 · P 0 z0 − P 0 → 0 X |Y ,Z=z Y |Z=z+ √ | = + √z yields meaningful results is log n as the memory size scales n X Z z n 1 polynomially (and not exponentially) with the blocklength, (17) c for any vector z0 ∈ d. i.e., |Yn| ≈ n for some c > 0. R From the above definitions, it is clear that for any 0 ≤ δ ≤ δ0 Conditions (i)–(v) are satisfied as long as the parametrized and i = 1, 2, we have family satisfies some weak smoothness condition [6]–[8]. (i) 0 (i) Assuming (i), (ii), (iv) and (v), the minimum com- Ra (δ ) ≤ Ra (δ), a = b, v, (10) Theorem 1. pression rate for visible codes under the variational distance R(1)(0) ≤ R(2)(0), a = b, v, (11) a a error criterion R(i)(δ) ≤ R(i)(δ). (12) v b (1) d Rv (δ1) = , ∀ δ1 ∈ [0, 2). (18) Note that (11) follows from Pinsker’s inequality (i.e., 2 log e 2 D(P kQ) ≥ 2 kP − Qk1) because a vanishing relative Assuming (i), (ii), the minimum compression rate for visible entropy implies the same for the variational distance. codes under the relative entropy error criterion

(2) d III.ASSUMPTIONSAND MAIN RESULTS R (δ2) = , ∀ δ2 ∈ [0, ∞). (19) v 2 Let Jz be the matrix of the parametric Assuming (i), (ii), (iv), and (v), the minimum compression rate family {PX|Z=z}z∈Z . This matrix has elements for blind codes under the variational distance error criterion ∂ log P (X)∂ log P (X) X|Z=z X|Z=z (1) 0 d 0 [Jz]i,j =Ez , Rb (δ1) = , ∀ δ1 ∈ [0, 2). (20) ∂zi ∂zj 2 (13) Assuming (i), (ii), and (iii) the minimum compression rate for where Ez means that we take expectation with respect to X ∼ blind codes under the relative entropy error criterion PX|Z=z. Before we state the main results of this paper, we consider the following assumptions: (2) 0 d 0 hd  Rb (δ2) = , ∀ δ2 ∈ , ∞ . (21) d 2 2 (i) (Boundedness of Parameter Space) The set Z ⊂ R is d bounded and has positive Lebesgue measure in R ; Furthermore, if (i) holds and {PX|Z=z}z∈Z is an exponential (ii) (Euclidean Approximation of Relative Entropy) As z0 → family [4], (21) can be strengthened to z, the relation d R(2)(δ0 ) = , ∀ δ0 ∈ [0, ∞). (22) b 2 2 2 D(PX|Z=zkPX|Z=z0 ) The direct and converse parts of this theorem are proved in 1 X 0 0 0 2 = [Jz]i,j(zi − zi)(zj − zj) + o(kz − z k ) (14) Sections V and VI respectively. Remarks on and implications 2 i,j of the theorem are detailed in the following section. 5

IV. CONNECTIONTO SUFFICIENT STATISTICS AND Then for δ ≥ 0, we have EXPONENTIAL FAMILIES εb(fb,n, ϕn) Z In this section, we discuss the implications of Theorem 1 in n n = F (ϕn · fb,n · PX|Z=z,PX|Z=z) µ(dz) (28) greater detail by relating them to the notion of (exact) sufficient Z statistics [3, Sec. 2.9]. We first review the fundamentals of Z  X n sufficient statistics, then motivate the notion of approximate = F (fb,n · PX|Z=z)(y)ϕn(y), Z sufficient statistics, provide some background on exponential y  families, and finally show that if one stores the exact sufficient X n n (fb,n · PX|Z=z)(y)PX|Z=z,Y =y µ(dz) (29) statistics in the memory Yn, the memory size would be larger y than that prescribed by Theorem 1. Z X n = (fb,n · PX|Z=z)(y) Z y n  A. Review of Sufficient Statistics × F ϕn(y),PX|Z=z,Y =y µ(dz) (30) ≤ δ, (31) Suppose, for the , that the blind encoder fb,n is a n deterministic function. When Y = fb,n(X ) is a sufficient where (30) uses Condition (*) of F in (27), and (31) uses the statistic relative to the family {PX|Z=z}z∈Z [3, Sec. 2.9], the fact that the error is bounded by δ according to (8) and (9). If n n conditional distribution PX|Z=z,Y =y(x ) does not depend on δ = 0, the equality in (23) holds, and we revert to the usual z, i.e., Z (−− Y (−− X forms a Markov chain in this order. notion of exact sufficient statistics discussed in Section IV-A. n In this case, we can choose the decoder ϕn : Yn → P(X ) Hence, the codes that we consider allowing for errors can be as follows regarded as a generalization of sufficient statistics. n ϕn(y) := PX|Z=z,Y =y. (23) C. Background on Exponential Families n n n Now, noting that PX|Z=z({x : fb,n(x ) = y}) = fb,n · n To put our results in Theorem 1 into context, we regard PX|Z=z(y) for every y in the memory Yn, we have {PX|Z=z}z∈Z as an exponential family [4] with parameter d X space Z ⊂ R . Recall that a parametric family of distributions · · n (24) ϕn fb,n = fb,n PX|Z=z(y)ϕn(y) {PX|Z=z}z∈Z is called an exponential family if it takes the y form X n n n n = PX|Z=z({x : fb,n(x ) = y})PX|Z=z,Y =y (25) " m # y X PX|Z=z(x) = PX (x) exp ziYi(x) − A(z) , (32) n = PX|Z=z. (26) i=1 where A(z), the cumulant generating function of the random Observe that, in this case, regardless of which error metric we vector (Y1(X),...,Ym(X)), is defined as choose, we will attain zero error between the reconstructed n " m # distribution ϕn · fb,n and the original one PX|Z=z. However, X X A(z) := log P (x) exp z Y (x) . (33) as we will see, if PX|Z=z is an exponential family the memory X i i size exceeds that prescribed by the various statements in x i=1

Theorem 1. Thus it is natural to relax the stringent condition The functions Yi(x) are known as the sufficient statistics of in (23) to some approximate versions of it. the exponential family. Another fact that we exploit in the sequel is that for any exponential family, there is an alternative parametrization known as the moment parametrization [4]. B. Exact vs. Approximate Sufficient Statistics There is a one-to-one correspondence between the natural parameter z and the expectation parameter To consider an approximate version of (23), we will make an assumption on the error function F ∂A(z) X ηi(z) := = Ez[Yi] = PX|Z=z(x)Yi(x). (34) ∂z 0 i x (*) Consider distributions Pi and Pi such that they re- spectively have disjoint supports from Pj, j =6 i and Hence, in the following to estimate the natural parameter z, we P 0, j =6 i. Then we assume that j can first estimate the moments η(z) = (η1(z), . . . , ηm(z)) ∈ H { ∈ Z} and then use the one-to-one correspon-   := η(z): z X X 0 X 0 dence to obtain z. F piPi, piPi = piF (Pi,Pi ) (27) i i i D. An Example: k-nomial Distributions where {pi} forms a probability mass function. This is clearly satisfied for the variational distance error Now as a concrete example, we consider a k-nomial dis- function. tribution, i.e., the family of discrete distributions that take on 6 k ∈ N values. The set of k-nomial distributions forms an Lemma 1. Assuming (i), (ii), we have exponential family with sufficient statistics  1 x = i (2) d  Rv (0) ≤ . (38) Yi(x) = −1 x = i + 1 , i = 1, . . . , k − 1. (35) 2  0 else Note that there are other parametrizations. It is known that the In addition, assuming (i), (ii), and (iii), vector Y (x) := (Y1(x),...,Yk−1(x)) allows us to recover information about the unknown parameter z [26], [30], i.e., (2)d d R ≤ . (39) Y (x) is a sufficient statistic for the k-nomial distribution. b 2 2 n Given n i.i.d. data samples from PX|Z=z, the exponential family can be written as Note that (38) proves that the direct part of (19) holds " m # (2) d because it implies that Rv (δ ) ≤ for all δ ∈ [0, ∞). n n n n X (n) n 2 2 2 PX|Z=z(x )=PX (x ) exp Yi (x )zi −nA(z) , (36) Similarly, (39) proves that (21) holds because it implies i=1 (2) 0 d 0 d that Rb (δ2) ≤ 2 for all δ2 ∈ [ 2 , ∞). These statements n n (n) n follow immediately from the bound in (10) concerning the where x = (x1, . . . , xn) ∈ X and Yi (x ) := Pn monotonicity of 7→ (2) and 7→ (2) . j=1 Yi(xj). Thus the vector of sufficient statistics is δ Rv (δ) δ Rb (δ) (n) n (n) n (n) n Y (x ) := (Y1 (x ),...,Ym (x )). In the k-nomial case, We also note that (39), which follows from Rissanen’s the dimension of the exponential family m = k − 1. It is easy ideas [21], [22], is rather weak because the asymptotic error is (n) n d to see that the total number of possibilities of Y (x ), i.e., bounded above by 2 instead of 0. We improve on this severe (n) n n n n+k−1 limitation in the subsequent subsections. the size of the set {Y (x ): x ∈ X } is k−1 . This is also the total number of n-types [5] on an alphabet of size k. Proof of Lemma 1: We first prove (39). Then we describe In this case, the required memory size is how to modify the argument slightly to show (38). Fix a lattice   Z √t d ∩ Z ⊂ Z n + k − 1 span t > 0 and consider the subset n,t := n Z . log |Y | = log = (k − 1) log n + O(1). (37) n n k − 1 Given the MLE zˆn :=z ˆML(X ), we consider the closest point Note that k − 1 = d, the dimension of the parameter space X 0 0 Z. Thus, the pre-log coefficient is d, which is twice as zn,t(ˆzn) := arg min [Jz]i,j(ˆzn,i − zi)(ˆzn,j − zj). (40) 0∈Z large as what the results of Theorem 1 prescribe if we use z n,t i,j approximate sufficient statistics in the sense of (31) instead of exact sufficient statistics discussed in Section IV-A. This That is, for this blind encoder, the memory Yn is taken to be motivates us to study the fundamental limits of approximate n n Zn,t and the encoder is fb,n(X ) := zn,t(ˆzML(X )), i.e., we sufficient statistics in the large n limit to reduce the memory first compute the MLE then we approximate it with a point in a d+o(1) d +o(1) size |Y | from n to n 2 . n finite subset Zn,t using the formula in (40). The decoder is the map from the parameter z to the distribution P n . The n,t X|Z=zn,t V. PROOFSOF DIRECT PARTS OF THEOREM 1 d coding length is log |Zn,t| = 2 log n+O(1), where the depen- In this section, the direct parts (upper bounds) of Theo- dence on t is in the O(1) term. Note that Rissanen [21], [22] rem 1 will be proved. For logical reasons, the statements in essentially proposed the same encoder but he was considering Theorem 1 will not be proved sequentially. Rather we will a different problem of universal source coding. Also Rissanen present the simplest proofs before proceeding to the proofs for did not explicitly specify the decoder. We also mention that more general statements. First, in Section V-A, we will prove Merhav and Feder [23] extended Rissanen’s analysis to both semi-direct part for the relative entropy criterion in (21). This the minimax and Bayesian (maximin) formulations. immediately leads the proof of the direct part for (19). Next, Now, for any r > 0 and any norm k · k, we have the in Section V-B, we strengthen the direct part for exponential 2 2 1 2 inequality ka−bk ≤ (1+r)kak +(1+ r )kbk , a consequence families under the relative entropy criterion in (22). Finally, √ q 2 of the basic fact that − 1 ≥ . Applying this in Section V-C, we prove the direct part in the blind setting ra r b 0 1 under the variational distance criterion as in (20). inequality to the norm 2 k · kJz (Jz is positive definite) with a ≡ zˆn − z and b ≡ zˆn − zn,t, we obtain A. Semi-Direct Part Based On Rissanen’s Minimum Descrip- tion Length (MDL) Encoder 1 X [J ] (z − z )(z − z ) 2 z i,j n,t,i i n,t,j j Here we prove the direct parts for (19) and (21) where i,j the error criterion used is the relative entropy. We present a 1 + r X ≤ [J ] (ˆz − z )(ˆz − z ) complete achievability proof in the visible setting, i.e., (19). 2 z i,j n,i i n,j j Notice that (19) implies (18). Under the same error criterion, i,j we show a semi-achievability in the blind setting (i.e., (21)) in 1 + 1 X + r [J ] (z − zˆ )(z − zˆ ). (41) which the error does not vanish even in the limit of large n. 2 z i,j n,t,i n,i n,t,j n,j i,j 7

We now estimate the error as follows: B. Direct Part For Exponential Families In the blind setting, the MDL encoder discussed in Sec- (2) εb (fb,n, ϕn) tion V-A has a non-vanishing error even in the asymptotic Z   limit. To overcome this problem, we devise a novel method := D [P n ] P n µ(dz) (42) Ez X|Z=zn,t X|Z=z attaining zero error in the asymptotic limit. Since the method Z Z h  i is more complicated in the general setting, and requires more ≤ n n (43) Ez D PX|Z=zn,t PX|Z=z µ(dz) assumptions (see Section V-C), we first assume that the Z distribution family forms an exponential family, and prove the Z 1 X = nEz [Jz]i,j(zn,t,i − zi)(zn,t,j − zj) direct part under the relative entropy criterion. 2 Z i,j  Lemma 2. When {PX|Z=z}z∈Z is an exponential family and 2 + o(kzn,t − zk ) µ(dz) (44) Assumption (i) holds, we have Z  (2) d n(1 + r) X Rb (0) ≤ . (50) ≤ Ez [Jz]i,j(ˆzn,i − zi)(ˆzn,j − zj) 2 2 Z i,j Note that this indeed implies the direct part of (22) since 1 (2) 0 d 0 n(1 + r ) X Rb (δ2) ≤ 2 for all δ2 ∈ [0, ∞). This significantly improves + [Jz]i,j(zn,t,i − zˆn,i)(zn,t,j − zˆn,j) 2 over the case where we do not assume that {PX|Z=z}z∈Z is i,j  an exponential family in (39) of Lemma 1 since we could only 2 (2) 0 d 0 d + o(kzn,t − zk ) µ(dz) (45) prove that Rb (δ2) ≤ 2 for all δ2 ∈ [ 2 , ∞), which is much weaker. In other words, the blind code presented below for 1 + r Z = d µ(dz) + o(1) exponential families can realize the same error performance 2 Z (which is asymptotically zero) as the visible code presented 1 Z  at the end of the proof of Lemma 1. We also observe that n(1 + r ) X + Ez [Jz]i,j(zn,t,i − zˆn,i)(zn,t,j − zˆn,j) Assumption (i) holds for the k-nomial example discussed in 2 Z i,j Section IV-D as the moment parameters [T ] belong to  Ez i 2 [−1, 1], which is bounded. + o kzn,t − zk µ(dz) (46) Proof of Lemma 2: First, to describe the encoder, we 1 + r Z 1 + 1 X extract the sufficient statistics from the data, i.e., we calculate ≤ d µ(dz) + o(1) + r |[J ] |t2 z i,j (n) n 2 Z 2 Y (xn) 1 X i≥j ηˆ := i = Y (x ), ∀ i = 1, . . . , d. (51) Z i n n i j  2 j=1 + Ez o nkzn,t − zk µ(dz) (47) Z 1 Next we fix a lattice span t > 0 and consider the subset of (1 + r)d 1 + X t d = + r |[J ] |t2 + o(1). (48) quantized moment parameters Hn,t := √ Z ∩H ⊂ H where 2 2 z i,j n i≥j recall that H = {η(z): z ∈ Z} is the set of feasible moment parameters (cf. Section IV-C). Given the observed value of We now justify some of the steps above. In (42), the expecta- ηˆ = (ˆη1,..., ηˆn), we choose the closest point in the lattice to n n n it, i.e., we choose tion is over the random zn,t(ˆzML(X )) where X ∼ PX|Z=z; in (43) we used Jensen’s inequality and the convexity of the 0 βt(η) := arg min kη − ηk. (52) 0 relative entropy; in (44) we used the Euclidean approximation η ∈Hn,t of the relative entropy in (14) (Assumption (ii)); in (45) we The encoder f is the map xn 7→ β (ˆη). For this encoder, the used the inequality in (41); in (46) we used (15) (Assumption b,n t memory size is |Hn,t|. Thus the coding length is log |Hn,t| = (iii)); and in (47) and (48) we used the definition of the lattice d t 2 log n+O(1), where the dependence in t is in the O(1) term. Zn,t resulting in the bound |zn,t,i −zˆn,i| ≤ √ for all i and n. n Now, we describe the decoder. Let Y (n) = (Y (n),...,Y (n)) Now since n ∈ N is arbitrary, 1 d and η = (η1, . . . , ηd). The decoder ϕn is the map 1 1 X (2) 1 + r 1 + r X 2 η¯ 7→ −1 PXn|Y (n)=nη. (53) lim εb (fb,n, ϕn) ≤ d + |[Jz]i,j|t (49) |β (¯η)| n→∞ 2 2 t ∈ −1(¯) i≥j η βt η

In other words, given the estimate βt(η) ∈ Hn,t, we consider Since is arbitrary, we may take → so the second term t > 0 t 0 a uniform mixture of all the distributions PXn|Y (n)=nη where vanishes. Next, since r > 0 is arbitrary, we may take r → 0 η runs over all points in the lattice that map to βt(η) under so the first term converges to the asymptotic error bound of the encoding map in (52). d 2 . This proves the upper bound in (39). In the following calculation of the error, we first consider We now consider visible case, which is simpler. In this case, the scalar case in which d = 1 for simplicity. At the end of we can replace the MLE zˆn by z, since the encoder has direct the proof, we show how to extend the ideas to the case where access to the parameter z. Hence, the first terms in (45)–(49) d > 1. A few additional notational conventions are needed. are equal 0 and we obtain (38) as desired. For each element η¯, denote the uniform distribution on the 8

Plots of φ(u) and φt,α(u) (compact convergence), we have 0.4 φ(u) (2) lim εb (fb,n, ϕn) 0.35 φt,α(u) n→∞ Z 0.3 n n = lim D(ϕn · fb,n · PX|Z=z k PX|Z=z) µ(dz) (58) n→∞ Z 0.25 Z n n ≤ lim D(ϕn · fb,n · PX|Z=z k PX|Z=z) µ(dz) (59) 0.2 Z n→∞

Densities 0.15 ≤ sup D(φt,α k φ). (60) α∈[0,t] 0.1 Now since the above holds for all t > 0, we can let t tend 0.05 to 0 (so the size of the quantization regions decreases to 0). Consequently, the right-hand-side of (60) also tends to 0 and 0 (2) −4 −3 −2 −1 0 1 2 3 4 hence, limn→∞ εb (fb,n, ϕn) = 0. u For general dimension d > 1, we can show the desired statement as follows. Let φ(d) be the d-dimensional standard Fig. 1. Illustration of the density φt,α in (56) with t = 0.4 and α = 0.3 d (d) (solid red line). The standard normal density φ is also shown (broken blue . Given α ∈ [0, t] , let φt,α be the corre- line). The two curves become increasingly close as t → 0. sponding quantization of the d-dimensional standard normal distribution with cutting point α and span t (cf. (56) for the one-dimensional distribution). Then in the same way, −1 subset β (¯η) as U −1 . Next, denote the transition kernel t βt (¯η) (2) (d) (d) lim ε (fb,n, ϕn) ≤ sup D(φt,α k φ ). (61) (channel) that maps the parameter η¯ to the uniform n→∞ b −1 α∈[0,t]d distribution on the set β (¯η) (i.e., U −1 ) as U −1 . t βt (¯η) βt (Y )|Y (n) n Similarly, we can take t to tend to zero and the error criterion Denote the distribution of the random variable Y1 (X ) n n vanishes as n → ∞. The logarithm of the memory size (coding when X ∼ PX|Z=z as P (n)| = . Let the of d Y1 Z z length) is thus log n + O(1). This proves Lemma 2. (n) 2 Y1 under distribution PX|Z=z be Vz. The normalizing linear √ (n) √ transformation y 7→ n(y − Ez[Y1 ])/ Vz is denoted as gz. C. Direct Part For The General Case (n) Since Y1 is a sufficient statistic relative to the exponential In this section we treat the general case (not necessarily family {PX|Z=z}z∈Z , we know that for any error criterion (in exponential family). We prove the following lemma which particular the relative entropy criterion), establishes the direct part (upper bound) in the blind setting under the variational distance criterion as in (20). n n D(ϕn · fb,n · PX|Z=z k PX|Z=z) Lemma 3. Assuming (i), (ii), (iv), and (v), we have = D(U −1 · P (n) k P (n) ) (54) βt (Y )|Y Y |Z=z Y |Z=z 1 1 (1) d −1 −1 R (0) ≤ . (62) = D (U −1( )| · P (n) ) · gz k P (n) · gz , b βt Y Y Y1 |Z=z Y1 |Z=z 2 (55) Note that this implies the upper bound to (20) because (62) implies that R(1)(δ0 ) ≤ d for all δ0 ∈ [0, 2). where the last equality follows from the fact that the b 1 2 1 Proof of Lemma 3: Fix a lattice span t > 0 and function gz is one-to-one. By the , −1 choose the memory Yn to be the quantized parameter space P (n) · gz converges to the standard normal distribution t d Y1 |Z=z (lattice) Z := √ ∩ Z ⊂ Z. Given the observed MLE 2 n,t n Z φ(u) ∝ exp(−u /2). On the other hand, by the definition n zˆML(X ) = z, we choose the encoder output to be the closest of U −1 (which results from the construction of the βt (Y )|Y point in this lattice, i.e., −1 encoder in (52)), the distribution (U −1( )| ·P (n) )·gz βt Y Y Y1 |Z=z 0 βt(z) := arg min kz − zk. (63) converges to a quantization of the standard normal distribution 0 z ∈Zn,t with span t, namely, n n The encoder fb,n is the map from x 7→ βt(ˆzML(x )). Thus  φt,α(u) ∝ φ α+(j+1/2)t , ∀ u ∈ (α+jt, α+(j+1)t], (56) the code has memory Yn = Zn,t in which the coding length d is log |Yn| = log |Zn,t| = 2 log n + O(1). where α ∈ [0, t] and the constant of proportionality in (56) is Now, to describe the decoder and the subsequent analysis, R 0 chosen so that φt,α(u) du = 1. See Fig. 1 for an illustration we use some simplified notation. Let Y and Y denote the R n n of the probability density function in (56). Thus, we have the random variables βt(ˆzML(X )) and zˆML(X ) respectively. upper bound These can be thought of as the quantized MLE and the MLE respectively. As usual Z ∈ Z is the original parameter. The −1 −1 lim D (Uβ−1(Y )|Y · P (n) ) · gz k P (n) · gz decoder ϕn is then the following map from elements in the n→∞ t Y1 |Z=z Y1 |Z=z memory to distributions in P(X n): ≤ sup D(φt,α k φ). (57) α∈[0,t] 1 X z¯ 7→ −1 PXn|Y 0=z,Y =¯z. (64) |βt (¯z)| −1 Since the convergence in (57) is uniform on compact sets z∈βt (¯z) 9

0 0 n 1 Essentially, the decoder takes the quantized MLE PY =ˆzML(X )|Z=z(y) {y = βt(y )}. Hence, the distribution n n 0 n βt(ˆzML(X )) and outputs a uniform mixture over all PY =βt(ˆzML(X )),Y =ˆzML(X )|Z=z converges to the standard “compatible” conditional distributions (i.e., all conditional normal distribution φ again by the local asymptotic normality distributions PXn|Y 0=z,Y =¯z whose parameter z lies in the assumption as stated in (16) (Assumption (iv)). Applying the −1 quantization cell βt (¯z)). triangle inequality to the right-hand-side of (70), we have We now estimate the error. We first consider the case lim An d = 1 for simplicity. For each element z¯, denote the uniform n→∞ −1 n distribution on the subset β (¯z) as U 0 . We denote the t Y |Y =¯z ≤ 0 × n − lim UY |Y PY =βt(ˆzML(X ))|Z=z φt,α 1 transition kernel corresponding to the map z¯ 7→ U 0 as n→∞ Y |Y =¯z PXn|Y 0,Y . Then the decoder ϕn can alternatively be written as + φt,α − φ 1 o the map z 7→ PXn|Y 0,Y =z · UY 0|Y =z. Now the error measured + φ − P = (ˆ ( n)) 0=ˆ ( n)| = (71) according to the variational distance can be written as Y βt zML X ,Y zML X Z z 1

(1) ≤ sup φt,α − φ 1. (72) εb (fb,n, ϕn) α∈[0,t] Z Now we analyze the right-hand-side of (69). First, note that = ϕn · fb,n · P n| = − P n| = µ(dz) (65) X Z z X Z z 1 0 n 1 Z Y − z =z ˆ (X ) − z behaves as Θ( √ ) with probability Z ML n tending to one by the central limit theorem (local asymptotic n 0 0 n = (PX |Y ,Y · UY |Y ) × PY =βt(ˆzML(X ))|Z=z Z normality). Since the quantization level is also of the order 1 n n Θ( √ ), the difference Y −z = β (ˆzML(X ))−z also behaves − PX |Z=z 1 µ(dz) (66) n t as Θ( √1 ) with probability tending to one. Hence, by regarding where (66) follows from the definitions of the encoder n 1{ } ∈ d n Y as the random variable Z =z ˜ for some z˜ R that PY =βt(ˆzML(X ))|Z=z (given the parameter is Z = z) differs from z ∈ Z ⊂ d by Θ( √1 ), we may write and decoder PXn|Y 0,Y =z · UY 0|Y =z. For clarity, we write R n PY =β (ˆz (Xn))|Z=z for the distribution of the quantized MLE t ML ≤ n 0 · 0 n − n n n Bn PX |Y ,Z=˜z PY =ˆzML(X )|Z=z PX |Z=z 1. (73) βt(ˆzML(X )) given that the samples X are independently generated from the distribution parametrized by z ∈ Z, i.e., At this point, we may apply the local asymptotic sufficiency n PX|Z=z. Let the integrand in (66) for fixed z be denoted as assumption as stated in (17) (Assumption (v)) to (73), yielding (1) . By the triangle inequality, εb (fb,n, ϕn; z) lim Bn = 0. (74) n→∞ (1) εb (fb,n, ϕn; z) ≤ An + Bn, (67) By (67), and similar compact convergence arguments as those leading from (58) to (60), we find that where the sequences An and Bn are defined as (1) n 0 · 0 × n lim ε (fb,n, ϕn) ≤ sup φt,α − φ . (75) An := (PX |Y ,Y UY |Y ) PY =βt(ˆzML(X ))|Z=z n→∞ b 1 α∈[0,t] − PXn|Y 0,Y · PY =β (ˆz (Xn)),Y 0=ˆz (Xn)|Z=z , t ML ML 1 Since this statement holds for all t > 0, taking the limit t → 0, (68) (1) we see that the asymptotic error limn→∞ εb (fb,n, ϕn) = 0. and So in the general case for d = 1, we can achieve a memory 1 n 0 n 0 n length (log of memory size or coding length) of log n+O(1). Bn := PX |Y ,Y · PY =βt(ˆzML(X )),Y =ˆzML(X )|Z=z 2 The case in which d > 1 can be analyzed in a completely − P n . (69) X |Z=z 1 analogous manner and we can conclude that a memory length d n 0 n Note that PY =βt(ˆzML(X )),Y =ˆzML(X )|Z=z is the joint distri- of 2 log n + O(1) can be achieved. n bution of the quantized MLE βt(ˆzML(X )) and the true MLE n n zˆML(X ) given that the samples X are independently gen- VI.PROOFSOF CONVERSE PARTS OF THEOREM 1 erated from the distribution parametrized by z ∈ Z. Now, by In this section, we prove the converse parts (lower bounds) the data processing inequality for the variational distance (i.e., to Theorem 1. We will only focus on the visible cases in (18) kPX|Y ·PY |Z=z −PX|Y ·QY |Z=zk1 ≤ kPY |Z=z −QY |Z=zk1), and (19) because according to (12), a converse for the visible the term An in (68) can be bounded as case implies the same for the blind case. Essentially, by (7),

0 n a visible code cannot be outperformed by a blind code. An ≤ UY |Y × PY =βt(ˆzML(X ))|Z=z Since our problem is closely related to Clarke and Barron’s − P = (ˆ ( n)) 0=ˆ ( n)| = (70) Y βt zML X ,Y zML X Z z 1 formula for the relative entropy between a parametrized dis- We now analyze the right-hand-sides of (70) and (69) in turn. tribution and a mixture distribution [24], [25], we clarify the By a similar reasoning as in the proof of Lemma 2, relation between our problem and this formula. To clarify this

0 n UY |Y × PY =βt(ˆzML(X ))|Z=z converges to the quanti- relation, in Section VI-A, we prove a weak converse, namely, d zation of the standard normal distribution φt,α (de- the impossibility of further compression from a rate of 2 when fined in (56)) by the local asymptotic normality as- the variational distance error criterion is asymptotically zero. sumption as stated in (16) (Assumption (iv)). Further- This can be shown by a simple combination of Clarke and more, since Y is a deterministic function of Y 0, we Barron’s formula and the uniform continuity of mutual infor- 0 n 0 n have the relation PY =βt(ˆzML(X )),Y =ˆzML(X )|Z=z(y, y ) = mation [31] (also called Fannes inequality [32] in quantum 10 information). Since the variational distance goes to zero when satisfying the condition that the error measured according to the relative entropy goes to zero, the weak converse under the variational distance vanishes, i.e., the variational distance criterion implies the weak converse under the relative entropy criterion. Hence, the arguments in (1) Section VI-A demonstrate the weak converse under both error δn := εv (Cv,n) → 0, as n → ∞. (78) criteria. These arguments clarify the relation between Clarke and Barron’s formula and our problem. However, to the best of n our knowledge, the strong converse parts cannot be shown via Now let the code distribution be PCv,n (x , z) := (ϕn · n Clarke and Barron’s formula, i.e., they require novel methods. fv,n(z))(x )µ(z) where ϕn · fv,n(z) is defined in (5) and n n Furthermore, there is no similar relation between the strong (ϕn · fv,n(z))(x ) is the evaluation of ϕn · fv,n(z) at x . converse parts under the variational distance and the relative Then, the definition of the error in the visible case in (6) entropy. This is because there is no relation between code rates and (78) implies that the variational distance between the code when the relative entropy is arbitrarily large and when the distribution and the generating distribution satisfies variational distance is arbitrarily close to 2, i.e., its maximum value. So, we need to prove two types of strong converse parts P − P n ≤ δ , (79) for each of the two error criteria. In Section VI-B, we prove Cv,n X ,Z 1 n a strong converse for the relative entropy error criterion using the Pythagorean theorem for the relative entropy, thus demon- for some sequence δ = o(1). Since |X | is finite, we can use strating (19). In Section VI-C, we prove a strong converse n the method of types to find a set of sufficient statistics for the for the variational distance error criterion by a different, and data (cf. Section IV-D). Indeed, we can form a set of sufficient novel, method, thus demonstrating (18). statistics relative to the family {PX|Z=z}z∈Z . Let us call the n |X | sufficient statistics Gn : X → R . The output cardinality n n n n |X |−1 A. Weak Converse Under Both Criteria Based On Clarke And of Gn is |Gn(X )| = |{Gn(x ): x ∈ X }| ≤ (n + 1) Barron’s Formula because we can take the type of xn to be the sufficient statistic, i.e., G (xn) = type(xn). By the data processing inequality In this section, we prove the following weak converse. n for the variational distance, we have Lemma 4. The following lower bound holds

−1 −1 (1) d · − n · ≤ R (0) ≥ . (76) PCv,n Gn PX ,Z Gn 1 δn. (80) v 2 This is, in fact, only a weak converse since asymptotically the error measured according to the variational distance must In the following we use a subscript to denote the distribu- tend to zero. It is insufficient to show (18) but we present tion of the random variables in the arguments of the mu- tual information functional, so for example IP (A; B) = the proof to demonstrate the connection between Clarke and P AB Barron’s result in (77) to follow and the problem we study. a PA(a)D(PB|A(·|a)kPB). Now, we notice that Here, we are only concerned with the variational distance criterion because a weak converse for this criterion implies n n I −1 (Gn(X ); Z) = IP n (X ; Z), and (81) the same for the relative entropy criterion. PXn,Z ·Gn X ,Z n n −1 Proof of Lemma 4: We first assume that X is a finite set. I (Gn(X ); Z) ≤ IPC (X ; Z) (82) PCv,n ·Gn v,n At the end, we show how to relax this condition. We recall that Clarke and Barron [24], [25] showed for a parametric family {PX|Z=z}z∈Z that where (81) follows from the definition of sufficient statistics and (82) follows from the data processing inequality for Z  Z  n n 0 mutual information. In addition, by the uniform continuity of D PX|Z=z PX|Z=z0 ν(dz ) µ(dz) Z Z mutual information [31], [32] and (80), we have d n = log + D(µkν) − D(µkµJ) + log CJ + o(1), (77) 2 2πe n n I −1 (G (X ); Z) − I −1 (G (X ); Z) PXn,Z ·Gn n PC ·Gn n where Jz is the Fisher information matrix defined in (13), v,n 1 √ |X |−1 µJ(dz) := det Jz dz is the so-called Jeffrey’s prior [25] ≤ δn log((n + 1) ) + ξ(δn), (83) CRJ √ and CJ := Z det Jz dz is the normalization factor. When ν = µ, the left-hand-side of (77) is precisely the mutual infor- mation I(Xn; Z) where the pair of random variables (Xn,Z) where ξ(x) := −x log x. We define the upper bound between n n n 0 is distributed according to P n (x , z) := P (x )µ(z). the two mutual information quantities in (83) as δn := X ,Z X|Z=z |X |−1 0 See [33] for an overview of approximations similar to (77) in δn log((n + 1) ) + ξ(δn) and note that δn = o(log n). the context of universal source coding and model selection. Define the joint distribution of the encoder and the parame-

For the purpose of proving the weak converse, we assume ter as Pfv,n (y, z) := Pfv,n(z)(y)µ(z) (recall the code is visible that we are given a sequence of codes {Cv,n := (fv,n, ϕn)}n∈N so fv,n has access to z). Consider the mutual information 11 between the parameter Z and the memory index Y , Pythagorean formula for relative entropy [26] states that     X X X X log |Yn| piD(PikQ)=D piPi Q + piD Pi pjPj . ≥ I (Y ; Z) (84) i∈I i∈I i∈I j∈I Pfv,n (95) Z  Z  0 0 In the following, we show that if the memory size |Yn| is (85) d = D fv,n(z) fv,n(z ) µ(dz ) µ(dz) (1−) Z Z too small, say n 2 for some fixed  > 0, then the error Z  Z  ε(2)(C ) tends to infinity as n grows. 0 0 v v,n ≥ D ϕn · fv,n(z) ϕn · fv,n(z )µ(dz ) µ(dz) (86) Consider any code C with memory size Z Z v,n = (fv,n, ϕn) d (1− ) n 2  |Yn| = n . Let Pfv,n(z)(y) = Pr{fv,n(z) = y} be the = IPCv,n (X ; Z) (87) n probability that the index in the memory Y ∈ Yn takes on ≥ I −1 (Gn(X ); Z) (88) PCv,n ·Gn the value y when the parameter is z ∈ Z under the random n 0 ≥ I −1 (G (X ); Z) − δ (89) encoder mapping fv,n. Then an application of the Pythagorean PXn,Z ·Gn n n n 0 theorem in (95) yields = I n (X ; Z) − δ (90) PX ,Z n   Z  Z  X n n n 0 0 D Pfv,n(z)(y)ϕn(y) P 0 − X|Z=z = D PX|Z=z PX|Z=z µ(dz ) µ(dz) δn (91) Z Z y∈Yn d n X n 0 = Pf (z)(y)D(ϕn(y)kP ) = log − D(µkµJ) + log CJ + o(1) − δn (92) v,n X|Z=z 2 2πe y∈Yn d   = log n + o(log n). (93) X X 0 0 2 − Pfv,n(z)(y)D ϕn(y) Pfv,n(z)(y )ϕn(y ) . 0 y∈Yn y ∈Yn In the above chain, (85), (87), and (91) follow from the defi- (96) nition of mutual information, (86) follows from the data pro- Hence, by integrating (96) over all ∈ Z, we obtain cessing inequality for the relative entropy, (88) follows from z (2) the data processing inequality for mutual information, (89) εv (fv,n, ϕn) follows from the uniform continuity of mutual information Z n  as stated in (83), (90) follows from the notion of sufficient = D ϕn · fv,n(z) PX|Z=z µ(dz) (97) Z statistics as seen in (81), and (92) follows from Clarke and Z   X n Barron’s formula [24] with ν = µ in (77). We conclude that = D Pfv,n(z)(y)ϕn(y) PX|Z=z µ(dz) (98) Z if a sequence of codes is such that the variational distance y∈Yn d +o(1) " vanishes as in (78), the memory size |Yn| ≥ n 2 . Z X n  Now, when X is not a finite set, we can choose a finite = Pfv,n(z)(y) D ϕn(y) PX|Z=z Z y∈Y disjoint partition {Sw}w∈W of X satisfying the following n  # conditions: (i) |W| is finite and (ii) ∪w∈W Sw = X . Now, X 0 0 − D ϕn(y) Pf (z)(y )ϕn(y ) µ(dz). (99) we define the parametric family PW |Z=z(w) := PX|Z=z(Sw). v,n 0 Clearly, we can go through the above proof with the finite- y ∈Yn support random variable W in place of X. Now, when the We analyze both terms in (99) in turn. n For the first term, we use an argument similar to that for the code reconstructs the original family {PX|Z=z}z∈Z , clearly n it also reconstructs the quantized family {P }w∈W . In proof of Theorem 1(a) in Rissanen [22]. Note that since the X|W =w d (1−) essence, reconstructing the latter is “easier” than the former. size of |Yn| is n 2 , the set Sn, of all possible distributions d (1−) n output by the decoder ϕ cannot exceed n 2 , i.e., |S | = Since (76) holds for the family {PX|W =w}w∈W it must also n n, d (1−) n |{ϕ (y): y ∈ Y }| ≤ |Y | = n 2 . For any given z ∈ Z, hold for {P | = }z∈Z . This completes the proof of (76). n n n X Z z 0 let the closest distribution in Sn, have parameter z ∈ Z. Since Z ∈ Rd is bounded, we can estimate the (order of the) 0 0 B. Strong Converse Under The Relative Entropy Criterion `2 distance between z and z , i.e., ∆ := kz − z k. If z is a point in general position in Z, then ∆ is of the same order as d (1−) In this section, we prove the following strong converse result r, where r is the largest radius of the n 2 disjoint spheres using the Pythagorean theorem for the relative entropy. d contained in Z. Since the volume spheres of radius r in R d Lemma 5. The following lower bound holds is proportional to r , we have that d d (1−) Kd · r · n 2 ≥ vol(Z), (100) (2) d Rv (δ2) ≥ , ∀ δ2 ∈ [0, ∞). (94) 2 where Kd is a constant that depends only on the dimension d. Since vol(Z) does not depend on n (it also only depends This proves the lower bound to (19). The proof hinges on the on d) and ∆ = Θ(r),2 Pythagorean formula for the relative entropy and a geometric − 1 (1−) argument also contained in Rissanen’s work [22]. ∆ = Ω(n 2 ). (101) Proof of Lemma 5: Given probability measures {Pi}i∈I 2The implied constants in the Ω( · ) notations used in (101), (102), and (104) and Q, and a probability mass function {pi}i∈I , the are all assumed to be positive. 12

5 At the same time, by the Euclidean approximation of relative (see Section II). Thus, we may choose α Mn points {zi : i = n n 0 2 5 entropy in (14), D(PX|Z=z0 kPX|Z=z) = Ω(nkz − z k ) = 1,..., α Mn} ⊂ Z satisfying the following two conditions:  Ω(n ). Thus the first term in (99) scales as α ϕ · f (z ) − P n ≤ 2 − , and (110) n v,n i X|Z=zi 1 Z X 2 n   −1 Pfv,n(z)(y)D ϕn(y) PX|Z=z µ(dz) = Ω(n ). (102)  5  Z ∀ i =6 j, |zi − zj| > λ(S) Mn , (111) y∈Yn α −1 1 On the other hand the second term in (99) is a conditional 5  − 2 +η Since λ(S) α Mn = Ω(n ), the distributions in the mutual information. In particular, it can be upper bounded as set {P n : i = 1,..., 5 M } ⊂ P(X n) are distinguish- X|Z=zi α n Z   able. That is, for any  > 0 we may choose an N ∈ N X X 0 0 Pfv,n(z)(y)D ϕn(y) Pfv,n(z)(y )ϕn(y ) µ(dz) satisfying the following. For any n ≥ N, there exists disjoint Z 0 n y∈Yn y ∈Yn subsets Di ⊂ X such that d = I(Xn; Y |Z) ≤ H(Y ) ≤ (1 − ) log n. (103) P n (D ) ≥ 1 −  (112) 2 X|Z=zi i 5 Note that the random variables in the information quantities for any i = 1,..., α Mn. For example, we may take above are computed with respect to the distribution induced n  1 X λ(S)  5 −1 by the visible code Cv,n = (fv,n, ϕn). D := xn ∈X n : x −z ≤ · M (113) i n j i 3 α n Combining (99), (102), and (103), we obtain j=1 5 (2)  d and it is then easy to verify that {Di : i = 1,..., Mn} ε (fv , ϕ ) ≥ Ω(n ) − (1 − ) log n → ∞. (104) α v ,n n 2 are disjoint and, by Chebyshev’s inequality, that (112) holds d (1−) for n large enough. Now recall that for any two probability Hence, with a memory size of n 2 , the error computed measures P,Q on a common probability space (Ω, ), half according to the relative entropy criterion for any visible code F the variational distance can be expressed as 1 kP − Qk = diverges. This completes the proof of (94). 2 1 sup{P (A)−Q(A): A ∈ F }. Thus, the combination of (110) and (112) shows that

C. Strong Converse Under The Variational Distance Criterion α  c n c 1 − ≥ ϕn · fv,n(zi) (Di ) − PX|Z=z (Di ) (114) In this section, we prove the following strong converse state- 4 i  c ment with respect to the variational distance error criterion. ≥ ϕn · fv,n(zi) (Di ) − . (115) Lemma 6. The following lower bound holds In other words,  α (1) d ϕn · fv,n(zi) (Di) ≥ − . (116) R (δ1) ≥ , ∀ δ1 ∈ [0, 2). (105) 4 v 2 We denote the elements of Yn by {1,...,Mn}. The distri- Lemma 6 significantly strengthens Lemma 4 because the bution at the output of the decoder ϕn · fv,n(z) is a convex asymptotic error δ1 is not restricted to be 0; rather it can take combination of {ϕn(1), . . . , ϕn(Mn)}. Thus, any value in [0, 2). It demonstrates the lower bound to (18). Mn Proof of Lemma 6: We first consider the case in which X   d = 1. We proceed by contradiction. We assume, without ϕn(j) (Di) ≥ ϕn · fv,n(zi) (Di). (117) loss of generality, that the parameter space Z = [0, 1]. Fix j=1 η ∈ (0, 1/2) and assume that the memory size Mn = |Yn| is Hence, 1 2 −η O(n ) and 5 M Mn  α n  X  [ (1) h n i Mn ≥ ϕn(j) Di (118) εv (fv,n, ϕn) := Ez∼µ ϕn · fv,n(z) − P | = X Z z 1 j=1 i=1 5 ≤ 2 − α (106) Mn M αX Xn = ϕ (j)(D ) (119) for some α ∈ (0, 2) and n large enough. Let n i i=1 j=1 n o 5 n α Mn S := z ∈Z : ϕn · fv,n(z)−P ≤ 2 − . (107) α X|Z=z 1 2 X  ≥ ϕn · fv,n(zi) (Di) (120) Markov’s inequality implies that i=1 5 M  n  α n     Ez∼µ kϕn · fv,n(z) − PX|Z=zk1 X α 5 α S ≥ − ≥ −  = Mn −  , (121) µ( ) 1 α (108) 4 α 4 2 − 2 i=1 2 − α α where (120) and the inequality in (121) are applications of the ≥ 1 − α = > 0. (109) 2 − 2 4 − α bounds in (117) and (116) respectively. So, we obtain Let λ be the Lebesgue measure on [0, 1]. From (109), we know 5 α  1 ≥ −  , (122) that λ(S) > 0 by absolute continuity of µ with respect to λ α 4 13 which is a contradiction (if  > 0 is chosen to be smaller than [2] E. L. Lehmann. Theory of . Springer, 2nd edition, 1998. α 1 −η [3] T. M. Cover and J. A. Thomas. Elements of Information Theory. Wiley- ). Hence, a memory size of |Yn| = O(n 2 ) is insufficient 20 Interscience, 2nd edition, 2006. to ensure that limn→∞ εv(fv,n, ϕn) is strictly smaller than 2. [4] M. J. Wainwright and M. I. Jordan. Graphical Models, Exponential In the general case in which we assume for the sake of con- Families, and Variational Inference. Foundations and Trends ® in d( 1 −η) tradiction that when the memory size is |Yn| = O(n 2 ) Machine Learning, 1(1–2):1–305, 2008. (for fixed η > 0), per dimension, the memory size is of the [5] I. Csiszar´ and J. Korner.¨ Information Theory: Coding Theorems for Discrete Memoryless Systems 1 −η . Cambridge University Press, 2011. order O(n 2 ). Now, we can treat each dimension separately [6] A.W. van der Vaart. Asymptotic statistics. Cambridge University Press, and apply the above argument to yield the same contradiction 1998. when the memory size is too small. [7] L. Le Cam. Asymptotic Methods in Statistical Decision Theory. Asymptotic Methods in Statistical Decision Theory, 1986. [8] L. Le Cam. Locally asymptotically normal families of distributions. VII.DISCUSSIONAND FUTURE RESEARCH DIRECTIONS University of California Publications in Statistics, 3:37–98, 1960. In this paper, we have considered the approximate re- [9] C. E. Shannon. A mathematical theory of communication. The Bell n System Technical Journal, 27:379–423, 1948. construction of a generating distribution PX|Z=z from a n [10] C. E. Shannon. Coding theorems for a discrete source with a fidelity compressed version of a source X (the blind setting) or criterion. IRE Nat. Conv. Rec., pages 142–163, 1959. the parameter of generating distribution z itself (the visible [11] B. Schumacher. Quantum coding. Phys. Rev. A, 51(4):2738–2747, Apr setting). We have shown using various notions of approximate 1995. n [12] N. Tishby, F. C. Pereira, and W. Bialek. The information bottleneck sufficient statistics that when n i.i.d. observations X are method. In Proc. of Allerton Conference, pages 368–377, Monticello, available, the length of the optimal code in most cases and IL, 1999. d [13] G. Chechik, A. Globerson, N. Tishby, and Y. Weiss. Information bot- under suitable regularity conditions is 2 log n + o(log n). In the process of deriving our results, we have strengthened the tleneck for gaussian variables. Journal of Machine Learning Research, 6:165–188, May 2005. achievability part based on Rissanen’s MDL principle [21], [14] P. Harremoes¨ and N. Tishby. The information bottleneck revisited or [22]. We have also proved strong converses, thus strengthening how to choose a good distortion measure. In Proc. of IEEE Intl. Symp. the utility of Clarke and Barron’s formula [24], [25], which, on Inform. Th., pages 566–570, Nice, France, 2007. [15] Y. Yang, G. Chiribella, and M. Hayashi. Optimal comrpession for by itself, only leads to a weak converse under the variational identically prepared qubit states. Phys. Rev. Lett., 117(9):090502, Aug distance error criterion. 2016. There are three promising avenues for future research: [16] D. Petz. Sufficient subalgebras and the relative entropy of states of a von Neumann algebra. Comm. Math. Phys., 105(1):123–131, 1986. 1) It is natural to question whether the assumption of [17] M. Koashi and N. Imoto. Compressibility of quantum mixed-state {PX|Z=z}z∈Z being an exponential family is necessary signals. Phys. Rev. Lett., 87(1):017902, 2001. to achieve R(2)(δ0 ) = d , for all δ0 ∈ [0, ∞) in (22) (i.e., [18] M. Hayashi. Quantum Information Theory: A Mathematical Foundation. b 2 2 2 Graduate Texts in Physics, Springer, 2nd edition, 2017. asymptotically zero error). We would like to remove this [19] D. Sutter, O. Fawzi, and R. Renner. Universal recovery map for somewhat restrictive assumption if possible. approximate Markov chains. Proceedings of the Royal Society Series A, 2) It is also natural to wonder about the scaling and form Mathematical, Physical and Engineering Sciences, 472:2186, 2016. [20] O. Fawzi and R. Renner. Quantum conditional mutual information and of the second-order term in the optimal code length approximate Markov chains. Comm. Math. Phys., 340(2):575–611, 2015. log |Yn|. It is known from Theorem 1 that, in most cases, [21] J. Rissanen. A univeral prior for integers and estimation by minimum d description length. Annals of Statistics, 11(2):416–431, 1983. the first-order term scales as 2 log n. Our achievability results based on Lemmas 1–3 suggest that the second- [22] J. Rissanen. Universal coding, information, prediction, and estimation. IEEE Trans. on Inform. Theory, 30(4):629–636, 1984. order term is of the constant order O(1). Establishing [23] N. Merhav and M. Feder. A strong version of the redundancy-capacity that this is indeed a constant and the dependence of this theorem of universal coding. IEEE Trans. on Inform. Theory, 41(3):714– constant as a function of δ ≥ 0, the bound on the error, 722, May 1995. [24] B. Clarke and A. Barron. Information-theoretic asymptotics of Bayes would be of significant theoretical and practical interest. methods. IEEE Trans. on Inform. Theory, 36(3):453–471, 1990. 3) Since there are many distance “metrics” that may be [25] B. Clarke and A. Barron. Jeffrey’s prior is asymptotically least favorable used for measuring distances between two distributions under entropy risk. J. Statist. Plann. Inference, 41:37–60, 1994. (e.g., Csiszar’s´ f-divergences) [34], it may also be fruit- [26] S.-I. Amari and H. Nagaoka. Methods of Information Geometry. American Mathematical Society, 2000. ful to study the asymptotics of the optimal code length [27] S. Borade and L. Zheng. Euclidean Information Theory. In IEEE log |Yn| when other distance measures beyond the rela- International Zurich Seminar on Communications, pages 14–17, 2008. tive entropy and variational distance are used to quantify [28] E. Abbe and L. Zheng. Linear universal decoding for compound n · · n channels. IEEE Trans. on Inform. Theory, 56(12):5999–6013, Dec 2010. the discrepancy between PX|Z=z and ϕn fv,n PX|Z=z [29] I. A. Ibragimov and R. Z. Hasminskii. Statistical Estimation: Asymptotic (in the blind setting). Theory. Springer-Verlag, New York, 1981. [30] S.-I. Amari. Information geometry on hierarchy of probability distribu- Acknowledgements tions. IEEE Trans. on Inform. Theory, 47(5):1701–1711, 2001. [31] Z. Zhang. Estimating mutual information via Kolmogorov distance. The authors thank the Associate Editor Dr. Peter Harremoes¨ IEEE Trans. on Inform. Theory, 53(9):3280–82, 2007. and the two reviewers for their insightful comments that helped [32] M. Fannes. A continuity property of the entropy density for spin lattice to improve the paper significantly. systems. Comm. Math. Phys., 31:291–297, 1973. [33] A. Barron, J. Rissanen, and B. Yu. The minimum description length principle in coding and modeling. IEEE Trans. on Inform. Theory, REFERENCES 44(6):2743–2760, 1998. [1] M. Hayashi and V. Y. F. Tan. Minimum rates of approximate sufficient [34] F. Liese and I. Vajda. On divergences and informations in statistics and statistics. In Proc. of IEEE Intl. Symp. on Inform. Th., pages 3035–3039, information theory. IEEE Trans. on Inform. Theory, 52(10):4394–4412, Aachen, Germany, 2017. Oct 2006. 14

Masahito Hayashi (M’06–SM’13–F’17) was born in Japan in 1971. He received the B.S. degree from the Faculty of Sciences in Kyoto University, Japan, in 1994 and the M.S. and Ph.D. degrees in Mathematics from Kyoto University, Japan, in 1996 and 1999, respectively. He worked in Kyoto University as a Research Fellow of the Japan Society of the Promotion of Science (JSPS) from 1998 to 2000, and worked in the Laboratory for Mathematical Neuroscience, Brain Science Institute, RIKEN from 2000 to 2003, and worked in ERATO Quantum Computation and Information Project, Japan Science and Technology Agency (JST) as the Research Head from 2000 to 2006. He also worked in the Superrobust Computation Project Information Science and Technology Strategic Core (21st Century COE by MEXT) Graduate School of Information Science and Technology, The University of Tokyo as Adjunct Associate Professor from 2004 to 2007. He worked in the Graduate School of Information Sciences, Tohoku University as Associate Professor from 2007 to 2012. In 2012, he joined the Graduate School of Mathematics, Nagoya University as Professor. He also worked in Centre for Quantum Technologies, National University of Singapore as Visiting Research Associate Professor from 2009 to 2012 and as Visiting Research Professor from 2012 to now. He also worked in Center for Advanced Intelligence Project, RIKEN as a Visiting Scientist from 2017. In 2011, he received Information Theory Society Paper Award (2011) for “Information-Spectrum Approach to Second-Order Coding Rate in Channel Coding”. In 2016, he received the Japan Academy Medal from the Japan Academy and the JSPS Prize from Japan Society for the Promotion of Science. In 2006, he published the book “Quantum Information: An Introduction” from Springer, whose revised version was published as “Quantum Information Theory: Mathematical Foundation” from Graduate Texts in Physics, Springer in 2016. In 2016, he published other two books “Group Representation for Quantum Theory” and “A Group Theoretic Approach to Quantum Informa- tion” from Springer. He is on the Editorial Board of International Journal of Quantum Information and International Journal On Advances in Security. His research interests include classical and quantum information theory and classical and quantum .

Vincent Y. F. Tan (S’07-M’11-SM’15) was born in Singapore in 1981. He is currently an Assistant Professor in the Department of Electrical and Computer Engineering (ECE) and the Department of Mathematics at the National University of Singapore (NUS). He received the B.A. and M.Eng. degrees in Electrical and Information Sciences from Cambridge University in 2005 and the Ph.D. degree in Electrical Engineering and Computer Science (EECS) from the Massachusetts Institute of Technology in 2011. He was a postdoctoral researcher in the Department of ECE at the University of Wisconsin-Madison and a research scientist at the Institute for Infocomm (I2R) Research, A*STAR, Singapore. His research interests include network information theory, machine learning, and statistical signal processing. Dr. Tan received the MIT EECS Jin-Au Kong outstanding doctoral thesis prize in 2011, the NUS Young Investigator Award in 2014, the Engineering Young Researcher Award in the Faculty of Engineering, NUS in 2018, and the Singapore National Research Foundation (NRF) Fellowship (Class of 2018). He has authored a research monograph on “Asymptotic Estimates in Informa- tion Theory with Non-Vanishing Error Probabilities” in the Foundations and Trends in Communications and Information Theory Series (NOW Publishers). He is currently an Editor of the IEEE Transactions on Communications and the IEEE Transactions on Green Communications and Networking.