Minimum Rates of Approximate Sufficient Statistics
Total Page:16
File Type:pdf, Size:1020Kb
1 Minimum Rates of Approximate Sufficient Statistics Masahito Hayashi,y Fellow, IEEE, Vincent Y. F. Tan,z Senior Member, IEEE Abstract—Given a sufficient statistic for a parametric family when one is given X, then Y is called a sufficient statistic of distributions, one can estimate the parameter without access relative to the family fPXjZ=zgz2Z . We may then write to the data. However, the memory or code size for storing the sufficient statistic may nonetheless still be prohibitive. Indeed, X for n independent samples drawn from a k-nomial distribution PXjZ=z(x) = PXjY (xjy)PY jZ=z(y); 8 (x; z) 2 X × Z with d = k − 1 degrees of freedom, the length of the code scales y2Y as d log n + O(1). In many applications, we may not have a (1) useful notion of sufficient statistics (e.g., when the parametric or more simply that X (−− Y (−− Z forms a Markov chain family is not an exponential family) and we also may not need in this order. Because Y is a function of X, it is also true that to reconstruct the generating distribution exactly. By adopting I(Z; X) = I(Z; Y ). This intuitively means that the sufficient a Shannon-theoretic approach in which we allow a small error in estimating the generating distribution, we construct various statistic Y provides as much information about the parameter approximate sufficient statistics and show that the code length Z as the original data X does. d can be reduced to 2 log n + O(1). We consider errors measured For concreteness in our discussions, we often (but not according to the relative entropy and variational distance criteria. always) regard the family fPXjZ=zgz2Z as an exponential For the code constructions, we leverage Rissanen’s minimum family [4], i.e., / P . This class of description length principle, which yields a non-vanishing error PXjZ=z exp i ziYi(x) measured according to the relative entropy. For the converse distributions is parametrized by a set of natural parameters parts, we use Clarke and Barron’s formula for the relative z = fzig and a set of natural statistics Y (x) = fYi(x)g, which entropy of a parametrized distribution and the corresponding is a function of the data. The natural statistics or maximum mixture distribution. However, this method only yields a weak likelihood estimator (MLE) are known to be sufficient statistics converse for the variational distance. We develop new techniques of the exponential family. In many applications, large datasets to achieve vanishing errors and we also prove strong converses. The latter means that even if the code is allowed to have a non- are prevalent. The one-shot model described above will then d vanishing error, its length must still be at least 2 log n. be replaced by an n-shot one in which the dataset consists Index Terms—Approximate sufficient statistics, Minimum of n independent and identically distributed (i.i.d.) random n rates, Memory size reduction, Minimum description length, variables X = (X1;X2;:::;Xn) each distributed according Exponential families, Pythagorean theorem, Strong converse to PXjZ=z where the exact z is unknown. If the support of X is finite, the distribution is a k-nomial distribution (a discrete distribution taking on at most k values) and the empirical I. INTRODUCTION distribution or type [5] of Xn is a sufficient statistic for The notion of sufficient statistics is a fundamental and learning z. However, the number of types with denominator n ubiquitous concept in statistics and information theory [2], n+k−1 k−1 on an alphabet with k values is k−1 = Θ(n ) [5]. We [3]. Consider a random variable X 2 X whose distribution are interested in this paper in the “memory size” to store the PXjZ=z depends on an unknown parameter z 2 Z. Typically various types. We imagine that each type is allocated a single in detection and estimation problems, we are interested in storage location in the memory stack and the memory size learning the unknown parameter z. In this case, it is often is the number of storage locations. Thus, the memory size unnecessary to use the full dataset X. Rather a function of required to estimate parameter z in a maximum likelihood arXiv:1612.02542v2 [cs.IT] 16 Nov 2017 the data Y = f(X) 2 Y usually suffices. If there is no loss k−1 manner (or distribution PXjZ=z) is at least Θ(n ) if the in the performance of learning Z given Y relative to the case (index of the) type is stored. The exponent d = k − 1 here is yMasahito Hayashi is with the Graduate School of Mathematics, Nagoya the number of degrees of freedom in the distribution family, University, and the Center for Quantum Technologies (CQT), National Uni- i.e., the dimensionality of the space Z that z belongs to. Can versity of Singapore (Email: [email protected]). we do better than a memory size of Θ(nd)? The answer to zVincent Y. F. Tan is with the Department of Electrical and Computer Engineering and the Department of Mathematics, National University of this question depends on the strictness of the recoverability Singapore (Email: [email protected]). condition of PXjZ=z. If PXjZ=z is to be recovered exactly, This paper was presented in part [1] at the 2017 International Symposium then the Markov condition in (1) is necessary and no reduction on Information Theory (ISIT) in Aachen, Germany. MH is partially supported by a MEXT Grant-in-Aid for Scientific Research of the memory size is possible. However, if PXjZ=z is to (B) No. 16KT0017. MH is also partially supported by the Okawa Research be recovered only approximately, we can indeed reduce the Grant and Kayamori Foundation of Informational Science Advancement. The memory size. This is one motivation of the current work. Centre for Quantum Technologies is funded by the Singapore Ministry of Education and the National Research Foundation as part of the Research In addition, going beyond exponential families, for general Centres of Excellence programme. distribution families, we do not necessarily have a useful and VYFT is partially supported an NUS Young Investigator Award (R-263- universal notion of sufficient statistics. Thus, we often focus on 000-B37-133) and a Singapore Ministry of Education Tier 2 grant “Network Communication with Synchronization Errors: Fundamental Limits and Codes” local asymptotic sufficient statistics by relaxing the condition (R-263-000-B61-112). for sufficient statistics. For example, under suitable regularity 2 conditions [6]–[8], the MLE forms a set of local asymptotic of i.i.d. distributions. The purpose of Rissanen’s compression sufficient statistics. However, there is no prior work that system is to obtain a compression system for data gen- discusses the required memory size if we allow the sufficient erated under a mixture distribution. He showed that when statistics to be approximate in some appropriate sense. To the dimensionality of the data is d, the optimal redundancy d address this issue, we introduce the notion of the minimum over the Shannon entropy is 2 log n + O(1). Merhav and coding length of certain asymptotic sufficient statistics and Feder [23] extended Rissanen’s analysis to both the minimax d show that it is 2 log n+O(1), where d is the dimension of the and Bayesian (maximin) senses. Clarke and Barron [24], [25] parameter of the family of distribution. Hence, the minimum refined Rissanen’s analysis and determined the constant (in d coding rate is the pre-log coefficient 2 , improving over the asymptotic expansions) under various regularity assumptions. original d when exact sufficient statistics are used. Here, we While we make heavy use of some of Rissanen’s coding ideas also notice that the locality condition can be dropped. That and Clarke and Barron’s asymptotic expansions for the relative is, our asymptotic sufficient statistics works globally. This is entropy between a parametrized distribution and a mixture, another motivation for the current paper. the problem setting we study is different. Indeed, the main ideas in Rissanen’s work [21], [22] are only helpful for us to establish the achievability parts of our theorems with non- A. Related Work zero asymptotic error for the relative entropy criterion (see Our problem is different from lossy and lossless conven- Lemma 1). Similarly, the main results of Clarke and Barron’s tional source coding [9], [10] because we do not seek to recon- work [24], [25] can only lead to a weak converse under the struct the data Xn but rather a distribution on X n. Hence, we variational distance criterion (see Lemma 4). Hence, we need need to generalize standard data compression schemes. Such to develop new coding techniques and converse ideas to satisfy a generalization has been discussed in the context of quantum the more stringent constraints on the code sequences. data compression by Schumacher [11]. Here, the source that generates the state cannot be directly observed. Schumacher’s encoding process involves compressing the original dataset B. Main Contributions and Techniques into a memory stack with a smaller size. The decoding process We provide a precise Shannon-theoretic problem formu- involves recovering certain statistics of the data to within a lation for compression for the model parameter z with an prescribed error bound δ ≥ 0. allowable asymptotic error δ ≥ 0 on the reconstructed dis- Reconstruction of distributions has also been studied in the tribution. This error is measured under the relative entropy context of the information bottleneck (IB) method [12]–[14]. and variational distance criteria. We use some of Rissanen’s In the IB method, given a joint distribution PX;Y , one seeks ideas for encoding in [21], [22] to show that the memory d to find the best tradeoff between accuracy and complexity size can be reduced to approximately Θ(n 2 ) resulting in a d when summarizing a random variable X and an observed and coding length of 2 log n + O(1).