Arxiv:1811.04251V4 [Cs.IT] 20 May 2020

Formal Limitations on the Measurement of Mutual Information David McAllester Karl Stratos Toyota Technological Institute at Chicago Rutgers University Abstract larger than O(ln N) where N is the number of samples. Thus a meaningful high-confidence lower bound is infeasi- Measuring mutual information from finite data is ble when underlying mutual information is large (e.g., hun- difficult. Recent work has considered variational dreds of bits). methods maximizing a lower bound. In this pa- Our result corrects, generalizes, and unifies past work on per, we prove that serious statistical limitations measuring mutual information. Unlike previous intractabil- are inherent to any method of measuring mutual ity results that are tied to specific estimators (Gao et al., information. More specifically, we show that any 2015; Oord et al., 2018), our result is universal to all es- distribution-free high-confidence lower bound on timators. Hence we show that without making strong as- mutual information estimated from N samples sumptions on the population distribution, such as the small cannot be larger than O(ln N). support assumption in Valiant and Valiant (2011) and min- imax bounds (Jiao et al., 2015), it is generally impossible to guarantee an accurate estimate of mutual information. 1 INTRODUCTION Our results contradict a theorem in Belghazi et al. (2018) claiming a polynomial sample size convergence rate for a Mutual information has important applications to unsuper- mutual information estimator and we point out an error in vised learning. It underlies classical representation learn- their proof. ing methods such as Brown clustering (Brown et al., 1992), While it is infeasible to give meaningful high-confidence INFOMAX (Bell and Sejnowski, 1995), and the informa- lower bound guarantees for large mutual information, es- tion bottleneck method (Tishby et al., 1999). Maximiz- timators lacking formal guarantees might still be useful ing mutual information is also central to recent works on in practice. To this end, we propose expressing mutual unsupervised representation learning with neural networks information as a difference of entropies and estimating (McAllester, 2018; Belghazi et al., 2018; Oord et al., 2018; the entropy terms by cross-entropy upper bounds. This Stratos, 2019; Hjelm et al., 2019). difference-of-entropies (DoE) estimator gives neither an Unfortunately, measuring mutual information from finite upper bound nor a lower bound guarantee on mutual in- data is a notoriously difficult estimation problem. This dif- formation. Nevertheless, we give theoretical and empirical ficulty has motivated researchers to consider more compu- evidence that DoE can meaningfully estimate large mutual tationally amenable measurement methods that maximize information from feasible samples. a parameterized lower bound on mutual information. The arXiv:1811.04251v4 [cs.IT] 20 May 2020 hope is that the optimized value of the bound is close to 1.1 Overview the true value of mutual information and can serve as a good enough approximation. For instance, this is the ap- We state our main result below. We write (X; Y ) to denote proach advocated by Mutual Information Neural Estimator a pair of random variables with ranges (X ; Y). Unless oth- (MINE) (Belghazi et al., 2018) and contrastive predictive erwise specified, they can be discrete or continuous. For coding (CPC) (Oord et al., 2018). simplicity, all distributions are assumed to have full support. Here we prove that serious statistical limitations are inherent to all methods of measuring mutual information. Theorem 1.1. Let B be any mapping from N samples of More specifically, we show that any distribution-free high- (X; Y ) to a real number with the following property: for confidence lower bound on mutual information cannot be any distribution pXY over (X; Y ), given N iid samples (x1; y1) ::: (xN ; yN ) ∼ pXY , with probability at least 0.99 ≥ Proceedings of the 23rdInternational Conference on Artificial In- I(X; Y ; pXY ) B((x1; y1) ::: (xN ; yN )) telligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: where I(X; Y ; pXY ) is the mutual information between Volume 108. Copyright 2020 by the author(s). (X; Y ) under pXY . Now pick any distribution qXY Formal Limitations on the Measurement of Mutual Information over (X; Y ). Then given N ≥ 50 iid samples 0 0 0 0 (x1; y1) ::: (xN ; yN ) ∼ qXY , with probability at least 0.96 N N ! N 1 1 0 X X f(xi) DVdf (pX jjqX ) := f(xi) − ln e (3) 0 0 0 0 N N B((x1; y1) ::: (xN ; yN )) ≤ 2 ln N + 5 i=1 i=1 An application of the DV bound is measuring KL diver- In the rest of the paper, we build toward this result by prov- gence. Specifically, we estimate KL divergence by esti- ing statistical limitations on measuring lower bounds on mating the associated DV bound from samples: Kullback-Leibler (KL) divergence and entropy. We derive jj ≥ jj several intermediate results, specifically: DKL(pX qX ) DVf (pX qX ) N & DVdf (pX jjqX ) • We first consider the Donsker-Varadhan lower bound on KL divergence which underlies the approach in MINE. The first inequality holds for all choices of f by Theo- We give an intuitive explanation on why estimating the rem 2.1. We require that the second inequality holds with bound from samples is problematic. We show that the high probability (with respect to random sampling). We polynomial sample complexity result given by Belghazi now show that this approach to measuring KL divergence et al. (2018) is incorrect. is problematic. • We formally prove that any lower bound on KL divergence cannot be larger than O(ln N). This subsumes the 2.1 Statistical Limitations on Measuring the DV Donsker-Varadhan bound as a special case. Bound • We formally prove that any lower bound on entropy can- A crucial observation is that the second term in the DV not be larger than O(ln N). Theorem 1.1 is a special bound (2) involves an expectation of the exponential case of this result. h f(x)i • We motivate measuring mutual information as a differ- E e ence of entropies, each of which measured by minimiz- x∼qX ing cross entropy. We empirically show that it outper- This expression has the same form as the moment gener- forms existing variational lower bounds in synthetic ex- ating function used in analyzing large deviation probabil- periments and produce realistic estimates of mutual in- ities. The utility of expectations of exponentials in large formation in real-world datasets. deviation theory is that such expressions can be dominated by extremely rare events (large deviations). The rare events 2 ISSUES WITH THE dominating the expectation will never be observed by sam- DONSKER-VARADHAN LOWER pling from qX . BOUND To quantitatively analyze the risk of unseen outlier events, we will make use of the following simple lemma where we The Donsker-Varadhan (DV) lower bound on KL diver- write pX (Φ[x]) for the probability over drawing x from pX gence is stated below. For completeness, a simple proof that the statement Φ[x] holds. is given in the supplementary material. Lemma 2.2 (Outlier risk lemma). Given N ≥ 2 samples from and a property such that ≤ , Theorem 2.1 (Donsker and Varadhan, 1983). Let pX and pX Φ[x] pX (Φ[x]) 1=N the probability that no sample satisfies is at least qX be distributions over X with finite KL divergence. Then x Φ[x] 1/4. for all bounded functions f : X! R DKL(pX jjqX ) ≥ DVf (pX jjqX ) (1) Proof. The probability that Φ[x] is unseen in the sample is at least (1 − 1=N)N which is at least 1=4 for N ≥ 2 and where N where limN!1(1 − 1=N) = 1=e. h f(x)i DVf (pX jjqX ) := E [f(x)] − ln E e (2) x∼pX x∼qX We can use the outlier risk lemma to perform a quantitative Moreover, (1) holds with equality for some f with range risk analysis of the DV bound. Assume without loss of X [0;F ]. generality that [0;Fmax] is the range of f taken on . Let max us consider the best case scenario in which the empirical The DV bound (2) computes the difference between the estimate (3) is the largest. It is easy to see that the largest expected value of f(x) under pX and the log of the ex- value is Fmax and attained when pected value of the exponential of f(x) under q . It can be X f(x ) = F 8i = 1 :::N ∼ i max easily estimated by sampling: given x1 : : : xN pX and 0 0 0 f(xi) = 0 8i = 1 :::N x1 : : : xN ∼ qX , we can compute the empirical estimate David McAllester, Karl Stratos 0 0 where x1 : : : xN are samples from pX and x1 : : : xN are lower bound on KL divergence (e.g., DKL(pX jjqX ) = samples from qX . But by the outlier risk lemma, there is supf>0 E [ln f(x)] − E [f(x)] + 1 in Nguyen et al. x∼pX x∼qX still at least a 1/4 probability that (2010) which does not involve an expectation of the exponential) that overcomes the limitation. h f(x)i 1 F E e ≥ e max x∼qX N We now strengthen the result by formally proving that no Since we require that (3) is a high-confidence lower bound lower bound on KL divergence estimated from samples can on the DV bound, it must account for the unseen outlier be large. We consider a more challenging setting in which risk.1 In particular, we must have we have perfect knowledge of pX (i.e., we can compute probabilities under the distribution) and only sample from F N e max qX to estimate a lower bound on DKL(pX jjqX ).

Arxiv:1811.04251V4 [Cs.IT] 20 May 2020

On Measures of Entropy and Information

Information Theory 1 Entropy 2 Mutual Information

Information Theory and Maximum Entropy 8.1 Fundamentals of Information Theory

A Characterization of Guesswork on Swiftly Tilting Curves Ahmad Beirami, Robert Calderbank, Mark Christiansen, Ken Duffy, and Muriel Medard´

Language Models

Neural Networks and Backpropagation

Entropy and Decisions

A Gentle Tutorial on Information Theory and Learning Roni Rosenfeld Carnegie Mellon University Outline • First Part Based Very

Entropy, Relative Entropy, Cross Entropy Entropy

Reduced Perplexity: a Simplified Perspective on Assessing Probabilistic Forecasts Kenric P

K-Nearest Neighbor Based Consistent Entropy Estimation for Hyperspherical Distributions

Entropy Methods for Joint Distributions in Decision Analysis Ali E