Formal Limitations on the Measurement of

David McAllester Karl Stratos Toyota Technological Institute at Chicago Rutgers University

Abstract larger than O(ln N) where N is the number of samples. Thus a meaningful high-confidence lower bound is infeasi- Measuring mutual information from finite data is ble when underlying mutual information is large (e.g., hun- difficult. Recent work has considered variational dreds of bits). methods maximizing a lower bound. In this pa- Our result corrects, generalizes, and unifies past work on per, we prove that serious statistical limitations measuring mutual information. Unlike previous intractabil- are inherent to any method of measuring mutual ity results that are tied to specific estimators (Gao et al., information. More specifically, we show that any 2015; Oord et al., 2018), our result is universal to all es- distribution-free high-confidence lower bound on timators. Hence we show that without making strong as- mutual information estimated from N samples sumptions on the population distribution, such as the small cannot be larger than O(ln N). support assumption in Valiant and Valiant (2011) and min- imax bounds (Jiao et al., 2015), it is generally impossible to guarantee an accurate estimate of mutual information. 1 INTRODUCTION Our results contradict a theorem in Belghazi et al. (2018) claiming a polynomial sample size convergence rate for a Mutual information has important applications to unsuper- mutual information estimator and we point out an error in vised learning. It underlies classical representation learn- their proof. ing methods such as Brown clustering (Brown et al., 1992), While it is infeasible to give meaningful high-confidence INFOMAX (Bell and Sejnowski, 1995), and the informa- lower bound guarantees for large mutual information, es- tion bottleneck method (Tishby et al., 1999). Maximiz- timators lacking formal guarantees might still be useful ing mutual information is also central to recent works on in practice. To this end, we propose expressing mutual unsupervised representation learning with neural networks information as a difference of entropies and estimating (McAllester, 2018; Belghazi et al., 2018; Oord et al., 2018; the entropy terms by cross-entropy upper bounds. This Stratos, 2019; Hjelm et al., 2019). difference-of-entropies (DoE) estimator gives neither an Unfortunately, measuring mutual information from finite upper bound nor a lower bound guarantee on mutual in- data is a notoriously difficult estimation problem. This dif- formation. Nevertheless, we give theoretical and empirical ficulty has motivated researchers to consider more compu- evidence that DoE can meaningfully estimate large mutual tationally amenable measurement methods that maximize information from feasible samples. a parameterized lower bound on mutual information. The arXiv:1811.04251v4 [cs.IT] 20 May 2020 hope is that the optimized value of the bound is close to 1.1 Overview the true value of mutual information and can serve as a good enough approximation. For instance, this is the ap- We state our main result below. We write (X,Y ) to denote proach advocated by Mutual Information Neural Estimator a pair of random variables with ranges (X , Y). Unless oth- (MINE) (Belghazi et al., 2018) and contrastive predictive erwise specified, they can be discrete or continuous. For coding (CPC) (Oord et al., 2018). simplicity, all distributions are assumed to have full sup- port. Here we prove that serious statistical limitations are in- herent to all methods of measuring mutual information. Theorem 1.1. Let B be any mapping from N samples of More specifically, we show that any distribution-free high- (X,Y ) to a real number with the following property: for confidence lower bound on mutual information cannot be any distribution pXY over (X,Y ), given N iid samples (x1, y1) ... (xN , yN ) ∼ pXY , with probability at least 0.99 ≥ Proceedings of the 23rdInternational Conference on Artificial In- I(X,Y ; pXY ) B((x1, y1) ... (xN , yN )) telligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: where I(X,Y ; pXY ) is the mutual information between Volume 108. Copyright 2020 by the author(s). (X,Y ) under pXY . Now pick any distribution qXY Formal Limitations on the Measurement of Mutual Information over (X,Y ). Then given N ≥ 50 iid samples 0 0 0 0 (x1, y1) ... (xN , yN ) ∼ qXY , with probability at least 0.96 N N ! N 1 1 0 X X f(xi) DVdf (pX ||qX ) := f(xi) − ln e (3) 0 0 0 0 N N B((x1, y1) ... (xN , yN )) ≤ 2 ln N + 5 i=1 i=1

An application of the DV bound is measuring KL diver- In the rest of the paper, we build toward this result by prov- gence. Specifically, we estimate KL divergence by esti- ing statistical limitations on measuring lower bounds on mating the associated DV bound from samples: Kullback-Leibler (KL) divergence and entropy. We derive || ≥ || several intermediate results, specifically: DKL(pX qX ) DVf (pX qX ) N & DVdf (pX ||qX ) • We first consider the Donsker-Varadhan lower bound on KL divergence which underlies the approach in MINE. The first inequality holds for all choices of f by Theo- We give an intuitive explanation on why estimating the rem 2.1. We require that the second inequality holds with bound from samples is problematic. We show that the high probability (with respect to random sampling). We polynomial sample complexity result given by Belghazi now show that this approach to measuring KL divergence et al. (2018) is incorrect. is problematic. • We formally prove that any lower bound on KL diver- gence cannot be larger than O(ln N). This subsumes the 2.1 Statistical Limitations on Measuring the DV Donsker-Varadhan bound as a special case. Bound • We formally prove that any lower bound on entropy can- A crucial observation is that the second term in the DV not be larger than O(ln N). Theorem 1.1 is a special bound (2) involves an expectation of the exponential case of this result. h f(x)i • We motivate measuring mutual information as a differ- E e ence of entropies, each of which measured by minimiz- x∼qX ing cross entropy. We empirically show that it outper- This expression has the same form as the moment gener- forms existing variational lower bounds in synthetic ex- ating function used in analyzing large deviation probabil- periments and produce realistic estimates of mutual in- ities. The utility of expectations of exponentials in large formation in real-world datasets. deviation theory is that such expressions can be dominated by extremely rare events (large deviations). The rare events 2 ISSUES WITH THE dominating the expectation will never be observed by sam- DONSKER-VARADHAN LOWER pling from qX . BOUND To quantitatively analyze the risk of unseen outlier events, we will make use of the following simple lemma where we The Donsker-Varadhan (DV) lower bound on KL diver- write pX (Φ[x]) for the probability over drawing x from pX gence is stated below. For completeness, a simple proof that the statement Φ[x] holds. is given in the supplementary material. Lemma 2.2 (Outlier risk lemma). Given N ≥ 2 samples from and a property such that ≤ , Theorem 2.1 (Donsker and Varadhan, 1983). Let pX and pX Φ[x] pX (Φ[x]) 1/N the probability that no sample satisfies is at least qX be distributions over X with finite KL divergence. Then x Φ[x] 1/4. for all bounded functions f : X → R

DKL(pX ||qX ) ≥ DVf (pX ||qX ) (1) Proof. The probability that Φ[x] is unseen in the sample is at least (1 − 1/N)N which is at least 1/4 for N ≥ 2 and where N where limN→∞(1 − 1/N) = 1/e. h f(x)i DVf (pX ||qX ) := E [f(x)] − ln E e (2) x∼pX x∼qX We can use the outlier risk lemma to perform a quantitative Moreover, (1) holds with equality for some f with range risk analysis of the DV bound. Assume without loss of X [0,F ]. generality that [0,Fmax] is the range of f taken on . Let max us consider the best case scenario in which the empirical The DV bound (2) computes the difference between the estimate (3) is the largest. It is easy to see that the largest expected value of f(x) under pX and the log of the ex- value is Fmax and attained when pected value of the exponential of f(x) under q . It can be X f(x ) = F ∀i = 1 ...N ∼ i max easily estimated by sampling: given x1 . . . xN pX and 0 0 0 f(xi) = 0 ∀i = 1 ...N x1 . . . xN ∼ qX , we can compute the empirical estimate David McAllester, Karl Stratos

0 0 where x1 . . . xN are samples from pX and x1 . . . xN are lower bound on KL divergence (e.g., DKL(pX ||qX ) = samples from qX . But by the outlier risk lemma, there is supf>0 E [ln f(x)] − E [f(x)] + 1 in Nguyen et al. x∼pX x∼qX still at least a 1/4 probability that (2010) which does not involve an expectation of the expo- nential) that overcomes the limitation. h f(x)i 1 F E e ≥ e max x∼qX N We now strengthen the result by formally proving that no Since we require that (3) is a high-confidence lower bound lower bound on KL divergence estimated from samples can on the DV bound, it must account for the unseen outlier be large. We consider a more challenging setting in which risk.1 In particular, we must have we have perfect knowledge of pX (i.e., we can compute probabilities under the distribution) and only sample from F N e max qX to estimate a lower bound on DKL(pX ||qX ). Even in DVd (pX ||qX ) ≤ Fmax − ln = ln N f N this setting, we have the following negative result. Theorem 3.1. Let B be any distribution-free high- 2.2 Discussion of MINE confidence lower bound on DKL(pX ||qX ) computed with complete knowledge of p but only samples from q . Our investigation of measuring the DV bound from samples X X is motivated by Belghazi et al. (2018) who propose mea- More specifically, let B(pX , S, δ) be any real-valued func- suring and maximizing mutual information by formulating tion of a distribution pX , a multiset S, and a confidence it as KL divergence and considering the DV lower bound. parameter δ such that, for any pX , qX and δ, with proba- N More specifically, they introduce a function f : X ×Y → R bility at least 1 − δ over a draw of S from qX we have parameterized by a neural network and perform gradient descent to optimize DKL(pX ||qX ) ≥ B(pX , S, δ)

N N 1 X 1 X 0 0 f(xi,yi) For any such bound, and for N ≥ 2, for any q , with sup f(xi, yi) − ln e (4) X f N N N i=1 i=1 probability at least 1 − 4δ over the draw of S from qX we have where (x1, y1) ... (xN , yN ) are drawn from pXY and 0 0 0 0 (x , y ) ... (x , y ) are drawn from pX × pY . This is an 1 1 N N B(pX , S, δ) ≤ ln N empirical estimate of the DV bound on

DKL(pXY ||pX × pY ) = I(X,Y ; pXY ) (5) They claim that (4) yields a high-confidence accurate mea- surement of underlying mutual information with polyno- Proof. Consider distributions pX and qX and N ≥ 2. De- mial sample complexity under mild assumptions (Theorem fine q˜X by 3 in their paper), which apparently contradicts our previous observation.  1  1 q˜X (x) = 1 − qX (x) + pX (x) Upon inspection, we have found that their claim is wrong. N N In the appendix of the arXiv version v4, they make an in- We now have DKL(pX ||q˜X ) ≤ ln N. We will prove that correct application of the Hoeffding inequality in equation N (46) from which they incorrectly derive equation (49). Ho- from samples S ∼ qX we cannot reliably distinguish be- effding depends on the bounded range of the random vari- tween qX and q˜X . able: (49) is bounding the exponential of the variable. Thus The premise is that B yields a high-confidence lower bound their bound stated in Theorem 3 has an exponential depen- for any distribution. Thus we have dence on the variable M. Pr (Small(S)) ≥ 1 − δ (6) S∼q˜N 3 STATISTICAL LIMITATIONS ON X MEASURING LOWER BOUNDS ON where Small(S) represent the event that B(p , S, δ) ≤ KL DIVERGENCE X ln N. The distribution q˜X equals the marginal on x of a distribution on pairs (s, x) where s is the value of Bernoulli The analysis given in Section 2.1 is specific to the DV variable with bias 1/N such that if s = 1 then x is drawn bound. One might hope that there exists a different from pX and otherwise x is drawn from qX . By Lemma 2.2 1We intentionally keep this argument informal to give intuition the probability that all coins are zero is at least 1/4. Condi- N N since the result on the DV bound will be subsumed by the general tioned on all coins being zero the distributions q˜X and qX result in Theorem 3.1. are the same. Let Pure(S) represent the event that all coins Formal Limitations on the Measurement of Mutual Information are 0. We now have It states that the mutual information between X and Y can- not be larger than information content of X alone. Thus a Pr (Small(S)) S∼qN lower bound on mutual information implies a lower bound X on entropy. We will show that any distribution-free high- = Pr (Small(S)|Pure(S)) N confidence lower bound on entropy requires a sample size S∼q˜X exponential in the size of the bound. Pr N (Pure(S) ∧ Small(S)) S∼q˜X = The above argument seems problematic for the case of con- PrS∼q˜N (Pure(S)) X tinuous densities as can be negative. PrS∼q˜N (Pure(S)) − PrS∼q˜N (¬Small(S)) ≥ X X However, for the continuous case we have Pr N (Pure(S)) S∼q˜X 0 I(X,Y ; pXY ) = sup I(C(X),C (Y ); pXY ) (8) PrS∼q˜N (Pure(S)) − δ C,C0 ≥ X PrS∼q˜N (Pure(S)) 0 X where C and C range over all maps from the underlying δ = 1 − continuous space to discrete sets (all binnings of the con- Pr N (Pure(S)) tinuous space). A proof is given in the supplementary ma- S∼q˜X ≥ 1 − 4δ terial. Hence an O(ln N) upper bound on the measurement of mutual information for the discrete case applies to the continuous case as well. We assume the discrete case in this section without loss of generality.

Note the importance of the fact that the lower bound B is We use the following definition. distribution-free (the DV bound is distribution-free): this Definition 4.1. The type of a multiset S, denoted T (S), is allows us to construct an adversarial distribution q˜X with a function on positive integers such that small KL divergence and turn the premise on its head ( ) to upper bound B (6). It may be possible to construct X T ∈ distribution-specific lower bounds that do not suffer the (S)(i) := x S : 1 = i 0 0 same limitations. Theorem 3.1 proves that without making x ∈S: x =x such additional assumptions, it is not possible to guarantee That is, T (S)(i) is the number of elements of S that occur a large lower bound on KL divergence. i times in S. This result has immediate implications on measuring mu- The type T (S) contains all information relevant to esti- tual information which is a special case of KL divergence mating the actual probability of the items of a given count (5). A direct application of Theorem 3.1 gives the fol- and of estimating the entropy of the underlying distribu- lowing: in the setting in which we have perfect knowl- tion. The problem of estimating distributions and entropies edge of but can only sample from and (i.e., pXY pX pY from sample types has been investigated by various authors we cannot compute marginals) and measure a lower bound (McAllester and Schapire, 2000; Orlitsky et al., 2003; Or- on from samples, we cannot guarantee I(X,Y ; pXY ) N litsky and Suresh, 2015; Arora et al., 2018). Here we give that the bound is larger than . However, this setting is ln N the following negative result on lower bounding the entropy arguably awkward. In the next section, we make a more of a distribution by sampling. natural argument by considering entropy. Theorem 4.1. Let B be any distribution-free high- confidence lower bound on H(X; pX ) computed from a 4 STATISTICAL LIMITATIONS ON type T (S) with S ∼ pN . MEASURING LOWER BOUNDS ON X ENTROPY More specifically, let B(T , δ) be any real-valued function of a type T and a confidence parameter δ such that for any p , with probability at least 1 − δ over a draw of S from Recall that mutual information can be formulated as a dif- X pN , we have ference of entropies X

H(X; pX ) ≥ B(T (S), δ) I(X,Y ; pXY ) = H(X; pX ) − H(X|Y ; pXY ) (7) For any such bound, and for N ≥ 50 and k ≥ 2, for any where H(X; pX ) is the entropy of X under pX and pX , with probability at least 1 − δ − 1.01/k over the draw H(X|Y ; pXY ) is the of X given Y un- N of S from pX we have der pXY . Entropy is nonnegative for discrete variables: in this case we have B(T (S), δ) ≤ ln 2kN 2

I(X,Y ; pXY ) ≤ H(X; pX ) David McAllester, Karl Stratos

Using 1−z ≥ e−1.01z for z ≤ 1/100 we have the following 1 birthday paradox calculation. p˜X kN2

N−1 1.01 X ≥ − ln Pr (Pure(S)) 2 j S∼p˜N kN X j=1 pX kN 2 2kN 2 1.01 (N − 1)N = − kN 2 2 .505 Figure 1: Construction of an adversarial distribution p˜X ≥ − 2 k from pX with entropy H(X;p ˜X ) ≤ ln 2kN . Therefore

Proof. Consider a distribution pX and N ≥ 100. If .505 the support of p has fewer than 2kN 2 elements then Pr (Pure(S)) ≥ e−.505/k ≥ 1 − (11) X N 2 S∼p˜X k H(X; pX ) < ln 2kN and by the premise of the theorem we have that, with probability at least 1 − δ over the draw Applying the union bound to (9) and (11) gives of S, B(T (S), δ) ≤ H(X; pX ) and the theorem follows. 2 If the support of pX has at least 2kN elements then we .505 sort the support of into a (possibly infinite) sequence Pr (Pure(S) ∧ Small(S)) ≥ 1 − δ − (12) pX N S∼p˜X k x1, x2,... so that pX (xi) ≥ pX (xi+1). We then define a distribution p˜ on the elements x . . . x 2 by X 1 2kN By a derivation similar to that of (11) we get  2  pX (xi) for i ≤ kN .505 p˜X (xi) =   Pr (Pure(S)) ≥ 1 − (13) µ 2 2 S∼pN k kN 2 for kN < i ≤ 2kN X P where µ := j>kN 2 pX (xj). See Figure 1 for illustration. Combining (10), (12) and (13) gives We will let Small(S) denote the event that B(T (S), δ) ≤ ln 2kN 2 and let Pure(S) abbreviate the event that no ele- 1.01 2 Pr (Small(S)) ≥ 1 − δ − ment for occurs twice in the sample. Since N xi i > kN p˜X S∼pX k 2 2 has a support of size 2kN we have H(X;p ˜X ) ≤ ln 2kN . Applying the premise of the lemma to p˜X gives Pr (Small(S)) ≥ 1 − δ (9) N S∼p˜X Again, we note the importance of the fact that the entropy For a type T let PrS∼P N (T ) denote the probability over lower bound B is distribution-free: this allows us to con- drawing S ∼ P N that T (S) = T . We now have struct an adversarial distribution p˜X with small entropy and apply the premise to upper bound (9). Theorem 1.1 is Pr (T (S)|Pure(S)) = Pr (T (S)|Pure(S)) B N N S∼pX S∼p˜X derived from Theorem 4.1 by choosing appropriate hyper- parameter values δ, k and using the fact that a lower bound This gives the following. on mutual information implies a lower bound on entropy. Pr (Small(S)) S∼pN X 5 TOWARD ACCURATE ≥ Pr (Pure(S) ∧ Small(S)) N S∼pX MEASUREMENT OF MUTUAL = Pr (Pure(S)) Pr (Small(S) | Pure(S)) INFORMATION N N S∼pX S∼pX = Pr (Pure(S)) Pr (Small(S) | Pure(S)) Our main contribution is a set of fundamental statistical S∼pN S∼p˜N X X limitations on measuring a lower bound on mutual informa- ≥ Pr (Pure(S)) Pr (Pure(S) ∧ Small(S)) (10) tion implied by the difficulty of measuring a lower bound S∼pN S∼p˜N X X on KL divergence (Theorem 3.1) or entropy (Theorem 4.1). 2 2 But it is natural to ask: then how can we achieve an accurate For i > kN we have p˜X (xi) ≤ 1/(kN ) which gives measurement of mutual information? As a complementary N−1 Y  j  piece of contribution, in this section we explore a new es- ≥ − Pr (Pure(S)) 1 2 timator for mutual information that, while lacking formal S∼p˜N kN X j=1 guarantees, does not suffer from the same limitations. Formal Limitations on the Measurement of Mutual Information

5.1 Mutual Information as a Difference of Entropies Thus unlike high-confidence distribution-free lower bounds, high-confidence distribution-free upper bounds on Since mutual information can be expressed as a difference entropy can approach√ the true cross entropy at the modest of entropies (7), the problem of measuring mutual informa- sample rate of 1/ N even when the true cross entropy is tion can be reduced to the problem of measuring entropies. large.3 More specifically, we write mutual information as 5.2 Experiments I(X,Y ; pXY )

= inf H(pX , qX ) − inf H(pX|Y , qX|Y ) (14) We present experiments with the proposed difference-of- q q X X|Y entropies (DoE) estimator to gain a better understanding of its empirical behavior. First, we compare DoE with exist- where we use the fact that the entropy H(X; pX ) is up- ing lower-bound estimators in a standard synthetic setting per bounded by the cross entropy between pX and qX , based on correlated Gaussians. Next, we apply DoE on the denoted H(pX , qX ), for all distributions qX (similarly for task of measuring mutual information between related ar- the conditional entropy H(X|Y ; pXY )). They are equal iff ticles and translation pairs and show an evidence of large qX = pX and we have mutual information.

H(X; pX ) = inf H(pX , qX ) (15) qX In the following, DoE computes an empirical estimate of (14) by taking iid samples (x1, y1) ... (xN , yN ) from pXY Note that the upper-bound guarantee for the cross-entropy and computing estimator (15) yields neither an upper bound nor a lower N bound guarantee for a difference of entropies (14). How- 1 X ever, we give theoretical evidence below that, unlike lower Hb(pX , qX ) = inf − log qX (xi) qX N bound estimators, upper bound cross-entropy estimators i=1 N can meaningfully estimate large entropies from feasible 1 X Hb(pX|Y , qX|Y ) = inf − log qX|Y (xi|yi) samples. qX|Y N i=1 Cross entropy estimation. The statistical limitations on Ib(X,Y ; pXY ) = Hb(pX , qX ) − Hb(pX|Y , qX|Y ) (17) distribution-free high-confidence lower bounds on entropy do not arise for cross-entropy upper bounds. For upper where each minimization corresponds to fitting a proba- bounds we can show that naive sample estimates of the bilistic model on the samples with the cross-entropy loss. cross-entropy loss produce meaningful (large entropy) re- sults. The empirical cross-entropy loss computed on sam- 5.2.1 Synthetic Experiments ples from a population distribution is x1 . . . xN pX Following Poole et al. (2019), we define random variables d N X,Y ∈ R where (Xi,Yi) are standard normal with cor- N 1 X relation ρ ∈ [−1, 1]. It can be checked that I(X,Y ) = Hb (pX , qX ) = − ln qX (xi) (16) N 2 i=1 −(d/2) ln(1 − ρ ). We use d = 128 and vary ρ to ex- periment with different values of mutual information. We where qX is viewed as a model of pX . We can bound the compare with the following lower-bound estimators: the −Fmax true loss of qX by ensuring a minimum probability e DV bound (2), MINE (i.e., DV with an “improved gradient where Fmax is then the maximum possible log loss in the estimator”) (Belghazi et al., 2018), NWJ (Nguyen et al., 2 cross-entropy objective. Given a loss bound of Fmax, 2010), NWJ estimated by optimizing Jensen-Shannon di- N Hb (pX , qX ) is just the standard sample mean estimator of vergence (NWJ (JS)), CPC (Oord et al., 2018), and a non- an expectation of a bounded variable. In this case we have linear interpolation between NWJ and CPC (NWJ+CPC). the following standard confidence interval derived from the We refer the reader to Poole et al. (2019) for a detailed Chernoff bound. exposition of these estimators. We parameterize the dis-

Theorem 5.1. For any population distribution pX , and tributions in DoE (i.e., qX and qX|Y in (17)) by isotropic model distribution qX with − ln qX (x) bounded to the in- Gaussian (correct) or logistic (misspecified). terval [0,Fmax], with probability at least 1 − δ over the Table 1 shows mutual information estimates given by these draw of x1 . . . xN ∼ pX we have estimators. All estimators are trained for 3,000 steps where s at every step they use N = 128 samples drawn from 2 N ln δ pXY to update their weights. We tune the hyperparame- H(pX , qX ) ∈ Hb (pX , qX ) ± Fmax 2N 3It is also possible to give PAC-Bayesian bounds on 2 θ In language modeling a loss bound exists for any model that H(pX , qX ) as functions of the parameters θ of qX . See the sup- ultimately backs off to a uniform distribution on characters. plementary material for details. David McAllester, Karl Stratos

Table 1: Estimates of mutual information under different estimators. Each estimator is trained for 3,000 steps where at every step it receives N = 128 samples of (X,Y ) and optimizes its weights; it is fully tuned over hyperparameter choices with respect to its final estimate. In each row, we boldface the estimate closest to the ground-truth mutual information.

DV MINE NWJ NWJ (JS) CPC CPC+NWJ DoE (Gaussian) DoE (Logistic) I(X,Y ) ln N 2.72 2.57 1.99 1.50 2.73 2.77 4.19 4.13 4.13 4.85 10.27 9.38 9.25 5.55 4.82 8.18 18.38 18.42 18.41 4.85 61.96 34.56 50.46 13.41 4.85 10.45 104.18 104.16 106.29 4.85

30

25

DoE (Gaussian) DoE (Logistic) 20 NWJ NWJ (JS) 15 DV MINE CPC CPC+NWJ 10 I(X,Y) ln N 5

0

500 1000 1500 2000 2500 3000

Figure 2: A plot of mutual information estimates during training. At every training step (x-axis), each estimator receives a (shared) minibatch of N = 128 samples, computes the current estimate (y-axis), and updates weights. We show the best-case scenario for each estimator by fully tuning its hyperparameters with respect to the final estimate. ters of each estimator (e.g., learning rate, number of hid- estimator. DoE is a clear outlier as the only accurate esti- den units, initialization strategies, estimator-specific hyper- mator of mutual information. Unlike lower-bound estima- parameters such as the mixing weights in NWJ+CPC and tors, DoE can approach I(X,Y ) from above or below. MINE) to minimize |I(X,Y ) − mˆ | where mˆ is the final estimate. Thus the results assume an oracle that gives an 5.2.2 Mutual Information Between Articles and optimal configuration. All details can be found in the code: Translations https://github.com/karlstratos/doe. We consider two datasets for the choice of (X,Y ): It is clear that (1) DoE obtains the most accurate esti- mates whether I(X,Y ) > ln N or I(X,Y ) ≤ ln N, (2) DoE is the only estimator that achieves accurate estimates 1. Related news article pairs extracted from the Who- et al. when underlying mutual information is large, and (3) either Did-What dataset (Onishi , 2016) Gaussian or logistic parameterization of DoE yields accu- 2. English-German translation pairs extracted from the rate estimates. Furthermore, Table 1 does not show the IWSLT 2014 dataset fact that DV, MINE, and NWJ are highly unstable (espe- cially in the large mutual information setting). In particular, while it seems that they can achieve estimates much larger We expect that mutual information is large in either setting: than ln N, it is not representative of their general perfor- a random article (sentence) has large entropy, but given a mance which is fraught with numerical overflow/underflow related article (translation) the uncertainty is drastically re- problems leading to completely off estimates. Poole et al. duced. We use a standard LSTM language model for qX (2019) discuss the large variance of these estimators. In and a standard attention-based translation model for qX|Y . contrast, DoE is based on the standard cross-entropy loss More details of the experiments can be found in the supple- and very stable. mentary material. Figure 2 shows a plot of the best training session for each Table 2 shows the estimates of mutual information on the test portion of data. We use log base 2 to accommodate Formal Limitations on the Measurement of Mutual Information

trastive choice. Complementary to our work, a recent work Table 2: Estimates of mutual information (in bits) on by Poole et al. (2019) investigates tradeoffs between bias article pairs and translation pairs based on the difference and variance in estimating variational bounds on mutual in- of entropies (17). formation.

distribution pXY (17) There is a class of representation learning methods such as related article pairs 120.34 Brown clustering (Brown et al., 1992) and the information shuffled article pairs −2.38 bottleneck method (Tishby et al., 1999) that maximize a translation pairs 54.72 lower bound on mutual information given by the data pro- shuffled translation pairs −2.64 cessing inequality (DPI). In these methods, we learn “cod- ing” functions (C,C0) by optimizing the objective

0 the bit interpretation of entropies (rather than nats). We max I(C(X),C (Y ); pXY ) ≤ I(X,Y ; pXY ) 0 see that mutual information is estimated to be over 120 bits C,C on related article pairs and 54 bits on translation pairs. Mu- where the inequality is by the DPI. Information theoretic tual information is estimated to be close to zero for shuffled co-training (McAllester, 2018) considers a similar lower pairs: this shows that the estimator can also handle small bound and has been shown to be useful for label induction mutual information. in speech and text (Stratos, 2019). Measuring these lower bounds is subject to the same limitations presented in this 6 RELATED WORK paper.

We make a few additional remarks on related work to better 7 CONCLUSIONS contextualize our work. In the continuous setting, a classi- cal approach to estimating mutual information is based on Maximizing mutual information is well motivated as a computing the average log of the distance to the k-th near- method of unsupervised pretraining of representations that est neighbor in samples (Kraskov et al., 2004). Gao et al. maintain semantic signal while dropping uninformative (2015) show that this estimator suffers exponential sam- noise. However, measuring and maximizing mutual infor- ple complexity and propose more refined nearest-neighbor mation from finite data is a difficult training objective. In methods (Gao et al., 2015). In contrast, we establish that this paper, we have shown serious statistical limitations in- serious statistical limitations are inherent to the measure- herent to measuring lower bounds on various information ment of mutual information no matter what estimator is theoretic measures including KL divergence, entropy, and used. mutual information. We have also given theoretical argu- There is a line of work that develops efficient estima- ments that representing mutual information as a difference tors for entropy by assuming distributions with very small of entropies, and estimating those entropies by minimizing support—distributions with support smaller than the sam- cross-entropy loss, is a more statistically justified approach ple size. Valiant and Valiant (2011) show that it is possible than maximizing a lower bound on mutual information. to achieve an optimal sample rate of O(n/ ln n) where n is Unfortunately cross-entropy upper bounds on entropy the support size. Past work on analyzing minimax bounds fail to provide either upper or lower bounds on mutual likewise assume small support (Jiao et al., 2015; Han et al., information—mutual information is a difference of en- 2015; Kandasamy et al., 2015). In this case, the entropy of tropies. We cannot rule out the possible existence of su- the distribution cannot be larger than the log of the number perintelligent models, models beyond current expressive of samples. This is in agreement with, but does not imply, power, that dramatically reduce cross-entropy loss. Lower our results. We are interested in the large entropy setting, bounds on entropy can be viewed as proofs of the non- such as a distribution over all possible images or articles. existence of superintelligence. We should not surprised that We cannot have the number of samples equal to the num- such proofs are infeasible. ber of possible images or articles. As discussed in depth in Section 2, we are motivated by the References approach in MINE which measures the DV bound to esti- Arora, S., Risteski, A., and Zhang, Y. (2018). Do GANs mate and maximize mutual information (Belghazi et al., learn the distribution? some theory and empirics. In In- 2018). CPC is another notable example that illustrates ternational Conference on Learning Representations. the statistical limitations of measuring mutual information (Oord et al., 2018). CPC maximizes a lower bound on mu- Belghazi, M. I., Baratin, A., Rajeshwar, S., Ozair, S., Ben- tual information through noise contrastive estimation: it gio, Y., Courville, A., and Hjelm, D. (2018). Mutual in- is shown that this lower bound cannot be larger than ln k formation neural estimation. In J. Dy and A. Krause, edi- where k is number of negative samples used in the con- tors, Proceedings of the 35th International Conference on David McAllester, Karl Stratos

Machine Learning, volume 80 of Proceedings of Machine McAllester, D. (2013). A pac-bayesian tutorial with a Learning Research, pages 531–540, StockholmsmÃd’s- dropout bound. arXiv preprint arXiv:1307:2118. san, Stockholm Sweden. PMLR. McAllester, D. (2018). Information theoretic co-training. Bell, A. J. and Sejnowski, T. J. (1995). An information- arXiv preprint arXiv:1802.07572. maximization approach to blind separation and blind de- convolution. Neural computation, 7(6), 1129–1159. McAllester, D. A. and Schapire, R. E. (2000). On the con- vergence rate of good-turing estimators. In COLT, pages Brown, P. F., Desouza, P. V., Mercer, R. L., Pietra, V. J. D., 1–6. and Lai, J. C. (1992). Class-based n-gram models of natu- ral language. Computational Linguistics, 18(4), 467–479. Nguyen, X., Wainwright, M. J., and Jordan, M. I. (2010). Donsker, M. and Varadhan, S. (1983). Asymptotic evalua- Estimating divergence functionals and the likelihood ra- tion of certain markov process expectations for large time, tio by convex risk minimization. IEEE Transactions on iv. Communications on Pure and Applied Mathematics, , 56(11), 5847–5861. 36(2), 183–212. Onishi, T., Wang, H., Bansal, M., Gimpel, K., and Dziugaite, G. K. and Roy, D. M. (2017). Computing non- McAllester, D. (2016). Who did what: A large-scale vacuous generalization bounds for deep (stochastic) neu- person-centered cloze dataset. In Proceedings of the 2016 ral networks with many more parameters than training Conference on Empirical Methods in Natural Language data. In Proceedings of the Thirty-Third Conference on Processing, pages 2230–2235. Uncertainty in Artificial Intelligence, UAI 2017, Sydney, Australia, August 11-15, 2017. Oord, A. v. d., Li, Y., and Vinyals, O. (2018). Represen- tation learning with contrastive predictive coding. arXiv Gao, S., Ver Steeg, G., and Galstyan, A. (2015). Efficient preprint arXiv:1807.03748. estimation of mutual information for strongly dependent variables. In Artificial Intelligence and Statistics, pages Orlitsky, A. and Suresh, A. T. (2015). Competitive distri- 277–286. bution estimation: Why is good-turing good. In Advances Han, Y., Jiao, J., and Weissman, T. (2015). Minimax esti- in Neural Information Processing Systems, pages 2143– 2151. mation of discrete distributions under l1 loss. IEEE Trans- actions on Information Theory, 61(11), 6343–6354. Orlitsky, A., Santhanam, N., and Zhang1, J. (2003). Al- Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Gre- ways good turing: Asymptotically optimal probability es- wal, K., Bachman, P., Trischler, A., and Bengio, Y. timation. Science, 302(5644). (2019). Learning deep representations by mutual informa- tion estimation and maximization. In International Con- Poole, B., Ozair, S., Van Den Oord, A., Alemi, A., and ference on Learning Representations. Tucker, G. (2019). On variational bounds of mutual infor- mation. In International Conference on Machine Learn- Jiao, J., Venkat, K., Han, Y., and Weissman, T. (2015). ing, pages 5171–5180. Minimax estimation of functionals of discrete distribu- tions. IEEE Transactions on Information Theory, 61(5), Stratos, K. (2019). Mutual information maximization for 2835–2885. simple and accurate part-of-speech induction. In Proceed- ings of the 2019 Conference of the North American Chap- Kandasamy, K., Krishnamurthy, A., Poczos, B., Wasser- ter of the Association for Computational Linguistics: Hu- man, L., et al. (2015). Nonparametric von mises estima- man Language Technologies. Association for Computa- tors for entropies, divergences and mutual informations. tional Linguistics. In Advances in Neural Information Processing Systems, pages 397–405. Tishby, N., Pereira, F. C., and Bialek, W. (1999). The in- Kraskov, A., Stögbauer, H., and Grassberger, P. (2004). formation bottleneck method. In Proceedings of the 37-th Estimating mutual information. Physical review E, 69(6), Annual Allerton Conference on Communication, Control 066138. and Computing, pages 368–377.

Luong, T., Pham, H., and Manning, C. D. (2015). Effec- Valiant, G. and Valiant, P. (2011). Estimating the un- tive approaches to attention-based neural machine trans- seen: an n/log (n)-sample estimator for entropy and sup- lation. In Proceedings of the 2015 Conference on Em- port size, shown optimal via new clts. In Proceedings of pirical Methods in Natural Language Processing, pages the forty-third annual ACM symposium on Theory of com- 1412–1421. puting, pages 685–694. ACM. Formal Limitations on the Measurement of Mutual Information

A PROOF OF THEOREM 2.1 for pY and pXY ). Therefore we can write the last expres- sion as

For any distribution rX over X, we can write X pXY (Ci, × Cj,) lim pXY (Ci, × Cj,) ln →0 pX (Ci,)pY (Cj,) i,j∈Z   rX (x) X pIJ (i, j) DKL(pX ||qX ) = E ln + DKL(pX ||rX ) = lim pIJ (i, j) ln x∼pX q (x) → X 0 pI (i)pJ (j) i,j∈Z  r (x)  ≥ ln X (18) E = lim I(I,J; pIJ ) x∼pX qX (x) →0

where (I,J) denote the indices (i, j) such that x ∈ Ci, and y ∈ C for (x, y) ∼ p . Let f : X → R be a bounded function and define j, XY This proof immediately generalizes to higher dimensions f(x) qX (x)e where the mutual information can be expressed as a Rie- rX (x) = ∀x ∈ X ef(x) mann integral. We believe that this statement remains true x∼Eq X for arbitrary measures on product spaces where the mutual which is a valid distribution over X. Plugging this into the information is finite. However the proof for this extremely lower bound in (18), we have general case appears to be nontrivial.

C PAC-BAYESIAN BOUNDS   rX (x) h f(x)i E ln = E [f(x)] − ln E e (19) x∼pX qX (x) x∼pX x∼qX The PAC-Bayesian bounds apply to “broad basin” losses and loss estimates such as the following:   By (18), the supremum of (19) over the choice of f is pre- θ  θ+  Hσ(S, qX ) = − ln q (x) x∼Ep E X cisely the KL divergence between pX and qX . It can be X ∼N(0,σI) easily verified that an optimal f is given by θ 1 X  θ+  Hbσ(S, qX ) = E − ln qX (x) |S| ∼N(0,σI) x∈S pX (x) f(x) = ln ∀x ∈ X θ qX (x) Under mild smoothness conditions on qX (x) as a function of θ we have Since (19) is invariant to translation of f, without loss of θ θ generality we can assume that the range of f is bounded in lim Hσ(pX , qX ) = H(pX , qX ) σ→0 [0,Fmax] for some constant Fmax. θ θ lim Hbσ(S, qX ) = Hb(S, qX ) σ→0 B MUTUAL INFORMATION AS THE An L2 PAC-Bayesian generalization bound (McAllester, SUPREMUM OVER BINNINGS 2013) gives that for any parameterized class of models and any bounded notion of loss, and any λ > 1/2 and σ > 0, We now show that the mutual information I(X,Y ; pXY ) with probability at least 1 − δ over the draw of S from N for X and Y continuous can be expressed as the supre- pX we have the following simultaneously for all parameter 0 mum of I(C(X),C (Y ); pXY ) over discrete binnings of vectors θ. the continuous space. We first consider the case where X,Y ∈ and where the mutual information can be written R θ as a Riemann integral over densities. Hσ(pX , qX )   2  1 θ λFmax ||θ|| 1 ≤ Hbσ(S, qX ) + + ln I(X,Y ; pXY ) 1 N 2σ2 δ 1 − 2λ Z pXY (x, y) = pXY (x, y) ln dx dy pX (x)pY (y) It is instructive to set λ = 5 in which case the bound be- X pXY (i, j) 2 comes. = lim pXY (i, j) ln  →0 pX (i)pY (j) i,j∈Z θ Hσ(pX , q ) where Z is the set of all integers. For each i ∈ Z, define the X   2  half-open interval Ci, := [i, (i + 1)). The probability of 10 θ 5Fmax ||θ|| 1 ≤ Hbσ(S, qX ) + + ln 9 N 2σ2 δ the interval is approximately pX (i) under pX (similarly David McAllester, Karl Stratos

train (tgt) train (src) train (tgt) train (src) # articles 68348 68348 # sentences 160239 160239 vocab size 100001 87941 vocab size 24726 35445 # words 20271664 19072167 # words 3275729 3100720 avg length 296 279 avg length 20 19 max length 400 400 max length 175 172 min length 10 12 min length 2 2

Table 3: Training statistics of the article pairs Table 4: Training statistics of the translation pairs

model and a language model. The decoder is a left-to-right 2-layer LSTM in which a single word embedding matrix While this bound is linear in 1/N, and tighter in practice is used for both input embeddings and the softmax predic- than square root bounds, note that there is a small resid- tions. When this model is trained as a language model on ual gap when holding λ fixed at 5 while taking N → ∞. PTB using standard hyperparameter values it achieves test In practice the regularization parameter λ can be tuned on perplexity of 72.26. The encoder is a separate left-to-right holdout data. One point worth noting is the form of the de- 2-layer LSTM using the same word embeddings as the de- pendence of the regularization coefficient on F , N and max coder. We use the input-feeding attention archietecture of the basin parameter σ. Luong et al. (2015). It is also worth noting that the bound can be given in terms The model is trained using SGD and batch size 10 with no of “distance traveled” in parameter space from an initial BPTT-style truncation. The dimension of the input/hidden (random) parameter setting θ . 0 states is 900 (thus 1800 for the input-feeding decoder). We use step-wise dropout with rate 0.65 on word embeddings θ and hidden states. The model is trained for 40 epochs and Hσ(pX , q ) X the model that achieves the best validation perplexity is se-   2  10 θ 5Fmax ||θ − θ0|| 1 ≤ Hbσ(S, qX ) + + ln lected. The sequence-level cross entropy is estimated as 9 N 2σ2 δ 1 SQXENT = M NLL where NLL is the negative log likeli- hood of the corpus and M is the total number of sequences Evidence is presented in Dziugaite and Roy (2017) that the in the corpus. distance traveled bounds are tighter in practice than tradi- Mutual information is estimated by taking the difference in tional L2 generalization bounds. SQXENT between the language model and the translation model (17).For article pairs, we obtain D EXPERIMENT DETAILS Ib(X,Y ; pXY ) = 1131.74 − 1048.33 = 83.41

Article pairs. We take pairs from the Who-Did-What in nats which translates to 120.34 bits. For translation pairs, dataset (Onishi et al., 2016). The pairs in this dataset were we obtain constructed by drawing articles from the LDC Gigaword newswire corpus. A first article is drawn at random and Ib(X,Y ; pXY ) = 81.73 − 43.80 = 37.9 then a list of candidate second articles is drawn using the first sentence of the first article as an information retrieval in nats which translates to 54.72 bits. query. A second article is selected from the candidates us- ing criteria described in Onishi et al. (2016), the most sig- nificant of which is that the second article must have oc- curred within a two week time interval of the first. The training statistics of this dataset after preprocessing is given in Table 3.

Translation pairs. Our translation pairs consists of English-German sentence pairs extracted from the IWSLT 2014 dataset. The training statistics of this dataset after preprocessing is given in Table 4.

Model. We train an LSTM encoder-decoder model where the decoder doubles as both the decoder of a translation