<<

Bounds on mutual information of mixture data for classification tasks

Yijun Ding Amit Ashok James C. Wyant College of Optical Sciences James C. Wyant College of Optical Sciences University of Arizona and Department of Electrical Engineering Tucson, AZ, USA University of Arizona [email protected] Tucson, AZ, USA [email protected]

Abstract—The data for many classification problems, such as The goal of this paper is to develop efficient methods pattern and speech recognition, follow mixture distributions. To to quantify the optimum performance of classification tasks, quantify the optimum performance for classification tasks, the when the distribution of the data x for each given class label, Shannon mutual information is a natural information-theoretic metric, as it is directly related to the of error. The pr(x|C), follows a known mixture distribution. As the mutual mutual information between mixture data and the class label does information I(x; C), which is commonly used to quantify task- not have an analytical expression, nor any efficient computational specific information, does not admit an analytical expression algorithms. We introduce a variational upper bound, a lower for mixture data, we provide analytical expressions for bounds bound, and three estimators, all employing pair-wise divergences and estimators of I(x; C). between mixture components. We compare the new bounds and estimators with Monte Carlo stochastic sampling and bounds derived from entropy bounds. To conclude, we evaluate the B. Problem Statement and Contributions performance of the bounds and estimators through numerical We consider the data as a continuous x simulations. and the class label as a discrete random variable C, where C Index Terms—Mixture distribution, classification, Shannon mutual information, bounds, estimation, mixed-pair can be any integer in [1, Π] and Π is the number of classes. The bold symbol x emphasizes that x is a vector, which can be high-dimensional. We assume that, when restricted to any I.INTRODUCTION of the classes, the conditional differential entropy of x is well- A. Motivation defined, or in other words, (x,C) is a good mixed-pair vector [14]. The mutual information between the data x and the class We study the performance of classification tasks, where label C can be defined as [15] the goal is to infer the class label C from sample data x. The Shannon mutual information I(x; C) characterizes the I(x; C) = KL( pr(x,C)|| Pr(C) · pr(x)) reduction in the uncertainty of the class label C with the X Z pr(x,C) (1) = dx pr(x,C) ln . knowledge of data x and provides a way to quantify the Pr(C) · pr(x) relevance of the data x with respect to the class label C. As the C mutual information is related to probability of classification When pr(x) is a mixture distribution with N components, error (Pe) through Fano’s inequality and other bounds [1]– N [3], it has been widely used for feature selection [4], [5], X pr(x) = w pr (x), learning [6], and quantifying task-specific information [7] for i i (2) classification. i=1

arXiv:2101.11670v1 [eess.SP] 27 Jan 2021 P Statistical mixture distributions such as Poisson, Wishart where wi is the weight of component i (wi ≥ 0 and i wi = or Gaussian mixtures are frequently used in the fields of 1), and pri is the probability density of component i. The speech recognition [8], image retrieval [9], system evaluation conditional distribution of the data, when the class label is c, [10], compressive sensing [11], distributed state estimation also follows a mixture distribution. [12], hierarchical clustering [13] etc. As a practical example, In this work, we propose new bounds and estimators of the consider a scenario in which the data x is measured with a Shannon mutual information between a mixture distribution x noisy system, eg. Poisson noise in an photon-starved imaging and its class label C. We provide a lower bound, a variational system or Gaussian noise in a thermometer reading. If the upper bound and three estimators of I(x; C), all based on actual scene (or temperature) has a class label, e.g. target pair-wise distances. We present closed-form expressions for present or not (the temperature is below freezing or not), then the bounds and estimators. Furthermore, we use numerical the mutual information I(x; C) describes with what confidence simulations to compare the bounds and estimators to Monte one can assign a class label C to the noisy measurement data Carlo (MC) simulations and a set of bounds derived from x. entropy bounds. C. Related works numerical simulations that these bounds are tighter than other well-known existing bounds, such as the kernel-density esti- Although estimation of conditional entropy and mutual mator [41], [46] and the expected-likelihood-kernel estimator information has been extensively studied [16]–[19], research [47]–[49]. Our results are not obvious from either paper, as has focused on purely discrete or continuous data. Nair et al. the calculation of I(x; C) involves a summation of multiple [14] extended the definition of the joint entropy to mixed-pairs, entropies or KL divergences. Instead of providing bounds for which consists of one discrete variable and one continuous each term (entropy) in the summation, we directly bound and variable. Ross [20], Moon et al. [21] and Beknazaryan et estimate the mutual information. al. [15] provided methods for estimating mutual information from samples of mixed-pairs based on nearest-neighbor or II.MAIN RESULTS kernel estimator. Gao et al. [22] extended the definition of In this section, we provide three estimators of I(x; C) mutual information to the case that each random variable can and a pair of lower and upper bounds. All bounds and have both discrete and continuous components through the estimators are based on pair-wise KL divergence and Cα Radon-Nikodym derivative. Here our goal is to study mutual divergence. Furthermore, we provide proofs of the lower and information for mixed-pairs, where the data x is continuous upper bounds. Before presenting our main results, we start and the class label C is discrete. with a few definitions. The marginal distribution on the class When the underlying distribution of the data is unknown, label C is X the mutual information can be approximated from samples Pr(C = c) = Pc = wi. (3) with a number of density or likelihood-ratio estimators based i∈{c} on binning [23], [24], kernel methods [25]–[27], k-nearest- Note that PΠ P = 1 and {c} is the set of the components neighbor (kNN) distances [28], [29], or approximated Gaus- c=1 c that have class label C = c. The conditional distribution of sianity (Edgeworth expansion [30]). To accommodate high the data, when the class label is c, is given by dimensional data (such as image and text) or large datasets, Gao et al. [31] improved the kNN estimator with a local X wi pr(x|c) = pri(x). (4) Pc non-uniformity correction term; Jiao et al. [32] proposed a i∈{c} minimax estimator of entropy that achieves the optimal sample Expressing the marginal distribution in terms of the conditional complexity; Belghazi et al. [33] presented a general purpose distribution, we have neural-network estimator; Poole et al. [34] provided a thorough X review and several new bounds on mutual information that is pr(x) = Pc · pr(x|c). (5) capable to trade off bias for variance. c However, when the underlying data distribution is known, The joint distribution of the data and class label is the exact computation of mutual information is tractable only X for a limited family of distributions [35], [36]. The mutual pr(x, c) = Pc · pr(x|c) = wipri(x) (6) information for mixture distributions has no known closed- i∈{c} from expression [37]–[39]; hence MC sampling and numerical A. Pair-wise distances integration are often employed as unbiased estimators. MC The Kullback-Leibler (KL) divergence is defined as sampling of sufficient accuracy is computationally intensive Z pri(x) [40]. Numerical integration is limited to low-dimensional KL(pri||prj) = dx pri(x) ln . (7) problems [41]. To reduce the computational requirement, de- prj(x) terministic approximations have been developed using merged The Cα divergence [50] between the two distribution pri(x) Gaussian [8], [42], component-wise Taylor-series expansion and prj(x) is defined as [43], unscented transform [44] and pair-wise KL divergence Z α 1−α between matched components [9]. The merged Gaussian and Cα(pri||prj) = − ln dx pri (x) prj (x), (8) unscented transform estimators are biased, while the Taylor expansion method provides a trade-off between computational for real-valued α ∈ [0, 1]. More specifically, when α = 1/2, demands and accuracy. the Chernoff divergence is Bhattacharyaa distance. Two papers that have deeply inspired our work are [8] and [45]. Hersey et al. [8] proposed a variational upper B. Bounds and estimates of the mutual information bound and an estimator of the KL divergence between two We adopt the convention that ln 0 = 0 and ln(0/0) = 0. An Gaussian mixtures by pair-wise KL divergence. Hersey et al. exact expression of the mutual information is [8] has shown empirically that the variational upper bound and estimator perform better than other deterministic approx- N " PN # X j=1 wjprj I(x; C) = H(C) − w E ln , (9) imations, such as merged Gaussian, unscented transform and i pri P wkpr matched components. Kolchinsky et al. [45] has bounded i=1 k∈{Ci} k entropy of mixture distributions with pair-wise KL divergence where {Ci} is the set of component index that is in the same R and Chernoff-α (Cα) divergence and demonstrated through class with component i and Epri [f] = dx pri(x)f(x) is the expectation of f with respect to the probability density class-wise divergence Cα(prc, prc0 ): function pr . i X Z pr I(x; C) = P dx pr · ln c Two approximations of I(x; C) are c c pr(x) c Z P 1−αc X c0 Pc0 prc0 N = − Pc dx pr · ln N P −Cα(pri||prj ) c 1−αc X j=1 wje prc ˆI (x; C) = H(C) − w ln , c Cα i P −C (pr ||pr ) wke α i k Z i=1 k∈{Ci} X pr(x) − Pc dx prc · ln N N αc P 1−αc P −KL(pri||prj ) prc 0 Pc0 pr 0 X j=1 wje c c c ˆI (x; C) = H(C) − w ln . KL i P −KL(pr ||pr ) Z wke i k X X αc 1−αc (14) i=1 k∈{Ci} ≥ − Pc ln Pc0 dx prc · prc0 (10) c c0 Z pr(x) X 1−αc − ln dx Pcpr · c P 1−αc 0 Pc0 pr 0 Another approximation of I(x; C) is c c c " # N X X −Cα (pr ||pr 0 ) N P −Dij = − P ln P 0 e c c c X j=1 wje c c ˆ 0 IKL&Cα (x; C) = H(C) − wi ln , c c P −Dik k∈{C } wke i=1 i This inequality follows from Jensen’s inequality and the con- 1 1  1 1  where = + . vexity of function ln(x). The parameter αc, which is specific Dij 2 KL(pri||prj) Cα(pri||prj) for a class c, can be any value in [0, 1]. (11) The class-wise Cα divergence has a minimum value of zero and the minimum is achieved when the two class has the same distribution. Furthermore, we can bound the Cα divergence As D is a function of both KL and Cα divergences, we denote through the subadditivity of the function f(x) = xα when this estimator with the subscript ‘KL&Cα’. 0 ≤ α ≤ 1. In other words, as f(a + b) ≤ f(a) + f(b) for A lower bound on I(x; C) based on pair-wise Cα is a ≥ 0 and b ≥ 0, the Cα divergence between the conditional distributions prc and prc0 can be bounded by: Π " Π # X X  α  1−α 0 0 Z Ilb Cα = − Pc ln Pc · min(1,Qcc ) , where wi wj −Cα(prc||prc0 ) X X 0 e = dx  pri  prj c=1 c =1 P P 0 c 0 c  αc  1−αc i∈{c} j∈{c } X X wi wj −Cαc (pri||prj )   Qcc0 = e ,  α  1−α Pc Pc0 X X wi wj −C (pr ||pr ) i∈{c} j∈{c0} ≤ min 1, e α i j  , Pc Pc0 (12) i∈{c} j∈{c0}

= min(1,Qcc0 ). and min(·) is the minimum value function. (15) A variational upper bound on I(x,C) based on pair-wise Therefore, Equation 12 is a lower bound on I(x; C). KL is The best possible lower bound can be obtained by finding the parameters αc that maximize Ilb C , which is equivalent P α Π Π to minimize c0 Pc0 min(1,Qcc0 ). In a special case when all P −KL(prm,c,prm,c0 ) X X c0=1 φm,c0 e components are symmetric and identical except the center Iub KL = H(C)− φm,c ln , φ −Cα(pri||prj ) m c=1 m,c location, eg. homoscedastic Gaussian mixture, e (13) achieves minimum value at α = 1/2 [45]. where φm,c are the variational parameters and detailed expla- nation of the notations is given in Section II-D. D. Proof of the variational upper bound Here we propose a direct upper bound on the mutual infor- mation I(x; C) using a variational approach. The underlying idea is to match components from different classes. To pick C. Proof of the lower bound one component from each class, there are N1 × N2... × NΠ combinations, where Nc is the number of components in QΠ We propose a lower bound on I(x; C) based on pair- class c. Denote an integer M = c=1 Nc. A component wise Cα divergences. For ease of notation, we denote the i in class c can be split into M/Nc components with each conditional distribution pr(x|C = c) as prc. We first make component corresponding to a component-combination in the use of a derivation from [45] and [51] to bound I(x; C) with other Π − 1 classes. Mathematically speaking, we introduce the variational parameters φij ≥ 0 satisfying the constraints We can further bound the batch-conditional entropy with PM/Nc j=1 φij = wi. Using the variational parameters, we can pair-wise KL divergence as write the joint distribution as N ˆ X M/Nc I(x; C) ≤ H(C) + HKL(x|m) − H(C|m) − wiHi(x) X X X i=1 pr(x, c) = wipri = φijpri. (16) P −KL(prm,c,prm,c0 ) X X 0 φm,c0 e i∈{c} i∈{c} j=1 = H(C) − φ ln c m,c φ m c m,c Note that the set {c} has N components. By rearranging c := I indices (i, j) into a vector m of length M, we can simplify ub KL (21) the joint distribution to

M where Hˆ KL(x|m) is an upper bound of the batch-conditional X entropy and the inequality has been proved in [45]. pr(x, c) = φm,cprm,c(x), (17) m=1 The tightest upper bound attainable through this method can be found by varying parameters φm,c to minimize Iub KL. where the subscript c emphasizes that each class has a unique The minimization problem has been proved to be convex (see mapping from (i, j) to m and pr (x) equals to the corre- m,c Appendix A). The upper bound Iub KL can be minimized sponding pr (x). 0 i iteratively by fixing the parameters φm,c0 (where c 6= c) and With this notation, the marginal distribution of the data x optimizing parameters φm,c under linear constraints. At each is iteration step Iub KL is lowered, and the convergent is the Π Π M tightest variational upper bound on the mutual information. X X X pr(x) = pr(x, c) = φm,cprm,c(x). (18) Non-optimum variational parameters still provide upper c=1 c=1 m=1 bounds on I(x; C). There are M × Π variational parameters, M × Π non-equality constraints and N equality constraints. We further define a mini-batch m as When the number of classes or components is large, the Π minimization problem will be computationally intensive. A X bm(x) = φm,cprm,c(x). (19) non-optimum solution that is similar to the matched bound [8], c=1 [53] can be obtained by dividing all components into max(Nc) mini-batches by matching each component i to one component Each mini-batch contains Π components with one component in each class. Mathematically speaking, φij = wi for one from each class. With this definition, the marginal distribution (i, j) φ = 0 PM pair of matched and ij otherwise. To find the of the data can be written as pr(x) = m=1 bm(x). The mini-batches, the Hungarian method [54], [55] for assignment th probability of a component in the m batch is Pm = problems can be applied. P th c φm,c. The probability density function of the m batch is pr(x|m) = b (x)/P . m m III.NUMERICAL SIMULATIONS Now we use Jensen’s inequality, or more specificly log-sum inequality [52], to bound I(x; C) by batch-conditional entropy, In this section, we run numerical simulations and compare estimators on mutual information between mixture data and X Z pr(x, c) I(x; C) = H(C) + dx pr(x, c) · ln class labels. We consider a simple example of binary classifi- pr(x) c cation of mixture data, where the mixture components are two- Z ! P dimensional homoscedastic Gaussians. The component centers X X φm,cprm,c = H(C) + dx φ pr ln m are close to the class boundary and uniformly distributed m,c m,c P b c m m m along the boundary. The location of the component centers Z   X X φm,cprm,c are plotted in the Figure 1(a), where the component centers ≤ H(C) + dx φm,cprm,c ln bm are represented by a red star (class 1) or a yellow circle c m (class 2). Each class consists 100 two-dimensional Gaussian N X components with equal weights. The components have the = H(C) + H(x|m) − H(C|m) − w H (x) i i same covariance matrix σ2I, where I is the identity matrix i=1 (20) and σ represents the size of the Gaussian components. The conditional distribution pr(x|c = 2) is plotted in the insert of P where H(C) = − Pc ln Pc is the entropy of the class Figure 1(a) for σ = 0.5. When σ is larger, the components P c label; H(x|m) = PmH(pr(x|m)) is the batch-conditional of the mixtures distribution are more connected; when σ m P entropy of the data; H(C|m) = m PmHm(C) is the batch- is smaller, the components are more isolated. Estimates of conditional entropy of the label, where Hm(C) is the entropy I(x; C) are calculated for varying σ. of the class label for batch m, and Hi(x) = H(pri(x)) is the A pair of obvious bounds of I(x; C) are [0, H(C)], where entropy of the ith component. H(C) is the entropy of the class label Pr(C). Another pair (a) (b) Fig. 1: (a) The locations of the center of the components and the mixture distribution pr(x|c = 2) when σ = 0.5 (insert). (b) Estimates of I(x; C).

of upper and lower bounds of I(x; C) can be derived from I(x,C). More specifically, ˆIKL (yellow dashed line) follows bounds on mixture entropy as the new variational upper bound Iub KL (deep red solid line) ˆ closely; ICα (blue dashed line) is a good estimator of I(x,C) Ilb 2H = Hlb(x) − Hub(x|C) ˆ (22) (grey solid line); IKL&Cα (black dashed line) is another good Iub 2H = Hub(x) − Hlb(x|C), estimator of I(x,C), as the black dashed line tracks the grey solid line closely. The differences between the three estimates where the upper and lower bound of entropy based on pair- and the I(x,C) calculated from MC simulation are plotted in wise KL and C divergences have been provided by [45]. α Appendix B. These bounds on I(x,C) are based on two entropy bounds, hence the subscript ‘2H’. We evaluate the following estimates of I(x,C): IV. CONCLUSION 1) The new variational upper bound and the new lower

bound, Iub KL and Ilb Cα , are plotted in dark red and We provide closed-form bounds and approximations of blue solid lines, respectively. mutual information between mixture data and class labels. 2) The estimates based on the pair-wise KL, Cα or D (a The closed-form expressions are based on pair-wise distances, function of both KL and Cα divergences) are plotted in which are feasible to compute even for high-dimensional data. yellow, light blue and black dashed lines, respectively. Based on numerical results, the new bounds we proposed 3) The true mutual information, I(x,C), as estimated by are tighter than the bounds derived from bounds on entropy MC sampling of the (grey solid line). and the approximations serve as good surrogates for the true 4) The upper and lower bounds Ilb 2H and Iub 2H are plotted mutual information. in orange and green dot-dashed lines, respectively. The obvious bounds on I(x,C), which are [0,H(C)], are also APPENDIX A presented by an area in grey. The Monte-Carlo simulation THE MINIMIZATION OF I results, which can serve as the benchmark, are calculated ub KL with 106 samples. We use α = 1/2 in the calculation of The minimization problem of I by varying φ C divergences, as it provides the optimum bounds for our ub KL mc α is convex, which we prove in this section. The convexity example. We also present details of our implementation and of the minimization problem can be checked through the results of two other scenarios in Appendix B. first and second-order derivatives. For ease of notation, we Our new upper bound and lower bound appear to be tighter P −KL(prm,c,prm,c0 ) define S = 0 φ 0 e and E 0 = than the bounds derived from entropy bounds over the range m,c c m,c m,cc exp(−KL(pr , pr 0 )). The first derivative of I is: of σ considered in our simulation. In Figure 1(b), where the m,c m,c ub KL estimates of I(x,C) are plotted, the results show that the   ∂Iub KL Sm,c φm,c X φm,c0 Em,c0c blue and dark red solid lines are almost always within the = − ln − − + 1 ∂φm,c φm,c Sm,c Sm,c0 area covered by the green and orange dot-dashed lines. The c06=c ˆ ˆ ˆ three estimates, IKL, ICα and IKL&Cα , all follow the trend of (23) The second derivative is: plot the difference between the three estimates and the true 2 mutual information calculated from MC sampling in Figure4. ∂ Iub KL Hcc = 2 (∂φm,c) B. Closed form expressions for Gaussian mixtures 2 2 (24) (S − φ ) φ 0 (E 0 ) m,c m,c X m,c m,c c Gaussian functions are often used as components in mixture = 2 + 2 (Sm,c) φm,c (Sm,c0 ) c06=c distributions and have closed form expressions for pair-wise KL and C divergences. Denoting the difference in the means for the diagonal terms and α of two components as µij = µi −µj and Σα,ij = (1−α)Σi + 2 ∂ Iub KL αΣj, the Cα divergence between two Gaussian components Hcc0 = ∂φ ∂φ 0 are m,c m,c (25) φ − S φ 0 − S 0 m,c m,c m,c m,c α(1 − α) T −1 1 |Σα,ij| = Em,cc0 + Em,c0c, 2 2 Cα(pri||prj) = µijΣα,ijµij + ln 1−α α , (Sm,c) (Sm,c0 ) 2 2 |Σi| |Σj| for c0 6= c. For any given vector θ of length Π, (27)   where | · | is the determinant. The KL divergence between the T X 2 X same two Gaussian components are θ Hθ = θc Hcc + θcθc0 Hcc0  c c06=c   −1 1 T −1 |Σj| tr(Σj Σi) − d " 2 2 KL(pri||prj) = µijΣj µij + ln + , X 2 (Sm,c − φm,c) X 2 φm,c(Em,cc0 ) 2 |Σi| 2 = θ + θ 0 c 2 c 2 (28) (Sm,c) φm,c (Sm,c) c c06=c where tr(·) is the trace of the matrix in the parenthesis and d # X φm,c − Sm,c is the dimension of the data. + 2θ θ 0 E 0 c c 2 m,cc When all mixture components have equal covariance ma- 0 (Sm,c) c 6=c trices Σ = Σ = Σ, we can denote λ = µT Σ−1µ and  2 i j ij ij ij p have X (Sm,c − φm,c)θc X φm,cEm,cc0 θc0 =  p −  S φ Sm,c Cα(pri||prj) = α(1 − α)λij/2, c m,c m,c c06=c (29) KL(pr ||pr ) = λ /2. ≥0. i j ij (26) With these expressions, the bounds and estimates of I(x; C) have simple forms. Therefore, Iub KL is convex when φm,c are considered as the variables.

APPENDIX B C. Expressions of Ilb 2H and Iub 2H ONTHE NUMERICAL SIMULATIONS The detailed expression for the mutual information bounds This appendix provides more simulation results and the derived from entropy bounds are: N detailed expressions used in the numerical simulations. The N P −Cα(pri||prj ) X j=1 wje additional results consider two different distributions of the I = H(C) − w ln , lb 2H i P −KL(pr ||pr ) wke i k component-center locations. We further present the difference i=1 k∈{Ci} N between the the estimated and the true mutual information. N P −KL(pri||prj ) X j=1 wje Last but not least, the closed form expressions include the Iub 2H = H(C) − wi ln . P −Cα(pr ||pr ) k∈{C } wke i k KL and Cα divergences between Gaussian components, the i=1 i bounds on I(x; C) derived from entropy bounds, and the (30) relation between Shannon mutual information and bounds on binary classification error (Pe). A. Numerical simulation results D. Bounds on Pe for binary classification We consider three scenarios: (1) the component centers Bounds on I(x; C) can be used to calculate bounds on Pe. are uniformly distributed along the class boundary, (2) the The Fano’s inequality [1] provides a lower bound on Pe for component centers are bunched into one group, and (3) the binary classification, as following component centers are bunched into several groups. Results −1 Pe ≥ h [H(C) − I(x; C)], (31) on the first scenario has been presented in the Section III. In b this section, we report on Scenarios 2 and 3. Illustration of the where hb(x) = −x log2(x) − (1 − x) log2(1 − x) is the binary −1 two scenarios are shown in Figure 2(a) and 3(a), respectively. entropy function, hb (·) is the inverse function of hb(·). More The results demonstrated in these two scenarios are similar specifically, one can calculate Pe by placing the value H(C)− to that of Scenario 1. To further demonstrate that our estima- I(x; C) on the left side of the binary entropy function and tors are good surrogates for the true mutual information, we solving for x. (a) (b) Fig. 2: Scenario 2, where the center of the components are bunched into one group, illustration (a), the mixture distribution pr(x|c = 2) when σ = 0.5 (insert), and estimates of I(x; C) (b).

(a) (b) Fig. 3: Scenario 3, where the center of the components are bunched into multiple groups, illustration (a), the mixture distribution pr(x|c = 2) when σ = 0.5 (insert), and estimators of I(x; C) (b).

I (x; C) - I(x; C) I (x; C) - I(x; C) I (x; C) - I(x; C) est est est 0.3 est kl&c est kl&c est kl&c 0.2 0.2 0.2 est c est c est c est kl 0.1 est kl 0.1 est kl 0.1

I (bits) I (bits) 0 I (bits) 0 0 -0.1 -0.1

-0.1 -0.2 -0.2 10-2 10-1 1 3 10-2 10-1 1 3 10-2 10-1 1 3

(a) (b) (c) Fig. 4: ˆI(x; C) − I(x; C) for (a) Scenario 1, (b) Scenario 2 and (c) Scenario 3. The I(; C) is calculated from MC simulations. A tight upper bound on binary classification error Pe has [11] J. M. Duarte-Carvajalino, G. Yu, L. Carin, and G. Sapiro, “Task-driven been reported recently [3], adaptive statistical compressive sensing of Gaussian mixture models,” IEEE Transactions on Signal Processing, vol. 61, no. 3, pp. 585–600, P ≤ min P , f −1[H(C) − I(x; C)] := Pˆ , (32) 2012. e min e ub [12] B. Noack, M. Reinhardt, and U. D. Hanebeck, “On nonlinear track-to- track fusion with Gaussian mixtures,” in 17th International Conference where Pmin is min{P1,P2}, and f(x) is a function defined on Information Fusion (FUSION). IEEE, 2014, pp. 1–8. by [13] J. Goldberger and S. T. Roweis, “Hierarchical clustering of a mixture model,” in Advances in Neural Information Processing Systems, 2005, Pmin x pp. 505–512. f(x) = −Pmin log2 − x log2 , (33) x + Pmin x + Pmin [14] C. Nair, B. Prabhakar, and D. Shah, “On entropy for mixtures of discrete and continuous variables,” arXiv preprint cs/0607075, 2006. and f −1(·) is the inverse function of f(·). [15] A. Beknazaryan, X. Dang, and H. Sang, “On mutual information When P  1, −P (log P − log P ) estimation for mixed-pair random variables,” & Probability e e e min . Letters, vol. 148, pp. 9–16, 2019. H(C) − I(x; C) . −Pe log Pe. Therefore, Pe is on the [16] L. Kozachenko and N. N. Leonenko, “Sample estimate of the entropy same order of magnitude as H(C) − I(x; C), when Pe  1. of a random vector,” Problemy Peredachi Informatsii, vol. 23, no. 2, pp. 9–16, 1987. [17] I. Ahmad and P.-E. Lin, “A nonparametric estimation of the entropy for absolutely continuous distributions (corresp.),” IEEE Transactions E. Pe estimates on Information Theory, vol. 22, no. 3, pp. 372–375, 1976. When σ is small, all estimates of I(x,C) converges to 1. [18] B. Laurent et al., “Efficient estimation of integral functionals of a density,” The Annals of Statistics, vol. 24, no. 2, pp. 659–681, 1996. To compare the estimates for 0.1 > σ > 0.01, we calculate [19] G. P. Basharin, “On a statistical estimate for the entropy of a sequence of and present an estimate of Pe in this section. This estimate of independent random variables,” Theory of Probability & Its Applications, Pe is the upper bound on Pe presented in the previous section. vol. 4, no. 3, pp. 333–336, 1959. [20] B. C. Ross, “Mutual information between discrete and continuous data When a lower bound of I(x,C) is used in the calculation (blue sets,” PloS one, vol. 9, no. 2, p. e87357, 2014. solid line and green dashed line), the estimates of Pe are upper [21] K. R. Moon, K. Sricharan, and A. O. Hero, “Ensemble estimation bounds on Pe. The other lines in the Pe plots are neither upper of mutual information,” in 2017 IEEE International Symposium on Information Theory (ISIT). IEEE, 2017, pp. 3030–3034. bound nor lower bound. The black lines in the plots, which is [22] W. Gao, S. Kannan, S. Oh, and P. Viswanath, “Estimating mutual the estimate of Pe calculated from IMC , is also not the true information for discrete-continuous mixtures,” in Advances in neural Pe but can serve as a benchmark for an upper bound on Pe. information processing systems, 2017, pp. 5986–5997. [23] G. A. Darbellay and I. Vajda, “Estimation of the information by an In Figure5, the blue and dark red solid lines are significantly adaptive partitioning of the observation space,” IEEE Transactions on closer to the grey line than the green and orange dashed lines. Information Theory, vol. 45, no. 4, pp. 1315–1321, 1999. The results demonstrated that our new bounds is tighter than [24] R. Moddemeijer, “On estimation of entropy and mutual information of continuous distributions,” Signal processing, vol. 16, no. 3, pp. 233–248, the bounds derived from entropy bounds. Furthermore, the 1989. three estimators (blue, black and yellow dashed lines) are good [25] A. M. Fraser and H. L. Swinney, “Independent coordinates for strange surrogates for the true mutual information. attractors from mutual information,” Physical review A, vol. 33, no. 2, p. 1134, 1986. REFERENCES [26] Y.-I. Moon, B. Rajagopalan, and U. Lall, “Estimation of mutual infor- mation using kernel density estimators,” Physical Review E, vol. 52, [1] R. M. Fano and D. Hawkins, “Transmission of information: A statistical no. 3, p. 2318, 1995. theory of communications,” American Journal of Physics, vol. 29, pp. [27] K. Kandasamy, A. Krishnamurthy, B. Poczos, L. Wasserman et al., 793–794, 1961. “Nonparametric von mises estimators for entropies, divergences and mu- [2] V. Kovalevskij, “The problem of character recognition from the point of tual informations,” Advances in Neural Information Processing Systems, view of mathematical statistics,” 1967. vol. 28, pp. 397–405, 2015. [3] B.-G. Hu and H.-J. Xing, “An optimization approach of deriving bounds [28] A. Kraskov, H. Stogbauer,¨ and P. Grassberger, “Estimating mutual between entropy and error from joint distribution: Case study for binary information,” Physical review E, vol. 69, no. 6, p. 066138, 2004. classifications,” Entropy, vol. 18, no. 2, p. 59, 2016. [29] S. Singh and B. Poczos,´ “Finite-sample analysis of fixed-k nearest [4] J. R. Vergara and P. A. Estevez,´ “A review of feature selection methods neighbor density functional estimators,” Advances in neural information based on mutual information,” Neural computing and applications, processing systems, vol. 29, pp. 1217–1225, 2016. vol. 24, no. 1, pp. 175–186, 2014. [30] M. M. V. Hulle, “Edgeworth approximation of multivariate differential [5] R. Battiti, “Using mutual information for selecting features in supervised entropy,” Neural computation, vol. 17, no. 9, pp. 1903–1910, 2005. neural net learning,” IEEE Transactions on neural networks, vol. 5, no. 4, [31] S. Gao, G. Ver Steeg, and A. Galstyan, “Efficient estimation of mutual pp. 537–550, 1994. information for strongly dependent variables,” in Artificial intelligence [6] N. Tishby, F. C. Pereira, and W. Bialek, “The information bottleneck and statistics, 2015, pp. 277–286. method,” arXiv preprint physics/0004057, 2000. [32] J. Jiao, K. Venkat, Y. Han, and T. Weissman, “Minimax estimation of [7] M. A. Neifeld, A. Ashok, and P. K. Baheti, “Task-specific information functionals of discrete distributions,” IEEE Transactions on Information for imaging system analysis,” JOSA A, vol. 24, no. 12, pp. B25–B41, Theory, vol. 61, no. 5, pp. 2835–2885, 2015. 2007. [33] M. I. Belghazi, A. Baratin, S. Rajeshwar, S. Ozair, Y. Bengio, [8] J. R. Hershey and P. A. Olsen, “Approximating the Kullback Leibler A. Courville, and D. Hjelm, “Mutual information neural estimation,” divergence between Gaussian mixture models,” in 2007 IEEE In- in International Conference on Machine Learning, 2018, pp. 531–540. ternational Conference on Acoustics, Speech and Signal Processing- [34] B. Poole, S. Ozair, A. v. d. Oord, A. A. Alemi, and G. Tucker, “On varia- ICASSP’07, vol. 4. IEEE, 2007, pp. IV–317. tional bounds of mutual information,” arXiv preprint arXiv:1905.06922, [9] J. Goldberger, S. Gordon, and H. Greenspan, “An efficient image 2019. similarity measure based on approximations of kl-divergence between [35] J. V. Michalowicz, J. M. Nichols, and F. Bucholtz, Handbook of two Gaussian mixtures,” in null. IEEE, 2003, p. 487. differential entropy. Crc Press, 2013. [10] Y. Ding and A. Ashok, “X-ray measurement model incorporating [36] F. Nielsen and R. Nock, “Entropies and cross-entropies of exponential energy-correlated material variability and its application in information- families,” in 2010 IEEE International Conference on Image Processing. theoretic system analysis,” arXiv preprint arXiv:2002.11046, 2020. IEEE, 2010, pp. 3621–3624. Est P Est P Est P e e e 0.5 0.5 0.5 MC MC est kl&c MC est kl&c -2 est kl&c -2 10 est c 10 est c est c est kl est kl e e e

P ub_kl P est kl P ub_kl lb_c -2 ub_kl -4 10 -4 lb_c 10 lb_c 10 ub_2H ub_2H lb_2H ub_2H lb_2H lb_2H

10-6 10-6 10-2 10-1 1 3 10-2 10-1 1 3 10-2 10-1 1 3

(a) (b) (c)

Fig. 5: Estimates of Pe calculated from a number of estimators of I(x; C).

[37] M. A. Carreira-Perpinan, “Mode-finding for mixtures of Gaussian [55] H. W. Kuhn, “The hungarian method for the assignment problem,” Naval distributions,” IEEE Transactions on Pattern Analysis and Machine research logistics quarterly, vol. 2, no. 1-2, pp. 83–97, 1955. Intelligence, vol. 22, no. 11, pp. 1318–1323, 2000. [38] J. V. Michalowicz, J. M. Nichols, and F. Bucholtz, “Calculation of differential entropy for a mixed Gaussian distribution,” Entropy, vol. 10, no. 3, pp. 200–206, 2008. [39] O. Zobay et al., “Variational bayesian inference with Gaussian-mixture approximations,” Electronic Journal of Statistics, vol. 8, no. 1, pp. 355– 389, 2014. [40] J.-Y. Chen, J. R. Hershey, P. A. Olsen, and E. Yashchin, “Accelerated monte carlo for Kullback-Leibler divergence between Gaussian mixture models,” in 2008 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 2008, pp. 4553–4556. [41] H. Joe, “Estimation of entropy and other functionals of a multivariate density,” Annals of the Institute of Statistical Mathematics, vol. 41, no. 4, pp. 683–697, 1989. [42] F. Nielsen and R. Nock, “Maxent upper bounds for the differential entropy of univariate continuous distributions,” IEEE Signal Processing Letters, vol. 24, no. 4, pp. 402–406, 2017. [43] M. F. Huber, T. Bailey, H. Durrant-Whyte, and U. D. Hanebeck, “On entropy approximation for Gaussian mixture random vectors,” in 2008 IEEE International Conference on Multisensor Fusion and Integration for Intelligent Systems. IEEE, 2008, pp. 181–188. [44] S. J. Julier and J. K. Uhlmann, “A general method for approximating nonlinear transformations of probability distributions,” Technical report, Robotics Research Group, Department of Engineering Science, Tech. Rep., 1996. [45] A. Kolchinsky and B. D. Tracey, “Estimating mixture entropy with pairwise distances,” Entropy, vol. 19, no. 7, p. 361, 2017. [46] P. Hall and S. C. Morton, “On the estimation of entropy,” Annals of the Institute of Statistical Mathematics, vol. 45, no. 1, pp. 69–88, 1993. [47] T. Jebara and R. Kondor, “Bhattacharyya and expected likelihood kernels,” in Learning theory and kernel machines. Springer, 2003, pp. 57–71. [48] T. Jebara, R. Kondor, and A. Howard, “Probability product kernels,” Journal of Machine Learning Research, vol. 5, no. Jul, pp. 819–844, 2004. [49] J. E. Contreras-Reyes and D. D. Cortes,´ “Bounds on renyi´ and shannon entropies for finite mixtures of multivariate skew-normal distributions: Application to swordfish (xiphias gladius linnaeus),” Entropy, vol. 18, no. 11, p. 382, 2016. [50] H. Chernoff et al., “A measure of asymptotic efficiency for tests of a hypothesis based on the sum of observations,” The Annals of Mathematical Statistics, vol. 23, no. 4, pp. 493–507, 1952. [51] D. Haussler, M. Opper et al., “Mutual information, metric entropy and cumulative relative entropy risk,” The Annals of Statistics, vol. 25, no. 6, pp. 2451–2492, 1997. [52] T. M. Cover and J. A. Thomas, Elements of information theory. John Wiley & Sons, 2012. [53] M. N. Do, “Fast approximation of Kullback-Leibler distance for depen- dence trees and hidden Markov models,” IEEE signal processing letters, vol. 10, no. 4, pp. 115–118, 2003. [54] J. Munkres, “Algorithms for the assignment and transportation prob- lems,” Journal of the society for industrial and applied mathematics, vol. 5, no. 1, pp. 32–38, 1957.