CCMI : Classifier Based Conditional Mutual Information Estimation

CCMI : Classifier based Conditional Mutual Information Estimation Sudipto Mukherjee, Himanshu Asnani, Sreeram Kannan Electrical and Computer Engineering, University of Washington, Seattle, WA. {sudipm, asnani, ksreeram}@uw.edu Abstract conditional mutual information is defined as: ZZZ p(x; y; z) I(X; Y jZ) = p(x; y; z) log dxdydz Conditional Mutual Information (CMI) is a p(x; z)p(yjz) measure of conditional dependence between random variables X and Y, given another ran- assuming that the distributions admit the respective den- dom variable Z. It can be used to quantify con- sities p(·). One of the striking features of MI and CMI is ditional dependence among variables in many that they can capture non-linear dependencies between data-driven inference problems such as graph- the variables. In scenarios where Pearson correlation ical models, causal learning, feature selec- is zero even when the two random variables are depen- tion and time-series analysis. While k-nearest dent, mutual information can recover the truth. Like- neighbor (kNN) based estimators as well as wise, in the sense of conditional independence for the kernel-based methods have been widely used case of three random variables X,Y and Z, conditional for CMI estimation, they suffer severely from mutual information provides strong guarantees, i.e., X ? the curse of dimensionality. In this paper, we Y jZ () I(X; Y jZ) = 0. leverage advances in classifiers and generative models to design methods for CMI esti- The conditional setting is even more interesting as de- mation. Specifically, we introduce an estima- pendence between X and Y can potentially change based tor for KL-Divergence based on the likelihood on how they are connected to the conditioning variable. ratio by training a classifier to distinguish the For instance, consider a simple Markov chain where observed joint distribution from the product X ! Z ! Y . Here, X ? Y jZ. But a slightly dif- distribution. We then show how to construct ferent relation X ! Z Y has X 6? Y jZ, even though several CMI estimators using this basic diver- X and Y may be independent as a pair. It is a well known gence estimator by drawing ideas from condi- fact in Bayesian networks that a node is independent of tional generative models. We demonstrate that its non-descendants given its parents. CMI goes beyond the estimates from our proposed approaches stating whether the pair (X; Y ) is conditionally depen- do not degrade in performance with increasing dent or not. It also provides a quantitative strength of dimension and obtain significant improvement dependence. over the widely used KSG estimator. Finally, 1.1 PRIOR ART as an application of accurate CMI estimation, The literature is replete with works aimed at apply- we use our best estimator for conditional inde- ing CMI for data-driven knowledge discovery. Fleuret pendence testing and achieve superior perfor- (2004) used CMI for fast binary feature selection to im- mance than the state-of-the-art tester on both prove classification accuracy. Loeckx et al. (2010) im- simulated and real data-sets. proved non-rigid image registration by using CMI as a 1 INTRODUCTION similarity measure instead of global mutual information. CMI has been used to infer gene-regulatory networks Conditional mutual information (CMI) is a fundamental (Liang and Wang 2008) or protein modulation (Giorgi information theoretic quantity that extends the nice prop- et al. 2014) from gene expression data. Causal discovery erties of mutual information (MI) in conditional settings. (Li et al. 2011; Hlinka et al. 2013; Vejmelka and Paluš For three continuous random variables, X, Y and Z, the 2008) is yet another application area of CMI estimation. Despite its wide-spread use, estimation of conditional Divergence Based CMI Estimation: We express CMI mutual information remains a challenge. One naive as the KL-divergence between two distributions pxyz = method may be to estimate the joint and conditional den- p(z)p(xjz)p(yjx; z) and qxyz = p(z)p(xjz)p(yjz), and sities from data and plug it into the expression for CMI. explore candidate generators for obtaining samples from But density estimation is not sample efficient and is of- q(·). The CMI estimate is then obtained from the diver- ten more difficult than estimating the quantities directly. gence estimator. The most widely used technique expresses CMI in terms Difference Based CMI Estimation: Using the im- of appropriate arithmetic of differential entropy estima- proved MI estimates, and the difference relation tors (referred to here as ΣH estimator): I(X; Y jZ) = I(X; Y jZ) = I(X; YZ) − I(X; Z), we show that es- h(X; Z)+h(Y; Z)−h(Z)−h(X; Y; Z), where h(X) = timating CMI using a difference of two MI estimates − R p(x) log p(x) dx is known as the differential entropy. performs best among several other proposed methods in X this paper such as divergence based CMI estimation and The differential entropy estimation problem has been KSG. studied extensively by Beirlant et al. (1997); Nemen- Improved Performance in High Dimensions: On both man et al. (2002); Miller (2003); Lee (2010); Lesniewicz´ linear and non-linear data-sets, all our estimators per- (2014); Sricharan et al. (2012); Singh and Póczos (2014) form significantly better than KSG. Surprisingly, our es- and can be estimated either based on kernel-density timators perform well even for dimensions as high as (Kandasamy et al. 2015; Gao et al. 2016) or k-nearest- 100, while KSG fails to obtain reasonable estimates even neighbor estimates (Sricharan et al. 2013; Jiao et al. beyond 5 dimensions. 2018; Pál et al. 2010; Kozachenko and Leonenko 1987; Improved Performance in Conditional Independence Singh et al. 2003; Singh and Póczos 2016). Build- Testing: As an application of CMI estimation, we use ing on top of k-nearest-neighbor estimates and break- our best estimator for conditional independence testing ing the paradigm of ΣH estimation, a coupled estimator (CIT) and obtain improved performance compared to the (which we address henceforth as KSG) was proposed by state-of-the-art CIT tester on both synthetic and real data- Kraskov et al. (2004). It generalizes to mutual informa- sets. tion, conditional mutual information as well as for other 2 ESTIMATION OF CONDITIONAL multivariate information measures, including estimation MUTUAL INFORMATION in scenarios when the distribution can be mixed (Runge 2018; Frenzel and Pompe 2007; Gao et al. 2017, 2018; The CMI estimation problem from finite samples can be Vejmelka and Paluš 2008; Rahimzamani et al. 2018). stated as follows. Let us consider three random variables X, Y , Z ∼ p(x; y; z), where p(x; y; z) is the The kNN approach has the advantage that it can natu- joint distribution. Let the dimensions of the random vari- rally adapt to the data density and does not require ex- ables be dx, dy and dz respectively. We are given n tensive tuning of kernel band-widths. However, all these n samples f(xi; yi; zi)gi=1 drawn i.i.d from p(x; y; z). So approaches suffer from the curse of dimensionality and dx dy dz xi 2 R ; yi 2 R and zi 2 R . The goal is to esti- are unable to scale well with dimensions. Moreover, Gao mate I(X; Y jZ) from these n samples. et al. (2015) showed that exponentially many samples are required (as MI grows) for the accurate estimation using 2.1 DIVERGENCE BASED CMI ESTIMATION kNN based estimators. This brings us to the central mo- Definition 1. The Kullback-Leibler (KL) divergence be- tivation of this work : Can we propose estimators for tween two distributions p(·) and q(·) is given as : conditional mutual information that estimate well even in high dimensions ? Z p(x) DKL(pjjq) = p(x) log dx 1.2 OUR CONTRIBUTIONS q(x) In this paper, we explore various ways of estimating CMI Definition 2. Conditional Mutual Information (CMI) by leveraging tools from classifiers and generative mod- can be expressed as a KL-divergence between two dis- els. To the best of our knowledge, this is the first work tributions p(x; y; z) and q(x; y; z) = p(x; z)p(yjz), i.e., that deviates from the framework of kNN and kernel based CMI estimation and introduces neural networks to I(X; Y jZ) = DKL(p(x; y; z)jjp(x; z)p(yjz)) solve this problem. The main contributions of the paper can be summarized as follows : The definition of CMI as a KL-divergence naturally leads Classifier Based MI Estimation: We propose a novel to the question : Can we estimate CMI using an estima- KL-divergence estimator based on classifier two-sample tor for divergence ? However, the problem is still non- approach that is more stable and performs superior to the trivial since we are only given samples from p(x; y; z) recent neural methods (Belghazi et al. 2018). and the divergence estimator would also require samples from p(x; z)p(yjz). This further boils down to whether tion. The core idea of MINE is cradled in a dual repre- we can learn the distribution p(yjz). sentation of KL-divergence. The two main lower bounds used by MINE are stated below. 2.1.1 Generative Models Definition 3. The Donsker-Varadhan representation ex- We now explore various techniques to learn the condi- presses KL-divergence as a supremum over functions, tional distribution p(yjz) given samples ∼ p(x; y; z). This problem is fundamentally different from drawing DKL(pjjq) = sup [f(x)] − log( [exp(f(x))]) xE∼p xE∼q independent samples from the marginals p(x) and p(y), f2F (1) given the joint p(x; y).

Load more