A Probabilistic Representation of Deep Learning for Improving the Information Theoretic Interpretability

A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability Xinjie Lan, Kenneth E. Barner Department of Electrical and Computer Engineering, University of Delaware, Newark, DE, USA, 19711 ARTICLEINFO ABSTRACT Keywords: In this paper, we propose a probabilistic representation of MultiLayer Perceptrons (MLPs) to improve deep neural networks the information theoretic interpretability. Above all, we demonstrate that the activations being i.i.d. is information bottleneck not valid for all the hidden layers of MLPs, thus the existing mutual information estimators based on probabilistic modeling non-parametric inference methods, e.g., empirical distributions and Kernel Density Estimate (KDE), non-parametric inference are invalid for measuring the information flow in MLPs. Moreover, we introduce explicit probabilistic explanations for MLPs: (i) we define the probability space (ΩF ; T ;PF / for a fully connected layer f and demonstrate the great effect of an activation function of f on the probability measure PF ; (ii) we prove the entire architecture of MLPs as a Gibbs distribution P ; and (iii) the back-propagation aims to optimize the sample space ΩF of all the fully connected layers of MLPs for learning an optimal Gibbs distribution P < to express the statistical connection between the input and the label. Based on the probabilistic explanations for MLPs, we improve the information theoretic interpretability of MLPs in three aspects: (i) the random variable of f is discrete and the corresponding entropy is finite; (ii) the information bottleneck theory cannot correctly explain the information flow in MLPs if we take into account the back-propagation; and (iii) we propose novel information theoretic explanations for the generalization of MLPs. Finally, we demonstrate the proposed probabilistic representation and information theoretic explanations for MLPs in a synthetic dataset and benchmark datasets. 1. Introduction Notably, the non-parametric statistical models lack solid theoretical basis in the context of DNNs. As two classical Improving the interpretability of Deep Neural Networks non-parametric inference algorithms (Wasserman, 2006), the (DNNs) is a fundamental issue of deep learning. Recently, empirical distribution and KDE approach the true distribu- numerous efforts have been devoted to explaining DNNs from tion only if the samples are independently and identically the view point of information theory. In the seminal work, distributed (i.i.d.). Specifically, the prerequisite of applying Shwartz-Ziv and Tishby(2017) initially use the Information the non-parametric statistics in DNNs is that the activations Bottleneck (IB) theory to clarify the internal logic of DNNs. of a hidden layer are i.i.d. samples of the true distribution Specifically, they claim that DNNs optimize an IB tradeoff of the layer. However, none of previous works explicitly between compression and prediction, and the generalization demonstrates the prerequisite. performance of DNNs is causally related to the compression. Moreover, the unclear definition for the random variable However, the IB explanation causes serious controversies, of a hidden layer results in an information theoretic issue especially Saxe et al.(2018) question the validity of the IB (Chelombiev et al., 2019). Specifically, a random variable explanations by some counter-examples, and Goldfeld et al. is a measurable function F :Ω → E mapping the sample (2019) doubt the causality between the compression and the space Ω to the measurable space E. All the previous works generalization performance of DNNs. simply assume the activations of a hidden layer as E but not Basically, the above controversies stem from different specify Ω, which indciates F as a continuous random vari- probabilistic models for the hidden layer of DNNs. Due to able because the activations are continuous. As a result, the arXiv:2010.14054v1 [cs.LG] 27 Oct 2020 the complicated architecture of DNNs, it is extremely hard to conditional distribution P .F X/ would be a delta function establish an explicit probabilistic model for the hidden layer ð under the assumption that DNNs are deterministic models, of DNNs. As a result, all the previous works have to adopt thereby the mutual information I.X;F / = ∞, where X is non-parametric statistics to estimate the mutual information. the random variable of the input. However, that contradicts Shwartz-Ziv and Tishby(2017) model the distribution of a experimental results I.X;F / < ∞. hidden layer as the empirical distribution (a.k.a. the binning To resolve the above information theoretic controversies method) of the activations of the layer, whereas Saxe et al. and further improve the interpretability for DNNs, this paper (2018) model the distribution as Kernel Density Estimation proposes a probabilistic representation for feedforward fully (KDE), and Goldfeld et al.(2019) model the distribution connected DNNs, i.e., the MultiLayer Perceptrons (MLPs), as the convolution between the empirical distribution and in three aspects: (i) we thoroughly study the i.i.d. property additive Gaussian noise. Inevitably, different probabilistic of the activations of a fully connected layer, (ii) we define models derive different information theoretic explanations the probability space for a fully connected layer, and (iii) we for DNNs, thereby leading to controversies. explicitly propose probabilistic explanations for MLPs and [email protected] (X. Lan) the back-propagation training algorithm. ORCID(s): 0000-0001-7600-106 (X. Lan) Xinjie Lan et al.: Preprint submitted to Elsevier Page 1 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability First, we demonstrate that the correlation of activations (') �"" �"" with the same label becomes larger as the layer containing �" (') the activations is closer to the output. Therefore, activations �'" (") (') �'" � �(" being i.i.d. is not valid for all the hidden layers of MLPs. "' ⋮ (") (') �%' � In other words, the existing mutual information estimators -" � (") � )" based on non-parametric statistics are not valid for all the �&' "' hidden layers of MLPs as the activations of hidden layers �% ⋮ � (() cannot satisfy the prerequisite. '' �"2 ⋮ (() Second, we define the probability space (Ω ; ;P / for �'2 � F T F � ⋮ )2 "( (() a fully connected layer f with N neurons given the input �32 ⋮ �'3 x. Let the experiment be f extracting a single feature of �& �"- x, (ΩF ; T ;PF / is defined as follows: the sample space ΩF consists of N possible outcomes (i.e., features), and each � � � � outcome is defined by the weights of each neuron; the event � � � space T is the -algebra; and the probability measure PF Figure 1: The input layer x has M nodes, and f1 has N N ³M .1/ is a Gibbs measure for quantifying the probability of each neurons ^f1n = 1[g1n.x/]`n=1, where g1n.x/ = m=1 !mn ⋅ xm + .1/ outcome occurring the experiment. Notably, the activation b1n is the nth linear function with !mn being the weight of the edge between x and f , and b being the bias. .⋅/ function of f has a great effect on PF , because an activation m 1n 1n 1 is a non-linear activation function, e.g., the ReLU function. equals the negative energy function of PF . Similarly, f = ^f = [g .f /]`K has K neurons, where Third, we propose probabilistic explanations for MLPs 2 2k 2 2k 1 k=1 g f ³N !.2/ f b . In addition, f is the softmax, and the back-propagation training: (i) we prove the entire 2k. 1/ = n=1 nk ⋅ 1n + 2k Y 1 ³K .3/ thus fyl = exp.gyl/ where gyl = k !kl ⋅ f2k + byl and architecture of MLPs as a Gibbs distribution based on the ZY =1 Gibbs distribution P for each layer; and (ii) we show that ³L F ZY = l=1 exp.gyl/ is the partition function. the back-propagation training aims to optimize the sample space of all the layers of MLPs for modeling the statistical connection between the input x and the label y, because the Finally, we generate a synthetic dataset to demonstrate weights of each layer define sample space. the theoretical explanations for MLPs. Since the dataset only In summary, the three probabilistic explanations for fully has four simple features, we can validate the probabilistic connected layers and MLPs establish a solid probabilistic explanations for MLPs by visualizing the weights of MLPs. foundation for explaining MLPs in an information theoretic In addition, the four features has equal probability, thus the way. Based on the probabilistic foundation, we propose three dataset has fixed entropy. As a result, we can demonstrate novel information theoretic explanations for MLPs. the information theoretic explanations for MLPs. Above all, we demonstrate that the entropy of F is finite, The rest of the paper is organized as follows. Section2 i.e., H.F / < ∞. Based on (ΩF ; T ;PF /, we can explicitly briefly discusses the related works. Section3 and4 propose define the random variable of f as F :ΩF → EF , where the probabilistic and information theoretic explanations for EF denotes discrete measurable space, thus F is a discrete MLPs, respectively. Section5 specifies the mutual informa- random variable and H.F / < ∞. As a result, we resolve the tion estimators based on (ΩF ; T ;PF / for a fully connected controversy regarding F being continuous. layer. Section6 validates the probabilistic and information Furthermore, we demonstrate that the information flow theoretic explanations for MLPs on the synthetic dataset and of X and Y in MLPs cannot satisfy IB if taking into account benchmark dataset MNIST and Fashion-MNIST. Section7 the back-propagation training. Specifically, the probabilistic concludes the paper and discusses future work. explanation for the back-propagation training indicates that Preliminaries. P .X;Y / = P .Y ðX/P .X/ is an unknown ΩF depends on both x and y, thus F depends on both X joint distribution between two random variables X and Y .A and Y , where Y is the random variable of y. However, IB j j j M j J dataset D = ^.x ; y / x Ë R ; y Ë R` consists of J requires that F is independent on Y given X, ð j=1 i.i.d.

Load more