A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

Xinjie Lan, Kenneth E. Barner

Department of Electrical and Computer Engineering, University of Delaware, Newark, DE, USA, 19711

ARTICLEINFO ABSTRACT

Keywords: In this paper, we propose a probabilistic representation of MultiLayer Perceptrons (MLPs) to improve deep neural networks the information theoretic interpretability. Above all, we demonstrate that the activations being i.i.d. is information bottleneck not valid for all the hidden layers of MLPs, thus the existing mutual information estimators based on probabilistic modeling non-parametric inference methods, e.g., empirical distributions and Kernel Density Estimate (KDE), non-parametric inference are invalid for measuring the information flow in MLPs. Moreover, we introduce explicit probabilistic explanations for MLPs: (i) we define the (ΩF ,  ,PF ) for a fully connected layer f and demonstrate the great effect of an activation function of f on the probability PF ; (ii) we prove the entire architecture of MLPs as a Gibbs distribution P ; and (iii) the back-propagation aims to optimize the sample space ΩF of all the fully connected layers of MLPs for learning an optimal Gibbs distribution P ∗ to express the statistical connection between the input and the label. Based on the probabilistic explanations for MLPs, we improve the information theoretic interpretability of MLPs in three aspects: (i) the of f is discrete and the corresponding is finite; (ii) the information bottleneck theory cannot correctly explain the information flow in MLPs if we take into account the back-propagation; and (iii) we propose novel information theoretic explanations for the generalization of MLPs. Finally, we demonstrate the proposed probabilistic representation and information theoretic explanations for MLPs in a synthetic dataset and benchmark datasets.

1. Introduction Notably, the non-parametric statistical models lack solid theoretical basis in the context of DNNs. As two classical Improving the interpretability of Deep Neural Networks non-parametric inference algorithms (Wasserman, 2006), the (DNNs) is a fundamental issue of deep learning. Recently, empirical distribution and KDE approach the true distribu- numerous efforts have been devoted to explaining DNNs from tion only if the samples are independently and identically the view point of information theory. In the seminal work, distributed (i.i.d.). Specifically, the prerequisite of applying Shwartz-Ziv and Tishby(2017) initially use the Information the non-parametric in DNNs is that the activations Bottleneck (IB) theory to clarify the internal logic of DNNs. of a hidden layer are i.i.d. samples of the true distribution Specifically, they claim that DNNs optimize an IB tradeoff of the layer. However, none of previous works explicitly between compression and prediction, and the generalization demonstrates the prerequisite. performance of DNNs is causally related to the compression. Moreover, the unclear definition for the random variable However, the IB explanation causes serious controversies, of a hidden layer results in an information theoretic issue especially Saxe et al.(2018) question the validity of the IB (Chelombiev et al., 2019). Specifically, a random variable explanations by some counter-examples, and Goldfeld et al. is a measurable function F ∶ Ω → E mapping the sample (2019) doubt the causality between the compression and the space Ω to the measurable space E. All the previous works generalization performance of DNNs. simply assume the activations of a hidden layer as E but not Basically, the above controversies stem from different specify Ω, which indciates F as a continuous random vari- probabilistic models for the hidden layer of DNNs. Due to able because the activations are continuous. As a result, the arXiv:2010.14054v1 [cs.LG] 27 Oct 2020 the complicated architecture of DNNs, it is extremely hard to conditional distribution P (F X) would be a delta function establish an explicit probabilistic model for the hidden layer ð under the assumption that DNNs are deterministic models, of DNNs. As a result, all the previous works have to adopt thereby the mutual information I(X,F ) = ∞, where X is non-parametric statistics to estimate the mutual information. the random variable of the input. However, that contradicts Shwartz-Ziv and Tishby(2017) model the distribution of a experimental results I(X,F ) < ∞. hidden layer as the empirical distribution (a.k.a. the binning To resolve the above information theoretic controversies method) of the activations of the layer, whereas Saxe et al. and further improve the interpretability for DNNs, this paper (2018) model the distribution as Kernel Density Estimation proposes a probabilistic representation for feedforward fully (KDE), and Goldfeld et al.(2019) model the distribution connected DNNs, i.e., the MultiLayer Perceptrons (MLPs), as the convolution between the empirical distribution and in three aspects: (i) we thoroughly study the i.i.d. property additive Gaussian noise. Inevitably, different probabilistic of the activations of a fully connected layer, (ii) we define models derive different information theoretic explanations the probability space for a fully connected layer, and (iii) we for DNNs, thereby leading to controversies. explicitly propose probabilistic explanations for MLPs and [email protected] (X. Lan) the back-propagation training algorithm. ORCID(s): 0000-0001-7600-106 (X. Lan)

Xinjie Lan et al.: Preprint submitted to Elsevier Page 1 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

First, we demonstrate that the correlation of activations () � � with the same label becomes larger as the layer containing � () the activations is closer to the output. Therefore, activations � () () � � � being i.i.d. is not valid for all the hidden layers of MLPs. ⋮ () () � � In other words, the existing mutual information estimators � () � based on non-parametric statistics are not valid for all the � hidden layers of MLPs as the activations of hidden layers � ⋮ � () cannot satisfy the prerequisite. � ⋮ () Second, we define the probability space (Ω , ,P ) for � � F  F � ⋮ () a fully connected layer f with N neurons given the input � ⋮ � x. Let the experiment be f extracting a single feature of � � x, (ΩF ,  ,PF ) is defined as follows: the sample space ΩF consists of N possible outcomes (i.e., features), and each � � � � outcome is defined by the weights of each neuron; the event � � � space  is the -algebra; and the PF Figure 1: The input layer x has M nodes, and f1 has N N ∑M (1) is a Gibbs measure for quantifying the probability of each neurons {f1n = 1[g1n(x)]}n=1, where g1n(x) = m=1 !mn ⋅ xm + (1) outcome occurring the experiment. Notably, the activation b1n is the nth linear function with !mn being the weight of the edge between x and f , and b being the bias.  (⋅) function of f has a great effect on PF , because an activation m 1n 1n 1 is a non-linear activation function, e.g., the ReLU function. equals the negative energy function of PF . Similarly, f = {f =  [g (f )]}K has K neurons, where Third, we propose probabilistic explanations for MLPs 2 2k 2 2k 1 k=1 g f ∑N !(2) f b . In addition, f is the softmax, and the back-propagation training: (i) we prove the entire 2k( 1) = n=1 nk ⋅ 1n + 2k Y 1 ∑K (3) thus fyl = exp(gyl) where gyl = k !kl ⋅ f2k + byl and architecture of MLPs as a Gibbs distribution based on the ZY =1 Gibbs distribution P for each layer; and (ii) we show that ∑L F ZY = l=1 exp(gyl) is the partition function. the back-propagation training aims to optimize the sample space of all the layers of MLPs for modeling the statistical connection between the input x and the label y, because the Finally, we generate a synthetic dataset to demonstrate weights of each layer define sample space. the theoretical explanations for MLPs. Since the dataset only In summary, the three probabilistic explanations for fully has four simple features, we can validate the probabilistic connected layers and MLPs establish a solid probabilistic explanations for MLPs by visualizing the weights of MLPs. foundation for explaining MLPs in an information theoretic In addition, the four features has equal probability, thus the way. Based on the probabilistic foundation, we propose three dataset has fixed entropy. As a result, we can demonstrate novel information theoretic explanations for MLPs. the information theoretic explanations for MLPs. Above all, we demonstrate that the entropy of F is finite, The rest of the paper is organized as follows. Section2 i.e., H(F ) < ∞. Based on (ΩF ,  ,PF ), we can explicitly briefly discusses the related works. Section3 and4 propose define the random variable of f as F ∶ ΩF → EF , where the probabilistic and information theoretic explanations for EF denotes discrete measurable space, thus F is a discrete MLPs, respectively. Section5 specifies the mutual informa- random variable and H(F ) < ∞. As a result, we resolve the tion estimators based on (ΩF ,  ,PF ) for a fully connected controversy regarding F being continuous. layer. Section6 validates the probabilistic and information Furthermore, we demonstrate that the information flow theoretic explanations for MLPs on the synthetic dataset and of X and Y in MLPs cannot satisfy IB if taking into account benchmark dataset MNIST and Fashion-MNIST. Section7 the back-propagation training. Specifically, the probabilistic concludes the paper and discusses future work. explanation for the back-propagation training indicates that Preliminaries. P (X,Y ) = P (Y ðX)P (X) is an unknown ΩF depends on both x and y, thus F depends on both X joint distribution between two random variables X and Y .A and Y , where Y is the random variable of y. However, IB j j j M j J dataset  = {(x , y ) x ∈ ℝ , y ∈ ℝ} consists of J requires that F is independent on Y given X, ð j=1 i.i.d. samples generated from P (X,Y ) with finite L classes, In addition, we further confirm none causal relationship i.e., yj ∈ {1, ⋯ ,L}. A neural network with I hidden lay- between the compression and the generalization of MLPs. ers is denoted as DNN = {x; f ; ...; f ; f } and trained Alternatively, we demonstrate that the performance of a MLP 1 I Y by , where (xj, yj) ∈ are the input of the DNN and depends on the mutual information between the MLP and X,   the label, respectively, thus x ∼ P (X) and the DNN aims i.e., I(X, MLP). More specifically, we demonstrate all the to learn P (Y X) with the one-hot format, i.e., if l = yj, information of Y coming from X, i.e., H(Y ) = I(X,Y ) ð P (l xj) = 1; otherwise P (l xj) = 0. We use the (the relation is visualized by the Venn diagram in Figure4), Y ðX ð Y ðX ð MLP = {x; f ; f ; f } in Figure1 for most theoretical thus I(X, MLP) can be divided into two parts I(Y, MLP) 1 2 Y derivations unless otherwise specified. In addition, H(X) and I(X,̄ MLP), where X̄ = Y c ∩ X denotes the relative is the entropy of X, I(X,F ) and I(Y,F ) are the mutual complement Y in X. We demonstrate that the performance i i information between X, Y and F , where F is the random of the MLP on training dataset depends on I(Y, MLP), and i i variable of the hidden layer f . the generalization of the MLP depends on I(X,̄ MLP). i

Xinjie Lan et al.: Preprint submitted to Elsevier Page 2 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

2. Related work Except the classical non-parametric inference methods, recent works propose some new mutual information estima- 2.1. Information theoretic explanations for DNNs tors. For instance, Yu et al.(2019) propose the matrix-based IB aims to optimize a random variable F as a compressed Rényi -entropy to estimate I(X,Fi) without probabilistic representation of X such that it can minimize the informa- modeling f i, in which Shannon entropy is a special case tion of X while preserve the information of Y (Slonim, 2002). of Rényi -entropy when → 1 (Yu et al., 2019, 2020). Since F is a compressed representation of X, it is entirely Gabrié et al.(2018) propose the heuristic replica method to determined by X, i.e, P (F X,Y ) = P (F X), thus the joint ð ð estimate I(X,Fi) in statistical feedforward neural networks distribution P (X,Y,F ) can be formulated as (Kabashima, 2008; Manoel et al., 2017). P (X,Y,F ) = P (X,Y )P (F X) ð (1) 2.2. Probabilistic explanations for DNNs = P (Y )P (XðY )P (F ðX). Probabilistic modeling the hidden layer of DNNs is a fundamental question of deep learning theory. Numerous As a result, the corresponding can be described probabilistic models have been proposed to explain DNNs, as Y ↔ X ↔ F and IB can be formulated as e.g., (Lee et al., 2018; Matthews et al., ∗ 2018; Novak et al., 2018), mixture model (Patel et al., 2016; P (F ðX) = arg min I(X,F ) − I(Y,F ), (2) P (F ðX) Tang et al., 2015; Oord and Schrauwen, 2014), and Gibbs distribution (Mehta and Schwab, 2014; Yaida, 2019). where is the Lagrange multiplier controlling the tradeoff As a fundamental probabilistic graphic model, the Gibbs between I(X,F ) and I(Y,F ). distribution (a.k.a., , the energy based The key to validating the IB explanation for MLPs is to model, or the renormalization group) formulates the depen- precisely measure I(X,Fi) and I(Y,Fi). Ideally, we should dence within X by associating an energy E(x; ) to each specify F ∶ Ω → E before deriving I(F , X) and I(F ,Y ). i Fi Fi i i dependence structure (Geman and Geman, 1984). However, the complicated architecture of MLPs makes it 1 hard to specify Fi. Alternatively, most previous works use P (X; ) = exp[−E(x; )], (5) non-parametric inference to estimate I(X,Fi) and I(Y,Fi). Z() Based on a classical non-parametric inference method, E x;   namely the empirical distribution of the activations of f , where ( ) is the energy function, are the parameters, i Z  ∑ E x;  1 Shwartz-Ziv and Tishby(2017) experimentally show that and the partition function is ( ) = x exp[− ( )] . The Gibbs distribution has three appealing properties. the MLP = {x; f1; f2; fY } shown in Figure1 satisfies IB and corresponds to a Markov chain First, it can be easily reformulated as various probabilistic models by redefining E(x; ), which allows us to clarify the complicated architecture of a hidden layer. For example, if Y ↔X↔F1↔F2 ↔ FY . (3) the energy function is defined as the summation of multiple ∑ As a result, the information flow in the MLP should satisfies functions, namely E(x; ) = − k fk(x; k), the Gibbs dis- the two Data Processing Inequalities (DPIs) tribution would be the Product of Experts (PoE) model, i.e., P (x; ) = 1 ∏ F , where F = exp[−f (x;  )] and Z() k k k k k H(X) ≥ I(X,F1) ≥ I(X,F2) ≥ I(X,FY ), ∏ (4) Z() = k Z(k) (Hinton, 2002). Second, since Z() only I(Y, X) ≥ I(Y,F1) ≥ I(Y,F2) ≥ I(Y,FY ). depends on , the deterministic function E(x; ) is a suffi- cient statistics of P (X; ). The property allows us to explain Furthermore, they claim that most of training epochs focus a deterministic hidden layer in a probabilistic way. Third, the on learning a compressed representation of input for fitting energy minimization is a typical optimization for , namely the labels, and the generalization performance of DNNs is ∗ = arg min E(x; ) (LeCun et al., 2006), which allows us causally related to the compression phase.  to explain the back-propagation training, because the energy Meanwhile, other non-parametric inference methods are minimization can be implemented by the gradient descent also used to estimate the mutual information. For instance, algorithm as long as E(x; ) is differentiable. Saxe et al.(2018) use Gaussian KDE to estimate I(X,F ) i To the best of our knowledge, Mehta and Schwab(2014) and I(Y,F ), Goldfeld et al.(2019) choose the convolution i initially explain the distribution of hidden layers as a Gibbs of the empirical distribution and additive Gaussian noise to distribution in the Restricted Boltzmann Machine (RBM). estimate I(X,F ), and Chelombiev et al.(2019) propose some i Lin et al.(2017) clarify certain advantages of DNNs based adaptive techniques for optimizing the mutual information on the Gibbs distribution. Notably, Yaida(2019) indirectly estimators based on empirical distributions and KDE. demonstrates the distribution of a fully connected layer as a However, the information theoretic explanations for DNNs Gibbs distribution. However, there is few work to extend the based on non-parametric inference have several limitations. Gibbs explanation to complicated hidden layers, e.g., fully First, it is invalid for non-saturating activation functions, e.g., connected layers and convolutional layers. the widely used ReLU. Second, the causal relation between generalization and compression cannot be validated by KDE 1We only consider the discrete case in the paper. and other recent works (Gabrié et al., 2018).

Xinjie Lan et al.: Preprint submitted to Elsevier Page 3 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

j j j j j j Rxj, xj0 Rf , f 0 Rf , f 0 Rf , f 0 0 1.0 0 1 1 1.0 0 2 2 1.0 0 Y Y 1.0

1K 0.8 1K 0.8 1K 0.8 1K 0.8

2K 0.6 2K 0.6 2K 0.6 2K 0.6

3K 0.4 3K 0.4 3K 0.4 3K 0.4

4K 0.2 4K 0.2 4K 0.2 4K 0.2

5K 5K 5K 5K 0 1K 2K 3K 4K 5K 0 1K 2K 3K 4K 5K 0 1K 2K 3K 4K 5K 0 1K 2K 3K 4K 5K (A) (B) (C) (D)

1.0 1.0 1.0 1.0 j j0 0.9 rdiff(x , x ) 0.9 0.9 0.9 j j 0.8 j j0 0.8 0 0.8 0.8 rsame(x , x ) rdiff(f1, f1) 0.7 0.7 0.7 j j 0.7 j j j j 0 0 train_acc 0 rdiff(f2, f2) rdiff(fY, fY) 0.6 0.6 rsame(f1, f1) 0.6 0.6 0.5 0.5 0.5 j j0 0.5 j j0 train_acc rsame(f2, f2) rsame(fY, fY) 0.4 0.4 0.4 0.4 0.3 0.3 0.3 train_acc 0.3 train_acc 0.2 0.2 0.2 0.2 100 101 102 100 101 102 100 101 102 100 101 102 training epochs training epochs training epochs training epochs (E) (F) (G) (H)

j 5000 ¨ Figure 2: (A) visualizes the sample correlation matrix Rxj ,xj given the 5000 testing dataset {x }j=1 . (B)-(D) visualize the three j 5000 ¨ sample correlation matrix R j j for the three layers given {x }j=1 , respectively. (E) visualizes the average sample correlation of f i ,f i j 5000 j 5000 {x }j=1 with the same labels and with different labels. (F)-(H) visualize the average sample correlation of {f i }j=1 for the three layers with the same labels and with different labels.

3. Novel probabilistic explanations Since the necessary condition for samples being i.i.d. is uncorrelation, we can use the sample correlation to examine In this section, we present three theoretical results: (i) we ¨ if activations being i.i.d.. More specifically, if F j and F j demonstrate that activations being i.i.d. is not valid for all the i i are i.i.d., the sample correlation r j j¨ must be zero, layers of MLPs, thus non-parametric inference cannot model f i ,f i the distributions of all the fully connected layers of MLPs; (ii) we define the probability space (Ω , ,P ) for a fully ∑N j ̄j j¨ ̄j¨ F  F n=1(fin − f i )(fin − f i ) r ¨ = , (6) connected layer, and propose a probabilistic explanation for f j ,f j u i i 2 the entire architecture of MLPs based on the Gibbs measure ∑N j ̄j 2 ∑N j¨ ̄j¨ n=1(fin − f i ) n=1 (fin − f i ) PF ; and (iii) we introduce a probabilistic explanation for the back-propagation training based on the sample space ΩF . j j¨ j j¨ where f i and f i are two activation samples of F i and F i ¨ ̄j N j 3.1. Activations are not i.i.d. given two sample inputs xj and xj , f = 1 ∑ f , and j M i N n=1 in Given an input x ∈ ℝ , we define the corresponding N is the number neurons of f i. Xj Xj, ,Xj Xj multivariate random variable = [ 1 ⋯ M ], where m We specify the MLP to classify the benchmark MNIST j is the scalar-valued random variable of xm. In the context dataset (LeCun et al., 1998). Since the dimension of each of frequentist probability, all the parameters of MLPs are image is 28 × 28, the number of the input nodes is M = j N K viewed as constants, thus the random variable of g1n(x ) = 784. In addition, f1, f2, and fY have = 128, = 64, ∑M !(1) xj b Gj ∑M !(1)Xj b and L = 10 neurons/nodes, respectively. All the activation m=1 mn ⋅ m + 1n is defined as 1n = m=1 mn m + 1n j j functions are sigmoid. After 200 training epochs, we derive and the random variable of the activation f = 1(g ) j 5000 1n 1n r ¨ on the 5000 testing images {x } and define the j j f j ,f j j=1 F  G i i is defined as n = 1( n). Therefore, the multivariate 1 1 matrix R ¨ to contain all the r ¨ . j j j j , j ð j , j ð random variable of f = [f , ⋯ , f ] can be defined as f i f i f i f i 1 11 1N j j¨ F j F j , ,F j As a result, we can examine if F and F being i.i.d. 1 = [ 11 ⋯ 1N ]. Similarly, we define the multivariate i i j j j j by checking if most elements in R ¨ are close to zero. In f F F , ,F f j ,f j random variable of 2 as 2 = [ 21 ⋯ 2K ] and the mul- i i j f j F j F j , ,F j addition, we rearrange the order of {x }5000 such that images tivariate random variable of Y as Y = [ y1 ⋯ yL]. j=1 Samples being i.i.d. is the prerequisite of non-parametric with the same label have consecutive index, i.e., images with inference methods, e.g., the empirical distribution and KDE. label l has the index [l×500, (l+1)×500], thus we can easily In the context of MLPs, all the previous works regard the check the correlation of activations with the same label. activations of a layer as the samples of the random variable We demonstrate that the correlation of activations with of the layer. As a result, activations being i.i.d. should be the the same label becomes larger as the layer is closer to the prerequisite of applying non-parametric inference methods output. In other words, activations being i.i.d. is not valid to estimate the true distribution of the layer. for all the layers of the MLP, thus non-parametric inference cannot correctly model the true distribution of all the layers.

Xinjie Lan et al.: Preprint submitted to Elsevier Page 4 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

More specifically, Figure2(A) shows that the correlation 3.2.1. The probability space (ΩF ,  ,PF ) for a layer j 5000 x r ¨ , ,P of each pair of testing images { }j=1 , i.e., xj ,xj , is close In this section, we define the probability space (ΩF  F ) j j¨ j j¨ for a fully connected layer, and prove the probability space to zero. Figure2(E) shows ̄rdiff(x , x ) and ̄rsame(x , x ), which denote the average sample correlation of {xj}5000 with being valid for all the fully connected layers of MLPs. j=1 Definition. Given a fully connected layer f consisting of different labels and with the same label, respectively. N f  g x N  neurons { n = [ n( )]}n=1, where (⋅) is an activation M M L−1 function, e.g, the sigmoid function, x = {xm} ∈ ℝ is ¨ 1 É É m=1 j j M ̄rdiff(x , x ) = rxj ,xj¨ (7) ∑ N the input of f, gn(x) = m=1 !mn ⋅ xm + bn is the nth linear diff l=0 yj yj¨ ≠ filter with !mn being the weight and bn being the bias, let f extracting a single feature of x be an experiment, we define , ,P f L−1 the probability space (ΩF  F ) for as follows. j j¨ 1 É É First, the sample space ΩF includes N possible outcomes ̄rsame(x , x ) = rxj ,xj¨ (8) N M N N {!n} = {{!mn} } defined by the weights of the N same l=0 yj yj¨ l n=1 m=1 n=1 = = neurons. Since a scalar value cannot describe the feature of ¨ x, we do not take into account b for defining Ω . In terms where N and N are the total number of pairs (xj, xj ) n F diff same of , ! defines a possible feature of x. In with different labels and the same label, respectively. We n ¨ ¨ particular, the definition of the experiment guarantees that observe that r (xj, xj ) is around 0.29 and r (xj, xj ) diff same the possible outcomes are mutually exclusive (i.e., only one is around 0.43 in Figure2(E). In summary, the correlation outcome will occur on each trial of the experiment). coefficients of {xj}5000 are low, thus i.i.d. can be viewed as j=1 Second, we define the event space  as the -algebra. a valid assumption for {xj}5000. j=1 For example, if f has N = 2 neurons and Ω = {!1, !2}, In terms of the correlation of activations with the same  = {ç, {!1}, {!2}, {{!1, !2}} means that neither of the label in different layers, Figure2(B)-(D) show an ascend- outcomes, one of the outcomes, or both of the outcomes ing trend as the layer is closer to the output. For instance, could happen, respectively. the pixels at the top-left corner of R ¨ becomes lighter f j ,f j Third, the probability measure PF is the Gibbs measure i i ! N x as the layer is closer to the output, i.e., the correlation of to quantify the probability of { n}n=1 occurring in . the activations with the label 0 becomes larger. In addition, Figure2(F)-(J) also demonstrate the ascending trend, i.e., 1 1 PF (!n) = exp(fn) = exp[(gn(x))] j j¨ j j¨ Z Z ̄r f , f ̄r f , f F F same( 1 1 ) converges to 0.55, same( 2 2 ) converges to (9) j j¨ 1 0.79, and ̄r (f , f ) converges to 0.84. = exp[( !n, x + bn)] same Y Y Z ⟨ ⟩ As a comparison, Figure2(B)-(D) show the correlation F of activations with different labels being relatively stable in where ⋅, ⋅ denotes the inner product and Z = ∑N exp(f ) different layers, which is further validated by Figure2(F)-(J) ⟨ ⟩ F n=1 n ¨ ¨ ¨ is the partition function. ̄r f j , f j ̄r f j , f j ̄r f j , f j showing that diff( 1 1 ), diff( 2 2 ), and diff( Y Y ) Proof. We use the mathematical induction to prove the converge to 0.29, 0.27, and 0.33, respectively. probability space for all the fully connected layers of the In summary, the correlation of activations with the same MLP in the backward direction. Given three probability space label becomes larger as the layer is closer to the output, thus (ΩF ,  ,PF ), (ΩF ,  ,PF ), and (ΩF ,  ,PF ) for the three activations being i.i.d. is not valid for all the layers of the 1 1 2 2 Y Y layers f1, f2, and fY , respectively, we first prove PF as MLP. In addition, we derive the same result based on more Y a Gibbs distribution, and then we prove PF and PF being complicated MLPs on the benchmark Fashion-MINST dataset 2 1 Gibbs distributions based on PF and PF , respectively. in AppendixB. As a result, non-parametric inference, e.g., Y 2 Since the output layer fY is the softmax, each output the empirical distribution and KDE, cannot correctly model node fyl can be formulated as the true distribution of all the layers, thus they are invalid for estimating the mutual information between each layer and K 1 1 É the input/labels. Notably, this section further confirms the f = exp(g ) = exp[ !(3) ⋅ f + b ] yl Z yl Z kl 2k yl necessity for establishing a slid probabilistic foundation for FY FY k=1 (10) deriving information theoretic explanations for DNNs. 1 (3) = exp[⟨!l , f2⟩ + byl], ZF 3.2. Probabilistic explanations for MLPs Y This section proposes three probabilistic explanations for ∑L where ZF = l=1 exp(gyl) is the partition function and MLPs: (i) we define the probability space (Ω , ,P ) for a Y F  F !(3) = {!(3)}K . Comparing Equation9 and 10, we can fully connected layer, (ii) we prove the entire architecture of l kl k=1 derive that f forms a Gibbs distribution P (!(3)) = f MLPs as a Gibbs distribution P , and (iii) we demonstrate Y FY l yl that the back-propagation aims to optimize the sample space (3) to measure the probability of !l occurring in f2, which is of each layer to learn an optimal Gibbs distribution P ∗ for consistent with the definition of (Ω , ,P ). FY  FY describing the statistical connection between X and Y .

Xinjie Lan et al.: Preprint submitted to Elsevier Page 5 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

Based on the properties of exponential functions, i.e., Therefore, P still can be modeled as a PoE model F2 exp(a + b) = exp(a) ⋅ exp(b) and exp(a ⋅ b) = [exp(b)]a, we (3) N can reformulate PF (! ) as 1 Ç 1 !(2) K Y l P = { [ exp(f )] nk } , (16) F2 ¨¨ 1n k=1 Z ZF K F2 n=1 1 (3) 1 Ç !(3) P (! ) = [exp(f )] kl , (11) (2) FY l ¨ 2k ! Z ¨¨ ∏N nk FY k=1 where Z = ZF ∕[exp(b2k) ⋅ Z ] and the partition F2 2 n=1 F1 ∑N (2) ¨ K function is Z = exp(f ). Similar to P (! ), we where Z = Z ∕exp(b ). Since {exp(f )} are scalar, F1 n=1 1n F2 k F FY yl 2k k=1 Y can derive the probability measure of f1 as we can introduce a new partition function Z = ∑K exp(f ) F2 k=1 2k 1 K (1) 1 such that { exp(f2k)} becomes a probability measure, PF (!n ) = exp(f1n) ZF k=1 1 Z 2 F1 (3) thus we can reformulate P (! ) as a PoE model M FY l 1 É (1) = exp[1( !mn ⋅ xm + b1n)] (17) K Z F1 m=1 (3) 1 Ç 1 !(3) P (! ) = [ exp(f )] kl , (12) FY l ¨¨ 2k 1 (1) Z ZF = exp[ ( ! , x + b )], FY k=1 2 Z 1 ⟨ n ⟩ 1n F1 K !(3) ¨¨ ∏ kl ∑N where Z = ZF ∕[exp(byl) ⋅ k [ZF ] ], especially each where Z = exp(f ) is the partition function and FY Y =1 2 F1 n=1 1n 1 (1) (1) M expert is defined as exp(f2k). !n = {!mn} . We can conclude that f corresponds to ZF m=1 1 2 (1) It is noteworthy that all the experts { 1 exp(f )}K a Gibbs distribution PF (!n ) to measure the probability of Z 2k k=1 1 F2 !(1) occurring in x, which is consistent with (Ω , ,P ). form a probability measure and establish an exact one-to-one n F1  F1 Overall, we prove the proposed probability space for all the correspondence to all the neurons in f2, thus the distribution of f can be expressed as fully connected layers in the MLP. Notably, we can easily 2 extend the probability space to an arbitrary fully connected 1 K layer through properly changing the script. PF = { exp(f2k)}k=1. (13) 2 Z Based on (ΩF , ,PF ) for a fully connected layer f, we F2  can specify the corresponding random variable F ∶ ΩF → ∑N (2) K E N Since {f =  ( ! ⋅ f + b )} , P can be F . More specifically, since ΩF = {!n}n includes finite 2k 2 n=1 nk 1n 2k k=1 F2 =1 extended as N possible outcomes, EF is a discrete measurable space and F is a discrete random variable, e.g., P (F = n) denotes the N 1 É probability of !n occurring in the experiment. P (!(2)) = exp[ ( !(2) ⋅ f + b )] F2 k Z 2 nk 1n 2k F2 n=1 (14) 3.2.2. The probabilistic explanation for the entire 1 = exp[ ( !(2), f + b )]. architecture of the MLP Z 2 ⟨ k 1⟩ 2k F2 Since ΩF is defined by !mn, it is fixed if not considering parameters updating. Therefore, Fi+1 is entirely determined where Z = ∑K exp(f ) is the partition function and by F in the MLP = {x; f ; f ; f } without considering the F2 k=1 2k i 1 2 Y (2) (2) N back-propagation training, and the MLP forms the Markov ! = {! } . We can conclude that f2 corresponds to k nk n=1 chain X ↔ F ↔ F ↔F , thus the entire architecture of a Gibbs distribution P (!(2)) to measure the probability of 1 2 Y F2 k the MLP corresponds to a joint distribution (2) !k occurring in f1, which is consistent with (ΩF ,  ,PF ). 2 2 P (F ,F ,F X) = P (F F )P (F F )P (F X). (18) Due to the non-linearity of the activation function 2(⋅), Y 2 1ð Y ð 2 2ð 1 1ð we cannot derive P (!(2)) being a PoE model only based F2 k Subsequently, we can derive the marginal distribution on the properties of exponential functions. Alternatively, P (F X) still being a Gibbs distribution the equivalence between the gradient descent algorithm and Y ð the first order approximation (Battiti, 1992) indicates that K N N (2) É É ∑ PF X(l x) = P (FY = l, F2k, F1 = n X = x) 2( n=1 !nk ⋅ f1n + b2k)] can be approximated as Y ð ð ð k=1 n=1 (19) N N 1 É (2) É (2) = exp[fyl(f2(f1(x)))], 2( !nk ⋅f1n+b2k)] ≈ C21⋅[ !nk ⋅f1n+b2k]+C22, (15) ZMLP(x) n=1 n=1 where Z (x) = ∑L ∑K ∑N P (l, k, n x) is MLP l=1 k=1 n=1 FY ,F ,F X ð C C f N 2 1ð where 21 and 22 only depend on the activations { 1n}n=1 the partition function. Notably, the energy function Eyl(x) = in the previous training iteration, thus they can be regarded −fyl(f2(f1(x))) indicates that P (FY X) is determined by (2) ð as constants and absorbed by !nk and b2k. The proof for the the entire architecture of the MLP. The detailed derivation approximation is included in AppendixC. of P (FY ðX) is presented in AppendixD.

Xinjie Lan et al.: Preprint submitted to Elsevier Page 6 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

Ω Ω Ω

(�) (�) (�) (�) � … � (�) … (�) … � � � �� �� �� � ⋮ ⋮ ⋮ ⋮ ⋮ ⋮

(�) (�) (�) … (�) (�) (�) � � �� … �� � … � � � ��

ℓ �

� �(�|�) �(�|�) �(�|�)

Figure 3: The probabilistic explanation for the MLP = {x; f1; f2; fY } and the training algorithm. l denotes the loss function. P (FiðFi−1) is the distribution of the layer fi given its previous layer. The oval above P (Fi) represents the corresponding sample (1) N (1) M space Ωi, which consists of possible outcomes defined by the weights of neurons. For example, Ω1 = {!n }n=1 and !n = {!mn}m=1.

3.2.3. Probabilistic explanations for training we can reformulate )l as )!(2) The back-propagation training (Rumelhart et al., 1986) nk updates the parameters of a hidden layer in the back-forward L ¨ )l É )l  (g2k) direction. In the MLP = {x; f1; f2; fY }, the weights of = !(3) 2 f . (22) (2) (3) kl 1n each layer are updated as f2k )!nk l=1 )!kl

)l )l )l !(T + 1) = !(T ) − (20) Equation 22 indicates is a function of , thus ΩF is )!(2) )!(3) 2 )!(T ) nk kl a function of ΩY . Similarly, we can derive where !(T ) denotes the learned weights in the T th training iteration, l is the cross entropy loss function, and is the K ¨ )l É )l  (g1n) = !(2) 1 x , (23) learning rate. Specifically, the gradient of l with respect to (1) (2) nk m f1n the weight of each layer in the MLP are formulated as )!mn k=1 )!nk which indicates that Ω is a function of Ω . )l F1 F2 = [f − P (l x)]f , In addition, we can derive Ω , Ω , and Ω depending (3) yl Y ðX ð 2k F1 F2 FY )!kl on the input x. Equation (21) shows that the gradient of l L with respect to the weight of a layer is a function of the input )l É f P l x !(3)¨ g f , )l )l = [ yl − Y X( ð )] ( 2k) 1n of the layer, i.e., (1) is a function of xm, (2) is a function (2) ð kl 2 )! )! )! l=1 mn nk nk )l K L of f n, and is a function of f k. f n =  [g n(x)] and 1 )!(3) 2 1 1 1 )l É É (3) ¨ (2) ¨ kl = [f − P (l x)]!  (g )!  (g )x . )l )l )l (1) yl Y ðX ð kl 2 2k nk 1 1n m f k =  [g k(f1)] imply that (1) , (2) , and (3) depend )!mn k=1 l=1 2 2 2 )! )! )! mn nk kl (21) on the input x. As a result, Ω , Ω , and Ω depend on F1 F2 FY the input x, because Ω is determined by )l . where P (l x) is the of the label )!(t) Y ðX ð y given the input x, i.e., P (l x) = 1 if l = y, otherwise In summary, the back-propagation training establishes Y ðX ð P (l x) = 0. The derivation is presented in AppendixE. the relation between the sample space of each layer and the Y ðX ð input/labels in two aspects. First, Ω , Ω , and Ω depend Since weights are randomly initialized before training, F1 F2 FY on y, i.e., Ω is a function y and Ω is a function of Ω . i.e., !(0) are random values, !(T +1) are entirely determined FY Fi Fi+1 )l T Second, Ω , Ω , and Ω depend on x. by all the gradients before T + 1, i.e., { } . Therefore, F1 F2 FY )!(t) t=1 Finally, we visualize the probabilistic explanation for the we conclude that Ω is determined by { )l }T because F )!(t) t=1 MLP in Figure3. The blue arrows indicate that the three ΩF is defined by !. layers f 1, f 2, and f Y form three conditional distributions, As a result, we can derive that Ω is a function y and i.e., P (F X), P (F F ), and P (F F ), respectively. The FY 1ð 2ð 1 Y ð 2 )l green arrows indicate that the back-propagation optimize the ΩF is a function of ΩF . First, since is a function of i i+1 )!(3) kl three sample space Ω1, Ω2, and ΩY by updating the weights P (l x), Ω can be viewed as a function of y based on Y ðX ð FY of each layer in the backward direction. In addition, the black the definition of P (l x). Second, based on Equation 21, Y ðX ð arrows indicate the mutual effect between the sample space and the corresponding Gibbs distributions.

Xinjie Lan et al.: Preprint submitted to Elsevier Page 7 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

� F F F �

H(�) Figure 5: The information flow of X and Y in the MLP. H(�) I(�, �) Section 3.2.3 demonstrates that Ω depends on both the I(�, �) |�) Fi (� H input x and the label y, thus I(Fi, X) ≠ 0 and I(Fi,Y ) ≠ 0. Since all the information of Y stems from X, I(Fi,Y ) is a subset of I(F , X), which is shown in Figure (4). H � = + + + I �, � = i I �, � = H � = + + + 4.3. The limitation of IB H � = + I �, � = � �, � − � �, � = IB assumes that F does not contain any information about Figure 4: The Venn diagram shows the relationship between Y except the information given by X, i.e., P (F ðX,Y ) = the information of X, Y , and Fi. Since P (X,Y ) is unknown, P (F ðX). Supposing MLPs satisfy the probabilistic premise, the information of P (X,Y ) is denoted by the largest oval with Shwartz-Ziv and Tishby(2017) propose the Markov chain dashed boundary. To facilitate subsequent discussions, we (Equation3) and two DPIs (Equation4) for the MLP. still use H(X) to denote the information of the i.i.d. samples However, we demonstrate Fi ∶ ΩF → EF depending on {xj }J , because the information of the samples converges to i i j=1 both X and Y , because Section 3.2.3 show Ω depending on H as long as the number of samples is large enough. Fi (X) both x and y. As a result, MLPs not satisfy the probabilistic premise for IB if taking into account the back-propagation. Notably, the information that Y transfers to Fi during 4. Novel information theoretic explanations training will retain in Ω after training, because Ω is fixed Fi Fi Based on the probabilistic representations for MLPs, we after training. In other words, MLPs still cannot satisfy the propose five information theoretic explanations for MLPs. probabilistic premise for IB even after training. Therefore, First of all, the entropy of a fully connected layer is finite. the information flow of X and Y in MLPs cannot satisfy the Second, we specify the information theoretic relationship DPIs (Equation4) derived from IB after training. between X, Y and Fi. Third, IB cannot correctly explain MLPs because MLPs not satisfy the probabilistic premise 4.4. The information flow of Y and X Section 3.2.3 shows that Ω is a function y, Ω is a for IB. Forth, we specify the information flow of X and Y FY F2 function of Ω , and Ω is a function of Ω in the MLP. in MLPs. Fifth, we propose a novel information theoretic FY F1 F2 Based on the definition of F ∶ Ω → E , we can derive explanation for the generalization of MLPs. Fi Fi Fi that FY is a function Y , F2 is a function of FY , and F1 is 4.1. The entropy of a layer is finite a function of F2, which indicates the Markov chain Y ↔ A controversy about information theoretic explanations FY ↔ F2 ↔ F1. As a result, the information flow of Y in for MLPs is that the random variable F ∶ Ω → E for the MLP can be expressed as i Fi Fi a fully connected layer f is continuous or discrete (Gold- i I(Y,F ) I(Y,F ) I(Y,F ). (24) feld et al., 2019). All the previous works assume activations 1 ≤ 2 ≤ Y as Ei, thus Fi is continuous and H(FiðX) = −∞ under the Since X is the input of the MLP, the information of X assumption that MLPs are deterministic models, which con- seems to flow in the forward direction in the MLP (the blue tradicts simulation results, i.e., H(FiðX) < ∞. arrows in Figure5). However, all the information of Y stems The definition of (Ω , ,P ) resolves the controversy. Fi  Fi from X (the red dashed arrow in Figure5) and flows in the Since Ω is discrete, F is discrete, thereby H(F X) < ∞. Fi i ið backward direction (the green arrows in Figure5) imply the In particular, the Gibbs measure P regards activations of f Fi i information of X flows in both the forward and the backward as the negative energy, i.e., activations are the intermediate directions, i.e., it cannot satisfy any DPI in the MLP. variables of P , rather than E . Fi Fi 4.5. A novel information theoretic explanation for Y F 4.2. The relationship between X, and i the generalization of MLPs Since we suppose (xj, yj) ∈ being i.i.d., we can derive  In terms of deep learning, generalization indicates the H(Y ) = I(X,Y ) (the proof is presented in AppendixF), ability of neural networks adapting to new data, which does which indicates that all the information of Y stems from X, not belong to the training dataset but is drawn from the i.e., the information of Y is a subset of that of X in the Venn  same distribution P (X,Y ). Based on the above information diagram of X, Y and F (Figure4). i theoretic explanations, we propose an information theoretic Since the weights of f are randomly initialized, F does i i explanation for the generalization of MLPs. Specifically, the not contain any information of X and Y before training. In performance of the MLP on the training dataset  can be addition, we cannot guarantee that all the information of Fi measured by I(Y,FY ), and the generalization performance is learned from  after training, thus H(FiðX) ≠ 0. ̄ of the MLP can be measured by I(X,F1).

Xinjie Lan et al.: Preprint submitted to Elsevier Page 8 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

Based on H(Y ) = I(X,Y ), we can derive 5.1. The estimation of I(X,Fi) Based on the definition of mutual information, we have ̄ I(X,Fi) = I(Y,Fi) + I(X,Fi), (25) I(X,F ) = H(F ) − H(F X). (30) where X̄ = Y c ∩ X is the relative complement Y in X. The i i ið information theoretic relationship is shown in Figure4. Notably, all the previous works estimate I(X,Fi) as H(Fi), The performance the MLP on the training dataset can  because IB supposes that Fi is entirely determined by X, be measured by I(Y,F ). Since we prove the MLP as the Y namely H(FiðX) = 0. However, we have H(FiðX) ≠ 0 Gibbs distribution P (FY ðX) for learning P (Y ðX) in Section in Section 4.2. As a result, we should take into account 3.2.2 and 3.2.3, we have P (yj xj) = P (yj xj) = 12 FY ðX ð Y ðX ð H(FiðX) for precisely estimating I(X,Fi). if the cross entropy loss decreases to zero. As a result, we The key to estimating I(X,Fi) is specifying P (FiðX) have H(FY ðX) = 0, thereby and P (F ). Based on the definitions of (Ω , ,P ) and the i Fi  Fi random variable Fi ∶ ΩF → EF (Section 3.2.1), we have I(FY ,X) = H(FY ) − H(FY ðX) = H(FY ). (26) i i 1 To derive H(FY ), we can reformulate P (FY ) as P (n xj) = exp[ ( !(i), f (xj) +b )]. (31) FiðX ð Z i ⟨ n 1→(i−1) ⟩ in É Fi P (FY ) = P (FY ðX = x)P (X = x). (27) x j ∈ where f i has N neurons, i.e., n ∈ [1,N], and f 1→(i−1)(x ) P X x P X is the input of f i, i.e., the output of the hidden layers from Though ( = ) is intractable due to ( ) is unknown, j P F the first one to (i − 1)th one given x . More specifically, we can simplify ( Y ) as follows when  includes large j j J the P (n x ) corresponding to the three fully connected enough i.i.d. samples, i.e., {x } ≈ . FiðX ð j=1 layers in the MLP can be expressed as 1 É P (F = yj) = P (F = yj X = xj). (28) 1 Y Y ð P n j  (1), j b J j F X( ðx ) = exp[ 1(⟨!n x ⟩ + 1n)] x ∈ 1ð Z F1 j j 1 where P (X = x ) = 1∕J because x are i.i.d. samples. P (k xj) = exp[ ( !(2), f (xj) + b )] j j j j F2ðX ð 2 ⟨ k 1 ⟩ 2k (32) If P (y x ) = P (y x ) = 1, we derive P (F ) = ZF FY ðX ð Y ðX ð Y 2 P (Y ) and H(FY ) = H(Y ). Overall, if the cross entropy loss j 1 (3) j PF X(l x ) = exp[ ! , f 2(f 1(x )) + byk] decreases to zero, I(X,FY ) = H(Y ), i.e., FY contains all Y ð ð Z ⟨ l ⟩ FY the information of Y , otherwise I(X,FY ) < H(Y ). The generalization performance of the MLP can be mea- In addition, we derive the marginal distribution PF (n) ̄ i sured by I(X,F1). First, H(Y ) = I(X,Y ) implies that the from the joint distribution P (Fi, X) as generalization of the MLP is entirely determined by how É much information of X the MLP has. Second, when the P (Fi = n) = P (FY = n, X = x) cross entropy loss decreases to zero, I(Y, MLP) achieves x∈  (33) the maximum H(Y ) in f , thus I(X, MLP) only depends É Y = P (X = x)P (FY = nðX = x). I X,̄ on ( MLP) based on Equation 25. In other words, if x∈ I(X,̄ MLP) is large, then I(X, MLP) is large and the MLP has good generalization. Since P (X) is unknown and  →  as long as  includes The information flow of X̄ in the MLP satisfies a DPI. large enough i.i.d. samples, we relax P (Fi = n) as Specifically, X̄ does not contain information of Y , thus it É j j cannot flow in the backward direction, i.e., it can only flow P (Fi = n) = P (X = x )P (Fi = nðX = x ) xj ∈ in the forward direction (the blue arrows in Figure5). We  (34) introduce a Markov chain X̄ ↔ F ↔ F ↔ F and the 1 É j 1 2 Y = P (Fi = nðX = x ). corresponding DPI is J j x ∈ ̄ ̄ ̄ I(X,F1) I(X,F2) I(X,FY ), (29) We can observe that P (n) measures the average prob- ≥ ≥ Fi ̄ ̄ ability of !(i) occurring in the entire dataset {xj}J , and thus I(X, MLP) can be simplified as I(X,F1). n j=1 In summary, the performance the MLP on the training P (n xj) measures the probability of !(i) occurring in FiðX ð n dataset  can be measured by I(Y,FY ), and the generaliza- the single data xj. Based on the equivalence between the ̄ tion of the MLP can be measured by I(X,F1). Kullback-Leibler (KL) divergence and mutual information, i.e., I(X,Fi) = EXKL[P (FiðX)ððP (Fi)] (Cover and Thomas, 5. The mutual information estimation 2006), we can conclude that I(X,Fi) would be small if the (i) N probability of {!n } occurring in each single data is close In this section, we estimate the mutual information I(Fi, X) n=1 (i) N and I(F ,Y ) based on the probability space (Ω , ,P ). to that of {!n } occurring the entire dataset, otherwise i Fi  Fi n=1 I(X,Fi) would be large. 2P (yj xj ) = P (F = yj X = xj ) for simplicity. FY ðX ð Y ð

Xinjie Lan et al.: Preprint submitted to Elsevier Page 9 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

Image0 (label [1,0]) Image1 (label [0,1]) Image2 (label [1,0]) Image3 (label [0,1]) 1.5 1.5 1.5 1.5 1.5 1.0 1.0 1.0 1.0 1.0 0.5 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.5 0.5 0.5 1.0 1.0 1.0 1.0 1.0 1.5 1.5 1.5 1.5 1.5 (A) (B) (C) (D) (E)

Figure 6: (A) shows the deterministic image x̂ . All the synthetic images x are generated by rotating x̂ and adding the Gaussian noise  (x, 0.1), i.e., x = r(x̂ ) +  (x, 0.1), where r(⋅) defines totally four different ways to rotate x̂ and x is the expectation of x. Specifically, Image0 is the synthetic image generated by adding  (x, 0.1) without rotation, Image1 is the synthetic image generated by rotating x̂ along the secondary diagonal direction and adding  (x, 0.1), Image2 is the synthetic image generated by rotating x̂ along the vertical direction and adding  (x, 0.1), and Image3 is the synthetic image generated by rotating x̂ along the horizontal direction and adding  (x, 0.1). The four images are categorized into two different classes: Image0 and Image2 with label [1, 0], and Image1 and Image3 with label [0, 1].

5.2. The estimation of I(Fi, Y ) 6.1. Setup Based on the definition of mutual information, we have We generate a synthetic dataset consisting of 256 32×32 grayscale images based on the deterministic image x̂ shown I(Y,Fi) = H(Fi) − H(FiðY ). (35) in Figure6(A). A synthetic image x is generated by rotating 2 x̂ and adding the Gaussian noise  (x,  = 0.1), We use the same method to estimate P (Fi). However, since P (n l) is intractable, we alternatively extend it as FiðY ð x = r(x̂ ) +  (x, 0.1) (38) É P (n l) = P (n xj)P (xj l). where r(⋅) totally defines four different ways to rotate x̂ shown FiðY ð FiðX ð XðY ð (36) xj ∈ in Figure6(B)-(E), and x denotes the expectation of x. The reason for adding Gaussian noise is to avoid MLPs directly Since {xj}J is supposed to be i.i.d., P (xj l) = 1 j=1 XðY ð N(l) memorize the deterministic image. if yj = l, otherwise P (xj l) = 0, where N(l) denotes the The synthetic dataset evenly consists of 64 images with XðY ð number of samples with the label l. As a result, we have the four different rotation ways shown in Figure6(B)-(E). Compared to benchmark datasets with complicated features, 1 É P (n l) = P (n xj), (37) the synthetic dataset only has four simple features, namely FiðY ð N(l) FiðX ð the four different rotation ways. As a result, we can clearly xj ∈ ,yj =l  demonstrate the proposed probabilistic explanations for MLPs which measures the probability of !(i) occurring in the entire by visualizing the weights of MLPs. n Compared to benchmark datasets with unknown entropy, dataset with the label l. Finally, we can derive I(Y,Fi) based on Equation 34 and 37. the entropy of the synthetic dataset is known. If we do not take into account the additive Gaussian noise, the entropy Similarly, based on I(Y,Fi) = EY KL[P (FiðY )ððP (Fi)], we can conclude that I(Y,F ) would be small if the proba- of the synthetic dataset would be exactly 2 bits. Since the i noise is Gaussian, the differential entropy of the noise is bility of {!(i)}N occurring in the dataset with each label n n=1 1 log(2e2) ≈ 0.38 bits. Therefore, the total entropy of is close to that of {!(i)}N occurring in the entire dataset, 2 n n=1 the synthetic dataset is 2.38 bits, because the additive noise otherwise I(Y,Fi) would be large. is independent on the rotation ways. Since the labels [1, 0] In summary, we introduces a new method to estimate and [0, 1] evenly divide the synthetic dataset into two classes, I(X,Fi) and I(Y,Fi) in this section based on the definitions the entropy of the labels is 1 bit. As a result, the synthetic of (Ω , ,P ) and the random variable F ∶ Ω → E . Fi  Fi i Fi Fi dataset enables us to precisely examine the existing and the proposed information theoretic explanations for MLPs.

6. Experiments 6.2. The simulations on the synthetic dataset In this section, we present two set of experiments based This section demonstrates five aspects: (i) the probabil- on the synthetic dataset and benchmark datasets to demon- ity space (Ω , ,P ) for a fully connected layer f ; (ii) the Fi  Fi i strate the probabilistic representation and the information effect of an activation function  (⋅) on P ; (iii) the effect i Fi theoretic explanations for MLPs in Section3,4, and5. All of i(⋅) on H(Fi), I(X,Fi), and I(Y,Fi); (iv) the informa- 3 the simulation codes are available online . tion flow in the MLP, i.e., the variation of I(X,Fi), I(Y,Fi), ̄ 3https://github.com/EthanLan/DNN_Information_theory and I(X,Fi) over different layers, and (v) the comparison of the proposed mutual information estimator and two existing non-parametric estimators on the synthetic dataset.

Xinjie Lan et al.: Preprint submitted to Elsevier Page 10 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

(1) (1) (1) (1) 1 2 3 4 0.4 0.4 0.4 0.3 0.2 0.2 0.2 0.2 0.1 0.0 0.0 0.0 0.0 0.1 0.2 0.2 0.2 0.2 0.4 0.4 0.3 0.4 (1) (1) (1) (1) 5 6 7 8 0.3 0.3 0.4 0.3 0.2 0.2 0.2 0.2 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.1 0.1 0.1 0.2 0.2 0.2 0.2 0.3 0.3 0.4 0.3

(1) 8 (1) (1) 1024 Figure 7: The eight possible outcomes {!n }n=1 represented by the learned weights of the eight neurons, where !n = {!mn}m=1 . (1) 1024 All the weights {!mn}m=1 are reshaped into 32 × 32 dimension for visualizing the spatial structure.

Table 1

The Gibbs measure of the first hidden layer f1 given the synthetic images in Figure6

!(1) !(1) !(1) !(1) !(1) !(1) !(1) !(1) 1 2 3 4 5 6 7 8

g1n(x) 45.3 215.7 206.2 -62.7 -222.9 137.1 -202.5 -171.6 f1n(x) 45.3 215.7 206.2 0.0 0.0 137.1 0.0 0.0 exp[f1n(x)] 4.71e+19 4.75e+93 3.56e+89 1.0 1.0 3.48e+59 1.0 1.0 P (!(1) Image0) 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 F1ðX n ð

g1n(x) -53.5 -217.7 -208.4 69.0 224.8 -134.6 204.1 171.3 f1n(x) 0.0 0.0 0.0 69.0 224.8 0.0 204.1 171.3 exp[f1n(x)] 1.0 1.0 1.0 9.25e+29 4.25e+97 1.0 4.36e+88 2.48e+74 P (!(1) Image1) 0.0 0.0 0.0 0.0 1.0 0.0 0.0 0.0 F1ðX n ð

g1n(x) 219.4 54.9 78.9 -211.3 -37.4 153.6 -106.6 -116.4 f1n(x) 219.4 54.9 78.9 0.0 0.0 153.6 0.0 0.0 exp[f1n(x)] 1.92e+95 6.96e+23 1.84e+34 1.0 1.0 5.10e+66 1.0 1.0 P (!(1) Image2) 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 F1ðX n ð

g1n(x) -219.0 -55.9 -81.6 208.0 41.3 -159.6 111.8 122.1 f1n(x) 0.0 0.0 0.0 208.0 41.3 0.0 111.8 122.1 exp[f1n(x)] 1.0 1.0 1.0 2.15e+90 8.63e+17 1.0 3.58e+48 1.06e+53 P (!(1) Image3) 0.0 0.0 0.0 1.0 0.0 0.0 0.0 0.0 F1ðX n ð (1) g1n(x) and f1n(x) are the linear output and the activation, respectively, where g1n(x) = ⟨!n , x⟩ + b1n and f1n = 1[g1n(x)].

To classify the synthetic dataset, we specify the MLP as First, we demonstrate the sample space Ω = {!(1)}N F1 n n=1 follows: (i) since a single image is 32 × 32, the input layer x for f1. We train the MLP on the synthetic dataset until the has M = 1024 nodes, (ii) two hidden layers f1 and f2 have training accuracy is 100% and visualize the learned weights N = 8 and K = 6 neurons, respectively, and (iii) the output !(1) !(1) 1024, n , , of the eight neurons, i.e., n = { mn}m=1 ∈ [1 8] in f L (1) layer Y has = 2 nodes corresponding to the labels of the Figure7, from which we observe that (i) ! can be regarded dataset. In addition, all the activation functions are chosen n as a possible outcome (i.e., the feature of x), e.g., !(1) has as ReLU (x) = max(0, x) unless otherwise specified. 2 low magnitude at top-left positions and high magnitude at 6.2.1. The probability space for a layer bottom-right positions, which describes the spatial feature To demonstrate the proposed probability space for all the of x = Image0 in Figure6; and (ii) the weights of different fully connected layers in the MLP = {x; f ; f ; f }, we neurons formulate different features. Though the weights of 1 2 Y (1) (1) only need to demonstrate (Ω , ,P ) for f , because we some neurons, e.g., ! and ! , are similar, they still can F1  F1 1 2 3 derive (Ω , ,P ) for each layer in the backward direction be viewed as different features, because their weights with Fi  Fi n n¨,!(1) !(1) based on the mathematical induction in Section 3.2.1. the same index are different, i.e., ∀ ≠ mn ≠ mn¨ .

Xinjie Lan et al.: Preprint submitted to Elsevier Page 11 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

Table 2 The Gibbs probability P (!(1) Image0) with four different activation functions and the F1ðX n ð corresponding conditional entropy H(F1ðX = Image0)

!(1) !(1) !(1) !(1) !(1) !(1) !(1) !(1) H(F X) 1 2 3 4 5 6 7 8 1ð

g1n(x) 45.3 215.7 206.2 -62.7 -222.9 137.1 -202.5 -171.6 Linear f1n (x) 45.3 215.7 206.2 -62.7 -222.9 137.1 -202.5 -171.6 Linear exp[f1n (x)] 4.71e+19 4.75e+93 3.56e+89 5.88e-28 1.56e-97 3.48e+59 1.13e-88 2.98e-75 P (F1ðX) 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 ReLU f1n (x) 45.3 215.7 206.2 0.0 0.0 137.1 0.0 0.0 ReLU exp[f1n (x)] 4.71e+19 4.75e+93 3.56e+89 1.0 1.0 3.48e+59 1.0 1.0 P (F1ðX) 0.0 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.00 Tanh f1n (x) 1.0 1.0 1.0 -1.0 -1.0 1.0 -1.0 -1.0 Tanh exp[f1n (x)] 2.71 2.71 2.71 0.36 0.36 2.71 0.36 0.36 P (F1ðX) 0.22 0.22 0.22 0.03 0.03 0.22 0.03 0.03 2.53 Sigmoid f1n (x) 1.0 1.0 1.0 0.0 0.0 1.0 0.0 0.0 Sigmoid exp[f1n (x)] 2.71 2.71 2.71 1.0 1.0 2.71 1.0 1.0 P (F1ðX) 0.18 0.18 0.18 0.07 0.07 0.18 0.07 0.07 2.84 Linear ReLU Tanh Sigmoid f1n (x) denotes the activation without activation function. f1n (x), f1n (x), and f1n (x) denote the activations with different activation functions given the same linear output g1n(x). H(F1ðX) denotes H(F1ðX = Image0) for simplicity.

Second, we demonstrate the Gibbs measure P for f . Tanh cannot guarantee an accurate Gibbs measure as it F1 1 (1) decreases the difference between the activation of relevant Based on Equation 17, we derive PF X(!n ðx) given the 1ð features and that of irrelevant features. For instance, !(1) is a four images in Figure6(B)-(E). Table1 shows that PF cor- 2 1 (1) (1) 8 relevant feature of Image0 with f Linear(x) = 215.7, and ! rectly measures the probability of {!n } . For instance, 12 1 n=1 Linear (1) is an irrelevant feature of Image0 with f (x) = 45.3, thus ! correctly describes the feature of Image0, thus it has the 11 2 f Linear(x) − f Linear(x) = 170.4. As a comparison, if use largest linear output g12(x) = 215.7 and activation f12(x) = ð 12 11 ð (1) 215.7, thereby P (!(1) Image0) = 1. As a comparison, Tanh, f Tanh(x)−f Tanh(x) = 0.0, thus P Tanh(! Image0) = F1ðX 2 ð ð 12 11 ð F1ðX 1 ð !(1) incorrectly describes the feature of Image0, thus it has P Tanh !(1) . 5 F X( 2 ðImage0) = 0 22. In other words, we cannot dis- the lowest linear output g (x) = −222.9 and activation 1ð 15 tinguish !(1) and !(1) based on Tanh. f (x) = 0.0, so P (!(1) Image0) = 0. 2 1 12 F1ðX 5 ð Sigmoid cannot guarantee an accurate Gibbs measure due to the same reason. In particular, since Sigmoid confines 6.2.2. The effect of activation functions on the Gibbs activations to the smaller range [0, 1], it further decreases probability measure P F the difference between the activation of relevant features and To demonstrate the effect of activation functions on the that of irrelevant features. For instance, if we use Tanh, Gibbs measure, we examine P (!(1) Image0) in four F 1ðX=x n ð f Tanh x f Tanh x . ð 12 ( ) − 15 ( )ð = 2 0. As a comparison, if we use different cases: (i) the linear activation function 1(x) = x, Sigmoid, f Sigmoid(x) − f Sigmoid(x) = 1.0. Consequently, (ii) ReLU 1(x) = max(0, x), (iii) the hyperbolic tangent ð 12 15 ð function (abbr. Tanh)  (x) = (ex − e−x)∕(ex + e−x), and P Sigmoid(!(1) Image0)−P Sigmoid(!(1) Image0) =0.11 becomes smaller 1 ð F1ðX 2 ð F1ðX 5 ð ð −x (iv) the sigmoid function  (x) = 1∕(1 + e ) in Table2. P Tanh (!(1) Image0)−P Tanh (!(1) Image0) =0.19 1 than ð F X 2 ð F X 5 ð ð , i.e., it becomes ReLU guarantees an accurate Gibbs measure because it 1ð 1ð more difficult to distinguish !(1) and !(1) based on Sigmoid. only sets negative linear outputs as zeros. For instance, !(1) 2 5 5 The experiment provides a probabilistic explanation for f Linear x is an irrelevant feature of Image0, Table2 shows 15 ( ) = the limitation of saturating (i.e., bounded) activation func- . f Linear x . e −222 9 and exp[ 15 ( )] = 1 56 − 97. As a comparison, tions (e.g., Tanh and Sigmoid) for training neural networks f ReLU x . f ReLU x . (Glorot and Bengio, 2010). Since saturating activation func- if we use ReLU, 15 ( ) = 0 0 and exp[ 15 ( )] = 1 0. The difference between exp[f ReLU(x)] and exp[f Linear(x)] tions confine activations in a very small range and decrease 15 15 the difference between the activation of relevant features and are small, thus P ReLU(!(1) Image0) also should be close to F1ðX 5 ð that of irrelevant features, they make difficult to distinguish P Linear !(1) relevant features and irrelevant ones. As a result, neural F X ( 5 ðImage0), which is validated by the experiment, 1ð networks with saturating activation functions require more namely P ReLU(!(1) Image0) = P Linear(!(1) Image0) = 0.0. F1ðX 5 ð F1ðX 5 ð computation cost, e.g., more training time or more hidden We observe similar results on other neurons in Table2. layers, to achieve the same training result.

Xinjie Lan et al.: Preprint submitted to Elsevier Page 12 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

Table 3 Table 4

The distribution P (F1) based on different activation functions The number neurons(nodes) of the layer and the activation and their respective H(F1), I(X,F1), and I(Y,F1) function of all the layers in the three MLPs

Linear ReLU Tanh Sigmoid x f 1 f 2 f Y (⋅) P (F = !(1)) 0.25 0.25 0.12 0.12 MLP1 1024 8 6 2 ReLU 1 1 P (F = !(1)) 0.25 0.25 0.12 0.13 MLP2 1024 8 6 2 Tanh 1 2 P (F = !(1)) 0.00 0.00 0.13 0.12 MLP3 1024 1 6 2 ReLU 1 3 P (F = !(1)) 0.25 0.25 0.13 0.13 1 4 P (F = !(1)) 0.25 0.25 0.12 0.13 1 5 P (F = !(1)) 0.00 0.00 0.12 0.13 All the weights of the three MLPs are randomly initial- 1 6 P (F = !(1)) 0.00 0.00 0.13 0.12 ized by a uniform distribution unless otherwise specified. 1 7 P (F = !(1)) 0.00 0.00 0.13 0.12 We choose the Adam algorithm (Kingma and Ba, 2014), a 1 8 variant of the Stochastic Gradient Descent (SGD), to learn H(F1) 2.00 2.00 3.00 3.00 the weights of the three MLPs on the entire synthetic dataset I X,F 2.00 2.00 0.47 0.16 ( 1) over 1000 epochs with the learning rate 0.01. I(Y,F ) 1.00 1.00 0.35 0.16 1 Based on the synthetic dataset and the learned weights at ̄ each epoch, we can derive I(X,Fi), I(Y,Fi), and I(X,Fi) based on Equation (30), (35), (25), respectively. To keep In summary, activation functions have a great effect on consistent with previous works (Chelombiev et al., 2019), the Gibbs measure of a fully connected layer. Consequently, we train the MLPs with 50 random initializations and use a fully connected layer with different activation functions the averaged mutual information to indicate the information should have different entropy and mutual information, which flow in the MLPs. Figure8(A) and8(E) show the variation is discussed in the next section. of the cross entropy loss and the training error of MLP1 and MLP2, respectively. Figure8(B)-8(D) and8(F)-8(H) show 6.2.3. The effect of activation functions on H(F1), the information flow in MLP1 and MLP2, respectively. I(X,F1), and I(Y,F1) Since all the weights of MLPs are randomly initialized, Since Gaussian noise is not helpful for classifying the Fi initially is independent on X and Y . As a result, I(X,Fi) synthetic dataset, the upper bound of I(X,F1) = 2. Since and I(Y,Fi) initially should be close to zero in MLP1 and the label evenly divides the entire dataset into two groups, MLP2, which is validated in Figure8(B)-8(C) and8(F)-8(G). the upper bound of I(Y,F1) = H(Y ) = 1. In addition, since As training continues, Figure8(B)-8(C) and8(F)-8(G) show each synthetic image only has one feature, H(F1 X = x) ð that I(X,Fi) and I(Y,Fi) quickly converge to fixed values. should be close to zero if f 1 models the input precisely. Specifically, Figure8(B) and8(F) show I(X,F1) in MLP1 Table2 summarizes H(F1 X = Image0) given different ð converging to 2 bits and I(X,F1) in MLP2 converging to activation functions: H(F1ðX = Image0) = 0 given ReLU, 0.47 bits, which are consistent with the results in Table3. and H(F1ðX = Image0) > 2 given Tanh and Sigmoid. We Figure8(C) and8(G) show that MLP1 and MLP2 spend can conclude that f 1 with Tanh or Sigmoid does not contain about 20 and 200 epochs on making I(Y,FY ) to converge much information of Image0. to H(Y ) = 1 bit, respectively. That further confirms that Table3 summarizes I(X,F1) given different activation saturating activation functions like Tanh cannot guarantee functions based on Equation 30. I(X,F1) = H(X) = 2.0 a precise Gibbs measure and require more training time to given ReLU indicates that f 1 with ReLU contains all the achieve the same training result. information of the entire dataset. In contrast, I(X,F1) = In terms of the information flow of X in MLP1 and MLP2, 0.47 given Tanh indicates that f 1 with Tanh does not contain Figure8(B) shows I(X,F1) ≥ I(X,F2) ≥ I(X,FY ) in MLP1 much information of the entire dataset. after the cross entropy loss decreases to zero. In contrast, In addition, we derive I(Y,F1) based on Equation 35. Figure8(F) shows I(X,FY ) ≥ I(X,F1) ≥ I(X,F2) in MLP2 I(Y,F1) = H(Y ) = 1.0 given ReLU indicates that f 1 with after the cross entropy loss decreases to zero. The results ReLU contains all the information of labels. As a compari- demonstrate that the information flow of X in MLPs cannot son, I(Y,F1) = 0.35 given Tanh indicates that f 1 with Tanh satisfy any DPI (Section 4.4). only contains partial information of labels. In terms of the information flow of Y in MLP1 and MLP2, Figure8(C) shows I(Y,F ) = I(Y,F ) = I(Y,F ) in MLP1 6.2.4. The information flow in the MLPs 1 2 Y after the cross entropy loss decreases to zero. In addition, In this section, we demonstrate the proposed information Figure8(G) shows I(Y,F ) I(Y,F ) I(Y,F ) in MLP2 theoretic explanations for the MLP in Section4. We design Y ≥ 2 ≥ 1 after the cross entropy loss decreases to zero. The results three MLPs, namely MLP1, MLP2, and MLP3. The differ- demonstrate that the information flow of Y in MLPs satis- ence between MLP1 and MLP2 is the activation functions, fies I(Y,F ) I(Y,F ) I(Y,F ) (Section 4.4). and the difference between MLP1 and MLP3 is the number Y ≥ 2 ≥ 1 of neurons in f 1, which are summarized in Table4.

Xinjie Lan et al.: Preprint submitted to Elsevier Page 13 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

1.0 1.0 2.0 1.0 0.6 training error 0.8 0.8 0.8 cross entropy 1.5

) ) I(X, F ) ) 1 i i i 0.6 0.4 0.6 0.6 F F F , , 1.0 , I(X, F2) Y X X ( ( ( 0.4 I

I 0.4 I 0.4 I(X, F1) I(Y, F1) I(X, F ) 0.2 Y 0.5 I(X, F ) I(Y, F ) training error 2 2 cross entropy 0.2 0.2 0.2 I(X, FY) I(Y, FY) 0.0 0.0 0.0 0.0 0.0 100 101 102 103 100 101 102 103 100 101 102 103 100 101 102 103 training epoch training epoch training epoch training epoch (A) (B) (C) (D)

1.0 2.0 1.0 training error I(X, F1) I(Y, F1) I(X, F ) 0.15 1 0.6 cross entropy 0.8 I(X, F ) 0.8 I(Y, F ) 1.5 2 2 I(X, F2) ) ) )

i I(X, F ) I(Y, F ) i 0.6 Y i 0.6 Y I(X, F ) F F 0.4 F 0.10 Y , , 1.0 , Y X X ( ( ( I 0.4 I 0.4 I 0.2 0.05 0.5 training error cross entropy 0.2 0.2

0.0 0.0 0.00 0.0 0.0 100 101 102 103 100 101 102 103 100 101 102 103 100 101 102 103 training epoch training epoch training epoch training epoch (E) (F) (G) (H)

Figure 8: (A) and (E) visualize the variation of the training error and the cross entropy loss of MLP1 and MLP2 during training. ̄ (B)-(D) visualize the variations of I(X,Fi), I(Y,Fi), and I(X,Fi) over all the layers in MLP1 during training, respectively. Similarly, ̄ (F)-(H) visualize the variations of I(X,Fi), I(Y,Fi), and I(X,Fi) over all the layers in MLP2 during training, respectively.

1.0 2.0 1.0 0.6 training error I(X, F1) I(X, F1) 0.8 cross entropy I(X, F2) 0.8 1.5 0.02 I(X, F2) ) ) )

i I(X, F ) i 0.4 0.6 Y i 0.6 I(X, F ) F F

F Y , , 1.0 , Y X X ( ( ( I

I 0.4 I 0.4 I(Y, F1) 0.01 0.2 0.5 I(Y, F ) training error 2 cross entropy 0.2 0.2 I(Y, FY) 0.0 0.0 0.00 0.0 0.0 100 101 102 103 100 101 102 103 100 101 102 103 100 101 102 103 training epoch training epoch training epoch training epoch (A) (B) (C) (D)

Figure 9: (A) visualizes the variation of the training error and the cross entropy loss of MLP3 during training. (B)-(D) visualize ̄ the variations of I(X,Fi), I(Y,Fi), and I(X,Fi) over all the layers of MLP3 during training, respectively.

Compared to MLP1, MLP3 only has one neuron in f 1, In summary, this section demonstrates the proposed in- which makes MLP3 to spend more than 100 epochs in min- formation theoretic explanations in Section 4.3 and 4.4. First, imizing the cross entropy loss to zero in Figure9(A). More we observe three different information flows of X in the importantly, it significantly changes the information flow in three MLPs, i.e., the information flow of X cannot satisfy MLP3. The probability space (Ω , ,P ) indicates that any DPI in MLPs. Second, the information flow of Y in F1  F1 the single neuron only defines one possible outcome with the tree MLPs has backward direction, i.e., it satisfy the DPI 100% occurring probability, thus f 1 becomes a determinis- I(Y,FY ) ≥ I(Y,F2) ≥ I(Y,F1), especially after the cross tic function and cannot transfer information to f 2 and f Y in entropy loss decreases to zero. Third, we demonstrate that the forward direction, i.e., the second and third blue arrows MLPs cannot satisfy the DPI derived from IB (Equation4), are blocked in Figure5. As a result, the information of X especially the information of X can only flow into MLP3 in and Y can flow into MLP3 in the backward direction. the backward direction. Figure9(B)-9(D) visualize the information flow in MLP3 To further demonstrate the proposed information theo- and demonstrate the above theoretical discussion. First, we retic explanations, the next section compares the proposed observe I(X,F1) = I(Y,F1) = 0, which validates f 1 being mutual information estimators (Section5) to commonly used a deterministic function. Second, the information flow of mutual information estimators based on two non-parametric X and Y in MLP3 satisfy I(X,FY ) ≥ I(X,F2) ≥ I(X,F1) inference methods, i.e., empirical distributions (Shwartz-Ziv and I(Y,FY ) ≥ I(Y,F2) ≥ I(Y,F1) in most training epochs, and Tishby, 2017) and Gaussian KDE (Saxe et al., 2018). respectively, which validates the information of X and Y can We use the same experimental methods and synthetic dataset ̄ flow into MLP3 in the backward direction. Third, I(X,Fi) as before to show the information flow of X and Y in MLP1 being very close to zero validates that all the information of and MLP2 based on the three mutual information estimators. each layer stems from Y based on Equation 25.

Xinjie Lan et al.: Preprint submitted to Elsevier Page 14 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

0.5 8 8 2.0

0.4 6 6 1.5 ) ) ) 0.3 i i i F F F

, 4 , 4 , 1.0 X X X

0.2 ( I(X, F1) ( I(X, F1) ( I(X, F1) I I I 2 I(X, F2) 2 I(X, F2) 0.5 I(X, F2)

training error 0.1 I(X, FY) I(X, FY) I(X, FY) 0.0 0 0 0.0 100 101 102 103 100 101 102 103 100 101 102 103 100 101 102 103 training epoch empirical KDE Gibbs (A) (B) (C) (D)

1.0 1.0 1.0 1.0 I(Y, F1) I(Y, F1) 0.8 0.8 0.8 0.8 I(Y, F2) I(Y, F2) ) ) ) i 0.6 i 0.6 I(Y, FY) i 0.6 I(Y, FY) 0.6 F F F , , , Y Y Y

( 0.4 I(Y, F1) ( 0.4 ( 0.4 0.4 I I I I(Y, F ) 0.2 2 0.2 0.2 cross entropy 0.2 I(Y, FY) 0.0 0.0 0.0 0.0 100 101 102 103 100 101 102 103 100 101 102 103 100 101 102 103 training epoch empirical KDE Gibbs (E) (F) (G) (H)

Figure 10: (A) and (E) visualize the variation of the training error and the cross entropy loss of MLP1 during training, respectively.

(B), (C), and (D) visualize the variation of I(X,Fi) over all the layers in MLP1 based on empirical distributions, Gaussian KDE, and Gibbs distribution, respectively. (F), (G), and (H) visualize the variation of I(Y,Fi) over all the layers in MLP1 based on empirical distributions, Gaussian KDE, and Gibbs distribution, respectively.

8 8 2.0 0.5 I(X, F1) I(X, F1) I(X, F1) 0.4 6 I(X, F2) 6 I(X, F2) 1.5 I(X, F2) ) ) ) i I(X, F ) i I(X, F ) i I(X, F )

0.3 F Y F Y F Y

, 4 , 4 , 1.0 X X X ( ( (

0.2 I I I 2 2 0.5

training error 0.1

0.0 0 0 0.0 100 101 102 103 100 101 102 103 100 101 102 103 100 101 102 103 training epoch empirical KDE Gibbs (A) (B) (C) (D)

0.8 1.0 1.0 1.0 I(Y, F1) I(Y, F1) 0.8 0.8 0.8 0.6 I(Y, F2) I(Y, F2) ) ) ) i 0.6 i 0.6 I(Y, FY) i 0.6 I(Y, FY) F F F

0.4 , , , Y Y Y

( 0.4 I(Y, F1) ( 0.4 ( 0.4 I I I I(Y, F ) 0.2 0.2 2 0.2 0.2 cross entropy I(Y, FY) 0.0 0.0 0.0 0.0 100 101 102 103 100 101 102 103 100 101 102 103 100 101 102 103 training epoch empirical KDE Gibbs (E) (F) (G) (H)

Figure 11: (A) and (E) visualize the variation of the training error and the cross entropy loss of MLP2 during training, respectively.

(B), (C), and (D) visualize the variation of I(X,Fi) over all the layers in MLP2 based on empirical distributions, Gaussian KDE, and Gibbs distribution, respectively. (F), (G), and (H) visualize the variation of I(Y,Fi) over all the layers in MLP2 based on empirical distributions, Gaussian KDE, and Gibbs distribution, respectively.

6.2.5. The comparison with existing methods Section 6.2.2 shows that Tanh cannot guarantee a precise Figure 10(B)-(C) show I(X,F1) > 2 in MLP1 based on Gibbs measure and Section 6.2.3 derives that f 1 with Tanh empirical distributions and KDE, which contradicts the fact only contains 0.47 bit information of the synthetic dataset. that the synthetic dataset only has 2 bits information. As a However, Figure 11(B)-(C) show I(X,F1) > 1 based on result, the two estimators cannot correctly estimate I(X,F1). empirical distributions and KDE. As a result, the two mutual Figure 10(F)-(G) show I(Y,FY ) = 0.8 and I(Y,FY ) = 1 information estimators cannot correctly estimate I(X,F1) in based on empirical distributions and KDE, respectively. It MLP2. In addition, Figure 11(F)-(G) demonstrate the same contradicts that I(Y,FY ) = H(Y ) if the cross entropy loss limitation of the two estimators for estimating I(Y,FY ) in becomes zero (Section 4.5). Specifically, Figure 10(F) show MLP2 as in MLP1. I(Y,FY ) < H(Y ) after the cross entropy loss decreases to In summary, the mutual information estimators based on zero, and Figure 10(G) show I(Y,FY ) = H(Y ) before the empirical distributions and KDE cannot correctly estimate cross entropy loss decreases to zero. the information flow of X and Y in MLP1 and MLP2.

Xinjie Lan et al.: Preprint submitted to Elsevier Page 15 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

1.0 3.5 4 0.03 3.0 I(Y, F1) 2.5 0.8 I(Y, F2) 3 2.5 2.0 I(Y, F ) 0.02 training error 0.6 Y )

2.0 F

, 1.5 testing error 2 X ( error I(Y,F)

I(X,F) 1.5 cross entropy 0.4 I I(X, F ) 0.01 I(X, F1) 1.0 1 1 1.0 I(X, F2) cross entropy I(X, F2) 0.2 0.5 0.00 I(X, FY) 0.5 I(X, FY) 0 0.0 0.0 0.0 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 training epoch training epoch training epoch training epoch (A) (B) (C) (D)

1.0 3.5 0.30 0.03 3 3.0 I(X, F1) 0.8 0.25 2.5 I(X, F2) I(X, F ) I(Y, F ) 0.20 0.02 training error 1 1 ) 0.6 2 I(X, FY) 2.0 F testing error I(X, F2) I(Y, F2) , 0.15 X ( error I(Y,F)

I(X,F) 1.5 cross entropy 0.4 I 0.01 I(X, FY) I(Y, FY) 0.10 1 1.0 0.2 cross entropy 0.05 0.00 0.5 0 0.00 0.0 0.0 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 training epoch training epoch training epoch training epoch (E) (F) (G) (H)

1.0 3.5 2.5 4 0.03 3.0 I(Y, F1) 2.0 0.8 I(Y, F ) 3 2.5 2 I(Y, F ) I(X, F1) 0.02 training error 0.6 Y ) 1.5

2.0 F testing error 2 , I(X, F2) X ( error I(Y,F)

I(X,F) 1.5 1.0 cross entropy 0.4 I(X, F ) I 0.01 1 I(X, FY) 1 1.0 cross entropy I(X, F2) 0.5 0.2 0.00 I(X, FY) 0.5 0 0.0 0.0 0.0 0 100 200 300 0 100 200 300 0 100 200 300 0 100 200 300 training epoch training epoch training epoch training epoch (I) (J) (K) (L)

Figure 12: (A), (E), and (I) visualize the variation of the training/testing error and the cross entropy loss of MLP4, MLP5 and

MLP6 during training, respectively. (B), (F), and (J) visualize the variation of I(X,Fi) over all the layers in MLP4, MLP5, and MLP6, respectively. (C), (G), and (K) visualize the variation of I(Y,Fi) over all the layers in MLP4, MLP5, and MLP6, ̄ respectively. (D), (H), and (L) visualize the variation of I(X,Fi) over all the layers in MLP4, MLP5, and MLP6, respectively.

6.3. The simulations on the benchmark datasets Table 5 In this section, we use MNIST dataset to demonstrate the The number neurons(nodes) of each layer and the activation proposed explanations for MLPs: (i) the information flow of functions in each MLP. X and Y in MLPs (Section 4.4 and 4.3) and (ii) the informa- tion theoretic explanations for generalization (Section 4.5). x f 1 f 2 f Y (⋅) In addition, AppendixG presents extra experiments based on MLP4 784 96 32 10 ReLU more complex MLPs on Fashion-MNIST dataset to further MLP5 784 96 32 10 Tanh validate the proposed explanations for MLPs. MLP6 784 32 96 10 ReLU 6.3.1. The information flow in the MLPs We design three MLPs, i.e., MLP4, MLP5, and MLP6, and their differences are summarized in Table5. All the 12(G) and 12(K) demonstrate that the information flow Y weights of the MLPs are randomly initialized by truncated satisfies I(Y,FY ) ≥ I(Y,F2) ≥ I(Y,F1) in all the tree normal distributions. We still choose the Adam method to MLPs. The experiment also demonstrates that IB cannot learn the weights of the MLPs on MNIST dataset over 300 correctly explain the information flow of X and Y in MLPs, epochs with the learning rate 0.001, and use the same method because they cannot satisfy the DPIs (Equation4) derived ̄ from IB in Figure 12(B, C), 12(F, G) and 12(J, K). as Section 6.2.4 to derive I(X,Fi), I(Y,Fi), and I(X,Fi) based on Equation (30), (35), (25), respectively. Figure 12(D), 12(H) and 12(J) demonstrate that the in- ̄ ̄ The information flow in the MLPs on MNIST dataset is formation flow X in all the tree MLPs satisfies I(X,F1) ≥ ̄ ̄ consistent with the results on the synthetic dataset. More I(X,F2) ≥ I(X,FY ). The next section will demonstrate ̄ specifically, Figure 12(B), 12(F) and 12(J) visualize three that the I(X,F1) can measure the generalization of MLPs different information flows of X in MLP4, MLP5, and MLP6, along with two variables: (i) the number of neurons, (ii) and respectively. It confirms that the information flow of X in the number of training samples. MLPs does not satisfy any DPI. In addition, Figure 12(C),

Xinjie Lan et al.: Preprint submitted to Elsevier Page 16 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

98.5 97.5 2.5 3.0 98.0 95.0 2.5 2.0 97.5 ) )

1 92.5 1 F F

97.0 , ,

2.0 X 90.0 1.5 X ( (

96.5 I I testing accuracy testing accuracy 96.0 1.5 87.5 I(X, F ) I(X, F ) 1.0 Testing accuracy 1 Testing accuracy 1 95.5 85.0 1.0 32 64 128 256 512 1k 1k 2k 4k 8k 10k 20k 40k 60k The number of neurons The number of samples (A) (B)

̄ Figure 13: (A) shows the variation of the testing accuracy and I(X,F1) given different MLPs with different number of neurons. ̄ (B) shows the variation of the testing accuracy and I(X,F1) given different number of training samples.

6.3.2. The information theoretic explanation for the non-parametric inference methods is samples being i.i.d.. generalization performance of MLPs Second, we define the probability space (ΩF ,  ,PF ) for ̄ First, I(X,F1) can measure the generalization of MLPs a fully connected layer f with N neurons given the input with different numbers of neurons. In general, a MLP with x. Let the experiment be f extracting a single feature of more neurons would have better generalization performance, x, the sample space ΩF consists of N possible outcomes ̄ thus I(X,F1) of the MLP should be larger. We design six (i.e., features), and each outcome is defined by the weights different MLPs = {x, f 1, f 2, f Y }, of which the two hidden of each neuron; the event space  is the -algebra; and the layers have the same number of neurons with the same ReLU probability measure PF is a Gibbs measure for quantifying activation function, but different MLPs have different num- the probability of each outcome occurring the experiment. bers of neurons, i.e., 32, 64,128, 256, 512, 1024. After the Third, we propose probabilistic explanations for MLPs MLPs achieve 100% training accuracy on MNIST dataset, and the back-propagation training: (i) the entire architecture we observe a positive correlation with the testing accuracy of MLPs formulates a Gibbs distribution based on the Gibbs ̄ and I(X,F1) in Figure 13(A). distribution PF for each layer; and (ii) the back-propagation ̄ Second, I(X,F1) can measure the generalization of MLPs training aims to optimize the sample space of all the layers with different numbers of training samples. In general, a of MLPs for modeling the statistical connection between the MLP with larger number of training samples would have input x and the label y, because the weights of each layer ̄ better generalization, thus I(X,F1) of the MLP should be define the sample space ΩF . larger. We generate 8 different training sets with different To the best of our knowledge, most existing information number of MNIST training samples and train MLP4 on the theoretic explanations for MLPs lack a solid probabilistic 8 training sets. After MLP4 achieves 100% training accuracy foundation. It not only weakens the validity of the informa- on the 8 training sets, we also observe a positive correlation tion theoretic explanations but also could derive incorrect ̄ with the testing accuracy and I(X,F1) in Figure 13(B). explanations for MLPs. To resolve the fundamental issue, ̄ In summary, I(X,F1) demonstrates positive correlation we first introduce the probabilistic representation for MLPs, ̄ with the testing error of MLPs, thus we conclude that I(X,F1) and then improve the information theoretic interpretability can be regarded as a criterion for the generalization of MLPs of MLPs in three aspects. along with two variables: (i) the number of neurons, (ii) Above all, we explicitly define the random variable of f and the number of training samples.The experiment shows as F ∶ ΩF → E based on (ΩF ,  ,PF ). Since ΩF is discrete, potential for explaining the generalization of general DNNs E denotes discrete measurable space. Hence, F is a discrete from perspective of information theory, we leave a rigorous random variable and H(F ) < ∞. In other words, we resolve study of this as future work. the controversy regarding F being discrete or continuous. Furthermore, the probabilistic explanation for the back- propagation training indicates that ΩF depends on both x 7. Conclusions and y, thereby F depending on both X and Y . It contradicts In this paper, we introduce a probabilistic representation the probabilistic assumption of IB, i.e., F is independent on for improving the information theoretic interpretability. The Y given X. As a result, the information flow of X and Y in probabilistic representation for MLPs includes three parts. MLPs does not satisfy IB if we take into account the back- First, we demonstrate that the activations being i.i.d. is propagation training. not valid for all the hidden layers of MLPs. As a result, In addition, we demonstrate that the performance of a the mutual information estimators based on non-parametric MLP depends on the mutual information between the MLP inference methods, e.g., empirical distributions and Kernel and the input X, i.e., I(X, MLP). Specifically, we prove all Density Estimate (KDE), are invalid for measuring the mu- the information of Y steming from X, i.e., H(Y ) = I(X,Y ) tual information in MLPs because the prerequisite of these (the relation is visualized by the Venn diagram in Figure4),

Xinjie Lan et al.: Preprint submitted to Elsevier Page 17 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability thus I(X, MLP) can be divided into two parts I(Y, MLP) A. The necessary conditions for activations ̄ ̄ c and I(X, MLP), where X = Y ∩ X denotes the relative being i.i.d. complement Y in X. We show that the training accuracy of the MLP depends on I(Y, MLP), and the generalization of A.1. The necessary conditions for activations the MLP depends on I(X,̄ MLP). being independent It is noteworthy that we design a synthetic dataset to fully The necessary condition for two random variables A and demonstrate the proposed probabilistic representation and B being independent is that they are uncorrelated, namely information theoretic explanations for MLPs. Compared to the covariance Cov(A, B) = 0. Therefore, the necessary G K k, k¨ all the existing information theoretical explanations merely condition for { 2k}k=1 being independent is that ∀( ) ∈ ¨ 2 ¨ ¨ using benchmark datasets for validation, the synthetic dataset S1 = {(k, k ) ∈ ℤ ðk ≠ k , 1 ≤ k ≤ K, 1 ≤ k ≤ K}, enables us to demonstrate the proposed information theoretic Cov(G2k,G2k¨ ) = 0, which can be formulated as explanations clearly and comprehensively, because all the features of the synthetic dataset are known and much sim- N N É (2) É (2) Cov ! F b , ! F ¨ b ¨ . pler than benchmark dataset. ( nk 1n + 2k n¨k¨ 1n + 2k ) = 0 (39) The proposed information theoretic explanations for MLPs n=1 n¨=1 provides a novel viewpoint to understand the generalization Since the covariance between a random variable and a of MLPs. It deserves more efforts as future research. First, constant are zero, namely Cov(X, c) = 0, we derive since the cross entropy loss only guarantees the performance ̄ of MLPs on the training dataset, incorporating I(X,F1) into N N É (2) É (2) Cov G ,G ¨ Cov ! F , ! F ¨ . (40) the cross entropy loss could be a promising approach to im- ( 2k 2k ) = ( nk 1n n¨k¨ 1n ) prove the generalization performance of MLPs. Second, we n=1 n¨=1 are planning to extend the information theoretic explanations Based on Cov(X,Y + Z) = Cov(X,Y ) + Cov(X,Z), for generalization to general DNNs, which could shed light Cov(G ,G ¨ ) can be extended as on understanding the generalization of DNNs. 2k 2k N É (2) (2) Cov G ,G ¨ ! ! V ar F ( 2k 2k ) = nk nk¨ ( 1n) n=1 (41) É (2) (2) ! ! Cov F ,F ¨ . + nk n¨k¨ ( 1n 1n ) n≠n¨

F N i.i.d. f N P F Assuming { 1n}n=1 are and { 1n}n=1 ∼ ( 1), we have Cov(F1n,F1n¨ ) = 0, thus we have

N É (2) (2) Cov G ,G ¨ V ar F ! ! . (42) ( 2k 2k ) = ( 1) nk nk¨ n=1

V ar F > G K Since ( 1) 0, the necessary condition for { 2k}k=1 ¨ being independent can be formulated as ∀(k, k ) ∈ S1,

N É !(2)!(2) . (43) nk nk¨ = 0 n=1 Based on the theorem in Appendix A.4, we can derive G K F K that { 2k}k=1 being independent is equivalent to { 2k}k=1 being independent as long as the activation function 2(⋅) is invertible. In other words, if 2(⋅) is invertible, the necessary F K condition for { 2k}k=1 being independent is the same as the G K i.i.d. necessary condition for { 2k}k=1 being . In summary, if the activations of a hidden layer are inde- pendent in the context of frequentist probability, the weights of the hidden layer must satisfy Equation 43 given the as- sumption that the inputs of the hidden layer are i.i.d. and the activation function is invertible.

Xinjie Lan et al.: Preprint submitted to Elsevier Page 18 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

A.2. The necessary conditions for activations A.3. Conclusion F N i.i.d. f N being identically distributed In summary, assuming { 1n}n=1 are and { 1n}n=1 ∼ G K K The necessary condition for { 2k}k=1 being identically P (F1), the necessary conditions for the activations {F2k} , ¨ k=1 distributed is that ∀(k, k ) ∈ S1, we have being i.i.d. can be summarized as

E(G ) = E(G ¨ ), (44) ¨ ¨ 2 ¨ ¨ 2k 2k ∀(k, k ) ∈ S1 = {(k, k ) ∈ ℤ ðk≠k ,1≤k≤K,1≤k ≤K} N where E(⋅) denotes the expectation. É !(2)!(2) , ∑N (2) nk nk¨ = 0 Since G2k = n=1 !nk F1n + b2k, we can derive n=1 (50) N N É (2) (2) É −E(F ) (! − ! ) = b − b ¨ , E(G ) = !(2)E(F ) + b . (45) 1 nk nk¨ 2k 2k 2k nk 1n 2k n=1 n=1

F N i.i.d. f N P F Assuming { 1n}n=1 are and { 1n}n=1 ∼ ( 1), we 2(⋅) is strictly increasing and differentiable. (51) have E(F1n) = E(F1). Hence, we can further derive K N Equation 50 shows the necessary conditions for {G2k} É k=1 (2) (46) being independent and identically distributed. Equation 51 E(G2k) = E(F1) !nk + b2k. n=1 specifies the condition of the activation function 2(⋅) such G K i.i.d. F K i.i.d. that if { 2k}k=1 are , then { 2k}k=1 are also . The Based on E(G2k) = E(G2k¨ ), we can derive invertible condition is not required here because strictly in- creasing and differentiable imply it. It is important to note N É (2) (2) that the necessary conditions hold for arbitrary fully con- −E(F ) (! − ! ) = b − b ¨ . (47) 1 nk nk¨ 2k 2k nected layers as long as we properly change the superscript n=1 (2) of !nk and the subscript of b2k. We assume that 2(⋅) is strictly increasing and differen-  −1 A.4. functions of independent random variables tiable, thus 2(⋅) is invertible and its inverse 2 (⋅) is also strictly increasing. As a result, the cumulative distribution are independent function of F2k can be expressed as Theorem: Assuming X and Y are independent random variables on a probability space (Ω,  ,P ). Let g and ℎ be Φ (f) = (F f) F2k 2k ≤ real-valued functions defined on the codomains of X and Y , respectively. Then g(X) and ℎ(Y ) are independent random = (1(G2k) ≤ f) (48) variables. = (G −1(f)) 2k ≤ 1 Proof: Let A ⊆ ℝ and B ⊆ ℝ be the range of g and ℎ, = Φ (−1(f)). the joint distribution between g(X) and ℎ(Y ) can be formu- G2k 1 late as P (g(X) ∈ A, ℎ(Y ) ∈ B). Let g−1(A) and ℎ−1(B) where (F2k ≤ f) is the probability of F2k takes on a value denote the preimages of A and B, respectively, we have less than or equal to f. Subsequently, we can obtain −1 −1 −1 P (g(X) ∈ A, ℎ(Y ) ∈ B) = P (X ∈ g (A),Y ∈ ℎ (B)) )ΦF (f) )ΦG ( (f)) P (f) = 2k = 2k 2 (52) F2k )f )f (49) )−1(f) Based on the definition of independence, we can derive that = P (−1(f)) 2 . G2k 2 )f P g X , ℎ Y P X g−1 P Y ℎ−1 Equation 49 indicates that if 2(⋅) is strictly increasing and ( ( ) ∈ A ( ) ∈ B) = ( ∈ (A)) ( ∈ (B)) G K differentiable and { 2k}k=1 are identically distributed, then = P (g(X) ∈ A)P (ℎ(Y ) ∈ B) F K { 2k}k=1 are identically distributed as well. (53) In summary, if the activations of a hidden layer are iden- tically distributed in the context of frequentist probability, Based on the definition of preimage, we can derive that the weights and the biases of the hidden layer must satisfy Equation 47 under the assumption that inputs of the layer are P (g(X) ∈ A, ℎ(Y ) ∈ B) = P (g(X) ∈ A)P (ℎ(Y ) ∈ B) i.i.d. and the activation functions are strictly increasing and (54) differentiable. Therefore, g(X) and ℎ(Y ) are independent random variables.

Xinjie Lan et al.: Preprint submitted to Elsevier Page 19 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

∑N (2) (2) Figure 14: (A) WK×K contains the value of ð n=1 !nk !nk¨ ð for all the neurons in f2. (B) The green triangles show all the ∑N (2) (2) samples [ n=1(!nk − !nk¨ ), b2k − b2k¨ ], the black line shows the linear regression result based on these samples, and the red line shows the linear relation indicated by the slope −f1 ≈ −E(F1). (C) The blue curve and the magenta curve show the variation of r = 1 ∑N !(2)!(2) and the training accuracy in 201 training epochs, respectively. (D) W contains the value f2 ∑K ∑k 1 ð n=1 nk nk¨ ð S×S k=1 k¨=1 ∑K (3) (3) ∑K (3) (3) of ð k=1 !ks !ks¨ ð for all the neurons in f3. (E) The green triangles show all the samples [ k=1(!ks − !ks¨ ), b3s − b3s¨ ], the black line shows the linear regression result based on these samples, and the red line shows the linear relation indicated by the slope −f ≈ −E(F ). (F) The blue curve and the magenta curve show the variation of r = 1 ∑S !(3)!(3) and the training 2 2 f3 ∑S ∑s 1 ð s=1 ks ks¨ ð s=1 s¨=1 accuracy in 201 training epochs, respectively.

C256 ∑N !(2) !(2) B. Activations are not i.i.d. in more complex We obtain 2 = 65280 samples of n=1( nk − nk¨ ) 128 ∑K (3) b b ¨ C ! MLPs on the Fashion-MNIST dataset and 2k − 2k , and 2 = 16256 samples of k=1( ks − (3) ! b b ¨ In this section, we demonstrate that activations cannot ks¨ ) and 3s − 3s , which are shown by green triangles satisfy the necessary conditions in a more complex MLP on in Figure ??(B) and ??(E), respectively. Based on the lin- ear regression, we learn the linearities with the slope Δ = the Fashion-MNIST dataset, thus activation being i.i.d. is not f1 −8.75E − 05 and Δ = −2.50E − 03 from the samples. In valid for all the fully connected layers of the MLP. f2 To check if activations satisfy the necessary conditions, addition, we derive the sample mean f1 = 3.96E − 01 and we specify a MLP = {x, f , f , f , f } for classifying the 1 2 3 Y f2 = 2.96E − 01. We observe that Δf and Δf have huge Fashion-MNIST dataset (Xiao et al., 2017). The dimension 1 2 difference as −f and −f , respectively. As a result, activa- of each Fashion-MNIST image is 28×28, thus the number of 1 2 tions cannot be identically distributed after training even if the input nodes is M = 784. In addition, f , f , and f have 1 2 3 we consider the estimation error. N = 512, K = 256, and S = 128 neurons, respectively, and Moreover, we demonstrate that activations being i.i.d. is f has L = 10 nodes. All hidden layers choose the sig- Y also not valid during the training procedure. We use r = moid function, which satisfies the third necessary condition f2 1 ∑N (2) (2) ∑N (2) (2) ?? K k ! ! ¨ (i.e., the mean of ! ! ¨ (Equation ), thus we only need to examine the first two ∑ ∑ 1 ð n=1 nk nk ð ð n=1 nk nk ð necessary conditions. k=1 k¨=1 over all the activations) and r = 1 ∑S !(3)!(3) After the training accuracy is very close to 100%, we f3 ∑S ∑s 1 ð s=1 ks ks¨ ð s=1 s¨=1 (2) (3) to indicate if all the activations of f and f are independent obtain !nk and !ks , and construct two matrixes WK×K and 2 3 W ∑N !(2)!(2) ∑K !(3)!(3) during training. Figure ??(C) and ??(F) show the variation S×S to contain ð n=1 nk nk¨ ð and ð k=1 ks ks¨ ð for of rf and rf during 301 training epochs, respectively. At each activation in f2 and f3, respectively. Figure ??(A) and 2 3 N (2) (2) K (3) (3) the beginning, r and r are close to zero because all the ??(D) show that ∑ ! ! and ∑ ! ! are far f2 f3 n=1 nk nk¨ k=1 ks ks¨ weights and are randomly initialized. However, as from zero for many different activations after training. As a nk ks the training procedure goes on, r and r show an increas- result, activations cannot be independent after training even f2 f3 ing trend. Therefore, all the activations cannot keep being if we consider the estimation error. independent, thereby not being i.i.d. during training.

Xinjie Lan et al.: Preprint submitted to Elsevier Page 20 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

F K F S Overall, since { 2k}k=1 and { 3s}s=1 cannot satisfy the for a single neuron, Equation 58 can be expressed as necessary conditions, they cannot be i.i.d. under the assump- N K T tion that {F n} and {F k} are i.i.d. during and after É 1 n=1 2 k=1 f [f ,  (2k)] ≈ (∇ f ) ⋅ [ !(2) ⋅ f + b ] training. In other words, {F }N , {F }K and {F }S 2k 1 j+1 2 2k tk 1t 2k 1n n=1 2k k=1 3s s=1 t=1 cannot be simultaneously i.i.d. during and after training the «­­­­­­­­­­­­­­­­­­­­­¯­­­­­­­­­­­­­­­­­­­­­¬ MLP. Therefore, activations being i.i.d. is not valid for all Approximation (59) the hidden layers of the MLP. T É + f [f ,  (2k)] − (∇ f ) ⋅ [ !¨(2) ⋅ f + b¨ ] 2k 1 j 2 2k tk 1t 2k t=1 C. The equivalence between the Stochastic «­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­¯­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­­¬ Gradient Descent (SGD) algorithm and the Bias first order approximation Equation 59 indicates that f2k[f1, j+1(2k)] can be re- If an arbitrary function f is differentiable at point p∗ ∈ formulated as two components: the approximation and the N ℝ and its differential is represented by the Jacobian matrix )f2k[f1,j (2)] bias. Since ∇ f = is only related to f and 2 2k ) 1 ∇pf, the first order approximation of f near the point p can 2 be formulated as j(2), it can be regarded as a constant with respect to j+1(2). The bias component also does not contain any parameters in ∗ ∗ ∗ the (j + 1)th training iteration. f(p) − f(p ) = (∇p∗ f) ⋅ (p − p ) + o(ððp − p ðð), (55) In summary, f2k(f1, j+1(2k)) can be reformulated as ∗ where o(ððp − p ðð) is a quantity that approaches zero much ∗ T faster than ððp − p ðð approaches zero. É (2) (60) Based on the first order approximation Battiti(1992), the f2k(f1, j+1(2k)) ≈ C1 ⋅ [ !tk ⋅ f1t + b2k] + C2 t=1 activations in f2 and f1 in the MLP = {x, f1, f2, fY } can be expressed as follows: where C = ∇ f and C2 = f (f ,  (2k)) − (∇ f ) ⋅ 1 2 2k 2k 1 j 2 2k [∑T !(2) ⋅ f + b∗ ]. Similarly, the activations in f also f [f ,  (2)] ≈ f [f ,  (2)] + (∇ f ) ⋅ [ (2) −  (2)], t=1 tk 1t 2k 1 2 1 j+1 2 1 j j (2) 2 j+1 j can be formulated as the approximation. f [x,  (1)] ≈ f [x,  (1)] + (∇ f ) ⋅ [ (1) −  (1)], 1 j+1 1 j j (1) 1 j+1 j To demonstrate the first order approximation for acti- (56) vations of the MLP (Equation 56), we only need to prove j+1(2) − j(2) approaching zero, which can be guaranteed where f2[f1, j+1(2)] are the activations of f2 based on the by SGD. Given the MLP = {x, f1, f2, fY } and the empiri- parameters of f learned in the j +1th iteration, i.e.,  (2), cal risk ̂l (ℎ), SGD aims to optimize the parameters of the 2 j+1  l given the activations of f1. The definitions of f2[f1, j(2)], MLP through minimizing ̂ (ℎ) (Rumelhart et al., 1986).  f1(x, j+1(1)), and f1(x, j(1)) are the same as f2[f1, j+1(2)]. ∑T (2) K ̂l f f  ! f b t = t − ∇ (ℎ), (61) Since 2 = { 2k = 2( t=1 tk ⋅ 1t + 2k)}k=1 has +1 t  K neurons and each neuron has T + 1 parameters, namely (2) (2) ̂l ̂l (2) = {! ; ⋯ ; ! ; b }K , the dimension of ∇ f is where ∇  (ℎ) denotes the Jacobian matrix of  (ℎ) with 1k T k 2k k=1 j (2) 2 t   respect to t at the tth iteration, and > 0 denotes the learn- equal to K × (T + 1) and ∇ f2 can be expressed as j (2) ing rate. Since the functions of all the layers are differen- ̂l T tiable, the Jacobian matrix of  (ℎ) with respect to the pa- ∇ (2)f2 = (∇ f2) ⋅ [f1; 1] (57)  j 2 rameters of the ith hidden layer, i.e., ∇ ̂l (ℎ), can be ex- (i) )f2[f1,t(2)] T pressed as where ∇ f2 = . Substituting (∇ f2) ⋅ [f1; 1] 2 )2 2 for ∇ f in Equation 56, we derive j (2) 2 ∇ ̂l (ℎ) = ∇ ̂l (ℎ)∇ f (y) fY  Y Y l l f2[f1, j+1(2)] ≈ f2[f1, j(2)] ∇ ̂ (ℎ) = ∇ ̂ (ℎ)∇ f ∇ f (62) (2) fY  f2 Y 2 2 T T l l + (∇ f2) ⋅ [f1; 1] ⋅ j+1(2) − (∇ f2) ⋅ [f1; 1] ⋅ j(2) ∇ ̂ (ℎ) = ∇ ̂ (ℎ)∇ f ∇ f ∇ f , 2 2 (1) fY  f2 Y f1 2 1 1 (58) where (i) denote the parameters of the ith layer. Equation If we only consider a single neuron, e.g., f2k, we define 61 and 62 indicate that (i) can be learned as  (2k) = [!(1); ⋯ ; !(1) ; b ] and  (2k) = [!¨(1); ⋯ ; !¨(1); b¨ ], j+1 1k T k 2k j 1k T k 2k  (i) =  (i) − [∇ ̂l (ℎ)]. (63) T ∑T (2) t+1 t t(i) thus [f1; 1] ⋅ j+1(2k) = t=1 !tk ⋅ f1t + b2k. As a result, Table6 summarizes SGD training procedure for the MLP shown in Figure1. SGD minimizing ̂l (ℎ) makes ∇ ̂l (ℎ)  t(i) to converging zero, thereby t+1(i)−t(i) converging to zero.

Xinjie Lan et al.: Preprint submitted to Elsevier Page 21 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

Table 6 One iteration of SGD training procedure for the MLP

Layer Gradients ∇ ̂l (ℎ) Parameters Activations (i) ̂l ̂l f ∇ y (ℎ) ↓ t (Y )=t (Y )− [∇ y (ℎ)] ↑ fY (f ,t (Y )) ↑ Y ( ) +1 +1 ( ) 2 +1 ̂l ̂l f ∇ (ℎ) ↓ t (2)=t (2)− [∇ (ℎ)] ↑ f (f ,t (2)) ↑ 2 (2) +1 +1 (2) 2 1 +1 ̂l ̂l f ∇ (ℎ) ↓ t (1)=t (1)− [∇ (ℎ)] ↑ f (x,t (1)) ↑ 1 (1) +1 +1 (1) 1 +1 x ——— The up-arrow and down-arrow indicate the order of gradients and parame- ters(activations) update, respectively.

D. The Gibbs explanation for the entire is a constant with respect to t, we have

architecture of the MLP T É Since the entire architecture of the MLP = {x, f1, f2, fY } P (k t)P (t x) F2ðF1 ð F1ðX ð in Figure1 corresponds to a joint distribution t=1 T 1 1 ¨(2) ¨ É ¨(1) ¨ P (FY ; F2; F1ðX) = P (FY ðF2)P (F2ðF1)P (F1ðX), (64) = exp[ ( ! , f )] exp[ ( ! , x )]. Z Z 2 ⟨ k 1⟩ 1 ⟨ t ⟩ F2 F1 t=1 the marginal distribution P (FY ðX) can be formulated as (69)

K T In addition, ∑T exp[ ( !¨(1), x¨ )] = Z , thus we have É É t=1 1 ⟨ t ⟩ F1 P (l x) = P (F = l, F = k, F = t X = x) FY ðX ð Y 2 1 ð k=1 t=1 T É 1 (2) K T P (k t)P (t x) = exp[ ( !¨ , f ¨ )]. (70) É É F2ðF1 ð F1ðX ð Z 2 ⟨ k 1⟩ = P (l k) P (k t)P (t x). t=1 F2 FY ðF2 ð F2ðF1 ð F1ðX ð k=1 t=1 Therefore, we can simplify P (l x) as (65) FY ðX ð

K T Based on the definition of the Gibbs probability measure É É P (l x) = P (l k) P (k t)P (t x) (Equation9), we have FY ðX ð FY ðF2 ð F2ðF1 ð F1ðX ð k=1 t=1 K 1 1 (1) P (t x) = exp(f ) = exp[ ( !¨ , x¨ )], É 1 ¨(2) ¨ F1ðX ð 1t 1 ⟨ t ⟩ = PF F (l k) exp[2( ! , f )]. ZF ZF Y ð 2 ð Z ⟨ k 1⟩ 1 1 k=1 F2 (66) (71)

¨(1) (1) ¨ ¨(1) ¨ 1 (3) where ! t = [!t , b1n] and x = [x, 1], i.e., ⟨! t , x ⟩ = Similarly, since P (l k) = exp[ ( ! , f + b )] FY ðF2 ð Z 3 ⟨ l 2⟩ yl (1) FY ⟨!t , x⟩ + b1n. Similarly, we have (3) ∑K (3) and ⟨!l , f2⟩ = k=1 !lk f2k is also a constant with respect to k, we can derive 1 1 (2) P (k t) = exp(f ) = exp[ ( !¨ , f ¨ )], F2ðF1 ð Z 2k Z 2 ⟨ k 1⟩ F2 F2 1 P (l x) = P (l k) = exp[ !(3), f + b ]. (67) FY ðX ð FY ðF2 ð Z ⟨ l 2⟩ yl FY f f T  !¨(1), x¨ T !¨(2) !(2), b (72) where 1 = { 1t}t=1 = { 1(⟨ t ⟩)}t=1, k = [ k 2k] and f ¨ = [f , 1], i.e., !¨(2), f ¨ = !(2), f + b , thus we f f K  !(2), f 1 1 ⟨ k 1⟩ ⟨ k 1⟩ 2k In addition, since 2 = { 2k}k=1 = { 2(⟨ k 1⟩ + have b )}K , we can extend P (l x ) as 2k k=1 FY ðX ð i T 1 É P (l x) = P (l k) = exp[ !(3), f + b ] PF F (k t)PF X(t x) FY ðX ð FY ðF2 ð ⟨ l 2⟩ yl 2ð 1 ð 1ð ð ZF t=1 Y (2) (73) T 2(⟨! , f1⟩ + b21) 1 1 É (2) (1) 1 ⎛ 1 ⎞ = exp[ ( !¨ , f ¨ )]exp[ ( !¨ , x¨ )]. = exp[ !(3), ⎜ ⋮ ⎟ + b ]. Z Z 2 ⟨ k 1⟩ 1 ⟨ t ⟩ Z ⟨ l ⟩ yl F2 F1 t=1 FY ⎜ (2) ⎟ ⎝ 2(⟨!K , f1⟩ + b2K ) ⎠ (68) Since f = {f }T = { ( !(1), x + b )}T , we can Since !¨(2), f ¨ = !(2), f +b = ∑T !(2)f +b 1 1t t=1 1 ⟨ t ⟩ 1n t=1 ⟨ k 1⟩ ⟨ k 1⟩ 2k t=1 kt 1t 2k further extend P (l x) as FY ðX ð

Xinjie Lan et al.: Preprint submitted to Elsevier Page 22 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

 (1), x b ⎛ ⎛ 1(⟨!1 ⟩ + 11) ⎞ ⎞ ⎜ (2) ⎟ 2(⟨!1 , ⎜ ⋮ ⎟⟩ + b21) ⎜ (1) ⎟ ⎜  ( ! , x + b ) ⎟ 1 ⎜ ⎝ 1 ⟨ 1 ⟩ 11 ⎠ ⎟ 1 P (l x) = exp[ !(3), ⎜ ⋮ ⎟ + b ] = exp[f (f (f (x)))]. (74) FY ðX ð Z ⟨ l ⟩ yl Z (x ) yl 2 1 FY ⎜ (1) ⎟ MLP i ⎛ 1(⟨!T , x⟩ + b1T ) ⎞ ⎜ (2) ⎟ ⎜ 2(⟨!K , ⎜ ⋮ ⎟⟩ + b2K ) ⎟ ⎜ ⎜  ( !(1), x + b ) ⎟ ⎟ ⎝ ⎝ 1 ⟨ T ⟩ 1T ⎠ ⎠

Overall, we prove P (l x) as a Gibbs distribution and Therefore, the derivative of l with respect to !(3) can be FY ðX ð kl it can be expressed as expressed as 1 P (l x) = exp[f (f (f (x)))]. (75) L FY ðX ð yl 2 1 )l É )l )gyl ZMLP(xi) = = [f − P (l x)]f . (82) (3) (3) yl Y ðX ð 2k )! )gyl )! where Eyl(x) = −fyl(f2(f1(x))) is the energy function of kl l=1 kl l ∈  given x and the partition function Similarly, the derivative of l with respect to g can be L K T 2k É É É expressed as ZMLP(x) = Q(FY ,F2,F1ðX = x) l=1 k=1 t=1 L (76) )l É )l )gyl )f L = 2k É )g )g )f )g = exp[f (f (f (x)))]. 2k l yl 2k 2k yl 2 1 =1 (83) l=1 L É = [f − P (l x)]!(3)¨ (g ). yl Y ðX ð kl 2 2k E. The gradient of the cross entropy loss l=1 function with respect to the weights (2) The derivative of l with respect to !nk can be expressed as If the loss function is the cross entropy, we have )l )l )g2k l = H[PY X(l x), fy(x)], (77) = ð ð (2) (2) )g2k L )!nk )!nk where fy(x) = {fyl} is the output of the MLP given x, l=1 L (84) and PY X(lðx) is the one-hot probability of x given the label É (3) ð = [f − P (l x)]! ¨ (g )f y, i.e., if l = y,P (l x) = 1, otherwise P (l x) = 0. yl Y ðX ð kl 2 2k 1n Y ðX ð Y ðX ð Based on the definition of the cross entropy, l can be l=1 expressed as Similarly, the derivative of l with respect to g1n can be L expressed as É l = − P (l x)logf . (78) Y ðX ð yl K l=1 )l É )l )g )f = 2k 1n Therefore, the derivative of l with respect to fyl is )g )g )f )g 1n k=1 2k 1n 1n )l PY X(lðx) K L = − ð . (79) É É )f f = [f − P (l x)]!(3)¨ (g )!(2)¨ (g ). yl yl yl Y ðX ð kl 2 2k nk 1 1n k=1 l=1 In addition, we have (85) 1 )f exp(gyt) < yt ZY fyl(1 − fyl) for t = l (1) = = . (80) The derivative of l with respect to !mn can be expressed as )gyl )gyl −fylfyt for t ≠ l As a result, the derivative of with respect to g can be )l )l )g1n l yl = expressed as (1) )g (1) )!mn 1n )!mn L K L )l É )l )fyt É É = = [f − P (l x)]!(3)¨ (g )!(2)¨ (g )x . )g )f )g yl Y ðX ð kl 2 2k nk 1 1n m yl t=1 yt yl k=1 l=1 É (81) = −P (l)(1 − f ) + P (t x)f (86) Y yl Y ðX ð yl t≠l Based on the back-propagation algorithm, weights are = f − P (l x) yl Y ðX ð Xinjie Lan et al.: Preprint submitted to Elsevier Page 23 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability updated as Table 7 The number neurons(nodes) of each layer and the activation )l !(1)(t + 1) = !(1)(t) − functions in each MLP. mn mn (1) )!mn(t) (2) (2) )l x f f f f (⋅) ! (t + 1) = ! (t) − 1 2 3 Y nk nk (2) (87) )!nk (t) MLP8 784 256 128 96 10 ReLU )l MLP9 784 256 128 96 10 Tanh !(3)(t + 1) = !(3)(t) − kl kl (3) MLP10 784 96 128 256 10 ReLU )!kl (t) where is the learning rate and t is the tth training iteration. G.1. The information flow in the MLPs To classify the Fashion-MNIST dataset, we design three F. H(Y ) = I(X,Y ) MLPs = {x, f1, f2, f3, fY }, i.e., MLP8, MLP9, and MLP10, In this section, we prove all the information of Y stems and their differences are summarized in Table7. All the from X, i.e., H(Y ) = I(X,Y ). Based on the definition of weights of the MLPs are randomly initialized by truncated mutual information, we have normal distributions. We still choose the Adam method to learn the weights of the MLPs on MNIST dataset over 500 I(X,Y ) = H(Y ) − H(Y ðX), (88) epochs with the learning rate 0.001, and use the same method ̄ as Section 6.2.4 to derive I(X,Fi), I(Y,Fi), and I(X,Fi) thus H(Y ) = I(X,Y ) is equivalent to H(Y ðX) = 0. based on Equation (30), (35), (25), respectively. Based on the definition of conditional entropy, we have The information flow in the MLPs on Fashion-MNIST É dataset is consistent with the results on the synthetic dataset. H(Y ðX) = P (X = x)H(Y ðX = x) (89) More specifically, Figure 15(B), 15(F) and 15(J) visualize x∈ three different information flows of X in MLP8, MLP9, and MLP10, respectively, which confirms that the information where H(Y ðX = x) can be formulated as flow of X in MLPs does not satisfy any DPI. Figure 15(C), É H(Y ðX = x) = P (Y = y)log2P (Y = yðX = x). (90) 15(G) and 15(K) demonstrate that the information flow Y y∈ satisfies I(Y,FY ) ≥ I(Y,F3) ≥ I(Y,F2) ≥ I(Y,F1) in all the tree MLPs. The experiment further demonstrates that j j Since  = {1, ⋯ ,L} and (x , y ) ∈  are i.i.d., we can IB cannot correctly explain the information flow of X and simplify H(Y ðX = x) as Y in MLPs, because they cannot satisfy the DPIs (Equation

L 4) derived from IB in Figure 15(B, C), 15(F, G) and 15(J, É N(l) H(Y X = x) = log P (Y = l X = x), (91) K). In addition, Figure 15(D), 15(H) and 15(J) demonstrate ð J 2 ð that the information flow X̄ in all the tree MLPs satisfies l=1 ̄ ̄ ̄ ̄ I(X,F1) ≥ I(X,F2) ≥ I(X,F3) ≥ I(X,FY ). where N(l) is the number of labels yj = l and J is the total number of samples. G.2. The information theoretic explanation for the Since P (Y ðX) is one-hot format, i.e., generalization performance of MLPs ̄ First, I(X,F1) can measure the generalization of MLPs < j j 1 if l = y with different number of neurons in MLPs. In general, a PY X(lðx ) = j (92) ð 0 if l ≠ y MLP with more neurons would have better generalization, thus I(X,F̄ ) of the MLP should be larger. We design six We can derive H(Y X = x) = 0, thereby H(Y X) = 0. 1 ð ð different MLPs = {x, f 1, f 2, f 3, f Y }. The number of neu- Finally, we have H(Y ) = I(X,Y ). rons in the three hidden layers of the six MLPs has the same ratio, i.e., #(f 1) ∶ #(f 2) ∶ #(f 3) = 4 ∶ 3 ∶ 1. However, G. Information theoretic explanations for different MLPs have different number of neurons, especially MLPs on Fashion-MNIST dataset #(f 1) = {64, 128, 256, 512, 1024, 2048}. After all the six MLPs achieve 100% training accuracy on MNIST dataset, In this section, we design three MLPs on Fashion-MNIST we observe a positive correlation with the testing accuracy dataset to demonstrate the proposed explanations for MLPs: ̄ and I(X,F1) in Figure 16(A). (i) the information flow of X and Y in MLPs (Section 4.4 ̄ Second, I(X,F1) can measure the generalization of MLPs and 4.3) and (ii) the information theoretic explanations for with different number of training samples. In general, a MLP generalization (Section 4.5). with larger number of training samples would have better ̄ generalization performance, thus I(X,F1) of the MLP should be larger. We generate 8 different training sets with different number of MNIST training samples and train MLP8 on the 8 training sets. After MLP8 achieves 100% training accuracy

Xinjie Lan et al.: Preprint submitted to Elsevier Page 24 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

0.20 1.0 3.0 training error 4 3 testing error 0.8 2.5 0.15 cross entropy 3 2.0 0.6 2 ) F I(X, F1) 0.10 I(X, F1) I(Y, F1) , 1.5

2 X ( error I(Y,F) I(X,F) 0.4 I I(X, F2) I(X, F2) I(Y, F2) 1.0 1 0.05 I(X, F ) I(Y, F ) I(X, F3) cross entropy 1 3 3 0.2 0.5 I(X, FY) I(Y, FY) I(X, FY) 0.00 0 0 0.0 0.0 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 training epoch training epoch training epoch training epoch (A) (B) (C) (D)

0.20 1.0 training error 0.4 3 3 I(X, F1) testing error 0.8 0.15 I(X, F2) cross entropy I(X, F1) I(Y, F1) 0.3

0.6 2 2 ) I(X, F ) I(X, F2) I(Y, F2) F 3

0.10 ,

X 0.2

I(X, F ) I(Y, F ) ( I(X, F ) error

3 I(Y,F) 3 Y I(X,F) 0.4 I 1 I(X, FY) 1 I(Y, FY) 0.05 0.1 0.2 cross entropy

0.00 0 0 0.0 0.0 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 training epoch training epoch training epoch training epoch (E) (F) (G) (H)

0.20 1.0 training error 4 2.5 3 testing error 0.8 0.15 2.0 cross entropy 3

0.6 2 ) F 1.5 I(X, F1) 0.10 , 2 I(X, F1) I(Y, F1) X ( error I(Y,F) I(X,F)

I I(X, F ) 0.4 I(X, F2) I(Y, F2) 1.0 2 1 0.05 1 I(X, F ) I(Y, F ) I(X, F3) cross entropy 3 3 0.2 0.5 I(X, FY) I(Y, FY) I(X, FY) 0.00 0 0 0.0 0.0 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 0 100 200 300 400 500 training epoch training epoch training epoch training epoch (I) (J) (K) (L)

Figure 15: (A), (E), and (I) visualize the variation of the training/testing error and the cross entropy loss of MLP8, MLP9 and

MLP10 during training, respectively. (B), (F), and (J) visualize the variation of I(X,Fi) over all the layers in MLP8, MLP9, and MLP10, respectively. (C), (G), and (K) visualize the variation of I(Y,Fi) over all the layers in MLP8, MLP9, and MLP10, ̄ respectively. (D), (H), and (L) visualize the variation of I(X,Fi) over all the layers in MLP8, MLP9, and MLP10, respectively.

90 4 4.5 testing accuracy 90.0 88 I(X, F ) 4.0 1 3

89.5 ) 86 ) 3.5 1 1 F F 3.0 , 84 2 , X X

89.0 ( ( I I 2.5 82 testing accuracy 1 88.5 2.0 Testing accuracy I(X, F1) Testing accuracy 80 1.5 128 256 512 1k 2k 4k 1k 2k 4k 8k 10k 20k 40k 60k The number of neurons The number of samples (A) (B)

̄ Figure 16: (A) shows the variation of the testing accuracy and I(X,F1) given different MLPs with different number of neurons. ̄ (B) shows the variation of the testing accuracy and I(X,F1) given different number of training samples. on the 8 training sets, we also observe a positive correlation References ̄ with the testing accuracy and I(X,F1) in Figure 16(B). ̄ Battiti, R., 1992. First and second order methods for learning: Between In summary, I(X,F1) demonstrate positive correlation steepest descent and newton’s method. Neural Computation 4, 141–166. with the testing error of MLPs, which keeps consistent with Chelombiev, I., Houghton, C., O’Donnell, C., 2019. Adaptive estimators results based on MLPs on MNIST dataset. The experiment show information compression in deep neural networks, in: Interna- ̄ tional Conference on Learning Representations. further confirms that I(X,F1) can be viewed as a criterion for the generalization of MLPs. Cover, T., Thomas, J., 2006. Elements of Information Theory. Wiley- Interscience, Hoboken, New Jersy. Gabrié, M., Manoel, A., Luneau, C., Macris, N., Krzakala, F., Zdeborová,

Xinjie Lan et al.: Preprint submitted to Elsevier Page 25 of 26 A Probabilistic Representation of Deep Learning for Improving The Information Theoretic Interpretability

L., et al., 2018. Entropy and mutual information in models of deep neural . networks, in: Advances in Neural Information Processing Systems, pp. 1821–1831. Geman, S., Geman, D., 1984. Stochastic relaxation, gibbs distributions, and the bayesian restoration of images. IEEE Transactions. on Pattern Analysis and Machine Intelligence , 721–741. Glorot, X., Bengio, Y., 2010. Understanding the difficulty of training deep feedforward neural networks, in: Proceedings of the thirteenth interna- tional conference on artificial intelligence and statistics, pp. 249–256. Goldfeld, Z., Van Den Berg, E., Greenewald, K., Melnyk, I., Nguyen, N., Kingsbury, B., Polyanskiy, Y., 2019. Estimating information flow in deep neural networks, in: Proceedings of the 36th International Confer- ence on Machine Learning, pp. 2299–2308. Hinton, G.E., 2002. Training products of experts by minimizing contrastive divergence. Neural Computation 14, 1771–1800. Kabashima, Y., 2008. Inference from correlated patterns: a unified the- ory for perceptron learning and linear vector channels, in: Journal of Physics: Conference Series, IOP Publishing. p. 012001. Kingma, D.P., Ba, J., 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 . LeCun, Y., Bottou, L., Haffner, P., 1998. Gradient-based learning applied to document recognition. Proceedings of the IEEE 11, 2278–2324. LeCun, Y., Chopra, S., Hadsell, R., Ranzato, M., Huang, F.J., 2006. A tutorial on energy-based learning. MIT Press. Lee, J., Bahri, Y., Novak, R., Schoenholz, S.S., Pennington, J., Sohl- Dickstein, J., 2018. Deep neural networks as gaussian processes, in: ICLR. Lin, H.W., Tegmark, M., Rolnick, D., 2017. Why does deep and cheap learning work so well? Journal of Statistical Physics 168, 1223–1247. Manoel, A., Krzakala, F., Mézard, M., Zdeborová, L., 2017. Multi-layer generalized linear estimation, in: 2017 IEEE International Symposium on Information Theory (ISIT), IEEE. pp. 2098–2102. Matthews, A., Rowland, M., Hron, J., Turner, R.E., Ghahramani, Z., 2018. Gaussian process behaviour in wide deep neural networks, in: ICLR. Mehta, P., Schwab, D.J., 2014. An exact mapping between the vari- ational renormalization group and deep learning. arXiv preprint arXiv:1410.3831 . Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., Abolafia, D.A., Pennington, J., Sohl-Dickstein, J., 2018. Bayesian deep convolutional networks with many channels are gaussian processes, in: ICLR. Oord, A., Schrauwen, B., 2014. Factoring variations in natural images with deep gaussian mixture models, in: NeurIPS. Patel, A., Nguyen, M., Baraniuk, R., 2016. A probabilistic framework for deep learning, in: NeurIPS. Rumelhart, D.E., Hinton, G.E., Williams, R.J., 1986. Learning representa- tions by back-propagating errors. Nature 323, 533–536. Saxe, A., Bansal, Y., Dapello, J., Advani, M., Kolchinsky, A., Tracey, B., Cox, D., 2018. On the information bottleneck theory of deep learning, in: International Conference on Representation Learning. Shwartz-Ziv, R., Tishby, N., 2017. Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810 . Slonim, N., 2002. The information bottleneck: Theory and applications. Ph.D. thesis. Citeseer. Tang, Y., Salakhutdinov, R., Hinton, G., 2015. Deep mixtures of factor analysers. arXiv preprint arXiv:1206.4635 . Wasserman, L., 2006. All of nonparametric statistics. Springer Science & Business Media. Xiao, H., Rasul, K., Vollgraf, R., 2017. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747 . Yaida, S., 2019. Non-gaussian processes and neural networks at finite widths. arXiv preprint arXiv:1910.00019 . Yu, S., Giraldo, L.G.S., Jenssen, R., Principe, J.C., 2019. Multivariate ex- tension of matrix-based renyi’s -order entropy functional. IEEE Trans- actions on Pattern Analysis and Machine Intelligence . Yu, S., Wickstrøm, K., Jenssen, R., Principe, J.C., 2020. Understanding convolutional neural networks with information theory: An initial ex- ploration. IEEE Transactions on Neural Networks and Learning Systems

Xinjie Lan et al.: Preprint submitted to Elsevier Page 26 of 26