<<

Tracking for Convolutional Neural Networks

Luke Nicholas Darlow

Supervisor: Prof Amos Storkey

Centre for Doctoral Training in Data Science School of Informatics University of Edinburgh

This dissertation is submitted for the degree of Master of Science by Research

August 2018

Declaration

I have read and understood the University of Edinburgh’s plagiarism guidelines. I hereby declare that except where specific reference is made to the work of others, the contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other university. This dissertation is my own work and contains nothing which is the of work done in collaboration with others, except as specified in the text and Acknowledgements.

Luke Nicholas Darlow August 2018

Acknowledgements

First, a thank you to the machine learning community; our shared fascination is motivation enough for me to take one small step forward, and contribute. I would like to thank Prof Amos Storkey, my supervisor, for his patience and valued guidance. Thank you to the boost and confidence of the Bayeswatch research group. Particu- larly, Antreas Antoniou for your excitement about research and availability to help, and your suggestions and help with implementation; and Elliot Crowley for being a sounding board and always offering grounding advice, and for running experiments on CINIC-10. Thank you to all the members of the CDT, staff and students alike. For your , confidence, and always reassuring presence, thank you Piette. Finally, thank you to my family for letting me stand tall on your shoulders. You are all incredible. This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh.

Abstract

The information bottleneck interpretation of deep learning posits that neural networks generalise well because they learn optimal hidden representations. An optimal rep- resentation preserves maximally the task-relevant information from the input data, while compressing all task-irrelevant information. In this thesis, we tracked mutual information for a modern convolutional neural network, as it learned to either classify or autoencode realistic image data. Images are complex high-dimensional data sources, which makes computing mutual information in closed-form intractable. Hence, we used decoder models to estimate mutual information lower bounds: a classifier for forward estimation and an autoregressive conditional PixelCNN++ for inverse estimation. Confirming some results in earlier research on toy problems, we found thatthe hidden representations first maximised shared information with the images, and then compressed task-irrelevant information for the remainder of training. Neural networks trained with stochastic gradient descent do learn to compress information. Compression was observed for both classification and autoencoding. However, whether this compres- sion is the primary feature that enables neural networks to generalise well is still an open question. Images were also generated conditioned on hidden representations for a qualitative perspective on the nature of the information retained and/or compressed. Contrary to earlier research, we did not find any evidence in the -to-noise ratios of weight updates that indicated a change from fitting to compression.

Table of contents

List of figures xiii

List of tables xv

1 Introduction1 1.1 Understanding Neural Networks ...... 1 1.2 An Information Theoretic Approach ...... 1 1.3 Application to Modern CNNs ...... 2 1.4 Our Contributions ...... 2 1.5 Thesis Structure ...... 3

2 Technical Background: Deep Neural Networks5 2.1 Convolutional Neural Networks ...... 6 2.1.1 Hidden Representations ...... 7 2.2 Modern Techniques for Deeper Networks ...... 8 2.2.1 Batch Normalisation ...... 8 2.2.2 Residual Connections ...... 9 2.3 Thirst for Understanding ...... 10 2.3.1 Deeper Representations Disentangle Better ...... 10 2.3.2 Rethinking Generalisation ...... 11 2.3.3 Frameworks, Approaches, and Tools ...... 11

3 Information Theoretic Analysis of Deep Neural Networks 13 3.1 Mutual Information as a Tool ...... 14 3.2 The Information Bottleneck Interpretation ...... 16 3.3 Is Compression Necessary for Generalisation? ...... 19 3.3.1 Tanh Non-linearity and Binning ...... 19 3.3.2 The Question of Compression ...... 20 x Table of contents

3.3.3 Two Stages of Learning and Stochastic Relaxation ...... 21 3.3.4 What of the IB Interpretation? ...... 22 3.4 Further Related Analyses ...... 23 3.4.1 Inverting Supervised Representations ...... 23

4 Investigation Framework 25 4.1 MI Using a Model: a Lower Bound ...... 26 4.1.1 Forward Decoding for Label MI ...... 27 4.1.2 Inverse Decoding for Input MI ...... 29 4.2 Tightness of the Bound ...... 31 4.3 Models Under Scrutiny ...... 32 4.3.1 Training and Freezing ...... 32 4.4 Data ...... 34 4.4.1 CINIC-10: CINIC-10 is Not Imagenet or CIFAR-10 ...... 35

5 Experiment One: Classifier MI Tracking 37 5.1 Inverse Decoding: Information About Inputs ...... 38 5.1.1 Compression Through Stochastic Relaxation? ...... 40 5.1.2 Conditional Samples ...... 42 5.2 Forward Decoding: Information about Targets ...... 47 5.2.1 Data Processing Inequality Violation? ...... 48 5.2.2 Linear Separability ...... 48

6 Experiment Two: Autoencoder MI Tracking 51 6.1 Inverse Decoding: Information about Inputs ...... 51 6.1.1 Conditional Samples ...... 53 6.1.2 Signal to Noise Ratio Tracking ...... 53 6.2 Forward Decoding: Information about Targets ...... 56

7 Conclusion 59 7.1 Our Contributions ...... 60 7.2 Our Findings ...... 60 7.3 Limitations and Future Work ...... 61

References 63

Appendix A Self-consistent IB Equations 67 Table of contents xi

Appendix B Training Curves 69 B.1 Training the Classifier and Autoencoder ...... 69 B.2 Forward Decoder Models ...... 69 B.3 PixelCNN++ Inverse Decoder Models ...... 78 B.3.1 PixelCNN++ Bound ...... 81 B.4 Unconditional PixelCNN++ ...... 81

Appendix C CINIC-10: CINIC-10 Is Not Imagenet or CIFAR-10 85 C.1 Motivation ...... 85 C.2 Compilation ...... 86 C.3 Analysis ...... 87

List of figures

2.1 A simple deep neural network...... 5 2.2 A demonstration ...... 7 2.3 Residual block ...... 9

3.1 Hyperbolic tangent activation function and binning procedure . . . . . 20

4.1 The ResNet model architecture (encoder) used to generate hidden acti- vations in either a classification or autoencoder set-up ...... 33

5.1 Mutual information curves (inverse direction) for classifier training . . 39 5.2 SNR statistics for classifier training...... 41

5.3 Samples generated using PixelCNN++, conditioned on h2 in the classifier training set-up ...... 43

5.4 Samples generated using PixelCNN++, conditioned on h3 in the classifier training set-up ...... 44

5.5 Samples generated using PixelCNN++, conditioned on h4 in the classifier training set-up ...... 45 5.6 Classifer learned representations, forward decoding ...... 47 5.7 Classifer learned representations, linear separability ...... 48

6.1 Mutual information curves (inverse direction) for autoencoder training 52

6.2 Samples generated using PixelCNN++, conditioned on h4, the autoen- coder bottleneck ...... 54 6.3 SNR statistics for autoencoder training...... 55 6.4 Autoencoder learned representations, forward decoding ...... 56

B.1 Models under scrutiny loss and accuracy curves...... 70 B.2 Image reconstructions from the autoencoder model ...... 71 xiv List of figures

B.3 Forward decoder models loss curves for first hidden representation, classifier training regime...... 72 B.4 Forward decoder models loss curves for second hidden representation, classifier training regime...... 73 B.5 Forward decoder models loss curves for third hidden representation, classifier training regime...... 74 B.6 Forward decoder models loss curves for first hidden representation, autoencoder training regime...... 75 B.7 Forward decoder models loss curves for second hidden representation, autoencoder training regime...... 76 B.8 Forward decoder models loss curves for third hidden representation, autoencoder training regime...... 77 B.9 Inverse PixelCNN++ decoder models loss curves for second hidden representation, classifier training regime...... 78 B.10 Inverse PixelCNN++ decoder models loss curves for third hidden repre- sentation, classifier training regime...... 79 B.11 Inverse PixelCNN++ decoder models loss curves for forth hidden repre- sentation, classifier training regime...... 79 B.12 Inverse PixelCNN++ decoder models loss curves for second hidden representation, autoencoder training regime...... 80 B.13 Inverse PixelCNN++ decoder models loss curves for third hidden repre- sentation, autoencoder training regime...... 80 B.14 Inverse PixelCNN++ decoder models loss curves for forth hidden repre- sentation, autoencoder training regime...... 81 B.15 Areas under the curve for PixelCNN++ lower bounds ...... 82 B.16 Unconditional PixelCNN++ loss curves ...... 82 B.17 Unconditional PixelCNN++ generated samples ...... 83

C.1 CINIC-10 contributor images’ histograms...... 88 C.2 Samples from CINIC-10, showing the differences between CIFAR-10 and ImageNet contributors...... 89 List of tables

2.1 Disantanglement example ...... 10

5.1 Relative information gain and compression for the hidden representations in a classifier training regime...... 38

6.1 Relative information gain and compression for the hidden representations in an autoencoder training regime ...... 53

C.1 CINIC-10 versus CIFAR-10 on some popular models for classification. . 87

Chapter 1

Introduction

1.1 Understanding Neural Networks

Deep Neural Networks are synonymous with modern machine learning and artificial intelligence partly because of their widespread success. Unfortunately, the popularity of neural networks for applications is not matched by a clear understanding of how they work. The field of neural networks will advance if we have more comprehensive theory about how they work, or if we have better empirical studies that characterise neural networks. In order to understand neural networks better, a number of researchers have proposed appropriate approaches and principled tools. We discuss these in the following section.

1.2 An Information Theoretic Approach

The information bottleneck (IB) interpretation of deep learning [45, 46, 41] claims that optimal learning is a technique for finding representations of data that are suited toa target task. An optimal representation should (1) keep maximal information about the task, and (2) retain minimal task-irrelevant information. The best-generalising model compresses all task-irrelevant information. In this thesis, we tracked information quantities to inspect the notion of compression in convolutional neural networks. Depth in neural networks results in a flexible model that enables easier compression, according to the IB perspective. Furthermore, the computational burden incurred by adding more layers does not outweigh the compression benefit. Researchers claimed that neural networks generalise well because they compress hidden representations efficiently [41]. However, recent research [36] provided counter examples against 2 Introduction

compression for good generalisation. This thesis explores whether compression occurs in modern convolutional models applied to realistic images. We aim to clarify whether earlier findings41 [ ] or counter-findings [36] were a product of a limited toy-example set-up, or do indeed apply to a modern setting.

1.3 Application to Modern CNNs

Images are a domain in which neural networks have become commonplace. Deep convolutional neural networks (CNNs) are foundational elements for modelling images and solving image-related tasks [25]. Images are complex and high-dimensional data sources. Therefore, the learning process can be better understood by using realistic images as a data source. This thesis applies an information theoretic analysis to CNN models trained on realistic image data. This investigation fills a current research gap by extending previous inspection of much simpler models [41, 36]. It is not possible to compute information theoretic quantities exactly for high-dimensional data. Hence, we must estimate these quantities using models.

1.4 Our Contributions

Our contributions are:

1. An information theoretic analysis of modern CNN models that are trained using realistic image data. The PixelCNN++ [33] architecture is used to compute a mutual information lower bound for the inverse analysis. A convolutional classifier model is used to compute a lower bound for the mutual information with the labels.

2. Confirmation of the presence of compression of task-irrelevant information. The analysis is undertaken for supervised (classification) and unsupervised (autoen- coding) training regimes.

3. Visualisation of conditionally generated images. This enables understanding of what aspects of the images (colour, object location, etc.) are irrelevant.

4. A demonstration of the increase in linear separability of learned hidden represen- tations as a function of training time. 1.5 Thesis Structure 3

5. Tracking of a learning signal in the form of the signal-to-noise ratios of weight updates. Earlier work [41, 46] posited that compression occurs at low signal-to- noise ratios.

6. The compilation of a new dataset, named CINIC-10, that extends CIFAR-10 [24]. CINIC-10 keeps the same task of 10-way classification, but has more samples per class.

1.5 Thesis Structure

A technical background to deep neural networks is given in Chapter2. The current literature on information theoretic analysis of deep learning is described in Chapter3 . Chapter4 explains the models under analysis in this research, the investigation framework for analysis, and the dataset compiled and used throughout. Chapters5 and6 disseminate and discuss the results of the analysis of a classifier and autoencoder, respectively. Chapter7 concludes this work and offers directions for future research.

Chapter 2

Technical Background: Deep Neural Networks

Deep neural networks [11] are widely-used parametric machine learning models. This thesis focusses on neural networks for images [25]. Convolutional models (Section 2.1) typically improve performance in neural networks for image-related problems. Realistic images provide a suitable problem domain for information theoretic analysis since they represent a complex and high-dimensional data source.

w1 f (⋅)

w2 x1

ŷ L(y,̂ y)

x2

Figure 2.1: A simple deep neural network. The input, x = x1, x2, is processed through three layers from left to right by a weighted sum, followed by a nonlinearity. f( ) could, · for example, be f(x) = sigmoid(w1x1 + w2x2 + b), where b is a scalar bias term and the example non-linearity is the sigmoid function. The loss is computed (right) using the network’s output, yˆ, and a known data label, y. 6 Technical Background: Deep Neural Networks

Neural networks are stacked multi-layer processing machines. Each successive layer performs a linear mapping on its input followed by a non-linear activation function. Figure 2.1 illustrates processing for a three layer fully-connected neural network. Each layer of computation can be thought of as performing feature engineering on its inputs that yields an internal hidden representation (Section 2.1.1). Neural networks are optimised end-to-end; the resultant hidden representations are learned from data (both x and yˆ in Figure 2.1) instead of being hand-crafted. Stochastic gradient descent (SGD) through back-propagation is the standard opti- misation procedure for neural networks. A loss is computed for parameters updates; the gradient of the loss with respect to the parameters defines the updates. These updates – effectively the learning signal – are back-propagated through a neural network. The current state and relative performance of each layer affects the quality of the updates. Deeper models suffer more from gradient degradation because the learning signalto earlier layers can be made overly noisy by its back-propagation through many layers. A number of modern techniques (Section 2.2) alleviate challenges associated with training very deep neural networks. Modern neural networks often have millions (or even billions) of parameters [38], but orders of magnitude fewer training samples. Regularisation combats overfitting in complex models but does not fully account for the generalisation capability [52]. Generalisation in deep learning is an ongoing topic of research [21]. The thirst for understanding (Section 2.3) why and how neural networks function is the driving force behind this research and thesis.

2.1 Convolutional Neural Networks

Convolutional neural networks (CNNs) leverage the spatial equivariance property of images; the content of an image can vary in spatial location without affecting how CNNs compute features. Convolutional kernels are structured two-dimensional shared weight matrices that are the basis of convolutional layers. Kernels are convolved over their inputs to extract features for the task(s) at hand. Stacking convolutional layers induces hierarchical features [27] that become more task-relevant with depth. For example, shallow features may be edges and deep features may be patterns that form faces. The idea behind CNNs was first called the ‘Neocognitron’ [7] but only gained popularity when processing power and data-availability allowed for the widespread successful use of CNNs [25]. 2.1 Convolutional Neural Networks 7

2.1.1 Hidden Representations

3 32 28 3

3 32 sigmoid(x ⋅ w + b) = h 28

3 1

Figure 2.2: A convolution where the input (left) is an image with three colour channels. The kernel (middle – lightness representing weights) is convolved over the entire input space, with each unique position involving a dot product with a receptive window of the input and the kernel, to yield a single scalar value. That scalar value passes through a non-linear activation function, and the output becomes a value in the feature map (right). There are 28 28 unique locations for the kernel, hence the reduction in × spatial size of the feature map.

Figure 2.2 illustrates a convolution using a kernel size of (3 3). The input is an × image in this case and the convolution results in a single feature map output. The dot product of a windowed region from the input – called the receptive field – and the kernel results in a single scalar value in the feature map. The convolution repeats with the same kernel parameters for every unique location in the input. A ‘strided’ convolution reduces the spatial size of the feature map by applying the convolution at steps greater than one. In so doing, subsequent span a greater receptive field. Every resultant value passes through a non-linear function to construct the feature map. The height and width dimensions are reduced by either: (1) taking a summary statistic such as mean or maximum of a windowed region, called pooling; or (2) strided convolutions. Spatial dimensionality reduction induces a larger receptive field for successive convolutions. A larger receptive field enables features that span a larger portion of the image. Moreover, many kernels can be learned for any given layer, each 8 Technical Background: Deep Neural Networks of which results in a single channel output. Thankfully, modern GPU libraries are optimised to perform convolutions since these can be computationally intensive. Hidden representations are computed by iteratively stacking convolutional layers that may use striding and/or pooling. The primary focus of this thesis is the way that these representations learn to selectively keep or discard information. We discuss modern techniques that make it possible to train deeper neural networks in the next section.

2.2 Modern Techniques for Deeper Networks

Earlier research focussed on finding means to make CNNs deeper [44], since an increase in depth typically improves convolutional model performance. Other generalisation advantages, such as speedier compression [41], make depth a favourable characteristic. Unfortunately, we cannot stack naively convolutional layers without incurring a severe increase in computational cost, or making the network untrainable because of gradient back-propagation issues. Batch normalisation (Section 2.2.1) and residual connections (Section 2.2.2) enable the training of deeper neural networks. We do not discuss other techniques – such as improved initialisation schemes [49] – that do not feature in the models under scrutiny in this research.

2.2.1 Batch Normalisation

Batch normalisation (BatchNorm) [19] is a technique that enables faster and more stable training. The activations of hidden layers are normalised on a minibatch basis. That is, a BatchNorm layer normalises minibatch activations so that each dimension has a mean of zero and unit . Two parameters are incorporated to keep the representation capacity of the model; the original activations can be recovered if they result in better performance. When employing BatchNorm, network parameters need not adapt to changes owing to differences between minibatches. Research has shown that BatchNorm has a positive effect on the characteristics of the loss surface35 [ ]. However, minibatch size-dependence can be problematic. Running means and are learned and kept for inference, to alleviate minibatch size-dependence. 2.2 Modern Techniques for Deeper Networks 9

Minibatch size-independent normalisation techniques were proposed in subsequent research [2, 34]. The exploration of optimal normalisation techniques is not the focus of this thesis. We discuss residual connections in the next section.

2.2.2 Residual Connections

x

weight layer : convolution

batch normalisation

activation

weight layer : convolution

batch normalisation

+ f (x) f (x) + x

activation

Figure 2.3: A residual block consists of two branches: (1) a processing branch involving convolutions and batch normalisation (Section 2.2.1); and (2) an identity skip connection that gets added to the result of processing. Figure adapted from the original work on ResNets [15].

Figure 2.3 shows the residual block – the basis of residual neural networks (ResNets) [15]. Training of very deep neural networks is stabilised when using residual connections because the gradient signal can flow relatively unhindered along the skip connections. Information is also passed forward along residual connections, meaning that features can be reused in later processing stages. ResNets are widely used since they almost always improve performance and are not computationally costly. The following section explains how better neural network theory and empirical studies are paramount. 10 Technical Background: Deep Neural Networks

2.3 Thirst for Understanding

A long-term goal of this research is to build and improve fundamental understandings of neural networks. The cliché explanation that neural networks are ‘black boxes’ [1], along with the common unnecessary marvel at their surprising performance, argue for the need to enrich understanding of the field. Comprehensive theoretical and empirical understanding of the optimisation processes that encourage good generalisation, is an important milestone to work towards.

2.3.1 Deeper Representations Disentangle Better

Examples of what a machine may see What we see Raw pixels Features legs ears fur grass tail snout

wheels glass road lights mirrors

Table 2.1: An example of how deeper features disentangle relevant information. The values of the raw pixels (middle; red, green, and blue) are represented as binary encoded blocks. These hold all the relevant information of the image, but not in a representation that is always easy to understand for either humans or machines. The features (right) are deeper representations and are more informative in helping decide what the image may be, but may also require many layers of processing to compute.

A reasonable explanation for the success of neural networks is that they discover useful levels of representation through disentangling as many factors as possible [3], while discarding as little information as is necessary. Current information-theoretic neural network research [45, 46, 41, 36] focusses on the compression of irrelevant information. This is covered in detail in Chapter3. 2.3 Thirst for Understanding 11

Deeper representations disentangle better [3]. Higher-level abstractions are easier to distinguish than low-level data. Consider Table 2.1 for an illustration of this: the values of the raw pixels are not readily informative of the image content; the features are immediately informative. Low-level pixel values are difficult to directly understand, even for humans, unless presented in a manner suited to our visual processing system. Another perspective as to why neural networks tend to perform better than other models is that depth offers a computational advantage for generalisation. Shwartz-Ziv and Tishby [41] argued that depth reduces the computational strain of compressing superfluous information. The colour of the background in the examples in Table 2.1 demonstrates superfluous information.

2.3.2 Rethinking Generalisation

Zhang et al. [52] highlighted the need to rethink generalisation. Conventional wisdom regarding generalisation is not readily applicable to neural networks. Conventional wisdom says that models tend to overfit when the number of parameters is much higher than the number of samples. This is not the case for neural networks. Regularisation techniques such as dropout [43] – originally described as a simple method to prevent neural network overfitting – do improve generalisation, but they are not solely respon- sible therefor. The next section discusses some examples of frameworks, approaches, and tools developed to query generalisation dynamics in neural networks.

2.3.3 Frameworks, Approaches, and Tools

Principled approaches, tools, or frameworks make it possible to observe, deduce, or recognise the characteristics of neural networks that cause good performance and generalisation. A task to benchmark the disentanglement capability of a learning algorithm [13] exemplifies a principled approach. This particular task was designed to be very difficult to accomplish without disentanglement. The addition of linear classifier probes at hidden layers [1] enables measuring how well each layer creates linearly separable representations (regarding a target task) of the input. Neural networks that were judged empirically to be untrainable can be trained by employing linear probes. We also use linear probes in this thesis to query whether hidden layers become more linear separable as learning progresses. Another framework being developed and used to understand neural networks is the information bottleneck (IB) interpretation of deep learning. Conclusions regarding 12 Technical Background: Deep Neural Networks

neural network functionality and generalisation can be drawn by assessing and tracking the shared information content between hidden representations and data (input or labels).

Conclusion

This chapter gave a technical background to deep neural networks, explaining the widely used convolutional architecture and modern techniques that are typically used to enable training and achieve suitable convergence in very deep networks. The justification behind the analysis expressed within this thesis – the thirst for understanding how, why, and what networks learn – was also expressed, with accompanying examples of tools and approaches used in earlier literature. The following chapter delves deeper into the information theoretic analysis of deep neural networks. Chapter 3

Information Theoretic Analysis of Deep Neural Networks

Entropy

Information Theory is the mathematical theory of communication [37]. It is concerned with the coding of information regarding quantification, transmission, and storage. A key concept is Information :

X H(x) = p(x) log p(x), (3.1) − x∈X where x is a , p(x) is the that x takes on the value x. The information entropy is the negative of the probability mass function. It is measured in , nats, or bans for logarithm bases of 2, e, and 10, respectively. We will be using nats in our assessments. The highest entropy data source is most surprising in that any occurrence is equally likely. For example, a fair coin toss is uniformly random and is the highest entropy binary data source. The lowest entropy data source is entirely unsurprising and predictable. To extend the coin toss example, the entropy of a (perfectly) rigged coin is zero. Information entropy can be thought of as the rate of information produced by a stochastic data source. 14 Information Theoretic Analysis of Deep Neural Networks

Conditional entropy

Conditional entropy quantifies the amount of information needed to describe arandom variable, x, given the state of another random variable, y. It is defined as

X H(x y) = p(y)H(x y = y) | y∈Y | X = p(y)p(x y) log p(x y) − y∈Y,x∈X | | X p(x, y) (3.2) = p(x, y) log − y∈Y,x∈X p(x) X p(x) = p(x, y) log . y∈Y,x∈X p(x, y)

Mutual information

Entropy and conditional entropy is used to define the mutual information (MI):

I(x; y) = H(x) H(x y). (3.3) − |

MI is the relative entropy, known as the Kullback-Libler (KL) divergence, between the joint distribution of two random variables and the product of their marginals:

X p (x, y) I(x; y) = DKL (p(x, y) p(x)p(y)) = p (x, y) log . (3.4) ∥ x∈X ,y∈Y p (x) p (y) Though entropy is only defined for discrete variables, relative entropy is meaningful for continuous variables. The summation is replaced with an integral:

Z Z p(x, y) I(x; y) = p(x, y) log dx dy . (3.5) y x p(x)p(y)

3.1 Mutual Information as a Tool

The MI between two random variables is the shared information between them. It is the reduction in about a random variable given knowledge of another. It is symmetric (I(x; y) = I(y; x)) and always non-negative. 3.1 Mutual Information as a Tool 15

Why MI for neural network analysis?

Why do we care about MI when examining neural networks? If we consider that the hidden layers of neural networks process an input random variable, x, into representa- tions that themselves are random variables, h, we can construct an analysis of I(x; h). Now consider the standard scenario of image classification using a CNN: where the data source (image) is a multi-dimensional random vector, x; y is the corresponding one-hot encoded label vector; and h is the multi-dimensional activations of some hidden layer within the CNN. MI allows us to query the information shared between input data and a hidden repre- sentation – I(x; h) – or the analogous quantity with respect to labels – I(y; h). As useful as that interpretation is, computing MI is not straightforward for high-dimensional quantities. It requires assessing the integral of a joint , which is combinatorial. Instead, in this thesis we estimate a MI lower bound using a model. The details of this estimate are discussed in Section 4.1 and concerns regarding the tightness of this bound are discussed in Section 4.2. Investigating why neural networks generalise efficiently can be done in a principled manner by querying and tracking MI quantities as learning progresses. One of the confounding features of neural networks is that they exhibit state-of-the-art performance, regarding generalisation, even when their number of parameters is much greater than the number of data items used to train them. The existing body of work on the IB interpretation (Section 3.2) gives insight as to why this may be the case.

Contribution

Most of the earlier work on MI analyses – detailed in the next section – typically used limited architectures and toy datasets or problems. The machine learning community is currently debating the conclusions drawn and insights gleaned. Assessing modern neural network architectures using realistic image data is our objective contri- bution to this area of discovery. We rephrase the MI analysis as: how well can we learn to decode the representation owing to the activations of a hidden layer to predict either the input image or the corresponding label? 16 Information Theoretic Analysis of Deep Neural Networks

3.2 The Information Bottleneck Interpretation

Imagine a situation in which you are given a large number of extensive descriptions about ten people’s faces. These descriptions are open-ended and were compiled by a separate observation group. Each description corresponds to a unique observer. Your task is to assign a label to each extensive description. As you read these descriptions you quickly determine commonalities, such as the colour of eyes or the length of hair. You gain an inkling for what aspects of these descriptions to look out for. You also begin to whittle down the details of the descriptions that seem irrelevant to your task: ‘this is a face’, ‘this person is smiling’, or ‘there is a giant shark in the background’. You end up making decisions about what is important and what is superfluous, and in so doing you build an internal representation of these descriptions that enables you to group them fairly easily. The IB interpretation of deep learning is about the idea of learning by forgetting the irrelevant details of input data (with respect to the target task). A main claim of the IB interpretation is that the efficacy and correctness of this forgetting process has a direct impact on the generalisation of the learned representations.

Application to deep neural networks

Recent research [46, 41] viewed deep learning as optimal representation learning. In this perspective, the input data source, x, is a random vector and the targets or labels, y, is another random vector. The relevant information between x and y is defined as I(x; y). Tishby and Zaslavsky [46] argued that an optimal representation of x captures all the relevant features, and compresses all irrelevant aspects, that can be used to predict y. The relevant components of x with respect to y are called the minimum sufficient statistics of x with respect to y. The optimal representation can be postulated as a random vector, h. Hence we can write that formulation as the graphical model

x h y. (3.6) → →

Any noise in the system is induced via noise in x alone unless it is intentionally added to the system. Viewing neural networks like this means that we can use information theoretic quantities to query their learning dynamics in a model-agnostic fashion [36]. 3.2 The Information Bottleneck Interpretation 17

Since information cannot be created, neural networks must obey the data processing inequality (DPI) [5]:

I(x; y) I(h; y), (3.7) ≥ where any processing only ever destroys or retains information.

IB method for compression trade-offs

The IB method [45] presented the following functional:

[p(x, h)] = I(x; h) βI(h; y), (3.8) L − with the positive Lagrange multiplier, β, acting as a trade-off parameter between the compression complexity rate, R = I(x; h), (3.9) and the information preserved, I(h; y). This is minimised using a set of self-consistent equations for the mappings p(h x), p(h), and p(y h), with x (see AppendixA) . | | ∈ X The IB method is a direct iterative approach to find the approximate optimal h that compresses x for the prediction of y. In other words, it is a means of finding the minimum sufficient statistics of x that predicts y most generally.

Compression via noise for the optimal IB bound solution

By controlling the value of β in Equation 3.8, Tishby and Zaslavsky [46] constructed an experiment to show that neural networks do actually learn optimal representations. Shwartz-Ziv and Tishby [41] later claimed that the reason for this is that neural networks have two distinct phases of training, namely empirical error minimisation (fitting) and representation compression. The compression stage pushes the hidden representations toward an optimal state where information in the input is discarded when it is irrelevant to the target. The flexibility and architecture of the model being trained constrains the optimality of the representations learned. That research [41] posited that the representation compression phase is the crucial component to neural network generalisation and is due to a phenomenon known as stochastic relaxation. This stochastic relaxation resembles diffusion, is characterised by gradients with a low signal-to-noise ratio (SNR), and thus mostly involves adding random noise to the weights of the neural network. In so doing, the entropy of the 18 Information Theoretic Analysis of Deep Neural Networks

weights is maximised (but constrained by the empirical error related to the prediction, y h). That, in turn, maximises the conditional entropy, H(x h), where h is the | | hidden layer computed by the aforementioned weights. Considering Equation 3.3, we see that this is equivalent to minimising the MI between hidden layers and the input, since the entropy of the input remains constant. Stochastic relaxation can be tracked by monitoring the SNR of the weight updates

in a neural network [41, 36]. Consider some weights, Whj for layer j that produce a

hidden representation, hj, from the activations immediately preceding these weights. The learning signal is quantified by the Frobeneus norm of the mean (over a minibatch) of the gradients with respect to these weights:

* + ∂L

mj = ; (3.10) ∂Whj F

the noise is quantified as the Frobeneus norm of the standard deviation (overa minibatch) of the same gradients:

! ∂L

sj = STD ; (3.11) ∂Whj F

and

SNR = mj/sj. (3.12)

What makes neural networks better than many other models at compressing the irrelevant features of the input space? The answer to this question may help explain why they are able to generalise well. According to the IB perspective, multiple consecutive hidden representations in a layered structure affects the stochastic relaxation process, in that it shortens the time taken to compress information. The computational benefit is exponential in the number of layers, while the cost of forward and backward propagation only grows linearly. The findings by Shwartz-Ziv and Tishby [41] are perhaps seminal yet certainly under scrutiny and debate (see Section 3.3). Primarily, these are:

• Learning happens in two phases, namely empirical error minimisation and repre- sentation compression. 3.3 Is Compression Necessary for Generalisation? 19

• Neural networks generalise well because they learn optimally compressed internal representations.

• Generalisation by noise (stochastic relaxation) is unique to neural networks trained using stochastic gradient descent.

• Error minimisation (fitting) happens rapidly while compression takes much longer; the primary benefit of depth is computational in that it shortens the compression time.

3.3 Is Compression Necessary for Generalisation?

Saxe et al. [36] constructed several experiments to explore the IB interpretation for generalisation in deep learning. That work endeavoured to demonstrate (1) how the choice of non-linearity with a binning methodology for MI computation may explain the compression evidenced by Shwartz-Ziv and Tishby [41]; (2) that generalisation may not require compression; (3) that stochastic relaxation is not the the mechanism of compression; and (4) a phase transition from error minimisation to compression is not clear from changes in the weight update’ SNRs.

3.3.1 Tanh Non-linearity and Binning

The hyperbolic tangent function was the non-linearity of choice in the exploration of the IB interpretation of deep learning [41]:

2 tanh(x) = 1. (3.13) 1 + e−2x −

Figure 3.1 shows the tanh activation function and the bins used to calculate MI. This is a saturating function: the activities tends to the values 1, 1 as the absolute {− } values of the inputs grow. Saxe et al. [36] argued that it is the combination of binning and a saturating non-linearity that caused the evidence of compression. This is because saturation becomes more prevalent in a network as it learns and a greater number of units map their input to fewer bins. This many-to-one mapping reduces the entropy of the hidden representation, which has a direct consequence on the MI. To test this hypothesis, a non-saturating non-linearity – the widely used rectifier linear unit (ReLU: max(0, x)) – was applied and tested. ReLUs are unconstrained 20 Information Theoretic Analysis of Deep Neural Networks

1.0

0.5

0.0 Activity

0.5 −

1.0 − 4 2 0 2 4 − − Input

Figure 3.1: Hyperbolic tangent activation function and binning procedure. The white lines denote the regions within which the activities are binned to compute MI. As more units saturate, more activities fall within the lowest and highest bins.

and thus require a different binning procedure: the maximum activity value was pre-determined by running the experiment to completion. One hundred bins were constructed over this range. Although compression was not evident in this experimental procedure, the maximum-value binning technique did not pay heed to maximum entropy binning [30] and was a source of contention regarding results.

3.3.2 The Question of Compression

Is compression the vital component for neural network generalisation? Experiments were designed to test whether compression and generalisation are causally related. These are discussed next.

Generalisation without compression

Saxe et al. [36] showed that a deep linear network, where MI was directly computable in closed-form, can generalise without compression. The toy example constructed to prove this [36] may be the reason for this lack of compression: the input space contained the same quantity of information throughout (it was a multivariate Gaussian). 3.3 Is Compression Necessary for Generalisation? 21

Progressive separation and contraction

Jacobsen et al. [20] queried whether compression is necessary for generalisation by constructing a fully invertible CNN (up to the projection to the final layer) and trained this on ImageNet [6]. Accuracy comparable to the state-of-the-art on the ImageNet classification task evidenced that generalisation can occur without compression, for an invertible network. They posited that a counter-explanation for generalisation is the progressive separation and contraction [31] of the input space, induced by stacked hidden layers. Contraction, regarding neural networks, means a reduction in the input space such that intra-class samples have a greater regularity of representation. Separation, on the other hand, refers to the ease with which inter-class samples can be separated. Indeed, considering the standard classification task, neural networks are explicitly trained to map the input data space into a linearly separable vector of class . Experimental evidence aside, the argument that the advantage of depth is to progressively improve separation and contraction is intuitive and instructive, yet without a fundamental theoretical basis. Although intentionally reversible networks [20] do not discard irrelevant components of the input space, we postulate that these instead progressively separate the irrelevant components from the relevant, allowing the final mapping (onto a classification vector) to easily discard (compress) this information. This suggestion allows for agreement between the IB perspective and progressive separation and contraction. Our suggested postulation will be tested in future work.

3.3.3 Two Stages of Learning and Stochastic Relaxation

Saxe et al. [36] showed that even neural networks that do not evidence compression (namely deep linear networks and networks with ReLU non-linearities) still exhibited two stages of learning. The diffusion-like process of stochastic relaxation was previously described as the cause through which neural networks compress [46, 41] and linked to low SNRs regarding weight updates. Although non-compressing networks also exhibited a switch from high to low SNRs in the weight updates, these results [36] are confounded because the computation of MI is questionable for ReLU activations owing to the chosen binning procedure [30]. 22 Information Theoretic Analysis of Deep Neural Networks

3.3.4 What of the IB Interpretation?

The experiments carried out and conclusions drawn by Saxe et al. [36] are opposed to the main claims of the IB interpretation [41, 46, 45]. A debate on these topics unfolded on the OpenReview of the refutation research [30]. The opinions and deductions thereon were not fully resolved. Nonetheless, these perspectives and experiments advance our knowledge. We do not offer fundamental theoretical conclusions regarding these issues. Instead, we endeavour to determine empirically, through the computation of MI using decoder models, whether the compression phenomenon is evident in a modern setting. Consider again the IB optimal bound and the evidence that neural networks with tanh non-linearities converge close to this bound [41]. It may be argued that the IB interpretation of compression for generalisation is sound, but also not immediately applicable because:

• For saturating non-linearities, the networks evidence compression directly through this saturation, yet do generalise well – the IB bound is tight.

• For non-saturating non-linearities or intentionally non-forgetting networks like reversible networks [20], compression is not immediately evident as measured by Saxe et al. [36] – the IB bound theorem is not yet applicable, particularly when using a maximum-value binning procedure (Section 3.3.1).

Our approach

It would seem that the experimental set-up in earlier research [41, 36], particularly with regard to how MI was measured, confounded whether compression occurs or is even necessary. The approach we take in this thesis is to compute the MI using a decoding model. For estimating the MI between images and hidden layers, we use a state-of-the-art autoregressive PixelCNN++ [33] decoding model (Section 4.1.2). We rephrase the following quantities as questions:

• I(x; h): how well can we predict the input data (images) given the activations of a hidden layer?

• I(y; h): how well can we predict the labels given a fixed set of activations computed from input images? 3.4 Further Related Information Theory Analyses 23

3.4 Further Related Information Theory Analyses

Measuring MI is difficult. Whether using a parametric method for an upper22 bound[ ], monitoring neural networks that are designed so as to compute the MI [8] using heuristic methods, or using a proxy quantity for information entropy [51], these techniques all have disadvantages and application spaces. The most relevant work regarding MI tracking using decoding models also uses the PixelCNN++ architecture [29].

3.4.1 Inverting Supervised Representations

Concurrent to the work expressed in this thesis, Nash et al. [29] trained a number of conditional PixelCNN++ models to ‘invert’ representations learned by a simple CNN classifier. Using the MNIST [26] and CIFAR-10 [24] image datasets, they showed how a PixelCNN++ can be used as a tool to analyse the invariances and behaviour of hidden representations. The MI was also tracked using the PixelCNN++ models, revealing that CNNs do evidence compression behaviour. They did not find an initial data-fitting phase in all but the first convolutional layer; this is largely owing to the granularity oftheir analysis. Computing the MI prior to any learning, as we do in Chapters5 and6, may have solved this. They also went on to show the effect of (1) regularisation and (2) global pooling. Reg- ularisation induces more compression and global pooling results in location-decoupled representations. Owing to the fact that the above-mentioned work is close to the research we are presenting in this thesis, a number of differences should be highlighted:

1. The models under scrutiny in our work (Section 4.3) use a ResNet architecture and BatchNorm, and are therefore an order of magnitude deeper (twenty-one convolutional layers versus two). The representations owing to the deeper con- volutional layers show clearer invariances to intra-class differences (Chapters5 and6) because of the increased capacity. We also provide conditional samples at critical points in learning to illustrate what information these models learn to ignore.

2. We test both supervised (Chapter5) and unsupervised (Chapter6) training regimes and compare the differences by tracking MI. 24 Information Theoretic Analysis of Deep Neural Networks

3. We trained a larger number of PixelCNN++ models for a finer granularity view of the MI in early stages of learning.

4. A new and larger dataset – CINIC-10 – was compiled to improve this information theoretic analysis (see Section 4.4.1). PixelCNN++ models tend to overfit training data relatively early on. This was pointed out by Nash et al. [29] in that density estimation in higher dimensions is challenging. A more encompassing dataset partly alleviates this issue.

5. We also track the MI in the forward direction to interpret how well the hidden representations capture relevant information about labels.

Conclusion

In this chapter we explained how mutual information can be used as a tool to interpret learning by stochastic gradient descent in deep neural networks. The information bottleneck interpretation of deep learning was explained, with particular reference to how mutual information between input data and hidden representations decreases as learning progresses. This compression phenomenon was discussed as a questionable necessity for ‘generalisation by forgetting’: the ongoing debate in the machine learning community was outlined and our contribution was posited. Closely related work was also discussed. The following chapter describes the investigation framework developed for our research. Chapter 4

Investigation Framework

This chapter introduces the framework we developed to investigate learning in neural networks in the context of realistic image data. The foremost quantity in our analysis is mutual information (MI, Equation 3.5). It is well defined but almost always intractable to calculate. Cases where MI is tractable are, for example, when the random variables are both Gaussian distributed, or where the variables can take on a limited number of discrete values. An upper bound of the MI can be established by adding Gaussian noise to each dimensions of the hidden representation and then applying a kernel density estimate approach. Kolchinsky et al. [22] computed an upper bound on the MI using a non- parametric density estimate. Unfortunately, non-parametric techniques do not scale well to high-dimensional input, and are therefore not suitable for our analysis. Further, the aforementioned upper bound is only a tight bound when the hidden representations cluster into well-separated classes [22]. Owing to the complex and high-dimensional nature of realistic images, an analytic solution for the MI between the activations of hidden layers and images (or targets) is not available. Therefore, we compute a lower bound using a decoding model. Our analysis effectively asks the question: can we use the activations of hidden layers in a neural network to predict the training data? We show how this is equivalent to estimating a lower bound on the MI (Section 4.1) and present models under which this can be estimated, for:

1. I(x; h): the MI between input images and some hidden layer’s activations (Section 4.1.1). 26 Investigation Framework

2. I(y; h): the MI between classification targets and some hidden layer’s activations (Section 4.1.2).

We track the MI in two models: a classifier (Chapter5) and an autoencoder (Chapter6). Section 4.3 details the particular models under scrutiny. We discuss in Section 4.4.1 the dataset, CINIC-10, compiled for our analysis.

4.1 MI Using a Model: a Lower Bound

Consider that the MI between an input random vector, x, and the corresponding hidden

representation, h, with pD as the true data distribution, can be expressed as:

" # pD(x, h) I(x; h) = EpD(x,h) log pD(x)pD(h) " # pD(x h) = EpD(x,h) log | (4.1) pD(x)

= EpD(x,h) [log pD(x h)] EpD(x) [log pD(x)] | −

= EpD(x,h) [log pD(x h)] C. | − Since the entropy of the data is constant we need not estimate C. Here h is a deterministic mapping of x through a neural network and any randomness in h is purely owing to randomness in x. We introduce an auxiliary distribution, q(x h), | that will be approximated by models (Sections 4.1.1 and 4.1.2) in order to bound this estimate from below: " # pD(x h)q(x h) I(x; h) = EpD(x,h) log | | C q(x h) − | " # pD(x h) = EpD(x,h) [log q(x h)] + EpD(x,h) log | C | q(x h) − (4.2) |

= EpD(x,h) [log q(x h)] + EpD(h) [DKL(pD(x h) q(x h))] C | | || | −

EpD(x,h) [log q(x h)] C. ≥ | − This follows because

EpD(h) [DKL(pD(x h) q(x h))] 0, (4.3) | || | ≥ 4.1 MI Using a Model: a Lower Bound 27 since the D 0. We use sufficiently large data (of size N) for an unbiased Monte KL ≥ Carlo estimate,

1 X (i) (i) EpD(x,h) [log q(x h)] log q(x h ). | ≃ N | (4.4) x(i),h(i)

An analogous derivation can be used for I(y; h). We define the decoder models q(x h) and q(y h) in Section 4.1.1 and 4.1.2, respectively. The KL-divergence | | between the true and estimated distributions determines the tightness of the bound (Section 4.2). Training the decoding models minimises the expected KL-divergence in Equation 4.3.

4.1.1 Forward Decoding for Label MI

Neural network classifiers attempt to maximize the log-likelihood of the target data, and in so doing directly maximize the MI between all hidden representations and targets [46]. To compute the MI between a given hidden representation at a point in model training we need: 1. The mapping from x to h to be fixed – any weights or other parameters associated with this mapping are frozen.

2. A neural network trained to maximize log q(y h) – essentially a classifier that | takes as input h. Note that this minimizes the left hand side of Equation 4.3.

3. Enough data to train the model under scrutiny, to train the decoder model, and to evaluate the MI satisfactorily.

4. Calibration to ensure the model does not overfit, and hence produce a poor bound in Equation 4.2. For example, consider a classifier with three hidden layers. We aim to compute the MI between the second hidden layer and the targets at some point in training. We freeze all weights that result in the second layer’s activations. The remaining weights are reinitialised and fitted, on a separate portion of the data, to estimate I(y; h). An entirely held out portion of the data is used to compute the MI lower bound estimate. This mitigates any overfitting or data-contamination. However, it also introduces a new challenge: we require three splits of the data, each of which should be as large as possible. Section 4.4.1 describes this challenge in more detail and offers a solution – the compilation of a new dataset. 28 Investigation Framework

Recalibrating probabilities

Although neural network classifiers are powerful function approximators that can provide accurate target predictions, they are also known to be poorly calibrated [14] partly due to overfitting. The confidence of predictions matches poorly to the expected accuracy, and the log-likelihood is overestimated. A modern neural network predicts with much greater certainty than necessary. Poor calibration does not affect accuracy, but rather the log-likelihood. Thisis problematic since we use a (possibly poorly calibrated) neural network as a forward decoding model to estimate MI. Therefore, confidence recalibration is necessary. Guo et al. [14] dealt with this mismatch between accuracy and log-likelihood using a method called temperature scaling, which we employ here. Temperature scaling adjusts the temperature of the softmax function (Equation 4.5) used to compute the output probability estimates. A higher temperature results in a ‘softer’ output and a higher entropy over the prediction. The temperature, T , scaled softmax used to compute the probability of element j in a multi-class setting is:

 z  ezj /T σ = . (4.5) PK zk/T T j k=1 e

Recalibration of a converged model minimises the negative log-likelihood on held-out data, by adjusting the softmax temperature alone.

Single layer linear decoding

Neural network classifiers are trained explicitly to yield a (final) layer that is linearly separable: the final set of weights maps the preceding activations to a one-hot encoded class prediction. Does this linear separability improve with depth and/or over training time? Alain and Bengio [1] used linear classifier probes to query how discriminating internal features were, but did not use these to analyse specifically learning dynamics. We adopt and extend linear probing as an alternative approach to forward decoding. A weight matrix is trained for each h decoding run to map it to a class prediction. How easily can a learned representation can be linearly separated? See Section 5.2.2 for these results. We describe the task of measuring the inverse MI in the next section. 4.1 MI Using a Model: a Lower Bound 29

4.1.2 Inverse Decoding for Input MI

The high-dimensional and structurally complex nature of image data makes measuring the MI between inputs and hidden layers more challenging. We require (1) a mod- elling approach that sufficiently captures the relationship between inputs and hidden representations, and (2) direct access to the log-likelihood. Estimating these sorts of image data distributions is an ongoing topic of research in the machine learning community. The current state-of-the-art can be broadly sorted into implicit and explicit distribution estimators. Regarding implicit estimators, generative adversarial networks [12] are designed to generate data that can ‘fool’ a discriminator network tasked with discerning real from fake data. In this manner, the generated distribution matches the true data distribution without any explicit of log-likelihood. Unfortunately, regardless of their evidenced success for modelling image distributions, no direct access to the data log-likelihood means these cannot be used in our analyses. Explicit estimators, on the other hand, can.

Autoregressive models

Autoregressive models estimate explicitly the data distribution. They leverage the fact that the joint distribution of an image can be decomposed as a product of conditional distributions:

n2 Y p(x h) = p(xi h, x1, . . . , xi−1), (4.6) | i=1 |

2 where, in the setting of this thesis, x is an image of n pixels, each xi is a pixel in an image, and the decomposition is in raster order (left to right, top to bottom). The conditioning variable h is included for completeness. PixelCNN [47] is a state-of-the-art formulation that effectively models image data distributions using this autoregressive decomposition. A PixelCNN model can be trained with or without a conditioning variable, and can also be used to generate images. On that note, images are generated one pixel per model evaluation, starting at the pixel in the top left corner of an image, and continuing in raster order. The generated pixels are fed back into the model for successive generations, thus allowing each pixel to depend in a highly non-linear and multi-modal manner on previously generated pixels [47]. 30 Investigation Framework

PixelCNN++

PixelCNN estimates each colour channel of each pixel using a 255-way softmax function. PixelCNN++ [33] extends and improves PixelCNN by estimating the colour-space for a single pixel as a K-way (with K = 10 in the original usage) discretized mixture of logistic sigmoids:

K " ! !# X xi + 0.5 µik xi 0.5 µik p(xi πi, µi, si) = πik σ − σ − − , (4.7) | k=1 sik − sik

th where πik is the k logistic sigmoid mixture coefficient for pixel i, µik and sik are the corresponding mean and scale of the sigmoid, σ( ). The discretization is accomplished · by binning the network’s output within 0.5. ± The colour channels are coupled by a simple factorisation into three components (red, green, and blue). First, the red channel is predicted using Equation 4.7. Next, the green channel is predicted in the same way, but the means of the mixture components,

µi, are allowed to depend on the value of the red pixel. The blue channel depends on both red and green channels in this way. Salimans et al. [33] argued that assuming a latent continuous colour intensity and modelling it with a simple continuous distribution (Equation 4.7) results in more condensed gradient propagation, and a memory efficient predictive distribution for x. Other improvements included down-sampling to capture non-local dependencies and additional skip connections to speed up training.

Conditioning

The conditioning variable is added to each gated convolution residual block, of which there are six per each of the five hyper-layers in PixelCNN++. The gated convolution residual block structure was shown empirically to improve results. The activations of a gated residual block are split into two equal parts and one half is processed through a sigmoid function to produce a mask of values in the range [0, 1]. This is element-wise multiplied with the other half of the activations as the ‘gate’. As for the conditioning variable, it is conventionally a one-hot encoded class label vector that is added to the activations of each gated residual block. Considering the layers chosen for scrutiny in this thesis (Figure 4.1), most of the conditioning variables 4.2 Tightness of the Bound 31 are three-dimensional: two spatial dimensions and a channel dimension. Additionally, we must account for the down-sampling used in PixelCNN++. Therefore, there are four possible transformations of h before it can be integrated into the PixelCNN++ model:

1. The conditioning variable is larger (regarding spatial width and height) than the activations to which it needs to be added. The conditioning variable is down- sampled using a strided convolution of two and (if necessary) average pooling with a kernel size of two. The filter width is matched in this same convolution.

2. The conditioning variable is smaller than the activations. A sub-pixel shuffle convolution [39, 40] is used for up-sampling. The sub-pixel shuffle is an alternative to deconvolution or nearest neighbour up-sampling that allows the model to learn the correct up-sampling without unnecessary padding. A non-strided convolution with a kernel of one matches the filter width.

3. The conditioning variable is the same size as the activations. A non-strided convolution with a kernel of one matches the filter width.

4. The conditioning variable is, instead, a vector – h4 in Figure 4.1. The dot product of these and the appropriately sized weight matrix are taken to match the activations.

If, for example, h = h is a (16 16) 32 (two-dimensional with 32 filters, the 2 × × second hyper-layer in Figure 4.1) hidden representation, the first three aforementioned transformations would be in effect because the configuration of PixelCNN++ [33] means that there are activations with spatial resolutions of (32 32), (16 16), and × × (8 8), to which the conditioning variable must be added. The following section briefly × describes the main issue with estimating a lower bound on the MI.

4.2 Tightness of the Bound

The quality of the decoding model influences the tightness of the MI lower according to Equation 4.3. If, for example, the layers in neural networks become more difficult to decode by moving into a highly non-linear manifold, the MI also becomes more difficult to estimate. We acknowledge that this has the potential to affect our results negatively because it is difficult to be certain of the quality of the bound. The PixelCNN++ 32 Investigation Framework model can generate images conditioned on hidden layers, and in so doing enables us an intuitive grasp of the MI and bolsters the bound estimation. Just because information is present in a representation, does not mean that it is readily accessible. Looking at model-based MI bounds gives us an intuitive sense of the extractable information in a given layer. Although we have not developed a formal procedure to determine the bound-quality in this thesis, we will endeavour to do so in future work. The following section presents the models under scrutiny.

4.3 Models Under Scrutiny

We have chosen a residual network (ResNet) architecture for the neural networks under scrutiny in this research. Further, we define four hidden representations, h with j 1,..., 4, as the activations of layers that can be extracted for MI tracking. j ∈ Figure 4.1 visualises the ‘encoder’ architecture and shows the locations of the hj layers. It is identical to what was used for CIFAR-10 classification in the original ResNet analysis [15]. The final layer in Figure 4.1 is decoded for either:

• Classification: where h4 is mapped to a ten-way class prediction and optimized by maximising the log-likelihood of this prediction with respect to class label data.

• Autoencoding: where h4 is processed to reconstruct x. A series of layers, that are the inverse of the encoder structure, act as the decoder portion. Any relevant spatial up-sampling uses sub-pixel shuffling [39, 40].

There are in this arrangement, therefore, two training paradigms (supervised classification versus unsupervised autoencoding) available for study, both of which share encoder structure for simplicity and comparable analyses.

4.3.1 Training and Freezing

Both the classifier and autoencoder weights were optimised using SGD with a learning rate of 0.1 and cosine annealing to zero over 200 epochs, a momentum factor of 0.9 and a L2 weight decay factor of 0.0005. We used the leaky ReLU non-linearity. Our implementation was written in PyTorch [32]. The hyper-parameters for the PixelCNN++ decoder models were identical to the original paper [33]. PixelCNN++ 4.3 Models Under Scrutiny 33

3x3 conv; stride 1; 16 filters

3x3 conv; stride 1; 16 filters

3x3 conv; stride 1; 16 filters

3x3 conv; stride 1; 16 filters

3x3 conv; stride 1; 16 filters

3x3 conv; stride 1; 16 filters

3x3 conv; stride 1; 16 filters

h1: 32x32x16 x: 32x32x3

3x3 conv; stride 2; 32 filters /2 3x3 conv; stride 1; 32 filters AUTOENCODER CLASSIFY DECODE 3x3 conv; stride 1; 32 filters

3x3 conv; stride 1; 32 filters

3x3 conv; stride 1; 32 filters

3x3 conv; stride 1; 32 filters

h2: 16x16x32

3x3 conv; stride 2; 64 filters /2 3x3 conv; stride 1; 64 filters

3x3 conv; stride 1; 64 filters

3x3 conv; stride 1; 64 filters

3x3 conv; stride 1; 64 filters

3x3 conv; stride 1; 64 filters

h3: 8x8x64 h4: 256 4x4 average pooling

Figure 4.1: The ResNet model architecture (encoder) used to generate hidden activa- tions in either a classification or autoencoder set-up. Each convolution block (inner central blocks) consists of: convolution BatchNorm leaky ReLU non-linearity. → → Additional convolution BatchNorm blocks are used at necessary skip connections → (‘/2’ blocks). The hidden representations (h1,..., h4) are taken at the ends of ‘hyper layers’, which are the three grouped and separately coloured series of blocks. This encoder is a 21 layer ResNet architecture, accounting for the convolutions in both skip connections. 34 Investigation Framework training was stopped early at 250 epochs, owing to time constraints, while the forward decoding models were all trained for 200 epochs to match the encoder training times. AppendixB details the training of the classifier, autoencoder, forward decoding models, and PixelCNN++ inverse decoding models. The weights defining the scrutinized models were saved at intervals throughout training and restored to produce the hj under analysis. Forty-eight decoder models would be required if, for example, six points throughout training were chosen for inspection of all four hidden representations. Half of these would be PixelCNN++ models for inverse decoding. Each PixelCNN++ model takes approximately fifteen days to train on a NVIDIA 1080 Ti GPU. Therefore, specific points in encoder training were carefully selected (details in Section 5.1) and not all hj representations were inversely decoded.

4.4 Data

The analysis in this thesis involves computing MI. The degree to which the employed dataset represents its true distribution influences the accuracy of any quantities that require an unbiased estimate (Equation 4.4). Therefore, we require the following:

1. The data used to learn the decoding models should be extensive enough to provide good lower bound estimates (Sections 4.1.1 and 4.1.2);

2. The data used to evaluate MI should contain as many items as possible for a meaningful measure (see Equation 4.4); and

3. The training data should be encompassing enough to prevent overfitting for the models under scrutiny as well as the decoding models.

Three way data split

These requirements prompt a three way split of the data into the following:

1. Encoding, for training the models under scrutiny;

2. Decoding, for training the models under which MI is computed; and

3. Evaluation, to provide unbiased held-out estimates for the quantities computed under the decoding models. 4.4 Data 35

CIFAR-10 [24] contains a total of 60,000 (32 32) colour pixel images (50,000 in × the training set and 10,000 in the test set) collected into ten classes. A three-way split means that there are only 20,000 data items per split. In our initial experiments we found that this was not enough to meet the aforementioned criteria. Particularly, the PixelCNN++ models evidenced overfitting after approximately 25 epochs. The compilation of ImageNet for the large scale visual recognition system [6] contains many more images (1000 images in per each of the 1000 classes, of various sizes and usually cropped to (256 256) pixels). Unfortunately, the task difficulty and × computational burden associated with ImageNet is too great.

4.4.1 CINIC-10: CINIC-10 is Not Imagenet or CIFAR-10

A natural progression, therefore, is to combine elements of ImageNet into CIFAR-10. The ImageNet project, as opposed to the widely used collated dataset [6], is a database of labelled images. Images are labelled according to the WordNet hierarchy [28] and are mapped to synonym sets (synsets). Each synset (‘dog, domestic dog, Canis familiaris’, for example) is a leaf node that has a number of images directly associated with it. A synset can have children (various dog breeds, to extend the example). We manually selected groups of synsets that fell under the classes of CIFAR-10. The minimum number of images per class was computed and items from ImageNet were randomly selected (should there be more available than this minimum number) and down-sampled to create CINIC-10: CINIC-10 Is Not ImageNet or CIFAR-10. Full details of the collection process and a data analysis can be found in AppendixC. CINIC-10 contains 27,000 images per class and the same classes as CIFAR-10: plane, car, bird, cat, deer, dog, frog, horse, ship, and truck. CINIC-10 contains 4.5 as many × data items as CIFAR-10, totaling 270,000. This provides a better representation of the realistic images in the standard CIFAR-10 image classification task, while not extending the computation- or task-related burdens as extensively as ImageNet would. We have not manually verified every data item in CINIC-10, but have run baseline experiments using modern architectures to compare expected performance (AppendixC ). Moreover, experiments showed much improved behaviour regarding overfitting. We cannot ensure that the items gathered from ImageNet are drawn from the same distribution as CIFAR-10, although inspecting the distribution of pixel intensities shows that these are close (AppendixC). Therefore, we ensured that there was an equal split of CIFAR-10 between the encoder, decoder, and evaluation dataset splits. Further, all CIFAR-10 test data is in the evaluation dataset. 36 Investigation Framework

Conclusion

This chapter discussed the investigation framework to undertake the research in this thesis. A lower bound on the mutual information was presented and the models with which the bound is computed were discussed. We presented the encoder network architecture set to create hidden representations for analysis. The encoder architecture is used with both classifier and autoencoder training regimes in Chapters5 and6, respectively. Finally, we explained the justification, compilation, and proposed use of CINIC-10, an extension of CIFAR-10. Chapter 5

Experiment One: Classifier MI Tracking

This chapter gives the results obtained by applying the framework discussed in Chapter 4 to hidden representations induced by learning to classify images. Application to an an autoencoder set-up follows in Chapter6. Tracking the MI in convolutional models required a learned decoding model for each query point. Developing a fine grained and detailed picture regarding learning progression incurs a substantial computational cost, partly because there are a number of layers that could be analysed in the encoder model architecture (see Figure 4.1). In the classifier training regime, encoding hidden layers aimsto maximise the information about the targets by maximising the log-likelihood of the data. We show that the information about the targets increased in all hidden layers as training progressed (Section 5.2). We observed two phases of learning (Section 5.1) in the classifier training regime, by assessing the MI between hidden layers and input images. I(x; hj) first briefly increased and then decreased for the remainder of training. The increase in MI – the fitting stage of training – took longer for earlier layers. The observed two-stage progression confirms some of the results found by Shwartz-Ziv and Tishby [41] regarding compression in neural networks, even when using realistic image data and a modern network architecture with non-saturating non-linearities. Contrariwise, we did not find an obvious causal link between compression and stochastic relaxation by analysing the signal-to-noise ratios of weight updates during training (Section 5.1.1). 38 Experiment One: Classifier MI Tracking

Set-up

A ResNet classifier with 22 layers (an encoder architecture – Figure 4.1 – and a final fully connected layer for output predictions) was trained on the encoder data split of CINIC-10 (Section 4.4.1) for 200 epochs. The weights were saved throughout training and restored to produce the activations that are h1, h2, h3, and h4. Further details can be found in AppendixB. The hidden representations were decoded using the decoder split of CINIC-10 to produce lower bound estimates for MI tracking. A lower bound on the MI between input images and hidden layers requires inverse decoding (Section 5.1); a lower bound between labels and hidden layers requires forward decoding (Section 5.2).

Each I(x; hj) or I(y; hj) assessment is done on the evaluation data split of CINIC-10.

5.1 Inverse Decoding: Information About Inputs

The MI between input images and hidden layers was tracked by decoding h2, h3, and h4 at 0, 1, 5, 10, 15, 100, and 200 epochs. The MI curves are the plots (Figure 5.1) created by evaluating EpD(x,h) [log q(x h)], using the evaluation data split over the | selected classifier training epochs. Relative information gain (from initialisation to peak MI) and relative information compression are listed in Table 5.1 as ∆f and ∆c, respectively.

Table 5.1: Relative information gain and compression for the hidden representations in a classifier training regime.

hj Initial LL Peak LL Final LL Fitting, ∆f (nats) Compression, ∆c (nats) h2 -7240 -6079 -6346 1162 267 h3 -7510 -7459 -7579 41 120 h4 -7799 -7633 -7701 166 68

Information compression was observed at all hidden layers. In the context of training deep neural networks for image classification, hidden representations formed by layered processing in a ResNet CNN architecture, using non-saturating non-linearities, exhibited a reduction in shared information with input images after initially maximally fitting. These findings do not confirm specifically the claim that generalisation is because of this compression [46, 41], or that compression occurs because of stochastic relaxation, but we can confirm that compression does occur in this real-world data setting. 5.1 Inverse Decoding: Information About Inputs 39

EpD(x,h ) [log q(x h2)] ' 2 |

EpD(x,h ) [log q(x h3)] ' 3 | 6200 EpD(x,h4) [log q(x h4)] − ' | h2, ∆c = 267 nats

6400 −

6600 −

6800 − lower bound ) j 7000 h

; − x ( I

7200 −

7400 −

h3, ∆c = 120 nats

7600 −

h4, ∆c = 68 nats

7800 − 1 5 10 15 100 200 Epochs of classifier training - log scale Figure 5.1: Mutual information curves: lower bound as per Equation 4.4 for classifier training. All representations exhibited two stages of learning: fitting followed by compression. The relative compression is the difference between the peak fitting and final values, listed as ∆c in nats on the right. The most compression relative to the starting MI happened at h3, which is the layer immediately prior to average pooling. The rate of compression increased as learning progressed. Note that not all evaluation points could be carried out for each hj owing to time constraints. 40 Experiment One: Classifier MI Tracking

The onset of compression is not consistent between layers; deeper layers began

compressing earlier. Compression started before a single epoch of training for h4,

before five epochs for h3, and approximately before ten epochs for h2. Moreover, earlier (shallower) layers exhibited more compression. Earlier layers also have higher total capacity. More control experiments need to be run in order to decipher which properties (capacity versus position in the neural network) resulted in greater compression.

Compression at h2 was over twice the compression observed at h3: 267 nats

versus 120 nats, respectively. Furthermore, even though h4 is merely h3 followed by a (4 4) average pooling operation, comparison of their MI curves (Figure 5.1) is useful. × The relative compression for h3 is higher than for h4 (see Table 5.1), showing that

compression at h3 is not forcing the same level of compression at h4.

5.1.1 Compression Through Stochastic Relaxation?

Shwartz-Ziv and Tishby [41] claimed that the mechanism behind generalisation in neural networks is stochastic relaxation: a low signal-to-noise ratio (SNR) in the SGD weight updates causes noise to dominate the learning, which has a diffusion-like effect on the weights. Empirical error minimisation (Section 3.2) was claimed to correspond to a high SNR as networks learn to fit the data in the early stages of learning, while compression was claimed to correspond to learning where the SNR is low, allowing the stochasticity owing to differences within minibatches to dominate. Saxe etal. [36] also observed a drop in SNRs with ReLU non-linearities, but found no causal link to generalisation. We analysed the SNR characteristics (see Section 3.2) throughout classifier training to uncover any agreement with the point at which the classifier enters its compression stage. The classifier model in this thesis did not exhibit a phase-shift from hightolow SNRs that directly correlated with the onset of compression. Figure 5.2 shows the relevant gradient signal statistics (Equations 3.10, 3.11, and 3.12). The modern architecture and optimisation choices used in this thesis – a ResNet using batch normalisation, weight decay, and a decaying learning rate – mean that a direct comparison with toy examples and small, standard neural networks [41, 36] (i.e., not convolutional) is troublesome. Moreover, the signal statistics are notably more noisy when learning using realistic image data, and were median filtered prior to analysis. There was, nonetheless, no obvious direct connection between a SNR phase-shift and the onset of compression. 5.1 Inverse Decoding: Information About Inputs 41

10.0 10.0

1.0 1.0

weight updates 0.1 weight updates 0.1 Observed compression start

Wh1 Wh2 0.0 k k 0.0 k k m1 m2

Mean/std s1 Mean/std s2 0.0 0.0 0 50 100 150 200 0 50 100 150 200 Epochs of classifier training Epochs of classifier training

(a) W.r.t weights to produce h1 (b) W.r.t weights to produce h2 10.0 10.0

1.0 1.0

weight updates 0.1 weight updates 0.1 Observed compression start Observed compression start

Wh3 WhL 0.0 k k 0.0 k k m3 mL

Mean/std s3 Mean/std sL 0.0 0.0 0 50 100 150 200 0 50 100 150 200 Epochs of classifier training Epochs of classifier training

(c) W.r.t weights to produce h3 (d) W.r.t weights to produce prediction

1.8

1.6

1.4 Mean SNR

SNR of h1 SNR of h4 1.2 SNR of h2 Mean SNR SNR of h3

0 25 50 75 100 125 150 175 200 Epochs of classifier training (e) Smoothed SNR of (a) – (d) and mean SNR.

Figure 5.2: Normalised gradient signal throughout classifier training. Wh denotes ∥ j ∥ the L2 norm of the weights, for scale reference. The SNR (Equation 3.12) is the ratio between the mean gradient signal (Equation 3.10) and the standard deviation thereof (Equation 3.11). The mean SNR (e) is the average SNR over all , (a) – (d). The vertical gradient lines indicate the observed approximate start of compression (determined from Figure 5.1), which does not correspond to a clear point in the SNR statistics, but does correspond to a similar signal value in all cases ( 1). We did not ≈ track MI for h1 – hence no observed compression start point. 42 Experiment One: Classifier MI Tracking

Figure 5.2 (e) shows the mean SNR across all layers. The SNR is relatively consistent between layers, which means that this loss signal characteristic is being propagated effectively through the network. Krähenbühl et al. [23] explained and justified the importance of learning signal consistency and its propagation through layers to avoid vanishing or exploding gradients. BatchNorm improves gradient propagation [19] by reducing the dependence of gradients on the scale of parameters, and allowed for a consistency of learning signal throughout the layers of the classifier network. It may be the case that state-of-the-art techniques such as batch normalisation confound comparisons with earlier work [41, 36], but there are similarities worth considering:

• The SNR decreased as learning progressed. For consistency with earlier works, the signal statistics of Figure 5.2 were plotted on a linear-log scale to show this decrease as it is not easily noted using log-log scales as in the earlier work [41, 36].

• There was a consistency regarding the value of the gradient signal (m 1, in j ≈ this case) when compression began (estimated from Figure 5.1). Following this, the SNR decreases over time.

5.1.2 Conditional Samples

When Tishby et al. [45] first introduced the IB method, they noted that the original information theoretic formulations [37] were more concerned with the transmission of information, rather than the relevance or meaning of the transmitted information to the recipient. In fact, it is this notion of information relevance that is at the core of the IB interpretation of learning. A beneficial side-effect to using PixelCNN++ for inverse decoding hidden layers, is that image samples can be conditionally generated. In this way, we are able to directly query the quality of the information in the hj layers in an intuitive fashion. Figures (i) 5.3, 5.4, and 5.5 show generated image samples conditioned on the activations h2 , (i) (i) (i) h3 , and h4 (when processing images x , where i refers to a sample index) in the classifier, respectively, at intervals throughout classifier training.

The samples generated conditioned on h2 in Figure 5.3 are all very similar – after 10 epochs of fitting – despite localised hue variations and global colour differences. ≈ Evidently these class-agnostic characteristics are readily compressed in this case. However, the sharp colour deviations in Figure 5.3(e) may instead be owing to issues with the PixelCNN++ model. 5.1 Inverse Decoding: Information About Inputs 43

(a) (b) No learning (c) 1 Epoch (d) 10 Epochs (e) 100 Epochs (f) 200 Epochs

Figure 5.3: Samples generated using PixelCNN++, conditioned on h2 in the classifier training set-up. The original images processed for h2 are shown in (a). Ten epochs is close to the peak of the ‘fitting’ stage; 200 epochs is the end of learning. Although the samples from different stages of classifier training are all structurally similar, the samples from later stages exhibit more hue and background deviations. h2 showed the most absolute compression according to Figure 5.1. 44 Experiment One: Classifier MI Tracking

(a) (b) No learning (c) 1 Epoch (d) 10 Epochs (e) 100 Epochs (f) 200 Epochs

Figure 5.4: Samples generated using PixelCNN++, conditioned on h3 in the classifier training set-up. The original images processed for h3 are shown in (a). Ten epochs is close to the peak of the fitting stage, while 200 epochs is the end of learning. Unnecessary features such as background colour are preserved at 10 epochs of training; the diversity of samples is greater at 200 epochs. I(h3; x) is lower at 200 epochs compared to no learning (Figure 5.1), but the quality of preserved information is better. 5.1 Inverse Decoding: Information About Inputs 45

(a) (b) No learning (c) 1 Epoch (d) 10 Epochs (e) 100 Epochs (f) 200 Epochs

Figure 5.5: Samples generated using PixelCNN++, conditioned on h4 in the classifier training set-up. The original images processed for h4 are shown in (a). One epoch is close to the peak of the ‘fitting’ stage, while 200 epochs is the end of learning. The images generated from converged representations look more realistic, yet are still diverse 46 Experiment One: Classifier MI Tracking

The difference between the hidden layers used to generated images for Figures 5.4 and 5.5 is entirely because h has 16 less capacity following a (4 4) average 4 × × pooling operation, resulting in less consistent images when compared to those generated conditioned on h3. The pooling does, however, improve translation and scale invariance. For translation invariance, consider the relative positions of key image features (the position of the bird’s head in row six, for example): these, when present, always correspond to the same spatial location in Figure 5.4, while the location varies more in Figure 5.5. For scale invariance, by an extension of the same argument as before, observe the differences in the sizes of generated horses in rows fifteen and sixteenin Figure 5.5, and the scale consistency in Figure 5.4. Similar findings were made by Nash et al. [29]. The images generated for Figures 5.4 and 5.5 also evidence that PixelCNN++ is not a faultless image model: the generated samples do not always look realistic. For further reference, AppendixB contains images generated using an unconditional PixelCNN++ trained on the decoder split of CINIC-10. It is, therefore, impossible to tell whether the failures in image generation are because the conditional PixelCNN++ models do a poor job of rendering believably realistic images, from the information in conditioning variables, or whether these representations lack the necessary information because of compression. The cars in rows three and four of Figure 5.5 are examples of poorly generated samples. Section 4.2 comments further on the tightness of the estimated lower bound. However, the imperfect quality of the decoding models is consistent between representations in early and late stages of classifier training. Hence, the results in Figure 5.1 are uniformly affected. There is also irrelevant information kept within the representations – like the background of trees in the final row of Figure 5.5 – that is not class-specific and could likely be discarded. Incorporating an attention mechanism into the classifier model48 [ ] may prove to be a suitable way of actively diminishing unnecessary information. Moreover, the fact that attention mechanisms improve performance, through a learned and forced focus on relevant portions of the input, bolsters the idea of generalisation by compression [46]. 5.2 Forward Decoding: Information about Targets 47

5.2 Forward Decoding: Information about Targets

The hidden layers h1,..., h3 were decoded by freezing all weights leading up to each layer. We do not need to decode h4. Decoding h4 is identical to decoding h3 in our framework because the only difference between them is a pooling operation. The remaining weights of the encoder portion of the model were retrained for classification on decoder data, and the lower bound on I(y; h) was estimated on the evaluation data split. The resultant estimates were repeated at numerous intervals throughout the training of the classifier under scrutiny, allowing us to track the MI as learning progressed. Figure 5.6 shows the MI as a function of epochs of classifier training.

Encoder data Evaluation data 0.2 − 0.4 − 0.6 − 0.8 − 1.0 − lower bound

) 1.2 j − h ; y

( 1.4 EpD(y,h1) [log q(y h1)] I − ' | [log q(y h )] 1.6 EpD(y,h2) 2 − ' | EpD(y,h3) [log q(y h3)] 1.8 ' | − EpD(y,x) [log q(y x)] ' | 2.0 − 0 1 10 100 200 0 1 10 100 200 . Epochs of classifier training - log scale .

Figure 5.6: I(y; hj) information curves: lower bounds as per Equation 4.4 for classifier training. The orange line corresponds directly to the negative loss of the classifier training as it represents the expected log-likelihood of the data under the classifier model, q (and where h = h0 = x). Where these estimates are taken using the encoder data (left), the data processing inequality seems to be violated in that I(y; h2) < I(y; h3), except in as much as the tightness of bounds differ.

The less processing involved to create a hidden representation – i.e., ‘shallower’ layers closer to x – the earlier on the MI saturates. This is to be expected as there are fewer weights to fit. This information ‘saturation’ is an interpretation for why shallow layers converge first when training neural networks. 48 Experiment One: Classifier MI Tracking

5.2.1 Data Processing Inequality Violation?

The encoder data split of CINIC-10 was used to train the classifier under scrutiny. An intriguing phenomenon is evidenced when assessing MI on the encoder data split in late stage training. There is a point at approximately 180 epochs of training where the information about the targets in the deeper layers of the network becomes greater than the information about the targets in shallower layers. This seems to be a violation of the data processing inequality (Equation 3.7): deeper layers only ever have information provided by shallower layers and information cannot be created. More formally:

I(y; x) I(y; h ) I(y; h ) I(y; h ) (5.1) ≥ 1 ≥ 2 ≥ 3 This seemingly impossible violation of the DPI is likely an issue with the tightness of the bound and was a prime motivation for the three-way data split. MI assessed on the held-out evaluation data evidences the correct ordering of layers (Figure 5.6, right).

5.2.2 Linear Separability

Encoder data: Linear probe Evaluation data: Linear probe 0.2 − EpD(y,h1) [log q(y h1)] 0.4 ' | − EpD(y,h2) [log q(y h2)] ' |

0.6 EpD(y,h ) [log q(y h3)] − ' 3 | EpD(y,x) [log q(y x)] 0.8 ' | − 1.0 − lower bound

) 1.2 j − h ; y

( 1.4

I − 1.6 − 1.8 − 2.0 − 0 1 10 100 200 0 1 10 100 200 . Epochs of classifier training - log scale .

Figure 5.7: I(y; hj) lower bounds as per Equation 4.4 for classifier training, using a single linear mapping for decoding to show the increasing linear separability of hidden layers as they learn. The orange line corresponds directly to the negative loss of the classifier training as it represents the expected log-likelihood of the data under the classifier model, q (and where h = h0 = x). 5.2 Forward Decoding: Information about Targets 49

I(y; hj) was evaluated under a simple linear mapping to test whether hidden layers became more linearly separable as learning progressed. The results are shown in Figure 5.7. For each evaluation, the classifier under scrutiny was frozen and a single fully-connected layer was attached to hj and trained using SGD as before. Each layer became more linearly separable as learning progressed, evidenced by the monotonic increase of linearly mapped MI assessments in Figure 5.7.

Conclusion

This chapter involved tracking the mutual information between data (images or corre- sponding labels) and hidden representations learned in a typical classifier set-up. The shared information between input images and hidden layers was used to show that there are two phases of learning [41]. First, the classifier network fits to the input data, evidenced by an increase in mutual information. Following this (at approximately 5% of total training time) the network goes into a compression phase, evidenced by a decrease of mutual information as it learns to discard irrelevant components of the input images. Images were also generated from the conditional PixelCNN++ decoder models to provide an intuitive means of assessing information compression. The idea that deep neural networks compress through stochastic relaxation, evi- denced by a low signal-to-noise ratio in the weight updates [41, 46], was inspected. We found no clear correlation between a signal-to-noise ratio phase transition and the onset of compression, while training using image data and a modern architecture choice. The mutual information between hidden layers and labels was tracked and exhibited. This is essentially a quantitative means of answering: how well equipped is each layer to predict the correct label? As expected, all layers improved as learning progressed. We also showed how all layers became more linearly separable throughout learning. The following chapter repeats the core experiments under an autoencoder training regime.

Chapter 6

Experiment Two: Autoencoder MI Tracking

This chapter mirrors the experiments of Chapter5. It differs in that the hidden layers under scrutiny in this chapter are trained using an autoencoder set-up. Compression still occurred for unsupervised autoencoding (Section 6.1). We show that the bottleneck layer became increasingly more ill-suited to a classification task, starting early onin training, by tracking its shared information with classification labels (Section 6.2).

6.1 Inverse Decoding: Information about Inputs

Figure 6.1 tracks I(x; hj) over the autoencoder training regime. The bottleneck layer is h4. Compression was observed before 100 epochs for h3 and h4, but earlier for h2. The bottleneck layer may act as effective labels for the earlier layers, which then learn to discard irrelevant information to optimise for the bottleneck. Compression in an autoencoder set-up is counter-intuitive: training attempts to fit to the data for recon- struction. Samples from the PixelCNN++ decoder models were taken (Section 6.1.1) to comprehend what quality of information the autoencoder bottleneck keeps or discards. The autoencoder training regime induced a longer fitting stage and therefore a higher relative gain in MI – see Table 6.1. Nonetheless, compression occurred. Autoencoders attempt to reconstruct their input. Therefore, the decrease in MI later in training indicated a reduction in the autoencoder’s ability to reconstruct input images. The conditional images in the following section show that this is not necessarily true: the autoencoder improved the representation in the bottleneck layer, toward better reconstruction, through compression. 52 Experiment Two: Autoencoder MI Tracking

EpD(x,h ) [log q(x h2)] ' 2 |

EpD(x,h ) [log q(x h3)] ' 3 |

EpD(x,h ) [log q(x h4)] ' 4 | 5500 − h2, ∆c = 732 nats

6000 −

lower bound 6500 )

j − h ; x ( I

7000 − h3, ∆c = 72 nats

h4, ∆c = 64 nats

7500 −

1 5 10 15 100 200 Epochs of autoencoder training - log scale

Figure 6.1: I(x; hj) information curves: lower bounds as per Equation 4.4 for autoen- coder training. The relative compression is the difference between the peak fitting and final values, listed as ∆c in nats on the right.. The layers still show signs of compression even when learning these using an autoencoder set-up. 6.1 Inverse Decoding: Information about Inputs 53

Table 6.1: Relative information gain and compression for the hidden representations in an autoencoder training regime. The autoencoder set-up induced a longer fitting stage that resulted in a higher ∆f .

hj Initial LL Peak LL Final LL Fitting, ∆f (nats) Compression, ∆c (nats) h2 -7409 -5204 -5936 2205 732 h3 -7608 -6980 -7052 628 72 h4 -7727 -7278 -7342 448 64

6.1.1 Conditional Samples

Figure 6.2 shows samples generated from the bottleneck layer – h4 – of the autoencoder.

Compression is evident after 100 epochs for h4 according to Figure 6.1. Even though the shared information between h4 and the data decreases after 100 epochs of autoencoder training, the information quality is better at 200 epochs: the samples are closer to the original and are less noisy. This goes some way to confirm the findings by Saxe et al. [36] regarding the overlap of training phases: fitting and compression need not be separate and may occur simultaneously. Samples from other layers were not included as they are nearly identical to the original images and thus superfluous to visualise. The autoencoder learned to make better use of its bottleneck layer to capture the image information, even though this resulted in lower MI. Compare the generated samples conditioned on the autoencoder’s bottleneck in Figure 6.2, to the samples generated conditioned on the same capacity representation, but trained for classification, in Figure 5.5. Although h4 could be a layer that preserves information suitable for good reconstruction, it discards much of this information in the classification set-up.

6.1.2 Signal to Noise Ratio Tracking

For a comparable analysis with classifier training (Section 5.1.1), Figure 6.3 exhibits the learning signal statistics for autoencoder training. We see that the SNR for autoencoder training is higher than for classifier training, and does not decrease over time. The signal is of the same scale as the L2 norm of the weights for autoencoder training. Conversely, the signal is approximately 10 lower than the L2 norm of the × weights for classifier training. This is partly explained by the fact that thelossis higher in the autoencoder set-up (see AppendixB). The SNR did not decrease for the autoencoder as it did for the classifier. 54 Experiment Two: Autoencoder MI Tracking

(a) (b) No learning (c) 5 Epochs (d) 10 Epochs (e) 100 Epochs (f) 200 Epochs

Figure 6.2: Samples generated using PixelCNN++, conditioned on h4, the autoencoder bottleneck. The original images processed for h4 are shown in (a). Even though the MI is lower at 200 epochs than at 100 epochs, the image samples are more realistic and less noisy. 6.1 Inverse Decoding: Information about Inputs 55

10.0 10.0 W k h1 k m1 1.0 s1 1.0

weight updates 0.1 weight updates 0.1 Observed compression start W 0.0 0.0 k h2 k m2

Mean/std Mean/std s2 0.0 0.0 0 50 100 150 200 0 50 100 150 200 Epochs of autoencoder training Epochs of autoencoder training

(a) W.r.t weights to produce h1 (b) W.r.t weights to produce h2 10.0 10.0

1.0 1.0

weight updates 0.1 weight updates 0.1 Observed compression start Observed compression start

Wh3 WhL 0.0 k k 0.0 k k m3 mL

Mean/std s3 Mean/std sL 0.0 0.0 0 50 100 150 200 0 50 100 150 200 Epochs of autoencoder training Epochs of autoencoder training

(c) W.r.t weights to produce h3 (d) W.r.t weights out of bottleneck

1.8

1.7 Mean SNR

SNR of h1 SNR of h4 1.6 SNR of h2 Mean SNR SNR of h3

0 25 50 75 100 125 150 175 200 Epochs of autoencoder training (e) Smoothed SNR of (a) – (d) and mean SNR.

Figure 6.3: Normalised gradient signal statistics throughout training of the autoencoder. Wh denotes the L2 norm of the weights, for scale reference. The SNR (Equation 3.12) ∥ j ∥ is the ratio between the mean gradient signal (Equation 3.10) and the standard deviation thereof (Equation 3.11). The mean SNR (e) is the average SNR over all signals, (a) – (d). 56 Experiment Two: Autoencoder MI Tracking

6.2 Forward Decoding: Information about Targets

Encoder data Evaluation data 0.6 − 0.8 − 1.0 − 1.2 − EpD(y,h ) [log q(y h1)] ' 1 | 1.4 [log q( )] − EpD(y,h2) y h2

lower bound ' | )

j [log q(y h )] 1.6 EpD(y,h3) 3

h ' |

; − y (

I 1.8 − 2.0 − 2.2 −

0 1 10 100 200 0 1 10 100 200 . Epochs of autoencoder training - log scale .

Figure 6.4: I(y; hj) lower bounds as per Equation 4.4 for autoencoder training. Unlike the classifier training regime (Figure 5.6), there are no significant differences between assessments on encoder and evaluation data. The representation immediately prior to the autoencoder bottleneck – h3 – compressed label relevant information as it learned.

The autoencoder never had access to class labels. Nonetheless, with our framework

we queried I(y; hj) to understand how the autoencoder preserved and/or discarded class-relevant information. Figure 6.4 shows these results: a different picture is painted compared to the training of a classifier (Section 5.2). Particularly, there is a notable

difference in behaviour between layer h3 and layers h1 and h2. h3 is the layer immediately preceding the average pooling operation that resulted in the autoencoder’s bottleneck (see Figure 4.1), and is thus most closely linked to it.

Only h3 discarded information about the class labels as learning progressed. In

other words, the information in the features that h3 learned became less useful for classification over time. The bottleneck layer is impartial to class labels by design of the autoencoder training regime. The increase of information about targets in

the shallower layers – h1 and h2 – shows that these layers learned to maximise any

information available within the input space prior to the bottleneck enforced on h3. The use of a standard and simple loss function for the autoencoder – mean squared error, in this case – may be the cause for the compression of class-relevant information in the bottleneck layer, since it encourages mode-averaging behaviour that results in a 6.2 Forward Decoding: Information about Targets 57 blurred reconstruction (see AppendixB for autoencoder image reconstructions). Some nuances of the images, assumed to be useful for classification, would be destroyed because of this behaviour. More advanced and modern loss functions can be tested in future work.

Conclusion

This chapter gave the results of tracking mutual information between data (inputs and labels) and hidden layers learned by training an autoencoder. Even though there were no class labels in this set-up, and therefore no irrelevant information, the hidden layers nonetheless compressed information later in training. This is likely owing to the mode-averaging behaviour of the mean squared error. Image samples were generated from the PixelCNN++ decoder models, conditioned on the autoencoder’s bottleneck. The information in the bottleneck layer sufficed for a good reconstruction using a PixelCNN++ as a powerful decoder model. A comparison was made between images generated using the same hidden representation, except trained for classification. The bottleneck layer could have the capacity to keep information for a good reconstruction, but instead only kept what is necessary when trained for classification. The weight-update signal-to-noise ratio statistics were shown for comparison with the classifier training regime. Unlike when learning to classify, the signal-to-noise ratios did not decrease as learning progressed. The mutual information between hidden layers and labels was also scrutinised. Earlier layers maximised label-relevant (and perhaps all available) information; the layer immediately responsible for the bottleneck compressed label-relevant information almost immediately. The draw conclusions and offers avenues for future work in the next chapter.

Chapter 7

Conclusion

Deep neural networks generalise well even when they contain many more parameters than training samples. Principled theoretic formulations and empirical assessments drive forward the field of deep learning toward understanding why neural networks perform so favourably. Earlier work [41] hypothesized that deep learning is successful because hidden layers learn by forgetting the irrelevant aspects of data. Hidden layers learn to compress aspects of the input data that are not necessary to accomplish the required task. The mechanism behind this compression was put forward as stochastic relaxation, where a diffusion-like process drives a neural network toward better compression by a task-constrained addition of noise to layer weights. An empirical- centric refutation paper [36] gave counter-examples to show that generalisation by compression is not necessarily the case, and sparked a debate in the machine learning community. This thesis is an information theoretic analysis of deep convolutional neural networks toward understanding the nature of the learned internal representations. We tracked the mutual information between learned representations and data, for a modern convolutional model that used residual connections and batch normalisation. Mutual information is difficult to compute for high-dimensional data. Hence, we defined and computed a lower bound on the mutual information using decoding models. We found that modern convolutional architectures, whether trained in a supervised or unsupervised context, exhibited compression behaviour as learning progressed. 60 Conclusion

7.1 Our Contributions

Most earlier research was confined to fully-connected networks, toy problems, or toy datasets. One of our primary contributions was to analyse a modern architecture set-up for encoding hidden representations, trained using realistic image data. We also provided qualitative examples in the form of image samples for an accompanying intuitive grasp on what kind of information was compressed. A framework for analysis was constructed to observe training for (1) supervised classification, and (2) unsupervised autoencoding. Information compression occurred in both cases. The signal-to-noise ratios in the stochastic parameter updates were also tracked. The evidenced compression was not obviously linked to a phase transition from high to low signal-to-noise ratios as claimed in earlier research [41]. The framework consisted of:

• A fixed (with regard to training regime) residual connection encoder architecture. The same architecture was used with targeted weight-freezing to compute the mutual information between hidden layers and classification labels.

• Conditional PixelCNN++ models to compute the mutual information between input images and hidden layers, and to conditionally generate image samples for qualitative analyses.

• A linear probing mechanism to investigate whether hidden layers become more linearly separable throughout learning.

• An analysis of of whether stochastic relaxation is evident in a modern set-up.

• A new image dataset compiled to improve information theoretic analyses.

7.2 Our Findings

We observed two phases of training for all hidden layers. The mutual information between the layers and the images first increased (fitting) and then decreased for the remainder of training (compression). Compression began earlier for classifier training than for autoencoder training. Earlier layers compressed more information. We generated image samples, conditioned on hidden layers, using PixelCNN++ to show: (1) what quality of information was compressed; and (2) that the tightness of the lower bound estimate was consistent. 7.3 Limitations and Future Work 61

Regarding classification, the hidden layers became both better at capturing class- relevant information, and more linearly separable as learning progressed. The bottleneck layer of the autoencoder was the only layer that exhibited any reduction in class- relevant information. Therefore, the bottleneck layer of an autoencoder is not suited to classification. Conversely, information compression in the bottleneck resulted inbetter quality generated images. Contrary to earlier research [41], tracking the signal-to-noise ratios of parameter updates did not reveal a connection between learning signals and the shift from fitting to compression. Moreover, the relationship between gradient signal and noise was not the same for classification and autoencoding.

7.3 Limitations and Future Work

The quality of the decoder model influences the tightness of the lower bound onthe mutual information. The better the decoder model, the tighter the bound. Therefore, we cannot be absolutely certain of the results exhibited in this thesis because assessing the quality of the bound requires knowing the mutual information (in which case we would not require a lower bound). We conditionally generated image samples to show that the decoder model did not evidence relatively worse behaviour at any assessment point. Additionally, any future advances in explicit image distribution estimation can improve the analyses in this thesis. There are a number of state-of-the-art models and architecture choices worth assessing with our framework. Attention mechanisms, for example, leverage the idea of relevant information. It would be useful to analyse the hidden representations learned in that context. We used a modern architecture and batch normalisation for the models under scrutiny. Batch normalisation affects the weight updates and may be a confounding factor regarding our analysis of stochastic relaxation as a mechanism for compression. Future work will entail a better means testing for stochastic relaxation. The long-term goal of our research is to find a way of immediately and efficiently defining hidden layers in neural networks. Understand the learning dynamics of neural networks is a step toward that future goal.

References

[1] Alain, G. and Bengio, Y. (2016). Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644. [2] Ba, J. L., Kiros, J. R., and Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450. [3] Bengio, Y. (2013). Deep learning of representations: looking forward. In Proceedings of the International Conference on Statistical Language and Speech Processing, pages 1–37. Springer. [4] Chrabaszcz, P., Loshchilov, I., and Hutter, F. (2017). A downsampled variant of ImageNet as an alternative to the CIFAR datasets. arXiv preprint arXiv:1707.08819. [5] Cover, T. M. and Thomas, J. A. (2012). Elements of information theory. John Wiley & Sons. [6] Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In Proceedings of the Conference on Computer Vision and Pattern Recognition. IEEE. [7] Fukushima, K. and Miyake, S. (1982). Neocognitron: A self-organizing neural network model for a mechanism of visual pattern recognition. In Competition and Cooperation in Neural Nets, pages 267–285. Springer. [8] Gabrié, M., Manoel, A., Luneau, C., Barbier, J., Macris, N., Krzakala, F., and Zdeborová, L. (2018). Entropy and mutual information in models of deep neural networks. arXiv preprint arXiv:1805.09785. [9] GitHub (2018a). ImageNet utils. Accessed on 2018-08-03. Online: https://github. com/tzutalin/ImageNet_Utils. [10] GitHub (2018b). PyTorch-CIFAR. Accessed on 2018-08-03. Online: https:// github.com/kuangliu/pytorch-cifar. [11] Goodfellow, I., Bengio, Y., and Courville, A. (2016). Deep learning. MIT Press. http://www.deeplearningbook.org. [12] Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. (2014). Generative adversarial nets. In Proceedings of the Advances in Neural Information Processing Systems, pages 2672–2680. Curran Associates, Inc. 64 References

[13] Gülçehre, Ç. and Bengio, Y. (2016). Knowledge matters: Importance of prior information for optimization. Journal of Machine Learning Research, 17(8):1–32. [14] Guo, C., Pleiss, G., Sun, Y., and Weinberger, K. Q. (2017). On calibration of modern neural networks. In Proceedings of the International Conference on Machine Learning, pages 1321–1330. PMLR. [15] He, K., Zhang, X., Ren, S., and Sun, J. (2016a). Deep residual learning for image recognition. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pages 770–778. IEEE. [16] He, K., Zhang, X., Ren, S., and Sun, J. (2016b). Identity mappings in deep residual networks. In Proceedings of the European Conference on Computer Vision, pages 630–645. Springer. [17] Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., and Adam, H. (2017). Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861. [18] Huang, G., Liu, Z., Van Der Maaten, L., and Weinberger, K. Q. (2017). Densely connected convolutional networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pages 2261–2269. IEEE. [19] Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the International Conference on Machine Learning, pages 448–456. PMLR. [20] Jacobsen, J.-H., Smeulders, A., and Oyallon, E. (2018). i-RevNet: Deep invertible networks. In Proceedings of the International Conference on Learning Representa- tions. [21] Jakubovitz, D., Giryes, R., and Rodrigues, M. R. D. (2018). Generalization error in deep learning. arXiv preprint arXiv:1808.01174v1. [22] Kolchinsky, A., Tracey, B. D., and Wolpert, D. H. (2017). Nonlinear information bottleneck. arXiv preprint arXiv:1705.02436. [23] Krähenbühl, P., Doersch, C., Donahue, J., and Darrell, T. (2015). Data-dependent initializations of convolutional neural networks. arXiv preprint arXiv:1511.06856. [24] Krizhevsky, A. (2009). Learning multiple layers of features from tiny images. Master’s thesis, Toronto University. [25] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, pages 1097–1105. Curran Associates, Inc. [26] Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324. [27] Ma, C., Huang, J.-B., Yang, X., and Yang, M.-H. (2015). Hierarchical convolutional features for visual tracking. In Proceedings of the International Conference on Computer Vision, pages 3074–3082. References 65

[28] Miller, G. A. (1995). WordNet: A lexical database for English. Communications of the ACM, 38(11):39–41. [29] Nash, C., Kushman, N., and Williams, C. K. (2018). Inverting supervised represen- tations with autoregressive neural density models. arXiv preprint arXiv:1806.00400. [30] OpenReview (2018). OpenReview of: On the information bottleneck theory of deep learning. Accessed on 2018-06-27. Online: openreview.net/forum?id=ry_WPG-A-. [31] Oyallon, E. (2017). Building a regular decision boundary with deep networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pages 5106–5114. [32] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A. (2017). Automatic differentiation in PyTorch. In Proceedings of the Advances in Neural Information Processing Systems Workshop. [33] Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. (2017). PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517. [34] Salimans, T. and Kingma, D. P. (2016). Weight normalization: A simple repa- rameterization to accelerate training of deep neural networks. In Proceedings of the Advances in Neural Information Processing Systems, pages 901–909. Curran Associates, Inc. [35] Santurkar, S., Tsipras, D., Ilyas, A., and Madry, A. (2018). How does batch normalization help optimization? (no, it is not about internal covariate shift). arXiv preprint arXiv:1805.11604. [36] Saxe, A. M., Bansal, Y., Dapello, J., Advani, M., Kochinsky, A., Tracey, B. D., and Cox, D. D. (2018). On the information bottleneck theory of deep learning. In Proceedings of the International Conference on Learning Representations. [37] , C. E. (2001). A mathematical theory of communication. ACM SIGMO- BILE Mobile Computing and Communications Review, 5(1):3–55. [38] Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. (2017). Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538. [39] Shi, W., Caballero, J., Huszár, F., Totz, J., Aitken, A. P., Bishop, R., Rueckert, D., and Wang, Z. (2016a). Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pages 1874–1883. IEEE. [40] Shi, W., Caballero, J., Theis, L., Huszar, F., Aitken, A., Ledig, C., and Wang, Z. (2016b). Is the deconvolution layer the same as a convolutional layer? arXiv preprint arXiv:1609.07009. 66 References

[41] Shwartz-Ziv, R. and Tishby, N. (2017). Opening the black box of deep neural networks via information. arXiv preprint arXiv:1703.00810. [42] Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556. [43] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1):1929–1958. [44] Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. (2015). Going deeper with convolutions. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pages 1–9. IEEE. [45] Tishby, N., Pereira, F. C., and Bialek, W. (2000). The information bottleneck method. arXiv preprint physics/0004057. [46] Tishby, N. and Zaslavsky, N. (2015). Deep learning and the information bottleneck principle. In Information Theory Workshop, pages 1–5. IEEE. [47] van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al. (2016). Conditional image generation with PixelCNN decoders. In Proceedings of the Advances in Neural Information Processing Systems, pages 4790–4798. Curran Associates, Inc. [48] Wang, F., Jiang, M., Qian, C., Yang, S., Li, C., Zhang, H., Wang, X., and Tang, X. (2017). Residual attention network for image classification. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pages 6450–6458. [49] Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S. S., and Pennington, J. (2018). Dynamical isometry and a mean field theory of CNNs: How to train 10,000-layer vanilla convolutional neural networks. arXiv preprint arXiv:1806.05393. [50] Xie, S., Girshick, R., Dollár, P., Tu, Z., and He, K. (2017). Aggregated residual transformations for deep neural networks. In Proceedings of the Conference on Computer Vision and Pattern Recognition, pages 5987–5995. IEEE. [51] Yu, S., Jenssen, R., and Principe, J. C. (2018). Understanding convolutional neural network training with information theory. arXiv preprint arXiv:1804.06537. [52] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In Proceedings of the International Conference on Learning Representations. Appendix A

Self-consistent IB Equations

The basis for the information bottleneck method is the joint satisfaction of the following self-consistent equations:

p(h) p(h x) = exp ( βD(p(y x) p(y h))), (A.1) | Z(x; β) − | ∥ |

X p(y h) = p(y x)p(x h), (A.2) | x | |

X p(h) = p(x)p(h x), (A.3) x | where Z(x; β) is the partition function and β acts as a trade-off parameter between the compression rate and the complexity (Equation 3.8). Tishby et al. [45] describes in detail the iterative algorithm for optimisation using these.

Appendix B

Training Curves

B.1 Training the Classifier and Autoencoder

Figure B.1 shows the classifier and autoencoder loss curves. The losses are not driven down to zero because we chose to use the same architecture as the original ResNet research did for CIFAR-10 Wang et al. [48]. Nonetheless, compression was still evident. These are not state-of-the-art results on CINIC-10, which was compiled for this research. AppendixC details the compilation thereof and exhibits results on a number of state-of-the-art architectures. We chose to adhere to the standard ResNet CIFAR-10 architecture [15] to simplify the analyses in this research. Future work should entail an architecture search specific to CINIC-10. Figure B.1(a) and (b) show that the classifier overfit the encoder data. This is further evidenced when tracking I(y; hj) in the experiments of Section 5.2. Figure B.2 shows reconstructions from the trained autoencoder.

B.2 Forward Decoder Models

Classifier training

Figures B.3, B.4, and B.5 show the training curves for the decoder models estimating

I(y; h1), I(y; h3), and I(y; h3), respectively, when these representations are trained using a classifier set-up. The estimation for I(y; x) is computed directly as the negative log-likelihood of the classifier model as it learns, with recalibration. The effect of probability recalibration can be seen by the final deviation ofthese decoder training curves: the results of recalibration have simply been appended for 70 Training Curves

1.75 Encoder data, 0.2221 1.50 Evaluation data, 0.5622

1.25

1.00

0.75 Average NLL

0.50

0.25 0 25 50 75 100 125 150 175 200 Classifier epochs

(a) Classification training and evaluation losses

100 Encoder data, 92.16 90 Evaluation data, 82.19

80 (%)

70

Accuracy 60

50

0 25 50 75 100 125 150 175 200 Classifier epochs

(b) Classification training and evaluation accuracies

0.17 Encoder data, 0.1207 0.16 Evaluation data, 0.1206

0.15

Loss 0.14

0.13

0.12 0 25 50 75 100 125 150 175 200 Autoencoder epochs

(c) Autoencoder training and evaluation losses

Figure B.1: Classifier loss (a) and accuracy (b) curves, and autoencoder loss (c)curve. The architecture of the encoder was chosen according to the original ResNet Wang et al. [48] for CIFAR-10 data. Hence, owing to more data in CINIC-10, the losses are not driven down to zero. B.2 Forward Decoder Models 71

(a) Encoder (b) Encoder (c) Decoder (d) Decoder (e) Evaluation (f) Evaluation

Figure B.2: Image reconstructions from the autoencoder model for all three datasets. These are typically blurry owing to the use of MSE loss when training the autoencoder. 72 Training Curves comparative visualisation. The y-axis scale is not consistent in the above-mentioned figures as they are not being shown for self-comparison, but rather as a testament to the convergence of the lower bound on I(y; , hj).

Estimating I(y; h1) ; epoch 0 Estimating I(y; h1) ; epoch 1 Estimating I(y; h1) ; epoch 5 2.5 1.6 1.4 1.4 2.0 1.2 1.2 1.0 1.5 1.0 0.8 0.8 Average NLL 1.0 0.6 0.6 0.4 0.4 0.5 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Estimating I(y; h1) ; epoch 8 Estimating I(y; h1) ; epoch 10 Estimating I(y; h1) ; epoch 20

1.4 1.4 1.4

1.2 1.2 1.2

1.0 1.0 1.0

0.8 0.8 0.8

Average NLL 0.6 0.6 0.6

0.4 0.4 0.4

0.2 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Estimating I(y; h1) ; epoch 30 Estimating I(y; h1) ; epoch 100 Estimating I(y; h1) ; epoch 200 1.4 1.4 1.4 1.2 1.2 1.2 1.0 1.0 1.0 0.8 0.8 0.8 .

Average NLL 0 6 0.6 0.6 Encoder data, 0.5086 0.4 0.4 0.4 Decoder data, 0.2218 Evaluation data, 0.5238 0.2 0.2 0.2 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Epochs Epochs Epochs

Figure B.3: Forward decoder models’ loss curves for estimating I(y; h1), for the classifier training regime. Each set of curves shows the decoding run for one data point in Figure 5.6. These models are the encoder architecture of Figure 4.1 with weights frozen to fix h1. The effect of recalibration the probabilities can be seen by the noticeably deviation at the end of training: the recalibrated results were simply appended to these curves.

Autoencoder training

Figures B.6, B.7, and B.8 show the training curves for the decoder models estimating

I(y; h1), I(y; h3), and I(y; h3), respectively, when these representations are trained using a autoencoder set-up. Compared to the decoding under a classifier training regime, these loss curves show how the representations are not immediately well-suited to classification: they have greater loss initially. B.2 Forward Decoder Models 73

Estimating I(y; h2) ; epoch 0 Estimating I(y; h2) ; epoch 1 Estimating I(y; h2) ; epoch 5 1.8 1.6 1.2 1.6 1.4 1.0 1.2 1.4 0.8 1.0 Average NLL 1.2 0.6 0.8

1.0 0.4 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Estimating I(y; h2) ; epoch 8 Estimating I(y; h2) ; epoch 10 Estimating I(y; h2) ; epoch 20 1.2 1.2

1.0 1.0 1.0

0.8 0.8 0.8

Average NLL 0.6 0.6 0.6

0.4 0.4 0.4 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Estimating I(y; h2) ; epoch 30 Estimating I(y; h2) ; epoch 100 Estimating I(y; h2) ; epoch 200 1.0 1.0 1.0 0.8 0.8 0.8 0.6 0.6 Average NLL 0.6 Encoder data, 0.4840 0.4 Decoder data, 0.2523 0.4 0.4 Evaluation data, 0.5301

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Epochs Epochs Epochs

Figure B.4: Forward decoder models’ loss curves for estimating I(y; h2), for the classifier training regime. Each set of curves shows the decoding run for one data point in Figure 5.6. These models are the encoder architecture of Figure 4.1 with weights frozen to fix h2. The effect of recalibration the probabilities can be seen by the noticeably deviation at the end of training: the recalibrated results were simply appended to these curves. 74 Training Curves

Estimating I(y; h3) ; epoch 0 Estimating I(y; h3) ; epoch 1 Estimating I(y; h3) ; epoch 5 1.04 17.5 Encoder data, 1.9659 1.58 15.0 Decoder data, 1.9657 Evaluation data, 1.9712 1.02 12.5 1.56

10.0 1.00 7.5 1.54 Average NLL

5.0 0.98

2.5 1.52

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Estimating I(y; h3) ; epoch 8 Estimating I(y; h3) ; epoch 10 Estimating I(y; h3) ; epoch 20 0.90 0.94 0.96 0.88 0.92 0.94 0.86 0.90

Average NLL 0.92 0.84 0.88 0.90 0.82 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Estimating I(y; h3) ; epoch 30 Estimating I(y; h3) ; epoch 100 Estimating I(y; h3) ; epoch 200 0.80 0.7 Encoder data, 0.2634 0.88 Decoder data, 0.4901 0.78 0.6 Evaluation data, 0.5051 0.86 0.76 0.5 0.84 0.74

Average NLL 0.4 . 0 82 0.72 0.3

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Epochs Epochs Epochs

Figure B.5: Forward decoder models’ loss curves for estimating I(y; h3), for the classifier training regime. Each set of curves shows the decoding run for one data point in Figure 5.6. These models are the encoder architecture of Figure 4.1 with weights frozen to fix h3. The effect of recalibration the probabilities can be seen by the noticeably deviation at the end of training: the recalibrated results were simply appended to these curves. B.2 Forward Decoder Models 75

Estimating I(y; h1) ; epoch 0 Estimating I(y; h1) ; epoch 1 Estimating I(y; h1) ; epoch 5 1.75 1.50 2.0 1.50 1.25 1.25 1.5 1.00 1.00

Average NLL 1.0 0.75 0.75 0.50 0.50 0.5

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Estimating I(y; h1) ; epoch 8 Estimating I(y; h1) ; epoch 10 Estimating I(y; h1) ; epoch 20 1.75 1.75

1.50 1.50 1.50

1.25 1.25 1.25

1.00 1.00 1.00

Average NLL 0.75 0.75 0.75

0.50 0.50 0.50

0.25 0.25 0.25 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Estimating I(y; h1) ; epoch 30 Estimating I(y; h1) ; epoch 100 Estimating I(y; h1) ; epoch 200 1.75

1.50 1.50 1.50

1.25 1.25 1.25

1.00 1.00 1.00

Average NLL 0.75 0.75 0.75 Encoder data, 0.5862 0.50 0.50 0.50 Decoder data, 0.3082 Evaluation data, 0.5914 0.25 0.25 0.25 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Epochs Epochs Epochs

Figure B.6: Forward decoder models’ loss curves for estimating I(y; h1), for the autoencoder training regime. Each set of curves shows the decoding run for one data point in Figure 6.4. These models are the encoder architecture of Figure 4.1 with weights frozen to fix h1. The effect of recalibration the probabilities can be seen by the noticeably deviation at the end of training: the recalibrated results were simply appended to these curves. 76 Training Curves

Estimating I(y; h2) ; epoch 0 Estimating I(y; h2) ; epoch 1 Estimating I(y; h2) ; epoch 5

1.8 1.8 1.6

1.6 1.4 1.6 1.2 1.4 1.4 1.0 Average NLL 1.2 1.2 0.8 1.0 1.0 0.6 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Estimating I(y; h2) ; epoch 8 Estimating I(y; h2) ; epoch 10 Estimating I(y; h2) ; epoch 20 2.00 2.00 1.6 1.75 1.75 1.4 1.50 1.50 1.2 1.25 1.25 1.0

Average NLL 1.00 1.00 0.8 0.75 0.75 0.6 0.50 0.50 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Estimating I(y; h2) ; epoch 30 Estimating I(y; h2) ; epoch 100 Estimating I(y; h2) ; epoch 200 1.8 1.8 1.6 1.6 1.6 1.4 1.4 1.4 1.2 1.2 1.2 . 1.0 1 0 1.0 Average NLL Encoder data, 0.7873 0.8 0.8 0.8 Decoder data, 0.5243 0.6 0.6 0.6 Evaluation data, 0.7933

0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Epochs Epochs Epochs

Figure B.7: Forward decoder models’ loss curves for estimating I(y; h2), for the autoencoder training regime. Each set of curves shows the decoding run for one data point in Figure 6.4. These models are the encoder architecture of Figure 4.1 with weights frozen to fix h2. The effect of recalibration the probabilities can be seen by the noticeably deviation at the end of training: the recalibrated results were simply appended to these curves. B.2 Forward Decoder Models 77

Estimating I(y; h3) ; epoch 0 Estimating I(y; h3) ; epoch 1 Estimating I(y; h3) ; epoch 5 20 Encoder data, 1.9599 2.8 Decoder data, 1.9568 2.08 Evaluation data, 1.9631 15 2.6 2.06 2.4 10 2.2 Average NLL 2.04 5 2.0

1.8 2.02 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Estimating I(y; h3) ; epoch 8 Estimating I(y; h3) ; epoch 10 Estimating I(y; h3) ; epoch 20 2.12 2.15

2.12 2.14 2.10 2.13

2.10 2.12 2.08

Average NLL 2.11 2.08 2.06 2.10 2.09 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200

Estimating I(y; h3) ; epoch 30 Estimating I(y; h3) ; epoch 100 Estimating I(y; h3) ; epoch 200 2.20 2.16 2.290 Encoder data, 2.2708 2.19 Decoder data, 2.2709 2.15 Evaluation data, 2.2714 2.18 2.285 2.14 2.17 2.13 2.280

Average NLL 2.12 2.16 2.275 2.11 2.15

2.10 2.14 2.270 0 50 100 150 200 0 50 100 150 200 0 50 100 150 200 Epochs Epochs Epochs

Figure B.8: Forward decoder models’ loss curves for estimating I(y; h3), for the autoencoder training regime. Each set of curves shows the decoding run for one data point in Figure 6.4. These models are the encoder architecture of Figure 4.1 with weights frozen to fix h3. The effect of recalibration the probabilities can be seen by the noticeably deviation at the end of training: the recalibrated results were simply appended to these curves. 78 Training Curves

B.3 PixelCNN++ Inverse Decoder Models

Classifier training

Figures B.9, B.10, and B.11 are the loss curves when using PixelCNN++ to decode the hidden representations h2, h3, and h4, respectively, for MI tracking in the classifier training regime. Each of these models took approximately fifteen days to train on a NVIDIA Titan 1080ti GPU. Therefore, we stopped training at 250 epochs owing to time and computation constraints. On that note, the training of PixelCNN++ decoder models regarding h4 was distributed over GPUs with less memory at first, and a smaller batch size was necessitated, showing as instabilities in the learning curves of Figure B.11. This is not ideal, but consistent amongst the decoders.

Estimating I(x; h2) ; epoch 0 Estimating I(x; h2) ; epoch 1 Estimating I(x; h2) ; epoch 10 7500 7500 7500

7250 7250 7250

7000 7000 7000

6750 6750 6750

Average NLL 6500 6500 6500

6250 6250 6250

6000 6000 6000 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250

Estimating I(x; h2) ; epoch 15 Estimating I(x; h2) ; epoch 100 Estimating I(x; h2) ; epoch 200

7500 7500 7500 Encoder data, 6332.4184 7250 7250 7250 Decoder data, 6206.0450 Evaluation data, 6332.6092 7000 7000 7000

6750 6750 6750

Average NLL 6500 6500 6500

6250 6250 6250

6000 6000 6000 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 Epochs Epochs Epochs

Figure B.9: Inverse PixelCNN++ decoder models’ loss curves for estimating I(x; h2), for the classifier training regime. Each set of curves shows the decoding run forone data point in Figure 5.1. These models were stopped at 250 epochs of decoding owing to time and computation constraints.

Autoencoder training

Figures B.12, B.13, and B.14 are the loss curves when using PixelCNN++ to decode the hidden representations h2, h3, and h4, respectively, for MI tracking in the autoencoder training regime. B.3 PixelCNN++ Inverse Decoder Models 79

Estimating I(x; h3) ; epoch 0 Estimating I(x; h3) ; epoch 1 Estimating I(x; h3) ; epoch 10 7900 7900 7900

7800 7800 7800

7700 7700 7700

7600 7600 7600

Average NLL 7500 7500 7500

7400 7400 7400

7300 7300 7300 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250

Estimating I(x; h3) ; epoch 15 Estimating I(x; h3) ; epoch 100 Estimating I(x; h3) ; epoch 200 7900 7900 7900 Encoder data, 7498.6980 7800 7800 7800 Decoder data, 7392.1538 Evaluation data, 7515.4007 7700 7700 7700

7600 7600 7600

Average NLL 7500 7500 7500

7400 7400 7400

7300 7300 7300 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 Epochs Epochs Epochs

Figure B.10: Inverse PixelCNN++ decoder models’ loss curves for estimating I(x; h3), for the classifier training regime. Each set of curves shows the decoding run forone data point in Figure 5.1. These models were stopped at 250 epochs of decoding owing to time and computation constraints.

Estimating I(x; h4) ; epoch 0 Estimating I(x; h4) ; epoch 1 Estimating I(x; h4) ; epoch 10 8100 8100 8100

8000 8000 8000

7900 7900 7900

7800 7800 7800

7700 7700 7700

Average NLL 7600 7600 7600

7500 7500 7500

7400 7400 7400 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250

Estimating I(x; h4) ; epoch 15 Estimating I(x; h4) ; epoch 100 Estimating I(x; h4) ; epoch 200 8100 8100 8100 Encoder data, 7640.4045 8000 8000 8000 Decoder data, 7457.2779 7900 7900 7900 Evaluation data, 7643.9200

7800 7800 7800

7700 7700 7700

Average NLL 7600 7600 7600

7500 7500 7500

7400 7400 7400 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 Epochs Epochs Epochs

Figure B.11: Inverse PixelCNN++ decoder models’ loss curves for estimating I(x; h4), for the classifier training regime. Each set of curves shows the decoding run forone data point in Figure 5.1. These models were stopped at 250 epochs of decoding owing to time and computation constraints. 80 Training Curves

Estimating I(x; h2) ; epoch 0 Estimating I(x; h2) ; epoch 5 Estimating I(x; h2) ; epoch 10

7500 7500 7500

7000 7000 7000

6500 6500 6500

6000 6000 6000 Average NLL

5500 5500 5500

5000 5000 5000 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250

Estimating I(x; h2) ; epoch 15 Estimating I(x; h2) ; epoch 100 Estimating I(x; h2) ; epoch 200

7500 7500 7500 Encoder data, 5889.1252 Decoder data, 5827.2696 7000 7000 7000 Evaluation data, 5924.7832

6500 6500 6500

6000 6000 6000 Average NLL

5500 5500 5500

5000 5000 5000 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 Epochs Epochs Epochs

Figure B.12: Inverse PixelCNN++ decoder models’ loss curves for estimating I(x; h2), for the autoencoder training regime. Each set of curves shows the decoding run for one data point in Figure 6.1. These models were stopped at 250 epochs of decoding owing to time and computation constraints.

Estimating I(x; h3) ; epoch 0 Estimating I(x; h3) ; epoch 5 Estimating I(x; h3) ; epoch 10 7800 7800 7800

7600 7600 7600

7400 7400 7400

7200 7200 7200 Average NLL

7000 7000 7000

6800 6800 6800 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250

Estimating I(x; h3) ; epoch 15 Estimating I(x; h3) ; epoch 100 Estimating I(x; h3) ; epoch 200 7800 7800 7800 Encoder data, 7014.7296 7600 7600 7600 Decoder data, 6961.8481 Evaluation data, 7027.5152 7400 7400 7400

7200 7200 7200 Average NLL

7000 7000 7000

6800 6800 6800 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 Epochs Epochs Epochs

Figure B.13: Inverse PixelCNN++ decoder models’ loss curves for estimating I(x; h3), for the autoencoder training regime. Each set of curves shows the decoding run for one data point in Figure 6.1. These models were stopped at 250 epochs of decoding owing to time and computation constraints. B.4 Unconditional PixelCNN++ 81

Estimating I(x; h4) ; epoch 0 Estimating I(x; h4) ; epoch 5 Estimating I(x; h4) ; epoch 10 7800 7800 7800

7700 7700 7700

7600 7600 7600

7500 7500 7500

Average NLL 7400 7400 7400

7300 7300 7300

7200 7200 7200 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250

Estimating I(x; h4) ; epoch 15 Estimating I(x; h4) ; epoch 100 Estimating I(x; h4) ; epoch 200 7800 7800 7800 Encoder data, 7317.7591 7700 7700 7700 Decoder data, 7259.7478 Evaluation data, 7340.4536 7600 7600 7600

7500 7500 7500

Average NLL 7400 7400 7400

7300 7300 7300

7200 7200 7200 0 50 100 150 200 250 0 50 100 150 200 250 0 50 100 150 200 250 Epochs Epochs Epochs

Figure B.14: Inverse PixelCNN++ decoder models’ loss curves for estimating I(x; h4), for the autoencoder training regime. Each set of curves shows the decoding run for one data point in Figure 6.1. These models were stopped at 250 epochs of decoding owing to time and computation constraints.

A side note on GPU hours

The decoding models shown in this appendix are as a result of over 14000 GPU hours ( ± 600 days of training). It is highly infeasible to repeat experiments unless convincingly necessary.

B.3.1 PixelCNN++ Bound

To demonstrate that the PixelCNN++ decoders approach the best they can do, we plot the area under the curve (AUC) of the information curves (e.g., Figure 5.1) as a function of PixelCNN++ training epochs. Figure B.15 demonstrates that the PixelCNN++ models are approaching their optimal decoding potential. The MI lower bound tends to the tightest it can be.

B.4 Unconditional PixelCNN++

Figure B.16 exhibits the unconditional PixelCNN++ loss for comparison with our decoder models. The lowest average loss was a NLL of 24411 which corresponds to 82 Training Curves

500000

400000 AUC of I(x; h2) estimate

AUC of I(x; h3) estimate 300000 AUC of I(x; h4) estimate AUC

200000

100000

0 50 100 150 200 250 PixelCNN++ decoder epochs

Figure B.15: Areas under the curve as a function of PixelCNN++ decoder models training epochs for classifier training, evidencing that the bounds are becoming close to the best that they can be. The curves information curves were adjusted so that the minimum value was zero (at no PixelCNN++ training), so as to make the areas herein meaningful.

3.58 bits per dimension. The original PixelCNN++ achieved 2.92 bits per dimension on CIFAR-10. Generated samples are shown in Figure B.17.

8200 Encoder data, 7515.3841 Decoder data, 7612.0762 8000 Evaluation data, 7628.6529

7800

Average NLL 7600

7400

0 50 100 150 200 250 Epochs

Figure B.16: Unconditional PixelCNN++ loss curves when trained on the encoder dataset of CINIC-10. Since this is only using one third of CINIC-10, it may be possible to achieve a lower loss when using a larger portion of CINIC-10. The best evaluation loss here corresponds to 3.58 bits per dimension, as opposed to the 2.92 bits per dimension on CIFAR-10 [33]. B.4 Unconditional PixelCNN++ 83

Figure B.17: Unconditional PixelCNN++ generated samples when trained on the encoder dataset of CINIC-10. These samples have good local qualities but are not particularly convincing as real images. This is a known pitfall of autoregressive explicit density estimators.

Appendix C

CINIC-10: CINIC-10 Is Not Imagenet or CIFAR-10

This appendix serves to explain the motivation (Section C.1) for compiling (Section C.2) CINIC-10. We also put forward some results on some well-known neural network architectures and compare the images and colour distributions to CIFAR-10 (Section C.3). We plan on releasing publicly this dataset in the near future.

C.1 Motivation

CINIC-10 is not Imagenet of CIFAR-10 but is compiled therefrom. Consider the following: • CIFAR-10 [24] is a commonly used image benchmarking dataset that is (1) complex enough to represent real-world images and (2) concise enough to allow for fast prototyping (unlike ImageNet [25]).

• It would be advantageous to have an extended version of CIFAR-10 since 6000 samples per class may not always be sufficient, particularly considering the three way data-split used in this thesis.

• The Fall 2011 ImageNet release [6] contains images that belong to the same (or similar) classes of CIFAR-10.

• The gap in task-difficulty regarding CIFAR-10 and CIFAR-100 is large.

• A dataset that provides another benchmark milestone, closer to CIFAR-10, would be useful. 86 CINIC-10: CINIC-10 Is Not Imagenet or CIFAR-10

• ImageNet32x32 and ImagenNet64x64 [4] are down-sampled versions of the stan- dard ImageNet release. This does provide a new benchmarking milestone in that the image sizes are smaller, but in some ways it is a more challenging problem as the information content is diminished.

• Data-hungry tasks, like the information theoretic analysis in this thesis, require more data.

• An equal train/validation/test data split gives a more principled perspective of generalisation performance. CINIC-10 was compiled in such a manner as to provide a similar challenge as CIFAR-10 but has 4.5 as much data. ×

C.2 Compilation

CINIC-10 contains all of the CIFAR-10 images and 210000 additional images drawn from the ImageNet database. We acknowledge that the images in CIFAR-10 have been carefully cropped and processed. Therefore, the images of CIFAR-10 and ImageNet are not particularly of the same distribution. To alleviate this, we ensured that each split of CINIC-10 contained equal numbers of CIFAR-10 images. The compilation steps were: 1. The CIFAR-10 images were processed into image format (.png) and named so as to enable backwards recovery; ‘train/dog/cifar-10-train-123.png’ is the image of index 123 from the ‘train’ set of CIFAR-10, for example.

2. The relevant synonym sets (synsets) within the ImageNet database were identified and collected. These synset-groups are listed in the additional synsets-to-cifar- 10-classes.txt file.

3. The synsets were downloaded using ImageNet Utils [9].

4. The images were read and downsampled in identical fashion to ImageNet32x32 [4], for consistency.

5. The lowest number of relevant ImageNet samples was observed to be 21939. Therefore, 21000 samples were randomly selected from each of the synset-groups and distributed equally among train, validation, and test sets. Regarding this thesis, those correspond to encoder, decoder, and evaluation sets. C.3 Analysis 87

6. The additional images were named so that their origins can be directly identified; ‘test/automobile/n03100240_16003.png’ is the down-sampled version of image number 16003 from the sysnet named ‘n03100240’.

C.3 Analysis

Classification results

Table C.1 gives results to compare models on CIFAR-10 and CINIC-10. Each run on CINIC-10 was repeated five times with different random seeds.

CIFAR-10 CINIC-10 Number of Model Test Error Test Error Parameters

VGG-16 [42] 7.36% 12.23 0.16% 14.7M ± ResNet-18 [48] 6.98% 9.73 0.05% 11.2M ± ResNet-18 4.89% 10.10 0.08% 11.2M (pre-activation) [16] ± GoogLeNet [44] - 8.83 0.12% 6.2M ± ResNext29_2x64d [50] 5.18% 8.55 0.15% 9.2M ± Densenet-121 [18] 4.96% 8.74 0.16% 7.0M ± Mobilenet [17] - 18.00 0.16% 3.2M ± Table C.1: CINIC-10 versus CIFAR-10 on some popular models for classification. CIFAR-10 results taken from a publicly available implementation [10]. Regarding CINIC-10 training, the train dataset was a combination of encoder and decoder sets while the test set was the evaluation dataset.

Distributions and samples

Figure C.1 shows the pixel intensity distributions for the image taken from CIFAR-10 and the ImageNet database. These almost perfectly overlap, although the ImageNet contributions have a lower mean. Figure C.2 shows samples randomly selected from the contributing datasets. It is evident that the CIFAR-10 data is better prepared 88 CINIC-10: CINIC-10 Is Not Imagenet or CIFAR-10 in some cases, but the subject of the image is nonetheless relatively evident for the images taken from ImageNet.

0.010 CIFAR-10 ImageNet contributors 0.008

0.006

Frequency 0.004

0.002

0.000 0 50 100 150 200 250 Colour intensity

Figure C.1: CINIC-10 contributor images’ histograms. These match well, but there are differences in the distribution of pixel intensities: ImageNet images have marginally lower pixel intensities. C.3 Analysis 89

(a) Images taken from CIFAR-10.

(b) Images taken from the ImageNet database.

Figure C.2: Samples from CINIC-10, showing the differences between CIFAR-10 and ImageNet contributors.