Mutual Information Tracking for Convolutional Neural Networks

Mutual Information Tracking for Convolutional Neural Networks Luke Nicholas Darlow Supervisor: Prof Amos Storkey Centre for Doctoral Training in Data Science School of Informatics University of Edinburgh This dissertation is submitted for the degree of Master of Science by Research August 2018 Declaration I have read and understood the University of Edinburgh’s plagiarism guidelines. I hereby declare that except where specific reference is made to the work of others, the contents of this dissertation are original and have not been submitted in whole or in part for consideration for any other degree or qualification in this, or any other university. This dissertation is my own work and contains nothing which is the outcome of work done in collaboration with others, except as specified in the text and Acknowledgements. Luke Nicholas Darlow August 2018 Acknowledgements First, a thank you to the machine learning community; our shared fascination is motivation enough for me to take one small step forward, and contribute. I would like to thank Prof Amos Storkey, my supervisor, for his patience and valued guidance. Thank you to the boost and confidence of the Bayeswatch research group. Particu- larly, Antreas Antoniou for your excitement about research and availability to help, and your suggestions and help with implementation; and Elliot Crowley for being a sounding board and always offering grounding advice, and for running experiments on CINIC-10. Thank you to all the members of the CDT, staff and students alike. For your support, confidence, and always reassuring presence, thank you Piette. Finally, thank you to my family for letting me stand tall on your shoulders. You are all incredible. This work was supported in part by the EPSRC Centre for Doctoral Training in Data Science, funded by the UK Engineering and Physical Sciences Research Council (grant EP/L016427/1) and the University of Edinburgh. Abstract The information bottleneck interpretation of deep learning posits that neural networks generalise well because they learn optimal hidden representations. An optimal representation preserves maximally the task-relevant information from the input data, while compressing all task-irrelevant information. In this thesis, we tracked mutual information for a modern convolutional neural network, as it learned to either classify or autoencode realistic image data. Images are complex high-dimensional data sources, which makes computing mutual information in closed-form intractable. Hence, we used decoder models to estimate mutual information lower bounds: a classifier for forward estimation and an autoregressive conditional PixelCNN++ for inverse estimation. Confirming some results in earlier research on toy problems, we found thatthe hidden representations first maximised shared information with the images, and then compressed task-irrelevant information for the remainder of training. Neural networks trained with stochastic gradient descent do learn to compress information. Compression was observed for both classification and autoencoding. However, whether this compression is the primary feature that enables neural networks to generalise well is still an open question. Images were also generated conditioned on hidden representations for a qualitative perspective on the nature of the information retained and/or compressed. Contrary to earlier research, we did not find any evidence in the signal-to-noise ratios of weight updates that indicated a change from fitting to compression. Table of contents List of figures xiii List of tables xv 1 Introduction1 1.1 Understanding Neural Networks . .1 1.2 An Information Theoretic Approach . .1 1.3 Application to Modern CNNs . .2 1.4 Our Contributions . .2 1.5 Thesis Structure . .3 2 Technical Background: Deep Neural Networks5 2.1 Convolutional Neural Networks . .6 2.1.1 Hidden Representations . .7 2.2 Modern Techniques for Deeper Networks . .8 2.2.1 Batch Normalisation . .8 2.2.2 Residual Connections . .9 2.3 Thirst for Understanding . 10 2.3.1 Deeper Representations Disentangle Better . 10 2.3.2 Rethinking Generalisation . 11 2.3.3 Frameworks, Approaches, and Tools . 11 3 Information Theoretic Analysis of Deep Neural Networks 13 3.1 Mutual Information as a Tool . 14 3.2 The Information Bottleneck Interpretation . 16 3.3 Is Compression Necessary for Generalisation? . 19 3.3.1 Tanh Non-linearity and Binning . 19 3.3.2 The Question of Compression . 20 x Table of contents 3.3.3 Two Stages of Learning and Stochastic Relaxation . 21 3.3.4 What of the IB Interpretation? . 22 3.4 Further Related Information Theory Analyses . 23 3.4.1 Inverting Supervised Representations . 23 4 Investigation Framework 25 4.1 MI Using a Model: a Lower Bound . 26 4.1.1 Forward Decoding for Label MI . 27 4.1.2 Inverse Decoding for Input MI . 29 4.2 Tightness of the Bound . 31 4.3 Models Under Scrutiny . 32 4.3.1 Training and Freezing . 32 4.4 Data . 34 4.4.1 CINIC-10: CINIC-10 is Not Imagenet or CIFAR-10 . 35 5 Experiment One: Classifier MI Tracking 37 5.1 Inverse Decoding: Information About Inputs . 38 5.1.1 Compression Through Stochastic Relaxation? . 40 5.1.2 Conditional Samples . 42 5.2 Forward Decoding: Information about Targets . 47 5.2.1 Data Processing Inequality Violation? . 48 5.2.2 Linear Separability . 48 6 Experiment Two: Autoencoder MI Tracking 51 6.1 Inverse Decoding: Information about Inputs . 51 6.1.1 Conditional Samples . 53 6.1.2 Signal to Noise Ratio Tracking . 53 6.2 Forward Decoding: Information about Targets . 56 7 Conclusion 59 7.1 Our Contributions . 60 7.2 Our Findings . 60 7.3 Limitations and Future Work . 61 References 63 Appendix A Self-consistent IB Equations 67 Table of contents xi Appendix B Training Curves 69 B.1 Training the Classifier and Autoencoder . 69 B.2 Forward Decoder Models . 69 B.3 PixelCNN++ Inverse Decoder Models . 78 B.3.1 PixelCNN++ Bound . 81 B.4 Unconditional PixelCNN++ . 81 Appendix C CINIC-10: CINIC-10 Is Not Imagenet or CIFAR-10 85 C.1 Motivation . 85 C.2 Compilation . 86 C.3 Analysis . 87 List of figures 2.1 A simple deep neural network. .5 2.2 A convolution demonstration . .7 2.3 Residual block . .9 3.1 Hyperbolic tangent activation function and binning procedure . 20 4.1 The ResNet model architecture (encoder) used to generate hidden acti- vations in either a classification or autoencoder set-up . 33 5.1 Mutual information curves (inverse direction) for classifier training . 39 5.2 SNR statistics for classifier training. 41 5.3 Samples generated using PixelCNN++, conditioned on h2 in the classifier training set-up . 43 5.4 Samples generated using PixelCNN++, conditioned on h3 in the classifier training set-up . 44 5.5 Samples generated using PixelCNN++, conditioned on h4 in the classifier training set-up . 45 5.6 Classifer learned representations, forward decoding . 47 5.7 Classifer learned representations, linear separability . 48 6.1 Mutual information curves (inverse direction) for autoencoder training 52 6.2 Samples generated using PixelCNN++, conditioned on h4, the autoencoder bottleneck . 54 6.3 SNR statistics for autoencoder training. 55 6.4 Autoencoder learned representations, forward decoding . 56 B.1 Models under scrutiny loss and accuracy curves. 70 B.2 Image reconstructions from the autoencoder model . 71 xiv List of figures B.3 Forward decoder models loss curves for first hidden representation, classifier training regime. 72 B.4 Forward decoder models loss curves for second hidden representation, classifier training regime. 73 B.5 Forward decoder models loss curves for third hidden representation, classifier training regime. 74 B.6 Forward decoder models loss curves for first hidden representation, autoencoder training regime. 75 B.7 Forward decoder models loss curves for second hidden representation, autoencoder training regime. 76 B.8 Forward decoder models loss curves for third hidden representation, autoencoder training regime. 77 B.9 Inverse PixelCNN++ decoder models loss curves for second hidden representation, classifier training regime. 78 B.10 Inverse PixelCNN++ decoder models loss curves for third hidden representation, classifier training regime. 79 B.11 Inverse PixelCNN++ decoder models loss curves for forth hidden representation, classifier training regime. 79 B.12 Inverse PixelCNN++ decoder models loss curves for second hidden representation, autoencoder training regime. 80 B.13 Inverse PixelCNN++ decoder models loss curves for third hidden representation, autoencoder training regime. 80 B.14 Inverse PixelCNN++ decoder models loss curves for forth hidden representation, autoencoder training regime. 81 B.15 Areas under the curve for PixelCNN++ lower bounds . 82 B.16 Unconditional PixelCNN++ loss curves . 82 B.17 Unconditional PixelCNN++ generated samples . 83 C.1 CINIC-10 contributor images’ histograms. 88 C.2 Samples from CINIC-10, showing the differences between CIFAR-10 and ImageNet contributors. 89 List of tables 2.1 Disantanglement example . 10 5.1 Relative information gain and compression for the hidden representations in a classifier training regime. 38 6.1 Relative information gain and compression for the hidden representations in an autoencoder training regime . 53 C.1 CINIC-10 versus CIFAR-10 on some popular models for classification. 87 Chapter 1 Introduction 1.1 Understanding Neural Networks Deep Neural Networks are synonymous with modern machine learning and artificial intelligence partly because of their widespread success. Unfortunately, the popularity of neural networks for applications is not matched by a clear understanding of how they work. The field of neural networks will advance if we have more comprehensive theory about how they work, or if we have better empirical studies that characterise neural networks. In order to understand neural networks better, a number of researchers have proposed appropriate approaches and principled tools. We discuss these in the following section. 1.2 An Information Theoretic Approach The information bottleneck (IB) interpretation of deep learning [45, 46, 41] claims that optimal learning is a technique for finding representations of data that are suited toa target task.

Load more