On the Inductive Bias of Deep Learning Tarek Mansour
Total Page:16
File Type:pdf, Size:1020Kb
Deep Neural Networks are Lazy: On the Inductive Bias of Deep Learning by Tarek Mansour S.B., C.S. and Mathematics, M.I.T (2018) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2019 ○c Tarek Mansour, MMXIX. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created. Author................................................................ Department of Electrical Engineering and Computer Science February 1, 2019 Certified by. Aleksander Madry Associate Professor of Computer Science Thesis Supervisor Accepted by . Katrina LaCurts Chairman, Department Committee on Graduate Theses 2 Deep Neural Networks are Lazy: On the Inductive Bias of Deep Learning by Tarek Mansour Submitted to the Department of Electrical Engineering and Computer Science on February 1, 2019, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Deep learning models exhibit superior generalization performance despite being heav- ily overparametrized. Although widely observed in practice, there is currently very little theoretical backing for such a phenomena. In this thesis, we propose a step forward towards understanding generalization in deep learning. We present evidence that deep neural networks have an inherent inductive bias that makes them inclined to learn generalizable hypotheses and avoid memorization. In this respect, we pro- pose results that suggest that the inductive bias stems from neural networks being lazy: they tend to learn simpler rules first. We also propose a definition of simplicity in deep learning based on the implicit priors ingrained in deep neural networks. Thesis Supervisor: Aleksander Madry Title: Associate Professor of Computer Science 3 4 Acknowledgments I would like to start by thanking my advisor Aleksander Madry for the guidance and mentorship during both my undergraduate and graduate careers at MIT. Aleksander introduced me to deep learning science and constantly pushed me to think critically about problems that arise in research. He played a big role in shaping me as an engineer as well as a scientist. This thesis would not have been possible without his mentoring and support. Having Aleksander as a mentor was a phenomenal experience. I could not have hoped for a better advisor. I would like to thank Kai Yuanqing Xiao for his significant contributions to the research presented in this thesis. He helped me throughout and played a key role in developing the ideas proposed. This work would not have been possible without him. I would like to thank the Theory of Computation group. They provided a great environment for research through reading groups and constant discussions about deep learning science. I really enjoyed being part of such an interesting group of people. I would also like to thank my MIT friends for the constant support they have given me throughout. I would like to thank my family for everything. Without them, I would not be where I am today. This thesis is dedicated to them. 5 6 Contents 1 Introduction 17 1.1 The Statistical Learning Problem . 18 1.1.1 Preliminaries and Notation: The Learning Setup . 18 1.1.2 Generalization and the Bias-Variance Tradeoff . 19 1.1.3 Feature Maps . 20 1.2 Deep Learning . 20 1.2.1 Preliminaries and Notation . 20 1.2.2 The Science of Deep Learning . 22 1.2.3 Generalization in Deep Learning . 23 1.3 Contributions: the Inductive Bias . 23 1.3.1 The Inductive Bias: a Definition . 23 1.3.2 Laziness, or Learning Simple Things First . 24 1.3.3 Simplicity is Not General . 24 1.4 Outline . 24 2 Related Works 27 2.1 The Quest to Uncover Deep Learning Generalization . 27 2.1.1 Stochastic Gradient Descent (SGD) as a Driver of Generalization 28 2.1.2 Overparametrization as a Feature . 28 2.1.3 Interpolation is not Equivalent to Overfitting . 29 2.2 Memorization in Deep Learning . 30 2.2.1 Noise Robustness in Deep Learning . 31 2.2.2 Memorization is Secondary . 31 7 2.3 Priors in Deep Learning . 32 2.3.1 Priors as Network Biases . 32 3 On the Noise Robustness of Deep Learning Models 35 3.1 Introdution . 35 3.1.1 Benign Noise and Adverserial Noise . 35 3.2 Generalization with High Output Domain Noise . 37 3.2.1 Non Linear Networks . 37 3.2.2 Linear Networks . 39 3.3 Generalization with High Input and Output Domains Noise . 41 3.3.1 Input Domain Noise as an "Easier" Task . 41 3.3.2 Towards the "Laziness" Property of Deep Neural Networks . 41 4 Learning Simple Things First: On the Inductive Bias in Deep Learn- ing Models 45 4.1 Introduction . 45 4.2 A Surprising Behavior: Generalization is Oblivious to Fake Images When it Matters . 47 4.2.1 Data Generation: the Gaussian Directions and CIFARp . 47 4.2.2 Generalization with Gaussian Directions . 48 4.2.3 Generalization in CIFARp .................... 52 4.3 Data Manifold Awareness . 53 4.3.1 Differential Treatment of Real and Synthetic Images . 54 4.3.2 Towards Identifying the Data Manifold: Unsupervised Learning 54 4.3.3 Towards Inductive Bias: Low Dimensional Compression . 56 4.4 Learning Simple Things First . 57 4.4.1 Data Generation: the Linear/Quadratic Dataset . 57 4.4.2 The Simplicity Bias: A Proof of Concept . 59 4.5 Laziness: a Force that Drives Generalization . 60 8 5 Inductive Bias through Priors: Simplicity is Preconditioned by Pri- ors 63 5.1 Introduction . 63 5.1.1 Priors as a Summary of Initial Beliefs . 63 5.1.2 Priors in Deep Learning . 64 5.1.3 Priors Matter for Deep Learning . 65 5.2 Simplicity, or Proximity to the Prior . 65 5.2.1 Bias through Non-Linear Activations . 66 5.2.2 Bias through Architecture . 67 5.2.3 Feature Engineering through Priors . 69 6 Conclusion 71 9 10 List of Figures 3-1 Adversarial example. The initial image (left) is correctly classified as a panda whereas the perturbed image (right) is classified as a gibbon, even though it looks exactly like the intial one to the human eye [GSS14]. 36 3-2 Test accuracy on true label test points in the uniform label MNIST dataset. The generalization error stays relatively low until very high values of alpha (∼ 50), then drops sharply. We attribute the drop to difficulty in optimization rather than a fundamental limitation ofthe training process. 38 3-3 Test accuracy on true label test points in the uniform label CIFAR10 dataset. The generalization accuracy drops slowly but stays relatively high for high noise levels. 39 3-4 Test accuracy on true label test points in the uniform label MNIST dataset, with a linear model. We can see that the model is very robust to noise and the generalization accuracy is affected minimally. 40 3-5 Test accuracy on true label test points in the white noise MNIST and CIFAR10 datasets. The added noisy images have no effect on the generalization accuracy. The accuracy on the uniform label dataset is added for comparison. 42 4-1 Images obtained after adding random gaussian directions to CIFAR10 images. We use different values of 휖 from left to right: 0, 50, 500, 5000. We see that for small epsilon the images are modified negligibly. 48 11 4-2 Test accuracy vs epsilon for the Gaussian Directions dataset with 훼 = 9. We see that after 휖 = 45 the test accuracy is the same as the accuracy obtained on the CIFAR10 dataset without any augmentation. 49 4-3 Training run on a Gaussian Directions dataset with 훼 = 9 and 휖 = 45. The network treats the real and fake images as two distinct entities: it learns on the true dataset first to reach good training set performance, then start memorizing the fake labels. 50 4-4 The Gaussian Directions dataset. True training sample (blue) are sur- rounded by a number of generated data points (red). 51 4-5 Training run on a CIFAR0:5 dataset. As in the Gaussian Directions case, the network learns on the true dataset first. 53 4-6 PCA analysis of the activations at the last hidden layer. The top im- ages show the activations for the entire test dataset, the bottom images show the activations for real images (x) with their fake counterparts (o). We can clearly see that there’s very little variation along the first 3 PCs for the fake data. The neural network maps the fake data to a very restricted subspace. 56 4-7 PCA analysis of the activations at the last hidden layer (single compo- nent view). The fake inputs activations are significantly concentrated, whereas the real inputs exhibit high variance. 57 4-8 The Linear/Quadratic Dataset. The image on the left shows the four different types of data and the image on the right shows their assigned labels. 58 4-9 Train accuracies on the Linear/Quadratic Dataset. The training accu- racy grows for the L points, which require a simpler classifier, first. .59 5-1 Train and test accuracies of the comparative run for ReLu and Quad activation. We can see that the linear dataset is easier for ReLU, and the quadratic dataset is easier for Quad. 66 12 5-2 Train and test accuracies of the comparative run for max pool and no max pool networks. The network without max pooling layers achieves high train and test accuracy faster than the network with the pooling layers.