Deep Neural Networks are Lazy: On the Inductive of Deep Learning by Tarek Mansour S.B., C.S. and Mathematics, M.I.T (2018) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY February 2019 ○c Tarek Mansour, MMXIX. All rights reserved. The author hereby grants to MIT permission to reproduce and to distribute publicly paper and electronic copies of this thesis document in whole or in part in any medium now known or hereafter created.

Author...... Department of Electrical Engineering and Computer Science February 1, 2019 Certified by...... Aleksander Madry Associate Professor of Computer Science Thesis Supervisor Accepted by ...... Katrina LaCurts Chairman, Department Committee on Graduate Theses 2 Deep Neural Networks are Lazy: On the Inductive Bias of Deep Learning by Tarek Mansour

Submitted to the Department of Electrical Engineering and Computer Science on February 1, 2019, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science

Abstract Deep learning models exhibit superior generalization performance despite being heav- ily overparametrized. Although widely observed in practice, there is currently very little theoretical backing for such a phenomena. In this thesis, we propose a step forward towards understanding generalization in deep learning. We present evidence that deep neural networks have an inherent inductive bias that makes them inclined to learn generalizable hypotheses and avoid memorization. In this respect, we pro- pose results that suggest that the inductive bias stems from neural networks being lazy: they tend to learn simpler rules first. We also propose a definition of simplicity in deep learning based on the implicit priors ingrained in deep neural networks.

Thesis Supervisor: Aleksander Madry Title: Associate Professor of Computer Science

3 4 Acknowledgments

I would like to start by thanking my advisor Aleksander Madry for the guidance and mentorship during both my undergraduate and graduate careers at MIT. Aleksander introduced me to deep learning science and constantly pushed me to think critically about problems that arise in research. He played a big role in shaping me as an engineer as well as a scientist. This thesis would not have been possible without his mentoring and support. Having Aleksander as a mentor was a phenomenal experience. I could not have hoped for a better advisor. I would like to thank Kai Yuanqing Xiao for his significant contributions to the research presented in this thesis. He helped me throughout and played a key role in developing the ideas proposed. This work would not have been possible without him. I would like to thank the Theory of Computation group. They provided a great environment for research through reading groups and constant discussions about deep learning science. I really enjoyed being part of such an interesting group of people. I would also like to thank my MIT friends for the constant support they have given me throughout. I would like to thank my family for everything. Without them, I would not be where I am today. This thesis is dedicated to them.

5 6 Contents

1 Introduction 17 1.1 The Statistical Learning Problem ...... 18 1.1.1 Preliminaries and Notation: The Learning Setup ...... 18 1.1.2 Generalization and the Bias-Variance Tradeoff ...... 19 1.1.3 Feature Maps ...... 20 1.2 Deep Learning ...... 20 1.2.1 Preliminaries and Notation ...... 20 1.2.2 The Science of Deep Learning ...... 22 1.2.3 Generalization in Deep Learning ...... 23 1.3 Contributions: the Inductive Bias ...... 23 1.3.1 The Inductive Bias: a Definition ...... 23 1.3.2 Laziness, or Learning Simple Things First ...... 24 1.3.3 Simplicity is Not General ...... 24 1.4 Outline ...... 24

2 Related Works 27 2.1 The Quest to Uncover Deep Learning Generalization ...... 27 2.1.1 Stochastic Gradient Descent (SGD) as a Driver of Generalization 28 2.1.2 Overparametrization as a Feature ...... 28 2.1.3 Interpolation is not Equivalent to Overfitting ...... 29 2.2 Memorization in Deep Learning ...... 30 2.2.1 Noise Robustness in Deep Learning ...... 31 2.2.2 Memorization is Secondary ...... 31

7 2.3 Priors in Deep Learning ...... 32 2.3.1 Priors as Network ...... 32

3 On the Noise Robustness of Deep Learning Models 35 3.1 Introdution ...... 35 3.1.1 Benign Noise and Adverserial Noise ...... 35 3.2 Generalization with High Output Domain Noise ...... 37 3.2.1 Non Linear Networks ...... 37 3.2.2 Linear Networks ...... 39 3.3 Generalization with High Input and Output Domains Noise ...... 41 3.3.1 Input Domain Noise as an "Easier" Task ...... 41 3.3.2 Towards the "Laziness" Property of Deep Neural Networks . . 41

4 Learning Simple Things First: On the Inductive Bias in Deep Learn- ing Models 45 4.1 Introduction ...... 45 4.2 A Surprising Behavior: Generalization is Oblivious to Fake Images When it Matters ...... 47

4.2.1 Data Generation: the Gaussian Directions and CIFAR푝 . . . . 47 4.2.2 Generalization with Gaussian Directions ...... 48

4.2.3 Generalization in CIFAR푝 ...... 52 4.3 Data Manifold Awareness ...... 53 4.3.1 Differential Treatment of Real and Synthetic Images . . . . 54 4.3.2 Towards Identifying the Data Manifold: Unsupervised Learning 54 4.3.3 Towards Inductive Bias: Low Dimensional Compression . . . . 56 4.4 Learning Simple Things First ...... 57 4.4.1 Data Generation: the Linear/Quadratic Dataset ...... 57 4.4.2 The Simplicity Bias: A Proof of Concept ...... 59 4.5 Laziness: a Force that Drives Generalization ...... 60

8 5 Inductive Bias through Priors: Simplicity is Preconditioned by Pri- ors 63 5.1 Introduction ...... 63 5.1.1 Priors as a Summary of Initial Beliefs ...... 63 5.1.2 Priors in Deep Learning ...... 64 5.1.3 Priors Matter for Deep Learning ...... 65 5.2 Simplicity, or Proximity to the Prior ...... 65 5.2.1 Bias through Non-Linear Activations ...... 66 5.2.2 Bias through Architecture ...... 67 5.2.3 Feature Engineering through Priors ...... 69

6 Conclusion 71

9 10 List of Figures

3-1 Adversarial example. The initial image (left) is correctly classified as a panda whereas the perturbed image (right) is classified as a gibbon, even though it looks exactly like the intial one to the human eye [GSS14]. 36

3-2 Test accuracy on true label test points in the uniform label MNIST dataset. The generalization error stays relatively low until very high values of alpha (∼ 50), then drops sharply. We attribute the drop to difficulty in optimization rather than a fundamental limitation ofthe training process...... 38

3-3 Test accuracy on true label test points in the uniform label CIFAR10 dataset. The generalization accuracy drops slowly but stays relatively high for high noise levels...... 39

3-4 Test accuracy on true label test points in the uniform label MNIST dataset, with a linear model. We can see that the model is very robust to noise and the generalization accuracy is affected minimally. . . . . 40

3-5 Test accuracy on true label test points in the white noise MNIST and CIFAR10 datasets. The added noisy images have no effect on the generalization accuracy. The accuracy on the uniform label dataset is added for comparison...... 42

4-1 Images obtained after adding random gaussian directions to CIFAR10 images. We use different values of 휖 from left to right: 0, 50, 500, 5000. We see that for small epsilon the images are modified negligibly. . . . 48

11 4-2 Test accuracy vs epsilon for the Gaussian Directions dataset with 훼 = 9. We see that after 휖 = 45 the test accuracy is the same as the accuracy obtained on the CIFAR10 dataset without any augmentation. 49

4-3 Training run on a Gaussian Directions dataset with 훼 = 9 and 휖 = 45. The network treats the real and fake images as two distinct entities: it learns on the true dataset first to reach good training set performance, then start memorizing the fake labels...... 50

4-4 The Gaussian Directions dataset. True training sample (blue) are sur- rounded by a number of generated data points (red)...... 51

4-5 Training run on a CIFAR0.5 dataset. As in the Gaussian Directions case, the network learns on the true dataset first...... 53

4-6 PCA analysis of the activations at the last hidden layer. The top im- ages show the activations for the entire test dataset, the bottom images show the activations for real images (x) with their fake counterparts (o). We can clearly see that there’s very little variation along the first 3 PCs for the fake data. The neural network maps the fake data to a very restricted subspace...... 56

4-7 PCA analysis of the activations at the last hidden layer (single compo- nent view). The fake inputs activations are significantly concentrated, whereas the real inputs exhibit high variance...... 57

4-8 The Linear/Quadratic Dataset. The image on the left shows the four different types of data and the image on the right shows their assigned labels...... 58

4-9 Train accuracies on the Linear/Quadratic Dataset. The training accu- racy grows for the L points, which require a simpler classifier, first. .59

5-1 Train and test accuracies of the comparative run for ReLu and Quad activation. We can see that the linear dataset is easier for ReLU, and the quadratic dataset is easier for Quad...... 66

12 5-2 Train and test accuracies of the comparative run for max pool and no max pool networks. The network without max pooling layers achieves high train and test accuracy faster than the network with the pooling layers...... 68

13 14 List of Tables

4.1 Test and train accuracies for Gaussian Directions dataset with different values of 훼. We set 휖 = 45 for these experiments. We can see that the test accuracy on the real images does not change when we add fake training examples to the dataset...... 50 4.2 Test accuracies for the probing experiment of the Gaussian Directions dataset. We can see that the subspace between the true and fake training points is torn between them. However, as we go further than

휖-far the network does not recognize 푐푘 as the label anymore...... 51

4.3 Test accuracies for CIFAR푝. The accuracy goes down linearly with 푝: as more true images get flipped, the true signal vanishes. . . . . 52 4.4 Train and test accuracies for the real and fake labels dataset. We can see that after 1 epoch of training the network recognizes what points are on the data manifold...... 55

5.1 Number of epochs needed to reach 60% train accuracy. The networks learn significantly faster when training on the data that fits theprior imposed by the activation functions...... 67

15 16 Chapter 1

Introduction

In the past couple of years, deep learning has been the driving force behind successes in many learning and prediction problems. In fact, deep neural networks have allowed unprecedented achievements in fields such as computer vision, natural language pro- cessing, machine translation, games and many others. The systems developed through neural networks achieve, or even surpass, human level performance in these fields. From a practical standpoint, it is therefore apparent that deep neural networks are important, and even essential, for future advances in .

Nevertheless, we currently have very little understanding of the inner mechanics of deep learning: there is very little theoretical backing for its impressive performance. As any other areas of science, developing simple mathematical rules that underlie the dynamics of deep learning is essential if we want to move forward in the field. Doing so will help with the creation of robust, reliable, and scalable systems. Our work goes along the lines of the science of deep learning. In this thesis, we aim to make a step forward towards understanding the ability of deep neural networks to generalize well. We propose a new perspective on neural networks: they have an inherent inductive bias that pushes them to learn simple things by making them "lazy".

17 1.1 The Statistical Learning Problem

Machine learning refers to the problem of learning, from a set of observed data, general rules that apply to unseen data. The goal is usually to make predictions or decisions based on these rules. There are three main types of machine learning methods today: , unsupervised learning and reinforcement learning. Supervised learning refers to procedures that aim to learn a mapping from inputs 푥 to labels 푦, whereas unsupervised learning is concerned with learning structure in unlabelled data. The compromise takes the form of reinforcement learning where an agent takes actions in an environment with some reward function and learns with the aim of maximizing the cumulative rewards taken through such actions. In our study, we focus on supervised learning.

1.1.1 Preliminaries and Notation: The Learning Setup

In the supervised learning problem, the goal is to learn a relationship between inputs 푥 to outputs 푦 from training data. We consider the data distribution 풟 on the space

풳 × 풴 and let 푆 = {(푥푖, 푦푖) for 푖 = 1, ..., 푛} denote the training set. The inputs 푥푖 are considered to be drawn from the 푑−dimensional space R푑. We focus on the problem of classification, where the labels 푦푖 take a finite number of values from the label set 풞. Let ℱ be the space of possible estimators 푓 : 풳 → 풴1 and let ℒ : 풴 × 풴 → [0, ∞) be a chosen loss function. The goal of the learning procedure is to find an estimator 푓 * ∈ ℱ that minimizes the expected risk, or, in other words, the expected error on data drawn from 풟:

* 푓 = arg min ℰ(푓), where ℰ(푓) = E(푥,푦)∼풟[ℒ(푓(푥), 푦)]. 푓∈ℱ

However, as mentioned earlier, we only have access to the training set 푆. The

1ℱ is not properly defined here. In general, it is considered to be the space of all measurable functions, but we omit such formality in this case.

18 procedure thus aims to minimize the empirical risk instead:

푛 1 ∑︁ 푓ˆ = arg min ℰˆ(푓), where ℰˆ(푓) = ℒ(푓(푥 ), 푦 ). 푛 푖 푖 푓∈ℱ 푖=1

Minimizing empirical risk assumes that optimizing for the objective ℰˆ is "close enough" to optimizing for the objective ℰ 2. Additionally, it is in general difficult to achieve this without restricting the class of estimators ℱ to some other class ℋ, which is more restricted and has certain desirable properties3.

1.1.2 Generalization and the Bias-Variance Tradeoff

The goal is to minimize ℰ, yet machine learning procedures minimize ℰˆ. The per- formance of a procedure is thus measured by the closest proxy to the expected risk: the error on a test set, or the generalization error. The latter can be high even if the empirical risk is minimized, a phenomena referred to as overfitting. This is because if the family of estimators ℋ is very large, the number of estimators that can fit the data is also very large, and it is hard for the procedure to pick the one that leads to similar expected and empirical risks. In this case, we say that the procedure has high variance. It is thus common to restrict the family of estimators further to limit the number of estimators that the procedure can find. In doing so, a certain bias away from the data is introduced, and if the family is too small, it will be impossible to fit the training data at all, leading to both training and generalization errors being high. Therefore, there is a tradeoff between bias and variance controlled by the complexity of the family or class of classifiers that are learnable through training, commonly referred to as capacity of the model in inference. However, the bias-variance tradeoff is not solely controlled through capacity. There are other ways to bias the model to learn simple estimators such as adding a regularizing term to the empirical risk objective (known as explicit regularization) or using implicit regularization such as

2There are multiple metrics used to define closeness in this context. 3ℋ is usually chosen to be a Reproducible Kernel Hilbert Space because of the desirable properties of such spaces. These details are not necessary for the development and are thus not explained here.

19 early stopping.

1.1.3 Feature Maps

In certain machine learning problems such as linear regression, logistic regression and Support Vector Machines (SVM), the family of estimators is taken to be a linear family. However, the mapping between 푥 and 푦 is often non-linear. Thus, it is common to map the inputs 푥 onto a space where their relationship to the labels is linear. Such a transformation is done through feature maps Φ: 풳 → ℳ. The feature space ℳ can often have a very high, or even infinite, number of dimensions. Thus, kernel methods are used to avoid having to perform computations using the features Φ(푥) [STC04]. Additionally, the choice of the right feature map to use for specific learning problems requires careful feature engineering and is often a difficult problem.

1.2 Deep Learning

Introduced for the first time as the "Neocognitron" [Fuk80], neural networks are currently predominantly used in a wide range of applications. The recent surge of interest in deep learning models came after their exploits in the ImageNet competition in 2012 [KSH12]. Ever since then, they have been applied successfully to a wide range of problems such as image classification, object recognition, speech recognition, control theory, game playing and others [KSH12, HZRS15, GMH13, SHM+16]. The power of deep neural networks comes from the fact that they do not require the practitioners to engineer specific feature maps Φ for the tasks at hand. In fact, through feeding the inputs through a sequence of layers, deep learning models learn a representation of such inputs while learning the input to output mapping.

1.2.1 Preliminaries and Notation

Neural networks are a sequence of layers that consist of a linear transformation fol- lowed by a non-linear activation function. In our work, we will use the following

20 notation to denote the estimator 푓 represented by a depth 푘 neural network:

푓 : 푥 → 푊푘휎푘−1(푊푘−1휎푘−2(...푊2휎1(푊1푥))),

where each 푊푗 denotes the parameter matrix at layer 푗 and each 휎푗 denotes the activation function at layer 푗4. In our treatment, we mainly consider the ReLU activation function 푥 → max{푥, 0}. Additionally, the height ℎ of the network is usually defined as the largest row or column dimension of the matrices 푊1, ...푊푘. From a statistical learning perspective, the parameters of the network (essentially the entries of the matrices) serve as an index into the space of estimators. Therefore, in deep learning, minimizing the empirical risk corresponds to learning the parameters of the network that lead to the best estimator. We denote all the parameters of the

5 network by the vector 휃 . Thus, the network is the estimator 푓휃 parametrized by 휃. The class of estimators that can be represented by the network is defined by its architecture, which englobes choices such as the depth of the network, the type of the different layers (convolutional, fully connected, max pool etc.), and the height of these layers. Let ℒ be an arbitrary loss function (it is usually the cross-entropy loss for classi- fication problems). The most commonly used method to learn the parameters 휃 that minimize the empirical loss is Gradient Descent (GD). GD is an iterative first-order method that moves the network parameters in the direction opposite to the gradient, or equivalently, the direction of steepest descent with respect to the objective. The update step in GD at time 푡 is:

푛 푡+1 푡 휕 ∑︁ 휃 ← 휃 − 휂 ℒ(푓 푡 (푥 ), 푦 ), 휕휃푡 휃 푖 푖 푖=1 where 휂 is a hyperparameter referred to as the learning rate. In practice, GD is usually replaced by its variant, Stochastic Gradient Descent (SGD). Instead of summing over all the training examples at each step (which can be very computationally intensive),

4The activations functions 휎 are required to be Lipchitz continuous in general. 5 휃 corresponds to stacking the parameters of the matrices 푊1, ...푊푘 into a 1-dimensional vector.

21 SGD approximates the empirical loss by sampling one training point (푥푖, 푦푖) (or a batch of training points) at a time and computing the loss on the point for the update. The update is as follows:

푡+1 푡 휕 휃 ← 휃 − 휂 ℒ(푓 푡 (푥 ), 푦 ). 휕휃푡 휃 푖 푖

Practically, the computation of the gradients is done via backpropagation [LBH15]. Additionally, many variants of SGD are used for different types of networks in practice such as adaptive methods that adapt the learning rate dynamically like AdaDelta [Zei12] and Adam [KB14], or second-order methods that use additional information about the Hessian like K-FAC [MG15].

1.2.2 The Science of Deep Learning

In the past couple of years, deep neural networks have proven to be extremely powerful methods from a practical point of view. However, there is very little theory around deep learning. The root of this shortcoming comes from the fact that most of the developments in learning theory are for problems that fall under the umbrella of estimators that are linear in the input data 푥 or the features Φ(푥), and that are chosen to have "nice" properties6. The estimators represented in deep networks are not linear and do not have such "nice" properties. To circumvent this, a line of research focuses on studying "shallow" networks, which are simplified versions of the architectures used in practice, to develop theoretical guarantees around such manageable setups. Another line of research focuses on explaining the behavior of deep neural networks through a mix of empirical and theoretical analyses. These works are mainly concerned with developing a science around deep learning. As for any novel phenomena, the science is still in early stages and the theoretical evidence is usually limited in scope and requires relatively strong assumptions. The goal is to develop a unified understanding of the mechanics of deep learning both froman optimization and generalization perspective. Our work follows this line of thought

6We omit a formal definition of niceness because it is not important for the context of ourwork.

22 and focuses on the generalization performance of deep neural networks.

1.2.3 Generalization in Deep Learning

The most commonly used neural network architectures are heavily overparametrized [ZBH+16]: the number of parameters is often larger than the number of samples used for training. Therefore, the class of estimators represented by the neural networks is highly complex. In fact, such networks can represent any function given enough overparametrization: they are universal approximators [HSW89]. As explained in 1.1.2, traditional learning theory suggests that such networks should have a hard time generalizing. However, this is not the case in practice: often, increasing the number of parameters leads to an increase in test set accuracy [LSS14]. In our work, we investigate this odd behavior of deep learning models and tie it to a notion of inductive bias inherent to the networks.

1.3 Contributions: the Inductive Bias

The ultimate goal of any statistical learning system is to generalize, or in other words, perform well on unseen data. To do so, the training procedure needs to avoid overfit- ting: the models needs to use a relatively small number of instances to learn simple rules that apply to a large number of instances. Therefore, the goal is for the learning process to be as similar as possible to induction. Traditionally, learning procedures are coupled with various methods that push them to induce such as model complexity restriction, regularization and early stopping. In our work, we propose evidence that neural networks have an to induce: the inductive bias.

1.3.1 The Inductive Bias: a Definition

Induction is defined as the "inference of a general law from particular instances" [Oxf18]. The aim of most scientific fields is to induce. In fact, scientists often explain observed practical phenomena via compact rules that summarize the underlying dy-

23 namics of the observations. Newton could have written a large table mapping each observed initial condition to trajectories of moving objects, but he came up with sim- ple equations that summarize the behavior of such objects instead. Our works sug- gests that deep learning models have an affinity to induce. They prioritize learning simple hypotheses from noisy observations instead of memorizing said observations.

1.3.2 Laziness, or Learning Simple Things First

As intellectual pleasing as it might be, the idea that the inductive bias in neural networks comes from a certain "human" drive to learn general dynamics is obviously far from being true. Our results show that the root of the inductive bias is a certain laziness of deep learning models: they tend to learn on simple structured data first even if they have enough capacity to learn on the more complicated unstructured data. We propose a proof of concept that demonstrates that deep networks prioritize learning simple classification boundaries before complex and elaborate boundaries and tie this phenomena to the generalization ability of the networks.

1.3.3 Simplicity is Not General

In general, there is no global ordering for simplicity. Some boundaries may be con- sidered simple for some networks and complex for others. We propose evidence that simplicity is defined by the implicit priors inherent to deep learning procedures. In this respect, we call for careful reasoning about the a priori beliefs that deep net- works incorporate as they can heavily bias the training procedure and impact the performance from both optimization and generalization standpoints.

1.4 Outline

The thesis is organized as follows. Chapter 2 synthesizes the key findings in the works related to the topic. In chapter 3, we discuss the robustness of deep learning models to large amounts of noise and tie it to their laziness. The latter is discussed

24 extensively in chapter 4, where we propose evidence showing the inductive bias at play in deep learning procedures. In the chapter, we also present empirical results that suggest that neural networks learn simple things first and propose the phenomena as a potential root of the bias to induce. Additionally, we redefine the concept of simplicity in chapter 5 as the byproduct of the a piori conditioning of the networks. Chapter 6 summarizes our results and proposes potential avenues for future research.

25 26 Chapter 2

Related Works

Deep learning models have recently helped achieve a large number of successes in many fields such as image classification, machine translation and others. However, there is still very little theory around such models and how they work. In fact, the current machine learning toolkit is insufficient to explain the performance of neural networks from both optimization and generalization standpoints. Understanding deep learning models’ ability to generalize is thus a very active area of research in machine learning and statistics today.

2.1 The Quest to Uncover Deep Learning General- ization

The traditional machine learning view on generalization is that models with high ca- pacity, or number of parameters larger than the number of samples, tend to exhibit very poor test set performance [GBC16]. This is because such models have the abil- ity to memorize and thus overfit the training set. However, although often severely overparametrized, deep neural networks have exhibited a phenomenal ability to gener- alize well on unseen data. There is very little theoretical backing for this phenomena. In fact, deep neural networks are considered to be universal approximators: given enough layer width, they are capable of approximating any measurable function to

27 any desired degree of accuracy [HSW89]. The traditional learning theory paradigm thus tells us that such networks should in principle heavily overfit the training set and struggle with achieving high generalization accuracy. The current theory is thus clearly inapplicable in this situation and there is a range of investigative works that aim to unveil the roots of generalization in deep learning.

2.1.1 Stochastic Gradient Descent (SGD) as a Driver of Gen- eralization

The ability of neural networks to generalize has been linked to the driving force behind their optimization: stochastic gradient descent. In [ZLR+18], it is claimed that SGD optimization pushes the network parameters towards "flat minima" instead of "sharp minima". This is because such minima make the solutions more robust to the inherent fluctuations between train and test sets. SGD help achieve such minima because ofthe inherent noise in the gradients computed by the method. Similar results are found in [KMN+16], where they argue that larger batch sizes lead to "sharp "minima", and hurt the generalization performance, since larger batch sizes lead to a reduced stochasticity in the gradient updates. We follow a different path in our work and focus on the inherent biases in the models rather than properties of the optimization procedures.

2.1.2 Overparametrization as a Feature

Another line of works studies the relationship between overparametrization and gen- eralization in deep neural networks. Although overparametrization is traditionally considered an inhibitor of generalization, there is evidence suggesting that they do not hold such a function in deep learning. Such evidence shows that increasing the number of parameters can often lead to better generalization accuracy because the additional parameters make training faster and the optimization problem eas- ier [LSS14, AGNZ18]. The work in [LL18] couples the effects of SGD and over- parametrization. They prove that sufficient overparametrization leads SGD to learn

28 parameters that are close to the random initialization. The idea of the model pa- rameters moving little in terms of common distance metrics is linked to good gen- eralization error since moving little is considered a form of implicit regularization in learning theory. This idea is investigated in a variety of other works. The works

+ [NLB 18a] propose a capacity bound base on the Frobenius (L2) distance between the parameters at convergence and the randomly initialized parameters that is correlated with the test error. Such a distance is also argued to decrease with overparametriza- tion. Additionally, assumptions on the same distance metric are used in [ALL18] to prove that the generalization error of the solution can be independent of the number of parameters; an idea also investigated in [GRS17]. These works propose evidence towards some sort of inductive bias inherent to the networks that stems from the coupling of overparametrization, SGD and random initialization. Our results extend the investigation of concept.

2.1.3 Interpolation is not Equivalent to Overfitting

Traditional statistical learning theory does not explain the empirical performance of deep learning. In fact, the bias-variance tradeoff is central to the understanding of generalization for traditional learning methods such as kernel regression and support vector machines (SVM) [SSBD14, CST00]. Many measures of complexity such as VC-dimension and regularization mechanisms have been developed to address the tradeoff. However, such mechanisms fail to capture the behavior of deep neural networks [ZBH+16]. In fact, such analyses allow learning procedures to fit the data perfectly only when there is very little noise in the sampling process that leads to the empirical set, which is usually not the case for the applications in question. Thus, statistical learning is currently witnessing a surge of works that rethink the bias variance tradeoff with the aim to reconcile learning theory with the performance of methods such as kernel machines, boosting, random forests, and deep learning. The bias variance tradeoff suggests that fitting the data perfectly, or interpolating it, is equivalent to overfitting. However, some recent works show that this equiva- lence is not general. The works in [LR18] propose a proof of concept rejecting this

29 equivalence. They study the Kernel Regression method with a very high dimensional Hilbert space ℋ. In general, such a method has the ability to fit the training set

2 perfectly and a regularization term 휆||푓||ℋ (where 푓 is an estimator chosen from the family ℋ) is usually added to the objective function to avoid overfitting. The parameter 휆, which is used to balance bias-variance tradeoff, is usually set to 0in practice. To explain this, they propose Kernel "Ridgeless" Regression and show that minimum-norm interpolated solution can have mechanisms of implicit regularization that find their root in high dimensionality, the curvature of the kernel function and favorable data geometry, and that are isolated from the bias-variance tradeoff. These results are reinforced in [BRT18] where it is proved that interpolation in the context of non-parametric regression and square loss prediction can lead to optimal rates. This is because, although the interpolating estimator fits all training points, the influence of each point is "local" and the estimator is, in aggregate, pulled towards the optimal estimator. The works also show a coexistence between bias-variance tradeoff and in- terpolation in the estimators studied. The generalization properties of interpolation schemes are also studied in [BHM18]. The performance of interpolation is analyzed through the lens of local classification methods such as geometric simplical interpo- lation and nearest neighbor rule based schemes. A running hypothesis is that this new paradigm in learning theory explains the out of sample performance of neural networks. Our work proposes evidence towards this hypothesis.

2.2 Memorization in Deep Learning

Deep neural networks generalize well but they have the capacity to memorize input to output mappings. One potential hypothesis suggests that stochastic gradient descent based training methods coupled with early stopping are enough to prevent the neural networks to memorize during the training procedure. However, empirical evidence suggests that this is not the case and that deep neural networks can overfit and mem- orize random noise. The experiments in [ZBH+16] demonstrate that neural networks, trained with SGD, are able to fit random noise in a relatively short amount of time.

30 Therefore, it does not seem that the lack of ability to memorize noise is what prevents neural networks from overfitting that noise.

2.2.1 Noise Robustness in Deep Learning

Deep learning methods have exhibited large robustness to noise, even if they can fit the noise. The work in [RVBS17] suggests that deep networks exhibit extreme ro- bustness to noise in the label space. The experiments they ran consists of augmenting datasets such as MNIST and CIFAR10 with an large number of randomly (uniformly or with bias) labeled images drawn from the datasets. The results show that the gen- eralization performance is relatively hurt very little, even when noisy training points severely outnumber the signal bearing points, as long as there is enough points from the latter type. We extend these results and analyze them in the context of the inductive bias and laziness of neural networks.

2.2.2 Memorization is Secondary

The neural network architectures used in practice can deal with a large amount of noise. In fact, they treat noisy and real data differently, as suggested in+ [AJB 17]. The work mentioned proposes a practical definition of memorization as training on noise, as well as a way to measure hardness of the hypotheses learned by the networks through the proposed Critical Sampling Ratio. They suggest that neural networks do not memorize real data, but only memorize noise. Additionally, they proposal empirical evidence that suggests that neural networks learn on simple patterns first and memorize noise second, and that the networks take advantage of shared patterns and structure between training examples to differentiate between the two types of data. They also link higher capacity to the ability to generalize better in high noise setups, since network use the extra capacity to memorize. The idea that neural networks have an inherent bias to learn simple hypotheses is also discussed in [NTS14]. They propose empirical and theoretical evidence that a type of implicit regularization, orthogonal to capacity control, plays a big role in the

31 generalization of deep neural networks. They coin it the "Real Inductive Bias". They draw an analogy to random matrix theory to suggest implicit norm regularization as a potential source for the inductive bias. The result they present is that such implicit

capacity control takes the form of L1 regularization in the top layer for infinite two-

layer networks with L2 weight decay. We center our analyses around the proposed inductive bias hypothesis and propose evidence to back it.

2.3 Priors in Deep Learning

The idea that neural networks generalize because they do not move much has lead to extensive research around the initial configuration of such models. Such configurations take the form of priors ingrained in the model. Although priors are more commonly used in fully probabilistic settings, some works in the Bayesian literature propose a broader definition of priors as a summary of beliefs before the model is run.Insome instances, such beliefs can in fact be influenced by the likelihood or the model itself [GSB17]. Armed with this perspective, there is a range of works that investigate priors in deep learning.

2.3.1 Priors as Network Biases

Priors are considered to heavily influence the training procedure of deep architec- tures. In fact, convolutional neural networks (CNNs) were introduced in [LBBH98] as a superior method for image classification because of the priors they hold such as translation independence. Following the same line of thought, the work in [UVL17] presents evidence that the structures of ConvNets hold a large number of image statis- tics a priori. They use non-trained ConvNets for inverse problems such as denoising and show that the structures with randomly initialized parameters are sufficient for good performance. On the flip side, some works suggest that bad priors can degrade performance significantly. In [LLM+18], ConvNets are shown to perform very poorly on tasks that involve translation dependence. The work also proposes a new network, called CoordConv, that holds better priors for such tasks. The importance of priors

32 has led recent works to focus on the development of priors that fit different tasks or even different learning paradigms. A "consciousness" prior is proposed in [Ben17] as a mechanism to bias the network towards learning representations in the abstract space rather than the pixel space. In our work, we propose a broad study of different types of deep learning priors and analyze their effect in the context of the inductive bias.

33 34 Chapter 3

On the Noise Robustness of Deep Learning Models

3.1 Introdution

In this section, we will study Deep Learning models’ robustness to what we refer to as "benign" noise. In general, deep neural networks tend to exhibit superior robust- ness to noise when compared to other machine learning methods such as SVMs or kernel regression. In this context, robustness refers to consistency in terms of gener- alization performance: robust models are models for which the prediction accuracy is unchanged or changed negligibly when trained on a dataset with high noise levels. Our results confirm that the test accuracy is changed very little even when thedeep learning models are simple and there is a significant amount of noise. Through these results, we also make this statement about robustness to noise more precise and tie it to a more fundamental property of deep neural networks: "laziness" (which will be discussed extensively in chapter 4).

3.1.1 Benign Noise and Adverserial Noise

We use the term benign noise to refer to noise, which could be on the input space or the label space, that is the result of a purely stochastic process and that is not crafted

35 Figure 3-1: Adversarial example. The initial image (left) is correctly classified as a panda whereas the perturbed image (right) is classified as a gibbon, even though it looks exactly like the intial one to the human eye [GSS14]. for the purpose of fooling the network or hurting its performance. We carefully define the concept to distinguish this type of noise from adversarial noise that is usually used in adversarial attacks. Malicious noise corresponds to noise that is carefully crafted to fool the learning algorithm without changing the input in the eyes of a human observer [DDM+04] . This type of noise has also been studied extensively in the context of deep learning for image classification [SZS+13, BCM+17]. Figure 3-1 is an instance of a typical adversarial example. These examples are usually created through solving the optimization problem that is based on each test points (푥푖, 푦푖):

′ 푥푖 = arg max ℒ(휃, 푥푖 + 훿, 푦푖). 훿 Solving this optimization problem and coming up with defenses against the algorithms that solve it is a very active area of research in the field [MMS+17]. In our study, we are not concerned with the impact of test set noise on the predictions emitted by the neural networks: we focus on the impact of training set noise. We study both label and image space noise that is generated randomly, without malicious intent. This is an interesting type of noise to study if we operate under the assumption that the noise is better modeled by some stochastic process rather than by the result of an adversarial procedure; an assumption that holds in a variety of "real world" settings. To this extent, we replicate and extend results from [RVBS17]. Our findings agree

36 with the results of the paper: the experiments we ran and our analysis of the results point towards robustness to heavy input and output space noise in deep learning models. From a generalization perspective, the models are unchanged. However, sig- nificant levels of noise can impact the optimization procedure, which makes robustness to noise require fine model tuning.

3.2 Generalization with High Output Domain Noise

In this section, we extend some of the work in [RVBS17] to further the understanding of deep learning’s performance in setups with low signal to noise ratio. More specifi- cally, we study situations where the number of "true" images that contribute to the signal we aim to learn is significantly outnumbered by the number of "fake" images that have no, or even negative, contribution to the signal of interest. The behavior is studied on an image classification setup, where for each truly labelled image we add a number of randomly drawn labels. More formally, let the number of training

examples be 푛. For each training example (푥푖, 푦푖), we generate 훼 training samples 푓 푓 such that each generated "fake" sample (푥푖 , 푦푖 ) follows (where 풰 denotes the uniform distribution and 풞 denotes the set of possible classes in the image classification setup):

푓 푥푖 = 푥푖 푓 푦푖 ∼ 풰[풞].

We calls refer to this dataset as the "uniform label" dataset. Note that in this case, we do not modify the original training dataset but merely augment it.

3.2.1 Non Linear Networks

We first investigate the behavior of network with non-linear activations (ReLUmore specifically). Figure 3-2 outlines the result of runs with different valuesof 훼 on MNIST [LC10]. In this experiment, we use a simple 4-layer convolutional neural network, with an SGD optimizer run for 100 epochs. It is clear that even without

37 Figure 3-2: Test accuracy on true label test points in the uniform label MNIST dataset. The generalization error stays relatively low until very high values of alpha (∼ 50), then drops sharply. We attribute the drop to difficulty in optimization rather than a fundamental limitation of the training process.

excessive tuning, when compared to the state-of-the-art architectures, the network’s generalization performance is not significantly hurt until a very high value of 훼, around 50 more specifically. After 훼 = 50, we observe a degradation of the test accuracy. This seems to be the consequence of the optimization becoming more difficult which calls for additional hyper-parameter tuning to enhance the optimization procedure. We used a standard, AlexNet based [KSH12], 4-layer architecture and intentionally stayed away from very specific tuning to make sure that the results proposed area property of a wide class of neural networks and not a specific configuration of the run. We used standard Stochastic Gradient descent for optimization. Overall, the results point towards robustness to massive label noise, since when 훼 = 50, there are 50 randomly labelled images for each image with a true label, thus, the network is exhibiting good generalization performance in an extremely high noise-to-signal ratio environment.

Additionally, we ran the same experiment on CIFAR10 [KNH]. The model we used for CIFAR10 is also standard: a 6-layer ConvNet optimized via momentum

38 Figure 3-3: Test accuracy on true label test points in the uniform label CIFAR10 dataset. The generalization accuracy drops slowly but stays relatively high for high noise levels.

SGD. We observe similar trends to what we observed with MNIST, but we do so in lower noise-to-signal ratio domains (Figure 3-3): after 훼 = 10, the generalization accuracy gets hurt and drops relatively sharply.

3.2.2 Linear Networks

To investigate the hypothesis further, we look into the behavior of linear networks when faced with high label noise. We run the experiment on MNIST with the same ar- chitecture, except that we drop the non-linear activation functions, specifically ReLU. The results presented in Figure 3-4 show that linear models are even more robust. The drop in accuracy observed for high 훼 when the network used ReLU activation is not observed anymore. In general, deep linear networks are "easier" to optimize than their non-linear counterparts. Additionally, the linear model has the same number of parameters as the ReLU based model, so there is no capacity difference between them. Therefore, the result reinforces our hypothesis: the drop in accuracy for the non-linear networks arises because of a difficulty in the optimization domain and not the generalization domain.

39 Figure 3-4: Test accuracy on true label test points in the uniform label MNIST dataset, with a linear model. We can see that the model is very robust to noise and the generalization accuracy is affected minimally.

Our investigation shows that deep learning models are robust to excessive amount of label noise: the network’s learning procedure is unfazed even when 99% of the training dataset is corrupted. It is apparent that the network use a "majority" deci- sion rule during training, thus, as long as the true label is marginally overrepresented in the relevant sub-space, the network will pick it as the correct label for the sub- space. More broadly, this behavior is hinting at an interesting and more general behavior: deep networks have a predisposition to assign one label to the sub-space in question instead of "overfitting" and memorizing the uniform labels. In other words, the networks are conditioned to learn simple decision rules.

In this section, we analyzed robustness to label noise, but noise can also manifest itself in the input domain. We will analyze this next.

40 3.3 Generalization with High Input and Output Do- mains Noise

In the previous section, we studied noise on the label space, so the inputs were un- changed. We now generate new training points by adding gaussian noise to the input

푑 images; let the input images 푥푖 lie in R , let 풩 denote the multivariate normal distri- bution, let 0푑, 퐼푑, denote the 푑-dimensional 0 vector and identity matrix respectively, 2 푓 푓 and let 휎 denote variance, each fake training point (푥푖 , 푦푖 ) is generated as:

2 훾 ∼ 풩 (0푑, 휎 퐼푑)

푓 푥푖 = 푥푖 + 훾 푓 푦푖 ∼ 풰[풞].

We refer to this dataset as the "white noise" dataset.

3.3.1 Input Domain Noise as an "Easier" Task

The white noise dataset can help us investigate how the neural network assigns labels to sub-spaces. We use the same models as in Section 3.2 to train on augmented datasets generated from MNIST and CIFAR10. The results are presented in Figure 3-5. The generalization accuracy is effectively completely oblivious to the addition of noisy images in the dataset: it is the same across a wide range of alphas for both datasets. In some way, this shows that the white noise dataset is a strictly "harder" task than training on the uniform label.

3.3.2 Towards the "Laziness" Property of Deep Neural Net- works

For the MNIST dataset, a simple argument could be made about the results: the white noise affects the usually empty areas of the images and can create a simple "backdoor" for classification. However, in the more general case (CIFAR10 for example), this

41 (a) MNIST (b) CIFAR

Figure 3-5: Test accuracy on true label test points in the white noise MNIST and CIFAR10 datasets. The added noisy images have no effect on the generalization accuracy. The accuracy on the uniform label dataset is added for comparison. is not necessarily obvious a priori: as discussed earlier, the network can be using a majority decision rule when he sees different signals that correspond to the same image; that does not necessarily entail that the network would learn the same decision rule when it sees neighboring images with different labels.

The reasoning used above does not seem to be sufficient to explain what is going on in this situation. The fact that the accuracy is untouched shows that the additional training examples do not influence the training process and they are somewhat ignored by the network. This ties back to the hypothesis we proposed earlier about networks aiming to learn simple things first. In fact, the additional training points are very proximal points, in terms of many distance metrics such as 퐿2 (Frobenius) or 퐿∞ norm, and they have different labels, they are thus not "simple" in the sense that sophisticated decision boundaries would be required to classify them correctly. The network does not get influenced by these data points and prioritizes the "real" data points that exhibit more structure. In order to do this, the networks need to be preconditioned to learn simple things first, or in other words, the network have to have an implicit inductive bias. We will discuss this extensively in chapter 4.

Additionally, the ability of the network to disregard the noisy training examples is interesting in and of itself. The generated points can be understood as being outside

42 the data manifold and the network seems to be aware of that. We will make this statement more precise in the next chapter.

43 44 Chapter 4

Learning Simple Things First: On the Inductive Bias in Deep Learning Models

In the last chapter, we observed an interesting property of deep neural networks: they can handle a massive amount of noise, both in label and images spaces. Such a property points towards the networks being able to ignore fake noisy inputs and focus on the true signal bearing inputs. In this chapter, we will analyze the cause of this behavior: deep learning models prefer to learn the simple things, and noisy inputs are not simple.

4.1 Introduction

Our work aims to provide evidence towards deep learning models having an inherent inductive bias: they tend to learn underlying rules, in other words, simple models, instead of memorizing specific training examples. If our hypothesis is true, this would explain the relatively good generalization performance typically observed in deep learning procedures. Such a behavior is surprising for deep models since they are usually highly over-parametrized, and we would thus expect the traditional statistical learning theory bias-variance tradeoff to cause a high generalization gap. In our study,

45 we tie our results to an active area of research in statistical learning theory that rejects the traditional bias variance tradeoff: recent works in the literature propose evidence that fitting does not imply overfitting and models’ bias-variance tradeoffs are not necessarily tied to the training set performance[BRT18, LR18, BHM18]. The procedures in question learn simple decision rules, even if they interpolate the data. The hypothesis of inductive bias would be a step forward towards reconciling deep learning’s performance whith this new thread in learning theory.

As discussed in Chapter 2, the idea of inductive bias in deep learning is not novel. It has been hypothesized that deep learning models have such a bias, which manifests itself in the form of implicit regularization [NTS14]. This regularization mechanism is independent of capacity control. We aim at making this statement more precise and study the mechanism through which deep learning models are biased to learn simple and general decision rules.

The results of Chapter 3 show that even if there is a significant amount of noise, neural networks still generalize well. As mentioned earlier, this could imply that the models are pre-conditioned to learn simple rules: noisy inputs are more complex than true inputs since they have significantly less structure. We extend this result in section 4.2, which unveils a very interesting property of the networks: they learn as if fake data is not there. Our results are augmented by some additional experiments in section 4.3 that show that the networks are "aware" of the type of data they are learning from: noisy fake data or true structured data. We tie this behavior of neural networks to their aim at working with compressed representations of the problem. In section 4.4, we create a synthetic experimental setup where we train simple networks on data with varying degrees of simplicity to show that the networks learn simple things first. Such a behavior would explain the affinity of deep learning modelsto compress the space in which they are operating.

46 4.2 A Surprising Behavior: Generalization is Obliv- ious to Fake Images When it Matters

In this section, we study the generalization of deep neural networks when faced with a mix of "true" structured data and "fake" unstructured data. Our results show that the generalization error is unaffected by the addition of the unstructured data: mem- orization of the fake data happens after learning the underlying principles governing the real data, and when it does happen, it does not impact "real learning".

4.2.1 Data Generation: the Gaussian Directions and CIFAR푝

The datasets used for the experiments are divided into two parts: true and fake. The true dataset contains image and label pairs drawn from CIFAR10. The fake dataset is synthetically generated based on the CIFAR10 images. We use 퐷ˆ and 퐷ˆ 푓 to denote the true and fake datasets respectively. There are two types of fake datasets used in this section. In section 4.2.3, the fake dataset contains the real images with randomly assigned labels. In this case, a fraction of the training examples from CIFAR10 are randomly assigned a uniform label. We use the parameter 푝 to denote the probability of changing an image’s label to a uniform label, thus, the overall dataset contains 푛 = 50, 000 training points, with a fraction ∼ 푝 of them having uniform labels instead of their true label. We call such datasets CIFAR푝. In section 4.2.2, the fake dataset is not generated from editing CIFAR10; it is compromised of 푛훼 generated training points: for each image-label pair (푥푖, 푦푖) in the CIFAR10 푓 푓 dataset, create 훼 fake training points (푥푖 , 푥푖 ) as, where 푑 denotes the dimension of the images, and 휖 a real values parameter:

훾 ∼ 풩 (0푑, 퐼푑)

푓 훾 푥푖 = 푥푖 + 휖 ||훾||2 푓 푦푖 ∼ 풰[풞 − {푦푖}].

47 Figure 4-1: Images obtained after adding random gaussian directions to CIFAR10 images. We use different values of 휖 from left to right: 0, 50, 500, 5000. We see that for small epsilon the images are modified negligibly.

Essentially, the fake points corresponding to each true point, are composed of images dispersed randomly over the 휖 hyper-sphere and uniform labels that are misleading (not the true label). We will call this dataset the "Gaussian directions" dataset. Figure 4-1 shows some true images with their fake counterparts for different values of 휖. In general, 휖 needs to be on the order of 103 for the images to be modified significantly. This makes sense since a norm of 휖 ∼ 103 in 푑 = 3, 072 dimensional space, corresponds to an average shift on the order of 1 pixel.

4.2.2 Generalization with Gaussian Directions

We ran a 6-layer ConvNet on Gaussian Directions datasets with different values of 훼 and 휖. Figure 4-2 shows the generalization accuracy for different values of 휖. The poor performance for the low values of the gaussian norm is explained by the fact that the images stay essentially the same (the noise is imperceptible) as we discussed in section 4.2.1. This is an interesting property since it seems that if the fake images are far "enough" from the true images, the training on the true images does not get perturbed, even if the fraction 훼 of fake points to true points is very high. We have also run the experiment for different values of 훼. The results are shown in table 4.1. There is no dependence on 훼 when 휖 is high enough: even when the fraction of true

48 Figure 4-2: Test accuracy vs epsilon for the Gaussian Directions dataset with 훼 = 9. We see that after 휖 = 45 the test accuracy is the same as the accuracy obtained on the CIFAR10 dataset without any augmentation. input points to fake inputs is as low as 5%, the network’s generalization performance is unchanged. Additionally, we can see that the fake data gets memorized to some extent, since the train accuracy is above chance (> 10%); this means that even though the network memorizes noise the memorization does not impact the test accuracy or the network’s learning on the true structured data. This idea has been observed in [AJB+17].

Additionally, the network seems to deal with true and fake images differently. If we look at any of the runs mentioned previously (an example is in Figure 4-3), we see that the training accuracy progresses faster and earlier for the true dataset. In other words, the network trains on the true dataset first, ignoring the fake images, until it reaches good accuracy and converges. The training on the fake images does not start till after the training on the true images is over, and it corresponds to memorizing the fake, unstructured labels: the test accuracy is fixed around random. It is clear from this experiment that the network prioritizes learning the true labels.

This behavior of neural networks is interesting. To see this, we reason about the ˆ Gaussian Directions dataset. We denote by 퐷푐푘 the subset of the true training set

49 훼 Total Train Accuracy Real Train Accuracy Fake Train Accuracy 9 28.43% 95.40% 21.32% 19 27.81% 96.12% 24.37% 29 26.78% 95.32% 24.52% 훼 Total Test Accuracy Real Test Accuracy Fake Test Accuracy 9 46.71% 83.42% 10.00% 19 47.27% 84.54% 10.00% 29 46.07% 82.13% 10.00%

Table 4.1: Test and train accuracies for Gaussian Directions dataset with different values of 훼. We set 휖 = 45 for these experiments. We can see that the test accuracy on the real images does not change when we add fake training examples to the dataset.

Figure 4-3: Training run on a Gaussian Directions dataset with 훼 = 9 and 휖 = 45. The network treats the real and fake images as two distinct entities: it learns on the true dataset first to reach good training set performance, then start memorizing the fake labels. that pertains to class 푐 ∈ 풞, 퐷ˆ = {∀(푥, 푦) ∈ 퐷ˆ 푠.푡 푦 = 푐 }. We also denote by 퐷ˆ 푓 푘 푐푘 푘 푐푘 the fake training points that pertain to class 푐 : 퐷ˆ 푓 = {∀(푥푓 , 푦푓 ) ∈ 퐷ˆ 푓 푠.푡 푦푓 = 푐 }. 푘 푐푘 푘 Lets look at one point (푥 , 푦 ) ∈ 퐷ˆ and one of its 훼 fake counter parts (푥푓 , 푦푓 ) ∈ 퐷ˆ 푓 . 푖 푖 푐푘 푖 푖 푐푘 푓 2 As discussed previously, the L2 distance between these two points is ||푥푖 − 푥푖 ||2 = 휖. ′ Let the distance between 푥푖 and some arbitrary test point 푥푖 that belong to class 푐푘 be denoted as 훿. In our experimental setup, we have 휖 << 훿 (as can also be seen in Figure 4-1). In some sense, the fake training points fall "between" the true training points and the true test points. Figure 4-4 proposes a pictorial representation of the dataset. Thus, it would be reasonable to expect these fake training points to act as

50 Figure 4-4: The Gaussian Directions dataset. True training sample (blue) are sur- rounded by a number of generated data points (red).

some form of "barrier" that can modify the class label for images that are further

away from 푥푖. The results of our experiments show that such a barrier does not exist, as the label is still 푐푘 when we go further out, and reach test points. However, there is some sort of barrier that gets created. We used the models trained on the Gaussian direction dataset with certain 훼 and 휖 and probed the landscape, or more specifically the subspace to label map. To do that, we generate new test points with images generated through the same procedure as for the fake training set but with distance a

fraction of 휖 and labels 푐푘. The results are summarized in table 4.2. The test accuracy is affected when we go further away from the training points. The label ismodified and influenced when we move along a random direction, so the barrier exists along most random directions. Nevertheless, it does not exist when we move within a very specific subspace, which contains the true test points and other true training points: the data manifold. This is evidence that the network is aware of the manifold. We will discuss this further in section 4.3.

distance Test Accuracy 0.1휖 82.53% 0.5휖 72.45% 1.5휖 2.00% 2.0휖 2.00%

Table 4.2: Test accuracies for the probing experiment of the Gaussian Directions dataset. We can see that the subspace between the true and fake training points is torn between them. However, as we go further than 휖-far the network does not recognize 푐푘 as the label anymore.

51 4.2.3 Generalization in CIFAR푝

In this section, we provide additional evidence towards our claim, before we dig

deeper into the data manifold. We run experiments on the CIFAR푝 dataset described in section 4.2.1 with different values of the proportion 푝. The results reported in table 4.3 show that the accuracy drops linearly with 푝. Therefore, the network’s learning procedure is impacted in this case. This seems to contradict the results of section 4.2.2 a priori. However, in this case, we edited the dataset and not augmented it. Thus, there are less training points in general, and the drop in accuracy comes mainly from a sample complexity argument. In fact, we still observe the behavior of the networks on the Gaussian Directions: it learns from the true data first (Figure 4-5). Therefore, even though the fake images have more structure in this case, the labels are still unstructured. This shows that the labels contribute to the structure, or lack thereof, of a training point.

푝 Test Accuracy 0.0 87.70% 0.25 80.84% 0.5 75.22% 0.75 62.18% 1.0 12.62%

Table 4.3: Test accuracies for CIFAR푝. The accuracy goes down linearly with 푝: as more true images get flipped, the true signal vanishes.

These results are not obvious: the networks are heavily over-parametrized, and given their capacity, they could very well memorize the labels in an indiscriminate way. However, the networks seem to distinguish datasets based on their structure and they are preconditioned to learn on the more structured, less complex dataset. This is not a property observed in traditional learning theory; in fact, using high capacity kernels, such as RBF, will usually lead to overfitting in this case. To this extent, traditional reasoning about capacity control and generalization does not apply here; there is another underlying dynamic at play. The networks aim to learn simple rules that can explain the data before memorizing outliers. In the next section, we propose

52 Figure 4-5: Training run on a CIFAR0.5 dataset. As in the Gaussian Directions case, the network learns on the true dataset first. a deeper dive into the network’s ability to distinguish real and fake data points, as well as its interaction with the data manifold.

4.3 Data Manifold Awareness

In the last section, we discussed a interesting phenomena: neural networks can dis- tinguish between real data and generated unstructured data. We also introduced the idea that the networks seem to focus its learning on data that is on the data manifold. We will now make this statement more precise.

In deep learning, data lies in a very high dimension space R푑. However, the different features of the data (in other words, the entries of the high dimensional vector) are usually far from independent: for example, pixels in images are heavily correlated and commonly used dataset contain images drawn from a very small subset of the set of all possible images. This smaller subset is a lower dimensional set where the data lies: we call it the data manifold. More formally, let 풟 be the data distribution for real-valued vectors in R푑, the data manifold 풱 is defined as supp(풟) and is usually contained in R푘 where 푘 << 푑. In this section, we provide evidence that neural networks "recognize" the data manifold and treat points generated from outside the manifold as secondary. We also relate this behavior to networks’ affinity to compress data and learn simple classifiers.

53 4.3.1 Differential Treatment of Real and Synthetic Images

In section 4.2, we proposed a range of evidence showing that the networks learn structured data first. In the case of the Gaussian Directions dataset, theadded points were ignored by the network up until the optimization on the true points was over. The network thus optimizes for these two sets of points differently. Additionally, it seems like the classification barriers that are induced by the two sets of points have different effects: only the barriers enforced by the true training set matterforthe classification accuracy. Such barriers impact points on the data manifold, whichis why true test points are affected. The fact that the network makes this distinction points towards it being aware of some high dimensional manifold in which the true data lies and on which it focuses its learning. In Figure 4-4 shows a representation of the relatively lower dimensional thread or manifold that connects the true train and test data.

4.3.2 Towards Identifying the Data Manifold: Unsupervised Learning

We claim that networks distinguish between data on the manifold and data outside of it, then use this distinction to train differently on the two types of data. We rephrase this statement more formally. The goal is to learn a conditional distribution on the

output labels 풫푌 |푋 (푌 = 푐|푋 = 푥) where Y takes values in 풞 and (푥, 푦) ∼ 풟. Our

claim is that the network learns an intermediary distribution 풫푇 |푋 (푇 = 푡|푋 = 푥), where 푇 is a variable that takes value in the alphabet 풯 = {푟푒푎푙, 푓푎푘푒}, then uses the conditional distribution 풫푌 |푋,푇 (푦 = 푐|푋 = 푥, 푇 = 푡) to determine the final label. Note that the variable 푇 does not have an obvious relationship with 푌 : the labels in

풞 are not sub-classes of the symbols in 풯 . This means that training to learn 풫푌 |푋 , which is what the optimization procedure is doing, does not necessarily entail or even favor learning 풫푇 |푋 a priori. In order to test our hypothesis, we take the model trained on the Gaussian Direc- tions dataset in section 4.2.2 and fix the parameters learned to achieve low softmax

54 loss, or in other words, to learn 풫푌 |푋 . We then replace the 10-dimensional output layer with a 2-dimensional output layer, where the neurons correspond to the la- bels real and fake respectively. Additionally, we create a new dataset that contains CIFAR10 images with label 푡 = 푡푟푢푒, and images generated using the Gaussian Di- rections procedure with label 푡 = 푓푎푙푠푒. The dataset is split in half between the two types in order not to bias the prediction in either direction. We train the last layer of the network for 1 epoch on a subset of this dataset (this is not a real training step but merely a rescaling and shifting of the randomly initialized parameters in the last layer). Table 4.4 summarizes the results.

훼 Train Accuracy Test Accuracy 9 97.35% 95.63%

Table 4.4: Train and test accuracies for the real and fake labels dataset. We can see that after 1 epoch of training the network recognizes what points are on the data manifold.

We clearly see that the network learned the distribution 풫푇 |푋 (푇 = 푡|푋 = 푥) while learning 풫푌 |푋 (푌 = 푐|푋 = 푥). Thus, although the procedure corresponds to supervised learning, it is seemingly performing some unsupervised learning: the network learns a representation of the inputs, which seems orthogonal a priori to the label it is trying to predict, alongside the representation that is useful for prediction.

However, if we assume that the network uses the labels 푡 for prediction via 풫푌 |푋,푇 , then this learning is not really unsupervised, and this seems like a more reasonable phenomena. This ties back to our main hypothesis that the network aims to learn simple things first and thus it distinguishes between the simple, true images and the complex, fake images. To corroborate our results, we also ran a Principal Component Analysis (PCA) of the activations on the final hidden layer. The results are shown in Figures 4-6 and 4-7. The activations are very clustered and have very little variance when the inputs are fake along most major principal components. This shows that the network maps the fake training images to the same representation and deals with these images as if they were the same. This result, coupled with the results from section 4.2, is strong evidence that the network ignores most of the fake inputs,

55 Figure 4-6: PCA analysis of the activations at the last hidden layer. The top im- ages show the activations for the entire test dataset, the bottom images show the activations for real images (x) with their fake counterparts (o). We can clearly see that there’s very little variation along the first 3 PCs for the fake data. The neural network maps the fake data to a very restricted subspace. except for the ones it memorizes.

4.3.3 Towards Inductive Bias: Low Dimensional Compression

The results in this section show that the networks learn to distinguish between the structured low dimensional data and the high dimensional data with highly indepen- dent features. This is evidence that deep learning models tend to try to discover a low dimensional representation of the problem: it performs some form of compres- sion, that helps it find simple classifiers first. Such a representation does notfitthe fake images well since they lie in a higher dimensional space than the true images and have a lot less structure. The classifiers needed to reach a good accuracy on these inputs are significantly more complex, so the networks prioritize the easier data points. They start learning the general overarching rules on these structured data points before memorizing the unstructured points: they are in some sense "lazy".

56 Figure 4-7: PCA analysis of the activations at the last hidden layer (single component view). The fake inputs activations are significantly concentrated, whereas the real inputs exhibit high variance.

4.4 Learning Simple Things First

In this section, we provide evidence that deep learning models are "lazy" and that they aim to learn simple things as long as it is possible. Sections 4.2 and 4.3 ex- pose experiments that show that neural networks try to uncover a low dimensional representation of the problem then find relatively simple decision rules for it, before tackling the more complicated components of the problem. We make this statement more precise by studying the behavior of neural networks on a synthetic dataset where we control the difficulty of the data points.

4.4.1 Data Generation: the Linear/Quadratic Dataset

We generate an artificial dataset that contains two types of points: linearly separable (L) and quadratically separable points (Q). We generate 50, 000 data points that are 100-dimensional, with a 50/50 split between L and Q points. The dataset is shown in

Figure 4-8. For each L point (푥, 푦), we generate each coordinate 푥푗 for 푗 ∈ 1, ..., 100

57 Figure 4-8: The Linear/Quadratic Dataset. The image on the left shows the four different types of data and the image on the right shows their assigned labels.

from a uniform distribution 풰[−2, 2] and assign the label 푦 = 1{푥1 > 0}. Each Q point (푥, 푦) is generate as such:

(︁1)︁ 푦 ∼ Bern 2

훾 ∼ 풩 (0100, 퐼100) 휖 ∼ 풰[푦, 푦 + 1] 훾 푥 = 휖 . ||훾||2

In this dataset, the L points require a simple classifier, the 푥0 = 0 line, whereas the Q points require a more complex classifier, the unit hyper-sphere. In the next section, we study the behavior of neural networks on the two types of points.

58 Figure 4-9: Train accuracies on the Linear/Quadratic Dataset. The training accuracy grows for the L points, which require a simpler classifier, first.

4.4.2 The Simplicity Bias: A Proof of Concept

As discussed earlier, deep networks seem to be biased to learn simple things first. We run a ConvNet with 1 hidden layer on the Linear/Quadratic dataset. As discussed earlier, the L points are harder to classify than the Q points since they require a very simple decision rule. The difficulty of a certain dataset or training point here is defined by how "warped" the boundary needs to be to classify the point correctly. The L points are on one end of the spectrum whereas points generated via random noise would require a significantly more complicated decision boundary 1. The results obtained confirm our hypothesis; they are presented in Figure 4-9. We canseethat the network learns on the simple L points first, then learns to classify the Q points. This is more evidence that the network is biased to learn simple things that are easy to explain. Our definition of "simple things" is not necessarily the right one touse.We will make this definition more precise in chapter 5, where we discuss how simplicity depends on the network priors.

1The Radial Basis Kernel (RBF) would be the extreme end of the spectrum. It corresponds to a feature map that maps points into an infinite dimensional space, and it can produce decision boundaries that essentially fully memorize the points.

59 4.5 Laziness: a Force that Drives Generalization

As discussed earlier, the goal of any learning system in the supervised learning paradigm is to achieve good generalization accuracy. Thus, the procedure that fits the model to the training set needs to avoid overfitting on the data. Traditionally, many techniques are used to help the procedure learn simple rules that are applicable to a wide range of unseen data. These techniques, which include model selection, ca- pacity control, explicit regularization, and implicit regularization via early stopping, bias the model to induce.

Deep learning models have exhibited an extraordinary ability to generalize, even when none of these traditional techniques are used: many models are heavily over- parametrized but still avoid overfitting the data. There is therefore some typeof inherent bias to induce in the models. In the previous sections, we provided evidence that supports this claim. The models are able to learn the underlying rules even when they are force-fed massive amounts of noisy data that push them to memorize (section 4.2).

The idea of attributing this inductive bias to some "human thrive to learn over- arching rules" in the model seems interesting, although unlikely. We show in sections 4.3 and 4.4.2 that the networks are just "lazy": they thrive to learn simple things first, before memorizing more complicated things. It is also apparent that these net- works make their tasks easier by compressing the data and prioritizing learning on low-dimensional data (section 4.3). This inherent laziness helps exempt deep learning models from the traditional bias-variance tradeoff and fit them within the realm ofthe idea that interpolation is not necessarily equivalent to overfitting [BRT18, BHM18]. Additional work is needed to understand the force behind the inductive bias itself. A possible hypothesis is that networks are reluctant to move very far away from their initial configuration. Some recent works show that the network parameters atcon-

vergence are very close to the randomly initialized parameters in terms of L2 norm [NLB+18b].

The concept of simplicity is a difficult concept to define. In the next chapter,

60 we show that what is simple for neural networks is defined by the priors that the networks carry. Some tasks can be simple for some networks and complex for others. Networks are lazy, but there is no global ordering for laziness.

61 62 Chapter 5

Inductive Bias through Priors: Simplicity is Preconditioned by Priors

In the last chapter, we studied an interesting property of deep learning models: they are "lazy". We proposed experiments that show that neural networks have an inherent inductive bias that come from a predisposition to learn simple things first. In this chapter, we will define what simple tasks are more precisely by relating them todeep learning priors. In fact, priors are what condition the networks and they determine what is easy or difficult to learn.

5.1 Introduction

In this chapter, we aim to define simplicity through priors engrained in deep learning models. Our work proposes that priors matter and they are a big factor in determining what is simple and complex for the networks.

5.1.1 Priors as a Summary of Initial Beliefs

Priors are usually used in the context of Bayesian inference. The problem setup usually deals with random variables whose properties we are trying to infer. There are two variables 푋 and 푌 : 푋 is usually the variable of interest and 푌 is a variable

63 that we observe. We usually have access to a model relating 푋 and 푌 ; it is usually called the likelihood 풫(푌 |푋), and the goal is to infer 푋 after having observed 푌 , or in other words, get the posterior 풫(푋|푌 ). In order to do this, we apply Bayes rule 풫(푋|푌 ) ∝ 풫(푌 |푋)풫(푋), where 풫(푋) is the prior. The choice of priors is a very active topic of research in the field of Bayesian statistics but the overall consensus is that priors matter. Some applications call for the use of uninformative or weakly informative priors to let the data drive the inference, others call for the use of stronger priors when data is limited or when the statisticians have strong a priori beliefs. These beliefs usually come from domain knowledge, other datasets, transfer learning etc. Additionally, although priors are usually considered to be data independent, they are usually tightly related to the model of interest because they can help the model from a computational standpoint (conjugate priors) or even statistical standpoint [GSB17]. More generally, priors can also be thought of as any beliefs or knowledge that the statistician has before proceeding with the inference mechanism. With this broader definition in mind, priors can be applied to statistical processes that are notbayesian in nature, such as most deep learning techniques.

5.1.2 Priors in Deep Learning

One area of deep learning is concerned with bayesian neural networks where deep learning is used as a pure inference mechanism [Nea95]. The deep learning model is seen as a generative model where the joint distribution over the data 푍 = (푋, 푌 ) and the parameters of the network 휃, 풫(푍, 휃), is well defined. Additionally, the network defines the model relating data and parameters. In this case, we have aproper prior on the network parameters 풫(휃) and we aim to learn the posterior distribution 풫(휃|푍). We do not discuss such models in our development. We are mainly concerned with more mainstream deep learning procedures where the parameters 휃 are not probabilistic and where we are performing parameter estimation instead of inference. Since the parameters are not random variables, we use the broader definition of prior here. Priors are any choices that incorporate the beliefs held before the learning algorithm takes place. With this definition, the distinction between the prior and the

64 model itself (represented by the neural network) becomes vague, but whether we label decisions to be part of the prior or the model has little consequence on our analysis. In our treatment the deep learning priors will englobe any decision made by the modeler on the network1. Some examples of choices that are considered to be part of the prior are the network architecture, the number of parameters, the non-linear activation functions, and the initialization procedure.

5.1.3 Priors Matter for Deep Learning

As in the Bayesian case, priors can be a big determinant in the models’ successes or failures. One very well known instance is the performance of Convolutional Neural Networks (CNNs) on image classification and object recognition [KSH12]. Such net- works are composed of convolutional layers that share features amongst each other and pooling layers; these type of layers encode prior belief about the task: spatial invariance. Incorporating such a belief into the model helped achieve a significant ac- curacy improvement in these tasks and others. This is evidence, out of many others, that although deep learning networks are great at feature engineering, they hold a set of priors that bias them to learn specific representations. In our work, we propose empirical evidence that priors determine what simple tasks are for deep networks: deep learning models do learn simple things first, but what is simple, depends onthe prior. We also contextualize the results to propose that priors, which can be consid- ered a part of feature engineering, are a very important component of the learning procedure and should be carefully considered when developing deep learning models.

5.2 Simplicity, or Proximity to the Prior

Linearly separable data or more generally linear problems are commonly considered to be simple tasks. This is because we, humans, have a good understanding of linear problems and have developed extensive intuition for them. In contrast, we usually find

1Note that such decisions, which are often made via hyper-parameter tuning and cross-validation, can very well be impacted by previous runs of the model and by the data itself.

65 (a) ReLU Activation (b) Quad Activation

Figure 5-1: Train and test accuracies of the comparative run for ReLu and Quad activation. We can see that the linear dataset is easier for ReLU, and the quadratic dataset is easier for Quad. it harder to reason about highly non-linear problems, since we have yet to develop a good grasp on them. However, this is not necessarily the case for deep learning models: simplicity does not have a global ordering. In this section, we propose experimental evidence to argue that simplicity in deep learning should be defined differently. The definition we propose is tightly related to the implicit priors that the model holds. In fact, we propose that priors and the preconditioning of the network that happens through such priors are what define simple and difficult task for a network: asimple task is one that is "close" or fits the prior incorporated in the network. We study the effect of different types of priors on the conditioning of the network.

5.2.1 Bias through Non-Linear Activations

In this section, we study the impact of changing the activation functions, which are an implicit prior, on the inductive bias. We consider two 1 hidden layer networks that have different activation functions: ReLU, or 푓 : 푥 → max{푥, 0}, and Quad, or 푔 : 푥 → 푥2. We run both models on two datasets that are compromised of L points and Q points respectively, where the L and Q points are defined in section 4.4.2. The results of the experiments are shown in figure 5-1 and table 5.1. Figure 5-1 presents the runs described and table 5.1 summarizes the results by comparing

66 the speed of the different runs. It is clear that networks learn faster on problems that fit the priors represented by the activation function. In fact, although networks with any non-linear activations can learn a wide variety of functions, they can do so with varying amount of difficulty. A networks with ReLU activations is preconditioned to learn linear classifiers, since such an activation would map the data 푥 to a feature space Φ(푥) where it is possible to linearly classify the data, even without any training. Similarly, an untrained network with the Quad non-linearities is biased to map the data to a "quadratic" feature space , where the data is easily separated by a quadratic classifier. Therefore, when the feature map implemented a priori by the network does not fit, or is far from, the feature map required to classify the data, thenetwork will find training on the data "less simple". Our analysis thus shows that through changing the initial feature map the activation functions can play a big role in defining what data is simple to learn on: a network with Quad activations considers quadratic rules simple and general.

L points Q points ReLU 75 92 Quad >200 110

Table 5.1: Number of epochs needed to reach 60% train accuracy. The networks learn significantly faster when training on the data that fits the prior imposed bythe activation functions.

5.2.2 Bias through Architecture

The recent advancements in deep learning have been tightly coupled with extensive research on network architectures. In fact, architecture is viewed as a principal factor behind the success of the method. An instance of this idea is the outperformance of ConvNets in computer vision tasks such as image classification and object recognition [KSH12]. The superior test accuracies achieved from ConvNets are tightly associated with the priors they incorporate. In fact, the tasks mentioned require the prediction to be invariant to the spatial location of the elements in the images. Weight sharing in

67 Figure 5-2: Train and test accuracies of the comparative run for max pool and no max pool networks. The network without max pooling layers achieves high train and test accuracy faster than the network with the pooling layers.

convolutional layers and max pooling layers, which are central components of CNNs, bias the model towards translation independence in the predictions. The success of such preconditioning has been proved for tasks that require such spatial invariance. We investigate the performance of such priors on translation dependent tasks. We use an idea similar to the work in [LLM+18] to develop a synthetic dataset where the labels depend on location. The dataset is compromised of black 64x64 images with a 9x9 square of white pixels. The labels correspond to the location of the center of the squares2. In this case, the mapping is perfectly translation dependent. The task is simple and we expect any network with reasonable capacity to learn the mapping, but we are interested in the time it takes for the network to converge to ≥ 99% training accuracy as a proxy for the simplicity of the task. We train on the dataset with two similar ConvNets where one of them has max pooling layers, whereas the other one does not. The hypothesis is that the network with max pooling layers should find it relatively harder to train on the dataset at hand because ithas a stronger predisposition to learn translation independent mappings. The results presented in Figure 5-2 confirm our hypothesis: the network without max pool trains faster. Therefore, the translation dependent task is relatively easier for this network than for its max pooling counterpart. In fact, following our line of thought, translation

2If the center is at row 푖 and column 푗 in the image, the label would be 푦 = 64푖 + 푗.

68 dependent mappings are close to the prior that the network without pooling holds and far from the one engrained in the network with pooling. This is evidence that architectural biases such as max pool can change the types of tasks that are simple for the network and the ones that require more effort.

5.2.3 Feature Engineering through Priors

In some sense, given our broad definition of priors as all the initial design choices inthe network, priors define the initial feature map that the network implements. Therefore, the priors bias the network towards some initial representation of the data that may or may not make the task at hand simpler. More broadly, such a priori feature maps bias the model towards learning different rules and types of classifier first: thebias imposed by the prior defines the inductive bias of the network. Therefore, careful consideration should be given to the choices made around network priors. Although it is not mainstream belief, deep learning models do require some form of feature engineering through network priors.

69 70 Chapter 6

Conclusion

In this thesis, we propose a step forward towards understanding the generalization performance of deep neural networks. Our analysis suggests the existence of an inher- ent inductive bias in such networks, which makes them inclined to learn simple and generalizable hypotheses during training. Such a bias constitutes a form of implicit regularization that is independent of capacity control and could explain the ability of neural network to generalize on unseen data despite the fact that they are over- parametrized. Additionally, it would also explain why deep networks are robust to substantial noise, since it would prevent them from memorizing unstructured noisy data. The inductive bias in deep learning seems very similar to the human desire to summarize complex phenomena with simple and elegant rules and equations. As an example, physicists aim to find one "master equation" that explains the functioning of the universe. However, the inclination to induce in deep neural networks has less philosophical roots. It stems from the fact that these networks are lazy. We present evidence that shows that deep models tend to learn simple hypotheses before learning the more complex ones, without forgetting what was already learned. We showed that when trained on noisy training data, such as gaussian noise and random labels, networks learned the true training data first and still generalized well. Additionally, we propose a way to define simplicity through the implicit priors engrained in the network. In this respect, we show that priors are a form of feature

71 engineering since they precondition the network towards learning certain hypotheses. The networks do learn simple general rules first but the rules are "not general in general", they are general and simple only given the specific priors of such networks. The results presented in this thesis call for further work to understand generaliza- tion in deep learning. We propose some of them below. The Role of Bounding the Norm: The idea that inductive bias leads to generalization holds if the network does not end up moving too much from its initial configuration. Although the idea has already been investigated [NLB+18a, ALL18, GRS17], further research is needed to confirm whether the network parameters at convergence stay "close" to the randomly initialized parameters and to understand the contribution of such a phenomena to generalization. The Role of Initialization: We defined simplicity as proximity to the implicit prior. Weight initialization is a very important component of deep networks’ priors, however, they are not very well understood. A natural next step is thus to study the impact of initialization on generalization. The hypothesis to test would be whether some types of initialization precondition the network to be "far" from the desired configuration and thus hurt test error. Such works would help precisely develop initialization schemes that are favorable to generalization. Inductive Bias and Optimization: We studied the impact of priors and types of data on the inductive bias. Optimization is a very important factor in deep learning, thus one could envision studying such bias through the optimization perspective. It would be interesting to study the relationship of the current SGD based optimization procedures to the inductive bias. It is not out of the question that such algorithms might play a central role in biasing deep learning models to induce. Statistical Learning Theory and Deep Learning: We discussed a new learn- ing theory paradigm that rejects the traditional bias-variance tradeoff. Some works have in fact shown that methods that result in interpolating estimators such as ker- nel machines and random forests can still generalize well when the influence of the points is local [LR18, BRT18, BHM18]. An interesting next step would be to inves- tigate whether a similar phenomena happens in deep neural networks. This thread

72 seems like a promising direction to take in order to move the science of deep learning closer to theory and reconcile the theoretical advances in statistical learning with the practical advances in deep learning.

73 74 Bibliography

[AGNZ18] Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. CoRR, abs/1802.05296, 2018.

[AJB+17] Devansh Arpit, Stanislaw Jastrzebski, Nicolas Ballas, David Krueger, Emmanuel Bengio, Maxinder S. Kanwal, Tegan Maharaj, Asja Fischer, Aaron Courville, Yoshua Bengio, and Simon Lacoste-Julien. A closer look at memorization in deep networks, 2017.

[ALL18] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and gener- alization in overparameterized neural networks, going beyond two layers. CoRR, abs/1811.04918, 2018.

[BCM+17] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndic, Pavel Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. CoRR, abs/1708.06131, 2017.

[Ben17] Yoshua Bengio. The consciousness prior. CoRR, abs/1709.08568, 2017.

[BHM18] Mikhail Belkin, Daniel Hsu, and Partha Mitra. Overfitting or perfect fitting? risk bounds for classification and regression rules that interpolate, 2018.

[BRT18] Mikhail Belkin, Alexander Rakhlin, and Alexandre B. Tsybakov. Does data interpolation contradict statistical optimality?, 2018.

[CST00] Nello Cristianini and John Shawe-Taylor. An Introduction to Support Vector Machines: And Other Kernel-based Learning Methods. Cambridge University Press, New York, NY, USA, 2000.

[DDM+04] Nilesh Dalvi, Pedro Domingos, Mausam, Sumit Sanghai, and Deepak Verma. Adversarial classification. In Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’04, pages 99–108, New York, NY, USA, 2004. ACM.

[Fuk80] Kunihiko Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in po- sition. Biological Cybernetics, 36(4):193–202, Apr 1980.

75 [GBC16] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.deeplearningbook.org.

[GMH13] Alex Graves, Abdel-rahman Mohamed, and Geoffrey E. Hinton. Speech recognition with deep recurrent neural networks. CoRR, abs/1303.5778, 2013.

[GRS17] Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Size-independent sample complexity of neural networks. CoRR, abs/1712.06541, 2017.

[GSB17] Andrew Gelman, Daniel Simpson, and Michael Betancourt. The prior can generally only be understood in the context of the likelihood. 2017.

[GSS14] Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples, 2014.

[HSW89] Kurt Hornik, Maxwell B. Stinchcombe, and Halbert White. Multilayer feedforward networks are universal approximators. Neural Networks, 2:359–366, 1989.

[HZRS15] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015.

[KB14] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.

[KMN+16] Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. CoRR, abs/1609.04836, 2016.

[KNH] Alex Krizhevsky, Vinod Nair, and Geoffrey Hinton. Cifar-10 (canadian institute for advanced research).

[KSH12] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi- fication with deep convolutional neural networks. In F. Pereira, C.J. Burges, L. Bottou, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 25, pages 1097–1105. Curran Associates, Inc., 2012.

[LBBH98] Yann Lecun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient- based learning applied to document recognition. In Proceedings of the IEEE, pages 2278–2324, 1998.

[LBH15] Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.

[LC10] Yann LeCun and Corinna Cortes. MNIST handwritten digit database. 2010.

76 [LL18] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. CoRR, abs/1808.01204, 2018. [LLM+18] Rosanne Liu, Joel Lehman, Piero Molino, Felipe Petroski Such, Eric Frank, Alex Sergeev, and Jason Yosinski. An intriguing failing of convolutional neural networks and the coordconv solution. CoRR, abs/1807.03247, 2018. [LR18] Tengyuan Liang and Alexander Rakhlin. Just interpolate: Kernel "ridge- less" regression can generalize, 2018. [LSS14] Roi Livni, Shai Shalev-Shwartz, and Ohad Shamir. On the computational efficiency of training neural networks. CoRR, abs/1410.1141, 2014. [MG15] James Martens and Roger B. Grosse. Optimizing neural networks with kronecker-factored approximate curvature. CoRR, abs/1503.05671, 2015. [MMS+17] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks, 2017. [Nea95] Radford M. Neal. Bayesian Learning for Neural Networks. PhD thesis, University of Toronto, 1995. [NLB+18a] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towards understanding the role of over-parametrization in generalization of neural networks. CoRR, abs/1805.12076, 2018. [NLB+18b] Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. Towards understanding the role of over-parametrization in generalization of neural networks, 2018. [NTS14] Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. In search of the real inductive bias: On the role of implicit regularization in deep learning. CoRR, abs/1412.6614, 2014. [Oxf18] OxfordDictionaries.com. Oxford Dictionary. Oxford University Press, 2018. [RVBS17] David Rolnick, Andreas Veit, Serge J. Belongie, and Nir Shavit. Deep learning is robust to massive label noise. CoRR, abs/1705.10694, 2017. [SHM+16] David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel, and Demis Hassabis. Mas- tering the game of Go with deep neural networks and tree search. Nature, 529(7587):484–489, January 2016.

77 [SSBD14] Shai Shalev-Shwartz and S. Ben-David. Understanding Machine Learn- ing: From Theory to Algorithms. Cambridge University Press, 2014.

[STC04] John Shawe-Taylor and Nello Cristianini. Kernel Methods for Pattern Analysis. Cambridge University Press, New York, NY, USA, 2004.

[SZS+13] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Du- mitru Erhan, Ian J. Goodfellow, and Rob Fergus. Intriguing properties of neural networks. CoRR, abs/1312.6199, 2013.

[UVL17] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. Deep image prior. CoRR, abs/1711.10925, 2017.

[ZBH+16] Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. CoRR, abs/1611.03530, 2016.

[Zei12] Matthew D. Zeiler. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701, 2012.

[ZLR+18] Chiyuan Zhang, Qianli Liao, Alexander Rakhlin, Brando Miranda, Noah Golowich, and Tomaso Poggio. Theory of deep learning iib: Optimization properties of sgd, 2018.

78