Feature Purification: How Adversarial Training Performs Robust

Zeyuan Allen-Zhu Yuanzhi Li [email protected] [email protected] Microsoft Research Redmond Carnegie Mellon University

March 16, 2020 (version 3)∗

Abstract Despite the empirical success of using adversarial training to defend deep learning models against adversarial perturbations, so far, it still remains rather unclear what the principles are behind the existence of adversarial perturbations, and what adversarial training does to the neural network to remove them. In this paper, we present a principle that we call feature purification, where we show one of the causes of the existence of adversarial examples is the accumulation of certain small dense mixtures in the hidden weights during the training process of a neural network; and more importantly, one of the goals of adversarial training is to remove such mixtures to purify hidden weights. We present both experiments on the CIFAR-10 dataset to illustrate this principle, and a theoretical result proving that for certain natural classification tasks, training a two-layer neural network with ReLU activation using randomly initialized gradient descent indeed satisfies this principle. Technically, we give, to the best of our knowledge, the first result proving that the following two can hold simultaneously for training a neural network with ReLU activation. (1) Training over the original data is indeed non-robust to small adversarial perturbations of some radius. (2) Adversarial training, even with an empirical perturbation algorithm such as FGM, can in fact be provably robust against any perturbations of the same radius. Finally, we also prove a complexity lower bound, showing that low complexity models such as linear classifiers, low- degree polynomials, or even the neural tangent kernel for this network, cannot defend against perturbations of this same radius, no matter what algorithms are used to train them. arXiv:2005.10190v3 [cs.LG] 3 Jul 2021

∗V1 of this paper was presented at IAS on this date, and can be viewed online: https://video.ias.edu/csdm/ 2020/0316-YuanzhiLi. We polished writing and experiments in V1.5, V2 and V3. We would like to thank Sanjeev Arora and Hadi Salman for many useful feedbacks and discussions. 1 Introduction

Large scale neural networks have shown great power to learn from a training data set, and generalize to unseen data sampled from similar distributions for applications across different domains [42, 46, 53, 87]. However, recent study has discovered that these trained large models are extremely vulnerable to small “adversarial attacks” [19, 95]. It has been discovered that small perturbations to the input– often small enough to be invisible to humans– can create numerous errors in prediction. Such slightly perturbed inputs are often referred to as “adversarial examples”. Since the original discovery of “adversarial examples”, a large body of works have been done emphasizing how to improve the robustness of the deep learning models against such perturba- tions [44, 63, 64, 83, 89]. One seminal approach is called adversarial training [65], where one iteratively computes adversarial examples from the training examples, and then retrain the model with these adversarial examples instead of the original examples (a.k.a. the clean examples). This approach was reported in [15] as the only approach that can defend against carefully designed adversarial attacks, and many follow-up works are built upon it [82, 111]. However, despite the great empirical success on improving the robustness of neural networks over various data sets, the theory of the adversarial examples is much less developed. In particular, we found that the following fundamental questions remain largely unaddressed: Questions Why do adversarial examples exist when we train the neural networks using the original training data set? How can adversarial training further “robustify” the trained neural networks against these adversarial attacks? To answer these questions, one sequence of theoretical works try to explain the existence of adversarial examples using the high dimensional nature of the input space and the over-fitting behavior due to the sample size and sample noise [32, 33, 39, 40, 67, 86, 96], and treat adversarial training from the broader view of min-max optimization [24, 65, 92, 93, 103]. However, recent observations [48] indicate that these adversarial examples can also, and arguably often, arise from features (those that do generalize) rather bugs (those that do not generalize due to effect of poor statistical concentration). To the best of our knowledge, all existing works studying adversarial examples either (1) apply generally to the case of arbitrarily unstructured functions f and only consider adversarial examples statistically, or (2) apply to a structured setting but only involving linear learners. These theoretical works, while shedding great lights to the study of adversarial examples, do not yet give concrete mathematical answers to the following questions regarding the specific hidden-layer structure of neural networks:

1. What are the features (i.e. the hidden weights) learned by the neural network via clean training (i.e., over the original data set)? Why are those features “non-robust”? 2. What are the differences between the features learned by clean training vs adversarial training (i.e., over a perturbed data set consisting of adversarial examples)? 3. Why do adversarial examples for a network transfer to other independently-trained networks?

Before going into the above questions regarding robustness, it is inevitable to first study what the features are when learned by a neural network during clean training. Theoretical studies are also limited in this direction. Most of existing works (1) only focus on the case when the training data is spherical Gaussian [16, 20, 22, 36, 52, 55, 58, 60, 62, 88, 90, 97, 100, 104, 112, 114], and some of them require heavy initialization using tensor decomposition, which might fail to capture the specific structure of the input and the property of a random initialization; or (2) only consider the neural

1 Cleaning Training

global (forward feature learning + backward feature correction)

(a) (b)

a) edge feature with 3 lines becomes 1 line, and thus more pure

b) edge features become less colorful, and thus more pure

Figure 1: Feature purification in adversarial training (for the first layer of AlexNet on CIFAR-10). Visualization of deeper layers of AlexNet as well as ResNet-34 can be found in Figure 6, Figure 10, and Figure 11. tangent kernel regime, where the neural networks are linearized so the features are not learned (they stay at random initialization) [4, 6–8, 13, 14, 26, 26–29, 38, 45, 50, 57, 61, 105, 115, 116]. In this paper, we present a new routine that enables us to formally study the learned features (i.e. the hidden weights) of a neural network, when the inputs are more naturally structured than being Gaussians. Using this routine, we give, to the best of our knowledge, the first theoretical result towards answering the aforementioned fundamental questions of adversarial examples, for certain neural networks with ReLU activation functions. Our results. We prove, for certain binary classification data set, when we train a two-layer ReLU neural network using gradient descent,1 starting from random initialization,

1. Given polynomially manly training examples, in polynomially many iterations, the neural net- work will learn well-generalizing features for the original data set, and the learned network will have close-to-perfect prediction accuracy for the test data sampled from the same distribution. 2. However, even with a weight-decay regularizer to avoid over-fitting, even with infinitely many training data, and even when super-polynomially many iterations are used to train the neural network to convergence, the learned network still has near-zero robust accuracy against small- norm adversarial perturbations to the data. In other words, those provably well-generalizing features on the original data set are also provably non-robust to adversarial perturbations to the datas, so they cannot be due to having too few training samples [32, 33, 39, 40, 67, 86, 96]. 3. Adversarial training, using perturbation algorithms such as Fast Gradient Method (FGM) [40], can provably and efficiently make the learned neural network achieve near-perfect robust ac- curacy, against even the worst-case norm-bounded adversarial perturbations, using a principle we refer to as “feature purification”. We illustrate “feature purification” in Figure 1 by an experiment, and explain it in mathematical terms next.

Feature purification: How adversarial training can perform robust deep learning. In this work, we also give precise, mathematical characterizations on the difference between learned features by clean training versus adversarial training in the aforementioned setting, leading to (to our best knowledge) the first theory of how, in certain learning tasks using ReLU neural networks, the provably non-robust features after clean training can be “robustified” via adversarial training.

1Our theory extends to stochastic gradient descent (SGD) at the expense of complicating notations.

2 0 ′ histogram of 휃(푤푖 , 푤푖) histogram of 휃(푤푖, 푤푗) for 푖 ≠ 푗 histogram of 휃(푤푖, 푤푖 ) 100% 100% 100% 80% 80% 80% 60% 60% 60% 40% 40% 40%

20% 20% 20%

% OF NEURONS OF % % OF OF %NEURONS % OF OF %NEURONS 0% 0% 0%

layer 1 layer 2 layer 3 layer 4 layer 5 layer 1 layer 2 layer 3 layer 4 layer 5 layer 1 layer 2 layer 3 layer 4 layer 5

100% 100% 100% 80% 80% 80% 60% 60% 60%

40% 40% 40%

% OF NEURONS OF % NEURONS OF % % OF NEURONS OF % 20% 20% 20% 0% 0% 0%

layer 1 layer 3 layer 5 layer 7 layer 9 layer 11 layer 1 layer 3 layer 5 layer 7 layer 9 layer 11 layer 1 layer 3 layer 5 layer 7 layer 9 layer 11 layer 13 layer 15 layer 17 layer 19 layer 21 layer 23 layer 13 layer 15 layer 17 layer 19 layer 21 layer 23 layer 13 layer 15 layer 17 layer 19 layer 21 layer 23 layer 25 layer 27 layer 29 layer 31 layer 25 layer 27 layer 29 layer 31 layer 25 layer 27 layer 29 layer 31

Figure 2: Measure of feature (local) purifications, for `2(1, 0.5) adversarial training on AlexNet and ResNet-34. For ResNet-34, the weights {wi} define networks with clean accuracy > 80% and robust accuracy 0%; while the weights {wi} define networks with clean accuracy > 65% and robust accuracy > 42%. Implementation details in Section 8.2.

We emphasize that prior theoretical works [35, 51, 84, 98, 113] mainly study adversarial ex- amples in the context of linear models (such as , linear regression over prescribed feature mappings, or the neural tangent kernels). In those models, the features are not trained, so adversarial training only changes the weights associated with the linear combination of these features, but not the actual features themselves. In contrast, this paper develops a theory showing that how, over certain learning tasks, adversar- ial training can actually change the features of certain neural networks to improve their robustness. We abstract this feature change in our setting into a general principle that we call feature purifica- tion, which although we only prove it for two-layer ReLU networks (see Theorem 5.1a+Theorem 5.3c), we empirically observe that it occurs more generally to real-world, deep neural networks on real- world data sets. We sketch its high-level idea as follows. The Principle of Feature Purification During adversarial training, the neural network will neither learn new, robust features nor remove existing, non-robust features learned over the original data set. Most of the works of adversarial training is done by purifying a small part of each learned feature after clean training. Mathematically, as a provisional step to measure of change of features in a network, let us use (1) (0) wi to denote the weight vector of the i-th neuron at initialization, (2) wi to denote its weight after 0 clean training, and (3) wi to denote its weight after adversarial training (using wi as initialization). 0 |hz,z0i| The “feature purification” principle, in math, says if we use θ(z, z ) := 0 as a provisional kzk2kz k2 measure of the correlation between “features”, then (see Figure 2 for real-life experiments):

(0) (0) 0 1. for most neurons: θ(wi , wi), θ(wi , wi) ≤ c for a small constant c (such as 0.2); 0 2. for most neurons: θ(wi, wi) ≥ C for a large constant C (such as 0.8); and

3. for most pairs of different neurons: θ(wi, wj) ≤ c for a small constant c (such as 0.2). 0 In words, this says both clean training and adversarial training discover hidden weights wi, wi that (0) 0 are fundamentally different from initialization wi . However, since wi and wi are close, clean

3 training must have already discovered a big portion of the robust features, and adversarial training merely needs to “purify” some small part of each original feature. In this paper: • we prove this feature purification principle in the case of two-layer ReLU neural networks over certain data sets, with c = o(1) and C = 1 − o(1) (see Theorem 5.1a+Theorem 5.3c); and • we provide empirical evidence that this feature purification principle holds also for deep neural networks used in real-life datasets (see Figure 2 as well as other experiments in the paper).

Why clean training learns non-robust features? Which part of the features are “puri- fied” during adversarial training? In our setting, we also give mathematical characterizations of where the “non-robust” part of each feature comes from during clean training. As we shall formally discuss in Section 6.2, training algorithms such as gradient descent will, at every step, add to the current parameters a direction that maximally correlates with the labeling function on average. For two-layer ReLU networks, we prove that such simple correlations will ac- cumulate, in each neuron, a small part of its weight that correlates with the average of the training data, and we refer to it as the dense mixture (see Theorem 5.2). However, under natural assump- tions of the data such as the sparse coding model — namely inputs come from sparse combinations of hidden dictionary words/vectors— such dense mixtures cannot have high correlation with any individual, clean example. Thus, even with these “dense mixtures” in the features, the network can still generalize well on the original data set. However, we show that these portions of the features are extremely vulnerable to small, adversarial perturbations along the “dense mixture” directions. As a result, one of the main goals of adversarial training, as we show, is to purify the neurons by removing such dense mixtures. This is the supporting theory behind our feature purification principle, as we also measure it empirically in the experiment section. We believe our result extends the reach of traditional learning theory, where often statistical properties of the model (such as generalization, etc.) is separate from optimization (i.e., how the models are trained). However, to understand adversarial examples in deep learning, one needs to admit that well-generalizing and adversarially robust neural networks do exist (and can even be found efficiently using adversarial training), thus it is also a global optimal solution of the clean training objective. It is rather a property of traditional clean training process using SGD which biases the network towards learning non-robust networks as another global optimal solution of the training objective. Moreover, in our setting, these dense mixtures in the hidden weights of the network come from the sparse coding structure of the data and the gradient descent algorithm. It is rather independent of the random initialization of the neural network. Thus, we prove that, at least in our scenario, adversarial examples for one network do transfer to other independently trained ones. Our contribution to computation complexity. We also prove a lower bound that, for the same sparse coding data model, even when the original data is linearly-separable, any linear classifier, any low-degree polynomial, or even the corresponding neural tangent kernel (NTK) of our studied two-layer neural network, cannot achieve meaningful robust accuracy (although they can easily achieve high clean accuracy). Together with our upper bound, we have shown that using a higher- complexity model (such as a two-layer neural network with ReLU activation, comparing to NTK) can in fact achieve better robustness against adversarial perturbations. Thus, our theory strongly supports the experimental finding in [40, 65], where experts have noticed that robustness against adversarial examples requires a model with higher complexity. The main intuition is that low- complexity models, including the neural tangent kernel, lacks the power to zero out low magnitude signals to improve model robustness, as illustrated in Figure 3 and Section 3.

4 Our experimental contributions. We present quite a few experimental results supporting our theory. (1) Our sparse coding model can indeed capture real-world data to certain degree. (2) Our principle of feature purification also holds for architectures such as AlexNet and ResNet. (3) Adversarial training using adversarial examples indeed purify some “dense mixtures” in practice. (4) During clean training, how the features can emerge from random initialization by wining the “lottery tickets”, as predicted by our theory. We present our experiments following each of the theorem statements accordingly. We also include a whole Section 8 for more detailed experiments.

1.1 Related Works

Adversarial examples: Empirical study. Since the seminal paper [95] shows the existence of small adversarial perturbations to change the prediction of the neural networks, many empirical studies have been done to make the trained neural networks robust against perturbations [44, 63, 64, 83, 89] (and we refer to the citations therein). The recent study [15] shows that the seminal approach [65] of adversarial training is the most effective way to make the neural networks robust against adversarial perturbations. Adversarial examples: Theoretical study. Existing theories mostly explain the existence of adversarial examples as the result of finite-sample data set over-fitting to high-dimensional learning problems [32, 33, 39, 40, 67, 86, 96]. Later, it is discovered by [48] that well-generalizing features can also be non-robust. Other theories focus on the Fourier perspective of the robustness [102, 108], showing that adversarial training might be preventing the network from learning the high frequency signals of the input image. Our theoretical work is fundamentally different from the aspect of poor statistical concentration over finite-sample data set, and our Theorem 5.1 and Theorem 5.3 strongly supports [48] that a well-trained, well-generalizing neural network can still be non-robust to adversarial attacks. Other theories about adversarial examples focus on how adversarial training might require more training data comparing to clean training [84], and might decrease clean training accuracy [80, 98]. The works by [35, 113] focus on how adversarial training can be performed efficiently in the neural tangent kernel regime. The purpose of these results are also fundamentally different than ours. Sparse coding (data) model. We use a data model called sparse coding, which is a popular model to model image, text and speech data [11, 12, 69, 77, 78, 101, 106, 107]. There are many existing theoretical works studying algorithm for sparse coding [9, 17, 43, 47, 54, 68, 85, 91, 94], however, these algorithms share little similarity to training a neural network. The seminal work by [10] provides a neurally-plausible algorithm for learning sparse coding along with other works using alternative minimization [1, 2, 37, 56, 59]. However, all of these results require a (carefully picking) warm start, while our theory is for training a neural network starting from random initialization. Threshold degree and kernel lower bound. We also provide, to the best of our knowledge, the first example when the original classification problem is learnable using a linear classifier but no low- degree polynomial can learn the problem robustly against small adversarial perturbations. Yet, the high-complexity neural networks can provably, efficiently and robustly learn the concept class. The lower bound for the classification accuracy using low-degree polynomials has been widely studied as the (approximate) threshold degree of a function or the sign-rank of a matrix [18, 21, 25, 41, 75, 81]. Our paper give the first example of a function with high (approximate) robust threshold degree, yet efficiently and robustly learnable by training a ReLU neural network using gradient descent. Other related works prove lower bounds for kernel method in the regression case [3, 5]. Generally speaking, such lower bounds are about the actual (approximate) degree of the function, instead

5 of the (approximate) threshold degree. It is well know that for general functions, the the actual degree can be arbitrary larger than the threshold degree.

2 Preliminaries

We use kxk or kxk2 to denote `2 norm of a vector x, and kxkp to denote the `p. For a matrix M ∈ d×d P R , we use Mi to denote the i-th column of M, and we use kMk∞ to denote maxi∈[d] j∈[d] Mi,j P C and kMk1 to denote maxj∈[d] i∈[d] Mi,j. We use poly(d) to denote Θ(d ) when the degree C is some not-specified constant. We use the term clean training to refer to the neural network found by training over the original data set, and the term robust training to refer to the neural network found by adversarial training. We let sign(x) = 1 for x ≥ 0 and sign(x) = −1 for x < 0. d Sparse coding model. In this paper, we consider the training data x ∈ R generated from x = Mz + ξ d×D D d for a dictionary M ∈ R , where the hidden vector z ∈ R and ξ ∈ R is the noise. For simplicity, we focus on D = d and M is a unitary matrix. Although our results extend trivially to the case of D < d or when M is incoherent, we point out that the orthogonal setting is more principle: As we will argue in one of our main result, the clean training learns a non-robust “dense mixture” of features (See Eq 5.1). In our orthogonal setting, the accumulation of such mixture is clearly NOT due to the correlation between features, rather its an intrinsic property of gradient descent. We assume the hidden vector z is “sparse”, in the following sense: for k ≤ d0.499, we have:

Assumption 2.1 (distribution of hidden vector z). The coordinates of z are independent, sym- metric random variables, such that |z | ∈ {0} ∪ [ √1 , 1]. Moreover, i k h  i [z2] = Θ 1  , Pr[|z | = 1] = Ω 1  , Pr |z | = Θ √1 = Ω k  E i d i d i k d

2 The first condition is a regularity condition, which says that E[kzk2] = Θ(1). The second and third condition says that there is a non-trivial probability where zi attains the maximum value, and a (much) larger probability that z is non-zero but has a small value (Remark: It could be the −0.314 case that zi is neither maximum nor too small, for example, |zi| can also be k with probability k0.628 −0.123 k0.0888 d as well, or k with probability d ). The main observation is that

Fact 2.2. Under Assumption 2.1, w.h.p., kzk0 = Θ(k) is a sparse vector. We study the simplest binary-classification problem, where the labeling function is linear over the hidden vector z: y(x) = sign (hw?, zi) ? For simplicity, we assume ∀i ∈ [D], |wi | = Θ(1), so all the coordinates of z have relatively equal contributions. Our theorems extend to other w? at the expense of complicating notations. Remark on sparse coding. The sparse coding model is very natural and is widely used to model image, text and speech data [69, 77, 78, 101, 106, 107]. There certainly exist (provable) algorithms for dictionary learning based on sum of squares, linear programming [17, 91], but they do not shed light on the training process of neural networks. Even the neural algorithm for sparse coding [10] is still far away from training a neural network using SGD or its variants. The main point of this paper is not to show neural networks can do sparse coding. Instead, our main point

6 is to distinguish the adversarial training and the clean training processes of neural networks using the sparse-coding model as a bridging tool. Noise model. We have allowed the inputs x = Mz + ξ to incorporate a noise vector ξ. Our lower bounds hold even when there is no noise (ξ = 0). Our upper bound theorems not only apply to ξ = 0, but more generally to “gaussian noise plus spike noise”: ξ = ξ0 + Mξ00

2 0 σx Here, the gaussian noise ξ ∼ N (0, d I) where σx ≤ O(1) can be an arbitrary large constant. The 00 002 spike noise ξ is any coordinate-wise independent, mean-zero satisfying E[ξi ] ≤ 2 σx  00 1 O d and |ξi | ≤ k0.501 for every i ∈ [d]. Therefore, our upper bound theorems hold even under the following extreme circumstances:

• the noise ξ can be of Euclidean norm O(σx), larger than the signal kMzk2 ≈ 1; and 00 1 1 • the spike noise ξi can be k0.501 which is the maximum possible (because zi can be k0.5 ). We point out that there are no dependencies among the constants in those O, Θ and Ω notations 2 of this section, except for the obvious ones (e.g. Pr[|zi| = 1] ≤ E[|zi| ]). In particular, σx can be 2 an arbitrarily large constant and Pr[|zi| = 1] can be an arbitrary small constant times 1/d. Clean and robust error. The goal of clean training is to learn a model f so that sign(f(x)) is as close to y as possible. We define the classification error on the original data set as: clean error: Ec(f) def= Pr [sign(f(x)) 6= y] x,y=y(x)

Next, we consider robust error against `p adversarial perturbations. For a value τ > 0 and a norm k · kp, we define the robust error of the model f (against `p perturbation of radius τ) as: r robust error: E (f) = Pr [∃δ : kδkp ≤ τ : sign(f(x + δ)) 6= y] x,y=y(x)

3 Warmup Intuitions

Linear learners are not robust. Given the setting of the data set, one direct approach is to use (the sign of) a linear classifier f(x) = hw?, M>xi to predict the label of x. There are two issues of using such a classifier:

1. When σx is as large as Θ(1), such classifier can not even classify x in good clean accuracy. Recall f(x) = hw?, M>xi = hw?, zi + hMw?, ξi. By our assumption, typically |hw?, zi| = O(1) ? 0 2 and hMw , ξ i ∼ N (0, Θ(σx)). Thus, when σx ≥ Θ(1), noise could be much larger than signal, and this linear classifier cannot be used to classify x correctly. In this case, actually no linear classifier (or even constant-degree polynomials3) can give meaningful clean accuracy.

2. Even when σx = 0 so the original data is perfectly linearly-classifiable, linear classifier is also not robust to small perturbations. Since typically |hw?, zi| = O(1), one can design an −CyMw? adversarial perturbation δ = kw?k2 for a large constant C, that can change the sign of the 2 √ ? > ? linear classifier f(x) = hw , M xi for most inputs. Since kw k2 = Θ( d), this linear classifier

2Actually, our theorem extends trivially to the case even when Pr[|z | = 1] = 1 . i d1+o(1) 3 P ? 3 One may think that using for example degree-3 polynomial i wi hMi, xi can reduce the level of noise, but due to the diversity in the value of zi when zi 6= 0, one must use something close to linear when |zi| is large. Applying Markov brothers’ inequality, one can show the low-degree polynomial must be close to a linear function.

7 linear, non-robust ReLU, non-robust

Higher-complexity models using a ReLU activation is more robust than a linear classifier due to the power to “zero-out” low-magnitude signals.

Figure 3: linear vs ReLU activation (our theorem is in symmetric ReLU for the sake of proof simplicity.)

is not even robust to adversarial perturbations of ` norm Θ √1 . In fact, no linear classifier 2 d can be robust to such small adversarial perturbations.

High-complexity models are more robust. Another choice to learn the labeling function is P ? to use a higher-complexity model f(x) = w hM , xi1 1 . Here, the “complexity” of i∈[d] i i |hMi,xi|≥ √ 2 k 4 f is much higher because an indicator function is used. Since hMi, xi = zi + hMi, ξi, by our noise model, as long as the signal z 6= 0 is non-zero, |hM , xi| ≥ √1 with high probability. Thus, this i i 2 k f(x) is equal to the true labeling function hw?, zi w.h.p. over the original data set, so is (much) more robust to noise comparing to linear models. Moreover, this f is also more robust to `2 adversarial perturbations. By Fact 2.2, w.h.p. the signal z is O(k)-sparse, and thus there are at most O(k) many coordinates i ∈ [d] with 1 1 = 1. Using this, one can derive that this high complexity model f(x) has 1 − o(1) |hMi,xi|≥ √ 2 k robust accuracy, against any adversarial perturbation of ` radius o √1 . This is much larger than 2 k that of O √1  for a linear classifier, and it is actually information theoretically optimal. d To sum up, higher-complexity models (such as those using ReLU) have the power to zero out low-magnitude signals to improve adversarial robustness, as illustrated in Figure 3. Learning robust classifier using neural network. Motivated by the above discussions between linear vs. high-complexity models, our goal is to show that a two-layer neural networks can (after adversarial training) learn a robust function f(x) such as X ? f(x) ≈ wi [ReLU(hMi, xi − b) − ReLU(−hMi, xi − b)] i∈[d] Here, ReLU(y) = max{y, 0} is the ReLU function and b ≈ √1 . In this paper, we present a theorem 2 k stating that adversarial training of a (wlog. symmetric) two-layer neural network can indeed recover a neural network of this form. In other words, after adversarial training, the features learned by the hidden layer of a neural network can indeed form a basis (namely, M1,..., Md) of the input x where the coefficients are sparse. We also present a theorem showing why, clean training will not learn this robust function. We also verify experimentally that the features learned by the first layer of AlexNet (after adversarial training) indeed form a sparse basis of the images, see Figure 4.

4 Learner Network and Adversarial Training

In this paper we consider a simple, two layer (symmetric)5 neural network with ReLU activation. m X f(x) = ai[ReLU(hwi, xi − bi + ρi) − ReLU(−hwi, xi − bi + ρi)] i=1 4One concrete measure of “higher complexity” is that f cannot be well-approximated by low degree polynomial. 5 We assume the neurons are symmetric (i.e., with (wi, −wi) pairs) to simplify proofs.

8 original images

sparsity 4.05% sparsity 2.89% sparsity 1.90% sparsity 1.10% sparsity 0.44% sparse reconstruction using learned features from the first layer of AlexNet

Figure 4: Reconstruct the original image using sparse linear combinations of the AlexNet’s features (adversarially trained). The average sparsity is only 4.05% or less. More experiments in Section 8.3.

d In this way we have f(x) = −f(−x). We refer to wi ∈ R as the hidden weight (or feature) of 2 the i-th neuron, and bi > 0 as the (negative) bias. Each ρi ∼ N (0, σρ) is a smoothing of the original ReLU, also known as the pre-activation noise. Equivalently, one can use the smoothed ReLU activation ReLU] (x) = Eρ ReLU(x + ρ). In our result, σρ is always smaller than bi and much smaller than the typical value of hwi, xi. The main role of the pre-activation noise is simply to make the gradient of ReLU smooth: it simplifies our analysis for the sample complexity derivation (see Lemma A.2). In this paper, unless specially specified, we will use ρ to denote (ρi)i∈[m]. (t) To simplify analysis, we fix ai = 1 throughout the training. We use wi to denote the hidden weights at time t, and use ft(w; x, ρ) to denote the network at iteration t m X  (t) (t) (t) (t)  ft(w; x, ρ) = ReLU(hwi , xi + ρi − bi ) − ReLU(−hwi , xi + ρi − bi ) i=1 (j) Given a training set Z = {xj, yj}j∈[N] together with one sample of pre-activation noise ρ for each (xj, yj), we define

def −yft(w;x,ρ) Losst(w; x, y, ρ) = log(1 + e )

def def 1 X (j) Losst(w) = E [Losst(w; x, y, ρ)] Loss] t(w) = [Losst(w; xj, yj, ρ )] x,y=y(x),ρ N j∈[N] def X def X Objt(w) = Losst(w) + λ Reg(wi) Objg t(w) = Loss] t(w) + λ Reg(wi) i∈[m] i∈[m]

Above, Losst(w; x, y, ρ) is the standard logistic loss, Losst(w) is the population risk and Loss] t(w) is the empirical risk. We consider a strong, but quite natural regularizer to further avoid over-  2 3  2 def kwik2 kwik2 kwik2 fitting, given as Reg(wi) = 2 + 3 . Here, 2 is known as weight decay in practice; the 3 kwik2 6 additional 3 is an analog of weight decay combined with batch normalization [49]. We consider log log log d 7 a fixed λ = d for simplicity, although our result trivially extends to other values of λ. Definition 4.1. In our case, the (clean, population) classification error at iteration t is

c def (t) Et = Pr [y 6= sign(ft(w ; x, ρ))] . (clean error) x,y=y(x),ρ

6 w λ Indeed, for a function f( ) + kwk2 over normalized w, its gradient with respect to w is proportional to kwk2 2 ww> w λ 3 (I − 2 )∇f( kwk ) + λw · kwk2. Here, λw · kwk2 can be viewed as the gradient of 3 kwk2. kwk2 2 7Throughout this paper, the purpose of anylog log log d factor is to cancel out arbitrarily large constants so that we can present theorems and lemmas with simpler notations.

9 50%

40% layer 6 layer 8 30% layer 16 layer 19 neurons 20% layer 22 layer 24

10% percenge percenge of activated 0 20 40 60 80 100 120 140 training epochs

Figure 5: Neuron activation sparsity on clean training of WRN-28-10 over the CIFAR-10 dataset

We also introduce a notation

0 (t) def d s ` (w ; x, y, ρ) = [log(1 + e )] | (t) t ds s=−yft(w ;x,ρ) 0 (t) c and observe Ex,y=y(x),ρ[`t(w ; x, y, ρ)] ≥ Ω(Et ).

4.1 Clean Training We consider clean training as the gradient descent algorithm with step length η > 0 on the hidden weights w1, . . . , wm over the original data set, see Algorithm 1. Our result extends to stochastic gradient descent at the expense of complicating notations. For simplicity, we assume the bias terms (t) (t) (t) 8 b1 = ··· = bm = b grow together. (0) 2  1 (0) √ At initialization, we let wi ∼ N 0, σ0I for σ0 = poly(d) and let b = Θ(σ0 log d). When (t+1) (t) cb near initialization, we manually increase the bias b = b + ηB where B = d for some small constant cb > 0— this corresponds to the “lottery ticket winning” phase to be discussed later in (t) 1 Section 6.1; and whenever b reaches k0.5001 we set B = 0— in this phase, the neurons that have won the “lottery ticket” will keep winning and grow significantly, to be discussed in Section 6.2. (t) (t) We also choose pre-activation noise σ = √b · Θ((log log log d)3) for t ≤ T = 1 , and ρ log d a poly(d)η (t) b(t) 3 σρ = log d · Θ((log log log d) ) for t > Ta. The explicit choices of B and Ta are given in the proofs.

4.2 Adversarial Training We state the adversarial training algorithm in Algorithm 2. It takes as input a perturbation algorithm A, and repeatedly applies gradient descent over a perturbed data set (that comes from the original data set plus the perturbation given by A). Formally,

Definition 4.2. An (adversarial) perturbation algorithm A (a.k.a. attacker) maps the current net- work f (which includes hidden weights {wi}, output weights {ai}, bias {bi} and smoothing parameter

8We make several remarks about the bias growth.

00 2 • Our analysis does extend to the case of trainable bias bi’s, when the spike noise is large (e.g. E[(ξi ) ] ≥ Ω(1/d)): in this case, by applying gradient descent on the bias, it will automatically grow until it is large enough to to de-noise. This shall significantly complicate the proofs, so we do not include it here. • We grow bias to increase activation sparsity as training goes. This is very natural and simulates the actual training process in practice (see Figure 5). • Alternatively, one can interpret our result as, in the practical sparse coding setting, even if the biases are well-tuned to simulate dictionary learning, clean learning over the weights wi’s still accumulates still fails to learn the exact dictionary, rather it will accumulate dense mixtures in the neuron weights, leading to near-zero robust accuracy.

10 Algorithm 1 clean training using gradient descent 1: begin with the randomly initialized w(0) and the starting bias b(0); 2: for t ∈ {0, 1, 2, ··· ,Tf − 1} do (j) (t) 2 3: for each (xj, yj) ∈ Z, sample pre-activation noise ρ i.i.d. ∼ N (0, (σρ ) Im×m). (j) 4: define empirical objective Objg t(w) at this iteration using {xj, yj, ρ }j∈[N]. (t+1) (t) (t) 5: for each i ∈ [m], update using gradient descent: wi ← wi − η∇wi Objg t(w ) 6: update b(t+1) ← b(t) + ηB 7: end for

Algorithm 2 adversarial training algorithm (against perturbation algorithm A)

1: begin with a network fTf learned through clean training in Algorithm 1. 2: for t ∈ {Tf ,Tf + 1, ··· ,Tf + Tg − 1} do (adv) 3: for every (xj, yj) ∈ Z, perturb xj ← xj + A(ft, xj, yj, rj). (j) (t) 2 4: for each (xj, yj) ∈ Z, sample pre-activation noise ρ i.i.d. ∼ N (0, (σρ ) Im×m). (adv) (j) 5: define empirical objective Objg t(w) at this iteration using {xj , yj, ρ }j∈[N]. (t+1) (t) (t) 6: for each i ∈ [m], update using gradient descent: wi ← wi − η∇wi Objg t(w ) 7: end for

d σρ), an input x, a label y, and some internal random string r, to R satisfying

kA(f, x, y, r)kp ≤ τ . for some `p norm. We say A is an `2 perturbation algorithm of radius τ if p = 2, and `∞ pertur- bation algorithm of radius τ if p = ∞. For simplicity, we assume A satisfies for fixed f, y, r, either 1 kA(f, x, y, r)kp ≤ poly(d) , or A(f, x, y, r) is a poly(d)-Lipschitz continuous function in x. One can verify that for our network, the fast gradient method (FGM) [40] satisfies the above properties. FGM is a widely used algorithm to find adversarial examples. In our language, FGM is simply given by:9  arg min hy∇ f(x), δi if k∇ f(x)k ≥ 1 ; A(f, x, y) = δ:kδkp≤τ x x q poly(d) 0 otherwise.

Definition 4.3. The robust error at iteration t, against arbitrary `p perturbation of radius τ, is r def d Et = Pr [∃δ ∈ R , kδkp ≤ τ : sign(ft(x + δ)) 6= y] (robust error) x,y=y(x),ρ In contrast, the empirical robust classification error against algorithm A is r def Ect = Pr [sign(ft(x + A(ft, x, y, r))) 6= y] (empirical robust error) x,y=y(x),ρ,r Our upper bound theorems apply to all perturbation algorithms under Definition 4.2, and gives small empirical robust error Ecr. To obtain small (true) robust error Er, as we shall see, one can for instance let A be the fast gradient method (FGM). Initialized from clean training. In this paper, we assume adversarial training (i.e., Algorithm 2)

9 Here, kkq is the dual norm of kkp. In our case, due to the pre-activation noise, we define ∇xf(x) = ∇x Eρ f(x; w, ρ). Also, we have zeroed out A(f, x, y) when k∇xf(x)kq is extremely small for the convenience of analysis, because otherwise A(f, x, y) is not Lipscthiz continuous at those points.

11 is initialized from a network that is already clean-trained. In contrast, in practice, adversarial training usually begins directly with random initialization. We remark here that: • First, in practice, adversarial training from a clean-trained initialization performs no worse than from a random initialization, see Table 1 on Page 21. In fact, it is sometimes even bene- ficial to begin with clean training and gradually switch to adversarial training (see e.g. [82]). • Second, to prove our main conceptual message— feature purification— it is convenient to start from a clean-trained model, and then try to understand which part of the features are changed after robust training. Since neural nets have lots of equivalent transformations that are not very well-understood (even in two-layer case), if we adversarially train it from random initialization, then it is theoretically very hard to quantify how it is related to another clean-trained model learned from random initialization (since we need to understand all the invariants).10 • Yet, our theory still gives support to what happens in adversarial training from random ini- tialization. As we shall prove, this “o(1) feature change” comes from dense mixtures directions (see Theorem 5.2). Thus, adversarial training from random initialization should directly avoid learning such dense mixtures, as opposed to first learning them (by clean training) and then forgetting (by adversarial training). We illustrate this in Figure 9.

5 Statements and Explanations of Our Main Results

5.1 Clean Training, Adversarial Training and `2 Robustness For simplicity, in this subsection we sketch our main results for a special case k = d0.36, although our theorems hold for a wider range of k in the full appendix. Recall the learner network ft(x) and its random initialization are given in Section 4, the clean training using gradient descent is given (t) d in Algorithm 1, and wi ∈ R is the hidden weight of the i-th hidden neuron at iteration t. We state the theorem for clean training as below:

Theorem 5.1 (clean training). There exists an absolute constants C, c > 0 such that for every 1+c C constant c0 ∈ (0, c], every d and m with m = d 0 , given N ≥ Ω(d ) many training data, for 1 1  every random initialization weight σ0 = poly(d) , for every learning rate η ∈ 0, Ω(dC ) , if we define d1.01 log d Tc := Θ( η ), then for every Tf ∈ [Tc, d /η], the following holds with high probability. The (t) network ft with hidden weights {wi }i∈[m] learned by clean training Algorithm 1 satisfies:

(a) Global feature learning: for every t ∈ [Tc,Tf ], P (t) (0) 2 P (t) 2 (0) 2 i∈[m] wi , wi = o(1) × i∈[m] kwi k2 · kwi k2 and P (t) (t) 2 P (t) 22 i,j∈[m] wi , wj = o(1) × i∈[m] kwi k2 . (see Theorem C.2)

(b) Clean training has good clean accuracy: for every t ∈ [Tc,Tf ], c (t) Et = Pr [y 6= sign(ft(w ; x, ρ))] ≤ o(1) . (see Theorem D.1) x,y=y(x),ρ

(c) Clean training is not robust to small adversarial perturbations: for every t ∈ [Tc,Tf ], every

10Even if one performs clean/adversarial training from the same random initialization, the additional randomness in SGD may quickly make the two models diverge from each other.

12 1 yMw? τ ≥ 0.5+10c , using perturbation δ = −τ ? (which does not depend on ft), k kMw k2 r (t) E ≥ Pr [ft(w ; x + δ, ρ) 6= y] = 1 − o(1) . (see Theorem E.1) t x,y,ρ

Theorem 5.1 indicates that in our setting, clean training of the neural network has good clean accuracy but terrible robust accuracy. Such terrible robust accuracy is not due to over-fitting, as it holds even when a super-polynomially many iterations and infinitely many training examples are used to train the neural network. In the next theorem, we give a precise characterization of what the hidden weights {wi} are after clean training, and why they are not robust.

Theorem 5.2 (clean training features). For every neuron i ∈ [m], there is a fixed subset Ni of size log d |Ni| = O(1) such that, for every t ∈ [Tc, d /η), (t) X ? X ? wi = αi,jwj Mj + βi,jwj Mj (see Theorem C.2) j∈Ni j∈N / i k where (1) |βi,j| < d1−c for some small constant c ∈ [0, 0.001], and (2) for at least Ω(d) many −c neurons i ∈ [m], it satisfies |Ni| = 1 and αi,j > d . Moreover, 1 X X  1 k k  β ∈ × , dc × . (see Lemma E.2) md i,j dc d d i∈[m] j∈[d]\Ni

Theorem 5.2 says that each neuron wi will learn constantly many (allegedly large) components   in the directions Mj : j ∈ Nj , and its components in the remaining directions Mj : j 6∈ Nj are all small. We emphasize that the sets Ni are independent of t but are solely determined by random initialization. In other words, for each neuron i, which setset Ni it “wins” is completely determined by the “lottery ticket” (its random initialization). We discuss this in more details in Section 6.1. Furthermore, Theorem 5.2 shows that instead of learning the pure, robust features {Mj}j∈[d], c intuitively, ignoring the small d factors, focusing only on those neurons with |Ni| = 1, and assuming for simplicity all the βi,j’s are of similar (positive) magnitude, then, clean training will learn neurons:

(t) P h k  ? i wi ≈ Θ(1)Mj + j06=j Θ d wj0 Mj0 (5.1) | {z } | {z } pure, robust feature dense mixture

Feature purification: mathematical reasoning. Eq. (5.1) says that, after clean training, the neural network will be able to learn a big portion of the robust feature, Θ(1)Mj, plus some small P  k  ?  dense mixture v = j06=j Θ d wj0 Mj0 . In our sparse coding model, each x is of form x = Mz+ξ, where z is a sparse vector and ξ is the noise. One critical observation is that such dense mixture v has low correlation with almost all inputs x from the original distribution, so it has negligible effect for clean accuracy. However, such dense mixture is extremely vulnerable to small but dense adversarial perturbations of the input along this direction v, making the model non-robust. As we point out, such “dense adversarial perturbation” directions do not exist in the original data.11 Thus, one has to rely on adversarial training to remove dense mixtures to make the model robust. This is the main spirit of our feature purification principle, and we illustrate it in Figure 6. Where does dense mixture come from? We shall explain in more details in Section 6.2, but at a high level, in each iteration, the gradient ∇Obj will bias towards the direction that correlates with the labeling function y = sign(hw?, zi); and since x = Mz + ξ in our model, such direction

11One can try to add these dense mixtures directly to the training data set, which we conjecture to be similar to the approach in [48]

13 “horse + mixtures” become “pure horse”

adversarial training robust acc = 0% robust acc = 42.3% clean acc > 80% clean acc = 65.0%

“car + mixtures” become “pure car”

Figure 6: Experiments support our theory that adversarial training do purify dense mixtures, through visualizing some deep layer features of ResNet on CIFAR-10 data, trained against an `2(1, 0.25) attacker (see Section 8.2). More experiments for different layers of the network and different attackers, see Figure 11 and Figure 13.

? P ? should be Mw = j wj Mj, so is a dense mixture direction and will be accumulated across time. The accumulation of dense mixture is consistent with the finding [23] (for linearly-separable data). However, as we have argued in Section 3, in our setting when the noise level σx ≥ Ω(1) is large, such dense mixture direction cannot be used to given even good clean accuracy. Therefore, during clean training, the neural network has the incentive to discover features close to {Mj}j∈[d] because they can “de-noise” ξ better (see discussions in Section 3). Yet, our critical observation is that, even for well-trained neural network which aims to de-noise ξ, even when the neurons are close to being pure features {Mj}j∈[d], the “dense direction” still locally correlates with the labeling function y, and thus can still be accumulated during the course of a local training algorithm such as gradient descent, leading to the small, non-robust part of each feature. Next, we state the theorem for adversarial training (recall Algorithm 2). It shows that adver- sarial training indeed purifies the small dense mixtures, leading to local changes of the weights.

Theorem 5.3 (adversarial training). In the same setting as Theorem 5.1, suppose A is an `2 1 perturbation algorithm with radius τ ≤ k0.5+c . Suppose we run clean training Algorithm 1 for k2+c Tf ≥ Tc iterations followed with adversarial training Algorithm 2 for Tg = Θ( η ) iterations. The following holds with high probability.

(a) Empirical robust accuracy: for t = Tf + Tg r Ect = Pr [sign(ft(x + A(ft, x, y, r))) 6= y] ≤ o(1) (see Theorem F.1) x,y=y(x),ρ,r

(b) Provable robust accuracy: when A is the fast gradient method (FGM), for t = Tf + Tg r def d Et = Pr [∃δ ∈ R , kδk2 ≤ τ : sign(ft(x + δ)) 6= y] ≤ o(1) (see Corollary F.2) x,y=y(x),ρ

(c) Feature (local) purification: for every t ∈ [Tf ,Tf + Tg − 1],

P (Tf ) (t) 2 P (Tf ) 2 i∈[m] kwi − wi k2 = o(1) × i∈[m] kwi k2 (see (F.12)) We emphasize that Theorem 5.3a holds for any perturbation algorithm A satisfying Definition 4.2, when we only concern the robustness of the network against A. Meaning that the local feature

14 purification happens regardless of which adversarial perturbation algorithm is used to find the adversarial examples. More surprisingly, Theorem 5.3b says when a good perturbation algorithm such as FGM is used, then not only the robustness generalizes to unseen examples, it also generalizes to any worst-case perturbation algorithm.12 Density of adversarial perturbation. Our previous theorem suggests that one of the main goals of adversarial training is to remove dense mixtures to make the network more robust. Therefore, before adversarial training, the adversarial perturbations are dense in the basis of {Mj}j∈[d]; and after adversarial training, the adversarial perturbations ought to be more sparse and aligned with inputs from the original data set. Figure 7 has confirmed this theoretical finding using real-life data sets. Later in Section 8.3, we also present concrete measurements of the sparsity of these adversarial perturbations, and compare them in Figure 12.

original images

adversarial perturbations for clean-trained models (scaled by x5)

adversarial perturbations for robust-trained models (scaled by x5)

Figure 7: Adversarial perturbations before and after adversarial training; ResNet-34, CIFAR-10 data set. For clean- trained models, adversarial perturbations are “dense.” After adversarial training, the “dense mixtures” are removed and the adversarial perturbations are more aligned with actual images. More experiments in Figure 12 and Section 8.3.

5.2 `∞ Robustness and Lower Bound for Low-Complexity Models

We also have the following theorem (stated in special case for simplicity) for `∞ robustness.

o(1) Theorem 5.4 (`∞ adversarial training). Suppose kMk∞, kMk1 = d . There exists constant c 0.399 c1 ∈ (0, c0) such that, in the same setting as Theorem 5.3, except now for any k ∈ [d 1 , d ], and A is an ` -perturbation algorithm with radius τ = 1 . Then, the same Theorem 5.3 and ∞ k1.75+2c0 Theorem 5.1 still hold and imply clean training is not robust again ` perturbation with radius 1 ; • ∞ k2−c1 adversarial training is robust against any ` -perturbation of radius τ = 1 . • ∞ k1.75+2c0

This gives a gap because c0, c1 can be made arbitrarily small. (see Theorem E.1 and Theorem F.4) We also show a lower bound that no low-degree polynomial, or even the corresponding neural tangent kernel (NTK), can robustly learn the concept class. Recall for our two-layer ReLU network,

12We point out that since our results hold for any perturbation algorithm A, it is impossible to characterize exactly what are the learned features after training (for example showing that they corresponds to the actual dictionary) since A might be a bad adversarial perturbation finding algorithm and the network does not even need to remove all the dense mixture in order to fool A. However, the true robustness given by Theorem 5.3b does imply that the dense mixture should be removed at least in terms of functionality, if one uses a good adversarial perturbation finding algorithm.

15 Definition 5.5. The feature mapping of the neural tangent kernel for our two-layer network f is  m  Φ(x) = x 1hw ,xi+ρ ≥b − 1−hw ,xi+ρ ≥b ρE i i i i i i i i=1

Therefore, given weights {vi}i∈[m], the NTK function p(x) is given as X  p(x) = hx, vii 1hw ,xi+ρ ≥b − 1−hw ,xi+ρ ≥b E 2 i i i i i i ρi∼N (0,σ ) i∈[m] ρ

Without loss of generality, assume each wi ∼ N (0, I).

2 In this paper, we consider a wide range of NTK parameters: ρi ∼ N (0, σρi ) for arbitrary o(1) o(1) σρi ∈ [0, d ] and |bi| ≤ d . Our lower bound holds even for a most simple case M = I and σx = 0, so the original concept class is linearly separable. We prove the following:

Theorem 5.6 (lower bound). For every constant C > 1, suppose m ≤ dC , then there is a constant 1 1 c > 0 such that when k = dc , considering `∞ perturbation with radius τ = k100 , we have w.h.p. over the choice of wi, for every p(x) in the above definition, the robust error 1 − o(1) Er(p) ≥ . (see Theorem G.1) 2 1 (In contrast, Theorem 5.4 says adversarial training of neural network gives robust radius τ = k1.76 .) Since a poly-sized NTK kernel is known to be powerful enough to incorporate any low complexity functions (such as constant-degree polynomials) [6], we have the following corollary.

Corollary 5.7 (lower bound). In the same setting as Theorem 5.6, if q(x) is a constant degree r 1−o(1) polynomial, then we also have the robust error E (q) ≥ 2 .

6 Overview of the Training Process

In this section, we present an overview of the proof for the training process, using gradient descent starting from random initialization. The complete proof is deferred to the Appendix.

6.1 Wining Lottery Tickets Near Random Initialization Our proof begins by showing how the features in the neural network are emerged from random initialization. In this phase, the loss function is not sufficiently minimized yet, so the classification accuracy remains around 50%. However, we prove in this phase, gradient descent can already drive the neural network to learn a rich set of interesting features out of the random initialization. We call this process “lottery ticket winning” near random initialization, which is related to the study of [34].

Remark. This “lottery ticket winning” process is fundamentally different from the neural tangent kernel analysis (e.g. [4, 7, 8, 13, 14, 26–29, 38, 45, 50, 57, 61, 105, 116]). In this phase, although the loss is not sufficiently minimized, the activation patterns of the ReLU activations have changed dramatically, so that they have little correlations with the random initialization. Yet, we develop a new theoretical technique that allows us to control the change of the weights of the neurons, as we summarize below.

16 We derive the following property at random initialization. At iteration t = 0, the hidden weights are (0) 2  initialized as wi ∼ N 0, σ0Id×d . Using standard properties of Gaussians, we show the following 1.01 critical property: as long as m ≥ d , there exists small constants c3 > c4 > 0 such that (0) 2 2 (i) For most of the neurons i ∈ [m], maxj∈[d]{hMj, wi i } ≤ 2σ0 log d. 1 (ii) For at most dc4 fraction of of the neurons i ∈ [m], there is a dimension j ∈ [d] with (0) 2 2 hMj, wi i ≥ 2.01σ0 log d. 1 (iii) For at least dc3 fraction of of the neurons i ∈ [m], there is one and only one j ∈ [d] such that (0) 2 2 0 (0) 2 2 hMj, wi i ≥ 2.02σ0 log d, and all the other j ∈ [d] satisfies hMj0 , wi i ≤ 2.01σ0 log d. In other words, even with very mild over-parameterization m ≥ d1.01, by the property of random gaussian initialization, there will be some “potentially lucky neurons” in (ii), where the maximum correlation to one of the features Mj is slightly higher than usual. Moreover, there will be some “surely lucky neurons” in (iii), where such “slightly higher correlation” appears in one and only one of the target features Mj. In our proof, we denote the set of the neurons in (iii) whose correlation with Mj is slightly (0) (0) higher than usual as the set Sj,sure, and denote those in (ii) as Sj,pot. We will identify the following process during the training, as given in Theorem C.1: Lottery tickets winning process (0) (t) 2 For every j ∈ [d], at every iteration t, if i ∈ Sj,sure, then hMj, wi i will grow faster than (t) 2 (t) 2 (t) 2 hMj0 , wi i for every t, until hMj, wi i becomes sufficiently larger than all the other hMj0 , wi i . In other words, if neuron i wins the lottery ticket at random initialization, then eventually, it will deviate from random initialization and grow to a feature that is more close to (a scaling of) Mj. Our other main observation is that if we slightly over-parameterize the network with m ≥ d1.001, (0) (0) 0.01 then for each j ∈ [d], |Sj,sure| ≥ 1 and |Sj,pot| ≤ d . Or in words, for each dimension j ∈ [d], the number of lottery tickets across all neurons is at most d0.01, but at least one neuron will win a lottery ticket (see Lemma B.2). We also illustrate the lottery ticket winning process experimentally in Figure 8.

epoch 0 epoch 0.3 epoch 0.6 epoch 1 epoch 1.5 epoch 2

2.5

2

1.5

1

0.5 TRAINING OBJECTIVE TRAINING

0 0 5 10 15 20 # OF EPOCHS

epoch 2.5 epoch 3 epoch 4 epoch 6 epoch 10 epoch 20

Figure 8: Lottery tickets winning process, AlexNet, CIFAR-10 data set.

17 6.2 The Formation of “Dense Mixtures” During Training The next phase of our analysis begins when all the neurons already won their lottery tickets near random initialization. After that, the loss starts to decrease significantly, so the (clean) classification error starts to drop. We shall prove that in this phase, gradient descent will also accumulate, in each neuron, a small “dense mixture” that is extremely vulnerable to small but adversarial perturbations. To show this, we maintain the following critical property as given in Theorem C.2:

If a neuron i wins the lottery ticket for feature Mj near random initialization, then it will keep this “lottery ticket” throughout the training.

(t) 2 Or in math words, for each neuron i, after hMj, wi i becomes sufficiently larger than all the (t) 2 (t) 2 other hMj0 , wi i at the first stage, it will stay much larger than other hMj0 , wi i for the remaining of the training process. To prove this, we introduce a careful coupling between the (directional) gradient of the neuron, and the (directional) Lipschitz continuity of the network ft, this is given in Section C.4.2. The vulnerable dense mixtures. The most critical observation in this phase is the formation of “dense mixtures”, where we show that even for the “lucky neuron” that wins the lottery ticket, the hidden weight of this neuron will look like (see Theorem 5.2)     k X ? w ≈ α M + Θ w 0 M 0 (6.1) i t  j d j j  j06=j

In other words, up to scaling, these neurons will look like wi ≈ Mj + vi, where vi is a “dense k  P ? mixture” vi = Θ d j06=j wj0 Mj0 . The key observation is that vi is small and dense, in the sense that it is a mixture of all the other features {Mj0 }j0∈[d], but each of the feature has a much smaller contribution comparing to the leading term Mj. Recall in our sparse coding model, each input x = Mz + ξ; so with high probability: k  |hvi, xi| ≤ Oe kxk2 (6.2) d √ 1 This value is even smaller than k when k ≤ d. Thus, this dense mixture will not be correlated with any particular natural input, and thus the existence of these mixtures will have negligible contribution to the output of ft on clean data. P However, if we perturb input x along the dense direction δ ∝ j0∈[d] Mj0 , we can observe that:  k  |hvi, δi| = Ω √ kδk2 d Comparing this with Eq (6.2), such “dense perturbation” can change the output of the neural network ft by a lot, using a small δ whose norm is much smaller than that of x. Thus, at this phase, even when the network has a good clean accuracy, it is still non-robust to these small yet dense adversarial perturbations. Moreover, this perturbation direction is “universal”, in the sense that it does not depend on the randomness of the model at initialization, or the randomness we use during the training. This explains transfer attacks in practice: that is, the adversarial perturbation found in one model can also attack other models that are independently trained. Feature purification. Since Eq. (6.2) suggests most original inputs have negligible correlations with each dense mixture, during clean training, gradient descent will have no incentive to remove

18 those mixtures. Thus, we have to rely on adversarial training to purify those dense mixtures by introducing adversarial examples. Those examples have correlation with vi’s that are higher than usual. As we prove in Theorem 5.1 and illustrate in Figure 1, such “purifications”, albeit imposing only a small change to each neuron, will greatly improve the robustness of the neural network. The formation of the dense mixtures. To further help the readers understand how those “dense mixtures” are formed, we sketch the proof of Theorem 5.2, which shows why clean training is provably non-robust. The main observation is that when the dense mixtures are small, the negative gradient of the (say, population) loss with respect to each neuron wi is approximately given by (recall x = Mz + ξ): h i (t) 0 (t) 1 1  −∇wi Loss(w ) ≈ E y`t(w ; x, y, ρ) (t) (t) + (t) (t) Mz x,y=y(x),ρ hwi ,xi+ρi≥b −hwi ,xi+ρi≥b As a result, h i (t) ? 0 (t) 1 1  ? −h∇wi Loss(w ), Mw i ≈ E y`t(w ; x, y, ρ) (t) (t) + (t) (t) hw , zi x,y=y(x),ρ hwi ,xi+ρi≥b −hwi ,xi+ρi≥b Since y = sign(hw?, zi), we have yhw?, zi ≥ 0; together with the non-negativity `0 ≥ 0 and the (t) ? indicators are always non-negative, we know −h∇wi Loss(w ), Mw i is quite positive. Thus, during clean training, this dense direction Mw? will naturally accumulate in each neuron. We emphasize that this is indeed a special property of gradient descent. Consider again the case σx = Ω(1) discussed in Section 3, where x = Mz + ξ with kξk2 = Ω(1) = Ω(kMzk2). With high probability, a linear classifier using direction Mw? cannot be used to classify x correctly. Yet, this direction Mw? is still locally positively correlated with the labeling function y, especially for well-trained, well-generalizing neural networks when the ξ can be “de-noised”. (Stochastic) gradient descent, as a local update algorithm, only exams the local correlation between the update direction and the labeling function, and it does not exam whether this direction can be used in the final result. Thus, this dense direction Mw? will be accumulated step by step, leading to a non-robust part of each of the features during clean training. In fact, even if we use wi = Mi as initialization as opposed to random initialization, continuing clean training will still accumulate these small but dense mixtures. We illustrate this in Figure 9. clean training adversarial training

pure features pure features + dense mixtures (random init) (linear model)

dense mixtures adversarial training blocks this incentive by introducing Clean training has the incentive to dense adv-perturbation accumulate dense mixtures associated with the linear model

Figure 9: Overall summary of clean training, adversarial training, in the language of pure vs dense features. (Exper- iment based on AlexNet on CIFAR-10 dataset.)

19 7 Conclusion

In this paper, we made a first step towards understanding how, in principle, the features in a neural network are learned during the training process, and why after clean training, these provably well-generalizing features are still provably non-robust. Our main conclusion is that during the clean training process using (stochastic) gradient descent, neural network will accumulate, in all features, some “dense mixture directions” that have low correlations with any natural input, but are extremely vulnerable to (dense) adversarial perturbations. During adversarial training, such dense mixtures are purified to make the model more robust. Our results suggest that the non-robustness of clean training is mainly due to two reasons: 1. the inductive bias of (stochastic) gradient descent, and 2. the “sparse coding” structure of the data. Both reasons are necessary in some sense. First, a robust model is also a global minimizer of the clean training objective (at least in our setting); but even with proper regularization and infinite training examples to avoid over-fitting, gradient descent still has inductive bias towards finding a non-robust model. Second, it is easy to come up with data sets— such as linear-classifier labels over well-conditioned mixture-of-Gaussians like inputs— where clean training using gradient descent directly achieves the best robust accuracy. Thus, to understand the non-robustness of neural networks, we more or less have to take into account the gradient descent algorithm and the structure of the inputs. Indeed, our step is still very provisional. We immediately see a plethora of extensions from our work. First of all, natural images have much richer structures than sparsity; hence, those “non-robust mixtures” accumulated by clean training might also carry structural properties other than density. Moreover, we would like to extend our work to the clean and robust training of multi-layer neural networks, possibly with hierarchical feature purification processes. (Our experi- ments in Figure 10 have confirmed on such hierarchical feature purification phenomenon.) Indeed, understanding the whole picture of adversarial examples and adversarial training might require a complete understanding of deep learning.

8 Experiment Details

We perform experiments using three standard architectures, AlexNet, ResNet-16, and ResNet-34 with basic blocks, and tested on the CIFAR-10 dataset.13 We discover that learning rate 0.1 for good for ResNet and 0.02 is good for AlexNet; while weight decay 0.0001 is good for ResNet and 0.0005 is good for AlexNet (this was also recommended by the git repo authors). We use standard SGD with 0.9 momentum as the training algorithm. During adversarial training, we have implemented:

• The empirical `2 perturbation algorithm (i.e., attacker) suggested by [82]. We choose two sets 14 of parameters `2(1, 0.25) and `2(0.5, 0.12).

13We used the implementations from https://github.com/bearpaw/pytorch-classification. We used their default random crop and random flip as data augmentation. 14 We use their SmoothAdvPGD attacker with following parameters. We use σ = 0.25 which is the random Gaussian perturbation added to the input; use mtrain = 2 which is the number of Gaussian noise samples used per training sample, use TPGD = 4 which is the number of PGD attack steps, and use ε = 1 which is the `2 radius for the PGD attacker. We also follow their instruction to perform 10 warmup epochs to gradually increase ε from zero to ε = 1. We call this `2(1, 0.25). We have also implemented `2(0.5, 0.12).

20 • The empirical `∞ perturbation algorithm (i.e., attacker) [65], with `∞ radius 4/255 and 8/255, together with 7 steps of PGD attack. We call them `∞(4/255) and `∞(8/255) respectively.

We mostly focus on the `2(1, 0.25) attacker in this paper, but shall compare them in Section 8.4. Remark 8.1. In Table 1, we present robust/clean accuracies against such attackers after vanilla clean training / vanilla adversarial training. We emphasize here that, in practice, one can also first perform clean training and then apply adversarial traing using the clean-trained weights as initialization (like we have theoretically studied in this paper). This does not affect the overall performance of both robust and clean accuracies.

attacker 퓵∞(ퟒ/ퟐퟓퟓ) 퓵∞(ퟖ/ퟐퟓퟓ) 퓵ퟐ(ퟎ. ퟓ, ퟎ. ퟏퟐ) 퓵ퟐ(ퟏ, ퟎ. ퟐퟓ) clean training 0.0% (93.9%) 0.0% (93.9%) 0.0% (93.9%) 0.0% (93.9%) adversarial training 61.4% (82.3%) 45.7% (71.6%) 59.8% (78.3%) 42.3% (64.3%) (init. with random weights) adversarial training 61.9% (83.9%) 46.0% (73.7%) 60.0% (79.2%) 42.2% (65.5%) (init. with clean-trained weights)

Table 1: CIFAR-10 robust accuracy % (clean accuracy %) against different attackers using ResNet-34

8.1 Feature Visualization of Deeper Layers

(a) AlexNet feature visualization, layers 2 + 4, clean (left) vs. robust (right)

hierarchical feature purification (cascading effect)

robust / clean accuracy = 0.0% / 93.9% robust / clean accuracy = 42.8% / 66.3%

(b) ResNet-34 feature visualization, layers 15 + 25 + 29, clean (left) vs. robust (right)

Figure 10: Visualization of deep features on cleanly-trained vs. robustly-trained models. Take-away message: features from robustly-trained models are more “pure” and closer to the the real image space. We hope that our work can be extended to a “hierarchical feature purification” for multi- layer neural networks, using the recent advance in the theory of training deep neural networks efficiently and beyond NTKs [3, 5]

Visualizing the first layer of any trained architecture is trivial: for instance, for AlexNet, the weight tensor of the first layer is 3 × 11 × 11 which gives the RGB color of 11 × 11 patches (and this

21 was precisely what we presented in Figure 1). However, such visualization can be less meaningful for ResNet because the tensors are of dimension 3 × 3 × 3. Visualizing the features presented by deeper convolutional layers is an active research area, dating back at least to [31]. Perhaps the most naive approach is to start from a randomly initialize image (of size 3 × 32 × 32), then take a specific neuron n at some layer, and repeatedly take its gradient with respect to the image. If we keep adding this gradient to the input image, then ideally this gives us the image which “excites” n the most. Unfortunately, it is a common knowledge in this area that this naive approach does not lead to “visually meaningful” images as we go (even slightly) deeper into a network (see e.g. the left column of Figure 10). In existing literature, researchers have tried to various ways to resolve this issue (see e.g. an extensive survey by [76] and the references therein). At a high level, some penalizes the image to remove high-frequency noise [66, 70, 72, 79, 99]; some searches for images that can still excite the given neuron after jittering [70, 71, 79, 99]; and some searches only in the space of “real data” by building a model (e.g. using GAN) to capture the prior [70, 73, 74]. We observe that, if the model is robustly trained, then one can directly apply the naive approach to visualize features of the deep layers, and the resulting images can be “visually very meaningful.” See Figure 10.15 Our theory in fact explains this phenomenon: the dense mixtures accumulated during clean training are extremely harmful to the visualization effect, since they are “visually meaningless.” After robust training, such dense mixtures are removed so the visualization starts to align better with human concepts. Throughout this paper we stick to this naive approach for visualizing features of deep layers.16

8.2 Feature Purification at Deeper Layers Since our theory shows the feature purification principle of a single layer, we perform the following experiment to verify it in practice, to study the effect of feature (local) purification in each layer individually. We consider the (pre-activation) ResNet-34 [109] architecture which has 31 convolutional layers. We select some convolutional layer `, and17 • (stage 1) perform T epochs of adversarial training; • freeze the weights of layers 1, 2, . . . , ` − 1 and re-randomize weights of layers `, ` + 1,... ; • (stage 2) perform T epochs of clean training (by training weights of layers `, ` + 1,... ); • (stage 3) perform T epochs of adversarial training (by training weights of layers `, ` + 1,... ). Then, we visualize the features on layer ` • at the end of epoch T (indicating layer ` is random), • at the end of epoch 2T (indicating layer ` is clean trained), and • at the end of epoch 3T (indicating layer ` is adversarially trained).

15This should not be surprising given that the “jittering” technique is known to work in practice on visualizing clean models. 16Specifically, starting from a random input image, we take 2000 gradient steps to update the image so that the given neuron at a specific layer is excited the most. We added a weight decay factor to incentivize the image to go to RGB (128,128,128) — except in Figure 6 we incentivize the image to go to RGB (0,0,0). 17We choose T = 50 with initial learning rate 0.1, and decay it to 0.01 at the end of the 40-th, 90-th and 140-th epoch. Note that since we have weight decay, if we run stage 3 indefinitely, then the individual neurons on the `-th layer may change too much (because two neurons can even swap their positions). Therefore, to illustrate our theoretical finding, we stop stage 3 as long as the robust (and clean) accuracy matches that of stage 1. This usually requires less than T epochs.

22 global feature learning clean training adversarial training ( forward feature learning feature (local) purification

+ backward feature correction)

layer 9 layer

layer 13 layer

layer 17 layer

layer 19 layer layer 21 layer

clean acc = 0% robust acc = 0%, clean acc > 75% robust acc ≥ 42.3%

layer 23 layer

layer 25 layer

layer 27 layer layer 29 layer randomize ≥ ℓ clean train ≥ ℓ adversarial train ≥ ℓ

Figure 11: Visualization of the `-th layer features from ResNet-34 for ` ∈ {9, 13, 17, 19, 21, 23, 25, 27, 29}. For each case of `, the layers ≤ ` − 1 are frozen at some pre-trained robust weights, and only layers ≥ ` are trained. Left column refers to layers ≥ ` are randomly initialized, “clean” refers to layers ≥ ` are cleanly trained, and “robust” refers to layers ≥ ` are adversarially trained. Take-away message: feature purification happens even at deep layers of a neural network.

We present our findings in Figure 11, and report the correlation between neurons in Figure 2, for the `2(1, 0.25) adversarial attacker. We also point out that with this training schedule, even when the first (` − 1)-layers are fixed to “robust features” and only the `, ` + 1, ··· layers are trained, after clean training, the robust accuracy is still 0%.

8.3 Sparse Reconstruction of Input Data and of Adversarial Perturbation Recall in Figure 4, we have shown that the input images can be sparsely reconstructed from the robust features. To better quantify this observation, we compare how sparse the input images can be reconstructed from (1) random features, (2) clean features, and (3) robust features. For each of the tasks, we use Lasso to reconstruct the 100 images, and sweep over all possible weights of

23 20 20 15

15 15 10 10 10

5

sparsity % sparsity % sparsity sparsity % sparsity 5 5

0 0 0 0.00125 0.0025 0.005 0.01 0.02 0.04 0.08 0.00125 0.0025 0.005 0.01 0.02 0.04 0.08 0.00125 0.0025 0.005 0.01 0.02 0.04 0.08 regression error regression error regression error robust_feature fit data clean_feature fit data robust_feature fit data clean_feature fit data robust_feature fit data clean_feature fit data rand_feature fit data rand_feature fit data rand_feature fit data (a) AlexNet, fit input images (b) ResNet-16, fit input images (c) ResNet-34, fit input images

30 14 14 25 12 12 20 10 10 15 8 8 6 6

10

sparsity % sparsity % sparsity sparsity % sparsity 4 4 5 2 2 0 0 0 0.00125 0.0025 0.005 0.01 0.02 0.04 0.08 0.00125 0.0025 0.005 0.01 0.02 0.04 0.08 0.00125 0.0025 0.005 0.01 0.02 0.04 0.08 regression error regression error regression error

robust_feature fit robust_delta robust_feature fit clean_delta robust_feature fit robust_delta robust_feature fit clean_delta robust_feature fit robust_delta robust_feature fit clean_delta

(d) AlexNet, fit adv. perturbations (e) ResNet-16, fit adv. perturbations (f) ResNet-34, fit adv. perturbations

Figure 12: Sparse reconstruction of input mages and of adversarial perturbations. Take-away message for the first row: robust features can be used to reconstruct input images with better sparsity, suggesting that robust features are more pure. Take-away message for the second row: adversarial perturbations from a clean model are more “dense” comparing to those from a robust model (and in fact robust model’s adversarial perturbations are (much) closer to real input images, see Figure 7).

18 the `1 regularizer (which controls how sparse the reconstruction is). The results are presented in the first row of Figure 12. As one can see, using clean features one can also sparsely reconstruct the input, but using robust features the reconstruction can be even sparser. This, to some extent, supports our theory that robust features are more “pure” than clean features. Perhaps more importantly, our theory suggests that for clean-trained models, adversarial per- turbations (we refer to as clean delta) have “dense mixtures”; while for robust-trained models, adversarial perturbations (we refer to as robust delta) are “more pure.” This was visually illus- trated in Figure 7. Now, to better quantify this observation, we compare how sparse clean delta and robust delta can be reconstructed from robust features. See the second row of Figure 12.19 From this experiment, we confirm that in practice, adversarial perturbations on robust models are more “pure” and closer to real input images. Remark 8.2. We point out when comparing how sparse clean delta and robust delta can be recon- structed from robust features, we did not cheat. For instance, in principle clean delta may not lie in the span of robust features and if so, it cannot be (sparsely) reconstructed from them. In our experiments (namely, the second row of Figure 12), we noticed that clean delta almost lies in the span of robust features (with regression error < 0.00005 for AlexNet and < 10−9 for ResNet).

18 2 Recall the Lasso objective is miny kW y − xk2 + λkyk1, where it uses W y to reconstruct given input x, and λ is the weight of the regularizer to control how sparse y is. The convolutional version of Lasso is analogous: the matrix W becomes the “transpose” of the weight of the convolutional layer, which is for instance implemented as nn.ConvTranspose2d in PyTorch. In our implementation, we have shifted each input image so that it has zero mean in each of the three color channels. We have selected the first 100 images where the (trained) robust classifier gives correct labels; the plots are similar if one simply selects the first 100 training images. 19In fact, we have also re-scaled the perturbations so that they have similar mean and standard deviations comparing to real input images. This allows one to also compare the two rows of Figure 12.

24 8.4 Comparing Different Attackers We also demonstrate in Figure 13 that feature purification occurs against several different attackers.

clean training adversarial training global feature learning feature (local) purification

robust acc = 45.8%

robust acc = 61.9%

clean acc = 0% robust acc = 0%, clean acc > 77%

robust acc = 42.7%

robust acc = 59.9%

randomize ≥ ℓ clean train ≥ ℓ adversarial train ≥ ℓ

Figure 13: Visualization of the 27-th convolutional layer of ResNet-34 against different attackers. Take-away message: feature purification happens against different attackers, and stronger attacker gives stronger effect of feature purification.

25 Appendix: Complete Proofs

We give a quick overview of the structure of our appendix sections. In Section A, we warm up the readers by calculating the gradient of the objective, and demon- strating that polynomially many samples are sufficient for the training. (t) (t) In Section B, we formally introduce Sj,pot, the set of “potentially lucky neurons” and Sj,sure, the set of “surely lucky neurons” at iteration t. In particular, we shall emphasize on how those notions evolve as t increases. In Section C, we formally prove how “lucky neurons” continue to be lucky, and more impor- tantly, for every neuron i that is lucky in direction j, why it grows faster than other unlucky directions j0, and how much faster. Specifically, Theorem C.1 corresponds to the initial “lottery- winning” phase where the accuracy remains around 50%; and Theorem C.2 corresponds to the later phase where large signals become even larger and eventually most neurons become “pure + dense mix” of the form (6.1). This is the most difficult section of this paper. In Section D, we prove that why clean training gives good clean (testing) accuracy. It is based on the structural theorem given by Theorem C.2, and requires some non-trivial manipulations of prob- ability theory results (such as introducing a high-probability, Bernstein form of the McDiarmid’s inequality). In Section E, we prove that why the model obtained from clean training is non-robust. It formally shows how the “dense mixtures” become accumulated step by step during clean training. In Section F, we prove our theorems for both `2 and `∞ adversarial training. In particular, in this section we demonstrate why practical perturbation algorithms, such as the fast gradient method (FGM), can help the (adversarial) training process “kill” those “dense mixtures.” In Section G, we prove lower bounds for the neural tangent kernel model given by two-layer networks. In Section H, we give missing details of some probability theory lemmas.

A Notations and Warmups

We find it perhaps a good exercise to do some simple calculations to warmup the readers with our notations, before going into the proofs. Global Assumptions. Throughout the proof, 1+c • We choose m = d 0 for a very small constant c0 ∈ (0, 1).

(One should think of c0 = 0.0001 for a simple reading. Our proof generalizers to larger m = poly(d) since having more neurons does not hurt performance, but we ignore the analysis so as to provide the simplest notations.) • We assume k < d(1−c0)/2. log log log d • We choose λ = d for simplicity. (The purpose of log log log d factor is to simplify notations, and it can be tightened to constant.) • Whenever we write “for random x”, “for random z” or “for random ξ”, we mean that the come from the distributions introduced in Section 2 with x = Mz + ξ.

26 (t) (t) d Network Gradient. In every iteration t, the weights of the neurons are w1 , . . . , wm ∈ R . d Recall the output of the neural network on input x ∈ R is m (t) X (t) (t) (t) (t) ft(w ; x, ρ) = ReLU(hwi , xi + ρi − b ) − ReLU(−hwi , xi + ρi − b ) . i=1 (t) 0 (t) def d s e−yft(w ;x,ρ) Fact A.1. If we denote by `t(w ; x, y, ρ) = [log(1 + e )] |s=−yf (w(t);x,ρ)= (t) , then ds t 1+e−yft(w ;x,ρ) (t) 0 (t) (t) ∇wi Losst(w ; x, y, ρ) = −y`t(w ; x, y, ρ)∇wi ft(w ; x, ρ)   0 (t) 1 1 = −y`t(w ; x, y, ρ) (t) (t) + (t) (t) · x hwi ,xi+ρi≥b −hwi ,xi+ρi≥b Let us also note that

∇Reg(wi) = (kwik2 + 1) · wi Lemma A.2. Suppose Z = {x(1), . . . , x(N)} are i.i.d. samples from D and y(i) = y(x(i)), and (t) suppose N ≥ poly(d) for some sufficiently large polynomial. Let f = ft and suppose b ≤ poly(d) 1 and σρ ≥ poly(d) . Then, for every w1, . . . , wN that may depend on the randomness of Z and satisfies kwik ≤ poly(d), it satisfies

1 X  (i) (i)    1 E Loss(w; x , y , ρ) − E Loss(w; x, y, ρ) ≤ N ρ x∼D,y=y(x),ρ poly(d) i∈[N]

1 X  (i) (i)    1 E ∇wLoss(w; x , y , ρ) − E ∇wLoss(w; x, y, ρ) ≤ N ρ x∼D,y=y(x),ρ poly(d) i∈[N] F (i) 2 In addition, suppose for every i ∈ [N], we have an i.i.d. random sample ρ ∼ N (0, σρI) that is 2 independent of Z and w. Then, with probability at least 1 − e−Ω(log d) over ρ, we have

1 X  (i) (i)  1 X (i) (i) (i)  1 E Loss(w; x , y , ρ) − Loss(w; x , y , ρ ) ≤ N ρ N poly(d) i∈[N] i∈[N]

1 X  (i) (i)  1 X (i) (i) (i)  1 E ∇wLoss(w; x , y , ρ) − ∇wLoss(w; x , y , ρ ) ≤ N ρ N poly(d) i∈[N] i∈[N] F Proof. The proof of the first part can be done by trivial VC dimension or Rademacher complexity arguments. For instance, the function Eρ[∇Loss(w; x, y, ρ)] is Lipschitz continuous in w with Lipschitz parameter at most poly(d) (note that this relies on the fact that we take an expectation in ρ), and thus one can take an epsilon-net over all possible choices of w, and then apply a union bound over them. The proof of the second part can be done by trivial Hoeffding bounds. 

B Neuron Structure and Initialization Properties

1+c We consider m = d 0 for a very small constant c0 ∈ (0, 1), and consider constants c1 > c2 to be chosen shortly. Let us define a few notations to characterize each neuron’s behavior. (t) Definition B.1 (neuron characterization). Recall wi is the weight for the i-th neuron at iteration (t) t. We shall choose a parameter σw at each iteration t ≥ 0 and define the following notions. Consider any dimension j ∈ [d].

27 (t) 1. Let Sj,sure ⊆ [m] be those neurons i ∈ [m] satisfying (t) 2 (t) 2 • hwi , Mji ≥ (c1 + c2)(σw ) log d, (t) 2 (t) 2 0 • hwi , Mj0 i < (c1 − c2)(σw ) log d for every j 6= j, (t) ? • sign(hwi , Mji) = sign(wj ). (t) 2. Let Sj,pot ⊆ [m] be those neurons i ∈ [m] satisfying (t) 2 (t) 2 • hwi , Mji ≥ (c1 − c2)(σw ) log d (t) 3. Let Sept ⊆ [m] be the set of neurons i ∈ [m] satisfying (t) 2 (t) 2 • kwi k2 ≤ 2(σw ) d (t) 2 (t) 2 • hw , Mji ≥ (c1 − c2)(σw ) log d for at most O(1) many j ∈ [d]. i √ (t) 2 (t) 2√ − log d • hwi , Mji ≥ 2(σw ) log d for at most 2 d many j ∈ [d]. (t) (t) σw d • |hwi , Mji| ≤ log d for at least Ω( log d ) many j ∈ [d].

(0) 2 (0) Lemma B.2 (geometry at initialization). Suppose each wi ∼ N (0, σ0I) and suppose σw = σ0. For every constants c0 ∈ (0, 1) and γ ∈ (0, 0.1), by choosing c1 = 2 + 2(1 − γ)c0 and c2 = γc0, we have with probability ≥ 1 − o(1/d3) over the random initialization, for all j ∈ [d]: (0)  γ  (0) (0) 4 c0 2γc0  |Sj,sure| = Ω d =: Ξ1 |Sj,pot| ≤ O d =: Ξ2 Sept = [m]

Remark. In the rest of the paper, we shall assume Lemma B.2 holds in all upper-bound related 100 theorems/lemmas, and for simplicity, we assume γ > 0 is some small constant so that (Ξ2) ≤ c d 0 . The notations Ξ1 and Ξ2 shall be used throughout the paper.

Definition B.3 (neuron characterization, continued). Recall b(t) is the bias at iteration t, and let us introduce more notions. (t) 1. Let Sept+ ⊆ [m] be the set of neurons i ∈ [m] satisfying

(t) 2 kw(t)k2 ≤ (σw ) d , • i 2 log2 d (t) hw(t), M i ≥ σw for at most O(1) many j ∈ [d]. • i j log d (t) 2. Let Sept++ ⊆ [m] be the set of neurons i ∈ [m] satisfying (t) (t) (σ )2 def 2 w √ 1 • kwi k2 ≤ β2 for β = 10 . kΞ2 (t) 3. Let Sj,pot+ ⊆ [m] be the set of neurons i ∈ [m] satisfying (t) k (t) • |hwi , Mji| ≥ dβ b . (t) 4. Let Sj,sure+ ⊆ [m] be the set of neurons i ∈ [m] satisfying (t) 2 (t) 2 • hwi , Mji ≥ 4k(b ) , (t) ? • sign(hwi , Mji) = sign(wj ).

(t) (t) (t) (t) Note that we do not have good properties on Sept+, Sept++, Sj,pot+ or Sj,sure+ at initialization t = 0; however, they will gradually begin to satisfy certain properties as the training process goes. See Section C for details.

28 B.1 Proof of Lemma B.2 Proof of Lemma B.2. Recall if g is standard Gaussian, then for every t > 0,

1 t 2 1 1 2 √ e−t /2 < Pr [g > t] < √ e−t /2 2π t2 + 1 g∼N (0,1) 2π t Therefore, for every i ∈ [m] and j ∈ [d], (0) p = Pr[hw , M i2 ≥ (c + c )σ2 log d] = Θ( 1 ) · 1 = Θ( √ 1 ) · 1 • 1 i j 1 2 0 log d d(c1+c2)/2 log d d·d(1−γ/2)c0 (0) p = Pr[hw , M i2 ≥ (c − c )σ2 log d] = Θ( 1 ) · 1 = Θ( √ 1 ) · 1 • 2 i j 1 2 0 log d d(c1−c2)/2 log d d·d(1−3γ/2)c0

(0) d−1 1. We first lower bound |Sj,sure|. For every i ∈ [m], with probability at least p1/2 · (1 − p2) ≥ γ c0 Ω( √ 1 ) · d 2 it satisfies log d m (0) 2 2 (0) ? hwi , Mji ≥ (c1 + c2)σ0 log d, sign(hwi , Mji)sign(wj ) ≥ 0 0 (0) 2 2 ∀j 6= j, hwi , Mj0 i ≤ (c1 − c2)σ0 log d By concentration with respect to all m choices of i ∈ [m], we know with probability at least (0)  γ  1 4 c0 1 − o( d3 ) it satisfies |Sj,sure| = Ω d .

3γ (0) c0 2. We next upper bound |S |. For every i ∈ [m], with probability at most p < O( √ 1 )· d 2 j,pot 2 log d m it satisfies (0) 2 2 hwi , Mji ≥ (c1 − c2)σ0 log d 1 By concentration with respect to all m choices of i, we know with probability at least 1−o( d3 ) (0) 2γc0 it satisfies |Sj,pot| = O(d ). (0) 3. As for Sept, we first note that for every i ∈ [m], by chi-square distribution’s tail bound, with h 2 i 3 (0) 2 σ0 d 2 probability at least 1 − o(1/d ) it satisfies kwi k2 ∈ 2 , 2σ0d .

For every i ∈ [m], the probability of existing q = 20/c0 different (0) 2 2 j1, ··· , jq ∈ [d]: s.t.∀r ∈ [q]: hwi , Mjr i ≥ (c1 − c2)σ0 log d c0 q q −q· 2 1 is at most d · (p2) ≤ d ≤ d4m . Union bounding over all possible i ∈ [m] gives the proof that, with probability at least 1 − 1/d4, for all but at most q = O(1) values of j ∈ [d], it (0) 2 (t) 2 satisfies hwi , Mji < (c1 − c2)(σw ) log d. √ For every i ∈ [m] and j ∈ [d], with probability at least 1 − e− log d it satisfies hw(0), M i2 ≤ √ i j 2√ 3 − log d 2σ0 log d. Therefore, with probability at least 1 − o(1/d ), there are ≥ d(1 − 2 ) indices (0) 2 2√ j ∈ [d] satisfying hwi , Mji ≤ 2σ0 log d. (0) For every i ∈ [m] and j ∈ [d], with probability at least 1√ it satisfies |hw , M i| ≤ 40000 log d i j σ√0 . Therefore, with probability at least 1−o(1/d3), there are 1√ indices j ∈ [d] 10000 log d 100000 log d (0) satisfying |hw , M i| ≤ σ√0 . i j 10000 log d 

29 C Neuron Structure Change During Training

For analysis purpose, we consider two phases during training. In Phase I, the neurons have moved so little so that the accuracy remains 50% for binary classification; however, some neurons shall start to win lottery and form “singleton” structures. We summarize this as the following theorem.

Theorem C.1 (phase I). Suppose the high-probability initialization event in Lemma B.2 holds. 1 −Ω(log2 d) Suppose η, σ0 ∈ (0, poly(d) ) and N ≥ poly(d). With probability at least 1 − e , the following  2  def d σ0 holds for all t ≤ Tb = Θ kη

(0) (t) 1. Sj,sure ⊆ Sj,sure for every j ∈ [d]. (0) (t) 2. Sj,pot ⊇ Sj,pot for every j ∈ [d]. (t) 3. Sept = [m] 2.5 (t) def  dσ0 log d  4. Sept+ = [m] for every t ≥ Ta = Θ η . (t) (0) (t) 5. Sept++ = [m] and Sj,pot ⊇ Sj,pot+ for every j ∈ [d] at this iteration t = Tb. (t) (t) (Recall according to Definition B.1 we have Sj,sure ⊆ Sj,pot.) In Phase II, the neurons start to move much more so that the network output becomes more meaningful; in phase II, the “singleton” neurons become even more singleton.

Theorem C.2 (phase II). In the same setting as Theorem C.1, with probability at least 1 − −Ω(log2 d)  O(log d)  e , the following holds for all t ∈ Tb, d /η . (t) (t) 1. Sept++ = Sept+ = [m]. (0) (t) 2. Sj,sure ⊆ Sj,sure for every j ∈ [d]. (0) (t) (t) 3. Sj,pot ⊇ Sj,pot+ ⊇ Sj,pot for every j ∈ [d].   4. S(0) ⊆ S(t) for every j ∈ [d], as long as t ≥ T def= Θ d . j,sure j,sure+ e ηΞ2 log d (t) (t) (Recall according to Definition B.3 we have Sj,sure+ ⊆ Sj,pot.) Remark C.3. Theorem C.2 immediately implies the first claim of Theorem 5.1 and the first claim of Theorem 5.2, after plugging in the definitions of those neuron structure sets introduced in Section B. For instance, we can write (t) X (t) X (t) wi = hwi , MjiMj + hwi , MjiMj (0) (0) j∈[d]: i∈Sj,pot j∈[d]: i6∈Sj,pot

We make several observations, when t ≥ Te: (t) (0) • Sept = [m] implies the cardinality of {j ∈ [d]: i ∈ Sj,pot} is ≤ O(1), so we can define it as Ni. 2 (0) (t) (t) k (t) kΞ2 • For every i 6∈ Sj,pot, it also satisfies i 6∈ Sj,pot+, so we have |hwi , Mji| < dβ b ≤ d (using (t) 2 b ≤ βΞ2 from Definition C.12). √ (t) (0) (t) ∗ (t) 1 (t) 2 • For every i ∈ Sj,sure+ ⊆ Sj,pot, we have hwi , Mji · sign(wi ) ≥ 2 kb > 8 (using b = βΞ2 Ξ2 from Definition C.12 and our choice of β).

30 They together imply the first claim of Theorem 5.2. One can similarly derive the first claim of Theorem 5.1.

C.1 Auxiliary Lemma 1: Geometry of Crossing Boundary We present a lemma to bound the size of the pre-activation signal.

Lemma C.4 (pre-activation signal size). For every t ≥ 0, every i ∈ [m], every λ ≥ 0, every j ∈ [d]: (t) (a) If i ∈ Sept then

  λ D (t) E2 (t) −Ω( ) 1/4 P 2 2 log1/4 d − log d Prz,ξ wi , j06=j Mj0 zj0 + ξ ≥ λ (σw ) ≤ e + e

(t) (b) If i ∈ Sept+ then  2  D (t) P E 2 (t) 2 −Ω(λ log d) −Ω(λ2 log d) k  Prz,ξ wi , j06=j Mj0 zj0 + ξ ≥ λ (σw ) ≤ e + e + O d

0 1 Proof of Lemma C.4a. Let E be the event where there exists j ∈ [d] with |z 0 | ≥ and j log2 d (t) 2 (t) 2√ 2 1  h 1 i  log4 d  hw , M 0 i ≥ 2(σ ) log d. Since [z ] = O , we know that Pr |z 0 | ≥ ≤ O . i j w E j0 d j log2 d d (t) By the definition of Sept and union bound, we know  4  √ log d 1/4 Pr[E] ≤ O × 2− log dd ≤ e− log d d

0 (t) 2 (t) 2 Let F be the event where there exists j ∈ [d] with zj0 6= 0 and hwi , Mj0 i ≥ Ω((σw ) log d). (t) Again, by the definition of Sept, we know that   k 1/4 Pr[F] ≤ O ≤ e− log d d Thus, when neither E or F happens, we have for every j0 ∈ [d]:   (t) 2 (t) 2p (t) 2 1 (t) 2p hw , Mj0 zj0 i ≤ min 2(σ ) log d · 1,O((σ ) log d) · ≤ 2(σ ) log d i w w log4 d w At the same time, we also have

X (t) 2 X 1 (t) 2 (t) 2 E hwi , Mj0 zj0 i ≤ O( )hwi , Mj0 i ≤ O((σw ) ) zj0 d j0∈[d] j0∈[d] Apply Bernstein concentration bound we complete the proof that

  λ D (t) E2 2 (t) −Ω( ) 1/4 P λ 2 log1/4 d − log d Prz,ξ wi , j06=j Mj0 zj0 ≥ 2 (σw ) ≤ e + e

(t) 2 2 (t) kwi k σx (t) 2 Finally, for the ξ part, let us recall hwi , ξi variable with variance at most O( d ) ≤ O((σw ) ) (t) and each |hw(t), M ihM , ξi| ≤ σw w.h.p. Using Bernstein concentration of random variables, we i j j log2 d finish the proof.  0 (t) Proof of Lemma C.4b. Let F be the event where there exists j ∈ [d] with zj0 6= 0 and |hwi , Mj0 i| ≥

31 (t) σw (t) Ω( log d ). By the definition of Sept+, we know that k  Pr[F] ≤ O d

(t) 0 (t) σw When F does not happen, we have for every j ∈ [d]: |hwi , Mj0 zj0 i| ≤ O( log d ) and at the same time (t) 2 X (t) 2 X 1 (t) 2 (σw ) E hwi , Mj0 zj0 i ≤ O( )hwi , Mj0 i ≤ O( ) zj0 d log d j0∈[d] j0∈[d] Apply Bernstein concentration bound we complete the proof that  2  D (t) P E λ2 (t) 2 −Ω(λ log d) −Ω(λ2 log d) k  Prz,ξ wi , j06=j Mj0 zj0 ≥ 2 (σw ) ≤ e + e + O d

(t) 2 2 (t) kwi k σx Finally, for the ξ part, let us recall hwi , ξi is a random variable with variance at most O( d ) ≤ (t) 2 O( (σw ) ). Using the Bernstein concentration bound, we finish the proof. log2 d 

C.2 Auxiliary Lemma 2: A Critical Lemma for Gradient Bound In this section we present a critical lemma that shall be used multiple times to bound the gradient in many of the following sections. Recall c2 ∈ (0, 0.1) is a constant from Lemma B.2. c1 p Lemma C.5 (critical). Let Y (z, S1): R × R → [−1, 1] be a center symmetric function, meaning p Y (z, S1) = Y (−z, −S1). Let S1 ∈ R and S2 ∈ R be random variables, where (S1,S2) is symmetri- cally distributed, meaning (S1,S2) distributes the same as (−S1, −S2). 2 For every α > 0, suppose ρ ∼ N (0, σρ), define quantity  1 1  ∆ := E Y (1,S1) α+S2+ρ≥b − Y (−1,S1) −α+S2+ρ≥b S1,S2,ρ 2 and define parameters V := E[(S2) ] and L := ES1 [|Y (1,S1) − Y (0,S1)|]. Then,  √  (a) it always satisfies |∆| ≤ O V + L σρ √ (b) if Y (z, S ) is a monotonically non-decreasing in z ∈ for every S ∈ p, then ∆ ≥ −Ω V  1 R 1 R σρ 0 00 0 00 0 0 00 00 Furthermore, suppose we can write S1 = (S1,S1 ) and S2 = S2 + S2 for (S1,S2) and (S1 ,S2 ) being 0 0 00 00 independent (although S1,S2 may be dependent, and S1 ,S2 may be dependent). Then, we have  2 2    (c) if α ≤ b(1 − c2 ), then |∆| ≤ e−Ω(b /σρ) + Γ min{1,O( α )} + L + Γ 2c1 σρ y y where the parameters h i h i Γ := Pr |S | ≥ c2 · b and Γ := Pr |S00| ≥ c2 · b • 2 10c1 y 2 10c1  0 00 0 00 L := max 0 00 [|Y (1,S ,S ) − Y (0,S ,S )|] ≤ 1 • y S1 ES1 1 1 1 1 1 Proof of Lemma C.5. We first focus on Y (1,S1) α+S2+ρ≥b, and write 1 1 1 Y (1,S1) α+S2+ρ≥b = Y (0,S1) α+S2+ρ≥b + Y (1,S1) − Y (0,S1) α+S2+ρ≥b (C.1) 1 1 = Y (0,S1) α+S2+ρ≥b ± Y (1,S1) − Y (0,S1) α+S2+ρ≥b (C.2)

32 1 Focusing on the term Y (0,S1) α+S2+ρ≥b, by the symmetric properties of Y and (S1,S2), we have 1 | [Y (0,S )1 ]| = | [Y (0,S )1 + Y (0, −S )1 ]| E 1 α+S2+ρ≥b 2 E 1 α+S2+ρ≥b 1 α−S2+ρ≥b 1 = | [Y (0,S )(1 − 1 )]| 2 E 1 α+S2+ρ≥b α−S2+ρ≥b 1 1 ≤ Pr[ α+S2+ρ≥b 6= α−S2+ρ≥b] ≤ Pr[ρ ∈ [b − α − S2, b − α + S2]] ! √ !  [|S |] p [S2] V = O E 2 = O E 2 = O (C.3) σρ σρ σρ

Putting (C.3) into (C.2), applying the bound L = ES1 [|Y (1,S1) − Y (0,S1)|], and repeating the 1 same analysis for Y (−1,S1) −α+S2+ρ≥b gives √ ! V |∆| ≤ O + L . σρ

This proves Lemma C.5a. When Y (z, S1) is a monotone non-decreasing in z ∈ R, we have 1 1 Y (1,S1) α+S2+ρ≥b ≥ Y (0,S1) α+S2+ρ≥b 1 1 Y (−1,S1) −α+S2+ρ≥b ≤ Y (0,S1) −α+S2+ρ≥b so we can go back to (C.1) (and repeating for Y (−1,S1)) to derive that √ ! V ∆ ≥ −Ω σρ This proves Lemma C.5b. Finally, when α ≤ b(1 − c2 ), we can bound ∆ differently 2c1 1 1 1 |∆| ≤ 2 Pr[ α+S2+ρ≥b 6= −α+S2+ρ≥b] + E [|Y (1,S1) − Y (−1,S1)| · α+S2+ρ≥b] 1 = 2 Pr[ρ ∈ [b − S2 − α, b − S2 + α]] + E [|Y (1,S1) − Y (−1,S1)| · α+S2+ρ≥b] (C.4) To bound the first term in (C.4) we consider two cases.:

2 2 when |S | ≤ b , we have Pr [ρ ∈ [b − S − α, b − S + α]] ≤ min{1, α }e−Ω(b /σρ); • 2 4 ρ 2 2 σρ when |S | ≥ b (happening w.p. ≤ Γ), we have Pr [ρ ∈ [b−S −α, b−S +α]] ≤ min{1,O α }. • 2 4 ρ 2 2 σρ Putting together, we know that α −Ω(b2/σ2)  Pr[ρ ∈ [b − S2 − α, b − S2 + α]] ≤ min{1, }· e ρ + Γ (C.5) σρ 0 00 0 00 To bound the second term in (C.4), first recall S1 = (S1,S1 ) and S2 = S2 + S2 , so we can write h 0 00 0 00 i |Y (1,S ,S ) − Y (−1,S ,S )| · 1 0 00 E 1 1 1 1 α+S2+S2 +ρ≥b h 0 00 0 00 i 1 c 1 c  ≤ E |Y (1,S1,S1 ) − Y (−1,S1,S1 )| · |α+S0 +ρ|≥(1− 2 )·b + |S00|≥ 2 ·b 2 10c1 2 10c1 h 0 00 0 00 i 1 c ≤ E |Y (1,S1,S1 ) − Y (−1,S1,S1 )| · |α+S0 +ρ|≥(1− 2 )·b + Γy (C.6) 2 10c1 00 To bound the first term in (C.6), we can take expectation over S1 and use the bound Ly to derive   h 0 00 0 00 i 0 c2 1 c E |Y (1,S1,S1 ) − Y (−1,S1,S1 )| · |α+S0 +ρ|≥(1− 2 )·b ≤ Ly Pr |α + S2 + ρ| ≥ (1 − ) · b 2 10c1 10c1

33 but since α ≤ (1 − c2 ) · b and S0 = S − S00, we can further bound 2c1 2 2 2         0 c2 00 c2 c2 c2 Pr |α + S2 + ρ| ≥ (1 − ) · b ≤ Pr |S2 | ≥ · b + Pr |S2| ≥ · b + Pr ρ ≥ · b 10c1 10c1 10c1 10c1 −Ω(b2/σ2) ≤ Γy + Γ + e ρ Putting these back to (C.6), we have h 0 00 0 00 i  −Ω(b2/σ2)  |Y (1,S ,S ) − Y (−1,S ,S )|1 0 00 ≤ e ρ + Γ + Γ L + Γ (C.7) E 1 1 1 1 α+S2+S2 +ρ≥b y y y Putting (C.5) and (C.7) back to (C.4), we conclude the when α ≤ b(1 − c2 ), we have 2c1    −Ω(b2/σ2)  α |∆| ≤ e ρ + Γ O( ) + Ly + Γy  σρ

C.3 Phase I: Winning lottery tickets near initialization Definition C.6. In Phase I, we have two sub-phases:

(t) √ (t)√ (t) (t) 3 • In Phase I.1, we pick b = c1σw log d and σρ = σw (log log log d)  2.5  (t+1) (t) Cη dσ0 log d We grow b = b + d for Ta = Θ η iterations. √ (t)√ (t) (t) (log log log d)3 In Phase I.2, we pick b(t) = c σ log d and σ = σ · √ • 1 w ρ w log d  2  (t+1) (t) Cη d σ0 We grow b = b + d for Tb = Θ kη iterations.

C.3.1 Activation Probability Recall x = P M z + ξ. Recall also c2 ∈ (0, 0.1) is a constant from Lemma B.2. j j j c1

Lemma C.7 (activation probability). We define Γt to be any value such that h i (t) P c2 (t) Pr w , 0 M 0 z 0 + ξ ≥ b ≤ Γ for every i ∈ [m] and j ∈ [d]; • x i j 6=j j j 10c1 t h i Pr w(t), x ≥ c2 b(t) ≤ Γ for every i ∈ [m] • x i 10c1 t h i Pr |ρ | ≥ c2 b(t) ≤ Γ for every i ∈ [m] • x i 10c1 t

We define Γt,y to be any value such that for every i ∈ [m] and j ∈ [d], there exists Λ ⊆ [d] \{j} with |Λ| ≥ Ω( √ d ) satisfying • log d h D E i (t) P c2 (t) Prx w , 0 Mj0 zj0 ≥ b ≤ Γt,y i j ∈Λ 10c1 Then,

(t) −Ω(log1/4 d) 1 • If we are in Phase I.1 and Sept = [m], then we can choose Γt = e and Γt,y = d10 . (t) k 1 • If we are in Phase I.2 and Sept+ = [m], then we can choose Γt = O( d ) and Γt,y = d10 .

(t) (t)√ Proof. Recall b = Θ(σw log d). Applying Lemma C.4a and Lemma C.4b we immediately have

h D E i 1/4 (t) (t) P c2 (t) −Ω(log d) • If S = [m], then Prx w , 0 Mj0 zj0 + ξ ≥ b ≤ e ; ept i j 6=j 10c1 h D E i (t) (t) P c2 (t) k  • If S = [m], then Prx w , 0 Mj0 zj0 + ξ ≥ b ≤ O . ept+ i j 6=j 10c1 d

34 P P Now, recall x = j0∈[d] Mj0 zj0 +ξ so it differs from j06=j Mj0 zj0 +ξ only by one term. Therefore, we h D E i (t) c2 (t) have the same bound on Prx w , x ≥ b by modifying the statements of Lemma C.4a i 10c1 and Lemma C.4b (without changing the proofs) to include this missing term.

h D E i 1/4 (t) (t) c2 (t) −Ω(log d) • If S = [m], Prx,ρ w , x ≥ b ≤ e ept i 10c1 h D E i (t) (t) c2 (t) k  • If S = [m], Prx,ρ w , x ≥ b ≤ O ept+ i 10c1 d

(t) 2 At the same time, using ρi ∼ N (0, (σρ ) ), we also have

(t) 3 h i 1/4 (t) σw (log log log d) (t) c2 (t) −Ω(log d) In Phase I.1, because σρ = Θ( √ )b , we have Pr |ρ | ≥ b  e • log d ρ i 10c1

(t) 3 h i (t) σw (log log log d) (t) c2 (t) k  In Phase I.2, because σρ = Θ( )b , we have Pr |ρ | ≥ b  O • log d ρ i 10c1 d

As for the bound on Γt,y, for every i ∈ [m], j ∈ [d], let Λ ⊆ [d] \{j} be the subset containing (t) 0 (t) def σw (t) all j ∈ [d] \{j} with |hwi , Mj0 i| ≤ q = log d . By the assumption Sept = [m], we know |Λ| ≥ Ω(d/ log d). (t) 2 Since |hwi , Mj0 i| ≤ q, E[zj ] = Θ(1/d) and |zj| ≤ 1, by Bernstein’s inequality, we have   * + (b(t))2 c −Ω( ) 1.5 (t) X 2 (t) q2|Λ|/d+q·b(t) −Ω(log d) Pr  wi , Mj0 zj0 ≥ b  ≤ e ≤ e .  x 10c1 j0∈Λ

C.3.2 Growth Lemmas (t) Our first lemma here shall be used to (lower) bound how hwi , Mji (i.e., the weight with respect to neuron i in direction Mj) grows for those i ∈ Sj,sure.

(t) Lemma C.8 (signal growth). Suppose we (1) either are in Phase I.1 with Sept = [m], (2) or are (t) (t) (t) in Phase I.2 with Sept+ = [m]. Then, for every j ∈ [d], every i ∈ Sj,sure, as long as |hwi , Mji| = O(b(t) log log log d), the following holds: h i   1 1 E y (t) (t) zj = Θ x,y,ρ hwi ,xi+ρi≥b d

(t) (t) ? Proof of Lemma C.8. Recall that i ∈ Sj,sure means sign(hwi , Mji) = sign(wj ). Without loss of (t) ? generality, let us assume sign(hwi , Mji) = sign(wj ) = 1. q First consider the case when |z | = 1. Since j ∈ S(t) , we have hw(t), M i ≥ b(t) 1 + c2 so j j,sure i j c1 applying Lemma C.7, (t) (t) Pr[hwi , xi + ρi ≥ b | zj = 1] ≥ 1 − 2Γt = 1 − o(1) (t) (t) Pr[hwi , xi + ρi ≥ b | zj = −1] ≤ 2Γt = o(1) ? Moreover, a simple calculation using |wj | = Θ(1) gives us Ex,y [y | zj = 1] = Θ(1) (can be proven by Lemma H.1) and therefore h i 1 E y (t) (t) zj |zj| = 1 = Θ(1) x,y,ρ hwi ,xi+ρi≥b

35 For all other non-zero value |z | = s > 0, we have s ≥ √1 and wish to apply Lemma C.5 to j k bound h i 1 ∆s := E y (t) (t) sign(zj) | |zj| = s x,y,ρ hwi ,xi+ρi≥b In Phase I.1, to apply Lemma C.5, we choose parameters as follows:

P ? D (t) P E (t) • Y = y, S1 = j06=j wj0 zj0 , S2 = wi , j06=j Mj0 zj0 + ξ , α = hwi , Mji · s > 0, ρ = ρi,

2 (t) 2 • V = E[S2 ] = O((σw ) ), L = Θ(s) (using Lemma H.1), Γ = Γt (using Lemma C.7), 00 00 D (t) P E • let Λ be the subset defined in Lemma C.7, then we can let S1 = (zj)j∈Λ and S2 = wi , j0∈Λ Mj0 zj0 1 • we have Γy = Γt,y = d10 (from Lemma C.7) and  ? Ly = max [|sign(w zj + S1) − sign(S1) 0 E 0 j zj0 for j ∈ [d] \{j}\ Λ zj0 for j ∈ Λ  00 0 ? 0 ? ≤ max Pr [S ∈ [−S − |w zj|, −S + |w zj|]] 0 0 1 1 j 1 j zj0 for j ∈ [d] \{j}\ Λ zj0 for j ∈ Λ x s 1 1 ≤ O( + ) ≤ O((s + √ )plog d) ≤ O(s · plog d) p|Λ|/d p|Λ|k/d k d where inequality x uses Lemma H.1a and |Λ| ≥ Ω( log d ) from Lemma C.7. Hence, invoking Lemma C.5, we have (t) (t)   σw O(1) σw O(1) 1 • ∆s ≥ − (t) ≥ − (log log log d)3 and |∆s| ≤ (t) + s ≤ (log log log d)3 + s when s = Ω log log log d σρ σρ  2 2    1/4   |∆ | ≤ e−Ω(b /σρ) + Γ O( α ) + L +Γ ≤ e−Ω(log d) ·s when s = O 1 (which • s t σρ y t,y log log log d b(t) implies α < 4 ) 2 Notice that E[zj ] = O (1/d), which implies that   1  (log log log d)2  Pr |z | = Ω = O j log log log d d This together gives us the bound that h i 1 1 1 −O( ) ≤ E y (t) (t) zj | |zj| < 1 ≤ O( ) d · log log log d x,y,ρ hwi ,xi+ρi≥b d In Phase I.2, the analysis is similar with different parameters: in particular,

(t) 2 2 (σw ) • V = E[S2 ] = O( log d ) which is tighter, Therefore, we have

(t) (t) σw O(1) σw O(1) • ∆s ≥ − (t)√ ≥ − (log log log d)3 and |∆s| ≤ (t)√ + s ≤ (log log log d)3 + s when s = σρ log d σρ log d  1  Ω log log log d

 2 2      |∆ | ≤ e−Ω(b /σρ) + Γ O( α ) + L + Γ ≤ O( ks log d ) when s = O 1 (which • s t σρ y t,y d log log log d b(t) implies α < 4 ). (This uses Γ = O(k/d) and α ≤ o(s log d).) t σρ

(1−c )/2 Taking expectation over zj as before, and using k < d 0 finishes the proof. 

36 (t) Our next lemma shall be used to upper bound how hwi , Mji can grown for every i ∈ [m].

(t) Lemma C.9 (maximum growth). Suppose we (1) either are in Phase I.1 with Sept = [m], (2) or (t) are in Phase I.2 with Sept+ = [m]. Then, for every j ∈ [d], every i ∈ [m], the following holds: h i   1 1 | E y (t) (t) zj | = O . x,y,ρ hwi ,xi+ρi≥b d Proof. Proof is analogous to that of Lemma C.8, and the reason we no longer need the requirement (t) (t) |hwi , Mji| = O(b log log log d) is because, when invoking Lemma C.5, it suffices for us to apply 1  Lemma C.5a for every non-zero values of z (as opposed to only those z = Ω log log log d ) which no longer requires α ≤ b.  (t) (t) Our next lemma shall be used to upper bound how hwi , Mji can grown for every i ∈ [m]\Sj,pot.

(t) Lemma C.10 (non-signal growth). Suppose we (1) either are in Phase I.1 with Sept = [m], (2) (t) (t) or are in Phase I.2 with Sept+ = [m]. Then, for every j ∈ [d], every i ∈ [m] \Sj,pot, the following holds:

h i Γt · log d y1 (t) z = O( ) E hw ,xi+ρ ≥b(t) j x,y,ρ i i d where Γt is given from Lemma C.7.

(t) Proof of Lemma C.10. Suppose |zj| = s and without loss of generality hw , Mji ≥ 0. We choose √ √ i (t) (t) (t)q c2 (t) c2 α = hw , M i · s as before. Then, we have α ≤ c − c σw ) log d = b 1 − ≤ b (1 − ) i j 1 2 c1 2c1 always holds. Therefore, using the same notation as the proof of Lemma C.8, we always have the bound    −Ω(b2/σ2)  α |∆s| ≤ e ρ + Γt O( ) + Ly + Γt,y σρ Plugging in the parameters we finish the proof.  (t) Our final lemma shall be used to upper bound how hwi , Mji can grown with respect to the noise ξ in the input.

Lemma C.11 (noise growth). For every i ∈ [m], every j ∈ [d], the following holds: ! h i 2 1 σx (t) E y (t) (t) hξ, Mji = O Γt |hw , Mji| x,y,ρ hw ,xi+ρi≥b (t) i i dσρ

Proof of Lemma C.11. We can define α = |hMj, ξii| and study h i 1 ∆s := E y (t) (t) sign(hξ, Mji) |hξ, Mji| = α (C.8) x,y,ρ hwi ,xi+ρi≥b

2 (t) 2 This time we have L = Ly = 0, Γy = 0, V = E[S2 ] = O((σw ) ), so applying Lemma C.5 we have,

|hw(t),M i|hM ,ξi|  2 2  (t) i j j −Ω(b /σρ) • when α ≤ b /4, |∆s| ≤ (t) · e + Γt ; kMj k2σρ (t) when α > b(t)/4 (which happens with exponentially small prob.), |∆ | ≤ O( σw ). • s σρ

37 p 2 Together, using the fact that E[|∆s|] ≤ E[∆s] we have: ! h i 2 1 σx (t) E y (t) (t) hξ, Mji = O Γt |hw , Mji| x,y,ρ hw ,xi+ρi≥b (t) i i dσρ 

C.3.3 Proof of Theorem C.1 Suppose in Lemma C.8 the hidden constant is 20C for the lower bound, that is, h i 1 20C E y (t) (t) zj ≥ x,y,ρ hwi ,xi+ρi≥b d Proof of Theorem C.1. Let us prove by induction with respect to t. Suppose the properties all hold at t = 0. Recall from Fact A.1, for iteration t, for every neuron i ∈ [m],   (t) 0 (t) 1 1 ∇wi Losst(w ; x, y, ρ) = −y`t(w ; x, y, ρ) (t) (t) + (t) (t) · x . hwi ,xi+ρi≥b −hwi ,xi+ρi≥b (t) (t) 1 0 (t) Using the bound on kwi k2 (from Sept = [m]) and the fact σ0 ≤ poly(d) , we know `t(w ; x, y, ρ) = 1 1 2 ± poly(d) . Also, recall also from Lemma A.2 that

 (t)  (t) 1 E ∇wi Losst(w ; x, y, ρ) = ∇wi Loss] t(w ) ± . x∼D,y=y(x),ρ poly(d) Together, we have a clean formulation for our gradient update rule: h   i (t+1) (t) (t) 1 1 η wi = wi (1 − ηλ − ηλkwi k2) + E y (t) (t) + (t) (t) · x ± . x,y=y(x),ρ hwi ,xi+ρi≥b −hwi ,xi+ρi≥b poly(d) and as a result for every j ∈ [d] η hw(t+1), M i = hw(t), M i(1 − ηλ − ηλkw(t)k ) ± i j i j i 2 poly(d) h   i 1 1  + E y (t) (t) + (t) (t) · zj + hξ, Mji . (C.9) x,y=y(x),ρ hwi ,xi+ρi≥b −hwi ,xi+ρi≥b We now prove each statement separately (and note our proofs apply both to Phase I.1 and I.2). (t) 1. For every i 6∈ Sj,pot, by substituting Lemma C.10 and Lemma C.11 into (C.9), we have

1/4 η · e−Ω(log d) √ hw(t+1) − w(t), M i ≤  c − c (σ(t+1) − σ(t))plog d (C.10) i i j d 1 2 w w (t+1) √ (t+1)√ (t+1) so we also have hwi , Mji < c1 − c2σw log d and thus i 6∈ Sj,pot . (t) (t) (t) (t) 2. For every i ∈ Sj,sure, suppose wlog hwi , Mji is positive. Then, either hwi , Mji > Ω(b log log log d) (t+1) √ (t+1)√ (t) in such a case we still have hwi , Mji ≥ c1 + c2σw log d. Otherwise, if hwi , Mji ≤ Ω(b(t) log log log d) then by substituting Lemma C.8 and Lemma C.11 into (C.9), we have (us- 1 log d ing σ0 ≤ poly(d) and λ ≤ d ) 20Cη √ hw(t+1), M i ≥ (1 − ηλ)hw(t), M i + ≥ hw(t), M i + c + c (σ(t+1) − σ(t))plog d i j i j d i j 1 2 w w (t+1) √ (t+1)√ so by induction we also have hwi , Mji ≥ c1 + c2σw log d. Combining this with (0) (t+1) (t+1) Sj,pot ⊇ Sj,pot , we conclude that i ∈ Sj,sure.

38 (t+1) 3. To check Sept = [m], we need to verify four things:

(t) 2 (t) 2 • hwi , Mji ≥ (c1 − c2)(σw ) log d for at most O(1) many j ∈ [d]. (0) (0) (t+1) This is so because Sept = [m] and Sj,pot ⊇ Sj,pot . √ (t+1) 2 (t+1) 2√ − log d • hwi , Mji ≥ 2(σw ) log d for at most 2 d many j ∈ [d]. This can be derived from (C.10) in the same way. (t+1) (t+1) σw d • |hwi , Mji| ≤ log d for at least Ω( log d ) many j ∈ [d]. This can be derived from (C.10) in the same way. (t) 2 (t) 2 • kwi k2 ≤ 2(σw ) d (t) For every i ∈ [m], suppose wlog hwi , Mji is positive. Then, by substituting Lemma C.9 and Lemma C.11 into (C.9), we have η hw(t+1) − w(t), M i ≤ O( ) i i j d Applying this formula for t + 1 times, we derive that η(t + 1) |hw(t+1), M i| ≤ O  + |hw(0), M i| (C.11) i j d i j and therefore applying this together with (C.10), (t+1) 2 X (t+1) 2 X (t+1) 2 kwi k2 = |hwi , Mji| + |hwi , Mji| (0) (0) j : i∈Sj,pot j : i6∈Sj,pot

x η(t + 1) η(t + 1) 1/4 ≤ 1.5kw(0)k2 + O(1) · 2 + d · 2 · e−Ω(log d) ≤ 2(σ(t+1))2d i 2 d d w (0) (Above, inequality x uses that there are at most O(1) indices j ∈ [d] such that i ∈ Sj,pot.)

2.5 (t+1) dσ0 log d 4. Finally, to check Sept+ = [m] for t ≥ Ta = Θ( η ), we first derive that η σ(t+1) = σ + Θ( √ ) · (t + 1) ≥ σ · Ω(log2 d) . w 0 d log d 0

(t+1) • For every i 6∈ Sj,pot , (C.10) gives

1/4 e−Ω(log d) |hw(t+1), M i| ≤ |hw(0), M i| + · η(t + 1) i j i j d −Ω(log1/4 d) (t+1) p e σw ≤ O(σ0 log d) + · η(t + 1) ≤ O( ) d log1.5 d

(0) (t+1) In particular, this together with Sept = [m] ensures that for every i ∈ [m], hwi , Mji ≥ (t+1) σw log d for at most O(1) many j ∈ [d]. (0) • For any i ∈ Sj,pot, using (C.11) we have η(t + 1) η(t + 1) |hw(t+1), M i| ≤ |hw(0), M i| + O( ) ≤ O(σ plog d) + O( ) = Θ(σ(t+1)plog d) i j i j d 0 d w (C.12)

(0) (t+1) 2 Using this together with the previous item, as well as |Sj,pot| ≤ O(1), we have kwi k ≤ (t+1) 2 O( (σw ) d ). log3 d

39 (t+1) Putting them together we have Sept+ = [m] for every t ≥ Ta. (t) (0) 5. After t = T iterations, we have σ = Θ(σ + √η · T ), for every i 6∈ S , by Lemma C.10 b w 0 d log d b j,pot and Lemma C.11 √ O(k/d) · log d k log d |hw(t), M i| ≤ |hw(Ta), M i| + · η(T − T ) ≤ O(σ(t) · ) i j i j d b a w d Combining this with (C.11), we immediately have

(t) 2 X (t+1) 2 X (t+1) 2 kwi k = |hwi , Mji| + |hwi , Mji| (0) (0) j : i∈Sj,pot j : i6∈Sj,pot k2 log2 d ≤ (σ(t))2 · O( + log d) w d (t) (0) (t) This implies Sept++ = [m] and Sj,pot ⊇ Sj,pot+ at this iteration t. 

C.4 Phase II: Signal Growth After Winning Lottery Definition C.12. In phase II we make the following parameter choices.

√ (t)√ (t) (t) (log log log d)3 In Phase II, we pick b(t) = c σ log d and σ = σ · √ . • 1 w ρ w log d (t+1) (t) Cη We grow b = b + d as before (the same as phase I.2 in Definition C.6) for each iteration, (t) (t) 2 but stop growing b when it reaches a threshold b = βΞ2. We first introduce a notation on a (high-probability) version of the coordinate Lipscthiz conti- nuity.

Definition C.13 (coordinate Lipschitzness). At every iteration t, for every j ∈ [d], we de- −Ω(log2 d) −Ω(log2 d) fine Lt,j > e to be the smallest value such that w.p. at least 1 − e over the 0 choice of {zj0 }j06=j and ξ, for every z ∈ [−1, 1] and z = (z1, ··· , zj−1, z, zj+1, ··· , zd), z = 0 0 (z1, ··· , zj−1, 0, zj+1, ··· , zd), x = Mz + ξ and x = Mz + ξ: 0 ft(x) − ft(x ) ≤ Lt,j|z| .

C.4.1 Growth Lemmas In this subsection, we provide new growth lemmas Lemma C.14, Lemma C.15, Lemma C.16, Lemma C.17 that are specific to Phase II, to replace the user of the old growth lemmas Lemma C.8, Lemma C.9, Lemma C.10, Lemma C.11 from Phase I.

(t) (t) Lemma C.14 (signal growth II). Suppose we Sept+ = Sept++ = [m]. Then, for every j ∈ [d], every (t) i ∈ Sj,sure, the following holds: √ √   (t) ! h 0 (t) i 1 Lt,j kσρ log d k y` (w ; x, y, ρ)1 (t) z = Θ ± O + + E t (t) j 3/2 x,y,ρ hwi ,xi+ρi≥b d d d βd

(t) ? Proof of Lemma C.14. First, without loss of generality, assuming that sign(hwi , Mji) = sign(wj ) =

40 0 0 0 1. Let us define z = (z1, ··· , zj−1, 0, zj+1, ··· , zd) and x = Mz + ξ. Define (t) def X  (t) (t) (t) (t)  ft,i(w ; x, ρ) = ReLU(hwj , xi + ρj + bj ) − ReLU(−hwj , xi + ρj + bj ) j6=i  (t) (t) (t) (t)  + ReLU(hwi , xi + bi ) − ReLU(−hwi , xi + bi )

0 (t) def d s ` (w ; x, y, ρ) = [log(1 + e )] | (t) t,i ds s=−yft,i(w ;x,ρ) es −Ω(log2 d) Now, since 1+es is an O(1)-Lipschitz function in s, we know that w.p. at least 1 − e 0 (t) 0 (t) 0 (t) |`t(w ; x, y, ρ) − `t,i(w ; x , y, ρ)| = O(Lt,j · |zj| + σρ log d) and this implies that

h 0 (t) 0 (t) 0 i y(` (w ; x, y, ρ) − ` (w ; x , y, ρ))1 (t) z E t t,i hw ,xi+ρ ≥b(t) j x,y,ρ i i  2 (t)  −Ω(log2 d) ≤ O Lt,j E[zj ] + σρ log d · E[|zj|] + e (t) √ ! L σ log d · k 2 = O t,j + ρ + e−Ω(log d) d d h i 0 (t) 0 1 so we only need to bound Ex,y,ρ y`t,i(w ; x , y, ρ) (t) (t) zj . hwi ,xi+ρi≥b (t) (t) Let us first focus on the case that |zj| = 1. As before, since j ∈ Sj,sure, we have hwi , Mji ≥ q b(t) 1 + c2 so applying Lemma C.7, c1

(t) (t) Pr[hwi , xi + ρi ≥ b | zj = 1] ≥ 1 − 2Γt = 1 − o(1) (t) (t) Pr[hwi , xi + ρi ≥ b | zj = −1] ≤ 2Γt = o(1) This means h i h i 0 (t) 0 1 1 0 (t) 0 E y`t,i(w ; x , y, ρ) (t) (t) zj | |zj| = 1 ≥ E y`t,i(w ; x , y, ρ) | zj = 1 − o(1) x,y,ρ hwi ,xi+ρi≥b 2 x,y,ρ ? ? Now recall y(zj, z) = sign(wj zj + hw , zi). ? ? • When |hw , zi| > |wj |, then we know that y(zj, z) and y(zj, −z) have different signs, but 0 (t) 0 0 (t) 0 `t,i(w ; x , y, ρ) = `t,i(w ; −x , −y, ρ) remains the same if we flip z to −z. By symmetry, we have h 0 (t) 0 ? ? i y` (w ; x , y, ρ) zj = 1 ∧ |hw , zi| > |w | = 0 x,y,ρE t,i j ? ? ? • Suppose otherwise |hw , zi| ≤ |wj |. Since |wj | = Θ(1), by Lemma H.1, this event happens with at least constant probability. When it happens, we have y(zj, z) = y(zj, −z) = +1, but 0 (t) 0 0 (t) 0 `t,i(w ; x , y, ρ) + `t,i(w ; −x , y, ρ) = 1. Therefore,

h 0 (t) 0 ? ? i 1 E y`t,i(w ; x , y, ρ) zj = 1 ∧ |hw , zi| ≤ |wj | ≥ x,y,ρ 2 Together, we have h i 0 (t) 0 1 E y`t,i(w ; x , y, ρ) (t) (t) zj | |zj| = 1 = Ω(1) (C.13) x,y,ρ hwi ,xi+ρi≥b

Next, conditioning on |zj| = s for some 0 < s < 1, we can apply Lemma C.5 with Y = 0 (t) 0 (t) P (t) y`t,i(w ; x , y, ρ) on s = sign(zj), α = hwi , Mjizj, S1 = z, S2 = j06=jhwi , Mj0 izj0 and ρ = ρi.

41 0 (t) 0 Since `t(w ; x , y, ρ) ≥ 0, we can conclude that Y is a monotone non-decreasing function of in s. [S2] One can verify that L = O(s) using Lemma H.1. Moreover, since i ∈ S(t) , we have E 2 ≤ 1 . ept++  (t)2 dβ2 σρ Let us denote by h i 0 (t) 0 1 ∆s := E y`t,i(w ; x , y, ρ) (t) (t) sign(zj) | |zj| = s x,y,ρ hwi ,xi+ρi≥b so according to Lemma C.5 we have  1   1  Ω √ ≤ ∆s = O √ + s β d β d √ k This implies, using E[|zj|] ≤ d , that √   ! h 0 (t) 0 i E[|zj|] 1 k 1 y` (w ; x , y, ρ)1 (t) z · 1 ≤ [∆ · |z | · 1 ] ≤ O √ + ≤ O + E t,i hw ,xi+ρ ≥b(t) j |zj |<1 E zj j |zj |<1 1.5 x,y,ρ i i β d d βd d √   ! h 0 (t) 0 i E[|zj|] k y` (w ; x , y, ρ)1 (t) z · 1 ≥ [∆ |z | · 1 ] ≥ −Ω √ ≥ −Ω E t,i hw ,xi+ρ ≥b(t) j |zj |<1 E zj j |zj |<1 1.5 x,y,ρ i i β d βd

Combining this with (C.13), and using Pr[|zj| = 1] ≥ Ω(1/d) finishes the proof.  Similarly, we have the following Lemma (t) (t) Lemma C.15 (maximum growth II). Suppose we Sept+ = Sept++ = [m]. Then, for every j ∈ [d], every i ∈ [m], the following holds: √ √ (t) ! h 0 (t) i 1 + Lt,j kσρ log d k y` (w ; x, y, ρ)1 (t) z = O + + E t hw ,xi+ρ ≥b(t) j 3/2 x,y,ρ i i d d βd

(t) (t) Lemma C.16 (non-signal growth II). Suppose we Sept+ = Sept++ = [m]. Then, for every j ∈ [d], (t) every i∈ / Sj,pot and i ∈ [m], the following holds: √ (t) ! h 0 (t) i log d + Lt,j kσρ log d y` (w ; x, y, ρ)1 (t) z ≤ Γ · O + E t hw ,xi+ρ ≥b(t) j t x,y,ρ i i d d Proof of Lemma C.16. In the same notation as the proof of Lemma C.14, we have

h 0 (t) 0 0 i y(` (w ; x, y, ρ) − ` (x , y))1 (t) z E t t,i hw ,xi+ρ ≥b(t) j x,y,ρ i i   2 2 1 (t) 1 −Ω(log d) ≤ O Lt,j E[zj · (t) (t) ] + σρ log d · E[|zj| · (t) (t) ] + e hwi ,xi+ρi≥b hwi ,xi+ρi≥b (t) √ ! L σ log d · k 2 ≤ Γ O t,j + ρ + e−Ω(log d) t d d

(t) where the last inequality uses Lemma C.7 and the fact i 6∈ Sj,pot (which, as before, implies if we (t) 2 (t) 2 b(t) 2 choose α = hwi , Mji · z then α ≤ (c1 − c2)(σw ) log d ≤ 4 ). Thus, we only need to bound h i 0 0 1 E y`t,i(x , y) (t) (t) zj . x,y,ρ hwi ,xi+ρi≥b 0 0 Conditioning on |zj| = s for some 0 < s ≤ 1, we can apply Lemma C.5 again with Y = y`t,i(x , y) (t) P (t) on s = sign(zj), α = hwi , Mjizj, S1 = z, S2 = j06=jhwi , Mj0 izj0 and ρ = ρi. This time, we use

42 1 1/4 Γy = d10 and Ly ≤ O(z · log d). Define h i 0 0 1 ∆s := E y`t,i(x , y) (t) (t) sign(zj) | |zj| = s x,y,ρ hwi ,xi+ρi≥b

b(t) Since α < 4 , Lemma C.5 tells us    −Ω(b2/σ2)  α |∆s| ≤ e ρ + Γt O( ) + Ly + Γy ≤ O(Γt log d) · z σρ and therefore

h 0 0 i 2 Γt log d y` (x , y)1 (t) z = [∆ |z |] ≤ O(Γ log d) · [z ] = O( ) E t,i hw ,xi+ρ ≥b(t) j E zj j t E j x,y,ρ i i d

Combining this with (C.13), and using Pr[|zj| = 1] ≥ Ω(1/d) finishes the proof.  Finally, we derive a more fine-grind bound for the noise:

(t) (t) Lemma C.17 (noise growth II). Suppose we Sept+ = Sept++ = [m]. Then, for every j ∈ [d], (a) for every j ∈ [d],

2 ! h i Γ L σ 2 0 (t) 1 Γt (t) 2 t t,j x −Ω(log d) E y`t(w ; x, y, ρ) (t) (t) hξ, Mji = O |hw , Mji|σx + + e x,y,ρ hw ,xi+ρi≥b (t) i i dσρ d

(0) (t) (b) suppose also Sj,pot ⊇ Sj,pot+ for every j ∈ [d], then  3 4  X h 0 (t) i k Ξ2 2 −Ω(log2 d) y` (w ; x, y, ρ)1 (t) hξ, M i ≤ O σ + e E t hw ,xi+ρ ≥b(t) j 2 x x,y,ρ i i d j∈[d]

Proof of Lemma C.17. We can first decompose the noise ξ into > 0 ξ = (I − MjMj )ξ + hMj, ξiMj =: ξj + hMj, ξiMj . 0 0 Let us define xj = Mz + ξj.

(t) 0 b(t) • On one hand we have with probability at least 1 − Γt, |hwi , xji| ≤ 10 (using a variant of Lemma C.7). Using the randomness of hξ, Mji and ρi we also have with probability at least −Ω(log2 d) (t) b(t) 1 − e it satisfies |hwi , Mji · hMj, ξi| + |ρi| ≤ 10 . Therefore, with probability at least −Ω(log2 d) 1 1 1 − Γt − e , we have (t) (t) = (t) 0 (t) . hwi ,xi+ρi≥b hwi ,xj i+ρi≥b (t) 0 b(t) • Otherwise, in the event that |hwi , xji| ≥ 10 , using the randomness of ρi, we have that  h (t) i   E hξ, Mjihw , Mji 1 1 i Pr (t) (t) 6= (t) 0 (t) ≤ O   ρ hw ,xi+ρi≥b hw ,x i+ρi≥b (t) i i i j σρ

Together, we have     0 (t) y` (w ; x, y, ρ) 1 (t) − 1 (t) hξ, M i E t hw ,xi+ρ ≥b(t) hw ,x0 i+ρ ≥b(t) j x,y,ρ i i i j i h i  2 (t)  ! E hξ, Mji hwi , Mji Γ ≤ O Γ = O t |hw(t), M i|σ2 (C.14)  t (t)  (t) i j x σρ dσρ

43 Using the coordinate Lipscthizness, we also have   0 (t) 0 0 y(` (w ; x, y, ρ) − ` (x , y))1 (t) hξ, M i E t t j hw ,x0 i+ρ ≥b(t) j x,y,ρ i j i Γ L σ2  ≤ L [hξ, M i2] Pr[hw(t), x0 i + ρ ≥ b(t)] = O t t,j x (C.15) t,j E j i j i d   0 0 1 Finally, we have Ex,y,ρ y`t(xj, y) (t) 0 (t) hξ, Mji = 0. We can thus combine (C.14) and hwi ,xj i+ρi≥b (C.15) to complete the proof of Lemma C.17a. Next, we want to prove Lemma C.17b. We have: denote

qi0 = (1 (t) (t) + 1 (t) (t) hw ,Mzi+ρ 0 +b ≥−|b(t)|/10 −hw ,Mzi+ρ 0 +b ≥−|b(t)|/10 i0 i i0 i0 i i0

−Ω(log2 d) (t) |b(t)| Since w.p. at least 1 − e , |hwi0 , ξi| ≤ 10 , in this case, we know that X 0 (t) 0 0 |`t(w ; x, y, ρ) − `t(xj, y)||hξ, Mji| j∈[d] X 2 X (t) ≤ hMj, ξi · |hw 0 , Mji| · (1 (t) (t) + 1 (t) (t) ) i hw ,xi+ρ 0 +b ≥0 −hw ,xi+ρ 0 +b ≥0 i0 i i0 i0 i i0 j∈[d] i0∈[m] X 2 X (t) ≤ hMj, ξi · |hwi0 , Mji| · qi0 j∈[d] i0∈[m] Note also we have: X (t) (t) X (t) |hwi0 , Mji| ≤ O(1) · kwi0 k2 + |hwi0 , Mji| 0 j∈[d] j : i 6∈Sj,pot+ σ(t) k k ≤ O(1) · w + d · b(t) ≤ 2 · b(t) β dβ β By Lemma C.19, we have that

X h 0 (t) 0 0 i y(` (w ; x, y, ρ) − ` (x , y))1 (t) hξ, M i E t t j hw ,xi+ρ ≥b(t) j x,y,ρ i i j∈[d] 2 h i −Ω(log d) X 0 (t) 0 0 1 ≤ e + E |(`t(w ; x, y, ρ) − `t(xj, y))||hξ, Mji| (t) (t) x,y,ρ hwi ,xi+ρi≥b j∈[d] −Ω(log2 d) X h 0 (t) 0 0 i ≤ e + |(` (w ; x, y, ρ) − ` (x , y))||hξ, Mji|qi x,y,ρE t t j j∈[d]   −Ω(log2 d) X 2 X (t) ≤ e + hMj, ξi · |hw 0 , Mji| · qi0 qi ··· taking expectation w.r.t. ξ first. x,y,ρE  i  j∈[d] i0∈[m]    2  −Ω(log2 d) σx X X (t) ≤ e + O · E  |hw 0 , Mji| · qi0 qi d x,y,ρ i j∈[d] i0∈[m]  2       3 4  σ k k 2 k Ξ 2 ≤ O x · O b(t) · kΞ · O + e−Ω(log d) ≤ O 2 σ2 + e−Ω(log d) (C.16) d β 2 d d2 x

44 Next, similar to the (C.14), we also have     X 0 0 y` (x , y) 1 (t) − 1 (t) hξ, M i E t j hw ,xi+ρ ≥b(t) hw ,x0 i+ρ ≥b(t) j x,y,ρ i i i j i j∈[d] ! !  2  X Γt (t) 2 k 2 X (t) k log d 2 ≤ O |hwi , Mji|σx ≤ O σx · |hwi , Mji| ≤ O σx (C.17) (t) 2 (t) d2β j∈[d] dσρ d σρ j∈[d] Combining (C.16) and (C.17) we finish the proof of Lemma C.17b. 

C.4.2 Growth Coupling We also have the following lemma which says, essentially, that all those neurons i ∈ [m] satisfying √ (t) (t) |hwi , Mji| ≥ 2 kb for the same j, grows roughly in the same direction that is independent of i.

Lemma C.18 (growth coupling). Suppose at iteration t, S(t) = [m]. Then, for every j ∈ [d], √ ept+ (t) (t) every i ∈ [m] such that |hwi , Mji| ≥ 2 kb , we have: ! h i h i 3/2 0 (t) 1 1  0 (t) k E y`t(w ; x, y, ρ) (t) (t) + (t) (t) zj = E y`t(w ; x, y, ρ)zj ± O 2 x,y,ρ hwi ,xi+ρi≥b −hwi ,xi+ρi≥b x,y,ρ d √ (t) (t) Proof of Lemma C.18. We first focus on the case when hwi , Mji| ≥ 2 kb is positive, and the 1 reverse case is analogous. Conditional on |zj| = s > 0, we know that s ≥ √ . Thus, when √ k (t) (t) (t) (t) (t) |hwi , Mji| ≥ 2 kb , |hwi , Mjis| ≥ 2b . Now, using Sept+ = [m] and Lemma C.7, we can conclude that (t) (t) k  • when zj > 0, Pr[hwi , xi + ρi ≥ b | zj = s] ≥ 1 − O d ; (t) (t) k  • when zj < 0, Pr[hwi , xi + ρi ≥ b | zj = −s] ≤ O d . Thus, we can obtain h i h i   0 (t) 1 0 (t) 1 k E y`t(w ; x, y, ρ) (t) (t) zj = E y`t(w ; x, y, ρ)zj zj >0 ± O × E |zj| x,y,ρ hwi ,xi+ρi≥b x,y,ρ d ! h i 3/2 0 (t) 1 k = E y`t(w ; x, y, ρ)zj zj >0 ± O x,y,ρ d2 In the symmetric case, we also have ! h i h i 3/2 0 (t) 1 0 (t) 1 k E y`t(w ; x, y, ρ) (t) (t) zj = E y`t(w ; x, y, ρ)zj zj <0 ± O 2 x,y,ρ −hwi ,xi+ρi≥b x,y,ρ d 

C.4.3 Activation Probabilities (t) (0) (t) Lemma C.19 (activation after ept+). Suppose Sept+ = [m] and Sj,pot ⊇ Sj,pot+ for every j ∈ [d]. 2 Then, with probability at least 1 − e−Ω(log d),

n (t) o i ∈ [m] s.t. |hw(t), xi| ≥ b ≤ O(kΞ ) . • i 10 2   (t) P b(t) w , (t) M z + ξ ≤ for every i ∈ [m]. • i j∈[d]: i6∈S j j 10 j,pot+

45 (t) (t) k (t) Proof. For every i ∈ [m] and j ∈ [d] with i 6∈ Sj,pot+, we have |hwi , Mji| ≤ dβ b . Therefore, by 2 Bernstein’s inequality (similar to Lemma C.4b), we know with probability at least 1 − e−Ω(log d), for every i ∈ [m],

* + (t) (t) X b w , M z + ξ ≤ (C.18) i j j 10 (t) j∈[d]: i6∈Sj,pot+ −Ω(log2 d) P 1 With probability at least 1 − e it satisfies j∈[d] zj 6=0 ≤ O(k) (since each zj 6= 0 with probability at most O( k )). Therefore, denoting by Λ = S S(t) , we have |Λ| ≤ O(kΞ ) d j∈[d]: zj 6=0 j,pot+ 2 (t) (since every |Sj,pot+| ≤ Ξ2). Now, for any i ∈ [m] \ Λ, inequality (C.18) immediately gives D E b(t) w(t), x ≤ . i 10 D E (t) Therefore, the number of i ∈ [m] satisfying w(t), x ≥ b cannot be more than O(kΞ ). i 10 2 

C.4.4 Coordinate Lipscthizness Bound (t) P (t) Lemma C.20 (coordinate Lipschitzness). For every j ∈ [d], let us define γj = (t) |hwi , Mji|. i∈Sj,pot+ (t) (0) (t) Then, suppose Sept+ = [m] and suppose Sj,pot ⊇ Sj,pot+, we have     (t) k (t) 1 Lt,j ≤ γj + O √ ≤ γj + O 3 d Ξ2 Proof of Lemma C.20. Clearly we have X (t) X (t) 1 1 Lt,j ≤ |hwi , Mji| + |hwi , Mji| · ( (t) (t) + (t) (t) ) hwi ,xi+ρi≥0.9bi −hwi ,xi+ρi≥0.9bi (0) (0) i∈Sj,pot+ i6∈Sj,pot+

−Ω(log2 d) By Lemma C.19 and the randomness of ρi, we know with probability at least 1 − e , the (t) (t) (t) (t) (t) number of activate neurons i 6∈ Sj,pot+—meaning hwi , xi+ρi ≥ 0.9bi or −hwi , xi+ρi ≥ 0.9bi — (t) is at most O(kΞ2). On the other hand, when i∈ / Sj,pot+, we know that k kΞ2 |hw(t), M i| ≤ b(t) ≤ 2 i j dβ d (t) Therefore, together, the total contribution from these active neurons with i∈ / Sj,pot+ is at most 2 2 3   kΞ2 k Ξ2 1 · O(kΞ2) < O( ) < O 3 . This completes the proof. d d Ξ2 

C.4.5 Regularization (t) Following the same argument as (C.9) from phase I, we know at any iteration t, as long as Sept+ = [m], η hw(t+1), M i = hw(t), M i(1 − ηλ − ηλkw(t)k ) ± i j i j i 2 poly(d) " # m   0 (t) X 1 1  + E y`t(w ; x, y, ρ) (t) (t) + (t) (t) · zj + hξ, Mji . (C.19) x,y=y(x),ρ hwi ,xi+ρi≥b −hwi ,xi+ρi≥b i=1

46 In this and the next subsection, we shall repeatedly apply growth lemmas to (C.19). Before (t) (t) 2 doing so, let us note σρ = o(b log d) ≤ o(βΞ2 log d), so using our parameter choice of β and using k ≤ d1−c0 , √ √ √ ! √ kσ(t) log d k kβΞ2 log2 d k 1 ρ + = o 2 + = o( ) (C.20) d βd3/2 d βd3/2 d This means, when applying the aforementioned growth lemmas Lemma C.14, Lemma C.15, Lemma C.16, √ (t) √ the additional terms kσρ log d and k are negligible. d βd3/2 We also have the following regularity lemma:

O(log d) (t) (t) (0) Lemma C.21 (regularity). For every T ≤ d /η, suppose Sept+ = Sept++ = [m] and Sj,pot ⊇ (t) Sj,pot+ hold for every t ≤ T and j ∈ [d]. Then, we have for every t ≤ T , with probability at least 2 1 − e−Ω(log d), 2 (t) 2 2 ∀j ∈ [d], ∀i ∈ [m]: Lt,j ≤ O(Ξ2), kwi k2 ≤ O(Ξ2), |ft(x)| ≤ O(Ξ2 log d) . Proof of Lemma C.21. By substituting Lemma C.15, Lemma C.17a and (C.20) into (C.19), we have for every j ∈ [d]: 1 + L  |hw(t+1), M i| ≤ |hw(t), M i|(1 − ηλkw(t)k ) + ηO t,j i j i j i 2 d   1 + L  ≤ |hw(t), M i| 1 − ηλ|hw(t), M i| + ηO t,j i j i j d (0) (0) Summing up over all i ∈ Sj,pot, and using Cauchy-Schwarz inequality together with |Sj,pot| ≤ Ξ2, we have  2   X (t+1) X (t) ηλ X (t) 1 + Lt,j |hw , M i| ≤ |hw , M i| −  |hw , M i| + ηO · Ξ i j i j Ξ  i j  d 2 (0) (0) 2 (0) i∈Sj,pot i∈Sj,pot i∈Sj,pot P (t) 1 1 Combining this with Lt,j ≤ (0) |hwi , Mji| + O( 3 ) from Lemma C.20 and our choice λ ≥ d , i∈Sj,pot Ξ2 we have (for every j ∈ [d] and t ≤ T ),

X (t) 2 |hwi , Mji| ≤ O(Ξ2) (0) i∈Sj,pot 2 This also implies Lt,j ≤ O(Ξ2) as well as 2 (t) X (t) X (t) k kw k2 = |hw , M i|2 + |hw , M i|2 ≤ O(Ξ4) + O( (b(t))2) ≤ O(Ξ4) . i i j i j 2 dβ2 2 (t) (t) j : i∈Sj,pot+ j : i6∈Sj,pot+ 2 Finally, for the objective value, we wish use Lt,j ≤ O(Ξ2) and apply a high-probability Bernstein variant of the McDiarmid’s inequality (see Lemma H.3). P Specifically, consider random z, ξ, ρ. For notation simplicity, let us write ξ = j∈[d] Mjξj for 2 σx i.i.d. random ξj ∼ N (0, d ). 0 0 Now, for every j ∈ [m], suppose we change zj to zj and ξj to ξj with the same distribution. 2 Then, with probability at least 1 − e−Ω(log d), 0 0 0 0 |ft(z, ξ, ρ) − ft(z−j, zj, ξ−j, ξj, ρ)| ≤ Lt,j · (|zj| + |zj| + |ξj| + |ξj|)

47 2 This implies with probability at least 1 − e−Ω(log d),

0 2 2 1 |ft(z, ξ, ρ) − ft(z−j, z , ξ, ρ)| ≤ O(L ) · E 0 j t,j zj ,zj d

2 Therefore, we can apply Lemma H.3 to derive that with probability at least 1 − e−Ω(log d), 2 |ft(x, ρ) − E [ft(x, ρ)]| ≤ O(Ξ2 log d) z,ξ

Finally, noticing that for every ρ, by symmetry Ez,ξ[ft(x, ρ)] = 0. This finishes the bound on the objective value.  We also prove this Lemma, which gives a lower bound on the loss:

Lemma C.22 (loss lower bound). In every iteration t, define Lmax := maxj∈[d]{Lt,j} and suppose 2 Lmax ≤ O(Ξ2). Then we have:   1  [`0 (w(t); x, y, ρ)] = Ω min 1, x,ρE t 2 2 Lmax log d  1  Proof of Lemma C.22. Let α ∈ 5 , 1 be a fixed value to be chosen later, and S0 ⊆ [d] be an (Ξ2) arbitrary subset of size |S0| = αd. Consider a randomly sampled vector z and let x = Mz + ξ be the corresponding input. We construct another z0 that is generated from the following process   1. Let S ⊆ S be the set consisting of all i ∈ S with |z | = Θ √1 . re,z 0 0 i k 0 2. For all i∈ / Sre,z, pick zi = zi. 0 0 3. For all i ∈ Sre,z, pick zi = zi or zi = −zi each with probability 0.5, independently at random. Obviously, z0 has the same distribution as z. Now, let us define x0 = Mz0 + ξ, y0 = sign(hw?, z0i). h  i Since |S | = αd, recalling the distribution property that Pr |z | = Θ √1 = Ω k , we know 0 i k d −Ω(log2 d) with probability at least 1 − e over the choice of z, |Sre,z| = Θ(αk). We call this event E1(z). z0 Let us denote by b = i ∈ {−1, 1} for every i ∈ S . We can therefore write f (w(t); x0, ρ) = i zi re,z t f(z, b, ξ, ρ) to emphasize that the randomness comes from z, b, ξ, ρ. Using the definition of coordi- 2 nate Lipscthizness, we know with probability at least 1 − e−Ω(log d) over z, ξ, ρ, it satisfies   Sre,z 0 Lmax ∀k ∈ Sre,z, ∀b ∈ {−1, 1} , ∀b ∈ {−1, +1}: |f(z, b, ξ, ρ) − f(z, (b−k, bk), ξ, ρ)| ≤ O √ k k

Let E2(z, ξ, ρ) denote the event where the above statement holds. Now, conditioning on E1(z) and E2(z, ξ, ρ) both hold, we can apply standard MiDiarmid’s in- equality (see Lemma H.2) over the randomness of b, and derive that with probability at least 2 1 − e−Ω(log d) over b,

√  f(z, b, ξ, ρ) − E [f(z, b, ξ, ρ)] ≤ O Lmax α log d b

Let E3(b||z, ξ, ρ) denote the (conditional) event where the above statement holds. −Ω(log2 d) 0 In sum, by combining E1, E2, E3, we know with probability at least 1 − e over z, z , ξ, ρ, it satisfies

(t) 0 h (t) 0 i √  ft(w ; x , ρ) − E ft(w ; x , ρ) | z, ξ, ρ ≤ O Lmax α log d . z0

48 As a simple corollary, if we generate another copy z00 in the same way as z0, and denote by 2 x00 = Mz00 + ξ, then with probability at least 1 − e−Ω(log d) over z, z0, z00, ξ, ρ, it satisfies √ (t) 0 (t) 00  ft(w ; x , ρ) − ft(w ; x , ρ) ≤ O Lmax α log d . (C.21) Now, let us denote by y0 = sign(hw?, z0i) and y00 = sign(hw?, z00i) and compare them. Let us write X ? X ? 0 X ? 00 A = wi zi ,B = wi zi ,C = wi zi . i∈[d]\Sre,z i∈Sre,z i∈Sre,z Thus, we have y0 = sign(A + B) and y00 = sign(A + C). First using a minor variant of Lemma H.1b, we have20 √ √ Pr A ∈ [0, α] ≥ Ω( α)

Denote this event by E4(z). Next, conditioning on any fixed z which satisfies E1(z) and E4(z), we know that B and C become independent, each controlled by |Sre,z| = Θ(αk) random Bernoulli variables. Therefore, we can apply a Wasserstein distance version of the central limit theorem (that can be derived from [110], full statement see [6, Appendix A.2]) to derive that, for a Gaussian variable g ∼ (0,V 2) where V 2 = P (z )2 = Θ(α), the Wasserstein distance: j∈Sre,z j log k  log k  W2 (B, g) ≤ O √ and W2 (C, g) ≤ O √ k k √ √ This means with probability at least Ω(1), it satisfies B ∈ [0, α] and C ≤ −5 α. √ √ √ To sum up, we know with probability at least Ω( α), it satisfies A, B ∈ [0, α] and C ≤ −5 α. This means y0 6= y00, or in symbols, √ Pr[y0 6= y00] ≥ Ω( α) . (C.22) Finally, conditioning on both (C.21) and (C.22) happen, we know that (t) 0 (t) 00 0 (t) 0 0 0 (t) 00 00 • either sign(ft(w ; x , ρ)) = sign(ft(w ; x , ρ), in which case `t(w ; x , y , ρ)+` (w ; x , y , ρ) ≥ 1 2 , (t) 0 √ (t) 00 √ • or |ft(w ; x , ρ)| ≤ O (Lmax α log d) and |ft(w ; x , ρ)| ≤ O (Lmax α log d), in which case 1 1 (t) 0 (t) 00 if we choose α = min{ , 2 2 }, then we have |ft(w ; x , ρ)|, |ft(w ; x , ρ)| ≤ O(1) and 2 Lmax log d 0 (t) 0 0 0 (t) 00 00 therefore `t(w ; x , y , ρ) + `t(w ; x , y , ρ) ≥ Ω(1). To sum up, we have

0 (t) 1 0 (t) 0 0 0 (t) 00 00 E [`t(w ; x, y, ρ)] = E [`t(w ; x , y , ρ) + `t(w ; x , y , ρ)] x,y=y(x),ρ 2 z,z0,z00,ξ,ρ √  1  ≥ Ω( α) = Ω min{1, } . 2 2  Lmax log d

C.4.6 Proof of Theorem C.2

Proof of Theorem C.2. We first prove that for every t ≥ Tb, (t) (t) (0) Sj,pot ⊆ Sj,pot+ ⊆ Sj,pot (C.23)

20 d To be precise, we can do so since we still have at least (1 − α)d ≥ 2 coordinates.

49 (t) (t) Note from the definitions the relationship Sj,pot ⊆ Sj,pot+ always holds, so we only need to prove the second inclusion. (0) Suppose (C.23) holds until iteration t. Then, for every i 6∈ Sj,pot, let us apply Lemma C.16, 2 Lemma C.17a together with (C.20) and Lt,j ≤ O(Ξ2) (using Lemma C.21) to (C.19). We get ηkΞ2  |hw(t+1), M i| ≤ |hw(t), M i|(1 − ηλ) + O 2 i j i j d2 (t+1) 2 1 Therefore, for those t that are sufficiently large so that b = βΞ2, we have (using λ ≥ d ) ηkΞ2  1  kΞ2  k |hw(t+1), M i| ≤ O 2 · = O 2 ≤ b(t+1) i j d2 ηλ d · dλ dβ

(t+1) η(t+1) and for those t that are still small so that b = Θ( d ), we have ηkΞ2  k |hw(t+1), M i| ≤ O 2 · (t + 1)  b(t+1) i j d2 dβ

(t+1) O(log d) Together, this means i 6∈ Sj,pot+ so (C.23) holds for all t ≥ Tb and T ≤ d /η.

Phase II.1. We will construct a threshold Te and prove inductively for all t ∈ [Tb,Te]. Initially at t = Tb, by Lemma C.20 we have Lt,j = o(1). As long as Lt,j = o(1) holds for all j ∈ [d], we have • for every i ∈ [m], substituting Lemma C.15, Lemma C.17a and (C.20) into (C.19), η  η  |hw(t+1), M i| ≤ |hw(t), M i| + O ≤ · · · ≤ O · t i j i j d d (t) • for every i 6∈ Sj,pot, substituting Lemma C.16, Lemma C.17a and (C.20) into (C.19), ηk log d ηk log d  |hw(t+1), M i| ≤ |hw(t), M i| + O ≤ · · · ≤ O · t i j i j d2 d2

(t) (t) (0) Since for each i, the number of j satisfying i ∈ Sj,pot is at most O(1) (using Sj,pot ⊆ Sj,pot and (0) Sept = [m]), we have η kw(t+1)k ≤ O( · t) (C.24) i d These bounds together mean several things:   L = o(1) for all j ∈ [d] and t ∈ [T ,T ] with T = Θ d . • t,j b e e ηΞ2 log d Indeed, (C.24) gives |hw(t), M i| ≤ O( 1 ), but the number of j satisfying i ∈ S(t) is at i j Ξ2 log d j,pot most O(1). So we can apply Lemma C.20 to get Lt,j = o(1). (t) • Sept++ = [m] for all t ∈ [Tb,Te]. Indeed,

(t) (t) – for those t that are small so that σ = Θ( √η t), we have (C.24) implies kw k ≤ w d log d i √ (t) (t) σw O( log d · σw )  β ; and (t) βΞ2 (t) – for those t that are large so that σ = Θ( √ 2 ), we have (C.24) implies kw k ≤ w log d i (t) O( 1 )  σw . Ξ2 log d β

(t) Together we have i ∈ Sept++.

50 (t) • Sept+ = [m] for all t ∈ [Tb,Te]. (t) This is a direct corollary of Sept++ = [m] together with the property that the number of j (t) satisfying i ∈ Sj,pot+ is at most O(1). (0) Next, let us consider any j ∈ [d] with i ∈ Sj,sure. At any iteration t ∈ [Tb,Te], substituting Lemma C.14, Lemma C.17a, (C.24), and (C.20) into (C.19), η  |hw(t+1), M i| ≥ |hw(t), M i|(1 − ηλ − ηλkw(t)k ) + Ω i j i j i 2 d η  ≥ |hw(t), M i|(1 − 2ηλ) + Ω i j d This means two things: (t) η 1 1 • The value |hwi , Mji| keeps increasing as t increases, until it reaches Θ( d · ηλ ) = Θ( dλ ) and 1 (t) at that point it may decrease but will not fall below Θ( dλ ). This ensures i ∈ Sj,sure. (t) • At t = Te, we must have i ∈ Sj,sure+ because     (t) 2 (t) η 1 1 (b ) |hwi , Mji| ≥ Ω( Te) ≥ Ω ≥ Ω · 2 4 d Ξ2 log d Ξ2 log d β Ξ2   1 (t) 2 (t)2 ≥ Ω 2 5 · 4k(b ) ≥ 4k b kβ Ξ2 log d

To sum up, at iteration t = Te, we have   for i ∈ S(0) , |hw(t), M i| ≥ Ω 1 ; • j,sure i j Ξ2 log d for i ∈ S(0) , |hw(t), M i| ≤ O( 1 ) • j,pot i j Ξ2 log d for i 6∈ S(0) , |hw(t), M i| ≤ O( k ) • j,pot i j dΞ2 Phase II.2. We first make a quick observation that (t) (t) • Sept+ = Sept++ = [m] for all t ≥ Te. (t) 2 Indeed, from iteration t = Te on, we have b = βΞ2. Using Lemma C.21 we have for every (t) (t) 2 σw (t) (t) i ∈ [m], kwi k ≤ O(Ξ2) ≤ β . Thus, Sept++ = [m] holds for all t ≥ Te. As for Sept+ = [m], it (t) is a simple corollary of Sept++ = [m] together with the property that the number of j satisfying (t) i ∈ Sj,pot+ is at most O(1).

(Te) Next, we claim for every i ∈ Sj,sure+ and every t ≥ Te, it must hold that   (t) 1 (t) (t) |hwi , Mji| ≥ max |hwi0 , Mji| and i ∈ Sj,sure+ (C.25) C0 i0∈[m] for some sufficiently large constant C0 > 1. We prove by induction. Suppose (C.25) holds for t and √ (t) (t) (t) we consider t + 1. By the definition of i ∈ Sj,sure+, we know |hwi , Mji| ≥ 2 kb . Now, consider every other i0 ∈ [m] \{i}

(t) 0 (t) (t+1) 0 (t+1) • if |hwi0 , Mji| < 2C |hwi , Mji|, then after one iteration we still have |hwi0 , Mji| < C |hwi , Mji|.

51 (t) 0 (t) • if |hwi0 , Mji| > 2C |hwi , Mji|, then we have 2 (t) 2 (t) 2 X (t) 2 (t) 2 k (t) 2 kw k = |hw , M i| + |hw , M 0 i| ≤ |hw , M i| + (d − 1) · (b ) i i j i j i j d2β2 j06=j 10 ≤ 10|hw(t), M i|2 ≤ |hw(t), M i|2 ≤ kw(t)k2 (C.26) i j 2C0 i0 j i0 Therefore, applying Lemma C.18 and Lemma C.17a (for i and i0), and using β ≤ √1 , we have k  1.5  (t+1) (t) (t) h 0 (t) i ηk |hw , Mji| = |hw , Mji|(1 − ηλ − ηλkw k2) + η E y`t(w ; x, y, ρ)zj ± O i i i x,y,ρ d2  1.5  (t+1) (t) (t) h 0 (t) i ηk |hw 0 , Mji| = |hw 0 , Mji|(1 − ηλ − ηλkw 0 k2) + η E y`t(w ; x, y, ρ)zj ± O i i i x,y,ρ d2 Taking the difference and using (C.26), we have   ηk1.5  |hw(t+1), M i| − |hw(t+1), M i| ≤ |hw(t), M i| − |hw(t), M i| (1 − ηλ − ηλkw(t)k ) + O i0 j i j i0 j i j i 2 d2   √ ηk1.5  ≤ |hw(t), M i| − |hw(t), M i| − Ω(ηλ( kb(t))3) + O i0 j i j d2  (t) (t)  ≤ |hwi0 , Mji| − |hwi , Mji|

(t+1) 0 (t+1) thus we continue to have |hwi0 , Mji| ≤ C |hwi , Mji|. Putting these together we show that the first half of (C.25) holds at t + 1. (t+1) As for why i ∈ Sj,sure+, we consider two cases. √ √ (t) (t) (t+1) (t) • If√|hwi , Mji| ≥ 4 kb , then in one iteration we should still have |hwi , Mji| ≥ 2 kb = 2 kb(t+1). √ (t) (t) • If |hw , Mji| ≤ 4 kb , then by the first half of (C.25) together with Lemma C.20, we know i √ (t)  1  the Lipscthizness Lt,j ≤ O( kb ·Ξ2)+O 3 ≤ o(1). In this case, we also have (see (C.26)) Ξ2 (t) 2 (t) 2 kwi k ≤ O(k(b ) ) = o(1). Applying Lemma C.14 and Lemma C.17a again we have η  |hw(t+1), M i| ≥ |hw(t), M i|(1 − ηλ − ηλkw(t)k ) + Ω i j i j i 2 d η  ≥ |hw(t), M i|(1 − 2ηλ) + Ω ≥ |hw(t), M i| i j d i j

(t+1) Putting both cases together we have i ∈ Sj,sure+ so the second half of (C.25) holds at t + 1. 

D Clean Accuracy Convergence Analysis

In this section we show the upper bound on how the clean training of a two-layer neural network N can learn the labeling function from N training samples {xi, yi}i=1 up to small generalization error.

52 Theorem D.1. Suppose the high-probability initialization event in Lemma B.2 holds, and suppose 1 −Ω(log2 d) def η, σ0 ∈ (0, poly(d) ) and N ≥ poly(d). With probability at least 1 − e , for any T ≥ Tc = 6 dΞ2 O(log d) Ω( η ) and T ≤ d /η, if we run the Algorithm 1 for Tf = Te + T iterations, we have

Te+T −1 1 X (t) E Objt(w ; x, y, ρ) ≤ o(1) T x,y,ρ t=Te

In other words, at least 99% of the iterations t = Te,Te + 1,...,Te + T − 1 will have population risk o(1) and clean population accuracy ≥ 1 − o(1).

Remark D.2. With additional efforts, one can also prove that Theorem D.1 holds with high prob- −Ω(log2 d) ability 1 − e for all T in the range T ∈ [Tc,Tf ]. We do not prove it here since it completes the notations and is not beyond the scope of this paper.

D.1 Proof of Theorem D.1: Convergence Theorem Our convergence analysis will rely on the following (what we call) coupling function which is the first-order approximation of the neural network. Definition D.3 (coupling). At every iteration t, we define a linear function in µ m   def X 1 (t) 1 (t) gt(µ; x, ρ) = (t) (t) · (hµi, xi + ρi − b ) − (t) (t) · (−hµi, xi + ρi − b ) hwi ,xi+ρi≥b −hwi ,xi+ρi≥b i=1 and it equals the output of the real network at point µ = w(t) both on zero and first order: (t) (t) gt(w ; x, ρ) = ft(w ; x, ρ) and ∇µgt(µ; x, ρ) µ=w(t) = ∇wft(w; x, ρ) w=w(t) In the analysis, we shall also identify a special choice µ? defined as follows.

(0) (0) ? ? Definition D.4. Recall S1,sure,..., Sd,sure ⊆ [m] are disjoint, so we construct µ1, . . . , µm by

  w?   j (0) ? def α (0) Mj, i ∈ Sj,sure for some j ∈ [d]; µi = |Sj,sure|  ~0, otherwise. Above, α = o(1) is a parameter to be chosen later. One can easily check (using Lemma B.2) that

P ? 2 α2 P ? 3 α3 Claim D.5. i∈[m] kµi k ≤ O( d) and i∈[m] kµi k ≤ O( 2 d) Ξ1 Ξ1 More interestingly, our so-constructed µ? satisfies (to be proved in Section D.2)

(t) (t) (0) (t) (0) (t) Lemma D.6. Suppose Sept+ = Sept++ = [m], Sj,pot ⊇ Sj,pot+ and Sj,sure ⊆ Sj,sure+ for every j ∈ [d]. Then, −Ω(log2 d) ? ? 1 (a) with probability at least 1 − e over x, ρ, gt(µ ; x, ρ) = αhw , zi ± O( 2 ) Ξ2 ?    −y·gt(µ ;x,ρ) 1 1 (b) Ex,y=y(x),ρ log 1 + e ≤ O 2 + 2 α Ξ2

(t+1) (t) (t) We are now ready to prove Theorem D.1. Since wi = wi − η∇wi Objg t(w ), we have the identity η2 1 1 ηh∇Objg (w(t)), w(t) − µ?i = k∇Objg (w(t))k2 + kw(t) − µ?k2 − kw(t+1) − µ?k2 t 2 t F 2 F 2 F

53 (t) (t) Applying Lemma A.2, we know that by letting Objt(w ) = Ex,y,ρ Objt(w ; x, y, ρ), it satisfies 1 1 η ηh∇Obj (w(t)), w(t) − µ?i ≤ η2 · poly(d) + kw(t) − µ?k2 − kw(t+1) − µ?k2 + t 2 F 2 F poly(d) Let us define a pseudo objective

0 def  −y·gt(µ;x,ρ)  X Objt(µ) = E log(1 + e ) + λ Reg(µi) , x,y=y(x),ρ i∈[m] which is a convex function in µ because gt(µ; x, ρ) is linear in µ. We have, for every t ≥ Te,

(t) (t) ? x 0 (t) (t) ? h∇Objt(w ), w − µ i = h∇Objt(w ), w − µ i 0 (t) 0 ? (t) 0 ? ≥ Objt(w ) − Objt(µ ) = Objt(w ) − Objt(µ ) y X kµ?k3 kµ?k2   1 1  ≥ Obj (w(t)) − λ i + i − O + t 3 2 α2 Ξ2 i∈[m] 2 z  3 2  (t) α log d α log d 1 1 ≥ Objt(w ) − O 2 + + 2 + 2 Ξ1 Ξ1 α Ξ2 √  (t) log d ≥ Objt(w ) − O √ Ξ1

Above, x uses the definition of gt, y uses Lemma D.6b (and Theorem C.2 for the prerequisite for ? 2 ? 3 Lemma D.6b), and z uses Claim D.5 for the bound on kµi k and kµi k . Putting these together, we have  √  (t) log d 2 1 (t) ? 2 1 (t+1) ? 2 η η Objt(w ) − O √ ≤ η · poly(d) + kw − µ kF − kw − µ kF + Ξ1 2 2 poly(d) 1 Therefore, after telescoping for t = Te,Te + 1,...,Te + T − 1, and using η ≤ poly(d) , we have T +T −1 √ e    (Te) ? 2 4 1 X (t) log d O(kw − µ kF ) O(Ξ2m) Objt(w ) − O √ ≤ ≤ T Ξ1 ηT ηT t=Te Finally, we calculate

(Te) 2 X (Te) 2 X (Te) 2 kw kF ≤ kwi k2 + kwi k2 S (0) S (0) i∈ j Sj,pot i6∈ j Sj,pot k2 ≤ dΞ · O(Ξ4) + m · O( Ξ4) ≤ O(dΞ5) 2 2 d2 2 2 and this finishes the proof. 

D.2 Proof of Claim D.6: Main Coupling The proof of Lemma D.6a comes from Claim D.7 and Claim D.8 below. In the two claims, we split ? gt(µ ; x) = gt,1 + gt,4 into two terms, and bound them separately. Define α · w?   ? X j X 1 1 gt,1(µ ; x, ρ) = (t) (t) + (t) (t) · zj (0) hwi ,xi+ρi≥b −hwi ,xi+ρi≥b j∈[d] |Sj,sure| (0) i∈Sj,sure   X (t) 1 (t) 1 gt,4(x, ρ) = (ρi − b ) (t) (t) + (b − ρi) (t) (t) hwi ,xi+ρi≥b −hwi ,xi+ρi≥b i∈[m]

54 ? ? −Ω(log2 d) Claim D.7. Prx,ρ[gt,1(µ ; x, ρ) = αhw , zi] ≥ 1 − e

(0) Proof of Claim D.7. Recall for each i ∈ Sj,sure, √ (t) (t) (t) • it satisfies i ∈ Sj,sure+ so |hwi , Mji| ≥ 2 kb ; (t) 0 (t) k (t) • it also implies i 6∈ Sj0,pot+ for any j 6= j, so |hwi , Mj0 i| ≤ dβ b ; (t) 2 (t) (t) (log log log d)3 • recall ρi ∼ N (0, (σρ ) ) for σρ = Θ(b · log d ).

(t) 2 2 (t) kwi k σx • recall hwi , ξi is a variable with variance at most O( d ) for σx = O(1). 2 Applying Lemma C.19, we know with probability at least 1 − e−Ω(log d) it satisfies (t) 2 (t) P b βΞ2 hw , 0 M 0 z 0 + ξi + |ρ | ≤ = (D.1) i j 6=j j j i 2 2 and when this happens it satisfies, whenever zj 6= 0, 1 1 (t) (t) + (t) (t) = 1 hwi ,xi+ρi≥b −hwi ,xi+ρi≥b (0) −Ω(log2 d) Summing up over all i ∈ Sj,sure and j ∈ [d], we have with probability at least 1 − e over ? ? x, ρ: gt,1(µ ; x) = αhw , zi.  1 −Ω(log2 d) Claim D.8. Prx,ρ[|gt,4(x, ρ)| ≤ O( 2 )] ≥ 1 − e Ξ2 P Proof of Claim D.8. Let us write ξ = j∈[d] Mjξj where each ξj is i.i.d. Let us write X gt,4(x, ρ) = gt,4,i(x, ρi) i∈[m]   def (t) 1 (t) 1 for gt,4,i(x, ρi) = (ρi − b ) (t) (t) + (b − ρi) (t) (t) hwi ,xi+ρi≥b −hwi ,xi+ρi≥b

We note that gt,4(x, ρ) is a random variable that depends on independent variables

ξ1, . . . , ξd, z1, . . . , zd, ρ1, . . . , ρm , so we also want to write it as gt,4(z, ξ, ρ) and gt,4,i(z, ξ, ρi). b(t) b(t) We can without loss of generality assume as if |ρi| ≤ and |ξj| ≤ 10 := B always hold, both 10 Ξ2 2 of which happen with probability at least 1 − e−Ω(log d). In the rest of the proof we condition on this happens. By symmetry we have

∀ρ: E [gt,4(z, ξ, ρ)] = 0 . ξ,z We wish to apply a high-probability version of the McDiarmid’s inequality (see Lemma H.3) to bound gt,4. In order to do so, we need to check the sensitivity of gt,4(x, ρ) regarding every random variable. 0 0 0 • For every zj, suppose we perturb it to an arbitrary zj ∈ [−1, 1]. We also write z = (z−j, zj) and x0 = Mz0 + ξ.

(t) – Now, for every i ∈ Sj,pot+, we have the naive bound 0 (t) |g (z, ξ, ρ ) − g (z , ξ, ρ )| ≤ 2b · 1 0 t,4,i i t,4,i i zj 6=zj (t) and there are at most |Sj,pot+| ≤ Ξ2 such neurons i.

55 (t) (t) k (t) – For every i 6∈ Sj,pot+, we have |hwi , Mji| ≤ dβ b . Define event   (t)  (t) X b  E = |hw , M 0 z 0 + ξi| ≥ i i j j 2  j06=j  1 1 ∗ When event Ei does not happen, we have (t) (t) = (t) 0 (t) , and thus hwi ,xi+ρi≥b hwi ,x i+ρi≥b 0 gt,4,i(z, ξ, ρi) = gt,4,i(z , ξ, ρi) .

∗ When Ei happens, using the randomness of ρi, we have (t) 0 ! h i |hw , Mji| · |zj − z | 1 1 i j Pr (t) (t) 6= (t) 0 (t) ≤ O ρ hw ,xi+ρi≥b hw ,x i+ρi≥b (t) i i i σρ k log d ≤ O · |z − z0 | dβ j j and thus (  k log d 0  0 0, w.p. ≥ 1 − O dβ |zj − zj| over ρi; |gt,4,i(z, ξ, ρi) − gt,4,i(z , ξ, ρi)| = O(b(t)), otherwise.

−Ω(log2 d) Note with probability at least 1 − e , the number of i ∈ [m] with Ei holds is at most O(kΞ2) (using Lemma C.19). Therefore, by applying Chernoff bound, we know k log d  |g (z, ξ, ρ) − g (z0, ξ, ρ)| ≤ O(b(t)) · |z − z0 | · kΞ + Ξ t,4 t,4 dβ j j 2 2

−Ω(log2 d) This means two things that both hold with probability at least 1 − e over z−j, ξ, ρ: √ 0 0 (t)  k log d  (t) 1 – For all zj, zj, |gt,4(x, ρ) − gt,4(x , ρ)| ≤ O(b ) · · kΞ2 + Ξ2 ≤ O( kb ) < o( 2 ) dβ Ξ2  2  0 2 (t) 2  k log d  0 2 2 – 0 |g (x, ρ)−g (x , ρ)| ≤ O (b ) · 0 · kΞ |z − z | + Ξ 1 0 ≤ Ezj ,zj t,4 t,4 Ezj ,zj dβ 2 j j 2 zj 6=zj  4 2 2   k Ξ2 log d 1 2 k (t) 2 1 O 2 2 + Ξ2 (b ) < o( 4 ) d β d d dΞ2 0 0 0 • For every ξj, suppose we perturb it to ξj ∈ [−B,B]. We write ξ = ξ + Mj(ξj − ξj) and x0 = Mz + ξ0.

(t) −Ω(log2 d) (t) P – Now, for every i ∈ Sj,pot+, with probability at least 1−e we have |hwi , j06=j Mj0 ξj0 i| ≤ b(t) (t) b(t) 0 10 . Therefore, if it also happens that |hwi , Mzi| ≤ 10 , then gt,4,i(z, ξ, ρi) = gt,4,i(z, ξ , ρi). In other words, we have 0 (t) 1 |gt,4,i(z, ξ, ρi) − gt,4,i(z , ξ, ρi)| ≤ 2b · (t) b(t) . |hwi ,Mzi|≥ 10 (t) Summing up over i ∈ Sj,pot+, and taking expectation in z, we have     X 0 (t) X (t) kΞ2 g (z, ξ, ρ ) − g (z , ξ, ρ ) ≤ 2b ·  1 (t)  ≤ O b E t,4,i i t,4,i i E (t) b z z  |hwi ,Mzi|≥ 10  d (t) (t) i∈Sj,pot+ i∈Sj,pot+ (t) where the last inequality uses a variant of Lemma C.7 and |Sj,pot+| ≤ Ξ2.

56 (t) (t) k (t) – For every i 6∈ Sj,pot+, we have |hwi , Mji| ≤ dβ b . Define event   (t)  (t) X b  E = |hw , Mz + M 0 ξ 0 i| ≥ i i j j 2  j06=j  1 1 ∗ When event Ei does not happen, we have (t) (t) = (t) 0 (t) , and thus hwi ,xi+ρi≥b hwi ,x i+ρi≥b 0 gt,4,i(z, ξ, ρi) = gt,4,i(z , ξ, ρi) .

∗ When Ei happens, using the randomness of ρi, we have (t) 0 ! h i |hw , Mji| · |ξj − ξ | 1 1 i j Pr (t) (t) 6= (t) 0 (t) ≤ O ρ hw ,xi+ρi≥b hw ,x i+ρi≥b (t) i i i σρ k log d ≤ O · |ξ − ξ0 | dβ j j and thus (  k log d 0  0 0, w.p. ≥ 1 − O dβ |ξj − ξj| over ρi; |gt,4,i(z, ξ, ρi) − gt,4,i(z, ξ , ρi)| = O(b(t)), otherwise.

−Ω(log2 d) Note with probability at least 1 − e , the number of i ∈ [m] with Ei holds is at most O(kΞ2) (using a minor variant of Lemma C.19). Therefore, by applying Chernoff −Ω(log2 d) bound, we know with probability at least 1 − e over z, ξ−j, ρ X k log d  | g (z, ξ, ρ ) − g (z, ξ0, ρ )| ≤ O(b(t)) · |ξ − ξ0 | · kΞ t,4,i i t,4,i i dβ j j 2 (t) i6∈Sj,pot+ −Ω(log2 d) Taking expectation over z, we have with probability at least 1 − e over ξ−j, ρ:

√ ! X 0 (t) k log d 0 gt,4,i(z, ξ, ρi) − gt,4,i(z, ξ , ρi) ≤ O(b ) · √ |ξj − ξ | · kΞ2 Ez j (t) d i6∈Sj,pot+

−Ω(log2 d) Putting the two cases together, we have with probability at least 1 − e over ξ−j, ρ:   0 (t) k log d 0 kΞ2 E gt,4(z, ξ, ρ) − E gt,4(z, ξ , ρ) ≤ O(b ) · O |ξj − ξj| · kΞ2 + z z dβ d

−Ω(log2 d) This means two things that both hold with probability at least 1 − e over ξ−j, ρ:   0 0 (t) k log d kΞ2 1 – For all ξj, ξj, | Ez gt,4(z, ξ, ρ) − Ez gt,4(z, ξ , ρ)| ≤ O(b ) · B · kΞ2 +  o( 2 ) dβ d Ξ2  2  0 2 (t) 2  k log d  0 2 k2 2 – 0 | g (z, ξ, ρ)− g (z, ξ , ρ)| ≤ O (b ) · 0 · kΞ |ξ − ξ | + 2 Ξ ≤ Eξj ,ξj Ez t,4 Ez t,4 Ezj ,zj dβ 2 j j d 2  4 2 2 2   k Ξ2 log d 1 k 2 (t) 2 1 O 2 2 + 2 Ξ2 (b )  o( 4 ) d β d d dΞ2 We are now ready to apply the high-probability version of the McDiarmid’s inequality (see Lemma H.3). We apply it twice. In the first time, we use the perturbation on z to derive that, 2 with probability at least 1 − e−Ω(log d) over z, ξ, ρ: 1 |gt,4(z, ξ, ρ) − gt,4(z, ξ, ρ)| ≤ O( ) Ez 2 Ξ2

57 In the second time, we use the perturbation on ξ to derive that, with probability at least 1 − 2 e−Ω(log d) over ξ, ρ, 1 | gt,4(z, ξ, ρ) − gt,4(z, ξ, ρ)| ≤ O( ) Ez E 2 z,ξ Ξ2 Finally, noticing that Ez,ξ gt,4(z, ξ, ρ) = 0 for every ρ, we finish the proof.  This finishes the proof of Lemma D.6a. We are only left to prove Lemma D.6b. By Lipscthiz continuity of the log(1 + e−x) function, we know with probability at least 1 − 2 e−Ω(log d),

? ? ? −y(x)·gt(µ ;x) −y(x)·αhw ,zi 1 −α|hw ,zi| 1 log 1 + e = log 1 + e ± O( 2 ) = log 1 + e ± O( 2 ) Ξ2 Ξ2 Taking expectation (and using the exponential tail) we have

h ? i h ? i −y(x)·gt(µ ;x) −α|hw ,zi| 1 E log 1 + e = E log 1 + e ± O( 2 ) Ξ2 Note if we take expectation over z, we have Z −α|hw?,zi| −αt ? E[log 1 + e ] ≤ log 1 + e ) · Pr[|hw , zi| ≤ t]dt z t≥0 x Z 1 1 1 −αt √  √ ≤ O(1) · e · t + ≤ O( 2 + ) t≥0 k α k where x uses Lemma H.1a. This finishes the proof of Lemma D.6b. 

E Why Clean Training is Non-Robust

In this section we shall show that clean training will not achieve robustness against `2 perturbation  d0.4999  0.3334 P of size τ = Ω k2 as long as k = Ω(e d ). Recall kMk1 = j∈[d] kMjk∞.

Theorem E.1 (clean training is non-robust). Suppose the high-probability initialization event (1−c0)/3 1 in Lemma B.2 holds. Suppose k > d and consider any iteration t ≥ Ω( 2 ) and t ≤ ηλΞ2 log d −Ω(log2 d) min{d /η, Tf }. With probability at least 1 − e the following holds. If we perturb every 10 ? 2 −Ω(log2 d) input x by δ = −yΞ2 (Mw )/k , then the accuracy drops below e : h  i −Ω(log2 d) Pr sign ft(x − δ) = y ≤ e , x,y=y(x),ρ h  i −Ω(log2 d) Pr sign E[ft(x − δ)] = y ≤ e . x,y=y(x) ρ Note that √ Ξ10 d Ξ10kMk kδk ≤ 2 and kδk ≤ 2 1 . 2 k2 ∞ k2 The proof of Theorem E.1 relies on the following main lemma (to be proved in Section E.1). It (t) ? says that towards the end of clean training, neurons wi have a (small) common direction in Mw . 1 def (0) Lemma E.2 (non-robust). For any iteration t ≥ Ω( 2 ), let subset S = ∪j∈[d]Sj,sure, then ηλΞ2    3 4  X (t) kd (t) 1 k Ξ hw , Mw?i = Ω and ∀i ∈ [m]: hw , Mw?i ≥ −O 2 σ2 i Ξ7 i λ d2 x i∈S 2

58 With the help of Lemma E.2, one can calculate that by perturbing input in this direction −y·Mw?, the output label of the network can change dramatically. This is the proof of Theorem E.1 and details can be found in Section E.2.

E.1 Proof of Lemma E.2: Common Direction Among Neurons Before proving Lemma E.2, let us first present Claim E.3.

Claim E.3. We have  k2 η  hw(t+1), Mw?i ≥ hw(t), Mw?i · (1 − ηλ − ηλkw(t)k) − O η σ2 + i i i d1.5 x poly(d) h i 0 (t) 1 ? + η E `t(w ; x, y, ρ) (t) (t) |hw , zi| x,ρ hwi ,xi+ρi≥b Proof of Claim E.3. Let us recall from (C.19) that η hw(t+1), M i = hw(t), M i · (1 − ηλ − ηλkw(t)k) ± i j i j i poly(d) h i 0 (t) 1 1   + η E y`t(w ; x, y, ρ) (t) (t) + (t) (t) zj + hξ, Mji x,y=y(x),ρ hwi ,xi+ρi≥b −hwi ,xi+ρi≥b and therefore (t+1) X (t) X (t) η hw , w?M i ≥ hw , w?M i · (1 − ηλ − ηλkw k) − i j j i j j i poly(d) j∈[d] j∈[d] h i 0 (t) 1 1  ? + η E y`t(w ; x, y, ρ) (t) (t) + (t) (t) hw , zi x,ρ hwi ,xi+ρi≥b −hwi ,xi+ρi≥b

X h 0 (t)  i − O(η) · y` (w ; x, y, ρ) 1 (t) + 1 (t) hξ, M i E t hw ,xi+ρ ≥b(t) −hw ,xi+ρ ≥b(t) j x,ρ i i i i j∈[d]

∗ ∗ (t) 2 Applying Lemma C.17b and using yhw , zi = |hw , zi| and kwi k ≤ O(Ξ2) (see Lemma C.21), we have  3 4  (t+1) X (t) X (t) k Ξ η hw , w?M i ≥ hw , w?M i · (1 − ηλ − ηλkw k) − O η 2 σ2 + i j j i j j i d2 x poly(d) j∈[d] j∈[d] h i 0 (t) 1 1  ? + η E `t(w ; x, y, ρ) (t) (t) + (t) (t) |hw , zi| x,ρ hwi ,xi+ρi≥b −hwi ,xi+ρi≥b  3 4  (t) X (t) k Ξ η ≥ hw , w?M i · (1 − ηλ − ηλkw k) − O η 2 σ2 + i j j i d2 x poly(d) j∈[d] h i 0 (t) 1 ? + η E `t(w ; x, y, ρ) (t) (t) |hw , zi| . x,ρ hwi ,xi+ρi≥b  Proof of Lemma E.2. Recall Claim E.3 says that  3 4  X (t+1) X (t) (t) k Ξ hw , Mw?i ≥ hw , Mw?i · (1 − ηλ − ηλkw k) − O η|S| 2 σ2 i i i d2 x i∈S i∈S " # 0 (t) ? X 1 + η E `t(w ; x, y, ρ)|hw , zi| (t) (t) x,ρ hwi ,xi+ρi≥b i∈S

59 (0) (t) Using Sj,sure ⊆ Sj,sure+, and a similar analysis to Lemma C.19, we know with probability at least −Ω(log2 d) P 1 1 − e it satisfies i∈S (s) (s) ≥ Ω(k). Therefore, the above inequality gives hwi ,xi+ρi≥b  3 4  X (t+1) X (t) (t) k Ξ hw , Mw?i ≥ hw , Mw?i · (1 − ηλ − ηλkw k) − O η|S| 2 σ2 i i i d2 x i∈S i∈S h i + Ω(ηk) · `0 (w(t); x, y, ρ)|hw?, zi| x,ρE t Now, using small ball probability Lemma H.1a we have

h ∗ 0 (t) i 1 0 (t) 1 Pr |hw , zi| ≤ 0.01 E[`s(w ; x, y, ρ)] ≤ E[`s(w ; x, y, ρ)] + O(√ ) . 2 k 0 0 (t) Therefore, let us abbreviate by writing `s = `s(w ; x, y, ρ), then h 0 (t) ∗ i E `s(w ; x, y, ρ) · |hw , zi| h i h i 0 ∗ ∗ 0 ∗ 0 ≥ E `s · |hw , zi| |hw , zi| ≥ 0.01 E[`s] · Pr |hw , zi| ≥ 0.01 E[`s] h i h i 0 0 ∗ 0 ∗ 0 ≥ 0.01 E[`s] · E `s |hw , zi| ≥ 0.01 E[`s] · Pr |hw , zi| ≥ 0.01 E[`s]  h i h i 0  0  0 ∗ 0 ∗ 0 = 0.01 E[`s] · E `s − E `s |hw , zi| < 0.01 E[`s] · Pr |hw , zi| < 0.01 E[`s] 0   0  h ∗ 0 i ≥ 0.01 E[`s] · E `s − Pr |hw , zi| < 0.01 E[`s]   x 0 1  0  1 1 ≥ 0.01 E[`s] · E `s − O(√ ) ≥ 5 2 k Ξ2 where the last inequality x uses Lemma C.21 and Lemma C.22. Therefore, using |S| ≤ dΞ2, we have  3 4  X (t+1) X (t) (t) k Ξ ηk hw , Mw?i ≥ hw , Mw?i · (1 − ηλ − ηλkw k) − O η|S| 2 σ2 + Ω( ) i i i d2 x Ξ5 i∈S i∈S 2 X (t) ηk ≥ hw , Mw?i · (1 − O(ηλΞ2)) + Ω( ) i 2 Ξ5 i∈S 2 1 so we conclude for every t ≥ Ω( 2 ) it satisfies ηλΞ2   X (t) kd hw , Mw?i ≥ Ω . i Ξ7 i∈S 2 As for any arbitrary i ∈ [m], we have  k3Ξ4  hw(t+1), Mw?i ≥ hw(t), Mw?i · (1 − ηλ − ηλkw(t)k) − O η 2 σ2 i i i d2 x  1 k3Ξ4  ≥ · · · ≥ −O 2 σ2 . λ d2 x 

E.2 Proof of Theorem E.1

def (0) (0) (t) Proof of Theorem E.1. Recall S = ∪j∈[d]Sj,sure from Lemma E.2 and Sj,sure ⊆ Sj,sure+ from (0) (t) −Ω(log2 d) Theorem C.2. For every i ∈ Sj,sure ⊆ Sj,sure+, we know with probability at least 1 − e , the

60 same proof of (D.1) gives

1 (t) = 1 ? and 1 (t) = 1 ? (t) wj zj >0 (t) wj zj <0 hwi ,xi+ρi≥10b −hwi ,xi+ρi≥10b (t) (t) ? √β 2 Therefore, setting δ = δ0Mw for some δ0 ∈ (0, ), and using kwi k2 ≤ Ξ2 (since Sept++ = [m]), √ d (t) 2 (t) we have |hwi , δi| ≤ δ0Ξ2 d ≤ b . Using this, we can sum up over all i ∈ S: X (t) (t) (t) (t) ReLU(hwi , x − δi + ρi − b ) − ReLU(−hwi , x − δi + ρi − b ) i∈S X (t) (t) (t) (t) X (t) 1 ? 1 ?  = ReLU(hw , xi + ρi − b ) − ReLU(−hw , xi + ρi − b ) − w zj >0 + w zj <0 hw , δi i i ji i ji i i i∈S i∈S | {z } ♣ where j ∈ [d] is the unique index such that i ∈ S(0) . We can rewrite the decrement i ji,sure X  X (t) ? ♣ = δ 1 ? + 1 ? hw , Mw i 0 wj zj >0 wj zj <0 i j∈[d] (0) i∈Sj,sure √ √ P (t) ? 2 (0) 3  Using | (0) hw , Mw i| ≤ Ξ d|S | ≤ Ξ d and 1 ? + 1 ? = 1 with proba- i 2 j,sure 2 wj zj >0 wj zj <0 i∈Sj,sure k bility Θ( d ), we can apply Bernstein’s inequality and derive √ h 3 i −Ω(log2 d) Pr |♣ − [♣]| > δ0 · Ξ kd log d ≤ e Ez 2  k Also using 1 ? + 1 ? = 1 with probability Θ( ), we can derive using Lemma E.2 that wj zj >0 wj zj <0 d

2 ! k X (t) ? k 2 E[♣] ≥ δ0 · Ω( ) hw , Mw i − O(kΞ2) · σx z d i λd1.5 i∈S  2 3 4  2 k 1 k Ξ2 2 k ≥ δ0 · Ω( 7 ) − O(kΞ2) · 2 σx ≥ δ0 · Ω( 7 ) Ξ2 λ d Ξ2 Combining the above equations andusing k > d(1−c0)/3 , we have with probability at least 1 − 2 e−Ω(log d),  k2  ♣ ≥ δ0 · Ω 7 Ξ2 √ (t) 2 (t) For the remainder terms, we using |hwi , δi| ≤ δ0Ξ2 d ≤ b /2, we have X (t) (t) (t) (t) ReLU(hwi , x − δi + ρi − b ) − ReLU(−hwi , x − δi + ρi − b ) i∈[m]\S n o X (t) (t) (t) (t) X 1 (t) ≤ ReLU(hwi , xi + ρi − b ) − ReLU(−hwi , xi + ρi − b ) − (t) b(t) min 0, hwi , δi |hwi ,xi|+|ρi|> i∈[m]\S i∈[m]\S 2 | {z } ♠

2 Using Lemma E.2 and Lemma C.19 we have with probability at least 1 − eΩ(log d),  3 4   3 4  1 k Ξ2 2 X 1 1 k Ξ2 2 ♣ ♠ ≥ −δ0 · O 2 σx · (t) b(t) ≥ −δ0 · O 2 σx · O(kΞ2) ≥ − . λ d |hwi ,xi|+|ρi|> λ d 2 i∈[m]\S 2

61 Putting together the bounds for ♣ and ♠ we have X (t) (t) (t) (t) ft(x − δ) = ReLU(hwi , x − δi + ρi − b ) − ReLU(−hwi , x − δi + ρi − b ) i∈[m]  2  X (t) (t) ♣ k ≤ ReLU(hw , xi + ρ − b(t)) − ReLU(−hw , xi + ρ − b(t)) − ≤ f (x) − δ · Ω i i i i 2 t 0 Ξ7 i∈[m] 2

10 Ξ2 2 In other words, choosing δ0 = k2 , then combining with |ft(x)| ≤ O(Ξ2 log d) from Lemma C.21, we immediately have ft(x − δ) < 0. Using an analogous proof, one can also show that ft(x + δ) > 0. Therefore, if we choose a ∗ perturb direction −δ0yMw = −yδ, we have h ∗  i −Ω(log2 d) Pr sign ft(x − δ0yMw ) = y ≤ e . x,y=y(x),ρ √ −Ω(log2 d) ∗ ∗ This means the robust accuracy is below e . Finally, using kMw k2 ≤ O( d) and kMw k∞ ≤ P O( j∈[d] kMjk∞) = O(kMk1) finishes the proof. Note that a similar proof as above also shows h ∗  i −Ω(log2 d) Pr sign E[ft(x − δ0yMw )] = y ≤ e . x,y=y(x) ρ 

F Robust Training Through Local Feature Purification

6 dΞ2 Suppose we run clean training for Tf ≥ Ω( η ) iterations following Theorem D.1. From this itera- tion on, let us perform T more steps of robust training. During the robust training phase, let us consider an arbitrary (norm-bounded) adversarial perturbation algorithm A. Recall from Definition 4.2 that, given the current network f (which includes hidden weights {wi}, output weights {ai}, bias {bi} and smoothing parameter σρ), an input x, a label y, and some internal random string r, the perturbation algorithm A outputs a vector satisfying

kA(f, x, y, r)kp ≤ τ . for some `p norm. Our two main theorems below apply to all such perturbation algorithms A (including the Fast Gradient Method, FGM).

Theorem F.1 (`2-adversarial training). In the same setting as Theorem D.1, suppose we first 6 dΞ2 log d run Tf iterations of clean training with Ω( η ) ≤ Tf ≤ d /η and obtain

 (Tf )  Objclean = E ObjT (w ; x, y, ρ) ≤ o(1) . x,y=y(x),ρ f

2(1−2c ) d 0 2.5 1−2c0 Next, suppose σx ≤ min{O(1), k5.5 } and k < d / log d. Starting from iteration Tf , k2Ξ4m log d 2 suppose we perform robust training for additional T = T = Θ( 2 ) ≤ O( k ) iterations, g ηd d1−2c0 def 1 −Ω(log2 d) against some `2 perturbation algorithm A with radius τ = √ . With probability ≥ 1−e , k·dc0

Tf +T −1 1 X  (t)  E Objt(w ; x + δ, y, ρ) ≤ Objclean + o(1) T x,y=y(x),δ=A(ft,x,y,r),ρ t=Tf

62 Corollary F.2. In Theorem F.1, if A is Fast Gradient Method (FGM) with `2 radius τ, and  (t)  E Objt(w ; x + δ, y, ρ) ≤ o(1) x,y=y(x),δ=A(ft,x,y),ρ for some t ∈ [Tf ,Tf + Tg]. Then,   d (t) Pr ∃δ ∈ R , kδk2 ≤ τ : sign(E ft(w ; x + δ, ρ)) 6= y ≤ o(1) . x,y=y(x) ρ

Corollary F.3. Consider for instance σx = 0, c0 = 0.00001, and sufficiently large d > 1. 0.0001 0.3999 1 • For k ∈ [d , d ], robust training gives 99.9% accuracy against `2 perturbation ≥ k0.5·d0.0001 . 0.3334 d0.5001 • For k ≥ d , clean training gives 0.01% accuracy against `2 perturbation radius ≤ k2 . 0.3334 0.3999 • For k ∈ [d , d ], robust training provably beats clean training in `2 robust accuracy.

Theorem F.4 (`∞-adversarial training). In the same setting as Theorem D.1, suppose we first 6 dΞ2 log d run Tf iterations of clean training with Ω( η ) ≤ Tf ≤ d /η and obtain

 (Tf )  Objclean = E ObjT (w ; x, y, ρ) ≤ o(1) . x,y=y(x),ρ f

2.5 1−2c0 Next, suppose σx ≤ O(1) and k < d / log d. Starting from iteration Tf , suppose we perform k2Ξ4m log d 2 robust training for additional T = T = Θ( 2 ) ≤ O( k ) iterations, against some ` g ηd d1−2c0 ∞ def 1 −Ω(log2 d) perturbation algorithm A of radius τ = 1.75 c . Then, with probability ≥ 1 − e , k ·kMk∞·d 0

Tf +T −1 1 X  (t)  E Objt(w ; x + δ, y, ρ) ≤ Objclean + o(1) T x,y=y(x),δ=A(ft,x,y,r),ρ t=Tf

Corollary F.5. In Theorem F.4, if A is Fast Gradient Method (FGM) with `∞ radius τ, and  (t)  E Objt(w ; x + δ, y, ρ) ≤ o(1) x,y=y(x),δ=A(ft,x,y),ρ for some t ∈ [Tf ,Tf + Tg]. Then,   d (t) Pr ∃δ ∈ R , kδk∞ ≤ τ : sign(E ft(w ; x + δ, ρ)) 6= y ≤ o(1) . x,y=y(x) ρ Corollary F.6. 0.0001 0.3999 • For k ∈ [d , d ], robust training gives 99.9% accuracy against `∞ perturbation ≥ 1 1.75 0.0001 . k ·d ·kMk∞ 0.0001 0.3334 d kMk1 • For k ≥ d , clean training gives 0.01% accuracy against `∞ perturbation ≤ k2 . 0.3334 0.3999 0.1248 • For k ∈ [d , d ] and when kMk∞, kMk1 ≤ d , robust training provably beats clean training in `∞ robust accuracy. Remark F.7. With additional efforts, one can also prove that Theorem F.1 and Theorem F.4 holds with high probability for all T in the range T = Θ(Tg). We do not prove it here since it is not beyond the scope of this paper.

63 F.1 Some Notations We first note some simple structural properties that are corollaries of Theorem C.2.

Proposition F.8. At iteration t = Tf , for every neuron i ∈ [m], we can write (t) def def X wi = gi + ui = αi,jMr + ui j∈Si

2 (0) 2 kΞ2 where Si ⊆ {j ∈ [d] | i ∈ Sj,pot} with |Si| = O(1), |αi,j| ≤ O(Ξ2) and maxj∈[d]{|hui, Mji|} = d .

(t) Proof. We can let αi,j = hwi , Mji and let ui be the remaining part. We have |Si| ≤ O(1) because (0) (t) (t) 2 Sept = [m]. We have |αi,j| = |hwi , Mji| ≤ kwi k ≤ O(Ξ2). We also have 2 (t) k (t) kΞ2 max{|hui, Mji|} = max {|hwi , Mji|} ≤ b ≤ . j∈[d] (0) dβ d  j∈[d]: i6∈Sj,pot+ We next introduce an important notation that shall be used throughout the proofs of this section.

(t) (t) (t) def (t) Definition F.9. For every t ≥ Tf , we write wi = gi + vi by defining vi = wi − gi.

F.2 Robust Coupling (t) (t) Definition F.10 (robust coupling). At every iteration t, recalling wi = gi + vi , we define a linear function in µ m  X 1 (t) gt(µ; x, x0, ρ) = (t) (t) · (hgi + µi, xi + ρi − b ) hgi+vi ,x0i+ρi≥b i=1  1 (t) − (t) (t) · (−hgi + µi, xi + ρi − b ) −hgi+vi ,x0i+ρi≥b and it equals the output of the real network at point µ = v(t) both on its zero and first order:

gt(µ; x + δ, x + δ, ρ) µ=v(t) = ft(w; x + δ, ρ) w=w(t)

∇µgt(µ; x + δ, x + δ, ρ) µ=v(t) = ∇wft(w; x + δ, ρ) w=w(t)

We shall show in this section that, recalling w(t) = g + v(t), then (t) (t) gt(v ; x + δ, x, ρ) ≈ ft(w ; x + δ, ρ)

(Tf ) gt(0; x + δ, x, ρ) ≈ ft(w , x, ρ)

However, as regarding how close they are, it depends on whether we have an `2 bound or `∞ bound on δ, so we shall prove the two cases separately in Section F.2.1 and F.2.2. It is perhaps worth nothing that the “closeness” of the above terms depend on two things, P (t) 2 • One is regarding how small i∈[m] kvi k2 is, and this shall later be automatically guaranteed via implicit regularization of first-order methods. (t) (t) • The other is regarding how small kvi k2 or kvi k1 is for every individual neuron i ∈ [m]. This is a bit non-trivial to prove, and we shall spend the entire Section F.3 to deal with this.

64 F.2.1 Robust Coupling for `2 Perturbation P (t) 2 2 (t) Lemma F.11. Suppose at iteration t, i∈[m] kvi k2 ≤ r m for some r ≤ 1, and suppose maxi∈[m] kvi k2 ≤ 0 d r . Then for any vector δ ∈ R that can depend on x (but not on ρ) with kδk2 ≤ τ for some b τ ≤ o( 2 0 ), we have Ξ2+r  5 2 0 2 2  h (t) (t) i 2 Ξ2 (Ξ2 + r ) r m E gt(v ; x + δ, x, ρ) − ft(w ; x + δ, ρ) ≤ O(τ ) · + 2 x,ρ σρ db σρ  kΞ2  As a corollary, in the event of r ≤ O √ 2 and r0 ≤ 1 and using m = d1+c0 , we have d  5 3.5  h (t) (t) i 2 Ξ2 k E gt(v ; x + δ, x, ρ) − ft(w ; x + δ, ρ) ≤ O(τ ) · + 1−2c x,ρ σρ d 0

(t) (t) Proof of Lemma F.11. Let us abbreviate the notations by setting vi = vi and b = b . (t) (t) To upper bound |gt(v ; x + δ, x, ρ) − ft(w ; x + δ, ρ)| it suffices to upper bound |V1 − V2| for X 1 V1 := (hgi + vi, x + δi + ρi − b) hgi+vi,x+δi+ρi≥b i∈[m] X 1 V2 := (hgi + vi, x + δi + ρi − b) hgi+vi,xi+ρi≥b i∈[m] (and one also needs to take into account the reverse part, whose proof is analogous). k  We first make some calculations. Using the definition of gi, we have Prx[hgi, xi ≥ |b|/10] ≤ O d for every i ∈ [m]. Thus, we can easily calculate that21     X 21 2 X 2 1  2 2 k E  hvi, δi hg ,xi≥|b|/10 ≤ τ kvik E hg ,xi≥|b|/10 = O τ · r m · (F.1) x i x i d i∈[m] i∈[m]   X 2 2 2 4 0 2 X   (hvi, δi + hgi, δi )1hv ,xi≥|b|/10 ≤ τ · O(Ξ + (r ) ) 1hv ,xi≥|b|/10 Ex  i  2 Ex i i∈[m] i∈[m]  2  2 2 0 2 X hvi, xi ≤ τ · O((Ξ2 + r ) ) · O E x b2 i∈[m]  r2m = O τ 2(Ξ2 + r0)2 · (F.2) 2 db2

X 2 X 2 2 X > 2 5 hgi, δi 1hg ,xi≥|b|/10 ≤ hgi, δi ≤ τ gig ≤ O(τ Ξ ) i i 2 i∈[m] i∈[m] i∈[m] spectral−norm (F.3) Now, for every i ∈ [m], b b • Case 1, |hvi, xi| ≤ 10 and |hgi, xi| ≤ 10 both happen. In this case, it must satisfy |hgi +vi, δi| ≤ 2 0 b −Ω(log2 d) (kgik+kvik)·τ ≤ O(Ξ2 +r )·τ ≤ 10 . . Also, with probability at least 1−e , it satisfies

21 P > Here, the spectral norm bound of i∈[m] gigi holds for the following reason. Each gi is a sparse vector supported > 2 only on |Si| = O(1) coordinates, and thus gigi  Di holds for a diagonal matrix Di that where [Di]j,j = kgik ≤ 4 (0) O(Ξ2) for j ∈ Si and [Di]j,j = 0 otherwise. Now, using the fact that |Sj,pot| ≤ Ξ2, we immediately have that 5 D1 + ··· + Dm  O(Ξ2) · Id×d.

65 b |ρi| ≤ 10 . To sum up, with high probability we have 1 1 hgi+vi,x+δi+ρi≥b = hgi+vi,xi+ρi≥b = 0 b b 1 • Case 2, either |hvi, xi| > 10 or |hgi, xi| > 10 . In this case, to satisfy hgi+vi,x+δi+ρi≥b 6= 1 hgi+vi,xi+ρi≥b, one must have |hgi + vi, x + δi − b + ρi| ≤ |hvi, δi| + |hgi, δi|. Also, using the randomness of ρi, we have   1 1  |hvi, δi| + |hgi, δi| Pr hgi+vi,x+δi+ρi≥b 6= hgi+vi,xi+ρi≥b ≤ O ρi σρ Together, we have   X  1 1 E [|V1 − V2|] ≤ E  |hvi, δi| + |hgi, δi| hg +v ,x+δi+ρ ≥b − hg +v ,xi+ρ ≥b  x,ρ,δ x,δ,ρ i i i i i i i∈[m]   2 2 X hvi, δi + hgi, δi 1 1  1 ≤ O(1) · E  hvi,xi≥|b|/10 + hgi,xi≥|b|/10  + (F.4) x,δ,ρ σρ poly(d) i∈[m]  5 2 2 0 2 2  2 Ξ2 kr m (Ξ2 + r ) r m ≤ O(τ ) · + + 2 σρ dσρ db σρ Ξ5 (Ξ2 + r0)2r2m ≤ O(τ 2) · 2 + 2 . 2  σρ db σρ

(t) Lemma F.12. Suppose at iteration t, max {|hu , M i|} ≤ √r and P kv k2 ≤ r2m i∈[m],j∈[d] i j d i∈[m] i 2 d with r ≤ 1. Then for any vector δ ∈ R that can depend on x (but not on ρ) with kδk2 ≤ τ, we have r ! mr2 r log d r2m (Tf ) √ 3 E [|gt(0; x + δ, x, ρ) − ft(w , x, ρ)|] ≤ O + kΞ2 + τΞ2 kΞ2 + 2 x,ρ dσρ d db

 kΞ2  As a corollary, in the event of r ≤ O √ 2 and using m = d1+c0 , we have d  k2.5 q  (Tf ) 7 E [|gt(0; x + δ, x, ρ) − ft(w , x, ρ)|] ≤ O + τ · kΞ x,ρ d1−2c0 2

(T ) Proof of Lemma F.12. To upper bound |gt(0; x + δ, x, ρ) − ft(v f , x)| it suffices to upper bound |V3 − V4| for X 1 V3 := (hgi, x + δi − b + ρi) hgi+vi,xi+ρi≥b i∈[m] X 1 V4 := (hgi + ui, xi − b + ρi) hgi+ui,xi+ρi≥b i∈[m] (and one also needs to take into account the reverse part, whose proof is analogous). Let us first define X 1 V5 := (hgi, xi − b + ρi) hgi+vi,xi+ρi≥b i∈[m] P 1 Let us define s = i∈[m] hgi+vi,xi+ρi≥b. By the properties that (1) gi is only supported on Si 2 with |Si| ≤ O(1), (2) for each j ∈ [d] at most Ξ2 of the gi are supported on i, and (3) kgik2 ≤ O(Ξ2),

66 we can obtain

√ X 1 3 3 q E gi hg +v ,xi+ρ ≥b ≤ O(Ξ2) E [ s] ≤ O(Ξ2) E [s] x,ρ i i i x,ρ x,ρ i∈[m] 2 At the same time, we know     2 X 1 1 1 X E[hvi, xi ] E [s] ≤ E  hg ,xi≥ b + hv ,xi≥ b  + ≤ O(kΞ2) + O   x,ρ x,ρ i 4 i 4 poly(d) b2 i∈[m] i∈[m]  1  ≤ O kΞ + r2m · 2 db2 Putting them together, we have

r 2 ! X 1 3 r m E |V3 − V5| ≤ τ · E gi hg +v ,xi+ρ ≥b ≤ O τΞ2 kΞ2 + (F.5) x,ρ x,ρ i i i db2 i∈[m] 2 Next, let us define X 1 V6 := (hgi, xi − b + ρi) hgi,xi+ρi≥b i∈[m] Using a similar analysis to (F.4), we have X |V5 − V6| ≤ |hgi, xi − b + ρi| · 1hg ,xi+ρ ≥b − 1hg +v ,xi+ρ ≥b x,ρE x,ρE i i i i i i∈[m] X ≤ |hvi, xi| [|1hg ,xi+ρ ≥b − 1hg +v ,xi+ρ ≥b|] x,ρE E i i i i i i∈[m]  2   2  hvi, xi r m ≤ O E ≤ O (F.6) x,ρ σρ dσρ Finally, we also have

X 1 1 E |V6 − V4| = E (hgi + ui, xi − b + ρi) hg +u ,xi+ρ ≥b − (hgi, xi − b + ρi) hg ,xi+ρ ≥b x,ρ x,ρ i i i i i i∈[m]

X 1 X 1 1  ≤ E hui, xi hg +u ,xi+ρ ≥b + E (hgi, xi − b + ρi) hg +u ,xi+ρ ≥b − hg ,xi+ρ ≥b x,ρ i i i x,ρ i i i i i i∈[m] i∈[m] x  2  X 1 mr ≤ E |hui, xi| hgi+ui,xi+ρi≥b + O x,ρ dσρ i∈[m] y  2  X 1 1  mr 1 ≤ E |hui, xi| hgi,xi≥b/4 + hui,xi≥b/4 + O + x,ρ dσρ poly(d) i∈[m]

Above, inequality x is due to a similar analysis as (F.6), and inequality y is because |ρi| ≤ b/4 2 with probability at least 1 − e−Ω(log d). Next, let us recall hu , M i ≤ √r and thus, by Bernstein’s i j d 2 inequality, with probability at least 1 − e−Ω(log d), r log2 d b |hui, xi| ≤ O √  . d 4

67 Putting this back we have mr2 r log d E |V6 − V4| ≤ O + kΞ2 √ x,ρ dσρ d

Combining the bounds on |V6 − V4|, |V3 − V5|, and |V5 − V6| finishes the proof. 

F.2.2 Robust Coupling for `∞ Perturbation P (t) 2 2 (t) Lemma F.13. Suppose at iteration t, i∈[m] kvi k2 ≤ r m for some r ≤ 1, and suppose maxi∈[m] kvi k1 ≤ 0 d r . Then for any vector δ ∈ R that can depend on x (but not on ρ) with kδk∞ ≤ τ for some b τ ≤ o( 2 0 ), we have Ξ2+r  5 2 0 2 2 0 2  h (t) (t) i 2 kΞ2 (Ξ2 + r ) r m (r ) kΞ2 E gt(v ; x + δ, x, ρ) − ft(w ; x + δ, ρ) ≤ O(τ ) · + 2 + x,ρ σρ db σρ σρ  kΞ2  As a corollary, in the event of r ≤ O √ 2 and r0 ≤ O(kΞ2 · kMk ) and using m = d1+c0 , we d 2 ∞ have h i (t) (t) 2 3.5 c0 2 gt(v ; x + δ, x, ρ) − ft(w ; x + δ, ρ) ≤ O(τ ) · k d · kMk x,ρE ∞ Proof of Lemma F.13. The proof is analogous to Lemma F.11 so we only highly the differences. In fact, we only need to change (F.1), (F.2) and (F.3) with the following calculations. k  Using the definition of gi, we have Prx[hgi, xi ≥ |b|/10] ≤ O d for every i ∈ [m] as well as P 1 22 i∈[m] Ex[ hgi,xi≥|b|/10] ≤ O (kΞ2). Thus, we can easily calculate that   X 2 2 X 0 2   2 0 2  hvi, δi 1hg ,xi≥|b|/10 ≤ τ (r ) 1hg ,xi≥|b|/10 = O τ · (r ) · kΞ2 Ex  i  Ex i i∈[m] i∈[m]   X 2 2 2 4 0 2 X   (hvi, δi + hgi, δi )1hv ,xi≥|b|/10 ≤ τ · O(Ξ + (r ) ) 1hv ,xi≥|b|/10 Ex  i  2 Ex i i∈[m] i∈[m]  2  2 2 0 2 X hvi, xi ≤ τ · O((Ξ2 + r ) ) · O E x b2 i∈[m]  r2m = O τ 2(Ξ2 + r0)2 · 2 db2   X 2 2 4 X   2 5 hgi, δi 1hg ,xi≥|b|/10 ≤ O(τ Ξ ) 1hg ,xi≥|b|/10 ≤ O(τ kΞ ) Ex  i  2 Ex i 2 i∈[m] i∈[m] Putting those into the rest of the proof (to replace (F.1), (F.2) and (F.3)) finishes the proof.  (t) Lemma F.14. Suppose at iteration t, max {|hu , M i|} ≤ √r and P kv k2 ≤ r2m i∈[m],j∈[d] i j d i∈[m] i 2 d with r ≤ 1. Then for any vector δ ∈ R that can depend on x (but not on ρ) with kδk∞ ≤ τ, we

22 P > Here, the spectral norm bound of i∈[m] gigi holds for the following reason. Each gi is a sparse vector supported > 2 only on |Si| = O(1) coordinates, and thus gigi  Di holds for a diagonal matrix Di that where [Di]j,j = kgik ≤ 4 (0) O(Ξ2) for j ∈ Si and [Di]j,j = 0 otherwise. Now, using the fact that |Sj,pot| ≤ Ξ2, we immediately have that 5 D1 + ··· + Dm  O(Ξ2) · Id×d.

68 have mr2 r log d  r2m (Tf ) √ 3 E [|gt(0; x + δ, x, ρ) − ft(w ; x, ρ)|] ≤ O + kΞ2 + τΞ2 kΞ2 + 2 x,ρ dσρ d db  kΞ2  As a corollary, in the event of r ≤ O √ 2 and using m = d1+c0 , we have d  k2.5  (Tf ) 4 E [|gt(0; x + δ, x, ρ) − ft(w ; x, ρ)|] ≤ O + τ · kΞ2 x,ρ d1−2c0 Proof of Lemma F.14. The proof is analogous to Lemma F.12 so we only highly the differences. Recall we have defined X 1 V3 := (hgi, x + δi − b + ρi) hgi+vi,xi+ρi≥b i∈[m] X 1 V4 := (hgi + ui, xi − b + ρi) hgi+ui,xi+ρi≥b i∈[m] X 1 V5 := (hgi, xi − b + ρi) hgi+vi,xi+ρi≥b i∈[m] X 1 V6 := (hgi, xi − b + ρi) hgi,xi+ρi≥b i∈[m]

The bounds on E |V5 −V6| and E |V6 −V4| state exactly the same comparing to Lemma F.12 (because they do not have δ involved). Let us now recalculate the difference |V3 − V5|. P 1 Let us define s = i∈[m] hgi+vi,xi+ρi≥b. By the properties that (1) gi is only supported on Si 2 with |Si| ≤ O(1), (2) for each j ∈ [d] at most Ξ2 of the gi are supported on i, and (3) kgik2 ≤ O(Ξ2), we can obtain

X 1 3 E gi hg +v ,xi+ρ ≥b ≤ O(Ξ2) E [s] x,ρ i i i x,ρ i∈[m] 1 At the same time, we know     2 X 1 1 1 X E[hvi, xi ] E [s] ≤ E  hg ,xi≥ b + hv ,xi≥ b  + ≤ O(kΞ2) + O   x,ρ x,ρ i 4 i 4 poly(d) b2 i∈[m] i∈[m]  1  ≤ O kΞ + r2m · 2 db2 Putting them together, we have

  2  X 1 3 r m E |V3 − V5| ≤ τ · E gi hg +v ,xi+ρ ≥b ≤ O τΞ2 kΞ2 + x,ρ,δ x,ρ i i i db2 i∈[m] 1 Using this new bound on E[V3 − V5] to replace the old one (F.5), the rest of the proof follows. 

F.3 Individual Neuron Growth Lemma

(Tf +T ) As mentioned earlier, the purpose of this section is to upper bound maxi∈[m] kvi k2 (if it is `2 (Tf +T ) perturbation) or maxi∈[m] kvi k1 (if it is `∞ perturbation) during the course of robust training. We have two subsections to deal with the two cases.

69 F.3.1 Growth Lemma for `2 Perturbation (t) We first bound kvi k2 during `2 robust training. 0 Lemma F.15 (movement bound). Suppose at iteration t, maxi∈[m] kvik2 ≤ r . Let ` ∈ [−1, 1] be d any random variable that can depend on x, ρ, and δ ∈ R be any random vector that can depend on x with kδk2 ≤ τ. Then, for every i ∈ [m], √ √ ! ! k (r0)2 log d k (r0)2 k r0 `1 (x + δ) ≤ O + τ + + √ + σ log d + E hgi+vi,x+δi+ρi≥b 2 2 x x,y=y(x),ρ 2 d db d db d db

As a corollary, suppose we run robust training from iteration Tf to Tf + T with T η ≤ o(db), 2 2 τ ≤ √ 1 and σ ≤ o( d√b ), then k log d x (T η)2 k log d √ 2 ! (Tf +T ) kΞ2 k max kvi k2 ≤ O √ + T η · ≤ o(1) i∈[m] d d

Proof of Lemma F.15. First of all we can reuse the analysis of (F.4) and derive that 1 Pr[hg + v , x + δi + ρ ≥ b] ≤ + Pr[hg , xi ≥ |b|/10] + Pr[hv , xi ≥ |b|/10] i i i poly(d) i i k (r0)2  ≤ O + =: κ d db2 This immediately gives  1  1  k E ` hgi+vi,x+δi+ρi≥bδ k2 ≤ τ · E hgi+vi,x+δi+ρi≥b ≤ κτ . def  1  Next, in order to bound the norm of φ = E ` hgi+vi,x+δi+ρi≥bx , we first inner product it with Mj for each j ∈ [d]. This gives  1  |hφ, Mji| = E ` hgi+vi,x+δi+ρi≥bhx, Mji 1 ≤ [(1 + 1 ) · |hx, M i|] + E hgi,xi≥b/10 hvi,xi≥b/10 j poly(d) 1 ≤ [(1 + 1 + 1 ) · |hx, M i|] + (F.7) E hgi,xi≥b/10 hvi,x−Mj zj i≥b/20 hvi,Mj izj ≥b/20 j poly(d) We bound the three terms separately. • For the first term, 1 1 log d 1 k log d E[ hg ,xi≥b/10 · |hx, Mji|] ≤ E[ hg ,xi≥b/10 · (|zj| + O( √ ))] ≤ E[ hg ,xi≥b/10 · |zj|] + O( )) i i d i d1.5 P Using the property of g we have 1 ≤ 0 1 for |S | ≤ O(1) and therefore i hgi,xi≥b/10 j ∈Si zj0 6=0 i    √ X k/d, if j ∈ Si; [1 · |zj|] ≤ 1 · |zj| = E hgi,xi≥b/10 E  zj0 6=0  k1.5/d2, if j 6∈ S . 0 i j ∈Si Therefore, we have   X 2 k [1 · |hx, M i|] ≤ O (F.8) E hgi,xi≥b/10 j d2 j∈[d]

70 • For the second term, h i h i 1 1 E hvi,x−Mj zj i≥b/20 · |hx, Mji| = E hvi,x−Mj zj i≥b/20 · |zj + hMj, ξi| h i   1 log d ≤ E hv ,x−M z i≥b/20 · O E[|zj|] + √ σx i j j d √  2! ! E hvi, x − Mjzji k log d ≤ O · O + √ σx b2 d d √ ! (r0)2  k log d ≤ O · O + √ σx db2 d d and therefore X  h i2 (r0)4  k  1 · |hx, M i| ≤ O · + σ2 log2 d (F.9) E hvi,x−Mj zj i≥b/20 j d2b4 d x j∈[d]

• For the third term,    1 1 log d E[ hv ,M iz ≥b/20 · |hx, Mji|] ≤ E hv ,M iz ≥b/20 · |zi| + O( √ ) i j j i j j d   |hvi, Mjizj| |hvi, Mjizj| log d ≤ E · |zi| + · O( √ ) b b d √ ! 1 k log d  1  ≤ |hv , M i| · O + ≤ |hv , M i| · O i j db bd1.5 i j db and therefore X  2 (r0)2  [1 · |hx, M i|] ≤ O (F.10) E hvi,Mj izj ≥b/20 j d2b2 j∈[d]

Putting (F.7), (F.8), (F.9), (F.10) these together, we have X  k (r0)4 k  (r0)2  kφk2 ≤ |hφ, M i|2 ≤ O + + σ2 log2 d + j d2 d2b4 d x d2b2 j∈[d] Summing everything up, we have √ √ ! ! k (r0)2  k (r0)2 k r0 `1 (x + δ) ≤ O + τ + + √ + σ log d + E hgi+vi,x+δi+ρi≥b 2 2 x x,y=y(x),ρ 2 d db d db d db

Now, suppose we run robust training for t = Tf ,Tf + 1,...,Tf + T − 1 and suppose for all of them (T ) 0 we have kvi k2 ≤ r satisfied. Then, using the gradient update formula (see e.g. (C.19)) √ √  0 2  0 2 ! 0 ! (Tf +T ) (Tf ) k (r ) k (r ) k r kv k2 ≤ kv k2 + T η · O + τ + + √ + σx log d + i i d db2 d db2 d db

(Tf +T ) 0 0 This means, in order to show kvi k2 ≤ r we can choose any r > 0 satisfying √ √ 2  0 2  0 2 ! 0 ! kΞ2 k (r ) k (r ) k r 0 O( √ ) + T η · O + τ + + √ + σx log d + ≤ r . d d db2 d db2 d db

2.5 2 Using the assumption of T η ≤ o(db) (which also implies (T η)2 ≤ o( d b )), τ ≤ √ 1 (which also k k log d

71 2 2 2 2 implies τ ≤ o( d√b )), and σ ≤ o( d√b ), we can choose (T η)2 k log d x (T η)2 k log d √ ! kΞ2 k r0 ≤ O( √ 2 ) + T η · O . d d 

F.3.2 Growth Lemma for `∞ Perturbation (t) def We now bound kvi k1 during `∞ robust training. Recall kMk∞ = maxj∈[d] kMjk1. 0 Lemma F.16 (movement bound). Suppose at iteration t, maxi∈[m] kvik1 ≤ r . Let ` ∈ [−1, 1] be d any random variable that can depend on x, ρ, and δ ∈ R be any random vector that can depend on x with kδk∞ ≤ τ. Then, for every i ∈ [m], k (r0)2  k `1 (x + δ) k ≤ O + · (τd + kMk log d) E hgi+vi,x+δi+ρi≥b 1 d db2 ∞ db2 As a corollary, suppose we run robust training from iteration Tf to Tf + T with T η ≤ 2 3 kMk∞kΞ2 b2  and τ ≤ o 2 , then T η·kΞ2kMk∞

(Tf +T ) 2 max kvi k1 ≤ O(kΞ2 · kMk∞) i∈[m] 0 Proof of Lemma F.16. Similar to the proof of Lemma F.15, and using kvik2 ≤ kvik1 ≤ r , we have 1 Pr[hg + v , x + δi + ρ ≥ b] ≤ + Pr[hg , xi ≥ |b|/10] + Pr[hv , xi ≥ |b|/10] i i i poly(d) i i k (r0)2  ≤ O + =: κ d db2 This implies  1  k E ` hgi+vi,x+δi+ρi≥bδ k1 ≤ κ · τd  1  d On the other hand, let us look at h := E ` hgi+vi,x+δi+ρi≥bx and set u = sign(h) ∈ {−1, 1} . We have  1  khk1 = E ` hgi+vi,x+δi+ρi≥bhx, ui −Ω(log d) Since with probability at least 1 − e it satisfies |hx, ui| = O(maxj∈[d] kMjk1 log d), we can conclude that khk1 = O(κkMk∞ log d). Together we have  0 2    k (r ) `1hg +v ,x+δi+ρ ≥b(x + δ) ≤ O + · (τd + kMk∞ log d) E i i i 1 d db2

Now, suppose we run robust training for t = Tf ,Tf + 1,...,Tf + T − 1 and suppose for all of them (T ) 0 we have kvi k1 ≤ r satisfied. Then, using the gradient update formula (see e.g. (C.19)) k (r0)2  kv(Tf +T )k ≤ kv(Tf )k + T η · O + · (τd + kMk log d) i 1 i 1 d db2 ∞

2 (Tf ) kΞ2 Recalling |hvi , Mji| ≤ d from (F.8), we have

(Tf ) X (Tf ) 2 vi = hvi , Mji · Mj ≤ kΞ2 · kMk∞ 1 j∈[d] 1

72

(Tf ) 0 0 This means, to prove that vi ≤ r , we can choose any r satisfying 1 k (r0)2  kΞ2 · kMk + T η · O + · (τd + kMk log d) ≤ r0 2 ∞ d db2 ∞ db2 b2  and using the assumption of T η ≤ 2 3 (which implies T η ≤ O(d)), and τ ≤ o 2 kMk∞kΞ2 T η·kΞ2kMk∞ 1 (which implies τ ≤ ηT ), we can choose

0 2 2 r ≤ 2kΞ2 · kMk∞ + T η · O (τk) ≤ O(kΞ2 · kMk∞) . 

F.4 Robust Convergence We are now ready to prove the main convergence theorem (that is, Theorem F.1 and F.4) for robust learning. Let us first calculate a simple bound: √ P (Tf ) 4 Claim F.17. i∈[m] Reg(gi) − Reg(wi ) ≤ O(k dΞ2)

kΞ2 Proof. Recalling kg k ≤ O(Ξ2) and ku k ≤ O( √ 2 ) from Proposition F.8, we have i 2 2 i d 4 3 3 2 2  kΞ2 kgik − kgi + uik ≤ O kuik · (kuik + kgik ) ≤ O( √ ) . d 

F.4.1 Robust Convergence for `2 Perturbation

(t+1) (t) (t) Proof of Theorem F.1. Since wi = wi − η∇wi RobObj^ t(w ), we have the identity η2 1 1 ηh∇RobObj^ (w(t)), w(t) − gi = k∇RobObj^ (w(t))k2 + kw(t) − gk2 − kw(t+1) − gk2 t 2 t F 2 F 2 F Applying (a variant of) Lemma A.2 (which requires us to use the Lipscthiz continuity assumption on A, see Definition 4.2), we know that by letting   RobObjt(w) = E Objt(w; x + δ, y, ρ) , x,y=y(x),δ,ρ it satisfies 1 1 η ηh∇RobObj (w(t)), w(t) − gi ≤ η2 · poly(d) + kw(t) − gk2 − kw(t+1) − gk2 + (F.11) t 2 F 2 F poly(d) Let us also define the clean objective and the pseudo objective as follows:   Objt(w) = E Objt(w; x, y, ρ) x,y=y(x),ρ

0  −y·gt(µ;x+δ,x,ρ)  X RobObjt(µ) = E log(1 + e ) + λ Reg(gi + µi) , x,y=y(x),δ,ρ i∈[m] which is a convex function in µ because gt(µ; x + δ, x, ρ) is linear in µ. Now, we inductively prove that at every iteration t ∈ [Tf ,Tf + T ], it satisfies  2  X (t) kΞ kv k2 ≤ r2m for r = Θ √ 2 (F.12) i 2 d i∈[m] (t) 0 0 max kvi k2 ≤ r for r = 1 (F.13) i∈[m]

73 In the base case t = Tf this is obvious due to Proposition F.8. Next, suppose (F.12) and (F.13) (t) (t) t hold at iteration t. Using the notation wi = gi + v and the Lipscthiz continuity of log(1 + e ), we have23 0 (t) (t) h (t) (t) i E[|RobObjt(v ) − RobObjt(w )|] ≤ E gt(v ; x + δ, x, ρ) − ft(w , x + δ)  5 3.5  2 Ξ2 k ≤ O(τ ) · + 1−2c (using Lemma F.11) σρ d 0  1  ≤ O (using τ ≤ √ 1 ) log d k·dc0

0 (Tf ) (Tf ) X (Tf ) E[|RobObjt(0) − Objt(w )|] ≤ E[|gt(0; x + δ, x) − ft(w , x)|] + λ Reg(w ) − Reg(gi) i i∈[m]  k2.5 q  kΞ4 log d ≤ O + τ kΞ7 + O 2√ d1−2c0 2 d (using Lemma F.12 and Claim F.17)  1 q  ≤ O + τ kΞ7 (using k2.5 < d1−2c0 / log d ) log d 2  1  ≤ O (using τ ≤ √ 1 ) log d k·dc0 Therefore, we can bound the left hand side of (F.11) as follows: (t) (t) 0 (t) (t) h∇RobObjt(w ), w − gi = h∇RobObjt(v ), v i  1  ≥ RobObj0 (v(t)) − RobObj0 (0) ≥ RobObj (w(t)) − Obj (w(Tf )) − O . t t t t log d

Putting this back to (F.11) and telescoping for t = Tf ,Tf + 1,...,Tf + T0 − 1 for any T0 ≤ T , we have T +T −1 1 f 0   1  X (t) (Tf ) RobObjt(w ) − Objt(w ) − O T0 log d t=Tf 1 1 1 k2Ξ4  1 (Tf ) 2 (Tf +T0) 2 2 (Tf +T0) 2 ≤ kw − gkF − kw − gkF ≤ · O m − kw − gkF 2ηT0 2ηT0 ηT0 d 2ηT0 (F.14) Inequality (F.14) now implies that  2 4  X (T +T ) k Ξ kv f 0 k2 = kw(Tf +T0) − gk2 ≤ O 2 m i F d i∈[m] so (F.12) holds at iteration t = Tf + T0. We can then also apply Lemma F.15 which ensures (F.13) holds at iteration t = Tf + T0. 2 4 k Ξ2m log d Finally, let us go back to (F.14) and choose T0 = T = Θ( ηd ). It implies T +T −1 1 f X   1  1 RobObj (w(t)) − Obj (w(Tf )) − O ≤ O( ) T t t log d log d t=Tf

23 b Note to apply Lemma F.11 we also need to check τ ≤ o( 2 0 ) but this is automatically satisfied under our Ξ2+r parameter choice τ ≤ √ 1 . k·dc0

74 Note that our final choice of T also ensures that the pre-requisite T η ≤ o(db) and τ, σx ≤ 2 2 o( d√b ) of Lemma F.15 hold. (T η)2 k log d 

F.4.2 Robust Convergence for `∞ Perturbation Proof of Theorem F.4. The proof is nearly identical to that of Theorem F.1. In particular, we want to inductively prove that at every iteration t ∈ [Tf ,Tf + T ], it satisfies  2  X (t) kΞ kv k2 ≤ r2m for r = Θ √ 2 (F.15) i 2 d i∈[m] (t) 0 0 2 max kvi k1 ≤ r for r = Θ(kΞ2 · kMk∞) (F.16) i∈[m] We also need to redo the following calculations:24

0 (t) (t) h (t) (t) i E[|RobObjt(v ) − RobObjt(w )|] ≤ E gt(v ; x + δ, x, ρ) − ft(w , x + δ)

2 3.5 c0 2 ≤ O(τ ) · k d · kMk∞ (using Lemma F.13)   1 1 ≤ O (using τ ≤ 1.75 c ) log d k ·kMk∞·d 0

0 (Tf ) (Tf ) X (Tf ) E[|RobObjt(0) − Objt(w )|] ≤ E[|gt(0; x + δ, x) − ft(w , x)|] + λ Reg(w ) − Reg(gi) i i∈[m]  2.5   4  k 4 kΞ2 log d ≤ O + τ · kΞ2 + O √ d1−2c0 d (using Lemma F.14 and Claim F.17)  1 q  ≤ O + τ kΞ7 (using k2.5 < d1−2c0 / log d ) log d 2   1 1 ≤ O (using τ ≤ 1.75 c ) log d k ·kMk∞·d 0 

F.5 Fast Gradient Method (FGM) Robust Training

Let us prove Corollary F.2 only for the `2 case, and the other `∞ case Corollary F.5 is completely analogous.

d Proof of Corollary F.2. At any iteration t ∈ [Tf ,Tf + Tg], consider any perturbation vector δ ∈ R which may depend on x but not on ρ, with kδk2 ≤ τ. Recall from Lemma F.11 that  5 3.5    h (t) (t) i 2 Ξ2 k 1 E E gt(v ; x + δ, x, ρ) − E ft(w ; x + δ, ρ) ≤ O(τ ) · + 1−2c ≤ O 2 x ρ ρ σρ d 0 log d 1 This means for at least 1 − O( log d ) probability mass of inputs x, we have   (t) (t) 1 E gt(v ; x + δ, x, ρ) − E ft(w ; x + δ, ρ) ≤ O ρ ρ log d

24 b Note to apply Lemma F.13 we also need to check τ ≤ o( 2 0 ) but this is automatically satisfied under our Ξ2+r parameter choice for τ.

75 For those choices of x, using the fact that gt is linear in δ, we also have (t) (t) gt(v ; x + δ, x, ρ) − ft(w ; x, ρ) Eρ Eρ (t) (t) (t) = gt(v ; x + δ, x, ρ) − gt(v ; x, x, ρ) = h∇x gt(v ; x, x, ρ), δi Eρ Eρ Eρ (t) = h∇x ft(w ; x, ρ), δi Eρ Putting them together we have       (t) (t) (t) 1 −y(x) E ft(w ; x + δ, ρ) = −y(x) E ft(w ; x, ρ) − hy(x)∇x E ft(w ; x, ρ), δi ± O ρ ρ ρ log d (F.17)     (t) (t) ? 1 ≤ −y(x) E ft(w ; x, ρ) − hy(x)∇x E ft(w ; x, ρ), δ i + O ρ ρ log d (F.18) ? where δ = A(ft, x, y) is the perturbation obtained by the fast gradient method with `2 radius τ. This means two things.  (t) On the other hand, by applying Markov’s inequality and Jensen’s inequality to Ex,y=y(x),ρ Objt(w ; x+ δ?, y, ρ) ≤ o(1), we know for at least 1 − o(1) probability mass of the choices of x, it satisfies

(t) ? log(1 + e−y(x) Eρ ft(w ;x+δ ,ρ)) ≤ o(1) (t) ? =⇒ −y(x) ft(w ; x + δ , ρ) ≤ −10 Eρ Therefore, for all of those x (with total mass ≥ 1 − o(1)) satisfying both, we can first apply (F.17) (with δ = δ?) to derive (t) (t) ? −y(x) ft(w ; x, ρ) − hy(x)∇x ft(w ; x, ρ), δ i ≤ −9 Eρ Eρ Applying (F.18) then we obtain (for any δ) (t) −y(x) ft(w ; x + δ, ρ) ≤ −8 Eρ

This means, the output of the network ft is robust at point x against any perturbation δ with radius τ. We finish the proof of Corollary F.2. 

G NTK Lower Bound For `∞ Perturbation

Recall from Definition 5.5 that the feature mapping of the neural tangent kernel for our two-layer network f is  m  Φ(x) = x 1hw ,xi+ρ ≥b − 1−hw ,xi+ρ ≥b ρE i i i i i i i i=1

Therefore, given weights {vi}i∈[m], the NTK function p(x) is given as X  p(x) = hx, vii 1hw ,xi+ρ ≥b − 1−hw ,xi+ρ ≥b E 2 i i i i i i ρi∼N (0,σ ) i∈[m] ρi To make our lower bound stronger, in this section, we consider the simplest input distribution with M = I and σx = 0 (so ξ ≡ 0). Our main theorem is the following.

76 d C Theorem G.1. Suppose w1, . . . , wm ∈ R are i.i.d. sampled from N (0, I) with m ≤ d for some 2 o(1) o(1) constant C > 1; and suppose ρi ∼ N (0, σρi ) with |σρi | ≤ d and |bi| ≤ d . Then, there exists −Ω(log2 d) 1 constant c6 > 0 so that, with probability at least 1 − e , choosing τ = dc6 , then for any  c6 0.5− c6  k ∈ d 100 , d 100 and sufficiently large d.

h d i 1 − o(1) Pr ∃δ ∈ R , kδk∞ ≤ τ : sign(p(x + δ)) 6= y ≥ . x,y=y(x) 2

G.1 Proof of Theorem G.1 We first note the following:

d ? Claim G.2. Suppose at point z ∈ R , some function p(z) gives the correct label y(z) = sign(hw , zi) 2 against any ` perturbation of radius τ. Then, letting ζ ∼ N (0, τ I ) be a random vector, ∞ log8 d d×d d and δ ∈ R be any vector with kδk∞ ≤ τ/2, it satisfies ? −Ω(log2 d) E [p(z + δ + ζ)] · sign(hw , zi) ≥ −e max{kvik2}i∈[m] ζ i∈[m]

−Ω(log2 d) Proof of Claim G.2. With probability at least 1 − e it satisfies kζ + δk∞ ≤ τ. When this happens, we must have p(z + δ + ζ) · sign(hw?, zi) ≥ 0. 

Therefore, for the analysis purpose (by sacrificing `∞ norm radius from τ to τ/2), we can imagine as if the input is randomly perturbed by ζ. This serves for the purpose of smoothing the NTK function p(·), which originally has indicator functions in it so may be trickier to analyze. Next, we define def MW (x) = max |hwi, xi| . i∈[m] One can carefully apply the Taylor expansion of the smoothed indicator function (using the ran- domness of ζ), to derive the following claim. (Detailed proof in Section G.4.) √ d 2 Claim G.3. Consider any NTK function p(x) with parameters kwik2 ≥ 2 , kwik∞ ≤ log d, ρ ∼ N (0, σ2 ) with |σ | ≤ do(1) and |b | ≤ do(1). Suppose τ ∈ [ 1 , 1], then there exists coefficients i ρi ρi i d1/5 0 00 {ci,r, ci,r, ci }i∈[m],r≥0 with 0 • each |ci,r|, |ci,r| ≤ O (1), 0 −0.1 • each |ci,r| ≤ |ci,r| · O(d r), 1  • each |ci,r| ≥ Ω d2 for every odd constant r ≥ 1 00 −1/4 • each |ci | ≤ O(d ). 1/4 1/4 so that, for every z with kzk1 ≤ d and every δ with kδk∞ ≤ τ/2 and MW (δ) ≤ τd , we have: [p(z + δ + ζ)] E2 ζ∼N (0, τ I ) log8 d d×d      r X 00 X vi 0 wi hwi, z + δi = kvik2 ci + ci,rh , z + δi + ci,rh , z + δi  kvik2 kwik2 τkwik2 r≥0 i∈[m]

77 Using the above formula, we can write

X ⊗r+1 E[p(x + ζ)] = CONST + Tr+1(x ) ζ r≥0    r ⊗r+1 def X vi 0 wi hwi, xi for Tr+1(x ) = kvik2 ci,r + ci,r , x kvik2 kwik2 τkwik2 i∈[m] 1  0 −0.1 25 Using |ci,r| ≥ Ω d2 for odd constant r ≥ 1, and |ci,r| ≤ O(d ) · |ci,r|, by applying Lemma G.4, we know that when r = 3C + 3 (say wlog. 3C + 3 is odd),  1  kT3C+4kF = Ω max{kvik2} (G.1) d3 i∈[m] √ Also, for a parameter q = d, let us apply Lemma G.5 to derive   def ⊗r 1 λr = max Tr(δ ) ≥ Ω r kTrkF (G.2) d √ δ∈R :kδk∞≤τ,MW (δ)≤τ q (τ)

Let R ≥ 3C + 3 be a constant to be chosen later, λmax = maxr

 r X ⊗r+1 X X hwi, z + δmaxi Tr+1((z + δmax) ) = ci,rhvi, z + δmaxi τkwik2 r≥R r≥R i∈[m] √ !r 4 X Oe(1) + τ q ≤ d m max{kvik2}i∈[m] √ i∈[m] r≥R τ d  r 5 X O(1) ≤ d m max{kvik2}i∈[m] i∈[m] d1/4 r≥R When R ≥ 10000(C + 1), we have

  X ⊗r+1 maxi∈[m]{kvik2}i∈[m] Tr+1((z + δmax) ) ≤ O (G.3) d100C r≥R Next, for every s ∈ [1/2, 1], let us define

X ⊗r+1 q

25 vi 0 wi  Specifically, one should substitute kvik2 ci,r + c as the new vi when applying Lemma G.4. kvik2 i,r kwik2

78 • On the other hand, by Claim G.7, we know that there is an s ∈ [1/2, 1] such that

|q

Without loss of generality, suppose q

q

G.2 Tensor Lower Bound Next, for each degree-r homogenous part of the polynomial expansion of Claim G.3, we can write it as a tensor and lower bound its Frobenius norm as follows. d C Lemma G.4. Suppose w1, . . . , wm ∈ R are i.i.d. sampled from N (0, I) with m ≤ d for some d constant C > 0. Let v1, . . . , vm ∈ R be arbitrary vectors that can depend on the randomness of d×(r+1) w1, . . . , wm. Let us denote by Tr+1 the symmetric tensor R → R such that  r ⊗r+1 X hwi, xi Tr+1(x ) = hvi, xi kwik2 i∈[m]

−Ω(log2 d) We have as long as r ≥ 3C, then w.p. ≥ 1 − e over the randomness of {wi}i∈[m], for every {vi}i∈[m] we have  1  kTr+1kF ≥ Ω √ max {kvik2} d i∈[m] Proof of Lemma G.4. Consider any fixed j ∈ [m], and some γ ∈ [−1, 1] to be chosen later. wj vj Let us define x = + γ which satisfies kxk2 ≤ 1. We have 2kwj k2 2kvj k2  r ⊗r+1 X hwi, xi Tr+1(x ) = hvi, xi kwik2 i∈[m]  r    r X hwi, xi γ hvj, wji 1 hwj, vji = hvi, xi + kvjk2 + + γ kwik2 2 2kwjk2 2 2kvjk2kwjk2 i∈[m]\{j}

79 2 Note that for every j 6= i, with probability at least 1 − e−Ω(log d),   hwi, xi hwi, wji hwi, vji log d = + γ ≤ O √ + γ . kwik2 2kwik2kwjk2 2kvjk2kwik2 d This implies that as long as |γ| ≤ √1 , d 1 γ hv , w i O(log d)r ⊗r+1 j j √ |Tr+1(x )| ≥ r kvjk2 + − m · max{kvik2} 3 2 2kwjk2 d i∈[m] Since the above lower bound holds for every |γ| ≤ √1 and every j ∈ [d], we immediately know d    ⊗r+1 1 max Tr+1(x ) ≥ Ω √ max {kvik2} kxk2≤1 d i∈[m] This implies our bound on the Frobenius norm as well. 

G.3 Tensor Perturbation We present the following critical lemma, which serves as the major step to prove the non-robustness of Neural Tangent Kernel:

Lemma G.5 (Tensor difference). For every m = poly(d), every set of vectors W = {w } with √ i i∈[m] each kwik2 = O( d), for every constant r > 0, every q ∈ [0, d], every symmetric tensor T of degree d×r r: R → R, for every τ > 0, 1. The following is true   ⊗r 1 λ := max T (δ ) ≥ Ω r kT kF d √ δ∈R :kδk∞≤τ,MW (δ)≤τ q (τ)

2. For every vectors z , ··· , z ∈ d with kz k ≤ 1,M (z ) = O(1) and supp(z )∩supp(z ) = 1 q R i ∞ W i √ e i j ∅ for i 6= j, for every y such that kyk∞ ≤ τ and MW (y) ≤ τ q, the following holds:   X ⊗r ⊗r λ T (y ) − T ((y + zi) ) ≤ Oe τ r i∈[q]

 τ 2  Proof of Lemma G.5. For the first item, we can simply let δ ∼ N 0, 2 . This choice of δ √ log d satisfies kδk∞ ≤ τ and MW (δ) ≤ τ q with high probability. Furthermore, by applying anti- concentration of Gaussian polynomials (see for instance [5, Lemma I.1]), we know with at least ⊗r Ω(kT kF ) constant probability |T (δ )| ≥ τ r . This proves the first item. To see the second item, we first note by tensor r-linearity and symmetry, r   X ⊗r ⊗r X r X ⊗r0 ⊗(r−r0) T (y ) − T ((y + zi) ) ≤ |T (z , y )| r0 j i∈[q] r0=1 j∈[q] and therefore we only need to bound the terms on the right hand side for any fixed r0 ∈ [r]. Define random variable {ξi,j}i∈[r0−1],j∈[q] where each ξi,j is i.i.d. uniformly at random chosen from {−τ, τ}. Consider arbitrary fixed values γj ∈ {−1, 1} for j ∈ [q]. Let us define random d variables Z1,Z2, ··· ,Zr0 ∈ R as: 0 X X  Y  ∀i ∈ [r − 1] : Zi := ξi,jzj,Zr0 := γj ξi,j zj j∈[q] j∈[q] i∈[r0−1]

80 From these notions one can directly calculate that ⊗(r−r0) r0 X ⊗r0 ⊗(r−r0) E[T (Z1,Z2, ··· ,Zr0 , y )] = τ γjT zj , y ξ j∈[q]

On the other hand, we have kZik∞ ≤ τ and moreover, using the randomness of ξi,j, we know w.h.p. √ |MW (Zi)| = Oe(τ q) for every i ∈ [q]. Hence, by Claim G.8, we know that ⊗(r−r0) | E[T (Z1,Z2, ··· ,Zr0 , y )]| = Oe(λ)

0 0   Putting them together, we have P γ T (z⊗r , y⊗(r−r )) = O λ , and since this holds for i∈[q] i i e τ r0 every γi ∈ {−1, 1}, we conclude that:   X ⊗r0 ⊗(r−r0) λ |T (z , y )| = Oe 0 i τ r i∈[q] Putting this back to the binomial expansion finishes the proof. 

G.4 Smoothed ReLU Taylor Series: Proof of Claim G.3 We first note the following Taylor expansion formula for smoothed ReLU. Claim G.6 (smoothed ReLU). Let a ≥ 0 be any real and ρ ∼ N (0, σ2) for σ ≥ a. Then, for every x ∈ [−a, a], ∞ ∞  2i  2i+1 1 X x 1 1 X 0 x E [ρ ρ+x≥0] = σ c2i and E [ ρ+x≥0] = + c2i+1 ρ σ ρ 2 σ i=0 i=0

1  0  1  where |c2i| = Θ i! , |c2i+1| = Θ (i+1)! Proof of Claim G.6. We can directly calculate that Z 2 2 1 − ρ 1 − x E [ρ1ρ+x≥0] = √ ρe 2σ2 dρ = √ e 2σ2 σ 2πσ ρ≥−x 2π 2 − x so using Taylor expansion of e 2σ2 we prove the first equation. As for the second equation, we have Z 2 1 − ρ E [1ρ+x≥0] = √ e 2σ2 dρ 2πσ ρ≥−x This implies that 2 d 1 − x E [1ρ+x≥0] = −√ e 2σ2 dx 2πσ Using Taylor expansion and integrating once, we prove the second equation.  We are now ready to prove Claim G.3.

Proof of Claim G.3. Specifically, for each i ∈ [m], denoting by x = z + δ, we wish to apply Claim G.6 to 1 1  E hx + ζ, vii hwi,x+ζi+ρi≥bi − −hwi,x+ζi+ρi≥bi ρi,ζ 1 1  1 1  = E hx, vii hwi,x+ζi+ρi≥bi − −hwi,x+ζi+ρi≥bi + E hζ, vii hwi,x+ζi+ρi≥bi − −hwi,x+ζi+ρi≥bi ρi,ζ ρi,ζ | {z } | {z } ♥ ♦

81 2 2 2 Note that g def= hw , ζi + ρ ∼ N (0, σ2) for σ2 = τ kw k2 + σ2 ∈  τ kw k2, 2τ kw k2. i i log16 d i 2 ρi log16 d i 2 log16 d i 2 • We first deal with the ♥ part. Using Claim G.6, we have ∞  2r+1! 1 1 1 X 0 hwi, xi − bi E hx, vii hw ,x+ζi+ρ ≥b = hx, vii E hw ,xi−b +g≥0 = hx, vii + c2r+1 ρ ,ζ i i i g i i 2 σ i r=0

0  1  for |c2r+1| = Θ (r+1)! . Similarly, we also have ∞  2r+1! 1 1 X 0 −hwi, xi − bi − E hx, vii −hw ,x+ζi+ρ ≥b = hx, vii − − c2r+1 ρ ,ζ i i i 2 σ i r=0 −0.2 Putting them together, and using the fact that bi  d  σ, we can write 1 1  ♥ = E hx, vii hwi,x+ζi+ρi≥bi − −hwi,x+ζi+ρi≥bi ρi,ζ ∞  2r+1  2r+1! X hwi, xi − bi −hwi, xi − bi = hx, v i c0 − c0 i 2r+1 σ 2r+1 σ r=0  r X hwi, xi = hx, v i c00 i r σ r≥0

00  1  00  1  for |c2r| ≤ O (r)! for every r ≥ 0 and |c2r+1| ≥ Ω (r+1)! . k • Let us now focus on the ♦ part. Let vi be the part of vi that is parallel to wi. Then obviously we have kvkk 1 k 1 i 2 1 E hζ, vii hwi,x+ζi+ρi≥bi = E hζ, vi i hwi,x+ζi+ρi≥bi = E hζ, wii hwi,x+ζi+ρi≥bi ζ,ρi ζ,ρi kwik2 ζ,ρi kvkk x i 2 1 k −1/4 = E (hζ, wii + ρi) hwi,x+ζi+ρi≥bi ± kvi k2 · O(d ) kwik2 ζ,ρi √ o(1) Above, the last x is due to σρi ≤ d and kwik2 ≥ Ω( d). def 2 Next, we again treat g = hζ, wii + ρi ∼ N (0, σ ) and apply Claim G.6. We have ∞  2r 1 1 X hwi, xi − bi E (hζ, wii + ρi) hw ,x+ζi+ρ ≥b = E g hw ,xi−b +g≥0 = σ c2r ρ ,ζ i i i g i i σ i r=0 1  for |c2r| = Θ r! . Putting them together, and doing the same thing for the symmetric part, we have 1 1  ♦ = E hζ, vii hwi,x+ζi+ρi≥bi − −hwi,x+ζi+ρi≥bi ρi,ζ k ∞  2r  2r! kv k2σ X hwi, xi − bi −hwi, xi − bi k = i c − c ± kv k · O(d−1/4) kw k 2r σ 2r σ i 2 i 2 r=0 k ∞  r x kv k2σ X hwi, xi k = i c000 ± kv k · O(d−1/4) kw k r σ i 2 i 2 r=1 ∞  r k wi X hwi, xi k = kv k h , xi c000 ± kv k · O(d−1/4) i 2 kw k r+1 σ i 2 i 2 r=0

82 −0.2 000 d−0.1 Above, using the property of bi  d  σ, equation x holds for some |c2r| ≤ O( r! ) and 000 d−0.1 |c2r+1| ≤ O( (r+1)! ).

Finally, putting the bounds for ♥ and ♦ together, and using τkwik2 ≤ σ ≤ τkw k , we derive that log8 d i 2    r X 0000 vi 00000 wi hwi, xi −1/4 ♥ + ♦ = kvik2 · ci,rhx, i + ci,r hx, i ± kvik2 · O(d ) kvik2 kwik2 τkwik2 r≥0 0000 00000 −0.1 0000 0000 1  for |ci,r| ≤ O(1) for every r ≥ 0, cr ≤ O(d r) · cr for every r ≥ 0, and |ci,r| ≥ Ω d2 for every odd constant r ≥ 1. This finishes the proof of Claim G.3. 

G.5 Simple Lemmas We have the following claim relating polynomial value with its coefficients: Claim G.7 (low degree polynomial). Let p : R → R be a constant-degree polynomial p(x) = PR r r=0 crx , then there exists x ∈ [1/2, 1] such that   |p(x)| ≥ Ω max |cr| . r=0,1,2,··· ,R

def 1  PR 0 r Proof of Claim G.7. Let us define q(x) = p x + 2 and write accordingly q(x) = r=0 crx . Using PR 0 r PR r the identity formula r=0 crx = r=0 cr(x + 0.5) we can derive (recalling R is constant)   0 max |cr| ≤ O max |cr| r=0,1,2,··· ,R r=0,1,2,··· ,R PR 0 r PR r Conversely, by writing r=0 cr(x−0.5) = r=0 crx , we also have the other direction and therefore   0 max |cr| = Θ max |cr| r=0,1,2,··· ,R r=0,1,2,··· ,R dr 0 Now, notice that dxr q(x) |x=0 = Θ(|cr|), so we can apply Markov brother’s inequality to derive that   0 max |q(x)| ≥ Ω max |cr| . x∈[0,1/2] r=0,1,2,··· ,R This finishes the proof.  Using this Claim, we also have the following claim about symmetric tensor:

Claim G.8 (symmetric tensor norms). For every constant r > 0, every fixed a1, a2 > 0, every d×r symmetric tensor T of degree r of the form T : R → R, let ⊗r λ1 := max {|T (x )|}, λ2 := max {|T (x1, x2, ··· , xr)|} x:kxk∞≤a1,MW (x)≤a2 {xi}i∈[r]:kxik∞≤a1,MW (xi)≤a2 then we have:

λ1 ≤ λ2 ≤ O(λ1)

Proof of Claim G.8. λ1 ≤ λ2 is obvious so let us prove the other direction. Define polynomial  ⊗r  r+1 (r+1)2 (r+1)r−1  p(s) = T x1 + s x2 + s x3 + ··· + s xr

P r0−1 The coefficient of p(s) at degree r0∈[r](r+1) is Θ (T (x1, x2, ··· , xr)). Thus, applying Claim G.7 and appropriately scaling the operator, we complete the proof. 

83 H Appendix for Probability Theory

H.1 Small ball probability: The basic property We also have the following property:

Lemma H.1 (small ball probability, 1-d case). (a) For every subset Λ ⊆ [d], every r, and every t > 0,

h P ? i t 1 Pr | w · zj − r| ≤ t ≤ O( + ) j∈Λ j p|Λ|/d p|Λ|k/d

(b) For every subset Λ ⊆ [d] with |Λ| ≥ Ω(d), and every t > 0,   h P ? i log k Pr | w · zj| ≤ t ≥ Ω(t) − O √ j∈Λ j k

k 0 Proof of Lemma H.1. Recall we have Przj [zj 6= 0] ≥ Ω( d ) for each j ∈ Λ. Let Λ ⊆ Λ be the subset of such indices j with non-zero z , so by our assumption we have |z | ≥ √1 for each j ∈ Λ0. By j j k −Ω(|Λ|k/d) 0 k Chernoff bound, with probability at least 1 − e , we know |Λ | ≥ Ω( d ) · |Λ|. Conditioning on such Λ0, by the Littlewood-Offord problem (a.k.a. small ball probability theo- rem, or anti-concentration for sum of Bernoulli variables, see [30]), we know   √ √ X ? kt + 1 kt + 1 Pr | w · zj − r| ≤ t E ≤ O( ) = O( )  j  p 0 p j∈Λ |Λ | |Λ|k/d 00 As for the lower bound, let us denote by Λ ⊆ Λ be the subset of such indices j with non-zero zj and |z | ≤ O( √1 ). We know with high probability |Λ00| ≥ Ω(k). Using  P w? ·z  ≤ O(1), j k E j∈Λ\Λ00 j j we can apply Markov’s inequality and get   X ? Pr  wj · zj ≤ B ≥ 0.6 for some constant B = O(1) j∈Λ\Λ00 Now, for the sum over Λ00, we can apply a Wasserstein distance version of the central limit theorem (that can be derived from [110], full statement see [6, Appendix A.2]) to derive that, for a Gaussian 2 P ? 2 2 variable g ∼ (0,V ) where V = j∈Λ00 (wj ) E[(zj) ] ≥ Ω(1), the Wasserstein distance:     X ? log k W2  wj · zj, g ≤ O √ j∈Λ00 k Using the property of Gaussian variables and B = O(1), we have    X t X t Pr g ∈ w? · z − , w? · z + ≥ Ω(t)   j j 2 j j 2 j∈Λ\Λ00 j∈Λ\Λ00 and using the above Wasserstein distance bound, we have      X ? X ? X ? log k Pr  wj · zj ∈  wj · zj − t, wj · zj + t ≥ Ω(t) − O √ k j∈Λ00 j∈Λ\Λ00 j∈Λ\Λ00 

84 H.2 McDiarmid’s Inequality and An Extension We state the standard McDiarmid’s inequality,

Lemma H.2 (McDiarmid’s inequality). Consider independent random variables x1, ··· , xn ∈ X n 0 and a mapping f : X → R. If for all i ∈ [n] and for all y1, ··· , yn, yi ∈ X , the function f satisfies 0 |f(y1, ··· , yi−1, yi, yi+1, ··· , yn) − f(y1, ··· , yi−1, yi, yi+1, ··· , yn)| ≤ ci. Then −2t2 Pr[f(x1, ··· , xn) − E f ≥ t] ≥ exp(Pn 2 ), i=1 ci 2t2 Pr[f(x1, ··· , xn) − E f ≤ −t] ≥ exp(Pn 2 ). i=1 ci We prove a more general version of McDiarmid’s inequality,

Lemma H.3 (McDiarmid extension). Let w1, . . . , wN be independent random variables and f :(w1, . . . , wN ) 7→ [0,B]. Suppose it satisfies for every k ∈ {2, 3,...,N},

• with probability at least 1 − p over w1, . . . , wN , it satisfies 00 00 ∀wk : f(w−k, wk) − f(w−k, wk) ≤ c

• with probability at least 1 − p over w1, . . . , wk−1, wk+1, . . . , wN , it satisfies (f(w , w0 ) − f(w , w00))2 ≤ V 2 0E 00 −k k −k k k wk,wk Then,  

Pr f(w1, . . . , wN ) − E [f(w1, . . . , wN ) | w1] ≥ t w1,...,wN w2,...,wN ! √ −Ω(t2) ≤ O(N p) + exp . √ PN 2 √ 2 2 t(c + pB) + t=2(Vt + pB ) √ Proof of Lemma H.3. For each t = 1,...,N − 1, we have with probability at least 1 − p over w1, . . . , wt, it satisfies  00 00  √ Pr ∀wt+1 : f(w≤t, wt+1, w>t+1) − f(w≤t, wt+1, w>t+1) ≤ c ≥ 1 − p . wt+1,...,wN √ We also have with probability at least 1 − p over w1, . . . , wt, it satisfies " # 0 00 2 2 √ Pr f(w≤t, w , w>t+1) − f(w≤t, w , w>t+1) ≤ V ≥ 1 − p . w ,...,w 0 E 00 t+1 t+1 t+1 t+2 N wt+1,wt+1 We denote by K the event (over w = (w , . . . , w )) that the above two statements hold. We ≤t √ ≤t 1 t know that Pr[w≤t ∈ K≤t] ≥ 1 − 2 p. For notational simplicity, we denote by K≤N the full set over all possible (w1, . . . , wN ). Define random variable Xt (which depends only on w1, . . . , wt) as 1 Xt := E [f(~w) | w≤t] (w≤1,...,w≤t)∈K≤1×···×K≤t ∈ [0,B] w>t

For every t and fixed w1, . . . , wt−1.

• If (w≤1, . . . , w

85 • If (w≤1, . . . , w

– If w≤t 6∈ K≤t, then Xt − Xt−1 = 0 − Xt−1 ≤ 0. – If w≤t ∈ K≤t, then

Xt − Xt−1 = E [f(wt) | w≤t] − E [f(wt) | wt w≥t √ Recall the property wt, it satisfies 00 00 ∀wt : f(wt) − f(wt) ≤ c

Taking expectation over wt and w>t, we have 00  00  √ ∀wt : E f(wt) − E [f(wt)] ≤ (c + pB) w>t w≥t √ This precisely means Xt − Xt−1 ≤ c + pB. √ – Using the property wt, it satisfies 00 2 2 f(wt) − f(wt) ≤ V E 00 t t wt,wt Taking expectation also over w>t, we have 00 2 2 √ 2 f(wt) − f(wt) ≤ V + pB E00 t t wt,wt ,w>t  2 00 2 √ 2 =⇒ [f(wt)] − [f(wt)] ≤ V + pB wE wE 00E t t t >t wt ,w>t 00 Now observe that, since (w , . . . , w ) ∈ K ×· · ·×K , we have X = 00 [f(w , w , w )]. ≤1 t t We also have that as long as w ∈ K , then X = [f(w , w , w )]. Putting these ≤t ≤t t √Ew>t t together, and using the fact Pr[w≤t ∈ K≤t] ≥ 1 − 2 p, we have  2  2 √ 2 E (Xt+1 − Xt) | w

In sum, we have just shown that for all choices of w1, . . . , wt−1, √  2  2 √ 2 Xt − Xt−1 ≤ (c + pB) and E (Xt+1 − Xt) | w t] ≤ exp √ PN 2 √ 2 2 t(c + pB) + t=2(Vt + pB ) Recalling 1 XN := f(~w) (w≤1,...,w≤t)∈K≤1×···×K≤t √ and we have XN = f(w1, . . . , wN ) with probability at least 1 − 2N p (and XN = 0 with the remaining probability). Also recalling 1 X1 := E [f(~w) | w1] w≤1 w2,...,wN √ and we have X1 = Ew2,...,wN [f(~w) | w1] with probability at least 1 − 2 p (and X1 = 0 with the remaining probability). Together, we have the desired theorem. 

86 Let us state, for completeness’ sake, a simple one-sided Bernstein form of martingale concen- tration (that we do not know a good reference to it).

Lemma H.4. Suppose we have a submartingale sequence X0,X1,...,XN , satisfying:

• X0 = 0 and E[Xt | Xt−1] ≤ Xt,

• Xt − Xt−1 ≤ c always holds, and 2 2 • EXt [(Xt − Xt−1) | Xt−1] ≤ Vt always holds. Then, 2 −Ω( t ) tc+P V 2 Pr[XN > t] ≤ e t t

η Xt Proof. Define potential function Ψt = e 2c for some η ∈ (0, 1) to be chosen later. We have

η   (Xt−X ) η(Xt − Xt−1) η(Xt − Xt−1)2 Ψ = Ψ · e 2c t−1 ≤ Ψ · 1 + + t t−1 t−1 2c 2c where the inequality is due to ey ≤ 1+y +y2 which holds for all −∞ < y ≤ 0.5. Taking conditional expectation, we have  X − X X − X  [Ψ | X ] ≤ Ψ · 1 + η  t t−1 | X  + η2  t t−1 2 | X  E t t−1 t−1 E 2c t−1 E 2c t−1

 2  V 2 2 Vt η2 t ≤ Ψ · 1 + η ≤ Ψ · e 4c2 . t−1 4c2 t−1

P V 2 η2 t t After telescoping, we have E[ΨN ] ≤ e 4c2 , and therefore

ηXN /(2c) P V 2 E[e ] η2 t t −η t Pr[X > t] ≤ ≤ e 4c2 2c N eηt/(2c) Choosing the optimal η ∈ (0, 1) gives us bound

2 −Ω( t ) tc+P V 2 Pr[XN > t] ≤ e t t 

References

[1] Alekh Agarwal, Animashree Anandkumar, and Praneeth Netrapalli. Exact recovery of sparsely used overcomplete dictionaries. stat, 1050:8–39, 2013. [2] Alekh Agarwal, Animashree Anandkumar, Prateek Jain, and Praneeth Netrapalli. Learning sparsely used overcomplete dictionaries via alternating minimization. SIAM Journal on Optimization, 26(4): 2775–2799, 2016. [3] Zeyuan Allen-Zhu and Yuanzhi Li. What Can ResNet Learn Efficiently, Going Beyond Kernels? In NeurIPS, 2019. Full version available at http://arxiv.org/abs/1905.10337. [4] Zeyuan Allen-Zhu and Yuanzhi Li. Can SGD Learn Recurrent Neural Networks with Provable Gener- alization? In NeurIPS, 2019. Full version available at http://arxiv.org/abs/1902.01028. [5] Zeyuan Allen-Zhu and Yuanzhi Li. Backward feature correction: How deep learning performs deep learning. arXiv preprint arXiv:2001.04413, 2020. [6] Zeyuan Allen-Zhu, Yuanzhi Li, and Yingyu Liang. Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers. In NeurIPS, 2019. Full version available at http: //arxiv.org/abs/1811.04918.

87 [7] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. On the convergence rate of training recurrent neural networks. In NeurIPS, 2019. Full version available at http://arxiv.org/abs/1810.12065. [8] Zeyuan Allen-Zhu, Yuanzhi Li, and Zhao Song. A convergence theory for deep learning via over- parameterization. In ICML, 2019. Full version available at http://arxiv.org/abs/1811.03962. [9] Sanjeev Arora, Rong Ge, and Ankur Moitra. New algorithms for learning incoherent and overcomplete dictionaries. In Conference on Learning Theory, pages 779–806, 2014. [10] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, efficient, and neural algorithms for sparse coding. Journal of Research, 40(2015), 2015. [11] Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. A latent variable model approach to pmi-based word embeddings. Transactions of the Association for Computational Linguis- tics, 4:385–399, 2016. [12] Sanjeev Arora, Yuanzhi Li, Yingyu Liang, Tengyu Ma, and Andrej Risteski. Linear algebraic structure of word senses, with applications to polysemy. Transactions of the Association for Computational Linguistics, 6:483–495, 2018. [13] Sanjeev Arora, Simon S Du, Wei Hu, Zhiyuan Li, Ruslan Salakhutdinov, and Ruosong Wang. On exact computation with an infinitely wide neural net. arXiv preprint arXiv:1904.11955, 2019. [14] Sanjeev Arora, Simon S. Du, Wei Hu, Zhiyuan Li, and Ruosong Wang. Fine-grained analysis of opti- mization and generalization for overparameterized two-layer neural networks. CoRR, abs/1901.08584, 2019. URL http://arxiv.org/abs/1901.08584. [15] Anish Athalye, Nicholas Carlini, and David Wagner. Obfuscated gradients give a false sense of security: Circumventing defenses to adversarial examples. arXiv preprint arXiv:1802.00420, 2018. [16] Ainesh Bakshi, Rajesh Jayaram, and David P Woodruff. Learning two layer rectified neural networks in polynomial time. arXiv preprint arXiv:1811.01885, 2018. [17] Boaz Barak, Jonathan A Kelner, and David Steurer. Dictionary learning and tensor decomposition via the sum-of-squares method. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pages 143–151, 2015. [18] Richard Beigel. The polynomial method in circuit complexity. In [1993] Proceedings of the Eigth Annual Structure in Complexity Theory Conference, pages 82–95. IEEE, 1993. [19] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Srndi´c,Pavelˇ Laskov, Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In Joint European conference on machine learning and knowledge discovery in databases, pages 387–402. Springer, 2013. [20] Digvijay Boob and Guanghui Lan. Theoretical properties of the global optimizer of two layer neural network. arXiv preprint arXiv:1710.11241, 2017. [21] Jehoshua Bruck and Roman Smolensky. Polynomial threshold functions, acˆ0 functions, and spectral norms. SIAM Journal on Computing, 21(1):33–42, 1992. [22] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet with gaussian inputs. arXiv preprint arXiv:1702.07966, 2017. [23] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. Sgd learns over- parameterized networks that provably generalize on linearly separable data. arXiv preprint arXiv:1710.10174, 2017. URL https://arxiv.org/abs/1710.10174. [24] S´ebastien Bubeck, Eric Price, and Ilya Razenshteyn. Adversarial examples from computational con- straints. arXiv preprint arXiv:1805.10204, 2018. [25] Mark Bun and Justin Thaler. Improved bounds on the sign-rank of acˆ 0. In 43rd International Colloquium on Automata, Languages, and Programming (ICALP 2016). Schloss Dagstuhl-Leibniz- Zentrum fuer Informatik, 2016. [26] Yuan Cao and Quanquan Gu. Generalization bounds of stochastic gradient descent for wide and deep neural networks. In Advances in Neural Information Processing Systems, pages 10835–10845, 2019. [27] Amit Daniely, Roy Frostig, and Yoram Singer. Toward deeper understanding of neural networks: The power of initialization and a dual view on expressivity. In Advances in Neural Information Processing

88 Systems (NIPS), pages 2253–2261, 2016. [28] Simon S Du, Jason D Lee, Haochuan Li, Liwei Wang, and Xiyu Zhai. Gradient descent finds global minima of deep neural networks. arXiv preprint arXiv:1811.03804, November 2018. [29] Simon S Du, Xiyu Zhai, Barnabas Poczos, and Aarti Singh. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018. [30] Paul Erd¨os.On a lemma of littlewood and offord. Bulletin of the American Mathematical Society, 51 (12):898–902, 1945. [31] Dumitru Erhan, Y. Bengio, Aaron Courville, and Pascal Vincent. Visualizing higher-layer features of a deep network. Technical Report, Univeriste de Montreal, 01 2009. [32] Alhussein Fawzi, Hamza Fawzi, and Omar Fawzi. Adversarial vulnerability for any classifier. In Advances in Neural Information Processing Systems, pages 1178–1187, 2018. [33] Nic Ford, Justin Gilmer, Nicolas Carlini, and Dogus Cubuk. Adversarial examples are a natural consequence of test error in noise. arXiv preprint arXiv:1901.10513, 2019. [34] Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635, 2018. [35] Ruiqi Gao, Tianle Cai, Haochuan Li, Cho-Jui Hsieh, Liwei Wang, and Jason D Lee. Convergence of ad- versarial training in overparametrized neural networks. In Advances in Neural Information Processing Systems, pages 13009–13020, 2019. [36] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017.

[37] Quan Geng and John Wright. On the local correctness of `1-minimization for dictionary learning. In 2014 IEEE International Symposium on Information Theory, pages 3180–3184. IEEE, 2014. [38] Behrooz Ghorbani, Song Mei, Theodor Misiakiewicz, and Andrea Montanari. Linearized two-layers neural networks in high dimension. arXiv preprint arXiv:1904.12191, 2019. [39] Justin Gilmer, Luke Metz, Fartash Faghri, Samuel S Schoenholz, Maithra Raghu, Martin Wattenberg, and Ian Goodfellow. Adversarial spheres. arXiv preprint arXiv:1801.02774, 2018. [40] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572, 2014. [41] Craig Gotsman and Nathan Linial. Spectral properties of threshold functions. Combinatorica, 14(1): 35–50, 1994. [42] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE, 2013. [43] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings of the 27th International Conference on International Conference on Machine Learning, pages 399–406, 2010. [44] Chuan Guo, Mayank Rana, Moustapha Cisse, and Laurens Van Der Maaten. Countering adversarial images using input transformations. arXiv preprint arXiv:1711.00117, 2017. [45] Boris Hanin and Mihai Nica. Finite depth and width corrections to the neural tangent kernel. arXiv preprint arXiv:1909.05989, 2019. [46] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recogni- tion. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. [47] Patrik O Hoyer. Non-negative sparse coding. In Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing, pages 557–565. IEEE, 2002. [48] Andrew Ilyas, Shibani Santurkar, Dimitris Tsipras, Logan Engstrom, Brandon Tran, and Aleksander Madry. Adversarial examples are not bugs, they are features. In Advances in Neural Information Processing Systems, pages 125–136, 2019.

89 [49] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015. [50] Arthur Jacot, Franck Gabriel, and Cl´ement Hongler. Neural tangent kernel: Convergence and gener- alization in neural networks. In Advances in neural information processing systems, pages 8571–8580, 2018. [51] Adel Javanmard, Mahdi Soltanolkotabi, and Hamed Hassani. Precise tradeoffs in adversarial training for linear regression. arXiv preprint arXiv:2002.10477, 2020. [52] Kenji Kawaguchi. Deep learning without poor local minima. In Advances in Neural Information Processing Systems, pages 586–594, 2016. [53] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convo- lutional neural networks. In Advances in neural information processing systems, pages 1097–1105, 2012. [54] Honglak Lee, Alexis Battle, Rajat Raina, and Andrew Y Ng. Efficient sparse coding algorithms. In Advances in neural information processing systems, pages 801–808, 2007. [55] Yuanzhi Li and Zehao Dou. When can wasserstein gans minimize wasserstein distance? arXiv preprint arXiv:2003.04033, 2020. [56] Yuanzhi Li and Yingyu Liang. Provable alternating gradient descent for non-negative matrix fac- torization with strong correlations. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 2062–2070. JMLR. org, 2017. [57] Yuanzhi Li and Yingyu Liang. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, 2018. [58] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with relu activation. In Advances in Neural Information Processing Systems, pages 597–607. http://arxiv.org/abs/1705.09886, 2017. [59] Yuanzhi Li, Yingyu Liang, and Andrej Risteski. Recovery guarantee of non-negative matrix factoriza- tion via alternating updates. In Advances in neural information processing systems, pages 4987–4995, 2016. [60] Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In COLT, 2018. [61] Yuanzhi Li, Colin Wei, and Tengyu Ma. Towards explaining the regularization effect of initial large learning rate in training neural networks. arXiv preprint arXiv:1907.04595, 2019. [62] Yuanzhi Li, Tengyu Ma, and Hongyang R Zhang. Learning over-parametrized two-layer relu neural networks beyond ntk. arXiv preprint arXiv:2007.04596, 2020. [63] Xuanqing Liu, Minhao Cheng, Huan Zhang, and Cho-Jui Hsieh. Towards robust neural networks via random self-ensemble. In Proceedings of the European Conference on Computer Vision (ECCV), pages 369–385, 2018. [64] Xingjun Ma, Bo Li, Yisen Wang, Sarah M Erfani, Sudanthi Wijewickrema, Grant Schoenebeck, Dawn Song, Michael E Houle, and James Bailey. Characterizing adversarial subspaces using local intrinsic dimensionality. arXiv preprint arXiv:1801.02613, 2018. [65] Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. To- wards deep learning models resistant to adversarial attacks. In ICLR. arXiv preprint arXiv:1706.06083, 2018. [66] Aravindh Mahendran and Andrea Vedaldi. Understanding deep image representations by inverting them. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5188– 5196, 2015. [67] Saeed Mahloujifar, Dimitrios I Diochnos, and Mohammad Mahmoody. The curse of concentration in robust learning: Evasion and poisoning attacks from concentration of measure. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4536–4543, 2019. [68] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online dictionary learning for sparse

90 coding. In Proceedings of the 26th annual international conference on machine learning, pages 689–696, 2009. [69] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11(Jan):19–60, 2010. [70] Alexander Mordvintsev. Deepdreaming with tensorflow. https://github.com/tensorflow/ tensorflow/blob/master/tensorflow/examples/tutorials//deepdream.ipynb, 2016. [71] Alexander Mordvintsev, Christopher Olah, and Mike Tyka. Inceptionism: Go- ing deeper into neural networks. https://research.googleblog.com/2015/06/ inceptionism-going-deeper-into-neural.html, 2015. [72] Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 427–436, 2015. [73] Anh Nguyen, Alexey Dosovitskiy, Jason Yosinski, Thomas Brox, and Jeff Clune. Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In Advances in neural information processing systems, pages 3387–3395, 2016. [74] Anh Nguyen, Jeff Clune, Yoshua Bengio, Alexey Dosovitskiy, and Jason Yosinski. Plug & play gener- ative networks: Conditional iterative generation of images in latent space. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4467–4477, 2017. [75] Ryan O’Donnell and Rocco A Servedio. New degree bounds for polynomial threshold functions. Com- binatorica, 30(3):327–358, 2010. [76] Chris Olah, Alexander Mordvintsev, and Ludwig Schubert. Feature visualization. Distill, 2017. doi: 10.23915/distill.00007. https://distill.pub/2017/feature-visualization. [77] Bruno A Olshausen and David J Field. Sparse coding with an overcomplete basis set: A strategy employed by v1? Vision research, 37(23):3311–3325, 1997. [78] Bruno A Olshausen and David J Field. Sparse coding of sensory inputs. Current opinion in neurobi- ology, 14(4):481–487, 2004. [79] Audun Øygard. Visualizing googlenet classes. https://www.auduno.com/2015/07/29/ visualizing-googlenet-classes, 2015. [80] Aditi Raghunathan, Sang Michael Xie, Fanny Yang, John C Duchi, and Percy Liang. Adversarial training can hurt generalization. arXiv preprint arXiv:1906.06032, 2019. [81] Alexander A Razborov and Alexander A Sherstov. The sign-rank of ac ˆ0. SIAM Journal on Com- puting, 39(5):1833–1855, 2010. [82] Hadi Salman, Jerry Li, Ilya Razenshteyn, Pengchuan Zhang, Huan Zhang, Sebastien Bubeck, and Greg Yang. Provably robust deep learning via adversarially trained smoothed classifiers. In Advances in Neural Information Processing Systems, pages 11289–11300, 2019. [83] Pouya Samangouei, Maya Kabkab, and Rama Chellappa. Defense-gan: Protecting classifiers against adversarial attacks using generative models. arXiv preprint arXiv:1805.06605, 2018. [84] Ludwig Schmidt, Shibani Santurkar, Dimitris Tsipras, Kunal Talwar, and Aleksander Madry. Ad- versarially robust generalization requires more data. In Advances in Neural Information Processing Systems, pages 5014–5026, 2018. [85] Karin Schnass. On the identifiability of overcomplete dictionaries via the minimisation principle un- derlying k-svd. Applied and Computational Harmonic Analysis, 37(3):464–491, 2014. [86] Ali Shafahi, W Ronny Huang, Christoph Studer, Soheil Feizi, and Tom Goldstein. Are adversarial examples inevitable? arXiv preprint arXiv:1809.02104, 2018. [87] David Silver, Aja Huang, Chris J Maddison, Arthur Guez, Laurent Sifre, George Van Den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, et al. Mastering the game of go with deep neural networks and tree search. nature, 529(7587):484, 2016. [88] Mahdi Soltanolkotabi, Adel Javanmard, and Jason D Lee. Theoretical insights into the optimization landscape of over-parameterized shallow neural networks. arXiv preprint arXiv:1707.04926, 2017.

91 [89] Yang Song, Taesup Kim, Sebastian Nowozin, Stefano Ermon, and Nate Kushman. Pixeldefend: Leveraging generative models to understand and defend against adversarial examples. arXiv preprint arXiv:1710.10766, 2017. [90] Daniel Soudry and Yair Carmon. No bad local minima: Data independent training error guarantees for multilayer neural networks. arXiv preprint arXiv:1605.08361, 2016. [91] Daniel A Spielman, Huan Wang, and John Wright. Exact recovery of sparsely-used dictionaries. In Conference on Learning Theory, pages 37–1, 2012. [92] David Stutz, Matthias Hein, and Bernt Schiele. Disentangling adversarial robustness and general- ization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6976–6987, 2019. [93] Arun Sai Suggala, Adarsh Prasad, Vaishnavh Nagarajan, and Pradeep Ravikumar. Revisiting adver- sarial risk. arXiv preprint arXiv:1806.02924, 2018. [94] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere. In 2015 Interna- tional Conference on Sampling Theory and Applications (SampTA), pages 407–410. IEEE, 2015. [95] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint arXiv:1312.6199, 2013. [96] Thomas Tanay and Lewis Griffin. A boundary tilting persepective on the phenomenon of adversarial examples. arXiv preprint arXiv:1608.07690, 2016. [97] Yuandong Tian. An analytical formula of population gradient for two-layered relu network and its applications in convergence and critical point analysis. arXiv preprint arXiv:1703.00560, 2017. [98] Dimitris Tsipras, Shibani Santurkar, Logan Engstrom, Alexander Turner, and Aleksander Madry. Robustness may be at odds with accuracy. In ICLR, number 2019, 2019. [99] Mike Tyka. Class visualization with bilateral filters. https://mtyka.github.io/deepdream/2016/ 02/05/bilateral-class-vis.html, 2016. [100] Santosh Vempala and John Wilmes. Polynomial convergence of gradient descent for training one- hidden-layer neural networks. arXiv preprint arXiv:1805.02677, 2018. [101] William E Vinje and Jack L Gallant. Sparse coding and decorrelation in primary visual cortex during natural vision. Science, 287(5456):1273–1276, 2000. [102] Haohan Wang, Xindi Wu, Pengcheng Yin, and Eric P Xing. High frequency component helps explain the generalization of convolutional neural networks. arXiv preprint arXiv:1905.13545, 2019. [103] Yisen Wang, Xingjun Ma, James Bailey, Jinfeng Yi, Bowen Zhou, and Quanquan Gu. On the conver- gence and robustness of adversarial training. In International Conference on Machine Learning, pages 6586–6595, 2019. [104] Bo Xie, Yingyu Liang, and Le Song. Diversity leads to generalization in neural networks. arXiv preprint Arxiv:1611.03131, 2016. [105] Greg Yang. Scaling limits of wide neural networks with weight sharing: Gaussian process behavior, gradient independence, and neural tangent kernel derivation. arXiv preprint arXiv:1902.04760, 2019. [106] Jianchao Yang, Kai Yu, Yihong Gong, and Thomas Huang. Linear spatial pyramid matching using sparse coding for image classification. In 2009 IEEE Conference on computer vision and pattern recognition, pages 1794–1801. IEEE, 2009. [107] Meng Yang, Lei Zhang, Jian Yang, and David Zhang. Robust sparse coding for face recognition. In CVPR 2011, pages 625–632. IEEE, 2011. [108] Dong Yin, Raphael Gontijo Lopes, Jon Shlens, Ekin Dogus Cubuk, and Justin Gilmer. A fourier perspective on model robustness in computer vision. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d Alche-Buc, E. Fox, and R. Garnett, editors, Advances in Neural Information Processing Sys- tems 32, pages 13276–13286. Curran Associates, Inc., 2019. URL http://papers.nips.cc/paper/ 9483-a-fourier-perspective-on-model-robustness-in-computer-vision.pdf. [109] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. arXiv preprint arXiv:1605.07146, 2016.

92 [110] Alex Zhai. A high-dimensional CLT in W2 distance with near optimal convergence rate. Probability Theory and Related Fields, 170(3-4):821–845, 2018. [111] Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan. Theoretically principled trade-off between robustness and accuracy. In International Conference on Machine Learning, 2019. [112] Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-layer relu networks via gradient descent. arXiv preprint arXiv:1806.07808, 2018. [113] Yi Zhang, Orestis Plevrakis, Simon S Du, Xingguo Li, Zhao Song, and Sanjeev Arora. Over- parameterized adversarial training: An analysis overcoming the curse of dimensionality. arXiv preprint arXiv:2002.06668, 2020. [114] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery guarantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175, 2017. [115] Difan Zou and Quanquan Gu. An improved analysis of training over-parameterized deep neural net- works. In Advances in Neural Information Processing Systems, pages 2053–2062, 2019. [116] Difan Zou, Yuan Cao, Dongruo Zhou, and Quanquan Gu. Stochastic gradient descent optimizes over- parameterized deep relu networks. arXiv preprint arXiv:1811.08888, 2018.

93