<<

RATT: Leveraging Unlabeled Data to Guarantee Generalization

Saurabh Garg ∗1, Sivaraman Balakrishnan1,3, J. Zico Kolter1,2, and Zachary C. Lipton1

1Machine Learning Department, Carnegie Mellon University 2Computer Science Department, Carnegie Mellon University 3Department of Statistics and Data Science, Carnegie Mellon University

Abstract To assess generalization, machine learning scientists typically either (i) bound the generalization gap and then (after training) plug in the empirical risk to obtain a bound on the true risk; or (ii) validate empirically on holdout data. However, (i) typically yields vacuous guarantees for overparameterized models. Furthermore, (ii) shrinks the training set and its guarantee erodes with each re-use of the holdout set. In this paper, we introduce a method that leverages unlabeled data to produce generalization bounds. After augmenting our (labeled) training set with randomly labeled fresh examples, we train in the standard fashion. Whenever classiers achieve low error on clean data and high error on noisy data, our bound provides a tight upper bound on the true risk. We prove that our bound is valid for 0-1 empirical risk minimization and with linear classiers trained by gradient descent. Our approach is especially useful in conjunction with deep learning due to the early learning phenomenon whereby networks t true labels before noisy labels but requires one intuitive assumption. Empirically, on canonical computer vision and NLP tasks, our bound provides non-vacuous generalization guarantees that track actual performance closely. This work provides practitioners with an option for certifying the generalization of deep nets even when unseen labeled data is unavailable and provides theoretical insights into the relationship between random label noise and generalization.

1 Introduction

Typically, machine learning scientists establish generalization in one of two ways. One approach, favored by learning theorists, places an a priori bound on the gap between the empirical and true risks, usually in terms of the complexity of the hypothesis class. After tting the model on the available data, one can plug arXiv:2105.00303v1 [cs.LG] 1 May 2021 in the empirical risk to obtain a guarantee on the true risk. The second approach, favored by practitioners, involves splitting the available data into training and holdout partitions, tting the models on the former and estimating the population risk with the latter. Surely, both approaches are useful, with the former providing theoretical insights and the latter guiding the development of a vast array of practical technology. Nevertheless, both methods have drawbacks. Most a priori generalization bounds rely on uniform convergence and thus fail to explain the ability of deep neural networks to generalize (Zhang et al., 2016; Nagarajan & Kolter, 2019b). On the other hand, provisioning a holdout dataset restricts the amount of labeled data available for training. Moreover, risk estimates based on holdout sets lose their validity with successive re-use of the holdout data due to adaptive overtting (Murphy, 2012; Dwork et al., 2015; Blum & Hardt, 2015). However, recent empirical studies suggest that on large benchmark datasets, adaptive overtting is surprisingly absent (Recht et al., 2019).

∗Correspondence to [email protected]

1 100 MLP Test 90 ResNet Predicted bound

80

70 Accuracy

60

50 0 20 40 60 80 100 Epoch Figure 1: Predicted lower bound on the clean population error with ResNet and MLP on binary CIFAR. Results aggregated over 5 seeds. ‘*’ denotes the best test performance achieved when training with only clean data and the same hyperparameters (except for the stopping point). The bound predicted by RATT (RHS in (2)) closely tracks the population accuracy on clean data.

In this paper, we propose Randomly Assign, Train and Track (RATT), a new method that leverages unlabeled data to provide a post-training bound on the true risk (i.., the population error). Here, we assign random labels to a fresh batch of unlabeled data, augmenting the clean training dataset with these randomly labeled points. Next, we train on this data, following standard risk minimization practices. Finally, we track the error on the randomly labeled portion of training data, estimating the error on the mislabeled portion and using this quantity to upper bound the population error. Counterintuitively, we guarantee generalization by guaranteeing overtting. Specically, we prove that Empirical Risk Minimization (ERM) with 0-1 loss leads to lower error on the mislabeled training data than on the mislabeled population. Thus, if despite minimizing the loss on the combined training data, we nevertheless have high error on the mislabeled portion, then the (mislabeled) population error will be even higher. Then, by complementarity, the (clean) population error must be low. Finally, we show how to obtain this guarantee using randomly labeled (vs mislabeled data), thus enabling us to incorporate unlabeled data. To expand the applicability of our idea beyond ERM on 0-1 error, we prove corresponding results for a linear classier trained by gradient descent to minimize squared loss. Furthermore, leveraging the connection between early-stopping and `2-regularization in linear models (Ali et al., 2018, 2020; Suggala et al., 2018), our results extend to early-stopped gradient descent. Because we make no assumptions on the data distribution, our results on linear models hold for more complex models such as kernel regression and neural networks in the Neural Tangent Kernel (NTK) regime (Jacot et al., 2018; Du et al., 2018, 2019; Allen-Zhu et al., 2019b; Chizat et al., 2019). Addressing practical deep learning models, while our guarantee requires an additional (reasonable) assumption, our experiments verify that the bound yields non-vacuous guarantees that track test error across several major architectures on a range of benchmark datasets for computer vision and Natural Language Processing (NLP). Because, in practice, overparameterized deep networks exhibit an early learning phenomenon tending to t clean data before tting mislabeled data (Liu et al., 2020; Arora et al., 2019; Li et al., 2019). Thus, our procedure yields tight bounds in the early phases of learning. Experimentally, we conrm the early learning phenomenon in standard Stochastic Gradient Descent (SGD) training and illustrate the eectiveness of weight decay combined with large initial learning rates in avoiding interpolation to mislabeled data while maintaining t on the training data, strengthening the guarantee provided by our method. To be clear, we do not advocate RATT as a blanket replacement for the holdout approach. Our main contribution is to introduce a new theoretical perspective on generalization and to provide a method that

2 may be applicable even when the holdout approach is unavailable. Of interest, unlike generalization bounds based on uniform-convergence that restrict the complexity of the hypothesis class (Neyshabur et al., 2018, 2015, 2017b; Bartlett et al., 2017; Nagarajan & Kolter, 2019a), our post hoc bounds depend only on the t to mislabeled data. We emphasize that our theory does not guarantee a priori that early learning should take place but only a posteriori that when it does, we can provide non-vacuous bounds on the population error. Conceptually, this nding underscores the signicance of the early-learning phenomenon in the presence of noisy labels and motivates further work to explain why it occurs.

2 Preliminaries

Notation. By ||¨||, and x¨, ¨y we denote the Euclidean norm and inner product, respectively. For a vector d th v P R , we use vj to denote its j entry, and for an event E we let I rEs denote the binary indicator of the event. d Suppose we have a multiclass classication problem with the input domain X Ď R and label space 1 n n Y “ t1, 2, . . . , ku . By D we denote the distribution over X ˆ Y. A dataset S :“ tpxi, yiqui“1 „ D contains n points sampled i.i.d. from D. By S, T and S, we denote the (uniform) empirical distribution k over points in datasets S, T and S respectively. Let F be a class of hypotheses mapping X to R .A training algorithm A: takes a dataset S, and returnsr a classier fpA,Sq P F. When the context is clear, we drop the parentheses forr convenience. Given a classier f and datum px, yq, we denote the 0-1 error (i.e., classication error) on that point by Epfpxq, yq :“ I y R arg maxjPY fjpxq , We express the population error on D as EDpfq :“ Epx,yq„D rEpfpxq, yqs and the empirical error on S as ES pfq :“ 1 n “ ‰ Epx,yq„S rEpfpxq, yqs “ n i“1 Epfpxiq, yiq. Throughout the paper, we consider a random label assignment procedure: draw x „ DX (the underlying ř distribution over X ), and then assign a label sampled uniformly at random. We denote a randomly labeled m m 1 dataset by S :“ tpxi, yiqui“1 „ D , where D is the distribution of randomly labeled data. By D , we denote the mislabeled distribution that corresponds to selecting examples px, yq according to D and then re-assigningr the label by samplingr among the incorrectr labels y1 ‰ y (renormalizing the label marginal).

3 Generalization Bound for RATT with ERM

We now present our generalization bound and proof sketches for ERM on the 0-1 loss (full proofs in App.A). For any dataset T , ERM returns the classier f that minimizes the empirical error:

f :“parg min ET pfq . (1) fPF

We focus rst on binary classication. Assumep we have a clean dataset S „ Dn of n points and a randomly labeled dataset S „ Dm of m pă nq points with labels in S are assigned uniformly at random. We show that with 0-1 loss minimization on the union of S and S, we obtain a classier whose error on D is upper bounded by a functionr r of the empirical errors on clean datar ES (lower is better) and on randomly labeled data ES (higher is better): r

Theoremr 1. For any classier f obtained by ERM (1) on dataset S Y S, for any δ ą 0, with probability at least 1 ´ δ, we have p r ? m logp4{δq EDpfq ď ES pfq ` 1 ´ 2ES pfq ` 2ESpfq ` 2 ` . (2) 2n c m 1For binary classication, we use Y “ t´1, 1u. ´ ¯ p p r p r p

3 In short, this theorem tells us that if after training on both clean and randomly labeled data, we achieve low error on the clean data but high error (close to 1{2) on the randomly labeled data, then low population error is guaranteed. Note that because the labels in S are assigned randomly, the error ES pfq for any xed predictor f (not dependent on S) will be approximately 1/2. Thus, if ERM produces a classier that has r r not t to the randomly labeled data, then p1 ´ 2ES pfqq will be approximately 0, and our error will be determined by the t to clean data.r The nal term accounts for nite sample error and notably (i) does not ? depend on the complexity of the hypothesis class; andr p (ii) approaches 0 at a Op1{ mq rate (for m ă n). Our proof strategy unfolds in three steps. First, in Lemma1 we bound EDpfq in terms of the error on the mislabeled subset of S. Next, in Lemmas2 and3, we show that the error on the mislabeled subset can be accurately estimated using only clean and randomly labeled data. p To begin, assume thatr we actually knew the original labels for the randomly labeled data. By SC and SM , we denote the clean and mislabeled portions of the randomly labeled data, respectively (with S “ SM Y SC ). Note that for binary classication, a lower bound on mislabeled population error ED1 prfq directlyr upper bounds the error on the original population EDpfq. Thus we only need to prove that the empiricalr r errorr on the mislabeled portion of our data is lower than the error on unseen mislabeled data, i.e.,p ? E pfq ď ED1 pfq “ 1 ´ E pfq (upto Op1{ mq). p SM SM

Lemmar p 1. Assumep the samer setupp as in Theorem1. Then for any δ ą 0, with probability at least 1 ´ δ over the random draws of mislabeled data SM , we have

r logp1{δq EDpfq ď 1 ´ E pfq ` . (3) SM c m p r p Proof Sketch. The main idea of our proof is to regard the clean portion of the data (S Y SC ) as xed. Then, ˚ there exists a classier f that is optimal over draws of the mislabeled data SM . Formally, r ˚ f :“ arg min EDpfq, r fPF q where D is a combination of the empirical distribution over correctly labeled data SYSC and the (population) 1 : distribution over mislabeled data D . Recall that f “ arg minfPF ESYS pfq. Since, f minimizes 0-1 error on q ˚ ˚ r S Y S, we have ESYS pfq ď ESYS pf q. Moreover, since f is independent of SM , we have with probability at least 1 ´ δ that p r p r r p r r ˚ ˚ logp1{δq E pf q ď ED1 pf q ` . SM c m ˚ r ˚ Finally, since f is the optimal classier on D, we have EDpf q ď EDpfq. Combining the above steps and using the fact that ED “ 1 ´ ED1 , we obtain the desired result. q q q p While the LHS in (3) depends on the unknown portion SM , our goal is to use unlabeled data (with randomly assigned labels) for which the mislabeled portion cannot be readily identied. Fortunately, we do not need to identify the mislabeled points to estimate ther error on these points in aggregate E pfq. SM Note that because the label marginal is uniform, approximately half of the data will be correctly labeled r and the remaining half will be mislabeled. Consequently, we can utilize the value of ES pfq and an estimatep of ES pfq to lower bound ES pfq. We formalize this as follows: C M r p Lemmar 2. Assume the samer setup as Theorem1. Then for any δ ą 0, with probability at least 1 ´ δ over the p p random draws of S, we have 2E pfq ´ E pfq ´ E pfq ď 2E pfq logp4{δq . S SC SM S 2m b r r p r p r p r p 4 To complete the argument, we show that due to the exchangeability of the clean data S and the clean portion of the randomly labeled data SC , we can estimate the error on the latter E pfq by the error on the SC former ES pfq. r p δ 0 1 δ Lemma 3. pAssume the same setup as Theorem1. Then for any ą , with probability at least ´ over the m logp2{δq random draws of SC and S, we have E pfq ´ ES pfq ď 1 ` . SC 2n m b r ` ˘ Lemma3 establishesr a tight bound on thep dierencep of the error of classier f on SC and on S. The proof uses Hoeding’s inequality for randomly sampled points from a xed population (Hoeding, 1994; Bardenet et al., 2015). p r Having established these core components, we can now summarize the proof strategy for Theorem1. We bound the population error on clean data (the term on the LHS of (2)) in three steps: (i) use Lemma1 to upper bound the error on clean distribution EDpfq, by the error on mislabeled training data E pfq; SM (ii) approximate E pfq by E pfq and the error on randomly labeled training data (i.e., E pfq) using SM SC p S r p Lemma2; and (iii) use Lemma3 to estimate E pfq using the error on clean training data (ES pfq). r p r p SC r p Comparison with Rademacher bound Ourr boundp in Theorem1 shows that we can upper boundp the clean population error of a classier by estimating its accuracy on the clean and randomly labeled portions of the training data. Next, we show that our bound’s dominating term is upper bounded by the Rademacher complexity (Shalev-Shwartz & Ben-David, 2014), a standard distribution dependent complexity measure. Proposition 1. Fix a randomly labeled dataset S „ Dm. Then for any classier f P F (possibly dependent on S)2 and for any δ ą 0, with probability at least 1 ´ δ over random draws of S, we have r r 2 r i ifpxiq 2 logp δrq 1 ´ 2E pfq ď E,x sup ` , S m d m «fPF ˆř ˙ff r m m where  is drawn from a uniform distribution over t´1, 1u and x is drawn from DX . In other words, the proposition above highlights that the accuracy on the randomly labeled data is never larger than the Rademacher complexity of F w.r.t. the underlying distribution over X , implying that our bound is never looser than a Rademacher complexity based bound. The proof follows by application of the bounded dierence condition and McDiarmid’s inequality (McDiarmid, 1989). We now discuss extensions of Theorem1 to regularized ERM and multiclass classication.

Extension to regularized ERM Consider any function R : F Ñ R, e.g., a regularizer that penalizes some measure of complexity for functions in class F. Consider the following regularized ERM:

f :“ arg min ES pfq ` λRpfq , (4) fPF where λ is a regularization constant.p If the regularization coecient is independent of the training data S Y S, then our guarantee (Theorem1) holds. Formally, R S S Theoremr 2. For any regularization function , assume we perform regularized ERM as in (4) on Y and obtain a classier f. Then, for any δ ą 0, with probability at least 1 ´ δ, we have EDpfq ď ES pfq ` 1 ´ ? 2E pfq ` 2E pfq ` 2 ` m logp1{δq . r S S p 2n m p p ´ ¯ b r2Wep restrict F tor functionsp which output a label in Y “ t´1, 1u.

5 A key insight here is that the proof of Theorem1 treats the clean data S as xed and considers random draws of the mislabeled portion. Thus a data independent regularization function doesn’t alter our chain of arguments and hence, has no impact on the resulting inequality. We prove this result formally in App.A. We note one immediate corollary from Theorem2: when learning any function f parameterized by w with L2-norm penalty on the parameters w, the population error with f is determined by the error on the clean training data as long as the error on randomly labeled data is high (close to 1{2). p Extension to multiclass classication Thus far, we have addressed binary classication. We now extend these results to the multiclass setting. As before, we obtain datasets S and S. Here, random labels are assigned uniformly among all classes. r Theorem 3. For any regularization function R, assume we perform regularized ERM as in (4) on S Y S and obtain a classier f. For a multiclass classication problem with k classes, for any δ ą 0, with probability at least 1 ´ δ, we have r p 4 k logp δ q EDpfq ď ES pfq ` pk ´ 1q 1 ´ E pfq ` c , (5) k´1 S d 2m ? ´ ¯ p mp r p for some constant c ď p2k ` k ` ? q. n k We rst discuss the implications of Theorem3. Besides empirical error on clean data, the dominating k term in the above expression is given by pk ´ 1q 1 ´ k´1 ES pfq . For any predictor f (not dependent on S), the term E pfq would be approximately pk´´ 1q{k and for¯f, the dierence now evaluates to the S r p accuracy of the randomly labeled data. Note that for binary classication, (5) simplies to Theorem1. rThe core of ourr proofp involves obtaining an inequality similar top(3). While for binary classication, we could upper bound E with 1 ´ ED (in the proof of Lemma1), for multiclass classication, error on the SM mislabeled data and accuracy on the clean data in the population are not so directly related. To establish an inequality analogous tor (3), we break the error on the (unknown) mislabeled data into two parts: one term corresponds to predicting the true label on mislabeled data, and the other corresponds to predicting neither the true label nor the assigned (mis-)label. Finally, we relate these errors to their population counterparts to establish an inequality similar to (3).

4 Generalization Bound for RATT with Gradient Descent

In the previous section, we presented results with ERM on 0-1 loss. While minimizing the 0-1 loss is hard in general, these results provide important theoretical insights. In this section, we show parallel results for linear models trained with Gradient Descent (GD). To begin, we introduce the setup and some additional notation. For simplicity, we begin discussion d T d with binary classication with X “ R . Dene a linear function fpx; wq :“ w x for some w P R and x P X . Given training set S, we suppose that the parameters of the linear function are obtained via gradient descent on the following L2 regularized problem:

n T 2 2 LSpw; λq :“ pw ´ yiq ` λ ||w||2 , (6) i“1 ÿ where λ ě 0 is a regularization parameter. Our choice to analyze squared loss minimization for linear networks is motivated in part by its analytical convenience, and follows recent theoretical work which analyze neural networks trained via squared loss minimization in the Neural Tangent Kernel (NTK) regime when they are well approximated by linear networks (Jacot et al., 2018; Arora et al., 2019; Du et al., 2019;

6 Hu et al., 2019). Moreover, recent research suggests that for classication tasks, squared loss minimization performs comparably to cross-entropy loss minimization (Muthukumar et al., 2020; Hui & Belkin, 2020). th For a given training set S, we use Spiq to denote the training set S with the i point removed. We now introduce one stability condition:

Condition 1 (Hypothesis Stability). We have β hypothesis stability if our training algorithm A satises the following:

β @i P t1, 2, . . . , nu, S, x,y D E pfpxq, yq ´ E f i pxq, y ď , E p qP p q n “ ` ˘ ‰ where fpiq :“ fpA,Spiqq and f :“ fpA,Sq. This condition is similar to a notion of stability called hypothesis stability (Bousquet & Elissee, 2002; Kearns & Ron, 1999; Elissee et al., 2003). Intuitively, Condition1 states that empirical leave-one-out error and average population error of leave-one-out classiers are close. This condition is mild and does not guarantee generalization. We discuss the implications in more detail in App. B.3. Now we present the main result of this section. As before, we assume access to a clean dataset S “ n n n`m m tpxi, yiqui“1 „ D and randomly labeled dataset S “ tpxi, yiqui“n`1 „ D . Let X “ rx1, x2, ¨ ¨ ¨ , xm`ns T 2 and y “ ry1, y2, ¨ ¨ ¨ , ym`ns. Fix a positive learning rate η such that η ď 1{ X X ` λ and an r r op initialization w0 “ 0. Consider the following gradient descent iterates to minimize´ˇˇ objectiveˇˇ (6) on¯ S Y S: ˇˇ ˇˇ

wt “ wt´1 ´ η∇wLSYSpwt´1; λq @t “ 1, 2,.... (7)r

r T ´1 T Then we have twtu converge to the limiting solution w “ X X ` λI X y. Dene fpxq :“ fpx; wq.

Theorem 4. Assume that this gradient descent algorithm satises` Condition˘ 1 with β “ Op1q. Then for any p p p δ ą 0, with probability at least 1 ´ δ over the random draws of datasets S and S, we have:

? m logp4{rδq 4 1 3β EDpfq ď ES pfq ` 1 ´ 2ES pfq ` 2ES pfq ` 1 ` ` ` . (8) 2n c m dδ m m ` n ´ ¯ ˆ ˙ r r With ap mild regularityp condition,p we establishp the same bound on GD training with squared loss, notably the same dominating term on the population error, as in Theorem1. In App. B.2, we present the extension to multiclass classication, where we again obtain a result parallel to Theorem3.

Proof Sketch. Because squared loss minimization does not imply 0-1 error minimization, we cannot use arguments from Lemma1. This is the main technical diculty. To compare the 0-1 error at a train point with an unseen point, we use the closed-form expression for w. We show that the train error on mislabeled points is less than the population error on the distribution of mislabeled data (parallel to Lemma1). For a mislabeled training point pxi, yiq in S, we show thatp

T T I yixi w ďr0 ď I yixi wpiq ď 0 , (9)

“ ‰ “ th ‰ where wpiq is the classier obtained by leavingp out the i ppoint from the training set. Intuitively, this condition stats that the train error at a training point is less than the leave-one-out error at that point, i.e. the errorp obtained by removing that point and re-training. We then use Condition1 to relate the average leave-one-out error (the average over the index i of the RHS in (9)) to the population error on the mislabeled distribution to obtain an inequality similar to (3).

7 Extensions to kernel regression Since the result in Theorem4 doesn’t impose any regularity conditions on the underlying distribution over X ˆ Y, our guarantees straightforwardly extend to kernel regression by using the transformation x Ñ φpxq for some feature transform function φ. Furthermore, recent literature has pointed out a concrete connection between neural networks and kernel regression with the so-called Neural Tangent Kernel (NTK) which holds in a certain regime where weights don’t change much during training (Jacot et al., 2018; Du et al., 2019, 2018; Chizat et al., 2019). Using this concrete correspondence, our bounds on the clean population error (Theorem4) extend to wide neural networks operating in the NTK regime.

Extensions to early stopped GD Often in practice, gradient descent is stopped early. We now provide theoretical evidence that our guarantees may continue to hold for an early stopped GD iterate. Concretely, we show that in expectation, the outputs of the GD iterates are close to that of a problem with data- independent regularization (as considered in Theorem2). First, we dene some notation. By LSpwq, we denote the objective in (6) with λ “ 0. Consider the GD iterates dened in (7). Let wλ “ arg minw LS pw; λq. th Dene ftpxq :“ fpx; wtq as the solution at the t iterate and fλpxq :“ fpx; wλq as the regularized solution. Let κ be the condition number of the population covariance matrix and let smin rbe the minimum positive singular value of the empirical covariance matrix. r r 1 Proposition 2 (informal). For λ “ tη , we have

2 2 Ex„DX pftpxq ´ fλpxqq ď cpt, ηq ¨ Ex„DX ftpxq , 1” ı “ ‰ where cpt, ηq « κ ¨ minp0.25, 2 2 2 q. An equivalentr guarantee holds for a point x sampled from the training smint η data. The proposition above states that for large enough t, GD iterates stay close to a regularized solution with data-independent regularization constant. Together with our guarantees in Theorem4 for regularization 1 solution with λ “ tη , Proposition2 shows that our guarantees with RATT may hold on early stopped GD. See the formal result in App. B.4. Remark Proposition2 only bounds the expected squared dierence between the tth gradient descent it- erate and a corresponding regularized solution. The expected squared dierence and the expected dierence of classication errors, that we wish to bound, are not related in general. They can however be related under standard low-noise (margin) assumptions. For instance, under the Tsybakov noise condition (Tsybakov et al., 1997; Yao et al., 2007), we can lower bound the expression on the LHS of Proposition2 with the dierence of expected classication error.

Extensions to deep learning Note that the main lemma underlying our bound on (clean) population error states that when training on a mixture of clean and randomly labeled data, we obtain a classier whose empirical error on the mislabeled training data is lower than its population error on the distribution of mislabeled data. We prove this for ERM on 0-1 loss (Lemma1). For linear models (and networks in NTK regime), we obtained this result by assuming hypothesis stability and relating training error at a datum with the leave-one-out error (Theorem4). However, to extend our bound to deep models we must assume that training on the mixture of random and clean data leads to over-tting on the random mixture. Formally: Assumption 1. Let f be a model obtained by training with an algorithm A on a mixture of clean data S and randomly labeled data S. Then with probability 1 ´ δ over the random draws of mislabeled data SM , we assume that the followingp condition holds: r r logp1{δq E pfq ď ED1 pfq ` c , SM c 2m r p p 8 Underparameterized model MNIST IMDb 100 100 100 MSE Test 90 CE Predicted bound 90 90

80 80 80

70 70 70

Accuracy Accuracy SGD Test Accuracy 60 60 Early stop Predicted bound 60 ELMo-LSTM Test Weight decay BERT Predicted bound 50 50 50 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.2 0.4 0.6 0.8 1.0 Fraction of unlabeled data Fraction of unlabeled data Fraction of steps

(a) (b) (c) Figure 2: We plot the accuracy and corresponding bound (RHS in (1)) at δ “ 0.1. for binary classication tasks. Results aggregated over 3 seeds. (a) Accuracy vs fraction of unlabeled data (w.r.t clean data) in the toy setup with a linear model trained with GD. (b) Accuracy vs fraction of unlabeled data for a 2-layer wide network trained with SGD on binary MNIST. With SGD and no regularization (red curve in (b)), we interpolate the training data and hence the predicted lower bound is 0. However, with early stopping (or weight decay) we obtain tight guarantees. (c) Accuracy vs gradient iteration on IMDb dataset with unlabeled fraction xed at 0.2. In plot (c), ‘*’ denotes the best test accuracy with the same hyperparameters and training only on clean data. See App.C for exact hyperparameter values. for a xed constant c ą 0.

Under Assumption1, our results in Theorem1,2 and3 extend beyond ERM with the 0-1 loss to general learning algorithms. We include the formal result in App. B.5. Note that given the ability of neural networks to interpolate the data, this assumption seems uncontroversial in the later stages of training. Moreover, concerning the early phases of training, recent research has shown that learning dynamics for complex deep networks resemble those for linear models (Nakkiran et al., 2019; Hu et al., 2020), much like the wide neural networks that we do analyze. Together, these arguments help to justify Assumption1 and hence, the applicability of our bound in deep learning. Motivated by our analysis on linear models trained with gradient descent, we discuss conditions in App. B.6 which imply Assumption1 for constant values δ ą 0. In the next section, we empirically demonstrate applicability of our bounds for deep models.

5 Empirical Study and Implications

Having established our framework theoretically, we now demonstrate its utility experimentally. First, for linear models and wide networks in the NTK regime where our guarantee holds, we conrm that our bound is not only valid, but closely tracks the generalization error. Next, we show that in practical deep learning settings, optimizing cross-entropy loss by SGD, the expression for our (0-1) ERM bound nevertheless tracks test performance closely and in numerous experiments on diverse models and datasets is never violated empirically. Datasets To verify our results on linear models, we consider a toy dataset, where the class conditional distribution ppx|yq for each label is Gaussian. For binary tasks, we use binarized CIFAR-10 (rst 5 classes vs rest) (Krizhevsky & Hinton, 2009), binary MNIST (0-4 vs 5-9) (LeCun et al., 1998) and IMDb sentiment analysis dataset (Maas et al., 2011). For multiclass setup, we use MNIST and CIFAR-10. Architectures To simulate the NTK regime, we experiment with 2-layered wide networks both (i) with the second layer xed at random initialization; (ii) and updating both layers’ weights. For vision datasets (e.g., MNIST and CIFAR10), we consider (fully connected) multilayer perceptrons (MLPs) with ReLU activations and ResNet18 (He et al., 2016). For the IMDb dataset, we train Long Short-Term Memory

9 Networks (LSTMs; Hochreiter & Schmidhuber(1997)) with ELMo embeddings (Peters et al., 2018) and ne-tune an o-the-shelf uncased BERT model (Devlin et al., 2018; Wolf et al., 2020). Methodology To bound the population error, we require access to both clean and unlabeled data. For toy datasets, we obtain unlabeled data by sampling from the underlying distribution over X . For image and text datasets, we hold out a small fraction of the clean training data and discard their labels to simulate unlabeled data. We use the random labeling procedure described in Sec.2. After augmenting clean training data with randomly labeled data, we train in the standard fashion. See App.C for experimental details.

Underparameterized linear models On toy Gaussian data, we train linear models with GD to minimize cross-entropy loss and mean squared error. Varying the fraction of randomly labeled data we observe that the accuracy on clean unseen data is barely impacted (Fig.2(left)). This highlights that in low dimensional models adding randomly labeled data with the clean dataset (in toy setup) has minimal eect on the performance on unseen clean data. Moreover, we nd that RATT oers a tight lower bound on the unseen clean data accuracy. We observe the same behavior with Stochastic Gradient Descent (SGD) training (ref. App.C).

Wide Nets Next, we consider MNIST binary classication with a wide 2-layer fully-connected network. In experiments with SGD training on MSE loss without early stopping or weight decay regularization, we nd that adding extra randomly label data hurts the unseen clean performance (Fig.2(b)). Additionally, due to the perfect t on the training data, our bound is rendered vacuous. However, with early stopping (or weight decay), we observe close to zero performance dierence with additional randomly labeled data. Alongside, we obtain tight bounds on the accuracy on unseen clean data paying only a small price to negligible for incorporating randomly labeled data. Similar results hold for SGD and GD and when cross-entropy loss is substituted for MSE (ref. App.C).

Deep Nets We verify our ndings on i) ResNet-18 and 5-layer MLPs trained with binary CIFAR (Fig.1); and ii) ELMo-LSTM and BERT-Base models ne-tuned on the IMDb dataset (Fig.2(c)). See App.C for additional results with deep models on binary MNIST. We x the amount of unlabeled data at 20% of the clean dataset size and train all models with standard hyperparameters. Consistently, we nd that our predicted bounds are never violated in practice. And as training proceeds, the t on the mislabeled data increases with perfect overtting in the interpolation regime rendering our bounds vacuous. However, with early stopping, our bound predicts test performance closely. For example, on IMDb dataset with BERT ne-tuning we predict 79.8 as the accuracy of the classier, when the true performance is 88.04 (and the best achievable performance on unseen data is 92.45). Additionally, we observe that our method tracks the performance from the beginning of the training and not just towards the end. Finally, we verify our multiclass bound on MNIST and CIFAR10 with deep MLPs and ResNets (see results in Table1 and per-epoch curves in App.C). As before, we x the amount of unlabeled data at 20% of the clean dataset to minimize cross-entropy loss via SGD. In all four settings, our bound predicts non-vacuous performance on unseen data. In App.C, we investigate our approach on CIFAR100 showing that even though our bound grows pessimistic with greater numbers of classes, the error on the mislabeled data nevertheless tracks population accuracy.

6 Discussion and Connections to Prior Work

Implicit bias in deep learning Several recent lines of research attempt to explain the generalization of neural networks despite massive overparameterization via the implicit bias of gradient descent (Soudry et al., 2018; Gunasekar et al., 2018a,b; & Telgarsky, 2019; Chizat & Bach, 2020). Noting that even for

10 Dataset Model Pred. Acc Test Acc. Best Acc.

MLP 93.1 97.4 97.9 MNIST ResNet 96.8 98.8 98.9 MLP 48.4 54.2 60.0 CIFAR10 ResNet 76.4 88.9 92.3

Table 1: Results on multiclass classication tasks. With pred. acc. we refer to the dominating term in RHS of (5). At the given sample size and δ “ 0.1, the remaining term evaluates to 30.7, decreasing our predicted accuracy by the same. We note that test acc. denotes the corresponding accuracy on unseen clean data. Best acc. is the best achievable accuracy with just training on just the clean data (and same hyperparamters except the stopping point). Note that across all tasks our predicted bound is tight and the gap between the best accuracy and test accuracy is small. Exact hyperparameters are included in App.C. overparameterized linear models, there exist multiple parameters capable of overtting the training data (with arbitrarily low loss), of which some generalize well and others do not, they seek to characterize the favored solution. Notably, Soudry et al.(2018) nds that for linear networks, gradient descent converges (slowly) to the max margin solution. A complementary line of work focuses on the early phases of training, nding both empirically (Rolnick et al., 2017; Arpit et al., 2017) and theoretically (Arora et al., 2019; Li et al., 2020; Liu et al., 2020) that even in the presence of a small amount of mislabeled data, gradient descent is biased to t the clean data rst during initial phases of training. However, to best our knowledge, no prior work leverages this phenomenon to obtain generalization guarantees on the clean data, which is the primary focus of our work. Our method exploits this phenomenon to produce non-vacuous generalization bounds. Even when we cannot prove a priori that models will t the clean data well while performing badly on the mislabeled data, we can observe that it indeed happens (often in practice), and thus, a posteriori, provide tight bounds on the population error. Moreover, by using regularizers like early stopping or weight decay, we can accentuate this phenomenon, enabling our framework to provide even tighter guarantees.

Non-vacuous generalization bounds In light of the inapplicability of traditional complexity-based bounds to deep neural networks (Zhang et al., 2016; Nagarajan & Kolter, 2019b), researchers have investigated alternative strategies to provide non-vacuous generalization bounds for deep nets (Neyshabur et al., 2015, 2017b,a, 2018; Dziugaite & Roy, 2017; Bartlett et al., 2017; & Raginsky, 2017; Arora et al., 2018; Li & , 2018; Allen-Zhu et al., 2019a; Pensia et al., 2018; Zhou et al., 2018; Nagarajan & Kolter, 2019a; Nakkiran et al., 2020). However, these bounds typically remain numerically loose relative to the true generalization error. However, (Dziugaite & Roy, 2017; Zhou et al., 2018) provide non-vacuous generalization guarantees. Specically, they transform a base network into consequent networks that do not interpolate the training data either by adding stochasticity to the network weights (Dziugaite & Roy, 2017) or by compressing the original neural network (Zhou et al., 2018). In a similar spirit, our work provides guarantees on overparameterized networks by using early stopping or weight decay regularization, preventing a perfect t on the training data. Notably, in our framework, the model can perfectly t the clean portion of the data, so long as they nevertheless t the mislabeled data poorly.

Leveraging noisy data to provide generalization guarantees In parallel work, Bansal et al.(2020) presented an upper bound on the generalization gap of linear classiers trained on representations learned via self-supervision. Under certain noise-robustness and rationality assumptions on the training procedure, the authors obtained bounds dependent on the complexity of the linear classier and independent of the complexity of representations. By contrast, we present generalization bounds for supervised learning that are non-vacuous by virtue of the early learning phenomenon. While both frameworks highlight how

11 robustness to random label corruptions can be leveraged to obtain bounds that do not depend directly on the complexity of the underlying hypothesis class, our framework, methodology, claims, and generalization results are very dierent from theirs.

Other related work. A long line of work relates early stopped GD to a corresponding regularized solution (Friedman & Popescu, 2003; Yao et al., 2007; Suggala et al., 2018; Ali et al., 2018; Neu & Rosasco, 2018; Ali et al., 2020). In the most relevant work, Ali et al.(2018) and Suggala et al.(2018) address a regression task, theoretically relating the solutions of early-stopped GD and a regularized problem, obtained with a data-independent regularization coecient. Towards understanding generalization numerous stability conditions have been discussed (Kearns & Ron, 1999; Bousquet & Elissee, 2002; Mukherjee et al., 2006; Shalev-Shwartz et al., 2010). Hardt et al.(2016) studies the uniform stability property to obtain generalization guarantees with early-stopped SGD. While we assume a benign stability condition to relate leave-one-out performance with population error, we do not rely on any stability condition that implies generalization.

7 Conclusion and Future work

Our work introduces a new approach for obtaining generalization bounds that do not directly depend on the underlying complexity of the model class. For linear models, we provably obtain a bound in terms of the t on randomly labeled data added during training. Our ndings raise a number of questions to be explored next. While our empirical ndings and theoretical results with 0-1 loss hold absent further assumptions and shed light on why the bound may apply for more general models, we hope to extend our proof that overtting (in terms classication error) to the nite sample of mislabeled data occurs with SGD training on broader classes of models and loss functions. We hope to build on some early results (Nakkiran et al., 2019; Hu et al., 2020) which provide evidence that deep models behave like linear models in the early phases of training. We also wish to extend our framework to the interpolation regime. Since many important aspects of neural network learning take place within early epochs (Achille et al., 2017; Frankle et al., 2020), including gradient dynamics converging to very small subspace (Gur-Ari et al., 2018), we might imagine operationalizing our bounds in the interpolation regime by discarding the randomly labeled data after initial stages of training.

Acknowledgements

SG thanks Divyansh Kaushik for help with NLP code. This material is based on research sponsored by Air Force Research Laboratory (AFRL) under agreement number FA8750-19-1-1000. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation therein. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the ocial policies or endorsements, either expressed or implied, of Air Force Laboratory, DARPA or the U.S. Government. SB acknowledges funding from the NSF grants DMS- 1713003 and CIF-1763734, as well as Amazon AI and a Google Research Scholar Award. ZL acknowledges Amazon AI, Salesforce Research, Facebook, UPMC, Abridge, PwC and the Center for Machine Learning and Health for their generous support of ACMI Lab’s research on machine learning under distribution shift.

References

Abadi, M., Barham, P., , J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al. Tensorow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, 2016.

12 Abou-Moustafa, K. and Szepesvári, C. An exponential efron-stein inequality for lq stable learning rules. arXiv preprint arXiv:1903.05457, 2019.

Achille, A., Rovere, M., and Soatto, S. Critical learning periods in deep neural networks. arXiv preprint arXiv:1711.08856, 2017.

Ali, A., Kolter, J. Z., and Tibshirani, R. J. A continuous-time view of early stopping for least squares. arXiv preprint arXiv:1810.10082, 2018.

Ali, A., Dobriban, E., and Tibshirani, R. J. The implicit regularization of stochastic gradient ow for least squares. arXiv preprint arXiv:2003.07802, 2020.

Allen-Zhu, Z., Li, Y., and Liang, Y. Learning and generalization in overparameterized neural networks, going beyond two layers. In Advances in neural information processing systems, pp. 6158–6169, 2019a.

Allen-Zhu, Z., Li, Y., and , Z. A convergence theory for deep learning via over-parameterization. In International Conference on Machine Learning, pp. 242–252. PMLR, 2019b.

Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.

Arora, S., Du, S. S., Hu, W., Li, Z., and Wang, R. Fine-grained analysis of optimization and generalization for overparameterized two-layer neural networks. arXiv preprint arXiv:1901.08584, 2019.

Arpit, D., Jastrzebski, S., Ballas, N., Krueger, D., Bengio, E., Kanwal, M. S., Maharaj, T., Fischer, A., Courville, A., Bengio, Y., et al. A closer look at memorization in deep networks. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 233–242. JMLR. org, 2017.

Bansal, Y., Kaplun, G., and Barak, B. For self-supervised learning, rationality implies generalization, provably. arXiv preprint arXiv:2010.08508, 2020.

Bardenet, R., Maillard, O.-A., et al. Concentration inequalities for sampling without replacement. Bernoulli, 21(3):1361–1385, 2015.

Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrally-normalized margin bounds for neural networks. In Advances in neural information processing systems, pp. 6240–6249, 2017.

Blum, A. and Hardt, M. The ladder: A reliable leaderboard for machine learning competitions. In International Conference on Machine Learning, pp. 1006–1014. PMLR, 2015.

Bousquet, O. and Elissee, A. Stability and generalization. Journal of machine learning research, 2(Mar): 499–526, 2002.

Chizat, L. and Bach, F. Implicit bias of gradient descent for wide two-layer neural networks trained with the logistic loss. In Conference on Learning Theory, pp. 1305–1338. PMLR, 2020.

Chizat, L., Oyallon, E., and Bach, F. On lazy training in dierentiable programming. In Advances in Neural Information Processing Systems, pp. 2937–2947, 2019.

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

Du, S., Lee, J., Li, H., Wang, L., and Zhai, X. Gradient descent nds global minima of deep neural networks. In International Conference on Machine Learning, pp. 1675–1685. PMLR, 2019.

13 Du, S. S., Zhai, X., Poczos, B., and Singh, A. Gradient descent provably optimizes over-parameterized neural networks. arXiv preprint arXiv:1810.02054, 2018.

Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., and Roth, A. L. Preserving statistical validity in adaptive data analysis. In Proceedings of the forty-seventh annual ACM symposium on Theory of computing, pp. 117–126, 2015.

Dziugaite, G. K. and Roy, D. M. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.

Elissee, A., Pontil, M., et al. Leave-one-out error and stability of learning algorithms with applications. NATO science series sub series iii computer and systems sciences, 190:111–130, 2003.

Frankle, J., Schwab, D. J., and Morcos, A. S. The early phase of neural network training. arXiv preprint arXiv:2002.10365, 2020.

Friedman, J. and Popescu, B. E. Gradient directed regularization for linear regression and classication. Technical report, Technical Report, Statistics Department, Stanford University, 2003.

Gunasekar, S., Lee, J., Soudry, D., and Srebro, N. Implicit bias of gradient descent on linear convolutional networks. arXiv preprint arXiv:1806.00468, 2018a.

Gunasekar, S., Woodworth, B., Bhojanapalli, S., Neyshabur, B., and Srebro, N. Implicit regularization in matrix factorization. In 2018 Information Theory and Applications Workshop (ITA), pp. 1–10. IEEE, 2018b.

Gur-Ari, G., Roberts, D. A., and Dyer, E. Gradient descent happens in a tiny subspace. arXiv preprint arXiv:1812.04754, 2018.

Hardt, M., Recht, B., and Singer, Y. Train faster, generalize better: Stability of stochastic gradient descent. In International Conference on Machine Learning, pp. 1225–1234. PMLR, 2016.

He, K., Zhang, X., Ren, S., and Sun, J. Deep Residual Learning for Image Recognition. In Computer Vision and Pattern Recognition (CVPR), 2016.

Hochreiter, S. and Schmidhuber, J. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

Hoeding, W. Probability inequalities for sums of bounded random variables. In The Collected Works of Wassily Hoeding, pp. 409–426. Springer, 1994.

Hu, W., Li, Z., and Yu, D. Simple and eective regularization methods for training on noisily labeled data with generalization guarantee. arXiv preprint arXiv:1905.11368, 2019.

Hu, W., Xiao, L., Adlam, B., and Pennington, J. The surprising simplicity of the early-time learning dynamics of neural networks. arXiv preprint arXiv:2006.14599, 2020.

Hui, L. and Belkin, M. Evaluation of neural architectures trained with square loss vs cross-entropy in classication tasks. arXiv preprint arXiv:2006.07322, 2020.

Jacot, A., Gabriel, F., and Hongler, C. Neural tangent kernel: Convergence and generalization in neural networks. In Advances in neural information processing systems, pp. 8571–8580, 2018.

Ji, Z. and Telgarsky, M. The implicit bias of gradient descent on nonseparable data. In Conference on Learning Theory, pp. 1772–1798. PMLR, 2019.

14 Kearns, M. and Ron, D. Algorithmic stability and sanity-check bounds for leave-one-out cross-validation. Neural computation, 11(6):1427–1453, 1999.

Krizhevsky, A. and Hinton, G. Learning Multiple Layers of Features from Tiny Images. Technical report, Citeseer, 2009.

LeCun, Y., Bottou, L., Bengio, Y., and Haner, P. Gradient-Based Learning Applied to Document Recognition. Proceedings of the IEEE, 86, 1998.

Li, M., Soltanolkotabi, M., and Oymak, S. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. arXiv preprint arXiv:1903.11680, 2019.

Li, M., Soltanolkotabi, M., and Oymak, S. Gradient descent with early stopping is provably robust to label noise for overparameterized neural networks. In International Conference on Articial Intelligence and Statistics, pp. 4313–4324. PMLR, 2020.

Li, Y. and Liang, Y. Learning overparameterized neural networks via stochastic gradient descent on structured data. In Advances in Neural Information Processing Systems, pp. 8157–8166, 2018.

Liu, S., Niles-Weed, J., Razavian, N., and Fernandez-Granda, C. Early-learning regularization prevents memorization of noisy labels. arXiv preprint arXiv:2007.00151, 2020.

Maas, A., Daly, R. E., Pham, P. T., , D., Ng, A. Y., and Potts, C. Learning word vectors for sentiment analysis. In Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, pp. 142–150, 2011.

McDiarmid, C. On the method of bounded dierences, pp. 148–188. London Mathematical Society Lecture Note Series. Cambridge University Press, 1989.

Mukherjee, S., Niyogi, P., Poggio, T., and Rifkin, R. Learning theory: stability is sucient for generalization and necessary and sucient for consistency of empirical risk minimization. Advances in Computational Mathematics, 25(1):161–193, 2006.

Murphy, K. P. Machine Learning: A Probabilistic Perspective. MIT Press, 2012.

Muthukumar, V., Narang, A., Subramanian, V., Belkin, M., Hsu, D., and Sahai, A. Classication vs regression in overparameterized regimes: Does the loss function matter? arXiv preprint arXiv:2005.08054, 2020.

Nagarajan, V. and Kolter, J. Z. Deterministic pac-bayesian generalization bounds for deep networks via generalizing noise-resilience. arXiv preprint arXiv:1905.13344, 2019a.

Nagarajan, V. and Kolter, J. Z. Uniform convergence may be unable to explain generalization in deep learning. In Advances in Neural Information Processing Systems, pp. 11615–11626, 2019b.

Nakkiran, P., Kaplun, G., Kalimeris, D., , T., Edelman, B. L., Zhang, F., and Barak, B. Sgd on neural networks learns functions of increasing complexity. arXiv preprint arXiv:1905.11604, 2019.

Nakkiran, P., Neyshabur, B., and Sedghi, H. The deep bootstrap: Good online learners are good o ine generalizers. arXiv preprint arXiv:2010.08127, 2020.

Neu, G. and Rosasco, L. Iterate averaging as regularization for stochastic gradient descent. In Conference On Learning Theory, pp. 3222–3242. PMLR, 2018.

15 Neyshabur, B., Tomioka, R., and Srebro, N. Norm-based capacity control in neural networks. In Conference on Learning Theory, pp. 1376–1401, 2015.

Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. Exploring generalization in deep learning. arXiv preprint arXiv:1706.08947, 2017a.

Neyshabur, B., Bhojanapalli, S., and Srebro, N. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017b.

Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations, 2018.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., and Chintala, S. Pytorch: An imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, 2019.

Pensia, A., Jog, V., and Loh, P.-L. Generalization error bounds for noisy, iterative algorithms. In 2018 IEEE International Symposium on Information Theory (ISIT), pp. 546–550. IEEE, 2018.

Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., and Zettlemoyer, L. Deep contextualized word representations. In Proc. of NAACL, 2018.

Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do imagenet classiers generalize to imagenet? In International Conference on Machine Learning, pp. 5389–5400. PMLR, 2019.

Rolnick, D., Veit, A., Belongie, S., and Shavit, N. Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694, 2017.

Shalev-Shwartz, S. and Ben-David, S. Understanding machine learning: From theory to algorithms. Cambridge university press, 2014.

Shalev-Shwartz, S., Shamir, O., Srebro, N., and Sridharan, K. Learnability, stability and uniform convergence. The Journal of Machine Learning Research, 11:2635–2670, 2010.

Sherman, J. and Morrison, W. J. Adjustment of an inverse matrix corresponding to a change in one element of a given matrix. The Annals of Mathematical Statistics, 21(1):124–127, 1950.

Soudry, D., Hoer, E., Nacson, M. S., Gunasekar, S., and Srebro, N. The implicit bias of gradient descent on separable data. The Journal of Machine Learning Research, 19(1):2822–2878, 2018.

Suggala, A., Prasad, A., and Ravikumar, P. K. Connecting optimization and regularization paths. In Advances in Neural Information Processing Systems, pp. 10608–10619, 2018.

Tsybakov, A. B. et al. On nonparametric estimation of density level sets. The Annals of Statistics, 25(3): 948–969, 1997.

Wolf, T., Debut, L., Sanh, V., Chaumond, J., Delangue, C., Moi, A., Cistac, P., Rault, T., Louf, R., Funtowicz, M., Davison, J., Shleifer, S., von Platen, P., Ma, C., Jernite, Y., Plu, J., Xu, C., Scao, T. L., Gugger, S., Drame, M., Lhoest, Q., and Rush, A. M. Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pp. 38–45. Association for Computational Linguistics, 2020.

16 Xu, A. and Raginsky, M. Information-theoretic analysis of generalization capability of learning algorithms. arXiv preprint arXiv:1705.07809, 2017.

Yao, Y., Rosasco, L., and Caponnetto, A. On early stopping in gradient descent learning. Constructive Approximation, 26(2):289–315, 2007.

Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.

Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Orbanz, P. Non-vacuous generalization bounds at the imagenet scale: a pac-bayesian compression approach. arXiv preprint arXiv:1804.05862, 2018.

17 Supplementary Material

Throughout this discussion, we will make frequently use of the following standard results concerning the exponential concentration of random variables:

Lemma 4 (Hoeding’s inequality for independent RVs (Hoeding, 1994)). Let Z1,Z2,...,Zn be independent bounded random variables with Zi P ra, bs for all i, then

1 n 2nt2 P pZi ´ E rZisq ě t ď exp ´ n pb ´ aq2 ˜ i“1 ¸ ÿ ˆ ˙ and

1 n 2nt2 P pZi ´ E rZisq ď ´t ď exp ´ n pb ´ aq2 ˜ i“1 ¸ ÿ ˆ ˙ for all t ě 0.

Lemma 5 (Hoeding’s inequality for sampling with replacement (Hoeding, 1994)). Let Z “ pZ1,Z2,...,ZN q be a nite population of N points with Zi P ra.bs for all i. Let X1,X2,...Xn be a random sample drawn without replacement from Z. Then for all t ě 0, we have

1 n 2nt2 P pXi ´ µq ě t ď exp ´ n pb ´ aq2 ˜ i“1 ¸ ÿ ˆ ˙ and

1 n 2nt2 P pXi ´ µq ď ´t ď exp ´ , n pb ´ aq2 ˜ i“1 ¸ ÿ ˆ ˙ 1 N where µ “ N i“1 Zi. We now discussř one condition that generalizes the exponential concentration to dependent random variables.

n Condition 2 (Bounded dierence inequality). Let Z be some set and φ : Z Ñ R. We say that φ satises the bounded dierence assumption if there exists c1, c2, . . . cn ě 0 s.t. for all i, we have

1 sup φpZ1,...,Zi,...,Znq ´ φpZ1,...,Zi,...,Znq ď ci . 1 n`1 Z1,Z2,...,Zn,ZiPZ

Lemma 6 (McDiarmid’s inequality (McDiarmid, 1989)). Let Z1,Z2,...,Zn be independent random variables n on set Z and φ : Z Ñ R satisfy bounded dierence inequality (Condition2). Then for all t ą 0, we have 2t2 pφpZ ,Z ,...,Z q ´ rφpZ ,Z ,...,Z qs ě tq ď exp ´ P 1 2 n E 1 2 n n c2 ˆ i“1 i ˙ and ř 2t2 pφpZ ,Z ,...,Z q ´ rφpZ ,Z ,...,Z qs ď ´tq ď exp ´ . P 1 2 n E 1 2 n n c2 ˆ i“1 i ˙ ř 18 A Proofs from Sec.3

Additional notation Let m1 be the number of mislabeled points (SM ) and m2 be the number of correctly labeled points (SC ). Note m1 ` m2 “ m. r A.1 Proof ofr Theorem1

Proof of Lemma1. The main idea of our proof is to regard the clean portion of the data (S Y SC ) as xed. Then, there exists an (unknown) classier f ˚ that minimizes the expected risk calculated on the (xed) clean data and (random draws of) the mislabeled data SM . Formally, r ˚ f :“ arg min EDpfq , (10) fPF r where q n m2 m1 1 D “ S ` SC ` D . m ` n m ` n m ` n Note here that D is a combinationq of the empirical distributionr over correctly labeled data S Y SC and the (population) distribution over mislabeled data D1. Recall that q r : f “ arg min ESYSpfq . (11) fPF r Since, f minimizes 0-1 error on S Y S, usingp ERM optimality on (11), we have E pfq ď E pf ˚q . (12) p r SYS SYS ˚ Moreover, since f is independent of SM , usingr Hoeding’sp r bound, we have with probability at least 1 ´ δ that r ˚ ˚ logp1{δq ES pf q ď ED1 pf q ` . (13) M d 2m1 r Finally, since f ˚ is the optimal classier on D, we have E pf ˚q ď E pfq . (14) qD D m1 n m2 Now to relate (12) and (14), we multiply (13) by and add ES pfq ` E pfq both the sides. q m`n q p m`n m`n SC Hence, we can rewrite (13) as follows: r ˚ ˚ m1 logp1{δq ESYS pf q ď EDpf q ` . (15) m ` nd 2m1 Now we combine equations (12), (15r), and (14),q to get

m1 logp1{δq ESYS pfq ď EDpfq ` , (16) m ` nd 2m1 which implies r p q p

logp1{δq ES pfq ď ED1 pfq ` . (17) M d 2m1

r 3 Since S is obtained by randomly labelingp an unlabeledp dataset, we assume 2m1 « m . Moreover, using ED1 “ 1 ´ ED we obtain the desired result. 3 Formally,r with probability at least 1 ´ δ, we have pm ´ 2m1q ď m logp1{δq{2. a 19 Proof of Lemma2. Recall E pfq “ m1 E pfq ` m2 E pfq. Hence, we have S m SM m SC

r r 2m1 r 2m2 2E pfq ´ E pfq ´ E pfq “ E pfq ´ E pfq ` E pfq ´ E pfq (18) S SM SC m SM SM m SC SC ˆ ˙ ˆ ˙ r r r 2m1 r r 2m2 r r “ ´ 1 E pfq ` ´ 1 E pfq . (19) m SM m SC ˆ ˙ ˆ ˙ r r 2m1 Since the dataset is labeled uniformly at random, with probability at least 1 ´ δ, we have m ´ 1 ď logp1{δq 2m2 logp1{δq 2m . Similarly, we have with probability at least 1 ´ δ, m ´ 1 ď 2m . Using` union bound,˘ with probability at least 1 ´ δ, we have b ` ˘ b logp2{δq 2E ´ E pfq ´ E pfq ď E pfq ` E pfq . (20) S SM SC SM SC c 2m ´ ¯ With re-arranging E prfq ` Er pfq andr using the inequality 1 ´r a ď 1 r, we have SM SC 1`a

r r logp2{δq 2E ´ E pfq ´ E pfq ď 2E . (21) S SM SC S c 2m r r r r

Proof of Lemma3. In the set of correctly labeled points S Y SC , we have S as a random subset of S Y SC . Hence, using Hoeding’s inequality for sampling without replacement (Lemma5), we have with probability at least 1 ´ δ r r

logp1{δq ES pfq ´ ESYS pfq ď . (22) C C d 2m2 r p r p m2 n Re-writing E pfq as E pfq ` ES pfq, we have with probability at least 1 ´ δ SYSC m2`n SC m2`n

r r p n p p logp1{δq ES pfq ´ ES pfq ď . (23) n ` m2 C d 2m2 ˆ ˙ ´ ¯ r p p As before, assuming 2m2 « m, we have with probability at least 1 ´ δ

m2 logp1{δq m logp1{δq E pfq ´ ES pfq ď 1 ` ď 1 ` . (24) SC n c m 2n c m ´ ¯ ´ ¯ r p p

Proof of Theorem1. Having established these core intermediate results, we can now combine above three lemmas to prove the main result. In particular, we bound the population error on clean data (EDpfq) as follows: p (i) First, use (17), to obtain an upper bound on the population error on clean data, i.e., with probability at least 1 ´ δ{4, we have

logp4{δq EDpfq ď 1 ´ E pfq ` . (25) SM c m p r p 20 (ii) Second, use (21), to relate the error on the mislabeled fraction with error on clean portion of randomly labeled data and error on whole randomly labeled dataset, i.e., with probability at least 1 ´ δ{2, we have

logp4{δq ´E pfq ď E pfq ´ 2E ` 2E . (26) SM SC S S c 2m r r r r (iii) Finally, use (24) to relate the error on the clean portion of randomly labeled data and error on clean training data, i.e., with probability 1 ´ δ{4, we have

m logp4{δq E pfq ď ´ES pfq ` 1 ` . (27) SC 2n c m ´ ¯ r Using union bound on the above threep steps, wep have with probability at least 1 ´ δ:

? m logp4{δq EDpfq ď ES pfq ` 1 ´ 2ES pfq ` 2ES ` 2 ` . (28) 2n c m ´ ¯ p p r p r

A.2 Proof of Proposition1

Proof of Proposition1. For a classier f : X Ñ t´1, 1u, we have 1 ´ 2 I rfpxq ‰ ys “ y ¨ fpxq. Hence, by denition of E, we have

1 m 1 m 1 ´ 2E pfq “ y ¨ fpx q ď sup y ¨ fpx q . (29) S m i i m i i i“1 fPF i“1 r ÿ ÿ Note that for xed inputs px1, x2, . . . , xmq in S, py1, y2, . . . ymq are random labels. Dene φ1py1, y2, . . . , ymq :“ 1 m supfPF m i“1 yi ¨ fpxiq. We have the following bounded dierence condition on φ1. For all i, r ř 1 sup φ1py1, . . . , yi, . . . , ymq ´ φ1py1, . . . , yi, . . . , ymq ď 1{m . (30) 1 m`1 y1,...ym,yiPt´1,1u

: 1 m Similarly, we dene φ2px1, x2, . . . , xmq “ Eyi„U t´1,1u supfPF m i“1 yi ¨ fpxiq . We have the following bounded dierence condition on φ2. For all i, “ ř ‰ 1 sup φ2px1, . . . , xi, . . . , xmq ´ φ1px1, . . . , xi, . . . , xmq ď 1{m . (31) 1 m`1 x1,...xm,xiPX

Using McDiarmid’s inequality (Lemma6) twice with Condition (30) and (31), with probability at least 1 ´ δ, we have

1 m 1 m 2 logp2{δq sup y ¨ fpx q ´ sup y ¨ fpx q ď . (32) m i i Ex,y m i i m fPF i“1 «fPF i“1 ff c ÿ ÿ Combining (29) and (32), we obtain the desired result.

21 A.3 Proof of Theorem2 Proof of Theorem2 follows similar to the proof of Theorem1. Note that the same results in Lemma1, Lemma2, and Lemma3 hold in the regularized ERM case. However, the arguments in the proof of Lemma1 change slightly. Hence, we state the lemma for regularized ERM and prove it here for completeness. Lemma 7. Assume the same setup as Theorem2. Then for any δ ą 0, with probability at least 1 ´ δ over the random draws of mislabeled data SM , we have logp1{δq rEDpfq ď 1 ´ E pfq ` . (33) SM c m Proof. The main idea of the proof remainsp the same,r i.e.p regard the clean portion of the data (S Y SC ) as ˚ xed. Then, there exists a classier f that is optimal over draws of the mislabeled data SM . Formally, r ˚ r f :“ arg min EDpfq ` λRpfq , (34) fPF where q n m1 m2 1 D “ S ` SC ` D . m ` n m ` n m ` n That is, D a combination of the empiricalq distribution over correctlyr labeled data S YSC and the (population) distribution over mislabeled data D1. Recall that q : r f “ arg min ESYSpfq ` λRpfq . (35) fPF Since, f minimizes 0-1 error on S YpS, using ERM optimalityr on (11), we have E pfq ` λRpfq ď E pf ˚q ` λRpf ˚q . (36) p SYS r SYS Moreover, since f ˚ is independent of S , using Hoeding’s bound, we have with probability at least 1 ´ δ r pM p r that r ˚ ˚ logp1{δq ES pf q ď ED1 pf q ` . (37) M d 2m1 Finally, since f ˚ is the optimal classierr on D, we have E pf ˚q ` λRpf ˚q ď E pfq ` λRpfq . (38) D q D Now to relate (36) and (38), we can re-write the (37) as follows: q q p p

˚ ˚ m1 logp1{δq ESYS pf q ď EDpf q ` . (39) m ` nd 2m1 After adding λRpf ˚q on both sidesr in (39), weq combine equations (36), (39), and (38), to get

m1 logp1{δq ESYS pfq ď EDpfq ` , (40) m ` nd 2m1 which implies r p q p

logp1{δq ES pfq ď ED1 pfq ` . (41) M d 2m1 r Similar as before, since S is obtained by randomlyp p labeling an unlabeled dataset, we assume 2m1 « m. Moreover, using ED1 “ 1 ´ ED we obtain the desired result. r 22 A.4 Proof of Theorem3 To prove our results in the multiclass case, we rst state and prove lemmas parallel to those used in the proof of balanced binary case. We then combine these results to obtain the result in Theorem3. Before stating the result, we dene mislabeled distribution D1 for any D. While D1 and D share the same marginal distribution over inputs X , the conditional distribution over labels y given an input x „ DX is changed as follows: For any x, the Probability Mass Function (PMF) over y is dened as: 1´pDp¨|xq 1 : pD p¨|xq “ k´1 , where pDp¨|xq is the PMF over y for the distribution D. Lemma 8. Assume the same setup as Theorem3. Then for any δ ą 0, with probability at least 1 ´ δ over the random draws of mislabeled data SM , we have logp1{δq EDpfq ďr pk ´ 1q 1 ´ E pfq ` pk ´ 1q . (42) SM c m ´ ¯ Proof. The main idea of the proofp remains the same.r Wep begin by regarding the clean portion of the data ˚ (S Y SC ) as xed. Then, there exists a classier f that is optimal over draws of the mislabeled data SM . However, in the multiclass case, we cannot as easily relate the population error on mislabeled data to the populationr accuracy on clean data. While for binary classication, we could lower bound the populationr accuracy 1 ´ ED with the empirical error on mislabeled data E (in the proof of Lemma1), for multiclass SM classication, error on the mislabeled data and accuracy on the clean data in the population are not so directly related. To establish (42), we break the error on the (unknown)r mislabeled data into two parts: one term corresponds to predicting the true label on mislabeled data, and the other corresponds to predicting neither the true label nor the assigned (mis-)label. Finally, we relate these errors to their population counterparts to establish (42). Formally, ˚ f :“ arg min EDpfq ` λRpfq , (43) fPF where q n m1 m2 1 D “ S ` SC ` D . m ` n m ` n m ` n That is, D is a combination ofq the empirical distributionr over correctly labeled data S Y SC and the (population) distribution over mislabeled data D1. Recall that q : r f “ arg min ESYSpfq ` λRpfq . (44) fPF Following the exact steps from the proofp of Lemma7, withr probability at least 1 ´ δ, we have

logp1{δq ES pfq ď ED1 pfq ` . (45) M d 2m1 r p p k Similar to before, since S is obtained by randomly labeling an unlabeled dataset, we assume k´1 m1 « m. T Now we will relate ED1 pfq with EDpfq. Let y denote the (unknown) true label for a mislabeled point px, yq (i.e., label before replacingr it with a mislabel).

p p T Epx,yqP„D1 I fpxq ‰ y “ Epx,yqP„D1 I fpxq ‰ y ^ fpxq ‰ y ” ” ıı ” ” ıı p p I p l jh n T ` Epx,yqP„D1 I fpxq ‰ y ^ fpxq “ y . (46) ” ” ıı p II p l jh n 23 Clearly, term 2 is one minus the accuracy on the clean unseen data, i.e.,

II “ 1 ´ Ex,y„D I fpxq ‰ y “ 1 ´ EDpfq . (47) ” ” ıı Next, we relate term 1 with the error on the unseenp clean data. We showp that term 1 is equal to the error k´2 on the unseen clean data scaled by k´1 , where k is the number of labels. Using the denition of mislabeled distribution D1, we have

1 k ´ 2 I “ Epx,yqP„D I fpxq ‰ i ^ fpxq ‰ y “ EDpfq . (48) k ´ 1 k ´ 1 ˜ «iPY^i‰y ff¸ ÿ ” ı p p p Combining the result in (47), (48) and (46), we have 1 E 1 pfq “ 1 ´ EDpfq . (49) D k ´ 1 Finally, combining the result in (49) with equationp (45), we havep with probability 1 ´ δ,

k logp1{δq EDpfq ď pk ´ 1q 1 ´ ES pfq ` pk ´ 1q . (50) M d2pk ´ 1qm ´ ¯ p r p

Lemma 9. Assume the same setup as Theorem3. Then for any δ ą 0, with probability at least 1 ´ δ over the random draws of S, we have

logp4{δq r kE pfq ´ E pfq ´ pk ´ 1qE pfq ď 2k . S SC SM c 2m Proof. Recall E pfq “ m1 E r ppfq ` mr2 Ep pfq. Hence,r wep have S m SM m SC

r r r km1 kE pfq ´ pk ´ 1qE pfq ´ E pfq “ pk ´ 1q E pfq ´ E pfq S SM SC pk ´ 1qm SM SM ˆ ˙ r r r km2 r r ` E pfq ´ E pfq m SC SC ˆ ˙ m1 k ´ 1 r r m2 1 “ k ´ E pfq ` ´ E pfq . m k SM m k SC „ˆ ˙ ˆ ˙  r r m1 k´1 logp1{δq Since the dataset is randomly labeled, we have with probability at least 1 ´ δ, m ´ k ď 2m .

m2 1 logp1{δq b Similarly, we have with probability at least 1 ´ δ, m ´ k ď 2m . Using` union bound,˘ we have with probability at least 1 ´ δ ` ˘ b logp2{δq kE pfq ´ pk ´ 1qE pfq ´ E pfq ď k E pfq ` E pfq . (51) S SM SC SM SC c 2m ´ ¯ r r r r r

Lemma 10. Assume the same setup as Theorem3. Then for any δ ą 0, with probability at least 1 ´ δ over the random draws of SC and S, we have

r k logp2{δq E pfq ´ ES pfq ď 1.5 . SC c 2m r p p 24 Proof. In the set of correctly labeled points S Y SC , we have S as a random subset of S Y SC . Hence, using Hoeding’s inequality for sampling without replacement (Lemma5), we have with probability at least 1 ´ δ r r logp1{δq ES pfq ´ ESYS pfq ď . (52) c C d 2m2 r p r p m2 n Re-writing E pfq as E pfq ` ES pfq, we have with probability at least 1 ´ δ SYSC m2`n SC m2`n

r r p np p logp1{δq ES pfq ´ ES pfq ď . (53) n ` m2 c d 2m2 ˆ ˙ ´ ¯ r p p As before, assuming km2 « m, we have with probability at least 1 ´ δ

m2 k logp1{δq 1 k logp1{δq E pfq ´ E pfq ď 1 ` ď 1 ` . (54) Sc S n c 2m k c 2m ´ ¯ ˆ ˙ r p p

Proof of Theorem3. Having established these core intermediate results, we can now combine above three lemmas. In particular, we bound the population error on clean data (EDpfq) as follows:

(i) First, use (50), to obtain an upper bound on the population error onp clean data, i.e., with probability at least 1 ´ δ{4, we have

k logp4{δq EDpfq ď pk ´ 1q 1 ´ ES pfq ` pk ´ 1q . (55) M d2pk ´ 1qm ´ ¯ p r p (ii) Second, use (51) to relate the error on the mislabeled fraction with error on clean portion of randomly labeled data and error on whole randomly labeled dataset, i.e., with probability at least 1 ´ δ{2, we have

logp4{δq ´pk ´ 1qE pfq ď E pfq ´ kE ` k . (56) SM SC S c 2m r r r (iii) Finally, use (54) to relate the error on the clean portion of randomly labeled data and error on clean training data, i.e., with probability 1 ´ δ{4, we have

m k logp4{δq E pfq ď ´ES pfq ` 1 ` . (57) SC kn c 2m ´ ¯ r Using union bound on the above threep steps,p we have with probability at least 1 ´ δ:

? m logp4{δq EDpfq ď ES pfq ` pk ´ 1q ´ kES pfq ` p kpk ´ 1q ` k ` k ` ? q . (58) n k c 2m a Simplifying thep term in RHSp of (58), we getr thep desired result. in the nal bound.

25 B Proofs from Sec.4

We suppose that the parameters of the linear function are obtained via gradient descent on the following L2 regularized problem:

n T 2 2 LSpw; λq :“ pw xi ´ yiq ` λ ||w||2 , (59) i“1 ÿ n where λ ě 0 is a regularization parameter. We assume access to a clean dataset S “ tpxi, yiqui“1 „ n n`m m D and randomly labeled dataset S “ tpxi, yiqui“n`1 „ D . Let X “ rx1, x2, ¨ ¨ ¨ , xm`ns and y “ T 2 ry1, y2, ¨ ¨ ¨ , ym`ns. Fix a positive learning rate η such that η ď 1{ X X ` λ and an initialization r r op w0 “ 0. Consider the following gradient descent iterates to minimize´ˇˇ objectiveˇˇ (59)¯ on S Y S: ˇˇ ˇˇ

wt “ wt´1 ´ η∇wLSYSpwt´1; λq @t “ 1, 2,... r (60)

r T ´1 T Then we have twtu converge to the limiting solution w “ X X ` λI X y. Dene fpxq :“ fpx; wq. ` ˘ B.1 Proof of Theorem4 p p p We use a standard result from linear algebra, namely the Shermann-Morrison formula (Sherman & Morrison, 1950) for matrix inversion:

n n n Lemma 11 (Sherman & Morrison(1950)) . Suppose A P R ˆ is an invertible square matrix and u, v P R are column vectors. Then A ` uvT is invertible i 1 ` vT Au ‰ 0 and in particular

A´1uvT A´1 pA ` uvT q´1 “ A´1 ´ . (61) 1 ` vT A´1u

For a given training set S Y SC , dene leave-one-out error on mislabeled points in the training data as

Epf pxiq, yiq r pxi,yiqPSM piq E “ , LOOpSM q ř r SM r r where fpiq :“ fpA, pSYSqpiqq. To relate empirical leave-one-out error and population error with hypothesis stability condition, we use the following lemma: r Lemma 12 (Bousquet & Elissee(2002)) . For the leave-one-out error, we have

2 1 3β E ED1 pfq ´ ELOOpS q ď ` . (62) M 2m1 n ` m „´ ¯  r Proof of the above lemma is similarp to the proof of Lemma 9 in Bousquet & Elissee(2002) and can be found in App.D. Before presenting the proof of Theorem4, we introduce some more notation. th Let Xpiq denote the matrix of covariates with the i point removed. Similarly, let ypiq be the array th of responses with the i point removed. Dene the corresponding regularized GD solution as wpiq “ ´1 T T X i Xpiq ` λI X i ypiq. Dene fpiqpxq :“ fpx; wpiqq. p q p q p ´ ¯ p p

26 Proof of Theorem4. Because squared loss minimization does not imply 0-1 error minimization, we cannot use arguments from Lemma1. This is the main technical diculty. To compare the 0-1 error at a train point with an unseen point, we use the closed-form expression for w and Shermann-Morrison formula to upper bound training error with leave-one-out cross validation error. The proof is divided into three parts: In part one, we show thatp 0-1 error on mislabeled points in the training set is lower than the error obtained by leave-one-out error at those points. In part two, we relate this leave-one-out error with the population error on mislabeled distribution using Condition1. While the empirical leave-one-out error is an unbiased estimator of the average population error of leave-one-out classiers, we need hypothesis stability to control the variance of empirical leave-one-out error. Finally, in part three, we show that the error on the mislabeled training points can be estimated with just the randomly labeled and clean training data (as in proof of Theorem1). Part 1 First we relate training error with leave-one-out error. For any training point pxi, yiq in S Y S, we have r T T T ´1 T Epfpxiq, yiq “ I yi ¨ xi w ă 0 “ I yi ¨ xi X X ` λI X y ă 0 (63) “ ‰ ” ` ˘ ı p p ´1 » T T T T fi “ I yi ¨ xi XpiqXpiq ` xi xi ` λI pXpiqypiq ` yi ¨ xiq ă 0 . (64) — ´ ¯ ffi — ffi — I ffi – l jh n fl T Letting A “ XpiqXpiq ` λI and using Lemma 11 on term 1, we have ´ ¯ ´1 T ´1 T ´1 A xixi A T Epfpxiq, yiq “ I yi ¨ xi A ´ pX ypiq ` yi ¨ xiq ă 0 (65) 1 ` xT A´1x piq „ „ i i   T ´1 T ´1 T ´1 T ´1 p xi A p1 ` xi A xiq ´ xi A xixi A T “ I yi ¨ pX ypiq ` yi ¨ xiq ă 0 (66) 1 ` xT A´1x piq „ „ i i   T ´1 xi A T “ I yi ¨ pX ypiq ` yi ¨ xiq ă 0 . (67) 1 ` xT A´1x piq „ „ i i   T ´1 Since 1 ` xi A xi ą 0, we have

T ´1 T Epfpxiq, yiq “ I yi ¨ xi A pXpiqypiq ` yi ¨ xiq ă 0 (68) ” T ´1 T ´1 T ı p “ I xi A xi ` yi ¨ xi A pXpiqypiqq ă 0 (69) ” T ´1 T ı ď I yi ¨ xi A pXpiqypiqq ă 0 “ Epfpiqpxiq, yiq . (70) ” ı Using (70), we have p

Epfpiqpxiq, yiq : pxi,yiqPSM ES pfq ď ELOOpS q “ . (71) M M S ř r M p r p r Part 2 We now relate RHS in (71) with the population error onr mislabeled distribution. To do this, we leverage Condition1 and Lemma 12. In particular, we have

2 1 3β 1 ESYS ED pfq ´ ELOOpS q ď ` . (72) M M 2m1 m ` n „´ ¯  r p r 27 Using Chebyshev’s inequality, with probability at least 1 ´ δ, we have

1 1 3β E ď ED1 pfq ` ` . (73) LOOpSM q δ 2m m ` n d ˆ 1 ˙ r Part 3 Combining (73) and (71), we have p

1 1 3β E pfq ď ED1 pfq ` ` . (74) SM δ 2m m ` n d ˆ 1 ˙ r Compare (74) with (17) in the proofp of Lemmap 1. We obtain a similar relationship between E and SM ED1 but with a polynomial concentration instead of exponential concentration. In addition, since we just use concentration arguments to relate mislabeled error to the errors on the clean and unlabeled portionsr of the randomly labeled data, we can directly use the results in Lemma2 and Lemma3. Therefore, combining results in Lemma2, Lemma3, and (74) with union bound, we have with probability at least 1 ´ δ

? m logp4{δq 4 1 3β EDpfq ď ES pfq ` 1 ´ 2ES pfq ` 2ES pfq ` 1 ` ` ` . (75) 2n c m dδ m m ` n ´ ¯ ˆ ˙ p p r p r p

B.2 Extension to multiclass classication For multiclass problems with squared loss minimization, as standard practice, we consider one-hot encoding k for the underlying label, i.e., a class label c P rks is treated as p0, ¨, 0, 1, 0, ¨, 0q P R (with c-th coordinate being 1). As before, we suppose that the parameters of the linear function are obtained via gradient descent on the following L2 regularized problem:

n k 2 : T 2 LSpw; λq “ w xi ´ yi 2 ` λ ||wj||2 , (76) i“1 j“1 ÿ ˇˇ ˇˇ ÿ ˇˇ ˇˇ n where λ ě 0 is a regularization parameter. We assume access to a clean dataset S “ tpxi, yiqui“1 „ n n`m m D and randomly labeled dataset S “ tpxi, yiqui“n`1 „ D . Let X “ rx1, x2, ¨ ¨ ¨ , xm`ns and y “ T 2 rey1 , ey2 , ¨ ¨ ¨ , eym n s. Fix a positive learning rate η such that η ď 1{ X X ` λ and an initialization ` r r op w0 “ 0. Consider the following gradient descent iterates to minimize´ˇˇ objectiveˇˇ (59) on¯ S Y S: ˇˇ ˇˇ t t´1 t´1 wj “ wj ´ η∇wj LSYSpw ; λq @t “ 1, 2,... and j “ 1, 2, . . . , k . r (77)

t r T ´1 T Then we have twj u for all j “ 1, 2, ¨ ¨ ¨ , k converge to the limiting solution wj “ X X ` λI X yj. f x : f x; w Dene p q “ p q. ` ˘ p Theorem 5. Assume that this gradient descent algorithm satises Condition1 with β “ Op1q. Then for a p p multiclass classication problem wth k classes, for any δ ą 0, with probability at least 1 ´ δ, we have:

k EDpfq ď ES pfq ` pk ´ 1q 1 ´ E pfq k ´ 1 S ˆ ˙ p p ? m rlogp p4{δq 4 1 3β ` k ` k ` ? ` kpk ´ 1q ` . (78) n k 2m δ m m ` n ˆ ˙ c d ˆ ˙ a 28 Proof. The proof of this theorem is divided into two parts. In the rst part, we relate the error on the mislabeled samples with the population error on the mislabeled data. Similar to the proof of Theorem4, we use Shermann-Morrison formula to upper bound training error with leave-one-out error on each wj. Second part of the proof follows entirely from the proof of Theorem3. In essence, the rst part derives an equivalent of (45) for GD training with squared loss and then the second part follows from the proofp of Theorem3. Part-1: Consider a training point pxi, yiq in S Y S. For simplicity, we use ci to denote the class of i-th point and use yi as the corresponding one-hot embedding. Recall error in multiclass point is given by T Epfpxiq, yiq “ I ci R arg max xi w . Thus, therer exists a j ‰ ci P rks, such that we have

“ ‰ T T T p Epfpxiq, yiq “ I ci R arg maxp xi w “ I xi wci ă xi wj (79) T T ´1 T T T ´1 T “ I “xi X X ` λI ‰ X“ yci ă xi X X‰ ` λI X yj (80) p p p p ” ` ˘ ` ˘ ı ´1 » T T T T T fi “ I xi XpiqXpiq ` xi xi ` λI Xpiqyci piq ` xi ´ Xpiqyjpiq ă 0 . (81) — ´ ¯ ´ ¯ ffi — ffi — I ffi – l jh n fl T Letting A “ XpiqXpiq ` λI and using Lemma 11 on term 1, we have ´ ¯ ´1 T ´1 T ´1 A xixi A T T Epfpxiq, yiq “ I xi A ´ X yci ` xi ´ X yj ă 0 (82) 1 ` xT A´1x piq piq piq piq „ „ i i   T ´1 T ´1 ´T ´1 T ´1 ¯ p xi A p1 ` xi A xiq ´ xi A xixi A T T “ I X yci ` xi ´ X yj ă 0 1 ` xT A´1x piq piq piq piq „„ i i   ´ ¯ (83) xT A´1 “ i XT y ` x ´ XT y ă 0 . (84) I T ´1 piq ci piq i piq jpiq 1 ` xi A xi „„  ´ ¯  T ´1 Since 1 ` xi A xi ą 0, we have

T ´1 T T Epfpxiq, yiq “ I xi A Xpiqyci piq ` xi ´ Xpiqyjpiq ă 0 (85) ” T ´1 ´ T ´1 T T ¯´1 ıT p “ I xi A xi ` xi A Xpiqyci piq ´ xi A Xpiqyjpiq ă 0 (86) ” T ´1 T T ´1 T ı ď I xi A Xpiqyci piq ´ xi A Xpiqyjpiq ă 0 “ Epfpiqpxiq, yiq . (87) ” ı Using (87), we have p

Epfpiqpxiq, yiq : pxi,yiqPSM ES pfq ď ELOOpS q “ . (88) M M S ř r M p r p r We now relate RHS in (71) with the population error on mislabeledr distribution. Similar as before, to do this, we leverage Condition1 and Lemma 12. Using (73) and (88), we have

1 1 3β E pfq ď ED1 pfq ` ` . (89) SM δ 2m m ` n d ˆ 1 ˙ r p p 29 We have now derived a parallel to (45). Using the same arguments in the proof of Lemma8, we have

k 1 3β EDpfq ď pk ´ 1q 1 ´ ES pfq ` pk ´ 1q ` . (90) M dδpk ´ 1q 2m1 m ` n ´ ¯ ˆ ˙ r Part-2: We nowp combine the results in Lemmap 9 and Lemma 10 to obtain the nal inequality in terms of quantities that can be computed from just the randomly labeled and clean data. Similar to the binary case, we obtained a polynomial concentration instead of exponential concentration. Combining (90) with Lemma9 and Lemma 10, we have with probability at least 1 ´ δ

k EDpfq ď ES pfq ` pk ´ 1q 1 ´ E pfq k ´ 1 S ˆ ˙ p p ? m rlogp p4{δq 4 1 3β ` k ` k ` ? ` kpk ´ 1q ` . (91) n k 2m δ m m ` n ˆ ˙ c d ˆ ˙ a

B.3 Discussion on Condition1 The quantity in LHS of Condition1 measures how much the function learned by the algorithm (in terms of error on unseen point) will change when one point in the training set is removed. We need hypothesis stability condition to control the variance of the empirical leave-one-out error to show concentration of average leave-one-error with the population error. Additionally, we note that while the dominating term in the RHS of Theorem4 matches with the dominating term in ERM bound in Theorem1, there is a polynomial concentration term (dependence on 1{δ instead of logp 1{δq) in Theorem4. Since with hypothesis stability, we just bound the variance, the polynomial concentration is due to the use of Chebyshev’s inequality instead of an exponential tail a inequality (as in Lemma1). Recent works have highlighted that a slightly stronger condition than hypothesis stability can be used to obtain an exponential concentration for leave-one-out error (Abou-Moustafa & Szepesvári, 2019), but we leave this for future work for now.

B.4 Formal statement and proof of Proposition2 Before formally presenting the result, we will introduce some notation. By L pwq, we denote the objective ?S in (59) with λ “ 0. Assume Singular Value Decomposition (SVD) of X as nUS1{2V T . Hence XT X “ VSV T . Consider the GD iterates dened in (60). We now derive closed form expression for the tth iterate of gradient descent:

T T T wt “ wt´1 ` η ¨ X py ´ Xwt´1q “ pI ´ ηVSV qwk´1 ` ηX y . (92)

Rotating by V T , we get

wt “ pI ´ ηSqwk´1 ` ηy, (93)

w V T w y V T XT y w 0 where t “ t and “ .r Assuming the initialr pointr 0 “ and applying the recursion in (93), we get

r r ´1 k wt “ S pI ´ pI ´ ηSq qy , (94)

r r 30 Projecting solution back to the original space, we have

´1 k T T wt “ VS pI ´ pI ´ ηSq qV X y . (95)

th T Dene ftpxq :“ fpx; wtq as the solution at the t iterate. Let wλ “ arg minw LS pw; λq “ pX X ` ´1 T ´1 T T λIq X y “ V pS ` λIq V X y. and dene fλpxq :“ fpx; wλq as the regularized solution. Assume κ be the condition number of the population covariance matrixr and let smin be the minimum positive singular value of the empirical covariance matrix. Ourr proof idea isr inspired from recent work on relating gradient ow solution and regularized solution for regression problems (Ali et al., 2018). We will use the following lemma in the proof:

kx k k kx Lemma 13. For all x P r0, 1s and for all k P N, we have (a) 1`kx ď 1´p1´xq and (b) 1´p1´xq ď 2¨ kx`1 .

k 1 p1`kxqp1´p1´xqkq Proof. Using p1 ´ xq ď 1`kx , we have part (a). For part (b), we numerically maximize kx for all k ě 1 and for all x P r0, 1s.

1 Proposition 3 (Formal statement of Proposition2) . Let λ “ tη . For a training point x, we have

2 2 Ex„S pftpxq ´ fλpxqq ď cpt, ηq ¨ Ex„S ftpxq , ” ı 1 “ ‰ where cpt, ηq :“ minp0.25, 2 2 2 q. Similarlyr for a test point, we have smint η

2 2 Ex„DX pftpxq ´ fλpxqq ď κ ¨ cpt, ηq ¨ Ex„DX ftpxq . ” ı “ ‰ Proof. We want to analyze the expected squaredr dierence output of regularized linear regression with 1 th regularization constant λ “ ηt and the gradient descent solution at the t iterate. We separately expand the algebraic expression for squared dierence at a training point and a test point. Then the main step is to show that S´1pI ´ pI ´ ηSqkq ´ pS ` λIq´1 ĺ cpη, tq ¨ S´1pI ´ pI ´ ηSqkq. Part 1 First, we will analyze the squared dierence of the output at a training point (for simplicity, we “ ‰ refer to S Y S as S), i.e.,

2 r 2 Ex„S ftpxq ´ fλpxq “ ||Xwt ´ Xwλ||2 (96) „  ´ ¯ ´1 t T T ´1 T T 2 r “ XVS pIr´ pI ´ ηSq qV X y ´ XV pS ` λIq V X y 2 (97) ´1 t ´1 T T “ ˇˇXV S pI ´ pI ´ ηSq q ´ pS ` λIq V X y 2 ˇˇ (98) ˇˇ 2 ˇˇ ˇˇ ` ˘ ˇˇ ˇˇ ˇˇ “ yT VX ¨S´1pI ´ pI ´ ηSqtq ´ pS ` λIq´1˛ SV T XT y . (99) ˚ ‹ ˚ I ‹ ˝ ‚ l 1 jh n We now separately consider term 1. Substituting λ “ tη , we get

S´1pI ´ pI ´ ηSqtq ´ pS ` λIq´1 “ S´1 pI ´ pI ´ ηSqtq ´ pI ` S´1λq´1 (100) ´1 t ´1 ´1 “ S `pI ´ pI ´ ηSq q ´ pI ` pStηq q˘ . (101)

` A ˘ l jh n

31 th We now separately bound the diagonal entries in matrix A. With si, we denote i diagonal entry of th S. Note that since η ď 1{ ||S||op, for all i, ηsi ď 1. Consider i diagonal term (which is non-zero) of the diagonal matrix A, we have

t 1 t tηsi 1 ´ p1 ´ siηq ¨ tηsi ˛ Aii “ 1 ´ p1 ´ siηq ´ “ 1 ´ (102) s 1 ` tηs s p1 ` tηs qp1 ´ p1 ´ s ηqtq i ˆ i ˙ i ˚ i i ‹ ˚ ‹ ˚ ‹ ˚ II ‹ ˝t ‚ 1 1 ´ p1 ´ siηql jh n ď . (Using Lemma 13 (b)) 2 s „ i  Additionally, we can also show the following upper bound on term 2:

t tηsi p1 ` tηsiqp1 ´ p1 ´ siηq q ´ tηsi 1 ´ t “ t (103) p1 ` tηsiqp1 ´ p1 ´ siηq q p1 ` tηsiqp1 ´ p1 ´ siηq q t t 1 ´ p1 ´ siηq ´ tηsip1 ´ siηq ď t (104) p1 ` tηsiqp1 ´ p1 ´ siηq q 1 ď . (Using Lemma 13 (a)) tηsi

Combining both the upper bounds on each diagonal entry Aii, we have

´1 t A ĺ c1pη, tq ¨ S pI ´ pI ´ ηSq q , (105) where c pη, tq “ minp0.5, 1 q. Plugging this into (99), we have 1 tsiη

2 T ´1 t 2 T T Ex„S ftpxq ´ fλpxq ď cpη, tq ¨ y VX S pI ´ pI ´ ηSq q SV X y (106) „  ´ ¯ T ` ´1 t ˘ ´1 t T T r “ cpη, tq ¨ y VX S pI ´ pI ´ ηSq q S S pI ´ pI ´ ηSq q V X y (107) ` ˘ ` ˘ 2 “ cpη, tq ¨ ||Xwt||2 (108) 2 “ cpη, tq ¨ Ex„S pftpxqq , (109) ” ı 1 where cpη, tq “ minp0.25, 2 2 2 q. t si η Part 2 With Σ, we denote the underlying true covariance matrix. We now consider the squared

32 dierence of output at an unseen point:

2 T T Ex„DX ftpxq ´ fλpxq “ Ex„DX x wt ´ x wλ 2 (110) „´ ¯  “ xT VS“´ˇˇ1pI ´ pI ´ ηSˇˇqt‰qV T XT y ´ xT V pS ` λIq´1V T XT y r ˇˇ r ˇˇ 2 (111) ˇˇ ˇˇ ˇˇ T ´1 t ´1 T T ˇˇ “ x V S pI ´ pI ´ ηSq q ´ pS ` λIq V X y 2 (112) T ´1 t ´1 T “ ˇyˇ VX` S pI ´ pI ´ ηSq q ´ pS ` λIq ˘ V ΣVˇˇ (113) ˇˇ ˇˇ t ´1 T T ` pI ´ pI ´ ηSq q ´ pS ˘` λIq V X y (114) 2 ` ˘ T ´1 t ´1 T T ď σmax ¨ y VX ¨S pI ´ pI ´ ηSq q ´ pS ` λIq ˛ V X y , ˚ ‹ ˚ I ‹ ˝l jh n‚ (115) where σmax is the maximum eigenvalue of the underlying covariance matrix Σ. Using the upper bound on term 1 in (105), we have

2 T ´1 t 2 T T Ex„DX ftpxq ´ fλpxq ď σmax ¨ cpη, tq ¨ y VX S pI ´ pI ´ ηSq q V X y (116) „  ´ ¯ `´1 t ˘ T T 2 r “ κ ¨ cpη, tq ¨ σmin ¨ V S pI ´ pI ´ ηSq q V X y 2 (117) T ď κ ¨ cpη, tq ¨ V Sˇˇ´1p`I ´ pI ´ ηSqtq V T X˘ T Σ ˇˇ (118) ˇˇ ˇˇ ´1 t T T “ ` V S pI ´ p˘I ´ ηSq ‰q V X y (119) T “ κ ¨ cpη, tq ¨ Ex„DX x“ w`t 2 . ˘ ‰ (120) “ˇˇ ˇˇ ‰ ˇˇ ˇˇ

B.5 Extension to deep learning Under Assumption B.6, we present the formal result parallel to Theorem3.

Theorem 6. Consider a multiclass classication problem with k classes. Under Assumption1, for any δ ą 0, with probability at least 1 ´ δ, we have 4 k logp δ q EDpfq ď ES pfq ` pk ´ 1q 1 ´ E pfq ` c , (121) k´1 S d 2m ´ ¯ p ? p r p for some constant c ď ppc ` 1qk ` k ` m? q. n k The proof follows exactly as in step (i) to (iii) in Theorem3.

B.6 Justifying Assumption1 Motivated by the analysis on linear models, we now discuss alternate (and weaker) conditions that imply Assumption1. We need hypothesis stability (Condition1) and the following assumption relating training error and leave-one-error:

33 Assumption 2. Let f be a model obtained by training with algorithm A on a mixture of clean S and randomly labeled data S. Then we assume we have p E pfq ď E , r SM LOOpSM q

r r Epfpiqpxiq,yiq p pxi,yiqPSM for all pxi, yiq P SM where fpiq :“ fpA,S Y SM piqq and E :“ . LOOpSM q |SM | ř r p Intuitively, thisr assumptionp states that ther error on a (mislabeled)r datum px, yrq included in the training set is less than the error on that datum px, yq obtained by a model trained on the training set S ´ tpx, yqu. We proved this for linear models trained with GD in the proof of Theorem5. Condition1 with β “ Op1q and Assumption2 together with Lemma 12 implies Assumption1 with a polynomial residual term (instead of logarithmic in 1{δ):

1 1 3β ES pfq ď E 1 pfq ` ` . (122) M D δ m m ` n d ˆ ˙ p p

34 C Additional experiments and details

C.1 Datasets Toy Dataset Assume xed constants µ and σ. For a given label y, we simulate features x in our toy classication setup as follows: 2 2 x :“ concat rx1, x2s where x1 „ N py ¨ µ, σ Idˆdq and x1 „ N p0, σ Idˆdq . ? In experiements throughout the paper, we x dimention d “ 100, µ “ 1.0, and σ “ d. Intuitively, x1 carries the information about the underlying label and x2 is additional noise independent of the underlying label. CV datasets We use MNIST (LeCun et al., 1998) and CIFAR10 Krizhevsky & Hinton(2009). We produce a binary variant from the multiclass classication problem by mapping classes t0, 1, 2, 3, 4u to label 1 and t5, 6, 7, 8, 9u to label ´1. For CIFAR dataset, we also use the standard data augementation of random crop and horizontal ip. PyTorch code is as follows: (transforms.RandomCrop(32, padding=4), transforms.RandomHorizontalFlip()) NLP dataset We use IMDb Sentiment analysis (Maas et al., 2011) corpus.

C.2 Architecture Details All experiments were run on NVIDIA GeForce RTX 2080 Ti GPUs. We used PyTorch (Paszke et al., 2019) and Keras with Tensorow (Abadi et al., 2016) backend for experiments. Linear model For the toy dataset, we simulate a linear model with scalar output and the same number of parameters as the number of dimensions. Wide nets To simulate the NTK regime, we experiment with 2´layered wide nets. The PyTorch code for 2-layer wide MLP is as follows: nn.Sequential( nn.Flatten(), nn.Linear(input_dims, 200000, bias=True), nn.ReLU(), nn.Linear(200000, 1, bias=True) ) We experiment both (i) with the second layer xed at random initialization; (ii) and updating both layers’ weights. Deep nets for CV tasks We consider a 4-layered MLP. The PyTorch code for 4-layer MLP is as follows: nn.Sequential(nn.Flatten(), nn.Linear(input_dim, 5000, bias=True), nn.ReLU(), nn.Linear(5000, 5000, bias=True), nn.ReLU(), nn.Linear(5000, 5000, bias=True), nn.ReLU(), nn.Linear(1024, num_label, bias=True) ) For MNIST, we use 1000 nodes instead of 5000 nodes in the hidden layer. We also experiment with convolutional nets. In particular, we use ResNet18 (He et al., 2016). Implementation adapted from: https: //github.com/kuangliu/pytorch-cifar.git.

35 Deep nets for NLP We use a simple LSTM model with embeddings intialized with ELMo em- beddings (Peters et al., 2018). Code adapted from: https://github.com/kamujun/elmo_ experiments/blob/master/elmo_experiment/notebooks/elmo_text_classification_ on_imdb.ipynb We also evaluate our bounds with a BERT model. In particular, we ne-tune an o-the-shelf uncased BERT model (Devlin et al., 2018). Code adapted from Hugging Face Transformers (Wolf et al., 2020): https://huggingface.co/transformers/v3.1.0/custom_datasets.html.

C.3 Additonal experiments Results with SGD on underparameterized linear models

Underparameterized model 100 MSE Test 90 CE Predicted bound

80

70 Accuracy 60

50 0.0 0.1 0.2 0.3 0.4 Fraction of unlabeled data Figure 3: We plot the accuracy and corresponding bound (RHS in (1)) at δ “ 0.1 for toy binary classication task. Results aggregated over 3 seeds. Accuracy vs fraction of unlabeled data (w.r.t clean data) in the toy setup with a linear model trained with SGD. Results parallel to Fig.2(a) with SGD. Results with wide nets on binary MNIST

MNIST MNIST MNIST 100 100 100

90 90 90

80 80 80

70 70 70

Accuracy GD Test Accuracy SGD Test Accuracy SGD Test 60 Early stop Predicted bound 60 Early stop Predicted bound 60 Early stop Predicted bound Weight decay Weight decay Weight decay 50 50 50 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 0.0 0.1 0.2 0.3 0.4 Fraction of unlabeled data Fraction of unlabeled data Fraction of unlabeled data

(a) GD with MSE loss (b) SGD with CE loss (c) SGD with MSE loss Figure 4: We plot the accuracy and corresponding bound (RHS in (1)) at δ “ 0.1 for binary MNIST classication. Results aggregated over 3 seeds. Accuracy vs fraction of unlabeled data for a 2-layer wide network on binary MNIST with both the layers training in (a,b) and only rst layer training in (c). Results parallel to Fig.2(b) .

Results on CIFAR 10 and MNIST We plot epoch wise error curve for results in Table1(Fig.5 and Fig.6). We observe the same trend as in Fig.1. Additionally, we plot an oracle bound obtained by tracking the error on mislabeled data which nevertheless were predicted as true label. To obtain an exact emprical value of the oracle bound, we need underlying true labels for the randomly labeled data. While with just access to extra unlabeled data we cannot calculate oracle bound, we note that the oracle bound is very tight and never violated in practice underscoring an importamt aspect of generalization in multiclass problems. This highlight that even a stronger conjecture may hold in multiclass classication, i.e., error on mislabeled data (where nevertheless true label was predicted) lower bounds the population error on the distribution of mislabeled data and hence, the error on (a specic) mislabeled portion predicts the population accuracy

36 on clean data. On the other hand, the dominating term of in Theorem3 is loose when compared with the oracle bound. The main reason, we believe is the pessimistic upper bound in (45) in the proof of Lemma8. We leave an investigation on this gap for future.

100 100 90 Test accuracy 90 Predicted bound 80 80 Oracle bound 70 70 60 60 50 50 Accuracy Accuracy 40 40 Test accuracy 30 30 Predicted bound 20 20 Oracle bound 10 10 0 10 20 30 40 50 60 70 0 10 20 30 40 50 Epoch Epoch

(a) MLP (b) ResNet

Figure 5: Per epoch curves for CIFAR10 corresponding results in Table1. As before, we just plot the dominating term in the RHS of (5) as predicted bound. Additionally, we also plot the predicted lower bound by the error on mislabeled data which nevertheless were predicted as true label. We refer to this as “Oracle bound”. See text for more details.

100 100 90 90 80 80 70 70 60 60 50 50 Accuracy Accuracy 40 Test accuracy 40 Test accuracy 30 Predicted bound 30 Predicted bound 20 Oracle bound 20 Oracle bound 10 10 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 Epoch Epoch

(a) MLP (b) ResNet

Figure 6: Per epoch curves for MNIST corresponding results in Table1. As before, we just plot the dominating term in the RHS of (5) as predicted bound. Additionally, we also plot the predicted lower bound by the error on mislabeled data which nevertheless were predicted as true label. We refer to this as “Oracle bound”. See text for more details.

Results on CIFAR 100 On CIFAR100, our bound in (5) yields vacous bounds. However, the oracle bound as explained above yields tight guarantees in the initial phase of the learning (i.e., when learning rate is less than 0.1) (Fig.7).

C.4 Hyperparameter Details Fig.1 We use clean training dataset of size 40, 000. We x the amount of unlabeled data at 20% of the clean size, i.e. we include additional 8, 000 points with randomly assigned labels. We use test set of 10, 000 points. For both MLP and ResNet, we use SGD with an initial learning rate of 0.1 and momentum 0.9. We x the weight decay parameter at 5 ˆ 10´4. After 100 epochs, we decay the learning rate to 0.01. We use SGD batch size of 100. Fig.2 (a) We obtain a toy dataset according to the process described in Sec. C.1. We x d “ 100 and create a dataset of 50, 000 points with balanced classes. Moreover, we sample additional covariates with the same procedure to create randomly labeled dataset. For both SGD and GD training, we use a xed learning rate 0.1.

37 100 Test accuracy 80 Oracle bound

60

40 Accuracy

20

0 0 20 40 60 80 100 Epoch

Figure 7: Predicted lower bound by the error on mislabeled data which nevertheless were predicted as true label with ResNet18 on CIFAR100. We refer to this as “Oracle bound”. See text for more details. The bound predicted by RATT (RHS in (5)) is vacuous.

Fig.2 (b) Similar to binary CIFAR, we use clean training dataset of size 40, 000 and x the amount of unlabeled data at 20% of the clean dataset size. To train wide nets, we use a xed learning of 0.001 with GD and SGD. We decide the weight decay parameter and the early stopping point that maximizes our generalization bound (i.e. without peeking at unseen data ). We use SGD batch size of 100. Fig.2 (c) With IMDb dataset, we use a clean dataset of size 20, 000 and as before, x the amount of unlabeled data at 20% of the clean data. To train ELMo model, we use Adam optimizer with a xed learning rate 0.01 and weight decay 10´6 to minimize cross entropy loss. We train with batch size 32 for 3 epochs. To ne-tune BERT model, we use Adam optimizer with learning rate 5 ˆ 10´5 to minimize cross entropy loss. We train with a batch size of 16 for 1 epoch. Table1 For multiclass datasets, we train both MLP and ResNet with the same hyperparameters as described before. We sample a clean training dataset of size 40, 000 and x the amount of unlabeled data at 20% of the clean size. We use SGD with an initial learning rate of 0.1 and momentum 0.9. We x the weight decay parameter at 5 ˆ 10´4. After 30 epochs for ResNet and after 50 epochs for MLP, we decay the learning rate to 0.01. We use SGD with batch size 100. For Fig.7, we use the same hyperparameters as CIFAR10 training, except we now decay learning rate after 100 epochs. In all experiments, to identify the best possible accuracy on just the clean data, we use the exact same set of hyperparamters except the stopping point. We choose a stopping point that maximizes test performance.

38 C.5 Summary of experiments

Classication type Model category Model Dataset Low dimensional Linear model Toy Gaussain dataset Overparameterized 2-layer wide net Binary MNIST linear nets Binary MNIST MLP Binary CIFAR Binary Binary MNIST Deep nets ResNet Binary CIFAR ELMo-LSTM model IMDb Sentiment Analysis BERT pre-trained model IMDb Sentiment Analysis MNIST MLP CIFAR10 Multiclass Deep nets MNIST ResNet CIFAR10 CIFAR100

39 D Proof of Lemma 12

Proof of Lemma 12. Recall, we have a training set S Y SC . We dened leave-one-out error on mislabeled points as Epf pxiq, yiq pxi,yiqPSrM piq E “ , LOOpSM q ř r SM r 1 1 1 1 where fpiq :“ fpA, pS Y Sqpiqq. Dene S :“ S Y S. Assumer px, yq and px , y q as i.i.d. samples from D . Using Lemma 25 in Bousquet & Elissee(2002), we have r r 2 1 1 ED1 pfq ´ E ď 1 1 1 Epfpxq, yqEpfpx q, y q ´ 2 1 Epfpxq, yqEpf pxiq, yiq E LOOpSM q ES ,px,yq,px ,y q ES ,px,yq piq „´ ¯  ” ı ” ı r m1 ´ 1 1 p ` ES1 Epfppiqpxiq, yiqpEpfpjqpxjq, yjq ` ES1 pEpfpiqpxiq, yiq . m1 m1 “ ‰ “ (123)‰

We can rewrite the equation above as :

2 1 1 ED1 pfq ´ E ď 1 1 1 Epfpxq, yqEpfpx q, y q ´ Epfpxq, yqEpf pxiq, yiq E LOOpSM q ES ,px,yq,px ,y q piq „´ ¯  ” ı p r p p I p l jh n ` ES1 Epfpiqpxiq, yiqEpfpjqpxjq, yjq ´ Epfpxq, yqEpfpiqpxiq, yiq ” ı II p l1 jh n ` ES1 Epfpiqpxiq, yiq ´ Epfpiqpxiq, yiqEpfpjqpxjq, yjq . (124) m1 “ ‰ III l jh n We will now bound term III. Using Cauchy-Schwarz’s inequality, we have

2 2 2 ES1 Epfpiqpxiq, yiq ´ Epfpiqpxiq, yiqEpfpjqpxjq, yjq ď ES1 Epfpiqpxiq, yiq ES1 1 ´ Epfpjqpxjq, yjq “ ‰ “ ‰ “ (125)‰ 1 ď . (126) 4

1 1 1 Note that since pxi, yiq, pxj, yjq, px, yq, and px , y q are all from same distribution D , we directly incorporate the bounds on term I and II from the proof of Lemma 9 in Bousquet & Elissee(2002). Combining that with (126) and our denition of hypothesis stability in Condition1, we have the required claim.

40