How Does Early Stopping Help Generalization Against Label Noise?

How does Early Stopping Help Generalization against Label Noise? Hwanjun Song 1 Minseok Kim 1 Dongmin Park 1 Jae-Gil Lee 1 Abstract treated as true-labeled ones and then used to update a DNN robustly, where τ 2 [0; 1] is a noise rate. This loss-based Noisy labels are very common in real-world train- separation is well known to be justified by the memoriza- ing data, which lead to poor generalization on test tion effect (Arpit et al., 2017) that DNNs tend to learn easy data because of overfitting to the noisy labels. In patterns first and then gradually memorize all samples. this paper, we claim that such overfitting can be avoided by “early stopping” training a deep neu- Despite its great success, a recent study (Song et al., 2019) ral network before the noisy labels are severely has argued that the performance of the loss-based separation memorized. Then, we resume training the early becomes considerably worse depending on the type of label stopped network using a “maximal safe set,” noise. For instance, the loss-based approach well separates which maintains a collection of almost certainly true-labeled samples from false-labeled ones in symmetric true-labeled samples at each epoch since the early noise (Figure1(a)), but many false-labeled samples are mis- stop point. Putting them all together, our novel classified as true-labeled ones because the two distributions two-phase training method, called Prestopping, overlap closely in pair and real-world noises (Figures1(b) realizes noise-free training under any type of label and1(c)), both of which are more realistic than symmetric noise for practical use. Extensive experiments noise (Ren et al., 2018; Yu et al., 2019). This limitation using four image benchmark data sets verify that definitely calls for a new approach that supports any type of our method significantly outperforms four state- label noise for practical use. of-the-art methods in test error by 0:4–8:2 percent In this regard, as shown in Figure2(a), we thoroughly inves- points under the existence of real-world noise. tigated the memorization effect of a DNN on the two types of noises and found two interesting properties as follows: 1. Introduction • A noise type affects the memorization rate for false- labeled samples: The memorization rate for false-labeled By virtue of massive labeled data, deep neural net- samples is faster with pair noise than with symmetric works (DNNs) have achieved a remarkable success in numer- noise. That is, the red portion in Figure2(a) starts to ap- ous machine learning tasks (Krizhevsky et al., 2012; Red- pear earlier in pair noise than in symmetric noise. This ob- mon et al., 2016). However, owing to their high capacity to servation supports the significant overlap of true-labeled memorize any label noise, the generalization performance of and false-labeled samples in Figure1(b). Thus, the loss- DNNs drastically falls down when noisy labels are contained based separation performs well only if the false-labeled in the training data (Jiang et al., 2018; Han et al., 2018; Song samples are scarcely learned at an early stage of training, et al., 2020b). In particular, Zhang et al.(2017) have shown as in symmetric noise. that a standard convolutional neural network (CNN) can eas- arXiv:1911.08059v3 [cs.LG] 8 Sep 2020 ily fit the entire training data with any ratio of noisy labels • There is a period where the network accumulates and eventually leads to very poor generalization on the test the label noise severely: Regardless of the noise type, data. Thus, it is challenging to train a DNN robustly even the memorization of false-labeled samples significantly when noisy labels exist in the training data. increases at a late stage of training. That is, the red portion in Figure2(a) increases rapidly after the dashed A popular approach to dealing with noisy labels is “sample line, in which we call the error-prone period. We note selection” that selects true-labeled samples from the noisy that the training in that period brings no benefit. The training data (Ren et al., 2018; Han et al., 2018; Yu et al., generalization performance of “Default” deteriorates 2019). Here, (1−τ)×100% of small-loss training samples are sharply, as shown in Figure2(c). 1 Graduate School of Knowledge Service Engineering, Daejoen, Based on these findings, we contend that eliminating this Korea. Correspondence to: Jae-Gil Lee <[email protected]>. error-prone period should make a profound impact on robust Presented at the ICML 2020 Workshop on Uncertainty and Ro- optimization. In this paper, we propose a novel approach, bustness in Deep Learning. Copyright 2020 by the author(s). called Prestopping, that achieves noise-free training based How does Early Stopping Help Generalization against Label Noise? 1.0 0.4 1.0 True-Labeled Sample True-Labeled Sample True-Labeled Sample 0.8 False-Labeled Sample 0.3 False-Labeled Sample 0.8 False-Labeled Sample 0.6 0.6 0.2 0.4 (1-휏) × 100% small-loss (1-휏) × 100% small-loss 0.4 (1-휏) × 100% small-loss 0.2 0.1 0.2 0.0 0.0 0.0 -6 -4 -2 0 2 4 -6 -4 -2 0 2 4 -2 -1 -0 1 2 3 Loss (log-scale) Loss (log-scale) Loss (log-scale) (a) Symmetric Noise. (b) Pair Noise. (c) Real-World Noise. Figure 1. Loss distributions at a training accuracy of 50%: (a) and (b) show those on CIFAR-100 with two types of synthetic noises of 40%, where “symmetric noise” flips a true label into other labels with equal probability, and “pair noise” flips a true label into a specific false label; (c) shows those on FOOD-101N (Lee et al., 2018) with the real-world noise of 18:4%. True-Labeled Sample False-Labeled Sample Default Prestopping 100% 100% 70% Error-prone 75% 75% 60% Period Max ratio of true-labeled samples 50% 50% 50% Generalization Improvement 25% 25% Early Stop Error Test 40% Memorization Ratio Memorization Memorization Ratio Memorization 0% 0% 30% 0 40 80 120 0 40 80 120 0 40 80 120 Epochs Epochs Epochs (a) Default. (b) Prestopping. (c) Test Error Convergence. Figure 2. Key idea of Prestopping: (a) and (b) show how many true-labeled and false-labeled samples are memorized when training DenseNet (L=40, k=12)2on CIFAR-100 with the pair noise of 40%. “Default” is a standard training method, and “Prestopping” is our proposed one; (c) contrasts the convergence of test error between the two methods. on the early stopping mechanism. Because there is no bene- As for the notion of network memorization, a sample x is fit from the error-prone period, Prestopping early stops train- defined to be memorized by a network if the majority of its ing before that period begins. This early stopping effectively recent predictions at time t coincide with the given label, as prevents a network from overfitting to false-labeled samples, in Definition 2.1. and the samples memorized until that point are added to a Definition 2.1. (Memorized Sample) Let y^ = Φ(xjθ ) be maximal safe set because they are true-labeled (i.e., blue t t the predicted label of a sample x at time t and Ht (q) = in Figure2(a)) with high precision. Then, Prestopping re- x fy^ ; y^ ;:::; y^ g be the history of the sample x that stores sumes training the early stopped network only using the t1 t2 tq the predicted labels of the recent q epochs, where Φ is a maximal safe set in support of noise-free training. Notably, neural network. Next, P (yjx; t; q) is formulated such that our proposed merger of “early stopping” and “learning from it provides the probability of the label y 2 f1; 2; :::; kg the maximal safe set” indeed eliminates the error-prone pe- estimated as the label of the sample x based on Ht as in riod from the training process, as shown in Figure2(b). As a x Eq. (2), where [·] is the Iverson bracket3. result, the generalization performance of a DNN remarkably P t [^y = y] improves in both noise types, as shown in Figure2(c). y^2Hx(q) P (yjx; t; q) = t (2) jHx(q)j 2. Preliminaries Then, the sample x with its noisy label y~ is a memorized A k-class classification problem requires the training data sample of the network with the parameter θt at time t if ∗ N ∗ D = fxi; yi gi=1, where xi is a sample and yi 2 argmaxyP (yjx; t; q) =y ~ holds. f1; 2; : : : ; kg is its true label. Following the label noise sce- ~ N nario, let’s consider the noisy training data D = fxi; y~igi=1, 3. Robust training via Prestopping where y~i 2 f1; 2; : : : ; kg is a noisy label which may not be The key idea of Prestopping is learning from a maximal true. Then, in conventional training, when a mini-batch b safe set with an early stopped network. Thus, the two Bt = fxi; y~igi=1 consists of b samples randomly drawn ~ components of “early stopping” and “learning from the from the noisy training data D at time t, the network param- maximal safe set” respectively raise the questions about eter θt is updated in the descent direction of the expected (Q1) when is the best point to early stop the training loss on the mini-batch Bt as in Eq. (1), where α is a learning process? and (Q2) what is the maximal safe set to enable rate and L is a loss function.

Load more