Deep Learning Is Robust to Massive Label Noise

Deep Learning is Robust to Massive Label Noise David Rolnick * 1 Andreas Veit * 2 Serge Belongie 2 Nir Shavit 3 Abstract Thus, annotation can be expensive and, for tasks requiring expert knowledge, may simply be unattainable at scale. Deep neural networks trained on large supervised datasets have led to impressive results in image To address this limitation, other training paradigms have classification and other tasks. However, well- been investigated to alleviate the need for expensive an- annotated datasets can be time-consuming and notations, such as unsupervised learning (Le, 2013), self- expensive to collect, lending increased interest to supervised learning (Pinto et al., 2016; Wang & Gupta, larger but noisy datasets that are more easily ob- 2015) and learning from noisy annotations (Joulin et al., tained. In this paper, we show that deep neural net- 2016; Natarajan et al., 2013; Veit et al., 2017). Very large works are capable of generalizing from training datasets (e.g., Krasin et al.(2016); Thomee et al.(2016)) data for which true labels are massively outnum- can often be obtained, for example from web sources, with bered by incorrect labels. We demonstrate remark- partial or unreliable annotation. This can allow neural net- ably high test performance after training on cor- works to be trained on a much wider variety of tasks or rupted data from MNIST, CIFAR, and ImageNet. classes and with less manual effort. The good performance For example, on MNIST we obtain test accuracy obtained from these large, noisy datasets indicates that deep above 90 percent even after each clean training learning approaches can tolerate modest amounts of noise example has been diluted with 100 randomly- in the training set. labeled examples. Such behavior holds across In this work, we study the behavior of deep neural networks multiple patterns of label noise, even when erro- under extremely low label reliability, only slightly above neous labels are biased towards confusing classes. chance. The insights from our study can help guide future We show that training in this regime requires a settings in which arbitrarily large amounts of data are easily significant but manageable increase in dataset size obtainable, but in which labels come without any guarantee that is related to the factor by which correct labels of validity and may merely be biased towards the correct have been diluted. Finally, we provide an analysis distribution. of our results that shows how increasing noise decreases the effective batch size. The key takeaways from this paper may be summarized as follows: Deep neural networks are able to generalize after 1. Introduction • training on massively noisy data, instead of merely Deep neural networks are typically trained using supervised memorizing noise. We demonstrate that standard deep learning on large, carefully annotated datasets. However, neural networks still perform well even on training arXiv:1705.10694v3 [cs.LG] 26 Feb 2018 the need for such datasets restricts the space of problems sets in which label accuracy is as low as 1 percent that can be addressed. This has led to a proliferation of above chance. On MNIST, for example, performance deep learning results on the same tasks using the same still exceeds 90 percent even with this level of label well-known datasets. However, carefully annotated data noise (see Figure1). This behavior holds, to varying is difficult to obtain, especially for classification tasks with extents, across datasets as well as patterns of label large numbers of classes (requiring extensive annotation) noise, including when noisy labels are biased towards or with fine-grained classes (requiring skilled annotation). confused classes. *Equal contribution 1Department of Mathematics, Mas- A sufficiently large training set can accommodate a • sachusetts Institute of Technology, Cambridge, MA USA We find that the minimum 2 wide range of noise levels. Department of Computer Science & Cornell Tech, Cornell Uni- dataset size required for effective training increases versity, New York, NY USA 3Department of Computer Science, Massachusetts Institute of Technology, Cambridge, MA USA. Cor- with the noise level (see Figure9). A large enough respondence to: David Rolnick <[email protected]>, Andreas training set can accommodate a wide range of noise Veit <[email protected]>. levels. Increasing the dataset size further, however, Deep Learning is Robust to Massive Label Noise does not appreciably increase accuracy (see Figure8). MNIST - Uniform label noise 100 High levels of label noise decrease the effective • batch size, as noisy labels roughly cancel out and only 90 a small learning signal remains. As such, dataset noise can be partly compensated for by larger batch sizes 80 and by scaling the learning rate with the effective batch size. 70 60 Model architecture 2. Related Work Perceptron 50 MLP1 Learning from noisy data. Several studies have investi- Prediction accuracy MLP2 gated the impact of noisy datasets on machine classifiers. 40 MLP4 Approaches to learn from noisy data can generally be cate- Conv4 gorized into two groups: In the first group, approaches aim 30 to learn directly from noisy labels and focus on noise-robust 0 20 40 60 80 100 algorithms, e.g., Beigman & Klebanov(2009); Guan et al. Number of noisy labels per clean label (2017); Joulin et al.(2016); Krause et al.(2016); Manwani & Sastry(2013); Misra et al.(2016); Van Horn et al.(2015); Reed et al.(2014). The second group comprises mostly Figure 1. Performance on MNIST as different amounts of noisy label-cleansing methods that aim to remove or correct mis- labels are added to a fixed training set of clean labels. We compare labeled data, e.g., Brodley & Friedl(1999). Methods in a perceptron, MLPs with 1, 2, and 4 hidden layers, and a 4-layer ConvNet. Even with 100 noisy labels for every clean label the this group frequently face the challenge of disambiguat- ConvNet still attains a performance of 91%. ing between mislabeled and hard training examples. To address this challenge, they often use semi-supervised approaches by combining noisy data with a small set of clean labels (Zhu, 2005). Some approaches model the label noise formance (e.g., Sukhbaatar et al.(2014); Van Horn et al. as conditionally independent from the input image (Natara- (2015); Zhang et al.(2017)). In these studies an increase jan et al., 2013; Sukhbaatar et al., 2014) and some propose in noise is assumed to decrease not only the proportion of image-conditional noise models (Veit et al., 2017; Xiao correct examples, but also their absolute number. In con- et al., 2015). Our work differs from these approaches in trast to these studies, we separate the effects and show in 4 § that we do not aim to clean the training dataset or propose that a decrease in the number of correct examples is more new noise-robust training algorithms. Instead, we study the destructive to learning than an increase in the number of behavior of standard neural network training procedures noisy labels. in settings with massive label noise. We show that even without explicit cleaning or noise-robust algorithms, neural 3. Learning with massive label noise networks can learn from data that has been diluted by an arbitrary amount of label noise. In this work, we are concerned with scenarios of abundant data of very poor label quality, i.e., the regime in which Analyzing the robustness of neural networks. Several falsely labeled training examples vastly outnumber correctly investigative studies aim to improve our understanding of labeled examples. In particular, our experiments involve convolutional neural networks. One particular stream of re- observing the performance of deep neural networks on multi- search in this space seeks to investigate neural networks by class classification tasks as label noise is increased. analyzing their robustness. For example, Veit et al.(2016) show that network architectures with residual connections To formalize the problem, we denote the number of original have a high redundancy in terms of parameters and are training examples by n. To model the amount of noise, we robust to the deletion of multiple complete layers during dilute the dataset by adding α noisy examples to the train- test time. Further, Szegedy et al.(2014) investigate the ro- ing set for each original training example. Thus, the total bustness of neural networks to adversarial examples. They number of noisy labels in the training set is αn. Note that show that even for fully trained networks, small changes by varying the noise level α, we do not change the available in the input can lead to large changes in the output and number of original examples. Thus, even in the presence thus misclassification. In contrast, we are focusing on non- of high noise, there is still appreciable data to learn from, adversarial noise during training time. Within this stream if we are able to pick it out. This is in contrast to previous of research, closest to our work are studies that focus on work (e.g., Sukhbaatar et al.(2014); Van Horn et al.(2015); the impact of noisy training datasets on classification per- Zhang et al.(2017)), in which an increase in noise also im- Deep Learning is Robust to Massive Label Noise CIFAR-10 - Uniform label noise ImageNet - Uniform label noise 95 90 90 80 85 70 80 60 75 50 70 Model architecture 40 Prediction accuracy Prediction accuracy ResNet 18 65 Conv4 Top-5 accuracy Conv6 30 60 Top-1 accuracy ResNet 20 55 0 1 2 3 4 5 0 2 4 6 8 10 Number of noisy labels per clean label Number of noisy labels per clean label Figure 3. Performance on ImageNet as different amounts of noisy Figure 2. Performance on CIFAR-10 as different amounts of noisy labels are added to a fixed training set of clean labels.

Deep Learning Is Robust to Massive Label Noise

Backpropagation and Deep Learning in the Brain

Automated Elastic Pipelining for Distributed Training of Transformers

Convolutional Neural Network Transfer Learning for Robust Face

Tiny Imagenet Challenge

The History Began from Alexnet: a Comprehensive Survey on Deep Learning Approaches

Kernel Descriptors for Visual Recognition

Face Recognition Using Popular Deep Net Architectures: a Brief Comparative Study

Key Ideas and Architectures in Deep Learning Applications That (Probably) Use DL

Transfer Learning Based Face Recognition Using Deep Learning M

Transfer Imagenet for Medical Image Analyzing Using Unsupervised Learning

Using Neural Networks for Image Classification Tim Kang San Jose State University

The MNIST Database of Handwritten Digit Images for Machine Learning Research