Arxiv:1902.03680V3 [Cs.LG] 17 Jun 2019 1

Learning From Noisy Labels By Regularized Estimation Of Annotator Confusion Ryutaro Tanno1 ∗ Ardavan Saeedi2 Swami Sankaranarayanan2 Daniel C. Alexander1 Nathan Silberman2 1University College London, UK 2Butterfly Network, New York, USA 1 2 fr.tanno, [email protected] fasaeedi,swamiviv,[email protected] Abstract tations is modest or the tasks are ambiguous. For example, The predictive performance of supervised learning algo- vision applications in medical image analysis [2] require an- rithms depends on the quality of labels. In a typical label notations from clinical experts, which incur high costs and collection process, multiple annotators provide subjective often suffer from high inter-reader variability [3, 4, 5, 6]. noisy estimates of the “truth” under the influence of their However, if the exact process by which each annotator varying skill-levels and biases. Blindly treating these noisy generates the labels was known, we could correct the an- labels as the ground truth limits the accuracy of learning al- notations accordingly and thus train our model on a cleaner gorithms in the presence of strong disagreement. This prob- set of data. Furthermore, this additional knowledge of the lem is critical for applications in domains such as medical annotators’ skills can be utilized to decide on which exam- imaging where both the annotation cost and inter-observer ples to be labeled by which annotators [7, 8, 9]. Therefore, variability are high. In this work, we present a method for methods that can accurately model the label noise of anno- simultaneously learning the individual annotator model and tators are useful for improving not only the accuracy of the the underlying true label distribution, using only noisy ob- trained model, but also the quality of labels in the future. servations. Each annotator is modeled by a confusion ma- Previous work proposed various methods for jointly esti- trix that is jointly estimated along with the classifier pre- mating the skills of the annotators and the ground truth (GT) dictions. We propose to add a regularization term to the labels. We categorize these methods into two groups: (1) loss function that encourages convergence to the true anno- two-stage approach and (2) simultaneous approach. Meth- tator confusion matrix. We provide a theoretical argument ods in the first category perform label aggregation and train- as to how the regularization is essential to our approach ing of a supervised learning model in two separate steps. both for the case of single annotator and multiple anno- The noisy labels Ye are first aggregated by building a prob- tators. Despite the simplicity of the idea, experiments on abilistic model of annotators. The observable variables are image classification tasks with both simulated and real la- the noisy labels Ye , and the latent variables/parameters to be bels show that our method either outperforms or performs estimated are the annotator skills and GT labels Y. Then, a on par with the state-of-the-art methods and is capable of machine learning model is trained on the pairs of aggregated estimating the skills of annotators even with a single label labels Y and input examples X (e.g. images) to perform available per image. the task of interest. The initial attempt was made in [10] in the early 1970s and more recently, numerous lines of re- arXiv:1902.03680v3 [cs.LG] 17 Jun 2019 1. Introduction search [11, 6, 12, 13, 14] proposed extensions of this work e.g. by estimating the difficulty of each example. However, In many practical applications, supervised learning algo- in all these cases, information about the raw inputs X is rithms are trained on noisy labels obtained from multiple completely neglected in the generative model of noisy la- annotators of varying skill levels and biases. When there is bels used in the aggregation step, and this highly limits the a substantial amount of disagreement in the labels, conven- quality of estimated true labels in practice. tional training algorithms that treat such labels as the “truth” The simultaneous approaches [15, 16, 17, 18] address lead to models with limited predictive performance. To this issue by integrating the prediction of the supervised mitigate such variation, practitioners typically abide by the learning model (i.e. distribution p(YjX)) into the proba- principle of “wisdom of crowds” [1] and aggregate labels by bilistic model of noisy labels, and have been shown to im- computing the majority vote. However, this approach has prove the predictive performance. These methods employ limited efficacy in applications where the number of anno- variants of the expectation-maximization (EM) algorithm ∗A part of the work done during internship at Butterfly Network. during training, and require a reasonable number of labels for each example. However, in most real world applica- [33] proposed a method for learning to weigh examples dur- tions, it is practically prohibitive to collect a large number ing each training iteration by using the validation loss on of labels per example, and this requirement limits their ap- clean labels as the meta-objective. [34] employs a simi- plications. A notable exception is the Model Boostrapped lar approach, but trains a separate network that proposes EM (MBEM) algorithm presented in [19] that is capable of weighting. However, curating a set of clean labels of suffi- learning even with little label redundancy. cient size is expensive for many applications, and this work In this paper, we propose a more effective alternative to focuses on the scenario of learning from purely noisy labels. these EM-based approaches for jointly modeling the annotator skills and GT label distribution. Our method separates 2. Methods the annotation noise from true labels by (1) ensuring high fx gN fidelity with the data by minimizing the cross entropy loss We assume that a set of images i i=1 are assigned (r) r=1;:::;R and (2) encouraging the estimated annotators to be maxi- with noisy labels fy~i gi=1;:::;N from multiple annotators (r) mally unreliable by minimizing the trace of the estimated where y~i denotes the label from annotator r given to ex- confusion matrices. Our method is also simpler to imple- ample xi, but no ground truth (GT) labels fyigi=1;:::;N ment, only requiring an addition of a regularization term to are available. In this work, we present a new pro- the cross-entropy loss. Furthermore, we provide a theoret- cedure for multiclass classification problem that can si- ical result that such regularization is capable of recovering multaneously estimate the annotator noise and GT label the annotation noise as long as the average confusion matrix distribution p(yjx) from such noisy set of data D = (CM) over annotators is diagonally dominant. (1) (R) fxi; y~i ; :::; y~i gi=1;:::;N . The method only requires Experiments on image classification tasks with both sim- adding a regularization term, that is the average accuracy ulated and real noisy labels demonstrate that our method, of all annotator models, to the cross-entropy loss function. despite being much simpler, leads to better or compara- Intuitively, the method biases ours models of each annota- ble performance with MBEM [19] and generalized EM tor to be as inaccurate as possible while having the model [15, 20], and is capable of recovering CMs even when there still explain the data. We will show that this is capable of is only one label available per example. We simulated a decoupling the annotation noise from the true label distribu- diverse range of annotator types on MNIST and CIFAR10 tion, as long as the average labels of the real annotators are data sets while we used a ultrasound dataset for cardiac view “sufficiently” correct (which we formalize in Sec. 2.3). For classification to test the efficacy in a real-world application. simplicity, we first describe the method in the dense label We also show importance of modeling individual annotators scenario in which each image has labels from all annota- by comparing against various modern noise-robust methods tors, and then extend to scenarios with missing labels where [21, 22, 23, 24], when the inter-annotator variability is high. only a subset of annotators label each image. As we shall Other Related Works. More broadly, our work is related see later, the method works even when each image is only to methods for robust learning in the presence of label noise. labelled by a single annotator. There is a large body of literature that do not explicitly 2.1. Noisy Observation Model model individual annotators unlike our method. The effects of label noise are well studied in common We first describe our probabilistic model of the observed classifiers such as SVMs and logistic regression, and ro- noisy labels from multiple annotators. In particular, we bust variants have been proposed [25, 26, 27]. More re- make two key assumptions: (1) annotators are statistically cently, various attempts have been made to train deep neu- independent, (2) annotation noise is independent of the in- ral networks under label noise. Reed et al. [21] developed put image. By assumption (1), the probability of observing fy~(1); :::; y~(R)g x a robust loss to model “prediction consistency”, which was noisy labels on image can be written as: later extended by [28]. In [29] and [22], label noise was R Y Z parametrized in the form of a transition matrix and incor- p(~y(1); :::; y~(R)jx) = p(~y(r)jy; x) · p(yjx)dy (1) porated into neural networks for binary and multi-way clas- r=1 y2Y sification. A more effective alternative for estimating such transition matrix was proposed in [30], and a method for where p(yjx) denotes the true label distribution of the im- capturing image dependency of label noise was shown in age, and p(~y(r)jy; x) describes the noise model by which [31].

Arxiv:1902.03680V3 [Cs.LG] 17 Jun 2019 1

CMPSCI 585 Programming Assignment 1 Spam Filtering Using Naive Bayes

Using Machine Learning to Improve Dense and Sparse Matrix Multiplication Kernels

Performance Metric Elicitation from Pairwise Classifier Comparisons Arxiv:1806.01827V2 [Stat.ML] 18 Jan 2019

MASTER TP^ Ostrlbunon of THIS DOCUMENT IS UNL.M.TED Kaon Content of Three-Prong Decays of the Tan Lepton

Designing Alternative Representations of Confusion Matrices to Support Non-Expert Public Understanding of Algorithm Performance

A Generalized Hierarchical Approach for Data Labeling Zachary D. Blanks

PREDICTING ACADEMIC PERFORMANCE of POTENTIAL ELECTRICAL ENGINEERING MAJORS a Thesis by SANJHANA SUNDARARAJ Submitted to the Offi

Jets + Missing Energy Signatures at the Large Hadron Collider

Multiple Classifier Systems Incorporating Uncertainty

Classifier Uncertainty: Evidence, Potential Impact, and Probabilistic Treatment

Evaluating Prediction Strategies in an Enhanced Meta-Learning Framework

A Comparative Study on Human Action Recognition Using Multiple Skeletal Features and Multiclass Support Vector Machine