Snuba: Automating Weak Supervision to Label Training Data
Total Page:16
File Type:pdf, Size:1020Kb
Snuba: Automating Weak Supervision to Label Training Data Paroma Varma Christopher Re´ Stanford University Stanford University [email protected] [email protected] ABSTRACT benign, malignant if area > 210.8: As deep learning models are applied to increasingly diverse return False Label if area < 150: Aggregator problems, a key bottleneck is gathering enough high-quality return Abstain training labels tailored to each task. Users therefore turn Labeled Data 25% benign, 75%benign, ... ?, ?, ?, ? if perim > 120: to weak supervision, relying on imperfect sources of labels return True like pattern matching and user-defined heuristics. Unfor- if perim > 80 Terminate? return Abstain tunately, users have to design these sources for each task. Heuristic Generation Training Labels This process can be time consuming and expensive: domain Unlabeled Data Snuba experts often perform repetitive steps like guessing optimal numerical thresholds and developing informative text pat- Figure 1: Snuba uses a small labeled and a large unlabeled terns. To address these challenges, we present Snuba,a dataset to iteratively generate heuristics. It uses existing la- system to automatically generate heuristics using a small bel aggregators to assign training labels to the large dataset. labeled dataset to assign training labels to a large, unla- beled dataset in the weak supervision setting. Snuba gen- pervision, or methods that can assign noisy training labels erates heuristics that each labels the subset of the data it to unlabeled data, like crowdsourcing [9, 22, 60], distant su- is accurate for, and iteratively repeats this process until the pervision [8,32], and user-defined heuristics [38,39,50]. Over heuristics together label a large portion of the unlabeled the past few years, we have been part of the broader effort data. We develop a statistical measure that guarantees the to enhance methods based on user-defined heuristics to ex- iterative process will automatically terminate before it de- tend their applicability to text, image, and video data for grades training label quality. Snuba automatically generates tasks in computer vision, medical imaging, bioinformatics heuristics in under five minutes and performs up to 9.74 F1 and knowledge base construction [4, 39, 50]. points better than the best known user-defined heuristics de- Through our engagements with users at large companies, veloped over many days. In collaborations with users at re- we find that experts spend a significant amount of time de- search labs, Stanford Hospital, and on open source datasets, signing these weak supervision sources. As deep learning Snuba outperforms other automated approaches like semi- techniques are adopted for unconventional tasks like analyz- supervised learning by up to 14.35 F1 points. ing codebases and now commodity tasks like driving market- PVLDB Reference Format: ing campaigns, the few domain experts with required knowl- Paroma Varma, Christopher R´e. Snuba: Automating Weak Su- edge to write heuristics cannot reasonably keep up with the pervision to Label Training Data. PVLDB, 12(3): 223-236, 2018. demand for several specialized, labeled training datasets. DOI: https://doi.org/10.14778/3291264.3291268 Even machine learning experts, such as researchers at the computer vision lab at Stanford, are impeded by the need to 1. INTRODUCTION crowdsource labels before even starting to build models for The success of machine learning for tasks like image recog- novel visual prediction tasks [23, 25]. This raises an impor- nition and natural language processing [12, 14] has ignited tant question: can we make weak supervision techniques eas- interest in using similar techniques for a variety of tasks. ier to adopt by automating the process of generating heuris- However, gathering enough training labels is a major bot- tics that assign training labels to unlabeled data? tleneck in applying machine learning to new tasks. In re- The key challenge in automating weak supervision lies in sponse, there has been a shift towards relying on weak su- replacing the human reasoning that drives heuristic devel- opment. In our collaborations with users with varying levels of machine learning expertise, we noticed that the process to This work is licensed under the Creative Commons Attribution- develop these weak supervision sources can be fairly repeti- NonCommercial-NoDerivatives 4.0 International License. To view a copy tive. For example, radiologists at the Stanford Hospital and of this license, visit http://creativecommons.org/licenses/by-nc-nd/4.0/. For Clinics have to guess the correct threshold for each heuristic any use beyond those covered by this license, obtain permission by emailing that uses a geometric property of a tumor to determine if [email protected]. Copyright is held by the owner/author(s). Publication rights it is malignant (example shown in Figure 1). We instead licensed to the VLDB Endowment. take advantage of a small, labeled dataset to automatically Proceedings of the VLDB Endowment, Vol. 12, No. 3 ISSN 2150-8097. generate noisy heuristics. Though the labeled dataset is too DOI: https://doi.org/10.14778/3291264.3291268 small to train an end model, it has enough information to 223 generate heuristics that can assign noisy labels to a large, source libraries [35, 49] and data models in existing weak unlabeled dataset and improve end model performance by supervision frameworks [38, 58]. Primitives examples in our up to 12.12 F1 points. To aggregate labels from these heuris- evaluation include bag-of-words for text and bounding box tics, we improve over majority vote by relying on existing attributes for images. factor graph-based statistical techniques in weak supervi- Pruner for Diversity. To ensure that the set of heuristics sion that can model the noise in and correlation among these is diverse and assigns high-quality labels to a large portion heuristics [2,4,39,41,48,50]. However, these techniques were of the unlabeled data, the pruner (Section 3.2) ranks the intended to work with user-designed labeling sources and heuristics the synthesizer generates by the weighted average therefore have limits on how robust they are. Automatically of their performance on the labeled set and coverage on the generated heuristics can be noisier than what these models unlabeled set. It selects the best heuristic at each iteration can account for and introduce the following challenges: and adds it to the collection of existing heuristics. This Accuracy. Users tend to develop heuristics that assign ac- method performs up to 6.57 F1 points better than ranking curate labels to a subset of the unlabeled data. An auto- heuristics by performance only. mated method has to properly model this trade-off between Verifier to Determine Termination Condition. The accuracy and coverage for each heuristic based only on the verifier uses existing statistical techniques to aggregate la- small, labeled dataset. Empirically, we find that generating bels from the heuristics into probabilistic labels for the unla- heuristics that each labels all the datapoints can degrade beled datapoints [4,39,50]. However, the automated heuris- end model performance by up to 20.69 F1 points. tic generation process can surpass the noise levels to which Diversity. Since each heuristic has limited coverage, users these techniques are robust to and degrade end model per- develop multiple heuristics that each labels a different sub- formance by up to 7.09 F1 points. We develop a statisti- set to ensure a large portion of the unlabeled data receives cal measure that uses the small, labeled set to determine a label. In an automated approach, we could mimic this by whether the noise in the generated heuristics is below the maximizing the number of unlabeled datapoints the heuris- threshold these techniques can handle (Section 4). tics label as a set. However, this approach can select heuris- Contribution Summary. We describe Snuba, a system tics that cover a large portion of the data but have poor to automatically generate heuristics using a small labeled performance. There is a need to account for both the diver- dataset to assign training labels to a large, unlabeled dataset sity and performance of the heuristics as a set. Empirically, in the weak supervision setting. A summary of our contri- balancing both aspects improves end model performance by butions are as follows: up to 18.20 F1 points compared to selecting the heuristic set that labels the most datapoints. • We describe the system architecture, the iterative pro- Termination Condition. Users stop generating heuristics cess of generating heuristics, and the optimizers used when they have exhausted their domain knowledge. An au- in the three components (Section 3). We also show tomated method, however, can continue to generate heuris- that our automated optimizers can affect end model tics that deteriorate the overall quality of the training labels performance by up to 20.69 F1 points (Section 5). assigned to the unlabeled data, such as heuristics that are • We present a theoretical guarantee that Snuba will ter- worse than random for the unlabeled data. Not account- minate the iterative process before the noise in heuris- ing for performance on the unlabeled dataset can affect end tics surpasses the threshold to which statistical tech- model performance by up to 7.09 F1 points. niques are robust (Section 4). This theoretical result Our Approach. To address the challenges above, we in- translates to improving end model performance by up troduce Snuba, an automated system that takes as input to 7.09 F1 points compared to generating as many a small labeled and a large unlabeled dataset and outputs heuristics as possible (Section 5). probabilistic training labels for the unlabeled data, as shown in Figure 1. These labels can be used to train a downstream • We evaluate our system in Section 5 by using Snuba la- machine learning model of choice, which can operate over bels to train downstream models, which generalize be- the raw data and generalize beyond the heuristics Snuba yond the heuristics Snuba generates.