WEAK SUPERVISION FROM HIGH-LEVEL ABSTRACTIONS

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Braden Jay Hancock August 2019

© 2019 by Braden Jay Hancock. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/ns523jd4552

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Chris Re, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Dan Jurafsky

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Percy Liang

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii Abstract

The interfaces for interacting with models are changing. Consider, for example, that while computers run on 1s and 0s, that is no longer the level of abstraction we use to program most computers. Instead, we use higher-level abstractions such as assembly language, high-level languages, or declarative languages to more efficiently convert our objectives into code. Similarly, most machine learning models are trained with "1s and 0s" (individually labeled examples), but we need not limit ourselves to interacting with them at this low level. Instead, we can use higher-level abstractions to more efficiently convert our domain knowledge into the inputs our models require. In this work, we show that weak supervision from high-level abstractions can be used to train high-performance machine learning models. At three different levels of abstraction, we describe the system we built to enable such interaction. We begin with Snorkel, which elevates label generation from a manual process to a programmatic one. With this system, domain experts encode their knowledge in potentially noisy and correlated black-box functions called labeling functions. These functions can then be automatically denoised and applied to unlabeled data to create large training sets quickly. Next, with Fonduer we enable an abstraction one step higher where advanced primitives defined over multiple modalities (visual, textual, structural, and tabular) allow users to programmatically supervise over richly formatted data (e.g., PDFs with tables and formatting). Finally, in BabbleLabble we show that we can even utilize supervision given in the form of natural language explanations, maintaining the benefits of programmatic supervision while removing the burden of writing code. For all of these systems, we demonstrate their effectiveness with empirical results and present real-world use cases where they have enabled rapid development of machine learning applications, including in bio-medicine, commerce, and defense.

iv Acknowledgments

My Ph.D. experience was rich and rewarding, and for that I owe a great debt to a great many people. I am deeply grateful to my advisor, Chris Ré. In an advisor I hoped for a master in identifying problems where progress would translate into real-world impact. I absolutely found this in Chris, but much more. From him I internalized valuable lessons such as: focus on process, not products; a paper is a receipt for good work, not the good work itself; and the reward for hard work is always more hard work. No one works harder than Chris, and it was an honor to be a part of his lab during my tenure at Stanford. I am grateful to Percy Liang and Dan Jurafsky, whose classes and insights during my first year at Stanford gave me a love for natural language processing and appreciation for education at the highest level. I owe a great deal to my fellow students and labmates in the Hazy Research group. It takes a village to raise a researcher, and innumerable paper swaps, whiteboard discussions, and late night hackathons together have made me the researcher I am today. In particular, I would like to acknowledge my fellow Ph.D. students Alex Ratner and Paroma Varma, who were with me through it all. I have had outstanding mentors at every stage of my education who took chances on me when they certainly did not have to: John Clark at AFRL before I knew a thing about programming, Christopher Mattson at BYU before I knew a thing about research, Mark Dredze at Johns Hopkins before I knew a thing about NLP, Vijay Gadepally at MIT Lincoln Laboratory before I knew a thing about machine learning, Hongrae Lee at before I knew a thing about production environments, and Antoine Bordes and Jason Weston at Facebook before I knew a thing about dialogue. My research would not have been possible without the financial support of my funders: the NSF Graduate Research Fellowship and Stanford Finch Family Fellowship especially, but also the DOE, NIH, ONR, DARPA, member companies of Stanford DAWN, and many other organizations supporting the Hazy Research group. Finally and most of all, I am grateful for my wife Lauren and daughters Annie and Pippa. When we moved to Stanford with a 10-day-old baby, I had hope but no assurance that I would be able to keep up with the rigorous demands of a top-tier doctoral program while simultaneously learning how to raise a family. I could not have anticipated then how many times more beautiful and rich my experience would be because it was shared with them.

v Contents

Abstract iv

Acknowledgments v

1 Introduction 1

2 Weak Supervision from Code 4 2.1 Introduction ...... 4 2.2 Snorkel Architecture ...... 8 2.2.1 A Language for Weak Supervision ...... 10 2.2.2 Generative Model ...... 14 2.2.3 Discriminative Model ...... 15 2.3 Weak Supervision Tradeoffs ...... 15 2.3.1 Modeling Accuracies ...... 16 2.3.2 Modeling Structure ...... 20 2.4 Evaluation ...... 23 2.4.1 Applications ...... 24 2.4.2 User Study ...... 31 2.5 Extensions & Next Steps ...... 34 2.5.1 Extensions for Real-World Deployments ...... 34 2.5.2 Multi-Task Weak Supervision ...... 34 2.5.3 Future Directions ...... 35 2.6 Related Work ...... 35 2.7 Conclusion ...... 36

3 Weak Supervision from Primitives 38 3.1 Introduction ...... 38 3.2 Background ...... 42 3.2.1 Knowledge Base Construction ...... 42

vi 3.2.2 Recurrent Neural Networks ...... 43 3.3 The Fonduer Framework ...... 45 3.3.1 Fonduer’s Data Model ...... 45 3.3.2 User Inputs and Fonduer’s Pipeline ...... 46 3.3.3 Fonduer’s Programming Model for KBC ...... 49 3.4 KBC in Fonduer ...... 50 3.4.1 Candidate Generation ...... 50 3.4.2 Multimodal LSTM Model ...... 51 3.4.3 Multimodal Supervision ...... 54 3.5 Experiments ...... 54 3.5.1 Experimental Settings ...... 54 3.5.2 Experimental Results ...... 56 3.5.3 Ablation Studies ...... 58 3.6 User Study ...... 61 3.7 Extensions ...... 63 3.8 Related Work ...... 64 3.9 Conclusion ...... 64

4 Weak Supervision from Natural Language 65 4.1 Introduction ...... 65 4.2 The BabbleLabble Framework ...... 67 4.2.1 Explanations ...... 68 4.2.2 Semantic Parser ...... 68 4.2.3 Filter Bank ...... 69 4.2.4 Label Aggregator ...... 70 4.2.5 Discriminative Model ...... 71 4.3 Experimental Setup ...... 71 4.3.1 Datasets ...... 72 4.3.2 Experimental Settings ...... 73 4.4 Experimental Results ...... 73 4.4.1 High Bandwidth Supervision ...... 73 4.4.2 Utility of Incorrect Parses ...... 74 4.4.3 Using LFs as Functions or Features ...... 75 4.5 Related Work and Discussion ...... 76 4.6 Extensions ...... 77

vii 5 Discussion & Conclusion 78 5.1 Advantages of Programmatic Supervision ...... 78 5.2 Limitations ...... 79 5.3 The Supervision Stack ...... 80 5.4 Conclusion ...... 81

A Snorkel Appendix 82 A.1 Additional Material for Sec. 3.1 ...... 82 A.1.1 Minor Notes ...... 82 A.1.2 Proof of Proposition 1 ...... 82 A.1.3 Proof of Proposition 2 ...... 84 A.1.4 Proof of Proposition 3 ...... 85

B Fonduer Appendix 88 B.1 Data Programming ...... 88 B.1.1 Components of Data Programming ...... 88 B.1.2 Theoretical Guarantees ...... 89 B.2 Extended Feature Library ...... 89 B.3 Fonduer at Scale ...... 91 B.3.1 Data Caching ...... 91 B.3.2 Data Representations ...... 92 B.4 Future Work ...... 93 B.5 GwasKB Web Interface ...... 95

C BabbleLabble Appendix 97 C.1 Predicate Examples ...... 97 C.2 Sample Explanations ...... 99

viii List of Tables

2.1 Modeling advantage Aw attained using a generative model for several applications in Snorkel (Section 2.4.1), the upper bound A˜ ∗ used by our optimizer, the modeling strategy selected by the optimizer—either majority vote (MV) or generative model (GM)—and the empirical label

density dΛ...... 18 2.2 Number of labeling functions, fraction of positive labels (for binary classification tasks), number of training documents, and number of training candidates for each task...... 24 2.3 Evaluation of Snorkel on relation extraction tasks from text. Snorkel’s generative and discriminative models consistently improve over distant supervision, measured in F1, the harmonic mean of precision (P) and recall (R). We compare with hand-labeled data when available, coming within an average of 1 F1 point...... 25 2.4 Number of candidates in the training, development, and test splits for each dataset...... 26 2.5 Evaluation on cross-modal experiments. Labeling functions that operate on or represent one modality (text, crowd workers) produce training labels for models that operate on another modality (images, text), and approach the predictive performance of large hand-labeled training datasets...... 28 2.6 Comparison between training the discriminative model on the labels estimated by the gen- erative model, versus training on the unweighted average of the LF outputs. Predictive performance gains show that modeling LF noise helps...... 29 2.7 Labeling function ablation study on CDR. Adding different types of labeling functions im- proves predictive performance...... 30 2.8 Self-reported skill levels—no previous experience (New), beginner (Beg.), intermediate (Int.), and advanced (Adv.)—for all user study participants...... 31

3.1 Summary of the datasets used in our experiments...... 55 3.2 End-to-end quality in terms of precision, recall, and F1 score for each application compared to the upper bound of state-of-the-art systems...... 57 3.3 End-to-end quality vs. existing knowledge bases...... 57 3.4 Comparing approaches to featurization based on Fonduer’s data model...... 61

ix 3.5 Comparing the features of SRV and Fonduer...... 61 3.6 Comparing document-level RNN and Fonduer’s deep-learning model on a single ELECTRON- ICS relation...... 61

4.1 Predicates in the grammar supported by BabbleLabble’s rule-based semantic parser. . . . 69 4.2 The total number of unlabeled training examples (a pair of annotated entities in a sent- ence), labeled development examples (for hyperparameter tuning), labeled test examples (for assessment), and the fraction of positive labels in the test split...... 72 4.3 F1 scores obtained by a classifier trained with BabbleLabble (BL) using 30 explanations or with traditional supervision (TS) using the specified number of individually labeled examples. BabbleLabble achieves the same F1 score as traditional supervision while using fewer user inputs by a factor of over 5 (Protein) to over 100 (Spouse)...... 73 4.4 The number of LFs generated from 30 explanations (pre-filters), discarded by the filter bank, and remaining (post-filters), along with the percentage of LFs that were correctly parsed from their corresponding explanations...... 74 4.5 F1 scores obtained using BabbleLabble with no filter bank (BL-FB), as normal (BL), and with a perfect parser (BL+PP) simulated by hand...... 75 4.6 F1 scores obtained using explanations as functions for data programming (BL) or features (Feat), optionally with no discriminative model (-DM) or using a perfect parser (+PP). . . . 76

B.1 Features from Fonduer’s feature library. Example values are drawn from the example candi- date in Figure 3.1. Capitalized prefixes represent the feature templates and the remainder of the string represents a feature’s value...... 90

x List of Figures

1.1 Similar to the way we program computers, we can program our machine models using higher- level abstractions than individual bits or labels...... 2

2.1 In Snorkel, rather than labeling training data by hand, users write labeling functions, which programmatically label data points or abstain. These labeling functions have different unknown accuracies and correlations. Snorkel automatically models and combines their outputs using a generative model, then uses the resulting probabilistic labels to train a discriminative model.5 2.2 In Example 2.1.1, training data is labeled by sources of differing accuracy and coverage. Two key challenges arise in using this weak supervision effectively. First, we need a way to estimate the unknown source accuracies to resolve disagreements. Second, we need to pass on this critical lineage information to the end model being trained...... 6 2.3 An overview of the Snorkel system. (1) SME users write labeling functions (LFs) that express weak supervision sources like distant supervision, patterns, and heuristics. (2) Snorkel applies the LFs over unlabeled data and learns a generative model to combine the LFs’ outputs into probabilistic labels. (3) Snorkel uses these labels to train a discriminative classification model, such as a deep neural network...... 7 2.4 Labeling functions take as input a Candidate object, representing a data point to be clas- sified. Each Candidate is a tuple of Context objects, which are part of a hierarchy representing the local context of the Candidate...... 10 2.5 Labeling functions expressing pattern-matching, heuristic, and distant supervision approaches, respectively, in Snorkel’s Jupyter notebook interface, for the Spouses example. Full code is available in Snorkel’s Intro tutorial...... 12 2.6 The data viewer utility in Snorkel, showing candidate spouse relation mentions from the Spouses example, composed of person-person mention pairs...... 13 2.7 A plot of the modeling advantage, i.e., the improvement in label accuracy from the generative model, as a function of the number of labeling functions (equivalently, the label density) on a synthetic dataset...... 17

xi ∗ 2.8 The predicted (A˜ ) and actual (Aw) advantage of using the generative labeling model (GM) over majority vote (MV) on the CDR application as the number of LFs is increased. At 9 LFs, the optimizer switches from choosing MV to choosing GM; this leads to faster modeling in early development cycles, and more accurate results in later cycles...... 20 2.9 Predictive performance of the generative model and number of learned correlations versus the correlation threshold . The selected elbow point achieves a good tradeoff between predictive performance and computational cost (linear in the number of correlations). Left: simulation of structure learning correcting the generative model. Middle: the CDR task. Right: all user study labeling functions for the Spouses task...... 20 2.10 Precision-recall curves for the relation extraction tasks. The top plots compare a majority vote of all labeling functions, Snorkel’s generative model, and Snorkel’s discriminative model. They show that the generative model improves over majority vote by providing more granular information about candidates, and that the discriminative model can generalize to candidates that no labeling functions label. The bottom plots compare the discriminative model trained on an unweighted combination of the labeling functions, hand supervision (when available), and Snorkel’s discriminative model. They show that the discriminative model benefits from the weighted labels provided by the generative model, and that Snorkel is competitive with hand supervision, particularly in the high-precision region...... 25 2.11 The increase in end model performance (measured in F1 score) for different amounts of unlabeled data, measured in the number of candidates. We see that as more unlabeled data is added, the performance increases...... 29 2.12 Predictive performance attained by our 14 user study participants using Snorkel. The majority (57%) of users matched or exceeded the performance of a model trained on 7 hours (2,500 instances) of hand-labeled data...... 32 2.13 The profile of the best performing user by F1 score, was a MS or Ph.D. degree in any field, strong Python coding skills, and intermediate to advanced experience with machine learning. Prior experience with text mining added no benefit...... 32 2.14 We bucketed labeling functions written by user study participants into three types—pattern- based, distant supervision, and complex. Participants tended to mainly write pattern-based labeling functions, but also universally expressed more complex heuristics as well...... 33

3.1 A KBC task to populate relation HasCollectorCurrent(Transistor Part, Current) from datasheets. Part and Current mentions are in blue and green, respectively...... 39 3.2 An overview of Fonduer KBC over richly formatted data. Given a set of richly formatted documents and a series of lightweight inputs from the user, Fonduer extracts facts and stores them in a relational database...... 43 3.3 Fonduer’s data model...... 45

xii 3.4 Tradeoff between (a) quality and (b) execution time when pruning the number of candidates using throttlers...... 51 3.5 An illustration of Fonduer’s multimodal LSTM for candidate (SMBT3904, 200) in Figure 3.1. 52 3.6 Average F1 score over four relations when broadening the extraction context scope in ELEC- TRONICS...... 58 3.7 The impact of each modality in the feature library...... 59 3.8 Study of different supervision resources’ effect. Metadata includes structural, tabular, and visual modalities...... 62 3.9 F1 quality over time with 95% confidence intervals (left). Modality distribution of user LFs (right)...... 62

4.1 In BabbleLabble, the user provides a natural language explanation for each labeling decision. These explanations are parsed into labeling functions that convert unlabeled data into a large labeled dataset for training a classifier...... 66 4.2 Natural language explanations are parsed into candidate labeling functions (LFs). Many incorrect LFs are filtered out automatically by the filter bank. The remaining functions provide heuristic labels over the unlabeled dataset, which are aggregated into one noisy label per example, yielding a large, noisily-labeled training set for a classifier...... 67 4.3 Valid parses are found by iterating over increasingly large subspans of the input looking for matches among the right hand sides of the rules in the grammar. Rules are either lexical (converting tokens into symbols), unary (converting one symbol into another symbol), or compositional (combining many symbols into a single higher-order symbol). A rule may optionally ignore unrecognized tokens in a span (denoted here with a dashed line)...... 68 4.4 An example and explanation for each of the three datasets...... 72 4.5 Incorrect LFs often still provide useful signal. On top is an incorrect LF produced for the Disease task that had the same accuracy as the correct LF. On bottom is a correct LF from the Spouse task and a more accurate incorrect LF discovered by randomly perturbing one predicate at a time as described in Section 4.4.2. (Person 2 is always the second person in the sentence)...... 74 4.6 When logical forms of natural language explanations are used as functions for data program- ming (as they are in BabbleLabble), performance can improve with the addition of unlabeled data, whereas using them as features does not benefit from unlabeled data...... 75 4.7 The Babble Labble interface...... 77

5.1 Abstract schematics of labeling rate for (a) manual labeling and (b) programmatic labeling approaches...... 79

xiii B.1 The GWASkb web application hosted at http://gwaskb.stanford.edu for exploring the contents of the GWASkb knowledge base created with Fonduer...... 95 B.2 Users can search by genotype (e.g., rs7329174) or phenotype (e.g., breast cancer) and see all related studies and associations with links to the corresponding articles for further exploration. 96

xiv Chapter 1

Introduction

Training a machine learning model for a new application requires three primary components: a model to train, hardware to train on, and data to train with. With the proliferation of cloud computing products, users around the world have access to state-of-the-art hardware for mere cents per hour.123 Similarly, model zoos456, industry standards7, and heavily supported open source frameworks [Paszke et al., 2017, Abadi et al., 2015] have made state-of-the-art models readily available for use. Consequently, the bottleneck to obtaining high quality in new machine learning applications has increasingly become obtaining the necessary training data. Furthermore, exacerbating this bottleneck is the fact that the predominant process for generating training data (manually labeling examples one-by-one) is so low-level. We can compare the process of training or “programming” machine learning models to that of programming computers. For example, even though computers are programmed with individual bits (1s and 0s), that is no longer the level of abstraction we use to write the programs our computers run. Instead, we use higher abstractions that compile down into that form (Figure 1.1).

• We use low-level assembly code to write multiple bytes at a time. • We use high-level languages like C++ to allow access to more advanced concepts. • And even higher, we use declarative languages like SQL where users can simply describe what they want, and the code that will be executed gets written automatically.

Compare this process to that of training supervised machine learning models via labeled examples. For many problems, while the data may be very complex, the labels themselves are quite low-level. Consider, for example, that the labels in the common binary classification setting are, like bits, also 1s and 0s. However,

1https://aws.amazon.com/pricing/ 2https://azure.microsoft.com/en-us/pricing/ 3https://cloud.google.com/pricing/ 4https://www.tensorflow.org/hub 5https://pytorch.org/hub 6https://github.com/huggingface/pytorch-transformers 7https://onnx.ai/

1 CHAPTER 1. INTRODUCTION 2

Figure 1.1: Similar to the way we program computers, we can program our machine models using higher-level abstractions than individual bits or labels. even though our models — like computers — are trained on low-level inputs, that does not have to be the interface we use to program them.

• We can programmatically generate labels using code. • We can expose advanced primitives (or “helper functions”) to make it easier to supervise complex concepts efficiently. • And we can support natural language supervision, where users convey their domain expertise with words and the labels are generated automatically.

In this work, we present three systems built at increasingly high levels of abstraction for supervising machine learning models, corresponding to the three levels just described. For each system, we present real-world use cases where it has been applied, including in defense, commerce, and medicine. The thesis of this dissertation is that weak supervision from high-level abstractions can be used to train high-performance machine learning models. Weak supervision comes in many forms, and can be loosely defined as using cheaper, higher-level, and/or potentially noisier inputs than ground truth labels as supervision. Perhaps the most commonly recognized form is distant supervision, in which the records of an external knowledge base are heuristically aligned with data points to produce noisy labels [Bunescu and Mooney, 2007, Mintz et al., 2009b, Alfonseca et al., 2012b]. Additional forms include crowdsourced labels [Yuen et al., 2011, Quinn and Bederson, 2011], using individual rules and heuristics to label data [Zhang et al., 2017, Rekatsinas et al., 2017a], and others [Zaidan and Eisner, 2008b, Liang et al., 2009b, Mann and McCallum, 2010,?, Stewart and Ermon, 2017]. In this work, we focus in CHAPTER 1. INTRODUCTION 3

particular on the setting where we have access to multiple weak supervision sources, which generally leads to better performance than using any particular source on its own. We begin with Snorkel8 in Chapter 2, a first-of-its-kind system that enables users to train state-of-the-art models without labeling any training data by hand. Instead, users write programmatic labeling functions that express arbitrary heuristics, which can have unknown and varying accuracies and correlations. Snorkel denoises these sources without access to ground truth by incorporating the first end-to-end implementation of our recently proposed machine learning paradigm, data programming. We present a flexible interface layer for writing labeling functions based on our experiences collaborating with companies, agencies, and research labs. In a user study, subject matter experts build models 2.8× faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. We study the modeling tradeoffs in this new setting and propose an optimizer for automating tradeoff decisions that gives up to 1.8× speedup per pipeline execution. In two collaborations with government agencies and on four open-source text and image data sets representative of other deployments, Snorkel provides 132% average improvements to predictive performance over prior heuristic approaches and comes within an average 3.60% of the predictive performance of large hand-curated training sets. Next, with Fonduer9 in Chapter 3 we enable an abstraction one step higher where advanced primitives allow users to easily supervise over multiple modalities: visual, textual, structural, and tabular. We analyze the contributions of this abstraction in the context of knowledge base construction (KBC) from richly formatted data (e.g., PDFs or webpages where formatting and layout convey information beyond the raw text). We compare Fonduer against state-of-the-art KBC approaches in four different domains. We find that Fonduer achieves an average improvement of 41 F1 points on the quality of the output knowledge base—and in some cases produces up to 1.87× the number of correct entries—compared to expert-curated public knowledge bases. We also conduct a user study to assess the usability of Fonduer’s programming model, showing that after using Fonduer for only 30 minutes, non-domain experts are able to design KBC systems that achieve on average 23 F1 points higher quality than traditional machine-learning-based KBC approaches. Third, with BabbleLabble10 in Chapter 4 we show that we can even utilize supervision given in the form of natural language explanations, maintaining the benefits of programmatic supervision while removing the burden of writing code. With BabbleLabble, annotators provide a natural language explanation for each labeling decision that they make. A semantic parser compiles these explanations into labeling functions composed of relevant primitives. On three relation extraction tasks, we find that users are able to train classifiers with comparable F1 scores from 5–100× faster by providing explanations instead of just labels. Furthermore, given the inherent imperfection of labeling functions, we find that a simple rule-based semantic parser suffices. Finally, in Section 5 we discuss the advantages and limitations of programmatic supervision at large, exploring the ramifications of the supervision stack which we have proposed.

8https://github.com/snorkel-team/snorkel 9https://github.com/HazyResearch/fonduer 10https://github.com/HazyResearch/babble Chapter 2

Weak Supervision from Code

We begin by attempting to answer the question of whether we can effectively train machine learning models without labels generated individually by hand. We accomplish this with Snorkel, a system for combining weak supervision sources to rapidly create training data. As we will see throughout this dissertation, the primary abstraction utilized by Snorkel, the labeling function, will serve as fundamental unit in subsequent higher-level interfaces. To use the computer analogy, labeling functions will serve as the x86 of the supervision stack. The work in this chapter is the result of collaboration with Alexander Ratner, Stephen Bach, Henry Ehrenberg, Jason Fries, Sen Wu, Jared Dunnmon, Frederic Sala, Shreyash Pandey, Christopher Ré, and many others. It draws on content from the following Snorkel-related publications: [Ratner et al., 2016b, 2017b, 2018, 2019b,a, Bach et al., 2017, 2019].

2.1 Introduction

In the last several years, there has been an explosion of interest in machine-learning-based systems across industry, government, and academia, with an estimated spend this year of $12.5 billion [idc, 2017]. A central driver has been the advent of techniques, which can learn task-specific representations of input data, obviating what used to be the most time-consuming development task: feature engineering. These learned representations are particularly effective for tasks like natural language processing and image analysis, which have high-dimensional, high-variance input that is impossible to fully capture with simple rules or hand-engineered features [Graves and Schmidhuber, 2005, Deng et al., 2009]. However, deep learning has a major upfront cost: these methods need massive training sets of labeled examples to learn from—often tens of thousands to millions to reach peak predictive performance [Sun et al., 2017]. Such training sets are enormously expensive to create, especially when domain expertise is required. For example, reading scientific papers, analyzing intelligence data, and interpreting medical images all require labeling by trained subject matter experts (SMEs). Moreover, we observe from our engagements

4 CHAPTER 2. WEAK SUPERVISION FROM CODE 5

def lf_1(x): return heuristic(x) �

def lf_2(x): return classifier(x) � � �

def lf_3(x): return re.find(p, x) �

LABELING GENERATIVE DISCRIMINATIVE FUNCTIONS MODEL MODEL

Figure 2.1: In Snorkel, rather than labeling training data by hand, users write labeling functions, which programmatically label data points or abstain. These labeling functions have different unknown accuracies and correlations. Snorkel automatically models and combines their outputs using a generative model, then uses the resulting probabilistic labels to train a discriminative model. with collaborators like research labs and major technology companies that problem specifications (e.g., class definitions or granularities) tend to change as projects progress, necessitating re-labeling. Some big companies are able to absorb this cost, hiring large teams to label training data [Metz, 2016, Eadicicco, 2017, Davis et al., 2013]. Other practitioners utilize classic techniques like active learning [Settles, 2012], [Pan and Yang, 2010], and semi- [Chapelle et al., 2009] to reduce the number of training labels needed. However, the bulk of practitioners are increasingly turning to some form of weak supervision: cheaper sources of labels that are noisier or heuristic. The most popular form is distant supervision, in which the records of an external knowledge base are heuristically aligned with data points to produce noisy labels [Bunescu and Mooney, 2007, Mintz et al., 2009b, Alfonseca et al., 2012b]. Other forms include crowdsourced labels [Yuen et al., 2011, Quinn and Bederson, 2011], individual rules and heuristics for labeling data [Zhang et al., 2017, Rekatsinas et al., 2017a], and others [Zaidan and Eisner, 2008b, Liang et al., 2009b, Mann and McCallum, 2010,?, Stewart and Ermon, 2017]. While these sources are inexpensive, they often have limited accuracy and coverage. Ideally, we would combine the labels from many weak supervision sources to increase the accuracy and coverage of our training set. However, two key challenges arise in doing so effectively. First, sources will overlap and conflict, and to resolve their conflicts we need to estimate their accuracies and correlation structure, without access to ground truth. Second, we need to pass on critical lineage information about label quality to the end model being trained.

Example 2.1.1. In Figure 2.2, we obtain labels from a high accuracy, low coverage Source 1, and from a low accuracy, high coverage Source 2, which overlap and disagree (split-color points). If we take an unweighted majority vote to resolve conflicts, we end up with null (tie-vote) labels. If we could correctly estimate the source accuracies, we would resolve conflicts in the direction of Source 1. We would still need to pass this information on to the end model being trained. Suppose that we took labels from Source 1 where available, and otherwise took labels from Source 2. Then, the expected training set accuracy would be 60.3%—only marginally better than the weaker source. Instead we should represent CHAPTER 2. WEAK SUPERVISION FROM CODE 6

LABEL SOURCE 1 …

Accuracy: 90% 1k labels LABEL SOURCE 2

100k labels Accuracy: 60% UNLABELED DATA

Figure 2.2: In Example 2.1.1, training data is labeled by sources of differing accuracy and coverage. Two key challenges arise in using this weak supervision effectively. First, we need a way to estimate the unknown source accuracies to resolve disagreements. Second, we need to pass on this critical lineage information to the end model being trained. training label lineage in end model training, weighting labels generated by high-accuracy sources more.

In recent work, we developed data programming as a paradigm for addressing both of these challenges by modeling multiple label sources without access to ground truth, and generating probabilistic training labels representing the lineage of the individual labels. We prove that, surprisingly, we can recover source accuracy and correlation structure without hand-labeled training data [Ratner et al., 2016a, Bach et al., 2017]. However, there are many practical aspects of implementing and applying this abstraction that have not been previously considered. We present Snorkel, the first end-to-end system for combining weak supervision sources to rapidly create training data. We built Snorkel as a prototype to study how people could use data programming, a fundamentally new approach to building machine learning applications. Through weekly hackathons and office hours held at Stanford University over the past year, we have interacted with a growing user community around Snorkel’s open source implementation.1 We have observed SMEs in industry, science, and government deploying Snorkel for knowledge base construction, image analysis, bioinformatics, fraud detection, and more. From this experience, we have distilled three principles that have shaped Snorkel’s design:

1. Bring All Sources to Bear: The system should enable users to opportunistically use labels from all available weak supervision sources.

2. Training Data as the Interface to ML: The system should model label sources to produce a single, probabilistic label for each data point and train any of a wide range of classifiers to generalize beyond those sources.

3. Supervision as Interactive Programming: The system should provide rapid results in response to user supervision. We envision weak supervision as the REPL-like interface for machine learning. 1http://snorkel.stanford.edu CHAPTER 2. WEAK SUPERVISION FROM CODE 7

SNORKEL We study a patient who became Document WeWe studystudy a a patient patient who who became became quadriplegic after parenteral magnesium quadriplegicquadriplegicafterafter parenteral parenteral magnesium magnesium administration for preeclampsia. Sentence administrationadministration for for preeclampsia preeclampsia.. Λ Span UNLABELED DATA Entity Λ CONTEXT HIERARCHY MODELING Subset A Λ � External LABEL OPTIMIZER � Subset B KBs MATRIX Subset C Ontology(ctd, [A, B, -C]) PROBABILISTIC TRAINING DATA Patterns & “causes”, “induces”, “linked Pattern(“{{0}}causes{{1}}”) dictionaries to”, “aggravates”, … Λ

CustomFn(x,y : heuristic(x,y)) DISCRIMINATIVE Domain “Chemicals of type A MODEL should be harmless…” GENERATIVE Heuristics LABELING FUNCTION INTERFACE MODEL WEAK SUPERVISION SOURCES Figure 2.3: An overview of the Snorkel system. (1) SME users write labeling functions (LFs) that express weak supervision sources like distant supervision, patterns, and heuristics. (2) Snorkel applies the LFs over unlabeled data and learns a generative model to combine the LFs’ outputs into probabilistic labels. (3) Snorkel uses these labels to train a discriminative classification model, such as a deep neural network.

Our work makes the following technical contributions: A Flexible Interface for Sources: We observe that the heterogeneity of weak supervision strategies is a stumbling block for developers. Different types of weak supervision operate on different scopes of the input data. For example, distant supervision has to be mapped programmatically to specific spans of text. Crowd workers and weak classifiers often operate over entire documents or images. Heuristic rules are open ended; they can leverage information from multiple contexts simultaneously, such as combining information from a document’s title, named entities in the text, and knowledge bases. This heterogeneity was cumbersome enough to completely block users of early versions of Snorkel. To address this challenge, we built an interface layer around the abstract concept of a labeling function (LF). We developed a flexible language for expressing weak supervision strategies and supporting data structures. We observed accelerated user productivity with these tools, which we validated in a user study where SMEs build models 2.8× faster and increase predictive performance an average 45.5% versus seven hours of hand labeling. Tradeoffs in Modeling of Sources: Snorkel learns the accuracies of weak supervision sources without access to ground truth using a generative model [Ratner et al., 2016a]. Furthermore, it also learns correlations and other statistical dependencies among sources, correcting for dependencies in labeling functions that skew the estimated accuracies [Bach et al., 2017]. This paradigm gives rise to previously unexplored tradeoff spaces between predictive performance and speed. The natural first question is: when does modeling the accuracies of sources improve predictive performance? Further, how many dependencies, such as correlations, are worth modeling? We study the tradeoffs between predictive performance and training time in generative models for weak supervision. While modeling source accuracies and correlations will not hurt predictive performance, we present a theoretical analysis of when a simple majority vote will work just as well. Based on our conclusions, CHAPTER 2. WEAK SUPERVISION FROM CODE 8

we introduce an optimizer for deciding when to model accuracies of labeling functions, and when learning can be skipped in favor of a simple majority vote. Further, our optimizer automatically decides which correlations to model among labeling functions. This optimizer correctly predicts the advantage of generative modeling over majority vote to within 2.16 accuracy points on average on our evaluation tasks and accelerates pipeline executions by up to 1.8×. It also enables us to gain 60%–70% of the benefit of correlation learning while saving up to 61% of training time (34 minutes per execution). First End-to-End System for Data Programming: Snorkel is the first system to implement our recent work on data programming [Ratner et al., 2016a, Bach et al., 2017]. Previous ML systems that we and others developed [Zhang et al., 2017] required extensive feature engineering and model specification, leading to confusion about where to inject relevant domain knowledge. While programming weak supervision seems superficially similar to feature engineering, we observe that users approach the two processes very differently. Our vision—weak supervision as the sole port of interaction for machine learning—implies radically different workflows, requiring a proof of concept. Snorkel demonstrates that this paradigm enables users to develop high-quality models for a wide range of tasks. We report on two deployments of Snorkel, in collaboration with the U.S. Department of Veterans Affairs and Stanford Hospital and Clinics, and the U.S. Food and Drug Administration, where Snorkel improves over heuristic baselines by an average 110%. We also report results on four open-source datasets that are representative of other Snorkel deployments, including bioinformatics, medical image analysis, and crowdsourcing; on which Snorkel beats heuristics by an average 153% and comes within an average 3.60% of the predictive performance of large hand-curated training sets.

2.2 Snorkel Architecture

Snorkel’s workflow is designed around data programming [Ratner et al., 2016a, Bach et al., 2017], a fundamentally new paradigm for training machine learning models using weak supervision, and proceeds in three main stages (Figure 2.3):

1. Writing Labeling Functions: Rather than hand-labeling training data, users of Snorkel write labeling functions, which allow them to express various weak supervision sources such as patterns, heuristics, external knowledge bases, and more. This was the component most informed by early interactions (and mistakes) with users over the last year of deployment, and we present a flexible interface and supporting data model.

2. Modeling Accuracies and Correlations: Next, Snorkel automatically learns a generative model over the labeling functions, which allows it to estimate their accuracies and correlations. This step uses no ground-truth data, learning instead from the agreements and disagreements of the labeling functions. We observe that this step improves end predictive performance 5.81% over Snorkel with unweighted label combination, and anecdotally that it streamlines the user development experience by providing actionable feedback about labeling function quality. CHAPTER 2. WEAK SUPERVISION FROM CODE 9

3. Training a Discriminative Model: The output of Snorkel is a set of probabilistic labels that can be used to train a wide variety of state-of-the-art machine learning models, such as popular deep learning models. While the generative model is essentially a re-weighted combination of the user-provided labeling functions—which tend to be precise but low-coverage—modern discriminative models can retain this precision while learning to generalize beyond the labeling functions, increasing coverage and robustness on unseen data.

Next we set up the problem Snorkel addresses and describe its main components and design decisions.

Setup: Our goal is to learn a parameterized classification model hθ that, given a data point x ∈ X, pre- dicts its label y ∈ Y, where the set of possible labels Y is discrete. For simplicity, we focus on the binary setting Y = {−1, 1}, though we include a multi-class application in our experiments. For example, x might be a medical image, and y a label indicating normal versus abnormal. In the relation extraction examples we look at, we often refer to x as a candidate. In a traditional supervised learning setup, we would learn hθ by fitting it to a training set of labeled data points. However, in our setting, we assume that we only have access to unlabeled data for training. We do assume access to a small set of labeled data used during development, called the development set, and a blind, held-out labeled test set for evaluation. These sets can be orders of magnitudes smaller than a training set, making them economical to obtain. The user of Snorkel aims to generate training labels by providing a set of labeling functions, which are black-box functions, λ : X → Y ∪ {∅}, that take in a data point and output a label where we use ∅ to denote that the labeling function abstains. Given m unlabeled data points and n labeling functions, Snorkel applies the labeling functions over the unlabeled data to produce a matrix of labeling function outputs Λ ∈ (Y ∪ {∅})m×n. The goal of the remaining Snorkel pipeline is to synthesize this label matrix Λ—which may contain overlapping and conflicting labels for each data point—into a single vector of probabilistic training ˜ labels Y = (y˜1, ..., y˜m), where y˜i ∈ [0, 1]. These training labels can then be used to train a discriminative model. Next, we introduce the running example of a text relation extraction task as a proxy for many real-world knowledge base construction and data analysis tasks:

Example 2.2.1. Consider the task of extracting mentions of adverse chemical-disease relations from the biomedical literature (see CDR task, Section 2.4.1). Given documents with mentions of chemicals and diseases tagged, we refer to each co-occurring (chemical, disease) mention pair as a candidate extraction, which we view as a data point to be classified as either true or false. For example, in Figure 2.2, we would have two candidates with true labels y1 = True and y2 = False: x_1 = Causes("magnesium", "quadriplegic") x_2 = Causes("magnesium", "preeclampsia")

Data Model: A design challenge is managing complex, unstructured data in a way that enables SMEs to write CHAPTER 2. WEAK SUPERVISION FROM CODE 10

Document Candidate(A,B) Sentence

Span

Entity

CONTEXT HIERARCHY

Figure 2.4: Labeling functions take as input a Candidate object, representing a data point to be classified. Each Candidate is a tuple of Context objects, which are part of a hierarchy representing the local context of the Candidate. labeling functions over it. In Snorkel, input data is stored in a context hierarchy. It is made up of context types connected by parent/child relationships, which are stored in a relational database and made available via an object-relational mapping (ORM) layer built with SQLAlchemy.2 Each context type represents a conceptual component of data to be processed by the system or used when writing labeling functions; for example a document, an image, a paragraph, a sentence, or an embedded table. Candidates—i.e., data points x—are then defined as tuples of contexts (Figure 2.4).

Example 2.2.2. In our running CDR example, the input documents can be represented in Snorkel as a hierarchy consisting of Documents, each containing one or more Sentences, each containing one or more Spans of text. These Spans may also be tagged with metadata, such as Entity markers identifying them as chemical or disease mentions (Figure 2.4). A candidate is then a tuple of two Spans.

2.2.1 A Language for Weak Supervision

Snorkel uses the core abstraction of a labeling function to allow users to specify a wide range of weak supervision sources such as patterns, heuristics, external knowledge bases, crowdsourced labels, and more. This higher-level, less precise input is more efficient to provide (see Section 3.6), and can be automatically denoised and synthesized, as described in subsequent sections. In this section, we describe our design choices in building an interface for writing labeling functions, which we envision as a unifying programming language for weak supervision. These choices were informed to a large degree by our interactions—primarily through weekly office hours—with Snorkel users in bioinformatics, defense, industry, and other areas over the past year.3 For example, while we initially intended to have a more complex structure for labeling functions, with manually specified types and correlation structure, we quickly found that simplicity in this respect was critical to usability (and not empirically detrimental to our ability to model their outputs). We also quickly discovered that users wanted either far more expressivity or far less of it, 2https://www.sqlalchemy.org/ 3http://snorkel.stanford.edu#users CHAPTER 2. WEAK SUPERVISION FROM CODE 11

compared to our first library of function templates. We thus trade off expressivity and efficiency by allowing users to write labeling functions at two levels of abstraction: custom Python functions and declarative operators.

Hand-Defined Labeling Functions: In its most general form, a labeling function is just an arbitrary snippet of code, usually written in Python, which accepts as input a Candidate object and either outputs a label or abstains. Often these functions are similar to extract-transform-load scripts, expressing basic patterns or heuristics, but may use supporting code or resources and be arbitrarily complex. Writing labeling functions by hand is supported by the ORM layer, which maps the context hierarchy and associated metadata to an object-oriented syntax, allowing the user to easily traverse the structure of the input data.

Example 2.2.3. In our running example, we can write a labeling function that checks if the word “causes" appears between the chemical and disease mentions. If it does, it outputs True if the chemical mention is first and False if the disease mention is first. If “causes” does not appear, it outputs None, indicating abstention: def LF_causes(x): cs, ce = x.chemical.get_word_range() ds, de = x.disease.get_word_range() if ce < ds and " causes " in x.parent.words[ce+1:ds]: return True if de < cs and " causes " in x.parent.words[de+1:cs]: return False return None

We could also write this with Snorkel’s declarative interface:

LF_causes = lf_search("{{1}}.∗\Wcauses\W.∗{{2}}", reverse_args=False)

Declarative Labeling Functions: Snorkel includes a library of declarative operators that encode the most common weak supervision function types, based on our experience with users over the last year. The semantics and syntax of these operators is simple and easily-customizable, consisting of two main types: (i) labeling function templates, which are simply functions that take one or more arguments and output a single labeling function; and (ii) labeling function generators, which take one or more arguments and output a set of labeling functions (described below). These functions capture a range of common forms of weak supervision, for example:

• Pattern-based: Pattern-based heuristics embody the motivation of soliciting higher information density input from SMEs. For example, pattern-based heuristics encompass feature annotations [Zaidan and Eisner, 2008b] and pattern-bootstrapping approaches [Hearst, 1992, Gupta and Manning, 2014] (Example 2.2.3).

• Distant supervision: Distant supervision generates training labels by heuristically aligning data points with an external knowledge base, and is one of the most popular forms of weak supervision [Mintz et al., 2009b, Alfonseca et al., 2012b, Hoffmann et al., 2011a]. CHAPTER 2. WEAK SUPERVISION FROM CODE 12

• Weak classifiers: Classifiers that are insufficient for our task—e.g., limited coverage, noisy, biased, and/or trained on a different dataset—can be used as labeling functions.

• Labeling function generators: One higher-level abstraction that we can build on top of labeling functions in Snorkel is labeling function generators, which generate multiple labeling functions from a single resource, such as crowdsourced labels and distant supervision from structured knowledge bases (Example 2.2.4).

Example 2.2.4. A challenge in traditional distant supervision is that different subsets of knowledge bases have different levels of accuracy and coverage. In our running example, we can use the Comparative Toxicogenomics Database (CTD)4 as distant supervision, separately modeling different subsets of it with separate labeling functions. For example, we might write one labeling function to label a candidate True if it occurs in the “Causes” subset, and another to label it False if it occurs in the “Treats” subset. We can write this using a labeling function generator,

LFs_CTD = Ontology(ctd, {"Causes": True, "Treats": False})

which creates two labeling functions. In this way, generators can be connected to large resources and create hundreds of labeling functions with a line of code.

Figure 2.5: Labeling functions expressing pattern-matching, heuristic, and distant supervision approaches, respectively, in Snorkel’s Jupyter notebook interface, for the Spouses example. Full code is available in Snorkel’s Intro tutorial.5

4http://ctdbase.org/ 5https://github.com/HazyResearch/snorkel/tree/master/tutorials/intro CHAPTER 2. WEAK SUPERVISION FROM CODE 13

Figure 2.6: The data viewer utility in Snorkel, showing candidate spouse relation mentions from the Spouses example, composed of person-person mention pairs.

Interface Implementation Snorkel’s interface is designed to be accessible to subject matter expert (SME) users without advanced programming skills. All components run in Jupyter iPython notebooks,6 including writing labeling functions.7 Users can therefore write labeling functions as arbitrary Python functions for maximum flexibility (Figure 2.5). We also provide a library of labeling function primitives and generators to more declaratively program weak supervision, and a viewer utility (Figure 2.6) that displays candidates, and also supports annotation, e.g., for constructing a small held-out test set for end evaluation.

Execution Model Since labeling functions operate on discrete candidates that are labeled independently, their execution is embarrassingly parallel. If Snorkel is connected to a relational database that supports simultaneous connections, e.g., PostgreSQL, then the master process (usually the notebook kernel) distributes the primary keys of the candidates to be labeled to Python worker processes. The workers independently read from the database to materialize the candidates via the ORM layer, then execute the labeling functions over them. The labels are returned to the master process which persists them via the ORM layer. Collecting the labels at the master is more efficient than having workers write directly to the database, due to table-level locking. Snorkel includes a Spark8 integration layer, enabling labeling functions to be run across a cluster. Once the set of candidates is cached as a Spark data frame, only the closure of the labeling functions and the resulting labels need to be communicated to and from the workers. This is particularly helpful in Snorkel’s iterative workflow. Distributing a large unstructured data set across a cluster is relatively expensive, but only has to be 6http://jupyter.org/ 7Note that all code is open source and available—with tutorials, blog posts, workshop lectures, and other material—at snorkel.stanford.edu. 8https://spark.apache.org/ CHAPTER 2. WEAK SUPERVISION FROM CODE 14

performed once. Then, as users refine their labeling functions, they can be rerun efficiently.

2.2.2 Generative Model

The core operation of Snorkel is modeling and integrating the noisy signals provided by a set of labeling functions. Using the recently proposed approach of data programming [Ratner et al., 2016a, Bach et al., 2017], we model the true class label for a data point as a latent variable in a probabilistic model. In the simplest case, we model each labeling function as a noisy “voter” which is independent—i.e., makes errors that are uncorrelated with the other labeling functions. This defines a generative model of the votes of the labeling functions as noisy signals about the true label. We can also model statistical dependencies between the labeling functions to improve predictive perfor- mance. For example, if two labeling functions express similar heuristics, we can include this dependency in the model and avoid a “double counting” problem. We observe that such pairwise correlations are the most common, so we focus on them in this paper (though handling higher order dependencies is straightforward). We use our structure learning method for generative models [Bach et al., 2017] to select a set C of labeling function pairs (j, k) to model as correlated (see Section 2.3.2). Now we can construct the full generative model as a factor graph. We first apply all the labeling functions to the unlabeled data points, resulting in a label matrix Λ, where Λi,j = λj(xi). We then encode the generative model pw(Λ, Y) using three factor types, representing the labeling propensity, accuracy, and pairwise correlations of labeling functions:

Lab 1 φi,j (Λ, Y) = {Λi,j 6= ∅} Acc 1 φi,j (Λ, Y) = {Λi,j = yi} Corr 1 φi,j,k(Λ, Y) = {Λi,j = Λi,k} (j, k) ∈ C

For a given data point xi, we define the concatenated vector of these factors for all the labeling functions 2n+|C| j = 1, ..., n and potential correlations C as φi(Λ, Y), and the corresponding vector of parameters w ∈ R . This defines our model:

m ! −1 T pw(Λ, Y) = Zw exp w φi(Λ, yi) , i=1 X where Zw is a normalizing constant. To learn this model without access to the true labels Y, we minimize the negative log marginal likelihood given the observed label matrix Λ:

wˆ = arg min − log pw(Λ, Y) . w Y X We optimize this objective by interleaving stochastic steps with Gibbs sampling ones, similar to contrastive divergence [Hinton, 2002]; for more details, see [Ratner et al., 2016a, Bach et al., CHAPTER 2. WEAK SUPERVISION FROM CODE 15

2017]. We use the Numbskull library,9 a Python NUMBA-based Gibbs sampler. We then use the predictions,

Y˜ = pwˆ (Y|Λ), as probabilistic training labels.

2.2.3 Discriminative Model

The end goal in Snorkel is to train a model that generalizes beyond the information expressed in the labeling functions. We train a discriminative model hθ on our probabilistic labels Y˜ by minimizing a noise-aware variant of the loss l(hθ(xi), y), i.e., the expected loss with respect to Y˜:

m ˆ θ = arg min Ey∼Y˜ [l(hθ(xi), y)] . θ i=1 X A formal analysis shows that as we increase the amount of unlabeled data, the generalization error of discriminative models trained with Snorkel will decrease at the same asymptotic rate as traditional supervised learning models do with additional hand-labeled data [Ratner et al., 2016a], allowing us to increase predictive performance by adding more unlabeled data. Intuitively, this property holds because as more data is provided, the discriminative model sees more features that co-occur with the heuristics encoded in the labeling functions.

Example 2.2.5. The CDR data contains the sentence, “Myasthenia gravis presenting as weakness after magnesium administration.” None of the 33 labeling functions we developed vote on the corresponding Causes(magnesium, myasthenia gravis) candidate, i.e., they all abstain. However, a deep neural network trained on probabilistic training labels from Snorkel correctly identifies it as a true mention.

Snorkel provides connectors for popular machine learning libraries such as TensorFlow [Abadi et al., 2015], allowing users to exploit commodity models like deep neural networks that do not require hand- engineering of features and have robust predictive performance across a wide range of tasks.

2.3 Weak Supervision Tradeoffs

We study the fundamental question of when—and at what level of complexity—we should expect Snorkel’s generative model to yield the greatest predictive performance gains. Understanding these performance regimes can help guide users, and introduces a tradeoff space between predictive performance and speed. We characterize this space in two parts: first, by analyzing when the generative model can be approximated by an unweighted majority vote, and second, by automatically selecting the complexity of the correlation structure to model. We then introduce a two-stage, rule-based optimizer to support fast development cycles.

9https://github.com/HazyResearch/numbskull CHAPTER 2. WEAK SUPERVISION FROM CODE 16

2.3.1 Modeling Accuracies

The natural first question when studying systems for weak supervision is, “When does modeling the accuracies of sources improve end-to-end predictive performance?” We study that question in this subsection and propose a heuristic to identify settings in which this modeling step is most beneficial.

Tradeoff Space

We start by considering the label density dΛ of the label matrix Λ, defined as the mean number of non- abstention labels per data point. In the low-density setting, sparsity of labels will mean that there is limited room for even an optimal weighting of the labeling functions to diverge much from the majority vote. Conversely, as the label density grows, known theory confirms that the majority vote will eventually be optimal [Li et al., 2013]. It is the middle-density regime where we expect to most benefit from applying the generative model. We start by defining a measure of the benefit of weighting the labeling functions by their true accuracies—in other words, the predictions of a perfectly estimated generative model—versus an unweighted majority vote:

Definition 1. (Modeling Advantage) Let the weighted majority vote of n labeling functions on data point n n xi be denoted as fw(Λi) = j=1 wjΛi,j, and the unweighted majority vote (MV) as f1(Λi) = j=1 Λi,j, where we consider the binaryP classification setting and represent an abstaining vote as 0. WeP define the modeling advantage Aw as the improvement in accuracy of fw over f1 for a dataset:

1 m A (Λ, y) = (1 {y f (Λ ) > 0 ∧ y f (Λ ) 0} w m i w i i 1 i 6 i=1 X 1 − {yifw(Λi) 6 0 ∧ yif1(Λi) > 0})

In other words, Aw is the number of times fw correctly disagrees with f1 on a label, minus the number of ∗ times it incorrectly disagrees. Let the optimal advantage A = Aw∗ be the advantage using the optimal weights w∗ (WMV*).

Additionally, let:

1 n 1 n α∗ = α∗ = 1/(1 + exp(w∗)) n j n j j=1 j=1 X X be the average accuracies of the labeling functions. To build intuition, we start by analyzing the optimal advantage for three regimes of label density (see Figure 2.7):

10We generate a class-balanced dataset of m = 1000 data points with binary labels, and n independent labeling functions with average accuracy 75% and a fixed 10% probability of voting. CHAPTER 2. WEAK SUPERVISION FROM CODE 17

Low-Density Bound 0.20 Optimizer (A * ) Optimal (A * )

0.15 Gen. Model (Aw)

0.10

Modeling Advantage Low-Density High-Density 0.05 (choose MV) (choose MV) Mid-Density (choose GM)

0.00

100 101 102 103 # of Labeling Functions Figure 2.7: A plot of the modeling advantage, i.e., the improvement in label accuracy from the generative model, as a function of the number of labeling functions (equivalently, the label density) on a synthetic 10 ∗ dataset. We plot the advantage obtained by a learned generative model (GM), Aw; by an optimal model A ; the upper bound A˜ ∗ used in our optimizer; and the low-density bound (Proposition 1).

Low Label Density: In this sparse setting, very few data points have more than one non-abstaining la- bel; only a small number have multiple conflicting labels. We have observed this occurring, for example, in the early stages of application development. We see that with non-adversarial labeling functions (w∗ > 0), even an optimal generative model (WMV*) can only disagree with MV when there are disagreeing labels, which will occur infrequently. We see that the expected optimal advantage will have an upper bound that falls quadratically with label density:

∗ Proposition 1. (Low-Density Upper Bound) Assume that P(Λi,j 6= 0) = pl ∀i, j, and wj > 0 ∀j. Then, the expected label density is d¯ = npl, and

∗ ¯2 ∗ ∗ EΛ,y,w∗ [A ] 6 d α (1 − α ) (2.1)

Proof Sketch: We bound the advantage above by computing the expected number of pairwise disagree- ments; for details, see Appendix of extended online version.11

High Label Density: In this setting, the majority of the data points have a large number of labels. For example, we might be working in an extremely high-volume crowdsourcing setting, or an application with many high-coverage knowledge bases as distant supervision. Under modest assumptions—namely, that the average labeling function accuracy α∗ is greater than 50%—it is known that the majority vote converges exponentially to an optimal solution as the average label density d¯ increases, which serves as an upper bound for the expected optimal advantage as well: 11https://arxiv.org/abs/1711.10160 CHAPTER 2. WEAK SUPERVISION FROM CODE 18

Table 2.1: Modeling advantage Aw attained using a generative model for several applications in Snorkel (Section 2.4.1), the upper bound A˜ ∗ used by our optimizer, the modeling strategy selected by the optimizer— either majority vote (MV) or generative model (GM)—and the empirical label density dΛ.

∗ Dataset Aw (%) A˜ (%) Modeling Strategy dΛ Radiology 7.0 12.4 GM 2.3 CDR 4.9 7.9 GM 1.8 Spouses 4.4 4.6 GM 1.4 Chem 0.1 0.3 MV 1.2 EHR 2.8 4.8 GM 1.2

∗ 1 Proposition 2. (High-Density Upper Bound) Assume that P(Λi,j 6= 0) = pl ∀i, j, and that α > 2 . Then:

∗ 1 2 ¯ ∗ −2pl(α − ) d EΛ,y,w∗ [A ] 6 e 2 (2.2)

Proof: This follows from an application of Hoeffding’s inequality; for details, see Appendix A.1.

Medium Label Density: In this middle regime, we expect that modeling the accuracies of the labeling functions will deliver the greatest gains in predictive performance because we will have many data points with a small number of disagreeing labeling functions. For such points, the estimated labeling function accuracies can heavily affect the predicted labels. We indeed see gains in the empirical results using an independent Acc generative model that only includes accuracy factors φi,j (Table 2.1). Furthermore, the guarantees in [Ratner et al., 2016a] establish that we can learn the optimal weights, and thus approach the optimal advantage.

Automatically Choosing a Modeling Strategy

The bounds in the previous subsection imply that there are settings in which we should be able to safely skip modeling the labeling function accuracies, simply taking the unweighted majority vote instead. However, in practice, the overall label density dΛ is insufficiently precise to determine the transition points of interest, given a user time-cost tradeoff preference (characterized by the advantage tolerance parameter γ in Algorithm 1). We show this in Table 2.1 using our application data sets from Section 2.4.1. For example, we see that the Chem and EHR label matrices have equivalent label densities; however, modeling the labeling function accuracies has a much greater effect for EHR than for Chem.

Instead of simply considering the average label density dΛ, we instead develop a best-case heuristic based on looking at the ratio of positive to negative labels for each data point. This heuristic serves as an upper bound to the true expected advantage, and thus we can use it to determine when we can safely skip training the n 1 generative model (see Algorithm 1). Let cy(Λi) = j=1 {Λi,j = y} be the counts of labels of class y for xi, and assume that the true labeling function weightsP lie within a fixed range, wj ∈ [wmin, wmax] and have CHAPTER 2. WEAK SUPERVISION FROM CODE 19

a mean w¯ .12 Then, define:

Φ(Λi, y) = 1 {cy(Λi)wmax > c−y(Λi)wmin} 1 m A˜ ∗(Λ) = 1 {yf (Λ ) 0} Φ(Λ , y)σ(2f (Λ )y) m 1 i 6 i w¯ i i=1 y∈±1 X X

∗ where σ(·) is the , fw¯ is majority vote with all weights set to the mean w¯ , and A˜ (Λ) is the predicted modeling advantage used by our optimizer. Essentially, we are taking the expected counts of instances in which a weighted majority vote could possibly flip the incorrect predictions of unweighted majority vote under best case conditions, which is an upper bound for the expected advantage:

Proposition 3. (Optimizer Upper Bound) Assume that the labeling functions have accuracy parameters

(log-odds weights) wj ∈ [wmin, wmax], and have E[w] = w¯ . Then:

∗ ˜ ∗ Ey,w∗ [A | Λ] 6 A (Λ) (2.3)

Proof Sketch: We upper-bound the modeling advantage by the expected number of instances in which WMV* is correct and MV is incorrect. We then upper-bound this by using the best-case probability of the weighted majority vote being correct given (wmin, wmax).

We apply A˜ ∗ to a synthetic dataset and plot in Figure 2.7. Next, we compute A˜ ∗ for the labeling matrices from experiments in Section 2.4.1, and compare with the empirical advantage of the trained generative models (Table 2.1).13 We see that our approximate quantity A˜ ∗ serves as a correct guide in all cases for determining which modeling strategy to select, which for the mature applications reported on is indeed most often the generative model. However, we see that while EHR and Chem have equivalent label densities, our optimizer correctly predicts that Chem can be modeled with majority vote, speeding up each pipeline execution by 1.8×.

Accelerating Initial Development Cycles

We find in our applications that the optimizer can save execution time especially during the initial cycles of iterative development. To illustrate this empirically, in Figure 2.8 we measure the modeling advantage of the generative model versus a majority vote of the labeling functions on increasingly large random subsets of the CDR labeling functions. We see that the modeling advantage grows as the number of labeling functions increases, and that our optimizer approximation closely tracks it; thus, the optimizer can save execution time

12 We fix these at defaults of (wmin, w¯ , wmax) = (0.5, 1.0, 1.5), which corresponds to assuming labeling functions have accuracies between 62% and 82%, and an average accuracy of 73%. 13 Note that in Section 2.4, due to known negative class imbalance in relation extraction problems, we default to a negative value if majority vote yields a tie-vote label of 0. Thus our reported F1 score metric hides instances in which the generative model learns to correctly (or incorrectly) break ties. In Table 2.1, however, we do count such instances as improvements over majority vote, as these instances have an effect on the training of the end discriminative model (they yield additional training labels). CHAPTER 2. WEAK SUPERVISION FROM CODE 20

0.08 Choose Choose 0.07 MV GM

0.06

0.05

0.04

0.03

Modeling Advantage 0.02

0.01 Optimizer (A * ) Gen. Model (A ) 0.00 w

5 10 15 20 25 30 # of Labeling Functions

∗ Figure 2.8: The predicted (A˜ ) and actual (Aw) advantage of using the generative labeling model (GM) over majority vote (MV) on the CDR application as the number of LFs is increased. At 9 LFs, the optimizer switches from choosing MV to choosing GM; this leads to faster modeling in early development cycles, and more accurate results in later cycles.

Simulated Labeling Functions Chemical-Disease Labeling Functions All User Study Labeling Functions 57.5 Performance 70.0 80 4000 50 # of Correlations 400 69.5 57.0 Elbow Point 70 40 69.0 3000 60 300 56.5 30 68.5 50 200 2000 56.0 68.0 20 40 55.5 30 100 67.5 1000 10 Number of Correlations Number of Correlations Number of Correlations

20 Predictive Performance (F1) 67.0 Predictive Performance (F1) 55.0 Predictive Performance (F1) 0 0 0 0.30 0.25 0.20 0.15 0.10 0.05 0.00 0.5 0.4 0.3 0.2 0.1 0.0 0.5 0.4 0.3 0.2 0.1 0.0 Correlation Threshold Correlation Threshold Correlation Threshold

Figure 2.9: Predictive performance of the generative model and number of learned correlations versus the correlation threshold . The selected elbow point achieves a good tradeoff between predictive performance and computational cost (linear in the number of correlations). Left: simulation of structure learning correcting the generative model. Middle: the CDR task. Right: all user study labeling functions for the Spouses task.

by choosing to skip the generative model and run majority vote instead during the initial cycles of iterative development.

2.3.2 Modeling Structure

In this subsection, we consider modeling additional statistical structure beyond the independent model. We study the tradeoff between predictive performance and computational cost, and describe how to automatically select a good point in this tradeoff space.

Structure Learning We observe many Snorkel users writing labeling functions that are statistically depen- dent. Examples we have observed include:

• Functions that are variations of each other, such as checking for matches against similar regular CHAPTER 2. WEAK SUPERVISION FROM CODE 21

expressions.

• Functions that operate on correlated inputs, such as raw tokens of text and their lemmatizations.

• Functions that use correlated sources of knowledge, such as distant supervision from overlapping knowledge bases.

Modeling such dependencies is important because they affect our estimates of the true labels. Consider the extreme case in which not accounting for dependencies is catastrophic:

Example 2.3.1. Consider a set of 10 labeling functions, where 5 are perfectly correlated, i.e., they vote the same way on every data point, and 5 are conditionally independent given the true label. If the correlated labeling functions have accuracy α = 50% and the uncorrelated ones have accuracy β = 99%, then the maximum likelihood estimate of their accuracies according to the independent model is αˆ = 100% and βˆ = 50%.

Specifying a generative model to account for such dependencies by hand is impractical for three reasons. First, it is difficult for non-expert users to specify these dependencies. Second, as users iterate on their labeling functions, their dependency structure can change rapidly, like when a user relaxes a labeling function to label many more candidates. Third, the dependency structure can be dataset specific, making it impossible to specify a priori, such as when a corpus contains many strings that match multiple regular expressions used in different labeling functions. We observed users of earlier versions of Snorkel struggling for these reasons to construct accurate and efficient generative models with dependencies. We therefore seek a method that can quickly identify an appropriate dependency structure from the labeling function outputs Λ alone. Naively, we could include all dependencies of interest, such as all pairwise correlations, in the generative model and perform parameter estimation. However, this approach is impractical. For 100 labeling functions and 10,000 data points, estimating parameters with all possible correlations takes roughly 45 minutes. When multiplied over repeated runs of hyperparameter searching and development cycles, this cost greatly inhibits labeling function development. We therefore turn to our method for automatically selecting which dependencies to model without access to ground truth [Bach et al., 2017]. It uses a pseudolikelihood estimator, which does not require any sampling or other approximations to compute the objective gradient exactly. It is much faster than maximum likelihood estimation, taking 15 seconds to select pairwise correlations to be modeled among 100 labeling functions with 10,000 data points. However, this approach relies on a selection threshold hyperparameter  which induces a tradeoff space between predictive performance and computational cost.

Tradeoff Space

Such structure learning methods, whether pseudolikelihood or likelihood-based, crucially depend on a selection threshold  for deciding which dependencies to add to the generative model. Fundamentally, the choice of  determines the complexity of the generative model.14 We study the tradeoff between predictive performance

14 Specifically,  is both the coefficient of the `1 regularization term used to induce sparsity, and the minimum absolute weight in log scale that a dependency must have to be selected. CHAPTER 2. WEAK SUPERVISION FROM CODE 22

and computational cost that this induces. We find that generally there is an “elbow point” beyond which the number of correlations selected—and thus the computational cost—explodes, and that this point is a safe tradeoff point between predictive performance and computation time.

Predictive Performance: At one extreme, a very large value of  will not include any correlations in the generative model, making it identical to the independent model. As  is decreased, correlations will be added. At first, when  is still high, only the strongest correlations will be included. As these correlations are added, we observe that the generative model’s predictive performance tends to improve. Figure 2.9, left, shows the result of varying  in a simulation where more than half the labeling functions are correlated. After adding a few key dependencies, the generative model resolves the discrepancies among the labeling functions. Figure 2.9, middle, shows the effect of varying  for the CDR task. Predictive performance improves as  decreases until the model overfits. Finally, we consider a large number of labeling functions that are likely to be correlated. In our user study (described in Section 3.6), participants wrote labeling functions for the Spouses task. We combined all 125 of their functions and studied the effect of varying . Here, we expect there to be many correlations since it is likely that users wrote redundant functions. We see in Figure 2.9, right, that structure learning surpasses the best performing individual’s generative model (50.0 F1).

Computational Cost: Computational cost is correlated with model complexity. Since learning in Snorkel is done with a Gibbs sampler, the overhead of modeling additional correlations is linear in the number of correlations. The dashed lines in Figure 2.9 show the number of correlations included in each model versus . For example, on the Spouses task, fitting the parameters of the generative model at  = 0.5 takes 4 minutes, and fitting its parameters with  = 0.02 takes 57 minutes. Further, parameter estimation is often run repeatedly during development for two reasons: (i) fitting generative model hyperparameters using a development set requires repeated runs, and (ii) as users iterate on their labeling functions, they must re-estimate the generative model to evaluate them.

Automatically Choosing a Model

Based on our observations, we seek to automatically choose a value of  that trades off between predictive performance and computational cost using the labeling functions’ outputs Λ alone. Including  as a hyper- parameter in a grid search over a development set is generally not feasible because of its large effect on running time. We therefore want to choose  before other hyperparameters, without performing any parameter estimation. We propose using the number of correlations selected at each value of  as an inexpensive indicator. The dashed lines in Figure 2.9 show that as  decreases, the number of selected correlations follows a pattern. Generally, the number of correlations grows slowly at first, then hits an “elbow point” beyond which the number explodes, which fits the assumption that the correlation structure is sparse. In all three cases, setting  to this elbow point is a safe tradeoff between predictive performance and computational cost. In cases where performance grows consistently (left and right), the elbow point achieves most of the predictive performance CHAPTER 2. WEAK SUPERVISION FROM CODE 23

Input: Label matrix Λ ∈ (Y ∪ {∅})m×n, advantage tolerance γ, structure search resolution η Output: Modeling strategy if A˜ ∗(Λ) < γ then return MV end if Structures ← [] 1 for i from 1 to 2η do  ← i · η C ← LearnStructure(Λ, ) Structures.append(|C|, ) end for  ← SelectElbowPoint(Structures) return GM Algorithm 1: Modeling Strategy Optimizer gains at a small fraction of the computational cost. For example, on Spouses (right), choosing  = 0.08 achieves a score of 56.6 F1—within one point of the best score—but only takes 8 minutes for parameter estimation. In cases where predictive performance eventually degrades (middle), the elbow point also selects a relatively small number of correlations, giving an 0.7 F1 point improvement and avoiding overfitting. Performing structure learning for many settings of  is inexpensive, especially since the search needs to be performed only once before tuning the other hyperparameters. On the large number of labeling functions in the Spouses task, structure learning for 25 values of  takes 14 minutes. On CDR, with a smaller number of labeling functions, it takes 30 seconds. Further, if the search is started at a low value of  and increased, it can often be terminated early, when the number of selected correlations reaches a low value. Selecting the elbow point itself is straightforward. We use the point with greatest absolute difference from its neighbors, but more sophisticated schemes can also be applied [Satopaa et al., 2011]. Our full optimization algorithm for choosing a modeling strategy and (if necessary) correlations is shown in Algorithm 1.

2.4 Evaluation

We evaluate Snorkel by drawing on deployments developed in collaboration with users. We report on two real-world deployments and four tasks on open-source data sets representative of other deployments. Our evaluation is designed to support the following three main claims:

• Snorkel outperforms distant supervision baselines. In distant supervision [Mintz et al., 2009b], one of the most popular forms of weak supervision used in practice, an external knowledge base is heuristically aligned with input data to serve as noisy training labels. By allowing users to easily incorporate a broader, more heterogeneous set of weak supervision sources—for example, pattern matching, structure-based, and other more complex heuristics—Snorkel exceeds models trained via distant supervision by an average of 132%. CHAPTER 2. WEAK SUPERVISION FROM CODE 24

Table 2.2: Number of labeling functions, fraction of positive labels (for binary classification tasks), number of training documents, and number of training candidates for each task.

Task # LFs % Pos. # Docs # Candidates Chem 16 4.1 1,753 65,398 EHR 24 36.8 47,827 225,607 CDR 33 24.6 900 8,272 Spouses 11 8.3 2,073 22,195 Radiology 18 36.0 3,851 3,851 Crowd 102 - 505 505

• Snorkel approaches hand supervision. We see that by writing tens of labeling functions, we were able to approach or match results using hand-labeled training data which took weeks or months to assemble, coming within 2.11% of the F1 score of hand supervision on relation extraction tasks and an average 5.08% accuracy or AUC on cross-modal tasks, for an average 3.60% across all tasks.

• Snorkel enables a new interaction paradigm. We measure Snorkel’s efficiency and ease-of-use by reporting on a user study of biomedical researchers from across the U.S. These participants learned to write labeling functions to extract relations from news articles as part of a two-day workshop on learning to use Snorkel, and matched or outperformed models trained on hand-labeled training data, showing the efficiency of Snorkel’s process even for first-time users.

We now describe our results in detail. First, we describe the six applications that validate our claims. We then show that Snorkel’s generative modeling stage helps to improve the predictive performance of the discriminative model, demonstrating that it is 5.81% more accurate when trained on Snorkel’s probabilistic labels versus labels produced by an unweighted average of labeling functions. We also validate that the ability to incorporate many different types of weak supervision incrementally improves results with an ablation study. Finally, we describe the protocol and results of our user study.

2.4.1 Applications

To evaluate the effectiveness of Snorkel, we consider several real-world deployments and tasks on open-source datasets that are representative of other deployments in information extraction, medical image classification, and crowdsourced sentiment analysis. Summary statistics of the tasks are provided in Table 2.2.

Discriminative Models: One of the key bets in Snorkel’s design is that the trend of increasingly powerful, open-source machine learning tools (e.g., models, pre-trained word embeddings and initial layers, automatic tuners, etc.) will only continue to accelerate. To best take advantage of this, Snorkel creates probabilistic training labels for any discriminative model with a standard loss function. CHAPTER 2. WEAK SUPERVISION FROM CODE 25

Table 2.3: Evaluation of Snorkel on relation extraction tasks from text. Snorkel’s generative and discrimina- tive models consistently improve over distant supervision, measured in F1, the harmonic mean of precision (P) and recall (R). We compare with hand-labeled data when available, coming within an average of 1 F1 point.

Distant Supervision Snorkel (Gen.) Task P R F1 PR F1 Lift Chem 11.2 41.2 17.6 78.6 21.6 33.8 +16.2 EHR 81.4 64.8 72.2 77.1 72.9 74.9 +2.7 CDR 25.5 34.8 29.4 52.3 30.4 38.5 +9.1 Spouses 9.9 34.8 15.4 53.5 62.1 57.4 +42.0

Snorkel (Disc.) Hand Supervision Task P R F1 Lift PR F1 Chem 87.0 39.2 54.1 +36.5 - - - EHR 80.2 82.6 81.4 +9.2 - - - CDR 38.8 54.3 45.3 +15.9 39.9 58.1 47.3 Spouses 48.4 61.6 54.2 +38.8 47.8 62.5 54.2

Chem EHR CDR Spouses 1.0 1.0 1.0 1.0 Majority Vote Majority Vote Majority Vote Snorkel (Gen.) Snorkel (Gen.) 0.8 0.8 0.8 Snorkel (Gen.) 0.8 Snorkel (Disc.) Snorkel (Disc.) Snorkel (Disc.) 0.6 0.6 0.6 0.6

Precision 0.4 Precision 0.4 0.4 Precision 0.4 Precision Majority Vote 0.2 0.2 Snorkel (Gen.) 0.2 0.2 Snorkel (Disc.) 0.0 0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Recall Recall Chem EHR CDR Spouses 1.0 1.0 1.0 1.0 Unweighted LFs Unweighted LFs Unweighted LFs Snorkel (Disc.) Hand Supervision 0.8 0.8 0.8 Hand Supervision 0.8 Snorkel (Disc.) Snorkel (Disc.) 0.6 0.6 0.6 0.6

Precision 0.4 Precision 0.4 0.4 Precision 0.4 Precision

0.2 0.2 Unweighted LFs 0.2 0.2 Snorkel (Disc.) 0.0 0.0 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Recall Recall Recall Recall Figure 2.10: Precision-recall curves for the relation extraction tasks. The top plots compare a majority vote of all labeling functions, Snorkel’s generative model, and Snorkel’s discriminative model. They show that the generative model improves over majority vote by providing more granular information about candidates, and that the discriminative model can generalize to candidates that no labeling functions label. The bottom plots compare the discriminative model trained on an unweighted combination of the labeling functions, hand supervision (when available), and Snorkel’s discriminative model. They show that the discriminative model benefits from the weighted labels provided by the generative model, and that Snorkel is competitive with hand supervision, particularly in the high-precision region.

In the following experiments, we control for end model selection by using currently popular, standard choices across all settings. For text modalities, we choose a bidirectional long short term memory (LSTM) sequence model [Graves and Schmidhuber, 2005], and for the medical image classification task we use a CHAPTER 2. WEAK SUPERVISION FROM CODE 26

Table 2.4: Number of candidates in the training, development, and test splits for each dataset.

Task # Train. # Dev. # Test Chem 65,398 1,292 1,232 EHR 225,607 913 604 CDR 8,272 888 4,620 Spouses 22,195 2,796 2,697 Radiology 3,851 385 385 Crowd 505 63 64

50-layer ResNet [He et al., 2015] pre-trained on the ImageNet object classification dataset [Deng et al., 2009]. Both models are implemented in Tensorflow [Abadi et al., 2015] and trained using the Adam optimizer [Kingma and Ba, 2014], with hyperparameters selected via random grid search using a small labeled development set. Final scores are reported on a held-out labeled test set. See full version [Ratner et al., 2017a] for details. A key takeaway of the following results is that the discriminative model generalizes beyond the heuristics encoded in the labeling functions (as in Example 2.2.5). In Section 2.4.1, we see that on relation extraction applications the discriminative model improves performance over the generative model primarily by increasing recall by 43.15% on average. In Section 2.4.1, the discriminative model classifies entirely new modalities of data to which the labeling functions cannot be applied.

Data Set Details Additional information about the sizes of the datasets are included in Table 2.4. Specifically, we report the size of the (unlabeled) training set and hand-labeled development and test sets, in terms of number of candidates. Note that the development and test sets can be orders of magnitude smaller than the training sets. Labeled development and test sets were either used when already available as part of a benchmark dataset, or labeled with the help of our collaborators, limited to several hours of labeling time maximum. Note that test sets were labeled by individuals not involved with labeling function development to keep the test sets properly blinded.

Relation Extraction from Text

We first focus on four relation extraction tasks on text data, as it is a challenging and common class of problems that are well studied and for which distant supervision is often considered. Predictive performance is summarized in Table 2.3, and precision-recall curves are shown in Figure 2.10. We briefly describe each task.

Scientific Articles (Chem): With modern online repositories of scientific literature, such as PubMed15 for biomedical articles, research results are more accessible than ever before. However, actually extracting fine-grained pieces of information in a structured format and using this data to answer specific questions at scale remains a significant open challenge for researchers. To address this challenge in the context of drug 15https://www.ncbi.nlm.nih.gov/pubmed/ CHAPTER 2. WEAK SUPERVISION FROM CODE 27

safety research, Stanford and U.S. Food and Drug Administration (FDA) collaborators used Snorkel to develop a system for extracting chemical reagent and reaction product relations from PubMed abstracts. The goal was to build a database of chemical reactions that researchers at the FDA can use to predict unknown drug interactions. We used the chemical reactions described in the Metacyc database [Caspi et al., 2016] for distant supervision.

Electronic Health Records (EHR): As patients’ clinical records increasingly become digitized, researchers hope to inform clinical decision making by retrospectively analyzing large patient cohorts, rather than conduct- ing expensive randomized controlled studies. However, much of the valuable information in electronic health records (EHRs)—such as fine-grained clinical details, practitioner notes, etc.—is not contained in standardized medical coding systems, and is thus locked away in the unstructured text notes sections. In collaboration with researchers and clinicians at the U.S. Department of Veterans Affairs, Stanford Hospital and Clinics (SHC), and the Stanford Center for Biomedical Informatics Research, we used Snorkel to develop a system to extract structured data from unstructured EHR notes. Specifically, the system’s task was to extract mentions of pain levels at precise anatomical locations from clinician notes, with the goal of using these features to automatically assess patient well-being and detect complications after medical interventions like surgery. To this end, our collaborators created a cohort of 5,800 patients from SHC EHR data, with visit dates between 1995 and 2015, resulting in 500K unstructured clinical documents. Since distant supervision from a knowledge base is not applicable, we compared against regular-expression-based labeling previously developed for this task.

Chemical-Disease Relations (CDR): We used the 2015 BioCreative chemical-disease relation dataset [Wei et al., 2015b], where the task is to identify mentions of causal links between chemicals and diseases in PubMed abstracts. We used all pairs of chemical and disease mentions co-occurring in a sentence as our candidate set. We used the Comparative Toxicogenomics Database (CTD) [Davis et al., 2016] for distant supervision, and additionally wrote labeling functions capturing language patterns and information from the context hierarchy. To evaluate Snorkel’s ability to discover previously unknown information, we randomly removed half of the relations in CTD and evaluated on candidates not contained in the remaining half.

Spouses: Our fourth task is to identify mentions of spouse relationships in a set of news articles from the Signal Media dataset [Corney et al., 2016a]. We used all pairs of person mentions (tagged with SpaCy’s NER module16) co-occurring in the same sentence as our candidate set. To obtain hand-labeled data for evaluation, we crowdsourced labels for the candidates via Amazon Mechanical Turk, soliciting labels from three workers for each example and assigning the majority vote. We then wrote labeling functions that encoded language patterns and distant supervision from DBpedia [Lehmann et al., 2014]. 16https://spacy.io/ CHAPTER 2. WEAK SUPERVISION FROM CODE 28

Table 2.5: Evaluation on cross-modal experiments. Labeling functions that operate on or represent one modality (text, crowd workers) produce training labels for models that operate on another modality (images, text), and approach the predictive performance of large hand-labeled training datasets.

Task Snorkel (Disc.) Hand Supervision Radiology (AUC) 72.0 76.2 Crowd (Acc) 65.6 68.8

Cross-Modal: Images & Crowdsourcing

In the cross-modal setting, we write labeling functions over one data modality (e.g., a text report, or the votes of crowdworkers) and use the resulting labels to train a classifier defined over a second, totally separate modality (e.g., an image or the text of a tweet). This demonstrates the flexibility of Snorkel, in that the labeling functions (and by extension, the generative model) do not need to operate over the same domain as the discriminative model being trained. Predictive performance is summarized in Table 2.5.

Abnormality Detection in Lung Radiographs (Rad): In many real-world radiology settings, there are large repositories of image data with corresponding narrative text reports, but limited or no labels that could be used for training an image classification model. In this application, in collaboration with radiologists, we wrote labeling functions over the text radiology reports, and used the resulting labels to train an image classifier to detect abnormalities in lung X-ray images. We used a publicly available dataset from the OpenI biomedical image repository17 consisting of 3,851 distinct radiology reports—composed of unstructured text and Medical Subject Headings (MeSH)18 codes—and accompanying X-ray images.

Crowdsourcing (Crowd): We trained a model to perform sentiment analysis using crowdsourced anno- tations from the weather sentiment task from Crowdflower.19 In this task, contributors were asked to grade the sentiment of often-ambiguous tweets relating to the weather, choosing between five categories of sentiment. Twenty contributors graded each tweet, but due to the difficulty of the task and lack of crowdworker filtering, there were many conflicts in worker labels. We represented each crowdworker as a labeling function—showing Snorkel’s ability to subsume existing crowdsourcing modeling approaches—and then used the resulting labels to train a text model over the tweets, for making predictions independent of the crowd workers.

Effect of Generative Modeling

An important question is the significance of modeling the accuracies and correlations of the labeling functions on the end predictive performance of the discriminative model (versus in Section 2.3, where we only considered 17http://openi.nlm.nih.gov/ 18https://www.nlm.nih.gov/mesh/meshhome.html 19https://www.crowdflower.com/data/weather-sentiment/ CHAPTER 2. WEAK SUPERVISION FROM CODE 29

Table 2.6: Comparison between training the discriminative model on the labels estimated by the generative model, versus training on the unweighted average of the LF outputs. Predictive performance gains show that modeling LF noise helps.

Disc. Model on Task Unweighted LFs Disc. Model Lift Chem 48.6 54.1 +5.5 EHR 80.9 81.4 +0.5 CDR 42.0 45.3 +3.3 Spouses 52.8 54.2 +1.4 Crowd (Acc) 62.5 65.6 +3.1 Rad. (AUC) 67.0 72.0 +5.0

8 CDR Spouses 7 EHR 6

5

4

3

2 F1 Score Improvement

1

0

103 104 105 Number of Candidates (Log Scale) Figure 2.11: The increase in end model performance (measured in F1 score) for different amounts of unlabeled data, measured in the number of candidates. We see that as more unlabeled data is added, the performance increases. the effect on the accuracy of the generative model). We compare Snorkel with a simpler pipeline that skips the generative modeling stage and trains the discriminative model on an unweighted average of the labeling functions’ outputs. Table 2.6 shows that the discriminative model trained on Snorkel’s probabilistic labels consistently predicts better, improving 5.81% on average. These results demonstrate that the discriminative model effectively learns from the additional signal contained in Snorkel’s probabilistic training labels over simpler modeling strategies.

Scaling with Unlabeled Data

One of the most exciting potential advantages of using a programmatic supervision approach as in Snorkel is the ability to incorporate additional unlabeled data, which is often cheaply available. Recently proposed theory characterizing the data programming approach used predicts that discriminative model generalization risk (i.e., predictive performance on the held-out test set) should improve with additional unlabeled data, at the same asymptotic rate as in traditional supervised methods with respect to labeled data [Ratner et al., 2016a]. That is, with a fixed amount of effort writing labeling functions, we could then get improved discriminative CHAPTER 2. WEAK SUPERVISION FROM CODE 30

Table 2.7: Labeling function ablation study on CDR. Adding different types of labeling functions improves predictive performance.

LF Type P R F1 Lift Text Patterns 42.3 42.4 42.3 + Distant Supervision 37.5 54.1 44.3 +2.0 + Structure-based 38.8 54.3 45.3 +1.0 model performance simply by adding more unlabeled data. We validate this theoretical prediction empirically on three of our datasets (Figure 2.11). We see that by adding additional unlabeled data—in these datasets, candidates from additional documents—we get significant improvements in the end discriminative model performance, with no change in the labeling functions. For example, in the EHR experiment, where we had access to a large unlabeled corpus, we were able to achieve significant gains (8.1 F1 score points) in going from 100 to 50 thousand documents. Further empirical validation of these strong unlabeled scaling results can be found in follow-up work using Snorkel in a range of application domains, including aortic valve classification in MRI videos [Fries et al., 2018], industrial-scale content classification at Google [Bach et al., 2019], fine-grained named entity recognition [Ratner et al., 2019a], radiology image triage [Khandwala et al., 2017], and others. Based on both this empirical validation, and feedback from Snorkel users in practice, we see this ability to leverage available unlabeled data without any additional user labeling effort as a significant advantage of the proposed weak supervision approach.

Labeling Function Type Ablation

We also examine the impact of different types of labeling functions on end predictive performance, using the CDR application as a representative example of three common categories of labeling functions:

• Text Patterns: Basic word, phrase, and regular expression labeling functions.

• Distant Supervision: External knowledge bases mapped to candidates, either directly or filtered by a heuristic.

• Structure-Based: Labeling functions expressing heuristics over the context hierarchy, e.g., reasoning about position in the document or relative to other candidates.

We show an ablation in Table 2.7, sorting by stand-alone score. We see that distant supervision adds recall at the cost of some precision, as we would expect, but ultimately improves F1 score by 2 points; and that structure- based labeling functions, enabled by Snorkel’s context hierarchy data representation, add an additional F1 point. CHAPTER 2. WEAK SUPERVISION FROM CODE 31

Table 2.8: Self-reported skill levels—no previous experience (New), beginner (Beg.), intermediate (Int.), and advanced (Adv.)—for all user study participants.

Subject New Beg. Int. Adv. Python 0 3 8 4 Machine Learning 5 1 4 5 Info. Extraction 2 6 5 2 Text Mining 3 6 4 2

2.4.2 User Study

We conducted a formal study of Snorkel to (i) evaluate how quickly subject matter expert (SME) users could learn to write labeling functions, and (ii) empirically validate the core hypothesis that writing labeling functions is more time-efficient than hand-labeling data. Users were given instruction on Snorkel, and then asked to write labeling functions for the Spouses task described in the previous subsection.

Participants: In collaboration with the Mobilize Center [Ku et al., 2015], an NIH-funded Big Data to Knowledge (BD2K) center, we distributed a national call for applications to attend a two-day workshop on using Snorkel for biomedical knowledge base construction. Selection criteria included a strong biomedical project proposal and little-to-no prior experience using Snorkel. In total, 15 researchers20 were invited to attend out of 33 team applications submitted, with varying backgrounds in bioinformatics, clinical informatics, and data mining from universities, companies, and organizations around the United States. The education demographics included 6 bachelors, 4 masters, and 5 Ph.D. degrees. All participants could program in Python, with 80% rating their skill as intermediate or better; 40% of participants had little-to-no prior exposure to machine learning; and 53-60% had no prior experience with text mining or information extraction applications (Table 2.8).

Protocol: The first day focused entirely on labeling functions, ranging from theoretical motivations to details of the Snorkel API. Over the course of 7 hours, participants were instructed in a classroom setting on how to use and evaluate models developed using Snorkel. Users were presented with 4 tutorial Jupyter notebooks providing skeleton code for evaluating labeling functions, along with a small labeled development candidate set, and were given 2.5 hours of dedicated development time in aggregate to write their labeling functions. All workshop materials are available online.21

Baseline: To compare our users’ performance against models trained on hand-labeled data, we collected a large hand-labeled dataset via Amazon Mechanical Turk (the same set used in the previous subsection). We then split this into 15 datasets representing 7 hours worth of hand-labeling time each—based on the 20One participant declined to write labeling functions, so their score is not included in our analysis. 21https://github.com/HazyResearch/snorkel/tree/master/tutorials/workshop CHAPTER 2. WEAK SUPERVISION FROM CODE 32

100 Subject Score Hand-labeled 80 Snorkel F1=0.9 90 80 60 70

Recall 60 40 F1 Score 50 40 20 30 20 10 0

0 20 40 60 80 100

Precision Figure 2.12: Predictive performance attained by our 14 user study participants using Snorkel. The majority (57%) of users matched or exceeded the performance of a model trained on 7 hours (2,500 instances) of hand-labeled data.

0.5 0.5 0.5 0.5

0.4 0.4 0.4 0.4

0.3 0.3 0.3 0.3 F1 F1 F1 F1

0.2 0.2 0.2 0.2

BA/BS MS PhD Beginner Intermediate Advanced None Beginner Intermediate Advanced None Beginner Intermediate Advanced Education Degree Python Experience Machine Learning Experience Text Mining Experience Figure 2.13: The profile of the best performing user by F1 score, was a MS or Ph.D. degree in any field, strong Python coding skills, and intermediate to advanced experience with machine learning. Prior experience with text mining added no benefit. crowd-worker average of 10 seconds per label—simulating the alternative scenario where users skipped both instruction and labeling function development sessions and instead spent the full day hand-labeling data. Partitions were created by drawing a uniform random sample of 2500 labels from the total Amazon Mechanical Turk-generated Spouse dataset. For 15 such random samples, the mean F1 score was 20.9 (min: 11.7, max: 29.5). Scaling to 55 random partitions, the mean F1 score was 22.5 (min: 11.7, max: 34.1).

Results: Our key finding is that labeling functions written in Snorkel, even by SME users, can match or exceed a traditional hand-labeling approach. The majority (8) of subjects matched or outperformed these hand-labeled data models. The average Snorkel user’s score was 30.4 F1, and the average hand-supervision score was 20.9 F1. The best performing user model scored 48.7 F1, 19.2 points higher than the best supervised model using hand-labeled data. The worst participant scored 12.0 F1, 0.3 points higher that the lowest hand-labeled model. The full distribution of scores by participant, and broken down by participant background, compared against the baseline models trained with hand-labeled data are shown in Figures 2.12 and 2.14 respectively.

Additional Details We note that participants only needed to create a fairly small set of labeling functions to achieve the reported performances, writing a median of 10 labeling functions (with a minimum of 2, and a maximum of 15). In general, these labeling functions had simple form; for example, two from our user study: CHAPTER 2. WEAK SUPERVISION FROM CODE 33

Labeling Function Types by User 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Pattern Distant Supervision Complex

Figure 2.14: We bucketed labeling functions written by user study participants into three types—pattern-based, distant supervision, and complex. Participants tended to mainly write pattern-based labeling functions, but also universally expressed more complex heuristics as well.

def LF_fictional(c): fictional = ("played the husband","played the wife","plays the husband","plays the wife", "acting role") if re. search ("|".join(fictional), c.get_parent().text, re.I): return −1 else : return 0 def LF_family(c): family = ("business partner","son","daughter","father","dad","mother","mom","children" ,"child","twins","cousin","friend","girlfriend","boyfriend","sister","brother") if len(other.intersection(get_between_tokens(c))) > 0: return −1 else : return 0

Participant labeling functions had a median length of 2 lines of Python code (min:2, max:12). We grouped participant-designed functions into three types:

1. Pattern-based (regular expressions, small term sets)

2. Distant Supervision (interacts with a knowledge base)

3. Complex (misc. heuristics, e.g. counting PERSON named entity tags, comparing last names of a pair of PERSON entities)

On average, 58% of participant’s labeling functions where pattern-based (min:25%, max: 82%). The best labeling function design strategy used by participants appeared to be defining small term sets correlated with positive and negative labels. Participants with the lowest F1 scores tended to design labeling functions with low coverage of negative labels. This is a common difficulty encountered when designing labeling functions, as writing heuristics for negative examples is sometimes counter-intuitive. Users with the highest overall F1 scores wrote 1-2 high coverage negative labeling functions and several medium-to-high accuracy positive labeling functions. CHAPTER 2. WEAK SUPERVISION FROM CODE 34

We note that the best single participant’s pipeline achieved an F1 score of 48.7, compared to the authors’ score of 54.2. User study participants favored pattern-based labeling functions; the most common design was creating small positive and negative term sets. Author labeling functions were similar, but were more accurate overall (e.g., better pattern matching).

2.5 Extensions & Next Steps

In this section, we briefly discuss extensions and use cases of Snorkel that have been developed since its initial release, as well as next steps and future directions more broadly. One such extension, higher-level interfaces, we defer to later chapters of this work.

2.5.1 Extensions for Real-World Deployments

Since its release, Snorkel has been used at organizations such as the Stanford Hospital, Google, Intel, Microsoft, Facebook, Alibaba, NEC, BASF, Toshiba, and Accenture; in the fight against human trafficking as part of DARPA’s MEMEX program; and in production at several large technology companies. With various teams at the Stanford School of Medicine, we have worked to extend the cross-modal radiology application described in Section 2.4 to a range of other similar cross-modal medical problems, which has involved building robust interfaces for various multi-modal clinical data and formats [Khandwala et al., 2017]. In collaboration with several teams at Google, we recently developed a new version of Snorkel, Snorkel DryBell, to interface with Google’s organizational weak supervision resources and compute infrastructure, and enable weak supervision at industrial scale [Bach et al., 2019].

2.5.2 Multi-Task Weak Supervision

Many real-world use cases of machine learning involve multiple related classification tasks—both because there are multiple tasks of interest, and because available weak supervision sources may in fact label different related tasks. Handling this multi-task weak supervision setting has been the focus of recent work on a new version of Snorkel, Snorkel MeTaL,22 which handles labeling functions that label different tasks, and in turn can be used to supervise popular multi-task learning (MTL) discriminative models [Ratner et al., 2017b, 2019a]. For example, we might be aiming to train a fine-grained named entity recognition (NER) system which tags specific types of people, places, and things, and have access to both fine-grained labeling functions—e.g., that label doctors versus lawyers—and coarse-grained ones, e.g., that label people versus organizations. By representing these as different logically-related tasks, we can model and combine these multi-granularity labeling functions using this new multi-task version of Snorkel. 22https://github.com/HazyResearch/metal CHAPTER 2. WEAK SUPERVISION FROM CODE 35

2.5.3 Future Directions

In addition to working on the core directions outlined—real world deployment, and multi-task supervision— several other directions are natural and exciting extensions of Snorkel. One is the extension to other classic machine learning settings, such as structured prediction, regression, and anomaly detection settings. Another direction is extending the possible output signature of labeling functions to include continuous values, probability distributions, or other more complex outputs. The extension of the core modeling techniques—for example, learning labeling function accuracies that are conditioned on specific subsets of the data, or jointly learning the generative and discriminative models—also provide exciting avenues for future research. Another practical and interesting direction is exploring integrations with other complementary techniques for dealing with the lack of hand-labeled training data (see Other Forms of Supervision in Section 2.6). One example is active learning [Settles, 2012], in which the goal is to intelligently sample data points to be labeled; in our setting, we could intelligently select sets of data points to show to the user when writing labeling functions—e.g. data points not labeled by existing labeling functions—and potentially with interesting visualizations and graphical interfaces to aid and direct this development process. Another interesting direction is formalizing the connection between labeling functions and transfer learning [Pan and Yang, 2010], and making more formal and practical connections to semi-supervised learning [Chapelle et al., 2009].

2.6 Related Work

This section is an overview of techniques for managing weak supervision, many of which are subsumed in Snorkel. We also contrast it with related forms of supervision.

Combining Weak Supervision Sources: The main challenge of weak supervision is how to combine multiple sources. For example, if a user provides two knowledge bases for distant supervision, how should a data point that matches only one knowledge base be labeled? Some researchers have used multi-instance learning to reduce the noise in weak supervision sources [Riedel et al., 2010b, Hoffmann et al., 2011a], essentially modeling the different weak supervision sources as soft constraints on the true label, but this approach is limited in that it requires using a specific end model that supports multi-instance learning. Researchers have therefore considered how to estimate the accuracy of label sources without a gold standard with which to compare—a classic problem [Dawid and Skene, 1979]—and combine these estimates into labels that can be used to train an arbitrary end model. Much of this work has focused on crowdsourcing, in which workers have unknown accuracy [Dalvi et al., 2013, Joglekar et al., 2015, Zhang et al., 2016b]. Such methods use generative probabilistic models to estimate a latent variable—the true class label—based on noisy observations. Other methods use generative models with hand-specified dependency structures to label data for specific modalities, such as topic models for text [Alfonseca et al., 2012b] or denoising distant supervision sources [Takamatsu et al., 2012b, Roth and Klakow, 2013b]. Other techniques for estimating CHAPTER 2. WEAK SUPERVISION FROM CODE 36

latent class labels given noisy observations include spectral methods [Parisi et al., 2014]. Snorkel is distin- guished from these approaches because its generative model supports a wide range of weak supervision sources, and it learns the accuracies and correlation structure among weak supervision sources without ground truth data.

Other Forms of Supervision: Work on semi-supervised learning considers settings with some labeled data and a much larger set of unlabeled data, and then leverages various domain- and task-agnostic assump- tions about smoothness, low-dimensional structure, or distance metrics to heuristically label the unlabeled data [Chapelle et al., 2009]. Work on active learning aims to automatically estimate which data points are optimal to label, thereby hopefully reducing the total number of examples that need to be manually anno- tated [Settles, 2012]. Transfer learning considers the strategy of repurposing models trained on different datasets or tasks where labeled training data is more abundant [Pan and Yang, 2010]. Another class of supervision includes self-training [Scudder, 1965, Agrawala, 1970] and co-training [Blum and Mitchell, 1998], which involve training a model or pair of models on data that they labeled themselves. Weak supervision is distinct in that the goal is to solicit input directly from SMEs, but at a higher level of abstraction and/or in an inherently noisier form. Snorkel is focused on managing weak supervision sources, but combining its methods with these other types of supervision is straightforward.

Related Data Management Problems: Researchers have considered related problems in data manage- ment, such as data fusion [Dong and Srivastava, 2015, Rekatsinas et al., 2017b] and truth discovery [Li et al., 2015b]. In these settings, the task is to estimate the reliability of data sources that provide assertions of facts and determine which facts are likely true. Many approaches to these problems use probabilistic graphical models that are related to Snorkel’s generative model in that they represent the unobserved truth as a latent variable, e.g., the latent truth model [Zhao et al., 2012]. Our setting differs in that labeling functions assign labels to user-provided data, and they may provide any label or abstain, which we must model. Work on data fusion has also explored how to model user-specified correlations among data sources [Pochampally et al., 2014]. Snorkel automatically identifies which correlations among labeling functions to model.

2.7 Conclusion

Snorkel provides a new paradigm for soliciting and managing weak supervision to create training data sets. In Snorkel, users provide higher-level supervision in the form of labeling functions that capture domain knowledge and resources, without having to carefully manage the noise and conflicts inherent in combining weak supervision sources. Our evaluations demonstrate that Snorkel significantly reduces the cost and difficulty of training powerful machine learning models while exceeding prior weak supervision methods and approaching the quality of large, hand-labeled training sets. Snorkel’s deployments in industry, research labs, and government agencies show that it has real-world impact, offering developers an improved way to build models. Thus, we confidently affirm that weak supervision from the high-level abstraction of labeling CHAPTER 2. WEAK SUPERVISION FROM CODE 37

functions can be used to train high-performance machine learning models. Chapter 3

Weak Supervision from Primitives

In Section 2, we answered the question of whether it is possible to train high-quality machine learning models without labels generated by hand. Indeed, we found that users can supervise with code instead, using the programming abstraction of labeling functions to programmatically generate labels. However, many types of data are not easily supervised with just a couple of lines of Python code. For example, while pattern-matching LFs based on regular expressions are convenient and often effective for supervising raw text documents, they are incapable of capturing many of the patterns that humans rely on when parsing information from richly formatted documents (e.g., PDFs, spreadsheets, webpages, etc.), such as visual alignments, formatting cues, and document structure. To successfully extend Snorkel to this very difficult setting, we created Fonduer, a system which raises the level of abstraction once more from writing custom labeling logic in code snippets to composing labeling functions from rich primitives generated a priori with an advanced parser. This alleviates much of the cognitive and coding burden of the subject matter experts so that they can focus more on expressing their domain expertise and less on generating hundreds or thousands of lines of codes to expose the types of relationship that they would like to refer to at supervision time. We explore this setting with richly formatted data in the context of knowledge base construction, once again demonstrating that weak supervision from an even higher-level abstraction (rich programmatic primitives) can be used to train high-performance machine learning models. The work in this chapter is the result of collaboration with Sen Wu, Luke Hsiao, Xiao Cheng, Theodoros Rekatsinas, Phil Levis, Christopher Ré, and others. It draws on content from the following Fonduer-related publications: [Wu et al., 2018, Kuleshov et al., 2019].

3.1 Introduction

Knowledge base construction (KBC) is the process of populating a database with information from data such as text, tables, images, or video. Extensive efforts have been made to build large, high-quality knowledge bases (KBs), such as Freebase [Bollacker et al., 2008], YAGO [Suchanek et al., 2008], IBM [Brown et al.,

38 CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 39

Transistor Datasheet SMBT3904...MMBT3904 Font: Arial; Size: 12; Style: Bold

NPN Silicon Switching Transistors Knowledge • High DC current gain: 0.1 mA to 100 mA • Low collector-emitter saturation voltage Base From header Maximum Ratings Parameter Symbol Value Unit HasCollectorCurrent Collector-emitter voltage VCEO 40 V

Collector-base voltage VCBO 60 Emitter-base voltage V 6 EBO Transistor Part Current Collector current AlignedIC 200 mA Total power dissipation P mW tot SMBT3904 200mA

TS ≤ 60°C S330S

TS ≤ 115°C S250S MMBT3904 200mA

Junction temperature Tj 150 °C

Storage temperature Tstg -65 ... 150 From table

Figure 3.1: A KBC task to populate relation HasCollectorCurrent(Transistor Part, Current) from datasheets. Part and Current mentions are in blue and green, respectively.

2013, Ferrucci et al., 2010], PharmGKB [Hewett et al., 2002], and Google Knowledge Graph [Singhal, 2012]. Traditionally, KBC solutions have focused on relation extraction from unstructured text [Shin et al., 2015, Madaan et al., 2016, Nakashole et al., 2011, Yahya et al., 2014]. These KBC systems already support a broad range of downstream applications such as information retrieval, question answering, medical diagnosis, and data visualization. However, troves of information remain untapped in richly formatted data, where relations and attributes are expressed via combinations of textual, structural, tabular, and visual cues. In these scenarios, the semantics of the data are significantly affected by the organization and layout of the document. Examples of richly formatted data include webpages, business reports, product specifications, and scientific literature. We use the following example to demonstrate KBC from richly formatted data.

Example 3.1.1 (HasCollectorCurrent). We are given a collection of transistor datasheets (like the one shown in Figure 3.1), and we want to build a KB of their maximum collector currents.1 The output KB can power a tool that verifies that transistors do not exceed their maximum ratings in a circuit. Figure 3.1 shows how relevant information is located in both the document header and table cells and how their relationship is expressed using semantics from multiple modalities.

The heterogeneity of signals in richly formatted data poses a major challenge for existing KBC systems. The above example shows how KBC systems that focus on text data—and adjacent textual contexts such as

1Transistors are semiconductor devices often used as switches or amplifiers. Their electrical specifications are published by manufac- turers in datasheets. CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 40

sentences or paragraphs—can miss important information due to this breadth of signals in richly formatted data. We review the major challenges of KBC from richly formatted data.

Challenges. KBC on richly formatted data poses a number of challenges beyond those present with un- structured data: (1) accommodating prevalent document-level relations, (2) capturing the multimodality of information in the input data, and (3) addressing the tremendous data variety.

Prevalent Document-Level Relations We define the context of a relation as the scope information that needs to be considered when extracting the relation. Context can range from a single sentence to a whole document. KBC systems typically limit the context to a few sentences or a single table, assuming that relations are expressed relatively locally. However, for richly formatted data, many relations rely on the context of a whole document.

Example 3.1.2 (Document-Level Relations). In Figure 3.1, transistor parts are located in the document header (boxed in blue), and the collector current value is in a table cell (boxed in green). Moreover, the interpretation of some numerical values depends on their units reported in another table column (e.g., 200 mA).

Limiting the context scope to a single sentence or table misses many potential relations—up to 97% in the ELECTRONICS application. On the other hand, considering all possible entity pairs in a document as candidates renders the extraction problem computationally intractable due to the combinatorial explosion of candidates.

Multimodality Classical KBC systems model input data as unstructured text [Madaan et al., 2016, Shin et al., 2015, Mintz et al., 2009a]. In richly formatted data, semantics are expressed via multiple different modalities or classes of features— textual, structural, tabular, and visual.

Example 3.1.3 (Multimodality). In Figure 3.1, important information (e.g., the transistor names in the header) is expressed in larger, bold fonts (displayed in yellow). Furthermore, the meaning of a table entry depends on other entries with which it is visually or tabularly aligned (shown by the red arrow). For instance, the semantics of a numeric value is specified by an aligned unit.

Semantics from different modalities can vary significantly but can convey complementary information.

Data Variety With richly formatted data, there are two primary sources of data variety: (1) format variety (e.g., file or table formatting) and (2) stylistic variety (e.g., linguistic variation).

Example 3.1.4 (Data Variety). In Figure 3.1, numeric intervals are expressed as “-65 . . . 150,” but other datasheets show intervals as “-65 ∼ 150,” or “-65 to 150.” Similarly, tables can be formatted with a variety of spanning cells, header hierarchies, and layout orientations.

Data variety requires KBC systems to adopt data models that are generalizable and robust against heterogeneous input data. CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 41

Our Approach. We introduce Fonduer, a machine-learning-based system for KBC from richly formatted data. Fonduer takes as input richly formatted documents, which may be of diverse formats, including PDF, HTML, and XML. Fonduer parses the documents and analyzes the corresponding multimodal, document-level contexts to extract relations. The final output is a knowledge base with the relations classified to be correct. Fonduer’s machine-learning-based approach must tackle a series of technical challenges.

Technical Challenges The challenges in designing Fonduer are: (1) Reasoning about relation candidates that are manifested in heterogeneous formats (e.g., text and tables) and span an entire document requires Fonduer’s machine-learning model to analyze heterogeneous, document- level context. While deep-learning models such as recurrent neural networks [Bahdanau et al., 2014] are effective with sentence- or paragraph-level context [Li et al., 2015a], they fall short with document-level context, such as context that span both textual and visual features (e.g., information conveyed via fonts or alignment). Developing such models is an open challenge and active area of research [LeCun et al., 2015]. (2) The heterogeneity of contexts in richly formatted data magnifies the need for large amounts of training data. Manual annotation is prohibitively expensive, especially when domain expertise is required. At the same time, human-curated KBs, which can be used to generate training data, may exhibit low coverage or not exist altogether. Alternatively, weak supervision sources can be used to programmatically create large training sets, but it is often unclear how to consistently apply these sources to richly formatted data. Whereas patterns in unstructured data can be identified based on text alone, expressing patterns consistently across different modalities in richly formatted data is challenging. (3) Considering candidates across an entire document leads to a combinatorial explosion of possible candidates, and thus random variables, which need to be considered during learning and inference. This leads to a fundamental tension between building a practical KBC system and learning accurate models that exhibit high recall. In addition, the combinatorial explosion of possible candidates results in a large class imbalance, where the number of “True” candidates is much smaller than the number of “False” candidates. Therefore, techniques that prune candidates to balance running time and end-to-end quality are required.

Technical Contributions Our main contributions are as follows: (1) To account for the breadth of signals in richly formatted data, we design a new data model that preserves structural and semantic information across different data modalities. The role of Fonduer’s data model is twofold: (a) to allow users to specify multimodal domain knowledge that Fonduer leverages to automate the KBC process over richly formatted data, and (b) to provide Fonduer’s machine-learning model with the necessary representation to reason about document-wide context (see Section 3.3). (2) We empirically show that existing deep-learning models [Zhang et al., 2016a] tailored for text information extraction (such as long short-term memory (LSTM) networks [Hochreiter and Schmidhuber, 1997]) struggle to capture the multimodality of richly formatted data. We introduce a multimodal LSTM network that combines textual context with universal features that correspond to structural and visual properties of the input documents. These features are captured by Fonduer’s data model and are generated automatically (see Section 3.4.2). We also introduce a series of data layout optimizations to ensure the scalability of Fonduer to CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 42

millions of document-wide candidates (see Appendix B.3). (3) Fonduer introduces a programming model in which no development cycles are spent on feature engineering. Users only need to specify candidates, the potential entries in the target KB, and provide lightweight supervision rules which capture a user’s domain knowledge to programmatically label subsets of candidates, which are used to train Fonduer’s deep-learning model (see Section 3.4.3). We conduct a user study to evaluate Fonduer’s programming model. We find that when working with richly formatted data, users rely on semantics from multiple modalities of the data, including both structural and textual information in the document. Our study demonstrates that given 30 minutes, Fonduer’s programming model allows users to attain F1 scores that are 23 points higher than supervision via manual labeling candidates (see Section 3.6).

Summary of Results. Fonduer-based systems are in production in a range of academic and industrial uses cases, including a major online retailer. Fonduer introduces several advancements over prior KBC systems (see Appendix 3.8): (1) In contrast to prior systems that focus on adjacent textual data, Fonduer can extract document-level relations expressed in diverse formats, ranging from textual to tabular formats; (2) Fonduer reasons about multimodal context, i.e., both textual and visual characteristics of the input documents, to extract more accurate relations; (3) In contrast to prior KBC systems that rely heavily on feature engineering to achieve high quality [Ré et al., 2014], Fonduer obviates the need for feature engineering by extending a bidirectional LSTM—a deep-learning strong standard baseline in natural language processing—to obtain a representation needed to automate relation extraction from richly formatted data. We evaluate Fonduer in four real-world applications of richly formatted information extraction and show that Fonduer enables users to build high-quality KBs, achieving an average improvement of 41 F1 points over state-of-the-art KBC systems.

3.2 Background

3.2.1 Knowledge Base Construction

The input to a KBC system is a corpus of documents. The output of the system is a relational database containing facts extracted from the input and stored in an appropriate schema. To describe this process, we adopt terminology from the KBC community. There are four types of objects that play key roles in KBC systems: (1) entities, (2) relations, (3) entity mentions, and (4) relation mentions. An entity e in a knowledge base represents a distinct real-world person, place, or object. Entities can be grouped into different entity types T1, T2,..., Tn. Entities also participate in relationships. A relationship between n entities is represented as an n-ary relation R(e1, e2,..., en) and is described by a schema

SR(T1, T2,..., Tn) where ei ∈ Ti.A mention m is a span of text that refers to an entity. A relation mention candidate (referred to as a candidate in this paper) is an n-ary tuple c = (m1, m2,..., mn) that represents a potential instance of a relation R(e1, e2,..., en). A candidate classified as true is called a relation mention, denoted by rR. CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 43

Example 3.2.1 (KBC). Consider the HasCollectorCurrent task in Figure 3.1. Fonduer takes a corpus of transistor datasheets as input and constructs a KB containing the (Transistor Part, Current) binary relation as output. Parts like SMBT3904 and Currents like 200mA are entities. The spans of text that read “SMBT3904” and “200” (boxed in blue and green, respectively) are mentions of those two entities, and together they form a candidate. If the evidence in the document suggests that these two mentions are related, then the output KB will include the relation mention (SMBT3904, 200mA) of the HasCollectorCurrent relation.

Knowledge base construction is defined as follows. Given a set of documents D and a KB schema

SR(T1, T2,..., Tn), where each Ti corresponds to an entity type, extract a set of relation mentions rR from D, which populate the schema’s relational tables.

User Input Error Matchers & Analysis Schema Labeling Functions Throttlers

SMBT3904...MMBT3904 SMBT3904...MMBT3904 NPN Silicon Switching Transistors SMBT3904...MMBT3904 •NPN High SiliconDC current Switching gain: 0.1 mA Transistors to 100 mA • •Low NPNHigh collector SiliconDC current-emitter Switching gain: saturation 0.1 mA Transistors tovoltage 100 mA HasCollectorCurrent • Low High collector DC current-emitter gain: saturation 0.1 mA voltageto 100 mA Maximum • Ratings Maximum• Low collector Ratings-emitter saturation voltage Parameter Symbol Value Unit Parameter Symbol Value Unit Multimodal CollectorMaximum-emitter Ratings voltage VCEO 40 V Collector-emitter voltage 40 V CollectorParameter-base voltage VVCEO Symbol 60Value Unit CBO Candidate Featurization Supervision CollectorCollector-base-emitter voltage voltage VCBOVCEO 6040 V Emitter-base voltage VEBO 6 Transistor Part Current Emitter-base voltage 6 KBC Initialization CollectorCollector current-base voltage I V EBOVCBO 200 60 mA C Generation & Multimodal & Classification CollectorEmitter -currentbase voltage IC VEBO 2006 mA Total power dissipation Ptot mV TotalCollector power current dissipation P 200 mVmA SMBT3904 200mA TS ≤ 71°C totIC S330S T ≤ 71°C 330 LSTM TSTotal S≤ 115 power°C dissipation Ptot S250S S S mV T S T ≤ 115≤ 71°C°C S250330S JunctionS temperature Tj 150S S °C MMBT3904 200mA Junction TS ≤ 115temperature°C Tj 150S250 S °C Storage temperature Tstg -65 ... 150 StorageJunction temperature temperature T -65 ...150 150 °C Phase 1 Phase 2 Phase 3 stgTj

Storage temperature Tstg -65 ... 150 Static Iterative Data Input Output Fonduer Figure 3.2: An overview of Fonduer KBC over richly formatted data. Given a set of richly formatted documents and a series of lightweight inputs from the user, Fonduer extracts facts and stores them in a relational database.

Like other machine-learning-based KBC systems [Carlson et al., 2010, Shin et al., 2015], Fonduer converts KBC to a statistical learning and inference problem: each candidate is assigned a Boolean random variable that can take the value “True” if the corresponding relation mention is correct, or “False” otherwise. In machine-learning-based KBC systems, each candidate is associated with certain features that provide evidence for the value that the corresponding random variable should take. Machine-learning-based KBC systems use machine learning to maximize the probability of correctly classifying candidates, given their features and ground truth examples.

3.2.2 Recurrent Neural Networks

The machine-learning model we use in Fonduer is based on a recurrent neural network (RNN). RNNs have obtained state-of-the-art results in many natural-language processing (NLP) tasks, including information extraction [Graves et al., 2009, 2013, Wu et al., 2016]. RNNs take sequential data as input. For each element in the input sequence, the information from previous inputs can affect the network output for the current element.

For sequential data {x1,..., xT }, the structure of an RNN is mathematically described as:

ht = f(xt, ht−1), y = g({h1,..., hT }) CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 44

where ht is the hidden state for element t, and y is the representation generated by the sequence of hidden states

{h1,..., hT }. Functions f and g are nonlinear transformations. For RNNs, we have that f = tanh(Whxt +

Uhht−1 + bh) where Wh, Uh are parameter matrices and bh is a vector. Function g is typically task-specific.

Long Short-term Memory LSTM [Hochreiter and Schmidhuber, 1997] networks are a special type of RNN that introduce new structures referred to as gates, which control the flow of information and can capture long-term dependencies. There are three types of gates: input gates it control which values are updated in a memory cell; forget gates ft control which values remain in memory; and output gates ot control which values in memory are used to compute the output of the cell. The final structure of an LSTM is given below.

it = σ(Wixt + Uiht−1 + bi)

ft = σ(Wfxt + Ufht−1 + bf)

ot = σ(Woxt + Uoht−1 + bo)

ct = ft ◦ ct−1 + it ◦ tanh(Wcxt + Ucht−1 + bc)

ht = ot ◦ tanh(ct)

where ct is a cell state vector, W, U, b are parameter matrices and a vector, σ is the sigmoid function, and ◦ is the Hadamard product. Bidirectional LSTMs consist of forward and backward LSTMs. The forward LSTM fF reads the sequence F F B from x1 to xT and calculates a sequence of forward hidden states (h1,..., hT ). The backward LSTM f B B reads the sequence from xT to x1 and calculates a sequence of backward hidden states (h1 ,..., hT ). The final hidden state for the sequence is the concatenation of the forward and backward hidden states, e.g., F B hi = [hi , hi ]. Attention Previous work explored using pooling strategies to train an RNN, such as max pooling [Verga et al., 2016], which compresses the information contained in potentially long input sequences to a fixed- length internal representation by considering all parts of the input sequence impartially. This compression of information can make it difficult for RNNs to learn from long input sequences. In recent years, the attention mechanism has been introduced to overcome this limitation by using a soft word-selection process that is conditioned on the global information of the sentence [Bahdanau et al., 2014]. That is, rather than squashing all information from a source input (regardless of its length), this mechanism allows an RNN to pay more attention to the subsets of the input sequence where the most relevant information is concentrated. Fonduer uses a bidirectional LSTM with attention to represent textual features of relation candidates from the documents. We extend this LSTM with features that capture other data modalities. CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 45

Figure 3.3: Fonduer’s data model.

3.3 The Fonduer Framework

An overview of Fonduer is shown in Figure 3.2. Fonduer takes as input a collection of richly formatted documents and a collection of user inputs. It follows a machine-learning-based approach to extract relations from the input documents. The relations extracted by Fonduer are stored in a target knowledge base. We introduce Fonduer’s data model for representing different properties of richly formatted data. We then review Fonduer’s data processing pipeline and describe the new programming paradigm introduced by Fonduer for KBC from richly formatted data. The design of Fonduer was strongly guided by interactions with collaborators (see the user study in Section 3.6). We find that to support KBC from richly formatted data, a unified data model must: • Serve as an abstraction for system and user interaction. • Capture candidates that span different regions (e.g. sections of pages) and data modalities (e.g., textual and tabular data). • Represent the formatting variety in richly formatted data sources in a unified manner. Fonduer introduces a data model that satisfies these requirements.

3.3.1 Fonduer’s Data Model

Fonduer’s data model is a directed acyclic graph (DAG) that contains a hierarchy of contexts, whose structure reflects the intuitive hierarchy of document components. In this graph, each node is a context (represented as boxes in Figure 3.3). The root of the DAG is a Document, which contains Section contexts. Each Section is divided into: Texts, Tables, and Figures. Texts can contain multiple Paragraphs; Tables and Figures can CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 46

contain Captions; Tables can also contain Rows and Columns, which are in turn made up of Cells. Each context ultimately breaks down into Paragraphs that are parsed into Sentences. In Figure 3.3, a downward edge indicates a parent-contains-child relationship. This hierarchy serves as an abstraction for both system and user interaction with the input corpus. In addition, this data model allows us to capture candidates that come from different contexts within a document. For each context, we also store the textual contents, pointers to the parent contexts, and a wide range of attributes from each modality found in the original document. For example, standard NLP pre-processing tools are used to generate linguistic attributes, such as lemmas, parts of speech tags, named entity recognition tags, dependency paths, etc., for each Sentence. Structural and tabular attributes of a Sentence, such as tags, and row/column information, and parent attributes, can be captured by traversing its path in the data model. Visual attributes for the document are recorded by storing bounding box and page information for each word in a Sentence.

Example 3.3.1 (Data Model). The data model representing the PDF in Figure 3.1 contains one Section with three children: a Text for the document header, a Text for the description, and a Table for the table itself (with 10 Rows and 4 Columns). Each Cell links to both a Row and Column. Texts and Cells contain Paragraphs and Sentences.

Fonduer’s multimodal data model unifies inputs of different formats, which addresses the data variety that comes from variations in format. To construct the DAG for each document, we extract all the words in their original order. For structural and tabular information, we use tools such as Poppler2 to convert an input file into HTML format; for visual information, such as coordinates and bounding boxes, we use a PDF printer to convert an input file into PDF. If a conversion occurred, we associate the multimodal information in the converted file with all extracted words. We align the word sequences of the converted file with their originals by checking if both their characters and number of repeated occurrences before the current word are the same. Fonduer can recover from conversion errors by using the inherent redundancy in signals from other modalities. This DAG structure also simplifies the variation in format that comes from table formatting.

Takeaways. Fonduer unifies a diverse variety of document formats, types of contexts, and modality semantics into one model in order to address variety inherent in richly formatted data. Fonduer’s data model serves as the formal representation of the intermediate data utilized in all future stages of the extraction process.

3.3.2 User Inputs and Fonduer’s Pipeline

The Fonduer processing pipeline follows three phases. We briefly describe each phase in turn and focus on the user inputs required by each phase. Fonduer’s internals are described in Section 3.4.

2https://poppler.freedesktop.org CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 47

(1) KBC Initialization The first phase in Fonduer’s pipeline is to initialize the target KB where the extracted relations will be stored. During this phase, Fonduer requires the user to specify a target schema that corresponds to the relations to be extracted. The target schema SR(T1,..., Tn) defines a relation R to be extracted from the input documents. An example of such a schema is provided below.

Example 3.3.2 (Relation Schema). An example SQL schema for the relation in Figure 3.1 is:

CREATE TABLE HasCollectorCurrent( TransistorPart varchar , C u r r e n t v a r c h a r );

Fonduer uses the user-specified schema to initialize an empty relational database where the output KB will be stored. Furthermore, Fonduer iterates over its input corpus and transforms each document into an instance of Fonduer’s data model to capture the variety and multimodality of richly formatted documents.

(2) Candidate Generation In Phase 2, Fonduer extracts relation candidates from the input documents. Here, users are required to provide two types of input functions: (1) matchers and (2) throttlers.

Matchers To generate candidates for relation R, Fonduer requires that users define matchers for each distinct mention type in schema SR. Matchers are how users specify what a mention looks like. In Fonduer, matchers are Python functions that accept a span of text as input—which has a reference to its data model—and output whether or not the match conditions are met. Matchers range from simple regular expressions to complex functions that use account signals across multiple modalities of the input data and can also incorporate existing methods such as named-entity recognition.

Example 3.3.3 (Matchers). From the HasCollectorCurrent relation in Figure 3.1, users define matchers for each type of the schema. A dictionary of valid transistor parts can be used as the first matcher. For maximum current, users can exploit the pattern that these values are commonly expressed as a numerical value between 100 and 995 for their second matcher.

# Usea dictionary to match transistor parts def transistor_part_matcher(span): return 1 if span in part_dictionary else 0

# Use RegEx to extract numbers between [100, 995] def max_current_matcher(span): return 1 if re. match ('[1−9][0−9][0−5]', span ) else 0

Throttlers Users can optionally provide throttlers, which act as hard filtering rules to reduce the number of candidates that are materialized. Throttlers are also Python functions, but rather than accepting spans of text as input, they operate on candidates, and output whether or not a candidate meets the specified condition. Throttlers limit the number of candidates considered by Fonduer.

Example 3.3.4 (Throttler). Continuing the example shown in Figure 3.1, the user provides a throttler, which only keeps candidates whose Current has the word “Value” as its column header. CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 48

def value_in_column_header(cand): return 1 if 'Value ' in header_ngrams(cand.current) else 0

Given the input matchers and throttlers, Fonduer extracts relation candidates by traversing its data model representation of each document. By applying matchers to each leaf of the data model, Fonduer can generate sets of mentions for each component of the schema. The cross-product of these mentions produces candidates:

Candidate(idcandidate, mention1, . . . , mentionn) where mentions are spans of text and contain pointers to their context in the data model of their respective document. The output of this phase is a set of candidates, C.

(3) Training a Multimodal LSTM for KBC In this phase, Fonduer trains a multimodal LSTM network to classify the candidates generated during Phase 2 as “True” or “False” mentions of target relations. Fonduer’s multimodal LSTM combines both visual and textual features. Recent work has also proposed the use of LSTMs for KBC but has focused only on textual data [Zhang et al., 2016a]. In Section 3.5.3, we experimentally demonstrate that state-of-the-art LSTMs struggle to capture the multimodal characteristics of richly formatted data, and thus, obtain poor-quality KBs. Fonduer uses a bidirectional LSTM (see Section 3.2.2) to capture textual features and extends it with additional structural, tabular, and visual features captured by Fonduer’s data model. The LSTM used by Fonduer is described in Section 3.4.2. Training in Fonduer is split into two sub-phases: (1) a multimodal featurization phase and (2) a phase where supervision data is provided by the user.

Multimodal Featurization Fonduer traverses its internal data model instance for each document and au- tomatically generates features that correspond to structural, tabular, and visual modalities as described in Section 3.4.2. These constitute a bare-bones feature library (called feature_lib, below), which augments the textual features learned by the LSTM. All features are stored in a relation:

Features(idcandidate, LSTMtextual, feature_libothers) No user input is provided in this step. Fonduer obviates the need for feature engineering and shows that incorporating multimodal information is key to achieving high-quality relation extraction.

Supervision To train its multimodal LSTM, Fonduer requires that users provide some form of supervision. Collecting sufficient training data for multi-context deep-learning models is a well-established challenge. As stated by LeCun et al. [2015], taking into account a context of more than a handful of words for text-based deep-learning models requires very large training corpora. To soften the burden of traditional supervision, Fonduer uses a supervision paradigm referred to as data programming [Ratner et al., 2016b]. Data programming is a human-in-the-loop paradigm for training machine-learning systems. In data programming, users only need to specify lightweight functions, referred to as labeling functions (LFs), that programmatically assign labels to the input candidates. A detailed overview of data programming is provided in Appendix B.1. While existing work on data programming [Ratner et al., CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 49

2017b] has focused on labeling functions over textual data, Fonduer paves the way for specifying labeling functions over richly formatted data. Fonduer requires that users specify labeling functions that label the candidates from Phase 2. Labeling functions in Fonduer are Python functions that take a candidate as input and assign +1 to label it as “True,” −1 to label it as “False,” or 0 to abstain. Example 3.3.5 (Labeling Functions). Looking at the datasheet in Figure 3.1, users can express patterns such as having the Part and Current y-aligned on the visual rendering of the page. Similarly, users can write a rule that labels a candidate whose Current is in the same row as the word “current” as “True.”

# Rule−basedLF based on visual information def y_axis_aligned(cand): return 1 if cand.part.y == cand.current.y else 0

# Rule−basedLF based on tabular content def has_current_in_row(cand): return 1 if 'current ' in row_ngrams(cand.current) else 0

As shown in Example 3.3.5, Fonduer’s internal data model allows users to specify labeling functions that capture supervision patterns across any modality of the data (see Section 3.4.3). In our user study, we find that it is common for users to write labeling functions that span multiple modalities and consider both textual and visual patterns of the input data (see Section 3.6). The user-specified labeling functions, together with the candidates generated by Fonduer, are passed as input to Snorkel [Ratner et al., 2017b], a data-programming engine, which converts the noisy labels generated by the input labeling functions to denoised labeled data used to train Fonduer’s multimodal LSTM model (see Appendix B.1).

Classification Fonduer uses its trained LSTM to assign a marginal probability to each candidate. The last layer of Fonduer’s LSTM is a softmax classifier (described in Section 3.4.2) that computes the probability of a candidate being a “True” relation. In Fonduer, users can specify a threshold over the output marginal probabilities to determine which candidates will be classified as “True” (those whose marginal probability of being true exceeds the specified threshold) and which are “False” (those whose marginal probability fall beneath the threshold). This threshold depends on the requirements of the application. Applications that require high accuracy can set a high threshold value to ensure only candidates with a high probability of being “True” are classified as such. As shown in Figure 3.2, supervision and classification are typically executed over several iterations as users develop a KBC application. This feedback loop allows users to quickly receive feedback and improve their labeling functions, and avoids the overhead of rerunning candidate extraction and materializing features (see Section 3.6).

3.3.3 Fonduer’s Programming Model for KBC

Fonduer is the first system to provide the necessary abstractions and mechanisms to enable the use of weak CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 50

supervision as a means to train a KBC system for richly formatted data. Traditionally, machine-learning- based KBC focuses on feature engineering to obtain high-quality KBs. This requires that users rerun feature extraction, learning, and inference after every modification of the feature set. With Fonduer’s machine- learning approach, features are generated automatically. This puts emphasis on (1) specifying the relation candidates and (2) providing multimodal supervision rules via LFs. This approach allows users to leverage multiple sources of supervision to address data variety introduced by variations in style better than traditional manual labeling [Shin et al., 2015]. Fonduer’s programming paradigm removes the need for feature engineering and introduces two modes of operation for Fonduer applications: (1) development and (2) production. During development, LFs are iteratively improved, in terms of both coverage and accuracy, through error analysis as shown by the blue arrows in Figure 3.2. LFs are applied to a small sample of labeled candidates and evaluated by the user on their accuracy and coverage (the fraction of candidates receiving non-zero labels). To support efficient error analysis, Fonduer enables users to easily inspect the resulting candidates and provides a set of LF metrics, such as coverage, conflict, and overlap, which provide users with a rough assessment of how to improve their LFs. Our users generated a sufficiently tuned set of LFs in about 20 iterations (see Section 3.6). In production, the finalized LFs are applied to the entire set of candidates, and learning and inference are performed only once to generate the final KB. On average, only a small number of LFs are needed to achieve high-quality KBC. For example, in the ELECTRONICS application, 16 LFs, on average, are sufficient to achieve an average F1 score of over 75. We also find that tabular and visual signals are particularly valuable forms of supervision for KBC from richly formatted data, and are complementary to traditional textual signals (see Section 3.6).

3.4 KBC in Fonduer

Here, we focus on the implementation of each component of Fonduer. In Appendix B.3 we discuss a series of optimizations that enable Fonduer’s scalability to millions of candidates.

3.4.1 Candidate Generation

Candidate generation from richly formatted data relies on access to document-level contexts, which is provided by Fonduer’s data model. Due to the significantly increased context needed for KBC from richly formatted data, naïvely materializing all possible candidates is intractable as the number of candidates grows combinatorially with the number of relation arguments. This combinatorial explosion can lead to performance issues for KBC systems. For example, in the ELECTRONICS domain, just 100 documents can generate over 1M candidates. In addition, we find that the majority of these candidates do not express true relations, creating a significant class imbalance that can hinder learning performance [Japkowicz and Stephen, 2002]. To address this combinatorial explosion, Fonduer allows users to specify throttlers, in addition to matchers, to prune away excess candidates. We find that throttlers must: CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 51

• Maintain high accuracy by only filtering negative candidates. • Seek high coverage of the candidates. Throttlers can be viewed as a knob that allows users to trade off precision and recall and promote scalability by reducing the number of candidates to be classified during KBC.

Prec. Rec. F1 10 1.0 Linear 8 6 0.5 4 Quality SpeedUp 2 0 0 50 100 0 50 100 % of Candidates Filtered % of Candidates Filtered (a) Quality vs. Filter Ratio (b) Speed Up vs. Filter Ratio

Figure 3.4: Tradeoff between (a) quality and (b) execution time when pruning the number of candidates using throttlers.

Figure 3.4 shows how using throttlers affects the quality-performance tradeoff in the ELECTRONICS domain. We see that throttling significantly improves system performance. However, increased throttling does not monotonically improve quality since it hurts recall. This tradeoff captures the fundamental tension between optimizing for system performance and optimizing for end-to-end quality. When no candidates are pruned, the class imbalance resulting from many negative candidates to the relatively small number of positive candidates harms quality. Therefore, as a rule of thumb, we recommend that users apply throttlers to balance negative and positive candidates. Fonduer provides users with mechanisms to evaluate this balance over a small holdout set of labeled candidates.

Takeaways. Fonduer’s data model is necessary to perform candidate generation with richly formatted data. Pruning negative candidates via throttlers to balance negative and positive candidates not only ensures the scalability of Fonduer but also improves the precision of Fonduer’s output.

3.4.2 Multimodal LSTM Model

We now describe Fonduer’s deep-learning model in detail. Fonduer’s model extends a bidirectional LSTM (Bi-LSTM), a deep-learning strong standard baseline for NLP, with a simple set of dynamically generated features that capture semantics from the structural, tabular, and visual modalities of the data model. A detailed CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 52

list of these features is provided in Appendix B.2. In Section 3.5.3, we perform an ablation study demonstrating that non-textual features are key to obtaining high-quality KBs. We find that the quality of the output KB deteriorates up to 33 F1 points when non-textual features are removed. Figure 3.5 illustrates Fonduer’s LSTM. We now review each component of Fonduer’s LSTM.

Textual features Structural features Visual features Tabular features Multimodal features [ ." .( ] Softmax ⊕ ⊕ ! Font : Arial; Size: 12; Style : Bold SMBT3904...MMBT3904 ! !"% !"& !"' ! (% !"# "$ (# !($ NPN Silicon Switching Transistors , , , , , , , , ℎ"# ℎ"$ ℎ"% ℎ"& ℎ"' ℎ(# ℎ($ ℎ(% • High DC current gain: 0.1 mA to 100 mA • Low collector-emitter saturationSame Font voltage

Maximum Ratings ------ℎ"# ℎ"$ ℎ"% ℎ"& ℎ"' ℎ(# ℎ($ ℎ(% Parameter Symbol Value Unit Collector-emitter voltage VCEO 40 V

Collector-base voltage VCBO 60 Emitter-base voltage Header: ‘Value’;6 Row: 5; Column: 3 [[1 SMBT3904 1]] … MMBT3904 [[2 200 2]] VEBO Collector current I 200 Font: mAArial; Size: 10 sentence )" sentence )( AlignedC Total power dissipation Ptot mV Bi-LSTM with Attention Extended Feature Library TS ≤ 71°C S330S Figure 3.5: An illustration of Fonduer’s multimodal LSTM for T candidateS ≤ 115°C (SMBT3904, 200)S250 inS Figure 3.1. Junction temperature Tj 150 °C

Storage temperature Tstg -65 ... 150

Bidirectional LSTM with Attention Traditionally, the primary source of signal for relation extraction comes from unstructured text. In order to understand textual signals, Fonduer uses an LSTM network to extract textual features. For mentions, Fonduer builds a Bi-LSTM to get the textual features of the mention from both th directions of sentences containing the candidate. For sentence si containing the i mention in the document, the textual features hik of each word wik are encoded by both forward (defined as superscript F in equations) and backward (defined as superscript B) LSTM, which summarizes information about the whole sentence with a focus on wik. This takes the structure:

F F hik = LSTM(hi(k−1), Φ(si, k)) B B hik = LSTM(hi(k+1), Φ(si, k)) F B hik = [hik, hik] where Φ(si, k) is the word embedding [Turian et al., 2010], which is the representation of the semantics of the th k word in sentence si.

The textual feature representation for a mention, ti, is calculated by the following attention mechanism to model the importance of different words from the sentence si and to aggregate the feature representation of CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 53

those words to form a final feature representation,

uik = tanh(Wwhik + bw) T exp(uikuw) αik = T Σj exp(uijuw)

ti = Σjαijuij

where Ww, uw, and b are parameter matrices and a vector. uik is the hidden representation of hik, and αik models the importance of each word in the sentence si. Special candidate markers (shown in red in Figure 3.5) are added to the sentences to draw attention to the candidates themselves. Finally, the textual features of a candidate are the concatenation of its mentions’ textual features [t1,..., tn].

Extended Feature Library Features for structural, tabular, and visual modalities are generated by leveraging the data model, which preserves each modality’s semantics. For each candidate, such as the candidate (SMBT3904, 200) shown in Figure 3.5, Fonduer locates each mention in the data model and traverses the DAG to compute features from the modality information stored in the nodes of the graph. For example, Fonduer can traverse sibling nodes to add tabular features such as featurizing a node based on the other mentions in the same row or column. Similarly, Fonduer can traverse the data model to extract structural features from tags stored while parsing the document along with the hierarchy of the document elements themselves. We review each modality: Structural features. These provide signals intrinsic to a document’s structure. These features are dynami- cally generated and allow Fonduer to learn from structural attributes, such as parent and sibling relationships and XML/HTML tag metadata found in the data model (shown in yellow in Figure 3.5). The data model also allows Fonduer to track structural distances of candidates, which helps when a candidate’s mentions are visually distant, but structurally close together. Specifically, featurizing a candidate with the distance to the lowest common ancestor in the data model is a positive signal for linking table captions to table contents. Tabular features. These are a subset of structural features since tables are common structures inside documents and have high information density. Table features are drawn from the grid-like representation of rows and columns stored in the data model, shown in green in Figure 3.5. In addition to the tabular location of mentions, Fonduer also featurizes candidates with signals such as being in the same row/column. For example, consider a table that has cells with multiple lines of text; recording that two mentions share a row captures a signal that a visual alignment feature could easily miss. Visual features. These provide signals observed from a visual rendering of a document. In cases where tabular or structural features are noisy—including nearly all documents converted from PDF to HTML by generic tools—visual features can provide a complementary view of the dependencies among text. Visual features encode many highly predictive types of semantic information implicitly, such as position on a page, which may imply when text is a title or header. An example of this is shown in red in Figure 3.5.

Training All parameters of Fonduer’s LSTM are jointly trained, including the parameters of the Bi-LSTM CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 54

as well as the weights of the last softmax layer that correspond to additional features.

Takeaways. To achieve high-quality KBC with richly formatted data, it is vital to have features from multiple data modalities. These features are only obtainable through traversing and accessing modality attributes stored in the data model.

3.4.3 Multimodal Supervision

Unlike KBC from unstructured text, KBC from richly formatted data requires supervision from multiple modalities of the data. In richly formatted data, useful patterns for KBC are more sparse and hidden in non- textual signals, which motivates the need to exploit overlap and repetition in a variety of patterns over multiple modalities. Fonduer’s data model allows users to directly express correctness using textual, structural, tabular, or visual characteristics, in addition to traditional supervision sources like existing KBs. In the ELECTRONICS domain, over 70% of labeling functions written by our users are based on non-textual signals. It is acceptable for these labeling functions to be noisy and conflict with one another. Data programming theory (see Appendix B.1.2) shows that, with a sufficient number of labeling functions, data programming can still achieve quality comparable to using manually labeled data. In Section 3.5.3, we find that using metadata in the ELECTRONICS domain, such as structural, tabular, and visual cues, results in a 66 F1 point increase over using textual supervision sources alone. Using both sources gives a further increase of 2 F1 points over metadata alone. We also show that supervision using information from all modalities, rather than textual information alone, results in an increase of 43 F1 points, on average, over a variety of domains. Using multiple supervision sources is crucial to achieving high-quality information extraction from richly formatted data.

Takeaways. Supervision using multiple modalities of richly formatted data is key to achieving high end-to- end quality. Like multimodal featurization, multimodal supervision is also enabled by Fonduer’s data model and addresses stylistic data variety.

3.5 Experiments

We evaluate Fonduer in four applications: ELECTRONICS, ADVERTISEMENTS, PALEONTOLOGY, and GENOMICS—each containing several relation extraction tasks. We seek to answer: (1) how does Fonduer compare against both state-of-the-art KBC techniques and manually curated knowledge bases? and (2) how does each component of Fonduer affect end-to-end extraction quality?

3.5.1 Experimental Settings

Datasets. The datasets used for evaluation vary in size and format. Table 3.1 shows a summary of these datasets. CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 55

Table 3.1: Summary of the datasets used in our experiments. Dataset Size #Docs #Rels Format ELEC. 3GB 7K 4 PDF ADS. 52GB 9.3M 4 HTML PALEO. 95GB 0.3M 10 PDF GEN. 1.8GB 589 4 XML

Electronics The ELECTRONICS dataset is a collection of single bipolar transistor specification datasheets from over 20 manufacturers, downloaded from Digi-Key.3 These documents consist primarily of tables and express relations containing domain-specific symbols. We focus on the relations between transistor part numbers and several of their electrical characteristics. We use this dataset to evaluate Fonduer on data consisting primarily of tables and numerical data.

Advertisements The ADVERTISEMENTS dataset contains webpages that may contain evidence of human trafficking activity. These webpages may provide prices of services, locations, contact information, physical characteristics of the victims, etc. We extract attributes associated with these trafficking ads. The output is deployed in production and is used by law enforcement agencies. This is a heterogeneous dataset containing millions of webpages over 692 web domains in which users create customized ads, resulting in 100,000s of unique layouts. We use this dataset to examine the robustness of Fonduer to the presence of significant data variety.

Paleontology The PALEONTOLOGY dataset is a collection of well-curated paleontology journal articles on fossils and ancient organisms. Here, we extract relations between paleontological discoveries and their corresponding physical measurements. These papers often contain tables spanning multiple pages. Thus, achieving high quality in this application requires linking content in tables to the text that references it, which can be separated by 20 pages or more in the document. We use this dataset to test Fonduer’s ability to draw candidates from document-level contexts.

Genomics The GENOMICS dataset is a collection of open-access biomedical papers on gene-wide association studies (GWAS) from the manually curated GWAS Catalog [Welter et al., 2014]. Here, we extract relations between single-nucleotide polymorphisms and human phenotypes found to be statistically significant. This dataset is published in XML format, thus, we do not have visual representations. We use this dataset to evaluate how well the Fonduer framework extracts relations from data that is published natively in a tree-based format.

Comparison Methods. We use two different methods to evaluate the quality of Fonduer’s output: the upper bound of state-of-the-art KBC systems (Oracle) and manually curated knowledge bases (Existing Knowledge Bases).

3https://www.digikey.com CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 56

Oracle Existing state-of-the-art information extraction (IE) methods focus on either textual data or semi- structured and tabular data. We compare Fonduer against both types of IE methods. Each IE method can be split into (1) a candidate generation stage and (2) a filtering stage, the latter of which eliminates false positive candidates. For comparison, we approximate the upper bound of quality of three state-of-the-art information extraction techniques by experimentally measuring the recall achieved in the candidate generation stage of each technique and assuming that all candidates found using a particular technique are correct. That is, we assume the filtering stage is perfect by assuming a precision of 1.0. • Text: We consider IE methods over text [Shin et al., 2015, Madaan et al., 2016]. Here, candidates are extracted from individual sentences, which are pre-processed with standard NLP tools to add part-of-speech tags, linguistic parsing information, etc. • Table: For tables, we use an IE method for semi-structured data [Barowy et al., 2015]. Candidates are drawn from individual tables by utilizing table content and structure. • Ensemble: We also implement an ensemble (proposed in [Dong et al., 2014]) as the union of candidates generated by Text and Table.

Existing Knowledge Base We use existing knowledge bases as another comparison method. The ELECTRON- ICS application is compared against the transistor specifications published by Digi-Key, while GENOMICS is compared to both GWAS Central [Beck et al., 2014] and GWAS Catalog [Welter et al., 2014], which are the most comprehensive collections of GWAS data and widely-used public datasets. Knowledge bases such as these are constructed using a combination of manual entry, web aggregation, paid third-party services, and automation tools.

Fonduer Details. Fonduer is implemented in Python, with database operations being handled by Post- greSQL. All experiments are executed in Jupyter Notebooks on a machine with four CPUs (each CPU is a 14-core 2.40 GHz Xeon E5–4657L), 1 TB RAM, and 12×3TB hard drives, with the Ubuntu 14.04 operating system.

3.5.2 Experimental Results

Oracle Comparison

We compare the end-to-end quality of Fonduer to the upper bound of state-of-the-art systems. In Table 3.2, we see that Fonduer outperforms these upper bounds for each dataset. In ELECTRONICS, Fonduer results in a significant improvement of 71 F1 points over a text-only approach. In contrast, ADVERTISEMENTS has a higher upper bound with text than with tables, which reflects how advertisements rely more on text than the largely numerical tables found in ELECTRONICS. In the PALEONTOLOGY dataset, which depends on linking references from text to tables, the unified approach of Fonduer results in an increase of 43 F1 points over the Ensemble baseline. In GENOMICS, all candidates are cross-context, preventing both the text-only and the table-only approaches from finding any valid candidates. CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 57

Table 3.2: End-to-end quality in terms of precision, recall, and F1 score for each application compared to the upper bound of state-of-the-art systems.

Sys. Metric Text Table Ensemble Fonduer Prec. 1.00 1.00 1.00 0.73 ELEC. Rec. 0.03 0.20 0.21 0.81 F1 0.06 0.40 0.42 0.77 Prec. 1.00 1.00 1.00 0.87 ADS. Rec. 0.44 0.37 0.76 0.89 F1 0.61 0.54 0.86 0.88 Prec. 0.00 1.00 1.00 0.72 PALEO. Rec. 0.00 0.04 0.04 0.38 F1 0.00* 0.08 0.08 0.51 Prec. 0.00 0.00 0.00 0.89 GEN. Rec. 0.00 0.00 0.00 0.81 # # # F1 0.00 0.00 0.00 0.85 * Text did not find any candidates. # No full tuples could be created using Text or Table alone

Table 3.3: End-to-end quality vs. existing knowledge bases.

System ELEC. GEN. GWAS GWAS Knowledge Base Digi-Key Central Catalog # Entries in KB 376 3,008 4,023 # Entries in Fonduer 447 6,420 6,420 Coverage 0.99 0.82 0.80 Accuracy 0.87 0.87 0.89 # New Correct Entries 17 3,154 2,486 Increase in Correct Entries 1.05× 1.87× 1.42×

Existing Knowledge Base Comparison

We now compare Fonduer against existing knowledge bases for ELECTRONICS and GENOMICS. No manually curated KBs are available for the other two datasets. In Table 3.3, we find that Fonduer achieves high coverage of the existing knowledge bases, while also correctly extracting novel relation entries with over 85% accuracy in both applications. In ELECTRONICS, Fonduer achieved 99% coverage and extracted an additional 17 correct entries not found in Digi-Key’s catalog. In the GENOMICS application, we see that Fonduer provides over 80% coverage of both existing KBs and finds 1.87× and 1.42× more correct entries than GWAS Central and GWAS Catalog, respectively.

Takeaways. Fonduer achieves over 41 F1 points higher quality on average when compared against the upper bound of state-of-the-art approaches. Furthermore, Fonduer attains over 80% of the data in existing public knowledge bases while providing up to 1.87× the number of correct entries with high accuracy. CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 58

1.0 0.77 0.66 0.5 2.6x 12.8x 0.30 Quality(F1) 0.06 0 Sentence Table Page Document

Figure 3.6: Average F1 score over four relations when broadening the extraction context scope in ELECTRON- ICS.

3.5.3 Ablation Studies

We conduct ablation studies to assess the effect of context scope, multimodal features, featurization approaches, and multimodal supervision on the quality of Fonduer. In each study, we change one component of Fonduer and hold the others constant.

Context Scope Study

To evaluate the importance of addressing the non-local nature of candidates in richly formatted data, we analyze how the different context scopes contribute to end-to-end quality. We limit the extracted candidates to four levels of context scope in ELECTRONICS and report the average F1 score for each. Figure 3.6 shows that increasing context scope can significantly improve the F1 score. Considering document context gives an additional 71 F1 points (12.8×) over sentence contexts and 47 F1 points (2.6×) over table contexts. The positive correlation between quality and context scope matches our expectations, since larger context scope is required to form candidates jointly from both table content and surrounding text. We see a smaller increase of 11 F1 points (1.2×) in quality between page and document contexts since many of the ELECTRONICS relation mentions are presented on the first page of the document.

Takeaways. Semantics can be distributed in a document or implied in its structure, thus requiring larger context scope than the traditional sentence-level contexts used in previous KBC systems.

Feature Ablation Study

We evaluate Fonduer’s multimodal features. We analyze how different features benefit information extraction from richly formatted data by comparing the effects of disabling one feature type while leaving all other types enabled, and report the average F1 scores of each configuration in Figure 3.7. CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 59

All No Structural No Visual No Textual No Tabular 1.0 1.0 0.88 0.80 0.770.72 0.74 0.700.630.69 0.63 0.55 0.5 0.5 Quality(F1) Quality(F1) 0 0 Elec. Ads.

1.0 1.0 0.85 0.76 0.610.56 0.510.49 0.5 0.41 0.5 0.30 0.34 Quality(F1) Quality(F1) 0 0 Paleo. Gen. Figure 3.7: The impact of each modality in the feature library.

We find that removing a single feature set resulted in drops of 2 F1 points (no textual features in PALEON- TOLOGY) to 33 F1 points (no textual features in ADVERTISEMENTS). While it is clear in Figure 3.7 that each application depends on different feature types, we find that it is necessary to incorporate all feature types to achieve the highest extraction quality. The characteristics of each dataset affect how valuable each feature type is to relation classification. The ADVERTISEMENTS dataset consists of webpages that often use tables to format and organize information— many relations can be found within the same cell or phrase. This heavy reliance on textual features is reflected by the drop of 33 F1 points when textual features are disabled. In ELECTRONICS, both components of the (part, attribute) tuples we extract are often isolated from other text. Hence, we see a small drop of 5 F1 points when textual features are disabled. We see a drop of 21 F1 points when structural features are disabled in the PALEONTOLOGY application due to its reliance on structural features to link between formation names (found in text sections or table captions) and the table itself. Finally, we see similar decreases when disabling structural and tabular features in the GENOMICS application (24 and 29 F1 points, respectively). Because this dataset is published natively in XML, structural and tabular features are almost perfectly parsed, which results in similar impacts of these features.

Takeaways. It is necessary to utilize multimodal features to provide a robust, domain-agnostic description for real-world data. CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 60

Featurization Study

We compare Fonduer’s multimodal featurization with: (1) a human-tuned multimodal feature library that leverages Fonduer’s data model, requiring feature engineering; (2) a Bi-LSTM with attention model; this RNN considers textual features only; (3) a machine-learning-based system for information extraction, referred to as SRV, which relies on HTML features [Freitag, 1998]; and (4) a document-level RNN [Li et al., 2015a], which learns a representation over all available modes of information captured by Fonduer’s data model. We find that: • Fonduer’s automatic multimodal featurization approach produces results that are comparable to manually-tuned feature representations requiring feature engineering. Fonduer’s neural network is able to extract relations with a quality comparable to the human-tuned approach in all datasets differing by no more than 2 F1 points (see Table 3.4). • Fonduer’s RNN outperforms a standard, out-of-the-box Bi-LSTM significantly. The F1-score obtained by Fonduer’s multimodal RNN model is 1.7× to 2.2× higher than that of a typical Bi-LSTM (see Table 3.4). • Fonduer outperforms extraction systems that leverage HTML features alone. Table 3.5 shows a comparison between Fonduer and SRV [Freitag, 1998] in the ADVERTISEMENTS domain—the only one of our datasets with HTML documents as input. Fonduer’s features capture more information than SRV’s HTML-based features, which only capture structural and textual information. This results in 2.3× higher quality. • Using a document-level RNN to learn a single representation across all possible modalities results in neural networks with structures that are too large and too unique to batch effectively. This leads to slow runtime during training and poor-quality KBs. In Table 3.6, we compare the performance of a document-level RNN [Li et al., 2015a] and Fonduer’s approach of appending non-textual information in the last layer of the model. As shown Fonduer’s multimodal RNN obtains an F1-score that is almost 3× higher while being three orders of magnitude faster to train.

Takeaways. Direct feature engineering is unnecessary when utilizing deep learning as a basis to obtain the feature representation needed to extract relations from richly formatted data.

Supervision Ablation Study

We study how quality is affected when using only textual LFs, only metadata LFs, and the combination of the two sets of LFs. Textual LFs only operate on textual modality characteristics while metadata LFs operate on structural, tabular, and visual modality characteristics. Figure 3.8 shows that applying metadata-based LFs achieves higher quality than traditional textual-level LFs alone. The highest quality is achieved when both types of LFs are used. In ELECTRONICS, we see an increase of 66 F1 points (9.2×) when using metadata LFs and a 3 F1 point (1.04×) improvement over metadata LFs when both types are used. Because this dataset CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 61

Table 3.4: Comparing approaches to featurization based on Fonduer’s data model. Sys. Metric Human-tuned Bi-LSTM w/ Attn. Fonduer Prec. 0.71 0.42 0.73 ELEC. Rec. 0.82 0.50 0.81 F1 0.76 0.45 0.77 Prec. 0.88 0.51 0.87 ADS. Rec. 0.88 0.43 0.89 F1 0.88 0.47 0.88 Prec. 0.92 0.52 0.76 PALEO. Rec. 0.37 0.15 0.38 F1 0.53 0.23 0.51 Prec. 0.92 0.66 0.89 GEN. Rec. 0.82 0.41 0.81 F1 0.87 0.47 0.85

Table 3.5: Comparing the features of SRV and Fonduer. Feature Model Precision Recall F1 SRV 0.72 0.34 0.39 Fonduer 0.87 0.89 0.88 relies more heavily on distant signals, LFs that can label correctness based on column or row header content significantly improve extraction quality. The ADVERTISEMENTS application benefits equally from metadata and textual LFs. Yet, we get an increase of 20 F1 points (1.2×) when both types of LFs are applied. The PALEONTOLOGY and GENOMICS applications show more moderate increases of 40 (4.6×) and 40 (1.8×) F1 points by using both types over only textual LFs, respectively.

3.6 User Study

Traditionally, ground truth data is created through manual annotation, crowdsourcing, or other time-consuming methods and then used as data for training a machine-learning model. In Fonduer, we use the data- programming model for users to programmatically generate training data, rather than needing to perform manual annotation—a human-in-the-loop approach. In this section we qualitatively evaluate the effectiveness of our approach compared to traditional human labeling and observe the extent to which users leverage non-textual semantics when labeling candidates. We conducted a user study with 10 users, where each user was asked to complete the relation extraction task of extracting maximum collector-emitter voltages from the ELECTRONICS dataset. Using the same experimental settings, we compare the effectiveness of two approaches for obtaining training data: (1) manual

Table 3.6: Comparing document-level RNN and Fonduer’s deep-learning model on a single ELECTRONICS relation. Learning Model Runtime during Training (secs/epoch) Quality (F1) Document-level RNN 37,421 0.26 Fonduer 48 0.65 CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 62

All Only Metadata Only Textual 1.0 0.88 0.85 0.77 0.74 0.73 0.680.68

0.51 0.5 0.45 0.45 Quality(F1)

0.08 0.11 0 Elec. Ads. Paleo. Gen. Figure 3.8: Study of different supervision resources’ effect. Metadata includes structural, tabular, and visual modalities.

LF Manual 1.0 0.6

0.4 0.5 Ratio 0.2 Quality(F1)

0 10 20 30 0 Time (min) Txt. Str.Tab. Vis. Figure 3.9: F1 quality over time with 95% confidence intervals (left). Modality distribution of user LFs (right). annotations (Manual) and (2) using labeling functions (LF). We selected users with a basic knowledge of Python but no expertise in the ELECTRONICS domain. Users completed a 20 minute walk-through to familiarize themselves with the interface and procedures. To minimize the effect of cognitive fatigue and familiarity with the task, half of the users performed the task of manually annotating training data first, then the task of writing labeling functions, while the other half performed the tasks in the reverse order. We allotted 30 minutes for each task and evaluated the quality that was achieved using each approach at several checkpoints. For manual annotations, we evaluated every five minutes. We plotted the quality achieved by user’s labeling functions each time the user performed an iteration of supervision and classification as part of Fonduer’s iterative approach. We filtered out two outliers and report results of eight users. In Figure 3.9 (left), we report the quality (F1 score) achieved by the two different approaches. The average F1 achieved using manual annotation was 0.26 while the average F1 score using labeling functions was 0.49, CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 63

an improvement of 1.9×. We found with statistical significance that all users were able to achieve higher F1 scores using labeling functions than manually annotating candidates, regardless of the order in which they performed the approaches. There are two primary reasons for this trend. First, labeling functions provide a larger set of training data than manual annotations by enabling users to apply patterns they find in the data programmatically to all candidates—a natural desire they often vocalized while performing manual annotation. On average, our users manually labeled 285 candidates in the allotted time, while the labeling functions they created labeled 19,075 candidates. Users provided seven labeling functions on average. Second, labeling functions tend to allow Fonduer to learn more generic features, whereas manual annotations may not adequately cover the characteristics of the dataset as a whole. For example, labeling functions are easily applied to new data. In addition, we found that for richly formatted data, users relied less on textual information—a primary signal in traditional KBC tasks—and more on information from other modalities, as shown in Figure 3.9 (right). Users utilized the semantics from multiple modalities of the richly formatted data, with 58.5% of their labeling functions using tabular information. This reflects the characteristics of the ELECTRONICS dataset, which contains information that is primarily found in tables. In our study, the most common labeling functions in each modality were: • Tabular: labeling a candidate based on the words found in the same row or column. • Visual: labeling a candidate based on its placement in a document (e.g., which page it was found on). • Structural: labeling a candidate based on its tag names. • Textual: labeling a candidate based on the textual characteristics of the voltage mention (e.g. magni- tude).

Takeaways. We found that when working with richly formatted data, users relied heavily on non-textual signals to identify candidates and weakly supervise the KBC system. Furthermore, leveraging weak supervision allowed users to create knowledge bases more effectively than traditional manual annotations alone.

3.7 Extensions

The GENOMICS application described in this chapter was explored in greater detail in a follow-up Nature Communications article [Kuleshov et al., 2019]. In that work, we introduced GWASkb, a machine-compiled knowledge base of over 6000 genotype–phenotype associations, constructed using Fonduer. There we include a more in-depth analysis of the types of associations that are recovered by Fonduer and missed by existing manually curated databases. We also create a web interface4 (see Appendix B.5) for searching the contents of GWASkb by genotype or phenotype.

4http://gwaskb.stanford.edu/ CHAPTER 3. WEAK SUPERVISION FROM PRIMITIVES 64

3.8 Related Work

Context Scope Existing KBC systems often restrict candidates to specific context scopes such as single sentences [Madaan et al., 2016, Yahya et al., 2014] or tables [Carlson et al., 2010]. Others perform KBC from richly formatted data by ensembling candidates discovered using separate extraction tasks [Dong et al., 2014, Govindaraju et al., 2013], which overlooks candidates composed of mentions that must be found jointly from document-level context scopes.

Multimodality In unstructured data information extraction systems, only textual features [Mintz et al., 2009a] are utilized. Recognizing the need to represent layout information as well when working with richly formatted data, various additional feature libraries have been proposed. Some have relied predominantly on structural features, usually in the context of web tables [Tengli et al., 2004, Penn et al., 2001, Pinto et al., 2002, Freitag, 1998]. Others have built systems that rely only on visual information [Gatterbauer et al., 2007, Yang and Zhang, 2001]. There have been instances of visual information being used to supplement a tree-based representation of a document [Kovacevic et al., 2002, Cosulschi et al., 2004], but these systems were designed for other tasks, such as document classification and page segmentation. By utilizing our deep-learning-based featurization approach, which supports all of these representations, Fonduer obviates the need to focus on feature engineering and frees the user to iterate over the supervision and learning stages of the framework.

Supervision Sources Distant supervision is one effective way to programmatically create training data for use in machine learning. In this paradigm, facts from existing knowledge bases are paired with unlabeled documents to create noisy or “weakly” labeled training examples [Mintz et al., 2009a, Min et al., 2013, Nguyen and Moschitti, 2011, Angeli et al., 2014]. In addition to existing knowledge bases, crowdsourcing [Gao et al., 2011] and heuristics from domain experts [Pasupat and Liang, 2014] have also proven to be effective weak supervision sources. In our work, we show that by incorporating all kinds of supervision in one framework in a noise-aware way, we are able to achieve high quality in knowledge base construction. Furthermore, through our programming model, we empower users to add supervision based on intuition from any modality of data.

3.9 Conclusion

In this paper, we study how to extract information from richly formatted data. We show that the key challenges of this problem are (1) prevalent document-level relations, (2) multimodality, and (3) data variety. To address these, we propose Fonduer, the first KBC system for richly formatted information extraction. We describe Fonduer’s data model, which enables users to perform candidate extraction, multimodal featurization, and multimodal supervision through a simple programming model. We evaluate Fonduer on four real-world domains and show an average improvement of 41 F1 points over the upper bound of state-of-the-art approaches. In some domains, Fonduer extracts up to 1.87× the number of correct relations compared to expert-curated public knowledge bases. Chapter 4

Weak Supervision from Natural Language

While moving up the stack from manual to programmatic supervision makes supervising a task much more scalable, it also makes it less accessible: anyone can click yes or no when providing a label, but not everyone can code. One way to rectify this is to move up to an even higher level interface. We accomplish this with the system BabbleLabble, demonstrating that the very high-level input of natural language can be effectively compiled down the supervision stack into inputs suitable for training high-quality models. The work in this chapter is the result of collaboration with Paroma Varma, Stephanie Wang, Martin Bringmann, Percy Liang, Christopher Ré, and others. It draws on content from the following BabbleLabble- related publications: [Hancock et al., 2017, 2018].

4.1 Introduction

The standard protocol for obtaining a labeled dataset is to have a human annotator view each example, assess its relevance, and provide a label (e.g., positive or negative for binary classification). However, this only provides one bit of information per example. This invites the question: how can we get more information per example, given that the annotator has already spent the effort reading and understanding an example? Previous works have relied on identifying relevant parts of the input such as labeling features [Druck et al., 2009, Raghavan et al., 2005, Liang et al., 2009a], highlighting rationale phrases in text [Zaidan and Eisner, 2008a, Arora and Nyberg, 2009], or marking relevant regions in images [Ahn et al., 2006]. But there are certain types of information which cannot be easily reduced to annotating a portion of the input, such as the absence of a certain word, or the presence of at least two words. In this work, we tap into the power of natural language and allow annotators to provide supervision to a classifier via natural language explanations. Specifically, we propose a framework in which annotators provide a natural language explanation for each

65 CHAPTER 4. WEAK SUPERVISION FROM NATURAL LANGUAGE 66

Example

Both cohorts showed signs of optic nerve toxicity due to ethambutol.

Label Does this chemical cause this disease? Y N

Explanation Why do you think so?

Because the words “due to” occur between the chemical and the disease.

Labeling Function def lf(x): return (1 if “due to” in between(x.chemical, x.disease) else 0)

Figure 4.1: In BabbleLabble, the user provides a natural language explanation for each labeling decision. These explanations are parsed into labeling functions that convert unlabeled data into a large labeled dataset for training a classifier. label they assign to an example (see Figure 4.1). These explanations are parsed into logical forms representing labeling functions (LFs), functions that heuristically map examples to labels [Ratner et al., 2016b]. The labeling functions are then executed on many unlabeled examples, resulting in a large, weakly-supervised training set that is then used to train a classifier. Semantic parsing of natural language into logical forms is recognized as a challenging problem and has been studied extensively [Zelle and Mooney, 1996, Zettlemoyer and Collins, 2005, Liang et al., 2011, Liang, 2016]. One of our major findings is that in our setting, even a simple rule-based semantic parser suffices for three reasons: First, we find that the majority of incorrect LFs can be automatically filtered out either semantically (e.g., is it consistent with the associated example?) or pragmatically (e.g., does it avoid assigning the same label to the entire training set?). Second, LFs near the gold LF in the space of logical forms are often just as accurate (and sometimes even more accurate). Third, techniques for combining weak supervision sources are built to tolerate some noise [Alfonseca et al., 2012a, Takamatsu et al., 2012a, Ratner et al., 2017b]. The significance of this is that we can deploy the same semantic parser across tasks without task-specific training. We show how we can tackle a real-world biomedical application with the same semantic parser used to extract instances of spouses. Our work is most similar to that of Srivastava et al. [2017], who also use natural language explanations to train a classifier, but with two important differences. First, they jointly train a task-specific semantic parser and classifier, whereas we use a simple rule-based parser. In Section 4.4, we find that in our weak supervision framework, the rule-based semantic parser and the perfect parser yield nearly identical downstream performance. Second, while they use the logical forms of explanations to produce features that are fed directly to a classifier, we use them as functions for labeling a much larger training set. In Section 4.4, we show that CHAPTER 4. WEAK SUPERVISION FROM NATURAL LANGUAGE 67

Unlabeled Examples + Explanations Labeling Functions Filters Label Matrix Label whether person 1 is married to person 2 def LF_1a(x): x x x x … return (1 if “his wife” in Correct 1 2 3 4 x1 Tom Bradyand his wife Gisele Bündchen were left(x.person2, dist==1) else 0) spotted in New York City on Monday amid rumors LF1a 1 def LF_1b(x): Semantic of Brady’s alleged role in Deflategate. return (1 if “his wife” in Filter LF -1 right(x.person2) else 0 (inconsistent) 2b True, because the words “his wife” -1 -1 are right before person 2. def LF_2a(x): LF3a return (-1 if x.person1 in Pragmatic x.sentence and x.person2 in Filter LF 1 1 1 x2 None of us knows what happened at Kane‘s 4c x.sentence else 0) (always true) … home Aug. 2, but it is telling that the NHL has not suspended Kane. def LF_2b(x): return (-1 if x.person1 == Correct + - - + … False, because person 1 and person 2 x.person2) else 0) ỹ in the sentence are identical. def LF_3a(x): Noisy Labels Classifier x3 Dr. Michael Richards and real estate and return (-1 if Correct insurance businessman Gary Kirke did not attend x.person1.tokens[-1] != (x ,ỹ ) the event. x.person2.tokens[-1] else 0) 1 1 (x2,ỹ2) def LF_3b(x): x False, because the last word of Pragmatic ỹ return (-1 if not ( Filter (x3,ỹ3) person 1 is different than the last x.person1.tokens[-1] == (duplicate (x ,ỹ ) word of person 2. x.person2.tokens[-1]) else 0) of LF_3a) 4 4 Figure 4.2: Natural language explanations are parsed into candidate labeling functions (LFs). Many incorrect LFs are filtered out automatically by the filter bank. The remaining functions provide heuristic labels over the unlabeled dataset, which are aggregated into one noisy label per example, yielding a large, noisily-labeled training set for a classifier. using functions yields a 9.5 F1 improvement (26% relative improvement) over features, and that the F1 score scales with the amount of available unlabeled data. We validate our approach on two existing datasets from the literature (extracting spouses from news articles and disease-causing chemicals from biomedical abstracts) and one real-world use case with our biomedical collaborators at OccamzRazor to extract protein-kinase interactions related to Parkinson’s disease from text. We find empirically that users are able to train classifiers with comparable F1 scores up to 100× faster when they provide natural language explanations instead of individual labels. Our code is available at https://github.com/HazyResearch/babble.

4.2 The BabbleLabble Framework

The BabbleLabble framework converts natural language explanations and unlabeled data into a noisily- labeled training set (see Figure 4.2). There are three key components: a semantic parser, a filter bank, and a label aggregator. The semantic parser converts natural language explanations into a set of logical forms representing labeling functions (LFs). The filter bank removes as many incorrect LFs as possible without requiring ground truth labels. The remaining LFs are applied to unlabeled examples to produce a matrix of labels. This label matrix is passed into the label aggregator, which combines these potentially conflicting and overlapping labels into one label for each example. The resulting labeled examples are then used to train an arbitrary discriminative model. CHAPTER 4. WEAK SUPERVISION FROM NATURAL LANGUAGE 68

LF

CONDITION

BOOL ARGLIST ISEQUAL

START LABEL FALSE BECAUSE ARG AND ARG IS EQUAL STOP

label false because X and Y are the same person

Lexical Rules Unary Rules Compositional Rules Ignored token START → BOOL → FALSE LF → START LABEL BOOL BECAUSE CONDITION STOP LABEL → label BOOL → TRUE CONDITION → ARGLIST ISEQUAL FALSE → false NUM → INT ARGLIST → ARG AND ARG Figure 4.3: Valid parses are found by iterating over increasingly large subspans of the input looking for matches among the right hand sides of the rules in the grammar. Rules are either lexical (converting tokens into symbols), unary (converting one symbol into another symbol), or compositional (combining many symbols into a single higher-order symbol). A rule may optionally ignore unrecognized tokens in a span (denoted here with a dashed line).

4.2.1 Explanations

To create the input explanations, the user views a subset S of an unlabeled dataset D (where |S|  |D|) and provides for each input xi ∈ S a label yi and a natural language explanation ei, a sentence explaining why the example should receive that label. The explanation ei generally refers to specific aspects of the example (e.g., in Figure 4.2, the location of a specific string “his wife”).

4.2.2 Semantic Parser

The semantic parser takes a natural language explanation ei and returns a set of LFs (logical forms or labeling functions) {f1,..., fk} of the form fi : X → {−1, 0, 1} in a binary classification setting, with 0 representing abstention. We emphasize that the goal of this semantic parser is not to generate the single correct parse, but rather to have coverage over many potentially useful LFs.1 We choose a simple rule-based semantic parser that can be used without any training. Formally, the parser uses a set of rules of the form α → β, where α can be replaced by the token(s) in β (see Figure 4.3 for example rules). To identify candidate LFs, we recursively construct a set of valid parses for each span of the explanation, based on the substitutions defined by the grammar rules. At the end, the parser returns all valid parses (LFs in our case) corresponding to the entire explanation. We also allow an arbitrary number of tokens in a given span to be ignored when looking for a matching rule. This improves the ability of the parser to handle unexpected input, such as unknown words or typos, since the portions of the input that are parseable can still result in a valid parse. For example, in Figure 4.3, the word “person” is ignored.

1Indeed, we find empirically that an incorrect LF nearby the correct one in the space of logical forms actually has higher end-task accuracy 57% of the time (see Section 4.4.2). CHAPTER 4. WEAK SUPERVISION FROM NATURAL LANGUAGE 69

Predicate Description

bool, string, int, Standard primitive data types float, tuple, list, set and, or, not, any, Standard logic operators all, none =, 6=, <, 6, >, > Standard comparison operators lower, upper, Return True for strings of the corre- capital, all_caps sponding case starts_with, ends_- Return True if the first string start- with, substring s/ends with or contains the second person, location, Return True if a string has the corre- date, number, sponding NER tag organization alias A frequently used list of words may be predefined and referred to with an alias count, contains, Operators for checking size, mem- intersection bership, or common elements of a list/set map, filter Apply a functional primitive to each member of list/set to transform or filter the elements word_distance, Return the distance between two character_distance strings by words or characters left, right, Return as a string the text that is between, within left/right/within some distance of a string or between two designated strings

Table 4.1: Predicates in the grammar supported by BabbleLabble’s rule-based semantic parser.

All predicates included in our grammar (summarized in Table 4.1) are provided to annotators, with minimal examples of each in use (Appendix C.1). Importantly, all rules are domain independent (e.g., all three relation extraction tasks that we tested used the same grammar), making the semantic parser easily transferrable to new domains. Additionally, while this paper focuses on the task of relation extraction, in principle the BabbleLabble framework can be applied to other tasks or settings by extending the grammar with the necessary primitives (e.g., adding primitives for rows and columns to enable explanations about the alignments of words in tables). To guide the construction of the grammar, we collected 500 explanations for the Spouse domain from workers on Amazon Mechanical Turk and added support for the most commonly used predicates. These were added before the experiments described in Section 4.4. The grammar contains a total of 200 rule templates.

4.2.3 Filter Bank

The input to the filter bank is a set of candidate LFs produced by the semantic parser. The purpose of the filter bank is to discard as many incorrect LFs as possible without requiring additional labels. It consists of two CHAPTER 4. WEAK SUPERVISION FROM NATURAL LANGUAGE 70

classes of filters: semantic and pragmatic.

Recall that each explanation ei is collected in the context of a specific labeled example (xi, yi). The semantic filter checks for LFs that are inconsistent with their corresponding example; formally, any LF f for which f(xi) 6= yi is discarded. For example, in the first explanation in Figure 4.2, the word “right” can be interpreted as either “immediately” (as in “right before”) or simply “to the right.” The latter interpretation results in a function that is inconsistent with the associated example (since “his wife” is actually to the left of person 2), so it can be safely removed. The pragmatic filters removes LFs that are constant, redundant, or correlated. For example, in Figure 4.2, LF_2a is constant, as it labels every example positively (since all examples contain two people from the same sentence). LF_3b is redundant, since even though it has a different syntax tree from LF_3a, it labels the training set identically and therefore provides no new signal. Finally, out of all LFs from the same explanation that pass all the other filters, we keep only the most specific (lowest coverage) LF. This prevents multiple correlated LFs from a single example from dominating. As we show in Section 4.4, over three tasks, the filter bank removes over 95% of incorrect parses, and the incorrect ones that remain have average end-task accuracy within 2.5 points of the corresponding correct parses.

4.2.4 Label Aggregator

The label aggregator combines multiple (potentially conflicting) suggested labels from the LFs and combines them into a single probabilistic label per example. Concretely, if m LFs pass the filter bank and are applied to n examples, the label aggregator implements a function f : {−1, 0, 1}m×n → [0, 1]n. A naive solution would be to use a simple majority vote, but this fails to account for the fact that LFs can vary widely in accuracy and coverage. Instead, we use data programming [Ratner et al., 2016b], which models the relationship between the true labels and the output of the labeling functions as a factor graph. More specifically, given the true labels Y ∈ {−1, 1}n (latent) and label matrix Λ ∈ {−1, 0, 1}m×n (observed) where

Λi,j = LFi(xj), we define two types of factors representing labeling propensity and accuracy:

Lab 1 φi,j (Λ, Y) = {Λi,j 6= 0} (4.1) Acc 1 φi,j (Λ, Y) = {Λi,j = yj}. (4.2)

m Denoting the vector of factors pertaining to a given data point xj as φj(Λ, Y) ∈ R , define the model:

n −1   pw(Λ, Y) = Zw exp w · φj(Λ, Y) , (4.3) j=1 X

2m where w ∈ R is the weight vector and Zw is the normalization constant. To learn this model without CHAPTER 4. WEAK SUPERVISION FROM NATURAL LANGUAGE 71

knowing the true labels Y, we minimize the negative log marginal likelihood given the observed labels Λ:

wˆ = arg min w − log pw(Λ, Y) (4.4) Y X using SGD and Gibbs sampling for inference, and then use the marginals pwˆ (Y | Λ) as probabilistic training labels. Intuitively, we infer accuracies of the LFs based on the way they overlap and conflict with one another. Since noisier LFs are more likely to have high conflict rates with others, their corresponding accuracy weights in w will be smaller, reducing their influence on the aggregated labels.

4.2.5 Discriminative Model

The noisy training set that the label aggregator outputs is used to train an arbitrary discriminative model. One advantage of training a discriminative model on the task instead of using the label aggregator as a classifier directly is that the label aggregator only takes into account those signals included in the LFs. A discriminative model, on the other hand, can incorporate features that were not identified by the user but are nevertheless informative.2 Consequently, even examples for which all LFs abstained can still be classified correctly. Additionally, passing supervision information from the user to the model in the form of a dataset—rather than hard rules—promotes generalization in the new model (rather than memorization), similar to distant supervision [Mintz et al., 2009a]. On the three tasks we evaluate, using the discriminative model averages 4.3 F1 points higher than using the label aggregator directly. For the results reported in this paper, our discriminative model is a simple logistic regression classifier with generic features defined over dependency paths.3 These features include unigrams, bigrams, and trigrams of lemmas, dependency labels, and part of speech tags found in the siblings, parents, and nodes between the entities in the dependency parse of the sentence. We found this to perform better on average than a biLSTM, particularly for the traditional supervision baselines with small training set sizes; it also provided easily interpretable features for analysis.

4.3 Experimental Setup

We evaluate the accuracy of BabbleLabble on three relation extraction tasks, which we refer to as Spouse, Disease, and Protein. The goal of each task is to train a classifier for predicting whether the two entities in an example are participating in the relationship of interest, as described below.

2We give an example of two such features in Section 4.4.3. 3https://github.com/HazyResearch/treedlib CHAPTER 4. WEAK SUPERVISION FROM NATURAL LANGUAGE 72

Spouse Example They include Joan Ridsdale, a 62-year-old payroll administrator from County Durham who was hit with a €16,000 tax bill when her (person 1, husband Gordondied. person 2) Explanation True, because the phrase “her husband” is within three words of person 2. Disease Example Young women on replacement estrogens for ovarian failure after cancer therapy may also have increased risk of endometrial carcinoma and should be examined periodically. (chemical, disease) Explanation True, because “risk of” comes before the disease. Protein Example Here we show that c-Jun N-terminal kinases JNK1, JNK2 and JNK3 phosphorylate tauat many serine/threonine-prolines, as assessed by the generation of the epitopes of phosphorylation-dependent anti-tau antibodies. (protein, kinase) Explanation True, because at least one of the words 'phosphorylation', 'phosphorylate', 'phosphorylated', 'phosphorylates' is found in the sentence and the number of words between the protein and kinase is smaller than 8." Figure 4.4: An example and explanation for each of the three datasets.

Task Train Dev Test % Pos. Spouse 22195 2796 2697 8% Disease 6667 773 4101 23% Protein 5546 1011 1058 22% Table 4.2: The total number of unlabeled training examples (a pair of annotated entities in a sentence), labeled development examples (for hyperparameter tuning), labeled test examples (for assessment), and the fraction of positive labels in the test split.

4.3.1 Datasets

Statistics for each dataset are reported in Table 4.2, with one example and one explanation for each given in Figure 4.4 and additional explanations shown in Appendix C.2. In the Spouse task, annotators were shown a sentence with two highlighted names and asked to label whether the sentence suggests that the two people are spouses. Sentences were pulled from the Signal Media dataset of news articles [Corney et al., 2016b]. Ground truth data was collected from Amazon Mechanical Turk workers, accepting the majority label over three annotations. The 30 explanations we report on were sampled randomly from a pool of 200 that were generated by 10 graduate students unfamiliar with BabbleLabble. In the Disease task (also called CDR in 2), annotators were shown a sentence with highlighted names of a chemical and a disease and asked to label whether the sentence suggests that the chemical causes the disease. Sentences and ground truth labels came from a portion of the 2015 BioCreative chemical-disease relation dataset [Wei et al., 2015a], which contains abstracts from PubMed. Because this task requires specialized domain expertise, we obtained explanations by having someone unfamiliar with BabbleLabble translate from Python to natural language labeling functions from an existing publication that explored applying weak supervision to this task [Ratner et al., 2017b]. The Protein task was completed in conjunction with OccamzRazor, a neuroscience company targeting biological pathways of Parkinson’s disease. For this task, annotators were shown a sentence from the relevant biomedical literature with highlighted names of a protein and a kinase and asked to label whether or not the kinase influences the protein in terms of a physical interaction or phosphorylation. The annotators had domain expertise but minimal programming experience, making BabbleLabble a natural fit for their use case. CHAPTER 4. WEAK SUPERVISION FROM NATURAL LANGUAGE 73

BL TS # Inputs 30 30 60 150 300 1,000 3,000 10,000 Spouse 50.1 15.5 15.9 16.4 17.2 22.8 41.8 55.0 Disease 42.3 32.1 32.6 34.4 37.5 41.9 44.5 - Protein 47.3 39.3 42.1 46.8 51.0 57.6 - - Average 46.6 28.9 30.2 32.5 35.2 40.8 43.2 55.0 Table 4.3: F1 scores obtained by a classifier trained with BabbleLabble (BL) using 30 explanations or with traditional supervision (TS) using the specified number of individually labeled examples. BabbleLabble achieves the same F1 score as traditional supervision while using fewer user inputs by a factor of over 5 (Protein) to over 100 (Spouse).

4.3.2 Experimental Settings

Text documents are tokenized with spaCy.4 The semantic parser is built on top of the Python-based imple- mentation SippyCup.5 On a single core, parsing 360 explanations takes approximately two seconds. We use existing implementations of the label aggregator, feature library, and discriminative classifier described in Sections 4.2.4–4.2.5 provided by the open-source project Snorkel [Ratner et al., 2017b]. Hyperparameters for all methods we report were selected via random search over thirty configurations on the same held-out development set. We searched over learning rate, batch size, L2 regularization, and the subsampling rate (for improving balance between classes).6 All reported F1 scores are the average value of 40 runs with random seeds and otherwise identical settings.

4.4 Experimental Results

We evaluate the performance of BabbleLabble with respect to its rate of improvement by number of user inputs, its dependence on correctly parsed logical forms, and the mechanism by which it utilizes logical forms.

4.4.1 High Bandwidth Supervision

In Table 4.3 we report the average F1 score of a classifier trained with BabbleLabble using 30 explanations or traditional supervision with the indicated number of labels. On average, it took the same amount of time to collect 30 explanations as 60 labels.7 We observe that in all three tasks, BabbleLabble achieves a given F1 score with far fewer user inputs than traditional supervision, by as much as 100 times in the case of the Spouse task. Because explanations are applied to many unlabeled examples, each individual input from the user can implicitly contribute many (noisy) labels to the learning algorithm.

4https://github.com/explosion/spaCy 5https://github.com/wcmac/sippycup 6 Hyperparameter ranges: learning rate (1e-2 to 1e-4), batch size (32 to 128), L2 regularization (0 to 100), subsampling rate (0 to 0.5) 7Zaidan and Eisner [2008a] also found that collecting annotator rationales in the form of highlighted substrings from the sentence only doubled annotation time. CHAPTER 4. WEAK SUPERVISION FROM NATURAL LANGUAGE 74

Explanation Labeling Function Correctness Accuracy def LF_1a(x): False, because a word Correct 84.6% starting with “improve” return (-1 if any(w.startswith(“improv”) for w in left(x.person2)) else 0) appears before the def LF_1b(x): Incorrect 84.6% chemical. return (-1 if “improv” in left(x.person2)) else 0)

def LF_2a(x): True, because “husband” return (1 if “husband” in left(x.person1, dist==1) else 0) Correct 13.6% occurs right before the person1. def LF_2b(x): return (1 if “husband” in left(x.person2, dist==1) else 0) Incorrect 66.2% Figure 4.5: Incorrect LFs often still provide useful signal. On top is an incorrect LF produced for the Disease task that had the same accuracy as the correct LF. On bottom is a correct LF from the Spouse task and a more accurate incorrect LF discovered by randomly perturbing one predicate at a time as described in Section 4.4.2. (Person 2 is always the second person in the sentence).

Pre-filters Discarded Post-filters LFs Correct Sem. Prag. LFs Correct Spouse 156 10% 19 118 19 84% Disease 102 23% 34 40 28 89% Protein 122 14% 44 58 20 85% Table 4.4: The number of LFs generated from 30 explanations (pre-filters), discarded by the filter bank, and remaining (post-filters), along with the percentage of LFs that were correctly parsed from their corresponding explanations.

We also observe, however, that once the number of labeled examples is sufficiently large, traditional supervision once again dominates, since ground truth labels are preferable to noisy ones generated by labeling functions. However, in domains where there is much more unlabeled data available than labeled data (which in our experience is most domains), we can gain in supervision efficiency from using BabbleLabble. Of those explanations that did not produce a correct LF, 4% were caused by the explanation referring to unsupported concepts (e.g., one explanation referred to “the subject of the sentence,” which our simple parser doesn’t support). Another 2% were caused by human errors (the correct LF for the explanation was inconsistent with the example). The remainder were due to unrecognized paraphrases (e.g., the explanation said “the order of appearance is X, Y” instead of a supported phrasing like “X comes before Y”).

4.4.2 Utility of Incorrect Parses

In Table 4.4, we report LF summary statistics before and after filtering. LF correctness is based on exact match with a manually generated parse for each explanation. Surprisingly, the simple heuristic-based filter bank successfully removes over 95% of incorrect LFs in all three tasks, resulting in final LF sets that are 86% correct on average. Furthermore, among those LFs that pass through the filter bank, we found that the average difference in end-task accuracy between correct and incorrect parses is less than 2.5%. Intuitively, the filters are effective because it is quite difficult for an LF to be parsed from the explanation, label its own example correctly (passing the semantic filter), and not label all examples in the training set with the same label or identically to another LF (passing the pragmatic filter). CHAPTER 4. WEAK SUPERVISION FROM NATURAL LANGUAGE 75

BL-FB BL BL+PP Spouse 15.7 50.1 49.8 Disease 39.8 42.3 43.2 Protein 38.2 47.3 47.4 Average 31.2 46.6 46.8 Table 4.5: F1 scores obtained using BabbleLabble with no filter bank (BL-FB), as normal (BL), and with a perfect parser (BL+PP) simulated by hand.

0.45

0.40

0.35

0.30 Spouse (BL)

F1 Score Spouse (Feat) 0.25 Disease (BL) Disease (Feat) 0.20 Protein (BL) Protein (Feat) 0.15 0 1000 2000 3000 4000 5000 Unlabeled Examples Figure 4.6: When logical forms of natural language explanations are used as functions for data programming (as they are in BabbleLabble), performance can improve with the addition of unlabeled data, whereas using them as features does not benefit from unlabeled data.

We went one step further: using the LFs that would be produced by a perfect semantic parser as starting points, we searched for “nearby” LFs (LFs differing by only one predicate) with higher end-task accuracy on the test set and succeeded 57% of the time (see Figure 4.5 for an example). In other words, when users provide explanations, the signals they describe provide good starting points, but they are actually unlikely to be optimal. This observation is further supported by Table 4.5, which shows that the filter bank is necessary to remove clearly irrelevant LFs, but with that in place, the simple rule-based semantic parser and a perfect parser have nearly identical average F1 scores.

4.4.3 Using LFs as Functions or Features

Once we have relevant logical forms from user-provided explanations, we have multiple options for how to use them. Srivastava et al. [2017] propose using these logical forms as features in a linear classifier, essentially using a traditional supervision approach with user-specified features. We choose instead to use them as functions for weakly supervising the creation of a larger training set via data programming [Ratner et al., 2016b]. In Table 4.6, we compare the two approaches directly, finding that the the data programming approach outperforms a feature-based one by 9.5 F1 points on average with the rule-based parser, and by 4.5 points with CHAPTER 4. WEAK SUPERVISION FROM NATURAL LANGUAGE 76

BL-DM BL BL+PP Feat Feat+PP Spouse 46.5 50.1 49.8 33.9 39.2 Disease 39.7 42.3 43.2 40.8 43.8 Protein 40.6 47.3 47.4 36.7 44.0 Average 42.3 46.6 46.8 37.1 42.3 Table 4.6: F1 scores obtained using explanations as functions for data programming (BL) or features (Feat), optionally with no discriminative model (-DM) or using a perfect parser (+PP). a perfect parser. We attribute this difference primarily to the ability of data programming to utilize a larger feature set and unlabeled data. In Figure 4.6, we show how the data programming approach improves with the number of unlabeled examples, even as the number of LFs remains constant. We also observe qualitatively that data programming exposes the classifier to additional patterns that are correlated with our explanations but not mentioned directly. For example, in the Disease task, two of the features weighted most highly by the discriminative model were the presence of the trigrams “could produce a” or “support diagnosis of” between the chemical and disease, despite none of these words occurring in the explanations for that task. In Table 4.6 we see a 4.3 F1 point improvement (10%) when we use the discriminative model that can take advantage of these features rather than applying the LFs directly to the test set and making predictions based on the label aggregator’s outputs.

4.5 Related Work and Discussion

Our work has two themes: modeling natural language explanations/instructions and learning from weak supervision. The closest body of work is on “learning from natural language.” As mentioned earlier, Srivastava et al. [2017] convert natural language explanations into classifier features (whereas we convert them into labeling functions). Goldwasser and Roth [2011] convert natural language into concepts (e.g., the rules of a card game). Ling and Fidler [2017] use natural language explanations to assist in supervising an image captioning model. Weston [2016], Li et al. [2016] learn from natural language feedback in a dialogue. Wang et al. [2017] convert natural language definitions to rules in a semantic parser to build up progressively higher-level concepts. We lean on the formalism of semantic parsing [Zelle and Mooney, 1996, Zettlemoyer and Collins, 2005, Liang, 2016]. One notable trend is to learn semantic parsers from weak supervision [Clarke et al., 2010, Liang et al., 2011], whereas our goal is to obtain weak supervision signal from semantic parsers. The broader topic of weak supervision has received much attention; we mention some works most related to relation extraction. In distant supervision [Craven et al., 1999, Mintz et al., 2009a] and multi-instance learning [Riedel et al., 2010a, Hoffmann et al., 2011b], an existing knowledge base is used to (probabilistically) impute a training set. Various extensions have focused on aggregating a variety of supervision sources by learning generative models from noisy labels [Alfonseca et al., 2012a, Takamatsu et al., 2012a, Roth and CHAPTER 4. WEAK SUPERVISION FROM NATURAL LANGUAGE 77

Figure 4.7: The Babble Labble interface.

Klakow, 2013a, Ratner et al., 2016b, Varma et al., 2017]. Finally, while we have used natural language explanations as input to train models, they can also be output to interpret models [Krening et al., 2017, Lei et al., 2016]. More generally, from a machine learning perspective, labels are the primary asset, but they are a low bandwidth signal between annotators and the learning algorithm. Natural language opens up a much higher-bandwidth communication channel. We have shown that weak supervision from the high-level abstraction of natural language can be used to train high-performance models in relation extraction (where one explanation can be “worth” 100 labels), and it would be interesting to extend our framework to other tasks and more interactive settings.

4.6 Extensions

The graphic user interface to the BabbleLabble system shown in Figure 4.7 was featured as a demonstration at NeurIPS 2017 [Hancock et al., 2017].8. With this interface, users interact with the system as described above, but with additional summary statistics being reported with each explanation that is submitted, such as total explanation count, dataset coverage, current performance on the development set, etc. Users can also immediately view examples labeled correctly or incorrectly by a given parse, alternative parses considered and filtered, etc. Qualitatively, we observe that this interface increases both the quantity and quality of superversion sources generated by users in a given amount of time. We believe that user-focused application interfaces such as this will play a significant role in furthering real-world adoption of weak supervision approaches.

8https://www.youtube.com/watch?v=YBeAX-deMDg Chapter 5

Discussion & Conclusion

That primary difference between a traditional labeling approach and the weak supervision approaches described in this dissertation is that the former is manual while the latter are programmatic. This fundamental difference comes with a number of potential advantages and limitations. In this section we discuss these considerations and offer concluding thoughts.

5.1 Advantages of Programmatic Supervision

The first advantage of programmatic supervision is speed. With manual supervision, labeling time is pro- portional to the amount of time it takes a human to label an example. On the other hand, when supervision is encoded in labeling functions—or higher-level abstractions that compile down into labeling functions— labeling time is proportional to the time it takes a program to label an example. While there is a startup cost to creating these labeling functions—writing a labeling function will almost certainly take longer than labeling an example—once they exist, the cost of labeling additional examples is marginal, making it possible to label far more (potentially orders of magnitude more) examples when using a programmatic labeling approach. Because of this, a programmatic labeling approach is often most advantageous when the amount of available data to label is very large and the problem is one that will benefit from a larger training set. The second advantage of programmatic supervision is cost. This follows naturally from the same reason stated above: once labeling functions are generated, labeling additional examples is not only fast, but also cheap when compared to the cost of paying for additional manual annotations. Furthermore, for some tasks and datasets, “embarrassingly parallel” scaling of annotation through crowdsourcing is not an option, either due to privacy constraints (e.g., financial transactions) or lack of expertise (e.g., medical images). In these cases, the ability to magnify the supervision ability of a small number of qualified individuals via programmatic supervision is particularly advantageous. A third advantage of programmatic supervision is dynamism. Unlike static benchmark tasks with frozen datasets and objectives, real-world problems often experience changes in the task schema (e.g., a two-class

78 CHAPTER 5. DISCUSSION & CONCLUSION 79 Labels Labels write run programs programs

Time Time (a) Manual Labeling (b) Programmatic Labeling

Figure 5.1: Abstract schematics of labeling rate for (a) manual labeling and (b) programmatic labeling approaches. problem becoming a three-class problem), task definition (e.g., what was previously classified as “illegal” becoming “legal” after legislation passes), or simply shifts in the underlying data distribution over time. When a training dataset needs to be updated to reflect such changes, a manual approach may require revisiting every label individually. On the other hand, if supervision is programmatic, these updates can often be achieved by simply adding or updating a small number of labeling functions related to the change. In other words, when domain knowledge is stored in programs instead of individually labels, if our labels are no longer valid, we haven’t necessarily lost our supervision. Finally, a fourth advantage of programmatic supervision is that it promotes transparency. When an annotator provides a label for an example, we generally do not receive any additional information suggesting the cause for that label being given. When labels are programmatically generated, however, we have label provenance information that allows us to observe which labeling functions ultimately contributed to that label. Thus, if a source of bias is discovered in our dataset, it may be possible to identify the sources (i.e., the labeling functions) responsible for that bias and remove or update them while leaving the rest of our supervision intact. Related to this is the notion of label auditability. If, for example, it is illegal for a bank to take into account gender or race when considering credit limit increases, by examining the source code of the contributing labeling functions, we can confirm that none of the disallowed attributes were used explicitly. Note, however, that this does not necessarily guarantee that the model will not learn a biased representation due to other correlations in the data or biases encoded in less transparent labeling functions such as wrapped third-party models.

5.2 Limitations

The programmatic supervision approaches we have described require three main ingredients: first, a set of labeling functions that users can write; second, an appropriate model to train; and third, a preferably large CHAPTER 5. DISCUSSION & CONCLUSION 80

amount of data to label. While there are many current and exciting future directions for enabling users to more easily write labeling functions in diverse circumstances, there are some tasks or datasets where encoding domain knowledge in this form is particularly difficult. For example, a particular domain may have few attributes that can be referred to when writing labeling functions; there may be few third-party resources (preprocessors, classifiers, etc.) available for reuse; or the interactions between attributes may be very complex, making it difficult to write succinct labeling functions with sufficient accuracy. Next, the approaches we describe implicitly rely on easily available discriminative models so that a user can focus on providing supervision, rather than feature engineering or model architecture design. For common modalities such as text, image, audio, video, etc., there are increasingly commoditized architectures available in the open-source. However, as machine learning is applied to increasingly broad problems, there will very likely be domains that have not reached this same level of maturity, where simpler models are preferred and the relative benefit of larger training sets is diminished. Finally, programmatic supervision benefits from settings where unlabeled data is readily available, as demonstrated empirically in the experiments described in this work. Where available data for labeling is limited, or where all data has already been labeled and relabeling requirements are infrequent, a user may be better served by focusing on other areas of the problem.

5.3 The Supervision Stack

Finally, we note that to the best of our knowledge, we are the first to make the analogy between the programming stack and supervision stack as described in Chapter 1. We have found this analogy to have an outsized impact on our own research pursuits and recognize its potential to lead to future research directions. One way in which the parallel stack analogy is particularly appropriate is that for the three abstractions described in this dissertation, the higher abstractions truly do compile down into the lower ones, just like programming languages. For example, the explanations BabbleLabble are parsed into advanced primitives like those supported by Fonduer, and the resulting functions are instances of a labeling function compatible with Snorkel. However, one might consider even higher levels up the supervision stack, where supervision signal is collected not from human language, but from human behavior — e.g, having users interact with a system and inferring labeling functions based on their implicit supervision with clicks and views. In this case, the higher-level abstraction may skip certain intermediate ones (e.g., natural language) on the way down the stack, suggesting that the hierarchy we explore in this dissertation is merely one instantiation of many such potential stacks. Another growing trend in machine learning is that of transfer learning [Pan and Yang, 2009, Weiss et al., 2016], wherein a model is pre-trained with training data corresponding to one task (an auxiliary task), then fine-tuned with the training data from another (the primary task). In computer vision [Girshick et al., 2014, Long et al., 2015], natural language processing [Devlin et al., 2018, Radford et al., 2019], and other domains, CHAPTER 5. DISCUSSION & CONCLUSION 81

this has proven to be an effective way to decrease the amount of training data required for the primary task to obtain a given level of performance, or at the very least to speed up the training process [He et al., 2018]. In the context of our supervision stack, rather than corresponding to another abstraction layer in a given stack, transfer learning is akin to utilizing the labels generated by a separate nearby stack. Ultimately, however, the use of transfer learning is orthogonal to weak supervision and the two can easily be combined, as weak supervision focuses on the creation of training labels and transfer learning focuses on the use of multiple sets of training labels, regardless of the processes that created them.

5.4 Conclusion

Decades of learning have gone into designing best practices and improved abstractions for general purpose programming. As the supervision for machine learning models becomes less bit-like and more program-like in nature, we have the opportunity to transfer many of those same lessons to this exciting and blossoming field. In this dissertation, we have described a supervision stack, where higher-level interfaces can be built up and compiled down like a programming stack, affording flexible and powerful abstractions for interacting with our models. Ultimately, we hope and expect to see weak supervision from these high-level abstractions empowering individuals, magnifying their ability to supervise powerful models in high-impact problems. Appendix A

Snorkel Appendix

A.1 Additional Material for Sec. 3.1

A.1.1 Minor Notes

Note that for the independent generative model (i.e., |C| = 0), the weight corresponding to the accuracy factor, wj, for labeling function j is just the log-odds of its accuracy:

αj = P(Λi,j = 1 | Yi = 1, Λi,j 6= 0) P(Λ = 1, Y = 1, Λ 6= 0) = i,j i i,j P(Yi = 1, Λi,j 6= 0) exp(w ) = j exp(wj) + exp(−wj)   1 αj =⇒ wj = log 2 1 − αj

Also note that the accuracy we consider is conditioned on the labeling function not abstaining, i.e.,:

P(Λi,j = 1 | Yi = 1) = αj ∗ P(Λi,j 6= ∅)

Lab because a separate factor φi,j captures how often each labeling function votes.

A.1.2 Proof of Proposition 1

∗ In this proposition, our goal is to obtain a simple upper bound for the expected optimal advantage EΛ,y,w∗ [A ] in the low label density regime. We consider a simple model where all the labeling functions have a fixed

82 APPENDIX A. SNORKEL APPENDIX 83

probability of emitting a non-zero label,

P(Λi,j 6= ∅) = pl ∀i, j (A.1) and that the labeling functions are all non-adversarial, i.e., they all have accuracies greater than 50%, or equivalently,

∗ wj > 0 ∀j (A.2)

First, we start by only counting cases where the optimal weighted majority vote (WMV*)—i.e., the predictions of the generative model with perfectly estimated weights—is correct and the majority vote (MV) is incorrect, which is an upper bound on the modeling advantage:

EΛ,y,w∗ [Aw∗ (Λ, y)] 1 m = (E ∗ [1 {y f ∗ (Λ ) > 0 ∧ y f (Λ ) 0} m Λi,yi,w i w i i 1 i 6 i=1 X 1 − {yifw∗ (Λi) 6 0 ∧ yif1(Λi) > 0}]) 1 m E ∗ [1 {y f ∗ (Λ ) > 0 ∧ y f (Λ ) 0}] 6 m Λi,yi,w i w i i 1 i 6 i=1 X Next, by (A.2), the only way that WMV* and MV could possibly disagree is if there is at least one disagreeing pair of labels:

m ∗ 1 E ∗ [A (Λ, y)] E [1 {c (Λ ) > 0 ∧ c (Λ ) > 0}] Λ,y,w 6 m Λi,y 1 i −1 i i=1 X n 1 where cy(Λi) = j=1 {Λi,j = y}, in other words, the counts of positive or negative labels for a given data point xi. Then, weP can bound this by the expected number of disagreeing, non-abstaining pairs of labels:

∗ EΛ,y,w∗ [A (Λ, y)]   1 m n−1 n E 1 {Λ 6= Λ ∧ Λ , Λ 6= 0} 6 m Λi,y  i,j i,k i,j i,k  i=1 j=1 k=j+1 X X X 1 m n−1 n = E [1 {Λ 6= Λ ∧ Λ , Λ 6= 0}] m Λi,y i,j i,k i,j i,k i=1 j=1 k=j+1 X X X 1 m n−1 n = P(Λ = λ, Λ = −λ, y = y0) m i,j i,k i i=1 j=1 k=j+1 y0∈±1 λ∈±1 X X X X X APPENDIX A. SNORKEL APPENDIX 84

Since we are considering the independent model, Λi,j ⊥ Λi,k6=j | yi, we have that:

P(Λi,j = λ, Λi,k = −λ, yi = λ)

= P(Λi,j = λ | yi = λ)P(Λi,k = −λ | yi = λ)P(yi = λ) 2 = αj(1 − αk)pl P(yi = λ)

Thus we have:

n−1 n ∗ 2 EΛ,y,w∗ [A (Λ, y)] 6 pl (αj(1 − αk) + (1 − αj)αk) j=1 k=j+1 X X n 2 = pl αj(1 − αk) j=1 k6=j X X n n 2 6 pl αj(1 − αk) j=1 k=1 X X 2 2 = n pl α¯(1 − α¯) = d¯2α¯(1 − α¯) where we have defined the average labeling function accuracy as α¯, and where the label density is defined as ¯ d = pln. Thus we have shown that the expected advantage scales at most quadratically in the label density. 

A.1.3 Proof of Proposition 2

Recall that for data point i with true label yi ∈ {−1, 1}, Λi,j ∈ {−1, 0, 1} is a random variable representing the output label of the jth labeling function, with accuracy αj and fixed labeling propensity βj = pl:

αj = P(Λi,j = 1 | yi = 1, Λi,j 6= 0)

= P(Λi,j = −1 | yi = −1, Λi,j 6= 0)

βj = P(Λi,j 6= 0) = pl where recall that we model the labeling functions as having class-symmetric accuracies; thus, without loss of generality, we consider yi = 1. Consider the average, which is proportional to the unweighted majority vote f1:

1 n 1 Λ¯ = Λ = f (Λ ) i m i,j m 1 i j=1 X APPENDIX A. SNORKEL APPENDIX 85

Applying Hoeffding’s inequality, we have for any t > 0:

 h i   1  P Λ¯ − E Λ¯ |y = 1 −t | y = 1 exp − nt2 i i i 6 i 6 2

By linearity of expectation, we have:

h i 1 n E Λ¯ |y = 1 = E [Λ |y = 1] i i n i,j i j=1 X 1 n = β (2α − 1) n j j j=1 X = pl (2α¯ − 1)

1 where α¯ = n αj is the average labeling function accuracy. Thus, assuming that α¯ > 0.5, we can set t = pl (2α¯ − 1) and get:

   1  P Λ¯ 0 | y = 1 exp − np2 (2α¯ − 1)2 i 6 i 6 2 l

Re-writing slightly, we get:

!  12 P (y f (Λ ) 0) exp −2p α¯ − d¯ i 1 i 6 6 l 2

Thus, we have a bound for the error rate of unweighted majority vote in terms of the label density d¯, which is ∗ in turn an upper bound for the optimal advantage A .

A.1.4 Proof of Proposition 3

In this proposition, our goal is to find a tractable upper bound on the conditional modeling advantage, i.e., the modeling advantage given the observed label matrix Λ. This will be useful because, given our label matrix, we can compute this quantity and, when it is small, safely skip learning the generative model and just use an unweighted majority vote (MV) of the labeling functions. We assume in this proposition that the true weights of the labeling functions lie within a fixed range, wj ∈ [wmin > 0, wmax] and have a mean w¯ . For notational convenience, let

1 f1(Λ) > 0 0  y = 0 f1(Λ) = 0   −1 f1(Λ) < 0   APPENDIX A. SNORKEL APPENDIX 86

We start with the expected advantage, and upper-bound by the expected number of instances in which WMV* is correct and MV is incorrect (note that for tie votes, we simply upper bound by trivially assuming an expected advantage of one):

∗ Ew∗,y [A (Λ, y) | Λ]

= Ew∗,y∼P(· | Λ,w∗) [Aw∗ (Λ, y)] m 1 0 0 E ∗ ∗ [1 {y 6= y } 1 {y f ∗ (Λ ) 0}] 6 m w ,y∼P(· | Λi,w ) i i i w i 6 i=1 Xm 1  0 0  = E ∗ E ∗ [1 {y 6= y }] 1 {y f ∗ (Λ ) 0} m w y∼P(· | Λi,w ) i i i w i 6 i=1 Xm 1 0 ∗ 0 = E ∗ [P(y 6= y | Λ , w )1 {y f ∗ (Λ ) 0}] m w i i i i w i 6 i=1 X Next, define:

00 Φ(Λi, y ) = 1 {cy00 (Λi)wmax − c−y00 (Λi)wmin} i.e. this is an indicator for whether WMV* could possibly output y00 as a prediction under best-case circum- stances. We use this in turn to upper-bound the expected modeling advantage again:

Ew∗,y∼P(· | Λ,w∗) [Aw∗ (Λ, y)] m 1 0 ∗ 0 E ∗ [P(y 6= y | Λ , w )Φ(Λ , −y )] 6 m w i i i i i i=1 Xm 1 0 0 ∗ = Φ(Λ , −y )E ∗ [P(y 6= y | Λ , w )] m i i w i i i i=1 X 1 m Φ(Λ , −y0)P(y 6= y0 | Λ , w¯ ) 6 m i i i i i i=1 X Now, recall that, for y0 ∈ ±1:

0 0 P(yi = y , Λi | w) P(yi = y | Λi, w) = 00 y00∈±1 P(yi = y , Λi | w) T 0  P exp w φi(Λi, yi = y ) = T 00 y00∈±1 exp (w φi(Λi, yi = y )) T 0 P exp w Λiy = T T exp (w Λi) + exp (−w Λi) 0 = σ (2fw(Λi)y ) APPENDIX A. SNORKEL APPENDIX 87

where σ(·) is the sigmoid function. Note that we are considering a simplified independent generative model with only accuracy factors; however, in this discriminative formulation the labeling propensity factors would drop out anyway since they do not depend on y, so their omission is just for notational simplicity. 0 Putting this all together by removing the yi placeholder, simplifying notation to match the main body of the paper, we have:

∗ Ew∗,y [A (Λ, y) | Λ] 1 m 1 {yf (Λ ) 0} Φ(Λ , y)σ (2yf (Λ )) 6 m 1 i 6 i w¯ i i=1 y∈±1 X X ˜ ∗ = A (Λ) . Appendix B

Fonduer Appendix

B.1 Data Programming

Machine-learning-based KBC systems rely heavily on ground truth data (called training data) to achieve high quality. Traditionally, manual annotations or incomplete KBs are used to construct training data for machine-learning-based KBC systems. However, these resources are either costly to obtain or may have limited coverage over the candidates considered during the KBC process. To address this challenge, Fonduer builds upon the newly introduced paradigm of data programming [Ratner et al., 2016b], which enables both domain experts and non-domain experts alike to programmatically generate large training datasets by leveraging multiple weak supervision sources and domain knowledge. In data programming, which provides a framework for weak supervision, users provide weak supervision in the form of user-defined functions, called labeling functions. Each labeling function provides potentially noisy labels for a subset of the input data and are combined to create large, potentially overlapping sets of labels which can be used to train a machine-learning model. Many different weak-supervision approaches can be expressed as labeling functions. This includes strategies that use existing knowledge bases, individual annotators’ labels (as in crowdsourcing), or user-defined functions that rely on domain-specific patterns and dictionaries to assign labels to the input data. The aforementioned sources of supervision can have varying degrees of accuracy, and may conflict with each other. Data programming relies on a generative probabilistic model to estimate the accuracy of each labeling function by reasoning about the conflicts and overlap across labeling functions. The estimated labeling function accuracies are in turn used to assign a probabilistic label to each candidate. These labels are used alongside a noise-aware discriminative model to train a machine-learning model for KBC.

B.1.1 Components of Data Programming

The main components in data programming are as follows:

88 APPENDIX B. FONDUER APPENDIX 89

Candidates A set of candidates C to be probabilistically classified.

Labeling Functions Labeling functions are used to programmatically provide labels for training data. A labeling function is a user-defined procedure that takes a candidate as input and outputs a label. Labels can be as simple as true or false for binary tasks, or one of many classes for more complex multiclass tasks. Since each labeling function is applied to all candidates and labeling functions are rarely perfectly accurate, there may be disagreements between them. The labeling functions provided by the user for binary classification can be more formally defined as follows. For each labeling function λi and r ∈ C, we have λi : r 7→ {−1, 0, 1} where +1 or −1 denotes a candidate as “True” or “False” and 0 abstains. The output of applying a set of l labeling functions to k candidates is the label matrix Λ ∈ {−1, 0, 1}k×l.

Output Data-programming frameworks output a confidence value p for the classification for each candidate as a vector Y ∈ {p}k. To perform data programming in Fonduer, we rely on a data-programming engine, Snorkel [Ratner et al., 2017b]. Snorkel accepts candidates and labels as input and produces marginal probabilities for each candidate as output. These input and output components are stored as relational tables. Their schemas are detailed in Section 3.3.

B.1.2 Theoretical Guarantees

While data programming uses labeling functions to generate noisy training data, it theoretically achieves a learning rate similar to methods that use manually labeled data [Ratner et al., 2016b]. In the typical supervised- learning setup, users are required to manually label O˜(−2) examples for the target model to achieve an expected loss of . To achieve this rate, data programming only requires the user to specify a constant number of labeling functions that does not depend on . Let β be the minimum coverage across labeling functions (i.e., the probability that a labeling function provides a label for an input point) and γ be the minimum reliability of labeling functions, where γ = 2 · a − 1 with a denoting the accuracy of a labeling function. Then under the assumptions that (1) labeling functions are conditionally independent given the true labels of input data, (2) the number of user-provided labeling functions is at least O˜(γ−3β−1), and (3) there are k = O˜(−2) candidates, data programming achieves an expected loss . Despite the strict assumptions with respect to labeling functions, we find that using data programming to develop KBC systems for richly formatted data leads to high-quality KBs (across diverse real-world applications) even when some of the data-programming assumptions are not met (see Section 3.5.2).

B.2 Extended Feature Library

Fonduer augments a bidirectional LSTM with features from an extended feature library in order to better model the multiple modalities of richly formatted data. In addition, these extended features can provide signals drawn from large contexts since they can be calculated using Fonduer’s data model of the document rather APPENDIX B. FONDUER APPENDIX 90

Table B.1: Features from Fonduer’s feature library. Example values are drawn from the example candidate in Figure 3.1. Capitalized prefixes represent the feature templates and the remainder of the string represents a feature’s value. b b b b b b b b b b c b c b b b b b b b Example Value TAG_

HTML_ATTR_font-family:Arial PARENT_TAG_

PREV_SIB_TAG_ NEXT_SIB_TAG_

NODE_POS_1 ANCESTOR_CLASS_ ANCESTOR_TAG__

ANCESTOR_ID_l1 COMMON_ANCESTOR_ LOWEST_ANCESTOR_DEPTH_1 CELL_cev ROW_NUM_5 COL_NUM_3 ROW_SPAN_1 COL_SPAN_1 ROW_HEAD_collector COL_HEAD_value ROW_200_[ma] COL_200_[6] SAME_TABLE SAME_TABLE_ROW_DIFF_1 SAME_TABLE_COL_DIFF_3 SAME_TABLE_MANHATTAN_DIST_10 SAME_CELL WORD_DIFF_1 CHAR_DIFF_1 SAME_PHRASE DIFF_TABLE DIFF_TABLE_ROW_DIFF_4 DIFF_TABLE_COL_DIFF_2 DIFF_TABLE_MANHATTAN_DIST_7 ALIGNED_current PAGE_1 SAME_PAGE HORZ_ALIGNED VERT_ALIGNED VERT_ALIGNED_LEFT VERT_ALIGNED_RIGHT VERT_ALIGNED_CENTER a a a a Description HTML tag of the mention HTML attributes of the mention HTML tag of the mention’sHTML parent tag of the mention’sHTML previous sibling tag of the mention’sPosition next sibling of a node amongHTML its class siblings sequence of theHTML mention’s tag ancestors sequence of theHTML mention’s ancestors ID’s of the mention’s ancestors HTML tags shared between mentionsMinimum on distance the between path two to mentions the toN-grams root their in of lowest the the common same document ancestor cellRow as number the of mention the mention Column number of the mention Number of rows the mentionNumber spans of columns the mentionRow spans header n-grams in theColumn table header of n-grams the in mention theN-grams table from of all the Cells mention thatN-grams are from in all the Cells same that rowWhether are as two in the mentions the given are mention same in columnRow the as number same the difference table given if mention two mentionsColumn are number in difference the if same two table mentionsManhattan are distance in between the two same mentionsWhether table in two the mentions same are table inWord the distance same between cell mentions in theCharacter same distance cell between mentions inWhether the two same mentions cell in aWhether cell two are mention in are the in sameRow the sentence number different difference tables if two mentionsColumn are number in difference different if tables two mentionsManhattan are distance in between different two tables mentionsN-grams in of different tables all lemmas visuallyPage aligned number with of the the mention mention Whether two mentions are onWhether the two same mentions page are horizontallyWhether aligned two mentions are vertically aligned Whether two mentions’ left bounding-boxWhether borders two are mentions’ vertically right aligned bounding-boxWhether borders the are centers vertically aligned of two mentions’ bounding boxes are vertically aligned Arity Unary Unary Unary Unary Unary Unary Unary Unary Unary Binary Binary Unary Unary Unary Unary Unary Unary Unary Unary Unary Binary Binary Binary Binary Binary Binary Binary Binary Binary Binary Binary Binary Unary Unary Binary Binary Binary Binary Binary Binary Tabular Tabular Structural Structural Structural Structural Structural Structural Structural Structural Structural Tabular Tabular Tabular Tabular Tabular Tabular Tabular Tabular Tabular Tabular Tabular Tabular Tabular Visual Visual Visual Visual Visual Visual Feature Type Structural Structural Tabular Tabular Tabular Tabular Tabular Tabular Visual Visual

a All N-grams are 1-grams by default. b This feature was not present in the example candidate. The values shown are example values from other documents. c In this example, the mention is 200, which forms part of the feature prefix. The value is shown in square brackets. APPENDIX B. FONDUER APPENDIX 91

than being limited to a single sentence or table. In Section 3.5, we find that including multimodal features is critical to achieving high-quality relation extraction. Our extended feature library serves as a baseline example of these types of features that can be easily enhanced in the future. However, even with these baseline features, our users have built high-quality knowledge bases for their applications. The extended feature library consists of a baseline set of features from the structural, tabular, and visual modalities. Table B.1 lists the details of the extended feature library. Features are represented as strings, and each feature space is then mapped into a one-dimensional bit vector for each candidate, where each bit represents whether the candidate has the corresponding feature.

B.3 Fonduer at Scale

We use two optimizations to enable Fonduer’s scalability to millions of candidates (see Section 3.3.1): (1) data caching and (2) data representations that optimize data access during the KBC process. Such optimizations are standard in database systems. Nonetheless, their impact on KBC has not been studied in detail. Each candidate to be classified by Fonduer’s LSTM as “True” or “False” is associated with a set of mentions (see Section 3.3.2). For each candidate, Fonduer’s multimodal featurization generates features that describe each individual mention in isolation and features that jointly describe the set of all mentions in the candidate. Since each mention can be associated with many different candidates, we cache the featurization of each mention. Caching during featurization results in a 100× speed-up on average in the ELECTRONICS domain yet only accounts for 10% of this stage’s memory usage. Recall from Section 3.3.3 that Fonduer’s programming model introduces two modes of operation: (1) development, where users iteratively improve the quality of labeling functions without executing the entire pipeline; and (2) production, where the full pipeline is executed once to produce the knowledge base. We use different data representations to implement the abstract data structures of Features and Labels (a structure that stores the output of labeling functions after applying them over the generated candidates). Implementing Features as a list-of-lists structure minimizes runtime in both modes of operation since it accounts for sparsity. We also find that Labels implemented as a coordinate list during the development mode are optimal for fast updates. A list-of-lists implementation is used for Labels in production mode.

B.3.1 Data Caching

With richly formatted data, which frequently requires document-level context, thousands of candidates need to be featurized for each document. Candidate features from the extended feature library are computed at both the mention level and relation level by traversing the data model accessing modality attributes. Because each mention is part of many candidates, naïve featurization of candidates can result in the redundant computation of thousands of mention features. This pattern highlights the value of data caching when performing multimodal featurization on richly formatted data. APPENDIX B. FONDUER APPENDIX 92

Traditional KBC systems that operate on single sentences of unstructured text pragmatically assume that only a small number of candidates will need to be featurized for each sentence and do not cache mention features as a result. Example B.3.1 (Inefficient Featurization). In Figure 3.1, the transistor part mention MMBT3904 could match with up to 15 different numerical values in the datasheet. Without caching, features would be unnecessarily recalculated 14 times, once for each candidate. In real documents 100s of feature calculations would be wasted. In Example B.3.1, eliminating unnecessary feature computations can improve performance by an order of magnitude. To optimize the feature-generation process, Fonduer implements a document-level caching scheme for mention features. The first computation of a mention feature requires traversing the data model. Then, the result is cached for fast access if the feature is needed again. All features are cached until all candidates in a document are fully featurized, after which the cache is flushed. Because Fonduer operates on documents atomically, caching a single document at a time improves performance without adding significant memory overhead. In the ELECTRONICS application, we find that caching achieves over 100× speed-up on average and in some cases even over 1000×, while only accounting for approximately 10% of the memory footprint of the featurization stage.

Takeaways. When performing feature generation from richly formatted data, caching the intermediate results can yield over 1000× improvements in featurization runtime without adding significant memory overhead.

B.3.2 Data Representations

The Fonduer programming model involves two modes of operation: (1) development and (2) production. In development, users iteratively improve the quality of their labeling functions through error analysis and without executing the full pipeline as in previous techniques such as incremental KBC [Shin et al., 2015]. Once the labeling functions are finalized, the Fonduer pipeline is only run once in production. In both modes of operation, Fonduer produces two abstract data structures (Features and Labels as described in Section 3.3). These data structures have three access patterns: (1) materialization, where the data structure is created; (2) updates, which include inserts, deletions, and value changes; and (3) queries, where users can inspect the features and labels to make informed updates to labeling functions. Both Features and Labels can be viewed as matrices, where each row represents annotations for a candidate (see Section 3.3.2). Features are dynamically named during multimodal featurization but are static for the lifetime of a candidate. Labels are statically named in classification but updated during development. Typically Features are sparse: in the ELECTRONICS application, each candidate has about 100 features while the number of unique features can be more than 10M. Labels are also sparse, where the number of unique labels is the number of labeling functions. APPENDIX B. FONDUER APPENDIX 93

The data representation that is implemented to store these abstract data structures can significantly affect overall system runtime. In the ELECTRONICS application, multimodal featurization accounts for 50% of end- to-end runtime, while classification accounts for 15%. We discuss two common sparse matrix representations that can be materialized in a SQL database. • List of lists (LIL): Each row stores a list of (column_key, value) pairs. Zero-valued pairs are omitted. An entire row can be retrieved in a single query. However, updating values requires iterating over sublists. • Coordinate list (COO): Rows store (row_key, column_key, value) triples. Zero-valued triples are omitted. With COO, multiple queries must be performed to fetch a row’s attributes. However, updating values takes constant time. The choice of data representation for Features and Labels reflects their different access patterns, as well as the mode of operation. During development, Features are materialized once, but frequently queried during the iterative KBC process. Labels are updated each time a user modifies labeling functions. In production, Features’ access pattern remains the same. However, Labels are not updated once users have finalized their set of labeling functions. From the access patterns in the Fonduer pipeline, and the characteristics of each sparse matrix representa- tion, we find that implementing Features as an LIL minimizes runtime in production and development. Labels, however, should be implemented as COO to support fast insertions during iterative KBC and reduce runtimes for each iteration. In production, Labels can also be implemented as LIL to avoid the computation overhead of COO. In the ELECTRONICS application, we find that LIL provides 1.4× speed-up over COO in production and that COO provides over 5.8× speed-up over LIL when adding a new labeling function.

Takeaways. We find that Labels should be implemented as a coordinate list during development, which supports fast updates for supervision, while Features should use a list of lists, which provides faster query times. In production, both Features and Labels should use a list-of-list representation.

B.4 Future Work

Being able to extract information from richly formatted data enables a wide range of applications, and represents a new and interesting research direction. While we have demonstrated that Fonduer can already obtain high-quality knowledge bases in several applications, we recognize that many interesting challenges remain. We briefly discuss some of these challenges.

Data Model One challenge in extracting information from richly formatted data comes directly at the data level—we cannot perfectly preserve all document information. Future work in parsing, OCR, and computer vision have the potential to improve the quality of Fonduer’s data model for complex table structures and figures. For example, improving the granularity of Fonduer’s data model to be able to identify axis titles, APPENDIX B. FONDUER APPENDIX 94

legends, and footnotes could provide additional signals to learn from and additional specificity for users to leverage while using the Fonduer programming model.

Deep-Learning Model Fonduer’s multimodal recurrent neural network provides a prototypical automated featurization approach that achieves high quality across several domains. However, future developments for incorporating domain-specific features could strengthen these models. In addition, it may be possible to expand our deep-learning model to perform additional tasks (e.g., identifying candidates) to simplify the Fonduer pipeline.

Programming Model Fonduer currently exposes a Python interface to allow users to provide weak super- vision. However, further research in user interfaces for weak supervision could bolster user efficiency in Fonduer. For example, allowing users to use natural language or graphical interfaces in supervision may result in improved efficiency and reduced development time through a more powerful programming model. Similarly, feedback techniques like active learning [Settles, 2012] could empower users to more quickly recognize classes of candidates that need further disambiguation with LFs. APPENDIX B. FONDUER APPENDIX 95

B.5 GwasKB Web Interface

Figure B.1: The GWASkb web application hosted at http://gwaskb.stanford.edu for exploring the contents of the GWASkb knowledge base created with Fonduer. APPENDIX B. FONDUER APPENDIX 96

Figure B.2: Users can search by genotype (e.g., rs7329174) or phenotype (e.g., breast cancer) and see all related studies and associations with links to the corresponding articles for further exploration. Appendix C

BabbleLabble Appendix

C.1 Predicate Examples

Below are the predicates in the rule-based semantic parser grammar, each of which may have many supported paraphrases, only one of which is listed here in a minimal example.

Logic and: X is true and Y is true or: X is true or Y is true not: X is not true any: Any of X or Y or Z is true all: All of X and Y and Z are true none: None of X or Y or Z is true

Comparison =: X is equal to Y 6=: X is not Y <: X is smaller than Y 6: X is no more than Y >: X is larger than Y >: X is at least Y

Syntax lower: X is lowercase upper: X is upper case capital: X is capitalized

97 APPENDIX C. BABBLELABBLE APPENDIX 98

all_caps: X is in all caps starts_with: X starts with "cardio" ends_with: X ends with "itis" substring: X contains "-induced"

Named-entity Tags person: A person is between X and Y location: A place is within two words of X date: A date is between X and Y number: There are three numbers in the sentence organization: An organization is right after X

Lists list: (X, Y) is in Z set: X, Y, and Z are true count: There is one word between X and Y contains: X is in Y intersection: At least two of X are in Y map: X is at the start of a word in Y filter: There are three capitalized words to the left of X alias: A spouse word is in the sentence (“spouse” is a predefined list from the user)

Position word_distance: X is two words before Y char_distance: X is twenty characters after Y left: X is before Y right: X is after Y between: X is between Y and Z within: X is within five words of Y APPENDIX C. BABBLELABBLE APPENDIX 99

C.2 Sample Explanations

The following are a sample of the explanations provided by users for each task.

Spouse Users referred to the first person in the sentence as “X” and the second as “Y”.

Label true because "and" occurs between X and Y and "marriage" occurs one word after person1.

Label true because person Y is preceded by ‘beau’.

Label false because the words "married", "spouse", "husband", and "wife" do not occur in the sentence.

Label false because there are more than 2 people in the sentence and "actor" or "actress" is left of person1 or person2.

Disease

Label true because the disease is immediately after the chemical and ’induc’ or ’assoc’ is in the chemical name.

Label true because a word containing ’develop’ appears somewhere before the chemical, and the word ’following’ is between the disease and the chemical.

Label true because "induced by", "caused by", or "due to" appears between the chemical and the disease."

Label false because "none", "not", or "no" is within 30 characters to the left of the disease.

Protein

Label true because "Ser" or "Tyr" are within 10 characters of the protein.

Label true because the words "by" or "with" are between the protein and kinase and the words "no", "not" or "none" are not in between the protein and kinase APPENDIX C. BABBLELABBLE APPENDIX 100

and the total number of words between them is smaller than 10.

Label false because the sentence contains "mRNA", "DNA", or "RNA".

Label false because there are two "," between the protein and the kinase with less than 30 characters between them. Bibliography

Worldwide semiannual cognitive/artificial intelligence systems spending guide. Technical report, International Data Corporation, 2017.

M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. J. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Józefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. G. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. A. Tucker, V. Vanhoucke, V. Vasudevan, F. B. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: Large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2015.

A. K. Agrawala. Learning with a probabilistic teacher. IEEE Transactions on Infomation Theory, 16:373–379, 1970.

L. V. Ahn, R. Liu, and M. Blum. Peekaboom: a game for locating objects in images. In Conference on Human Factors in Computing Systems (CHI), pages 55–64, 2006.

E. Alfonseca, K. Filippova, J. Delort, and G. Garrido. Pattern learning for relation extraction with a hierarchical topic model. In Association for Computational Linguistics (ACL), pages 54–59, 2012a.

E. Alfonseca, K. Filippova, J.-Y. Delort, and G. Garrido. Pattern learning for relation extraction with a hierarchical topic model. In Meeting of the Association for Computational Linguistics (ACL), 2012b.

G. Angeli, S. Gupta, M. Jose, C. Manning, C. Ré, J. Tibshirani, J. Wu, S. Wu, and C. Zhang. Stanford’s 2014 slot filling systems. TAC KBP, 695, 2014.

S. Arora and E. Nyberg. Interactive annotation learning with indirect feature voting. In Association for Computational Linguistics (ACL), pages 55–60, 2009.

S. Bach, B. He, A. Ratner, and C. Ré. Learning the structure of generative models without labeled data. In International Conference on Machine Learning (ICML), 2017.

S. Bach, D. Rodriguez, Y. Liu, C. Luo, H. Shao, C. Xia, S. Sen, A. Ratner, B. Hancock, H. Alborzi, R. Kuchhal, C. Ré, and R. Malkin. Snorkel DryBell: A case study in deploying weak supervision at industrial scale. Arxiv, 2019.

101 BIBLIOGRAPHY 102

D. Bahdanau, K. Cho, and Y. Bengio. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473, 2014.

D. Barowy, S. Gulwani, T. Hart, and B. Zorn. Flashrelate: extracting relational data from semi-structured spreadsheets using examples. In ACM SIGPLAN Notices, volume 50, pages 218–228. ACM, 2015.

T. Beck, R. Hastings, S. Gollapudi, R. Free, and A. Brookes. GWAS Central: a comprehensive resource for the comparison and interrogation of genome-wide association studies. EJHG, 22(7):949–952, 2014.

A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In Workshop on Computa- tional Learning Theory (COLT), 1998.

K. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In International Conference on Management of Data (SIGMOD), pages 1247–1250, 2008.

E. Brown, E. Epstein, J. Murdock, and T. Fin. Tools and methods for building Watson. IBM Research. Abgerufen am, 14:2013, 2013.

R. C. Bunescu and R. J. Mooney. Learning to extract relations from the Web using minimal supervision. In Meeting of the Association for Computational Linguistics (ACL), 2007.

A. Carlson, J. Betteridge, B. Kisiel, B. Settles, E. R. H. Jr, and T. M. Mitchell. Toward an architecture for never-ending language learning. In Association for the Advancement of Artificial Intelligence (AAAI), 2010.

R. Caspi, R. Billington, L. Ferrer, H. Foerster, C. Fulcher, I. Keseler, A. Kothari, M. Krummenacker, M. La- tendresse, L. Mueller, Q. Ong, S. Paley, P. Subhraveti, D. Weaver, and P. Karp. The MetaCyc database of metabolic pathways and enzymes and the BioCyc collection of pathway/genome databases. Nucleic Acids Research, 44(D1):D471–D480, 2016.

O. Chapelle, B. Schölkopf, and A. Zien, editors. Semi-Supervised Learning. Adaptive Computation and Machine Learning. MIT Press, 2009.

J. Clarke, D. Goldwasser, M. Chang, and D. Roth. Driving semantic parsing from the world’s response. In Computational Natural Language Learning (CoNLL), pages 18–27, 2010.

D. Corney, D. Albakour, M. Martinez, and S. Moussa. What do a million news articles look like? In Workshop on Recent Trends in News Information Retrieval, 2016a.

D. Corney, D. Albakour, M. Martinez-Alvarez, and S. Moussa. What do a million news articles look like? In NewsIR@ ECIR, pages 42–47, 2016b.

Mirel Cosulschi, Nicolae Constantinescu, and Mihai Gabroveanu. Classifcation and comparison of information structures from a web page. Annals of the University of Craiova-Mathematics and Computer Science Series, 31, 2004. BIBLIOGRAPHY 103

M. Craven, J. Kumlien, et al. Constructing biological knowledge bases by extracting information from text sources. In ISMB, pages 77–86, 1999.

N. Dalvi, A. Dasgupta, R. Kumar, and V. Rastogi. Aggregating crowdsourced binary ratings. In International World Wide Web Conference (WWW), 2013.

A. Davis, C. Grondin, R. Johnson, D. Sciaky, B. King, R. McMorran, J. Wiegers, T. Wiegers, and C. Mattingly. The comparative toxicogenomics database: update 2017. Nucleic Acids Research, 2016.

A. Davis et al. A CTD–Pfizer collaboration: Manual curation of 88,000 scientific articles text mined for drug–disease and drug–phenotype interactions. Database, 2013.

A. P. Dawid and A. M. Skene. Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society C, 28(1):20–28, 1979.

J. Deng, W. Dong, R. Socher, L. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical image database. In Computer Vision and (CVPR), pages 248–255, 2009.

J. Devlin, M. Chang, K. Lee, and K. Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.

X. Dong, E. Gabrilovich, G. Heitz, W. Horn, N. Lao, K. Murphy, T. Strohmann, S. Sun, and W. Zhang. Knowledge vault: A web-scale approach to probabilistic knowledge fusion. In International Conference on Knowledge Discovery and Data Mining (KDD), pages 601–610, 2014.

X. L. Dong and D. Srivastava. Big Data Integration. Synthesis Lectures on Data Management. Morgan & Claypool Publishers, 2015.

G. Druck, B. Settles, and A. McCallum. Active learning by labeling features. In Empirical Methods in Natural Language Processing (EMNLP), pages 81–90, 2009.

L. Eadicicco. Baidu’s on the future of artificial intelligence, 2017. Time [Online; posted 11-January-2017].

D. Ferrucci, E. Brown, J. Chu-Carroll, J. Fan, D. Gondek, A. Kalyanpur, A. Lally, J. Murdock, E. Nyberg, J. Prager, et al. Building watson: An overview of the deepqa project. AI magazine, 31(3):59–79, 2010.

D. Freitag. Information extraction from HTML: Application of a general machine learning approach. In AAAI/IAAI, pages 517–523, 1998.

J. Fries, P. Varma, V. Chen, K. Xiao, H. Tejeda, P. Saha, J. Dunnmon, H. Chubb, S. Maskatia, M. Fiterau, S. Delp, E. Ashley, C. Ré, and J. Priest. Weakly supervised classification of rare aortic valve malformations using unlabeled cardiac mri sequences. bioRxiv, 2018. doi: 10.1101/339630. URL https://www. biorxiv.org/content/early/2018/08/22/339630. BIBLIOGRAPHY 104

H. Gao, G. Barbier, and R. Goolsby. Harnessing the crowdsourcing power of social media for disaster relief. IEEE Intelligent Systems, 26(3):10–14, 2011.

W. Gatterbauer, P. Bohunsky, M. Herzog, B. Krüpl, and B. Pollak. Towards domain-independent information extraction from web tables. In WWW, pages 71–80, 2007.

R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 580–587, 2014.

D. Goldwasser and D. Roth. Learning from natural instructions. In International Joint Conference on Artificial Intelligence (IJCAI), pages 1794–1800, 2011.

V. Govindaraju, C. Zhang, and C. Ré. Understanding tables in context using standard nlp toolkits. In ACL, pages 658–664, 2013.

A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5):602–610, 2005.

A. Graves, M. Liwicki, S. Fernández, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel connectionist system for unconstrained . IEEE transactions on pattern analysis and machine intelligence, 31(5):855–868, 2009.

A. Graves, A. Mohamed, and G. Hinton. with deep recurrent neural networks. In Acoustics, speech and signal processing (icassp), 2013 ieee international conference on, pages 6645–6649. IEEE, 2013.

S. Gupta and C. Manning. Improved pattern learning for bootstrapped entity extraction. In CoNLL, 2014.

B. Hancock, S. Wang, P. Varma, P. Liang, and C. Ré. Babble labble: Learning from natural language explanations. In Advances in Neural Information Processing Systems (NeurIPS) Demonstrations, 2017.

B. Hancock, P. Varma, S. Wang, M. Bringmann, P. Liang, and C. Ré. Training classifiers with natural language explanations. In Association for Computational Linguistics (ACL), 2018.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CoRR, abs/1512.03385, 2015. URL http://arxiv.org/abs/1512.03385.

K. He, R. Girshick, and P. Dollár. Rethinking imagenet pre-training. arXiv preprint arXiv:1811.08883, 2018.

M. A. Hearst. Automatic acquisition of hyponyms from large text corpora. In Interational Conference on Computational linguistics, pages 539–545, 1992.

M. Hewett, D. Oliver, D. Rubin, K. Easton, J. Stuart, R. Altman, and T. Klein. Pharmgkb: the pharmacogenetics knowledge base. Nucleic acids research, 30(1):163–165, 2002. BIBLIOGRAPHY 105

G. Hinton. Training products of experts by minimizing contrastive divergence. Neural computation, 14(8): 1771–1800, 2002.

S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.

R. Hoffmann, C. Zhang, X. Ling, L. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In Meeting of the Association for Computational Linguistics (ACL), 2011a.

R. Hoffmann, C. Zhang, X. Ling, L. S. Zettlemoyer, and D. S. Weld. Knowledge-based weak supervision for information extraction of overlapping relations. In Association for Computational Linguistics (ACL), pages 541–550, 2011b.

N. Japkowicz and S. Stephen. The class imbalance problem: A systematic study. Intelligent data analysis, 6 (5):429–449, 2002.

M. Joglekar, H. Garcia-Molina, and A. Parameswaran. Comprehensive and reliable crowd assessment algorithms. In International Conference on Data Engineering (ICDE), 2015.

N. Khandwala, A. Ratner, J. Dunnmon, R. Goldman, M. Lungren, D. Rubin, and C. Ré. Cross-modal data programming for medical images. NIPS ML4H Workshop, 2017.

D. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.

M. Kovacevic, M. Diligenti, M. Gori, and V. Milutinovic. Recognition of common areas in a web page using visual information: a possible application in a page classification. In ICDM, pages 250–257, 2002.

S. Krening, B. Harrison, K. M. Feigh, C. L. Isbell, M. Riedl, and A. Thomaz. Learning from explanations using sentiment and advice in RL. IEEE Transactions on Cognitive and Developmental Systems, 9(1): 44–55, 2017.

J. Ku, J. Hicks, T. Hastie, J. Leskovec, C. Ré, and S. Delp. The Mobilize center: an NIH big data to knowledge center to advance human movement research and improve mobility. Journal of the American Medical Informatics Association, 22(6):1120–1125, 2015.

Volodymyr Kuleshov, Jialin Ding, Christopher Vo, Braden Hancock, Alexander Ratner, Yang Li, Christopher Ré, Serafim Batzoglou, and Michael Snyder. A machine-compiled database of genome-wide association studies. Nature communications, 10(1):3341, 2019.

Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.

J. Lehmann, R. Isele, M. Jakob, A. Jentzsch, D. Kontokostas, P. Mendes, S. Hellmann, M. Morsey, P. van Kleef, S. Auer, and C. Bizer. DBpedia - A large-scale, multilingual knowledge base extracted from Wikipedia. Semantic Web Journal, 2014. BIBLIOGRAPHY 106

T. Lei, R. Barzilay, and T. Jaakkola. Rationalizing neural predictions. In Empirical Methods in Natural Language Processing (EMNLP), 2016.

H. Li, B. Yu, and D. Zhou. Error rate analysis of labeling by crowdsourcing. In ICML Workshop: Machine Learning Meets Crowdsourcing. Atalanta, Georgia, USA, 2013.

J. Li, T. Luong, and D. Jurafsky. A hierarchical neural autoencoder for paragraphs and documents. In ACL, pages 1106–1115, 2015a.

J. Li, A. H. Miller, S. Chopra, M. Ranzato, and J. Weston. Learning through dialogue interactions. arXiv preprint arXiv:1612.04936, 2016.

Yaliang Li, Jing Gao, Chuishi Meng, Qi Li, Lu Su, Bo Zhao, Wei Fan, and Jiawei Han. A survey on truth discovery. SIGKDD Explor. Newsl., 17(2), 2015b.

P. Liang. Learning executable semantic parsers for natural language understanding. Communications of the ACM, 59, 2016.

P. Liang, M. I. Jordan, and D. Klein. Learning from measurements in exponential families. In International Conference on Machine Learning (ICML), 2009a.

P. Liang, M. I. Jordan, and D. Klein. Learning from measurements in exponential families. In International Conference on Machine Learning (ICML), 2009b.

P. Liang, M. I. Jordan, and D. Klein. Learning dependency-based compositional semantics. In Association for Computational Linguistics (ACL), pages 590–599, 2011.

H. Ling and S. Fidler. Teaching machines to describe images via natural language feedback. In Advances in Neural Information Processing Systems (NeurIPS), 2017.

J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015.

A. Madaan, A. Mittal, G. Mausam, G. Ramakrishnan, and S. Sarawagi. Numerical relation extraction with minimal supervision. In AAAI, pages 2764–2771, 2016.

G. S. Mann and A. McCallum. Generalized expectation criteria for semi-supervised learning with weakly labeled data. Journal of Machine Learning Research, 11:955–984, 2010.

C. Metz. Google’s hand-fed AI now gives answers, not just search results, 2016. Wired [Online; posted 29-November-2016].

B. Min, R. Grishman, L. Wan, C. Wang, and D. Gondek. Distant supervision for relation extraction with an incomplete knowledge base. In North American Association for Computational Linguistics (NAACL), pages 777–782, 2013. BIBLIOGRAPHY 107

M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In Association for Computational Linguistics (ACL), pages 1003–1011, 2009a.

M. Mintz, S. Bills, R. Snow, and D. Jurafsky. Distant supervision for relation extraction without labeled data. In Meeting of the Association for Computational Linguistics (ACL), 2009b.

N. Nakashole, M. Theobald, and G. Weikum. Scalable knowledge harvesting with high precision and high recall. In WSDM, pages 227–236. ACM, 2011.

T. Nguyen and A. Moschitti. End-to-end relation extraction using distant supervision from external semantic repositories. In HLT, pages 277–282, 2011.

S. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on knowledge and data engineering, 22 (10):1345–1359, 2009.

S. J. Pan and Q. Yang. A survey on transfer learning. IEEE Transactions on Knowledge and Data Engineering, 22(10):1345–1359, 2010.

F. Parisi, F. Strino, B. Nadler, and Y. Kluger. Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences of the USA, 111(4):1253–1258, 2014.

P. Pasupat and P. Liang. Zero-shot entity extraction from web pages. In ACL (1), pages 391–401, 2014.

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer. Automatic differentiation in pytorch, 2017.

G. Penn, J. Hu, H. Luo, and R. McDonald. Flexible web document analysis for delivery to narrow-bandwidth devices. In ICDAR, volume 1, pages 1074–1078, 2001.

D. Pinto, M. Branstein, R. Coleman, W. Croft, M. King, W. Li, and X. Wei. Quasm: a system for question answering using semi-structured data. In JCDL, pages 46–55, 2002.

R. Pochampally, Anish Das Sarma, X. L. Dong, A. Meliou, and D. Srivastava. Fusing data with correlations. In ACM SIGMOD International Conference on Management of Data (SIGMOD), 2014.

A. J. Quinn and B. B. Bederson. Human computation: A survey and taxonomy of a growing field. In ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), 2011.

A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever. Language models are unsupervised multitask learners. 2019.

H. Raghavan, O. Madani, and R. Jones. Interactive feature selection. In International Joint Conference on Artificial Intelligence (IJCAI), volume 5, pages 841–846, 2005.

A. Ratner, C. De Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In Neural Information Processing Systems (NIPS), 2016a. BIBLIOGRAPHY 108

A. Ratner, S. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. Snorkel: Rapid training data creation with weak supervision. CoRR, abs/1711.10160, 2017a. URL http://arxiv.org/abs/1711.10160.

A. Ratner, S. H. Bach, H. Ehrenberg, J. Fries, S. Wu, and C. Ré. Snorkel: Rapid training data creation with weak supervision. In Very Large Data Bases (VLDB), number 3, pages 269–282, 2017b.

A. Ratner, B. Hancock, J. Dunnmon, R. Goldman, and C. Ré. Snorkel metal: Weak supervision for multi-task learning. In Proceedings of the Second Workshop on Data Management for End-To-End Machine Learning, page 3. ACM, 2018.

A. Ratner, B. Hancock, J. Dunnmon, F. Sala, S. Pandey, and C. Ré. Training complex models with multi-task weak supervision. AAAI, 2019a.

A. Ratner, B. Hancock, and C. Ré. The role of massively multi-task and weak supervision in software 2.0. In Conference on Innovative Data Systems Research, 2019b.

A. J. Ratner, C. M. D. Sa, S. Wu, D. Selsam, and C. Ré. Data programming: Creating large training sets, quickly. In Advances in Neural Information Processing Systems (NeurIPS), pages 3567–3575, 2016b.

C. Ré, A. Sadeghian, Z. Shan, J. Shin, F. Wang, S. Wu, and C. Zhang. Feature engineering for knowledge base construction. IEEE Data Engineering Bulletin, 2014.

T. Rekatsinas, X. Chu, I. Ilyas, and C. Ré. HoloClean: Holistic data repairs with probabilistic inference. PVLDB, 10(11):1190–1201, 2017a.

T. Rekatsinas, M. Joglekar, H. Garcia-Molina, A. Parameswaran, and C. Ré. SLiMFast: Guaranteed results for data fusion and source reliability. In ACM SIGMOD International Conference on Management of Data (SIGMOD), 2017b.

S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. In Machine Learning and Knowledge Discovery in Databases (ECML PKDD), pages 148–163, 2010a.

S. Riedel, L. Yao, and A. McCallum. Modeling relations and their mentions without labeled text. In European Conference on Machine Learning and Knowledge Discovery in Databases (ECML PKDD), 2010b.

B. Roth and D. Klakow. Combining generative and discriminative model scores for distant supervision. In Empirical Methods in Natural Language Processing (EMNLP), pages 24–29, 2013a.

B. Roth and D. Klakow. Combining generative and discriminative model scores for distant supervision. In Conference on Empirical Methods on Natural Language Processing (EMNLP), 2013b.

V. Satopaa, J. Albrecht, D. Irwin, and B. Raghavan. Finding a “kneedle” in a haystack: Detecting knee points in system behavior. In International Conference on Distributed Computing Systems Workshops, 2011. BIBLIOGRAPHY 109

H. J. Scudder. Probability of error of some adaptive pattern-recognition machines. IEEE Transactions on Infomation Theory, 11:363–371, 1965.

B. Settles. Active Learning. Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers, 2012.

J. Shin, S. Wu, F. Wang, C. D. Sa, C. Zhang, and C. Ré. Incremental knowledge base construction using DeepDive. In Very Large Data Bases (VLDB), number 11, pages 1310–1321, 2015.

A. Singhal. Introducing the knowledge graph: things, not strings. Official google blog, 2012.

S. Srivastava, I. Labutov, and T. Mitchell. Joint concept learning and semantic parsing from natural language explanations. In Empirical Methods in Natural Language Processing (EMNLP), pages 1528–1537, 2017.

R. Stewart and S. Ermon. Label-free supervision of neural networks with physics and other domain knowledge. In AAAI Conference on Artificial Intelligence (AAAI), 2017.

F. Suchanek, G. Kasneci, and G. Weikum. Yago: A large ontology from wikipedia and wordnet. Web Semantics: Science, Services and Agents on the World Wide Web, 6(3):203–217, 2008.

C. Sun, A. Shrivastava, S. Singh, and A. Gupta. Revisiting unreasonable effectiveness of data in deep learning era. arXiv preprint arXiv:1707.02968, 2017.

S. Takamatsu, I. Sato, and H. Nakagawa. Reducing wrong labels in distant supervision for relation extraction. In Association for Computational Linguistics (ACL), pages 721–729, 2012a.

S. Takamatsu, I. Sato, and H. Nakagawa. Reducing wrong labels in distant supervision for relation extraction. In Meeting of the Association for Computational Linguistics (ACL), 2012b.

A. Tengli, Y. Yang, and N. Ma. Learning table extraction from examples. In COLING, page 987, 2004.

J. Turian, L. Ratinov, and Y. Bengio. Word representations: a simple and general method for semi-supervised learning. In ACL, pages 384–394, 2010.

P. Varma, B. He, D. Iter, P. Xu, R. Yu, C. D. Sa, and C. Ré. Socratic learning: Augmenting generative models to incorporate latent subsets in training data. arXiv preprint arXiv:1610.08123, 2017.

P. Verga, D. Belanger, E. Strubell, B. Roth, and A. McCallum. Multilingual relation extraction using compositional universal schema. In HLT-NAACL, pages 886–896, 2016.

S. I. Wang, S. Ginn, P. Liang, and C. D. Manning. Naturalizing a programming language via interactive learning. In Association for Computational Linguistics (ACL), 2017.

C. Wei, Y. Peng, R. Leaman, A. P. Davis, C. J. Mattingly, J. Li, T. C. Wiegers, and Z. Lu. Overview of the biocreative V chemical disease relation (cdr) task. In Proceedings of the fifth BioCreative challenge evaluation workshop, pages 154–166, 2015a. BIBLIOGRAPHY 110

C.-H. Wei, Y. Peng, R. Leaman, Davis A. P., C. J. Mattingly, J. Li, T.C. Wiegers, and Z. Lu. Overview of the BioCreative V chemical disease relation (CDR) task. In BioCreative Challenge Evaluation Workshop, 2015b.

K. Weiss, T. Khoshgoftaar, and D. Wang. A survey of transfer learning. Journal of Big data, 3(1):9, 2016.

D. Welter, J. MacArthur, J. Morales, T. Burdett, P. Hall, H. Junkins, A. Klemm, P. Flicek, T. Manolio, L. Hindorff, et al. The NHGRI GWAS Catalog, a curated resource of snp-trait associations. Nucleic acids research, 42:D1001–D1006, 2014.

J. E. Weston. Dialog-based language learning. In Advances in Neural Information Processing Systems (NeurIPS), pages 829–837, 2016.

S. Wu, L. Hsiao, X. Cheng, B. Hancock, T. Rekatsinas, P. Levis, and C. Ré. Fonduer: Knowledge base construction from richly formatted data. In Proceedings of SIGMOD 2018, 2018.

Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, M. Krikun, Y. Cao, Q. Gao, K. Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016.

M. Yahya, S. Whang, R. Gupta, and A. Halevy. ReNoun : Fact Extraction for Nominal Attributes. EMNLP, pages 325–335, 2014.

Y. Yang and H. Zhang. HTML page analysis based on visual cues. In ICDAR, pages 859–864, 2001.

M.-C. Yuen, I. King, and K.-S. Leung. A survey of crowdsourcing systems. In Privacy, Security, Risk and Trust (PASSAT) and Inernational Conference on Social Computing (SocialCom), 2011.

O. F. Zaidan and J. Eisner. Modeling annotators: A generative approach to learning from annotator rationales. In Empirical Methods in Natural Language Processing (EMNLP), 2008a.

O. F. Zaidan and Jason Eisner. Modeling annotators: A generative approach to learning from annotator rationales. In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2008b.

M. Zelle and R. J. Mooney. Learning to parse database queries using inductive logic programming. In Association for the Advancement of Artificial Intelligence (AAAI), pages 1050–1055, 1996.

L. S. Zettlemoyer and M. Collins. Learning to map sentences to logical form: Structured classification with probabilistic categorial grammars. In Uncertainty in Artificial Intelligence (UAI), pages 658–666, 2005.

C. Zhang, C. Ré, M. Cafarella, C. De Sa, A. Ratner, J. Shin, F. Wang, and S. Wu. DeepDive: Declarative knowledge base construction. Commun. ACM, 60(5):93–102, 2017.

Y. Zhang, A. Chaganty, A. Paranjape, D. Chen, J. Bolton, P. Qi, and C. Manning. Stanford at TAC KBP 2016: Sealing pipeline leaks and understanding chinese. TAC, 2016a. BIBLIOGRAPHY 111

Y. Zhang, X. Chen, D. Zhou, and M. I. Jordan. Spectral methods meet EM: A provably optimal algorithm for crowdsourcing. Journal of Machine Learning Research, 17:1–44, 2016b.

B. Zhao, B. Rubinstein, J. Gemmell, and J. Han. A Bayesian approach to discovering truth from conflicting sources for data integration. PVLDB, 5(6):550–561, 2012.