The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19)

Training Complex Models with Multi-Task Weak Supervision

Alexander Ratner, Braden Hancock, Jared Dunnmon, Frederic Sala, Shreyash Pandey, Christopher Re´ Department of Computer Science ajratner, bradenjh, jdunnmon, fredsala, shreyash, chrismre @stanford.edu { }

Abstract domains (Zhang et al. 2017a; Xiao et al. 2015). A theme of weak supervision is that using the full diversity of available As models continue to increase in complex- ity, collecting large hand-labeled training sets has become sources is critical to training high-quality models (Ratner et one of the biggest roadblocks in practice. Instead, weaker al. 2018; Zhang et al. 2017a). forms of supervision that provide noisier but cheaper labels The key technical difficulty of weak supervision is deter- are often used. However, these weak supervision sources have mining how to combine the labels of multiple sources that diverse and unknown accuracies, may output correlated la- have different, unknown accuracies, may be correlated, and bels, and may label different tasks or apply at different levels may label at different levels of granularity. In our experi- of granularity. We propose a framework for integrating and ence with users in academia and industry, the complexity of modeling such weak supervision sources by viewing them as real world weak supervision sources makes this integration labeling different related sub-tasks of a problem, which we refer to as the multi-task weak supervision setting. We show phase the key time sink and stumbling block. For example, that by solving a matrix completion-style problem, we can if we are training a model to classify entities in text, we may recover the accuracies of these multi-task sources given their have one available source of high-quality but coarse-grained dependency structure, but without any labeled data, leading labels (e.g. “Person” vs. “Organization”) and one source that to higher-quality supervision for training an end model. The- provides lower-quality but finer-grained labels (e.g. “Doc- oretically, we show that the generalization error of models tor” vs. “Lawyer”); moreover, these sources might be corre- trained with this approach improves with the number of un- lated due to some shared component or data source (Bach labeled data points, and characterize the scaling with respect et al. 2017; Varma et al. 2017). Handling such diversity re- to the task and dependency structures. On three fine-grained quires addressing a core technical challenge: estimating the classification problems, we show that our approach leads to unknown accuracies of multi-granular and potentially corre- average gains of 20.2 points in accuracy over a traditional su- pervised approach, 6.8 points over a majority vote baseline, lated supervision sources without any labeled data. and 4.1 points over a previously proposed weak supervision To overcome this challenge, we propose MeTaL, a frame- method that models tasks separately. work for modeling and integrating weak supervision sources with different unknown accuracies, correlations, and gran- 1 Introduction ularities. In MeTaL, we view each source as labeling one of several related sub-tasks of a problem—we refer to this One of the greatest roadblocks to using modern machine as the multi-task weak supervision setting. We then show learning models is collecting hand-labeled training data at that given the dependency structure of the sources, we can the massive scale they require. In real-world settings where use their observed agreement and disagreement rates to re- domain expertise is needed and modeling goals change fre- cover their unknown accuracies. Moreover, we exploit the quently, hand-labeling training sets is prohibitively slow, ex- relationship structure between tasks to observe additional pensive, and static. For these reasons, practitioners are in- cross-task agreements and disagreements, effectively pro- creasingly turning to weak supervision techniques wherein viding extra signal from which to learn. In contrast to pre- noisier, often programmatically-generated labels are used vious approaches based on sampling from the posterior of instead. Common weak supervision sources include exter- a graphical model directly (Ratner et al. 2016; Bach et al. nal knowledge bases (Mintz et al. 2009; Zhang et al. 2017a; 2017), we develop a simple and scalable matrix completion- Craven and Kumlien 1999; Takamatsu, Sato, and Nakagawa style algorithm, which we are able to analyze by applying 2012), heuristic patterns (Gupta and Manning 2014; Ratner strong matrix concentration bounds (Tropp 2015). We use et al. 2018), feature annotations (Mann and McCallum 2010; this algorithm to learn and model the accuracies of diverse Zaidan and Eisner 2008), and noisy crowd labels (Karger, weak supervision sources, and then combine their labels to Oh, and Shah 2011; Dawid and Skene 1979). The use of produce training data that can be used to supervise arbitrary these sources has led to state-of-the-art results in a range of models, including increasingly popular multi-task learning Copyright c 2019, Association for the Advancement of Artificial models (Caruana 1993; Ruder 2017). Intelligence (www.aaai.org). All rights reserved. Compared to previous methods which only handled the

4763 single-task setting (Ratner et al. 2016; 2018), and gener- Anandkumar 2017). These approaches, however, do not han- ally considered conditionally-independent sources (Anand- dle multi-granularity or multi-task weak supervision, require kumar et al. 2014; Dawid and Skene 1979), we demonstrate expensive sampling-based techniques that may lead to non- that our multi-task aware approach leads to average gains identifiable solutions, and leave room for sharper theoreti- of 4.1 points in accuracy in our experiments, and has at cal characterization of weak supervision scaling properties. least three additional benefits. First, many dependency struc- More generally, our work is motivated by a wide range of tures between weak supervision sources may lead to non- specific weak supervision techniques, which include tradi- identifiable models of their accuracies, where a unique solu- tional distant supervision approaches (Mintz et al. 2009; tion cannot be recovered. We provide a compiler-like check Craven and Kumlien 1999; Zhang et al. 2017a; Hoffmann et to establish identifiability—i.e. the existence of a unique al. 2011; Takamatsu, Sato, and Nakagawa 2012), co-training set of source accuracies—for arbitrary dependency struc- methods (Blum and Mitchell 1998), pattern-based supervi- tures, without resorting to the standard assumption of non- sion (Gupta and Manning 2014; Zhang et al. 2017a), and adversarial sources (Dawid and Skene 1979), alerting users feature-annotation techniques (Mann and McCallum 2010; to this potential stumbling block that we have observed in Zaidan and Eisner 2008; Liang, Jordan, and Klein 2009). practice. Next, we provide sample complexity bounds that Crowdsourcing: Our approach also has connections to characterize the benefit of adding additional unlabeled data the crowdsourcing literature (Karger, Oh, and Shah 2011; and the scaling with respect to the user-specified task and Dawid and Skene 1979), and in particular to spectral and dependency structure. While previous approaches required method of moments-based approaches (Zhang et al. 2014; thousands of sources to give non-vacuous bounds, we cap- Dalvi et al. 2013a; Ghosh, Kale, and McAfee 2011; Anand- ture regimes with small numbers of sources, better reflecting kumar et al. 2014). In contrast, the goal of our work is to the real-world uses of weak supervision we have observed. support and explore settings not covered by crowdsourcing Finally, we are able to solve our proposed problem directly work, such as sources with correlated outputs, the proposed with SGD, leading to over 100 faster runtimes compared to multi-task supervision setting, and regimes wherein a small prior Gibbs-sampling based approaches× (Ratner et al. 2016; number of labelers (weak supervision sources) each label a Platanios et al. 2017), and enabling simple implementation large number of items (data points). Moreover, we theoreti- using libraries like PyTorch. cally characterize the generalization performance of an end We validate our framework on three fine-grained classi- model trained with the weakly labeled data. fication tasks in named entity recognition, relation extrac- Multi-Task Learning: Our proposed approach is mo- tion, and medical document classification, for which we tivated by recent progress on multi-task learning mod- have diverse weak supervision sources at multiple levels els (Caruana 1993; Ruder 2017; Søgaard and Goldberg of granularity. We show that by modeling them as label- 2016), in particular their need for multiple large hand- ing hierarchically-related sub-tasks and utilizing unlabeled labeled training datasets. We note that the focus of our pa- data, we can get an average improvement of 20.2 points in per is on generating supervision for these models, not on the accuracy over a traditional supervised approach, 6.8 points particular multi-task learning model being trained, which we over a basic majority voting weak supervision baseline, and seek to control for by fixing a simple architecture in our ex- 4.1 points over data programming (Ratner et al. 2016), an periments. existing weak supervision approach in the literature that is Our work is also related to recent techniques for estimat- not multi-task-aware. We also extend our framework to han- ing classifier accuracies without labeled data in the pres- dle unipolar sources that only label one class, a critical as- ence of structural constraints (Platanios et al. 2017). We use pect of weak supervision in practice that leads to an average matrix structure estimation (Loh and Wainwright 2012) and 2.8 point contribution to our gains over majority vote. From concentration bounds (Tropp 2015) for our core results. a practical standpoint, we argue that our framework repre- sents an efficient way for practitioners to supervise mod- 3 Programming Machine Learning with ern machine learning models, including new multi-task vari- ants, for complex tasks by opportunistically using the di- Weak Supervision verse weak supervision sources available to them. To further As modern machine learning models become both more validate this, we have released an open-source implementa- complex and more performant on a range of tasks, devel- tion of our framework.1 opers increasingly interact with them by programmatically generating noisier or weak supervision. These approaches 2 Related Work of effectively programming machine learning models have recently been formalized by the following pipeline (Ratner Our work builds on and extends various settings studied in et al. 2016; 2018): First, users provide one or more weak machine learning. supervision sources, which are applied to unlabeled data to Weak Supervision: We draw motivation from recent work generate a set of noisy labels. These labels may overlap and which models and integrates weak supervision using gen- conflict; we model and combine them via a label model in erative models (Ratner et al. 2016; 2018; Bach et al. 2017) order to produce a final set of training labels. These labels and other methods (Guan et al. 2017; Khetan, Lipton, and are then used to train some discriminative model, which we refer to as the end model. This programmatic weak super- 1github.com/HazyResearch/metal vision approach can utilize sources ranging from heuristic

4764 Figure 1: A schematic of the MeTaL pipeline. To generate training data for an end model, such as a multi-task model as in our experiments, the user inputs a task graph Gtask defining the relationships between task labels Y1, ..., Yt; a set of unlabeled data points X; a set of multi-task weak supervision sources si which each output a vector λi of task labels for X; and the dependency structure between these sources, Gsource. We train a label model to learn the accuracies of the sources, outputting a vector of probabilistic training labels Y˜ for training the end model.

bine them: some of them label phrases coarsely as PERSON versus ORGANIZATION, while others classify specific fine- grained types of people or organizations, with a range of unknown accuracies. In our framework, she can repre- sent them as labeling tasks of different granularities—e.g. Y1 = Person, Org , Y2 = Doctor, Lawyer, N/A , { } { } Y3 = Hospital, Office, N/A , where the label N/A applies,{ for example, when the type-of-person} task is applied to an organization. In our proposed multi-task supervision setting, the user specifies a set of structurally-related tasks, and then provides a set of weak supervision sources which are user-defined functions that either label each data point or abstain for each task, and may have some user-specified dependency struc- Figure 2: An example fine-grained entity classification prob- ture. These sources can be arbitrary black-box functions, and lem, where weak supervision sources label three sub-tasks can thus subsume a range of weak supervision approaches of different granularities: (i) Person vs. Organization, relevant to both text and other data modalities, including use (ii) Doctor vs. Lawyer (or N/A), (iii) Hospital vs. of pattern-based heuristics, distant supervision (Mintz et al. Office (or N/A). The example weak supervision sources 2009), crowd labels, other weak or biased classifiers, declar- use a pattern heuristic and dictionary lookup respectively. ative rules over unsupervised feature extractors (Varma et al. 2017), and more. Our goal is to estimate the unknown accu- racies of these sources, combine their outputs, and use the rules to other models, and in this way can also be viewed as resulting labels to train an end model. a pragmatic and flexible form of multi-source transfer learn- ing. 4 Modeling Multi-Task Weak Supervision In our experiences with users from science and industry, multi-task weak supervi- we have found it critical to utilize all available sources of The core technical challenge of the sion accuracies weak supervision for complex modeling problems, includ- setting is recovering the unknown of weak ing ones which label at multiple levels of granularity. How- supervision sources given their dependency structure and a ever, this diverse, multi-granular weak supervision does not schema of the tasks they label, but without any ground-truth easily fit into existing paradigms. We propose a formulation labeled data. We define a new algorithm for recovering the where each weak supervision source labels some sub-task of accuracies in this setting using a matrix completion-style op- a problem, which we refer to as the multi-task weak super- timization objective. We establish conditions under which vision setting. We consider an example: the resulting estimator returns a unique solution. We then analyze the sample complexity of our estimator, character- Example 1 A developer wants to train a fine-grained izing its scaling with respect to the amount of unlabeled Named Entity Recognition (NER) model to classify men- data, as well as the task schema and dependency structure, tions of entities in the news (Figure 2). She has a multi- and show how the estimation error affects the generalization tude of available weak supervision sources which she be- performance of the end model we aim to train. Finally, we lieves have relevant signal for her problem—for example, highlight how our approach handles abstentions and unipo- pattern matchers, dictionaries, and pre-trained generic NER lar sources, two critical scenarios in the weak supervision taggers. However, it is unclear how to properly use and com- setting.

4765 A Multi-Task Weak Supervision Estimator Y Y, λ1, λ2 Problem Setup Let X be a data point and Y = Y Y T ∈ X [Y1,Y2,...,Yt] be a vector of categorical task labels, Yi ∈ 1, . . . , ki , corresponding to t tasks, where (X, Y) is drawn λ1 λ2 λ3 λ4 Y, λ3 Y, λ4 i.i.d.{ from} a distribution .2 The user provides a specificationD of how these tasks re- Figure 3: An example of a weak supervision source de- late to each other; we denote this schema as the task struc- pendency graph G (left) and its junction tree repre- ture G . The task structure expresses logical relation- source task sentation (right), where Y is a vector-valued random vari- ships between tasks, defining a feasible set of label vec- able with a feasible set of values, Y . Here, the out- tors , such that Y . For example, Figure 2 illus- put of sources 1 and 2 are modeled as∈ Ydependent condi- tratesY a hierarchical task∈ Ystructure over three tasks of dif- tioned on Y. This results in a junction tree with singleton ferent granularities pertaining to a fine-grained entity clas- separator sets, Y. Here, the observable cliques are O = sification problem. Here, the tasks are related by logical λ1, λ2, λ3, λ4, λ1, λ2 . subsumption relationships: for example, if Y2 = DOCTOR, { { }} ⊂ C this implies that Y1 = PERSON, and that Y3 = N/A, since the task label Y3 concerns types of organizations, which is inapplicable to persons. Thus, in this task struc- overlapping and conflicting labels, from sources with un- known accuracies and correlations. To handle this, we will ture, Y = [PERSON, DOCTOR, N/A]T is in while Y = T Y learn a label model Pµ(Y λ), parameterized by a vector of [PERSON, N/A, HOSPITAL] is not. While task structures | are often simple to define, as in the previous example, or are source correlations and accuracies µ, which for each data explicitly defined by existing resources—such as ontologies point X takes as input the noisy labels λ = λ1,..., λm {˜ } or graphs—we note that if no task structure is provided, our and outputs a single probabilistic label vector Y. Succinctly, given a user-provided tuple ( U , S, Gsource,Gtask), our key approach becomes equivalent to modeling the t tasks sepa- X rately, a baseline we consider in the experiments. technical challenge is recovering the parameters µ without In our setting, rather than observing the true label Y, access to ground truth labels Y. we have access to m multi-task weak supervision sources si S which emit label vectors λi that contain labels for Modeling Multi-Task Sources To learn a label model ∈ some subset of the t tasks. Let 0 denote a null or abstaining over multi-task sources, we introduce sufficient statistics label, and let the coverage set τi 1, . . . , t be the fixed over the random variables in Gsource. Let be the set of ⊆ { } C set of tasks for which the ith source emits non-zero labels, cliques in Gsource, and define an indicator random variable such that λi τi . For convenience, we let τ0 = 1, . . . , t for the event of a clique C taking on a set of values y : ∈ Y { } C so that τ = . For example, a source from our previous ∈ C Y 0 Y 1 example might have a coverage set τi = 1, 3 , emitting ψ(C, yC ) = i∈C Vi = (yC )i , { } T {∩ } coarse-grained labels such as λi = [PERSON, 0, N/A] . Q (|Yτ |−1) where (yC )i τ . We define ψ(C) 0, 1 i∈C i Note that sources often label multiple tasks implicitly due ∈ Y i ∈ { } to the constraints of the task structure; for example, a source as the vector of indicator random variables for all combina- tions of all but one of the labels emitted by each variable in that labels types of people (Y2) also implicitly labels people clique —thereby defining a minimal set of statistics—and vs. organizations (Y1 = PERSON), and types of organiza- C N/A define ψ(C) accordingly for any set of cliques C . Then tions (as Y3 = ). Thus sources tailored to different tasks ⊆ C µ = E [ψ( )] is the vector of sufficient statistics for the label still have agreements and disagreements; we use this addi- C tional cross-task signal in our approach. model we want to learn. The user also provides the conditional dependency struc- We work with two simplifying conditions in this section. First, we consider the setting where G is triangulated ture of the sources as a graph Gsource = (V,E), where V = source and has a junction tree representation with singleton separa- Y, λ1, λ2,..., λm (Figure 3). Specifically, if (λi, λj) is { } tor sets. If this is not the case, edges can always be added not an edge in Gsource, this means that λi is independent of to G to make this setting hold; otherwise, we describe λj conditioned on Y and the other source labels. Note that source how our approach can directly handle non-singleton separa- if Gsource is unknown, it can be estimated using statistical techniques such as (Bach et al. 2017). Importantly, we do tor sets in the Appendix. not know anything about the strengths of the correlations in Second, we use a simplified class-conditional model of the noisy labeling process, where we learn one accuracy pa- Gsource, or the sources’ accuracies. Our overall goal is to apply the set of weak supervi- rameter for each label value λi that each source si emits. This is equivalent to assuming that a source may have a dif- sion sources S = s1, . . . , sm to an unlabeled dataset U consisting of n data{ points, then} use the resulting weakly-X ferent accuracy on each different class, but that if it emits a certain label incorrectly, it does so uniformly over the differ- labeled training set to supervise an end model fw : (Figure 1). This weakly-labeled training set willX contain 7→ Y ent true labels Y. This is a more expressive model than the commonly considered one, where each source is modeled by 2The variables we introduce throughout this section are sum- a single accuracy parameter, e.g. in (Dawid and Skene 1979; marized in a glossary in the Appendix, which can be accessed at Ratner et al. 2016), and in particular allows us to capture the https://arxiv.org/abs/1810.02840. unipolar setting considered later on.

4766 Our Approach The chief technical difficulty in our prob- (i, j) where (KO)i,j = 0, determined by Gsource, yielding a lem is that we do not observe Y. We overcome this by ana- system of equations, lyzing the covariance matrix of an observable subset of the −1 T  0 = (Σ )i,j + zz for (i, j) Ω, (5) cliques in Gsource, leading to a matrix completion-style ap- O i,j ∈ proach for recovering µ. We leverage two pieces of infor- which is now a matrix completion problem. Define A as mation: (i) the observability of part of Cov [ψ( )], and (ii) || ||Ω C the Frobenius norm of A with entries not in Ω set to zero; a result from (Loh and Wainwright 2012) which states that −1 then we can rewrite (5) as Σ + zzT = 0. We solve the inverse covariance matrix Cov [ψ( )]−1 is structured ac- O Ω C this equation to estimate z, and thereby recover ΣOS , from cording to Gsource, i.e., if there is no edge between λi and λj which we can directly recover the label model parameters µ in Gsource, then the corresponding entries are 0. algebraically. We start by considering two disjoint subsets of : the set of observable cliques, O —i.e., those cliquesC not con- taining Y—and the separator⊆ C set cliques of the junction tree, Checking for Identifiability A first question is: which . In the setting we consider here, = Y (see Fig- dependency structures Gsource lead to unique solutions for Sure ⊆ 3). C We then write the covariance matrixS of{ the} indicator µ? This question presents a stumbling block for users, who variables for O , Cov [ψ(O )], in block form, similar might attempt to use non-identifiable sets of correlated weak to (Chandrasekaran,∪ S Parrilo, and∪ Willsky S 2010), as: supervision sources. We provide a simple, testable condition for identifiability.   ΣO ΣOS Let Ginv be the inverse graph of Gsource; note that Ω is the Cov [ψ(O )] Σ = T (1) ∪ S ≡ ΣOS ΣS edge set of Ginv expanded to include all indicator random variables ψ( ). Then, let MΩ be a matrix with dimensions and similarly define its inverse: C Ω dO such that each row in MΩ corresponds to a pair   |(i,| j ×) Ω with 1’s in positions i and j and 0’s elsewhere. −1 KO KOS ∈ K = Σ = T (2) Taking the log of the squared entries of (5), we get a sys- KOS KS 2 tem of linear equations MΩl = qΩ, where li = log(zi ) and dO ×dO −1 2 Here, ΣO = Cov [ψ(O)] R is the observable q(i,j) = log(((ΣO )i,j) ). Assuming we can solve this sys- P∈ Q block of Σ, where dO = C∈O i∈C ( τi 1). Next, tem (which we can always ensure by adding sources; see |Y | − 2 ΣOS = Cov [ψ(O), ψ( )] is the unobserved block which Appendix), we can uniquely recover the zi , meaning our is a function of µ, the labelS model parameters that we wish model is identifiable up to sign. 2 to recover. Finally, ΣS = Cov [ψ( )] = Cov [ψ(Y)] is a Given estimates of the zi , we can see from (5) that the function of the class balance P (Y). S sign of a single zi determines the sign of all other zj reach- We make two observations about ΣS . First, while the full able from zi in Ginv. Thus to ensure a unique solution, we form of ΣS is the covariance of the 1 indicator vari- only need to pick a sign for each connected component ables for each individual value of Y but|Y| one, − given our sim- in Ginv. In the case where the sources are assumed to be plified class-conditional label model, we in fact only need a independent, e.g., (Dalvi et al. 2013b; Zhang et al. 2014; single indicator variable for Y (see Appendix); thus, ΣS is a Dawid and Skene 1979), it suffices to make the assumption scalar. Second, ΣS is a function of the class balance P (Y), that the sources are on average non-adversarial; i.e., select which we assume is either known, or has been estimated ac- the sign of the zi that leads to higher average accuracies of cording to the unsupervised approach we detail in the Ap- the sources. Even a single source that is conditionally in- pendix. Thus, given ΣO and ΣS , our goal is to recover the dependent from all the other sources will cause Ginv to be vector ΣOS from which we can recover µ. fully connected, meaning we can use this symmetry break- Applying the block matrix inversion lemma, we have: ing assumption in the majority of cases even with corre- lated sources. Otherwise, a sufficient condition is the stan- −1 −1 T −1 KO = ΣO + cΣO ΣOS ΣOS ΣO , (3) dard one of assuming non-adversarial sources, i.e. that all sources have greater than random accuracy. T −1 −1 + where c = ΣS ΣOS ΣO ΣOS R . Let z = −1 − ∈ √cΣO ΣOS ; we can then express (3) as: Source Accuracy Estimation Algorithm Now that we −1 T know when a set of sources with correlation structure K = Σ + zz (4) Gsource O O is identifiable, yielding a unique z, we can estimate the The right hand side of (4) consists of an empirically ob- accuracies µ using Algorithm 1. We also use the func- −1 T servable term, ΣO , and a rank-one term, zz , which we tion ExpandTied, which is a simple algebraic expansion of can solve for to directly recover µ. For the left hand side, tied parameters according to the simplified class-conditional we apply an extension of Corollary 1 from (Loh and Wain- model used in this section; see Appendix for details. In Fig- wright 2012) (see Appendix) to conclude that KO has graph- ure 4, we plot the performance of our algorithm on syn- structured sparsity, i.e., it has zeros determined by the struc- thetic data, showing its scaling with the number of unla- ture of dependencies between the sources in Gsource. This beled data points n, the density of pairwise dependencies in suggests an algorithmic approach of estimating z as a ma- Gsource, and the runtime performance as compared to a prior trix completion problem in order to recover an estimate of Gibbs sampling-based approach. Next, we theoretically an- µ (Algorithm 1). In more detail: let Ω be the set of indices alyze the scaling of the error µˆ µ∗ . || − ||

4767 Est. Error vs. n Est. Error vs. % Dep Runtime vs. m, n

m = 10 6 MeTaL, m=10 m = 20 3.0 Gibbs, m=10 − 3.5 5 − m = 40 MeTaL, m=50 Gibbs, m=50 3.5 4 − Without G 4.0 With G 3 − Runtime (s) 4.0

− 10 2 4.5

− Log Avg. Ln Est. Error Avg. Ln Est. Error 4.5 − 1 5 6 7 8 0.00 0.05 0.10 0.15 0.20 3.0 3.5 4.0 4.5 5.0 Ln(n) Pairwise Dep % Log10(n)

∗ Figure 4: (Left) Estimation error µˆ µ decreases with increasing n. (Middle) Given Gsource, our model successfully recovers the source accuracies even with|| many− pairwise|| dependencies among sources, where a naive conditionally-independent model fails. (Right) The runtime of MeTaL is independent of n after an initial matrix multiply, and can thus be multiple orders of magnitude faster than Gibbs sampling-based approaches (Ratner et al. 2016).

Algorithm 1 Source Accuracy Estimation for Multi-Task and let w˜ be the w that minimizes the true noise-aware loss. Weak Supervision This minimization can be performed by standard methods ˆ and is not the focus of our paper; let the solution wˆ satisfy Input: Observed labeling rates E [ψ(O)] and covariance  wˆ w˜ 2 γ. We make several assumptions, follow- ˆ ˆ E ΣO; class balance E [ψ(Y)] and variance ΣS ; correlation ingk (Ratner− k et al.≤ 2016): (1) that for some label model param- ∗ sparsity structure Ω λ ∗ eters µ , sampling ( , Y) Pµ ( ) is the same as sampling ˆ −1 T from the true distribution, ∼λ Y · ; and (2) that the task zˆ argminz ΣO + zz ( , ) Ω ∼ D ← labels Ys are independent of the features of the end model −1 T ˆ , ˆ ˆ √ cˆ ΣS (1 +z ˆ ΣOzˆ) ΣOS ΣOz/ˆ cˆ given λ sampled from Pµ∗ ( ), that is, the output of the op- 0← ˆ ˆ ˆ ← · µˆ ΣOS + E [ψ(Y)] E [ψ(O)] timal label model provides sufficient information to discern return← ExpandTied(ˆµ0) the true label. Then we have the following result: Theorem 1 Let w˜ minimize the expected noise aware loss, using weak supervision source parameters µˆ estimated with Theoretical Analysis: Scaling with Diverse Algorithm 1. Let wˆ minimize the empirical noise aware loss  2 ∗ Multi-Task Supervision with E wˆ w˜ γ, w = minw l(w, X, Y), and let the assumptionsk − k above≤ hold. Then the generalization error Our ultimate goal is to train an end model using the source is bounded by: labels, denoised and combined by the label model µˆ we ∗ ∗ have estimated. We connect the generalization error of this E [l(w, ˆ X, Y) l(w ,X, Y)] γ + 4 µˆ µ . end model to the estimation error of Algorithm 1, ultimately − ≤ |Y| || − || − 1 Thus, to control the generalization error, we must control showing that the generalization error scales as n 2 , where µˆ µ∗ , which we do in Theorem 2: n is the number of unlabeled data points. This key result es- || − || tablishes the same asymptotic scaling as traditionally super- Theorem 2 Let µˆ be an estimate of µ∗ produced by Algo- vised learning methods, but with respect to unlabeled data rithm 1 run over n unlabeled data points. Let a := ( dO + ΣS points. 1 kΣ−1k2 dO 2 2 O ( ) λmax(KO)) and b := −1 . Then, we have: ˜ ΣS Let Pµˆ(Y λ) be the probabilistic label (i.e. distribu- (ΣO )min tion) predicted| by our label model, given the source la- r bels λ as input, which we compute using the estimated ∗ 2 32π + E [ µˆ µ ] 16( 1)dO abσmax(M ) µˆ. We then train an end multi-task discriminative model || − || ≤ |Y| − n Ω fw : parameterized by w, by minimizing the ex-  p −1  −1  X 7→ Y 3 dOaλ (ΣO) + 1 κ(ΣO) + λ (ΣO) . pected loss with respect to the label model over n unlabeled × min min 1 Pt data points. Let l(w, X, Y) = t s=1 lt(w, X, Ys) be a bounded multi-task loss function such that without loss of Interpreting the Bound We briefly explain the key terms generality l(w, X, Y) 1; then we minimize the empirical controlling the bound in Theorem 2; more detail is found ≤ noise aware loss: in the Appendix. Our primary result is that the estimation 1 + n − 2 1 error scales as n . Next, σmax(MΩ ), the largest singular X  ˜  + wˆ = argminw E˜ l(w, Xi, Y) , (6) value of the pseudoinverse M , has a deep connection to n Y∼Pµˆ (·|λ) Ω i=1 the density of the graph Ginv. The smaller this quantity, the

4768 more information we have about Ginv, and the easier it is to The average number of weak supervision sources per task estimate the accuracies. Next, λmin(ΣO), the smallest eigen- was 13, with sources expressed as Python functions, averag- value of the observed covariance matrix, reflects the condi- ing 4 lines of code and comprising a mix of pattern matching tioning of ΣO; better conditioning yields easier estimation, heuristics, external knowledge base or dictionary lookups, and is roughly determined by how far away from random and pre-trained models. In all three cases, we choose the de- guessing the worst weak supervision source is, as well as composition into sub-tasks so as to align with weak supervi- how conditionally independent the sources are. λmax(KO), sion sources that are either available or natural to express. the largest eigenvalue of the upper-left block of the inverse Named Entity Recognition (NER): We represent covariance matrix, similarly reflects the overall conditioning a fine-grained named entity recognition problem— −1 of Σ. Finally, (ΣO )min, the smallest entry of the inverse ob- tagging entity mentions in text documents—as a hi- served matrix, reflects the smallest non-zero correlation be- erarchy of three sub-tasks over the OntoNotes dataset tween source accuracies; distinguishing between small cor- (Weischedel et al. 2011): Y1 Person, Organization , ∈ { } relations and independent sources requires more samples. Y2 Businessperson, Other Person, N/A , Y3 Company∈, {Other Org, N/A , where again we} use N/A∈ Extensions: Abstentions & Unipolar Sources to{ represent “not applicable”.} We briefly highlight two extensions handled by our approach Relation Extraction (RE): We represent a relation extrac- which we have found empirically critical: handling absten- tion problem—classifying entity-entity relation mentions in tions, and modeling unipolar sources. text documents—as a hierarchy of six sub-tasks which ei- Handling Abstentions. One fundamental aspect of the ther concern labeling the subject, object, or subject-object weak supervision setting is that sources may abstain from pair of a possible or candidate relation in the TACRED labeling a data point entirely—that is, they may have incom- dataset (Zhang et al. 2017b). For example, we might label plete and differing coverage (Ratner et al. 2018; Dalvi et al. a relation as having a Person subject, Location object, and 2013b). We can easily deal with this case by extending the Place-of-Residence relation type. coverage ranges τ of the sources to include the vector of Medical Document Classification (Doc): We represent Y i all zeros, ~0, and we do so in the experiments. a radiology report triaging (i.e. document classification) Handling Unipolar Sources. Finally, we highlight the fact problem from the OpenI dataset (National Institutes of Health 2017) as a hierarchy of three sub-tasks: Y1 that our approach models class conditional source accura- ∈ cies, in particular motivated by the case we have frequently Acute, Non-Acute , Y2 Urgent, Emergent, N/A , Y3 {Normal, Non-Urgent} , N/A∈ {. } ∈ observed in practice of unipolar weak supervision sources, { } i.e., sources that each only label a single class or abstain. In practice, we find that users most commonly use such unipo- End Model Protocol Our goal was to test the performance lar sources; for example, a common template for a heuristic- of a basic multi-task end model using training labels pro- based weak supervision source over text is one that looks for duced by various different approaches. We use an architec- a specific pattern, and if the pattern is present emits a spe- ture consisting of a shared bidirectional LSTM input layer cific label, else abstains. As compared to prior approaches with pre-trained embeddings, shared linear intermediate lay- that did not model class-conditional accuracies, e.g. (Ratner ers, and a separate final linear layer (“task head”) for each et al. 2016), we show in our experiments that we can use our task. Hyperparameters were selected with an initial search class-conditional modeling approach to yield an improve- for each application (see Appendix), then fixed. ment of 2.8 points in accuracy. Core Validation We compare the accuracy of the end 5 Experiments multi-task model trained with labels from our approach ver- We validate our approach on three fine-grained classification sus those from three baseline approaches (Table 1): problems—entity classification, relation classification, and Traditional Supervision [Gold (Dev)]: We train the end document classification—where weak supervision sources • model using the small hand-labeled development set. are available at both coarser and finer-grained levels (e.g. Hierarchical Majority Vote [MV]: We use a hierarchical as in Figure 2). We evaluate the predictive accuracy of end • models supervised with training data produced by several majority vote of the weak supervision source labels: i.e. approaches, finding that our approach outperforms tradi- for each data point, for each task we take the majority vote tional hand-labeled supervision by 20.2 points, a baseline and proceed down the task tree accordingly. This proce- majority vote weak supervision approach by 6.8 points, and dure can be thought of as a hard decision tree, or a cascade a prior weak supervision denoising approach (Ratner et al. of if-then statements as in a rule-based approach. 2016) that is not multi-task-aware by 4.1 points. Data Programming [DP]: We model each task separately • using the data programming approach for denoising weak Datasets Each dataset consists of a large (3k-63k) amount supervision (Ratner et al. 2018). of unlabeled training data and a small (200-350) amount of In all settings, we used the same end model architecture labeled data which we refer to as the development set, which as described above. Note that while we choose to model we use for (a) a traditional supervision baseline, and (b) for these problems as consisting of multiple sub-tasks, we eval- hyperparameter tuning of the end model (see Appendix). uate with respect to the broad primary task of fine-grained

4769 NER RE Doc Average Gold (Dev) 63.7 2.1 28.4 2.3 62.7 4.5 51.6 MV 76.9 ± 2.6 43.9 ± 2.6 74.2 ± 1.2 65.0 DP (Ratner et al. 2016) 78.4 ± 1.2 49.0 ± 2.7 75.8 ± 0.9 67.7 ± ± ± MeTaL 82.2 0.8 56.7 2.1 76.6 0.4 71.8 ± ± ± Table 1: Performance Comparison of Different Supervision Approaches. We compare the micro accuracy (avg. over 10 trials) with 95% confidence intervals of an end multi-task model trained using the training labels from the hand-labeled devel- opment set (Gold Dev), hierarchical majority vote (MV), data programming (DP), and our approach (MeTaL).

Accuracy vs. n (Log-Scale) # Train LM EM Gain 82.2 80.6 NER 62,547 75.2 82.2 7.0 77.2 RE 9,090 55.3 57.4 2.1 Doc 2,630 75.6 76.6 1.0

Figure 6: Using the label model (LM) predictions directly Micro-Avg. Accuracy 63.7 versus using an end model trained on them (EM).

0 5 25 63 Unlabeled Datapoints n (Thousands) 6 Conclusion Figure 5: In the OntoNotes dataset, end model accuracy scales with the amount of available unlabeled data. We presented MeTaL, a framework for training models with weak supervision from diverse, multi-task sources hav- ing different granularities, accuracies, and correlations. We classification (for subtask-specific scores, see Appendix). tackle the core challenge of recovering the unknown source We observe in Table 1 that our approach of leveraging accuracies via a scalable matrix completion-style algorithm, multi-granularity weak supervision leads to large gains— introduce theoretical bounds characterizing the key scaling 20.2 points over traditional supervision with the develop- with respect to unlabeled data, and demonstrate empirical ment set, 6.8 points over hierarchical majority vote, and 4.1 gains on real-world datasets. In future work, we hope to points over data programming. learn the task relationship structure and cover a broader range of settings where labeled training data is a bottleneck.

Ablations We examine individual factors: Unipolar Correction: Modeling unipolar sources (Sec 4), Acknowledgements We gratefully acknowledge the sup- which we find to be especially common when fine-grained port of DARPA under Nos. FA87501720095 (D3M) and tasks are involved, leads to an average gain of 2.8 points of FA86501827865 (SDH), NIH under No. N000141712266 accuracy in MeTaL performance. (Mobilize), NSF under Nos. CCF1763315 (Beyond Spar- Joint Task Modeling: Next, we use our algorithm to es- sity), CCF1563078 (Volume to Velocity), and DGE-114747 timate the accuracies of sources for each task separately, to (NSF GRF), ONR under No. N000141712266 (Unifying observe the empirical impact of modeling the multi-task set- Weak Supervision), the Moore Foundation, NXP, Xilinx, ting jointly as proposed. We see average gains of 1.3 points LETI-CEA, Intel, , NEC, Toshiba, TSMC, ARM, Hi- in accuracy (see Appendix). tachi, BASF, Accenture, Ericsson, Qualcomm, Analog De- End Model Generalization: Though not possible in many vices, the Okawa Foundation, and American Family Insur- settings, in our experiments we can directly apply the label ance, the Stanford Interdisciplinary Graduate and Bio-X fel- model to make predictions. In Table 6, we show that the end lowships, the Intelligence Community Postdoctoral Fellow- model improves performance by an average 3.4 points in ship, and members of the Stanford DAWN project: Intel, Mi- accuracy, validating that the models trained do indeed learn crosoft, Teradata, Facebook, Google, Ant Financial, NEC, to generalize beyond the provided weak supervision. More- SAP, and VMWare. The U.S. Government is authorized over, the largest generalization gain of 7 points in accuracy to reproduce and distribute reprints for Governmental pur- came from the dataset with the most available unlabeled data poses notwithstanding any copyright notation thereon. Any (n=63k), demonstrating scaling consistent with the predic- opinions, findings, and conclusions or recommendations ex- tions of our theory (Fig. 5). This ability to leverage addi- pressed in this material are those of the authors and do not tional unlabeled data and more sophisticated end models are necessarily reflect the views, policies, or endorsements, ei- key advantages of the weak supervision approach in prac- ther expressed or implied, of DARPA, NIH, ONR, or the tice. U.S. Government.

4770 References Platanios, E.; Poon, H.; Mitchell, T. M.; and Horvitz, E. J. 2017. Estimating accuracy from unlabeled data: A proba- Anandkumar, A.; Ge, R.; Hsu, D.; Kakade, S. M.; and Tel- bilistic logic approach. garsky, M. 2014. Tensor decompositions for learning latent variable models. The Journal of Machine Learning Research Ratner, A. J.; De Sa, C. M.; Wu, S.; Selsam, D.; and Re,´ 15(1):2773–2832. C. 2016. Data programming: Creating large training sets, quickly. Bach, S. H.; He, B.; Ratner, A. J.; and Re,´ C. 2017. Learning the structure of generative models without labeled data. Ratner, A.; Bach, S.; Ehrenberg, H.; Fries, J.; Wu, S.; and Re,´ C. 2018. Snorkel: Rapid training data creation with Blum, A., and Mitchell, T. 1998. Combining labeled and weak supervision. unlabeled data with co-training. Ruder, S. 2017. An overview of multi-task learning in deep Caruana, R. 1993. Multitask learning: A knowledge-based neural networks. CoRR abs/1706.05098. source of inductive bias. Søgaard, A., and Goldberg, Y. 2016. Deep multi-task learn- Chandrasekaran, V.; Parrilo, P. A.; and Willsky, A. S. 2010. ing with low level tasks supervised at lower layers. Latent variable graphical model selection via convex opti- Takamatsu, S.; Sato, I.; and Nakagawa, H. 2012. Reducing mization. In Communication, Control, and Computing (Allerton), wrong labels in distant supervision for relation extraction. 2010 48th Annual Allerton Conference on, 1610–1613. IEEE. Tropp, J. A. 2015. An introduction to matrix concentra- Craven, M., and Kumlien, J. 1999. Constructing biolog- tion inequalities. Foundations and Trends R in Machine Learn- ical knowledge bases by extracting information from text ing 8(1-2):1–230. sources. Varma, P.; He, B. D.; Bajaj, P.; Khandwala, N.; Banerjee, Dalvi, N.; Dasgupta, A.; Kumar, R.; and Rastogi, V. 2013a. I.; Rubin, D.; and Re,´ C. 2017. Inferring generative model Aggregating crowdsourced binary ratings. structure with static analysis. Dalvi, N.; Dasgupta, A.; Kumar, R.; and Rastogi, V. 2013b. Weischedel, R.; Hovy, E.; Marcus, M.; Palmer, M.; Belvin, Aggregating crowdsourced binary ratings. R.; Pradhan, S.; Ramshaw, L.; and Xue, N. 2011. Ontonotes: Dawid, A. P., and Skene, A. M. 1979. Maximum likelihood A large training corpus for enhanced processing. Handbook of estimation of observer error-rates using the em algorithm. Natural Language Processing and Machine Translation. Springer. Applied statistics 20–28. Xiao, T.; Xia, T.; Yang, Y.; Huang, C.; and Wang, X. 2015. Ghosh, A.; Kale, S.; and McAfee, P. 2011. Who mod- Learning from massive noisy labeled data for image classi- erates the moderators?: Crowdsourcing abuse detection in fication. user-generated content. Zaidan, O. F., and Eisner, J. 2008. Modeling annotators: A generative approach to learning from annotator rationales. Guan, M. Y.; Gulshan, V.; Dai, A. M.; and Hinton, G. E. 2017. Who said what: Modeling individual labelers im- Zhang, Y.; Chen, X.; Zhou, D.; and Jordan, M. I. 2014. proves classification. arXiv preprint arXiv:1703.08774. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. Gupta, S., and Manning, C. D. 2014. Improved pattern learning for bootstrapped entity extraction. Zhang, C.; Re,´ C.; Cafarella, M.; De Sa, C.; Ratner, A.; Shin, J.; Wang, F.; and Wu, S. 2017a. DeepDive: Declarative Hoffmann, R.; Zhang, C.; Ling, X.; Zettlemoyer, L.; and knowledge base construction. Commun. ACM 60(5):93–102. Weld, D. S. 2011. Knowledge-based weak supervision for Zhang, Y.; Zhong, V.; Chen, D.; Angeli, G.; and Manning, information extraction of overlapping relations. C. D. 2017b. Position-aware attention and supervised data Karger, D. R.; Oh, S.; and Shah, D. 2011. Iterative learning improve slot filling. for reliable crowdsourcing systems. Khetan, A.; Lipton, Z. C.; and Anandkumar, A. 2017. Learning from noisy singly-labeled data. arXiv preprint arXiv:1712.04577. Liang, P.; Jordan, M. I.; and Klein, D. 2009. Learning from measurements in exponential families. Loh, P.-L., and Wainwright, M. J. 2012. Structure estima- tion for discrete graphical models: Generalized covariance matrices and their inverses. Mann, G. S., and McCallum, A. 2010. Generalized ex- pectation criteria for semi- with weakly labeled data. JMLR 11(Feb):955–984. Mintz, M.; Bills, S.; Snow, R.; and Jurafsky, D. 2009. Dis- tant supervision for relation extraction without labeled data. National Institutes of Health. 2017. Open-i.

4771