The Importance of Calibration for Estimating Proportions from Annotations

The Importance of Calibration for Estimating Proportions from Annotations Dallas Card Noah A. Smith Machine Learning Department Paul G. Allen School of CSE Carnegie Mellon University University of Washington Pittsburgh, PA, 15213, USA Seattle, WA, 98195, USA [email protected] [email protected] Abstract Note that the goal would not be to identify individ- ual instances, but rather to estimate a proportion, Estimating label proportions in a target cor- as a way of measuring the prevalence of a social pus is a type of measurement that is useful for answering certain types of social-scientific phenomenon. Although we assume that trained questions. While past work has described a annotators could recognize this phenomenon with number of relevant approaches, nearly all are some acceptable level of agreement, relying solely based on an assumption which we argue is on manual annotation would restrict the number invalid for many problems, particularly when of messages that could be considered, and would dealing with human annotations. In this paper, limit the analysis to the messages available at the we identify and differentiate between two rele- time of annotation.1 vant data generating scenarios (intrinsic vs. extrinsic labels), introduce a simple but novel We thus treat proportion estimation as a mea- method which emphasizes the importance of surement problem, and seek a way to train an in- calibration, and then analyze and experimen- strument from a limited number of human anno- tally validate the appropriateness of various tations to measure label proportions in an unanno- methods for each of the two scenarios. tated target corpus. This problem can be cast within a supervised 1 Introduction learning framework, and past work has demon- A methodological tool often used in the social sci- strated that it is possible to improve upon a na¨ıve ences and humanities (and practical settings like classification-based approach, even without access journalism) is content analysis – the manual cat- to any labeled data from the target corpus (For- egorization of pieces of text into a set of cate- man, 2005, 2008; Bella et al., 2010; Hopkins and gories which have been developed to answer a sub- King, 2010; Esuli and Sebastiani, 2015). How- stantive research question (Krippendorff, 2012). ever, as we argue ( 2), most of this work is based § Automated content analysis holds great promise on a set of assumptions that we believe are in- for augmenting the efforts of human annotators valid in a significant portion of text-based research (O’Connor et al., 2011; Grimmer and Stewart, projects in the social sciences and humanities. 2013). While this task bears similarity to text cat- Our contributions in this paper include: egorization problems such as sentiment analysis, identifying two different data-generating sce- the quantity of real interest is often the proportion • narios for text data (intrinsic vs. extrinsic la- of documents in a dataset that should receive each bels) and and establishing their importance to label (Hopkins and King, 2010). This paper tack- the problem of estimating proportions ( 2); les the problem of estimating label proportions in § a target corpus based on a small sample of human analyzing which methods are suitable for • annotated data. each setting, and proposing a simple alterna- As an example, consider the hypothetical ques- tive approach for extrinsic labels ( 3); and tion (not explored in this work) of whether hate § speech is increasingly prevalent in social media an empirical comparison of methods that val- • posts in recent years. “Hate speech” is a difficult- idates our analysis ( 4). § to-define category only revealed (at least initially) 1For additional examples see Grimmer et al.(2012), Hop- through human judgments (Davidson et al., 2017). kins and King(2010), and references therein. 1636 Proceedings of NAACL-HLT 2018, pages 1636–1646 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics Complicating matters somewhat is the fact that controlled real-world process, such as users rat- annotation may take place before the entire col- ing a review’s helpfulness. Note that our setup lection is available, so that the subset of instances does include the special case in which true gold- that are manually annotated may represent a bi- standard labels are available for each instance ased sample ( 2). Because this is so frequently the (such as the authors of documents in an author- § case, all of the results in this paper assume that we ship attribution problem). In such a case, is de- A must confront the challenges of transfer learning terministic (assuming unique inputs). or domain adaptation. (The simpler case, where Given that our objective is to mimic the annota- we can sample from the true population of inter- tion process, we seek to estimate the proportion of est, is revisited in 5.) documents in the target corpus expected to be cat- § egorized into each of the K categories, if we had 2 Problem Definition an unlimited budget and full access to the target Our setup is similar to that faced in transfer learn- corpus at the time of annotation. That is, we wish (T ) ing, and we use similar terminology (Pan and to estimate q , which we define as: Yang, 2010; Weiss et al., 2016). We assume that (T ) 1 NT (T ) q(y = k X ) , p(yi = k x ). we have a source and a target corpus, comprised | NT i=1 | i (2) of NS and NT documents respectively, the latter P of which are not available for annotation. We will Given a set of documents sampled from the represent each corpus as a set of documents, i.e., source corpus and L applications of the annotation (S) (S) X(S) = x , ..., x , and similarly for X(T ). function, we can obtain, at some cost, a labeled h 1 Ns i We further assume that we have a set of K mu- training corpus of L documents, i.e., D(train) = tually exclusive categories, = 1,...,K , and (x , y ),..., (x , y ) . Because the source and Y { } h 1 1 L L i that we wish to estimate the proportion of doc- target corpora are not in general drawn from the uments in the target corpus that belong to each same distribution, we seek to make explicit our as- category. These would typically correspond to a sumptions about how they differ.3 Past literature quantity we wish to measure, such as what frac- on transfer learning has identified several patterns tion of news articles frame a policy issue in a par- of dataset shift (Storkey, 2009). Here we focus ticular way, what fraction of product reviews are on two particularly important cases, linking them considered helpful, or what fraction of social me- to the relevant data generating processes, and ana- dia messages convey positive sentiment. Gener- lyze their relevance to estimating proportions. ally speaking, these categories will be designed based on theoretical assumptions, an understand- Two kinds of distributional shift. There are ing of the design of the platform that produced the two natural assumptions we could make about data, and/or initial exploration of the data itself. what is constant between the two corpora. We In idealized text classification scenarios, it is could assume that there is no change in the dis- conventional to assume training data with already- tribution of text given a document’s label, that is p(S)(x y) = p(T )(x y). Alternately, we could assigned gold-standard labels. Here, we are in- | | assume that there is no change in the distribution terested in scenarios where we must generate our (S) (T ) 2 of labels given text, i.e., p (y x) = p (y labels via an annotation process. Specifically, as- | | sume that we have some annotation function, , x). The former is assumed in the case of prior A which produces a distribution over the K mutu- probability shift, where we assume that p(y) differs but p(x y) is constant, and the later is as- ally exclusive labels, conditional on text. Given | sumed in the case of covariate shift, where we as- a document, xi, the annotation process samples a sume that p(x) differs but p(y x) is constant label from the annotation function, defined as: | (Storkey, 2009). (x , k) p(y = k x ). (1) These two assumptions correspond to two fun- A i , i | i damentally different types of scenarios that we Typically, the annotation function would repre- need to consider, which are summarized in Table sent the behavior of a human annotator (or group 1. The first is where we are dealing with what we of annotators), but it could also represent a less 3Clearly, if we make no assumptions about how the source 2This could include gathering multiple independent anno- and target distributions are related, there is no guarantee that tations per instance, but we will typically assume only one. supervised learning will work (Ben-David et al., 2012). 1637 Label type Intrinsic Extrinsic by a change in the underlying distribution of text, Data generating x p(x y) y p(y x) process ∼ | ∼ | p(x). With intrinsic labels, by contrast, it may be Assumed to differ the case that p(x y) is the same for the source p(y) p(x) | across domains and the target, assuming there are no additional Assumed constant p(x y) p(y x) across domains | | factors influencing the generation of text. In that Corresponding Prior Covariate case, a shift in the distribution of features would distributional probability shift shift shift be fully explained by a difference in the underlying label proportions. Table 1: Data generating scenarios and corresponding The idea that there are different data generat- distributional properties.

Load more