The Importance of for Estimating Proportions from Annotations

Dallas Card Noah A. Smith Department Paul G. Allen School of CSE Carnegie Mellon University University of Washington Pittsburgh, PA, 15213, USA Seattle, WA, 98195, USA [email protected] [email protected]

Abstract Note that the goal would not be to identify individ- ual instances, but rather to estimate a proportion, Estimating label proportions in a target cor- as a way of measuring the prevalence of a social pus is a type of measurement that is useful for answering certain types of social-scientific phenomenon. Although we assume that trained questions. While past work has described a annotators could recognize this phenomenon with number of relevant approaches, nearly all are some acceptable level of agreement, relying solely based on an assumption which we argue is on manual annotation would restrict the number invalid for many problems, particularly when of messages that could be considered, and would dealing with human annotations. In this paper, limit the analysis to the messages available at the we identify and differentiate between two rele- time of annotation.1 vant data generating scenarios (intrinsic vs. ex- trinsic labels), introduce a simple but novel We thus treat proportion estimation as a mea- method which emphasizes the importance of surement problem, and seek a way to train an in- calibration, and then analyze and experimen- strument from a limited number of human anno- tally validate the appropriateness of various tations to measure label proportions in an unanno- methods for each of the two scenarios. tated target corpus. This problem can be cast within a supervised 1 Introduction learning framework, and past work has demon- A methodological tool often used in the social sci- strated that it is possible to improve upon a na¨ıve ences and humanities (and practical settings like classification-based approach, even without access journalism) is content analysis – the manual cat- to any labeled data from the target corpus (For- egorization of pieces of text into a set of cate- man, 2005, 2008; Bella et al., 2010; Hopkins and gories which have been developed to answer a sub- King, 2010; Esuli and Sebastiani, 2015). How- stantive research question (Krippendorff, 2012). ever, as we argue ( 2), most of this work is based § Automated content analysis holds great promise on a set of assumptions that we believe are in- for augmenting the efforts of human annotators valid in a significant portion of text-based research (O’Connor et al., 2011; Grimmer and Stewart, projects in the social sciences and humanities. 2013). While this task bears similarity to text cat- Our contributions in this paper include: egorization problems such as sentiment analysis, identifying two different data-generating sce- the quantity of real interest is often the proportion • narios for text data (intrinsic vs. extrinsic la- of documents in a dataset that should receive each bels) and and establishing their importance to label (Hopkins and King, 2010). This paper tack- the problem of estimating proportions ( 2); les the problem of estimating label proportions in § a target corpus based on a small sample of human analyzing which methods are suitable for • annotated data. each setting, and proposing a simple alterna- As an example, consider the hypothetical ques- tive approach for extrinsic labels ( 3); and tion (not explored in this work) of whether hate § speech is increasingly prevalent in social media an empirical comparison of methods that val- • posts in recent years. “Hate speech” is a difficult- idates our analysis ( 4). § to-define category only revealed (at least initially) 1For additional examples see Grimmer et al.(2012), Hop- through human judgments (Davidson et al., 2017). kins and King(2010), and references therein.

1636 Proceedings of NAACL-HLT 2018, pages 1636–1646 New Orleans, Louisiana, June 1 - 6, 2018. c 2018 Association for Computational Linguistics

Complicating matters somewhat is the fact that controlled real-world process, such as users rat- annotation may take place before the entire col- ing a review’s helpfulness. Note that our setup lection is available, so that the subset of instances does include the special case in which true gold- that are manually annotated may represent a bi- standard labels are available for each instance ased sample ( 2). Because this is so frequently the (such as the authors of documents in an author- § case, all of the results in this paper assume that we ship attribution problem). In such a case, is de- A must confront the challenges of transfer learning terministic (assuming unique inputs). or domain adaptation. (The simpler case, where Given that our objective is to mimic the annota- we can sample from the true population of inter- tion process, we seek to estimate the proportion of est, is revisited in 5.) documents in the target corpus expected to be cat- § egorized into each of the K categories, if we had 2 Problem Definition an unlimited budget and full access to the target Our setup is similar to that faced in transfer learn- corpus at the time of annotation. That is, we wish (T ) ing, and we use similar terminology (Pan and to estimate q , which we define as:

Yang, 2010; Weiss et al., 2016). We assume that (T ) 1 NT (T ) q(y = k X ) , p(yi = k x ). we have a source and a target corpus, comprised | NT i=1 | i (2) of NS and NT documents respectively, the latter P of which are not available for annotation. We will Given a set of documents sampled from the represent each corpus as a set of documents, i.e., source corpus and L applications of the annotation (S) (S) X(S) = x , ..., x , and similarly for X(T ). function, we can obtain, at some cost, a labeled h 1 Ns i We further assume that we have a set of K mu- training corpus of L documents, i.e., D(train) = tually exclusive categories, = 1,...,K , and (x , y ),..., (x , y ) . Because the source and Y { } h 1 1 L L i that we wish to estimate the proportion of doc- target corpora are not in general drawn from the uments in the target corpus that belong to each same distribution, we seek to make explicit our as- category. These would typically correspond to a sumptions about how they differ.3 Past literature quantity we wish to measure, such as what frac- on transfer learning has identified several patterns tion of news articles frame a policy issue in a par- of dataset shift (Storkey, 2009). Here we focus ticular way, what fraction of product reviews are on two particularly important cases, linking them considered helpful, or what fraction of social me- to the relevant data generating processes, and ana- dia messages convey positive sentiment. Gener- lyze their relevance to estimating proportions. ally speaking, these categories will be designed based on theoretical assumptions, an understand- Two kinds of distributional shift. There are ing of the design of the platform that produced the two natural assumptions we could make about data, and/or initial exploration of the data itself. what is constant between the two corpora. We In idealized text classification scenarios, it is could assume that there is no change in the dis- conventional to assume training data with already- tribution of text given a document’s label, that is p(S)(x y) = p(T )(x y). Alternately, we could assigned gold-standard labels. Here, we are in- | | assume that there is no change in the distribution terested in scenarios where we must generate our (S) (T ) 2 of labels given text, i.e., p (y x) = p (y labels via an annotation process. Specifically, as- | | sume that we have some annotation function, , x). The former is assumed in the case of prior A which produces a distribution over the K mutu- probability shift, where we assume that p(y) dif- fers but p(x y) is constant, and the later is as- ally exclusive labels, conditional on text. Given | sumed in the case of covariate shift, where we as- a document, xi, the annotation process samples a sume that p(x) differs but p(y x) is constant label from the annotation function, defined as: | (Storkey, 2009). (x , k) p(y = k x ). (1) These two assumptions correspond to two fun- A i , i | i damentally different types of scenarios that we Typically, the annotation function would repre- need to consider, which are summarized in Table sent the behavior of a human annotator (or group 1. The first is where we are dealing with what we of annotators), but it could also represent a less 3Clearly, if we make no assumptions about how the source 2This could include gathering multiple independent anno- and target distributions are related, there is no guarantee that tations per instance, but we will typically assume only one. will work (Ben-David et al., 2012).

1637 Label type Intrinsic Extrinsic by a change in the underlying distribution of text, Data generating x p(x y) y p(y x) process ∼ | ∼ | p(x). With intrinsic labels, by contrast, it may be Assumed to differ the case that p(x y) is the same for the source p(y) p(x) | across domains and the target, assuming there are no additional Assumed constant p(x y) p(y x) across domains | | factors influencing the generation of text. In that Corresponding Prior Covariate case, a shift in the distribution of features would distributional probability shift shift shift be fully explained by a difference in the underly- ing label proportions. Table 1: Data generating scenarios and corresponding The idea that there are different data generat- distributional properties. ing processes is obviously not new.5 What is novel here, however, is asking how these different will call intrinsic labels, that is labels which are in- assumptions affect the estimation of proportions. herent to each instance, and which in some sense Virtually all past work on estimating proportions precede and predict the generation of the text of has only considered prior probability shift, assum- ing that p(x y) is constant.6 Existing meth- that instance. A classic example of this scenario is | the case of authorship attribution (e.g., Mosteller ods take advantage of this assumption, and can and Wallace, 1964), in which different authors are be shown empirically to work well when it is sat- assumed to have different propensities to use dif- isfied (e.g., through artificial modification of real ferent styles and vocabularies. The identity of the datasets to alter label proportions in a corpus). We author of a document is arguably an intrinsic prop- expect them to fail, however, in the case of extrin- erty of that document, and it is easy to see a text as sic annotations, as there is no reason to think that having been generated conditional on its author. the required assumption should necessarily hold. The contrasting scenario is what we will refer By contrast, the problem of covariate shift is in some sense less of a problem because we di- to as extrinsic labels; this scenario is our primary (T ) interest. We assume here that the labels are not rectly observe X . Since the annotation func- inherent in the documents, but rather have been tion is assumed to be unchanging, we could per- externally generated, conditional on the text as fectly predict the expected label proportions in the a stimulus to some behavioral process.4 We ar- target corpus if we could learn the annotation func- gue that this is the relevant assumption for most tion using labeled data from the source corpus. annotation-based projects in the social sciences, The problem thus becomes how to learn a well- where the categories of interest do not correspond calibrated approximation of the annotation func- to pre-existing categories that might have existed tion from a limited amount of labeled data. in the minds of authors before writing, or affected 3 Methods the writing process. Rather, these are theorized categories that have been developed specifically to Given a labeled training set and a target corpus, analyze or measure some aspect of the document’s the na¨ıve approach is to train a classifier through effect that is of interest to the researcher. any conventional means, predict labels on the tar- We won’t always know the true distributional get corpus, and return the relative prevalence of properties of our datasets, but distinguishing be- predicted labels. Following Forman(2005), we re- tween intrinsic and extrinsic labels provides a fer to this approach as classify and count (CC). If guide. The critical point is that these two dif- using a probabilistic classifier, averaging the pre- ferent labeling scenarios have different implica- dicted posterior probabilities rather than predicted tions for robustness to distributional shift. In the labels will be referred to as probabilistic classify case of extrinsic labels, especially when work- and count (PCC; Bella et al., 2010). ing with trained annotators, it is reasonable to as- Both approaches can fail, however. In the case sume that the behavior of the annotation func- of intrinsic labels, this is because these approaches tion is determined purely by the text, such that will not account for the shift in prior label prob- p(y x) is unchanged between source and target, | 5Peters et al.(2014) describe these, somewhat confus- and any change in label proportions is explained ingly, as causal and anti-causal problems. 6For example, Hopkins and King(2010) argue that blog- 4Fong and Grimmer(2016) also consider this process in gers first decide on the sentiment they wish to convey and attempting to identify the causal effects of texts. then write a blog post conditional on that sentiment.

1638 ability, p(y), which is assumed to have occurred included in model selection as well. (Hopkins and King, 2010). In the case of covari- To estimate calibration error (CE) during cross- ate shift, the difference in p(x) will result in a validation, we use an approximation due to model that is not optimal (in terms of classifica- Nguyen and O’Connor(2015), adaptive binning. tion performance) for the target domain. In both In the case of binary labels, this is computed as: cases, there is also the problem of classifier bias 2 or miscalibration. Particularly in the case of un- CE 1 B 1 y p (x ) , (3) , B j=1 j i j i θ i balanced labels, a standard classifier is likely to be |B | ∈B − P  P  biased, overestimating the probability of the more using B bins, where bin j contains instances for Bth common labels, and vice versa (Zhao et al., 2017). which pθ(xi) are in the j quantile, where pθ(xi) Here we present a simple but novel method for ex- is the predicted probability of a positive label for trinsic labels, followed by a number of baseline instance i. For added robustness, we take the av- approaches against which we will compare. (See erage of CE for B 3, 4, 5, 6, 7 . ∈ { } supplementary material for additional details.) In our experiments, we consider two variants of PCC: the first, PCCF1 , which is a baseline, 3.1 Proposed method: calibrated is tuned conventionally for classification perfor- probabilistic classify and count (PCCcal) mance, whereas the other (PCCcal) is tuned for cal- One simple solution, which we propose here, is to ibration, as measured using CE, but is otherwise attempt to train a well-calibrated classifier. To be identically trained. As a base classifier we make clear, calibration refers to the long-run accuracy use of l1-regularized , operating of predicted probabilities. That is, a probabilistic on n-gram features.8 classifier, hθ(x), is well calibrated at the level µ if, among all instances for which the classifier pre- 3.2 Existing methods appropriate for dicts class k with a probability of µ, the proportion extrinsic labels that are truly assigned to class k is also equal to µ.7 The idea of extrinsic labels has not been previ- It has previously been shown (DeGroot and ously considered by past work on estimating pro- Fienberg, 1983; Brocker¨ , 2009) that any proper portions, but it is closely related to the problems scoring rule (e.g., cross entropy, Brier score, etc.) of calibration and covariate shift. Here we briefly can be factored into two components representing summarize two representative methods, which we calibration and refinement, the later of which ef- consider as baselines (see supplementary material fectively measures how close predicted probabili- for details). ties are to zero or one. Minimizing a correspond- ing loss function thus involves a trade-off between Platt scaling. One approach to calibration is to these two components. train a model using conventional methods and to Optimizing only for calibration is not helpful, then learn a secondary calibration model. One of as a trivial solution is to simply predict a probabil- the most common and successful variations on this ity distribution equal to the observed label propor- approach is Platt scaling, which learns a logistic tions in the training data for all instances (which is regression classifier on held-out training data, tak- perfectly calibrated on the labeled sample). The ing the scores from the primary classifier as in- alternative we propose here is to train a classi- put. This model is then applied to the scores re- fier using a typical objective (here, regularized log turned by the primary classifier on the target cor- loss) but use calibration on held-out data as a cri- pus (Platt, 1999). To estimate proportions, the pre- terion for model selection, i.e., when we tune hy- dicted probabilities are then averaged, as in PCC. perparameters via cross validation. We refer to Reweighting for covariate shift. Although they cal this method as calibrated PCC (PCC ). Specif- are not typically thought of in the context of es- ically, we select regularization strength via grid timating proportions, several methods have been search, choosing the value that leads to the lowest proposed to deal directly with the problem of co- average calibration error across training / held-out variate shift, including kernel mean matching and splits. Of course, other hyperparameters could be 8More complex models could be considered, but we use 7For example, a weather forecaster will be well-calibrated logistic regression because it is a well-understood and widely if it rains on 60% of days for which the forecaster predicted a applicable model that has been shown to be relatively well- 60% chance of rain, etc. calibrated in general (Niculescu-Mizil and Caruana, 2005).

1639 its extensions (Huang et al., 2006; Sugiyama et al., As described below, we create multiple subtasks 2011). Here, we consider the two-stage method from each dataset by using different partitions of from Bickel et al.(2009), which uses a logistic the data. In all cases, we report absolute error (AE) regression model to distinguish between source on the proportion of positive instances, averaged and target domains, and then uses the probabili- across the subtasks of each dataset. ties from this model to re-weight labeled training Although we do not have access to the true an- instances, to more heavily favor those that are rep- notation function, we approximate the expected resentative of the target domain. The appeal of this label proportions in the target corpus by averag- method is that all unlabeled data can be used to es- ing the available labels, which should be a very timate this shift. close approximation when the number of avail- able labels is large (which informed our choice of 3.3 Existing methods appropriate for datasets for these experiments). For a single sub- intrinsic labels task, the absolute error is thus evaluated as As previously mentioned, virtually all of the past work on estimating proportions makes the as- AE = qˆ(y = 1 X(T )) 1 NT y(T ) . (5) | − NT i=1 i sumption that p(x y) is constant between source | P and target. Under this assumption, it can be shown For all experiments, we also report the AE we that p(y(θ) = j y = k) is also constant for all j would obtain from using the observed label pro- (|θ) and k, where y is the predicted label from hθ, portions in the training sample as a prediction (la- and y is the true (intrinsic) label. If these values beled “Train”). Although this does not correspond were known, then the label proportions in the tar- to an interesting prediction (as it only says the fu- get corpus could be found by taking the model’s ture will always look exactly like the past), it does estimate of label proportions in the target corpus, represent a fundamental baseline. If a method is (CC), and then solving a linear system of equa- unable to do better than this, it suggests that the tions as a post-classification correction. Although method has too much measurement error to be a number of variations on this model have been useful. proposed, all are based on the same assumption, To test for statistically significant differences thus we take a method known as adjusted classify between methods, we use an omnibus application and count (ACC) as an exemplar, which directly of the Wilcoxon signed-rank test to compare one estimates the relevant quantities using a confusion method against all others, including a Bonferroni matrix (Forman, 2005). In the case of binary clas- correction for the total number of tests per hypoth- sification, this reduces to: esis. With 4 datasets, each with 2 sample sizes, comparing against 6 other methods this results in 1 N (θ) T y FPR (T ) NT i=1 i a significance threshold of approximately 0.001. qˆACC(y = 1 X ) = − , | TPR FPR Finally, in order to connect this work with past P − (4) literature on estimating proportions, we also in- clude a side experiment with one intrinsically- where FPR =p ˆ(y(θ) = 1 y = 0) and TPR = | labeled dataset where we have artificially modi- pˆ(y(θ) = 1 y = 1) are both estimated using | fied the label proportions in the target corpus by held-out data. dropping positive or negatively-labeled instances in order to simulate a large prior probability shift 4 Experiments between the source and target domains. For our experiments, we focus on the case of bi- nary classification where the difference between 4.1 Datasets the source and target corpora results from a differ- We briefly describe the datasets we have used here ence in time—that is, the training documents are and provide additional details in the supplemen- sampled from one time period, and the goal is to tary material. Note that although this work is estimate label proportions on documents from a primarily focused on applications in which the future time period. We include examples of both amount of human-annotated data is likely to be intrinsic and extrinsic labels to demonstrate the small, fair evaluation of these methods requires importance of this distinction to the effectiveness datasets that are large enough that we can approx- of different methods. imate the expected label proportion in the target

1640 corpus using the available labels; as such, the fol- timent as an intrinsic label. As with the above lowing datasets were chosen so as to have a rep- datasets, we create subtasks by considering all resentative sample of sufficiently large intrinsi- pairs of temporally adjacent days with sufficient cally and extrinsically-labeled data, where docu- tweets, and treating them as a paired source and ments were time-stamped, with label proportions target corpora, respectively. (Go et al., 2009). that differ between time periods. 4.2 Results Media Frames Corpus (MFC): As a primary example of extrinsic labels, we use a dataset The results on the datasets with extrinsic and in- of several thousand news articles that have been trinsic labels are presented in Figures1 and2, re- annotated in terms of a set of broad-coverage spectively. framing dimensions (such as economics, moral- As expected, the results differ in important ity, etc.). We treat annotations as indicating the ways between intrinsically and extrinsically la- presence or absence of each dimension, and con- beled datasets, although there are some results sider each one as a separate sub-task. As with all which hold in all cases. In all settings, CC is worse datasets, we create a source and target corpus by on average than predicting the observed propor- dividing the datasets by year. Particularly for this tions in the training data (significantly worse for dataset, it seems reasonable to posit that the an- the Amazon and Twitter datasets), reinforcing the notation function was relatively constant between idea that averaging the predictions from a classi- source and target, as the annotators worked with- fier will lead to a biased estimate of label propor- out explicit knowledge of the article’s date (Card tions. This same finding holds for PCCF1 when the et al., 2015). amount of labeled data is small (L = 500), sug- gesting that simply averaging the predicted prob- Amazon reviews: As a secondary example of abilities is not reliable without a sufficiently large extrinsic labels, we make use of a subset of Ama- labeled dataset. zon reviews for five different product categories, For the datasets with extrinsic labels, PCCcal each of which has tens of thousands of reviews. performs best on average in all settings. For the For this dataset, we ignore the star rating associ- MFC dataset, PCCcal is significantly better than ated with the review, and instead focus on predict- all methods except Platt scaling when L = 500 ing the proportion of people that would rate the re- and significantly better than all methods except view as helpful. Here we create separate subtasks reweighting and PCCF1 when L = 2000 (after for each product category by considering each pair a Bonferroni correction, as in all cases). As ex- of adjacent years as a source and target corpus, re- pected, ACC is actually worse on average than spectively (McAuley et al., 2015). CC on the extrinsic datasets, presumably because of the mismatched assumptions. Reweighting for Yelp reviews: As a primary example of a large covariate shift offers mediocre performance in all dataset with intrinsic labels, we make use of the settings, perhaps because, while it attempts to ac- Yelp10 dataset, treating the source location of the count for covariate shift, it may still suffer from review as the label of interest. Specifically, we cre- miscalibration. ate binary classification tasks by choosing pairs of cities with approximately the same number of re- On the datasets with intrinsic labels, by con- views, and again use year of publication to divide trast, no one method dominates the others. As the data into source and target corpora, creating expected, ACC does poorly when the amount of L = 500 multiple subtasks per pair of cities. labeled data is small ( ); it does improve upon CC when L = 4000, but not by enough to Twitter sentiment: Finally, we include a Twit- do significantly better than other methods, perhaps ter sentiment analysis dataset which was collected calling into question the validity of the assumption and automatically labeled, using the presence of that p(x y) is constant in these datasets. | certain emoticons as implicit labels indicating pos- Surprisingly, both Platt scaling and PCCcal also itive or negative sentiment (with the emoticons offer competitive performance in the experiments then removed from the text). Because of the way with intrinsic labels. However, this is likely the this data was collected, and the relatively narrow case in part because the change in label propor- time coverage, it seems plausible to treat the sen- tions is relatively small from year to year (or day

1641 Train

CC

PCCF1

ACC

Reweighting

Platt

PCCcal

0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 MFC (L=500) MFC (L=2000) Amazon (L=500) Amazon (L=4000)

Figure 1: Absolute error (AE) on datasets with extrinsic labels. Each dot represents the result for a single subtask, and bars show the mean. PCCcal (bottom row) performs best on average in all cases and is significantly better than most other methods on MFC.

Train

CC

PCCF1

ACC

Reweighting

Platt

PCCcal

0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 Yelp (L=500) Yelp (L=4000) Twitter (L=500) Twitter (L=4000)

Figure 2: Absolute error (AE) on datasets with intrinsic labels. No method is significantly better than all others.

to day in the case of Twitter). This is illustrated F 0.15 PCC 1 by Figure3, which presents the results of the side- ACC experiment with artificially modified (intrinsic) la- 0.10 AE bel proportions using a subset of the Twitter data. 0.05 These results confirm past findings, and show that 0.00 ACC drastically outperforms other methods such 0.45 0.50 0.55 0.60 0.65 0.70 0.75 0.80 F as PCC 1 , if we selectively drop instances so as Modified target label proportion to enforce a large difference in label proportions between source and target. This is the expected Figure 3: Absolute error (AE) for predictions on one result, as ACC is the only method tailored to deal day of Twitter data (L = 5000) when artificially modi- with prior probability shift (which is being arti- fying target proportions. The proportion of positive la- ficially simulated). Unfortunately, its advantage bels in the source corpus is 0.625. ACC performs sig- nificantly better given an large artificially-created dif- is not maintained when the difference between ference in label proportions between source and target, source and target is small, which is the case for but not when the difference is small. all of the naturally-occurring differences we found in the Yelp and Twitter datasets. Although past work has relied heavily on these sorts of simu- extrinsic labels, and statistically indistinguishable lated differences and artificial experiments, it is results using other methods, suggesting that either unclear whether they are a good substitute for real- type of regularization could serve as a basis for cal world data, given that we mostly observed rela- PCC or Platt scaling. tively small differences in practice. 5 Discussion Finally, we also tested the effect of using l2 in- stead of l1 regularization, but found that it tended As anyone who has worked with human annota- to produce significantly worse estimates of pro- tions can attest, the process of collecting annota- portions using CC and PCCF1 on the datasets with tions is messy and time-consuming, and tends to

1642 involve large numbers of disagreements (Artstein 0.075 PCC and Poesio, 2008). Although it is conventional SRS to treat disagreements as errors on the behalf of 0.050 some subset of annotators, this paper provides an Mean AE 0.025 0.000 alternative way of understanding these. By treat- 102 103 104 ing annotation as a stochastic process, conditional Amount of labeled data (L) on text, we can explain not only the disagree- ments between annotators, but also the lack of Figure 4: Comparison of SRS and PCC in simulation self-consistency that is also sometimes observed. when we know the true model and sample from the tar- get corpus (averaged over 200 repetitions). Although the assumption that p(y x) does not | change is clearly a simplification, it seems reason- able when working with trained annotators. Cer- in approximation.9 More importantly, because it tainly this assumption seems much better justified is independent of the dimensionality of the data, than the conventional assumption that p(x y) is it works well on high-dimensional data, such as | constant, since the latter does not account for dif- text, whereas classification-based approaches will ferences in the distribution of text arising from dif- struggle. We can illustrate this by comparing SRS ferences in subject matter, etc. and PCC in simulation. Figure4 shows the mean Although we have demonstrated that using a AE (averaged over 200 trials) for a case in which method that is appropriate to the data generating we know the true model (including the prior on the process is beneficial, it is important to note that weights, and thus the appropriate amount of regu- all methods presented here can still result in rela- larization) and only need to learn the values of the tively large errors in the worst cases. In part this is weights. Even in this idealized scenario, SRS re- due to the difficulty of learning a conditional dis- mains better than PCC for all values of L. (See tribution involving high-dimensional data (such as supplementary material for details). text) with only a limited number of annotations. Depending on the level of accuracy required, Even with much more annotated data, however, simply sampling a few hundred documents and la- previously unseen features could still have a po- beling them should be sufficient to get a reason- tentially large impact on future annotations. Ulti- ably reliable estimate of the overall label propor- mately, we should be cautious about all such pre- tions, along with an approximate confidence inter- dictions, and always validate where possible, by val. Unfortunately, this option is only available eventually sampling and annotating data from the when we have full access to the target corpus at target corpus. the time of annotation.

Additional related work. There is a small lit- What if we can sample from the target corpus? erature on the problem of estimating proportions Although there are many situations in which do- in a target dataset (see 1); as we have empha- § main adaptation is unavoidable (such as predict- sized, almost all of it makes the assumption that ing public opinion from Twitter in real time with p(x y) is the same for both source and tar- | models trained on the past), at least some re- get. Moreover, most of the methods that have search projects in the humanities and social sci- been proposed have been tested using relatively ences might reasonably have access to all data of small datasets, or datasets where the target cor- interest from the beginning of the project, such as pus has been artificially modified by altering the when working with a historical corpus. Although a label proportions in the target corpus (as we did full proof is beyond the scope of this paper, in this in the side experiment reported in Figure3). It case, the best approach is almost certainly to sim- ply sample a random set of documents, label them 9If we were sampling with replacement, the variance in using the annotation function, and report the rela- the binary case would be given by the standard formula SRS p¯(1 p¯) 1 NT V[ˆq ] = − , where p¯ = p(yi = 1 xi). tive prevalence of each label (Hopkins and King, L NT i=1 | This may not be possible, however, as annotators seeing a 2010). document for the second or third timeP would likely be af- Although this simple random sampling (SRS) fected by their own past decisions. Nevertheless, using this as the basis for a plug-in estimator should still be a reasonable approach ignores the text, it is an unbiased estima- approximation when the target corpus is large. Please refer to tor with variance that can easily calculated, at least supplementary material for additional details.

1643 seems unclear that this is a good simulation of the text, rather than as a combination of correct and kind of shift in distribution that one is likely to en- incorrect judgements about a label intrinsic to the counter in practice. An exception to this is Esuli document. Moreover, it is reasonable to assume and Sebastiani(2015), who test their method on in this case that p(y x) is unchanging between | the RCV1-v2 corpus, also splitting by time. They source and target, and methods that aim to learn perform a large number of experiments, but un- a well-calibrated classifier, such as PCCcal, are fortunately, nearly all of their experiments involve likely to perform best. By contrast, if p(x y) is | only a very small difference in label proportions unchanging between source and target, then vari- between the source and target (with the vast ma- ous correction methods from the literature on es- jority < 0.01), which limits the generalizability of timating proportions, such as ACC, can perform their findings. Additional methods for calibration well, especially when differences are large. Ul- could also be considered, such as the isotonic re- timately, any of these methods can still result in gression approach of Zadrozny and Elkan(2002), large errors in the worst cases. As such, validation but in practice we would expect the results to be remains important when treating the estimation of very similar to Platt scaling. proportions as a type of measurement. Another line of work has approached the prob- Acknowledgements We would like to thank lem of aggregating labels from multiple annotators Philip Resnik, Brendan O’Connor, anonymous re- (Raykar et al., 2009; Hovy et al., 2013; Yan et al., viewers, and all members of Noah’s ARK for help- 2013). That is, if we believe that some annota- ful comments and discussion, as well as XSEDE tors are more reliable than others, it might make and Microsoft Azure for grants of computational sense to try to determine this in an unsupervised resources used for this work. manner, and give more weight to the annotations from the reliable annotators. This seems particu- larly appropriate when dealing with uncooperative References annotators, as might be encountered, for example, Ron Artstein and Massimo Poesio. 2008. Inter-Coder in crowdsourcing (Snow et al., 2008; Zhang et al., Agreement for Computational Linguistics. Com- 2016). However, with a team of trained annota- putational Linguistics 34(4):555–596. https:// tors, we believe that honest disagreements could doi.org/10.1162/coli.07-034-R2. contain valuable information better not ignored. Josh Attenberg and Foster Provost. 2011. Inac- Finally, this work also relates to the problem tive learning?: Difficulties employing active learn- of active learning, where the goal is to interac- ing in practice. SIGKDD Explorations Newsletter tively choose instances to be labeled, in a way 12(2):36–41. https://doi.org/10.1145/ 1964897.1964906. that maximizes accuracy while minimizing the to- tal cost of annotation (Beygelzimer et al., 2009; Jason Baldridge and Miles Osborne. 2004. Active Baldridge and Osborne, 2004; Rai et al., 2010; learning and the total cost of annotation. In Pro- ceedings of EMNLP. Settles, 2012). This is an interesting area that might be productively combined with the ideas in Antonio Bella, Maria Jose Ramirez-Quintana, Jose this paper. In general, however, the use of active Hernandez-Orallo, and Cesar Ferri. 2010. Quan- tification via probability estimators. In IEEE In- learning involves additional logistical complica- ternational Conference on . https: tions and does not always work better than ran- //doi.org/10.1109/ICDM.2010.75. dom sampling in practice (Attenberg and Provost, Shai Ben-David, Shai Shalev-Shwartz, and Ruth Urner. 2011). 2012. Domain adaptation – can quantity compen- sate for quality? Annals of Mathematics and Ar- 6 Conclusions tificial Intelligence 70:185–202. https://doi. org/10.1007/s10472-013-9371-9. When estimating proportions in a target corpus, Alina Beygelzimer, Sanjoy Dasgupta, and John Lang- it is important to take seriously the data gener- ford. 2009. Importance weighted active learning. In ating process. We have argued that in the case Proceedings of ICML. https://doi.org/10. of data annotated by humans in terms of cate- 1145/1553374.1553381. gories designed to help answer social-scientific re- Steffen Bickel, Michael Bruckner,¨ and Tobias Schef- search questions, labels should be treated as ex- fer. 2009. Discriminative Learning Under Covariate trinsic, generated probabilistically conditional on Shift. Journal of Machine Learning Research 10.

1644 Jochen Brocker.¨ 2009. Reliability, sufficiency, and the Jiayuan Huang, Alexander J. Smola, Arthur Gretton, decomposition of proper scores. Quarterly Journal Karsten M. Borgwardt, and Bernhard Scholkopf.¨ of the Royal Meteorological Society 135(643):1512– 2006. Correcting sample selection bias by unlabeled 1519. data. In Proceedings of NIPS.

Dallas Card, Amber E. Boydstun, Justin H. Gross, Klaus Krippendorff. 2012. Content analysis: an intro- Philip Resnik, and Noah A. Smith. 2015. The me- duction to its methodology. SAGE. dia frames corpus: Annotations of frames across is- sues. In Proceedings of ACL. https://doi. Julian McAuley, Christopher Targett, Qinfeng Shi, and org/10.3115/v1/P15-2072. Anton van den Hengel. 2015. Image-based recom- mendations on styles and substitutes. In Proceed- Thomas Davidson, Dana Warmsley, Michael Macy, ings of SIGIR. https://doi.org/10.1145/ and Ingmar Weber. 2017. Automated hate speech 2766462.2767755. detection and the problem of offensive language. In Proceedings of ICWSM. Frederick Mosteller and David L. Wallace. 1964. In- ference and Disputed Authorship. Addison-Wesley Morris H. DeGroot and Stephen E. Fienberg. 1983. publishing company, Inc. https://doi.org/ The comparison and evaluation of forecasters. The 10.1080/01621459.1963.10500849. Statistician: Journal of the Institute of Statisticians 32:12–22. Khanh Nguyen and Brendan O’Connor. 2015. Pos- Andrea Esuli and Fabrizio Sebastiani. 2015. Optimiz- terior calibration and exploratory analysis for natu- ing text quantifiers for multivariate loss functions. ral language processing models. In Proceedings of ACM Trans. Knowl. Discov. Data 9(4). https: EMNLP. //doi.org/10.1145/2700406. Alexandru Niculescu-Mizil and Rich Caruana. 2005. Christian Fong and Justin Grimmer. 2016. Discov- Predicting good probabilities with supervised learn- ery of treatments from text corpora. In Proceedings ing. In Proceedings of ICML. https://doi. of ACL. https://doi.org/10.18653/v1/ org/10.1145/1102351.1102430. P16-1151. Brendan O’Connor, David Bamman, and Noah A. George Forman. 2005. Counting positives accurately Smith. 2011. Computational text analysis for so- despite inaccurate classification. In Proceedings of cial science: Model assumptions and complexity. the European Conference on Machine Learning. In NIPS Workshop on Comptuational Social Science and the Wisdom of Crowds. George Forman. 2008. Quantifying counts and costs via classification. Data Mining and Knowledge Dis- S.J. Pan and Q. Yang. 2010. A survey on transfer learn- covery 17(2):164–206. https://doi.org/10. ing. IEEE Transactions on Knowledge and Data 1007/s10618-008-0097-y. Engineering 22(10):1345–1359. https://doi. org/10.1109/TKDE.2009.191. Alec Go, Richa Bhayani, and Lei Huang. 2009. Twit- ter sentiment classification using distant supervision. Jonas Peters, Joris M. Mooij, Dominik Janzing, and Technical report. Bernhard Scholkopf.¨ 2014. Causal discovery with continuous additive noise models. Journal of Ma- Justin Grimmer, Solomon Messing, and Sean J. West- chine Learning Research 15:2009–2053. wood. 2012. How words and money cultivate a per- sonal vote: The effect of legislator credit claiming John C. Platt. 1999. Probabilistic outputs for sup- on constituent credit allocation. American Political port vector machines and comparisons to regularized Science Review 106(4):703–719. https://doi. likelihood methods. In Advances in Large Margin org/10.1017/S0003055412000457. Classifiers, pages 61–74. Justin Grimmer and Brandon M. Stewart. 2013. Text as data: The promise and pitfalls of automatic con- Piyush Rai, Avishek Saha, III Hal Daume,´ and Suresh tent analysis methods for political texts. Political Venkatasubramanian. 2010. Domain adaptation Analysis 21(3):267–297. https://doi.org/ meets active learning. In Proceedings of NAACL. 10.1093/pan/mps028. Vikas C. Raykar, Shipeng Yu, Linda H. Zhao, Anna K. Daniel Hopkins and Gary King. 2010. A method of Jerebko, Charles Florin, Gerardo Hermosillo, Luca automated nonparametric content analysis for so- Bogoni, and Linda Moy. 2009. Supervised learning cial science. American Journal of Political Sci- from multiple experts: whom to trust when everyone ence 54(1):220–247. https://doi.org/10. lies a bit. In Proceedings of ICML. https:// 1111/j.1540-5907.2009.00428.x. doi.org/10.1145/1553374.1553488.

Dirk Hovy, Taylor Berg-Kirkpatrick, Ashish Vaswani, Burr Settles. 2012. Active Learning. Morgan and Eduard H. Hovy. 2013. Learning whom to trust & Claypool. https://doi.org/10.2200/ with MACE. In Proceedings of NAACL. S00429ED1V01Y201207AIM018.

1645 Rion Snow, Brendan O’Connor, Daniel Jurafsky, and Andrew Y. Ng. 2008. Cheap and fast—but is it good?: Evaluating non-expert annotations for nat- ural language tasks. In Proceedings of EMNLP. Amos J. Storkey. 2009. When training and test sets are different: Characterising learning transfer. In Joaquin Quionero-Candela, Masashi Sugiyama, An- ton Schwaighofer, and Neil D. Candela Sugiyama Schwaighofer Lawrence Lawrence, editors, Dataset Shift in Machine Learning, MIT Press, chapter 1, pages 3–28. Masashi Sugiyama, Makoto Yamada, Paul von Bunau,¨ Taiji Suzuki, Takafumi Kanamori, and Motoaki Kawanabe. 2011. Direct density-ratio estima- tion with via least-squares hetero-distributional subspace search. Neural Net- works 24(2). https://doi.org/10.1016/ j.neunet.2010.10.005.

Karl Weiss, Taghi M. Khoshgoftaar, and DingDing Wang. 2016. A survey of transfer learning. Jour- nal of Big Data 3(1). https://doi.org/10. 1186/s40537-016-0043-6.

Yan Yan, Romer´ Rosales, Glenn Fung, Subramanian Ramanathan, and Jennifer G. Dy. 2013. Learn- ing from multiple annotators with varying expertise. Machine Learning 95:291–327. https://doi. org/10.1007/s10994-013-5412-1.

Bianca Zadrozny and Charles Elkan. 2002. Transform- ing classifier scores into accurate multiclass proba- bility estimates. In Proceedings of KDD. https: //doi.org/10.1145/775047.775151.

Jing Zhang, Xindong Wu, and Victor S. Sheng. 2016. Learning from crowdsourced labeled data: a sur- vey. Artif. Intell. Rev. 46:543–576. https:// doi.org/10.1007/s10462-016-9491-9.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or- donez, and Kai-Wei Chang. 2017. Men also like shopping: Reducing gender bias amplification us- ing corpus-level constraints. In Proceedings of the EMNLP.

1646