Comparing Bayesian Models of Annotation

Comparing Bayesian Models of Annotation Silviu Paun1 Bob Carpenter2 Jon Chamberlain3 Dirk Hovy4 Udo Kruschwitz3 Massimo Poesio1 1School of Electronic Engineering and Computer Science, Queen Mary University of London 2Department of Statistics, Columbia University 3School of Computer Science and Electronic Engineering, University of Essex 4Department of Marketing, Bocconi University Abstract Research suggests that models of annotation can solve these problems of standard practices when The analysis of crowdsourced annotations in natural language processing is con- applied to crowdsourcing (Dawid and Skene, cerned with identifying (1) gold standard 1979; Smyth et al., 1995; Raykar et al., 2010; labels, (2) annotator accuracies and biases, Hovy et al., 2013; Passonneau and Carpenter, and (3) item difficulties and error patterns. 2014). Such probabilistic approaches allow us Traditionally, majority voting was used for to characterize the accuracy of the annotators 1, and coefficients of agreement for 2 and 3. Lately, model-based analysis of corpus and correct for their bias, as well as account- annotations have proven better at all three ing for item-level effects. They have been shown tasks. But there has been relatively little to perform better than non-probabilistic alterna- work comparing them on the same datasets. tives based on heuristic analysis or adjudication This paper aims to fill this gap by ana- (Quoc Viet Hung et al., 2013). But even though lyzing six models of annotation, covering a large number of such models has been proposed different approaches to annotator ability, (Carpenter, 2008; Whitehill et al., 2009; Raykar item difficulty, and parameter pooling (tying) across annotators and items. We et al., 2010; Hovy et al., 2013; Simpson et al., evaluate these models along four aspects: 2013; Passonneau and Carpenter, 2014; Felt et al., comparison to gold labels, predictive accu- 2015a; Kamar et al., 2015; Moreno et al., 2015, in- racy for new annotations, annotator char- ter alia), it is not immediately obvious to potential acterization, and item difficulty, using four users how these models differ or, in fact, how they datasets with varying degrees of noise in the should be applied at all. To our knowledge, the form of random (spammy) annotators. We literature comparing models of annotation is lim- conclude with guidelines for model selec- ited, focused exclusively on synthetic data (Quoc tion, application, and implementation. Viet Hung et al., 2013) or using publicly available implementations that constrain the experiments al- 1 Introduction most exclusively to binary annotations (Sheshadri The standard methodology for analyzing crowd- and Lease, 2013). sourced data in NLP is based on majority voting (selecting the label chosen by the majority of Contributions coders) and inter-annotator coefficients of agreement, such as Cohen’s κ (Artstein and Poesio, • Our selection of six widely used models 2008). However, aggregation by majority vote (Dawid and Skene, 1979; Carpenter, 2008; implicitly assumes equal expertise among the Hovy et al., 2013) covers models with vary- annotators. This assumption, though, has been re- ing degrees of complexity: pooled models, peatedly shown to be false in annotation prac- which assume all annotators share the same tice (Poesio and Artstein, 2005; Passonneau and ability; unpooled models, which model in- Carpenter, 2014; Plank et al., 2014b). Chance- dividual annotator parameters; and partially adjusted coefficients of agreement also have many pooled models, which use a hierarchical shortcomings—for example, agreements in mis- structure to let the level of pooling be dictated take, overly large chance-agreement in datasets by the data. with skewed classes, or no annotator bias correc- tion (Feinstein and Cicchetti, 1990; Passonneau • We carry out the evaluation on four datasets with and Carpenter, 2014). varying degrees of sparsity and annotator 571 Transactions of the Association for Computational Linguistics, vol. 6, pp. 571–585, 2018. Action Editor: Jordan Boyd-Graber. Submission batch: 3/2018; Revision batch: 6/2018; Published 12/2018. c 2018 Association for Computational Linguistics. Distributed under a CC-BY 4.0 license. Figure 2: Plate diagram of the Dawid and Skene model. Figure 1: Plate diagram for multinomial model. The hyperparameters are left out. 2.1 A Pooled Model accuracy in both gold-standard dependent Multinomial (MULTINOM) The simplest Bayesian and independent settings. model of annotation is the binomial model proposed in Albert and Dodd(2004) and discussed • We use fully Bayesian posterior inference to in Carpenter(2008). This model pools all annota- quantify the uncertainty in parameter esti- tors (i.e., assumes they have the same ability; see mates. Figure1). 1 The generative process is: • We provide guidelines for both model selec- • For every class k 2 f1; 2; :::; Kg: tion and implementation. – Draw class-level abilities K 2 Our findings indicate that models which in- ζk ∼ Dirichlet(1 ) clude annotator structure generally outperform • Draw class prevalence π ∼ Dirichlet(1K ) other models, though unpooled models can over- fit. Several open-source implementations of each • For every item i 2 f1; 2; :::; Ig: model type are available to users. – Draw true class ci ∼ Categorical(π) 2 Bayesian Annotation Models – For every position n 2 f1; 2; :::; Nig: ∗ Draw annotation All Bayesian models of annotation that we de- y ∼ Categorical(ζ ) scribe are generative: They provide a mechanism i;n ci to generate parameters θ characterizing the pro- 2.2 Unpooled Models cess (annotator accuracies and biases, prevalence, Dawid and Skene (D&S) The model proposed by etc.) from the prior p(θ), then generate the ob- Dawid and Skene (1979) is, to our knowledge, the served labels y from the parameters according to first model-based approach to annotation proposed the sampling distribution p(yjθ). Bayesian infer- in the literature.3 It has found wide application ence allows us to condition on some observed (e.g., Kim and Ghahramani, 2012; Simpson et al., data y to draw inferences about the parameters 2013; Passonneau and Carpenter, 2014). It is an θ; this is done through the posterior, p(θjy). unpooled model, namely, each annotator has their The uncertainty in such inferences may then be own response parameters (see Figure2), which are used in applications such as jointly training clas- given fixed priors. Its generative process is: sifiers (Smyth et al., 1995; Raykar et al., 2010), comparing crowdsourcing systems (Lease and • For every annotator j 2 f1; 2; :::; Jg: Kazai, 2011), or characterizing corpus accuracy (Passonneau and Carpenter, 2014). – For every class k 2 f1; 2; :::; Kg: This section describes the six models we eval- ∗ Draw class annotator abilities K uate. These models are drawn from the litera- βj;k ∼ Dirichlet(1 ) ture, but some had to be generalized from binary to multiclass annotations. The generalization nat- 1Carpenter(2008) parameterizes ability in terms of specificity urally comes with parameterization changes, al- and sensitivity. For multiclass annotations, we generalize to a though these do not alter the fundamentals of the full response matrix (Passonneau and Carpenter, 2014). models. (One aspect tied to the model parameter- 2Notation: 1K is a K-dimensional vector of 1 values. 3 ization is the choice of priors. The guideline we Dawid and Skene fit maximum likelihood estimates us- followed was to avoid injecting any class prefer- ing expectation maximization (EM), but the model is easily extended to include fixed prior information for regularization, ences a priori and let the data uncover this infor- or hierarchical priors for fitting the prior jointly with the abil- mation; see more in §3.) ity parameters and automatically performing partial pooling. 572 Figure 4: Plate diagram for the hierarchical Dawid and Skene model. Figure 3: Plate diagram for the MACE model. the overall population of annotators (see Figure4). This structure provides partial pooling, using in- • Draw class prevalence π ∼ Dirichlet(1K ) formation about the population to improve estimates of individuals by regularizing toward the • For every item i 2 f1; 2; :::; Ig: population mean. This is particularly helpful with – Draw true class ci ∼ Categorical(π) low count data as found in many crowdsourcing – For every position n 2 f1; 2; :::; Nig: tasks (Gelman et al., 2013). The full generative 6 ∗ Draw annotation process is as follows: y ∼ Categorical(β )4 i;n jj[i;n];ci • For every class k 2 f1; 2; :::; Kg: Multi-Annotator Competence Estimation (MACE) – Draw class ability means This model, introduced by Hovy et al.(2013), 0 ζk;k0 ∼ Normal(0; 1); 8k 2 f1; :::; Kg takes into account the credibility of the annotators 5 – Draw class s.d.’s and their spamming preference and strategy (see 0 Ω 0 ∼ HalfNormal(0; 1); 8k Figure3). This is another example of an unpooled k;k model, and possibly the model most widely • For every annotator j 2 f1; 2; :::; Jg: applied to linguistic data (e.g., Plank et al., 2014a; Sabou et al., 2014; Habernal and Gurevych, 2016, – For every class k 2 f1; 2; :::; Kg: inter alia). Its generative process is: ∗ Draw class annotator abilities 0 • For every annotator j 2 f1; 2; :::; Jg: βj;k;k0 ∼ Normal(ζk;k0 ; Ωk;k0 ); 8k – Draw spamming behavior • Draw class prevalence π ∼ Dirichlet(1K ) K j ∼ Dirichlet(10 ) • For every item i 2 f1; 2; :::; Ig: – Draw credibility θj ∼ Beta(0:5; 0:5) • For every item i 2 f1; 2; :::; Ig: – Draw true class ci ∼ Categorical(π) – For every position n 2 f1; 2; :::; Nig: – Draw true class ci ∼ Uniform ∗ Draw annotation yi;n ∼ – For every position n 2 f1; 2; :::; Nig: 7 Categorical(softmax(βjj[i;n];ci )) ∗ Draw a spamming indicator si;n ∼ Bernoulli(1 − θjj[i;n]) Item Difficulty (ITEMDIFF) We also test an exten- ∗ If si;n = 0 then: sion of the “Beta-Binomial by Item” model from · yi;n = ci Carpenter(2008), which does not assume any an- ∗ Else: notator structure; instead, the annotations of an item are made to depend on its intrinsic difficulty.

Comparing Bayesian Models of Annotation

STATISTICAL SCIENCE Volume 36, Number 3 August 2021

A Matching Based Theoretical Framework for Estimating Probability of Causation

December 2000

ALEXANDER PHILIP DAWID Emeritus Professor of Statistics University of Cambridge E-Mail: [email protected] Web

Theory and Applications of Proper Scoring Rules

PHIL 8670 (Fall, 2015): Philosophy of Statistics

Computational and Financial Econometrics (CFE 2017)

Hyper and Structural Markov Laws for Graphical Models

Probability Forecasts and Prediction Markets A

Extended Conditional Independence and Applications in Causal Inference

Sufficient Covariate, Propensity Variable and Doubly Robust

Decision-Theoretic Foundations for Statistical Causality