Bayesian Aggregation of Categorical Distributions with Applications in Crowdsourcing∗
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) Bayesian Aggregation of Categorical Distributions with Applications in Crowdsourcing∗ Alexandry Augustin Matteo Venanzi Alex Rogers Nicholas R. Jennings Southampton University Microsoft Oxford University Imperial College Southampton, UK London, UK Oxford, UK London, UK [email protected] [email protected] [email protected] [email protected] Abstract proportion of each category. For example, if a worker believes that 1/2 of a document evokes anger, 1/4 sur- A key problem in crowdsourcing is the aggre- prise, and 1/4 disgust, he will submit the following judg- gation of judgments of proportions. For exam- ment fanger:50%; surprise:25%; disgust:25%g. Another ple, workers might be presented with a news domain of interest is digital humanitarianism. Satellite article or an image, and be asked to identify imagery is increasingly used to map deforestations and the proportion of each topic, sentiment, object, natural disasters rapidly by volunteers. However, im- or colour present in it. These varying judg- age quality is often poor, due to cloud obstruction or ments then need to be aggregated to form a low resolution, with features that are difficult to iden- consensus view of the document's or image's tify precisely by the volunteers. A volunteer who be- contents. Often, however, these judgments are lieves 3/4 of the image consists of rubble and 1/4 of skewed by workers who provide judgments ran- undamaged buildings would provide the following judg- domly. Such spammers make the cost of acquir- ment frubble:75%; undamaged:25%g. ing judgments more expensive and degrade the While collecting judgments from individual workers is accuracy of the aggregation. For such cases, we appealing, such methods raise issues of reliability, due to provide a new Bayesian framework for aggre- the unknown incentives of the participants. There are gating these responses (expressed in the form of likely to be malicious workers (i.e. spammers) who pro- categorical distributions) that for the first time vide judgments randomly. For example, in the sentiment accounts for spammers. We elicit 796 judg- analysis domain, workers may provide as many random ments about proportions of objects and colours judgments as possible for quick financial gain [Difallah in images. Experimental results show compa- et al., 2012]. Similarly, in the digital humanitarian do- rable aggregation accuracy when 60% of the main, malicious volunteers may provide misleading re- workers are spammers, as other state of the art ports to divert the attention of ecologists or emergency approaches do when there are no spammers. services. It has been estimated that up to 45% of workers on crowdsourcing platforms may fall into this category 1 Introduction [Vuurens et al., 2011]. Thus, reliably aggregating multi- ple judgments from crowd workers is non{trivial. The emergence of crowdsourcing platforms such as Ama- zon Mechanical Turk (AMT), Crowdflower and oDesk, To address these challenges, we regard the problem has impacted a number of domains. In particular in ar- of judging proportions as one of probabilistic modelling eas such as sentiment analysis, citizen science, and dig- where the judgments from the workers are categorical ital humanitarianism, it is now possible to collect low- distributions [Oakley, 2010; Geng, 2016]. Now, a number cost judgments rapidly with reduced reliance on domain of approaches have been proposed to aggregate probabil- experts. These judgments usually take the form of sin- ity distributions. Specifically, label distribution learning gle or multiple discrete labels, but can be any meta- [Geng, 2016] is a comprehensive framework for design- data attached to an item. A key problem of interest ing algorithms that infer distributions. While general in this area is the aggregation of judgments of propor- in nature, it fails to explicitly address the problem of tions [Varey et al., 1990; Ho et al., 2016]. Text cate- spammers. Simpler methods such as opinion pools are gorisation offers an example scenario. In this context, perhaps the most commonly used approach. In partic- the task is to assign categories for a text document ular, the linear opinion pool (LinOp) [Bacharach, 1979] whose content may cover multiple topics or evoke var- and the logarithmic opinion pool [Weerahandi and Zidek, ious sentiments [Blei et al., 2003]. The documents are 1981] are the most popular methods. They aggregate assigned to humans who provide a judgment about the individual workers' judgments using a weighted arith- metic average and a weighted geometric average respec- ∗This work was supported by General Electric Aviation. tively. They are simple and frequently yield useful re- 1411 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) sults with a moderate amount of computation. Alter- 2 Preliminaries natively, approaches based on confusion matrices, such We denote by x?y the fact that the random variable x is as IBCC [Kim and Ghahramani, 2012] and CBCC [Ve- independent of the random variable y; a ∼ means \dis- nanzi et al., 2014], have been shown to produce more tributed as"; and j expresses conditional probabilities. accurate aggregations when faced with spammers com- If x is has categorical distribution we write x ∼ Cat (:). pared to single weight methods [Ipeirotis et al., 2010]. The Kronecker delta δ (x) is 1 if x = 0; 0 otherwise. Here, a confusion matrix is a square stochastic matrix Given this notation, IBCC assumes that each docu- where each row represents the worker's accuracy of a ment has a single unknown category which we want to given category. It is this breakdown of the workers' ac- infer. This target category ti for document i takes a curacy per category that overcomes the limitation of us- value in j 2 f1; ··· ;Jg, where J is the number of al- ing single weights. However, to date, these models only ternatives. Categories are assumed to be drawn from a address settings where documents have a single category categorical distribution with probability chosen among multiple alternatives. When faced with documents with multiple categories, such models infer ti ∼ Cat (κ) : (1) inaccurate aggregation and workers' ability. Given a set of K workers, each worker k 2 f1; ··· ;Kg To address these shortcomings, we extend IBCC to (k) deal with settings where documents have multiple cate- submits a judgment ci = l of the target category ti = j gories. Specifically, we seek to aggregate multiple work- for document i, where l 2 f1; ··· ;Jg is the set of discrete (k) ers' judgments of the proportion of those categories. We judgments that the worker can make. A judgment ci elicit these judgements in the form of categorical distri- from worker k is assumed to have been generated from butions from each worker, and use these distributions a categorical distribution as input to our model. We then sample and weight (through the workers' confusion matrices) each distri- (k) (k) ci ∼ Cat πti : (2) bution repeatedly to obtain multiple discrete observa- tions for a single document by the same worker. This Furthermore, the workers' judgments are assumed to be sampling step is crucial to enable the use of confusion conditionally independent given the target category ti matrices to identify spammers, while avoiding the con- (k) (f1;··· ;Kgnk) straint of classifying documents into a single category as ci ? ci jti; 8k 2 f1; ··· ;Kg : found in IBCC. A by-product of using confusion matri- This is the assumption commonly used in na¨ıve Bayes ces is that it enables the classification of spammers from classifiers which ignore correlations between workers. It diligent workers. This is key to improving the aggre- is a reasonable assumption in crowdsourcing since work- gation accuracy as spammers can be removed from the ers do not typically interact with each other. The prob- original dataset; and can further be excluded from the (k) pool of available workers in future crowdsourcing cam- abilities πj;l are the individual error-rates of the k-th paigns. Furthermore, our approach is sufficiently flex- worker. The confusion matrix ible to learn the workers' ability in both unsupervised n o Π(k) = π(k); ··· ; π(k) and semi-supervised settings. In particular, the former 1 J approach learns the workers' ability and the documents' for worker k is a square stochastic matrix defined on proportion simultaneously, making it especially suitable RJ×J capturing the probabilistic dependency between when the ground truth is not available. On the other the workers' judgments and the consensus. Each row hand, if the true proportion is known for some of the represents an alternative j, while each column represents documents, performance can be improved on new docu- the worker's judgment l regarding each category. All ments using this limited training data. rows of the confusion matrix are assumed independent In more detail, we make the following contributions to within and across workers the state of the art: (i) we define a Bayesian model that (k) (f1;··· ;Kgnk) jointly learns, for the first time, the per category accura- πi ? πj ; 8k 2 f1; ··· ;Kg and 8i 6= j: cies of individual workers, together with the distribution associated with each document; (ii) we empirically show This means that the workers' ability to identify a given that our model outperforms existing methods on three category is not dependent on their ability to identify the real-world datasets; achieving a comparable level of ac- other alternatives. In line with Bayesian inference, the curacy when 60% of the workers are spammers, as other parameters κ and Π are random variables. Therefore, a approaches do when there are no spammers; (iii) we show conjugate Dirichlet prior distribution is assigned to the a five times improvement in the expected number of mis- parameter vector κ, such that classified spammers compared to existing methods.