Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

Bayesian Aggregation of Categorical Distributions with Applications in Crowdsourcing∗

Alexandry Augustin Matteo Venanzi Alex Rogers Nicholas R. Jennings Southampton University Microsoft Oxford University Imperial College Southampton, UK London, UK Oxford, UK London, UK [email protected] [email protected] [email protected] [email protected]

Abstract proportion of each category. For example, if a worker believes that 1/2 of a document evokes anger, 1/4 sur- A key problem in crowdsourcing is the aggre- prise, and 1/4 disgust, he will submit the following judg- gation of judgments of proportions. For exam- ment {anger:50%, surprise:25%, disgust:25%}. Another ple, workers might be presented with a news domain of interest is digital humanitarianism. Satellite article or an image, and be asked to identify imagery is increasingly used to map deforestations and the proportion of each topic, sentiment, object, natural disasters rapidly by volunteers. However, im- or colour present in it. These varying judg- age quality is often poor, due to cloud obstruction or ments then need to be aggregated to form a low resolution, with features that are difficult to iden- consensus view of the document’s or image’s tify precisely by the volunteers. A volunteer who be- contents. Often, however, these judgments are lieves 3/4 of the image consists of rubble and 1/4 of skewed by workers who provide judgments ran- undamaged buildings would provide the following judg- domly. Such spammers make the cost of acquir- ment {rubble:75%, undamaged:25%}. ing judgments more expensive and degrade the While collecting judgments from individual workers is accuracy of the aggregation. For such cases, we appealing, such methods raise issues of reliability, due to provide a new Bayesian framework for aggre- the unknown incentives of the participants. There are gating these responses (expressed in the form of likely to be malicious workers (i.e. spammers) who pro- categorical distributions) that for the first time vide judgments randomly. For example, in the sentiment accounts for spammers. We elicit 796 judg- analysis domain, workers may provide as many random ments about proportions of objects and colours judgments as possible for quick financial gain [Difallah in images. Experimental results show compa- et al., 2012]. Similarly, in the digital humanitarian do- rable aggregation accuracy when 60% of the main, malicious volunteers may provide misleading re- workers are spammers, as other state of the art ports to divert the attention of ecologists or emergency approaches do when there are no spammers. services. It has been estimated that up to 45% of workers on crowdsourcing platforms may fall into this category 1 Introduction [Vuurens et al., 2011]. Thus, reliably aggregating multi- ple judgments from crowd workers is non–trivial. The emergence of crowdsourcing platforms such as Ama- zon Mechanical Turk (AMT), Crowdflower and oDesk, To address these challenges, we regard the problem has impacted a number of domains. In particular in ar- of judging proportions as one of probabilistic modelling eas such as sentiment analysis, citizen science, and dig- where the judgments from the workers are categorical ital humanitarianism, it is now possible to collect low- distributions [Oakley, 2010; Geng, 2016]. Now, a number cost judgments rapidly with reduced reliance on domain of approaches have been proposed to aggregate probabil- experts. These judgments usually take the form of sin- ity distributions. Specifically, label distribution learning gle or multiple discrete labels, but can be any meta- [Geng, 2016] is a comprehensive framework for design- data attached to an item. A key problem of interest ing algorithms that infer distributions. While general in this area is the aggregation of judgments of propor- in nature, it fails to explicitly address the problem of tions [Varey et al., 1990; Ho et al., 2016]. Text cate- spammers. Simpler methods such as opinion pools are gorisation offers an example scenario. In this context, perhaps the most commonly used approach. In partic- the task is to assign categories for a text document ular, the linear opinion pool (LinOp) [Bacharach, 1979] whose content may cover multiple topics or evoke var- and the logarithmic opinion pool [Weerahandi and Zidek, ious sentiments [Blei et al., 2003]. The documents are 1981] are the most popular methods. They aggregate assigned to humans who provide a judgment about the individual workers’ judgments using a weighted arith- metic average and a weighted geometric average respec- ∗This work was supported by General Electric Aviation. tively. They are simple and frequently yield useful re-

1411 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17) sults with a moderate amount of computation. Alter- 2 Preliminaries natively, approaches based on confusion matrices, such We denote by x⊥y the fact that the x is as IBCC [Kim and Ghahramani, 2012] and CBCC [Ve- independent of the random variable y; a ∼ means “dis- nanzi et al., 2014], have been shown to produce more tributed as”; and | expresses conditional probabilities. accurate aggregations when faced with spammers com- If x is has categorical distribution we write x ∼ Cat (.). pared to single weight methods [Ipeirotis et al., 2010]. The δ (x) is 1 if x = 0; 0 otherwise. Here, a confusion matrix is a square stochastic matrix Given this notation, IBCC assumes that each docu- where each row represents the worker’s accuracy of a ment has a single unknown category which we want to given category. It is this breakdown of the workers’ ac- infer. This target category ti for document i takes a curacy per category that overcomes the limitation of us- value in j ∈ {1, ··· ,J}, where J is the number of al- ing single weights. However, to date, these models only ternatives. Categories are assumed to be drawn from a address settings where documents have a single category categorical distribution with probability chosen among multiple alternatives. When faced with documents with multiple categories, such models infer ti ∼ Cat (κ) . (1) inaccurate aggregation and workers’ ability. Given a set of K workers, each worker k ∈ {1, ··· ,K} To address these shortcomings, we extend IBCC to (k) deal with settings where documents have multiple cate- submits a judgment ci = l of the target category ti = j gories. Specifically, we seek to aggregate multiple work- for document i, where l ∈ {1, ··· ,J} is the set of discrete (k) ers’ judgments of the proportion of those categories. We judgments that the worker can make. A judgment ci elicit these judgements in the form of categorical distri- from worker k is assumed to have been generated from butions from each worker, and use these distributions a categorical distribution as input to our model. We then sample and weight   (through the workers’ confusion matrices) each distri- (k) (k) ci ∼ Cat πti . (2) bution repeatedly to obtain multiple discrete observa- tions for a single document by the same worker. This Furthermore, the workers’ judgments are assumed to be sampling step is crucial to enable the use of confusion conditionally independent given the target category ti matrices to identify spammers, while avoiding the con- (k) ({1,··· ,K}\k) straint of classifying documents into a single category as ci ⊥ ci |ti, ∀k ∈ {1, ··· ,K} . found in IBCC. A by-product of using confusion matri- This is the assumption commonly used in na¨ıve Bayes ces is that it enables the classification of spammers from classifiers which ignore correlations between workers. It diligent workers. This is key to improving the aggre- is a reasonable assumption in crowdsourcing since work- gation accuracy as spammers can be removed from the ers do not typically interact with each other. The prob- original dataset; and can further be excluded from the (k) pool of available workers in future crowdsourcing cam- abilities πj,l are the individual error-rates of the k-th paigns. Furthermore, our approach is sufficiently flex- worker. The confusion matrix ible to learn the workers’ ability in both unsupervised n o Π(k) = π(k), ··· , π(k) and semi-supervised settings. In particular, the former 1 J approach learns the workers’ ability and the documents’ for worker k is a square stochastic matrix defined on proportion simultaneously, making it especially suitable RJ×J capturing the probabilistic dependency between when the ground truth is not available. On the other the workers’ judgments and the consensus. Each row hand, if the true proportion is known for some of the represents an alternative j, while each column represents documents, performance can be improved on new docu- the worker’s judgment l regarding each category. All ments using this limited training data. rows of the confusion matrix are assumed independent In more detail, we make the following contributions to within and across workers the state of the art: (i) we define a Bayesian model that (k) ({1,··· ,K}\k) jointly learns, for the first time, the per category accura- πi ⊥ πj , ∀k ∈ {1, ··· ,K} and ∀i 6= j. cies of individual workers, together with the distribution associated with each document; (ii) we empirically show This means that the workers’ ability to identify a given that our model outperforms existing methods on three category is not dependent on their ability to identify the real-world datasets; achieving a comparable level of ac- other alternatives. In line with Bayesian inference, the curacy when 60% of the workers are spammers, as other parameters κ and Π are random variables. Therefore, a approaches do when there are no spammers; (iii) we show conjugate Dirichlet prior distribution is assigned to the a five times improvement in the expected number of mis- parameter vector κ, such that classified spammers compared to existing methods. κ ∼ Dir (ν) (3) The remainder of this paper is organised as follows. We first introduce our notation and present IBCC in where the hyperparameter ν ∈ RJ can be viewed as Section 2. In Section 3, we detail our method. In Section pseudo-counts of prior observations; that is, the number 4, we present the results of our experimental evaluation. of documents in each category across the corpus. A con- We conclude in Section 5. jugate Dirichlet prior distribution is similarly introduced

1412 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

categories Algorithm 1 Generative process of MBCC. 1: Input: the confusion matrices Π and the category proportions Λ 2: 3: for each document i ∈ {1, ··· ,I} do

Dir. Dir. 4: for each sample n ∈ {1, ··· ,N} do 5: Sample zi,n ∼ Cat(Λi) 6: for each worker k ∈ {1, ··· ,K} do Cat. Cat. (k)   7: Sample c ∼ Cat π(k) i,n zi,n 8: end for 9: end for 10: 11: for each worker k ∈ {1, ··· ,K} do 12: for each category j ∈ {1, ··· ,J} do (k) (k) PN  (k)  13: Φi,j = βi,j n=1 δ ci,n − j samples Cat. (k) 14: where βi,j is a normalising constant. 15: end for

workers 16: end for documents 17: end for 18: 19: return Φ Figure 1: Factor graph of MBCC [Dietz, 2010].

(k) (k) for all samples n ∈ {1, ··· ,N}. The vector z of dimen- over the parameter π with hyperparameter α such i j j sion N can be seen as independent target categories t that i (k)  (k) of document i (as in IBCC) drawn from the categories πj ∼ Dir αj . (4) proportion in Equation 5. We subsequently match these (k) samples against samples from each of the workers’ dis- The set of hyperparameters αj form a matrix (k) tribution Φi to obtain (k) n (k) (k)o A = α , ··· , α , (k)   1 J c ∼ Cat π(k) (6) i,n zi,n (k) RJ where each row j is a vector αj ∈ ≥0. The matrix for each document i. This is equivalent to running IBCC A(k) is chosen to represent any prior level of uncertainty N times with different values of the workers’ judgment (k) (k) in the workers’ confusion matrix, and can also be re- ci drawn from their distribution Φi . In practice, one garded as pseudo-counts of prior observations; that is, can perform this sampling until the accuracy no longer the number of documents of each category that worker increases. Alternatively, a theoretical upper bound for k has already judged. an arbitrary level of error can be calculated using Lemma Now, IBCC assumes that workers classify documents 3 in [Devroye, 1983]. Thus, the joint distribution is into a single category among multiple alternatives. This I limitation provides the basis for our extension. Y p (z, c, Λ, Π) = Cat (Λi) Dir (i) × 3 Aggregating Judgments over i=1 K N J Multiple Categories Y Y   Y  (k) Cat π(k) Dir α , (7) zi,n j Our proposed model – referred to hereafter as multi- k=1 n=1 j=1 category independent Bayesian classifier combination (MBCC) – is a generalisation of IBCC that deals with where the hyperparameter ν in IBCC has been renamed settings where documents have multiple categories. In for clarity to i (i.e. the pseudo-count of categories for more detail, we introduce a categorical distribution (i.e. each document i), such that the category proportion) with parameter Λi over the J Λi ∼ Dir (i) . (8) categories for each document i (Equation 5) instead of The generative process, that is, the random process by the single categorical distribution κ common to all doc- (k) uments found in IBCC (Equation 1). Moreover, each which MBCC assumes the workers’ judgment Φi arose, (k) is summarised in Figure 1 and Procedure 1. In partic- worker k submits a distribution Φ over the J cate- i ular, we start with the confusion matrices Π, and the gories for each document i instead of the single category category proportions Λ sampled from Equations 4 and 8 c(k) required by IBCC. The confusion matrix Π(k) of i respectively (Line 1). We then sample N categories zi,n each worker k is kept unchanged from IBCC. Therefore from each category proportion Λi from Equation 5 (Line in this new setting, to assess the accuracy of each worker 5). Given each category z , we sample judgments c(k) through their confusion matrices Π(k), we need to first i,n i,n from the workers’ zi,n-th row of their confusion matrix independently sample the aggregated distribution Λi to Π(k) (Line 7). Finally, we find the most likely categor- obtain a set of categories zi,n for each document i such (k) (k) that ical distributions Φi which generated the samples zi zi,n ∼ Cat (Λi) , (5) for all documents i and workers k (Line 13).

1413 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

The key inferential problem that needs to be solved in Colours. We crowdsourced a set of 460 judgments order to use MBCC is that of computing the posterior of the proportion of colours in the flags of 20 countries distribution of the latent variables z, and the parame- [Augustin and Venanzi, 2017]. We asked 23 participants ters Λ and Π given the data c, that is p (z, Λ, Π|c). To to judge, from memory, the proportion of 10 colours in do this, we can use approximation methods based on each country’s flag. statistical sampling (such as Markov chain Monte Carlo The alternatives in each dataset are complete, that [Geman and Geman, 1984]) or density approximation is, using all the categories can always fully describe the (such as variational approximation [Jordan et al., 1999; instance. Furthermore, although these three datasets Attias, 2000] or Laplace approximation [Laplace, 1986]). may already contain spammers, we augment the datasets In particular, our implementation uses the variational with additional synthetic spammers to explore the loss message passing algorithm [Winn and Bishop, 2005] as it of accuracy as they increase in number. Vuurens et has proven itself to be superior in terms of speed and ac- al. [Vuurens et al., 2011] identified four types of spam- curacy compared to both sampling and density based ap- mers: sloppy, uniform, random and semi-random. While proximation [Minka, 2013]; but other approaches could sloppy workers try to the best of their abilities to com- be used if appropriate. plete the tasks, they may be insufficiently precise in their judgments. On the other hand, uniform spammers use 4 Empirical Evaluation a fixed uniform judgment pattern across all documents. Finally, random spammers provide unique meaningless To evaluate the efficacy of our model, we use an in- answers for each document and semi-random spammers dependently gathered dataset, and introduce two new also answer a few questions properly. While the obvi- datasets; all of which include ground truth from expert ous characteristic of repeating judgment patterns can annotators. We then compare performance against four be used to manually filter out uniform spammers, ran- state-of-the-art benchmarks. The experiments are run in dom spammers are the most challenging to detect. For an unsupervised setting, where the ground truth is never this reason, we use a random spamming strategy where exposed to the algorithms, and is only used to measure a spammer always provides a random categorical dis- their accuracy. The source code and datasets are avail- tribution for each document. The distribution of each able at [Augustin and Venanzi, 2017]. spammer k for each document i shares a prior such that 4.1 Datasets (k) Our experiments use a total of three datasets. Φi ∼ Dir (1) , (9) SemEval. This dataset contains judgments of senti- where the pseudo-count of 1 ensures that all possible ments within one hundred news headlines sampled from (k) distributions Φ are equally likely. the SemEval2007 test set [Strapparava and Mihalcea, i 2007; Snow et al., 2008]. Each worker was presented with 4.2 Experimental Setting a list of headlines, and was asked to give numeric judg- ments between zero and a hundred for each of six senti- We set the parameter of the of each con- (k) ments. Ten judgments were collected for each headline fusion matrix for all workers and spammers to A = T for a total of 1,000 judgments. These judgments were 100 × I + 1 1. This means that workers are initially as- obtained from 38 workers whom provided a minimum of sumed to be reasonably accurate before seeing any data. 20 judgments each, and 26 on average. We truncate the Using a different assumption leads to distinct aggrega- total number of judgments per worker to 20 to avoid dis- tion accuracy profiles that will be discussed in more de- crepancies in accuracy of each inferred confusion matrix. tail in Section 4.5. Furthermore, to ensure fair com- We normalise the values submitted by each worker into parison between the benchmarks, we do not adjust each valid probability distributions by ensuring that the total parameters  of the prior distributions over categories area is equal to one at any given time. to reflect the balance of each dataset. Finally, we run IAPR-TC12. We crowdsourced a set of 16 images all models a hundred times each to achieve statistically sampled from the IAPR-TC12 dataset [Escalante et al., significant results at the 99% confidence level. 2010; Augustin and Venanzi, 2017]. This is a collection of 20,000 images of urban and rural scenes manually seg- 4.3 Benchmarks mented per regions. Each pixel belongs to one of six We compare the performance of our model to four state region. We gathered a total of 21 judgments per im- of the art benchmark methods. age from 21 workers. Workers were asked to estimate Uniform assigns a uniform distribution for each ag- 1 the proportion of each region in the image. The ground gregated distribution (i.e. pj = J for all j ∈ {1, ··· ,J}), truth proportion for each category is calculated by di- making it independent of the dataset. These particular viding the number of pixels in the region by the total values of pj maximise the entropy function of categorical number of pixels in the image. The workers reported distributions. That is, if one were to guess the aggre- their judgments with a pie chart, enabling quick and ac- gated distribution far from the ground truth on average curate judgments of proportion [Hollands and Spence, (given some distance metric), it can be expected to have 1992]. its error above the uniform model.

1414 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

Figure 2: Average error on the aggregated distributions Λ on the SemEval dataset when increasing: (left) the ratio of spammers at N = 180 samples, (right) the number of samples at a ratio of spammers of 50%.

Figure 3: Average error on the aggregated distributions when increasing the ratio of spammers on: (left) the IAPR-TC12 dataset at N = 180 samples, (right) the Colours dataset at N = 330 samples.

LinOp averages the distributions provided by the of the aggregation on the entire dataset by workers. Unlike IBCC and MBCC, it does not sample I the workers’ distributions to produce the discrete obser- 1 X E = d (Λ∗, Λ ) (10) vations, but directly takes the distributions as input. As Λ I i i there is no training set for assigning informative weights i=1 (k) 1 to the workers, we assign equal weights ω = K to each where d (.) is the Euclidean distance, I the total number worker. ∗ of documents, Λi the ground truth distribution for doc- Median estimates the aggregated distribution by ar- ument i, and Λi the aggregated distribution provided by ranging the judgments for each document in ascending the model for document i. Furthermore, we define the order and then takes the middle judgment. Each judg- deviation of the confusion matrix Π(k) of worker k to ment is considered with equal weight for the same reason the identity matrix I by as for LinOp. The median is a robust method against J extreme judgements since it will not give an arbitrarily X  (k) large or small result if no more than half of the judg- EΠ(k) = d Ij, πj . (11) ments for a document are incorrect. j=1 IBCC combines discrete judgments from multiple Other distance metric can be used, provided that it gives workers and models the ability of each individual worker finite results when the distributions are not absolutely using confusion matrices [Kim and Ghahramani, 2012]. continuous. Although IBCC takes a single category as ground truth, the output is a posterior distribution over the categories 4.5 Results which can be compared to the ground truth category We now present the results of our empirical evaluation proportion. regarding a number of key aspects: (i) accuracy of aggre- gation and robustness to spammers, (ii) convergence and 4.4 Accuracy Metrics running time, (iii) classification of spammers. We first To assess the accuracy of the inference, we use the Eu- consider in detail the results on the SemEval dataset, clidean distance. Specifically, we define the average error and briefly discuss the results on the other datasets.

1415 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

the SemEval dataset. To do this, we first collect the in- ferred confusion matrix Π(k) for each worker. We then compute the deviation EΠ(k) of each confusion matrix to the identity matrix (Equation 11). The identity matrix represents the confusion matrix of a perfect worker, that is, one that always gives judgments in accordance with the consensus. We then set a threshold on the error, above which a worker is classified as a spammer. The receiver operating characteristic (ROC) curves in Fig- ure 4 capture the effect of varying the threshold of the error. As can be seen, MBCC has an area under the curve (AUC) of 0.99, showing a 5 times improvement compared to IBCC in terms of the expected number of misclassified spammers. The AUC eventually decreases for both models at higher ratios. We now briefly discuss the results on the two other Figure 4: ROC curves on the SemEval dataset. The ratio of datasets. On the IAPR-TC12 dataset (Figure 3 (left)), spammers is set to 50%. LinOp, Median and MBCC achieve equal accuracies when no spammers are added. However, since each im- Figure 2 (left) shows the average error as the ratio of age in this dataset is dominated by a single category, spammers increases on the SemEval dataset. To illus- that is, at least one category has more than 50% cover- trate, a ratio of spammers of 50% means half the work- age in each image, IBCC performs reasonably well even ers in the dataset are spammers. As can be seen, MBCC with a high ratio of spammers. This is because IBCC achieves a comparable aggregation accuracy when 50% always assigns a probability of one to the most likely of the workers are spammers, as LinOp does when no category, which emphasises its assumption that docu- spammers are added. However, the added complexity of ments have exactly one category. In fact, as the docu- detecting spammers in MBCC comes at a cost when the ments’ proportion regresses to Kronecker deltas, the er- level of spammers is low. Indeed, the highest precision ror from IBCC decreases. On the Colours dataset (Fig- is initially produced by LinOp, but as the ratio of spam- ure 3 (right)), the accuracy of MBCC remains a lower mers increases a crossover point is reached after which bound of LinOp throughout the range of ratios of spam- MBCC is the most precise. The value of this crossover mers. However, Median initially achieves the lowest er- point can be adjusted by varying the pseudo-counts of ror with a crossover point with MBCC at 15% spam- (k) mers. This is because the judgments provided by the prior observations Aii on the diagonal of the workers’ workers include sufficient outliers, making the Median a confusion matrix. For high values, MBCC assumes all good measure of central tendency. Finally, convergence workers are truthful. This is the assumption underpin- of the error on the aggregation is reached at 100 sam- ning LinOp, where all workers have an implicit iden- ples for the IAPR-TC12 dataset and 150 for the Colours tity matrix as their confusion matrix. In fact, for high dataset. Since the Colour dataset has 4 more categories pseudo-counts, the performance of MBCC matches per- than the IAPR-TC12 datasets, it requires more samples fectly with LinOp regardless of the number of spammers. to achieve convergence. On the other hand, with lower values of pseudo-counts, MBCC allows greater flexibility to learn the spammers, at the cost of having a greater error when there are fewer spammers in the dataset. Therefore, the added degrees 5 Conclusions of freedom lead to a tradeoff in accuracy at different ratio of spammers. Furthermore, we set the number of sam- ples N empirically. Figure 2 (right) shows the average We introduced a novel model for the aggregation of cate- error of the aggregated distribution when increasing the gorical distributions. The key innovations of our method number of samples at a ratio of spammers of 50%. As are the elicitation and sampling of judgments of propor- the number of samples increases, we observe convergence tions in the form of probability distributions and the use of the error at 100 samples to values of 0.75 and 0.37 for of these samples to improve on the accuracy of aggre- IBCC and MBCC respectively. Inevitably, the running gation. In particular, we showed empirically, on three time also increases as the number of samples increases. real-world datasets, that our approach outperforms ex- In particular, the running time of LinOp and Median isting methods by up to 28% in terms of accuracy. We is typically 3ms, while that of IBCC and MBCC ranges have also shown a comparable level of accuracy when from 12s and 13s, to 6min and 28min respectively, across 60% of the workers are spammers, as other approaches the range of samples shown in Figure 2 (right). do when there are no spammers. Finally, we improved We now evaluate the accuracy of the confusion matrix- the expected number of misclassified spammers by up to based models at classifying workers from spammers on five times that achieved by existing methods.

1416 Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence (IJCAI-17)

References [Jordan et al., 1999] M.I. Jordan, Z. Ghahramani, T.S. [Attias, 2000] H. Attias. A Variational Bayesian Frame- Jaakkola, and L.K. Saul. An Introduction to Varia- work for Graphical Models. In Advances in Neural In- tional Methods for Graphical Models. Machine learn- formation Processing Systems 12, pages 209–215. MIT ing, 37(2):183–233, 1999. Press, 2000. [Kim and Ghahramani, 2012] H.C. Kim and Z. Ghahra- mani. Bayesian Classifier Combination. In Proceedings [Augustin and Venanzi, 2017] A. Augustin and M. Ve- of the 15th International Conference on Artificial In- nanzi. MBCC, 2017. https://github.com/alexandry- telligence and (AISTATS), 2012. augustin/mbcc/. [Laplace, 1986] P.S. Laplace. Memoir on the Probability [Bacharach, 1979] M. Bacharach. Normal Bayesian Di- of the Causes of Events. Statistical Science, 1(3):364– alogues. Journal of the American Statistical Associa- 378, 1986. tion, 74(368):837, 1979. [Minka, 2013] T.P. Minka. Expectation Propagation for [Blei et al., 2003] D.M. Blei, A.Y. Ng, and M.I. Jor- Approximate Bayesian Inference. In Proceedings of the dan. Latent Dirichlet Allocation. Journal of Machine Seventeenth Conference on Uncertainty in Artificial Learning Research, 3:993–1022, 2003. Intelligence, volume abs/1301.2294, 2013. [ ] Devroye, 1983 L. Devroye. The Equivalence of Weak, [Oakley, 2010] J. Oakley. Eliciting Univariate Probabil- Strong and Complete Convergence in L1 for Kernel ity Distributions. Rethinking risk measurement and Density Estimates. The Annals of Statistics, pages reporting, 1, 2010. 896–904, 1983. [Snow et al., 2008] R. Snow, B. O’Connor, D. Jurafsky, [ ] Dietz, 2010 L. Dietz. Directed factor graph notation and A.Y. Ng. Cheap and Fast - But is it Good?: Eval- for generative models. Technical report, Max Planck uating Non-expert Annotations for Natural Language Institute for Informatics, Tech. Rep, 2010. Tasks. In Proceedings of the Conference on Empirical [Difallah et al., 2012] D.E. Difallah, G. Demartini, and Methods in Natural Language Processing, pages 254– P. Cudre-Mauroux. Mechanical Cheat: Spamming 263. Association for Computational Linguistics, 2008. Schemes and Adversarial Techniques on Crowdsourc- [Strapparava and Mihalcea, 2007] C. Strapparava and ing Platforms. In CrowdSearch, pages 26–30. Citeseer, R. Mihalcea. Semeval-2007 Task 14: Affective Text. 2012. In Proceedings of the 4th International Workshop on [Escalante et al., 2010] H.J. Escalante, C.A. Hernandez, Semantic Evaluations, pages 70–74. Association for J.A. Gonzalez, A. Lopez-Lopez, M. Montes, Ed- Computational Linguistics, 2007. uardo F Morales, L Enrique Sucar, L. Villasenor, and [Varey et al., 1990] C.A. Varey, B.A. Mellers, and M.H. M. Grubinger. The segmented and annotated IAPR Birnbaum. Judgments of Proportions. Journal of Ex- TC-12 benchmark. Computer Vision and Image Un- perimental Psychology: Human Perception and Per- derstanding, 114(4):419–428, 2010. formance, 16(3):613, 1990. [Geman and Geman, 1984] S. Geman and D. Geman. [Venanzi et al., 2014] M. Venanzi, J. Guiver, G. Kazai, Stochastic Relaxation, Gibbs Distributions, and the P. Kohli, and M. Shokouhi. Community-based Bayesian Restoration of Images. Pattern Analysis and Bayesian Aggregation Models for Crowdsourcing. In Machine Intelligence, IEEE Transactions on, PAMI- Proceedings of the 23rd International Conference on 6(6):721–741, 1984. World Wide Web, pages 155–164. ACM Press, 2014. [Geng, 2016] X. Geng. Label distribution learning. [Vuurens et al., 2011] J. Vuurens, A.P. de Vries, and IEEE Transactions on Knowledge and Data Engineer- C. Eickhoff. How Much Spam Can You Take? An ing, 28(7):1734–1748, 2016. Analysis of Crowdsourcing Results to Increase Accu- [Ho et al., 2016] C. J. Ho, R. Frongillo, and Y. Chen. racy. In Proceedings of the ACM SIGIR Workshop on Eliciting Categorical Data for Optimal Aggregation. Crowdsourcing for Information Retrieval (CIR’11), In Advances In Neural Information Processing Sys- pages 21–26, 2011. tems, pages 2442–2450, 2016. [Weerahandi and Zidek, 1981] S. Weerahandi and J.V. [Hollands and Spence, 1992] J.G. Hollands and Zidek. Multi-Bayesian Statistical Decision Theory. I. Spence. Judgments of Change and Propor- Journal of the Royal Statistical Society. Series A tion in Graphical Perception. Human Factors: The (General), 144(1):85, 1981. Journal of the Human Factors and Ergonomics [Winn and Bishop, 2005] J. Winn and C.M. Bishop. Society, 34(3):313–334, 1992. Variational Message Passing. Journal of Machine [Ipeirotis et al., 2010] P.G. Ipeirotis, F. Provost, and Learning Research, 6:661–694, 2005. J. Wang. Quality Management on Amazon Mechan- ical Turk. In Proceedings of the ACM SIGKDD workshop on human computation, pages 64–67. ACM, 2010.

1417