General-Domain Truth Discovery Via Average Proximity

General-Domain Truth Discovery via Average Proximity

RESHEF MEIR, Technion- Israel Institute of Technology, Israel OFRA AMIR, Technion- Israel Institute of Technology, Israel OMER BEN-PORAT, Technion- Israel Institute of Technology, Israel TSVIEL BEN-SHABAT, Technion- Israel Institute of Technology, Israel GAL COHENSIUS, Technion- Israel Institute of Technology, Israel LIRONG XIA, RPI, USA Truth discovery is a general name for a broad range of statistical methods aimed to extract the correct answers to questions, based on multiple answers coming from noisy sources. For example, workers in a crowdsourcing platform. In this paper, we suggest a simple heuristic for estimating workers’ competence using average proximity to other workers. We prove that this estimates well the actual competence level and enables separating high and low quality workers in a wide spectrum of domains and statistical models. We then design a simple proximity-based truth discovery algorithm (P-TD) that weighs workers according to their average proximity. The answers for questions may be of different forms such as real-valued, categorical, rankings, or other complex labels, and P-TD can be combined with any existing aggregation function or voting rule to improve their accuracy. We demonstrate through an extensive empirical study on real and synthetic data that P-TD and its iterative variants outperform other heuristics and state-of-the-art truth discovery methods in the above domains.

“All happy families are alike; each unhappy family is unhappy in its own way." — Leo Tolstoy, Anna Karenina arXiv:1905.00629v3 [cs.AI] 15 Feb 2021

Manuscript submitted for review to the 22nd ACM Conference on Economics & Computation (EC'21). Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 1

1 INTRODUCTION Consider a standard crowdsourcing task such as identifying which images contain a person or a car [15], or identifying the location in which pictures were taken [32]. Such tasks are often used to construct large datasets that can later be used to train and test machine learning algorithms. Crowdsourcing workers are usually not experts, thus answers obtained this way often contain many mistakes [38, 39] , and multiple answers are aggregated to improve accuracy. From a theoretical/statistical perspective, “truth discovery" is a general name for a broad range of methods that aim to extract some underlying ground truth from noisy answers. While the mathematics of truth discovery dates back to the early days of statistics, at least to the Condorcet Jury Theorem [12], the rise of crowdsourcing platforms suggests an exciting modern application for truth discovery. In this paper, we consider an approach to truth discovery that is based on workers’ proximity, with crowdsourcing as our main motivation. This is inspired both by existing methods of truth discovery (see related work), and by the use of voters’ proximity in social choice [11]. We consider a set of workers who answer the same set of questions, where each question has a single correct answer. We want to: (a) identify the more competent and accurate workers; and (b) aggregate workers’ answers such that the outcome will be as close as possible to the true answers. The separation of (a) and (b) is particularly important when considering existing crowdsourcing platforms, which may already use certain aggregation methods due to legacy, simplicity, incentives, or explainability considerations. We assume that each worker has some underlying fault level 푓푖 that is the expected distance between her answers and the ground truth, so that the “good” workers have low fault. In addition, we define the disparity of each worker to be her average distance from all other workers. Our main theoretical result can be written as follows:

Theorem (Anna Karenina principle, informal). The expected disparity of each worker is roughly linearly increasing in her fault level 푓푖 .

Essentially, the theorem says that as in Tolstoy’s novel, “good workers are all alike,” (so their disparities remain low), whereas “each bad worker is bad in her own way” and thus not particularly close to other workers. We explore the conditions under which this principle holds exactly or approximately, and can thus be used to identify good and bad workers. In the most general model, we need not assume almost anything about the domain, except that there is some way to measure distances between workers. To the best of our knowledge, our method is the first that has theoretical guarantees that are domain independent. By making progressively more assumptions on the structure of labels and/or the noise model, we can get more accurate estimation of the fault. Armed with this estimate, we can weigh workers when aggregating them in order to significantly improve the accuracy of predicted labels.

1.1 Contribution and paper structure After the preliminary definitions in Section 2, we prove a formal version of the Anna Karenina principle and show how it can be used to identify poor workers in Section 3, under very mild assumptions on the data and without reliance on any aggregation method. We show how special or relaxed versions of the theorem apply when we add statistical assumptions or alter the distance measure. In particular, when the noise is symmetric the (expected) disparity provides us with a tight 2-approximation of the true fault level under any distance metric, and with an exact estimate under any squared norm. When the noise is Gaussian, the disparity induces the maximum likelihood estimator. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 2

We then describe a general proxy-based truth discovery algorithm (P-TD) in Section 4, where workers’ weights are set following the results of the previous section, leading to a statistically consistent estimation of the ground truth. We then suggest iterative variants of our algorithm. Finally, we show in Section 5 via extensive testing on real and synthetic data, that the Anna Karenina principle applies in practice, and that our algorithms substantially improve aggregation in a broad range of domains out-performing specially-designed algorithms for specific models. Most proofs, as well as additional empirical results are relegated to the appendices.

1.2 Related work The idea of estimating workers’ competence in order to improve aggregation is far from new, and many algorithms have been proposed that aim to do just that (a recent survey is in [28]). In the social choice literature, extensions of the Condorcet Jury Theorem to experts with heterogeneous competence levels were surveyed by Grofman et al. [20]. We can very roughly classify works on truth discovery into two classes. The first class entwines the two estimation problems mentioned above, by iteratively aggregating labels to obtain an estimate of the ground truth, and using that approximate truth to estimate workers competence. This approach was pioneered by the EM-style Dawid-Skene estimator [14]. Followup works include [3, 19, 24, 42], which provide formal convergence guarantees on variants of the above iterative approach under specific noise models. In particular, they assume a specific structure of the labels (typically categorical or real valued). The estimation involves an optimization process, and may or may not have a closed form solution depending on the statistical assumptions. In the most common domain of categorical labels, the methods above also differ on whether they maintain a soft or hard estimate of the ground truth (e.g. [24] applies the former approach, and [19] applies the latter). The simple algorithm we propose can be thought of as a single-iteration soft estimate that works in general domains. In Section 4.1 we extend our algorithm for multiple iterations, making it competitive with the state-of-the-art algorithms while maintaining generality and simplicity. The second class of algorithms uses spectral methods to infer the competence (and sometimes other latent variables) from the covariance matrix of the workers [33, 45]. There is both a conceptual and technical connection to our work, as workers’ covariance can be thought of as a special case of pairwise proximity in a binary setting. A key advantage of the spectral approach is that it enables us to sometimes identify cartels of like-minded workers trying to bias the outcome [33]. In addition, many other aggregation methods have been developed for particular domains with more complex labels. For example, aggregations of full and partial rankings have been considered in the social choice literature for centuries, and dozens of voting rules have been proposed, including some recent suggestions in the context of crowdsourcing [6, 7, 34]. While these rules often treat all voters symmetrically, they usually have natural weighted variants. Therefore, our suggested method does not compete with voting rules, but rather complements and improves them by assigning suitable weights.

Distance functions in truth discovery. When there are complex labels that are not numbers or categories, but for example contain text, graphics and/or hierarchical structure, there may not be a way to aggregate them but we may still want to evaluate workers’ competence. Two recent works we are aware of suggested using a distance metric as a substitute for general or complex labels. The first is by Li et al. [27], where the distance is calculated from the currently aggregated labels, and not from other workers. Here both the distance function and the aggregation part (estimation of the ground truth) are part of the model. In contrast, our average-proximity approach only deals with estimating competence levels, leaving the choice of aggregation method to the user. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 3

The second paper, which is closer to our approach, is by Braylan and Lease [5]. They empha- sized the challenges that such complex labels pose, and proposed to use pairwise distances as an abstraction for complex labeling. Their multidimensional annotation scaling (MAS) model extends the Dawid-Skene model by calculating the labels and competence levels that would maximize the likelihood of the observed distance matrix. Just like our approach (and in contrast to all the algorithms mentioned above), the MAS algorithm can be applied to any dataset that has a well-defined distance measure, regardless of label structure. MAS is the most computationally- intensive method from all those we mentioned above: it uses the Stan probabilistic programming language [8] to find a MAP estimation of the solution, and is not even guaranteed to converge. While Braylan and Lease do not discuss aggregation in the paper (they just use the label of the most competent worker), clearly their solution can be combined with any weighted aggregation method. Other challenges. There are many variants of the truth discovery problem that are outside the scope of the current paper. Yet we should note that due to its simplicity, it is often possible to combine our proximity-based algorithm with various other solutions. In particular we mention the following challenges for which flexible solutions have been proposed in the literature: Partial data In real crowdsourcing, workers often provide input only on subsets of questions. This may pose nontrivial challenges to optimization-based methods (see e.g. [13, 24]). Semi-supervised learning Often it is possible to obtain verified answers to some of the questions in the dataset. This can be used in order to further improve the estimate of workers’ competence, see e.g. [43]. Partition to subtasks The dataset may be partitioned into subtasks on which the workers have varying competence, and there is a need to tradeoff “local" and “global" competence estimation. See e.g. [5].

Beyond the theoretical and algorithmic work described above, similarity among workers was most commonly used in “Games with a Purpose” which incorporated “output agreement” mechanism [23, 37]. Our theoretical results further justify the use of such methods in practice. 2 PRELIMINARIES We consider a set of 푛 workers 푁 , each providing a report in some space 푍. We denote elements of 푍 (typically 푚-length vectors, see below) in bold. Thus an instance of a truth discovery is a pair 퐼 = ⟨푆 = (풔푖 )푖 ∈푁 , 풛⟩, where 풔푖 ∈ 푍 is report of worker 푖, and 풛 ∈ 푍 is the ground truth. 푆 is also called a dataset. We denote by 퐶 the indicator variable of a true/false condition 퐶. J K Statistical model of the data. We do not make any assumptions regarding the ground truth 풛.A dataset is constructed in three steps, demonstrated in Table 1: We start from a proto-population T , which is a distribution over all possible worker types 푇 . From T we sample a finite population of workers with types 푡ì = (푡푖 )푖 ∈푁 (vectors of length 푛 will be marked with an arrow accent). The last step generates the reported answers (the dataset) as follows. The answers that each worker provides depend on three factors: the ground truth, her type, and a random factor. Formally, a noise model is a function Y : 푍 × 푇 → Δ(푍). That is, the report of worker 푖 is a random variable 풔푖 sampled from Y(풛, 푡푖 ). We note that T , Y and 풛 together induce a distribution Y(풛, T) over answers (and thus over datasets), where 풔 ∼ Y(풛, T) means we first sample a type 푡 ∼ T and then a report 풔 ∼ Y(풛, 푡). Table 2 shows an example of a dataset sampled from the population in Table 1, when the noise model Y is a multivariate independent Normal distribution with mean 풛 and variance 푡푖 . This model is known as additive white Gaussian noise (AWG, see [16]), and will be one of the models we consider in this work. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 4

step notation depends on example Proto-population T 푇 푇 = [0.5, 2], T = 푈 ([0.5, 2]) Population 푡ì = (푡푖 )푖 ∈푁 T, 푛 푛 = 5, 푡ì = (0.55, 0.82, 1.08, 1.19, 1.32) Dataset 푆 = (풔푖 )푖 ∈푁 푡,ì 풛, Y 풛 = (8, −1, 4, 3), 푆 is shown in Table 2 Table 1. The three stages of generating a dataset. In this example 푍 = R4, a type is a real number in the range [0.5, 2], and T is a uniform distribution over types.

푖 \ 푗 1 2 3 4 푑(풔, 풛) 푡푖 \ 푧푗 8 −1 4 3 0 1 0.55 9.28 −0.26 3.32 2.56 0.26 2 0.82 7.22 −1.66 3.58 1.84 0.65 3 1.08 8.13 −1.48 5.30 2.07 0.29 4 1.19 9.93 1.18 3.91 0.98 0.97 5 1.32 6.20 0.89 3.32 2.88 1.76 푟푀 (푆) 8.15 −0.27 3.89 2.07 0.57 푟푂 (푆) 8.31 −0.40 3.80 2.13 0.01 Table 2. An example of a dataset 푆 = (푠푖,푗 ) sampled from the AWG model. Workers’ types and the ground truth are taken from Table 1. The two bottom lines are showing aggregated results using the unweighted mean, and the oracle-weighted mean, respectively. The rightmost column shows the error, i.e. the distance of every row from the ground truth (using NSED). On the right, we see a graphical representation of 푆 (we use two 2-dimensional plots instead of a 4-dimensional one). The blue X marks the ground truth. Workers’ reports are marked by gray circles, whose size is proportional 푀 to 푡푖 . The mean 푟 (푆) is marked by a green diamond.

Workers’ competence. Competent workers are close to the truth. More formally, given some ground truth 풛 and a distance function 푑, we define the fault level (or incompetence) of a worker 푖 푓 퐸 푑 , as her expected distance from the ground truth, denoted 푖 (풛) := 풔푖 ∼Y (푡푖,풛) [ (풔푖 풛)]. We denote by 휇T (풛) := 퐸풔∼Y (풛,T) [푑(풔, 풛)] the mean fault level, omitting T when clear from the context. Note that the higher 휇 is, the more errors we would expect on average. Most distance functions we consider in this work will be derived from an inner product. Formally, consider an arbitrary symmetric inner product space (푍, ⟨·, ·⟩). This induces a norm ∥풙∥2 := ⟨풙, 풙⟩ and a distance measure 푑(풙,풚) := ∥풙 − 풚∥2. A special case of interest is the normalized Euclidean 푍 푚 ⟨ , ⟩ 1 Í 푥 푦 product on = R , defined as 풙 풚 퐸 := 푚 푗 ≤푚 푗 푗 ; and the corresponding normalized squared Euclidean distance (NSED), a natural way to capture the dissimilarity of two items [9, 10, 26]. It is easy to see that the fault level of a worker in the AWG model under NSED is exactly her variance, that is, 푓푖 (풛) = 푡푖 for any 풛. Aggregation. An aggregation function 푟 : 푍푛 → 푍 maps 푛 workers’ reports to a single answer. Common examples include the mean, majority, plurality, or some ordinal voting rule. Given an instance 퐼 = ⟨푆, 풛⟩ and aggregation function or algorithm 푟, the error of 푟 on 퐼 is defined as 푑(풛, 푟 (푆)). The goal of truth discovery is to find aggregation functions that tend to have low error. E.g., the penultimate row in Table 2 demonstrates the mean aggregation rule 푟 푀 and its error. When the type every worker is known, for many noise models there are accurate characterizations of the optimal aggregation functions. For example, the optimal aggregation under the AWG model with NSED is taking the mean of workers’ answers, inversely weighted by their variance [2]. Since the true fault levels are not known, we call this the ‘oracle’ aggregation function 푟푂 . Indeed we can see in the bottom line of Table 2 that the oracle aggregation has a very low error. Thus our focus in this work is on reasonable estimation of the fault level, which is both important on its own, and leads to better aggregation in a broad range of domains. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 5

3 ESTIMATING FAULT LEVELS Our key approach is relying on estimating 푓푖 from the average distance of worker 푖 from all other workers. Formally, we define 1 ∑︁ 푑 ′ := 푑(풔 , 풔 ′ ); and 휋 = 휋 (푆) := 푑 ′ . 푖푖 푖 푖 푖 푖 푛 − 1 푖푖 푖′ ∈푁 \{푖 } Next, we analyze the relation between 휋푖 (which is a random variable) and 푓푖 , which is an inherent property that is deterministically induced by the worker’s type. For an element 풔 ∈ 푍 we ˜ consider the induced noise variable 흐풔 := 풔 − 풛. We denote by Y(풛, 푡) the distribution of 흐풔 (where 2 풔 ∼ Y(풛, 푡)). Thus under NSED we have that 푑(풔, 풛) = ∥흐풔 ∥ . We define 풃 (풛) := 퐸 [흐 ] as the bias of a type 푖 worker, and 풃T (풛) := 퐸 [흐] as 푖 흐푖 ∼Y(˜ 풛,푡푖 ) 푖 흐∼Y(˜ 풛,T) the mean bias of the proto-population. E.g. in Euclidean space 풃푖 (풛) is a vector where 푏푖 푗 (풛) > 0 if 푖 tends to overestimate the answer of question 푗, and negative values mean underestimation. Our main conceptual result is the following connection: Theorem 1 (Anna Karenina Principle).

퐸푆∼Y (풛,T)푛 [휋푖 (푆)|푡푖, 풛] = 푓푖 (풛) + 휇T (풛) − 2 ⟨풃푖 (풛), 풃T (풛)⟩ . The proof is rather straight-forward, and is relegated to Appendix B. In particular it shows by direct computation that the expectation of 푑푖푖′ := 푑(풔푖, 풔푖′ ) for every pair of workers is

퐸[푑푖푖′ |푡푖, 푡푖′, 풛] = 푓푖 (풛) + 푓푖′ (풛) − 2 ⟨풃푖 (풛), 풃푖′ (풛)⟩ . (1) Despite (or perhaps because) its simplicity, the principle above is highly useful for estimating workers’ competence. Indeed, it seems that 휋푖 is increasing in 푓푖 , so a naïve approach to estimate ˆ 푓푖 from the data is by setting 푓푖 to be some increasing function of 휋푖 . However there are several obstacles we need to overcome in order to get a good estimation of 푓푖 using this approach. In particular: concentration bounds; dependency on the ground truth 풛; estimation of 휇; and the biases that appear in the last term; all of which will be tackled in the next sections. For concreteness, we assume in the remainder of the paper, except where explicitly stated otherwise (see Sec. 3.5), that the inner product space (푍, ⟨⟩) is R푚, and thus 푑 is NSED.

Concentration bounds. With sufficiently large population and independent questions, 휋푖 (푆) is close to its expectation w.h.p. (see Appendix G for details): Proposition 2. Suppose the fourth moment of each distribution Y(˜ 풛, 푡), 푡 ∈ T is bounded, and that 푠 푡 , 푧 휋∗ 퐸 휋 푆 , 푡 푚, 푡 푇 ( 푖 푗 ) are conditionally i.i.d. on 푖 푗 . Denote 푖 := 푆∼Y (풛,T)푛 [ 푖 ( ) | 풛 푖 ] for some 풛 ∈ R 푖 ∈ . 1 ∗ 훽, 훿 푛,푚 ( ) 푛 [|휋 (푆) − 휋 | 훽 | , 푡 ] 훿 For > 0, if > Ω 훿훽2 , we have Pr푆∼Y (풛,T) 푖 푖 > 풛 푖 < .

Truth Invariance. In general, the noise may depend on 풛. However often this is not the case, and we formally define the following condition: Definition 1 (Truth Invariance). A noise model Y is truth invariant (TI) if for any 풛, 풛′ ∈ R푚 ∈ {− , }푚 푃푟 [∀푗, 휖 휖∗] 푃푟 [∀푗, 휖 ′ 휖∗휉 ] there is 흃 1 1 such that 흐∼Y(˜ 푡,풛) 푗 = 푗 = 흐′∼Y(˜ 푡,풛′) 푗 = 푗 푗 for all types 푡 푇 휖∗ ∈ and any 푗 ∈ R. Truth invariance means that under different ground truths we have exactly the same noise distribution (of all types), up to reflection.1 Intuitively, reflection should not affect pairwise distances, and this is indeed the case. The following version of our result is the main one we will use throughout the paper:

1An example for violation of truth invariance is a yes/no question with different false positive and false negative rates. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 6

Theorem 3 (Anna Karenina Principle for TI Models). For a truth invariant noise model, and any 풛, 퐸푆∼Y (풛,T)푛 [휋푖 (푆)|푡푖 ] = 푓푖 +휇T −2 ⟨풃푖, 풃T ⟩.

Symmetric Noise. A trivial implication of Theorem 3 is for symmetric noise models.

Corollary 4. If Y is truth invariant and 풃T = 0 then 퐸[휋푖 (푆)|푡푖 ] = 푓푖 + 휇T for all 푖.

This means that given 휇T and enough samples for a sufficiently accurate estimation of the expected disparity, we can retrieve workers’ exact fault level with high accuracy. Cor. 4 applies in particular to the AWG model. 3.1 Model-free bounds What can be said without symmetry or other assumptions on the model? We argue that we can at least tell particularly poor workers from good workers (this also holds without TI): ∗ ∗∗ Corollary 5. Consider a “bad" worker 푖 with 푓푖∗ > 9휇T , and a “good" worker 푖 with 푓푖∗∗ < 휇T . Then 퐸[휋푖∗ ] > 퐸[휋푖∗∗ ]. No better separation is possible (i.e. there is an instance where all the inequalities become equalities). The proof relies on the following lemmas. The first one follows from Cauchy-Schwarz and Jensen inequality, and the second is due to the Anna Karenina principle. √︁ √ Lemma 6. For all 푖, | ⟨풃푖, 풃T ⟩ | ≤ 푓푖 휇. √ 2 Lemma 7. For any worker 푖 and any 훾 ≥ 0, if 푓푖 = 훾휇, then 퐸[휋푖 ] ∈ (1 ± 훾) 휇. Proof. For the upper bound,

퐸[휋푖 |푓푖 ] = 푓푖 + 휇 − 2 ⟨풃푖, 풃T ⟩ (By Thm. 1) √︁ √ ≤ 푓푖 + 휇 + 2 푓푖 휇 (By Lemma 6) √ √ √ √ 2 ≤ 훾휇 + 휇 + 2 훾휇 휇 = (1 + 훾 + 2 훾)휇 = (1 + 훾) 휇. (since 푓푖 ≤ 훾휇) The proof of the lower bound is symmetric. □

Hammer-Spammer. A special case of interest is when there are only two types of workers, a situation known as “Hammer-spammer" [24]. In contrast to [24] we do not assume labels are binary or categorical. All hammers have the same fault 푓 퐻 and bias 풃퐻 , and likewise for the spammers (denoted 푓 푃, 풃푃 ). Denote ℎ := ∥풃퐻 ∥ and 푝 := ∥풃푃 ∥. Also denote by 훼 ∈ [0, 1] the fraction of spammers in the population. A trivial case where separation is easy is when the hammers are perfect, i.e., 푓 퐻 = 0. Then by Lemma 7 with 훾 = 0, we know that their expected disparity is exactly 퐸[휋퐻 ] = 휇 = 훼 푓 푃 , whereas the expected disparity of each spammer is 퐸[휋푃 ] = 휇 + 푓 푃 = 훼 (푓 푃 + 1). Clearly 퐸[휋푃 ] > 퐸[휋퐻 ]. We next present more general conditions that (each) guarantee we can distinguish the good workers from the bad ones. This is possible either if the fraction of spammers 훼 is sufficiently small (condition (a)), or as long as they have a sufficiently low bias (Conditions (b) and (c)). The proof once again follows from the Anna Karenina theorem via Lemma 7. Corollary 8. Suppose there are two types of workers, with 푓 푃 > 푓 퐻 . For all 푖∗ ∈ 푁 푃, 푖∗∗ ∈ 푁 퐻 , if any of the following conditions apply, then 퐸[휋푖∗ ] > 퐸[휋푖∗∗ ]: 푃 퐻 ℎ = 훼 푓 −푓 (a) 0 and < 2푝2 ; (b) 푓 푃 − 푓 퐻 > 푝2 − ℎ2; (c) 푝 ≤ ℎ. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 7

3.2 The AWG model Since AWG is a symmetric and truth-invariant noise model, we know from Cor. 4 that 퐸[휋푖 (푆)|푡푖 ] = ˆ 푓푖 + 휇. Thus if we have a good estimate 휇ˆ of 휇, setting 푓푖 := 휋푖 − 휇ˆ is a reasonable heuristic. However ideally we would want something better: the maximum likelihood estimator of 푓푖 . In this section we show that under a slight relaxation of the AWG model, the heuristic above is the MLE. 휇 1 Í 휋 (푆) Denote ¯ := 2푛 푖 ∈푁 푖 , that is, half the average pairwise distance. Note that for a symmetric 퐸 Í 휋 푆 휇 휇 휇 noise model we have by Cor. 4 that [ 푖 ∈푁 푖 ( )] = 2 , and thus ¯ is an unbiased estimator of . 푓ˆ푁 푃 휋 휇 푁푃 We thus set 푖 := 푖 − ¯, where stands for Naïve Proxy.

Computing the MLE. Recall that from Eq. (1) for symmetric noise models, we have that 퐸[푑푖푖′ |푡푖, 푡푖′ ] = 푓푖 + 푓푖′ . However even under the AWG model, the pairwise distances (and hence the disparities) are correlated. For the analysis, we will neglect these correlations, and assume that the pairwise distances are all independent conditional on workers’ types. More formally, under the Pairwise Additive White Gaussian (pAWG) model, 푑푖푖′ = 푓푖 + 푓푖′ + 휖푖푖′ , where all of 휖푖푖′ are sampled i.i.d. from a ˆ normal distribution with mean 0 and unknown variance. Ideally, we would like to find 푓b= (푓푖 )푖 ∈푁 2 that minimize the estimation errors (휖푖푖′ ), which is an Ordinary Least Squares (regression) problem. We next show how to derive a closed form solution for the maximum likelihood estimator of the fault levels, also allowing for a regularization term with coefficient 휆. The theorem might have independent interest as it allows us to estimate the diagonal of a noisy matrix created from adding a vector to itself (an “outer sum” matrix).

Theorem 9. Let 휆 ≥ 0, and 퐷 = (푑푖,푖′ )푖,푖′ ∈푁 be an arbitrary symmetric nonnegative matrix (representing pairwise distances). Then 2(푛 − 1)휋 ì − 8푛(푛−1) 휇¯ ∑︁ ˆ ˆ 2 2 4푛+휆−4 argmin (푓푖 + 푓푖′ − 푑푖푖′ ) + 휆∥푓b∥ = . 푓b 2 2푛 + 휆 − 4 푖,푖′ ∈푁 :푖≠푖′ Before proving the theorem, we describe its implications.

Corollary 10. The maximum likelihood estimator of 푓ì in the pAWG model is proportional to: 푓 푀퐿 휋ì − 푛 휇 휆 푓 푁 푃 휋ì − 휇 휆 (a) b := 푛−1 ¯ for = 0; (b) b := ¯ for = 4. That is, our heuristic estimates 푓b푁 푃 is in fact the optimal solution of the (regularized) pAWG model. Further, by setting 휆 = 0 (no regularization) we get a slight variation of this heuristic that we name 푓b푀퐿 (for ‘maximum likelihood’). We compare these estimates empirically in Appendix C, showing (as expected) that the regularized version works better for small datasets but the unregularized estimate 푓b푀퐿 quickly converges to the correct estimate as we increase 푚. Proof of Theorem 9 for 휆 = 0. Let 푃 be a list of all 푛2 −푛 ordered pairs of [푛] (without the main diagonal) in arbitrary order. Setting 휆 = 0, we are left with the following least squares equation: ∑︁ 푓ˆ 푓ˆ 푑 2 퐴푓 2, min ( 푖 + 푖′ − 푖푖′ ) = min ∥ b− 풅∥2 푓b 푖,푖′ 푓b where 퐴 is a |푃 | × 푛 matrix with 푎푘푖 = 1 iff 푖 ∈ 푃푘 , and 풅 is a |푃 |-length vector with 푑푘 = 푑푖푖′ for ′ 푃푘 = (푖, 푖 ). Fortunately, 퐴 has a very specific structure that allows us to obtain the above closed-form solution. Note that every row of 퐴 has exactly two ‘1’ entries, in the row index 푖 and column index ′ 2 푖 of 푃푘 ; the total number of ‘1’ is 2(푛 − 푛); there are 2푛 − 2 ones in every column; and every two

2We use a wide hat 푓binstead of hat+arrow accept. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 8

′ ′ ′ distinct columns 푖, 푖 share exactly two non-zero entries (at rows 푘 s.t. 푃푘 = (푖, 푖 ) and 푃푘 = (푖 , 푖)). This means that (퐴푇 퐴) has 2푛 − 2 on the diagonal and 2 in any other entry. The optimal solution for ordinary least squares is obtained at 푓bsuch that (퐴푇 퐴)푓b= 퐴푇 풅. Thus 푇 ∑︁ [퐴 풅]푖 = 2 푑푖푖′ = 2휋푖, (2) 푖′≠푖 푇 ˆ ∑︁ ˆ ˆ ∑︁ ˆ and [(퐴 퐴)푓b]푖 = (2푛 − 2)푓푖 + 2 푓푖′ = 2(푛 − 2)푓푖 + 2 푓푖′ . (3) 푖′≠푖 푖′ ∈푁 Denote ∑︁ 푇 ∑︁ ∑︁ 훼 := [퐴 풅]푖 = 2 푑푖푖′ = 4푛(푛 − 1)휇,¯ (4) 푖 ∈푁 푖 ∈푁 푖′≠푖 then ∑︁ 푇 ∑︁ ˆ ∑︁ ˆ 훼 = [(퐴 퐴)푓b]푖 = [2(푛 − 2)푓푖 + 2 푓푖′ ] (By Eq. (3)) 푖 ∈푁 푖 ∈푁 푖′ ∈푁 ∑︁ ∑︁ ∑︁ ∑︁ ˆ = 2(푛 − 2) 푓푖 + 2 푓푖′ ( 1) = 4(푛 − 1) 푓푖 . (5) 푖 ∈푁 푖′ ∈푁 푖 ∈푁 푖 ∈푁 We can now write the 푛 linear equations as 푇 푇 ∑︁ 2(푛 − 1)휋푖 = [퐴 풅]푖 = [(퐴 퐴)푓b]푖 = (2푛 − 4)푓ˆ푖 + 2 푓ˆ푖′ ⇐⇒ (By Eqs. (2),(3)) 푖′ ∈푁 ∑︁ (푛 − 1)휋푖 = (푛 − 2)푓ˆ푖 + 푓ˆ푖′ ⇐⇒ 푖′ ∈푁 휋푖 1 ∑︁ 휋푖 1 훼 푓ˆ = − 푓ˆ′ = − (by (5)) 푖 푛−2 푛−2 푖 푛−2 푛−2 4(푛−1) 푖′ ∈푁 푛 − 1 푛 푛 − 2 = 휋 − 휇¯ = 푓ˆ푀퐿, (by (4)) 푛 − 2 푖 푛 − 2 푛 − 1 푖 as required. □

3.3 Categorical labels Categorical labels are perhaps the most common domain where truth discovery is applied. In this 푚 domain both the ground truth 풛 and workers’ answers 풔1,..., 풔푛 are given as vectors in {1, 2, . . . , 푘} . In order to apply the Anna Karenina principle, we need to show all conditions apply. Formally, we embed each categorical question in the Euclidean space R푘 , by identifying each label with one of the 푘 unit vectors (times a constant); thus 푚 categorical questions are embedded in a 푚 × 푘-dimension Euclidean space: for every 풙 ∈ [푘]푚 let 풔(풙) be the embedding of 풙, where √︃ 푠 푘 푥 ℓ 푗,ℓ (풙) = 2 if 푗 = and 0 otherwise. Then the NSED and Hamming distance coincide: , 푘 푚 푑 , 1 Í푚 푥 푦 Lemma 11. For every two vectors 풙 풚 ∈ [ ] , 퐸 (풔(풙) 풔(풚)) = 푗=1 푗 ≠ 푗 . 푚 J K In the most general definition of a type, 푡푖 may be an arbitrary confusion matrix for every answer. We assume the Independent Errors (IER) noise model where: (a) each worker 푖 has some fixed error 푓 ∗ probability 푖 ; and (b) conditional on being wrong, all workers have exactly the same distribution over wrong answers (although some wrong answers may be more likely than others).3 We denote 훾 푃푟 푠 ℓ 푠 푧 휃 Í 훾 2 휃 by ℓ := [ 푖 푗 = | 푖 푗 ≠ 푗 ] and := ℓ ≤푘 ( ℓ ) . That is, is the probability that two workers who 푘 − 휃 1 are wrong select the same answer. E.g., if all 1 false answers are equally likely, then = 푘−1 . Lemma 12. The IER noise model is truth invariant. We can therefore apply Theorem 3 to obtain a concrete estimation of the fault levels.

3In the binary case this is known as the one coin or the Dawid-Skene model. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 9

Corollary 13 (Anna Karenina Principle for the IER model). For categorical questions with the IER noise model, 퐸[휋푖 (푆)] = 휇 + (1 − (1 + 휃)휇)푓푖 . In particular we get 퐸[휋푖 (푆)] = 휇 + (1 − 2휇)푓푖 in the binary one-coin case. 푓 ( ) 퐸[푑( , )] 1 Í푚 푃푟 [푠 푧 ] 1 Í푚 푓 ∗ 푓 ∗ Proof. By Lemma 11, 푖 풛 = 풔푖 풛 = 푚 푗=1 푖 푗 ≠ 푗 = 푚 푗=1 푖 = 푖 . Note that 푓푖 does not depend on 풛. As for the bias, it gets a bit more complicated due to the added dimensions. Due to truth-invariance, assume w.l.o.g. that 푧푗 = 1 and denote Y푖 := Y(풛, 푡푖 ). For every false label ℓ ≠ 1 = 푧푗 (meaning 푧푗,ℓ = 0), we have √︂ √︂ √︂ √︂ 푘 푘 푘 ∗ 푘 푏 (풛) = 퐸 ∼Y [푠 − 푧 ] = 푃푟 [푠 = 1] = 푃푟 ∼Y [푠 = ℓ] = 푓 훾 = 훾 푓 . 푖 푗,ℓ 풔푖 푖 푖 푗,ℓ 푗,ℓ 2 풔푖 푖 푗,ℓ 2 풔푖 푖 푖 푗 2 푖 ℓ 2 ℓ 푖 √︃ ℓ 푧 푧 푘 For the true label = 푗 = 1 (where 푗,ℓ = 2 ), we have √︂푘 √︂푘 √︂푘 푏 = 퐸 ∼Y [푠 − 푧 ] = − 푃푟 ∼Y [푠 = 0] = − 푃푟 ∼Y [푠 ≠ 푧 ] = − 푓 . 푖 푗,ℓ 풔푖 푖 푖 푗,ℓ 푗,ℓ 2 풔푖 푖 푖 푗,ℓ 2 풔푖 푖 푖 푗 푗 2 푖 √︃ √︃ 푏 푘 훾 휇 ℓ 푏 푘 휇 This entails T,푗,ℓ = 2 ℓ for ≠ 1 and T,푗,1 = − 2 . Finally, we get from Thm. 3 that 푚 푘 2 2 ∑︁ ∑︁ 퐸[휋 ] = 푓 + 휇 − ⟨ , ⟩ = 푓 + 휇 − 푏 푏 푖 푖 푘 풃푖 풃T 푖 푘푚 푖 푗,ℓ T,푗,ℓ 푗=1 ℓ=1 푚 √︂ √︂ 푚 푘 √︂ √︂ 2 ∑︁ 푘 1 2 ∑︁ ∑︁ 푘 푘 = 푓 + 휇 − (− 푓 )(− 휇) − ( 훾 푓 )( 훾 휇) 푖 푘푚 2 푖 2 푘푚 2 ℓ 푖 2 ℓ 푗=1 푗=1 ℓ=2 푚 푚 푘 1 ∑︁ 1 ∑︁ ∑︁ = 푓 + 휇 − 푓 휇 − 훾 푓 훾 휇 푖 푚 푖 푚 ℓ 푖 ℓ 푗=1 푗=1 ℓ=2 푘 ∑︁ 2 = 푓푖 + 휇 − 푓푖 휇 − 푓푖 휇 (훾ℓ ) = 푓푖 + 휇 − 푓푖 휇 − 푓푖 휇휃 = 휇 + (1 − (1 + 휃)휇)푓푖, ℓ=2 as required. □

3.4 Ranking Domain Let L(퐶) be the set of all permutations over a set of candidates 퐶. Any ranking (permutation) 퐿 퐶 , 푚 푚 |퐶 | ∈ L( ) corresponds to a binary vector 풙 ∈ {0 1} where = 2 (i.e., each dimension is a pair 푚 of candidates), which we denote by 풙퐿 ∈ {0, 1} where 0 means the first candidate is ranked higher. E.g. the order 퐿 = (푐2, 푐1, 푐3, 푐4) corresponds to 풙퐿 = (1, 0, 0, 0, 0, 0) since the first entry compares 푐1, 푐2 which are in reversed order. If there is 퐿 ∈ L(퐶) s.t. 풙퐿 = 풙 then we denote it 퐿풙 . In particular we denote by 퐿풛 the ground truth order over 퐶. A natural metric over rankings is the Kendall-tau distance (a.k.a swap distance), which coincides with the Hamming distance on the binary representation [25], and thus with NSED by Lemma 11. Independent Condorcet Noise model (ICN). According to the ICN model [44], an agent of type 푚 푡푖 ∈ [0, 1] observes a vector 풔푖 ∈ {0, 1} where for every pair of candidates 푗 = (푎,푏), we have 푠푖 푗 ≠ 푧푗 with probability 푡푖 . 푚 |퐶 | By definition, the ICN model is a special case of the binary IER model, where = 2 , and the fault level of a voter is 푓푖 = 푡푖 . We get that 퐸[휋푖 (푆)|푡푖 ] = 휇 + (1 − 2휇)푓푖 as an immediate corollary of Cor. 13. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 10

Fig. 1. Each scatterplot presents all workers in a single instance, with their disparity 휋푖 vs. their real error 푑(풔푖, 풛). In the top row we have a generated instance for each of four domains and their respective distance measures: continuous (with NSED); categorical (with Hamming distance); Ranking (with Kendall-tau distance); and interval matching (with Jaccard distance). In the bottom row we have four real datasets from these domains.

3.5 General Distance Metrics Suppose that 푑 is an arbitrary distance metric over space 푍, 풛 ∈ 푍 is the ground truth, and 풔푖 ∈ 푍 is the report of worker 푖. 푓푖 and 휋푖 are defined as before. We say that the noise model Y is symmetric if there is an invertible mapping 휙 : 푍 → 푍, such that for all 푥, 푑(풙, 휙 (풙)) = 2푑(풙, 풛), and for all 푡 and 푋 ⊆ 푍, 푃푟풔∼Y (풛,푡) [풔 ∈ 푋] = 푃푟풔∼Y (풛,푡) [풔 = 휙 (푋)]. In words, for every point 풙 there is an equally-likely point 휙 (풙) that is on “the other side" of 풛. The AWG model is symmetric by this definition under any 퐿푝 norm, by setting 휙 (풙) = 2풛 − 풙. Note that this is a stronger symmetry requirement than the one we had on NSED, and in particular it entails 풃푖 = 0 for every worker. Theorem 14 (Anna Karenina principle for symmetric noise and distance metrics). If 푑 is any distance metric and Y is symmetric, then max{휇, 푓푖 } ≤ 퐸[휋푖 (푆)] ≤ 휇 + 푓푖 . In contrast to all previous results that follow from Theorem 1, the last theorem requires a separate proof, as there is no assumption on the model other than symmetry. In fact, the upper bound holds regardless of symmetry. An immediate corollary of Theorem 14 is that for poor workers with 푓푖 ≥ 휇, 휋푖 is a 2-approximation for 푓푖 (up to noise). See details and tightness proof in Appendix F.

Empirical evidence. Fig. 1 shows the tight relation between the observed disparity 휋푖 and the true (unknown) distance to the ground truth 풛. We can see that there is a very strong, almost-linear relation, irrespective of the domain or the distance function used. This holds both for synthetic data to which our theoretical results apply, and for real data.4 4 AGGREGATION Our Proximity-based Truth Discovery (P-TD) algorithm, described in Alg. 1, assigns to each worker 푖 a disparity 휋푖 (her average distance to other workers), and then uses 휋푖 to derive the estimated ˆ fault level 푓푖 and weight 푤푖 . The implementation of 퐸푠푡푖푚푎푡푒퐹푎푢푙푡 () is guided by the appropriate Anna Karenina theorem, when available. Similarly, known results on weighted aggregation in each domain guide the implementation of the 푆푒푡푊 푒푖푔ℎ푡 () function. In particular: Continuous labels with NSED We know from [2] that at least for Gaussian noise, the weighted mean with 푤 ∝ 1 is the maximum likelihood estimator of 풛, and it is therefore justified to 푖 푓푖

4In schemaMatching users were asked to align the attributes of two database schemata [36]. Intervals is a synthetic dataset we constructed where 풛 is a set of intervals, and each report 풔푖 is a noisy version of it (see Appendix J). The other datasets are explained in Section 5. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 11 ALGORITHM 1: (P-TD) ALGORITHM 2: (D-TD) Input: Dataset 푆 ∈ 푋푛. Input: Dataset 푆 ∈ 푋 푛. 푛 Output: Est. fault levels 푓b∈ R ; answers 풛ˆ ∈ 푋. Output: 푓b0 ∈ R푛; 풛ˆ ∈ 푋. Compute 푑푖푖′ ← 푑(풔푖, 풔푖′ ) for every pair of workers; Set 풚 ← 푟 (푆); 1 ∈ ← Í ′ ′ For each worker 푖 푁 , set 휋푖 푛−1 푖 ≠푖 푑푖푖 ; for each worker 푖 ∈ 푁 do 1 Í Set 휇ˆ ← 푖 ∈푁 휋푖 ; ˆ0 2푛 Set 푓 ← 푑(풚, 풔푖 ); ∈ 푖 for each worker 푖 푁 do 푤 푆푒푡푊 푒푖푔ℎ푡 푓ˆ0 Set 푖 ← ( 푖 ); Set 푓ˆ푖 ← 퐸푠푡푖푚푎푡푒퐹푎푢푙푡 (휋푖, 휇ˆ); end Set 푤푖 ← 푆푒푡푊 푒푖푔ℎ푡 (푓ˆ푖 ); Set 풛ˆ ← 푟 (푆,푤ì); end Set 풛ˆ ← 푟 (푆,푤ì);

푆푒푡푊 푒푖푔ℎ푡 푓ˆ 1 set ( 푖 ) = ˆ . As for estimating the fault level, Cor. 10(b) suggests 푓푖 푛 푛 퐸푠푡푖푚푎푡푒퐹푎푢푙푡 (휋 , 휇ˆ) = 휋 − 휇ˆ = 휋 − 휇¯ = 푓ˆ푀퐿. 푖 푖 푛 − 1 푖 푛 − 1 푖 If we want to be more conservative we may use a regularized version where a natural 퐸푠푡푖푚푎푡푒푡퐹푎푢푙푡 휋 , 휇 휋 휇 휋 휇 푓ˆ푁 푃 candidate is ( 푖 ˆ) = 푖 − ˆ = 푖 − ¯ = 푖 by Cor. 10(a). This induces two variants of the P-TD algorithm that we name P-TD푀퐿 and P-TD푁 푃 , respectively. Categorical labels In the categorical IER model with 푘 categories, the maximum likelihood estimator is a weighted plurality vote with weights 푤퐺 := log( (1−푓푖 )(푘−1) ) [4, 20].5 For the 푖 푓푖 binary case, we refer to 푤퐺 = log( 1−푓푖 ) as . By reversing the linear relation 푖 푓푖 Grofman weights 퐸푠푡푖푚푎푡푒퐹푎푢푙푡 (휋 , 휇) { , 휋푖 −휇ˆ } 푆푒푡푊 푒푖푔ℎ푡 (푓ˆ) in Cor. 13 (Sec. 3.3), we set 푖 ˆ = max 0 1−(1+휃)휇ˆ ; and 푖 = ˆ log( (1−푓푖 )(푘−1) ). 푓ˆ푖 Rankings For voters/workers with equal competence the maximum likelihood estimator is the Kemeny-Young voting rule [44]. Due to the equivalence of the ICN and IER noise models, it sounds natural to guess that Kemeny-Young voting with Grofman weights is optimal for heterogeneous workers. We are not aware of such a result, but we prove that this is approximately true, see Appendix E for details. We thus define both 퐸푠푡푖푚푎푡푒퐹푎푢푙푡 () and 푆푒푡푊 푒푖푔ℎ푡 () as in the binary labels case, and heuristically apply this regardless of the voting rule in use. General distance metrics In some domains there are complex labels (parse trees, free-form plots, etc.), with no straight-forward way to aggregate them. Yet if we properly estimate workers’ competence, we can always use the label of the most competent worker to boost quality [5]. Formally, this amounts to setting a weight of 1 to the most competent worker and 0 to all others. Following Thm. 14 we recommend setting 퐸푠푡푖푚푎푡푒퐹푎푢푙푡 (휋푖 ) = 휋푖 (ignoring 휇ˆ), which is a conservative heuristic estimate of the fault. We extend the recommendation of heuristically setting 푓b= 휋ì to any domain where we do not have a more accurate estimation. Consistency. An immediate question is whether the P-TD algorithm is consistent. In the simple AWG model the answer is positive. Theorem 15 (Consistency of P-TD (AWG model)). When Y has bounded support and 풛 is bounded, P-TD with the weighted mean rule 푟, is consistent. That is, as 푛 → ∞, 푚 = 휔 (log(푛)) and 푛 = 휔 (log푚):

5Grofman et al. [20] attribute the result for 푘 = 2 to several papers by different authors from the 1960s and 1970s. Ben-Yashar and Paroush [4] extended it to multiple categories. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 12

ˆ • For any constant 훿 > 0, 푃푟 [|푓푖 − 푓푖 | > 훿] → 0 for all 푖 ∈ 푁 ; • For any constant 휏 > 0, 푃푟 [푑(풛ˆ, 풛) > 휏] → 0.

However, the consistency proof relies on the consistency of the the 휇ˆ estimator for 휇. While our estimate 휇¯ is indeed consistent in the AWG model, in the IER model 퐸[휇¯] < 휇 and thus inconsistent but conservative.6 For completeness we also provide a consistent estimate 휇ˇ for 휇 in the √︃ 휇 휇 1 − ( 1 )2 − 푑ˆ IER model, but we recommend using the more conservative ¯. Formally, ˇ = 1+휃 1+휃 1+휃 , 푑ˆ 1 Í⌊푛/2⌋ 푑 where := ⌊푛/2⌋ 푖=1 (2푖−1),2푖 . See Appendix G for proofs. Distance-based truth discovery. Another common approach to truth-discovery (mentioned e.g. in [20]) is to first aggregate the reports to a vector 풚 = 푟 (푆), then estimate competence according to 푓ˆ0 푑 , 푟 푆,푤 푤 푆푒푡푊 푒푖푔ℎ푡 푓ˆ0 푖 = (풔푖 풚), returning ( ì) (where 푖 = ( 푖 ) uses the same function as in P-TD). We denote this simple algorithm as distance-from-outcome-Truth-Discovery (D-TD, see also Alg. 2). Note that unlike P-TD, this approach requires not just a distance measure but also a specific aggregation function in order to estimate competence.

Theorem 16. In the continuous domain, D-TD and P-TD푁 푃 coincide.

This holds regardless of any assumptions on the data. The proof outline is to first show that 휇¯ is 푓ˆ0 푓 푁 푃 푓 0 proportional to the average of all 푖 ; then show that b is proportional to b . The theorem then follows since the workers’ weights are inversely proportional to the fault, and thus both algorithms apply weighted mean with the same (normalized) weights.

4.1 Improving P-TD As we explained in Section 1.2, some advanced methods from the truth discovery literature are based on the following reasoning: a good estimate of workers’ competence leads to aggregated answers that are close to the ground truth; and a good approximation of the ground truth can give us a good estimate of workers’ competence. Thus we can iteratively improve both estimates.

Iterative average-proximity truth discovery. We can adopt a similar iterative approach by extending our P-TD algorithm to Iterative P-TD (or iP-TD, see Alg. 3). Intuitively, in each iteration we compute the disparity of a worker according to her weighted average distance from all workers, and use it to give higher weight to more competent workers in the next iteration. Note that iP-TD does not use the aggregation rule 푟 in order to produce the estimate 푓b, meaning that just like P-TD it can be applied in any domain that has a distance function.

Analysis. Why would it be beneficial to calculate a weighted average rather than average distance from the other workers? Ideally, since we use the disparity 휋ì to derive the estimated fault levels 푓b, we would like 휋ì to be as close as possible to the real fault 푓ì, i.e. to minimize ∥휋 ì − 푓ì∥. For a 휔 휋휔ì Í 휔 ′푑 ′ normalized weight vector ì, let 푖 := 푖′ ∈푁 푖 푖푖 , i.e. the weighted average distance according to 휔ì. We show that in the IER categorical model, the estimate 휋ì휔ì improves as weights are more skewed towards the better workers.

ì 휔ì ì D ìE Proposition 17. Fix a population 푓 . In the IER model, lim푚→∞ ∥휋 ì − 푓 ∥ is monotone in 휔,ì 푓 .

6By conservative we mean that the induced difference in the estimated fault of good and bad workers becomes less extreme. An even more conservative estimate would be 휇ˆ = 0, which implies 푓ˆ푖 = 휋푖 . Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 13 ALGORITHM 3: Iterative-Proxy-based-Truth-Discovery (iP-TD) Input: number of iterations 푇 ; dataset 푆 ˆ Output: Fault levels 푓b= (푓푖 )푖 ∈푁 ; answers 풛ˆ Compute 푑푖푖′ ← 푑(풔푖, 풔푖′ ) for every pair of workers; Initialize 휔ì0 ← 1ì; for 푡 = 1, 2, . . . ,푇 do Í ′ 휔푡−1푑 ′ For every worker 푖 ∈ 푁 , set 휋푡 ← 푖 ≠푖 푖′ 푖푖 ; 푖 Í ′ 푡−1 푖 ≠푖 휔푖′ ∈ 푡 ← ′ ( 푡 ) − 푡 For every worker 푖 푁 , set 휔푖 max푖 휋푖′ 휋푖 ; end 1 Í 푇 Set 휇ˆ ← 2푛 푖 ∈푁 휋푖 ; for every worker 푖 ∈ 푁 do ˆ 푇 Set 푓푖 ← 퐸푠푡푖푚푎푡푒퐹푎푢푙푡 (휋푖 , 휇ˆ); Set 푤푖 ← 푆푒푡푊 푒푖푔ℎ푡 (푓ˆ푖 ); end Set 풛ˆ ← 푟 (푆,푤ì);

Proof. In the limit, every pairwise distance 푑푖푖′ equals its expectation 푓푖 + (1 − (1 + 휃)푓푖 )푓푖′ . 푤ì 푤ì Denote 휋ì¯ := lim푚→∞ 휋ì . Therefore 휋휔ì ∑︁ 휔 푑 ∑︁ 휔 푓 휃 푓 푓 ¯푖 = 푖′ 푖푖′ = 푖′ ( 푖 + (1 − (1 + ) 푖 ) 푖′ 푖′ ∈푁 푖′ ∈푁 ∑︁ ∑︁ D ìE = 푓푖 휔푖′ + (1 − (1 + 휃)푓푖 ) 휔푖′ 푓푖′ = 푓푖 + (1 − (1 + 휃)푓푖 ) 휔,ì 푓 , 푖′ ∈푁 푖′ ∈푁

휔ì D ìE D ìE and |휋¯푖 − 푓푖 | = |(1 − (1 + 휃)푓푖 ) 휔,ì 푓 | = |1 − (1 + 휃)푓푖 | 휔,ì 푓 . Thus

2 휔ì ì 2 휔ì ì 2 ∑︁ 휔ì 2 D ìE ∑︁ 2 lim ∥휋 ì − 푓 ∥ = ∥휋ì¯ − 푓 ∥ = (휋¯ − 푓푖 ) = 휔,ì 푓 (1 − (1 + 휃)푓푖 ) . 푚→∞ 푖 푖 ∈푁 푖 ∈푁

휔ì ì The sum in the last term is a positive constant (does not depend on 휔ì). Thus lim푚→∞ ∥휋 ì − 푓 ∥ is D E monotone in 휔,ì 푓ì as required. □

D E By Prop. 17, giving higher weights to better workers (i.e. reducing 휔,ì 푓ì ) brings 휋ì휔ì closer to 푓ì, which justifies the re-weighting in every iteration ofthe iP-TD algorithm (at least for the IER model), so that in the final iteration 휋ì푇 (and thus 푓b) would be closer to 푓ì. We conjecture that a result similar to Prop. 17 holds at least qualitatively in more general settings, and thus recommend using iP-TD when possible.

Excluding questions. The iP-TD algorithm is an example of how we can take an idea from other truth discovery algorithms, and combine it with the average-proximity approach. Another idea that was used in the literature is exclusion. Specifically, in the Belief Propagation algorithm24 of[ ], a worker is not modelled by a single latent variable (representing competence). Rather, there are 푚 such variables, where the variables used to estimate 푧푗 in each iteration, are computed while excluding question 푗. We can easily incorporate this idea into our iP-TD algorithm, resulting in the Exclusion-Iterative P-TD (eiP-TD), see Appendix H. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 14

5 EMPIRICAL EVALUATION Algorithms. Our P-TD algorithm and its variants (iP-TD, eiP-TD) can be applied to any data that has a distance measure. A natural baseline is unweighted aggregation (UA). Oracle aggregation (OA) uses the oracle fault 푑(풔푖, 풛), marking the lowest error we can hope obtaining by readjusting workers’ weights. In every domain, we also compared with the other relevant algorithms: • D-TD and its iterative variant iD-TD can work on any data that has well defined distances and an aggregation rule. • Multidimensional Annotation Scaling (MAS)[5] can work on any data that has well defined distances, just like our algorithms. However due to its high runtime, we only included it in a subset of our simulations. • Conflict Resolution on Heterogeneous DataCRH ( ) works on categorical or continuous data [27] with NSED. • The other algorithms are defined only for binary data: Iterative Belief PropagationIBP ( )[24]; EigenVector Decomposition (EVD) [33]; and the Dawid-Skene estimator (DS) [14, 19]. For MAS we used the code from [5], courtesy of the authors. We implemented the other algorithms following the respective papers. For all algorithms that do not specify how to weigh workers (all variants of D-TD, P-TD, and MAS), we used the same 푆푒푡푊 푒푖푔ℎ푡 () function on the estimated fault as explained in Sec. 4. See detailed implementation notes in Appendix H. Distance measures and aggregation rules. • For continuous data, most of our simulations use NSED and the mean aggregation rule. We also used the L1 metric with the median rule. • For categorical data we used the Hamming distance and the plurality aggregation rule. • For ranking data we used the Kendall-tau distance with a variety of 10 voting rules for ranking data (Borda, Copeland, Kemeny-Young, weighted-Kemeny-Young, Plurality, Veto, Random dictator, Best dictator, Modal voting [7] and Error-correcting Voting [34]). See voting rules definitions in Appendix I. Additional results with Spearman rank-correlation and Footrule as the distance measures are provided in Appendix K. • For the Translations data (see below) we used GLUE score [40], following [5]. The same distance measure was used for both evaluation and as input to MAS and P-TD.

5.1 Experimental Setup Synthetic data. In order to sample workers’ types in all domains, we used a distribution T that is either a truncated Normal distribution (denoted N) or a “Hammer-Spammer” distribution with 80% spammers (denoted HS). See more details in Appendix J. For the continuous domain we generated workers’ answers using the AWG model. In the categorical domain we used the IER model, with 2 or 4 available answers. In the ranking domain we used the Mallows model. Real datasets. We used three datasets from [35](GoldenGate, Flags, Dogs) where workers labeled images from 2, 4, and 10 categories, respectively. In the Triangles datasets subjects guessed the coordinates of a vertex in a triangle based on a partial drawing [21]. In the Predict datasets from [29], subjects predicted whether certain events will occur under four different treatments (T1-T4) that vary payment rules and feedback policies. For ranking, we used eight datasets collected by Mao et al. [31], where subjects were asked to put 4 images in the right order, under four different difficulty levels (DOTS-X and PUZZLES-X). In the Translations dataset from [5] there are 400 sentences translated from Japanese to English. We processed each sentence separately as a single sample. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 15

Fig. 2. Error of all algorithms as 푛 is increasing, in the continuous NSED setting. P-TD 푁 푃 and P-TD 푀퐿 are indistinguishable in this figure. We omit the y-axis label since the scale is arbitrary.

Dataset Algorithm (푑 = NSED) Dataset Algorithm (푑 = L1) Cont. 푛 CRH MAS P-TD푀퐿 P-TD푁 푃 Cont. 푛 D-TD P-TD iP-TD AWG.N 10 −44.8% +82.8% −73.0% -73.7% AWG.N 10 -54.0% −27.3% −32.9% AWG.N 20 −39.1% +6.2% -79.9% −79.2% AWG.N 20 -65.3% −25.4% −31.0% AWG.HS 10 −33.5% +66.2% -77.2% −75.2% AWG.HS 10 -77.9% −29.6% −36.9% AWG.HS 20 −30.1% +3.0% −72.7% -86.0% AWG.HS 20 -73.8% −34.6% −45.1% TRI1 10 -38.2% +122.8% −23.7% −38.1% TRI1 10 +40.8% +0.3% +5.9% TRI1 20 −26.3% +56.6% −30.1% -32.3% TRI1 20 +10.8% -0.3% +1.4% TRI2 10 −16.2% +32.5% −19.9% -22.1% TRI2 10 +14.7% −4.7% -5.6% TRI2 20 −13.3% −7.5% -25.2% −25.0% TRI2 20 −5.2% −7.1% -9.8% Build 10 −15.9% +58.8% −10.6% -16.7% Build 10 +11.2% -4.5% −2.9% Build 20 −21.7% −3.0% −31.1% -32.5% Build 20 −8.0% −13.6% -21.0% Table 3. Results on continuous data, 푚 = 15. Left: with NSED and mean rule. Right: with L1 metric andthe median rule.

Dataset Simple alg. Advanced alg. Binary 푛 D-TD P-TD IBP CRH DS EVD iD-TD MAS iP-TD eiP-TD IER.N 12 −24.4% -36.2% +76.6% −19.9% −37.4% −30.8% −28.1% −28.1% -42.1% −38.1% IER.N 30 −37.6% -64.1% −25.8% −18.5% −69.5% −64.9% −41.6% −48.1% −71.4% -71.5% IER.HS 12 −3.9% -6.3% +29.4% −3.6% −7.3% −5.1% −4.4% −7.8% -9.5% −7.6% IER.HS 30 −8.8% -18.2% +22.0% −3.7% −24.8% −24.4% −10.9% −8.4% −28.3% -28.8% GGate 15 −51.0% -64.8% −67.2% −27.6% −79.4% −37.1% −53.3% −78.3% −77.1% -80.9% DotsB 30 +0.6% -1.5% +61.6% 0.0% +6.3% +2.7% +1.6% +0.3% +0.4% +2.2% Pred.T1 30 −2.5% -9.7% +9.5% −1.2% −7.1% −7.0% −2.9% −3.4% −11.7% -12.2% Pred.T2 30 −0.6% -2.2% +9.0% −0.1% −1.7% -8.5% −1.0% −5.5% −4.8% Pred.T3 30 −2.0% -7.0% +1.9% −1.0% −6.7% -13.0% −2.2% −11.0% −10.9% Pred.T4 30 −3.1% -14.1% −3.3% −1.4% −11.4% -30.5% −3.4% −18.8% −19.7% Table 4. Binary datasets, with Hamming distance, 푚 = 15, and the Majority rule. With GGate and 푛 = 30 almost all algorithms reach perfect accuracy so we show 푛 = 15.

We further collected two more datasets, one similar to DOTS above but with only a binary choice (BinaryDots); and one where subjects were asked to estimate the heights of several buildings from pictures (Buildings). The latter can be used either as continuous data or as ranking. Sampling. To obtain robust results we used the following sampling process: given 푛, 푚 and a dataset (real or synthetic), we sampled 푛 workers and 푚 questions without repetition, and repeated the process at least 1000 times for every combination, running all algorithms on the same samples. A detailed description of all datasets is in Appendix J.

Evaluation. The error of every algorithm is the distance (as specified above) to the ground truth, averaged over all samples of a particular dataset of certain size. To present results in a normalized 퐴푙푔 푅퐼 (퐴푙푔) 퐸푟푟 (퐴푙푔)−퐸푟푟 (UA) way, we computed the relative improvement (RI) of each algorithm as := 퐸푟푟 (UA)−퐸푟푟 (OA) . This means that 푅퐼 (UA) = 0%, and 푅퐼 (OA) = −100%, which is the best we can expect from any algorithm that weighs the workers (as OA weighs them based on their true competence). A positive number means that 퐴푙푔 performs worse than UA, on average. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 16

Dataset Simple alg. Advanced alg. 푘 Categorical 푛 푚 D-TD P-TD CRH iD-TD MAS iP-TD 4 IER.N 30 15 −4.9% -8.5% −2.1% −5.6% −4.1% -10.8% 4 IER.N 80 15 −7.3% -22.8% −2.0% −8.7% -31.1% 4 IER.N 30 50 −18.8% -25.9% −6.6% −27.9% -37.8% 4 Flags 15 15 −43.0% -45.7% −30.4% −47.9% -58.7% 4 Flags 15 30 -57.8% −51.4% −43.9% −66.3% -79.8% −65.8% 4 Flags 15 80 -68.9% −55.5% −52.9% -81.7% −71.5% 10 Dogs 5 15 −6.9% -9.3% −10.4% −7.6% -13.0% 10 Dogs 15 30 −9.8% -21.1% −8.7% −9.8% −22.4% -28.3% Table 5. Results on categorical data with 푘 > 2 categories.

Fig. 3. Results of all algorithms on categorical datasets as 푛 increases and 푚 = 15. The gray lines are the competing algorithms.

Ranking simple alg. advanced alg. Ranking 푟 D-TD P-TD MAS iD-TD iP-TD Mal.N BD −14.7% -92.6% −92.4% −15.4% −92.5% Mal.N Plu. −1.9% -54.5% −56.9% +21.7% -59.3% Mal.N KY −9.6% -17.4% -18.3% −8.5% −15.7% Mal.HS BD −17.5% -67.0% −24.4% -78.2% Mal.HS Plu. −2.7% -22.1% −5.7% -42.5% Mal.HS KY -19.3% −14.0% −19.6% -34.9% Build. BD −8.5% -62.0% −56.2% −10.2% -62.8% Build. Plu. −9.0% -46.9% -51.5% +5.3% −46.5% Build. KY -0.6% −0.5% -2.8% +2.0% +2.2% DOTS7 BD −26.4% -83.8% −84.6% −30.3% -86.4% DOTS7 Plu. +1.8% -35.4% −15.9% −1.7% -46.1% DOTS7 KY −3.0% -16.1% −4.5% +1.2% -19.6% PUZZLES7 BD −24.3% -87.7% −86.8% −30.1% -87.9% PUZZLES7 Plu. +0.6% -40.4% −29.9% +1.4% -51.9% PUZZLES7 KY +1.4% -11.5% −2.6% −0.8% −6.7% Table 6. Results for ranking with the Kendall-tau distance, 푛 = 30. 푟 is the voting rule (Best Dictator, Plurality, Kemeny- Young). In DOTS and PUZZLES we used all 4 alternatives, in Fig. 4. P-TD and UA under 3 voting rules. the other datasets we sampled 6 alternatives.

5.2 Results Tables 3-6 present the relative improvement (RI) of all algorithms on various datasets. The names of our algorithms are always in bold and on the rightmost columns. We mark in bold the best simple and best overall algorithm for each dataset. We also underline all algorithms that are at most 2 standard deviations from the best one. If UA provides the lowest error then no method is marked. In the figures that include an error bar, it represents one standard error. Label accuracy. One immediate observation is that P-TD substantially outperforms both UA and D-TD in almost all domains and datasets.7 This is true across all domains: continuous, categorical, and text translation (Fig. 5), as well as with ranking under various voting rules and distance functions (Figs. 4,7 and more in Appendix K). One exception is the continuous domain with the

7Recall that in the continuous domain with NSED, P-TD and D-TD coincide. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 17

Fig. 5. Results on the Translations dataset from [5]. On the left we see a random sample of 100 sentences. Dots below the dashed diagonal are sentences on which the algorithm performed better than a random selection among the ten translators. The algorithms are compared on the right over all 400 sentences.

Fig. 6. Error of all algorithms as 푚 is increasing, in continuous and categorical settings.

L1 distance and the median rule, where P-TD still improves over UA, but D-TD is even better on synthetic datasets. As for the advanced methods, no algorithm dominates the others, and in fact every algorithm ‘wins’ in some combinations of (dataset, 푛,푚). Yet our iP-TD and eiP-TD algorithms are consistently among the top performers, often by a large margin. The good empirical results on the synthetic AWG data are not surprising, due to the theoretical results in Section 3.2. However we find it very encouraging and a bit unexpected that our general- domain, easy-to-compute methods perform consistently better than the computation-intensive MAS algorithm, and beat EVD and DS on synthetic data from the Dawid-Skene model—the very model for which their theoretical guarantees were derived. Varying the level of difficulty. Generally speaking, truth discovery becomes harder as there are fewer workers to aggregate. We can see differences among the algorithms both in their ‘initial’ performance with few workers, and in their rate of convergence as 푛 increases. Again our algorithms consistently perform better than others, especially when 푛 is in the low-medium range. This holds for continuous labels with NSED (Fig. 2; categorical labels with two or more categories (Fig. 3); and ranking (see Appendix K). Again there are exceptions, e.g. on the BinaryDots dataset all algorithms perform poorly. Some of the competing methods (IBP in particular) perform very poorly when there is little data, but substantially better as 푛 is increasing. However, in typical crowdsourcing situations we cannot expect a large number of answers for every question. We observe similar patterns when increasing the number of questions 푚 (Fig. 6). Note however that when 푛 is fixed, convergence is not to 0 error but to the performance of the oracle. In the DOTS and PUZZLES datasets, there are 4 levels of difficulty that again exhibit similar patterns (Fig. 7(left) for PUZZLES). Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 18

Fig. 7. Left: Results on all 4 PUZZLES datasets with 푛 = 15. Right: we focus on two datasets, showing the error of the error-correcting (EC) voting rule from [34] with and without weighing the workers. The unweighted results of Best-dictator, Plurality, and Kemeny-Young are shown in comparison. The color key is as in Fig. 6. Voting rules. Recall that the P-TD algorithm can be used to the voters’ weights in various voting rules. In Fig. 7(left) we see the benefit of using P-TD and iP-TD with almost every voting rule, where the benefit is particularly strong for simple rules. In fact, selecting the best-dictator combined with P-TD, outperforms the (unweighted) Kemeny-Young voting rule (Fig. 4). Adding weights to more sophisticated voting rules like Kemeny-Young or error-correcting voting, improves them further by small margin. Fig. 7(right) shows that it is not obvious how to choose the 휌 parameter in the error-correcting voting rule of [34] (which is supposed to determine the robustness to adversarial noise). Yet we can use P-TD to improve performance at almost any value. 6 CONCLUSION Average proximity can be used as a general scheme to estimate workers’ competence in a broad range of truth discovery and crowdsourcing scenarios. Due to the “Anna Karenina principle,” we expect the disparity of competent and incompetent workers to be substantially different even under very weak assumptions on the domain and the noise model. Under more explicit assumptions, the disparity accurately estimates the true competence level. The above results suggest an extremely simple and general algorithm for truth-discovery (the P-TD algorithm), that weighs workers by their average proximity to others, and can be combined with most aggregation methods. This is particularly useful in the context of existing crowdsourcing systems where the aggregation rule may be subject to constraints due to legacy, simplicity, explainability, legal, or other considerations (e.g. a voting rule with certain axiomatic properties). Despite its simplicity, the P-TD algorithm substantially improves the outcome compared to unweighted aggregation and the D-TD algorithm, and is competitive with highly sophisticated algorithms that are specifically designed for their respective domains. An obvious shortcoming of P-TD is that a group of workers that submit similar labels (e.g. by acting strategically) can boost their own weights. Future work will consider how to identify and/or mitigate the affect of such groups. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 19

A RESUBMISSION COVER LETTER This paper is a substantially revised version of an EC’20 submission. This cover letter is included with explicit permission from EC’21 program chair.

A.1 Key points of EC’20 reviewers and our related changes We order the comments by the significance of the change in the new version. • Results rely on specific choices of noise models and distances, there is no oneuni- fied proof (R1,R3). We restructured the paper so that there is a single ‘Anna Karenina’ theorem that makes no assumptions on the noise model, and from which most other results (in particular for categorical, continuous and ranking domains) follow as special cases. Further, we added general bounds on the expected proxy distance that make fairly weak assumptions on the noise and/or distances (Section 3.1,Section 3.5). So we get a natural tradeoff where results become stronger as the models become more specific. Note that this also directly answers the two main future challenges we specified in the discussion of the EC’20 version (a unified approach, and correlations among workers). • Proofs of the Anna Karenina theorems are technically straightforward (R1). While we agree that the contribution of the Anna Karenina principle is mainly conceptual, we think its simplicity only contributes to its applicability. We added multiple novel results, including showing that under a common noise model, our approach coincides with the maximum likelihood estimator of workers’ competence levels (Theorem 9 and corollaries). • An in-depth discussion of previous work is missing (R2,R3). The reviewers did not specify what previous work they were referring to, but the current version substantially expands the literature review and discussion, see Sec. 1.2. • How sensitive results are to the function EstimateExp? (R3) We removed the function and use a simple uniform calculation to estimate 휇 (half of the average disparity), and we theoretically justify this calculation in the text. Changing this estimation (e.g. replacing it with 0) results in a very slight effect on the empirical results. • The Categorical model assumes all wrong answers occur with equal probability (R1). We extended the results to include arbitrary error probabilities (we only make use of the probability that two wrong answers coincide). See Cor. 13. • The connection with proxy voting is unclear (R1,R2). While that was our primary inspiration, we agree that these are now separate topics. We changed the title and much of the terminology. We further extended the empirical analysis, including comparison with very recent algorithms and more varied datasets.

A.2 Full reviews from EC’20 Review 1. The theoretical contributions of the paper seem to be relatively marginal, but perhaps the significant generality of the approach and the promising empirical results are sufficient to overcome that weakness. Strengths The fact that the general approach adopted by the paper generalizes to cover many relevant settings for crowdsourcing tasks is useful, even so far as to expand the domain of truth-discovery tasks beyond the current standard settings, as the paper discusses in the conclusion. It is also helpful to see empirical results that provide evidence that the algorithms perform well in practice against Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 20 other methods from the truth discovery literature (and against natural baselines), even though those methods can be more specifically tailored to their respective domains. Weaknesses Perhaps the primary challenge for this setting is in framing the problem correctly and making the correct definitions, but given the presentation and definitions in the paper, the proofs ofthemain theoretical result in each section (the various specific statements of the “Anna Karenina principle”) are pretty trivial, largely involving basic facts about probabilities, expectations, and variances. Further, the results for each setting seem heavily rely on the specific choices of the noise models and distance functions. These choices are not really discussed, and the only justification for them seems to be occasionally mentioning that they are common in the literature. Particularly with regard to the noise models, which intuitively seem highly unrealistic (see more below), a more detailed justification, or at least discussion, of the choice of model would be helpful. Lastly, the term "proxy", especially in the term proxy distance, which I believe is original to this paper, seems to be misused or at least used vaguely. It is not at all clear from the paper what the word "proxy" in the term "proxy distance" is meant to convey and I would suggest rethinking that terminology. Similarly the connection with proxy voting is not at all clear. The introduction references a suggestion that perhaps representative voters being similar to one another helps to explain the effectiveness of proxy voting, which somewhat evokes the "Anna Karenina principle," but thatdoes not really draw a connection with proxy voting itself. However, the title “Truth Discovery via Proxy Voting” as well as phrasing such as "we apply proxy voting to estimate each worker’s competence" implies a strong, direct connection. I think that such phrasing (including the title) should be reconsidered. At the very least, a more thorough explanation of the connection is warranted. Noise models (in response to SPC that asked for elaboration): One example that is maybe particularly egregious is that in the Categorical Answers Domain, the noise model (the Independent Errors Model) assumes that all wrong answers occur with equal probability, given that a worker answers the question incorrectly. This is almost never even close to true, but generally there are one or two wrong answers that are the most common. For example, if we had a multiple choice question asking about the capital of Pennsylvania, with the options being Harrisburg, Philadelphia, Pittsburgh, and Scranton, basically no one would pick Scranton and far more people would pick Philadelphia than Pittsburgh. There are similar easy criticisms of the other noise models. For example, in the Continuous Answers Domain, the Independent Normal Noise model eliminates the possibility that workers are systematically biased in their answers (e.g. that a worker might have a tendency to overestimate the correct answer). Almost certainly, these models are chosen because they make the math easy, but they never actually discuss the choice of models, besides mentioning that they are common models in the literature. It is fine to use an unrealistic model, but the choice of a particular model warrants more discussion than "other people have used this model," especially when the choice of model seems to drive the technical results.

Review 2. Comments for the authors: This paper considers the problem of truth discovery via proxy voting in crowdsourcing. The setting is the following: there is a set of workers who provide answers to some task, and these answers are then aggregated with the goal of recovering the true answer (the ground truth). However, the answers of the workers are usually noisy; the workers have fault levels, which is the expected distance between their answers and the ground truth. Hence, the authors focus on first identifying the better workers and then applying simple aggregation rules, which assign more weight to the answers of the better workers. The authors consider three different domains: continuous (the answer is a real value), categorical (the answer is one ofmultiple Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 21 choices), and rankings (the answer is a linear order of a set of elements). The authors show that the average distance of each worker from all other workers is linear in her fault level, meaning that a worker is closer to other workers if her fault level is small (and hence is a good worker) and further away if her fault level is large. The exact form of this linear function depends on the domain and the underlying probability distribution from which the fault levels are drawn. Using this, the authors are able to show that proxy voting recovers the correct fault levels of the workers with a very good precision. The authors also present an iterative version of proxy voting, and perform simulations to compare their methods to other from previous work. Evaluation: The method proposed in this paper seems to be quite robust and to work under a plethora of different settings as demonstrated, which is very nice. The approach is interesting and quite easy to understand, even though some parts of the paper are very densely written and hard to follow. However, I am not quite sure I understand how significant and novel the conceptual and technical contribution of the paper actually is. An in-depth discussion of previous work is definitely missing. The authors could also reconsider which proofs appear in the main text; the main focus seems to be on the Anna Karenina principle, but I think the most important theorems are those showing the efficiency of the method (that the error is almost zero). I am also not quitesurewhy this approach is termed "proxy voting" and not "weighted voting", but this is minor. Overall, I find the paper interesting and support it, but only weakly. Review 3. Comments for the authors: This paper deals with aggregating responses of a set of workers in a metric space. The authors argue that by observing the alignment of the workers in the space one can identify the workers which are more accurate than the others: intuitively such workers tend to be close to each other. Then the accurate voters can be assigned higher weights, and so the quality of the aggregation can be improved. The authors prove this claim under certain assumptions in three different settings, which they refer to as the continuous domain, categorical domain and rankings domain. This is an interesting result and I like the general punchline of the paper. The fact that the paper consists of three rather independent parts, one per each model, makes me slightly less excited than if the paper proved a result that would apply more broadly. The results could be better presented. A few things were unclear: • The descriptions of the models (continuous domain, categorical domain and rankings domain) given in page 2 were to short and hard to understand. In my opinion these models should be presented more precisely, specifying how the elements of each space look like, what they correspond to, etc., perhaps giving some examples. • It is not entirely clear to me how the results depend on a particular distribution. How sensitive they are to the function EstimateExp? • I was somehow lacking a thorough discussion on the related work, which I could find in one place and which would discuss how the similar types of problems have been handled so far. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 22

Contents Abstract 0 1 Introduction 1 1.1 Contribution and paper structure 1 1.2 Related work 2 2 Preliminaries 3 3 Estimating Fault Levels 5 3.1 Model-free bounds 6 3.2 The AWG model 7 3.3 Categorical labels 8 3.4 Ranking Domain 9 3.5 General Distance Metrics 10 4 Aggregation 10 4.1 Improving P-TD 12 5 Empirical Evaluation 14 5.1 Experimental Setup 14 5.2 Results 16 6 Conclusion 18 A Resubmission Cover Letter 19 A.1 Key points of EC’20 reviewers and our related changes 19 A.2 Full reviews from EC’20 19 Contents 22 B General Anna Karenina Theorem 23 C Continuous Domain 26 C.1 Optimal estimation for the pAWG model 28 D Categorical Domain 31 E Ranking Domain 32 F General Distance Metrics 32 G Sample Complexity and Consistency 34 H Implementation notes 42 I Definitions of voting rules 43 J Datasets 44 J.1 Synthetic datasets 44 J.2 Real world data 45 K More Empirical Results 46 References 47 Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 23

B GENERAL ANNA KARENINA THEOREM Consider an inner product space with a symmetric inner product ⟨, ⟩ . Theorem 1.

퐸푆∼Y (풛,T)푛 [휋푖 (푆)|푡푖, 풛] = 푓푖 (풛) + 휇T (풛) − 2 ⟨풃푖 (풛), 풃T (풛)⟩ .

Proof.

1 ∑︁ 2 퐸 ∼Y ( T)푛 [휋 (푆)|푡 , 풛] = 퐸 ∼Y ( T)푛 [ ∥풔 − 풔 ′ ∥ |푡 , 풛] 푆 풛, 푖 푖 푆 풛, 푛 − 1 푖 푖 푖 푖′≠푖 1 ∑︁ 2 = 퐸 ∼Y ( T)푛 [ ∥풔 − 풛 − (풔 ′ − 풛)∥ |푡 , 풛] 푆 풛, 푛 − 1 푖 푖 푖 푖′≠푖 1 ∑︁ 2 2 = 퐸 ∼Y ( T)푛 [ ∥풔 − 풛∥ + ∥풔 ′ − 풛∥ − 2⟨ 풔 − 풛, 풔 ′ − 풛⟩|푡 , 풛] 푆 풛, 푛 − 1 푖 푖 푖 푖 푖 푖′≠푖 1 ∑︁ 2 1 ∑︁ 2 = 퐸 ∼Y ( ) [ ∥풔 − 풛∥ |푡 , 풛] + 퐸 푛−1 [ ∥풔 ′ − 풛∥ |푡 , 풛] 풔푖 풛,푡푖 푛 − 1 푖 푖 푆∼Y (풛,T) 푛 − 1 푖 푖 푖′≠푖 푖′≠푖 1 ∑︁ − 2퐸 ∼Y ( T)푛 [ ⟨ 풔 − 풛, 풔 ′ − 풛⟩|푡 , 풛] 푆 풛, 푛 − 1 푖 푖 푖 푖′≠푖 1 ∑︁ 2 = 푓 (푧) + 퐸 ′ ∼ [ 퐸 ′ ∼Y ( ′ ) [∥풔 ′ − 풛∥ |푡 ′, 풛|풛] 푖 푛 − 1 푡푖 휏 푠푖 풛,푡푖 푖 푖 푖′≠푖 1 ∑︁ − 2 퐸 ∼ [ 퐸 ∼Y ( ) ′ ∼Y ( ′ ) [⟨ 풔 − 풛, 풔 ′ − 풛⟩|푡 ′, 푡 , 풛]|푡 , 풛] 푛 − 1 푡푖 휏 풔푖 풛,푡푖 ,푠푖 풛,푡푖 푖 푖 푖 푖 푖 푖′≠푖 1 ∑︁ = 푓 (풛) + 퐸 ∼Y ( T) [푓 ′ |풛] 푖 푛 − 1 푆 풛, 푖 푖′≠푖 1 ∑︁ − 2 퐸 ∼ [⟨ 퐸 ∼Y ( ) [풔 − 풛|푡 , 풛], 퐸 ′ ∼Y ( ′ ) [풔 ′ − 풛|푡 ′, 풛]⟩|푡 , 풛] 푛 − 1 푡푖 휏 풔푖 풛,푡푖 푖 푖 풔푖 풛,푡푖 푖 푖 푖 푖′≠푖 = 푓푖 (풛) + 휇T (풛) 1 ∑︁ − 2 ⟨ 퐸 ∼ [퐸 ∼Y ( ) [풔 − 풛|푡 , 풛]|푡 , 풛], 퐸 ∼ [퐸 ′ ∼Y ( ′ ) [풔 ′ − 풛|푡 ′, 풛]|푡 , 풛]⟩ 푛 − 1 푡푖 휏 풔푖 풛,푡푖 푖 푖 푖 푡푖 휏 풔푖 풛,푡푖 푖 푖 푖 푖′≠푖 = 푓푖 (풛) + 휇T (풛) 1 ∑︁ − 2 ⟨ 퐸 ∼Y ( ) [풔 − 풛|푡 , 풛], 퐸 ∼Y ( T) [풃 ′ (풛)|풛]⟩ 푛 − 1 풔푖 풛,푡푖 푖 푖 푆 풛, 푖 푖′≠푖 = 푓푖 (풛) + 휇T (풛) − 2 ⟨풃푖 (풛), 풃T (풛)⟩ □

Definition 2 (Truth Invariance for Distances). We say that a noise model Y is truth in- ′ ′ variant for distance if for all 풛, 풛 and any types 푡푖, 푡푖′ , we have 푓푖 (풛) = 푓푖 (풛 ) and ⟨풃푖 (풛), 풃푖′ (풛)⟩ = ′ ′ ⟨풔ˆ푖 (풛 ), 풃푖′ (풛 )⟩.

Lemma 18. If Y is truth invariant then it is truth invariant for distances. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 24

Proof. Consider any 풛, 풛′ ∈ R푚, and let 흃 as in the definition of truth invariance for distances. 2 Note that (휉 푗 ) = 1 for all 푗. We first show for 푓푖 .

푓 (풛) = 퐸 ∼Y ( ) [푑(풔 , 풛)] = 퐸 [푑(풛 + 흐 , 풛)] 푖 풔푖 풛,푡푖 푖 흐푖 ∼Y(˜ 풛,푡푖 ) 푖

= 퐸 ′ [푑(풛 + 흐 , 풛)] = 퐸 ′ [푑(풛 + 흐 · 휉, 풛)] 흐푖 ∼Y(˜ 풛 ,푡푖 )휉 푖 흐푖 ∼Y(˜ 풛 ,푡푖 ) 푖 푚 푚 ∑︁ 2 ∑︁ 2 = 퐸 ′ [ (푧 + 휖 · 휉 − 푧 ) ] = 퐸 ′ [ (휖 · 휉 ) ] 흐푖 ∼Y(˜ 풛 ,푡푖 ) 푗 푖 푗 푗 푗 흐푖 ∼Y(˜ 풛 ,푡푖 ) 푖 푗 푗 푗=1 푗=1 푚 푚 ∑︁ 2 2 ∑︁ 2 = 퐸 ′ [ 휖 휉 ] = 퐸 ′ [ 휖 ] 흐푖 ∼Y(˜ 풛 ,푡푖 ) 푖 푗 흐푖 ∼Y(˜ 풛 ,푡푖 ) 푖 푗=1 푗=1 ′ ′ 퐸 ′ 푑 , 푓 . = 풔푖 ∼Y (풛 ,푡푖 ) [ (풔푖 풛 )] = 푖 (풛 )

Next, consider two workers of types 푡푖, 푡푖′ .

푚 푚 ∑︁ ∑︁ 푚 ⟨풃푖 (풛), 풃푖′ (풛)⟩ = 푏푖 푗 (풛)푏푖′ 푗 (풛) = 퐸 ˜ [휖푖 푗 ]퐸 ˜ [휖푖′ 푗 ] 흐푖 ∼Y(풛,푡푖 ) 흐푖′ ∼Y(풛,푡푖′ ) 푗=1 푗=1 푚 푚 ∑︁ ∑︁ 2 = 퐸 ˜ ′ [휖푖 푗 휉 푗 ]퐸 ˜ ′ [휖푖′ 푗 휉 푗 ] = (휉 푗 ) 퐸 ˜ ′ [휖푖 푗 ]퐸 ˜ ′ [휖푖′ 푗 ] 흐푖 ∼Y(풛 ,푡푖 ) 흐푖′ ∼Y(풛 ,푡푖′ ) 흐푖 ∼Y(풛 ,푡푖 ) 흐푖′ ∼Y(풛 ,푡푖′ ) 푗=1 푗=1 푚 푚 ∑︁ ∑︁ ′ ′ ′ ′ = 퐸 ˜ ′ [휖푖 푗 ]퐸 ˜ ′ [휖푖′ 푗 ] = 푏푖 푗 (풛 )푏푖′ 푗 (풛 ) = 푚 ⟨풃푖 (풛 ), 풃푖′ (풛 )⟩ , 흐푖 ∼Y(풛 ,푡푖 ) 흐푖′ ∼Y(풛 ,푡푖′ ) 푗=1 푗=1 as required. □

Theorem 3. For a truth invariant noise model, 퐸푆∼Y (풛,T)푛 [휋푖 (푆)|푡푖 ] = 푓푖 + 휇T − 2 ⟨풃푖, 풃T ⟩, regardless of 풛.

Proof. By Lemma 18, any truth invariant model is also truth invariant for distances.

⟨ ( ), ( )⟩ , 퐸 [ ′ ( )] 퐸 [⟨ ( ), ′ ( )⟩] 퐸 [⟨ , ′ ⟩] Since 풃푖 풛 풃T 풛 = 풃푖 푡푖′ ∼T 풃푖 풛 = 푡푖′ ∼T 풃푖 풛 풃푖 풛 = 푡푖′ ∼T 풃푖 풃푖 , the theorem follows immediately from truth invariance for distances and from Theorem 1. □

√︁ √ Model-free bounds. Lemma 6. For all 푖, | ⟨풃푖, 풃T ⟩ | ≤ 푓푖 휇.

Proof. Due to Cauchy-Schwarz inequality,

uv 푚 uv 푚 uv 푚 uv 푚 t 1 ∑︁ t 1 ∑︁ t 1 ∑︁ t 1 ∑︁ | ⟨ , ⟩ | ≤ ∥ ∥ · ∥ ∥ = 푏 · 푏 = (푏 )2 · (푏 )2 풃푖 풃T 풃푖 풃T 푚 푖 푗 푚 T,푗 푚 푖 푗 푚 T,푗 푗=1 푗=1 푗=1 푗=1

uv 푚 uv 푚 t 1 ∑︁ t 1 ∑︁ = (퐸 [푠 − 푧 ])2 · (퐸 퐸 [푠 − 푧 ])2 푚 풔푖 ∼Y푖 푖 푗 푗 푚 푡∼T 풔∼Y (푡) 푗 푗 푗=1 푗=1 Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 25

By Jensen inequality and convexity of the square function,

uv 푚 uv 푚 t 1 ∑︁ t 1 ∑︁ ≤ 퐸 [(푠 − 푧 )2]· 퐸 퐸 [(푠 − 푧 )2] 푚 풔푖 ∼Y푖 푖 푗 푗 푚 푡∼T 풔∼Y (푡) 푗 푗 푗=1 푗=1

uv 푚 uv 푚 t 1 ∑︁ t 1 ∑︁ ≤ 퐸 [ (푠 − 푧 )2]· 퐸 퐸 [ (푠 − 푧 )2] 풔푖 ∼Y푖 푚 푖 푗 푗 푡∼T 풔∼Y (푡) 푚 푗 푗 푗=1 푗=1 √︃ √ √︁퐸 푑 , 퐸 퐸 푑 , √︁푓 휇, ≤ 풔푖 ∼Y푖 [ (풔푖 풛)]· 푡∼T 풔∼Y (푡) [ (풔 풛)] ≤ 푖 · as required. □

The following lemma is another corollary of the Anna Karenina principle. √ 2 Lemma 7. For any worker 푖 and any 훼 ≥ 0, if 푓푖 = 훼휇, then 퐸[휋푖 ] ∈ (1 ± 훼) 휇.

Proof. For the upper bound,

퐸[휋푖 ] = 푓푖 + 휇 − 2 ⟨풃푖, 풃T ⟩ (By Thm. 1) √︁ √ ≤ 푓푖 + 휇 + 2 푓푖 휇 (By Lemma 6) √ √ √ √ 2 ≤ 훼휇 + 휇 + 2 훼휇 휇 = (1 + 훼 + 2 훼)휇 = (1 + 훼) 휇. (since 푓푖 ≤ 훼휇) The proof of the lower bound is symmetric. □

∗ ∗∗ Corollary 5. Consider a “poor" worker 푖 with 푓푖∗ > 9휇T , and a “good" worker 푖 with 푓푖∗∗ < 휇T . Then 퐸[휋푖∗∗ ] < 퐸[휋푖∗ ]. No better separation is possible.

Proof. By Lemma. 7, 퐸[휋푖∗∗ ] ≤ 4휇 < 퐸[휋푖∗ ]. Without further assumptions, this condition is tight. To see why, consider a population on R ∗∗ ∗ where 푧 = 0. The good worker 푖 provides the fixed report 푠푖∗∗ = −1, the poor worker 푖 provides ′ the fixed report 푠푖∗ = 3 − 훿. However the measure of types 푡푖∗, 푡푖∗∗ in T is 0, and w.p. 1 type 푡 is selected with a fixed report 푠 = 1. Note that 휇 = (푠 − 푧) = 1, and thus 푓푖∗∗ = 1 = 휇 whereas 푓푖∗ = 9 = 9휇. However, the reports of 푖∗, 푖∗∗ are completely symmetric around 1, in the absenxe of more workers there is no way to distinguish between these two workers, by their proxy score or otherwise. □

Corollary 8. Suppose there are two types of workers, with 푃 > 퐻. For all 푖∗ ∈ 푁 푃, 푖∗∗ ∈ 푁 퐻 , if any of the following conditions apply, then 퐸[휋푖∗ ] > 퐸[휋푖∗∗ ]: ℎ = 훼 푃−퐻 (a) 0 and < 2푝2 ; (b) 퐻 = 0 and 훼 < 0.5; (c) 푃 − 퐻 > 푝2 − ℎ2; (d) 푝 ≤ ℎ. Intuitively, this means that it is easy to distinguish hammers from spammers as there are fewer spammers; as the difference in competence becomes larger; and as the spammers are less biased.

Proof. Denote by 훽 := 풔푃, 풔퐻 the amount of agreement in the bias of hammers and spammers. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 26

For every spammer 푖 ∈ 푁 푃

퐸[휋푖 |푡푖 = 푃] = 푓푖 + 휇 − 2 ⟨풃푖, 풃T ⟩ = 푃 + 훼푃 − 2 풔푃, 훼풔푃 + (1 − 훼)풔퐻 = 푃 + 훼푃 − 훼2 풔푃, 풔푃 − (1 − 훼)2 풔푃, 풔퐻 = 푃 + 훼푃 − 2훼푝2 − (1 − 훼)2 풔푃, 풔퐻 = 푃 + 훼푃 − 2훼푝2 − 2(1 − 훼)훽. (6)

Similarly, it applies for every hammer 푖 ∈ 푁 퐻 :

퐸[휋푖 |푡푖 = 퐻] = 푓푖 + 휇 − 2 ⟨풃푖, 풃T ⟩ = 퐻 + 훼푃 − 2 풔퐻 , 훼풔푃 + (1 − 훼)풔퐻 = 퐻 + 훼푃 − 2훼훽 − 2(1 − 훼)ℎ2. (7)

(a) Follows immediately from Eqs. (6),(7), since 훽 = 0.

2 퐻 = 푃−퐻 = 푃 ≥ 푝 = 1 (b) Suppose 0, then Condition (a) is implied, since 2푝2 2푝2 2푝2 2 . (c) We note that due to Cauchy-Schwartz inequality,

√︁ 푝2 + ℎ2 |훽| = | 풔푃, 풔퐻 | ≤ ∥풔푃 ∥ · ∥풔퐻 ∥ = 푝 · ℎ = 푝2ℎ2 ≤ . (8) 2 Due to Eqs. (6),(7):

2 2 퐸[휋푖∗ ] − 퐸[휋푖∗∗ ] = 푃 + 훼푃 − 2훼푝 − 2(1 − 훼)훽 − (퐻 + 훼푃 − 2훼훽 − 2(1 − 훼)ℎ ) = 푃 − 퐻 − (2훽(1 − 2훼) + 2훼 (푝2 + ℎ2) − 2ℎ2) ≥ 푃 − 퐻 − ((푝2 + ℎ2)(1 − 2훼) + 2훼 (푝2 + ℎ2) − 2ℎ2) (By Eq. (8)) ≥ 푃 − 퐻 − ((푝2 + ℎ2) − 2ℎ2) = 푃 − 퐻 − (푝2 − ℎ2) > 0, as required.

(d) If 푝 ≤ ℎ then 푃 − 퐻 > 0 ≥ 푝2 − ℎ2 thus Condition (c) is implied. □

C CONTINUOUS DOMAIN Recall that for the mean function, we have 푓ˆ0 푑 , 푧 푑 , 푟 푀 푆 , 푖 = (풔푖 ˆ) = (풔푖 ( )) which is the distance of 푖 from the unweighted mean of all workers. Denote 1 ∑︁ 휇0 = 푓ˆ0. ˆ : 푛 푖 푖 ≤푛 푟 휇0 푛−1 휇 Lemma 19. When is the mean function, ˆ = 푛 . Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 27

Proof. 1 1 ∑︁ 1 1 ∑︁ 1 ∑︁ 휇 = 휋 (푆) = 푑(풔 , 풔 ′ ) 2 푛 푖 2 푛 푛 − 1 푖 푖 푖 ∈푁 푖 ∈푁 푖′ ∈푁 \{푖 }

1 ∑︁ ∑︁ ∑︁ 2 = (푠 − 푠 ′ ) 2푛(푛 − 1) 푖 푗 푖 푗 푖 ∈푁 푖′ ∈푁 \{푖 } 푗 ≤푚

1 ∑︁ ∑︁ ∑︁ 2 ∑︁ ∑︁ ∑︁ ∑︁ ∑︁ ∑︁ 2 = [ (푠 ) − 2 푠 푠 ′ + (푠 ′ ) ] 2푛(푛 − 1) 푖 푗 푖 푗 푖 푗 푖 푗 푖 ∈푁 푖′ ∈푁 \{푖 } 푗 ≤푚 푖 ∈푁 푖′ ∈푁 \{푖 } 푗 ≤푚 푖 ∈푁 푖′ ∈푁 \{푖 } 푗 ≤푚

1 ∑︁ ∑︁ 2 1 ∑︁ ∑︁ ∑︁ 1 ∑︁ ∑︁ 2 = (푠 ) − 푠 푠 ′ + (푠 ′ ) 2푛 푖 푗 푛(푛 − 1) 푖 푗 푖 푗 2푛 푖 푗 푖 ∈푁 푗 ≤푚 푖 ∈푁 푖′ ∈푁 \{푖 } 푗 ≤푚 푖′ ∈푁 푗 ≤푚

∑︁ 1 ∑︁ 2 1 ∑︁ ∑︁ = ( (푠 ) − 푠 푠 ′ ) 푛 푖 푗 푛(푛 − 1) 푖 푗 푖 푗 푗 ≤푚 푖 ∈푁 푖 ∈푁 푖′ ∈푁 \{푖 } Whereas for the mean aggregation 풛ˆ,

0 1 ∑︁ 1 ∑︁ 1 ∑︁ 휇 = 푑( , ˆ) = 푑( , ′ ) ˆ 푛 풔푖 풛 푛 풔푖 푛 풔푖 푖 ∈푁 푖 ∈푁 푖′ ∈푁 1 ∑︁ ∑︁ 1 ∑︁ 2 = (푠 − 푠 ′ ) 푛 푖 푗 푛 푖 푗 푖 ∈푁 푗 ≤푚 푖′ ∈푁

1 ∑︁ ∑︁ 2 2 ∑︁ 1 ∑︁ ∑︁ = ((푠 ) − 푠 푠 ′ + 푠 ′ 푠 ′′ ) 푛 푖 푗 푛 푖 푗 푖 푗 푛2 푖 푗 푖 푗 푖 ∈푁 푗 ≤푚 푖′ ∈푁 푖′ ∈푁 푖′′ ∈푁 ! ∑︁ 1 ∑︁ 2 2 ∑︁ ∑︁ 1 ∑︁ ∑︁ = (푠 ) − 푠 푠 ′ + 푠 ′ 푠 푛 푖 푗 푛2 푖 푗 푖 푗 푛2 푖 푗 푖 푗 푗 ≤푚 푖 ∈푁 푖 ∈푁 푖′ ∈푁 푖′ ∈푁 푖 ∈푁

! ∑︁ 1 ∑︁ 2 1 ∑︁ ∑︁ = (푠 ) − 푠 푠 ′ 푛 푖 푗 푛2 푖 푗 푖 푗 푗 ≤푚 푖 ∈푁 푖 ∈푁 푖′ ∈푁

∑︁ 1 ∑︁ 2 1 ∑︁ 2 ∑︁ = © (푠 ) − ((푠 ) + 푠 ′ 푠 )ª 푛 푖 푗 푛2 푖 푗 푖 푗 푖 푗 ® 푗 ≤푚 푖 ∈푁 푖 ∈푁 푖′ ∈푁 \{푖 } « ¬

∑︁ 1 1 ∑︁ 2 1 ∑︁ ∑︁ = ©( + ) (푠 ) − 푠 ′ 푠 ª 푛 푛2 푖 푗 푛2 푖 푗 푖 푗 ® 푗 ≤푚 푖 ∈푁 푖 ∈푁 푖′ ∈푁 \{푖 } « ¬ 푛 푛 ∑︁ − 1 ∑︁ 2 − 1 1 ∑︁ ∑︁ = © (푠 ) − 푠 ′ 푠 ª 푛2 푖 푗 푛 푛(푛 − 1) 푖 푗 푖 푗 ® 푗 ≤푚 푖 ∈푁 푖 ∈푁 푖′ ∈푁 \{푖 } « ¬ 푛 푛 − 1 ∑︁ 1 ∑︁ 2 1 ∑︁ ∑︁ − 1 = © (푠 ) − 푠 ′ 푠 ª = 휇, 푛 푛 푖 푗 푛(푛 − 1) 푖 푗 푖 푗 ® 푛 푗 ≤푚 푖 ∈푁 푖 ∈푁 푖′ ∈푁 \{푖 } « ¬ as required. □ 푓ˆ푁 푃 휋 휇 Recall that 푖 = 푖 − ¯. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 28

푓ˆ0 푛−1 푓ˆ푁 푃 푖 Proposition 20. 푖 = 푛 푖 for all . Proof. First note that

0 1 ∑︁ 1 ∑︁ 1 ∑︁ 2 푓ˆ = 푑 ( , ′ ) = (푠 − 푠 ′ ) 푖 퐸 풔 푛 풔푖 푚 푖 푗 푛 푖 푗 푖′ ∈푁 푗 ≤푚 푖′ ∈푁 ! 1 ∑︁ 2 2 ∑︁ 1 ∑︁ 2 = 푠 − 푠 푠 ′ + ( 푠 ′ ) 푚 푖 푗 푖 푗 푛 푖 푗 푛 푖 푗 (9) 푗 ≤푚 푖′ ∈푁 푖′ ∈푁

푛−1 0 1 휋 휋 푓˜ 휋 − 휇 푑( , ) 휋 Í ′ 푑( , ′ ) Denote ˜푖 := 푛 푖 and 푖 := ˜푖 ˆ . Since 풔푖 풔푖 = 0, we have ˜푖 = 푛 푖 ∈푁 풔푖 풔푖 .

0 1 ∑︁ 1 ∑︁ 푓˜ = 휋 − 휇 = 푑 ( , ′ ) − 푑 ( ′, ˆ) 푖 ˜푖 ˆ 푛 퐸 풔푖 풔푖 푛 퐸 풔푖 풛 푖′ ∈푁 푖′ ∈푁 1 ∑︁ 1 ∑︁ 0 0 = 푑 (풔 , 풔 ′ ) − 푓ˆ′ (By def. of 푓ˆ′ ) 푛 퐸 푖 푖 푛 푖 푖 푖′ ∈푁 푖′ ∈푁

! 1 ∑︁ 1 ∑︁ 2 1 ∑︁ 1 ∑︁ 2 2 ∑︁ 1 ∑︁ 2 = (푠 − 푠 ′ ) − 푠 ′ − 푠 ′ 푠 ′′ + ( 푠 ′′ ) 푛 푚 푖 푗 푖 푗 푛 푚 푖 푗 푖 푗 푛 푖 푗 푛 푖 푗 (By Eq. (9)) 푖′ ∈푁 푗 ≤푚 푖′ ∈푁 푗 ≤푚 푖′′ ∈푁 푖′′ ∈푁 " !# 1 ∑︁ 1 ∑︁ 2 1 ∑︁ 2 2 ∑︁ 1 ∑︁ 2 = (푠 − 푠 ′ ) − 푠 ′ − 푠 ′ 푠 ′′ + ( 푠 ′′ ) 푚 푛 푖 푗 푖 푗 푛 푖 푗 푖 푗 푛 푖 푗 푛 푖 푗 푗 ≤푚 푖′ ∈푁 푖′ ∈푁 푖′′ ∈푁 푖′′ ∈푁 " !# 1 ∑︁ 2 1 ∑︁ 2 2 2 ∑︁ 1 ∑︁ 2 = 푠 + 푠 ′ − 푠 푠 ′ − (푠 ′ − 푠 ′ 푠 ′′ + ( 푠 ′′ ) 푚 푖 푗 푛 푖 푗 2 푖 푗 푖 푗 푖 푗 푖 푗 푛 푖 푗 푛 푖 푗 푗 ≤푚 푖′ ∈푁 푖′′ ∈푁 푖′′ ∈푁 " # 1 ∑︁ 2 1 ∑︁ 1 ∑︁ 2 ∑︁ 1 ∑︁ 1 ∑︁ 2 = 푠 − 푠 푠 ′ + 푠 ′ 푠 ′′ − ( 푠 ′′ ) 푚 푖 푗 2 푖 푗 푛 푖 푗 푛 푖 푗 푛 푖 푗 푛 푛 푖 푗 푗 ≤푚 푖′ ∈푁 푖′ ∈푁 푖′′ ∈푁 푖′ ∈푁 푖′′ ∈푁 " # 1 ∑︁ 2 1 ∑︁ 1 ∑︁ 2 1 ∑︁ 2 = 푠 − 푠 푠 ′ + ( 푠 ′′ ) − ( 푠 ′′ ) 푚 푖 푗 2 푖 푗 푛 푖 푗 2 푛 푖 푗 푛 푖 푗 푗 ≤푚 푖′ ∈푁 푖′′ ∈푁 푖′′ ∈푁 " # 1 ∑︁ 2 1 ∑︁ 1 ∑︁ 2 0 = 푠 − 푠 푠 ′ + ( 푠 ′′ ) = 푓ˆ . 푚 푖 푗 2 푖 푗 푛 푖 푗 푛 푖 푗 푖 (By Eq. (9)) 푗 ≤푚 푖′ ∈푁 푖′′ ∈푁

Finally, by Lemma 19, 푛 푛 푛 푛 푛 푓ˆ푁 푃 = 휋 − 휇 = 휋˜ − 휇ˆ0 = (휋˜ − 휇ˆ0) = 푓˜ = 푓ˆ0, 푖 푖 푛 − 1 푖 푛 − 1 푛 − 1 푖 푛 − 1 푖 푛 − 1 푖 as required. □

C.1 Optimal estimation for the pAWG model Suppose that the pairwise distances are all independent conditional on workers’ types. More formally, under the pairwise Additive White Gaussian (pAWG) noise model, 푑푖푖′ = 푓푖 + 푓푖′ +휖푖푖′ , where all of 휖푖푖′ are sampled i.i.d. from a normal distribution with mean 0 and unknown variance. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 29

1, 2 1 1 0 0 0 , © ª 1 3 1 0 1 0 0 ® , ® 1 4 1 0 0 1 0 ® , ® 1 5 1 0 0 0 1 ® , ® 2 3 0 1 1 0 0 ® 8 2 2 2 2 , ® 2 4 0 1 0 1 0 ® © 2 8 2 2 2 ª , ® ® 푃 = 2 5 ; 퐴 = 0 1 0 0 1 ® ; 퐴푇 퐴 = 2 2 8 2 2 ® , ® ® 3 4 0 0 1 1 0 ® 2 2 2 8 2 ® , ® ® 3 5 0 1 1 0 1 ® 2 2 2 2 8 , ® « ¬ 4 5 0 0 0 1 1 ® , ® 5 4 0 0 0 1 1 ® . . . ® . . . ® ® 2, 1 1 1 0 0 0 « ¬

Fig. 8. The matrices 퐴 and 퐴푇 퐴 for 푛 = 5. The part below the dashed line is a reflection of the part above

Recovering 푓ì from all the off-diagonal entries of the pairwise distance matrix 퐷 can be done by solving the following ordinary least squares (regression) problem:8 ∑︁ ˆ ˆ 2 min (푓푖 + 푓푖′ − 푑푖푖′ ) . 푓b 푖,푖′ We will show a closed form solution to a somewhat more general claim, that also allows for a regularization term.

Theorem 9. Let 휆 ≥ 0, and 퐷 = (푑푖,푖′ )푖,푖′ ∈푁 be an arbitrary symmetric nonnegative matrix (representing pairwise distances). Then (푛 − )휋 ì − 8푛(푛−1) 휇 ∑︁ 2 ∑︁ 2 2 1 4푛+휆−4 argmin (푓ˆ + 푓ˆ′ − 푑 ′ ) + 휆 (푓ˆ) = . 푓b 푖 푖 푖푖 푖 2푛 + 휆 − 4 푖,푖′ ∈푁 :푖≠푖′ 푖 ∈푁

Proof. Let 푃 be a list of all 푛2 − 푛 ordered pairs of [푛] (without the main diagonal) in arbitrary order. We can now rewrite the least squares problem as ∑︁ 푓ˆ 푓ˆ 푑 2 휆 ∑︁ 푓ˆ 2 퐴푓 2 휆 푓 2 , min ( 푖 + 푖′ − 푖푖′ ) + ( 푖 ) = min ∥ b− 풅∥2 + ∥ b∥2 푓b (푖,푖′) ∈푃 푖 ∈푁 푓b

where 퐴 is a |푃 | × 푛 matrix with 푎푘푖 = 1 iff 푖 ∈ 푃푘 , and 풅 is a |푃 |-length vector with 푑푘 = 푑푖푖′ for ′ 푃푘 = (푖, 푖 ). The optimal solution for regularized least squares is obtained at 푓bsuch that (퐴푇 퐴 + 휆퐼)푓b= 퐴푇 풅. (10) Fortunately, 퐴 has a very specific structure that allows us to obtain the above closed-form solution. Note that every row of 퐴 has exactly two ‘1’ entries, in the row index 푖 and column index ′ 2 푖 of 푃푘 ; the total number of ‘1’ is 2(푛 − 푛); there are 2푛 − 2 ones in every column; and every two ′ ′ ′ distinct columns 푖, 푖 share exactly two non-zero entries (at rows 푘 s.t. 푃푘 = (푖, 푖 ) and 푃푘 = (푖 , 푖)).

8Using an OLS for this problem is inspired by the work of [33], although they do not provide a closed form solution as we do here, nor do they consider regularization. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 30

This means that (퐴푇 퐴) has 2푛 − 2 everywhere on the main diagonal and 2 everywhere else. See Fig. 8 for a visual example. Thus 푇 ∑︁ [퐴 풅]푖 = 2 푑푖푖′ = 2(푛 − 1)휋푖, (11) 푖′≠푖 and 푇 ˆ ∑︁ ˆ ˆ [(퐴 퐴 + 휆퐼)푓b]푖 = (2푛 − 2)푓푖 + 2 푓푖′ + 휆푓푖 푖′≠푖 ˆ ∑︁ ˆ ˆ ∑︁ ˆ = (2푛 − 2 + 휆)푓푖 + 2 푓푖′ = (2푛 − 4 + 휆)푓푖 + 2 푓푖′ . (12) 푖′≠푖 푖′ ∈푁 Denote ∑︁ 푇 ∑︁ 훼 := [퐴 풅]푖 = 2(2(푛 − 1)휋푖 ) = 4푛(푛 − 1)휇, (13) 푖 ∈푁 푖 ∈푁 then ∑︁ 푇 ∑︁ 푇 훼 = [퐴 풅]푖 = [(퐴 퐴 + 휆퐼)푓b]푖 (By Eq. (10)) 푖 ∈푁 푖 ∈푁 ∑︁ ˆ ∑︁ ˆ = [(2푛 − 4 + 휆)푓푖 + 2 푓푖′ ] (By Eq. (12)) 푖 ∈푁 푖′ ∈푁 ∑︁ ˆ ∑︁ ˆ ∑︁ = (2푛 − 4 + 휆) 푓푖 + 2 푓푖′ ( 1) 푖 ∈푁 푖′ ∈푁 푖 ∈푁 ∑︁ ˆ ∑︁ ˆ ∑︁ ˆ = (2푛 − 4 + 휆) 푓푖 + 2 푓푖′푛 = (4푛 + 휆 − 4) 푓푖 . (14) 푖 ∈푁 푖′ ∈푁 푖 ∈푁

We can now write the 푛 linear equations specifying 푓b: 푇 푇 2(푛 − 1)휋푖 = [퐴 풅]푖 = [(퐴 퐴 + 휆퐼)푓b]푖 (By Eqs. (10),(11)) ˆ ∑︁ ˆ = (2푛 − 4 + 휆)푓푖 + 2 푓푖′ ⇒ (By Eq. (12)) 푖′ ∈푁

2(푛 − 1)휋 − 2 Í ′ 푓ˆ′ 푓ˆ = 푖 푖 ∈푁 푖 (isolating 푓ˆ) 푖 2푛 − 4 + 휆 푖 4푛(푛−1) 2(푛 − 1)휋푖 − 2 휇 = 4푛−4+휆 , (by Eqs.(13),(14)) 2푛 − 4 + 휆 as required. □

We can use 푓bobtained for different values of the regularizing constant 휆 in the P-TD algorithm. • By setting 휆 = 0 (no regularization), we get (푛 − )휋 ì − 8푛(푛−1) 휇 푛 휋 푛휇 푛 2 1 4푛−4 ( − 1) ì − 푓b푀퐿 = = ∝휋 ì − 휇. 2푛 − 4 푛 − 2 푛 − 1 • By setting 휆 = 2 we get (푛 − )휋 ì − 8푛(푛−1) 휇 푛 ( = ) 2 1 4푛−2 2 푓b푀퐿 휆 2 = = 휋ì − 휇. 2푛 − 2 2푛 − 1 Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 31

Fig. 9. A comparison of the aggregation error of P-TD (see Aggregation section), with different regularization factors. Results are shown for synthetic data in the IER model under two different distributions T .

• By setting 휆 = 4 we get (푛 − )휋 ì − 8푛(푛−1) 휇 푛 ( = ) 2 1 4푛 − 1 푓b푀퐿 휆 4 = = (휋 ì − 휇) = 푓b0 ∝ 푓b푁 푃 . 2푛 푛 Recall that 휆 = 4 gives us the naïve P-TD algorithm. We compare the results empirically in Fig. 9. “Oracle" is a weighted mean that uses the true types of workers to set the weights (see OA in Appendix K). We can see that while the naïve variant performs better when the pairwise distances are noisy (low 푚), as 푚 grows the 푀퐿 variant converges to the performance of the oracle.

D CATEGORICAL DOMAIN , 푘 푚 푑 , 1 Í푚 푥 푦 Lemma 11. For every two vectors 풙 풚 ∈ [ ] , (풔(풙) 풔(풚)) = 푗=1 푗 ≠ 푗 . 푚 J K Proof. 푚 푘 푚 푘 1 ∑︁ ∑︁ 1 ∑︁ ∑︁ 푘 푑(풔(풙), 풔(풚)) = (풔 (풙) − 풔 (풚))2 = 푠 (풙) ≠ 푠 (풚) 푚푘 푗,ℓ 푗,ℓ 푚푘 2 푗,ℓ 푗,ℓ 푗=1 ℓ=1 푗=1 ℓ=1 J K 푚 푘 1 ∑︁ ∑︁ = (푠 (풙) = 1 ∧ 푠 (풚) = 0) ∨ (푠 (풙) = 0 ∧ 푠 (풚) = 1) 2푚 푗,ℓ 푗,ℓ 푗,ℓ 푗,ℓ 푗=1 ℓ=1 J K 푚 푘 1 ∑︁ ∑︁ = (푥 = ℓ ∧ 푦 ≠ ℓ) ∨ (푥 ≠ ℓ ∧ 푦 = ℓ) 2푚 푗 푗 푗 푗 푗=1 ℓ=1 J K 푚 푘 1 ∑︁ ∑︁ = 푥 = ℓ ∧ 푦 ≠ ℓ + 푥 ≠ ℓ ∧ 푦 = ℓ 2푚 푗 푗 푗 푗 푗=1 ℓ=1 J K J K 푚 푘 푚 푘 1 ∑︁ ∑︁ 1 ∑︁ ∑︁ = 푥 = ℓ ∧ 푦 ≠ ℓ + 푥 ≠ ℓ ∧ 푦 = ℓ 2푚 푗 푗 2푚 푗 푗 푗=1 ℓ=1 J K 푗=1 ℓ=1 J K 푚 푚 푚 1 ∑︁ 1 ∑︁ 1 ∑︁ = 푥 ≠ 푦 + 푥 ≠ 푦 = 푥 ≠ 푦 . (15) 2푚 푗 푗 2푚 푗 푗 푚 푗 푗 푗=1 J K 푗=1 J K 푗=1 J K □ Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 32

E RANKING DOMAIN Note that while our results on fault estimation from the binary domain directly apply (at least to the ICN model), aggregation is more tricky: an issue-by-issue aggregation may result in a non-transitive (thus invalid) solution. The voting rules we consider are guaranteed to output a valid ranking. While there is much literature on which voting rule comes closest to the ground truth (see, e.g., [4, 18]), we assume here that the voting rule has already been fixed, and are concerned mainly about the effect of weighing voters according to their estimated fault. The Kemeny rule and near-optimal aggregation. It is well known that for both Condorcet noise model and Mallows model (a restriction of Condorcet model to transitive orders, see also Ap- pendix J.1), when all voters have the same fault level, the maximum likelihood estimator of 퐿풛 is obtained by applying the Kemeny-Young (henceforth, KY) voting rule on 푆 [44]. The KY rule 푟퐾푌 computes the binary vector풚0 that corresponds to the majority applied separately 0 Í 푟퐾푌 푆 푑 , 0 on every pair of candidates (that is, 풚 := sign 푖 ≤푛 풔푖 ); then ( ) := argmin퐿∈L (퐶) 퐻 (풙퐿 풚 ). In particular it can be applied when 푆 is composed of transitive vectors. A natural question is whether there is a weighted version of KY that is an MLE and/or minimizes ∗ the expected distance to 퐿 when fault levels 푓푖 are known. We did not find any explicit reference to this question or to the case of distinct fault levels in general. There are some other extensions: [18] deal with a different variation of the noise model where it is less likely to swap pairsthat are further apart; and in [41] the KY rule is extended to deal with more general noise models and partial orders. As it turns out, using weighted KY with (binary) Grofman weights 푤퐺 = log 1−푓푖 provides us 푖 푓푖 with (at least) an approximately optimal outcome.

Proposition 21. Suppose that the ground truth 퐿풛 is sampled from a uniform prior on L(퐶). ∗ Suppose that instance 퐼 = ⟨푆, 풛⟩ is sampled from population (푓푖 )푖 ∈푁 via the ICN model. Let 퐿 := 퐾푌 퐺 ì ′ ∗ 푟 (푆,푤ì (푓 )). Let 퐿 ∈ L(퐶) be any random variable that may depend on 푆. Then 퐸[푑퐾푇 (퐿 , 퐿풛)] ≤ ′ 2퐸[푑퐾푇 (퐿 , 퐿풛)], where expectation is over all instances. Í 푤 ′ , 푚 Proof. Consider the ground truth binary vector 풛. Let 풚 := sign 푖 ≤푛 푖 풔푖 . Let 풙 ∈ {−1 1} be an arbitrary random variable that may depend on the input profile. For every 푗 ≤ 푚 (i.e., every 푦 푧 푃푟 푥 ′ 푧 푃푟 푦 푧 pair of elements), we know from [20] that 푗 is the MLE for 푗 , and thus [ 푗 = 푗 ] ≤ [ 푗 = 푗 ]. ∗ Now, recall that by definition of the KY rule, 퐿 minimizes the Hamming distance to 퐿풚 . By triangle inequality, and since 풙퐿∗ is the closest transitive vector to 풚, ∗ 퐸[푑퐾푇 (퐿 , 퐿풛)] = 퐸[푑퐻 (풙퐿∗, 풛)] ≤ 퐸[푑퐻 (풙퐿∗,풚) + 푑퐻 (풚, 풛)] ≤ 퐸[2푑퐻 (풚, 풛)] 2 ∑︁ 2 ∑︁ = 푃푟 [푦 ≠ 푧 ] ≤ 푃푟 [푥 ′ ≠ 푧 ] = 퐸[푑 ( ′, )]. 푚 푗 푗 푚 푗 푗 2 퐻 풙 풛 푗 ≤푚 푗 ≤푚 ′ ′ In particular, this holds for any transitive vector 풙 = 풙퐿′ corresponding to ranking 퐿 , thus ∗ ′ 퐸[푑퐾푇 (퐿 , 퐿풛)] ≤ 2퐸[푑퐻 (풙퐿′, 풛)] = 2퐸[푑퐾푇 (퐿 , 퐿풛)], as required. □

F GENERAL DISTANCE METRICS Suppose that 푑 is an arbitrary distance metric over some space 푍. 풛 ∈ 푍 is the ground truth, and 풔푖 ∈ 푍 is the report of worker 푖. The fault level and proxy score are defined as before.

Theorem 14. If푑 is any distance metric and Y is symmetric, then max{휇T (풛), 푓푖 (풛)} ≤ 퐸[휋푖 (푆)|푡푖, 풛] ≤ 휇T (풛) + 푓푖 (풛). Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 33

Proof. We start with the upper bound, which does not require the symmetry assumption. 1 ∑︁ 퐸[휋 (푆)|푡 , 풛] = 퐸[ 푑(풔 , 풔 ′ )|푡 , 풛] 푖 푖 푛 − 1 푖 푖 푖 푖′≠푖 1 ∑︁ ≤ 퐸[ 푑(풔 , 풛) + 푑(풛, 풔 ′ )|푡 , 풛] (triangle inequality) 푛 − 1 푖 푖 푖 푖′≠푖 1 ∑︁ 1 ∑︁ = 퐸[ 푑(풔 , 풛)|푡 , 풛] + 퐸[ 푑(풛, 풔 ′ )|푡 , 풛] 푛 − 1 푖 푖 푛 − 1 푖 푖 푖′≠푖 푖′≠푖 1 ∑︁ = 퐸[푑(풔 , 풛)|푡 , 풛] + 퐸[푑(풛, 풔 ′ )|풛] 푖 푖 푛 − 1 푖 푖′≠푖 1 ∑︁ = 푓 (풛) + 퐸 ′ ∼T퐸 ′ ∼Y ( ′ ) [푑(풛, 풔 ′ |푡 , 풛] 푖 푛 − 1 푡푖 풔푖 푡푖 ,풛 푖 푖 푖′≠푖 1 ∑︁ = 푓 (풛) + 퐸 ′ ∼T [푓 ′ (풛)] 푖 푛 − 1 푡푖 푖 푖′≠푖 1 ∑︁ = 푓 (풛) + 휇T (풛) = 푓 (풛) + 휇T (풛). 푖 푛 − 1 푖 푖′≠푖 For the lower bound, we denote for every point 풙 ∈ 푋 its symmetric location by 푥ˆ = 휙 (풙). Fix ′ some location 풔푖 and some other agent 푖 .

퐸[푑( , ′ )| , 푡 ′, ] 퐸 [푑( , ′ )] 풔푖 풔푖 풔푖 푖 풛 = 풔푖′ ∼Y (푡푖′,풛) 풔푖 풔푖 1 1 = 퐸 ′ ∼Y ( ′ ) [ 푑(풔 , 풔 ′ ) + 푑(풔 , 풃 ′ )] (symmetry) 풔푖 푡푖 ,풛 2 푖 푖 2 푖 푖 1 = 퐸 ′ ∼Y ( ′ ) [푑(풔 , 풔 ′ ) + 푑(풔 , 풃 ′ )] 2 풔푖 푡푖 ,풛 푖 푖 푖 푖 1 ≥ 퐸 ′ ∼Y ( ′ ) [푑(풔 ′, 풃 ′ )] (triangle inequality) 2 풔푖 푡푖 ,풛 푖 푖 1 = 퐸 ′ ∼Y ( ′ ) [2푑(풔 ′, 풛)] (definition of 풃 ′ ) 2 풔푖 푡푖 ,풛 푖 푖 = 푓푖′ (풛). Similarly, for any point 풙, 퐸 푑 , 푡 , 퐸 푑 , [ (풔푖 풙)| 푖 풛] = 풔푖 ∼Y (푡푖,풛) [ (풔푖 풙)] 1 1 = 퐸 ∼Y ( ) [ 푑(풙, 풔 ) + 푑(풙, 풃 )] (symmetry) 풔푖 푡푖,풛 2 푖 2 푖 1 = 퐸 ∼Y ( ) [푑(풙, 풔 ) + 푑(풙, 풃 )] 2 풔푖 푡푖,풛 푖 푖 1 ≥ 퐸 ∼Y ( ) [푑(풔 , 풃 )] (triangle inequality) 2 풔푖 푡푖,풛 푖 푖 퐸 푑 , 푓 . = 풔푖 ∼Y (푡푖,풛) [ (풔푖 풛)] = 푖 (풛)

That is, the expected distance of 푖 from any point is at least 푓푖 . Therefore 퐸[푑(풔푖, 풔푖′ )|푡푖, 푡푖′ ] ≥ max{푓푖 (풛), 푓푖′ (풛)}. Now, it follows that

퐸[휋 (푆)|푡 , ] 퐸 [푑( , ′ )|푡 , 푡 ′, ] ≥ 퐸 [ {푓 ( ), 푓 ′ ( )}] ≥ {푓 ( ), 휇 ( )}, 푖 푖 풛 = 푡푖′ 풔푖 풔푖 푖 푖 풛 푡푖′ ∼T max 푖 풛 푖 풛 max 푖 풛 T 풛 as required. □ Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 34

퐸 [ {푓 , 푓 ′ }] The proof is actually showing a more informative lower bound: 푡푖′ ∼T max 푖 푖 . For the best worker this would equal 휇, whereas for the worst worker this would equal 푓푖 . ˆ Note that taking 푓푖 = 휋푖 − 휇ˆ is a good heuristic (assuming 휇ˆ is a good estimate of 휇), in particular for very good or very bad workers: if 푓푖 0 then due to our upper bound, ˆ 푓푖 = 휋푖 − 휇ˆ ≤ 푓푖 + 휇 − 휇ˆ 푓푖 .

If 푓푖 ≫ 휇 then due to the lower bound, ˆ 푓푖 = 휋푖 − 휇ˆ ≥ 푓푖 − 휇ˆ 푓푖 − 휇 = (1 − 표(1))푓푖 .

We can also use this result to show that 휋 provides a 2-approximation of 푓푖 , at least for the worse-than-average workers.

Corollary 22. If 푑 is any distance metric and Y is symmetric, then for any 푖 s.t. 푓푖 ≥ 휇, we have then 푓푖 ∈ [퐸[휋푖 ]/2, 퐸[휋푖 ]].

This is since 푓푖 = max{휇, 푓푖 } ≤ 퐸[휋푖 ] on one hand, and 퐸[휋푖 ] ≤ 푓푖 + 휇 ≤ 2푓푖 on the other hand. To see that this is tight (for an Euclidean distance in one dimension), suppose that 풛 = 0. consider 푛 − 1 workers with 풔푖 sampled uniformly from {−1, 1}, and another worker 푛 where 풔푛 = −푛 w.p. 1 푛 1 푓 휇 퐸[휋 ] + 표( ) 2푛 , 풔푛 = w.p. 2푛 , and otherwise 풔푛 = 0. Then 푖 = = 1 for all workers, but 1 = 1 1 and 퐸[휋푛] = 2 − 표(1). A slight modification of the above example also shows why the mean distance alone cannot fully distinct between two workers with 푓 = 2푓 ′: if we change the support of worker 푛 above to {−푛/2, 0, 푛/2} (without changing the probabilities) then 푓푛 = 휇/2 = 푓1/2, yet 퐸[휋1], 퐸[휋푛] both equal 1 + 표(1).

G SAMPLE COMPLEXITY AND CONSISTENCY

Proposition 2. Suppose the fourth moment of each distribution Y(˜ 풛, 푡), 푡 ∈ T is bounded, and that questions are i.i.d. conditional on worker’s type. That is, Y(˜ 풛, 푡) = Z(풛, 푡)푚 for some some distribution 푚 1 Z 휇 = 퐸 푛 [휋 (푆) | , 푡 ] ∈ R , 푡 ∈ T 푛,푚 ( ) . Denote 풔푖,풛 : 푆∼Y (풛,T) 푖 풛 푖 for some 풛 푖 . If > Ω 훿휖2 , we have:

휋 푆 휇 휖 , 푡 훿. Pr [| 푖 ( ) − 풔푖,풛 | > | 풛 푖 ] < (16) 푆∼Y (풛,T)푛

More specifically, let 푀푡 (푡 = 1, 2, 3, 4) denote the supremum of the 푡-th moment of any distribution 24(4(푀 )2 + 4푀 + 2푀 ) Y(˜ 풛, 푡), 푡 ∈ T (16) 푛 > 2 2 4 + 1 . Then the bound in Eq. holds whenever 훿휖2 and 12(푀 + 4푀 푀 + 4(푀 )2푀 ) 3푀 푚 > max{ 4 1 3 1 2 , 4 } 훿휖2 훿 .

Proof. W.l.o.g. let 풛 = 0. The proofs for other cases are similar. For any 풔푖 , let 휇 퐸 푛 [휋 (푆)| , ] 퐸 푛−1 [휋 (푆 , )| ] 퐸(휋 (푆)) 풔푖,풛 = 푆∼Y (풛,T) 푖 풔푖 풛 = 푆−푖 ∼Y (풛,T) 푖 −푖 풔푖 풛 . It follows from definition that 푖 = 퐸 휇 풔푖 ∼Y (풛,푡푖 ) ( 풔푖,풛). We define the following three events. 휋 푆 휇 휖 • Event A. | 푖 ( ) − 풔푖,풛 | > 퐴. The event will be used conditioned on 풔푖 satisfying some properties. 2 2 2 ∥풔푖 ∥2 ∥풔푖 ∥2 ∥풔푖 ∥2 • Event B. − 퐸풔 ∼Y (풛,푡 ) ( ) > 휖퐵. The event requires to be not too close to 푚 푖 푖 푚 푚 ∥풔 ∥2 Y( , 푡 ) 푖 2 ≤ 푀 + 휖 its mean, where 풔푖 is drawn from 풛 푖 . If this event does not hold, then 푚 2 퐵. Later in the proof we will set 휖퐵 = 1. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 35

휇 퐸 휇 휖 휇 • Event C. | 풔푖,풛 − 풔푖 ( 풔푖,풛)| ≤ 퐶 . The event requires 풔푖,풛 to be not too close to its mean 퐸 휇 퐸 휋 푆 휇 풔푖 ( 풔푖,풛) = ( 푖 ( )), where the randomness comes from 풔푖 . 풔푖,풛 is a function that only depends on 풔푖 . In the following three claims, we give lower bounds on Events A (Claim 23), B (Claim 24), and C (Claim 25), respectively. The first claim states that the probability for Event A to hold isupper- ∥풔 ∥2 푖 2 푛 휖 bounded by a function that depends on 푚 , , and 퐴.

Claim 23. For any 풔푖 , we have

2 ∥풔푖 ∥2 8푀2 + 2푀4 Pr(Event A|풔 ) = Pr(|휋 (푆) − 휇 | > 휖 |풔 ) < 푚 푖 푖 풔푖,풛 퐴 푖 (푛 − )휖2 1 퐴

Proof. For any 풔푖 , 휋푖 (푆|풔푖 ) can be seen as average of 푛 − 1 i.i.d. samples for 푑(풔푖, 풙), where (푥 , . . . , 푥 ) Y( , T) 푑( , ) 1 ∥ ∥2 − 2 · + 1 ∥ ∥2 풙 = 1 푚 is distributed as 풛 . For any fixed 풔푖 , 풔푖 풙 = 푚 풔푖 2 푚 풔푖 풙 푚 풙 2, where ∥ · ∥2 is the 퐿2 norm. Therefore, the variance of 푑(풔푖, 풙) given 풔푖 is the same as the variance − 2 · + 1 ∥ ∥2 of 푚 풔푖 풙 푚 풙 2, calculated as follows.

1 푉퐴푅[푑(풔 , 풙)|풔 ] = 푉퐴푅[−2풔 · 풙 + ∥풙∥2|풔 ] 푖 푖 푚2 푖 2 푖 1 ≤ 퐸{(−2풔 · 풙 + ∥풙∥2)2|풔 } 푚2 푖 2 푖 푚 1 ∑︁ ≤ 퐸{2푚( (4푠2 푥2 + 푥4))|풔 }(Cauchy-Schuwarz) 푚2 푖 푗 푗 푗 푖 푗=1 푚 2 2 ∑︁ ∥풔푖 ∥ ≤ ( 푀 ∥ ∥2 + 푚푀 ) = 푀 2 + 푀 푚 4 2 풔푖 2 4 8 2 푚 2 4 푗=1

Recall that 푀2 (respectively, 푀4) is the maximum second (respectively, fourth) moment of random variables in Y(풛, T). The claim follows after Chebyshev’s inequality. □

Claim 24. For any 푡푖 , we have

∥풔 ∥2 ∥풔 ∥2 푀 푖 2 퐸 푖 2 휖 4 Pr (Event B) = Pr ( − 풔푖 ∼Y (풛,푡푖 ) ( ) > 퐵) < 풔 ∼Y (풛,푡 ) 풔 ∼Y (풛,푡 ) 푚 푚 푚휖2 푖 푖 푖 푖 퐵

Proof. Note that the components of 풔푖 are i.i.d., whose variances are no more than 푀4. Therefore, ∥풔 ∥2 푖 2 푀4 ) the variance of 푚 is no more than 푚 . The claim follows after Chebyshev’s inequality. □

2 Claim 25. Let 퐶3 = 푀4 + 4푀1푀3 + 4(푀1) 푀2. For any 푡푖 , we have

퐶 휇 퐸 휇 휖 3 Pr (Event C) = Pr (| 풔푖,풛 − 풔푖 ( 풔푖,풛)| > 퐶 ) < 풔 ∼Y (풛,푡 ) 풔 ∼Y (풛,푡 ) 푚휖2 푖 푖 푖 푖 퐶 Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 36

Proof. For any given 풔푖 , we have 푚 1 ∑︁ 휇 = 퐸 { (푠 − 푥 )2} 풔푖,풛 푚 풙∼Y (풛,T) 푖 푗 푗 푗=1 푚 푚 푚 1 ∑︁ ∑︁ ∑︁ = 퐸 { 푠2 + 푥2 + 푠 푥 } 푚 풙∼Y (풛,T) 푖 푗 푗 2 푗 푗 푗=1 푗=1 푗=1 2 푚 푚 ∥풔푖 ∥ 1 ∑︁ 2 ∑︁ = 2 + 퐸(푥2) + 푠 퐸(푥 ) 푚 푚 푗 푚 푗 푗 푗=1 푗=1 2 푚 ∥풔푖 ∥ 2푀 ∑︁ ≤ 2 + 푀 + 1 푠 푚 2 푚 푗 푗=1

Recall that 푀1 (respectively, 푀2) is the superium first (respectively, second) moment of random variables in T . Now let us consider the case where 풔푖 is generated from Y(풛, 푡푖 ). We have 2 푚 ∥풔푖 ∥ 2푀 ∑︁ 푉퐴푅(휇 ) =푉퐴푅( 2 + 푀 + 1 푠 ) 풔푖,풛 푚 2 푚 푗 푗=1 푚 1 ∑︁ = 푉퐴푅( (풔2 + 2푀 푠 )) 푚2 푖 푗 1 푖 푗 푗=1 푚 1 ∑︁ = 푉퐴푅(푠2 + 2푀 푠 ) (independence of 푠 ’s) 푚2 푖 푗 1 푖 푗 푖 푗 푗=1 1 1 ≤ 퐸 (푠2 + 2푀 푠 )2 ≤ (푀 + 4푀 푀 + 4(푀 )2푀 ) 푚 푠푖1 푖1 1 푖1 푚 4 1 3 1 2 The claim follows after Chebyshev’s inequality. □ 휖 휖 휖 휖 Let 퐴 = 퐶 = 2 and 퐵 = 1. We have: 휖 휖 Pr(|휋 − 퐸(휋 )| > 휖) < Pr(|휋 − 휇 | > ) + Pr(|휇 − 퐸(휋 )| > ) 푖 푖 푖 풔푖,풛 2 풔푖,풛 푖 2 = Pr(Event A) + Pr(Event C) 4퐶 ≤ Pr(Event A) + 3 (Claim 25) 푚휖2 4퐶 ≤ Pr(Event A&¬Event B) + Pr(Event A&Event B) + 3 푚휖2 4퐶 ≤ Pr(Event A|¬Event B) Pr(¬Event B) + Pr(Event B) + 3 푚휖2 푀 4퐶 ≤ Pr(Event A|¬Event B) + 4 + 3 (Claim 24) 푚휖2 푚휖2 퐵 4(8푀 (푀 + 1) + 2푀 ) 푀 4퐶 ≤ 2 2 4 + 4 + 3 (Claim 23 and 휖 = 1) (푛 − 1)휖2 푚 푚휖2 퐵 When 푚 and 푛 satisfy the conditions described in the theorem, each of the three parts is no more 훿 than 3 , which proves the theorem. □ Consistency for P-TD with continuous data. Corollary 4 of course does not yet guarantee that either algorithm returns accurate answers. For this, we need the following two results. The first says that a good approximation of 푓ì entails a good approximation of 풛; and the second shows that Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 37 in the limit P-TD with the mean function 푟 푀 returns the correct answers. Given a fault estimation 푓 푤퐴 푤퐴 (푓 ) 1 , we denote the Aitkin weights = := 푓 , following [2]. 1 Í 1 ∗ 1 Í 1 Recall that 풛ˆ = 푖 ∈푁 풔푖 is the output of P-TD; and that 풙 = 푖 ∈푁 풔푖 is the best possible 푛 푓ˆ푖 푛 푓푖 estimation of 풛 by [2]. 푓ˆ 훿 푓 훿 , 푤 푤퐴 Lemma 26. Suppose that 푖 ∈ (1 ± ) 푖 for some ∈ [0 1]. Then 푖 is in the range [ 푖 (1 − 훿 ,푤퐴 훿 ) 푖 (1 + 2 )]. Proof. 1 1 1 − 훿 푤 = ≥ = 푖 ˆ 푓 ( + 훿) 푓 ( + 훿)( − 훿) 푓푖 푖 1 푖 1 1 1 − 훿 1 − 훿 = ≥ = ( − 훿)푤퐴. 2 1 푖 푓푖 (1 − 훿 ) 푓푖 and 1 1 1 + 2훿 푤 = ≤ = 푖 ˆ 푓 ( − 훿) 푓 ( − 훿)( + 훿) 푓푖 푖 1 푖 1 1 2 1 + 2훿 1 + 2훿 = ≤ = ( + 훿)푤퐴. 2 1 2 푖 푓푖 (1 + 훿 − 2훿 ) 푓푖 □ ˆ Theorem 27. Suppose that ∀푖 ∈ 푁, 푓푖 ∈ (1 ± 훿)푓푖 for some 훿 ∈ [0, 1]; it holds that ∗ 2 푑(풛ˆ, 풛) ≤ 푑(풙 , 풛) + 푂 (훿 · max푖,푗 (푠푖 푗 ) ). 푤 푤퐴 훿 ,푤퐴 훿 Proof. Note first that by Lemma 26, each 푖 is in the range [ 푖 (1 − ) 푖 (1 + 2 )]. W.l.o.g., Í 푤퐴 푖 ∈푁 푖 = 1. Thus ∑︁ (1 − 훿) ≤ 푤ˆ 푖′ ≤ (1 + 2훿), 푖′ and 1 ∈ ( − 훿, + 훿) Í 푤 1 4 1 4 푖′ ˆ 푖′ as well. Denote 푠 = max푖 푗 푠푖 푗 . For each 푗 ≤ 푚,

2 1 ∑︁ 2 (푧ˆ푗 − 푧푗 ) = ( 푤ˆ 푖푠푖 푗 − 푧푗 ) Í ′ 푤ˆ ′ 푖 푖 푖 ∈푁 1 ∑︁푤퐴 휏 푠 푧 2 휏 훿, 훿 = ( ˆ 푖 (1 + 푖 ) 푖 푗 − 푗 ) (for some 푖 ∈ [−2 2 ]) Í ′ 푤ˆ ′ 푖 푖 푖 ∈푁 ∑︁푤퐴 휏 ′ 푠 푧 2 휏 ′ 훿, 훿 = ( 푖 (1 + 푖 ) 푖 푗 − 푗 ) (for some 푖 ∈ [−8 8 ]) 푖 ∈푁 ∑︁ ∑︁ 푤퐴 휏 ′ 푠 푤퐴 휏 ′ 푠 푧2 푧 ∑︁푤퐴 휏 ′ 푠 = 푖 (1 + 푖 ) 푖 푗 푖′ (1 + 푖′ ) 푖′ 푗 + 푗 − 2 푗 ˆ 푖 (1 + 푖 ) 푖 푗 푖 ∈푁 푖′ ∈푁 푖 ∈푁 ∑︁ ∑︁ 푤퐴푠 푤퐴푠 휏 ′ 휏 ′ 휏 ′휏 ′ 푧2 푧 ∑︁푤퐴 휏 ′ 푠 = 푖 푖 푗 푖′ 푖′ 푗 (1 + 푖 + 푖′ + 푖 푖′ ) + 푗 − 2 푗 ˆ 푖 (1 + 푖 ) 푖 푗 푖 ∈푁 푖′ ∈푁 푖 ∈푁 ∑︁ ∑︁ 푤퐴푠 푤퐴푠 휏 ′′ 푧2 푧 ∑︁푤퐴 휏 ′ 푠 = 푖 푖 푗 푖′ 푖′ 푗 (1 + 푖푖′ ) + 푗 − 2 푗 푖 (1 + 푖 ) 푖 푗 푖 ∈푁 푖′ ∈푁 푖 ∈푁 Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 38

휏 ′′ 훿 For | 푖푖′ | < 25 . Similarly,

푥∗ 푧 2 ∑︁ ∑︁ 푤퐴푠 푤퐴푠 푧2 푧 ∑︁푤퐴푠 . ( − 푗 ) = 푖 푖 푗 푖′ 푖′ 푗 + 푗 − 2 푗 푖 푖 푗 푖 ∈푁 푖′ ∈푁 푖 ∈푁

Next,

1 ∑︁ 푑(ˆ, ) − 푑( ∗, ) = (푧 − 푧 )2 − (푥∗ − 푧 )2 풛 풛 풙 풛 푚 ˆ푗 푗 푗 푗 푗 ≤푚 ! 1 ∑︁ ∑︁ ∑︁ 퐴 퐴 ′′ ∑︁ 퐴 ′ = 푤 푠 푤 ′ 푠 ′ 휏 ′ + 푧 푤 푠 휏 푚 푖 푖 푗 푖 푖 푗 푖푖 2 푗 푖 푖 푗 푖 푗 ≤푚 푖 ∈푁 푖′ ∈푁 푖 ∈푁 ! 1 ∑︁ ∑︁ ∑︁ 퐴 퐴 ∑︁ 퐴 2 ≤ 훿 푤 푠 푤 ′ 푠 ′ + 훿 푤 ( ) 25 푚 푖 푖 푗 푖 푖 푗 16 푖 풔 푗 ≤푚 푖 ∈푁 푖′ ∈푁 푖 ∈푁

1 ∑︁ ∑︁ ∑︁ 퐴 퐴 2 ∑︁ 퐴 2 ≤ 훿 푤 푤 ′ (푠) + 훿 푤 (푠) 25 푚 푖 푖 16 푖 푗 ≤푚 푖 ∈푁 푖′ ∈푁 푖 ∈푁 ≤ 50훿 (푠)2, as required. □

Theorem 15. When Y has bounded support and 풛 is bounded, P-TD with the weighted mean rule 푟, is consistent. That is, as 푛 → ∞, 푚 = 휔 (log(푛)) and 푛 = 휔 (log푚): ˆ • For any constant 훿 > 0, 푃푟 [|푓푖 − 푓푖 | > 훿] → 0 for all 푖 ∈ 푁 ; • For any constant 휏 > 0, 푃푟 [푑(풛ˆ, 풛) > 휏] → 0.

The proof in fact shows the consistency of D-TD, and thus of P-TD휆 with 휆 = 4 (which is identical). Since P-TD휆 are identical in the limit for all 휆, consistency holds for any algorithm in the class.

ˆ Proof. We start by proving the first part, namely that |푓푖 − 푓푖 | → 0 under the stated conditions. We divide the proof into the following two steps. 0 1 푗 ≤ 푚 푦 Í ′ 푠 ′ 푧 Step 1: for each , 푗 = 푛 푖 ∈푁 푖 푗 is close to 푗 with high probability. This can be seen as the average of 푛 i.i.d. random variables such that the mean of each random variable is 푧푗 and the variance of each random variable is 휇F. Therefore, by Hoeffding’s inequality for sub-Gaussian random variables, we have that for any 푇1 > 0:

푦0 푧 푇 퐶 푇 2푛 , Pr(| 푗 − 푗 | > 1) < exp{− 1 1 }

for constant 퐶1 > 1. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 39

푗 푚 푦0 푧 푇 Í푚 푠 푦0 2 푚 푓 Step 2: for each ≤ , given | 푗 − 푗 | ≤ 1, then we have that 푗=1 ( 푖 푗 − 푗 ) / is close to 푖 with high probability. 푚 푠 푦0 2 ∑︁ ( 푖 푗 − 푗 ) 푚 푗=1 푚 푠2 푚 푦02 푚 푦0푠 ∑︁ 푖 푗 ∑︁ 푗 ∑︁ 푗 푖 푗 = + − 푚 푚 2 푚 푗=1 푗=1 푗=1 푚 푠2 푚 2 푚 푚 ∑︁ 푖 푗 ∑︁ (푧푗 + 푇1) ∑︁ 푧푗푠푖 푗 ∑︁ 푠푖 푗 ≤ + − + 푇 푚 푚 2 푚 2 1 푚 푗=1 푗=1 푗=1 푗=1 푚 2 푚 ∑︁ (푠푖 푗 − 푧푗 ) ∑︁ (푠푖 푗 + 푧푗 ) = + 푇 2 + 푇 푚 1 2 1 푚 푗=1 푗=1

Because for each 푗 ≤ 푚, 푠푖 푗 − 푧푗 is a sample of the Gaussian distribution with variance 푓푖 , for any 푇2 > 0, we have: 푚 ∑︁ 푠 푧 푚 푇 퐶 푇 2푚 , Pr(| ( 푖 푗 − 푗 )/ | > 2) < exp{− 2 2 } 푗=1 for constant 퐶2 > 0, where 퐶2 is related to the minimum variance in F . 푗 푚 퐸 푠 푧 2 푓 푉푎푟 푠 푧 2 푓 2 푇 Because for each ≤ , [( 푖 푗 − 푗 ) ] = 푖 and [( 푖 푗 − 푗 ) ] = 2 푖 , for any 3 > 0, we have: 푚 ∑︁ 푠 푧 2 푚 푓 푇 퐶 푇 2푚 , Pr(| ( 푖 푗 − 푗 ) / − 푖 | > 3) < exp{− 3 3 } 푗=1 for constant 퐶3 > 0. 푚 퐶 푇 2푛 푛 퐶 푇 2푚 푛 퐶 푇 2푚 By union bound, with probability at least 1− exp{− 1 1 }+ exp{− 2 2 }+ exp{− 3 3 }, we have: 푗 푚 푦0 푧 푇 • For every ≤ , | 푗 − 푗 | ≤ 1. 푖 푛 Í푚 푠 푧 푚 푇 • For every ≤ , | 푗=1 ( 푖 푗 − 푗 )/ | ≤ 2. 푖 푛 Í푚 푠 푧 2 푚 푓 푇 • For every ≤ , | 푗=1 ( 푖 푗 − 푗 ) / − 푖 | ≤ 3 This means that 푚 푓ˆ ∑︁ 푠 푦0 2 푚 푖 = ( 푖 푗 − 푗 ) / 푗=1 푚 푚 ∑︁ 푠 푧 2 푚 푇 2 푇 ∑︁ 푠 푧 푚 = ( 푖 푗 − 푗 ) / + 1 + 2 1 ( 푖 푗 + 푗 )/ 푗=1 푗=1 푚 푚 푓 푇 푇 2 푇 ∑︁ 푠 푧 푚 푇 ∑︁ 푧 푚 ≤ 푖 + 3 + 1 + 2 1 ( 푖 푗 − 푗 )/ + 4 1 푗 / 푗=1 푗=1 푚 푓 푇 푇 2 푇 푇 푇 ∑︁ 푧 푚 ≤ 푖 + 3 + 1 + 2 1 2 + 4 1 ( 푗 / ) 푗=1 푚 퐶 푇 2푛 푛 퐶 푇 2푚 Similarly, we can prove that with probability at least 1 − 2( exp{− 1 1 } + exp{− 2 2 } + 푛 퐶 푇 2푚 푖 푛 exp{− 3 3 )}, for all ≤ , we have: 푚 푓ˆ 푓 푇 푇 2 푇 푇 푇 ∑︁ 푧 푚 , | 푖 − 푖 | ≤ 3 + 1 + 2 1 2 + 4 1 ( 푗 / ) 푗=1 Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 40

√︃ √︃ 푇 휔 ( ) log푚 푇 푇 휔 ( ) log푛 Consistency follows after setting letting 1 = 1 푛 , 2 = 3 = 1 푚 . Proof of part 2. The consistency of P-TD follows after the first part of this theorem, Theorem 27, and the fact that 풙∗ is a consistent estimator. □

Consistency for the Categorical domain. Lemma 28 (Consistency of plurality with Grofman Weights). Plurality with Grofman F( 푘−1 ) weights is consistent for IER with observed fault levels, if and only if 푘 ≠ 1. Proof. W.l.o.g. let 푚 = 1 and 푧 = 0. Let 푊 ∗ (respectively, 푊 ) denote the random variable for a weighted vote given to alternative 0 (respectively, any other fixed alternative) under IER with observed fault levels. In other words, for any 0 < 푓 < 1, conditioned on the fault level being 푓 (which happens with probability F(푓 )), we have • 푊 ∗ (1−푓 )(푘−1) − 푓 푊 ∗ = log 푓 with probability 1 , and = 0 otherwise. • 푊 (1−푓 )(푘−1) 푓 푊 = log 푓 with probability 푘−1 , and = 0 otherwise. Therefore, ∫ 푘 (1 − 푓 )(푘 − 1) E(푊 ∗ − 푊 ) = (1 − 푓 ) log 푑F(푓 ) (17) 푘 − 1 푓 − 푘 푓 (1−푓 )(푘−1) We note that 1 푘−1 and log 푓 always have the same sign (positive, negative, or 0), and 푓 푘−1 both of them are 0 if and only if = 푘 , that is, the voter chooses an alternative uniformly at F( 푘−1 ) 푊¯ ∗ Í푛 푊 ∗/푛 푊¯ Í푛 푊 /푛 random. Therefore, when 푘 < 1, we have (17)>0. Let = 푖=1 and let = 푖=1 . ∗ By the Law of Large Numbers, we have lim푛→∞ Pr(푊¯ −푊¯ > 0) = 1. This means that as 푛 increases, the total weighted votes of 0 is strictly larger than the total weighted votes of 1 with probability close to 1. It follows from the union bound that as 푛 → ∞, the probability for 0 to be the winner goes to 1, which proves the consistency of the algorithm. F( 푘−1 ) When 푘 = 1, with probability 1 the votes are uniformly distributed with the same weights regardless of the ground truth. Therefore the algorithm is not consistent. □

Theorem 29 (Consistency of P-TD (IER model)). Suppose the support of T is a closed subset of ( , ) F( 푘−1 ) 휇 휇 0 1 , 푘 ≠ 1, and a consistent estimator ˆ for is used. Then P-TD with the weighted plurality rule 푟 is consistent. More precisely, as 푛 → ∞ and 푚 = 휔 (log푛): ˆ (1) For any constant 훿 > 0, 푃푟 [|푓푖 − 푓푖 | > 훿] → 0 for all 푖 ∈ 푁 ; (2) For any constant 휏 > 0, 푃푟 [푑(풛ˆ, 풛) > 휏] → 0. Proof. Proof of the first part. The proof proceeds by bounding the probability for the following three events as 푛 → ∞ and 푚 = 휔 (푛). We will show that by choosing 푇1,푇2,푇3 in the following ˆ ì three events appropriately, 푓푖 ’s converge to the ground truth fault levels 푓 .

Event A: Given 푇1 > 0, 휇ˆ is no more than 푇1 away from 휇. ′ ′ Event B: Given 푇2 > 0, for all 푖 ≤ 푛 and 푖 ≤ 푛 with 푖 ≠ 푖 , 푑푖푖′ is no more than 푇2 away from 푘 푓 + 푓 ′ − 푓 푓 ′ 푖 푖 푘−1 푖 푖 . 1 푇 푖 ≤ 푛 Í ′ 푓 ′ 푇 휇 Event C: Given 3 > 0, for all , 푛−1 푖 ≠푖 푖 is no more than 3 away from (which is ′ 퐸(푓푖′ ) for every 푖 ). ˆ ì We first show that when events A, B, and C holding simultaneously, 푓푖 ’s are close to 푓 . Because event Í Í 1 푖′≠푖 푓푖′ 푘 푖′≠푖 푓푖′ 퐵 푖 ≤ 푛 휋 Í ′ 푑 ′ 푇 푓 + − 푓 holds, for every , 푖 = 푛−1 푖 ≠푖 푖푖 is no more than 2 away from 푖 푛−1 푘−1 푖 푛−1 , 푇 + 푇 + 푘 푓 푇 푓 + 휇 − 푘 푓 휇 퐶 which is no more than 2 3 푘−1 푖 3 away from 푖 푘−1 푖 , because event holds. Also Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 41 according to event 퐴, we have 푘 푘 푘 |휋 − (푓 + 휇 − 푓 휇)| ≤푇 + 푇 + 푓 푇 + 푇 + 푓 푇 푖 푖 푘 − 1 푖 2 3 푘 − 1 푖 3 1 푘 − 1 푖 1 푘 ≤푇 + 푇 + 푇 + (푇 + 푇 ) 1 2 2 푘 − 1 1 3 Therefore, as 푇1,푇2,푇3 → 0, for every 푖 ≤ 푛 we have 푘 휋 → (푓 + 휇ˆ − 푓 휇ˆ), 푖 푖 푘 − 1 푖 푓ˆ 휋푖 −휇ˆ 푓 which means that 푖 = 푘 → 푖 . 1− 푘−1 휇ˆ We now show that when 푚 = 휔 (log푛), for any positive 푇1,푇2, and 푇3 and any 훿 > 0, there exists 푁 ∈ N such that for all 푛 ≥ 푁 , the probability for events A, B, and C hold simultaneously is at least 1 − 훿. For event A. Following the consistency of 휇ˆ, there exists 푁1 ∈ N such that for every 푛 > 푁1, event A holds with probability at least 1 − 훿/3. ′ For event B. Given any 푖 ≠ 푖 , 푑푖푖′ can be seen as the average of 푚 i.i.d. random variables, each of which takes 1 with probability 1 − (1 − 푓푖 )(1 − 푓푖′ ) − 푓푖 푓푖′ /(푘 − 1) and takes 0 otherwise. Therefore, by Hoeffding’s inequality, we have: 푘 −2푇 2푚 Pr(|푑 ′ − (푓 + 푓 ′ − 푓 푓 ′ )| > 푇 ) ≤ 2푒 2 (18) 푖푖 푖 푖 푘 − 1 푖 푖 2 ′ 2 −2푇 2푚 By the union bound, the probability for (18) to hold for any pairs of 푖 ≠ 푖 is no more than 2푛 푒 2 . Therefore, as 푛 → ∞ and 푚 = 휔 (log푛), the probability for event C to hold goes to 1. Consequently, there exists 푁2 ∈ N such that for every 푛 > 푁2, event B happens with probability at least 1 − 훿/3. ′ For event C. For any 푖 ≤ 푛, notice that each 푓푖′ with 푖 ≠ 푖 is an i.i.d. sample from 휇. Therefore, it follows from Hoeffding’s inequality that there exists 푁3 ∈ N such that for all 푛 > 푁3, we have

Í ′ 푓 ′ 훿 Pr(| 푖 ≠푖 푖 | > 푇 ) < 푛 − 1 3 3푛 It follows from the union bound that for any 푛 > 푁3, event C happens with probability at least 1 − 훿/3. The first part of the theorem follows after the union bound with 푁 = max(푁1, 푁2, 푁3). Proof of the second part. Because the support of F is a closed subset of (0, 1), the estimated Grofman weight is bounded. By the first part of the theorem, for any 휏 > 0 and 훿 > 0, there exists 푁 such that for all 푛 > 푁 and 푚 = 휔 (log푛), 푤 푤퐺 휏 푖 푛 훿, Pr(| ˆ 푖 − 푖 | < for all ≤ ) > 1 − 푤 푤퐺 푖 where ˆ 푖 and 푖 are the estimated Grofman weight for agent obtained by using the estimated 푤 = (푤 ,...,푤 ) 푤ì퐺 = (푤퐺, . . . ,푤퐺 ) fault, and the true Grofman weight, respectively. Let b ˆ 1 ˆ 푛 and 1 푛 . 푙 퐴 푙,푤 푙 푤 For any alternative ∈ , let Score( b) denote the weighted score of alternative with weights b. This means that with probability at least 1 − 훿, for each 푙 ∈ 퐴, we have Score(푙,푤) Score(푙,풘푮 ) | b − | 휏 푛 푛 < The consistency of P-TD follows after the fact that 푟 푃 (·,푤ì퐺 ) is consistent and the gap between the average score of the winner and the average score of any other alternative is bounded below by F( 푘−1 ) the constant in (17) (which is strictly positive because 푘 ≠ 1) with probability that goes to 1 as 푛 → ∞. □ Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 42

As we argue in the main text, 휇¯ is a conservative estimate of 휇, and we recommend using it in practice. However it is not a consistent estimate of 휇 and therefore cannot be used for the consistency proof. For completeness, we propose an alternative consistent estimate of 휇 that is based on the pairwise distance matrix: √︄ 1 1 푑ˆ 휇ˇ = − ( )2 − , (19) 1 + 휃 1 + 휃 1 + 휃

푑ˆ 1 Í⌊푛/2⌋ 푑 where := ⌊푛/2⌋ 푖=1 (2푖−1),2푖 .

휇 ≤ 1 휇 휇 Proposition 30. Under the IER model, for any 1+휃 , ˇ converges to in probability. 푛 푖 푛 푑 푑 , Proof. For any ∈ N, notice that for 1 ≤ ≤ ⌊ 2 ⌋, (2푖−1),2푖 are i.i.d. samples of (풔1 풔2), where 푛 풔1 and 풔2 generated independently from Y(풛, T) . It follows that

2 (푑(풔1, 풔2)) = 2휇 − (1 + 휃)휇 , (20) which means that 푑ˆ converges to 2휇 − (1 + 휃)휇2 in probability. The consistency of 휇ˇ follows after the continuous mapping theorem [30], by noticing that 휇ˇ is a continuous function of 푑ˆ and 휇 = 휇ˇ(퐸[푑ˆ]) = 휇ˇ(퐸[푑(풔1, 풔2)]). √︃ 1 ± ( 1 )2 − 푑ˆ Finally, we note that Eq. (20) has two solutions, 1+휃 1+휃 1+휃 , but only the one in Eq. (19) satisfies our assumption on 휇. □

H IMPLEMENTATION NOTES Our eiP-TD algorithm is described in Alg. 4. Implementation notes for all algorithms: • In all algorithms that have an initialization step, we initialized all weights to 1.9 • In all of the algorithms, we clipped the estimated fault level to [0.001, 0.999], to avoid division- by-zero issues. • We did not allow negative weights. • If most weights were negative, we used uniform weights instead. • In the continuous domain we used the NSED or L1 for the distance. In the categorical domain we used Hamming distance. Specific implementation notes: • The EVD algorithm in [33] is more general than the version we implemented, in that it holds two estimates for each worker, allowing for different false positive and false negative rates, as well as for imbalanced data. We implemented the special case that fits the balanced single- parameter IER model (we also made sure that our synthetic datasets are indeed balanced). • [33] four variants of the algorithm that differ in the way they fill the main diagonal of 푄ˆ. We implemented the one they reported as most effective. • We obtained the code of the MAS algorithm from the authors of [5]. We ran it without changing the meta-parameters. We did not partition datasets into subtasks - every dataset (or sample from a dataset) was considered as a separate question.

9[24] suggest to use random initialization but in all of our simulations this has led to worse performance of their algorithm. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 43

ALGORITHM 4: Exclusion-Iterative Proximity-based Truth Discovery (eiP-TD) Input: number of iterations 푇 ; dataset 푆 Output: answers 풛ˆ Initialize 푤ì 0 ← 1ì; −푗 푑 푑−푗 , ′ 푗 Compute 푖푖′ ← (풔푖 풔푖 ) for every pair of workers excluding question ; for 푡 = 1, 2, . . . ,푇 do for every question 푗 ∈ 푀 do for every worker 푖 ∈ 푁 do Í 푡−1 ′≠ 휔 ′ 푑 ′ 휋푡 ← 푖 푖 푖 ,푗 푖푖 ; 푖,푗 Í ′ 푡−1 푖 ≠푖 휔푖′,푗 end 푖 푁 휔푡 ′ 휋푡 휋푡 For every worker ∈ , set 푖,푗 ← max푖 ( 푖′,푗 ) − 푖,푗 ; end end for every question 푗 ∈ 푀 do 휇 ← 1 Í 휋푇 Set ˆ푗 2푛 푖 ∈푁 푖,푗 ; for every worker 푖 ∈ 푁 do 푓ˆ 퐸푠푡푖푚푎푡푒퐹푎푢푙푡 휋푇 , 휇 Set 푖,푗 ← ( 푖,푗 ˆ푗 ); ˆ Set 푤푖,푗 ← 푆푒푡푊 푒푖푔ℎ푡 (푓푖,푗 ); end 푧 푟 푆 ,푤푇 Set ˆ푗 ← ( 푗 ì 푗 ); end return 풛ˆ.

I DEFINITIONS OF VOTING RULES

All voting rules get as input a voting profile 푆 = (퐿1, . . . , 퐿푛) and non-negative normalized weights 푤ì. We denote by 퐿푖 (푐) the rank of candidate 푐 by voter 푖 (lower is better). ′ ′ ′ ′ 푐, 푐 푐 ≺ 푐 퐿 (푐) 퐿 (푐 ) 푣 (푐, 푐 ) = Í ′ 푤 For a pair of candidates , we denote: 푖 if 푖 < 푖 ; : 푖:푐 ≺푖푐 푖 is the fraction of voters preferring 푐 to 푐 ′; and 푦(푐, 푐 ′) := 푤 (푐, 푐 ′) > 0.5 means that 푐 beats 푐 ′ in a pairwise match. J K |퐶 | Note that 풗 and 풚 are vectors of length 2 . Recall that 풙퐿 is the binary vector corresponding to rank 퐿. Every voting rule 푟 generates a score 푞푟 (푐, 푆,푤ì) for each candidate 푐, then ranks candidates by increasing score (breaking ties at random). Therefore we need to specify how each voting rule determines the score. 푞푟 푐, 푆,푤 Í 푤 퐿 푐 Í 푤 푐 ′, 푐 Borda ( ì) = 푖 ∈푁 푖 ( 푖 ( ) − 1) = 푐′≠푐 ( ). 푞푟 푐, 푆,푤 Í 푦 푐 ′, 푐 Copeland ( ì) = 푐′≠푐 ( ). 푞푟 푐, 푆,푤 퐿∗ 푐 퐿∗ 푑 , Kemeny (KY) ( ì) = ( ), where = argmin퐿∈L 퐻 (풙퐿 풚). 푞푟 푐, 푆,푤 퐿∗ 푐 퐿∗ 푑 , Weighted-graph Kemeny (WKY) ( ì) = ( ), where = argmin퐿∈L ℓ1 (풙퐿 풗). 푞푟 (푐, 푆,푤ì) = 푛 − Í 푤 Plurality 푖 ∈푁 :퐿푖 (푐)=1 푖 . 푞푟 (푐, 푆,푤ì) = Í 푤 Veto 푖 ∈푁 :퐿푖 (푐)=푚 푖 . 푞푟 푐, 푆,푤 퐿 푐 푖∗ 푤 Best Dictator ( ì) = 푖∗ ( ), where = argmax푖 ∈푁 푖 . Random Dictator Similar to Best Dictator, except that 푖∗ is selected at random, proportionally to its weight 푤푖 . Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 44

Modal ranking [7]: this rule simply selects the ranking 퐿 that is most frequent in 푆. In the weighted variation, it selects 퐿 that maximizes the sum of weights of voters with vote 퐿. We break ties at random, if needed. The Error-Correcting (EC) voting rule [34] the EC voting rule that given a profile 푆 and a parameter 푡, first sets 퐵푡 ⊆ L(퐶) as the set of all rankings whose average distance from 푆 is at most 푡. Then, if this set is not empty, it selects the ranking from L(퐶) that minimizes the maximal distance to any ranking in 퐵푡 . we can easily extend the EC rule to use voters’ weights when computing 퐵푡 . Then, just like any other weighted voting rule, we can combine it with our proxy-based 푡 푤푖푑푡ℎ(푆) 휌 ∈ [ , ] algorithms. We applied it while varying the value of = 휌 in the range 1 4 , ′ where 푤푖푑푡ℎ(푆) := max퐿,퐿′ ∈푆 푑푘푡 (퐿, 퐿 ). When 퐵푡 was empty, we used the weighted KY voting rule (note that with 휌 = 1 it is never empty). All ties were broken uniformly at random.

J DATASETS J.1 Synthetic datasets Continous data. For synthetic categorical data we generated instances from the IER model. For the proto-population T we used either a truncated Normal distribution (푓푖 ∼ 푁 (1, 0.5), clipped to nonnegative values); or noisy variant of the “Hammer-Spammer" distribution (HS). In the HS distribution every worker had 20% chance to be a “hammer" with 푓푖 = 0.2, and 80% chance to be a “spammer" with 푓푖 = 1. Categorical data. For synthetic categorical data we generated instances from the IER model. For the proto-population T we used either a truncated Normal distribution (푓푖 ∼ 푁 (0.35, 0.15) for 푘 = 2 and 푓푖 ∼ 푁 (0.7, 0.1) for 푘 = 4, clipped to [0, 1]); or noisy variant of the “Hammer-Spammer" distribution (HS). In the HS distribution every worker had 20% chance to be a “hammer" with 푓푖 = 0.2, and 80% chance to be a “spammer" with worse-than-random performance (푓푖 = 0.52). Mallows Models. In the main text we discussed the Condorcet noise model. The down side of the Condorcet noise model is that it may result in nontransitive answers. Mallows model is similar, except it is guaranteed to produce transitive answers (rankings). Formally, given ground truth ranking 퐿풛 ∈ L and parameter 휙푖 > 0, the probability of observing 퐿 휙푑 (퐿풛,퐿푖 ) 휙 order 푖 ∈ L is proportional to 푖 . Thus for = 1 we get a uniform distribution, whereas for low 휙푖 we get orders concentrated around 퐿풛. In fact, if we throw away all non-transitive samples, the probability of getting rank 퐿푖 under the Condorcet noise model with parameter 푓푖 (conditional on the outcome being transitive) is exactly the same as the probability of getting 퐿 under Mallows Model with parameter 휙 = 1−푓푖 . 푖 푖 푓푖 We generated populations from Mallows distribution, where the proto population (distribution of 휙푖 ) was either a clipped Normal distribution (we used 푁 (0.65, 0.15) and 푁 (0.85, 0.15)), or a “hammer- spammer" (HS) distribution (20% hammers with 휙푖 = 0.3, and 80% spammers with 휙푖 = 0.99 which is just slightly better than random).

Interval matching. For the interval matching domain, we generated data as follows. The ground truth was set to be the union of intervals [0, 100] and [130, 170] (see black intervals in Fig. 10. We arbitrarily assigned a parameter 푞푖 between 0.5 and 0.9 to 푛 = 50 workers (specifically, 3 workers with 0.9, 11 workers with 0.8, 5 workers with 0.7, 10 workers with 0.6, 15 workers with 0.5 and 6 workers with 0.3). Each worker with parameter 푞푖 , identifies each interval with probability 푞푖 . In addition, if she identifies both, she identify them as a single unified interval w.p. 1 − 푞푖 . Then some noise monotone in 푞푖 is added to the boundaries of each identified interval (specifically, if the real Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 45

Fig. 10. A single generated instance of the interval matching problem. On the left we can see the empirical fault level of all 50 workers, v.s. their disparity. On the right we see the ground truth (the two black intervals), and the reports of three of the workers. We can also see where these workers are in the scatter plot on the left.

boundary is at 푥 then the reported boundary is at 푥 + 휖 where 휖 ∼ 푈 [−60 · 푞푖, 60 · 푞푖 ] for the left interval and 휖 ∼ 푈 [−30 · 푞푖, 30 · 푞푖 ] for the right interval, and this is done independently for every reported boundary). We then measured the Jaccard distance from each report to the ground truth and to all other reports. The distance between two empty reports is defined as 0.

J.2 Real world data Datasets we collected. Participants were given short instructions, then they had to answer 푚 = 25 questions. We recruited participants through Amazon Mechanical Turk. We restricted participation to workers that had at least 50 assignments approved. We planted in each survey a simple question that can be easily answered by a anyone who understand the instructions of the experiment (known as Gold Standards questions). Participants who answered correctly the gold standard question received a payment of $0.3. Participants did not receive bonuses for accuracy. The study protocol was approved by the Institutional Review Board at our institution. DotsBinary Subjects were asked which picture had more dots (binary questions). Buildings Subjects were asked to mark the height of the building in the picture on a slide bar. The cardinal labels were later removed and replaced with the induced ranking.

Datasets from previous studies. Another dataset with continuous answers came from a study of people’s geometric reasoning [21].10 Triangles Participants in the study were shown the base of a triangle (two vertices and their angles) and were asked to position the third vertex. That is, their answer to each question is an x-coordinate and y-coordinate of a vertex. In our analysis we treat each of the coordinates as a separate question. In addition, we used three datasets from [35], in those experiments workers had to identify an object in pictures. Workers got paid by the Double or Nothing payment scheme described in their paper (they had a fourth dataset, HeadsOfCountries, where almost all subjects got perfect answers, and hence we did not use this dataset). GoldenGate subjects were asked whether the golden gate bridge in the picture (binary questions). 35 workers, 21 questions. Dogs subjects were asked to identify the breed of the dog in the picture (10 possible answers). 31 workers, 85 questions. We omitted 4 workers due to missing data in their reports.

10https://github.com/StatShapeGeometricReasoning/StatisticalShapeGeometricReasoning Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 46

Flags subjects were asked to identify country to which shown flag belongs (4 possible answers). 35 workers, 126 questions. In addition, we used two datasets from [31], in those experiments workers had to rank four alternatives. The two datasets consists 6400 rankings from 1300 unique workers. Workers got paid 0.1$ per ranking. DOTS subjects were asked to rank four pictures of dots, from fewest dots to most dots. This task has been suggested as a benchmark task for human computation in [22]. PUZZLES subjects were asked to rank four 8-puzzles by the least number of moves the puzzles are from the solution, from closest to furthest. We further used data with complex labels, collected by other researchers: [1, 36]. SchemaMatching Schema matching is the task of aligning the attributes of two database schemata. For example, an “orderData” attribute in one table may correspond to a “poDay” attribute in another table. In a study aimed at characterizing human matchers, subjects (푁 = 211) were asked to match attributes from two schemata taken from the “purchase order” dataset [17], which had 142 and 46 attributes, respectively. A gold standard reference match was compiled by domain experts. Each subject was asked to indicate attributes that correspond to each other in the two schemata. We used the Jaccard distance function to determine similarity between subjects and their fault levels (distance from the gold standard matches). Translations The dataset, that was made available to us by the authors of [5], contains translations of 400 sentences from Japanese to English (in [5] they used a similar dataset of Urdu-English). Each sentence has 10 translations by different workers, and was used (in our study) asa separate dataset. The last source of datasets was from a recent study on long term prediction [29]. Predict Subjects were asked to assess the likelihood that 18 events will occur (in particular to predict whether or not they will occur), e.g., “Will D.C. United win the D.C. United vs. FC Dallas soccer game on Sat Oct. 13th?" Subject were assigned to four treatments of the payment method: (1) according to Brier scoring rule (207 workers); (2) According to peer- prediction (220 workers); (3)+(4) according to combinations of the two (238 and 226 workers, respectively). The datasets were collected at different dates (8-13 Oct.). We display reuslts for the first date in the dataset. The results for other dates are similar.

K MORE EMPIRICAL RESULTS Computing Infrastructure and runtime. Running any of the algorithms on an input of size 푛 = 200,푚 = 1000 (the largest inputs we used) takes well under a second on a Dell laptop with an Intel i7 processor. Thus Runtime is not a significant consideration in this work. The only exception isthe MAS algorithm, where we ran the original code on a virtual machine, and restricted the number of samples and the settings in which it was tested.

Ranking. Figure 11 shows the same information as Fig. 7 in the main text, but for the other datasets DOTS-X and Buildings. We also see the same patterns. Figure 12 shows similar information for synthetic data, but for all three distance functions. While it does seem that P-TD is most effective with the Kendall-tau distance, it still provides improvement in the other settings. Finally, Figure 13 shows more information on how P-TD affects the Error-Correcting voting rule of [34]. It seems to be beneficial mainly when 푛 is small. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 47

Fig. 11. Results on all DOTS datasets (with 푛 = 15) and on the Buildings dataset with 6 sampled alternatives.

Fig. 12. Results on a synthetic dataset with nine voting rules under each of the 3 distance measures we used for rankings. The proto-population is Mallows distribution with 휙푖 ∼ 푁 (0.65, 0.15) with 7 sampled alternatives.

REFERENCES [1] Rakefet Ackerman, Avigdor Gal, Tomer Sagi, and Roee Shraga. A cognitive model of human bias in matching. In PRICAI’19, pages 632–646. Springer, 2019. [2] AC Aitkin. On least squares and linear combination of observations. Proceedings of the RSE, 55:42–48, 1935. [3] Bahadir Ismail Aydin, Yavuz Selim Yilmaz, Yaliang Li, Qi Li, Jing Gao, and Murat Demirbas. Crowdsourcing for multiple-choice question answering. In Twenty-Sixth IAAI Conference, 2014. [4] Ruth Ben-Yashar and Jacob Paroush. Optimal decision rules for fixed-size committees in polychotomous choice situations. Social Choice and Welfare, 18(4):737–746, 2001. [5] Alexander Braylan and Matthew Lease. Modeling and aggregation of complex annotations via annotation distances. In Proceedings of The Web Conference 2020, pages 1807–1818, 2020. [6] Ioannis Caragiannis, Ariel D Procaccia, and Nisarg Shah. When do noisy votes reveal the truth? In EC’13, pages 143–160. ACM, 2013. [7] Ioannis Caragiannis, Ariel D Procaccia, and Nisarg Shah. Modal ranking: A uniquely robust voting rule. In AAAI’14, 2014. [8] Bob Carpenter, Andrew Gelman, Matthew D Hoffman, Daniel Lee, Ben Goodrich, Michael Betancourt, Marcus A Brubaker, Jiqiang Guo, Peter Li, and Allen Riddell. Stan: a probabilistic programming language. Grantee Submission, Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 48

Fig. 13. A comparison of D-TD, P-TD, UA and OA on Error-Correcting (EC) voting rule with ranging the parameter 휌 using Kendall-tau measure.

76(1):1–32, 2017. [9] Randy L Carter, Robin Morris, and Roger K Blashfield. On the partitioning of squared euclidean distance andits applications in cluster analysis. Psychometrika, 54(1):9–23, 1989. [10] Sung-Hyuk Cha. Comprehensive survey on distance/similarity measures between probability density functions. international journal of mathematical models and methods in applied sciences, 1(2):1, 2007. [11] Gal Cohensius, Shie Mannor, Reshef Meir, Eli Meirom, and Ariel Orda. Proxy voting for better outcomes. In AAMAS’17, pages 858–866, 2017. [12] Marie J Condorcet et al. Essai sur l’application de l’analyse à la probabilité des décisions rendues à la pluralité des voix, volume 252. American Mathematical Soc., 1785. [13] Nilesh Dalvi, Anirban Dasgupta, Ravi Kumar, and Vibhor Rastogi. Aggregating crowdsourced binary ratings. In WWW’13, pages 285–294, 2013. [14] Alexander Philip Dawid and Allan M Skene. Maximum likelihood estimation of observer error-rates using the em algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1):20–28, 1979. [15] Jia Deng, Olga Russakovsky, Jonathan Krause, Michael S Bernstein, Alex Berg, and Li Fei-Fei. Scalable multi-label annotation. In SIGCHI’14, pages 3099–3102. ACM, 2014. [16] Francis X Diebold. Elements of forecasting. South-Western College Pub., 1998. [17] Hong-Hai Do and Erhard Rahm. Coma—a system for flexible combination of schema matching approaches. In VLDB’02, pages 610–621. Elsevier, 2002. [18] Mohamed Drissi-Bakhkhat and Michel Truchon. Maximum likelihood approach to vote aggregation with variable probabilities. Social Choice and Welfare, 23(2):161–185, 2004. [19] Chao Gao and Dengyong Zhou. Minimax optimal convergence rates for estimating ground truth from crowdsourced labels. arXiv preprint arXiv:1310.5764, 2013. [20] Bernard Grofman, Guillermo Owen, and Scott L Feld. Thirteen theorems in search of the truth. Theory and Decision, 15(3):261–278, 1983. [21] Yuval Hart, Moira R Dillon, Andrew Marantan, Anna L Cardenas, Elizabeth Spelke, and L Mahadevan. The statistical shape of geometric reasoning. Scientific reports, 8(1):12906, 2018. [22] John J Horton. The dot-guessing game: A fruit fly for human computation research. SSRN, 2010. [23] Shih-Wen Huang and Wai-Tat Fu. Enhancing reliability using peer consistency evaluation in human computation. In CSCW, pages 639–648. ACM, 2013. [24] David R Karger, Sewoong Oh, and Devavrat Shah. Iterative learning for reliable crowdsourcing systems. In NeurIPS’11, pages 1953–1961, 2011. [25] Maurice George Kendall. Rank correlation methods. Griffin, 1948. [26] Esvey Kosman and KJ Leonard. Similarity coefficients for molecular markers in studies of genetic relationships between individuals for haploid, diploid, and polyploid species. Molecular ecology, 14(2):415–424, 2005. [27] Qi Li, Yaliang Li, Jing Gao, Bo Zhao, Wei Fan, and Jiawei Han. Resolving conflicts in heterogeneous data by truth discovery and source reliability estimation. In SIGMOD’14, 2014. [28] Yaliang Li, Jing Gao, Chuishi Meng, Qi Li, Lu Su, Bo Zhao, Wei Fan, and Jiawei Han. A survey on truth discovery. ACM SIGKDD Explorations Newsletter, 17(2):1–16, 2016. Reshef Meir, Ofra Amir, Omer Ben-Porat, Tsviel Ben-Shabat, Gal Cohensius, and Lirong Xia 49

[29] Debmalya Mandal, Goran Radanovic, and David C Parkes. The effectiveness of peer prediction in long-term forecasting. In AAAI’20, 2020. [30] Henry B Mann and Abraham Wald. On stochastic limit and order relationships. The Annals of Mathematical Statistics, 14(3):217–226, 1943. [31] Andrew Mao, Ariel D Procaccia, and Yiling Chen. Better human computation through principled voting. In AAAI’13, 2013. [32] Elliot McLaughlin. Image overload: Help us sort it all out, nasa requests. CNN.com. Retrieved at 18/9/2014, 2014. [33] Fabio Parisi, Francesco Strino, Boaz Nadler, and Yuval Kluger. Ranking and combining multiple predictors without labeled data. Proceedings of the National Academy of Sciences, 111(4):1253–1258, 2014. [34] Ariel D Procaccia, Nisarg Shah, and Yair Zick. Voting rules as error-correcting codes. Artificial Intelligence, 231:1–16, 2016. [35] Nihar Bhadresh Shah and Denny Zhou. Double or nothing: Multiplicative incentive mechanisms for crowdsourcing. In NeurIPS’15, pages 1–9, 2015. [36] Roee Shraga, Avigdor Gal, and Haggai Roitman. What type of a matcher are you? coordination of human and algorithmic matchers. In Proceedings of the HILL’19 Workshop, pages 1–7, 2018. [37] Luis Von Ahn and Laura Dabbish. Designing games with a purpose. Communications of the ACM, 51(8):58–67, 2008. [38] Jeroen Vuurens, Arjen P de Vries, and Carsten Eickhoff. How much spam can you take? an analysis of crowdsourcing results to increase accuracy. In ACM SIGIR Workshop on CIR’ 11, pages 21–26, 2011. [39] Paul Wais, Shivaram Lingamneni, Duncan Cook, Jason Fennell, Benjamin Goldenberg, Daniel Lubarov, David Marin, and Hari Simons. Towards building a high-quality workforce with mechanical turk. NeurIPS workshop, pages 1–5, 2010. [40] Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144, 2016. [41] Lirong Xia and Vincent Conitzer. A maximum likelihood approach towards aggregating partial orders. In IJCAI’11, 2011. [42] Houping Xiao, Jing Gao, Zhaoran Wang, Shiyu Wang, Lu Su, and Han Liu. A truth discovery approach with theoretical guarantee. In SIGKDD’16, pages 1925–1934, 2016. [43] Xiaoxin Yin and Wenzhao Tan. Semi-supervised truth discovery. In Proceedings of the 20th international conference on World wide web, pages 217–226, 2011. [44] H Peyton Young. Condorcet’s theory of voting. American Political science review, 82(04):1231–1244, 1988. [45] Yuchen Zhang, Xi Chen, Dengyong Zhou, and Michael I Jordan. Spectral methods meet em: A provably optimal algorithm for crowdsourcing. The Journal of Machine Learning Research, 17(1):3537–3580, 2016.