Crowdsourced Algorithms in Data Management
DASFAA 2014 Tutorial
Dongwon Lee, Ph.D.
(Thanks to Jongwuk Lee, Ph.D.)
Penn State University, USA [email protected] 2 Latest Version l Latest version of this DASFAA14 tutorial is available at:
http://bit.ly/1h6wHAS 3 TOC l Crowdsourcing Basics l Definition l Apps l Crowdsourced Algorithms in DB l Sort l Top-1 l Top-k l Select l Count l Join 4 James Surowiecki, 2004
“Collective intelligence can be brought to bear on a wide variety of problems, and complexity is no bar… conditions that are necessary for the crowd to be wise: diversity, independence, and … decentralization” 5 Jeff Howe, WIRED, 2006
“Crowdsourcing represents the act of a company or institution taking a function once performed by employees and outsourcing it to an undefined (and generally large) network of people in the form of an open call. … The crucial prerequisite is the use of the open call format and the large network of potential laborers…”
http://www.wired.com/wired/archive/14.06/crowds.html 6 Daren Brabhan, 2013
“Crowdsourcing as an online, distributed problem-solving and production model that leverages the collective intelligence of online communities to serve specific organizational goals” 7 “Human Computation”, 2011
“Human computation is simply computation that is carried out by humans… Crowdsourcing can be considered a method or a tool that human computation systems can use…”
By Edith Law & Luis von Ahn 8 Game with a Purpose: GWAP l Luis von Ahn @ CMU l Eg, l ESP Game à Google Image Labeler
l Foldit
l Duolingo: crowdsourced language translation 9 Eg, Francis Galton, 1906
Weight-judging competition: 1,197 (mean of 787 crowds) vs. 1,198 pounds (actual measurement) 10 Eg, StolenSidekick, 2006 l A woman lost a cellphone in a taxi l A 16-year-old girl ended up having the phone l Refused to return the phone l Evan Guttman, the woman’s friend, sets up a blog site about the incident l http://stolensidekick.blogspot.com/ l http://www.evanwashere.com/StolenSidekick/ l Attracted a growing amount of attention à the story appeared in Digg main page à NY Times and CNN coverage à Crowds pressure on police … l NYPD arrested the girl and re-possessed the phone
http://www.nytimes.com/2006/06/21/nyregion/21sidekick.html?_r=0 11 Eg, “Who Wants to be a Millionaire”
Asking the audience usually works è Audience members have diverse knowledge that can be coordinated to provide a correct answer in sum 12 Eg, DARPA Challenge, 2009 l To locate 10 red balloons in arbitrary locations of US l Winner gets $40K l MIT team won the race with the strategy: l 2K per balloon to the first person, A, to send the correct coordinates l 1K to the person, B, who invited A l 0.5K to the person, C, who invited B, … 13 Eg, Threadless.com l Sells t-shirts, designed/voted by crowds l Artists whose designs are chosen get paid 14 Eg, reCAPCHA
As of 2012
Captcha: 200M every day
ReCaptcha: 750M to date 15 Eg, l Crowdfunding, started in 2009 l Project creators choose a deadline and a minimum funding goal l Creators only from US, UK, and Canada l Donors pledge money to support projects, in exchange of non-monetary values l Eg, t-shirt, thank-u-note, dinner with creators l Donors can be from anywhere l Eg, Pebble, smartwatch l 68K people pledged 10M 16
http://www.resultsfromcrowds.com/features/crowdsourcing-landscape/ What is Crowdsourcing? 17
l Many defini ons l A few characteris cs l Online and distributed l Open call & right incen ve l Diversity and independence l Top-down & bo om-up
l Q: What are the computa onal issues in crowdsourcing? l Micro-tasks for large crowds 18 What is Computational Crowdsourcing? l Focus on computational aspect of crowdsourcing l Mainly use micro-tasks l Algorithmic aspect l Optimization problem with three parameters l When to use Computational Crowdsourcing? l Machine can’t do the task well l Large crowds can do it well l Task can be split to many micro-tasks Computational Crowdsourcing 19 Find an outlier among three l Requesters images l People submit some tasks l Pay rewards to workers
Submit tasks Collect answers l Marketplaces l Provide crowds with tasks
Find tasks Return answers l Find an outlier among three Crowds images l Workers perform tasks 20 Crowdsourcing Marketplaces l Platforms for posting/performing (often micro) tasks l Those who want to have tasks done via crowdsourcing è Requesters l Eg, companies, researchers l Those who want to perform tasks for monetary profits è Workers l Eg, individuals for extra income 21 Crowdsourcing Platforms l Notables ones: l Mechanical Turk (AMT)
l CrowdFlower
l CloudCrowd
l Clickworker
l SamaSource 22 AMT: mturk.com
Workers Requesters 23 AMT (cont) l Workers l Register w. credit account (only US workers can register as of 2013) l Bid to do tasks for earning money l Requesters l First deposit money to account l Post tasks o Task can specify a qualification for workers l Gather results l Pay to workers if results are satisfactory 24 AMT (cont) l Tasks l Called HIT (Human Intelligence Task) l Micro-task l Eg l Data cleaning l Tagging / labeling l Sentiment analysis l Categorization l Surveying l Photo moderation
l Transcription Translation task 25 AMT: HIT List
Workers qualification 26 AMT: HIT Example 27 Eg, Text Transcription [Miller-13]
l Problem: one person can’t do a good transcription l Key idea: iterative improvement by many workers
Greg Li le et al. “Exploring itera ve and parallel human computa on processes.” HCOMP 2010 Eg, Text Transcription [Miller-13] 28
improvement $0.05 Eg, Text Transcription [Miller-13] 29
3 votes @ $0.01 Eg, Text Transcription [Miller-13] 30
After 9 iterations Eg, Text Transcription [Miller-13] 31
I had intended to hit the nail, but I’m not a very good aim it seems and I ended up hitting my thumb. This is a common occurrence I know, but it doesn’t make me feel any less ridiculous having done it myself. My new strategy will involve lightly tapping the nail while holding it until it is embedded into the wood enough that the wood itself is holding it straight and then I’ll remove my hand and pound carefully away. We’ll see how this goes.
Another example: blurry text After 8 iterations Eg, Computer Vision [Li-HotDB12] 32
l How similar is the artistic style?
Human and Machine Detec on of Stylis c Similarity in Art. Adriana Kovashka and Ma hew Lease. CrowdConf 2010 33 TOC l Crowdsourcing Basics l Definition l Apps l Crowdsourced Algorithms in DB l Sort l Top-1 l Top-k l Select l Count l Join 34 Crowdsourcing DB Projects l CDAS @ NUS l CrowdDB @ UC Berkeley & ETH Zurich l MoDaS @ Tel Aviv U. l Qurk @ MIT l sCOOP @ Stanford & UCSC Eg, CrowdDB System 35
l Crowd-enabled databases l Hybrid human/machine databases l Building a database engine that can dynamically crowdsource certain operations
[Franklin-SIGMOD11] 36 Preliminaries: 3 Factors
l Latency (or execution time) l Worker pool size l Job attractiveness How long do we wait for?
l Monetary cost Latency l # of questions l # of workers l Cost per question Quality
l Quality of answers How much is the quality of l Worker maliciousness Cost answers satisfied? l Worker skills How much $$ does we spend? l Task difficulty 37 Preliminaries: Size of Comparison l Diverse forms of questions in a HIT l Different sizes of comparisons in a question l Eg, Binary question Which is better? o s = 2
l Eg, N-ary question Which is the best? o s = N . . . 38 Preliminaries: Batch l Repetitions of questions within a HIT l Eg, two n-ary questions (batch factor b=2)
Which is the best?
. . .
Which is the best?
. . . 39 Preliminaries: Response (r) l # of human responses seeked for a HIT
W1
Which is better? W2 r = 3
W3 40 Preliminaries: Round (= Step)
l Algorithms are executed in rounds l # of rounds ≈ latency Round #1 Round #2
Which is better?
Which is better? Parallel Execution Which is better?
Sequential Execution 41 Sort Operation l Rank N items using crowdsourcing with respect to the constraint C l Eg, C as “Representative,” “Dangerous,” “Beautiful” SELECT !*! FROM! !ImageTable AS I! WHERE !I.Date > 2014 AND I.loc = “NY”! ORDER BY !CrowdOp(“Representative”)! 42 Naïve Sort l Assuming pair-wise comparison of 2 items l Eg, “Which of two images is better?” l Cycle: A > B, B > C, and C > A l If no cycle occurs N l Naïve all pair-wise comparisons takes 2 comparisons l Optimal # of comparison is O(N log N) l If cycle exists C l More comparisons are required
A B 43 Sort [Marcus-VLDB11] l Proposed 3 crowdsourced sort algorithms l #1: Comparison-based Sort l Workers rank S items (S ⊂ N ) per HIT ! S $ l Each HIT yields # & pair-wise comparisons " 2 % l Build a DAG using all pair-wise comparisons from all workers o If i > j, then add an edge from i to j l Break a cycle in the DAG: “head-to-head” o Eg, If i > j occurs 3 times and i < j occurs 2 times, keep only i > j l Perform a topological sort in the DAG 44 Sort [Marcus-VLDB11]
1 3 2 5 4
4 3 5 1 2 45 Sort [Marcus-VLDB11]
l N=5, S=3
A A W1 > > 1
B B 1 E
C 1
C D D
E 46 Sort [Marcus-VLDB11]
l N=5, S=3
A A W1 > > 1
B B 1 E W2 > > C 2 1
C D D 1
E 47 Sort [Marcus-VLDB11]
l N=5, S=3
A A W1 > > 1 1
B E B 1 1 W2 > > 1 C 2 1
W3 C D D > > 1
E 48 Sort [Marcus-VLDB11]
l N=5, S=3
A A W1 > > 1 1
1 B E B 1 1 W2 > > ✖ 1 C 2 1 1 1 W3 C D D > > 1
E W4 > > 49 Sort [Marcus-VLDB11] l N=5, S=3
A A Topological Sort B B E A > B > C > C E > D D
C D E 50 Sort [Marcus-VLDB11] l #2: Rating-based Sort l W workers rate each item along a numerical scale l Compute the mean of W ratings of each item l Sort all items using their means l Requires W*N HITs: O(N) Mean rating Worker Rating 1.3 W1 4 W2 3 W3 4 3.6 . . . Worker Rating . . . W1 1 W2 2 W3 1 8.2 51 Sort [Marcus-VLDB11] 52 Sort [Marcus-VLDB11] l #3: Hybrid Sort l First, do rating-based sort à sorted list L l Second, do comparison-based sort on S ( S ⊂ L ) l How to select the size of S o Random o Confidence-based o Sliding window 53 Sort [Marcus-VLDB11]
Rank correlation btw. Worker agreement Comparison vs. rating 54 Sort [Marcus-VLDB11] 55 Top-1 Operation l Find the top-1, either MAX or MIN, among N items w.r.t. “something” l Objective l Avoid sorting all N items to find top-1 56 Top-1 Operation l Examples l [Venetis-WWW12] introduces the bubble max and tournament-based max in a parameterized framework l [Guo-SIGMOD12] studies how to find max using pair-wise questions in the tournament-like setting and how to improve accuracy by asking more questions 57 Max [Venetis-WWW12] l Introduced two Max algorithms l Bubble Max l Tournament Max l Parameterized framework
l si: size of sets compared at the i-th round
l ri: # of human responses at the i-th round
Which is better? Which is the best?
si = 2 si = 3 ri = 3 ri = 2 58 Max [Venetis-WWW12] l Bubble Max Case #1
¥ N = 5 s = 2 1 ¥ Rounds = 3 r1= 3 ¥ # of questions =
r1 + r2 + r3 = 11
s2 = 3 r2 = 3
s3 = 2 r3 = 5 59 Max [Venetis-WWW12] l Bubble Max Case #2
¥ N = 5 ¥ Rounds = 2 ¥ # of questions =
r1 + r2 = 8
s1 = 4 r1= 3
s2 = 2 r2 = 5 60 Max [Venetis-WWW12] l Tournament Max ¥ N = 5 s1 = 2 ¥ Rounds = 3 r = 1 1 ¥ # of questions s3 = 2 = r1 + r2 + r3 + r4 = 10 r3 = 3
s2 = 2 r2 = 1 s4 = 2 r4 = 5 61 Max [Venetis-WWW12] l How to find optimal parameters?: si and ri l Tuning Strategies (using Hill Climbing)
l Constant si and ri
l Constant si and varying ri
l Varying si and ri 62 Max [Venetis-WWW12] l Bubble Max
l Worst case: with si=2, O(N) comparisons needed l Tournament Max
l Worst case: with si=2, O(N) comparisons needed l Bubble Max is a special case of Tournament Max 63 Max [Venetis-WWW12] 64 Max [Venetis-WWW12] 65 Max [Venetis-WWW12]
Budget 66 Top-k Operation l Find top-k items among N items w.r.t. “something” l Top-k list vs. top-k set l Objective l Avoid sorting all N items to find top-k 67 Top-k Operation l Examples l [Davidson-ICDT13] inves gates the variable user error model in solving top-k list problem l [Polychronopoulous-WebDB13] proposes tournament-based top-k set solu on 68 Top-k Operation l Naïve solution is to “sort” N items and pick top-k items l Eg, N=5, k=2, “Find two best Bali images?” 5 l Ask = 10 pair-wise questions to get a total 2 order l Pick top-2 images 69 Top-k: Tournament Solution (k = 2) l Phase 1: Building a tournament tree l For each comparison, only winners are promoted to the next round
Round 1 70 Top-k: Tournament Solution (k = 2) l Phase 1: Building a tournament tree l For each comparison, only winners are promoted to the next round
Round 2
Round 1 71 Top-k: Tournament Solution (k = 2) l Phase 1: Building a tournament tree l For each comparison, only winners are promoted to the next round
Round 3
Round 2 Total, 4 questions with 3 rounds
Round 1 72 Top-k: Tournament Solution (k = 2) l Phase 2: Updating a tournament tree l Iteratively asking pair-wise questions from the bottom level
Round 3
Round 2
Round 1 73 Top-k: Tournament Solution (k = 2) l Phase 2: Updating a tournament tree l Iteratively asking pair-wise questions from the bottom level
Round 4 74 Top-k: Tournament Solution (k = 2) l Phase 2: Updating a tournament tree l Iteratively asking pair-wise questions from the bottom level
Round 5
Round 4 Total, 6 questions With 5 rounds 75 Top-k: Tournament Solution l This is a top-k list algorithm l Analysis
k = 1 k ≥ 2 # of questions O(n)
# of rounds l If there is no constraint for the number of rounds, this tournament sort yields the optimal result 76 Top-k [Polychronopoulous-WebDB13] l Top-k set algorithm l Top-k items are “better” than remaining items l Capture NO ranking among top-k items
K items
l Tournament-based approach l Can become a Top-k list algorithm l Eg, Top-k set algorithm, followed by [Marcus- VLDB11] to sort k items 77 Top-k [Polychronopoulous-WebDB13] l Algorithm l Input: N items, integer k and s (ie, s > k) l Output: top-k items l Procedure: o O ß N items o While |O| > k § Partition O into disjoint subsets of size s § Identify top-k items in each subset of size s: s-rank(s) § Merge all top-k items into O o Return O l Effective only when s and k are small l Eg, s-rank(20) with k=10 won’t work well 78 Top-k [Polychronopoulous-WebDB13] l Eg, N=10, s=4, k=2 Top-2 items
s-rank()
s-rank()
s-rank() s-rank() 79 Top-k [Polychronopoulous-WebDB13] l s-rank(s) // workers rank s items and aggregate l Input: s items, integer k (ie, s > k), w workers l Output: top-k items among s items l Procedure: o For each of w workers § Rank s items ≈ comparison-based sort [Marcus-VLDB11] o Merge w rankings of s items into a single ranking § Use median-rank aggregation [Dwork-WWW01] o Return top-k item from the merged ranking of s items 80 Top-k [Polychronopoulous-WebDB13] l Eg, s-rank(): s=4, k=2, w=3
W1
4 1 2 3
W2
4 2 1 3
W3
3 2 3 4
Median Ranks 4 2 2 3 Top-2 81 Top-k [Polychronopoulous-WebDB13] l How to set # of workers, w, per s-rank() l Basic o same # to all s-rank() l Adaptive o 3-level assignments: low, medium, high o if rankings from workers disagree, allowing more workers would improve the accuracy § Needs more rounds for improvement 82 Top-k [Polychronopoulous-WebDB13] l Basic vs. Adaptive 83 Top-k [Polychronopoulous-WebDB13] l Comparison to Sort [Marcus-VLDB11] 84 Top-k [Polychronopoulous-WebDB13] l Comparison to Max [Venetis-WWW12] 85 Top-k [Polychronopoulous-WebDB13] l AMT result 86 Select Operation l Given N items, select k items that satisfy a predicate P l ≈ Filter, Find, Screen, Search 87 Select Operation l Examples l [Yan-MobiSys10] uses crowds to find an image relevant to a query l [Parameswaran-SIGMOD12] develops ilters l [Franklin-ICDE13] efficiently enumerates items satisfying conditions via crowdsourcing l [Sarma-ICDE14] finds a bounded number of items satisfying predicates using the optimal solution by the skyline of cost and time 88 Select [Parameswaran-SIGMOD12] l Novel grid-based visualization
Yes B
A C
D
No 89 Select [Parameswaran-SIGMOD12] l Common strategies l Always ask X questions, return most likely answer à Triangular strategy
l If X YES return “Pass”, Y NO return “Fail”, else keep asking à Rectangular strategy
l Ask until |#YES - #NO| > X, or at most Y questions à Chopped off triangle 90 Select [Parameswaran-SIGMOD12] l What is the best strategy? Find strategy with minimum overall expected cost 1. Overall expected error is less than threshold 2. # of questions per item never exceeds m
YESs
6
5
4
3
2
1
1 2 3 4 5 6 NOs 91 Count Operation l Given N items, estimate a fraction of items M that satisfy a predicate P l Selectivity estimation in DB à crowd- powered query optimizers l Evaluating queries with GROUP BY + COUNT/AVG/SUM operators l Eg, “Find photos of females with red hairs” l Selectivity(“female”) ≈ 50% l Selectivity(“red hair”) ≈ 2% l Better to process predicate(“red hair”) first 92 Count [Marcus-VLDB13] l Hypothesis: Humans can estimate the frequency of objects’ properties in a batch without having to explicitly label each item l Two approaches l #1: Label Count o Sampling theory based o Have workers label samples explicitly l #2: Batch Count o Have workers estimate the frequency in a batch 93 Count [Marcus-VLDB13] l Label Count (via sampling) 94 Count [Marcus-VLDB13] l Batch Count 95 Count [Marcus-VLDB13] l Findings on accuracy l Images: Batch count > Label count l Texts: Batch count < Label count l Further Contributions l Detecting spammers l Avoiding coordinated attacks 96 Join Operation l Identify matching records or entities within or across tables l ≈ similarity join, entity resolution (ER), record linkage, de-duplication, … l Beyond the exact matching l [Chaudhuri-ICDE06] similarity join
l R JOINp S, where p=sim(R.A, S.A) > t l sim() can be implemented as UDFs in SQL l Often, the evaluation is expensive o DB applies UDF-based join predicate after Cartesian product of R and S 97 Join Operation l Examples l [Marcus-VLDB11] proposes 3 types of joins l [Wang-VLDB12] generates near-optimal cluster-based HIT design to reduce join cost l [Wang-SIGMOD13] reduces join cost further by exploiting transitivity among items l [Whang-VLDB13] selects right questions to ask to crowds to improve join accuracy l [Gokhale-SIGMOD14] proposes the hands-off crowdsourcing for join workflow 98 Join [Marcus-VLDB11] l To join tables R and S l #1: Simple Join l Pair-wise comparison HIT l |R||S| HITs needed l #2: Naïve Batching Join l Repetition of #1 with a batch factor b l |R||S|/b HITs needed l #3: Smart Batching Join l Show r and s images from R and S l Workers pair them up l |R||S|/rs HITs needed 99 Join [Marcus-VLDB11]
#1 Simple Join 100 Join [Marcus-VLDB11]
Batch factor #2 Naïve b = 2 Batching Join 101 Join [Marcus-VLDB11]
r images s images from R from S
#3 Smart Batching Join 102 Join [Marcus-VLDB11] 103 Join [Marcus-VLDB11]
Last 50% of wait time is spent completing the last 5% of tasks 104 Join [Wang-VLDB12] l [Marcus-VLDB11] proposed two batch joins l More efficient smart batch join still generates |R||S|/rs # of HITs l Eg, (10,000 X 10,000) / (20 x 20) = 250,000 HITs à Still too many ! l [Wang-VLDB12] contributes CrowdER: 1. A hybrid human-machine join o #1 machine-ER prunes obvious non-matches o #2 human-ER examines likely matching cases § Eg, candidate pairs with high similarity scores 2. Algorithm to generate min # of HITs for step #2 105 Join [Wang-VLDB12] l Hybrid idea: generate candidate pairs using existing similarity measures (eg, Jaccard)
Main Issue: HIT Generation Problem 106 Join [Wang-VLDB12] Pair-based HIT Generation Cluster-based HIT Generation = Naïve Batching in = Smart Batching in [Marcus-VLDB11] [Marcus-VLDB11] 107 Join [Wang-VLDB12] l HIT Generation Problem l Input: pairs of records P, # of records in HIT k l Output: minimum # of HITs s.t. o All HITs have at most k records
o Each pair (pi, pj) ∈ P must be in at least one HIT
1. Pair-based HIT Generation l Trivial: P/k # of HITs s.t. each HIT contains k pairs in P 2. Cluster-based HIT Generation l NP-hard problem à approximation solution 108 Join [Wang-VLDB12]
Cluster-based HIT #1
r1, r2, r3, r7
k = 4 Cluster-based HIT #2
r3, r4, r5, r6
Cluster-based HIT #3
r , r , r , r This is the minimal # of cluster-based HITs 4 7 8 9 satisfying previous two conditions 109 Join [Wang-VLDB12] l Two-tiered Greedy Algorithm l Build a graph G from pairs of records in P l CC ß connected components in G o LCC: large CC with more than k nodes o SCC: small CC with no more than k nodes l Step 1: Partition LCC into SCCs l Step 2: Pack SCCs into HITs with k nodes o Integer programming based 110 Join [Wang-VLDB12] l Eg, Generate cluster-based HITs (k = 4) 1. Partition the LCC into 3 SCCs
o {r1, r2, r3, r7}, {r3, r4, r5, r6}, {r4, r7} 2. Pack SCCs into HITs
o A single HIT per {r1, r2, r3, r7} and {r3, r4, r5, r6}
o Pack {r4, r7} and {r8, r9} into a HIT
111 Join [Wang-VLDB12] l Step 1: Partition l Input: LCC, k Output: SCCs
l rmax ß node in LCC with the max degree
l scc ß {rmax}
l conn ß nodes in LCC directly connected to rmax l while |scc| < k and |conn| > 0
o rnew ß node in conn with max indegree (# of edges to scc) and min outdegree (# of edges to non-scc) if tie
o move rnew from conn to scc o update conn using new scc l add scc into SCC
112 Join [Wang-VLDB12] 113 Join [Wang-VLDB12] 114 Join [Wang-VLDB12] 115 Join [Wang-SIGMOD13] l Use the same hybrid machine-human framework as [Wang-VLDB12] l Aim to reduce # of HITs further l Exploit transitivity among records 116 Join [Wang-SIGMOD13] l Positive transitive relation l If a=b, and b=c, then a=c
iPad 2nd Gen = iPad Two iPad 2nd Gen = iPad 2 iPad Two = iPad 2 l Negative transitive relation l If a = b, b ≠ c, then a ≠ c
iPad 2nd Gen = iPad Two iPad 2nd Gen ≠ iPad 3 iPad Two ≠ iPad 3 117 Join [Wang-SIGMOD13] l Three transitive relations l If there exists a path from o to o’ which only consists of matching pairs, then (o, o’) can be deduced as a matching pair l If there exists a path from o to o’ which only contains a single non-matching pair, then (o, o’) can be deduced as a non-matching pair l If any path from o to o’ contains more than one non-matching pairs, (o, o’) cannot be deduced. 118 Join [Wang-SIGMOD13]
(o3, o5) à match
(o5, o7) à non-match
(o1, o7) à ?
119 Join [Wang-SIGMOD13] l Given a pair (oi, oj), to check the transitivity
l Enumerate path from oi to oj à exponential ! l Count # of non-matching pairs in each path l Solution: Build a cluster graph l Merge matching pairs to a cluster l Add inter-cluster edge for non-matching pairs
(o5, o6) à ?
(o1, o5) à ?
120 Join [Wang-SIGMOD13] l Problem Definition: l Given a set of pairs that need to be labeled, minimize the # of pairs requested to crowd workers based on transi ve rela ons
? 121 Join [Wang-SIGMOD13] l Labeling order matters !
(o1, o2), (o1, o6), (o2, o6)
vs.
N (o1, o6), (o2, o6), (o1, o2)
è Given a set of pairs to label, how to order them affects the # of pairs to deduce using the transitivity 122 Join [Wang-SIGMOD13] l Optimal labeling order
w =
w’ =
l If pi is a matching pair and pi+1 is a non-matching pair, then C(w) ≤ C(w’) o C(w): # of crowdsourced pairs required for w l That is, always better to first label a matching pair and then a non-matching pair l In reality, optimal label order cannot be achieved 123 Join [Wang-SIGMOD13] l Expected optimal labeling order
l w =
l P(pi = crowdsourced)
o Enumerate all possible labels of
whether pi is crowdsourced 124 Join [Wang-SIGMOD13] l Expected optimal labeling order
l w1 =
l E[C(w1)] = 1 + 1 + 0.05 = 2.05
o P1: P(P1 = crowdsourced) = 1
o P2: P(P2 = crowdsourced) = 1
o P3: P(P3 = crowdsourced) = P(both P1 and P2 are non- matching) = (1-0.9)(1-0.5) = 0.05
Expected value o Probability 2 w1 =
l Eg, p1, p2, p3, p4, p5, p6, p7, p8
High 126 Join [Wang-SIGMOD13] l Parallel labeling l To reduce the rounds à smaller latency
l Eg, w=<(o1, o2), (o2, o3), (o3, o4)>
o (o1, o2) has to be crowdsourced
o (o2, o3) has to be crowdsourced whether the label of (o1, o2) is matching or not
o (o3, o4) has to be crowdsourced no matter which labels (o1, o2) or (o2, o3) has o è All three pairs can be crowdsourced concurrently
o2
o1 o3 o4 127 Join [Wang-SIGMOD13] l Parallel labeling
l w=
l pi needs to be crowdsourced iff:
o pi cannot be deduced from
o
l If an intermediate pair ph is unlabeled
o Assume ph as matching and check the transitivity
l For each pair pi (1 ≤ i ≤ n)
o Output pi as a crowdsourced pair if pi cannot be deduced from