Crowdsourced Algorithms in Data Management

Crowdsourced Algorithms in Data Management DASFAA 2014 Tutorial Dongwon Lee, Ph.D. (Thanks to Jongwuk Lee, Ph.D.) Penn State University, USA [email protected] 2 Latest Version l Latest version of this DASFAA14 tutorial is available at: http://bit.ly/1h6wHAS 3 TOC l Crowdsourcing Basics l Definition l Apps l Crowdsourced Algorithms in DB l Sort l Top-1 l Top-k l Select l Count l Join 4 James Surowiecki, 2004 “Collective intelligence can be brought to bear on a wide variety of problems, and complexity is no bar… conditions that are necessary for the crowd to be wise: diversity, independence, and … decentralization” 5 Jeff Howe, WIRED, 2006 “Crowdsourcing represents the act of a company or institution taking a function once performed by employees and outsourcing it to an undefined (and generally large) network of people in the form of an open call. … The crucial prerequisite is the use of the open call format and the large network of potential laborers…” http://www.wired.com/wired/archive/14.06/crowds.html 6 Daren Brabhan, 2013 “Crowdsourcing as an online, distributed problem-solving and production model that leverages the collective intelligence of online communities to serve specific organizational goals” 7 “Human Computation”, 2011 “Human computation is simply computation that is carried out by humans… Crowdsourcing can be considered a method or a tool that human computation systems can use…” By Edith Law & Luis von Ahn 8 Game with a Purpose: GWAP l Luis von Ahn @ CMU l Eg, l ESP Game à Google Image Labeler l Foldit l Duolingo: crowdsourced language translation 9 Eg, Francis Galton, 1906 Weight-judging competition: 1,197 (mean of 787 crowds) vs. 1,198 pounds (actual measurement) 10 Eg, StolenSidekick, 2006 l A woman lost a cellphone in a taxi l A 16-year-old girl ended up having the phone l Refused to return the phone l Evan Guttman, the woman’s friend, sets up a blog site about the incident l http://stolensidekick.blogspot.com/ l http://www.evanwashere.com/StolenSidekick/ l Attracted a growing amount of attention à the story appeared in Digg main page à NY Times and CNN coverage à Crowds pressure on police … l NYPD arrested the girl and re-possessed the phone http://www.nytimes.com/2006/06/21/nyregion/21sidekick.html?_r=0 11 Eg, “Who Wants to be a Millionaire” Asking the audience usually works è Audience members have diverse knowledge that can be coordinated to provide a correct answer in sum 12 Eg, DARPA Challenge, 2009 l To locate 10 red balloons in arbitrary locations of US l Winner gets $40K l MIT team won the race with the strategy: l 2K per balloon to the first person, A, to send the correct coordinates l 1K to the person, B, who invited A l 0.5K to the person, C, who invited B, … 13 Eg, Threadless.com l Sells t-shirts, designed/voted by crowds l Artists whose designs are chosen get paid 14 Eg, reCAPCHA As of 2012 Captcha: 200M every day ReCaptcha: 750M to date 15 Eg, l Crowdfunding, started in 2009 l Project creators choose a deadline and a minimum funding goal l Creators only from US, UK, and Canada l Donors pledge money to support projects, in exchange of non-monetary values l Eg, t-shirt, thank-u-note, dinner with creators l Donors can be from anywhere l Eg, Pebble, smartwatch l 68K people pledged 10M 16 http://www.resultsfromcrowds.com/features/crowdsourcing-landscape/ What is Crowdsourcing? 17 l Many deﬁni*ons l A few characteris*cs l Online and distributed l Open call & right incen*ve l Diversity and independence l Top-down & boom-up l Q: What are the computaonal issues in crowdsourcing? l Micro-tasks for large crowds 18 What is Computational Crowdsourcing? l Focus on computational aspect of crowdsourcing l Mainly use micro-tasks l Algorithmic aspect l Optimization problem with three parameters l When to use Computational Crowdsourcing? l Machine can’t do the task well l Large crowds can do it well l Task can be split to many micro-tasks Computational Crowdsourcing 19 Find an outlier among three l Requesters images l People submit some tasks l Pay rewards to workers Submit tasks Collect answers l Marketplaces l Provide crowds with tasks Find tasks Return answers l Find an outlier among three Crowds images l Workers perform tasks 20 Crowdsourcing Marketplaces l Platforms for posting/performing (often micro) tasks l Those who want to have tasks done via crowdsourcing è Requesters l Eg, companies, researchers l Those who want to perform tasks for monetary profits è Workers l Eg, individuals for extra income 21 Crowdsourcing Platforms l Notables ones: l Mechanical Turk (AMT) l CrowdFlower l CloudCrowd l Clickworker l SamaSource 22 AMT: mturk.com Workers Requesters 23 AMT (cont) l Workers l Register w. credit account (only US workers can register as of 2013) l Bid to do tasks for earning money l Requesters l First deposit money to account l Post tasks o Task can specify a qualification for workers l Gather results l Pay to workers if results are satisfactory 24 AMT (cont) l Tasks l Called HIT (Human Intelligence Task) l Micro-task l Eg l Data cleaning l Tagging / labeling l Sentiment analysis l Categorization l Surveying l Photo moderation l Transcription Translation task 25 AMT: HIT List Workers qualification 26 AMT: HIT Example 27 Eg, Text Transcription [Miller-13] l Problem: one person can’t do a good transcription l Key idea: iterative improvement by many workers Greg Lile et al. “Exploring iterave and parallel human computaon processes.” HCOMP 2010 Eg, Text Transcription [Miller-13] 28 improvement $0.05 Eg, Text Transcription [Miller-13] 29 3 votes @ $0.01 Eg, Text Transcription [Miller-13] 30 After 9 iterations Eg, Text Transcription [Miller-13] 31 I had intended to hit the nail, but I’m not a very good aim it seems and I ended up hitting my thumb. This is a common occurrence I know, but it doesn’t make me feel any less ridiculous having done it myself. My new strategy will involve lightly tapping the nail while holding it until it is embedded into the wood enough that the wood itself is holding it straight and then I’ll remove my hand and pound carefully away. We’ll see how this goes. Another example: blurry text After 8 iterations Eg, Computer Vision [Li-HotDB12] 32 l How similar is the artistic style? Human and Machine Detecon of Stylis0c Similarity in Art. Adriana Kovashka and Ma<hew Lease. CrowdConf 2010 33 TOC l Crowdsourcing Basics l Definition l Apps l Crowdsourced Algorithms in DB l Sort l Top-1 l Top-k l Select l Count l Join 34 Crowdsourcing DB Projects l CDAS @ NUS l CrowdDB @ UC Berkeley & ETH Zurich l MoDaS @ Tel Aviv U. l Qurk @ MIT l sCOOP @ Stanford & UCSC Eg, CrowdDB System 35 l Crowd-enabled databases l Hybrid human/machine databases l Building a database engine that can dynamically crowdsource certain operations [Franklin-SIGMOD11] 36 Preliminaries: 3 Factors l Latency (or execution time) l Worker pool size l Job attractiveness How long do we wait for? l Monetary cost Latency l # of questions l # of workers l Cost per question Quality l Quality of answers How much is the quality of l Worker maliciousness Cost answers satisfied? l Worker skills How much $$ does we spend? l Task difficulty 37 Preliminaries: Size of Comparison l Diverse forms of questions in a HIT l Different sizes of comparisons in a question l Eg, Binary question Which is better? o s = 2 l Eg, N-ary question Which is the best? o s = N . 38 Preliminaries: Batch l Repetitions of questions within a HIT l Eg, two n-ary questions (batch factor b=2) Which is the best? . Which is the best? . 39 Preliminaries: Response (r) l # of human responses seeked for a HIT W1 Which is better? W2 r = 3 W3 40 Preliminaries: Round (= Step) l Algorithms are executed in rounds l # of rounds ≈ latency Round #1 Round #2 Which is better? Which is better? Parallel Execution Which is better? Sequential Execution 41 Sort Operation l Rank N items using crowdsourcing with respect to the constraint C l Eg, C as “Representative,” “Dangerous,” “Beautiful” SELECT !*! FROM! !ImageTable AS I! WHERE !I.Date > 2014 AND I.loc = “NY”! ORDER BY !CrowdOp(“Representative”)! 42 Naïve Sort l Assuming pair-wise comparison of 2 items l Eg, “Which of two images is better?” l Cycle: A > B, B > C, and C > A l If no cycle occurs N l Naïve all pair-wise comparisons takes 2 comparisons l Optimal # of comparison is O(N log N) l If cycle exists C l More comparisons are required A B 43 Sort [Marcus-VLDB11] l Proposed 3 crowdsourced sort algorithms l #1: Comparison-based Sort l Workers rank S items (S ⊂ N ) per HIT ! S $ l Each HIT yields # & pair-wise comparisons " 2 % l Build a DAG using all pair-wise comparisons from all workers o If i > j, then add an edge from i to j l Break a cycle in the DAG: “head-to-head” o Eg, If i > j occurs 3 times and i < j occurs 2 times, keep only i > j l Perform a topological sort in the DAG 44 Sort [Marcus-VLDB11] 1 3 2 5 4 4 3 5 1 2 45 Sort [Marcus-VLDB11] l N=5, S=3 A A W1 > > 1 B B 1 E C 1 C D D E 46 Sort [Marcus-VLDB11] l N=5, S=3 A A W1 > > 1 B B 1 E W2 > > C 2 1 C D D 1 E 47 Sort [Marcus-VLDB11] l N=5, S=3 A A W1 > > 1 1 B E B 1 1 W2 > > 1 C 2 1 W3 C D D > > 1 E 48 Sort [Marcus-VLDB11] l N=5, S=3 A A W1 > > 1 1 1 B E B 1 1 W2 > > ✖ 1 C 2 1 1 1 W3 C D D > > 1 E W4 > > 49 Sort [Marcus-VLDB11] l N=5, S=3 A A Topological Sort B B E A > B > C > C E > D D C D E 50 Sort [Marcus-VLDB11] l #2: Rating-based Sort l W workers rate each item along a numerical scale l Compute the mean of W ratings of each item l Sort all items using their means l Requires W*N HITs: O(N) Mean rating Worker Rating 1.3 W1 4 W2 3 W3 4 3.6 .

Crowdsourced Algorithms in Data Management

Combating Attacks and Abuse in Large Online Communities

High Converting Email Marketing Templates

The Scam Filter That Fights Back Final Report

Social Turing Tests: Crowdsourcing Sybil Detection

Read a Sample

Crowdsourcing Real-Time Viral Disease and Pest Information: A

Text Corpora and the Challenge of Newly Written Languages

A Study Into the Motivations of Internet Users Contributing to Translation Crowdsourcing: the Case of Polish Facebook User-Translators

Proceedings Template

Multiple Tabs in Quip Spreadsheet

Crowdsourcing and Open Access: Collaborative Techniques for Disseminating Legal Materials and Scholarship Timothy K

Using Crowdsourcing for Enterprise Software Localization