Ethics, Crowdsourcing and Traceability for Big Data in Human Language Technology

"Where are the data coming from?" Ethics, crowdsourcing and traceability for Big Data in Human Language Technology Gilles Adda, Laurent Besacier, Alain Couillault, KarënFort, Joseph Mariani, Hugues de Mazancourt [email protected] KarënFort ([email protected]) Ethics, crowdsourcing and traceability 1 / 25 "Where are the data coming from?" Ethics, crowdsourcing and traceability for Big Data in Human Language Technology Gilles Adda, Laurent Besacier, Alain Couillault, KarënFort, Joseph Mariani, Hugues de Mazancourt [email protected] KarënFort ([email protected]) Ethics, crowdsourcing and traceability 1 / 25 HLT and data 1 HLT and data The need for more The cost of more 2 HLT and human computation 3 The Ethics and Big Data Charter 4 Conclusion KarënFort ([email protected]) Ethics, crowdsourcing and traceability 2 / 25 HLT and data The need for more The growing need for (growing size) corpora Success of probabilistic machine learning methods + evaluation paradigm ) need for more and more (manually annotated) corpora: I for learning purposes I for evaluation purposes Success of the technology ) needed annotations become more and more diverse more and more complex KarënFort ([email protected]) Ethics, crowdsourcing and traceability 3 / 25 HLT and data The cost of more The notoriously high cost of corpora development Prague Dependency Treebank [Böhmováet al., 2001]: 1.8 million words ) 5 years, 22 persons, $600,000 GENIA [Kim et al., 2008]: 9,372 sentences (gene and protein names) ) 5 part-time annotators,1 senior and1 junior coordinators for 1.5 year KarënFort ([email protected]) Ethics, crowdsourcing and traceability 4 / 25 HLT and human computation 1 HLT and data 2 HLT and human computation Amazon Mechanical Turk MTurk: the legend Long-term consequences 3 The Ethics and Big Data Charter 4 Conclusion KarënFort ([email protected]) Ethics, crowdsourcing and traceability 5 / 25 HLT and human computation Amazon Mechanical Turk History: von Kempelen's Mechanical Turk A mechanical chess player created by J. W. von Kempelen in 1770: KarënFort ([email protected]) Ethics, crowdsourcing and traceability 6 / 25 HLT and human computation Amazon Mechanical Turk History: von Kempelen's Mechanical Turk In fact, a human chess master was hiding inside to operate the machine: KarënFort ([email protected]) Ethics, crowdsourcing and traceability 6 / 25 HLT and human computation Amazon Mechanical Turk History: von Kempelen's Mechanical Turk artificial artificial intelligence! KarënFort ([email protected]) Ethics, crowdsourcing and traceability 6 / 25 HLT and human computation Amazon Mechanical Turk Amazon Mechanical Turk MTurk KarënFort ([email protected]) Ethics, crowdsourcing and traceability 7 / 25 HLT and human computation Amazon Mechanical Turk Amazon Mechanical Turk MTurk is a crowdsourcing system: work outsourced via the Web, done by many people (the crowd), here, the Turkers KarënFort ([email protected]) Ethics, crowdsourcing and traceability 7 / 25 HLT and human computation Amazon Mechanical Turk Amazon Mechanical Turk MTurk is a crowdsourcing, microworking system: tasks are cut into small pieces (HITs) and their execution is paid for by the Requesters KarënFort ([email protected]) Ethics, crowdsourcing and traceability 7 / 25 HLT and human computation Amazon Mechanical Turk Amazon Mechanical Turk MTurk is a crowdsourcing, microworking system: tasks are cut into small pieces (HITs) and their execution is paid for. KarënFort ([email protected]) Ethics, crowdsourcing and traceability 7 / 25 HLT and human computation MTurk: the legend MTurk: the Dream-Come-True Story? It's Cheap, Fast, Good [Snow et al., 2008] and a Hobby for Turkers! KarënFort ([email protected]) Ethics, crowdsourcing and traceability 8 / 25 HLT and human computation MTurk: the legend How many Turkers? [Fort et al., 2011]: although 500k people are registered as Turkers in the MTurk system, there are really between 15,059 and 42,912 of them. KarënFort ([email protected]) Ethics, crowdsourcing and traceability 9 / 25 HLT and human computation MTurk: the legend MTurk: a hobby for Turkers? [Ross et al., 2010, Ipeirotis, 2010] show that: Turkers are priorly financially motivated (91%): I 20% use MTurk as their primary source of income; I 50% as their secondary source of income; I leisure is important for only a minority (30%). 20% of the Turkers spend more than 15 hour a week on MTurk, and contribute to 80% of the tasks. observed mean hourly wages is below US$ 2. KarënFort ([email protected]) Ethics, crowdsourcing and traceability 10 / 25 HLT and human computation MTurk: the legend MTurk allows to produce an equivalent quality? Possibility to produce quality resources in some precise cases (e.g. speech transcription) But questionable quality: I quality decreases when the task becomes complex (e.g. summarizing [Gillick and Liu, 2010]) I UI issues [Tratz and Hovy, 2010] I Turkers (cheaters, spammers) I by-task payment model [Kochhar et al., 2010] For some simple tasks, NLP tools perform better than MTurk [Wais et al., 2010]. KarënFort ([email protected]) Ethics, crowdsourcing and traceability 11 / 25 HLT and human computation MTurk: the legend Is MTurk Ethical and/or legal? Ethics: No identification: no relation Requesters/Turkers and among Turkers No possibility to unionize, to protest against wrongdoings or to go to court. No minimal wage( < 2$/hr in average) Possibility to refuse to pay the Turkers KarënFort ([email protected]) Ethics, crowdsourcing and traceability 12 / 25 HLT and human computation MTurk: the legend Is MTurk Ethical and/or legal? KarënFort ([email protected]) Ethics, crowdsourcing and traceability 12 / 25 HLT and human computation MTurk: the legend Is MTurk Ethical and/or legal? Legality: Amazon license agreement: Turkers are considered as independent workers ) they are supposed to pay all the taxes. Illusory, giving the very low wages ) States are deprived of a legitimate income source. KarënFort ([email protected]) Ethics, crowdsourcing and traceability 12 / 25 HLT and human computation Long-term consequences Consequences on data production Vicious circle: Usage of low cost systems (such as MTurk) in projects ) Funding agencies see a huge (but unethical) reduction in costs ) Funding agencies get reluctant to pay for projects developed outside MTurk ) MTurk costs become a standard ) Other, more costly systems disappear KarënFort ([email protected]) Ethics, crowdsourcing and traceability 13 / 25 The Ethics and Big Data Charter 1 HLT and data 2 HLT and human computation 3 The Ethics and Big Data Charter Creation process Contents Usage Example 4 Conclusion KarënFort ([email protected]) Ethics, crowdsourcing and traceability 14 / 25 The Ethics and Big Data Charter Creation process Writers and contributors KarënFort ([email protected]) Ethics, crowdsourcing and traceability 15 / 25 The Ethics and Big Data Charter Creation process Collaboration 1 meeting a month, from June to December 2012 validation by each participant KarënFort ([email protected]) Ethics, crowdsourcing and traceability 16 / 25 The Ethics and Big Data Charter Contents A self-declared charter Form to fill and provide for each grant proposal KarënFort ([email protected]) Ethics, crowdsourcing and traceability 17 / 25 The Ethics and Big Data Charter Contents Key points Traceability: history of the data Quality: description of the means deployed to ensure the quality of the data Ethics: status and remuneration of the participants Legal aspects: license and relevant laws [Couillault et al., 2014] showed that most published data in HLT do not provide all these information KarënFort ([email protected]) Ethics, crowdsourcing and traceability 18 / 25 The Ethics and Big Data Charter Usage Goal Have the Charter adopted by funding agencies, so that they can define an ethical selection policy Provide (force?) some space for researchers to take a more global perspective on their project and allow them to also reflect on other issues (privacy, surveillance, etc) KarënFort ([email protected]) Ethics, crowdsourcing and traceability 19 / 25 The Ethics and Big Data Charter Example TCOF-POS [Benzitoun et al., 2012] From TCOF (Traitement de Corpus Oraux en Fran¸cais): spontaneous speech corpus transcribed with Transcriber TCOF-POS: annotated with part-of-speech pre-annotations correction 2 annotators + 1 validator using a spreadsheet used methodology: regularly computed inter-annotator agreement KarënFort ([email protected]) Ethics, crowdsourcing and traceability 20 / 25 The Ethics and Big Data Charter Example TCOF-POS: extract L2 LOC L2 ok FNO ok L3 LOC L3 il PRO:clsi il y PRO:cloy aura VER:futu avoir il PRO:clsi il y PRO:cloy aura VER:futu avoir KarënFort ([email protected]) Ethics, crowdsourcing and traceability 21 / 25 The Ethics and Big Data Charter Example TCOF-POS charter: some details KarënFort ([email protected]) Ethics, crowdsourcing and traceability 22 / 25 The Ethics and Big Data Charter Example Writing the Charter for TCOF-POS 2h of work: A. Couillault (Aproged) et K. Fort revision: C. Benzitoun (ATILF) KarënFort ([email protected]) Ethics, crowdsourcing and traceability 23 / 25 Conclusion MTurk: latest evolutions Amazon

Load more