Crowd Truth and the Seven Myths of Human Annotation Aroyo, Lora; Welty, Chris
Total Page:16
File Type:pdf, Size:1020Kb
VU Research Portal Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation Aroyo, Lora; Welty, Chris published in The AI Magazine 2015 DOI (link to publisher) 10.1609/aimag.v36i1.2564 document version Publisher's PDF, also known as Version of record Link to publication in VU Research Portal citation for published version (APA) Aroyo, L., & Welty, C. (2015). Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation. The AI Magazine, 36(1), 15-24. https://doi.org/10.1609/aimag.v36i1.2564 General rights Copyright and moral rights for the publications made accessible in the public portal are retained by the authors and/or other copyright owners and it is a condition of accessing publications that users recognise and abide by the legal requirements associated with these rights. • Users may download and print one copy of any publication from the public portal for the purpose of private study or research. • You may not further distribute the material or use it for any profit-making activity or commercial gain • You may freely distribute the URL identifying the publication in the public portal ? Take down policy If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately and investigate your claim. E-mail address: [email protected] Download date: 30. Sep. 2021 Articles Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation Lora Aroyo, Chris Welty n Big data is having a disruptive impact amar prestar aen … “The world is changed.” In the past across the sciences. Human annotation of decade the amount of data and the scale of computation semantic interpretation tasks is a critical Iavailable has increased by a previously inconceivable part of big data semantics, but it is based amount. Computer science, and AI along with it, has moved on an antiquated ideal of a single correct truth that needs to be similarly disrupted. solidly out of the realm of thought problems and into an We expose seven myths about human empirical science. However, many of the methods we use annotation, most of which derive from predate this fundamental shift, including the ideal of truth. that antiquated ideal of truth, and dispel Our central purpose is to revisit this ideal in computer sci - these myths with examples from our ence and AI, expose it as a fallacy, and begin to form a new research. We propose a new theory of theory of truth that is more appropriate for big data seman - truth, crowd truth, that is based on the tics. We base this new theory on the claim that, outside intuition that human interpretation is subjective, and that measuring annota - mathematics, truth is entirely relative and is most closely tions on the same objects of interpretation related to agreement and consensus. (in our examples, sentences) across a Our theories arise from experimental data that has been crowd will provide a useful representation published previously (Aroyo and Welty 2013a, Soberon et al. of their subjectivity and the range of rea - 2013, Inel et al. 2013, Dumitrache et al. 2013), and we use sonable interpretations. throughout the article examples from our natural language Copyright © 2015, Association for the Advancement of Artificial Intelligence. All rights reserved. ISSN 0738-4602 SPRING 2015 15 Articles processing (NLP) work in medical relation extraction; The Seven Myths however, the ideas generalize across all of semantics, including semantics from images, audio, video, sen - The need for human annotation of data is very real. sor networks, medical data, and others. We begin by We need to be able to measure machine performance on tasks that require empirical analysis. As the need looking at the current fallacy of truth in AI and where for machines to handle the scale of the data increas - it has led us, and then examine how big data, crowd - es, so will the need for human annotated gold stan - sourcing, and semantics can be combined to correct dards. We have found that, just as the sciences and the fallacy and improve big data analytic systems. humanities are reinventing themselves in the pres - ence of data, so too must the collection of human Human Annotation annotated data. We have discovered the following myths that In the age of big data, machine assistance for empir - directly influence the practice of collecting human ical analysis is required in all the sciences. Empirical annotated data. Like most myths, they are based in analysis is very much akin to semantic interpreta - fact but have grown well beyond it, and need to be tion: data is abstracted into categories, and meaning - revisited in the context of the new changing world: ful patterns, correlations, associations, and implica - Myth One: One Truth tions are extracted. For our purposes, we consider all Most data collection efforts assume that there is one semantic interpretation of data to be a form of empir - correct interpretation for every input example. ical analysis. Myth Two: Disagreement Is Bad In NLP, which has been a big data science for more To increase the quality of annotation data, disagree - than 20 years, numerous methods for empirical ment among the annotators should be avoided or analysis have been developed and are more or less reduced. standards. One of these methods is human annota - Myth Three: Detailed Guidelines Help tion to create a gold standard or ground truth. This When specific cases continuously cause disagree - method is founded on the ideal that there is a single, ment, more instructions are added to limit interpre - universally constant truth. But we all know this is a tations. fallacy, as there is not only one universally constant Myth Four: One Is Enough truth. Most annotated examples are evaluated by one per - Gold standards exist in order to train, test, and eval - son. uate algorithms that do empirical analysis. Humans Myth Five: Experts Are Better perform the same analysis on small amounts of exam - Human annotators with domain knowledge provide ple data to provide annotations that establish the better annotated data. truth. This truth specifies for each example what the Myth Six: All Examples Are Created Equal correct output of the analysis should be. Machines The mathematics of using ground truth treats every can learn (in the machine-learning sense) from these example the same; either you match the correct examples, or human programmers can develop algo - result or not. rithms by looking at them, and the correctness of Myth Seven: Once Done, Forever Valid their performance can be measured on annotated Once human annotated data is collected for a task, it examples that weren’t seen during training. is used over and over with no update. New annotat - The quality of annotation gold standards is estab - ed data is not aligned with previous data. lished by measuring the interannotator agreement, which is roughly the average pairwise probability Debunking the Myths that two people agree, adjusted for chance (Cohen 1960). This follows from the ideal of truth: a higher- Now let us revisit these myths in the context of the quality ground truth is one in which multiple new changing world, and thus in the face of a new humans provide the same annotation for the same theory of truth. examples. Again, we all know this is a fallacy, as there One Truth is more than one truth for every example. In the Our basic premise is that the ideal of truth is a falla - extreme case, if we want to interpret music, or a cy for semantic interpretation and needs to be poem, we would not expect all human annotations changed. All analytics are grounded in this fallacy, of it to be the same. In our experiments we have and human annotation efforts proceed from the found that we don’t need extreme cases to see clear - assumption that for each task, every example has a ly multiple human perspectives reflected in annota - “correct” interpretation and all others are incorrect. tion; this has revealed the fallacy of truth and helped With the widespread use of classifiers as a tool, this us to identify a number of myths in the process of “one truth” myth has become so pervasive that is is collecting human annotated data. assumed even in cases where it obviously does not 16 AI MAGAZINE Articles No. Sentence ex1 [GADOLINIUM AGENTS] used for patients with severe renal failure show signs of [NEPHROGENIC SYSTEMIC FIBROSIS]. ex2 He was the rst physician to identify the relationship between [HEMOPHILIA] and [HEMOPHILIC ARTHROPATHY]. ex3 [Antibiotics] are the rst line treatment for indications of [TYPHUS]. ex4 With [Antibiotics] in short supply, DDT was used during World War II to control the insect vectors of [TYPHUS]. ex5 [Monica Lewinsky] came here to get away from the chaos in [the nation’s capital]. ex6 [Osama bin Laden] used money from his own construction company to suport the [Muhajadeen] in Afganistan against Soviet forces. def1 MANIFESTATION links disorders to the observations that are closely associated with them; for example, abdominal distension is a manifestation of liver failure def2 CONTRAINDICATES refers to a condition that indicates that drug or treatment SHOULD NOT BE USED, for example, patients with obesity should avoid using danazol def3 ASSOCIATED WITH refers to signs, symptoms, or ndings that often appear together; for example, patients who smoke often have yellow teeth Table 1. Example Sentences and Definitions hold, like analyzing a segment of music for its mood erate the same ground truth. Rather than accepting (Lee and Hu 2012). In our research we have found this as a natural property of semantic interpretation, countless counterexamples to the one truth myth.