EACL 2017

Ethics in Natural Language Processing

Proceedings of the First ACL Workshop

April 4th, 2017 Valencia, Spain Sponsors:

c 2017 The Association for Computational Linguistics

Order copies of this and other ACL proceedings from:

Association for Computational Linguistics (ACL) 209 N. Eighth Street Stroudsburg, PA 18360 USA Tel: +1-570-476-8006 Fax: +1-570-476-0860 [email protected]

ISBN 978-1-945626-47-0

ii Introduction

Welcome to the first ACL Workshop on Ethics in Natural Language Processing! We are pleased to have participants from a variety of backgrounds and perspectives: social science, computational linguistics, and philosophy; academia, industry, and government.

The workshop consists of invited talks, contributed discussion papers, posters, demos, and a panel discussion. Invited speakers include Graeme Hirst, a Professor in NLP at the University of Toronto, who works on lexical semantics, pragmatics, and text classification, with applications to intelligent text understanding for disabled users; Quirine Eijkman, a Senior Researcher at Leiden University, who leads work on security governance, the sociology of law, and human right; Jason Baldridge, a co-founder and Chief Scientist of People Pattern, who specializes in computational models of discourse as well as the interaction between machine learning and human bias; and Joanna Bryson, a Reader in artificial intelligence and natural intelligence at the University of Bath, who works on action selection, systems AI, transparency of AI, political polarization, income inequality, and ethics in AI.

We received paper submissions that span a wide range of topics, addressing issues related to overgeneralization, dual use, privacy protection, bias in NLP models, underrepresentation, fairness, and more. Their authors share insights about the intersection of NLP and ethics in academic work, industrial work, and clinical work. Common themes include the role of tasks, datasets, annotations, training populations, and modelling. We selected 4 papers for oral presentation, 8 for poster presentation, and one for demo presentation, and have paired each oral presentation with a discussant outside of the authors’ areas of expertise to help contextualize the work in a broader perspective. All papers additionally provide the basis for panel and participant discussion.

We hope this workshop will help to define and raise awareness of ethical considerations in NLP throughout the community, and will kickstart a recurring theme to consider in future NLP conferences. We would like to thank all authors, speakers, panelists, and discussants for their thoughtful contributions. We are also grateful for our sponsors (Bloomberg, Google, and HITS), who have helped making the workshop in this form possible.

The Organizers Margaret, Dirk, Shannon, Emily, Hanna, Michael

iii

Organizers: Dirk Hovy, University of Copenhagen (Denmark) Shannon Spruit, Delft University of Technology (Netherlands) Margaret Mitchell, Google Research & Machine Intelligence (USA) Emily M. Bender, University of Washington (USA) Michael Strube, Heidelberg Institute for Theoretical Studies (Germany) Hanna Wallach, Microsoft Research, UMass Amherst (USA)

Program Committee: Gilles Adda Fernando Diaz Nikola Ljubesic Molly Roberts Nikolaos Aletras Benjamin Van Durme Adam Lopez Tim Rocktäschel Mark Alfano Jacob Eisenstein L. Alfonso Urena Lopez Frank Rudzicz Jacob Andreas Jason Eisner Teresa Lynn Alexander M. Rush Isabelle Augenstein Desmond Elliott Nitin Madnani Derek Ruths Tim Baldwin Micha Elsner Gideon Mann Asad Sayeed Miguel Ballesteros Katrin Erk Daniel Marcu David Schlangen David Bamman Raquel Fernandez Jonathan May Natalie Schluter Mohit Bansal Laura Fichtner Kathy McKeown H. Andrew Schwartz Solon Barocas Karën Fort Paola Merlo Hinrich Schütze Daniel Bauer Victoria Fossum David Mimno Djamé Seddah Eric Bell Lily Frank Shachar Mirkin Dan Simonson Steven Bethard Sorelle Friedler Alessandro Moschitti Sameer Singh Rahul Bhagat Annemarie Friedrich Jason Naradowsky Vivek Srikumar Chris Biemann Juri Ganitkevich Roberto Navigli Sanja Stajner Yonatan Bisk Spandana Gella Arvind Neelakantan Pontus Stenetorp Michael Bloodgood Kevin Gimpel Ani Nenkova Brandon Stewart Matko Bosnjak Joao Graca Dong Nguyen Veselin Stoyanov Chris Brockett Yvette Graham Brendan O’Connor Anders Søgaard Miles Brundage Keith Hall Diarmuid O’Seaghdha Ivan Titov Joana J. Bryson Oul Han Miles Osborne Sara Tonelli Ryan Calo Graeme Hirst Jahna Otterbacher Oren Tsur Marine Carpuat Nathan Hodas Sebastian Padó Yulia Tsvetkov Yejin Choi Kristy Hollingshead Alexis Palmer Lyle Ungar Munmun De Choudhury Ed Hovy Martha Palmer Suresh Venkatasubramanian Grzegorz Chrupala Georgy Ishmaev Michael Paul Yannick Versley Ann Clifton Jing Jiang Ellie Pavlick Aline Villavicencio Kevin B. Cohen Anna Jobin Emily Pitler Andreas Vlachos Shay B. Cohen Anders Johannsen Barbara Plank Rob Voigt Court Corley David Jurgens Thierry Poibeau Svitlana Volkova Ryan Cotterell Brian Keegan Chris Potts Martijn Warnier Aron Culotta Roman Klinger Vinod Prabhakaran Zeerak Waseem Walter Daelemans Ekaterina Kochmar Daniel Preotiuc Bonnie Webber Dipanjan Das Philipp Koehn Nikolaus Pöchhacker Joern Wuebker Hal Daumé III Zornitsa Kozareva Will Radford François Yvon Steve DeNeefe Jayant Krishnamurthy Siva Reddy Luke Zettlemoyer Francien Dechesne Jonathan K. Kummerfeld Luis Reyes-Galindo Janneke van der Zwaan Leon Derczynski Vasileios Lampos Sebastian Riedel Aliya Deri Angeliki Lazaridou Ellen Riloff Mona Diab Alessandro Lenci Brian Roark

Invited Speakers: Graeme Hirst, University of Toronto (Canada) Quirine Eijkman, Leiden University (Netherlands) Jason Baldridge, People Pattern (USA) Joanna Bryson, University of Bath (UK)

v

Table of Contents

Gender as a Variable in Natural-Language Processing: Ethical Considerations Brian Larson ...... 1

These are not the Stereotypes You are Looking For: Bias and Fairness in Authorial Gender Attribution Corina Koolen and Andreas van Cranenburgh...... 12

A Quantitative Study of Data in the NLP community Margot Mieskes ...... 23

Ethical by Design: Ethics Best Practices for Natural Language Processing Jochen L. Leidner and Vassilis Plachouras...... 30

Building Better Open-Source Tools to Support Fairness in Automated Scoring Nitin Madnani, Anastassia Loukina, Alina von Davier, Jill Burstein and Aoife Cahill ...... 41

Gender and Dialect Bias in YouTube’s Automatic Captions Rachael Tatman ...... 53

Integrating the Management of Personal Data Protection and Open Science with Research Ethics Dave Lewis, Joss Moorkens and Kaniz Fatema ...... 60

Ethical Considerations in NLP Shared Tasks Carla Parra Escartín, Wessel Reijers, Teresa Lynn, Joss Moorkens, Andy Way and Chao-Hong Liu 66

Social Bias in Elicited Natural Language Inferences Rachel Rudinger, Chandler May and Benjamin Van Durme ...... 74

A Short Review of Ethical Challenges in Clinical Natural Language Processing Simon Suster, Stephan Tulkens and Walter Daelemans...... 80

Goal-Oriented Design for Ethical Machine Learning and NLP Tyler Schnoebelen ...... 88

Ethical Research Protocols for Social Media Health Research Adrian Benton, Glen Coppersmith and Mark Dredze ...... 94

Say the Right Thing Right: Ethics Issues in Natural Language Generation Systems Charese Smiley, Frank Schilder, Vassilis Plachouras and Jochen L. Leidner ...... 103

vii

Workshop Program

Tuesday, 4 April, 2017

9:30–11:00 Morning Session 1

9:30–9:40 Welcome, Overview Dirk, Margaret, Shannon, Michael

9:40–10:15 Invited Talk Graeme Hirst

10:15–10:50 Invited Talk Joanna Bryson

11:00–11:30 Coffee Break

11:30–13:00 Morning Session 2 - Gender

11:30–11:45 Gender as a Variable in Natural-Language Processing: Ethical Considerations Brian Larson

11:45–12:00 These are not the Stereotypes You are Looking For: Bias and Fairness in Authorial Gender Attribution Corina Koolen and Andreas van Cranenburgh

12:00–12:25 Paper Discussion Authors and Discussant

12:25–13:00 Invited Talk Quirine Eijkman

ix Tuesday, 4 April, 2017 (continued)

13:00–14:30 Lunch Break

14:30–16:00 Afternoon Session 1 - Data and Design

14:30–15:05 Invited Talk Jason Baldridge

15:05–15:20 A Quantitative Study of Data in the NLP community Margot Mieskes

15:20–15:35 Ethical by Design: Ethics Best Practices for Natural Language Processing Jochen L. Leidner and Vassilis Plachouras

15:35–16:00 Paper Discussion Authors and Discussant

16:00–17:00 Afternoon Session 2 - Coffee and Posters

16:00–17:00 Building Better Open-Source Tools to Support Fairness in Automated Scoring Nitin Madnani, Anastassia Loukina, Alina von Davier, Jill Burstein and Aoife Cahill

16:00–17:00 Gender and Dialect Bias in YouTube’s Automatic Captions Rachael Tatman

16:00–17:00 Integrating the Management of Personal Data Protection and Open Science with Research Ethics Dave Lewis, Joss Moorkens and Kaniz Fatema

16:00–17:00 Ethical Considerations in NLP Shared Tasks Carla Parra Escartín, Wessel Reijers, Teresa Lynn, Joss Moorkens, Andy Way and Chao-Hong Liu

16:00–17:00 Social Bias in Elicited Natural Language Inferences Rachel Rudinger, Chandler May and Benjamin Van Durme

16:00–17:00 A Short Review of Ethical Challenges in Clinical Natural Language Processing Simon Suster, Stephan Tulkens and Walter Daelemans

x Tuesday, 4 April, 2017 (continued)

16:00–17:00 Goal-Oriented Design for Ethical Machine Learning and NLP Tyler Schnoebelen

16:00–17:00 Ethical Research Protocols for Social Media Health Research Adrian Benton, Glen Coppersmith and Mark Dredze

16:00–17:00 Say the Right Thing Right: Ethics Issues in Natural Language Generation Systems Charese Smiley, Frank Schilder, Vassilis Plachouras and Jochen L. Leidner

17:00–18:00 Evening Session

17:10–17:45 Panel Discussion Panelists

17:45–18:00 Concluding Remarks Dirk, Margaret, Shannon, Michael

xi

Gender as a Variable in Natural-Language Processing: Ethical Considerations

Brian N. Larson Georgia Institute of Technology 686 Cherry St. MC 0165 Atlanta, GA 30363 USA [email protected]

Abstract alone that made use of personal (as opposed to grammatical) gender as a central variable. Many Researchers and practitioners in natural- others used gender as a variable without referring language processing (NLP) and related to gender in their titles. It is not uncommon, how- fields should attend to ethical princi- ever, for studies regarding gender to be reported ples in study design, ascription of cate- without any explanation of how gender labels were gories/variables to study participants, and ascribed to authors or their texts. reporting of findings or results. This paper This paper argues that using gender as a variable discusses theoretical and ethical frame- in NLP is an ethical issue. Researchers and prac- works for using gender as a variable in titioners in NLP who unreflectively apply gender NLP studies and proposes four guidelines category labels to texts and their authors may vio- for researchers and practitioners. The late ethical principles that govern the use of human principles outlined here should guide prac- participants or “subjects” in research (Belmont titioners, researchers, and peer reviewers, Report, 1979; Common Rule, 2009). By failing and they may be applicable to other social to explain in study reports what theory of gen- categories, such as race, applied to human der they are using and how they assigned gender beings connected to NLP research. categories, they may also run afoul of other ethi- 1 Introduction cal frameworks that demand transparency and ac- countability from researchers (Breuch et al., 2002; Bamman et al. (2014) challenged simplistic no- FAT-ML, nd; MacNealy, 1998). tions of a gender binary and the common quest in This paper discusses theoretical and ethical natural-language processing (NLP) studies merely frameworks for using gender as a variable in NLP to predict gender based on text, making the fol- studies. The principles outlined here should guide lowing observation: researchers and peer reviewers, and they may be applicable to other social categories, such as race, If we start with the assumption that ‘fe- applied to human beings connected to NLP re- male’ and ‘male’ are the relevant cate- search. Note that this paper does not purport to gories, then our analyses are incapable select the best theory of gender or method of as- of revealing violations of this assump- cribing gender categories for NLP. Rather, it urges tion. . . . [W]hen we turn to a descriptive a continual process of thoughtfulness and debate account of the interaction between lan- regarding these issues, both within each study and guage and gender, this analysis becomes among the authors and readers of study reports. a house of mirrors, which by design can In summary, researchers and practitioners only find evidence to support the under- should (1) formulate research questions making lying assumption of a binary gender op- explicit theories of what “gender” is; (2) avoid position (p. 148). using gender as a variable unless it is necessary Gender is a common variable in NLP stud- to answer research questions; (3) make explicit ies. For example, a search of the ACL Anthology methods for assigning gender categories to par- (aclanthology.info) for the keyword “gen- ticipants and linguistic artifacts; and (4) respect der” in the title field revealed seven papers in 2016 the difficulties of respondents when asking them

1 Proceedings of the First Workshop on Ethics in Natural Language Processing, pages 1–11, Valencia, Spain, April 4th, 2017. c 2017 Association for Computational Linguistics to self-identify for gender. gender people in the United States alone, and for Section 2 considers theoretical foundations for these persons, the language used to characterize gender as a research construct and rationales for them can function as respectful on the one hand or studying it. Section 3 proposes ethical frame- offensive and defamatory on the other (GLAAD, works for academic researchers and for practition- nd a). Note that the gender labels that transgen- ers. Section 4 examines several studies in NLP der persons ascribe to themselves do not include that are representative of the range of studies us- “other.” The folk view of gender might be an ap- ing gender as a variable. Section 5 concludes with propriate frame for the NLP researcher seeking to recommendations for best practices in designing, explore study participants’ use of language in re- reporting, and peer-reviewing NLP studies using lation to their own conceptions of their genders. gender as a variable. Another view of gender sees it as performa- tive. So, according to DeFrancisco et al. (2014, 2 Gender and rationales for its study p. 3) gender consists in “the behaviors and ap- 2.1 Three views of gender pearances society dictates a body of a particular sex should perform,” structuring “people’s under- There are many views of how gender functions as standing of themselves and each other.” Accord- a social construct. This section presents just three: ing to Larson (2016), an actor’s gender knowl- the common or folk view of gender, a performative edge comprises components of the actor’s cogni- view of gender, and one social psychological view tive environment—beliefs about behaviors the ac- of gender. None of these views can be seen as tor expects to have a particular effect or effects on correct for all contexts and applications. The view another based on knowledge about a typified situ- that is appropriate for a given project will depend ation in the actor’s cognitive environment. Among on the research questions posed and the goals of these behaviors is language. Butler (1993) charac- the project. terized gender as a form of performativity arising A folk belief, as the term is used here, refers to in “an unexamined framework of normative het- the doxa or beliefs of the many that may or may erosexuality” (p. 97). According to all these the- not be supported by systematic inquiry—common ories, gender performativity is not merely perfor- beliefs distinguished from scientific knowledge or mance, but rather performances that respond to, philosophical theories (Plato, 2005). In the folk or are constrained by, norms or conventions and conception, the “heteronormative gender binary” simultaneously reinforce them. This approach to (Larson, 2016, p. 365) conflates sex, the chro- gender could be useful, for example, in a study mosomal and biological characteristics of people, exploring the ways that language might be used to with gender, their outward appearances and behav- resist folk views of gender, especially in a context iors. The salience of these categories and their bi- like transgender communities, where resistance to nary nature are taken as obvious and natural. Con- gender doxa is essential to building identity. Sim- sequently, the options available on a survey for the ilarly, it could be useful in studying cases where question “Gender?” are frequently “male” or “fe- persons of one gender attempt to appropriate con- male” (sex categories) rather than “masculine” or ventional communicative practices of another gen- “feminine” (gender categories). There is a grow- der without adopting a transgender identity. Bam- ing understanding in contemporary western cul- man et al. (2014) made specific reference to this ture, however, that some individuals either do not family of theories in their study of Twitter users. fall easily into the binary or exhibit gender char- acteristics inconsistent with the biological sex as- A third approach to thinking about gender is cribed to them at birth—these persons are some- to assume a gender binary, identify characteristics times referred to as being “transgender,” while that cluster around the modes of the binary, and as- those whose sex and gender are congruent are “cis- sess the gender of study participants based on their gender” (DeFrancisco et al., 2014). Various com- closeness of fit to these modes. This is exactly the munities of persons who are not cisgender have approach of the Bem Sex Roles Inventory (Bem, other names they prefer to use for themselves, in- 1974) and other instruments developed by social cluding “gender non-conforming,” “non-binary,” psychologists to assess gender. This approach al- and “genderqueer” (GLAAD, nd b). According to lows the researcher to break gender down into con- one academic report, there are 1.4 million trans- stituent features. So, for example, the BSRI asso-

2 ciates self-reliance, independence, and athleticism such a way as to cause “algorithmic discrimina- with masculinity and loyalty, sympathy, and sen- tion,” where “an individual or group receives un- sitivity with femininity (Blanchard-Fields et al., fair treatment as a result of algorithmic decision- 1994). This approach might be useful, for exam- making” (Goodman, 2016). The ethical frame- ple, for an NLP practitioner seeking to identify works discussed in the next section provide rea- consumers exhibiting individual characteristics— sons to avoid such discrimination. like independence and athleticism—in order to market a particular product to those consumers 3 Ethical frameworks without regard to their gender or sex. Such ap- Academic researchers and commercial practition- proaches may not be available to NLP researchers, ers may draw their ethical principles from different though, as they require participants to fill out sur- ethical frameworks, but they have similar ethical veys. obligations for ascribing category labels to persons These are only three of many possible ap- and for using and reporting the research resulting proaches to gender, and as the examples suggest, from them. they vary widely in the kinds of research questions In the United States, academic researchers they can help to answer. are generally guided by principles articulated in the Belmont Report (1979), which calls on re- 2.2 Rationales for studying gender searchers to observe three principles:

Broadly speaking, NLP studies focused on gen- Respect for persons represents the right of • der stem from two sources: researchers and prac- a human taking part or being observed in titioners. Borrowing from concepts in the field research (sometimes called a “subject” or of research with human participants, we can char- “participant”) to make an informed decision acterize research as “activity designed to test an about whether to take part and for a re- hypothesis, permit conclusions to be drawn, and searcher “to give weight to autonomous per- thereby to develop or contribute to generaliz- sons’ considered opinions and choices.” able knowledge” (Belmont Report, 1979). Prac- Beneficence requires that the research first do titioners, by contrast, are interested in provid- • ing solutions or “interventions that are designed no harm to participants and second “maxi- solely to enhance the well-being of an individ- mize possible benefits and minimize possible ual. . . client”—in other words, the development of harms.” commercial applications. These two rationales can Justice demands that the costs and benefits blend when academics disseminate research with • of research be distributed fairly, so that one the intention of attracting commercial interest and group does not endure the costs of research when practitioners disseminate study findings to while another enjoys its benefits. the academic community with a goal, in part, of attracting attention to their commercial activities. Under regulations of the U.S. Department of Practitioners may also intend to develop applica- Health and Human Services known as the Com- tions that serve the needs of multiple clients, as mon Rule, “all research involving human subjects when they seek to sell a technical solution to many conducted, supported or otherwise subject to regu- players within an industry. lation by any federal department or agency” must The practitioner may have more instrumental be subjected to review by an institutional review objectives, hoping, for example, for insights about board or IRB (Common Rule, 2009). As a practi- consumer behavior applicable to an employer’s or cal matter, most research universities in the United client’s commercial goals. Practitioners engaged States require that all research involving human in such studies need not be concerned about the participants be subject to IRB review. The Com- finer points of academic-researcher ethics. They mon Rule embodies many of the principles of the should be conscious, however, of the social effects Belmont Report and of the Declaration of Helsinki of their research when it is disseminated, covered (World Medical Association, 1964). in the news, etc. Even if their research is used only Other authorities argue that academic re- internally for their companies or clients, they may searchers have ethical responsibilities regarding use variables in machine learning applications in their research, even if it does not involve human

3 participants. In that context, internal and exter- Some researchers/practitioners have argued for nal validity (or validity and reliability) of research fairness, accountability, and transparency as ethi- findings are ethical concerns (Breuch et al., 2002; cal principles in applications of machine learning, MacNealy, 1998). Not being explicit about what a technology commonly used in NLP. Consider, the researcher means by the research construct for example, Hardt (2014) and Wallach (2014), gender raises a problem for readers of research re- and the group of researchers and practitioners be- ports, as they cannot evaluate a researcher’s claims hind FAT-ML (FAT-ML, nd). In this literature, it is without knowing in principle what the researcher not always clear what these three terms are meant means by her central terms. Not being explicit to represent. So, for example, fairness appears to about the ascription of the category gender as a be a social metric similar to the Belmont Report’s variable to participants or communication artifacts beneficence and justice. Wallach refers to it almost that they create brings into question internal and strictly in the phrase “bias, fairness, and inclu- external validity of research findings, because it sion.” This seems concerned with fairness in the makes it difficult or impossible for other schol- distributive sense of the Belmont Report’s justice ars to reproduce, test, or extend study findings. In rather than the aggregate sense of consequentialist short, doing good science is an ethical obligation ethical systems. Wallach’s uses of transparency of good scientists. and accountability echo the ethical principles for researchers suggested by Breuch et al. (2002) and Practitioners are bound by ethical frameworks MacNealy (1998). She appears to view them as that are applicable to all persons generally. In principles to which researchers and practitioners the West, these may be drawn from normative should aspire. frameworks that determine circumstances under which one can be called ethical: “virtue ethics”— FAT-ML could be operationalized as an ethical having ethical thoughts and an ethical character framework this way: NLP studies would expose (Hursthouse and Pettigrove, 2016); “deontologi- their theoretical commitments, describe their re- cal” ethics—conforming to rules, laws, and other search constructs (including gender), and explain statements of ethical duty (Alexander and Moore, their methods (including their ascription of gen- 2016); and “consequentialism”—engaging in ac- der categories). The resulting transparency per- tion that causes more good than harm (Sinnott- mits accountability to peer reviewers and other Armstrong, 2015). Other western and non-western researchers and practitioners, who may assess a ethical systems may prioritize other values (Hen- given study against principles intended to result nig, 2010). Deontological ethics is drawn from in valid and reliable scientific findings, principles sets of rules, such as religious texts, industry designed to ensure respect for persons, justice, codes of ethics, and laws. Deontological theo- beneficence, and other evolving ethical principles rists derive such rules from theoretical procedures, under the rubric of fairness. Identification of the such as Kant’s categorical imperative, where “all applicable rules awaits the rational non-coercive those possibly affected” can “will a just maxim discourse of which the First Workshop on Ethics as a general rule”; Rawl’s “veil of ignorance,” in NLP is an early and important example. in which participants cannot know what role they will play in the society for which they posit rules; 4 Applying frameworks to previous or Habermas’s discourse ethics, rules resulting studies from a “noncoercive rational discourse among free This section considers how previously published and equal participants” (Habermas, 1995, p. 117). and disseminated studies satisfy the ethical frame- In a sense the Belmont Report provides a set of works noted above and whether those frameworks rules for deontological evaluation. may challenge the studies. Note that consider- Consequentialist ethical systems like utilitarian- ation of these particular studies is not meant to ism evaluate actions not by their means but their suggest that they are ethically flawed; they have ends. They are thus consistent with the Belmont been selected because they are recent studies or Report edict that research’s benefits should out- high-quality studies that have been widely cited. weigh its costs. But neither the Belmont Report Generally, the studies discussed in this section in- nor other ethical systems typically permit actors cluded very careful descriptions of their methods to ignore the means they use to pursue their ends. of data collection and analysis. However, though

4 each purported to tell us something about gender, The approach of Rao and colleagues and Her- hardly any defined what they meant by “gender” ring and Paolillo also appears to put the researcher or “sex,” many did not indicate how they ascribed in the position of deciding what counts as male the gender categories to their participants or arti- and female in the data. This raises questions of facts, and some that did explain the ascription of fairness with regard to participants who have been gender categories left room for concerns. labeled according the researchers’ expectations, or A great many studies have explored gender dif- perhaps their biases, rather than autonomous deci- ferences in human communication. An early and sions by the participants. widely cited study is Koppel et al. (2002), where Other studies make their ascription of gender the researchers used machine learning to predict categories explicit but fail to cautiously approach the gender of authors of published texts in the such labels. For example, two early studies, Yan (BNC). Koppel and col- and Yan (2006) and Argamon et al. (2007), used leagues noted that the works they selected from machine learning to classify blogs by their au- the BNC were labeled for author gender, but they thors’ genders. They used blog profile account did not indicate how that labeling was done. settings to ascribe gender categories. Burger et Like Koppel et al., many study authors allow the al. (2011) assigned gender to Twitter users by fol- ascription of the gender category to be the result of lowing links from Twitter accounts to users’ blogs an opaque process—that is, they do not fully em- on blogging platforms that required users to indi- brace the transparency and accountability princi- cate their genders. More recently, Rouhizadeh et ples identified above, making the validity of stud- al. (2016) studied Facebook users from the period ies difficult to assess. For example, in a study of 2009–2011 based on their self-identified genders computer-mediated communication, Herring and (but these data were gathered before Facebook’s Paolillo (2006) assigned gender to blog authors current gender options, see below), and Wang et “by examining each blog qualitatively for indica- al. (2016) looked at Weibo users, collecting self- tions of gender such as first names, nicknames, identified gender data from their profiles. explicit gender statements. . . and gender-indexical None of the studies in the previous paragraph language.” The authors did not provide readers described how frequently account holders indi- with examples of the process of assigning these cated their own genders, what gender options were labels—called “coding” here as it is frequently possible, or how researchers accounted for ac- by qualitative researchers, and not to be confused count holders posing with genders other than their with the computer programmer’s notion of “cod- own. The answers to such questions would make ing” or writing code—a coding guide, which is the studies more transparent, helping readers to the set of instructions that researchers use to as- assess the their validity and fairness. For exam- sign category labels to persons or artifacts, or a ple, if many users of a site refused to disclose statement about whether the researchers compared their genders, it is possible that the decision not to coding by two or more coders to assess inter-rater disclose might correlate with other characteristics reliability (Potter and Levine-Donnerstein, 1999). that would make gender distinctions in the data Rao et al. (2010) examined Twitter posts more or less pronounced. The Belmont Report’s (“tweets”) to predict the gender categories they concern about autonomy would best be addressed had ascribed to the texts’ authors. They identi- by understanding the options given to participants fied 1,000 Twitter users and inferred their gender to represent themselves as gendered persons on based upon a heuristic: “For gender, the seed set these blogging platforms. If there were only two for the crawl came from initial sources including gender options—probably “male” and “female”— sororities, fraternities, and male and female hy- we might well wonder whether transgender per- giene products. This produced around 500 users sons may have refused to answer the question, or in each class” (2010, p. 38). Of course, using lin- if forced to answer, how they chose which gender. guistic performances (profiles and tweets) to as- One study deserves special mention: Bam- sign gender to Twitter accounts and then using man et al. (2014) compared user names on Twit- linguistic performances to predict the genders of ter profiles to U.S. Census data which showed those accounts is very like the “house of mirrors” a gender distribution for the 9,000 most com- that Bamman et al. (2014) warned of above. monly appearing first names. Though some names

5 were ambiguous—used for persons of different Ideally, a researcher would provide an extended genders—in the census data, 95 percent of the discussion of the central variable in his or her users included in the study had a name that was “at study. For example, Larson (2016) offered a def- least 85 percent associated with its majority gen- inition of “gender” used in the study along with a der” (p. 140). They then examined correlations lengthy discussion of the concept. Both the dis- between gender and language use. This approach cussion and analysis in Bamman et al. (2014) en- might fall prey to criticisms regarding category gaged with previous theoretical literature on gen- ascription similar to those leveled at the studies der and challenged the gender constructs used in above. Bamman et al., however, exhibited much previous NLP studies. But articles using gender more caution in the use of gender categories than as a variable need not go to this extent. The goal any of the other studies cited here and engaged in of making gender theory explicit can be achieved cluster analyses that showed patterns of language by quoting a definition of “gender” from earlier re- use that crossed the gender-binary boundary. By search and giving some evidence of actually hav- describing the theory of gender they used and the ing read some of the earlier research. In the alter- method of ascribing the gender label, they made native, the researcher may adopt a construct def- their study transparent and accountable. Whether inition for gender; that is, the researcher may an- it is fair is an assessment for their peers to make. swer the question, “What does ‘gender’ measure?” Thus, researchers can either choose definitions of 5 Guidelines for using gender as a “gender” from existing theories or identify what variable they mean by “gender” by defining it themselves. Practitioners may take a different view. Con- This section describes four guidelines for re- sider, for example, a practitioner working at a searchers and practitioners using gender as a social media site that requires its users to self- variable in NLP studies that fall broadly under identify in response to the question “gender.” It is these admonitions: (1) formulate research ques- reasonable for this practitioner to use NLP tools to tions making explicit theories of what “gender” is; study the site’s customers based on their responses (2) avoid using gender as a variable unless it is to this question, seeking usage patterns, correla- necessary to answer research questions; (3) make tions, etc. But a challenge arises as social me- explicit methods for assigning gender categories to dia platforms recognize nuances in gender iden- participants and linguistic artifacts; and (4) respect tity. For example, in 2015 Facebook began allow- the difficulties of respondents when asking them ing its users to indicate that their gender is “fe- to self-identify for gender. It also includes a rec- male,” “male,” or “custom,” and the custom option ommendation for peer reviewers for conference- is an open text box (Bell, 2015). A practitioner paper and research-article submissions. Note that there using gender data will be compelled to use this paper does not advocate for a particular the- many labels or group them in a manner selected ory of gender or method of ascribing gender cat- by the practitioner. Using all the labels presents egories to cover all NLP studies. Rather, it advo- difficulties for classifiers and for the practitioner cates for exposing decisions on these matters to attempting to explain results. Grouping labels re- aid in making studies more transparent, account- quires the practitioner to theorize about how they able, and fair. The decisions that practitioners and should be grouped. This takes us back to the ad- researchers make will be subject to debate among monition that the researcher or practitioner should them, peer reviewers, and other practitioners and make explicit the theory of gender being used. researchers.

5.1 Make theory of gender explicit 5.2 Avoid using gender unless necessary Researchers and practitioners should make ex- This admonition is perhaps obvious: Given the ef- plicit the theory of gender that undergirds their re- forts that this paper suggests should surround the search questions. This step is essential to make selection, ascription, use, and reporting of gender studies accountable, transparent, and valid. For categories in NLP studies, it would be foolish to other researchers or practitioners to fully interpret use gender as a category unless it is necessary to a study and to interrogate, challenge, or reproduce achieve the researcher’s objectives, because the ef- it, they need to understand its theoretical grounds. fort is unlikely to be commensurate with the pay-

6 off. It is likely, though, that the casual use of gen- this approach will not likely go away; but the re- der as a routine demographic question in studies searcher should consider at the time of study de- where gender is not a central concern will remain sign whether and how to do this. Researchers re- commonplace. It seems an easy question to ask, porting findings should acknowledge if this is the and once the data are collected, it seems easy to approach they have taken. perform a cross-tabulation of findings or results A related approach makes sense where the re- based on the response to this question. searcher is studying how participants behave to- The reasons for avoiding the use of gender as ward each other based on what they perceive oth- a variable unless necessary are grounded in all ers’ genders to be. For example, if studying the ethical principles discussed above. A failure whether a teacher treats students differently based to give careful consideration to the questions pre- on student genders, the researcher may need to sented in this paper creates a variety of risks. Thus, know what genders the teacher ascribes to stu- researchers should resist the temptation to ask: “I dents. The researcher should give thought to how wonder if the women responded differently than to collect information about this category ascrip- the men.” The best way to resist this temptation tion from the teacher. The process could prove is to resist asking the gender question in the first challenging if the researcher and teacher operate place, unless it is important to presenting findings in an environment where students challenge tra- or results. ditional gender roles or where students outwardly A reviewer of this paper noted that following identify as transgender. this recommendation might inadvertently discour- But participant self-identification should be the age researchers and practitioners from checking gold standard for ascribing gender categories. Ex- the algorithmic bias of their systems. Indeed, it is cept in circumstances where one might not expect thoroughly consistent with values described here complete candor, one can count on participants to for researchers and practitioners to engage in such say what their own genders are. On the one hand, checking. In that case, gender is a necessary cate- this approach to ascribing a gender label respects gory, but where such work is anticipated, the other the autonomy of study participants, as it allows recommendations of this Section 5 should be care- them to assert the gender with which they iden- fully followed from the outset. tify. On the other hand, it does not account for the fact that each study participant may have a differ- 5.3 Make category assignment explicit ent conception of gender, its meaning, its relation Researchers and practitioners should make ex- to sex, etc. For example, a 76-year-old woman plicit the method(s) they use to ascribe gender who has lived in the United States her whole life categories to study participants or communica- may have a very different conception of what it tion artifacts. This step is essential to make the means to be “female” or “feminine” than does a researcher’s or practitioner’s studies accountable, 20-year-old recent immigrant to Germany from transparent, and valid. Just as the study’s theory Turkey. Despite this, each may be attempting to of gender is an essential basis for interpreting the make sense of her identity as including a female findings—for interrogating, challenging, and re- or feminine gender. producing them—so are the methods of ascribing In theory, the researcher could address the con- the variable of study. This category provides the cerns regarding participant self-identification us- largest number of specific recommendations. (In ing a gender-role inventory. In fact, one study this section, the term “researcher” refers both to looking for gender differences in writing did ex- researchers as discussed above and to practition- actly that, using the Bem Sex Role Inventory ers who choose to disseminate their studies into (BSRI) to assess author genders (Janssen and Mu- the research community.) rachver, 2004). The challenge with these ap- Researchers have several choices here. Outside proaches is that gender is a moving target. Sandra of NLP, they have very commonly ascribed gen- Bem introduced the BSRI in 1974 (Bem, 1974). der to study participants based on the researchers’ It has since been criticized on a wide variety of own best-guess assessments: The researcher in- grounds, but of importance here is the fact that teracts with a participant and concludes that she it was based on gender role stereotypes from the is female or he is male. For small-scale studies, time when it was created. Thus a meta-analysis by

7 Twenge (1997) of studies using the BSRI showed computer code used to make the ascriptions. This that the masculinity score of women taking the supports the goals of accountability, transparency, BSRI had increased steadily over 15 years, and and validity. men’s masculinity scores showed a steady de- Fourth, researchers who group gender labels crease in correlation over the same period. These collected from participant self-identification or use developments make sense in the context of a gen- a heuristic to assign gender categories to partici- der roles inventory that is necessarily validated pants or artifacts should consider “denaturalizing” over a period of years after it is first developed, the resulting category labels. This challenge is resulting in an outdated set of gender stereotypes only likely to increase as sites like social media being embodied in the test, stereotypes that are not platforms recognize nuances in gender identity, as confirmed later as gender roles change. This does this section previously noted with regard to Face- not mean that these inventories have no value for book. For example, Larson (2016) asked partic- some applications; rather, researchers using them ipants to identify their own genders, giving them should explain that they are using them, why they an open text box in which to do it. (See also the are using them, and what their limitations are. detailed discussion of methods in Larson (2017).) Researchers should consider the following spe- This permitted participants in the study to identify cific recommendations: First, if a study relies with any gender they chose, and respondents re- upon a gender-category ascription provided by sponded with eight different gender labels. Larson someone else, as does Koppel et al. (2002), it explained his grouping of the responses and chose should provide as much information as possible to denaturalize the gender categories by not us- about how the category was ascribed and acknowl- ing their common names. The article thus grouped edge the third-party category ascription as a limi- “F,” “Fem,” “Female,” and “female” together with tation. This supports the goals of research validity, the category label Gender F and “Cis Male,” “M,” transparency, and accountability. “Male,” and “Masculine” with the label Gender M. Such disclosure or transparency supports the goals Second, if the researchers relied upon self- accountability and fairness. identified gender from a technology or social me- dia platform, the study report should show that the The steps described here would have strength- researchers have reflected on the possibility that ened already fine studies like those cited in the pre- users of the platform have not identified their gen- vious section. Of course, they would not insulate ders at all (where the platform does not require it), them from criticism. For example, Larson (2016) that users may intentionally misidentify their gen- collected self-identified gender information and ders, that transgender users may be unable to iden- denaturalized the gender categories as explained tify themselves accurately (if the platform presents above, but the result was nevertheless a gender only a binary), or that they may have been in- binary consistent with that prevalent in the folk- sulted by the question (if the platform presents theory of gender. The transparency of the study them with “male,” “female,” and “other,” for ex- methods, however, provides a basis for critique; ample). All these reflections address questions of had it simply reported findings based on “male” validity, transparency, and accountability. The fi- and “female” participants, the reader would not nal two, however, also implicate the autonomy and even be able to identify this basis for critique. respect for persons the Belmont Report calls for. 5.4 Respect persons Third, if researchers use a heuristic or qualita- tive coding scheme to assess an author’s gender, One final recommendation is applicable to re- it is critically important that readers be presented searchers and to practitioners who may have a role with a full description of the process. This in- in deciding how to collect self-identified gender cludes providing a copy of the coding guide (the labels from participants. Here, the practitioner or set of instructions that researchers use to assign researcher should take pains to recognize differ- category labels to persons or artifacts) and describ- ences and difficulties that respondents may face in ing the process by which researchers checked their ascribing gender to themselves or to others. For code ascriptions, including a measure of inter-rater example, assuming that one is collecting demo- reliability. Studies that use automated means to graphic information with an online survey, one ascribe category labels should include copies of might offer respondents two options for gender:

8 “male” and “female.” In contemporary western mental description could be added to the paper be- culture, however, it is not unusual to have respon- fore publication of the proceedings without adding dents who do not easily identify with one gender excessive length to the paper. In the alternative, or another or who actively refuse to be classed in the supplemental description could be made avail- a particular gender. Others are confidently trans- able via a link to a web resource apart from the gender or intersex. Thus, two options may not paper itself. ACL has provided for the submis- be enough. However, the addition of an “other” sion of “supplementary material” at least at some might seem degrading or insulting to those who do of its conferences “to report preprocessing deci- not consider themselves to be “male” or “female.” sions, model parameters, and other details nec- Another option might be “none of the above,” but essary for the replication of the experiments re- this again seems to function as an othering selec- ported” (Association for Computational Linguis- tion. There are so many ways that persons might tics, 2016). Other NLP conferences and techni- choose to describe their genders that listing them cal reports should follow this lead. In any case, it might also be impractical, especially as the list it- may be helpful if the peer-review mechanisms for self might have reactive effects by drawing spe- journals and conferences include a means for the cial attention to the gender question. Such effects researcher to attach the supplemental description, might arise if the comprehensive nature of the list as its quality may influence the votes of some re- tips research participants off that gender is an ob- viewers regarding the quality of the paper. ject of study in the research. But even the “free- form” space discussed above presents difficulties for practitioners and researchers. 6 Conclusion Grappling with this challenge, and in the case of researchers and practitioners disseminating their This paper represents only a starting point for research, documenting that grappling, is the best treating the research variable gender in an ethical way to ensure ethical outcomes. fashion. The guidelines for researchers and prac- titioners here are intended to be straightforward 5.5 Reviewers: Expect ethical practices and simple. However, to engage in research or practice that measures up to high ethical standards, The way to ensure that researchers (and practition- we should see ethics not as a checklist at the be- ers who disseminate their studies as research) con- ginning or end of a study’s design and execution. form to ethical principles is to make them account- Rather, we should view it as a process where we able at the time of peer review. A challenge for continually ask whether our actions respect human researchers and peer reviewers alike, however, is beings, deliver benefits and not harms, distribute space. A long paper for EACL is eight pages at the potential benefits and harms fairly, and explain our time of initial submission. A researcher may not research so that others may interrogate, test, and feel able to report fully on a study’s background, challenge its validity. data, methods, findings, and significance in that Other sets of social labels, such as race, eth- space and still have space to explain steps take to nicity, and religion, raise similar ethical concerns, ensure the use of the gender variable is ethical. At and researchers studying data including those cat- least two possible solutions come to mind. egories should also consider the advice presented First, researchers may make efforts to weave ev- here. idence of ethical study design and implementation into study write-ups. It may be possible with the addition of a small number of sentences to satisfy Acknowledgments a peer reviewer that a researcher has followed the guidelines in this paper. Thanks to the anonymous reviewers for helpful Second, a researcher could write up a supple- guidance. This project received support from the mental description of the study addressing partic- University of Minnesota’s Writing Studies Depart- ularly these issues. The researcher could signal ment James I. Brown fellowship fund and its Col- the presence of the supplemental description by lege of Liberal Arts Graduate Research Partner- noting its existence in the first draft submitted for ship Program. peer review. If the paper is accepted, the supple-

9 References Victoria Pruin DeFrancisco, Catherine Helen Pal- czewski, and Danielle Dick McGeough. 2014. Gen- Larry Alexander and Michael Moore. 2016. Deonto- der in Communication: A Critical Introduction. logical ethics. In Edward N. Zalta, editor, The Stan- Sage Publications, Thousand Oaks, CA, 2nd edition. ford Encyclopedia of Philosophy. Metaphysics Re- search Lab, Stanford University, Winter 2016 edi- FAT-ML. n.d. Fairness, accountability, and trans- tion. parency in machine learning. Retrieved January 23, Shlomo Argamon, Moshe Koppel, James W. Pen- 2017, from http://www.fatml.org/. nebaker, and Jonathan Schler. 2007. Mining the GLAAD. n.d. a. Gay and Lesbian Alliance blogosphere: Age, gender and the varieties of self- Against Defamation. GLAAD media reference expression. First Monday, 12(9). guide. In focus: Covering the transgender com- Association for Computational Linguistics. 2016. Call munity. Retrieved January 23, 2017, from for papers. the 55th Annual Meeting of the Associa- http://www.glaad.org/reference/covering-trans- tion for Computational Linguistics ACL Member community. Portal, November. Retrieved February| 17, 2017 from https://www.aclweb.org/portal/content/55th- GLAAD. n.d. b. Gay and Lesbian Alliance annual-meeting-association-computational- Against Defamation. Glossary of terms: Trans- linguistics. gender. Retrieved January 23, 2017, from http://www.glaad.org/reference/transgender. David Bamman, Jacob Eisenstein, and Tyler Schnoe- belen. 2014. Gender identity and lexical varia- Bryce W. Goodman. 2016. A step towards accountable tion in social media. Journal of Sociolinguistics, algorithms? algorithmic discrimination and the eu- 18(2):135–160. ropean union general data protection. In 29th Con- ference on Neural Information Processing Systems Karissa Bell. 2015. Facebook’s new gender (NIPS 2016), Barcelona. NIPS Foundation. options let you choose anything you want. Mashable. Retrieved January 1, 2017, from Jurgen Habermas. 1995. Reconciliation Through the http://mashable.com/2015/02/26/facebooks-new- Public use of Reason: Remarks on John Rawls’s custom-gender-options/. Political Liberalism. The Journal of Philosophy, 92(3):109–131. Belmont Report. 1979. The Belmont Report: Ethi- cal principles and guidelines for the protection of Moritz Hardt. 2014. How big data is unfair: Un- human subjects of research. Retrieved January 24, derstanding sources of unfairness in data driven 2017, from https://www.hhs.gov/ohrp/regulations- decision making, September. Retrieved January and-policy/belmont-report/index.html. 23, 2017, from https://medium.com/@mrtz/how- big-data-is-unfair-9aa544d739de#.jr0yrklo0. Sandra L. Bem. 1974. The measurement of psycholog- ical androgyny. Journal of Consulting and Clinical Alicia Hennig. 2010. Confucianism as corporate Psychology, 42(2):155–162. ethics strategy. China Business and Research, 2010(5):1–7. Fredda Blanchard-Fields, Lynda Suhrer-Roussel, and Christopher Hertzog. 1994. A confirmatory factor Susan C. Herring and John C. Paolillo. 2006. Gender analysis of the Bem Sex Role Inventory: Old ques- and genre variation in weblogs. Journal of Sociolin- tions, new answers. Sex Roles, 30(5-6):423–457. guistics, 10(4):439–459. Lee-Ann Kastman Breuch, Andrea M. Olson, and An- Rosalind Hursthouse and Glen Pettigrove. 2016. drea Frantz. 2002. Considering ethical issues in Virtue ethics. In Edward N. Zalta, editor, The Stan- technical communication research. In Laura J. Gu- ford Encyclopedia of Philosophy. Metaphysics Re- rak and Mary M. Lay, editors, Research in Techni- search Lab, Stanford University, Winter 2016 edi- cal Communication, pages 1–22. Praeger Publishers, tion. Westport, CT. John Burger, John Henderson, George Kim, and Guido Anna Janssen and Tamar Murachver. 2004. The Zarrella. 2011. Discriminating gender on Twit- relationship between gender and topic in gender- ter. Technical report, MITRE Corporation, Bedford, preferential language use. Written Communication, MA. 21(4):344 –367. Judith Butler. 1993. Bodies That Matter: On the Dis- Moshe Koppel, Shlomo Argamon, and Anat Rachel cursive Limits of “Sex”. Routledge, New York. Shimoni. 2002. Automatically categorizing writ- ten texts by author gender. Literary and Linguistic Common Rule. 2009. Protection of Human Computing, 17(4):401 –412. Subjects. 45 Code of Federal Regulations Part 46. Retrieved February 13, 2017, from Brian N. Larson. 2016. Gender/genre: The lack of https://www.hhs.gov/ohrp/regulations-and- gendered register in texts requiring genre knowl- policy/regulations/45-cfr-46/. edge. Written Communication, 33(4):360–384.

10 Brian N. Larson. 2017. First-year law stu- dents’ court memoranda. Web download LDC2017T03, Linguistic Data Consortium, University of Pennsylvania, Philadelphia, February. http://catalog.ldc.upenn.edu/LDC2017T03. Mary Sue MacNealy. 1998. Strategies for Empirical Research in Writing. Longman, Boston. Plato. 2005. Meno. In Plato: Meno and Other Dia- logues, pages 99–143. Oxford University Press, Ox- ford. W. James Potter and Deborah Levine-Donnerstein. 1999. Rethinking validity and reliability in content analysis. Journal of Applied Communication Re- search, 27(3):258–284. Delip Rao, David Yarowsky, Abhishek Shreevats, and Manaswi Gupta. 2010. Classifying latent user at- tributes in Twitter. In Proceedings of the 2nd In- ternational Workshop on Search and Mining User- generated Contents, pages 37–44, Toronto, ON, Canada, October. ACM. Masoud Rouhizadeh, Lyle Ungar, Anneke Buffone, and Andrew H. Schwartz. 2016. Using syntactic and semantic context to explore psychodemographic differences in self-reference. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2054–2059. Associa- tion for Computational Linguistics. Walter Sinnott-Armstrong. 2015. Consequentialism. In Edward N. Zalta, editor, The Stanford Encyclope- dia of Philosophy. Metaphysics Research Lab, Stan- ford University, Winter 2015 edition. Jean M. Twenge. 1997. Changes in masculine and feminine traits over time: A meta-analysis. Sex Roles, 36(5-6):305–325. Hanna Wallach. 2014. Big data, machine learning, and the social sciences: Fairness, accountability, and transparency. Retrieved January 23, 2017, from https://medium.com/@hannawallach/big- data-machine-learning-and-the-social-sciences- 927a8e20460d#.czusepxiz. Yuan Wang, Yang Xiao, Chao Ma, and Zhen Xiao. 2016. Improving users’ demographic prediction via the videos they talk about. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1359–1368. Associa- tion for Computational Linguistics. World Medical Association. 1964. Declaration of Helsinki: Ethical Principles for Medical Research Involving Human Subjects. World Medical Associa- tion, Ferney-Voltaire, France, October 2013 edition. Xiang Yan and Ling Yan. 2006. Gender classifica- tion of weblog authors. In AAAI Spring Symposium: Computational Approaches to Analyzing Weblogs, pages 228–230, Palo Alto, CA, March. Association for the Advancement of Artificial Intelligence.

11 These are not the Stereotypes You are Looking For: Bias and Fairness in Authorial Gender Attribution

Corina Koolen Andreas van Cranenburgh Institute for Logic, Language and Institut fur¨ Sprache und Information Computation, University of Amsterdam Heinrich Heine University Dusseldorf¨ [email protected] [email protected]

Abstract solutions to conduct it more soundly.1 The reason we do not propose to abandon gender analysis in Stylometric and text categorization results NLP altogether is that female-male differences are show that author gender can be discerned quite striking when it comes to cultural produc- in texts with relatively high accuracy. How- tion. We focus on literary fiction. Female authors ever, it is difficult to explain what gives rise still remain back-benched when it comes to gain- to these results and there are many possible ing literary prestige: novels by females are still confounding factors, such as the domain, much less likely to be reviewed, or to win a liter- genre, and target audience of a text. More ary award (Berkers et al., 2014; Verboord, 2012). fundamentally, such classification efforts Moreover, literary works by female authors are risk invoking stereotyping and essential- readily compared to popular bestselling genres typ- ism. We explore this issue in two datasets ically written by and for women, referred to as of Dutch literary novels, using commonly ‘women’s novels,’ whereas literary works by male used descriptive (LIWC, topic modeling) authors are rarely gender-labeled or associated with and predictive (machine learning) methods. popular genres (Groos, 2011). If we want to do re- Our results show the importance of con- search into the gender gap in cultural production, trolling for variables in the corpus and we we need to investigate the role of author gender argue for taking care not to overgeneralize in texts without overgeneralizing to effects more from the results. properly explained by text-extrinsic perceptions of gender and literary quality. 1 Introduction In other words, NLP research can be very useful in revealing the mechanisms behind the differences, Women write more about emotions, men use more but in order for that to be possible, researchers need numbers (Newman et al., 2008). Conclusions such to be aware of the issues, and learn how to avoid as these, based on Natural Language Processing essentialistic explanations. Thus, our question is: (NLP) research into gender, are not just compelling how can we use NLP tools to research the rela- to a general audience (Cameron, 1996), they are tionship between gender and text meaningfully, yet specific and seem objective, and hence are pub- without resorting to stereotyping or essentialism? lished regularly. Analysis of gender with NLP has roughly two The ethical problem with this type of research methodological strands, the first descriptive and however, is that stressing difference—where there the second predictive. First, descriptive, is the tech- is often considerable overlap—comes with the ten- nically least complex one. The researcher divides dency of enlarging the perceived gap between fe- a set of texts into two parts, half written by female male and male authors; especially when results are and half by male authors, processes these with the interpreted using gender stereotypes. Moreover, same computational tool(s), and tries to explain the many researchers are not aware of possible con- 1We are not looking to challenge the use of gender as a founding variables related to gender, resulting in binary construct in this paper, although this is a position that well-intentioned but unsound research. can be argued as well. Butler (2011) has shown how gender is not simply a biological given, nor a valid dichotomy. We But, rather than suggesting not performing re- recognize that computational methods may encourage this search into gender at all, we look into practical dichotomy further, but we shall focus on practical steps.

12 Proceedings of the First Workshop on Ethics in Natural Language Processing, pages 12–22, Valencia, Spain, April 4th, 2017. c 2017 Association for Computational Linguistics observed differences. Examples are Jockers (2013, research into gender. pp. 118–153) and Hoover (2013). Olsen (2005) cleverly reinterprets Cixous’ notion of ecriture´ 2.1 Preemptive categorization feminine´ to validate an examination of female au- Admittedly, categorization is hard to do without. thors separately from male authors (Cixous et al., We use it to make sense of the world around us. It 1976). is necessary to function properly, for instance to The second, at a first glance more neutral strand be able to distinguish a police officer from other of automated gender division, is to use predictive persons. Gender is not an unproblematic category methods such as text categorization: training a ma- however, for a number of reasons. chine learning model to automatically recognize First, feminists have argued that although many texts written by either women or men, and to mea- people fit into the categories female and male, there sure the success of its predictions (e.g., Koppel are more than two sexes (Bing and Bergvall, 1996, et al., 2002; Argamon et al., 2009). Johannsen p. 2). Our having to decide how to categorize the et al. (2015) combines descriptive and predictive novel by the transgender male in our corpus pub- approaches and mines a dataset for distinctive fea- lished before his transition is a case in point (we tures with respect to gender. We will apply both opted for male). descriptive and predictive methods as well. Second, it is problematic because gender is such The rest of this paper is structured as follows. a powerful categorization. Gender is the primary Section 2 discusses two theoretical issues that characteristic that people use for classification, over should be considered before starting NLP research others like race, age and occupational role, re- into gender: preemptive categorization, and the gardless of actual importance (Rudman and Glick, semblance of objectivity. These two theoretical 2012, p. 84). Baker (2014) analyzes research that issues are related to two potential practical pitfalls, finds gender differences in the spoken section of the the ones which we hope to remedy with these pa- British National Corpus (BNC), which indicates per: dataset bias and interpretation bias (Section 3). gender differences are quite prominent. However, In short, if researchers choose to do research into the context also turned out to be different: women gender (a) they should be much more rigorous in were more likely to have been recorded at home, selecting their dataset, i.e., confounding variables men at work (p. 30). Only when one assumes that need to be given more attention when constructing gender causes the contextual difference, can we a dataset; and (b) they need to avoid potential in- attribute the differences to gender. There is no di- terpretative pitfalls: essentialism and stereotyping. rect causation, however. Because of the saliency Lastly, we provide computational evidence for our of the category of gender, this ‘in-between step’ of argument, and give handles on how to deal with causation is not always noticed. Cameron (1996) the practical issues, based on a corpus of Dutch, altogether challenges the “notion of gender as a literary novels (Sections 4 through 6). pre-existing demographic correlate which accounts Note that none of the gender-related issues we for behavior, rather than as something that requires argue are new, nor is the focus on computational explanation in its own right” (p. 42). analysis (see Baker, 2014). What is novel, how- This does not mean that gender differences do ever, is the practical application onto contemporary not exist or that we should not research them. But, fiction. We want to show how fairly simple, com- as Bing and Bergvall (1996) point out: “The issue, monly used computational tools can be applied in of course, is not difference, but oversimplification a way that avoids bias and promotes fairness—in and stereotyping” (p. 15). Stereotypes can only be this case with respect to gender, but note that the built after categorization has taken place at all (Rud- method is relevant to other categorizations as well. man and Glick, 2012). This means that the method 2 Theoretical issues of classification itself inherently comes with the potential pitfall of stereotyping. Gender research in NLP gives rise to several eth- Although the differences found in a divided cor- ical questions, as argued in for instance Bing and pus are not necessarily meaningful, nor always re- Bergvall (1996) and Nguyen et al. (2016). We dis- producible with other datasets, an ‘intuitive’ ex- cuss two theoretical issues here, which researchers planation is a trap easily fallen into: rather than need to consider carefully before performing NLP being restricted to the particular dataset, results can

13 be unjustly ascribed to supposedly innate qualities ity items (Argamon et al., 2009), and verbs stands of all members of that gender, and extrapolated to out (Johannsen et al., 2015; Newman et al., 2008). all members of the gender in trying to motivate a What these differences mean, or why they are im- result. This type of bias is called essentialism (All- portant for literary analysis (other than a functional port, 1979; Gelman, 2003). benefit), is not generally made sufficiently evident. Rudman and Glick (2012) argue that stereotypes But while evaluations of out-of-sample predic- (which are founded on essentialism) cause harm tions provide an objective measure of success, the because they can be used to unfairly discriminate technique is ultimately not any more neutral than against individuals—even if they are accurate on the descriptive method, with its preemptive group average differences (p. 95). selection. Even though the algorithm automatically On top of that, ideas on how members of each finds gender differences, the fact remains that the gender act do not remain descriptive, but become researcher selects the gender as two groups to train prescriptive. This means that based on certain dif- for, and the predictive success says nothing about ferences, social norms form on how members of a the merits (e.g., explanatory value) of this division. certain gender should act, and these are then rein- In other words, it starts with the same premise as forced, with punishment for deviation. As Baker the descriptive method, and thus needs to keep the (2014) notes: “The gender differences paradigm same ethical issues in mind. creates expectations that people should speak at the linguistic extremes of their sex in order to be seen 3 Practical concerns as normal and/or acceptable, and thus it problema- Although the two theoretical issues are unavoid- tizes people who do not conform, creating in- and able, there are two practical issues inextricably out-groups.” (p. 42) linked to them, dataset and interpretation bias, Thus, although categorization in itself can appear which the researcher should strive to address. unproblematic, actively choosing to apply it has the potential pitfall of reinforcing essentialistic ideas 3.1 Dataset bias on gender and enlarging stereotypes. This is of Strictly speaking, a corpus is supposed to represent course not unique to NLP, but the lure of making a statistically representative sample, and the con- sweeping claims with big data, coupled with NLP’s clusions from experiments with corpora are only semblance of objectivity, makes it a particularly valid insofar as this assumption is met. In gender pressing topic for the discipline. research, this assumptions is too often violated, as potential confounding factors are not accounted for, 2.2 Semblance of objectivity exacerbating the ethical issues discussed. An issue which applies to NLP techniques in gen- For example, Johannsen et al. (2015) work with eral, but particularly to machine learning, is the a corpus of online reviews divided by gender and semblance of neutrality and objectivity (see Rieder age. However, reflected in the dataset is the types and Rohle,¨ 2012). Machine learning models can of products that men and women tend to review make predictions on unseen texts, and this shows (e.g., cars vs. makeup). They argue that their use of that one can indeed automatically identify differ- abstract syntactic features may overcome this do- ences between male and female authors, which are main bias, but this argument is not very convincing. relatively consistent over multiple text types and For example, the use of measurement phrases as a domains. Note first that the outcome of these ma- distinctive feature for men can also be explained by chine learning classifiers are different from what its higher relevance in automotive products versus many general readers expect: the nature of these makeup, instead of as a gender marker. differences is often stylistic, rather than content- Argamon et al. (2009) carefully select texts by related (e.g., Flekova et al. 2016; Janssen and Mu- men and women from the same domain, French lit- rachver 2005, pp. 211–212). For men they in- erature, which overcomes this problem. However, clude a higher proportion of determiners, numer- since the corpus is largely based on nineteenth cen- ical quantifiers (Argamon et al., 2009; Johannsen tury texts, any conclusions are strongly influenced et al., 2015), and overall verbosity (longer sen- by literary and gender norms from this time period tences and texts; Newman et al. 2008). For women (which evidently differ from contemporary norms). a higher use of personal pronouns, negative polar- Koppel et al. (2002) compose a corpus from the

14 BNC, which has more recent texts from the 1970s, tween men and women are innate and/or socially and includes genre classifications which together constructed, such interpretations are not only un- with gender are balanced in the resulting corpus. sound, they promote the separation of female and Lastly, Sarawgi et al. (2011) present a study that male authors in literary judgments. But it can be carefully and systematically controls for topic and done differently. A good example of research based genre bias. They show that in cross-domain tasks, on careful gender-related analysis is Muzny et al. the performance of gender attribution decreases, (2016) who consider gender as performative lan- and investigate the different characteristics of lex- guage use in its dialogue and social context. ical, syntactic, and character-based features; the Dataset and interpretation bias are quite hard to latter prove to be most robust. avoid with this type of research, because of the On the surface the latter two seem to be a rea- theoretical issues discussed in Section 2. We now sonable approach of controlling variables where provide two experiments that show why it is so possible. One remaining issue is the potential for important to try to avoid these biases, and provide publication bias: if for whatever reason women are first steps as to how this can be done. less likely to be published, it will be reflected in this corpus without being obvious (a hidden variable). 4 Data In sum, controlling for author characteristics To support our argument, we analyze two datasets. should not be neglected. Moreover, it is often not The first is the corpus of the Riddle of Literary clear from the datasets whether text variables are Quality: 401 Dutch-language (original and trans- sufficiently controlled for either, such as period, lated) novels published between 2007–2012, that text type, or genre. Freed (1996) has shown that re- were bestsellers or most often lent from libraries in searchers too easily attribute differences to gender, the period 2009–2012 (henceforth: Riddle corpus). when in fact other intersecting variables are at play. It consists mostly of suspense novels (46.4 %) and We argue that there is still much to gain in the con- general fiction (36.9 %), with smaller portions of sideration of author and text type characteristics, romantic novels (10.2 %) and other genres (fantasy, but we focus on the latter here. Even within the horror, etc.; 6.5 %). It contains about the same text type of fictional novels, in a very restricted pe- amount of female authors (48.9 %) as male authors riod of time, as we shall show, there is a variety of (47.6 %) and 3.5 % of unknown gender, or duo’s of subgenres that each have their own characteristics, mixed gender. In the genre of general fiction how- which might erroneously be attributed to gender. ever (where the literary works are situated), there are more originally Dutch works by male authors, 3.2 Interpretation bias and more translated work by female authors. The acceptance of gender as a cause of difference The second corpus (henceforth: Nominee cor- is not uncommon in computational research (cf. pus) was compiled because of this skewness; there Section 1). Supporting research beyond the chosen are few Dutch female literary authors in the Riddle dataset is not always sought, because the align- corpus. It is a set of 50 novels that were nomi- ment of results with ‘common knowledge’ (which nated for one of the two most well-known literary is generally based on stereotypes) is seen as suffi- prizes in the Netherlands, the AKO Literatuurprijs cient, when in fact this is more aptly described as (currently called ECI Literatuurprijs) and the Lib- researcher’s bias. Conversely, it is also problematic ris Literatuur Prijs, in the period 2007-2012, but when counterintuitive results are labeled as deviant which were not part of the Riddle corpus. Variables and inexplicable (e.g., in Hoover, 2013). This is controlled for are gender (24 female, 25 male, 1 a form of cherry picking. Another subtle exam- transgender male who was then still known as a ple of this is the choice of visualization in Jock- female), country of origin (Belgium and the Nether- ers and Mimno (2013) to illustrate a topic model. lands), and whether the novel won a prize or not (2 They choose to visualize only gender-stereotypical within each gender group). The corpus is relatively topics, even though they make up a small part of small, because the percentage of female nominees the results, as they do note carefully (Jockers and was small (26.2 %). Mimno, 2013, p. 762). Still, this draws attention to 5 Experiments with LIWC the stereotype-confirming topics. Regardless of the issue whether differences be- Newman et al. (2008) relate a descriptive method

15 of extracting gender differences, using Linguistic 0.09 0.08 Male authors 0.07 Female authors Inquiry and Word Count (LIWC; Pennebaker et al., 0.06 0.05 2007). LIWC is a text analysis tool typically used 0.04 0.03 0.02 for sentiment mining. It collects word frequen- Density (novels) 0.01 0.00 cies based on word lists and calculates the relative 10 0 10 20 30 40 50 60 70 frequency per word list in given texts. The word % male readers lists, or categories, are of different orders: psy- Figure 1: Kernel density estimation of the percent- chological, linguistic, and personal concerns; see age of male readers with respect to author gender. Table 1; LIWC and other word list based meth- ods have been applied to research of fiction (e.g., Nichols et al., 2014; Mohammad, 2011). We use a Processes, which describes thought processes and validated Dutch translation of LIWC (Zijlstra et al., has been claimed to be a marker of science fiction— 2005). as opposed to fantasy and mystery—because “rea- soned decision-making is constitutive of the res- 5.1 Riddle corpus olution of typical forms of conflict in science fic- We apply LIWC to the Riddle corpus, where we tion” (Nichols et al., 2014, p. 30). Arguably, rea- compare the corpus along author gender lines. We soned decision-making is stereotypically associ- also zoom in on the two biggest genres in the cor- ated with the male gender. pus, general fiction and suspense. When we com- It is quite possible to leave the results at that, pare the results of novels by male authors versus and attempt an explanation. The differences are those by female authors, we find that 48 of 66 not just found in the overall corpus, where a rea- LIWC categories differ significantly (p ă 0.01), sonable amount of romantic novels (approximately after a Benjamini-Hochberg False Discovery Rate 10 %, almost exclusively by female authors) could correction. In addition to significance tests, we re- be seen as the cause for a gender stereotypical out- port Cohen’s d effect size (Cohen, 1988). An effect come. The results are also found within the tradi- size |d| ą 0.2 can be considered non-negligible. tionally ‘male’ genre of suspense (although half of The results coincide with gender stereotypical the suspense authors are female in this corpus), and notions. Gender stereotypes can relate to several within the genre of general fiction. attributes: physical characteristics, preferences and Nonetheless, there are some elements to the cor- interest, social roles and occupations; but psycho- pus that were not considered. The most important logical research generally focuses on personality. factor not taken into account, is whether the novel Personality traits related to agency and power are has been originally written in Dutch or whether it is often attributed to men, and nurturing and empa- a translation. As noted, the general fiction category thy to women (Rudman and Glick, 2012, pp. 85– is skewed along gender lines: there are very few 86). The results in Table 1 were selected from originally Dutch female authors. the categories with the largest effect sizes. These Another, more easily overlooked factor is the stereotype-affirming effects remain when only a existence of subgenres which might skew the out- subset of the corpus with general fiction and sus- come. Suspense and general fiction are categories pense novels is considered. that are already considerably more specific than the In other words, quite some gender stereotype- ‘genres’ (what we would call text-types) researched confirming differences appear to be genre indepen- in the previously mentioned studies, such as fiction dent here, plus there are some characteristics that versus non-fiction. For instance, there is a typical were also identified by the machine learning exper- subgenre in Dutch suspense novels, the so-called iments mentioned in section 2.2. Novels by female ‘literary thriller’, which has a very specific con- authors for instance score significantly higher over- tent and style (Jautze, 2013). The gender of the all and within genre in Affect, Pronoun, Home, author—female—is part of its signature. Body and Social; whereas novels by male authors Readership might play a role in this as well. The score significantly higher on Articles, Prepositions, percentage of readers for female and male authors, Numbers, and Occupation. taken from the Dutch 2013 National Reader Survey The only result here that counters stereotypes is (approximately 14,000 respondents) shows how the higher score for female authors on Cognitive gendered the division of readers is. This distribu-

16 Female Male effect LIWC category Examples mean SD mean SD size (d) sign. Linguistic Prepositions to, with, above 11.38 0.86 11.92 0.86 -0.63 * Pronouns I, them, itself 12.58 1.90 10.14 2.10 1.22 * Negations no, not, never 2.02 0.31 1.78 0.35 0.74 * Article a, an, the 8.48 1.08 9.71 1.19 -1.08 * Numbers 0.61 0.15 0.79 0.25 -0.86 * Psychological Social mate, talk, they, child 10.81 2.00 9.54 1.73 0.68 * Friends buddy, friend, neighbor 0.10 0.04 0.09 0.04 0.23 Humans 0.43 0.16 0.41 0.15 0.11 Affect happy, cried, abandon 2.84 0.49 2.35 0.38 1.12 * Positive emotions love, nice, sweet 1.38 0.34 1.13 0.23 0.86 * Cognitive processes cause, know, ought 5.51 0.67 5.03 0.72 0.69 * Occupation work, class, boss 0.54 0.15 0.67 0.20 -0.75 * Current concerns Home apartment, kitchen, family 0.42 0.13 0.34 0.14 0.57 * Money cash, taxes, income 0.20 0.10 0.21 0.10 -0.12 Body ache, breast, sleep 1.30 0.41 1.06 0.33 0.63 *

Table 1: A selection of LIWC categories with results on the Riddle corpus. The indented categories are subcategories forming a subset of the preceding category. * indicates a significant result. tion is visualized in Figure 1, which is a Kernel few female authors have made the cut in this period Density Estimation (KDE). A KDE can be seen of time to begin with;2 thus, we contend, they will as a continuous (smoothed) variant of a histogram, be more similar to the male authors in this corpus in which the x-axis shows the variable of inter- than in the Riddle corpus containing bestsellers. est, and y-axis indicates how common instances However, here we run into the problem that sig- are for a given value on the x-axis. In this case, nificance tests on this corpus of different size would the graph indicates the number of novels read by not be comparable to those on the previous corpus; a given proportion of male versus female readers. for example, due to the smaller size, there will Male readers barely read the female authors in our be a lower chance of finding a significant effect corpus, female readers read both genders; there is (and indeed, repeating the procedure of the pre- a selection of novels which is only read by female vious section yields no significant results for this readers. Hence, the gender of the target reader corpus). Moreover, comparing only means is of group differs per genre as well, and this is another limited utility. Inspection does reveal that five ef- possible influence on author style. fect sizes increase: Negations, Positive emotions, In sum, there is no telling whether we are look- Cognitive processes, Friends, and Money; all relate ing purely at author gender, or also at translation more strongly to female authors. Other effect sizes and/or subgenre, or even at productions of gendered decrease, mostly mildly. perceptions of genre. In light of these problems with the t-test in an- alyzing LIWC-scores, we offer an alternative. In 5.2 Comparison with Nominees corpus interpretation, the first step is to note the strengths We now consider a corpus of novels that were nom- and weaknesses of the method applied. The largest inated for the two most well-known literary awards problem with comparing LIWC scores among two in the Netherlands, the AKO Literatuurprijs and groups with a t-test, is that it only tests means: the Libris Literatuur Prijs. This corpus has less con- mean score for female authors versus the mean founding variables, as these novels were all origi- score for male authors in our research. A t-test nally written in Dutch, and are all of the same genre. to compare means is restricted to examining the They are fewer, however, fifty in total. We hypoth- groups as a whole, which, we as we argued, is un- esize that there are few differences in LIWC scores 2Note that female authors not being nominated for literary between the novels by the female and male authors, prizes does not say anything about the relationship between as they have been nominated for a literary award, gender and literary quality. Perhaps female authors are over- looked, or they write materials of lesser literary quality, or and will not be marked as overtly by a genre. All of they are simply judged this way because men have set the them have passed the bar of literary quality—and standard and the standard is biased towards ‘male’ qualities.

17 Riddle: Occup Nominees: Occup 3.5 Riddle BoW char3grams support 3.0 male male 2.5 female female female 83.7 80.8 196 2.0 1.5 male 82.1 79.9 191 1.0 avg / total 82.9 80.4 387 density (texts) 0.5 0.0 Nominees BoW char3grams support 0.5 0.0 0.5 1.0 1.5 2.0 0.5 0.0 0.5 1.0 1.5 2.0 Riddle:% words Anger Nominees:% words Anger 6 female 63.2 57.9 24 male male 5 male 77.4 74.2 26 female female 4 avg / total 70.6 66.4 50 3 2

density (texts) 1 Table 2: Gender classification scores (F1) on the 0 Riddle corpus (above) and the Nominees corpus 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.1 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Riddle:% wordsCogmech Nominees:% words Cogmech 0.6 (below). male male 0.5 female female 0.4 0.3 use the words less often than female authors, and 0.2

density (texts) 0.1 on the other, there is a similar-sized group of 0.0 1 2 3 4 5 6 7 8 9 1 2 3 4 5 6 7 8 9 male authors who—and this counters stereotypi- Riddle:% words Body Nominees:% words Body 1.6 cal explanations—use the words more often than 1.4 male male 1.2 female female female authors. The individual differences between 1.0 0.8 authors appear to be more salient than differences 0.6 0.4 between the means; contrary to what the means density (texts) 0.2 0.0 indicate, Body apparently is a category and topic 1 0 1 2 3 4 1 0 1 2 3 4 % words % words worth looking into. This shows how careful one Figure 2: Kernel density estimation of four LIWC must be in comparing means of groups within a cor- categories across the novels of the Riddle (left) and pus, with respect to (author) gender or otherwise. Nominees (right) corpus. 6 Machine Learning Experiments In order to confirm the results in the previous sec- sound to begin with. That is why we only use it as tion, we now apply machine learning methods that a means to an end. A KDE plot of scores on each have proved most successful in previous work. category gives better insight into the distribution Since we want to compare the two corpora, we and differences across the novels; see Figure 2. opt for training and fitting the models on the Riddle Occupation and Anger are two categories of corpus, and applying those models to both corpora. which the difference in means largely disappears with the Nominees corpus, showing an effect size 6.1 Predictive: Classification of d ă 0.1. The plots demonstrate nicely how the We replicate the setup of Argamon et al. (2009), overlap has become near perfect with the Nominees which is to use frequencies of lemmas to train a corpus, indicating that subgenre and/or translation support vector classifier. We restrict the features might have indeed been factors that caused the dif- to the 60 % most common lemmas in the corpus ference in the Riddle corpus. Cognitive processes and transform their counts to relative frequencies (Cogmech) is a category which increases in effect (i.e., a bag-of-words model; BoW). Because of the size with the Nominees corpus. We see that the robust results reported with character n-grams in overlap with female and male authors is large, but Sarawgi et al. (2011), we also run the experiment that a small portion of male authors uses the words with character trigrams, in this case without a re- in this category less often than other authors and striction on the features. We train on the Riddle a small portion of the female authors uses it more corpus, and evaluate on both the the Riddle corpus often than other authors. and the Nominees corpus; for the former we use While the category Body was found to have a 5-fold cross-validation to ensure an out-of-sample significant difference with the Riddle corpus, in evaluation. We leave out authors of unknown or the KDE plot it looks remarkably similar, while multiple genders, since this class is too small to in the Nominees corpus, there is a difference not learn from. in mean but in variance. It appears that on the See Table 2 for the results; Table 4 shows the one hand, there are quite some male authors who confusion matrix with the number of correct and in-

18 female: toespraak, engel, energie, champagne, gehoorzaam, grendel, drug, tante, echtgenoot, vleug

speechNN , angel, energy, champagne, docile, lock, drug, aunt, spouse, tad

male: wee, datzelfde, hollen, conversatie, plak, kruimel, strijken, gelijk, inpakken, ondergaan

woe, same, run, conversation, slice, crumble, ironVB , right/just, pack, undergo

Table 3: A sample of 10 distinctive, mid-frequency features.

Riddle female male t1: self-development life time human moment stay

female 170 26 t48: dialogues/colloquial language male 40 151 simply totally to-tell time of-course t23: settling down Nominees female male life house child woman year t37: military Nominees, male female 12 12 soldier lieutenant army two to-get Riddle, male Nominees, female male 2 24 t2: family father mother child year son Riddle, female

0.00 0.02 0.04 0.06 0.08 0.10 Table 4: Confusion matrices for the SVM results topic score with BoW. The diagonal indicates the number of t8: quotation/communication correctly classified texts. The rows show the true madam grandma old to-tell to-hear t44: looks & parties labels, while the columns show the predictions. woman glass dress nice to-look Nominees, male t46: author: Kinsella/Wickham Riddle, male mom suddenly just to-get to-feel Nominees, female Riddle, female t14: non-verbal communication correct classifications. As in the previous section, it man few to-get time to-nod t9: house appears that gender differences are less pronounced house wall old to-lie to-hang 0.000 0.005 0.010 0.015 0.020 0.025 0.030 0.035 in the Nominees corpus, shown by the substantial topic score difference of almost 10 F1 percentage points. We also see the effect of a different training and test cor- Figure 3: Comparison of mean topic weights w.r.t. pus: the classifier reveals a bias for attributing texts gender and corpus, showing largest (above) and to male authors with the Nominees corpus, shown smallest (below) male-female differences. by the distribution of misclassifications in Table 4. On the one hand, the success can be explained by similarities of the corpora; on the other, the male ing out the most frequent words and sorting words bias reveals that the model is also affected by par- with the largest difference in the Nominees corpus ticularities of the training corpus. Sarawgi et al. (which helps to focus on the differences that remain (2011) show that with actual cross-domain classifi- in the corpus other than the one on which the model cation, performance drops more significantly. has been trained). As an indication of the sort of differences the classifier exploits, Table 3 shows a A linear model3 is in principle straightforward selection of features; the results cannot be easily to interpret: features make either a positive or a aligned with stereotypes, and it remains difficult to negative contribution to the final prediction. How- explain the success of the classifier from a small ever, due to the fact that thousands of features are sample as this. We now turn to a different model to involved, and words may be difficult to interpret analyze the differences between the two corpora in without context, looking at the features with the terms of gender. highest weight may not give much insight; the tail may be so long that the sign of the prediction still 6.2 Descriptive: Topic Model flips multiple times after the contribution of the top We use a topic model of the Riddle corpus pre- 20 features has been taken into account. sented in Jautze et al. (2016) to infer topic weights Indeed, looking at the features with the high- for both corpora. This model of 50 topics was est weight does not show a clear picture: the top derived with Latent Dirichlet Allocation (LDA), 20 consists mostly of pronouns and other function based on a lemmatized version of the Riddle cor- words. We have tried to overcome this by filter- pus without function words or punctuation, divided 3Other models such as decision trees are even more into chunks of 1000 tokens. We compare the topic amenable to interpretation. However, in the context of text weights with respect to gender by taking the mean categorization, bag-of-word models with large numbers of features work best, which do not work well in combination topic weights of the texts of each gender. From with decision trees. the list of 50 topics we show the top 5 with both

19 the largest and the smallest (absolute) difference However, we do want to keep doing textual anal- between the genders (with respect to the Nominees ysis into gender, as we argued we should, in order corpus);4 see Figure 3. Note that the topic labels to analyze gender bias in cultural production. The were assigned by hand, and other interpretations of good news is that we can take practical steps to the topic keys are possible. minimize their effect. We show that we can do The largest differences contain topics that con- this by taking care to avoid two practical problems firm stereotypes: military (male) and settling down that are intertwined with the two theoretical issues: (female). This is not unexpected: the choice to ex- dataset bias and interpretation bias. amine the largest differences ensures these are the Dataset bias can be avoided by controlling for extreme ends of female-male differences.5 How- more variables than is generally done. We argue ever, the topics that are most similar for the gen- that apart from author variables (which we have ders in the Nominees corpus contain stereotype- chosen not to focus on in this paper, but which confirming topics as well—i.e., they both score should be taken into account), text variables should similarly low on ‘looks and parties.’ be applied more restrictively. Fiction, even, is too Finally, the large difference on dialogue and col- broad as a genre; subgenres as specific as ‘literary loquial language shows that speech representation thriller’ can become confounding factors as well, forms a fruitful hypothesis for explaining at least as we have shown in our set of Dutch bestsellers, part of the gender differences. both in the experiments with LIWC as well as the machine learning experiments. 7 Discussion and Conclusion Interpretation bias stems from considering fe- male and male authors as groups that can be re- Gender is not a self-explanatory variable. In this lied upon and taken for granted. We have shown paper, we have used fairly simple, commonly ap- with visualizations that statistically significant dif- plied Natural Language Processing (NLP) tech- ferences between genders can be caused by out- niques to demonstrate how a seemingly ‘neutral’ liers on each end of the spectrum, even though corpus—one that consists of only one text-type, the gender overlap is large on the one hand; and fiction, and with a balanced number of male and that possibly interesting within-group differences female authors—can easily be used to produce become confounded by solely using means over stereotype-affirming results, while in fact (at least) gender groups on the other hand, missing differ- two other variables were not controlled for prop- ences that might be interesting. Taking these extra erly. Researchers need to be much more careful in visualization steps makes for a better basis for anal- selecting their data and interpreting results when ysis that does right by authors, no matter of which performing NLP research into gender, to minimize gender they are. the ethical issues discussed. This work has focused on standard explanatory From an ethics point of view, care should be and predictive text analysis tools. Recent devel- taken with NLP research into gender, due to the un- opments with more advanced techniques, in par- avoidable ethical-theoretical issues we discussed: ticular word embeddings, appear to allow gender (1) Preemptive categorization: dividing a dataset in prejudice in word associations to be isolated, and two preemptively invites essentialist or even stereo- even eliminated (Schmidt, 2015; Bolukbasi et al., typing explanations; (2) The semblance of objec- 2016; Caliskan-Islam et al., 2016); applying these tivity: because a computer algorithm calculates methods to literature is an interesting avenue for differences between genders, this lends a sense of future work. objectivity; we are inclined to forget that the re- The code and results for this paper are avail- searcher has chosen to look or train for these two able as a notebook at https://github.com/ categories of female and male. andreasvc/ethnlpgender 4By comparing absolute differences in topic weights, rarer topics with small but nevertheless consistent differences may Acknowledgments be overlooked; using relative differences would remove this bias, but introduces the risk of giving too much weight to rarer topics. We choose the former to focus on the more prominent We thank the six (!) reviewers for their insightful and representative topics. and valuable comments. 5Note that the topics were derived from the Riddle corpus, which contains romance and spy novels.

20 References Lyle Ungar, and Daniel Preot¸iuc-Pietro. 2016. An- alyzing biases in human perception of user age and Gordon Willard Allport. 1979. The nature of prejudice. gender from text. In Proceedings of ACL, pages Basic books. 843–854. http://aclweb.org/anthology/ P16-1080. Shlomo Argamon, Jean-Baptiste Goulain, Russell Horton, and Mark Olsen. 2009. Vive la Alice Freed. 1996. Language and gender research in an Difference!´ Text mining gender difference in experimental setting. In Bergvall et al. (1996). French literature. Digital Humanities Quarterly, 3(2). http://www.digitalhumanities. Susan A. Gelman. 2003. The essential child: Origins of org/dhq/vol/3/2/000042/000042.html. essentialism in everyday thought. Oxford University Press. Paul Baker. 2014. Using corpora to analyze gender. A&C Black. Marije Groos. 2011. Wie schrijft die blijft? schri- jfsters in de literaire kritiek van nu. Tijd- Victoria L. Bergvall, Janet M. Bing, and Alice F. Freed, schrift voor Genderstudies, 3(3):31–36. Transl.: editors. 1996. Rethinking language and gender re- Who writes endures? Women writers in cur- search: theory and practice. Longman, London. rent literary criticism. http://rjh.ub.rug. nl/genderstudies/article/view/1575. Pauwke Berkers, Marc Verboord, and Frank Weij. 2014. Genderongelijkheid in de dagbladberichtgev- David Hoover. 2013. Text analysis. In Kenneth Price ing over kunst en cultuur. Sociologie, 10(2):124– and Ray Siemens, editors, Literary Studies in the 146. Transl.: Gender inequality in newspaper cover- Digital Age: An Evolving Anthology. Modern Lan- age of arts and culture. https://doi.org/10. guage Association, New York. 5117/SOC2014.2.BERK. Anna Janssen and Tamar Murachver. 2005. Read- Janet M. Bing and Victoria L. Bergvall. 1996. The ers’ perceptions of author gender and literary question of questions: Beyond binary thinking. In genre. Journal of Language and Social Psychol- Bergvall et al. (1996). ogy, 24(2):207–219. http://dx.doi.org/10. 1177%2F0261927X05275745. Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Venkatesh Saligrama, and Adam T Kalai. Kim Jautze. 2013. Hoe literair is de literaire thriller? 2016. Man is to computer programmer as Blog post. Transl.: How literary is the literary woman is to homemaker? debiasing word thriller?, http://kimjautze.blogspot. embeddings. In Advances in Neural Infor- nl/2013/11/hoe-literair-is-de- mation Processing Systems, pages 4349–4357. literaire-thriller.html. http://papers.nips.cc/paper/6228- man-is-to-computer-programmer-as- Kim Jautze, Andreas van Cranenburgh, and Co- woman-is-to-homemaker-debiasing- rina Koolen. 2016. Topic modeling literary qual- word-embeddings.pdf. ity. In Digital Humanities 2016: Conference Ab- stracts, pages 233–237. Krakow,´ Poland. http: Judith Butler. 2011. Gender trouble: Feminism and the //dh2016.adho.org/abstracts/95. subversion of identity. Routledge, New York, NY. Matthew L. Jockers. 2013. Macroanalysis: Digital Aylin Caliskan-Islam, Joanna J. Bryson, and Arvind methods and literary history. University of Illinois Narayanan. 2016. Semantics derived automatically Press, Urbana, Chicago, Springfield. from language corpora necessarily contain human biases. ArXiv preprint, https://arxiv.org/ Matthew L. Jockers and David Mimno. 2013. Sig- abs/1608.07187. nificant themes in 19th-century literature. Poet- ics, 41(6):750–769. http://dx.doi.org/10. Deborah Cameron. 1996. The language-gender inter- 1016/j.poetic.2013.08.005. face: challenging co-optation. In Bergvall et al. (1996). Anders Johannsen, Dirk Hovy, and Anders Søgaard. 2015. Cross-lingual syntactic variation over age Hel´ ene` Cixous, Keith Cohen, and Paula Cohen. 1976. and gender. In Proceedings of CoNLL, pages The laugh of the Medusa. Signs: Journal of Women 103–112. http://aclweb.org/anthology/ in Culture and Society, 1(4):875–893. http://dx. K15-1011. doi.org/10.1086/493306. Moshe Koppel, Shlomo Argamon, and Anat Rachel Jacob Cohen. 1988. Statistical power analysis for Shimoni. 2002. Automatically categorizing the behavioral sciences. Routledge Academic, New written texts by author gender. Literary York, NY. and Linguistic Computing, 17(4):401–412. http://llc.oxfordjournals.org/ Lucie Flekova, Jordan Carpenter, Salvatore Giorgi,

21 content/17/4/401.abstract. Marc Verboord. 2012. Female bestsellers: A cross- national study of gender inequality and the popular– Saif Mohammad. 2011. From once upon a time to highbrow culture divide in fiction book produc- happily ever after: Tracking emotions in novels and tion, 1960-2009. European Journal of Communica- fairy tales. In Proceedings of the 5th Workshop on tion, 27(4):395–409. http://dx.doi.org/10. Language Technology for Cultural Heritage, Social 1177%2F0267323112459433. Sciences, and Humanities, pages 105–114. http: //aclweb.org/anthology/W11-1514. Hanna Zijlstra, Henriet¨ Van Middendorp, Tanja Van Meerveld, and Rinie Geenen. 2005. Validiteit Grace Muzny, Mark Algee-Hewitt, and Dan Juraf- van de Nederlandse versie van de Linguistic Inquiry sky. 2016. The dialogic turn and the performance and Word Count (LIWC). Netherlands journal of of gender: the English canon 1782–2011. In psychology, 60(3):50–58. Transl.: Validity of the Digital Humanities 2016: Conference Abstracts, Dutch version of LIWC. http://dx.doi.org/ pages 296–299. http://dh2016.adho.org/ 10.1007/BF03062342. abstracts/153.

Matthew L. Newman, Carla J. Groom, Lori D. Handelman, and James W. Pennebaker. 2008. Gender differences in language use: An anal- ysis of 14,000 text samples. Discourse Pro- cesses, 45(3):211–236. http://dx.doi.org/ 10.1080/01638530802073712.

Dong Nguyen, A. Seza Dogru˘ o,¨ Carolyn P. Rose,´ and Franciska de Jong. 2016. Computational So- ciolinguistics: A survey. Computational Linguis- tics, 42(3):537–593. http://aclweb.org/ anthology/J16-3007.

Ryan Nichols, Justin Lynn, and Benjamin Grant Purzy- cki. 2014. Toward a science of science fiction: Ap- plying quantitative methods to genre individuation. Scientific Study of Literature, 4(1):25–45. http:// dx.doi.org/10.1075/ssol.4.1.02nic.

Mark Olsen. 2005. Ecriture´ feminine:´ searching for an indefinable practice? Literary and linguistic com- puting, 20(Suppl. 1):147–164.

James W. Pennebaker, Roger J. Booth, and Martha E. Francis. 2007. Linguistic inquiry and word count: LIWC [computer software]. www.liwc.net.

Theo Rieder and Bernhard Rohle.¨ 2012. Digital meth- ods: Five challenges. In Understanding digital hu- manities, pages 67–84. Palgrave Macmillan, Lon- don.

Laurie A. Rudman and Peter Glick. 2012. The so- cial psychology of gender: How power and intimacy shape gender relations. Guilford Press.

Ruchita Sarawgi, Kailash Gajulapalli, and Yejin Choi. 2011. Gender attribution: tracing stylometric evi- dence beyond topic and genre. In Proceedings of CoNLL, pages 78–86. http://aclweb.org/ anthology/W11-0310.

Ben Schmidt. 2015. Rejecting the gender bi- nary: a vector-space operation. Blog post, http://bookworm.benschmidt.org/ posts/2015-10-30-rejecting-the- gender-binary.html.

22 A Quantitative Study of Data in the NLP community

Margot Mieskes Information Science Darmstadt University of Applied Sciences [email protected]

Abstract Therefore, important questions are, what hap- pens to the data and how reliable are results ob- We present results on a quantitative analy- tained through them. sis of publications in the NLP domain on We present a quantitative analysis of how of- collecting, publishing and availability of ten data is being collected, how data is published, research data. We find that a wide range of and what data types are being collected. Taken to- publications rely on data crawled from the gether it gives insight into issues arising from col- web, but few give details on how poten- lecting data and from distributing it via channels, tially sensitive data was treated. Addition- that do not allow for reproducing results, even af- ally, we find that while links to repositories ter a comparably short period of time. Based on of data are given, they often do not work this, we open a discussion about best practices on even a short time after publication. We put data collection, storage and distribution in order together several suggestions on how to im- to ensure high-quality research, that is solid and prove this situation based on publications reproducable. But also to make sure, users of, from the NLP domain, but also other re- i.e., social media channels are treated according search areas. to general standards concerning sensitive data.

1 Introduction 2 Related Work The Natural Language Processing (NLP) commu- In the following we give a broad overview on re- nity makes extensive use of resources available on usability of published code and data sets, but also the internet. And as research in NLP attracts more on results of actual reproducibility studies and pri- attention by the general public, we have to make vacy issues from various domains. sure, our results are solid and reliable, similar to medicine and pharmacy. In the case of medicine, General Guidelines “One goal of scientific the general public is often too optimistic. In NLP publication is to share results in enough detail to this over-optimism can have a negative impact, allow other research teams to reproduce them and such as in articles on automatic speech recogni- to build on them” (Iorns, 2012). But even in med- tion1 or personality profiling2. Few point out, that ical or pharmaceutical research failure to replicate the algorithms are not perfect and do not solve all results can be as high as 89% (Iorns, 2012). Jour- the problems, as on terrorism prevention3 or senti- nals such as Nature5 and PLOS6 require their au- ment analysis4. thors to make relevant code available to editors 1https://theintercept.com/2015/05/ and reviewers. If code cannot be shared, the editor 05/nsa-speech-recognition-snowden- can decline a paper from publication.5 Addition- searchable-text/ ally, they list a range of repositories that are “rec- 2http://www.digitaltonto.com/2013/the- dark-side-of-technology/ ognized and trusted within their respective com- 3http://www.telegraph.co.uk/news/ munities” and meet accepted criteria as “trustwor- uknews/terrorism-in-the-uk/11431757/ Algorithms-and-computers-wont-stop- 5http://www.nature.com/authors/ terrorism.html policies/availability.html 4http://www.nytimes.com/2009/08/24/ 6http://journals.plos.org/plosone/s/ technology/internet/24emotion.html?_r=1 data-availability

23 Proceedings of the First Workshop on Ethics in Natural Language Processing, pages 23–29, Valencia, Spain, April 4th, 2017. c 2017 Association for Computational Linguistics thy digital repositories” for storing data6. This en- suggest a resource innovation impact factor to en- ables authors to follow best practices in their fields courage the publication of data and resources. for the preparation, recording and storing of data. Gratta et al. (2016) studied the types of re- sources published during the previous three LREC Study on re-usability of Code Collberg et al. conferences. They found that more than half (2015) did an extensive study into the release and (58%) of the resources were corpora. They visu- usability of code in the domain of computer sci- alized collaborations between researchers on spe- ence. The authors categorized published code into cific resources and pointed out issues concerning three categories: Projects that were obtained and the meta-data provided by data publishers. built in less than 30 minutes, projects that were successfully built in more than 30 minutes and Replication Studies in NLP Experiments in re- projects where the authors had to rely on the state- producing results in the NLP domain such as ment of the author of the published code. (Fokkens et al., 2013) are still quite rare. One rea- Additionally, they carried out a user study, to son might be, that when undertaking such projects, look into reasons why code was not shared. Rea- “sometimes conflicting results are obtained by re- sons were (among others), that the code will be peating a study” (Jones, 2009). Fokkens et al. available soon, that the programmer left or that the (2013) found, that their experiments were diffi- authors do not intend to release the code at all. cult to carry out and to obtain meaningful results. Their study also presents reasons why code or The 4Real workshop focused on the “the topic of support is unavailable. They found that prob- the reproducibility of research results and the cita- lems in building code were (among others) based tion of resources, and its impact on research in- 9 9 on “files missing from the distribution” and “in- tegrity” . Their call for papers asked for submis- complete documentation”. The authors also list sions of “actual replication exercises of previous lessons learned from their experiment, formulated published results” (see also (Branco et al., 2016)). as advice to the community such as: plan to re- Results from this workshop found that reproduc- lease the code, plan for students to leave, create ing experiments can give additional insights, and project websites and plan for longevity. can therefore be beneficial for the researchers as Finally, the authors present a list of suggestions well as for the community (Cohen et al., 2016). to improve sharing of research artifacts, among Data Privacy and Ethics Another important as- others on how to give details about the sharing in pect is data privacy. An overview on how to deal the publications, beyond using public repositories with data taken from, for example, social me- and coding conventions. dia channels can be found in (Diesner and Chin, Re-using Data Some of the findings by Coll- 2016). The authors raise various issues regard- berg et al. (2015) apply to data as well. Data ing the usage of data crawled from the web. As has to be “independently understandable”, which data obtained through these channels is, strictly means, that it is not necessary to consult the orig- speaking, restricted in terms of redistribution, re- inal provider (Peer et al., 2014). A researcher has producibility is a problem. the responsibility to publish data, code and rele- Wu et al. (2016) present work on develop- vant material (Hovy and Spruit, 2016). Addition- ing and implementing principles for creating re- ally, Peer (2014) argued, that a data review process sources based on patient data in the medical do- as carried out by data archives such as ICSPR7 or main and working with this data. ISPS8 is feasible. Bleicken et al. (2016) report efforts on Milsutkaˇ et al. (2016) propose to store URLs as anonymization of video data from sign language. persistent identifiers to allow for future references The authors developed a semi-automatic proce- and support long-term availability. dure to black relevant parts of the video, where Francopoulo et al. (2016) looked at NLP publi- named entities are mentioned. cations and NLP resources and carried out a quan- Fort and Couillault (2016) report on a survey titative study into resource re-usage. The authors on the awareness and care NLP researchers show towards ethical issues. The authors scope also 7http://www.icpsr.umich.edu/icpsrweb/ considered working conditions for crowd workers. index.jsp 8http://isps.yale.edu/research/data 9http://4real.di.fc.ul.pt/

24 Their results indicate that the majority (84%) con- extending the work. Our results give an indication sider licensing and distribution of language data on the availability of research data. during their work. Over three-quarters of the par- Specific to data taken for example from social ticipants (77%) think that “ethics should be part of media channels is another, additional aspect: the subjects for the call for papers”. Personal Data Researchers present and publish 3 Research Questions their data and results of their research on confer- ences and workshops, often using examples taken In the course of this work, we looked at various from the actual data. And of course, they aim to aspects of experimental work: look for examples that are entertaining, especially during a presentation. But we also observed that NLP researchers collect data, often Collection names are being used. Not just fairly common without informing the persons or entities who pro- names, but real names or aliases used on social duced this data. These data sets are analyzed, con- media. Which renders this person identifiable as clusions are drawn about how people write, be- defined by the data protection act below. have, etc. and others make use of these findings Therefore, we added the questions: in other contexts. This gave raise to the questions: Did the data contain sensitive data? • Has data been collected? Was the data anonymized? • If the data contains potentially sensitive data, • • which post-processing steps have been taken These questions look at how researchers deal (i.e. anonymization)? with potentially sensitive data. The results indicate Was the resulting data published? how serious they take their responsibility towards • Is there enough information/is it possible to their research subjects, which are either voluntar- • obtain the data? ily or involuntarily taking part in a study. What constitutes sensitive data? Related to the Replicability/Reproducibility Often data on above presented questions, we had to define what which these studies are based, is not published or sensitive data is. In a leaflet from the MIT In- not available anymore. This can be due to vari- formation Services and Technology sensitive data 10 ous reasons . Among those are, that webpages or includes information about “ethnicity, race, po- e-mail addresses are no longer functional after a litical or religious views, memberships, physical researcher left a specific research institute, after a or mental health, personal life (...) information webpage re-design some data has not been moved on a person as consumer, client, employee, pa- to the new page, and copyright or data privacy is- tient, student”. It also includes contact informa- sues could not be resolved. tion, ID, birth date, parents names, etc. (Services This gives rise to issues, such as reproducibil- and Technology, 2009). The UK data protecton act ity of research results. Original results from these contains a similar list.11 The European Commis- studies are published and later referred to, but they sion (Schaar, 2007) formulates personal and there- cannot be verified on the original data. In some fore sensitive data as “any information relating to cases, data is being re-used and extended. But of- an identified or identifiable natural person”. And ten only specific parts of the original data is used. even anonymizing data does not solve all issues Details on how to reproduce the changed data set here, as “(...) information may be presented as (e.g. code/scripts used to obtain the subset) are not aggregated data, the original sample is not suffi- published and descriptions about the procedure are ciently large and other pieces of information may insufficient. This is extends the questions: enable the indentification of individuals”. Based on these definitions, we counted towards Was previously published data used in a dif- • the sensitive data aspect everything that users ferent way and/or extended? themselves report (“user generated web content” (Diesner and Chin, 2016)), but also what is be- These questions target at how easy it would be ing reported about them, e.g. data gathered from to follow-up by reproducing published results and 11https://ico.org.uk/for-organisations/ 10This is based on personal experience and therefore not guide-to-data-protection/key- quantified. definitions/

25 Venue # papers # data published Ratio Table 1 shows the results with respect to the NAACL 182 57 31.3% ACL 231 63 27.3% number of publications and the number of papers EMLNP 264 81 30.7% reporting data usage and/or extension. LREC saw Coling 337 89 26.4% the highest number of published papers containing LREC 744 414 55.6% total 1758 704 40.0% collected and/or published data. Table 2 gives details about the availability of the Table 1: Results of papers reporting the usage and data sets used. 468 of the 704 publications (58%) the publication of data. report a link where the data can be downloaded. Another 35% report no link at all and below 1% equiment such as mobile phones which allows to mention that the data is proprietary and cannot be identify a specific person. published. Out of the links given, 18% do not work at all. This includes cases where the men- 4 Quantitative Analysis tioned page did not exist (anymore) or where it is inaccessible. Most cases where links did not work Our quantitative analysis was carried out on pub- (15.7%) were due to incomplete or not working lications from NAACL (Knight et al., 2016), ACL links to personal webpages at the respective re- (Erk and Smith, 2016), EMNLP (Su et al., 2016), search institutions. Therefore, we looked in more LREC (Calzolari et al., 2016) and Coling (Mat- detail at the hosting methods for publishing data. sumoto and Prasad, 2016) from 2016. This re- We found that only about 20.7% were published sulted in a data set of 1758 publications, which on public hosting services such as github13 or includes long papers for ACL, long and short pa- bitbucket14. While these services are targeted pers for NAACL, technical papers for Coling and towards code and might not be appropriate for full proceedings for EMNLP and LREC, but no data collections, they are at least independent of workshop publications. personal or research institute webpages. LREC publications also mention hosting services such as Procedure All publications were manually metashare15, the LRE Map16 or that data will be checked by the author. Creating an automatic provided through LDC17 or ELRA18 (8.9%). method proved to be infeasible, as the descriptions on whether or not data was collected, whether it Category Percentage is provided to the research community, through Link available 65.2% Link does not work 15.7% which channel etc. is too heterogeneous across No Link 31.4% the publications. We checked the abstracts for On Request 1.8% pointers on the specific work and looked at the Proprietary data < 1% respective sections on procedure, data collection Table 2: Detailed numbers on available and work- and looked for mentions of publication plans, ing links link or availablility of the data. This information was collected and stored in a table for later eval- uation. This analysis could have been extended Responsibility towards Research Subjects by contacting the data set authors and looking at Out of 704 publications about 32.8% collected the content of the data sets. While this definitely or used data from social media or otherwise would be a worthwhile study, this would have sensitive data as outlined in Section 3 above. Only gone beyond the scope of the current paper, as about 3.5% of these report the anonymization of it would have meant to contact at least over 700 the data. In some cases it was obvious that no authors individually. Additionally, this project anonymization has been carried out, as the discus- was intended to raise the awareness on how data sion of the data and results mentions user names is being collected and published. or aliases, which makes the person identifiable. The remaining publications do not mention how Of the 1758 pub- Reproducibility of Results 13https://github.com/ lications 704 reported to have collected or ex- 14https://bitbucket.org/product tended/changed existing data12 (approx. 40%). 15http://www.meta-share.eu/ 16http://www.resourcebook.eu/ 12Publications used more than one data set, therefore, sums 17https://www.ldc.upenn.edu/ can be more than 100%. 18http://catalog.elra.info/

26 the data was treated or processed. It is possible, of meta data should be provided. These should that most of them anonymized their data, but it also include information on methods and tools, is not clearly stated. Other data collected was which have been used to process the data. This generally written data such as news (37%), spoken simplifies the reproduction of experiments and re- data (11%) and annotations (27%). sults.19 All of this could be collected in a reposi- In LREC a considerable amount of data from tory, where code and data is stored. Various efforts the medical domain, recordings of elderly, patho- in this direction already exist, such as LRE Map20 logical voices and data from proficiency observa- or the ACL Data and Code Repository21. The tions, such as children or foreign language learner ACL Repository currently lists only 9 resources was reported (7%). But in only 10% of the cases from 2008 to 2011. The LRE Map contains over anonymization was reported or became obvious 2,000 corpora, but the newest dates from LREC through the webpage or published pictures. 2014. So the data that was analyzed here, has not been provided there. 5 Suggestions for future direction Adding a reproducibility section to conferences From the above presented analysis, we raise sev- and journals in the NLP domain would support the eral discussion points, which the NLP community validation of previously presented results. Stud- should address together. The following is meant as ies verified by independent researchers could be a starting point to flesh out a code of conduct and raised in the awareness and given appropriate potential future activities to improve the situation. credit to both original researchers and the verifi- cation. This could be tied together with extending, Data Collection and Usage This addresses is- encouraging, enforcing the usage of data reposito- sues such as how to collect data, how to pre- ries such as the ACL Repository or the LRE Map /post-process data (i.e. anonymization) and recom- and find common interfaces between the various mendations for available tools supporting these. efforts. On the long term, virtual research envi- Additionally, guidelines on how to present data ronments would allow for working with sensitive in publications and presentations should enforce data without distributing it, which would foster the anonymization. This could be supported by al- collaboration across research labs. lowing one additional page for submitted pa- pers, where details on collections, procedures and 6 Future Work treatement are given. A checklist both for authors and reviewers should contain at least: Future work includes extending this preliminary study in two directions: earlier publications and Has data been collected? how usable are published data sets. Are various • How was this data collected and processed? high-profile studies actually replicable and what • Was previously available data used/extended can we learn from the results? • – which one? Additionally, the suggestions sketched in the Is a link or a contact given? • previous section have to be fleshed out and put to Where does it point (private page, research • action in a continious revision process. institute, official repository)? Acknowledgments For journals the availability and usability of data (and potentially code) should be mandatory, simi- This work was partially supported by the DFG- lar to Nature and PLOS (see Section 2). funded research training group “Adaptive Prepara- tion of Information from Heterogeneous Sources” Data Distribution This addresses issues on how (AIPHES, GRK 1994/1). We would like to thank data should be distributed to the community, re- the reviewers for their valuable comments that specting data privacy issues as well. We should helped to considerably improve the paper. define standards for publications that are not tied to a specific lab or even the personal website of 19Ideally, a labbook or experiment protocol containing all a researcher, similar to recommended repositories the necessary information about the experiments should be for Nature or PLOS (see Section 2), but rather pro- published as well. 20http://www.resourcebook.eu/ vide means and guidelines to gather, work with 21https://www.aclweb.org/aclwiki/index. and publish data. On publication, a defined set php?title=ACL_Data_and_Code_Repository

27 References Karen¨ Fort and Alain Couillault. 2016. Yes, We Care! Results of the Ethics and Natural Language Julian Bleicken, Thomas Hanke, Uta Salden, and Sven Processing Surveys. In Nicoletta Calzolari (Con- Wagner. 2016. Using a Language Technology In- ference Chair), Khalid Choukri, Thierry Declerck, frastructure for German in order to Anonymize Ger- Sara Goggi, Marko Grobelnik, Bente Maegaard, man Sign Language Corpus Data. In Nicoletta Cal- Joseph Mariani, Helene Mazo, Asuncion Moreno, zolari (Conference Chair), Khalid Choukri, Thierry Jan Odijk, and Stelios Piperidis, editors, Proceed- Declerck, Sara Goggi, Marko Grobelnik, Bente ings of the Tenth International Conference on Lan- Maegaard, Joseph Mariani, Helene Mazo, Asun- guage Resources and Evaluation (LREC 2016), cion Moreno, Jan Odijk, and Stelios Piperidis, edi- Paris, France, may. European Language Resources tors, Proceedings of the Tenth International Confer- Association (ELRA). ence on Language Resources and Evaluation (LREC 2016), Paris, France, May. European Language Re- Gil Francopoulo, Joseph Mariani, and Patrick sources Association (ELRA). Paroubek. 2016. Linking Language Resources and NLP Papers. In 4REAL Workshop: Workshop on Antonio´ Branco, Nicoletta Calzolari, and Khalid Research Results Reproducibility and Resources Choukri, editors. 2016. Portoroz,ˇ Slovenia. An Citation in Science and Technology of Language, LREC 2016 Workshop. pages 24–32, Portoroz,ˇ Slovenia, May. An LREC 2016 Workshop. Nicoletta Calzolari, Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Riccardo Del Gratta, Francesca Frontini, Monica Joseph Mariani, Hlne Mazo, Asuncin Moreno, Jan Monachini, Gabriella Pardelli, Irene Russo, Roberto Odijk, and Stelios Piperidis., editors. 2016. Tenth Bartolini, Fahad Khan, Claudia Soria, and Nico- International Conference on Language Resources letta Calzolari. 2016. LREC as a Graph: Peo- and Evaluation (LREC 2016). European Language ple and Resources in a Network. In Nicoletta Cal- Resources Association, Portoroz,ˇ Slovenia, May zolari (Conference Chair), Khalid Choukri, Thierry 23–28, 2016. published online at: http://www.lrec- Declerck, Sara Goggi, Marko Grobelnik, Bente conf.org/proceedings/lrec2016/index.html. Maegaard, Joseph Mariani, Helene Mazo, Asun- cion Moreno, Jan Odijk, and Stelios Piperidis, edi- Kevin Cohen, Jingbo Xia, Christophe Roeder, and tors, Proceedings of the Tenth International Confer- Lawrence Hunter. 2016. Reproducibility in Nat- ence on Language Resources and Evaluation (LREC ural Language Processing: A Case Study of two 2016), Paris, France, May. European Language Re- R Libraries for Mining PubMed/MEDLINE. In sources Association (ELRA). 4REAL Workshop: Workshop on Research Results Reproducibility and Resources Citation in Science Dirk Hovy and Shannon L. Spruit. 2016. The Social and Technology of Language, pages 6–12, Portoroz,ˇ Impact of Natural Language Processing. In Pro- Slovenia, May. An LREC 2016 Workshop. ceedings of the 54th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 2: Short Christian Collberg, Todd Proebsting, and Alex M. War- Papers), pages 591–598, Berlin, Germany, August. ren. 2015. Repeatability and Benefaction in Com- Association for Computational Linguistics. puter Systems Research – A Study and a Modest Elizabeth Iorns. 2012. Is medical science Proposal. Technical Report TR 14-04, University built on shaky foundations? https: of Arizona. //www.newscientist.com/article/ Jana Diesner and Chieh-Li Chin. 2016. Gratis, Li- mg21528826.000-is-medical-science- bre, or Something Else? Regulations and Misas- built-on-shaky-foundations/, Septem- sumptions Related to Working with Publicly Avail- ber. able Text Data. In ETHI-CA2 2016: ETHics In Cor- Val Jones. 2009. Science-Based Medicine pus Collection, Annotation & Application, Portoroz,ˇ 101: Reproducibility. https:// Slovenia, May. An LREC 2016 Workshop. sciencebasedmedicine.org/ science-based-medicine-101- Katrin Erk and Noah A. Smith, editors. 2016. Pro- reproducibility/, August. ceedings of the 54th Annual Meeting of the Associa- tion for Computational Linguistics (Volume 1: Long Kevin Knight, Ani Nenkova, and Owen Rambow, edi- Papers). Association for Computational Linguistics, tors. 2016. Proceedings of the 2016 Conference of Berlin, Germany, August. the North American Chapter of the Association for Computational Linguistics: Human Language Tech- Antske Fokkens, Marieke van Erp, Marten Postma, Ted nologies. Association for Computational Linguis- Pedersen, Piek Vossen, and Nuno Freire. 2013. Off- tics, San Diego, California, June. spring from Reproduction Problems: What Repli- cation Failure Teaches Us. In Proceedings of the Yuji Matsumoto and Rashmi Prasad, editors. 2016. 51st Annual Meeting of the Association for Compu- Proceedings of COLING 2016, the 26th Interna- tational Linguistics (Volume 1: Long Papers), pages tional Conference on Computational Linguistics: 1691–1701, Sofia, Bulgaria, August. Association for Technical Papers. The COLING 2016 Organizing Computational Linguistics. Committee, Osaka, Japan, December.

28 Jozef Milsutka,ˇ Ondrejˇ Kosarko,ˇ and Amir Kamran. 2016. SHORTREF.ORG – Making URLs Easy-to- Cite. In 4REAL Workshop: Workshop on Research Results Reproducibility and Resources Citation in Science and Technology of Language, pages 19–23, Portoroz,ˇ Slovenia, May. An LREC 2016 Workshop. Limor Peer, Ann Green, and Elizabeth Stephenson. 2014. Committing to Data Quality Review. Inter- national Journal of Digital Curation, 9(1):263–291. Limor Peer. 2014. Mind the gap in data reuse: Sharing data is necessary but not sufficient for future reuse. http://blogs.lse.ac.uk/ impactofsocialsciences/2014/03/28/ mind-the-gap-in-data-reuse/, March. Peter Schaar. 2007. Opinion 4/2007 on the con- cept of personal data. http://ec.europa. eu/justice/policies/privacy/docs/ wpdocs/2007/wp136_en.pdf. Information Services and Technology. 2009. Sensitive Data: Your Money AND Your Life. http://web.mit.edu/infoprotect/ docs/protectingdata.pdf, January. Jian Su, Kevin Duh, and Xavier Carreras, editors. 2016. Proceedings of the 2016 Conference on Em- pirical Methods in Natural Language Processing. Association for Computational Linguistics, Austin, Texas, November. Stephen Wu, Tamara Timmons, Amy Yates, Meikun Wang, Steven Bedrick, William Hersh, and Hong- fang Liu. 2016. On Developing Resources for Patient-level Information Retrieval. In Nico- letta Calzolari (Conference Chair), Khalid Choukri, Thierry Declerck, Sara Goggi, Marko Grobelnik, Bente Maegaard, Joseph Mariani, Helene Mazo, Asuncion Moreno, Jan Odijk, and Stelios Piperidis, editors, Proceedings of the Tenth International Con- ference on Language Resources and Evaluation (LREC 2016), Paris, France, May. European Lan- guage Resources Association (ELRA).

29 Ethical by Design: Ethics Best Practices for Natural Language Processing

Jochen L. Leidner and Vassilis Plachouras Thomson Reuters, Research & Development, 30 South Colonnade, London E14 5EP, United Kingdom. jochen.leidner,vassilis.plachouras @thomsonreuters.com { }

Abstract no means limited to it. As more areas in life are affected by these new technologies, the practical Natural Language Processing (NLP) sys- need for clarification of ethical implications in- tems analyze and/or generate human lan- creases; in other words, we have reached the level guage, typically on users’ behalf. One where the topic is no longer purely academic: we natural and necessary question that needs need to have solutions for what a driverless car to be addressed in this context, both in should morally do in situations that can be de- research projects and in production set- scribed as ethical dilemmas, and in language and tings, is the question how ethical the work speech-enabled system ethical questions also arise is, both regarding the process and its out- (see below). Governments and NGOs are also try- come. ing to come to grips with what machine learning, Towards this end, we articulate a set of which NLP also relies on, means for policy mak- issues, propose a set of best practices, ing (Armstrong, 2015). notably a process featuring an ethics re- In this paper, a more principled way to deal view board, and sketch how they could with ethical questions in NLP projects is proposed, be meaningfully applied. Our main argu- which is inspired by previous work on the more ment is that ethical outcomes ought to be narrowly confined space of privacy, which we at- achieved by design, i.e. by following a pro- tempt to generalize. In doing so, we want to make cess aligned by ethical values. We also of- sure that common pitfalls such as compartmen- fer some response options for those facing talization (i.e., considering one area in isolation ethics issues. and solving problems in a way that creates prob- While a number of previous works ex- lems elsewhere), do not hinder the pursuit of eth- ist that discuss ethical issues, in particular ical NLP research and development, and we shall around big data and machine learning, to present some possible response options for those the authors’ knowledge this is the first ac- facing non-ethical situations to stimulate discus- count of NLP and ethics from the perspec- sion. tive of a principled process. Paper Plan. The rest of this paper is structured 1 Introduction as follows: Sec. 2 introduces the concept of “ethi- cal by design”. After reviewing some related work Ethics, the part of practical philosophy concerned in Sec. 3, Sec. 4 reviews ethics issues in NLP. with all things normative (moral philosophy, an- Sec. 5 introduces a proposed process model and swering the fundamental question of how to live some possible responses for those facing ethics one’s life) permeates all aspects of human ac- dilemmas. Sec. 6 discusses the shortcomings, and tion. Applying it to Natural language Process- Sec. 7 summarizes and concludes this paper. ing (NLP), we can ask the following core ques- tions: 1. What ethical concerns exist in the realm 2 Ethical by Design of NLP? 2. How should these ethical issues be addressed? At the time of writing, automation Ann Cavoukian (2009), a Canadian privacy and in- using machine learning is making great practical formation officer, has devised a set of seven prin- progress, and this includes NLP tasks, but is by ciples for privacy by design, a sub-set of which

30 Proceedings of the First Workshop on Ethics in Natural Language Processing, pages 30–40, Valencia, Spain, April 4th, 2017. c 2017 Association for Computational Linguistics we can generalize—so that they apply to gen- data mining and machine learning and their eth- eral ethics standards instead of the single issue of ical implications, often focused on privacy and privacy—as follows. bias/discrimination. Few if any of these works have mentioned issues specific to language pro- 1. Proactive not reactive: by planning to do cessing, but a lot of the unspecific issues also do things in an ethical way we avoid having apply to NLP.2 Third, there is a body of works to react remedially to non-ethical situations on professional ethics, often talked about in the more often than without a planning approach; context of curriculum design for computer science 2. Ethical as the default setting: by making a teaching (didactics of computing), governance and commitment to pursuing originally ethical professional conduct and legal/ethical aspects of paths, we create alignment within orga- computing (computing as a profession, continued nizations towards a more streamlined set professional development). of options that comply with common values;1 Moral Philosophy & Ethics. We cannot even 3. Ethics embedded into the process: a process aspire to give a survey of centuries of moral phi- firmly inclusive of ethics at all stages and losophy in a few sentences; instead, we briefly levels is less likely to create accidental harm; sketch three exemplary schools of moral philos- ophy to represent the fact that there is no sin- 4. End-to-end ethics: ethics cannot be confined gle school of thought that settles all moral ques- to a stage; it must be an all-encompassing tions.3 Aristotle’s “Nicomachean Ethics” (Aristo- property of a process from basic research tle, 1925; Aristotle, 1934)4, Utilitarianism (Mill, over product design to dissemination or de- 1879) and Kant’s (1785) categorical imperative livery, i.e. the full life-cycle of a technology; are just three examples of philosophical frame- 5. Visibility and transparency: a process that is works that can be used as a frame of reference published can be scrutinized, criticized and to study ethics, including the ethics of NLP and ultimately improved by a caring community; its applications. Aristotle based his system on 6. Respect for user values: whatever values happiness (Greek ευδαιμονὶα) as the highest at- a research institute, university or company tainable and ultimate goal for humans, and takes may hold is one thing, being user-centric a consensus-informed view starting with those means to also consider the values of the user moral principles that people with “good upbring- (of a component, product) and the subjects ing” agree on. Kant’s categorical imperative posits that take part in experiments (ratings, data a decision criterion to decide whether an action annotations). is moral or not, namely whether we would want to lift up our behaviour so that it may become a How could such principles be applied to NLP, con- law of nature. Utilitarianism suggests to maximise cretely? We ought to make some practical pro- happiness for the largest number of people, which posals how to proceed e.g. in a research project or implies a quantification- and outcome-oriented as- when developing a product to avoid ethical issues. pect; however, it also contains a severe flaw: it can To this end, we will now look at some potential is- be used to justify unethical behavior towards mi- sues, review best practices that are available, and norities as long as a majority benefits. then put it all together in the form of a process rec- Information & Big Data Ethics. There is a ommendation and possible responses for dealing body of work within philosophy on information with ethical issues as they arise. ethics (Allen et al., 2005; Bynum, 2008); big data 3 Related Work has created its own challenges, which are begin- 2An edited collection on ethics and related topics in the Prior work on the topics of ethics in NLP can be context of artificial companions exists (Wilks, 2010), but as grouped into three categories. First, there is the Masthoff (2011) points out NLP does not feature in it. 3 general body of literature covering applied ethics For general background reading in ethics and moral phi- losophy, see Gensler (2011). For computing-related ethics and moral philosophy. Second, within computer background there already exist many suitable entries to the science, there are discussions around big data, literature (Brey et al., 2009; Quinn, 2016; Stahl et al., 2016; Bynum, 2008; Bynum and Rogerson, 2004; Cary et al., 1There is a catch, namely different people may agree to 2003). slightly different ethical premises, or they may draw different 4named after its dedication to Nicomachus, likely either conclusions from the same premises. Artistotle’s father or son

31 ning to be discussed. Pasquale (2015) provides cal implications of crowdsourcing, the program- a thorough analysis of the societal impact of data controlled automation of work conducted by collection, user profiling, data vendors and buy- anonymous human subjects. ers, and application algorithms and the associated Professional Ethics. Professional ethics is the issues; it contains numerous real case examples. integration of codification into education and con- However, the exposition does not appear to in- tinuous professional development, and the com- clude examples likely to rely on NLP. Supervised puting profession developed the ACM Code of learning, clustering, data mining and recommen- Ethics and Professional Conduct (1992), which dation methods can account for the vast majority communicates a detailed set of 22 values; how- of examples (collaborative filtering, Apriori algo- ever, the compliance with them has been made rithm), which raises the questions of whether there voluntary, there are no negative consequences to will be a second wave of more sophisticated profil- people not adhering to them; further more, insuf- ing attempts relying on NLP and neural networks. ficient detail is given with regards to where moral Machine Learning & Bias. Since 2014, the boundaries are on specific issues, or how they may Fairness, Accountability, and Transparency in Ma- be obtained. The current code, then, falls short chine Learning (FATML, 2014) workshop series of an “algorithm how to be a good professional”, (originally organized by S. Barocas and M. Hardt if that can even exist. More recently (ACM, at NIPS) have been concerned with technical so- 2017), 7 principles were postulated to promote lutions associated with a lack of accountabil- the transparency and accountability of algorithms: ity, transparency and fairness in machine learning 1. Awareness; 2. Access and redress; 3. Ac- models. countability; 4. Explanation; 5. Data Provenance; 6. Auditability; and 7. Validation and Testing. NLP Application Ethics. Thieltges, Schmidt The Association of Internet Researchers has pub- and Hegelich (2016) discuss NLP chat-bots; in lished ethics guidelines (Markham and Buchanan, particular, they focus on the dilemma they call 2012), which have seen some adoption.5 Per- “devil’s triangle”, a tension between transparency, haps the most interesting empirical finding to date accuracy and robustness of any proposed auto- is a survey by Fort and Couillault (2016), who matic chat-bot detection classifier. Software that posed ethics-related questions to French and inter- interacts with humans and/or acts on humans’ be- national audiences, respectively, in two polls. For half, such as robot control code or chat-bots will example, over 40% of respondents said they have need to contain embodied decisions to ensure that refused to work on a project on ethical grounds. the software acts as if it was a moral agent, in other words we would expect the software to act This work draws heavily on Cavoukian (2009)’s in a way such that a human acting in the same proposal but goes beyond in that we propose a way would be considered morally acting (Allen et process model that makes intentional and non- al., 2005). Most recently, Hovy and Spruit (2016) intentional violations harder to go unnoticed. provided a broad account and thought-provoking Our process is also informed by the holis- call for investigation to the NLP community to tic “Resources-Product-Target” (RPT) model of explore their impact on society. To give an ex- Floridi (2013). As he points out (Floridi, 2013, ample of an unethical or at least highly ques- p. 20), many models focus rather narrowly on a tionable application, Mengelkamp, Rohmann and “ethics of information resources”, “ethics of in- Schumann (2016) survey (but do not advocate) formation products” or “ethics of the informa- practices of credit rating agencies’ use of social tional environment” view (which he calls mi- (user-generated) content, mostly unknown and un- croethical). His counter-proposal, the RPT model approved by the creators of that data. Fairfield (Figure 1), in contrast, is a macro-ethical approach and Shtein (2014) analyze ethics from a journal- that tries to avoid dilemmas and counterproductive ism point of view, which bears some similarity effects by taking too narrow a view: RPT stands with the perspective of automatic NLP, as jour- for “Resources-Product-Target”, because Floridi’s nalists also scrutinize textual sources and produce model considers the life cycle of producing an in- text, albeit not algorithmically. formation product (output) from and information

NLP Methodology Ethics. Fort, Adda and Co- 5e.g. by the journal PeerJ. See also AoIR (2017 to appear) hen (2011) provide an early account of the ethi- for an upcoming event on ethics in Internet research.

32 User Unethical NLP Applications (pertains to P for Alice Product in Floridi’s RPT model). The earliest eth- ical issue involving NLP that the authors could find during the research for this paper surrounds information information information resource target product the UNIX spell(1) command. Spell is a spell- RT P checker: it prints out words not found in a lexicon so they can be corrected. However, in the course

infosphere of its invocation, McIllroy’s 1978 version (un- like Johnson’s original implementation) emailed words that are not found in its lexicon to its im- Figure 1: The Holistic Resources-Product-Target plementer to support lexicon improvements (Bent- (RPT) Model after Floridi (2013). ley, 1986, p. 144); while this is technically com- mendable (perhaps even one of the earliest exam- ples of log-file analysis), from a privacy point of resource (input), during which an effect on the en- view the author of a document may disapprove vironment (infosphere) is created (target). By con- of this.7 Youyou, Kosinski and Stillwell (2015) sidering the environment as well, compartmental- show that automated psychometrics—they use so- ization, i.e. behaving ethically in a narrowly con- cial media “like”s—now rivals human determina- fined realm but not overall, can be avoided.6 In- tion of personality traits; one interesting moral as- formation ethics within NLP is nascent, however pect is that when subjects wrote a piece of text there is a lot of general work that can be borrowed they were likely not aware that in the future this from, going back as far as early computer science may be possible, in the same way that many people itself (Wiener, 1954). who uploaded their photos online were not aware that one day face recognition at scale would reach 4 Ethical Issues in NLP maturity, which has now happened. In a simi- lar spirit, Kosinski, Stillwell and Graepel (2013) While the related work in the previous sections re- demonstrate that other private personal traits and viewed work on ethics including but not limited attributes can be computed from a user’s data, in- to ethics and NLP, in this section, we discuss the cluding from their network of personal relation- types of ethical issues that we are aware, and give ships. Thieltges, Schmidt and Hegelich (2016) some examples from the NLP literature. We will describe another issue, namely that of chat-bots, link each type to one or more parts of Floridi’s which may act in a political way, such as steer- model. An interesting question is what specif- ing or influencing a discussion or, worse, com- ically is different in NLP with respect to ethics pletely destroying meaningful human discourse compared to other data-related topics or ethics in by injecting noise: on Twitter, chat-bots made general. This question can be split into two parts: real conversation impossible for the topic chan- first, since it pertains to human language process- nel #YaMeCance dedicated to reducing violence ing, and human language touches many parts of and corruption in Mexico, and real-life human ac- life, these areas also have an ethics dimension. tivists were reportedly followed and threatened. It For example, languages define linguistic commu- seems prudent that any bot self-identify as an au- nities, so inclusion and bias become relevant top- tomated entity, and from an ethical—if not legal— ics. Second, NLP is about processing by machine. point of view, a respectful, low-traffic interaction This means that automation (and its impact on is warranted. NLP developers should not partic- work) and errors (and their impact, whether inten- ipate in such efforts, not let themselves be instru- tional or not) become ethical topics. Furthermore, mentalized by state actors or commercial interests, if NLP systems are used as (information) access should withdraw from dubious projects, and pub- mechanism, accessibility is another concern (in- clusion of language-impaired users). 7The authors of this paper do not know if all users of spell(1) were privy to this feature (we have not received a 6A famous example of compartmentalization is the cruel response from Prof. McIllroy to an email request for clari- dictator that is loving to his children at home. In an NLP fication while this paper was under review). In any case, it context, an example could be a friendly and caring scientist should be clear that the works cited here are listed to ignite that unwittingly abuses workers using a crowdsourcing API, the ethics discussion, not to criticize individual works or au- because he needs gold data and has a small budget. thors, whose work we greatly respect.

33 licize and disclose immoral practices.8 In con- sitive psychiatric material (Pestian et al., 2012; trast, research into the automatic detection of such Brew, 2016) constructed to study causes for ter- bots seems ethical, to increase transparency and minating one’s life are much more private still. reduce manipulation in a society, and this may re- Is the ability to construct a classifier that detects quire manipulative bots to be developed for test how “serious” a suicide note should be taken a purposes; however, they should not be deployed, good thing? It may prevent harm by directing but be kept in sandboxed environments or just be scarce health resources in better ways, but does implemented as simulations. Hovy and that justify the privacy invasion of anyone’s per- Spruit (2016) point out the dual nature of some sonal good-byes, without their consent? Althoff, work: like a knife can be used to cut bread or to Clark and Leskovec (2016) describe an analy- harm others, NLP may “have dual” use potential. sis of counseling conversations using NLP meth- There are two possible responses: either object if ods; here, perhaps because the patients are still the non-ethical use is clearly involved in a project alive, even stronger privacy protection is indi- or product. Or alternatively, act conservatively and cated. Another privacy-related issue is excessive avoid obvious dual-use technologies entirely in fa- government surveillance, which can lead to self- vor of ethical-only use technologies (e.g. work on censoring and ultimately undermine democracy health-care applications instead of command-and- (Penney, 2016). control software). Building an NLP application, Fairness, Bias & Discrimination (pertains to like any other human activity that is a means to T for target in Floridi’s RPT model). Picture a an end, can have an ethical end or not. For exam- spoken dialog system that is easy to use for a ple, an NLP application could be an instance of an young male financial professional user with a Lon- unethical application if its purpose is not consis- don English pronunciation, but that may barely tent with ethical norms. If one adopts cherishing work for an elderly lady from Uddingston (near human life as an absolute moral value, developing Glasgow, Scotland). As automated information a smart weapon using voice control would be an systems are becoming more pervasive, they may example of an application that is ethically wrong. eventually substitute human information kiosks Davis and Patterson (2012) list identity, privacy, for cost reasons, and then out-of-sample user ownership and reputation as the four core areas of groups could be excluded and left behind without big data ethics. What is the range of potential eth- an alternative. The internal functioning of NLP ical issues in NLP in specific? This paper cannot systems can raise questions of transparency & ac- provide an exhaustive list, but we will try to give a countability: what if a parser does not work for nucleus list that serves to illustrate the breadth of particular types of inputs, and the developer does topics that can be affected. not communicate this aspect to an application de- Privacy (pertains to T for target in Floridi’s veloper, who wants to build a medical application RPT model). Collecting linguistic data may lead that uses it. It is responsible behavior to disclose 9 to ethical questions around privacy : Corpora limitations of a system to its users, and NLP sys- such as the British National Corpus, the Collins tems are no exception. In the context of machine COBUILD corpus or the Penn Treebank contain learning, governments have started looking into names of individuals and often substantial per- the impact of ML on society, the need for policy sonal information about them; e-mail corpora to guidance and regulation (Armstrong, 2015). study the language of email (Klimt and Yang, 2004), or corpora of suicide notes and other sen- Abstraction & Compartmentalization (per- tains to all parts of Floridi’s RPT model). As men- 8One reviewer has called our call for non-participation in tioned earlier, Floridi’s (2013) model was explic- un-ethical work “na¨ıve”; however, we believe individuals can effect positive change through their personal choices, and es- itly designed to overcome an overly narrow fo- pecially in sought-after professions no-one has an excuse that cus only on the input or project output. Abstract- they had to feed their family (or whatever justification one ing over humans happens in crowdsourcing (see may bring forth). Also, by buying from and working for, more ethical companies, a pull towards increasing ethical be- above) when work is farmed out to an API, which havior overall may be generated. has individual humans behind it, but this fact can 9 Privacy as a concept is discussed in Westin (1967); See be all to easily ignored by the API’s caller. If the recent seminal literature on big data privacy (Lane et al., 2014; Zimmer, 2010) for more in-depth discussions of data abstraction can lead to ethical ignorance in one and privacy. dimension, compartmentalization can lead to the

34 same in another. For example, an information ex- ers from menial jobs so they can pursue more in- traction system that gets build without looking at teresting work thereafter. However, many may the political context in which it is likely deployed lack the qualifications or intellect, or may at least may lead to unethical actions by a team of other- perceive stress by that perspective of being forced wise morality-oriented researchers. to taking on more and more challenging jobs. Oth- Complexity (pertains to T as in target in ers even see automation as “freeing humans from Floridi’s RPT model). Today’s big data sys- duty of work” completely, which would be an eth- tems are cloud-based pipelines of interconnected ical pro-argument. In reality, most humans like data feeds and “black box” processes (Pasquale, to have work, and may even need it to give their 2015) combining and transforming a multitude of lives a structure and purpose; certainly many peo- sources each, and these transcend individual orga- ple define who they are by what they do profes- nizational boundaries. This means that an organi- sionally. Therefore, taking their work from them zation offering e.g. an NLP API ceases control of without their consent, means taking their dignity how it may be used externally; this creates com- from them. It is also often argued that NLP sys- plex, hard-to-understand macro-ecosystems. tems merely aim to make human analysts more productive. It is desirable to do so, and then au- (pertains to R Un-ethical Research Methods tomation would seem morally legitimate. How- as in Resource and T as in target in RPT). Doing ever, in practice, many customers of applications research itself can be done in more or less eth- desire automation as a tool to reduce the work- ical ways, so the discussion should not be lim- force, because they are under cost pressure. ited to the outcome. An example for applying wrong standards could be setting up a psycholin- 5 Best Practices guistic experiment about language acquisition of children in a kindergarden without briefing and 5.1 Ethics Review Board getting the consent of their parents. Doing the In order to establish ethical behavior as a default, research itself may involve hiring helpers, who installing a process likely increases the awareness; should not be kept in unethical work conditions; it assigns responsibility and improves consistency crowdsourcing has been criticized to be a form of procedure and outcome. An ethics review board of slavery, for instance (Fort et al., 2011). Re- for companies (as already implemented by uni- cently, crowdsourcing has become a common el- versities for use in the context of the approval of ement of the NLP toolbox to create gold data or experiments with human and animal subjects) in- to carry out cost-effective evaluations. Crowd- cluded in the discussion of new products, services, sourcing is now ubiquitous, even indispensable or planned research experiments should be con- for researchers in HCI, cognitive science, psycho- sidered, very much like it already exists in uni- linguistics and NLP. And it poses not just tax versity environments in an experimental natural compliance issues (who issues tax returns for my science context; in the 21st century, experiments workers that I do not know?), but also the fact with data about people is a proxy for doing ex- that the mutual anonymity leads to a loose, non- periments with people, as that data affects their committal relationship between researchers and lives. Figure 2 shows our proposal for such a pro- crowd workers that stand against pride in the of cess for a hypothetical company or research in- quality of work output or a moral sense duty of stitution. It shows a vetting process featuring an care for workers. Ethics Review Board (ERB), which would oper- Automation (pertains to R as in Resource and T ate as follows before and after executing research as in target in RPT). Finally, it could be questioned projects, before and after product development, as whether NLP in general is unethical per se, based well as during deployment of a product or ser- on it being an instance of automation: any activity vice on an ongoing basis (at regular intervals), the that leads to, or contributes to, the loss of jobs that ERB gets to review propositions (the “what”) and people use to generate their own existential sup- methods (the “how”) and either gives its bless- port: The argument that automation destroys jobs ing (approve) or not (veto). Ethics stakehold- is old (Ford, 2015; Susskind and Susskind, 2015), ers participate in research, product/service design and traditionally, two counterarguments have been & development, operations & customer service. presented. Some claim automation relieves work- Each of them could report to the Chief Informa-

35 request for comment Ethics Approval complaint Board vetoes/approves regular review

Research Product Development Deployment Experiment Experiment Business Case Product/Service

Figure 2: A Process Proposal for “Ethics by Design” in an Organization. tion Officer via a Vice President for Governance projects are mostly pertaining to automation as an rather than phase-specific management to give the issue, worker representatives would be good to in- ethics review more robustness through indepen- clude. Moral philosophers, human rights experts dence. There are associated business benefits (em- and lawyers could be included as well; in gen- ployee identification, reduced risk of reputation eral, some delegates should be independent and damage), however these are distinct from the eth- should have a well-developed conscience.10 In ical motive, i.e. desiring to do the right thing for practice, the ERB involvement incorporates ele- its own sake. They are also distinct from legal ments from “value sensitive design” (Friedman et motives, because acting legally is often not suf- al., 2008) (thus generalizing Cavoukian’s “privacy ficient for acting ethically, especially in emerging by design” idea) and works as follows: a prod- technology areas where the law lags behind tech- uct manager generates a document outlining a new nology developments. An ERB might be too ex- project, or a scientist creates an idea for a new re- pensive to operate for smaller institutions, but it search project. At the decision point (launch or could be outsourced to independent contractors or nor), the ERB is involved to give its ethics ap- companies that perform an external audit function proval (in addition to other, already existing func- (“ERB as a service”). The board would convene tions, e.g. finance or strategy). At this stage, the at well-defined points to sign off that there was an morality of the overall idea is assessed. Working ethics conversation, documenting identified issues at a conceptual level, the ERB needs to identify and recommending resolution pathways. ERB au- the stakeholders, both direct and indirect, as well dits could benefit from check-lists that are collated as the values that are implicated in the proposed in the organization based on the experience ob- idea or project. Once the project is launched, de- tained in past projects (Smiley et al., 2017). In tailed written specifications are usually produced. the US, Institutional Review Boards are already They again are brought to the ERB for review. legally required for certain kinds of institutions The project plan itself is reviewed by the ERB and businesses (Enfield and Truwit, 2008; Pope, with a view to scrutinizing research methods, un- 2009). Our proposal is to adopt similar practices, derstanding how the stakeholders prioritize im- and to customize them to accommodate particu- plicated values and trade-offs between competing lar, NLP-specific issues. An ERB should be em- values, as well as how the technology supports cer- powered to veto new products or NLP projects on tain values. The ERB may send documents back ethical grounds at the planning stage or at project with additional requests to clarify particular as- launch time (earlier means less money wasted). pects. The ERB documents permanently anything The ERB could be installed by the board to make identified as unethical. Ideally, they would have a the company more sustainable, and more attrac- powerful veto right, but different implementations tive to ethical investors. Note that it is not re- are thinkable. Much harder is the involvement in quired that ERB members agree on one and the ongoing review activities, for example to decide same school of ethics: a diverse ERB with voting whether or not code is ethical. It appears that procedures, comprising members, each of which a committee meeting is not well suited to ascer- driven by their own conscience, might converge tain moral principles are adhered to; a better way towards wise decisions, and that may be the best could be if the ERB was in regular informal touch way to adopt as a practical solution (“crowd wis- 10 dom”). The ethics board should ideally contain It would definitely help to include more inexperienced and younger members, whose idealism may not have been all stakeholder groups: if the organization’s NLP corrupted by too much exposure to so-called “real life”.

36 with scientists and developers in order to probe the the software or the process that leads to the soft- team with the right questions. For example, a men- ware can itself be unethical in part or as a whole. tion of the use of crowdsourcing could trigger a There is also an interaction between the conversa- suggestion to pay the legal minimal hourly salary. tion whether AI (including NLP) can/should even aspire doing what it does, as it does, because fram- 5.2 Responses to Ethics Dilemmas ing of the task brings ethical baggage that some A lot of the literature focuses on how to decide see as distraction from other (more?) important what is ethical or not. While this is obviously a issues: as Lanier (2014) points out, the direc- core question, the discussion must not rest there: tional aspiration and framing of AI as a movement of similar relevance is an elaboration about pos- that either aims to or accidentally results in re- sible remedies. Table 1 shows a set of possible placing humans or superseding the human race, responses to ethics issues. Some of these acts are effectively creating a post-human software-only about the individual’s response to a situation with species (“the end of human agency”) is a “non- possible ethical implications in order to avoid be- optimal” way of looking at the dangers of AI; coming co- responsible/complicit, whereas others in his view, it adds a layer of distractive argu- are more outcome-oriented. They are loosely ori- ments (e.g. ethical/religious ones about whether ented from least (bottom) to most (top) serious, we should do this) that divert the discourse from and include internal and external activities. other, more pressing conversations, such as over- promising (and not delivering), which leads to 6 Discussion subsequent funding cuts (“AI winter”). We will likely be producing a world that may be likened The privacy issue in the early UNIX spell(1) tool to Brazil rather than Skynet. While Lanier has differs from the Mexican propaganda chat-bots a point regarding over-selling, in our view the in that the former wrongly (un-ethically) imple- ethical problems need to be addressed regardless, ments a particular function (if there is no on- but his argument helps to order them by immedi- screen warning to ensure informed consent of the ate urgency. One could argue that our ERB pro- user), whereas in the latter case, the chat-bot ap- posal may slow innovation within an organization. plication as a whole is to be rejected on moral However, it is central to protecting the organiza- grounds. We can use these two situations to anec- tion from situations that have a significant impact dotally test our “ethics by design” process in a on its reputations and its customers, hence, reduc- thought experiment: what if both situations arose ing the organization’s risk exposure. One may also in an organization implementing the process as argue that, if implemented well, it could guide in- described above? The spell tool’s hidden emails novation processes towards ethical innovation. should be unearthed in the “review” stage (which could well include code reviews by independent 7 Summary & Conclusion consultants or developers in cases where question- able practices have been unearthed or suspected). In this paper, we have discussed some of the ethi- And clearly, the Mexican bots should be rejected cal issues associated with NLP and related tech- by the ERB at the planning stage. By way of self- niques like machine learning. We proposed an criticism, flagging functional issues like the hid- “ethics by design” approach and we presented a den spell email feature is perhaps less likely de- new process model for ethics reviews in compa- tectable than application-level ethical issues, since nies or research institutions, similar to ethics re- over-keen programmers may either forget, inten- view boards that in universities must approve ex- 11 tionally not communicate, or mis-assess the im- periments with animal and human subjects. We portance of the hidden-email property; neverthe- also presented a list of remedies that researchers less, using the process arguably makes it more can consider when facing ethical dilemmas. In fu- likely to detect both issues than without using it. ture work, professional codes of conduct should be Floridi’s model, which wass designed for infor- strengthened, and compliance be made mandatory mation ethics, may have to be extended in the di- for professionals. rection of information processing ethics (covering 11See also the journal IRB: Ethics & Human Research (cur- the software that creates or co-creates with hu- rently in its 39th volume) dedicated to related topics in other mans the information under consideration), since disciplines.

37 Table 1: Remedies: Pyramid of Possible Responses to Unethical Behavior. Demonstration to effect a change in society by public activism Disclosure to document/to reveal injustice to regulators, the police, investigative journalists (“Look what they do!”, “Stop what they do!”) Resignation to distance oneself III (“I should not/cannot be part of this.’) Persuasion to influence in order to halt non-ethical activity (“Our organization should not do this.”) Rejection to distance oneself II; to deny participation; conscientious objection (“I can’t do this.”) Escalation raise with senior management/ethics boards (“You may not know what is going on here.”) Voicing dissent to distance oneself I (“This project is wrong.”) Documentation ensure all the facts, plans and potential and actual issues are preserved.

Acknowledgments. The authors would like to Chris Brew. 2016. Classifying ReachOut posts with a express their gratitude to Aaron J. Mengelkamp, radial basis function SVM. In Kristy Hollingshead Frank Schilder, Isabelle Moulinier, and Lucas and Lyle H. Ungar, editors, Proceedings of the 3rd Workshop on Computational Linguistics and Clini- Carstens for pointers and discussions, and to cal Psychology, CLPsych@NAACL-HLT 2016, June Khalid Al-Kofahi for supporting this work. We 16, 2016, San Diego, California, USA, pages 138– would also like to express our gratitude for the de- 142. Association for Computational Linguistics. tailed constructive feedback from the anonymous Brey, Philip, and Johnny H. Soraker. 2009. Philoso- reviewers. phy of computing and information technology. In Dov M. Gabbay, Antonie Neijers, and John Woods, editors, Philosophy of Technology and Engineering References Sciences, volume 9, pages 1341–1408. North Hol- land, Burlington, MA, USA. ACM. 1992. ACM Code of Ethics and Pro- fessional Conduct. Online, cited 2016-12-27, Terrell Ward Bynum and Simon Rogerson. 2004. http://bit.ly/2kbdhoD. Computer Ethics and Professional Responsibility: Introductory Text and Readings. Wiley-Blackwell, ACM. 2017. Principles for Algorithmic Trans- Malden, MA, USA. parency and Accountability. online, cited 2017-01- 15, http://bit.ly/2jVROTx. Terell Bynum. 2008. Computer and information ethics. In Stanford Encyclopedia of Philosophy. Colin Allen, Iva Smit, and Wendell Wallach. 2005. Stanford University. Artificial morality: Top-down, bottom-up, and hy- brid approaches. Ethics and Information Technol- C. Cary, H. J. Wen, and P. Mahatanankoon. 2003. Data ogy, 7(3):149–155. mining: Consumer privacy, ethical policy, and sys- tem development practices. Human Systems Man- Tim Althoff, Kevin Clark, and Jure Leskovec. 2016. agement, 22. Large-scale analysis of counseling conversations: An application of natural language processing to Ann Cavoukian. 2009. Privacy by design. Techni- mental health. Transactions of the Association for cal report, The Office of the Information and Pri- Computational Linguistics, 4(463–476). vacy Commissioner of Ontario, Toronto, Ontario, Canada. Aristotle. 1925. The Nicomachean Ethics: Translated Kord Davis and Doug Patterson. 2012. Ethics of Big with an Introduction. Oxford University Press, Ox- Data: Balancing Risk and Innovation. O’Reilly, Se- ford, England, UK. bastopol, CA, USA.

Aristotle. 1934. Aristotle in 23 Volumes, volume 19. K. B. Enfield and J. D. Truwit. 2008. The pur- Heinemann, London, England. pose, composition and function of an institutional re- view board: Balancing priorities. Respiratory Care, Harry Armstrong. 2015. Machines that learn in the pages 1330–1336. wild: Machine learning capabilities, limitations and implications. Technical report, Nesta, London, Eng- Joshua Fairfield and Hannah Shtein. 2014. Big data, land. big problems: Emerging issues in the ethics of data science and journalism. Journal of Mass Media Association of Internet Researchers. 2017, to appear. Ethics, 29(1):38–51. AoIR 2017: Networked Publics – The 18th annual meeting of the Association of Internet Researchers FATML. 2014. Fairness, accountability, and trans- will be held in Tartu, Estonia, 18-21 October 2017. parency in machine learning (FATML).

Jon Bentley. 1986. Programming Pearls. Addison- Luciano Floridi. 2013. The Ethics of Information. Ox- Wesley, Reading, MA, USA. ford University Press, Oxford, England, UK.

38 Martin Ford. 2015. The Rise of the Robots: Tech- Judith Masthoff. 2011. Review of: Close Engage- nology and the Threat of Mass Unemployment. ments with Artificial Companions: Key Social, Psy- Oneworld, New York, NY, USA. chological, Ethical, and Design Issues. Computa- tional Linguistics, 37(2):399–402. Karn Fort and Alain Couillault. 2016. Yes, we care! results of the ethics and natural language processing Aaron Mengelkamp, Sebastian Rohmann, and surveys. In 10th edition of the Language Resources Matthias Schumann. 2016. Credit assessment and Evaluation Conference, 23-28 May 2016, Por- based on user generated content: State of research. toro, Slovenia, LREC 2016, pages 1593–1600. In Christine Bernadas and Delphine Minchella, ed- itors, Proceedings of the 3rd European Conference K. Fort, G. Adda, and K.B. Cohen. 2011. Amazon on Social Media, 12-13 July 2016, pages 223–231, Mechanical Turk: Gold mine or coal mine? Com- Caen, France. putational Linguistics, 37(2):413–420. John Stuart Mill. 1879. Utilitarianism. Floating Press, Batya Friedman, Peter H. Kahn Jr., and Alan Born- Auckland, New Zealand, 1st edition. ing. 2008. Value sensitive design and information systems. In Kenneth Einar Himma and Herman T. Frank Pasquale. 2015. The Black Box Society: The Tavani, editors, The Handbook of Information and Secret Algorithms That Control Money and Infor- Computer Ethics, chapter 4, pages 69–101. Willey, mation. Harvard University Press, Cambridge, MA, Hoboken, NJ, USA. USA. Jonathon W. Penney. 2016. Chilling effects: Online Ethics: A Contemporary In- Harry J. Gensler. 2011. surveillance and Wikipedia use. Berkeley Law Re- troduction . Routledge Contemporary Introductions view, 31(1):117–182. to Philosophy. Routledge, London, 2nd edition. John P. Pestian, Pawel Matykiewicz, and Michelle Dirk Hovy and Shannon L. Spruit. 2016. The so- Linn-Gust. 2012. Suicide note sentiment clas- cial impact of natural language processing. In Pro- sification: A supervised approach augmented by ceedings of the 54th Annual Meeting of the Asso- Web data. Biomedical Informatics Insights, 5 ciation for Computational Linguistics, pages 591– (Suppl. 1):1–6. 598, Berlin, Germany. Association for Computa- tional Linguistics. T. M. Pope. 2009. Multi-institutional healthcare ethics committees: The procedurally fair internal Immanuel Kant. 1785. Grundlegung zur Metaphysik dispute resolution mechanism. Campbell Law Re- der Sitten. J. F. Hartknoch, Riga, Latvia. view, 31:257–331. Bryan Klimt and Yiming Yang. 2004. The Enron Michael J. Quinn. 2016. Ethics for the Information corpus: A new dataset for email classification re- Age. Pearson, 7th edition. search. In J.-F. Boulicaut, F. Esposito, F. Gian- notti, and D. Pedreschi, editors, Machine Learn- Charese Smiley, Frank Schilder, Vassilis Plachouras, ing: ECML 2004, 15th European Conference on and Jochen L. Leidner. 2017. Say the right thing Machine Learning, Pisa, Italy, September 20-24, right: Ethics issues in natural language generation 2004, Proceedings, volume 3201 of Lecture Notes systems. In Proceedings of the Workshop on Ethics in Computer Science, pages 217–226, Heidelberg, & NLP held at the EACL Conference, April 3-7, Germany. Springer. 2017, Valencia, Spain. ACL.

Michal Kosinski, David Stillwell, and Thore Grae- Bernd Carsten Stahl, Job Timmmermans, and pel. 2013. Private traits and attributes are pre- Brent Daniel Mittelstadt. 2016. The ethics of dictable from digital records of human behavior. computing: A survey of the computing-oriented lit- Proceedings of the National Academy of Sciences, erature. ACM Computing Survey, 48(4):55:1–55:38. 110(15):5802–5805. Richard Susskind and Daniel Susskind. 2015. The Fu- ture of the Professions: How Technology Will Trans- Julia Lane, Victoria Stodden, Stefan Bender, and Helen form the Work of Human Experts Hardcover. Ox- Nissenbaum, editors. 2014. Privacy, Big Data and ford University Press, New York, NY, USA. the Public Good. Cambridge University Press, New York, NY, USA. Andree Thieltges, Florian Schmidt, and Simon Hegelich. 2016. The devils triangle: Ethical con- Jason Lanier. 2014. The myth of AI: A conversa- siderations on developing bot detection methods. In tion with Jaron Lanier. Online, cited 2017-01-17, Proc. 2016 AAAI Spring Symposium, Stanford Uni- https://www.edge.org/conversation/the-myth-of-ai. versity, March 2123, 2016, pages 253–257. AAAI. Annette Markham and Elizabeth Buchanan. 2012. Alan F. Westin. 1967. Privacy and Freedom. Ethical decision-making and Internet research: Rec- Atheneum, New York, NY, USA, 1st edition. ommendations from the AoIR ethics working com- mittee (version 2.0). Technical report, Association Norbert Wiener. 1954. The Human Use of Human Be- of Internet Researchers. ings. Houghton Mifflin, Boston, MA, USA.

39 Yorick Wilks, editor. 2010. Close Engagements with Artificial Companions: Key Social, Psychological, Ethical, and Design Issues. John Benjamins, Ams- terdam, The Netherlands. Wu Youyou, Michal Kosinski, and David Stillwell. 2015. Computer-based personality judgments are more accurate than those made by humans. Pro- ceedings of the National Academy of Sciences, 112(4):1036–1040. M. Zimmer. 2010. ‘but the data is already public’: On the ethics of research in Facebook. Ethics and Information Technology, 12(4):313–325.

40 Building Better Open-Source Tools to Support Fairness in Automated Scoring

Nitin Madnani1, Anastassia Loukina1, Alina von Davier2, Jill Burstein1 and Aoife Cahill1

1Educational Testing Service, Princeton, NJ 2ACT, Iowa City, IA 1 nmadnani,aloukina,jburstein,acahill @ets.org { [email protected] }

Abstract nal score assigned by a human rater (Page, 1966; Burstein et al., 1998; Zechner et al., 2009; Bern- Automated scoring of written and spo- stein et al., 2010). ken responses is an NLP application that Test scores whether assigned by human raters or can significantly impact lives especially computers can have a significant effect on people’s when deployed as part of high-stakes tests lives and, therefore, must be fair to all test takers. such as the GRE® and the TOEFL®. Automated scoring systems may offer some ad- Ethical considerations require that auto- vantages over human raters, e.g., higher score con- mated scoring algorithms treat all test- sistency (Williamson et al., 2012). Yet, like any takers fairly. The educational measure- other machine learning algorithm, models used ment community has done significant re- for score prediction may inadvertently encode dis- search on fairness in assessments and au- crimination into their decisions due to biases or tomated scoring systems must incorporate other imperfections in the training data, spurious their recommendations. The best way to correlations, and other factors (Romei and Rug- do that is by making available automated, gieri, 2013b; von Davier, 2016).1 non-proprietary tools to NLP researchers The paper has the following structure. We first that directly incorporate these recommen- draw awareness to the psychometric research and dations and generate the analyses needed recommendations on quantifying potential biases to help identify and resolve biases in their in automated scoring and how it relates to the scoring systems. In this paper, we attempt ideas of fairness, accountability, and transparency to provide such a solution. in machine learning (FATML). The second half of the paper presents an open-source tool called 1 Introduction RSMTool2 for developers of automated scoring Natural Language Processing (NLP) applications models which directly integrates these psychome- now form a large part of our everyday lives. As tric recommendations. Since such developers are researchers who build such applications, we have a likely to be NLP or machine learning researchers, responsibility to ensure that we prioritize the ideas the tool provides an important bridge from the ed- of fairness and transparency and not just blindly ucational measurement side to the NLP side. Next, pursue better algorithmic performance. we discuss further challenges related to fairness in In this paper, we discuss the ethical considera- automated scoring that are not currently addressed tions pertaining to automated scoring of written or by RSMTool as well as methods for avoiding bias spoken test responses, referred to as “constructed in automated scoring rather than just detecting it. responses”. Automated scoring is an NLP appli- The paper concludes with a discussion of how cation which aims to automatically predict a score these tools and methodologies may, in fact, be ap- for such responses. We focus on automated sys- 1Some of these problems were recently discussed at a tems designed to score open-ended constructed panel focused on Fairness in Machine learning in Educa- response questions. Such systems generally use tional Measurement that was held at the annual meeting of text and speech processing techniques to extract National Council for Educational Measurement (von Davier and Burstein, 2016). a set of features from responses which are then 2http://github.com/ combined into a scoring model to predict the fi- EducationalTestingService/rsmtool

41 Proceedings of the First Workshop on Ethics in Natural Language Processing, pages 41–52, Valencia, Spain, April 4th, 2017. c 2017 Association for Computational Linguistics plicable to other NLP applications beyond auto- developers of the test conduct both qualita- mated scoring. tive and quantitative reviews of each question (Angoff, 2012; Duong and von Davier, 2013; 2 Ethics and Fairness in Constructed Oliveri and von Davier, 2016; Zieky, 2016). Response Scoring 2. Test administration. All test-takers must At this point in the paper, we believe it is important be provided with comparable opportunities to to define exactly what we refer to as fairness for demonstrate the abilities being measured by the field of scoring constructed responses, whether the test. This includes considerations such as it is done manually by humans or automatically by the location and number of test centers across NLP systems. the world, and whether the testing conditions A key concept here is the idea of a “construct” in each test center are standardized and se- which is defined as a set of related knowledge, cure. For example, Bridgeman et al. (2003) skills, and other abilities that a test is designed to showed that, at least for some tests, exami- measure. Examples of possible constructs include nee test scores may be affected by screen res- logical reasoning, language proficiency, reading olution of the monitors used to administer the comprehension etc. A fair test is one where dif- test. This means that for such tests to be fair, ferences in test scores between the test-takers are it is necessary to ensure that all test-takers use due only to differences in skills which are part of monitors with a similar configuration. the construct. Any consistent differences in scores between different groups of test takers that result 3. Test scoring. There should also be no bias from other factors not immediately related to the in the test scores irrespective of whether they construct (i.e., “construct-irrelevant”) — e.g., test- are produced by human raters or by auto- taker gender — may indicate that the test is unfair. mated scoring models. The unequal distribu- Specifically, for a test to be fair, the non-random tion of social, economic, and educational re- effects of construct-irrelevant factors need to be sources means that some differences in per- minimized during the four major phases of a test: formance across subgroups are to be ex- test development, test administration, test scoring, pected. However, differences large enough to and score interpretation (Xi, 2010; Zieky, 2016): have practical consequences must be investi- gated to ensure that they are not caused by 1. Test development. All tests must be free construct-irrelevant factors (AERA, 1999). of bias, i.e., no questions on a test should include any content that may advantage or 4. Score interpretation Finally, while most disadvantage any specific subgroup of test- tests tend to have a constant structure, the takers in ways that are unrelated to the con- actual questions change regularly. Some- struct the test is designed to assess. The sub- times several different versions of a test (“test groups in this case are defined based on fac- forms”) exist in parallel. Even if two test- tors that include test-taker personal informa- takers take different versions of a test, their tion such as gender, race, or disability, but test scores should still be comparable. To may also go beyond the standard protected achieve this, a separate statistical process properties. For example, Xi (2010) discusses process called “test equating” is often used to how familiarity with the subject matter in an adjust for unintended differences in the diffi- English language proficiency test may impact culty of the test forms (Lee and von Davier, test performance and, thus, would require an 2013; Liu and Dorans, 2016). This process explicit analysis of fairness for a group de- itself must also be investigated for fairness to fined by test-taker fields of study. Addition- ensure that it does not introduce bias against ally, on the same test, test-takers whose na- any group of test-takers. tive languages use the Roman alphabet will have an advantage over test-takers with native In this paper, we focus on the third phase: the languages based on other alphabets. How- fairness of test scores as measured by the impact ever, this advantage is allowable because it is of construct-irrelevant factors. As Xi (2010) dis- relevant to the construct of English compre- cusses in detail, unfair decisions based on scores hension. To ensure bias-free questions, the assigned to test-takers from oft-disadvantaged

42 groups are likely to have profound consequences: their own biases (Feldman et al., 2015) either due they may be denied career opportunities and ac- to an existing bias in the training data or due to cess to resources that they deserve. Therefore, a minority group being inadequately represented it is important to ensure — among other things in the training data. Automated scoring is cer- — that construct-irrelevant factors do not intro- tainly not immune to such biases and, in fact, duce systematic biases in test scores, irrespective several studies have documented differing perfor- of whether they are produced by human raters or mance of automated scoring models for test-takers by an automated scoring system. with different native languages or with disabilities Over the last few years, there has been a sig- (Burstein and Chodorow, 1999; Bridgeman et al., nificant amount of work done on ensuring fair- 2012; Wang and von Davier, 2014; Wang et al., ness, accountability, and transparency for machine 2016; An et al., 2016; Loukina and Buzick, In learned models from what is now referred to as the print). FATML community (Kamiran and Calders, 2009; Biases can also arise because of techniques used Kamishima et al., 2012; Luong et al., 2011; Zemel to develop new features for automated scoring et al., 2013). More recently, Friedler et al. (2016) models. The automated score may be based on proposed a formal framework for conceptualizing features which are construct-irrelevant despite be- the idea of fairness. Within that framework, the ing highly correlated with the human scores in the authors define the idea of “structural bias”: the training data. As an example, consider that more unequal treatment of subgroups when there is no proficient writers tend to write longer responses. clear mapping between the features that are easily Therefore, one almost always observes a consis- observable for those subgroups (e.g., largely irrel- tent positive correlation between essay length and evant, culturally and historically defined charac- human proficiency score (Perelman, 2014; Sher- teristics) and the true features on which algorith- mis, 2014b). This is acceptable since verbal flu- mic decisions should actually be based (the “con- ency — a correlate of response length — is consid- struct”). Our conceptualization of fairness for au- ered an important part of the writing proficiency. tomated scoring models in this paper — avoiding Yet, longer essays should not automatically re- systematic biases in test scores across subgroups ceive higher scores. Therefore, without proper due to construct-irrelevant factors — fits perfectly model validation to consider the relative impact in this framework. of such features, decisions might be made that are unfair to test-takers. 3 Detecting Biases in Automated Scoring On this basis, the psychometric guidelines re- Human scoring of constructed responses is a sub- quire that if automated scoring models are to be jective process. Among the factors that can im- used for making high-stakes decisions for col- pact the assigned scores are rater fatigue (Ling lege admissions or employment, the NLP re- et al., 2014), differences between novice and ex- searchers developing those models should perform perienced raters (Davis, 2015), and the effect of model validation to ensure that demographic and raters’ linguistic background on their evaluation construct-irrelevant factors are not causing their of the language skill being measured (Carey et models to produce significant differences in scores al., 2011). Furthermore, the same response can across different subgroups of test-takers (Yang et sometimes receive different scores from different al., 2002; Clauser et al., 2002; Williamson et al., raters. To guard against such rater inconsistencies, 2012). This is exactly what fairness – as we define responses for high-stakes tests are often scored by it in this paper – purports to measure. multiple raters (Wang and von Davier, 2014; Pen- However, it is not easy for an NLP or ma- field, 2016). Automated scoring of constructed re- chine learning researcher to perform comprehen- sponses can overcome many of these issues inher- sive model validation since they may be unfamil- ent to human scoring: computers do not get tired, iar with the required psychometric and statistical do not have personal biases, and can be config- checks. The solution we propose is a tool that ured to always assign the same score to a given incorporates both the standard machine learning response. pipeline necessary for building an automated scor- However, recent studies in machine learning ing model and a set of psychometric and statis- have highlighted that algorithms often introduce tical analyses aimed at detecting possible bias in

43 engine performance. We believe that such a tool cus solely on the fairness-driven evaluation capa- should be open-source and non-proprietary so that bilities of the tool that are directly relevant to the the automated scoring community can not only au- issues we have discussed so far. Readers inter- dit the source code of the already available analy- ested in other parts of the RSMToolare referred ses to ensure their compliance with fairness stan- to the comprehensive documentation available at dards but also contribute new analyses. http://rsmtool.readthedocs.org. We describe the design of such a tool in the rest Before we describe the fairness analyses im- of the paper. Specifically, our tool provides the fol- plemented in the tool, we want to acknowledge lowing model validation functionality to NLP/ML that there are many different ways in which re- researchers working on automated scoring: (a) searchers might approach building as well as eval- defining custom subgroups and examining differ- uating scoring models (Chen and He, 2013; Sher- ences in the performance of the automated scor- mis, 2014a). The list of learners and fairness anal- ing model across these groups; (b) examining the yses the tool provides is not, and cannot be, ex- effect of construct-irrelevant factors on automated haustive. In fact, later in the paper, we discuss scores; and (c) comparing the effects of such fac- some analyses that could be implemented in future tors in two different versions of the same scoring versions of the tool since one of the core charac- model, e.g., a version with a new feature added to teristics of the tool is its flexible architecture. See the model and a version without the same feature. 4.4 for more details. § In the next section, we present in detail the anal- 4 RSMTool yses incorporated into RSMTool aimed at detecting the various sources of biases we introduced earlier. In this section, we present an open-source Python As it is easier to show the analyses in the context tool called RSMTool developed by two of the au- of an actual example, we use data from the Hewlett thors for building and evaluating automated scor- Foundation Automated Student Assessment Prize ing models. The tool is intended for NLP re- (ASAP) competition on automated essay scoring searchers who have already extracted features (Shermis, 2014a).3 As our scoring model, we from the responses and need to choose a learner use ordinary linear regression with features ex- function and evaluate the performance as well as tracted from the text of the essay; see Attali and the fairness of the entire scoring pipeline (the Burstein (2006) for details of the features. Note training data, the features, and the learner func- that since the original ASAP data does not con- tion). tain any demographic information, we simulate an Once the responses have been represented as a L1 attribute (the test-taker’s native language) for set of features, automated scoring essentially be- illustration purposes.4 The complete report auto- comes a machine learning problem and NLP re- matically generated by RSMTool is available at: searchers are free to use any of the large number http://bit.ly/fair-tool. The report contains of existing machine learning toolkits. However, links to the raw data used to generate it and to most of those toolkits are general-purpose and do other input files needed to run RSMTool. We fo- not provide the aforementioned fairness analyses. cus on specific sections of the report below. Instead, we leverage one such toolkit — scikit- learn (Pedregosa et al., 2011) — to build a tool 4.1 Differential Feature Functioning that integrates these fairness analyses directly into In order to evaluate the fairness of a machine the machine learning pipeline and researchers then learning algorithm, Feldman et al. (2015) recom- get them automatically in the form of a compre- mend preventive auditing of the training data to hensive HTML report. determine if the resulting decisions will be fair, ir- Note that the automated scoring pipeline built respective of the machine learning model learned into the tool provides functionality for each step from that training data. RSMTool incorporates sev- of the process of building and evaluating auto- 3https:/www.kaggle.com/c/asap-aes/ mated scoring models: (a) feature transformation, data/ (b) manual and automatic feature selection, and 4We believe it is more transparent to use a publicly avail- (c) access to linear and non-linear learners from able dataset with simulated demographics, rather than a pro- prietary dataset with real demographics that cannot be shared scikit-learn as well as the custom linear learners publicly. The value added by the fairness analyses comes we have implemented. In this paper, we will fo- through in either case.

44 eral such auditing approaches borrowed from pre- by test-taker L1 subgroups in our sample dataset; vious research in both educational measurement Figure 1(b) shows a DFF line plot for the same and machine learning. feature. These plots indicate that the values for The first step in evaluating the fairness of an au- the GRAMMAR feature are consistently lower for tomated scoring model is to ensure that the per- one of the test-taker subgroups (L1=Hindi) across formance of each feature is not primarily deter- all score levels. If such a pattern were observed in mined by construct-irrelevant factors. The tra- real data, it would warrant further investigation to ditional way to approach this is to have an ex- establish the reasons for such behavior. pert review the features and ensure that their de- scription and method of computation are in line 4.1.2 Continuous Factors with the definition of the specific set of skills that This type of construct-irrelevant factors includes the given test purports to measure (Deane, 2013). continuous covariates which despite being corre- However, features incorporated into a modern au- lated with human scores are either not directly rel- tomated scoring system often rely on multiple un- evant to the construct measured by the test or, even derlying NLP components such as part-of-speech if they are, should not be the primary contributor taggers and syntactic parsers as well as complex to the model’s predictions. Response length, as computational algorithms and, therefore, a quali- previously discussed, is an example of such co- tative review may not be sufficient. Furthermore, variates. Even though it provides an important some aspects of spoken or written text can only be indication of verbal fluency, a model which pre- measured indirectly given the current state of NLP dominantly relies on length will not generate fair technologies (Somasundaran et al., 2014). scores. To explore the impact of such factors, RSMTool allows the user to explore the quan- RSMTool computes two types of correlations: (a) titative effect of two types of construct-irrelevant the marginal correlation between each feature and factors that may affect feature performance: cate- the covariate, and (b) the “partial” correlation be- gorical and continuous. tween each feature and the human score, with the effects of the covariate removed (Cramer,´ 1947). 4.1.1 Categorical Factors This helps to clearly bring out the contribution This group of factors generally includes variables of a feature above and beyond being a proxy for that can take on one of a fixed number of possible the identified covariate. The marginal and par- values, e.g., test-takers’ demographic characteris- tial correlation coefficients for our example are tics, different versions of the same test question, shown in Figure 1(c). It shows that although all or various testing conditions. We refer to these features in our simulated dataset contribute infor- factors as “subgroups” though they are not always mation beyond response length, for some features, limited to demographic subgroups. length accounts for a substantial part of their per- When this information is available for all or formance. some of the responses, RSMTool allows the user to compare the feature distributions for differ- 4.2 Bias in Model Performance ent subgroups using box-plots and other distri- Not all types of machine learning algorithms lend butional statistics such as mean and standard de- themselves easily to the differential feature func- viations. However, feature distributions depend tioning analysis. Furthermore, the sheer number on the scores which may differ across subgroups of features in some models may make the results and, therefore, differences in a feature’s distri- of such analyses difficult to interpret. Therefore, bution across subgroups may not always indicate a second set of fairness analyses included into that the feature is biased. To address this, RSM- RSMTool considers how well the automated scores Tool also includes Differential feature functioning agree with the human scores (or another, user- (DFF) analysis (Penfield, 2016; Zhang et al., In specified gold standard criterion) and whether this print). This approach compares the mean values of agreement is consistent across different groups of a given feature for test-takers with the same score test-takers. but belonging to different subgroups. These dif- RSMTool computes all the standard evaluation ferences can be described and reviewed directly metrics generally used for regression-based ma- using DFF line plots. Figure 1(a) shows a box- chine learning models such as Pearson’s correla- plot for the distribution of the GRAMMAR feature tion coefficient (r), coefficient of determination

45 (a) Box-plots showing the distribution of standard- (b) A differential feature functioning (DFF) plot ized GRAMMAR feature values by test-taker na- for the GRAMMAR feature. Each line represents tive language (L1). The dotted red lines represent an L1; each point shows the mean and 95% confi- the thresholds for outlier truncation computed as dence intervals of the feature values computed for the mean feature value 4 standard deviations. test-takers with that L1 and that assigned score. ±

(c) Pearson’s correlation coefficients (r) between features and human scores: (a) Marginal: marginal correlation of each feature with human score (b) Partial all: correlation of each feature with human score with the effects of all other features removed, and (c) Partial− length: the correlation of each feature with human score with the effect of response length removed. The two− dotted lines represent correlations thresholds recommended by Williamson et al. (2012).

Figure 1: Examples of RSMTool fairness analyses for categorical and continuous factors.

46 (R2), and root mean squared error (RMSE). In avoid population drift which can occur when the addition, it also computes other measures that are test-taker population used to train the model no specifically recommended in psychometric liter- longer matches the population currently evaluated ature for evaluating automated scoring models: by this model. quadratically-weighted kappa, percentage agree- When updating an automated scoring system ment with human scores, and the standardized for one of the above reasons, one should not only mean difference (SMD) between human and au- conduct a fairness analysis for the new version tomated scores (Williamson et al., 2012; Rami- of the model, but also a comprehensive compar- neni and Williamson, 2013). These metrics are ison of the old and the new version. For example, computed for the whole evaluation set as well as a change in the percentage of existing test-takers for each subgroup separately in order to evaluate who have passed a particular test resulting from whether the accuracy of automated scores is con- the update would need to be explained not only to sistent across different groups of test-takers. Fig- the test-takers but also to the people making deci- ure 2 shows a plot illustrating how the model R2 sions based on test scores (von Davier, 2016). computed on the evaluation set varies across the RSMTool includes the functionality to conduct different test-taker L1 subgroups. a comprehensive comparison of two different ver- sions of a scoring system and produce a report which includes fairness analyses for each of the versions as well as how these analyses differ be- tween the two versions. As an example, we com- pare two versions of our example scoring model — one that uses all features and another that does not include the GRAMMAR feature. The compar- ison report can be be seen here: http://bit.ly/ fair-tool-compare.

4.4 Customizing RSMTool The measurement guidelines currently imple- mented in RSMTool follow the psychometric framework suggested by Williamson et al. (2012). It was developed for the evaluation of e-rater, an automated system designed to score English writ- ing proficiency (Attali and Burstein, 2006), but is generalizable to other applications of automated Figure 2: The performance of our scoring model scoring. This framework was chosen because it (R2) for different subgroups of test-takers as de- offers a comprehensive set of criteria for both the fined by their native language (L1). Before com- accuracy as well as the fairness of the predicted puting the R2, the predictions of the model are scores. Note that not all of these recommenda- trimmed and then re-scaled to match the human tions are universally accepted by the automated score distribution in the training data. scoring community. For example, Yannakoudakis and Cummins (2015) recently proposed a different set of metrics for evaluating the accuracy of auto- 4.3 Model comparison mated scoring models. Like any other software, automated scoring sys- Furthermore, the machine learning community tems are updated on a regular basis as researchers has recently developed various analyses aimed at develop new features or identify better machine detecting bias in algorithm performance that could learning algorithms. Even in scenarios where new be applied in the context of automated scoring. features or algorithms are not needed, changes in For example, in addition to reviewing individual external dependencies used by the scoring pipeline features, one could also attempt to predict the might necessitate new releases. Automated scor- subgroup membership from the features used to ing models may also be regularly re-trained to score the responses (Feldman et al., 2015). If this

47 prediction is generally accurate, then there is a rather than due to construct-irrelevant factors. risk that subgroup membership could be implic- itly used by the scoring model and lead to unfair The automated scoring models used in sys- scores. However, if the subgroup prediction has tems such as e-rater for assessing writing profi- high error over all models generated from the fea- ciency in English (Attali and Burstein, 2006) or tures, then the scores assigned by a model trained SpeechRater for spoken proficiency (Zechner et on this data are likely to be fair. al., 2009) have traditionally been linear models with a small number of interpretable features be- RSMTool has been designed to make it easy for cause such models lend themselves more easily the user to add new evaluations and analyses of to a detailed fairness review and allow decision- these types. The evaluation and report-generation makers to understand how, and to what extent, dif- components of RSMTool (including the fairness ferent parts of the test-takers’ skill set are being analyses) can be run on predictions from any ex- covered by the features in the model (Loukina et ternal learner, not just the ones that are provided al., 2015). For such linear models, RSMTool dis- by the tool itself. Each section of its report is plays a detailed model description including the implemented as a separate Jupyter/IPython note- model fit (R2) computed on the training set as well book (Kluyver et al., 2016). The user can choose as the contribution of each feature to the final score which sections should be included into the final (via raw, standardized, and relative coefficients). HTML report and in which order. Furthermore, NLP researchers who want to use different evalu- At the same time, recent studies (Heilman and ation metrics or custom fairness analyses can pro- Madnani, 2015; Madnani et al., 2016) on scor- vide them in the form of new Jupyter notebooks; ing actual content rather than just language profi- these analyses are dynamically executed and in- ciency suggest that it is possible to achieve higher corporated into the final report along with the ex- performance, as measured by agreement with hu- isting analyses or even in their place, if so desired, man raters, by employing many low-level fea- without modifying a single line of code. tures and more sophisticated machine learning al- Finally, for those who want to make more sub- gorithms such as support vector machines or ran- stantive changes, the tool is written entirely in dom forests. Generally, these models are built us- Python, is open-source with an Apache 2.0 li- ing sparse feature types such as word n-grams, of- cense, and has extensive online documentation. ten resulting in hundreds of thousands of predom- We also provide a well-documented API which inantly binary features. Using models with such allows users to integrate various components of a large feature space means that it is no longer RSMTool into their own applications. clear how to map the individual features and their weights to various parts of the test-takers’ skill set, 4.5 Model Transparency & Interpretability and, therefore, difficult to identify whether any differences in the model performance stem from The analyses produced by RSMTool only suggest the effects of construct-irrelevant factors. a potential bias and flag individual subgroups or features for further consideration. As we indi- One way to increase the interpretability of such cated earlier, the presence of differences across models is to group multiple features by feature subgroups does not automatically imply that the type (e.g. “syntactic relationships”) and build a model is unfair; further review is required to estab- stacked model (Wolpert, 1992) containing sim- lish the source of such differences. One of the first pler models for each feature type. These stacked steps in such a review usually involves examining models can then be combined in a final linear each feature separately as well as the individual model which can be examined in the usual manner contribution of each feature to the final score. It is for fairness considerations (Madnani and Cahill, important to note here that unfairness may also be 2016). The idea of making complex machine- introduced by what is not in the model. An auto- learned models more interpretable to users and mated scoring system may not cover a particular stakeholders has been investigated more thor- aspect of the construct which can be evaluated by oughly in recent years and several promising so- humans. If the performance across subgroups dif- lutions have been proposed that could also be used fers on this aspect of the construct, the difference for content-scoring models (Kim et al., 2016; Wil- may be due to “construct under-representation” son et al., 2016).

48 5 Mitigating Bias in Automated Scoring 6 Conclusion

So far we have primarily discussed techniques In this paper, we discussed considerations that for detecting potential biases in automated scoring go into developing fairer automated scoring mod- models. We showed that there are multiple sources els for constructed responses. We also presented of possible bias which makes it unlikely that there RSMTool, an open-source tool to help NLP re- would be a single “silver bullet” that can make searchers detect potential biases in their scoring test scores completely bias-free. The approach models. We described the analyses currently in- currently favored in the educational measurement corporated into the tool for evaluating the impact community is to try and reduce susceptibility to of construct-irrelevant categorical and continuous construct-irrelevant factors by design. This in- factors. We also showed that the tool is designed in cludes an expert review of each feature before it is a flexible manner which allows users to easily add added to the model to ensure that it is theoretically their own custom fairness analyses and showed and practically consistent with the skill set being some examples of such analyses. measured by the test. These features are then com- While RSMTool has been designed for auto- bined in an easily interpretable model (usually lin- mated scoring research (some terminology in the ear regression) which is trained on a representative tool and the report is specific to automated scor- sample of test-taker population. ing), its flexible nature and well-documented API However, simpler scoring models may not al- allow it to be easily adapted for any machine learn- ways be the right solution. For one, as we dis- ing task in which the numeric prediction is gener- cussed in 3, several studies have shown that ated by regressing on a set of non-sparse, numeric § even such simple models may still exhibit bias. features. Furthermore, the evaluation component In addition, recent studies on scoring test-takers’ can be used separately which allows users to eval- knowledge of content rather than proficiency have uate the performance and fairness of any model shown that using more sophisticated — and hence, that generates numeric predictions. less transparent — models yields non-trivial gains in the accuracy of the predicted scores. There- Acknowledgments fore, ensuring completely fair automated scoring at large requires more complex solutions. We would like to thank the anonymous review- The machine learning community has identified ers for their valuable comments and sugges- several broad approaches to deal with discrimina- tions. We would also like to thank Lei Chen, tion that could, in theory, be used for automated Sorelle Friedler, Brent Bridgeman, Vikram Rama- scoring models, especially those using more com- narayanan, and Keelan Evanini for their contribu- plex non-linear algorithms: the training data can tions and comments. be modified (Feldman et al., 2015; Kamiran and Calders, 2012; Hajian and Domingo-Ferrer, 2013; Mancuhan and Clifton, 2014), the algorithm it- References self can be changed so that it optimizes for fair- AERA. 1999. Standards for Educational and Psy- ness as well as the selection criteria (Kamishima et chological Testing. American Educational Research al., 2012; Zemel et al., 2013; Calders and Verwer, Association. 2010), and the output decisions can be changed Ji An, Vincent Kieftenbeld, and Raghuveer Kan- after-the-fact (Kamiran et al., 2012). A survey of neganti. 2016. Fairness in Automated Scoring: such approaches is provided by Romei and Rug- Screening Features for Subgroup Differences. Pre- gieri (2013a). Future work in automated scoring sented at the Annual Meeting of the National Coun- cil on Measurement in Education, Washington DC. could explore whether these methods can address some of the known biases. William H. Angoff. 2012. Perspectives on Differen- Of course, it is also important to note that such tial Item Functioning Methodology. In P.W. Holland bias-mitigating approaches often lead to a decline and H. Wainer, editors, Differential Item Function- in the overall model performance and, therefore, ing, pages 3–23. Taylor & Francis. one needs to balance model fairness and accuracy Yigal Attali and Jill Burstein. 2006. Automated Essay which likely depends on the stakes for which the Scoring with e-rater V. 2. The Journal of Technol- model is going to be used. ogy, Learning and Assessment, 4(3):1–30.

49 Jared Bernstein, A. Van Moere, and Jian Cheng. 2010. Minh Q. Duong and Alina von Davier. 2013. Hetero- Validating Automated Speaking Tests. Language geneous Populations and Multistage Test Design. In Testing, 27(3):355–377. Roger E. Millsap, L. Andries van der Ark, Daniel M. Bolt, and Carol M. Woods, editors, New Devel- Brent Bridgeman, Mary Lou Lennon, and Altamese opments in Quantitative Psychology: Presentations Jackenthal. 2003. Effects of Screen Size, Screen from the 77th Annual Psychometric Society Meeting, Resolution, and Display Rate on Computer-Based pages 151–170. Springer New York. Test Performance. Applied Measurement in Educa- tion, 16(3):191–205. Michael Feldman, Sorelle A. Friedler, John Moeller, Brent Bridgeman, Catherine Trapani, and Yigal Attali. Carlos Scheidegger, and Suresh Venkatasubrama- 2012. Comparison of Human and Machine Scor- nian. 2015. Certifying and Removing Disparate Im- ing of Essays: Differences by Gender, Ethnicity, pact. Proceedings of the 21th ACM SIGKDD Inter- and Country. Applied Measurement in Education, national Conference on Knowledge Discovery and 25(1):27–40. Data Mining, pages 259–268. Jill Burstein and Martin Chodorow. 1999. Automated Sorelle A. Friedler, Carlos Scheidegger, and Suresh essay scoring for nonnative english speakers. In Venkatasubramanian. 2016. On the (Im)possibility Proceedings of a Symposium on Computer Medi- of Fairness. CoRR, abs/1609.07236. ated Language Assessment and Evaluation in Nat- ural Language Processing, pages 68–75, Strouds- Sara Hajian and Josep Domingo-Ferrer. 2013. A burg, PA, USA. Methodology for Direct and Indirect Discrimina- tion Prevention in Data Mining. IEEE Transactions Jill Burstein, Karen Kukich, Susanne Wolff, Chi on Knowledge and Data Engineering, 25(7):1445 – Lu, Martin Chodorow, Lisa Braden-Harder, and 1459. Mary Dee Harris. 1998. Automated Scoring Us- ing A Hybrid Feature Identification Technique. In Michael Heilman and Nitin Madnani. 2015. The Proceedings of the 36th Annual Meeting of the As- Impact of Training Data on Automated Short An- sociation for Computational Linguistics and 17th In- swer Scoring Performance. In Proceedings of the ternational Conference on Computational Linguis- Tenth Workshop on Innovative Use of NLP for Build- tics, Volume 1, pages 206–210, Montreal, Que- ing Educational Applications, pages 81–85, Denver, bec, Canada, August. Association for Computa- Colorado, June. Association for Computational Lin- tional Linguistics. guistics. Toon Calders and Sicco Verwer. 2010. Three Naive Bayes Approaches for Discrimination-Free Classi- Faisal Kamiran and Toon Calders. 2009. Classifying fication. Data Mining journal; special issue with without Discriminating. In Proceedings of the IEEE selected papers from ECML/PKDD. International Conference on Computer, Control and Communication, pages 1–6. M. D. Carey, R. H. Mannell, and P. K. Dunn. 2011. Does a Rater’s Familiarity with a Candidate’s Pro- Faisal Kamiran and Toon Calders. 2012. Data Prepro- nunciation Affect the Rating in Oral Proficiency In- cessing Techniques for Classification Without Dis- terviews? Language Testing, 28(2):201–219. crimination. Knowledge and Information Systems, 33(1):1 – 33. Hongbo Chen and Ben He. 2013. Automated Essay Scoring by Maximizing Human-Machine Agree- ment. In Proceedings of the 2013 Conference on Faisal Kamiran, Asad Karim, and Xiangliang Zhang. Empirical Methods in Natural Language Process- 2012. Decision Theory for Discrimination-aware ing, pages 1741–1752, Seattle, Washington, USA, Classification. In International Conference on Data October. Association for Computational Linguistics. Mining (ICDM). Brian E. Clauser, Michael T. Kane, and David B. Swan- Toshihiro Kamishima, Shotaro Akaho, Hideki Asoh, son. 2002. Validity Issues for Performance-Based and Jun Sakuma. 2012. Fairness-aware Classifier Tests Scored With Computer-Automated Scoring with Prejudice Remover Regularizer. In Proceed- Systems. Applied Measurement in Education, ings of the Joint European Conference on Machine 15(4):413–432. Learning and Knowledge Discovery in Databases, pages 35–50. Harald Cramer.´ 1947. Mathematical Methods of Statistics. Princeton University Press. Been Kim, Dmitry Malioutov, and Kush Varshney, ed- Larry Davis. 2015. The Influence of Training and Ex- itors. 2016. Proceedings of the ICML Workshop on perience on Rater Performance in Scoring Spoken Human Interpretability in Machine Learning. Language. Language Testing, 33:117–135. Thomas Kluyver, Benjamin Ragan-Kelley, Fernando Paul Deane. 2013. On the Relation between Auto- Prez, Brian Granger, Matthias Bussonnier, Jonathan mated Essay Scoring and Modern Views of the Writ- Frederic, Kyle Kelley, Jessica Hamrick, Jason ing Construct. Assessing Writing, 18(1):7–24. Grout, Sylvain Corlay, Paul Ivanov, Damin Avila,

50 Safia Abdalla, Carol Willing, and Jupyter Develop- Fabian Pedregosa, Gal Varoquaux, Alexandre Gram- ment Team. 2016. Jupyter Notebooks — A Publish- fort, Vincent Michel, Bertrand Thirion, Olivier ing Format for Reproducible Computational Work- Grisel, Mathieu Blondel, Peter Prettenhofer, Ron flows. In Proceedings of the 20th International Con- Weiss, Vincent Dubourg, Jake Vanderplas, Alexan- ference on Electronic Publishing. IOS Press. dre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and douard Duchesnay. 2011. Yi-Hsuan Lee and Alina von Davier. 2013. Moni- Scikit-learn: Machine learning in Python. Journal toring Scale Scores over Time via Quality Control of Machine Learning Research, 12:2825–2830. Charts, Model-Based Approaches, and Time Series Techniques. Psychometrika, 78(3):557–575. Randall D. Penfield. 2016. Fairness in Test Scoring. In Neil J. Dorans and Linda L. Cook, editors, Fairness G. Ling, P. Mollaun, and X. Xi. 2014. A Study on the in Educational Assessment and Measurement, pages Impact of Fatigue on Human Raters when Scoring 55–76. Routledge. Speaking Responses. Language Testing, 31:479– 499. Les Perelman. 2014. When the state of the art is Jinghua Liu and Neil J. Dorans. 2016. Fairness in Counting Words. Assessing Writing, 21:104–111. Score Interpretation. In Neil J. Dorans and Linda L. Cook, editors, Fairness in educational assessment Chaitanya Ramineni and David M. Williamson. 2013. and measurement, pages 77–96. Routledge. Automated Essay Scoring: Psychometric Guidelines and Practices. Assessing Writing, 18(1):25–39. Anastassia Loukina and Heather Buzick. In print. Au- tomated Scoring of Speakers with Speech Impair- Andrea Romei and Salvatore Ruggieri. 2013a. A ments. ETS Research Report Series, In print. Multidisciplinary Survey on Discrimination Analy- sis. The Knowledge Engineering Review, pages 1– Anastassia Loukina, Klaus Zechner, Lei Chen, and 57. Michael Heilman. 2015. Feature Selection for Automated Speech Scoring. In Proceedings of the Andrea Romei and Salvatore Ruggieri. 2013b. Dis- Tenth Workshop on Innovative Use of NLP for Build- crimination Data Analysis: A Multi-disciplinary ing Educational Applications, pages 12–19, Denver, Bibliography. In Bart Custers, Toon Calders, Bart Colorado, June. Association for Computational Lin- Schermer, and Tal Zarsky, editors, Discrimination guistics. and Privacy in the Information Society: Data Min- Binh Thanh Luong, Salvatore Ruggieri, and Franco ing and Profiling in Large Databases, pages 109– Turini. 2011. k-NN as an Implementation of Situ- 135. Springer Berlin Heidelberg. ation Testing for Discrimination Discovery and Pre- vention. In Proceedings of the 17th ACM SIGKDD Mark D. Shermis. 2014a. State-of-the-art Automated international conference on Knowledge discovery Essay Scoring: Competition, Results, and Future Di- and data mining, pages 502–510. rections from a United States demonstration. As- sessing Writing, 20:53–76. Nitin Madnani and Aoife Cahill. 2016. Automated Scoring of Content. Presented at the panel on Fair- Mark D. Shermis. 2014b. The Challenges of Emu- ness and Machine Learning for Educational Prac- lating Human Behavior in Writing Assessment. As- tice, Annual Meeting of the National Council on sessing Writing, 22:91–99. Measurement in Education, Washington DC. Swapna Somasundaran, Jill Burstein, and Martin Nitin Madnani, Aoife Cahill, and Brian Riordan. 2016. Chodorow. 2014. Lexical Chaining for Measuring Automatically Scoring Tests of Proficiency in Music Discourse Coherence Quality in Test-taker Essays. Instruction. In Proceedings of the 11th Workshop In Proceedings of COLING 2014, the 25th Inter- on Innovative Use of NLP for Building Educational national Conference on Computational Linguistics: Applications, pages 217–222, San Diego, CA, June. Technical Papers, pages 950–961, Dublin, Ireland, Association for Computational Linguistics. August. Dublin City University and Association for Computational Linguistics. Koray Mancuhan and Chris Clifton. 2014. Combat- ing Discrimination Using Bayesian Networks. Artif. Alina von Davier and Jill Burstein. 2016. Fairness Intell. Law, 22(2):211–238. and Machine Learning for Educational Practice. Co- Mara Elena Oliveri and Alina von Davier. 2016. Psy- ordinated Session, Annual meeting of the National chometrics in Support of a Valid Assessment of Lin- Council on Measurement in Education, Washington guistic Minorities: Implications for the Test and DC. Sampling Designs. International Journal of Testing, 16(3):220–239. Alina von Davier. 2016. Fairness Concerns in Com- putational Psychometrics. Presented at the panel Ellis B. Page. 1966. The Imminence of ... Grad- on Fairness and Machine Learning for Educational ing Essays by Computer. The Phi Delta Kappan, Practice, Annual Meeting of the National Council 47(5):238–243. on Measurement in Education, Washington DC.

51 Zhen Wang and Alina von Davier. 2014. Monitor- ing of scoring using the e-rater automated scoring system and human raters on a writing test. ETS Re- search Report Series, 2014(1):1–21. Zhen Wang, Klaus Zechner, and Yu Sun. 2016. Mon- itoring the Performance of Human and Automated Scores for Spoken Responses. Language Testing, pages 1–20. David M. Williamson, Xiaoming Xi, and F. Jay Breyer. 2012. A Framework for Evaluation and Use of Au- tomated Scoring. Educational Measurement: Issues and Practice, 31(1):2–13. Andrew Gordon Wilson, Been Kim, and William Her- lands, editors. 2016. Proceedings of the NIPS Work- shop on Interpretable Machine Learning for Com- plex Systems. David H. Wolpert. 1992. Stacked Generalization. Neural Networks, 5:241–259. Xiaoming Xi. 2010. How do we go about Investigating Test Fairness? Language Testing, 27(2):147–170.

Yongwei Yang, Chad W. Buckendahl, Piotr J. Juszkiewicz, and Dennison S. Bhola. 2002. A Review of Strategies for Validating Computer- Automated Scoring. Applied Measurement in Ed- ucation, 15(4):391–412, oct. Helen Yannakoudakis and Ronan Cummins. 2015. Evaluating the Performance of Automated Text Scoring systems. In Proceedings of the Tenth Work- shop on Innovative Use of NLP for Building Edu- cational Applications, pages 213–223, Denver, Col- orado, June. Association for Computational Linguis- tics. Klaus Zechner, Derrick Higgins, Xiaoming Xi, and David M. Williamson. 2009. Automatic Scoring of Non-native Spontaneous Speech in Tests of Spoken English. Speech Communication, 51(10):883–895. Richard S Zemel, Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork. 2013. Learning fair representations. In Proceedings of ICML, pages 325–333. Mo Zhang, Neil J. Dorans, Chen Li, and Andre A. Rupp. In print. Differential feature functioning in automated essay scoring. In H. Jiao and R.W. Lis- sitz, editors, Test fairness in the new generation of large-scale assessment.

Michael J. Zieky. 2016. Fairness in Test Design and Development. In Neil J. Dorans and Linda L. Cook, editors, Fairness in Educational Assessment and Measurement, pages 9–32. Routledge.

52 Gender and Dialect Bias in YouTube’s Automatic Captions

Rachael Tatman Department of Lingusitics University of Washington [email protected]

Abstract tion in language use, for example, has been ex- tensively studied (Trudgill, 1972; Eckert, 1989, This project evaluates the accuracy of among many others). There is also robust vari- YouTube’s automatically-generated cap- ation in language use by native speakers across tions across two genders and five dialects dialect regions. For instance, English varies dra- of English. Speakers’ dialect and gen- matically between the United States (Cassidy and der was controlled for by using videos others, 1985), New Zealand (Hay et al., 2008) and uploaded as part of the “accent tag chal- Scotland (Milroy and Milroy, 2014). lenge”, where speakers explicitly iden- Sociolinguistic variation has historically been tify their language background. The re- a source of error for natural language process- sults show robust differences in accuracy ing. Differences across genders in automatic across both gender and dialect, with lower speech recognition accuracy have been previously accuracy for 1) women and 2) speakers reported, with better recognition rates reported for from Scotland. This finding builds on both men (Ali et al., 2007) and women (Gold- earlier research finding that speaker’s so- water et al., 2010; Sawalha and Abu Shariah, ciolinguistic identity may negatively im- 2013). Previous work has also found evidence pact their ability to use automatic speech of dialectal bias in speech recognition in both recognition, and demonstrates the need for English (Wheatley and Picone, 1991) and Ara- sociolinguistically-stratified validation of bic (Droua-Hamdani et al., 2012). In addition, systems. there are many anecdotal accounts of bias against dialect in speech recognition. For example, in 1 Introduction 2010 Microsoft’s Kinect was released and, while The overall accuracy of automatic speech recog- it shipped with Spanish voice recognition, it did nition (ASR) has increased substantially over the not recognize Castilian Spanish (Plunkett, 2010). past decade: a decade ago it was not uncommon This study investigates whether YouTube’s auto- to report a ASR error rates of 27% (Sha and Saul, matic captions have different WER for native En- 2007), while a recent Microsoft system achieved a glish speakers across two genders and five dialect word error rate (WER) of just 6.3% on the Switch- regions. board corpus (Xiong et al., 2016). Have these strong gains benefited all speakers evenly? Pre- 2 Method vious work, briefly discussed below, has found systematic bias both by dialect and gender. This Data for this project was collected by hand check- paper provides additional evidence that sociolin- ing YouTube’s automatic captions (Harrenstien, guistic variation continues to provide a source of 2009) on the word list portion of accent tag videos. avoidable error by showing that the WER is ro- Annotation was done by a phonetically-trained bustly different for male and female native English listener familiar with the dialects in the study. speakers from different dialect regions. YouTube’s automatic captions were chosen for It is well established in the field of sociolinguis- three reasons. The first is that they’re backed by tics that there is quantifiable variation in language Google’s speech recognition software, which is use between social groups. Gender-based varia- both very popular and among the more accurate

53 Proceedings of the First Workshop on Ethics in Natural Language Processing, pages 53–59, Valencia, Spain, April 4th, 2017. c 2017 Association for Computational Linguistics proprietary ASR systems (Liao et al., 2013). The Again Envelope Potato second is the fact that the accuracy of YouTube’s Alabama Figure Probably automatic captions specifically are an area of im- Aluminum Fire Quarter mediate concern to the Deaf community and is Arizona Florida Roof a frequent topic of (frustrated) discussion: they Ask Gif Route are often referred to as “autocraptions” (Lockrey, Atlantic GPOY Ruin 2015) due to their low accuracy and the fact that Attitude Guarantee Salmon content creators will often rely on them instead of Aunt Halloween Sandwich providing accurate captions. Finally, YouTube’s Auto Image Saw large, diverse userbase allowed for the direct com- Avocado Iron Spitting parison of speakers from a range of demographic Bandanna Lawyer Sure backgrounds. Both Marriage Syrup Car Mayonnaise Theater 2.1 Accent tag Caramel Muslim Three Catch Naturally Tomato The accent tag, developed by Bert Vaux and based Caught New Orleans Twenty on the Harvard dialect survey (Vaux and Golder, Cool Whip Officer Waffle 2003), has become a popular and sustained inter- Coupon Oil Wagon net phenomenon. Though it was designed to elicit Crayon Oregon Wash differences between dialect regions in the United Data Pajamas Water States, it has achieved wide popularity across the Eleven Pecan English-speaking world. Variously called the “ac- cent tag”, “dialect meme”, “accent challenge” Table 1: Word list for accent tag videos. or “Tumblr/Twitter/YouTube accent challenge”, videos in this genre follow the same basic out- line. First, speakers introduce themselves and de- alect markers. “Lawyer”, for example, is gener- lOI.jÄ scribe their linguistic background, with a focus ally pronounced [ ] in New England and Cal- lO.jÄ on regional dialect. Then speakers read a list of ifornia, but [ ] in Georgia (Vaux and Golder, words designed to elicit phonological dialect dif- 2003). These facts do place the ASR system used ferences. Finally, speakers read and then answer a to generate the automatic captions at a disadv- list of questions designed to elicit lexical variation. tange, and may help to explain the high error rates. For example, one question asks “What do you call gym shoes?”, which speakers variously answered 2.2 Speakers “sneakers”, “tennis shoes”, “gym shoes” or what- A total of eighty speakers were sampled for this ever the preferred term is in thier dialect. project. Videos for eight men and eight women This study focuses on only the word list portion from each dialect region were included. The di- of the accent tag. Over time, the word list has been alect regions were California, Georgia, New Eng- changed and appended, most notably with terms land (Maine and New Hampshire), New Zealand commonly used in on-line communities such as and Scotland. These regions were chosen based “GPOY” (gratuitous picture of yourself) or “gif” on their high degree of geographic separation from (graphics interchange format, a popular digital im- each other, distinct local regional dialects and (rel- age format). Even with these variations, all videos atively) comparable populations. Of these regions, discussed here used some subset of the word-list California has the largest population, with approx- shown in Table 1. imately 38.8 million residents, and New England It should be noted that this is a particularly dif- the smallest, with Maine and New Hampshire hav- ficult ASR task. First, words are presented in iso- ing a combined population of approximately 2.6 lation rather than within a frame sentence, which million (although the United States census bureau means that ASR systems cannot benefit from the estimates the population of New England as a re- use of language models. Second, the word-list gion at over 14 million as of 2010 (Bogue et al., portion of the accent tag challenge was intention- 2010)). ally constructed to only include words with mul- Sampling was done by searching YouTube us- tiple possible pronunciations and that serve as di- ing the exact term “accent challenge” or “accent

54 tag” and the name of the geographical region. Only videos which had automatic captions were included in this study. For each speaker, the word error rate (WER) was calculated separately. Data 1.00 and code used for analysis is available online1.

● ● ● ● 3 Results 0.75 ● ● ● ● ● ● ● ●

The effect of dialect and gender on WER was eval- ● ● ● ● ● ● ● ●

● ● ● uated using liner mixed-effects regression. Both ● ● ● ● ● ● ● 0.50 ● ● speaker and year were included as random effects. ● ● ● ● ● ● ● ● ● ●

● Speaker was included to control for both individ- ● ● ● ● ● ● ● ● ● Word Error Rate Word ual variability in speech clarity and also record- ● ● ● ● ● ● ● ing quality, since only one recording per speaker 0.25 ● ● ● was used. Year was included to control for im- ● ●

● provements in ASR over time. Automatic captions ● are generated just after the video is uploaded to 0.00 YouTube, and the recordings used were uploaded Georgia over a five year period, so it was important to ac- CaliforniaNewZealandNewEngland Scotland count for overall improvements in speech recogni- tion. Figure 1: YouTube automatic caption word error A model which included both gender and di- rate by speaker’s dialect region. Points indicate alect as fixed effects more closely fit the data (i.e. individual speakers. had a lower Akaike information criterion) than nested models without gender (χ2 (5, N= 80) = 31 p <0.01), without dialect (χ2 (5, N= 80) = 14, p <0.01) or without either (χ2 (5, N= 80) = 31, p <0.01). In terms of dialect, speakers from Scot- land had reliably worse performance than speakers from the United States or New Zealand, as can be seen in Figure 1. The lower level of accuracy for 1.00 Scottish English can not be explained by, for ex-

● ample, a small number of speakers of that variety. ● ● The population of New Zealand, the dialect which 0.75 ● ● ● ● ● had the second-lowest WER, is roughly 80% that ● ●

● ● ● of Scotland. Nor is it factor of wealth. Scotland ● ● ● ● ● ● ● and New Zealand have a GDP per capita that falls ● ● 0.50 ● ● ● ● ● ● within one hundred US dollars of each other. ● ●

● ● ● ● ● ● ● There was also a significant effect of gender: ● ● Word Error Rate Word

● ● ● the word error rate was higher for women than ● ● ● ● ● 0.25 ● ● men (t(78) = -3.5, p <0.01 ). This is shown in Fig- ●

● ure 2. This is somewhat surprising given earlier ●

● studies which found the opposite result (Goldwa- ● ter et al., 2010; Sawalha and Abu Shariah, 2013). 0.00 In addition, there was an interaction between Women Men gender and dialect. Adding an interaction term be- tween gender and dialect to the model above sig- Figure 2: YouTube automatic caption word error nificantly improved model fit (χ2 (5, N= 80) = 16, rate by speaker’s gender. Points indicate individ- p <0.01). As can be seen in Figure 3, the effect ual speakers. of gender was not equal across dialects. Differ- ences between genders were largest for speakers

1https://github.com/rctatman/youtubeDialectAccuracy

55 1.00 differences were indeed underlying the differing ● Women ● Men word error rates for male and female speakers. ● First, a female speaker of standardized Ameri- ● ● ● ● 0.75 ● can English was recorded clearly reading the word ● ● ● ● ● ● ● list shown in Table 1. In order to better approxi-

● ● ● ● ● ● ● ● mate the environment of the recordings in the ac- ● ● ● ● ● ● ● ● cent tag videos, the recording was made using a ● ● 0.50 ● ● ● ● ● ● ● ● ● ● consumer-grade headset microphone in a quiet en- ● ●

● ● ● ● ● ● ● ● ● ● ● vironment, rather than using a professional-grade Word Error Rate Word ● ● ● ● ● ● microphone in a sound-attenuated booth. The ● ● 0.25 ● ● ● original recording had a mean pitch of 192 Hz and ● ● a median of 183 Hz, which is slightly lower than

● ● average for a female speaker of American English 0.00 (Pepiot,´ 2014). The pitch of the original record-

Georgia ing was artificially scaled both up and down 60 CaliforniaNewZealandNewEngland Scotland Hz in 20 Hz intervals using Praat (Boersma and others, 2002). This resulted in a total of seven Figure 3: Interaction of gender and dialect. The recordings: the original, three progressively lower difference in Word Error Rates between genders pitched and three progressively higher pitched. was largest for speakers from New Zealand and These resulting sound-files were then uploaded New England. In no dialect was accuracy reliably to YouTube and automatic captions were gener- better for women than men. ated. The video, and captions, can be viewed on YouTube3. from New Zealand and New England. Overall, the automatic captions for the word list were very accurate; there were a total of 9 errors Given the nature of this project, there is limited across all 434 tokens, for a WER of .002. Though access to other demographic information about it may be due to ceiling effects, there was no sig- speakers which might be important, such as age, nificant effect of pitch on accuracy. The much level of education, socioeconomic status, race or higher accuracy of this set of captions may be due ethnicity2. The last is of particular concern given to improvement in the algorithms underlying the recent findings that automatic natural language automatic captions or the nature of the speech in processing tools, including language identifiers the recording, which was clear, careful and slow. and parsers struggle with African American En- More investigation with a larger sample of voices glish (Blodgett et al., 2016). is necessary to determine if pitch differences, or 4 Effects of pitch on YouTube automatic perhaps another factor such as intensity, are what captions is underlying the differences in WER for male and female speakers. That said, even if gender-based One potential explanation for the different er- differences in accuracy between genders can be ror rates found for male and female speakers is attributed to acoustic differences associated with differences in pitch. Pitch differences are one gender, that would not account for the strong ef- of the most reliable and well-studied perceptual fect of dialect region. markers of gender in speech (Wu and Childers, 1991; Gelfer and Mikos, 2005) and speech with a 5 Discussion high fundamental frequency (typical of women’s The results presented above show that there are speech) has also been found to be more difficult differences in WER between dialect areas and for automatic speech recognizers (Hirschberg et genders, and that manipulating one speaker’s pitch al., 2004; Goldwater et al., 2010). A small exper- was not sufficient to affect WER for that speaker. iment was carried out to determine whether pitch While the latter needs additional data to form a 2Speakers in this sample did not self-report their race or robust generalization, the size of the effect for ethnicity and, given the complex nature of race and ethnicity the former is deeply disturbing. Why do these in both New Zealand and the US, the researcher opted not to guess at speaker’ race and ethnicity. 3https://www.YouTube.com/watch?v=eUgrizlV-R4

56 differences exist? From a linguistics standpoint, ing this is to include information about speaker’s no dialect is inherently more or less intelligible. social identity, such as the geographic location The main factor which determines how well a lis- of the speaker (Ye et al., 2016) or using gender- tener understands a dialect is the amount of expo- dependent speech recognition models (Konig and sure they have had to it (Clarke and Garrett, 2004; Morgan, 1992; Abdulla and Kasabov, 2001). Sumner and Samuel, 2009); with sufficient expo- Regardless of the method used to correct biases, sure, any human listener can learn any language it is imperative that the NLP community work to variety. In addition, earlier research that found do so. Robust differences in accuracy of auto- lower WER for female speakers shows that cre- matic speech recognition based on a speaker’s so- ating such ASR systems is possible (Goldwater et cial identity is an ethical issue (Hovy and Spruit, al., 2010; Sawalha and Abu Shariah, 2013). Given 2016). In particular, if NLP and ASR systems that there is also a difference between dialects, consistently preform worse for users from disad- these differences are most likely due to something vantaged groups than they do for users from priv- besides the inherent qualities of the signal. ileged groups, this exacerbates existing inequali- One candidate for the cause of these differ- ties. The ideal would be for systems to preform ences is imbalances in the training dataset. Any equally well for users regardless of their sociolin- bias in the training data will be embedded in a guistic backgrounds. system trained on it (Torralba and Efros, 2011; Differences in performance derived from speak- Bock and Shamir, 2015). While the system be- ers’ social identity is particularly concerning given hind YouTube’s automatic captions is propriety the increasing use of speech-analysis algorithms and it is thus impossible to validate this suppo- during the hiring process (Shahani, 2015; Morri- sition, there is room for improvement in the so- son, 2017). Given the evidence that speech anal- cial stratification of many speech corpora. Li- ysis tools preform more poorly on some speakers brevox, for example, is a popular open-source who are members of protected classes, this could speech data set that “suffers from major gender legally be discrimination (Ajunwa et al., 2016). and per speaker duration imbalances” (Panayotov Error analyses that compare performance across et al., 2015). TIMIT, the most-distributed corpora sociolinguistic-active social groups, like the one available through the linguistic data consortium, presented in this paper, can help ensure that this is balanced for speaker dialect but approximately is not the case and highlight any imbalances that 69% of the speech in it comes from male speak- might exist. ers (Garofolo et al., 1993). Switchboard (God- Acknowledgments frey et al., 1992) undersamples women, Southerns and non-college-educated speakers. Many other The author would like to thank Richard Wright, popular speech corpora such as the Numbers cor- Gina-Anne Levow, Alicia Beckford Wassink, pus (Cole et al., 1995) or the AMI meeting corpus Emily M. Bender, the University of Washington (McCowan et al., 2005) don’t include information Computational Linguistics Laboratory and the re- on speaker gender or dialect background. Taken viewers for their helpful comments and support. together, these observations suggest that socially Any remaining errors are mine. This work has stratified sampling of spekaers has historically not been supported by supported by National Science been the priority during corpus construction for Foundation grant DGE-1256082. computational applications. One solution to imbalanced training sets to fo- References cus on collecting unbiased socially stratified sam- ples, or at the very least documenting the ways in Waleed H. Abdulla and Nikola K. Kasabov. 2001. Improving speech recognition performance through which samples are unbalanced, for future speech gender separation. changes, 9:10. corpora. This is already being addressed in the data collection of some new corpora such as the Ifeoma Ajunwa, Sorelle Friedler, Carlos Scheidegger, and Suresh Venkatasubramanian. 2016. Hiring by Automatic Tagging and Recognition of Stance algorithm: predicting and preventing disparate im- (ATAROS) corpus (Freeman et al., 2014). pact. Available at SSRN. This does not help to address existing imbal- S. Ali, K. Siddiqui, N. Safdar, K. Juluru, W. Kim, ances in training data, however. One way of do- and E. Siegel. 2007. Affect of gender on speech

57 recognition accuracy. In American Journal of John J. Godfrey, Edward C. Holliman, and Jane Mc- Roentgenology, volume 188. American Roentgen Daniel. 1992. Switchboard: Telephone speech cor- Ray Society. pus for research and development. In Acoustics, Speech, and Signal Processing, 1992. ICASSP-92., Su Lin Blodgett, Lisa Green, and Brendan O’Connor. 1992 IEEE International Conference on, volume 1, 2016. Demographic Dialectal Variation in Social pages 517–520. IEEE. Media: A Case Study of African-American English. arXiv preprint arXiv:1608.08868. Sharon Goldwater, Dan Jurafsky, and Christopher D. Manning. 2010. Which words are hard to rec- Benjamin Bock and Lior Shamir. 2015. Assessing the ognize? prosodic, lexical, and disfluency factors efficacy of benchmarks for automatic speech accent that increase speech recognition error rates. Speech recognition. In Proceedings of the 8th International Communication, 52(3):181–200. Conference on Mobile Multimedia Communications, pages 133–136. ICST (Institute for Computer Sci- Ken Harrenstien. 2009. Automatic captions in ences, Social-Informatics and Telecommunications YouTube. The Official Google Blog, 11. Engineering). Jennifer Hay, Margaret Maclagan, and Elizabeth Gor- Paul Boersma et al. 2002. Praat, a system for do- don. 2008. New Zealand English. Edinburgh Uni- ing phonetics by computer. Glot international, versity Press. 5(9/10):341–345. Julia Hirschberg, Diane Litman, and Marc Swerts. Donald J. Bogue, Douglas L. Anderton, and Richard E. 2004. Prosodic and other cues to speech recognition Barrett. 2010. The population of the United States. failures. Speech Communication, 43(1):155–175. Simon and Schuster. Dirk Hovy and Shannon L. Spruit. 2016. The so- Frederic Gomes Cassidy et al. 1985. Dictionary of cial impact of natural language processing. In Pro- American Regional English. Belknap Press of Har- ceedings of the 54th Annual Meeting of the Associa- vard University Press. tion for Computational Linguistics, volume 2, pages 591–598. Constance M. Clarke and Merrill F. Garrett. 2004. Yochai Konig and Nelson Morgan. 1992. Gdnn: Rapid adaptation to foreign-accented English. The A gender-dependent neural network for continuous Journal of the Acoustical Society of America, speech recognition. In Neural Networks, 1992. 116(6):3647–3658. IJCNN., International Joint Conference on, vol- Ronald A. Cole, Mike Noel, Terri Lander, and Terry ume 2, pages 332–337. IEEE. Durham. 1995. New telephone speech corpora at Hank Liao, Erik McDermott, and Andrew Senior. CSLU. In Eurospeech. Citeseer. 2013. Large scale deep neural network acous- Ghania Droua-Hamdani, Sid-Ahmed Selouani, and tic modeling with semi-supervised training data for Malika Boudraa. 2012. Speaker-independent asr YouTube video transcription. In Automatic Speech for modern standard arabic: effect of regional ac- Recognition and Understanding (ASRU), 2013 IEEE cents. International Journal of Speech Technology, Workshop on, pages 368–373. IEEE. 15(4):487–493. Michael Lockrey. 2015. YouTube automatic craptions score an incredible 95% accuracy rate! Penelope Eckert. 1989. The whole woman: Sex and medium.com, July. [Online; posted 25-July-2015]. gender differences in variation. Language variation and change, 1(03):245–267. Iain McCowan, Jean Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, Valerie Freeman, Julian Chan, Gina-Anne Levow, J. Kadlec, V. Karaiskos, et al. 2005. The AMI meet- Richard Wright, Mari Ostendorf, Victoria Zayats, ing corpus. In Proceedings of the 5th International Yi Luan, Heather Morrison, Lauren Fox, Maria An- Conference on Methods and Techniques in Behav- toniak, et al. 2014. ATAROS Technical Report ioral Research, volume 88. 1: Corpus collection and initial task validation. U. Washington Linguistic Phonetics Lab. James Milroy and Lesley Milroy. 2014. Real English: the grammar of English dialects in the British Isles. John S. Garofolo, Lori F. Lamel, William M. Fisher, Routledge. Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, and Victor Zue. 1993. TIMIT acoustic- Lennox Morrison. 2017. Speech analysis could now phonetic continuous . Linguistic data land you a promotion. BBC News, Jan. consortium, Philadelphia, 33. Vassil Panayotov, Guoguo Chen, Daniel Povey, and Marylou Pausewang Gelfer and Victoria A. Mikos. Sanjeev Khudanpur. 2015. Librispeech: an ASR 2005. The relative contributions of speaking funda- corpus based on public domain audio books. In mental frequency and formant frequencies to gender 2015 IEEE International Conference on Acous- identification based on isolated vowels. Journal of tics, Speech and Signal Processing (ICASSP), pages Voice, 19(4):544–554. 5206–5210. IEEE.

58 Erwan Pepiot.´ 2014. Male and female speech: a study of mean f0, f0 range, phonation type and speech rate in Parisian French and American English speakers. In Speech Prosody 7, pages 305–309. Luke Plunkett. 2010. Report: Kinect Doesn’t Speak Spanish (It Speaks Mexican). September. M. Sawalha and M. Abu Shariah. 2013. The effects of speakers’ gender, age, and region on overall perfor- mance of Arabic automatic speech recognition sys- tems using the phonetically rich and balanced Mod- ern Standard Arabic speech corpus. In Proceedings of the 2nd Workshop of Arabic WACL-2. Leeds. Fei Sha and Lawrence K. Saul. 2007. Large margin hidden Markov models for automatic speech recog- nition. Advances in neural information processing systems, 19:1249. Aarti Shahani. 2015. Now algorithms are deciding whom to hire, based on voice. All Tech Considered: Tech, culture and connection, March. Meghan Sumner and Arthur G. Samuel. 2009. The effect of experience on the perception and represen- tation of dialect variants. Journal of Memory and Language, 60(4):487–501. Antonio Torralba and Alexei A. Efros. 2011. Unbiased look at dataset bias. In Computer Vision and Pat- tern Recognition (CVPR), 2011 IEEE Conference on, pages 1521–1528. IEEE. Peter Trudgill. 1972. Sex, covert prestige and linguis- tic change in the urban British English of Norwich. Language in society, 1(02):179–195. Bert Vaux and Scott Golder. 2003. The Harvard di- alect survey. Cambridge, MA: Harvard University Linguistics Department. Barbara Wheatley and Joseph Picone. 1991. Voice Across America: Toward robust speaker- independent speech recognition for telecommuni- cations applications. Digital Signal Processing, 1(2):45–63. Ke Wu and Donald G. Childers. 1991. Gender recognition from speech. part i: Coarse analysis. The journal of the Acoustical society of America, 90(4):1828–1840. W. Xiong, J. Droppo, X. Huang, F. Seide, M. Seltzer, A. Stolcke, D. Yu, and G. Zweig. 2016. The Microsoft 2016 Conversational Speech Recognition System. arXiv preprint arXiv:1609.03528. Guoli Ye, Chaojun Liu, and Yifan Gong. 2016. Geo- location dependent deep neural network acoustic model for speech recognition. In 2016 IEEE Inter- national Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), pages 5870–5874. IEEE.

59 Integrating the Management of Personal Data Protection and Open Science with Research Ethics

David Lewis Joss Moorkens Kaniz Fatema ADAPT Centre ADAPT Centre/SALIS ADAPT Centre Trinity College Dublin Dublin City University Trinity College Dublin Ireland Ireland Ireland dave.lewis joss.moorkens kaniz.fatema @adaptcentre.ie @dcu.ie @scss.tcd.ie

Abstract search, it must respect the needs of data protection legislation, including different EU member states’ This paper examines the impact of the EU implementation of GDPR, and the data protection General Data Protection Regulation, in the regimes in jurisdictions outside the EU. These may context of the requirement from many re- greatly complicate and delay the benefits of open search funders to provide open access re- science policies. This paper reviews these trends search data, on current practices in Lan- and aims to distil the issues that researcher insti- guage Technology Research. We analyse tutes as well as national and transnational research the challenges that arise and the oppor- bodies need to face in the coming years to effec- tunities to address many of them through tively manage research data amid these parallel the use of existing open data practices and sometime conflicting needs. In particular, we for sharing language research data. We highlight the interdependency between these is- discuss the impact of this also on cur- sues and how those who manage research ethics rent practice in academic and industrial re- will need to react. search ethics.

1 Introduction 2 GDPR and LT Research Data Language Technology (LT) research is facing an As Hovy and Spruit (2016) point out, language unprecedented confluence of issues in the man- data contains latent characteristics of the person agement of experimental data. The EU’s adop- producing it, and language technology therefore tion of the General Data Protection Regulation has the inherent potential to expose personal char- (GDPR) (European Parliament and Council of acteristics of the individual. Coulthard (2000) the European Union, 2016) imposes new require- notes that identification of authors is very difficult ments for tracking informed consent for the us- from linguistic data alone, but has been success- age of personal data that may impact all European ful when accompanied by metadata “information LT research significantly. National guidelines now which massively restricts the number of possible need to be established on how GDPR applies to authors”. This presents a distinct data protection scientific data, and given the large penalties in- challenge for the sharing and reuse of language re- volved, this uncertainty presents significant insti- sources as they are difficult to reliably anonymise tutional risk for those undertaking research with and in some cases can already be used as a bio- the unanonymised or unanonymisable data often metric. needed in LT research. As the sharing of language resources is an es- In addition, the European Commission (EC) tablished feature of LT research internationally we and other research funding bodies increasingly en- must carefully examine the provisions coming into courage open science practices. The aim is to pub- force in the EU with the introduction of GDPR. As lish research data alongside research papers in or- an example we can consider research conducted der to reduce the cost of obtaining research data into the productivity changes to translator practice and improve the repeatability, replicability and re- resulting from the use of LT. Translation mem- producibility of research. While this is a posi- ory (TM) data is often used for MT training, al- tive move for the quality and integrity of LT re- though identifying metadata is almost always re-

60 Proceedings of the First Workshop on Ethics in Natural Language Processing, pages 60–65, Valencia, Spain, April 4th, 2017. c 2017 Association for Computational Linguistics moved beforehand. Measures to retain the meta- Data Processor, and in LT research this would cor- data in order to strengthen copyright claims in respond to other research organisations receiving respect of translators, as suggested by Moorkens and reusing research data from the controller. The et al. (2016), would create a risk of data breach potential penalties for a data breach fall within two under the terms of GDPR. This means that one categories with differing maximum fines. A fine of possibility for extending human translator earn- up to 20 million or 4% of turnover (whichever is ings will almost definitely become an impossibil- greater) may be imposed for failure to adhere to ity for creators of MT systems. Machine transla- basic principles for processing, including condi- tion (MT) is another popular LT technique use in tions for consent (Articles 5, 6, 7 and 9), infring- translation practice. The impact of MT is being ing on the rights of Data Subjects (Articles 20, 21, increasingly assessed through detailed analyses of 22), or improper international data transfers (Arti- keystroke logs of translators making corrections to cles 44-49). Other failures to comply, such as by such translations. These logs may also be pub- failing to obtain proper consent for childrens’ data, lished to accompany such studies (Carl, 2012), but to keep sufficient records, or to apply proper safe- are known at the level of keystroke timings to pos- guards, may result in a fine of up to 10 million or sess biometric signals that can identify the transla- 2% of turnover (whichever is greater). Both Data tor. Another growing practice is translation dicta- Controllers and Data Processors may be consid- tion using automated speech translation. Here, re- ered liable for the security of personal data, and peatable studies may involve the sharing of record- any data breach must be reported within 72 hours. ings and transcripts of spoken translation, where again speech recording could be used to identify The GDPR explicitly encompasses the speaker. pseudonymised data, which would require additional information (stored separately) to 2.1 What is the GDPR? identify the Data Subject. This would include, for example, TM data with a translation unit ID that The GDPR is an EU Regulation, adopted in April can be attributed to an individual. Personal data 2016 and due to come into force in May 2018. It should be retained for a period no longer than is addresses protection of people with regard to the necessary to accomplish the purpose for which processing and free movement of personal data, it was collected. However long-term archiving replacing the 1995 Data Protection Directive. The is permitted if this is in the public interest for GDPR (Article 4) defines Personal Data as “in- scientific and historical research purposes, or formation relating to an identified or identifiable statistical purposes, providing that there are some natural person”, who it refers to as a Data Sub- safeguards. These exemptions for research aim ject. An identifier for the Data Subject may be a to reconcile privacy with data-driven innovation name, an identification number, location data, an and the public good that may result. The GDPR online identifier or “one or more factors specific to states that the designation of scientific research the physical, physiological, genetic, mental, eco- should be “interpreted in a broad manner” in- nomic, cultural or social identity” of the Data Sub- cluding technological development, fundamental ject. It is these latter factors that perhaps lie latent research, applied research and privately funded in language resources and data that is increasingly research. Importantly, therefore, consideration subject to analysis by language technologies, such of GDPR exemptions for LT research may have as samples of utterances from specific data sub- widespread implications for industry as well as jects. for academia. GDPR might not apply to data The organisation that collects and uses personal processing where the focus is not on “personal information is the Data Controller, and bears the data, but aggregate data” and is statistical rather primary responsibility for implementing the pro- than referring to a particular individual. Where visions of GDPR. This role would be conducted personal data is processed however, separate by research institute which will be responsible for consents are required for different processing GDPR on many forms of personal data (e.g. stu- activities. However, providing that safeguards dent, staff and alumni records) beyond that gen- are implemented, secondary processing and erated by research. Other organisations in receipt processing of sensitive categories of data may of personal data from a controller is known as a be permitted for research purposes where data

61 has been collected lawfully. Article 89, which search data repositories. However, for such open addresses exemptions for scientific research that, access to work at scale, improved level of inter- states these safeguards should include technical operability will be required for the meta-data as- and organisational measures to protect data, sociated with data sets made availalbe through following the principle of data minimisation, and different institutional research data repositories. may include pseudonymisation or anonymisation Such meta-data interoperability is needed to sup- where possible or appropriate. As discussed above port the aggregation, indexing and searching of however, these mechanisms may not be adequate experimental data from different repositories so for protecting data subject identity in the sharing that researchers can find suitable data with less and reuse of language resources. Significantly, the effort. Further, reflecting data protection and re- precise nature of the safeguard required by Article search ethics properties in such meta-data will also 89 are left for EU member states to legislate on reduce the effort required to ensure that reusing (Beth Thompson, 2016). So while this enables experimental data from another source does not in- interpretation of GDPR that aligns with existing cur data protection compliance risks. national standards for research data, different interpretations may impede efforts to share and 3.1 Open Data for Open Science reuse experimental data internationally if differing In parallel to other initiatives, Linked Open Data GDPR enforcement regimes emerge. based on open data standards of the World Wide Web Consortium is being adopted as a common 3 Requirements of Open Science means for sharing all types of data between organ- The requirement for open access research publica- isations, with strong uptake reported in the public tion of the results of publicly funded research has sector. Linked Open Data is based upon the prin- become common practice in recent years. How- ciple of interlinking resources and data with stan- ever the central importance of data in all empiri- dardised Resource Description Framework (RDF) cal research, in addition to the growth of big data and Uniform Resource Identifiers (URIs) that can research approaches, has heightened the call for be read and queried by machines through powerful common policies on publishing and sharing re- standardised querying mechanisms (Bizer et al., search data associated with a publication (of Eu- 2009). ropean Research Universities, 2013). Open RDF-based data vocabularies such as DCAT 2 help in expressing authorship of research Major research funders, including the EC, have 3 widened their guidelines on open science to now data sets, while ODRL vocabulary can express address open research data (European Commis- usage rights and licensing. The provenance of sion, 2016). The aim in doing so is to make it eas- an experiment, in terms of which people and pro- ier for researchers to: build on previous research grammes performed which actions on which re- sources at what time, can be captured and mod- and improve the quality of research results; col- 4 laborate and avoid duplication of effort to improve elled using the PROV family of data vocabular- the efficiency of publicly funded research; acceler- ies. Garijo et al. (2014) build on these standards ate progress to market in order to realise economic to propose an open data format for recording both and social benefits; and involve citizens and soci- the sequence of experimental steps and the data ety. It is anticipated that EC-funded projects will resources passing between them. This would al- transition from optional involvement in open data low the publication and discovery of experimental pilots to working under a stronger obligation to descriptions with specific metadata (such as usage provide open access to research data. This how- rights or data subject consent) associated with spe- ever has to be provided within the constraints of cific data elements. EU and national data regulations, now including Experiential description using these open vo- GDPR. Initiatives such as the Open Access Infras- cabularies can be collected or aggregated to form 1 linked repositories such as those supported by tructure for Research in Europe (OpenAIRE) pro- 5 vide additional information and support on linking OpenAire and Linghub , which are being piloted publications to underlying research data, and is de- 2http://www.w3.org/TR/vocab-dcat/ veloping open interfaces for exchange between re- 3http://www.w3.org/TR/odrl/ 4http://www.w3.org/TR/prov-0/ 1https://www.openaire.eu/ 5http://linghub.lider-project.eu

62 for language resources. Existing research for these form processing of that data for purposes that are standard vocabularies has provided best practice compatible with that consent. for publishing data sets’ metadata as linked open This will mean therefore that exchange of LT data (Brummer et al., 2014). The machine read- research data with latent personal features can- able nature of metadata can make it easy for an not proceed without an appropriate contract on the automated system to verify the correctness of the usage of this data being signed and recorded for data, or perform other operations such as check- GDPR compliance purposes. It should also in- ing of data formats, completeness of metadata clude an undertaking by the data processor not to and the provenance of data used (Freudenberg et attempt the identification of natural persons from al., 2016). This approach is amenable to exten- the data, including through analysis in aggregation sion with domain-specific experimental metadata, with other data. This goes beyond the standard such as the machine learning metadata proposed form license agreements already in place for reuse in the MEX vocabulary (Esteves et al., 2015). The of language resources, e.g. the META-SHARE li- LT research community has already developed a cences 6, which focus mostly on issues of copy- schema, termed META-SHARE (Piperidis, 2012), right ownership and usage conditions related to for language resource metadata that shares many attribution or, in some cases, to compensation. characteristics with the OpenAire scheme. The To avoid GDPR unduly impeding the sharing and META-SHARE schema has also been mapped reuse of experimental data, we recommend that onto RDF with relevant attributes mapped to spe- bodies such as the EC and ELRA develop stan- cific properties from the standard vocabularies dard form contract terms for the reuse of research previously mentioned, and is used by LingHub as data that the LT research community can use in an aggregation source (McCrae et al., 2015). documenting this aspect of GDPR compliance. GDPR, in Recital 33, acknowledges that it “is 4 Discussion often not possible to fully identify the purpose of personal data processing for scientific research Combining the emerging imperative of GDPR purposes at the time of collection”. It allows data compliance and data science poses the following subjects to provide consent to only specific parts of challenges for organisations undertaking LT re- a research activity “when in keeping with recog- search and concerned with research ethics. Firstly nised ethical standards for scientific research”. the encouragement by funders for researchers to This highlights the fact that good practice in re- provide open access to experimental data must be search ethics already incorporates many features tempered by the overriding legal requirements of now formalised in GDPR, i.e. the need for a clear GDPR compliance. While GDPR offers deroga- explanation of the data collection and processing tion of certain rights when dealing with personal purposes; the explicit gathering of informed con- data for the purposes of scientific research, this sent and the option of the data subject to with- does not remove the obligation for research per- draw from any part of the research activity at any forming organisations in the EU to demonstrate time. If EU proposals for ePrivacy (European Par- their conformance to GDPR, to conduct data pro- liament and Council of the European Union, 2017) tection impact assessments, and to ensure that the move on to become a regulation, the data subject appropriate safeguards for the derogation are in will have more control over whether data may be place, especially in cases where anonymisation of repurposed by being offered “the option to prevent experimental data is not possible. third parties from storing information on the termi- In GDPR terms, a data processor receiving data nal equipment of an end user or processing infor- from an LT experiment will need to know the mation already stored on that equipment” (Article terms of consent agreed to by the data subject in 10). Key to data subjects exercising such control giving the primary data collector permission to use over the processing of personal data is their full their personal data for a stated purpose. This will understanding of the scientific research purposes enable the receiving party to assess whether the to which their data will be subject. purpose to which they now intend to put the data Further research is required to assess the com- is compatible with that consent. Given that infor- prehensibility of plain language descriptions of mation, the receiving data processor would also need to give an undertaking that it will only per- 6http://wizard.elda.org/

63 purpose typically used by researchers for data sub- data standards tracking the transfer and use of per- jects. The META-SHARE schema, for example, sonal data in a way that can support GDPR com- supports a ’purpose’ attribute, but it is populated pliance checking. Use of open data standards that with names of different areas of LT research that capture the detail of data processing workflows are unlikely to be accessible to data subjects. Fur- may be annotated to better record the processes ther, more applied research, perhaps conducted by by which: informed consent is gathered from in- industry, may be conducted with the known in- dividual data subjects; their individual objections tention of supporting new service features, e.g. to specific uses of personal data is handled and the personalisation, targeted marketing, or differential purposes to which personal data is put is audited. pricing. As these are of direct concern to the data Consent must be first collected, then stored and subject, such intentions should not be concealed processed for checking compliance with data pro- by statements of purpose related to the broader cessing. generation of knowledge when seeking informed Consent can be modified. The modification can consent. From such research, the LT community be initiated by the data subject or due to change and research institutes should seek to find clas- of context the controller can re-solicit for consent sifications of purpose that are both accessible to that can lead to modification of consent. Consent data subjects and convey the differences in pur- can be revoked. After revocation, the data may pose of basic and applied research. Current rules be archived for the time necessary for research re- and practices on academic research ethics tend to sult verification and finally destroyed. Further re- vary from institution to institution, with the in- search is needed on how to annotate open experi- tention of protecting participants and researchers mental workflow provenance records with details by making clear the purpose of data collection, of consent management and its impact on the life- and requesting explicit consent to use personal cycle management of the subject’s data. A pos- data for that purpose. Researchers may have to sible benefit of an open data approach, is that it make an undertaking with regard to data protec- may allow individual institutes to publish the at- tion, but there is rarely any follow-up to ascer- tributes of their differing ethics review processes, tain whether the data has been stored or destroyed allowing collection and analysis of variations that as promised. In contrast, GDPR compliance will may assist in normalising standards. This will also require rigorous organisational and technical sys- allow those reusing others’ data to be reassured tems for record keeping and tracking the use to that it was collected under ethical standards with which data is put by data processors, including which they are familiar. Ultimately this could re- data transfer to processors in other institutes and sult is a simple badge system, similar to that em- other jurisdictions. Further, much LT research data ployed for creative commons, that could simplify processing involves secondary processing of in- the selection of LT research data according to the dustrial data, such as TMs or glossaries. As these compatibility of research ethics and data protec- are not collected from directly from experimental tion protocols under which it was produced with data subjects but via industrial processes, this data those sought by the research hoping to use that is collected, stored, retained, and shared without data. Design of such a seal for reuse of experimen- a reliable trace of research ethics clearance. Fur- tal data could benefit from the work already under- ther, as LT research is increasingly undertaken by way in developing data protection seals 7 given the large companies with access to vast data-sets of overlap between research ethics protocols and the customer information, the resulting experimental informed consent requirements of GDPR. data is typically not subject to the a priori scrutiny of institutional review boards or ethics commit- Acknowledgements tees as is common with publicly funded research. Supported by the ADAPT Centre for Digital Con- This disparity between public and private norms tent Technology which is funded under the SFI for undertaking research ethics may create barri- Research Centres Programme (Grant 13/RC/2106) ers to research collaboration and impede the pro- and is co-funded under the European Regional De- gression of reproducable research results into the velopment Fund. public domain. An opportunity therefore exists for the LT research community to better leverage open 7https://ico.org.uk/for-organisations/improve-your- practices/privacy-seals/

64 References Dirk Hovy and Shannon L. Spruit. 2016. The social impact of natural language processing. In Proceed- Beth Thompson. 2016. Analysis: Research and the ings of the 54th Annual Meeting of the Association of general data protection regulation. Technical report, Computational Linguistics, pages 591–598, August. July. John P. McCrae, Penny Labropoulou, Jorge Gra- Chris Bizer, Tom Heath, and Martin Hepp. 2009. Spe- cia, Marta Villegas, V´ıctor Rodr´ıguez-Doncel, and cial issue on linked data. International Journal on Philipp Cimiano, 2015. One Ontology to Bind Them Semantic Web and Information Systems, 5(3):1–22. All: The META-SHARE OWL Ontology for the Inter- operability of Linguistic Datasets on the Web, pages Martin Brummer, Ciro Baron, Ivan Ermilov, Markus 271–282. Springer International Publishing, Cham. Freudenberg, Dimitris Kontokostas, and Sebastian Hellmann. 2014. Dataid: Towards semantically rich Joss Moorkens, David Lewis, Wessel Reijers, Eva Van- metadata for complex datasets. In Proceedings of massenhove, and Andy Way. 2016. Translation re- the 10th International Conference on Semantic Sys- sources and translator disempowerment. In ETHI- tems (SEM’14), New York, USA. ACM. CA 2016: Workshop on ETHics In Corpus Collec- tion, Annotation and Application, Portoroz, Slove- Michael Carl. 2012. Translog-ii: a program for record- nia, May. ing user activity data for empirical reading and writ- ing research. In Proceedings of the Eight Interna- League of European Research Universities. 2013. tional Conference on Language Resources and Eval- Leru roadmap for research data. Technical report, uation (LREC’12), Istanbul, Turkey, may. European Dec. Language Resources Association (ELRA). Stelios Piperidis. 2012. The meta-share language Malcolm Coulthard. 2000. Whose text is it? on resources sharing infrastructure: Principles, chal- the linguistic investigation of authorship. In Srikant lenges, solutions. In Proceedings of the Eight In- Sarangi and Malcolm Coulthard, editor, Discourse ternational Conference on Language Resources and and Social Life, chapter 15, pages 270–288. Rout- Evaluation (LREC’12), Istanbul, Turkey, may. Euro- ledge, London. pean Language Resources Association (ELRA). Diego Esteves, Diego Moussallem, Ciro Baron Neto, Tommaso Soru, Ricardo Usbeck, Markus Acker- mann, and Jens Lehmann. 2015. Mex vocabulary: A lightweight interchange format for machine learn- ing experiments. In Proceedings of the 11th Interna- tional Conference on Semantic Systems, SEMAN- TICS ’15, pages 169–176, New York, NY, USA. ACM.

European Commission. 2016. Guidelines on open ac- cess to scientific publications and research data in horizon 2020. Technical report, February.

European Parliament and Council of the European Union. 2016. Regulation (eu) 2016/679 of the euro- pean parliament and of the council (GDPR). Official Journal of the European Union, 119(1):1–88.

European Parliament and Council of the European Union. 2017. Proposal for a regulation concern- ing the respect for private life and the protection of personal data in electronic communications.

Markus Freudenberg, Martin Brummer, Jessika Ruck- nagel, Robert Ulrich, Thomas Eckart, Dimitris Kon- tokostas, and Sebastian Hellmann. 2016. The meta- data ecosystem of dataid. In Special Track on Meta- data & Semantics for Open Repositories at 10th In- ternational Conference on Metadata and Semantics Research.

Daniel Garijo, Yolanda Gil, and Oscar Corcho. 2014. Ninth workshop on workflows in support of large- scale science (works). In Ninth Workshop on Work- flows in Support of Large-Scale Science (WORKS),, New Orleans, USA, Nov. IEEE.

65 Ethical Considerations in NLP Shared Tasks

Carla Parra Escart´ın1, Wessel Reijers2, Teresa Lynn2, Joss Moorkens1, Andy Way2 and Chao-Hong Liu2

1ADAPT Centre, SALIS, Dublin City University, Ireland 2ADAPT Centre, School of Computing, Dublin City University, Ireland carla.parra,wessel.reijers,teresa.lynn,joss.moorkens,andy.way,chaohong.liu { } @adaptcentre.ie

Abstract from the competing systems so that research can Shared tasks are increasingly common in move forward from one year to the next, or to es- our field, and new challenges are sug- tablish best practices as to how to tackle a particu- gested at almost every conference and lar challenge. workshop. However, as this has become Over the past few years, the organisation of an established way of pushing research and participation in shared tasks has become more forward, it is important to discuss how popular in Natural Language Processing (NLP), we researchers organise and participate in speech and image processing. In the field of NLP shared tasks, and make that information study, researchers now have an array of annual available to the community to allow fur- tasks in which they can participate. For exam- ther research improvements. In this pa- ple, several shared tasks are organised at the Con- 1 per, we present a number of ethical issues ference on Natural Language Learning (CoNLL), along with other areas of concern that are the Conference and Labs of the Evaluation Forum 2 related to the competitive nature of shared (CLEF), or the International Workshop on Se- 3 tasks. As such issues could potentially mantic Evaluation (SEMEVAL). For those work- impact on research ethics in the Natural ing on a topic that proves to be particularly chal- Language Processing community, we also lenging, it has also become a trend to propose a propose the development of a framework new shared task co-located at a related conference for the organisation of and participation in or workshop in order to encourage contributions shared tasks that can help mitigate against from the wider community to address the problem these issues arising. at hand. The NLP community has seen a rapid in- crease in the number of shared tasks recently, with 1 Introduction many repeated periodically while others have been Shared tasks are competitions to which re- organised only once. During all collocated work- searchers or teams of researchers submit systems shops at ACL 2016 alone, a total of 9 new shared that address specific, predefined challenges. The tasks were proposed, along with others held annu- competitive nature of shared tasks arises from the ally. The 2016 Conference on Machine Transla- publication of a system ranking in which the au- tion (WMT16), for instance, offered 6 shared tasks thors of the systems achieving the highest scores already held in previous years along with 4 new 4 obtain public acknowledgement of their work. In ones. this paper, we discuss a number of ethical issues A distinctive feature of shared tasks is their and various other areas of concern that relate to integral competitive nature. In the field of re- the competitive nature of shared tasks. We then search ethics, the factor of competition in research move to propose the creation of a common frame- projects has been shown to have potentially nega- work for shared tasks that could help in overcom- tive ethical consequences for upholding research ing these issues. 1http://www.signll.org/conll The primary goal of shared tasks is to encourage 2http://www.clef-initiative.eu/ wider international participation in solving partic- 3http://alt.qcri.org/semeval2016/ ular tasks at hand. A second objective is to learn 4http://www.statmt.org/wmt16/

66 Proceedings of the First Workshop on Ethics in Natural Language Processing, pages 66–73, Valencia, Spain, April 4th, 2017. c 2017 Association for Computational Linguistics integrity and the open character of scientific re- on speech processing and started in 1987.5 search. For instance, McCain (1991) has argued In 1992, new initiatives focused on the field that increased competition for funding and pub- of text understanding under the umbrella of lications in the field of genetics has resulted in the DARPA TIPSTER Program (Harman, 1992). undesirable ‘secretive’ behaviour of scientists, of Since then, researchers in NLP have experienced refusal to provide access to data sets or conflicts how this type of benchmarking for NLP tools and about ownership of experimental materials. Addi- systems has become a tradition in many sub-areas. tionally, Mumford and Helton (2001) argued that In fact, some of the current annual shared tasks the negative perception among researchers of the date all the way back to 1998 and 1999 when the intentions of their competitors might invoke un- first SEMEVAL (then called SENSEVAL-1)6 and ethical behaviour. These are serious consequences CONLL7 were organised. of elements of competition on the work of re- Typically, shared tasks consist of 4 distinct searchers. However, to date, little attention seems phases (Paroubek et al., 2007): to have been paid to preventing these problems from arising in the organisation of shared tasks in 1. Training phase, NLP research. 2. Dry-run phase, With the experience gathered in our commu- nity thanks to the organisation of shared tasks over 3. Evaluation phase, and the past 30 years, we believe the time is right to initiate an open discussion on a common ethical 4. Adjudication phase. framework for the organisation of shared tasks, in order to reduce the potential negative ethical con- During the training phase, participants are pro- sequences of their competitive character. Such dis- vided with data to calibrate and train their sys- cussions should be held in the spirit of trying to tems. Such systems are subsequently used to pro- globally establish – as a research community – cess a blind test set during the dry-run phase, and which ethical issues should be tackled and con- their results are evaluated against a ‘gold standard’ sidered across all shared tasks; the purpose of this previously prepared by the shared task organisers. paper is not to criticise how any particular shared In the adjudication phase, participants are asked task is or has been organised thus far in the field. to raise any issues observed during the evaluation The remainder of this paper is organised as fol- and validate the obtained results. lows: Section 2 is devoted to an overview of the 2.1 Why are shared tasks important in our role of shared tasks in NLP, including their defi- field? nition, importance as well as particular issues in the existing shared tasks in our field. Section 3 Shared tasks are important because they help boost is devoted to a discussion on the potential nega- the pace of development in our field and encour- tive ethical impacts of the factor of competition age a culture of improving upon the state-of-the- that is insufficiently regulated, and finally Section art. Shared tasks have an additional advantage: by 4 proposes steps towards the creation of a common using the same data, all systems can be evaluated framework for the organisation of shared tasks in objectively and comparisons across systems could NLP that assists at overcoming the ethical issues be made easier. we identify. At the same time, some best practices and de facto standards have evolved from shared tasks, 2 Shared Tasks in NLP e.g. the widely used CoNLL format used in pars- ing and many other NLP tasks, and the splitting As mentioned in Section 1, shared tasks are com- of German compounds in MT proposed by Koehn petitions to which researchers or teams of re- and Knight (2003). searchers submit systems that address a particular A by-product of these shared tasks are the new challenge. In the field of NLP, the first shared tasks datasets that are made available for use by the were initiated in the United States by NIST in 5 collaboration with DARPA (Mariani et al., 2014). See Pallett (2003) for an overview of these first shared tasks and the role that NIST played in them. Paroubek et al. (2007) report that the first shared 6http://www.senseval.org/ tasks – then called evaluation campaigns – focused 7http://www.cnts.ua.ac.be/conll99/npb/

67 wider research community. Shared tasks encour- task on unseen data. age the development of new resources and also In some cases, the shared task organisers will encourage innovative approaches to data collec- distinguish between two different tracks for the tion. Moreover, provided the data is made avail- same shared task depending on the source of the able, any researcher can measure the performance data being used to train the systems. In most cases, of their system against past shared task data. It all teams in an evaluation test their systems on the is also possible for any researcher outside one same datasets to allow for easier across-the-board shared task (possibly investigating different top- comparisons (‘closed’ track). Other shared tasks ics) to use the publicly available shared task data allow for the inclusion of additional data by indi- after the event to benchmark new systems or appli- vidual teams (‘open’ track). In these ‘open’ tracks, cations, and allow for replication of experiments, the inclusion of other data is not necessarily veri- e.g. Mate Tools development reported evaluations fied and based on a trust system. It is worth not- based on datasets from the CoNLL 2009 Shared ing that, to date, this system has worked well and, Task (Bohnet, 2010). to the best of our knowledge, there have been no Shared tasks with a large number of partici- known issues of mistrust in NLP shared tasks. pants can also indicate the need to tackle a partic- Depending on the type of shared tasks, differ- ular problem, or point to challenges that are par- ent evaluation methodologies will be used, rang- ticularly attractive for the NLP research commu- ing from purely automatic metrics, such as preci- nity. The participation of industry-based teams in sion and recall for many of the shared tasks focus- shared tasks shows that some of them are relevant ing on Information Retrieval, to human evaluation, beyond the academic research community. such as the ranking of MT outputs or automati- Taken together, shared tasks have proven them- cally generated text.8 selves to be very effective in incentivising re- search in specialised areas, but they come at a 3 Potential ethical issues concerning the cost: organisers need to prepare the datasets well competitive nature of shared tasks in advance, define the evaluation criteria, gather As we have seen in the previous section, there is enough interest for participation, rank the submit- currently a great variability and a lack of standard- ted systems, and so on. At the same time, there is isation in the organisation of shared tasks. Be- little information sharing among shared tasks that cause shared tasks have become an important part would allow organisers to benefit from the expe- of the scientific research in NLP, a certain level rience of others. As a result, shared tasks vary of standardisation is nonetheless required in or- greatly in the way they are organised, how the der to safeguard satisfactory levels of scientific in- datasets are shared, and the type of information tegrity and openness of scientific research. This (and data) which is available to participants and standardisation in the organisation of shared tasks the research community both before, during, and is needed specifically to mitigate potential nega- after the evaluation. tive ethical impacts of their competitive character. With a view to proposing a standard approach to 2.2 Variability across shared tasks shared task organisation, in this section, we dis- Depending on the task at hand, shared tasks are or- cuss the potential negative ethical impacts of com- ganised in different ways. In some cases (such as petition in scientific research and subsequently il- the MT shared tasks), no annotated data is needed, lustrate this by addressing potentially problematic and thus only aligned bilingual data is used. aspects of the organisation of shared tasks in NLP. In others, prior to the shared task, the organis- ers create annotated data that will be distributed 3.1 Ethical issues arising from competition in to all participating teams to allow them to pre- scientific research pare their systems for the task. Such annotated Competition is a factor in scientific research that is data is used with two main aims: (i) adjusting to not limited to the field of NLP. In the organisation the format required for submissions, and (ii) al- of contemporary science, competitive schemes for lowing researchers to explore the data to develop 8Paroubek et al. (2007) offer a good overview of the or- automatic systems, either rule-based or machine- ganisation of shared tasks and the different types of evalua- learning based, so that they are able to perform the tion that one may come across.

68 scientific positions, publication possibilities or re- peers (Anderson et al., 2007). This might search funding are increasingly influential. How- lead researchers to have the tendency to be- ever, shared tasks are distinctive from traditional have unethically with regards to their peers forms of research, such as the writing of individ- in order to preserve or strengthen their repu- ual research papers or the organisation of exper- tation. iments in a closed research project, because the element of competition is integral to the research 3.2 Potential negative effects of competition activity. In other words, shared tasks not only take in shared tasks in NLP place within a competitive context, they are com- The motivation for involvement in shared tasks has petitions per se. evolved somewhat over the recent past. Many re- For this reason, the effects of competition on the searchers in MT, for example, will participate in conduct of researchers should be taken seriously. the annual shared tasks organised at WMT, where As Anderson et al. (2007, p. 439) argue: “the rela- there is a ranking of the best systems for a pro- tionship between competition and academic mis- posed task. Participation and success in tasks conduct is a serious concern”. A number of neg- such as these are often used to demonstrate re- ative ethical impacts of competition in scientific search excellence to funding agencies. At the research are discussed in the literature on research same time, performance of the systems may also ethics. We suggest that the NLP community could have a greater impact on the funding for a com- draw on previous experiences and studies in the plete research area (not only for individual teams wider scientific community. We present three of or institutions). We only need to look back to the most important ones: the notorious ALPAC report (Pierce and Carroll, 1966), whose consequences for research funding Secretive behaviour. This effect of competi- • in the US for MT were devastating for a consider- tion results from the tendency of researchers able period. Such funding-related motivation can to give themselves an unfair competitive ad- in turn lead to increased competitiveness. vantage in terms of knowledge concerning the research challenge at hand. McCain When we revisit the shared tasks within NLP, (1991) suggests that this behaviour can have the potential negative ethical impacts of competi- several concrete forms, such as the unwill- tion identified in the literature on research ethics ingness to publish research results in a timely can also be found in this field. Here, we discuss fashion, refusal to provide access to data sets the main issues identified which require mecha- and conflicts concerning the ‘ownership’ of nisms to be established by our community to pre- experimental materials. vent them from happening.

Overlooking the relevance of ethical con- Secretiveness. Competitiveness can some- • • cerns. Another effect of competition is the times lead to secretiveness with respect to tendency of the teams competing to overlook the specific features used to tune a system the relevance of ethical concerns in their re- to ensure that the best methods and/or ap- search. As Mumford and Helton (2001) ex- proaches stay in the same institution/team. plain, this might have the form of disregard- Participants usually submit their system de- ing ethical concerns in general, or specifi- scriptions to the shared task, in the form of cally with regard to one’s own work while presentations and research papers. However, anticipating the potential ethical misconduct the way in which such systems are described of others (“if they can do it, why shouldn’t may vary greatly, as one can always choose we?”). This can lead to careless – or ques- a more abstract higher-level description to tionable – research conduct. avoid ‘spilling the beans’ about the method- ology applied, and retaining the knowledge Relations with other scientists. Because the • rather than sharing it. stakes in competitions can be very high (they might result in further or decreased research Unconscious overlooking of ethical con- • funding, or in opening up or closing off of cerns. Leading on from the secretiveness is- future career paths), competitions might have sue raised above, teams may unintentionally negative impacts on the relations between be vague in reporting details of their systems’

69 parameters and functionality solely on the ba- However, as shown by Fanelli (2010), re- sis that other teams have also previously re- searchers have a tendency to report only pos- ported in this way. Such practice can simply itive results. He claims that this may be arise from the existence of convention in the because they “attract more interest and are absence of guidelines or standards. cited more often”, adding that there is a be- lief that “journal editors and peer reviewers Potential conflicts of interest. Finally, an- • might tend to favour them, which will further other potential ethical issue is related to or- increase the desirability of a positive outcome ganisers or annotators being allowed to par- to researchers”. ticipate in the shared task in which they are involved. Again, while some find it unethical Furthermore, with PhD students being to participate in their own shared task, oth- pressed to publish in the top conferences in ers disagree, and the community trusts that their fields, they may be reluctant to sub- in such cases the organisers trained their sys- mit systems that do not report on positive re- tems under the same conditions as the rest of sults. As a result, while we always discover the participants, i.e. they did not take advan- what worked for a particular task, we are tage of prior access to data and did not train not usually told what did not work, although their systems for a longer period of time, or that may be of equal or (even) greater impor- have a sneak peak at – or hand-select for op- tance than the methodology that worked, as it timal performance – the test data to improve would help others to avoid repeating the same their system’s performance and cause it to mistake in the future. In fact, it may be the be highly ranked. Both points of view are case that the same approach has been tested perfectly valid, and in some cases even jus- by different institutions with no success and tified, e.g. teams working on (usually low- that we are incurring a hidden redundancy resourced) languages for which they them- that does not help us to move forward as a selves are one of the few potential partici- field. In order to prevent these issues from pants for those languages. While the overlap occurring, we should design mechanisms that of organisers, annotators and participants has incentivise the publication of negative results not yet revealed itself to be a major issue in and more thorough error analysis of systems our field, and the goodwill and ethical con- Similarly, it may be the case that although it duct of all involved is generally trusted, it is is highly desirable that industry-based teams worth considering the establishment of meth- participate in a shared task, some may be ods for minimising the risk of this happening reluctant to do so on the basis of the nega- in the future. One such measure could be for tive impact that this may have for their prod- the organisers to explicitly state whether the uct if it does not end up among the first 9 overlap is likely to happen. ranked. Thus, rather than strengthening the academia-industry relationship and learning Subsequently, we have identified a number of from each other, we risk making the gap be- other potential conflicts with the objectivity and tween the two bigger rather than bridging integrity of research that may arise from the com- it. Should we not address this and establish petitive nature of shared tasks in NLP. Whether in- mechanisms that encourage industrial teams tentional or unintentional, these issues are worth to participate in shared tasks without such as- considering when developing a common frame- sociated risks? work for the organisation of and participation in shared tasks: Withdrawal from competition. Some • Lack of description of negative results. The teams may prefer to withdraw from the com- • fact that negative results are also informa- petition rather than participate if they fear tive is something that no researcher will deny. that their performance may have a negative impact in their future funding: how could 9 This type of overlap was highlighted by the organis- research excellence on a particular topic be ers of the PARSEME shared task at the 13th Workshop on Multiword Expressions (MWE 2017): http://bit.ly/ argued if one’s team came last in a com- 2jPsu2n. petition? Again, mechanisms could be de-

70 signed with the aim of discouraging this type related challenge. This is a real problem, as of withdrawal. For example, one possible so- if this is the case, we should ask ourselves lution would be to only report on the upper what we as a field are really learning. At 50% of the ranked systems. the same time, our field experiences a lot of redundancy, as we try to reimplement oth- Potential ‘gaming the system’. Another • ers’ algorithms against which we test our own concern is the impact of the results of the systems. This is the case particularly when shared task beyond the shared task itself (e.g. systems participating in a shared task are not real-world applications, end-users). Shared subsequently released to the community.10 tasks are evaluated against a common test set under the auspices of a ‘fair’ compari- Unequal playing field. Another potential • son among systems. However, as the ultimate risk is the fact that larger teams at institu- goal of most participating teams is to obtain tions with greater processing power (e.g. bet- the highest positions in the ranking, there is ter funded research centres or large multi- a risk of focusing on winning, rather than on nationals) may have a clear unfair advantage the task itself. Of course, accurate evaluation in developing better performing systems, ren- is crucial when reporting results of NLP tasks dering the ‘competition’ as an unequal play- (e.g. Summarisation (Mackie et al., 2014); ing field for researchers in general. This MT (Graham, 2015)). As evaluation met- could be mitigated against by establishing, rics play a crucial role in determining who beforehand, the conditions under which sys- is the winner of a shared task, many partic- tems are trained and tested for the task. ipating teams will tune their systems so that they achieve the highest possible score for the In this section, we have identified several po- objective function at hand, as opposed to fo- tential ethical concerns related to the organization cusing on whether this approach is actually and participation in shared tasks. As observed, the the best way to solve the problem. This, in three issues discussed in the academic literature on turn, impacts directly on the real-world ap- competition in research (cf. Section 3.1) appear plications for which solving that challenge is to be important considerations for shared tasks in particularly relevant, as it may be the case NLP. In addition, we have highlighted some other that the ‘winning’ systems are not necessar- areas of potential ethical consideration in our field ily the best ones to be used in practice. with respect to shared tasks. In the next section, we discuss potential paths to tackle the ethical As discussed previously, some shared tasks concerns raised here. allow for ‘closed’ and ‘open’ variants, i.e. in the ‘closed’ sub-task, participants use only 4 Future directions the data provided by the shared task organis- ers, such that the playing field really is level The great value of shared tasks is there for all (we ignore for now the question as to whether to see, and there is no doubt that they will con- the leading system really is the ‘best’ system tinue to be a major venue for many researchers in for the task at hand, or (merely) has the best NLP in the future. Nonetheless, we have pointed pre-preprocessing component, for instance). out several ethical concerns that we believe should By contrast, in the ‘open’ challenge, teams be addressed by the NLP community, and mech- are permitted to add extra data such that true anisms created to prevent them should be also comparison of the merits of the competing agreed upon. At the same time, there may be other systems is much harder to bring about. ethical considerations that the authors have omit- ted due to lack of knowledge about all shared tasks Redundancy and replicability in the field. • 10 11 Another important issue is that, although this The existence of initiatives such as CLARIN or the re- cent efforts made by ELDA to try to standardize even vari- should be the overriding goal, we typically ous versions of the ‘same’ dataset, evaluation metric, or even find that for any new data set – even for the a particular run of an experiment show that we are shift- same language pair – optimal parameter set- ing to a new research paradigm where open data, research transparency, reproducibility of results and a collaborative tings established in a previous shared task do approach to advancements in science are advocated (Peder- not necessarily carry over to the new, albeit sen, 2008; Perovsekˇ et al., 2015).

71 in NLP, or simply because they arose within par- Public release of the results of the participat- • ticipation in specific shared tasks and have never ing systems after the shared task, under an been shared with the community. Thus, we see agreed license; that a first step towards determining potential eth- ical issues related to the organisation of and par- Declaration of the list of contributors to a cer- • ticipation in shared tasks is to conduct a survey tain system at submission time; in our community to ensure broad contribution. Such a survey – to be launched shortly after dis- Anonymisation of the lower (50% ?) of sys- • cussions at the 2017 Ethics in NLP workshop – tems evaluated to be referred to by name in consists of two parts. The first tries to gauge the published results; varying requirements of shared tasks, and the sec- ond one aims at assessing what people feel are ... important factors for consideration when drawing • up a common framework for shared tasks in NLP. This common framework will ensure greater trans- 5 Conclusion parency and understanding of shared tasks in our community, and prevent us from encountering the In this paper we have discussed a number of po- potential negative impact of the ethical concerns tential ethical issues in the organisation and partic- raised here. ipation of shared tasks that NLP scientists should Questions regarding past experiences related to address to prevent them from arising as problems shared tasks (either as organisers, annotators or in the future. Besides taking into account the par- participants) are included in the survey to gather ticular features of shared tasks, we investigated the information regarding (i) best practices used in potential ethical issues of competition in scientific specific shared tasks that could be extrapolated research and extrapolated such issues to the po- to new ones, (ii) the type of information that is tential problems that may arise in our own field. available to participants before, during and after In addition, as we believe this should be tackled the shared task, (iii) potential ethical concerns en- by the NLP community as a whole, we have pro- countered in the past and how they were tackled, posed the launch of a survey to gather further in- (iv) other causes for concern from the NLP com- formation about shared tasks in NLP that will help munity and (v) good experiences that we should in the development of a common framework in the aim at replicating. near future. This would include current best prac- Besides recommendations on best practice, we tice, a series of recommendations and checklists envisage the creation of shared task checklists as to what issues should be taken into account, based on the questions in the survey and their as well as what information is provided to partic- replies. These checklists would target the organ- ipants, depending on the type of shared tasks in isers, annotators and participating teams in shared question. tasks, and would be used to state any relevant in- Finally, shared tasks in our field play an essen- formation required in each case. By subsequently tial role in NLP. They have undoubtedly helped making them publicly available to the community improve the quality of the systems we develop (e.g. at the shared task website), any participat- across a range of NLP sub-fields, to a point where ing team or researcher interested in the shared task many of them comprise essential components of topic would know how specific topics were ad- professional workflows. The system as such is not dressed in the shared task, and what information irretrievably broken, so there may be a temptation was or will be available to them. What follows is to not fix the issues outlined in this paper. How- a non-exhaustive list of some of the items that we ever, we firmly believe that the field of NLP has foresee including in the checklist (subject to dis- reached a level of maturity where some reflection cussion and amendment): on the practices that we currently take for granted is merited, such that our shared tasks become ever Participation of organisers in the shared task; more reliable and consistent across our discipline, • and further strides are made to the benefit of the Participation of annotators or people who had field as a whole as well as to the wider commu- • prior access to the data in the shared task; nity.

72 Acknowledgements Science, Technology & Human Values, 16(4):491– 516. The authors wish to thank the anonymous re- viewers for their valuable feedback. This re- Michael D. Mumford and Whitney B. Helton. 2001. Organizational influences on scientific integrity. In search is supported by Science Foundation Ire- Nicholas Steneck and Mary Scheetz, editors, Inves- land in the ADAPT Centre (Grant 13/RC/2106) tigating research integrity: Proceedings of the first (www.adaptcentre.ie) at Dublin City University. ORI research conference on research integrity, vol- ume 5583, pages 73–90. David S. Pallett. 2003. A look at nist’s benchmark asr References tests: past, present, and future. In 2003 IEEE Work- Melissa S. Anderson, Emily A. Ronning, Raymond shop on Automatic Speech Recognition and Under- de Vries, and Brian C. Martinson. 2007. The standing (IEEE Cat. No.03EX721), pages 483–488, perverse effects of competition on scientists’ work Nov. and relationships. Science and Engineering Ethics, 13(4):437–461. Patrick Paroubek, Stephane´ Chaudiron, and Lynette Hirschman. 2007. Principles of Evaluation in Nat- Bernd Bohnet. 2010. Very high accuracy and fast de- ural Language Processing. Traitement Automatique pendency parsing is not a contradiction. In Proceed- des Langues, 48(1):7–31, May. ings of the 23rd International Conference on Com- putational Linguistics, COLING ’10, pages 89–97, Ted Pedersen. 2008. Empiricism is not a matter of Stroudsburg, PA, USA. Association for Computa- faith. Computational Linguistics, 34(3):465–470. tional Linguistics. Matic Perovsek,ˇ Vid Podpecan,ˇ Janez Kranjc, Tomazˇ Daniele Fanelli. 2010. Do pressures to publish in- Erjavec, Senja Pollak, Quynh Ngoc Thi Do, Xiao crease scientists’ bias? an empirical support from Liu, Cameron Smith, Mark Cavazza, and Nada us states data. PLoS ONE, 5(4):e10271, 04. Lavrac.ˇ 2015. Text mining platform for nlp work- flow design, replication and reuse. In Proceedings of Yvette Graham. 2015. Improving evaluation of ma- the Workshop on Replicability and Reproducibility chine translation quality estimation. In Proceed- in Natural Language Processing: adaptive methods, ings of the 53rd Annual Meeting of the Association resources and software at IJCAI 2015. for Computational Linguistics and the 7th Interna- tional Joint Conference on Natural Language Pro- John R. Pierce and John B. Carroll. 1966. Language cessing (Volume 1: Long Papers), pages 1804–1813, and Machines: Computers in Translation and Lin- Beijing, China, July. Association for Computational guistics. National Academy of Sciences/National Linguistics. Research Council, Washington, DC, USA. Donna Harman. 1992. The darpa tipster project. SI- GIR Forum, 26(2):26–28, October. Philipp Koehn and Kevin Knight. 2003. Empirical Methods for Compound splitting. In Proceedings of the Tenth Conference of the European Chapter of the Association for Computational Linguistics, pages 187–193, Stroudsburg, PA, USA. Association for Computational Linguistics. Stuart Mackie, Richard McCreadie, Craig Macdonald, and Iadh Ounis. 2014. On choosing an effective au- tomatic evaluation metric for microblog summarisa- tion. In Proceedings of the 5th Information Interac- tion in Context Symposium, pages 115–124. ACM. Joseph Mariani, Patrick Paroubek, Gil Francopoulo, and Olivier Hamon. 2014. Rediscovering 15 years of discoveries in language resources and evaluation: The lrec anthology analysis. In Proceedings of the Ninth International Conference on Language Re- sources and Evaluation (LREC’14), Reykjavik, Ice- land, may. European Language Resources Associa- tion (ELRA). Katherine W. McCain. 1991. Communication, com- petition, and secrecy: The production and dissem- ination of research-related information in genetics.

73 Social Bias in Elicited Natural Language Inferences

Rachel Rudinger* Chandler May* Benjamin Van Durme Johns Hopkins University Johns Hopkins University Johns Hopkins University [email protected] [email protected] [email protected]

Abstract Stock word embeddings have been shown to ex- hibit gender bias, leading to proposed debiasing We analyze the Stanford Natural Lan- algorithms (Bolukbasi et al., 2016). Word em- guage Inference (SNLI) corpus in an in- beddings have been shown to reproduce harmful vestigation of bias and stereotyping in implicit associations exhibited by human subjects NLP data. The human-elicitation proto- in implicit association tests (Caliskan-Islam et al., col employed in the construction of the 2016). Gender bias in sports journalism has been SNLI makes it prone to amplifying bias studied via language modeling, confirming that and stereotypical associations, which we male athletes receive questions more focused on demonstrate statistically (using pointwise the game than female athletes (Fu et al., 2016). In mutual information) and with qualitative guessing the gender, age, and education level of examples. the authors of Tweets, crowdworkers found to ex- aggerate stereotypes (Carpenter et al., 2017). 1 Introduction A prerequisite to resolving the above issues is basic awareness among NLP researchers and prac- Since the statistical revolution in Artificial Intel- titioners of where systematic bias in datasets ex- ligence (AI), it is standard in areas such as nat- ists, and how it may arise. In service of this goal, ural language processing and computer vision to we offer a case study of bias in the Stanford Nat- train models on large amounts of empirical data. ural Language Inference (SNLI) dataset. SNLI is This “big data” approach popularly connotes ob- a recent but popular NLP dataset for textual infer- jectivity; however, as a cultural, political, and eco- ence, the largest of its kind by two orders of mag- nomic phenomenon in addition to a technological nitude, offering the potential to substantially ad- one, big data carries subjective aspects (Crawford vance research in Natural Language Understand- et al., 2014). The data mining process involves ing (NLU). We select this dataset because (1) we defining a target variable and evaluation criteria, predict that natural language inference as a NLP collecting a dataset, selecting a manner in which to task may be generally susceptible to emulating hu- represent the data, and sometimes eliciting anno- man cognitive biases like social stereotyping, and tations: bias, whether or implicit or explicit, may (2) we are interested in how eliciting written infer- be introduced in the performance of each of these ences from humans with minimal provided context tasks (Barocas and Selbst, 2016). may encourage stereotyped responses. We focus on the problem of overgeneralization, in which a data mining model extrapolates ex- Using the statistical measure of pointwise mu- cessively from observed patterns, leading to bias tual information along with qualitative examples, confirmation among the model’s users (Hovy and we demonstrate the existence of stereotypes of Spruit, 2016). High-profile cases of overgener- various forms in the elicited hypotheses of SNLI. alization in the public sphere abound (Crawford, 2013; Crawford, 2016; Barocas and Selbst, 2016). 2 The SNLI Dataset Research on the measurement and correction of Bowman et al. (2015) introduce the Stanford Nat- overgeneralization in NLP in particular is nascent. ural Language Inference corpus. The corpus * denotes equal contribution. was generated by presenting crowdworkers with

74 Proceedings of the First Workshop on Ethics in Natural Language Processing, pages 74–79, Valencia, Spain, April 4th, 2017. c 2017 Association for Computational Linguistics a photo caption (but not the corresponding photo) of W2, PMI is defined as from the Flickr30k corpus (Young et al., 2014) P (W1 = w1,W2 = w2) and instructing them to write a new alternate cap- PMI(w1, w2) = log . tion for the unseen photo under one of the follow- P (W1 = w1)P (W2 = w2) ing specifications: The new caption must either be To compute PMI from corpus statistics, we plug [1] “definitely a true description of the photo,” [2] in maximum-likelihood estimates of the joint and “might be a true description of the photo,” or [3] marginal probabilities: “definitely a false description of the photo.” Thus, in the parlance of Natural Language Inference, Pˆ(W1 = w1,W2 = w2) = C(w1, w2)/C( , ), the original caption and the newly elicited cap- ∗ ∗ Pˆ(W = w ) = C(w , )/C( , ), tion form a sentence pair consisting of a premise 1 1 1 ∗ ∗ ∗ (the original caption) and a hypothesis (the newly Pˆ(W2 = w2) = C( , w2)/C( , ), elicited sentence). The pair is labeled with one ∗ ∗ ∗ of three entailment relation types (ENTAILMENT, where C(w1, w2) represents the co-occurrence NEUTRAL, or CONTRADICTION), corresponding count of W1 = w1 and W2 = w2 in the corpus and to conditions [1–3] above. The dataset contains denotes marginalization (summation) over the ∗ 570K such pairs in total. corresponding variable. We wish to focus on the Given the construction of this dataset, we iden- bias introduced in the hypothesis elicitation pro- tify two possible sources of social bias: caption cess, so we count co-occurrences between words bias,1 already present in the premises from the (or bigrams) w1 in a premise and words (or bi- Flickr30k corpus (van Miltenburg, 2016), and (in- grams) w2 in a corresponding hypothesis. ference) elicitation bias, resulting from the SNLI For each pair of word types (or bigrams) w1 and protocol of eliciting possible inferences from hu- w2, we can check the independence between the mans provided an image caption. Though we rec- indicator variables Xw1 = I W1=w1 and Yw2 = { } ognize these sources of bias may not be as tidy and I W2=w2 with a likelihood ratio test. (Hereafter { } independent as their names suggest, it is a useful we omit subscripts w1 and w2 for ease of nota- conceptual shorthand: In this paper, we are pri- tion.) Denote the observed counts of X and Y over 2 marily interested in detecting elicitation bias. the corpus by C0(x, y) for x, y 0, 1 . The test ∈ { } statistic is

3 Methodology C0(x,y) ˆ ˆ x,y P (X = x)P (Y = y) Λ(C0) = . We are ultimately concerned with the impact of  ˆ C0(x,y) P x,y P (X = x, Y = y) a dataset’s biases on the models and applications P that are trained on it. To avoid dependence on where Pˆ is the maximum likelihood estimator (us- a particular model or model family, we evaluate ing C0), the summations range over x, y 0, 1 , the SNLI dataset in a model-agnostic fashion us- ∈ { } and we have dropped the subscripts w1 and w2 for ing the pointwise mutual information (PMI) mea- ease of notation. The quantity 2 log Λ(C ) is χ2- − 0 sure of association (Church and Hanks, 1990) and distributed with one degree of freedom, so we can likelihood ratio tests of independence (Dunning, use it to test rejection of the null hypothesis (in- 1993) between lexical units. dependence between X and Y ) for significance. Given categorical random variables W1 and W2 That quantity is also equal to a factor of 2C ( , ) 0 ∗ ∗ representing word occurrences in a corpus, for times the mutual information between X and Y , each word type (or bigram) w1 in the range of W1 and the PMI between W1 and W2 (on which X and and for each word type (or bigram) w2 in the range Y are defined) is a (scaled) component of the mu- tual information. Noting this relationship between 1Note that what we call caption bias may be due either PMI (which we use to sort all candidate word to the Flickr30k caption writing procedure, or the underly- ing distribution of images themselves. Distilling these two pairs) and the likelihood ratio test statistic (which sources of bias is outside the scope of this paper, as the SNLI we use to test for independence of the top word corpus makes no direct use of the images themselves. Put 2 another way, because SNLI annotators did not see images, For example, note C0(1, 1) = C(w1, w2) and the elicited hypotheses are independent of the Flickr images, C0(1, 0) = C(w1, ) C(w1, w2); the other counts C0(0, 1) ∗ − conditioned on the premises. and C0(0, 0) can also be computed in this manner.

75 GENDER woman hairdresser‡ fairground grieving receptionist widow man rock-climbing videoing armband tatooes gent women actresses† husbands‡ womens‡ gossip‡ wemon‡ men gypsies supervisors contractors mens‡ cds girl schoolgirl piata cindy pigtails‡ gril boy misbehaving see-saw timmy lad‡ sprained girls fifteen‡ slumber sking‡ jumprope† ballerinas‡ boys giggle‡ youths‡ sons‡ brothers‡ skip mother kissed‡ parent‡ mom‡ feeds daughters father fathers‡ dad‡ sons† daughters plant AGE old ferret‡ quilts‡ knits‡ grandpa‡ elderly‡ young giggle cds youthful‡ tidal amusing old woman knits‡ grandmother‡ scarf† elderly‡ lady‡ young woman salon† attractive blow blowing feeds old man ferret‡ grandpa‡ wrapping‡ grandfather‡ elderly‡ young man boarder disabled rollerblades graduation skate‡ RACE/ETHNICITY/NATIONALITY indian indians‡ india‡ native‡ traditional‡ pouring† caucasian blond white‡ american asian blonde indian woman cooking† clothes lady using making american patriotic‡ canadian‡ americans‡ reenactment‡ america‡ indian man food couple a‡ sleeping sitting american woman women‡ black white front her‡ asian kimonos‡ asians‡ asain‡ oriental‡ chinatown‡ american man speaking‡ money‡ black‡ white‡ music asians asian‡ food people‡ eating friends black woman african‡ american asian white‡ giving asian woman oriental‡ indian† chinese‡ listens† customers black man african‡ american white‡ roller face asian man shrimp† rice† chinese‡ businessman cooks† native american americans‡ music‡ dressed they woman white woman protesting‡ lady‡ looks women‡ was african american caucasian asian‡ speaking‡ black‡ white‡ white man pancakes‡ caucasian‡ class black† concert african africans‡ africa‡ pots† receives† village†

Table 1: Top five words in hypothesis by PMI with specified words in premise, filtered to co-occurrences with a unigram with count at least five. Queries in bold. Significance of a likelihood ratio test for independence denoted by † (α = 0.01) and ‡ (α = 0.001). pairs), we control for the family-wise error rate us- Preliminary results contained many bigrams in ing the Holm-Bonferroni procedure (Holm, 1979) the top-five lists that overlapped with the query— on all candidate word pairs. The procedure is ap- exactly or by lemma—along with a stop word. To plied separately within each view of the corpus mitigate this redundancy we filter the query results that we analyze: the all–inference-type view, EN- to unigrams before sorting and truncating. TAILMENT-only view, NEUTRAL-only view, and CONTRADICTION-only view. The U.S. Equal Employment Opportunity Com- 4 Results mission (EEOC) characterizes discrimination by type, where types of discrimination include age, We analyze bias in the SNLI dataset using both disability, national origin, pregnancy, race/color, PMI as a statistical measure of association (Sec. religion, and sex.3 To test for the existence of 4.1) and with demonstrative examples (Sec. 4.2). harmful stereotypes in the SNLI dataset we pick words and bigrams used to describe people la- beled as belonging to each of these categories, 4.1 Top Associated Terms by PMI such as Asian or woman, and list the top five or ten co-occurrences with each of those query terms For each social identifier of interest (for exam- in the SNLI dataset, sorted by PMI.4 We omit co- ple, “woman,” “man,” “Asian,” “African Ameri- occurrences with a count of less than five. We in- can,” etc.) we query for the top 5 or 10 unigrams clude both broad and specific query words; for ex- in the dataset that share the highest PMI with the ample, we include adjectives describing nationali- identifier. In Table 1, the results are broken down ties as well as those describing regions and races. by gender-, age-, and race/ethnicity/nationality- We also include query bigrams describing people based query terms, though some query terms com- labeled as belonging to more than one category, bine more than one type of identifier (for exam- such as Asian woman. Due to space constraints, ple, gender and race). Table 2 shows the results we report a subset of the top-five lists exhibit- for the same gender-based queries run over differ- ing harmful stereotypes. The code and query list ent portions of SNLI, as partitioned by entailment used in our analysis are available online, facilitat- type (ENTAILMENT, NEUTRAL, and CONTRADIC- ing further analysis of the complete results.5 TION). As described in Sec. 3, the pairwise counts used to estimate PMI are between a word in the 3https://www.eeoc.gov/laws/types/ premise and a word in the hypothesis; thus, query 4 We use the provided Stanford tokenization of the SNLI terms correspond with SNLI premises, and the re- dataset, converting all words to lowercase before counting co- occurrences. sults of the query correspond with hypotheses. A 5https://github.com/cjmay/snli-ethics discussion of these results follows in Sec. 5.

76 women scarves† ladies‡ womens‡ wemon‡ females‡ woman‡ affection dressing chat smile† men mens‡ guys‡ guitars cowboys† remove dock dudes workers‡ computers‡ boxers ENTAILMENT girls cheerleaders‡ females‡ girl‡ dancers children‡ smile practice dance‡ outfits laughing boys males‡ children‡ boy‡ kids‡ four‡ fighting† exercise play‡ pose fun women actresses‡ gossip‡ husbands‡ womens‡ nuns† bridesmaids† gossiping‡ ladies‡ strippers purses men lumberjacks mens‡ supervisors thieves‡ homosexual roofers reminisce† contractors groomsmen engineers‡ NEUTRAL girls fifteen‡ slumber† gymnasts‡ cheerleading‡ bikinis† sisters‡ cheerleaders‡ daughters‡ selfies† teenage‡ boys skip† sons‡ brothers‡ twins‡ muddy trunks† males† league‡ cards recess† women womens† wemon bikinis‡ ladies‡ towels females‡ politics dresses‡ discussing men‡ men dudes mens‡ motel‡ gossip surfboards wives caps sailors floors helmets CONTRADICTION girls sking‡ boys‡ 50 brothers sisters dolls† pose opposite phones hopscotch boys girls‡ sisters‡ sons bunk homework† males coats beds† guns professional

Table 2: Top-ten words in hypothesis by PMI with gender-related query words in premise, filtered to co-occurrences with a unigram with count of at least five, sorted by inference type (ENTAILMENT, NEUTRAL, or CONTRADICTION). Queries in bold. Significance of a likelihood ratio test for independence denoted by † (α = 0.01) and ‡ (α = 0.001).

4.2 Qualitative Examples Explicit introduction of harmful stereotypes or pejorative language by crowdworkers (such as that Some forms of bias in a dataset may only be presented here) is a form of elicitation bias; it may detectable with aggregate statistics such as PMI. be a result of many factors, including the crowd- Other, more explicit forms of bias may be appar- worker’s personal experiences, cultural identities, ent from individual data points. Here we present native English dialect, political ideology, socioe- some example sentence pairs from SNLI that conomic status, anonymity (and hence relative im- outwardly exhibit harmful stereotypes (labeled punity), and lack of awareness of their responses’ HS) or the use of pejorative language or slurs potential impact. As one reviewer suggested, in (labeled PL).6 Note that in these examples, the the case of CONTRADICTION elicitation, some identifiable biases have been introduced as a result crowdworkers may even have “viewed their role of the SNLI inference elicitation protocol, that is, as being not just contradictory, but outrageously they arise in the hypothesis. so.” While these explanations are speculative, PREMISE: An African American man looking at some the harmful language and stereotypes observed in butchered meat that is hanging from a rack outside a these examples are not. building. HYPOTHESIS (CONTRA.): A black man is in jail [HS] 5 Discussion of Results PREMISE: New sport is being played to show apprecia- tion to the kids who can not walk. From the top associated terms by PMI, as re- HYPOTHESIS (ENTAIL.): People are playing a sport in ported in Tables 1 and 2, the clearest stereotypi- honor of crippled people. [PL] cal patterns emerge for gender categories. Stereo- typical associations evoked for women (but not PREMISE: Several people, including a shirtless man and a woman in purple shorts which say “P.I.N.K.” on the men) include: expectations of emotional labor back, are walking through a crowded outdoor area. (smile, kissed), “pink collar” jobs (hairdresser), HYPOTHESIS (ENTAIL.): The woman is wearing slutty sexualization and emphasis on physical appear- shorts. [PL] ance (bikinis), talkativeness (gossip, gossiping), PREMISE: adult with red boots and purse walking down and being defined in relation to men (men, hus- the street next to a brink wall. bands). Conversely, stereotypical views of men HYPOTHESIS (NEUTR.): A whore looking for clients. [PL, HS] are also evoked: performance of physical labor (cowboys, workers), and professionals in technical PREMISE: Several Muslim worshipers march towards computers engineers Mecca. jobs ( , ). HYPOTHESIS (NEUTR.): The Muslims are terrorists. Gender-based stereotypes in the corpus cut [HS] across age, as well. Girls are associated with particular sports (ballerinas, cheerleaders, PREMISE: A man dressed as a woman and other people stand around tables with checkered tablecloths and a lad- cheerleading, dance, gymnasts), games and toys der. (jumprope, dolls), outward appearances (pigtails, HYPOTHESIS (NEUTR.): The man is a transvestite.[PL] bikinis), and activities (slumber [parties], self- ies). Boys, meanwhile, are stereotyped as trouble- 6The authors recognize the partially subjective nature of makers (fighting) and active outdoors (recess, applying these labels. league, play).

77 Though gender stereotypes appear in all three idence of stereotypes, including stereotypes of in- entailment categories in Table 2, those under the tersectional identities, by merging the counts of NEUTRAL label appear especially strong. We hy- semantically related terms (or, conversely, by de- pothesize this is a result of the less constrained na- coupling the counts of homonyms). It could also ture of eliciting inferences that are neither “defi- be fruitful to infer dependency parses and compute nitely true” nor “definitely false”: Eliciting infer- co-occurrences between dependency paths rather ences that merely “might be true” may actually en- than individual words to facilitate interpretation of courage stereotyped responses. Formally, neutral the results (Lin and Pantel, 2001; Chambers and inferences may or may not be true, so those ex- Jurafsky, 2009), if sparsity can be controlled. pressing stereotypes could be assumed to have no We have focused on the identities and accom- negative impact on the downstream model. How- panying biases present in the SNLI dataset, in par- ever, if the model assumes neutral inferences are ticular those created in the hypothesis elicitation equally likely to be true or false a priori, that process; one complement to our study would mea- assumption’s impact may be greater on minority sure the demographic bias in the corpus. Cor- groups subject to harmful negative stereotypes. relations introduced at any level in the data col- As represented by top-k PMI lists, individual lection process—including real-world correlations terms for race, ethnicity, and nationality appear present in the population—are subject to scrutiny, to have less strongly stereotyped associations than as they may be both creations and creators of gender terms, but some biased associations are structural inequality. still observed. Words associated with Asians in As artificial intelligence absorbs the world’s this dataset, for example, appear to center around collective knowledge with increasing efficiency food and eating; the problematic term “Oriental” and comprehension, our collective knowledge is in is also highly associated (another example of pe- turn shaped by the outputs of artificial intelligence. jorative language, as discussed in Sec. 4.2). For It is thus imperative that we understand how the many race, ethnicity, and nationality descriptors, bias pervading our society is encoded in artificial some of the top-5 results by PMI are terms for intelligence. This work constitutes a first step to- other races, ethnicities, or nationalities. This is in ward understanding and accounting for the social large part a result of an apparent SNLI annotator bias present in natural language inference. tactic for CONTRADICTION examples: If the race, ethnicity, or nationality of a person in the premise Acknowledgments is specified, simply replace it with a different one. We are grateful to our many reviewers who offered both candid and thoughtful feedback. 6 Conclusion This material is based upon work supported by the JHU Human Language Technology Cen- We used a simple and interpretable association ter of Excellence (HLTCOE), DARPA LORELEI, measure, namely pointwise mutual information, and the National Science Foundation Graduate to test the SNLI corpus for elicitation bias, not- Research Fellowship under Grant No. DGE- ing that bias at the level of word co-occurrences 1232825. The U.S. Government is authorized to is likely to lead to overgeneralization in a large reproduce and distribute reprints for Governmen- family of downstream models. We found evidence tal purposes. The views and conclusions contained that the elicited hypotheses introduced substantial in this publication are those of the authors and gender stereotypes as well as varying degrees of should not be interpreted as representing official racial, religious, and age-based stereotypes. We policies or endorsements of DARPA, the NSF, or caution that our results do not imply the latter the U.S. Government. stereotypes are not present: rather, the prominence of gender stereotypes may be due to the relatively visual expression of gender, and the absence of References other stereotypes in our results may be due to spar- Solon Barocas and Andrew D. Selbst. 2016. Big sity. We also note that our analysis reflects our data’s disparate impact. California Law Review, own experiences, beliefs, and biases, inevitably in- 104(3):671–732. fluencing our results. Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, Future work may find more comprehensive ev- Venkatesh Saligrama, and Adam T. Kalai. 2016.

78 Man is to computer programmer as woman is to Dirk Hovy and Shannon L. Spruit. 2016. The social homemaker? debiasing word embeddings. In D. D. impact of natural language processing. In Proceed- Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and ings of the 54th Annual Meeting of the Association R. Garnett, editors, Advances in Neural Information for Computational Linguistics (Volume 2: Short Pa- Processing Systems 29, pages 4349–4357. pers), pages 591–598, Berlin, Germany, August. As- sociation for Computational Linguistics. Samuel R. Bowman, Gabor Angeli, Christopher Potts, and Christopher D. Manning. 2015. A large anno- Dekang Lin and Patrick Pantel. 2001. Dirt – discovery tated corpus for learning natural language inference. of inference rules from text. In Proceedings of the In Proceedings of the 2015 Conference on Empiri- seventh ACM SIGKDD international conference on cal Methods in Natural Language Processing, pages Knowledge discovery and data mining, pages 323– 632–642, Lisbon, Portugal, September. Association 328. ACM. for Computational Linguistics. Emiel van Miltenburg. 2016. Stereotyping and bias Aylin Caliskan-Islam, Joanna J. Bryson, and Arvind in the Flickr30K dataset. In Jens Edlund, Dirk Narayanan. 2016. Semantics derived automatically Heylen, and Patrizia Paggio, editors, Proceedings from language corpora necessarily contain human of the Workshop on Multimodal Corpora (MMC- biases. Preprint, arXiv:1608.07187. 2016), pages 1–4, May. Jordan Carpenter, Daniel Preotiuc-Pietro, Lucie Peter Young, Alice Lai, Micah Hodosh, and Julia Flekova, Salvatore Giorgi, Courtney Hagan, Mar- Hockenmaier. 2014. From image descriptions to garet L. Kern, Anneke E. K. Buffone, Lyle Ungar, visual denotations. Transactions of the Association and Martin E. P. Seligman. 2017. Real men dont say of Computational Linguistics, 2:67–78. “cute”: Using automatic language analysis to isolate inaccurate aspects of stereotypes. to appear. Nathanael Chambers and Dan Jurafsky. 2009. Unsu- pervised learning of narrative schemas and their par- ticipants. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th In- ternational Joint Conference on Natural Language Processing of the AFNLP, pages 602–610, Suntec, Singapore, August. Association for Computational Linguistics. Kenneth Ward Church and Patrick Hanks. 1990. Word association norms mutual information, and lexicog- raphy. Computational Linguistics, Volume 16, Num- ber 1, March 1990, 16(1). Kate Crawford, Kate Miltner, and Mary L. Gray. 2014. Critiquing big data: Politics, ethics, epistemology. International Journal of Communication, 8:1663– 1672. Kate Crawford. 2013. Think again: Big data. http://atfp.co/2k9jaBT. Accessed 2017-01-26. Kate Crawford. 2016. Artificial intelligence’s white guy problem. http://nyti.ms/2jVLJUh. Accessed 2017-01-22. Ted Dunning. 1993. Accurate methods for the statis- tics of surprise and coincidence. Computational Linguistics, Special Issue on Using Large Corpora: I, 19(1). Liye Fu, Cristian Danescu-Niculescu-Mizil, and Lil- lian Lee. 2016. Tie-breaker: Using language mod- els to quantify gender bias in sports journalism. In Proceedings of the IJCAI workshop on NLP meets Journalism. Sture Holm. 1979. A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2):65–70.

79 A Short Review of Ethical Challenges in Clinical Natural Language Processing

Simon Susterˇ 1,2, Stephan´ Tulkens1 and Walter Daelemans1

1CLiPS, University of Antwerp 2Antwerp University Hospital simon.suster,stephan.tulkens,walter.daelemans @uantwerpen.be { }

Abstract comparison, in biomedical NLP, where the working data consist of biomedical research literature, these Clinical NLP has an immense potential in conditions have been present to a much lesser de- contributing to how clinical practice will gree, and the progress has been more rapid (Cohen be revolutionized by the advent of large and Demner-Fushman, 2014). The main contribut- scale processing of clinical records. How- ing factor to this situation has been the sensitive ever, this potential has remained largely nature of data, whose processing may in certain untapped due to slow progress primarily situations put patient’s privacy at risk. caused by strict data access policies for re- The ethics discussion is gaining momentum in searchers. In this paper, we discuss the con- general NLP (Hovy and Spruit, 2016). We aim in cern for privacy and the measures it entails. this paper to gather the ethical challenges that are We also suggest sources of less sensitive especially relevant for clinical NLP, and to stim- data. Finally, we draw attention to biases ulate discussion about those in the broader NLP that can compromise the validity of empir- community. Although enhancing privacy through ical research and lead to socially harmful restricted data access has been the norm, we do applications. not only discuss the right to privacy, but also draw attention to the social impact and biases emanating 1 Introduction from clinical notes and their processing. The chal- The use of notes written by healthcare providers lenges we describe here are in large part not unique in the clinical settings has long been recognized to clinical NLP, and are applicable to general data to be a source of valuable information for clini- science as well. cal practice and medical research. Access to large quantities of clinical reports may help in identifying 2 Sensitivity of data and privacy causes of diseases, establishing diagnoses, detect- ing side effects of beneficial treatments, and mon- Because of legal and institutional concerns arising itoring clinical outcomes (Agus, 2016; Goldacre, from the sensitivity of clinical data, it is difficult 2014; Murdoch and Detsky, 2013). The goal of for the NLP community to gain access to relevant clinical natural language processing (NLP) is to data (Barzilay, 2016; Friedman et al., 2013). This develop and apply computational methods for lin- is especially true for the researchers not connected guistic analysis and extraction of knowledge from with a healthcare organization. Corpora with trans- free text reports (Demner-Fushman et al., 2009; parent access policies that are within reach of NLP Hripcsak et al., 1995; Meystre et al., 2008). But researchers exist, but are few. An often used corpus while the benefits of clinical NLP and data mining is MIMICII(I) (Johnson et al., 2016; Saeed et al., have been universally acknowledged, progress in 2011). Despite its large size (covering over 58,000 the development of clinical NLP techniques has hospital admissions), it is only representative of been slow. Several contributing factors have been patients from a particular clinical domain (the in- identified, most notably difficult access to data, tensive care in this case) and geographic location limited collaboration between researchers from dif- (a single hospital in the United States). Assuming ferent groups, and little sharing of implementations that such a specific sample is representative of a and trained models (Chapman et al., 2011). For larger population is an example of sampling bias

80 Proceedings of the First Workshop on Ethics in Natural Language Processing, pages 80–87, Valencia, Spain, April 4th, 2017. c 2017 Association for Computational Linguistics

(we discuss further sources of bias in section 3). 2.1 Protecting the individual Increasing the size of a sample without recognizing Clinical notes contain detailed information about that this sample is atypical for the general popula- patient-clinician encounters in which patients con- tion (e.g. not all patients are critical care patients) fide not only their health complaints, but also their could also increase sampling bias (Kaplan et al., lifestyle choices and possibly stigmatizing condi- 1 2014). We need more large corpora for various tions. This confidential relationship is legally pro- medical specialties, narrative types, as well as lan- tected in US by the HIPAA privacy rule in the case guages and geographic areas. of individuals’ medical data. In EU, the conditions Related to difficult access to raw clinical data is for scientific usage of health data are set out in the the lack of available annotated datasets for model General Data Protection Regulation (GDPR). Sani- training and benchmarking. The reality is that anno- tization of sensitive data categories and individuals’ tation projects do take place, but are typically con- informed consent are in the forefront of those leg- strained to a single healthcare organization. There- islative acts and bear immediate consequences for fore, much of the effort put into annotation is lost the NLP research. afterwards due to impossibility of sharing with the The GDPR lists general principles relating to larger research community (Chapman et al., 2011; processing of personal data, including that process- Fan et al., 2011). Again, exceptions are either few— ing must be lawful (e.g. by means of consent), fair e.g. THYME (Styler IV et al., 2014), a corpus and transparent; it must be done for explicit and annotated with temporal information—or consist legitimate purposes; and the data should be kept of small datasets resulting from shared tasks like limited to what is necessary and as long as neces- the i2b2 and ShARe/CLEF. In addition, stringent sary. This is known as data minimization, and it access policies hamper reproduction efforts, im- includes sanitization. The scientific usage of health pede scientific oversight and limit collaboration, data concerns “special categories of personal data”. not only between institutions but also more broadly Their processing is only allowed when the data sub- between the clinical and NLP communities. ject gives explicit consent, or the personal data is There are known cases of datasets that had been made public by the data subject. Scientific usage is used in published research (including reproduction) defined broadly and includes technological devel- in its full form, like MiPACQ2, Blulab, EMC Dutch opment, fundamental and applied research, as well Clinical Corpus and 2010 i2b2/VA (Albright et al., as privately funded research. 2013; Kim et al., 2015; Afzal et al., 2014; Uzuner Sanitization Sanitization techniques are often et al., 2011), but were later trimmed down or made seen as the minimum requirement for protecting unavailable, likely due to legal issues. Even if individuals’ privacy when collecting data (Berman, these datasets were still available in full, their small 2002; Velupillai et al., 2015). The goal is to ap- size is still a concern, and the comments above ply a procedure that produces a new version of the regarding sampling bias certainly apply. For ex- dataset that looks like the original for the purposes ample, a named entity recognizer trained on 2010 of data analysis, but which maintains the privacy i2b2/VA data, which consists of 841 annotated pa- of those in the dataset to a certain degree, depend- tient records from three different specialty areas, ing on the technique. Documents can be sanitized will due to its size only contain a small portion of by replacing, removing or otherwise manipulat- possible named entities. Similarly, in linking clini- ing the sensitive mentions such as names and geo- cal concepts to an ontology, where the number of graphic locations. A distinction is normally drawn output classes is larger (Pradhan et al., 2013), the between anonymization, pseudonymization and de- small amount of training data is a major obstacle identification. We refer the reader to Polonetsky to deployment of systems suitable for general use. et al. (2016) for an excellent overview of these procedures. Although it is a necessary first step in protect- 1Sampling bias could also be called selection bias; it is not inherent to the individual documents, but stems from the way ing the privacy of patients, sanitization has been these are arranged into a single corpus. criticized for several reasons. First, it affects the 2 The access to the MiPACQ corpus will be re-enabled integrity of the data, and as a consequence, their in the future within the Health NLP Center for distributing linguistic annotations of clinical texts (Guergana Savova, per- utility (Duquenoy et al., 2008). Second, although sonal communication). sanitization in principle promotes data access and

81 sharing, it may often not be sufficient to eliminate tunities. We also believe that a distinction between the need for consent. This is largely due to the academic research and commercial use of clinical well-known fact that original sensitive data can data should be implemented, as the public is more be re-identified through deductive disclosure (Am- willing to allow research than commercial exploita- blard et al., 2014; De Mazancourt et al., 2015; tion (Lawrence, 2016; van Staa et al., 2016). Hardt et al., 2016; Malin et al., 2013; Tene, 2011).3 Yet another possibility is open consent, in which Finally, sanitization focuses on protecting the in- individuals make their data publicly available. Ini- dividual, whereas ethical harms are still possible tiatives like Personal Genome Project may have on the group level (O’Doherty et al., 2016; Taylor an exemplary role, however, they can only provide et al., 2017). Instead of working towards increas- limited data and they represent a biased population ingly restrictive sanitization and access measures, sample (Mittelstadt and Floridi, 2016). another course of action could be to work towards heightening the perception of scientific work, em- Secure access Since withholding data from re- phasizing professionalism and existence of punitive searchers would be a dubious way of ensuring con- measures for illegal actions (Fairfield and Shtein, fidentiality (Berman, 2002), the research has long 2014; Mittelstadt and Floridi, 2016). been active on secure access and storage of sensi- tive clinical data, and the balance between the de- Consent Clinical NLP typically requires a large gree of privacy loss and the degree of utility. This amount of clinical records describing cases of pa- is a broad topic that is outside the scope of this tients with a particular condition. Although obtain- article. The interested reader can find the relevant ing consent is a necessary first step, obtaining ex- information in Dwork and Pottenger (2013), Malin plicit informed consent from each patient can also et al. (2013) and Rindfleisch (1997). compromise the research in several ways. First, ob- taining consent is time consuming by itself, and it Promotion of knowledge and application of best- results in financial and bureaucratic burdens. It can of-class approaches to health data is seen as one also be infeasible due to practical reasons such as a of the ethical duties of researchers (Duquenoy et patient’s death. Next, it can introduce bias as those al., 2008; Lawrence, 2016). But for this to be willing to grant consent represent a skewed popula- put in practice, ways need to be guaranteed (e.g. tion (Nyren´ et al., 2014). Finally, it can be difficult with government help) to provide researchers with to satisfy the informedness criterion: Information access to the relevant data. Researchers can also about the experiment sometimes can not be com- go to the data rather than have the data sent to municated in an unambiguous way, or experiments them. It is an open question though whether medi- happen at speed that makes enacting informed con- cal institutions—especially those with less devel- sent extremely hard (Bird et al., 2016). oped research departments—can provide the in- The alternative might be a default opt-in pol- frastructure (e.g. enough CPU and GPU power) icy with a right to withdraw (opt-out). Here, con- needed in statistical NLP. Also, granting access sent can be presumed either in a broad manner— to one healthcare organization at a time does not allowing unspecified future research, subject to eth- satisfy interoperability (cross-organizational data ical restrictions—or a tiered manner—allowing cer- sharing and research), which can reduce bias by tain areas of research but not others (Mittelstadt allowing for more complete input data. Interop- and Floridi, 2016; Terry, 2012). Since the informa- erability is crucial for epidemiology and rare dis- tion about the intended use is no longer uniquely ease research, where data from one institution can tied to each research case but is more general, this not yield sufficient statistical power (Kaplan et al., could facilitate the reuse of datasets by several re- 2014). search teams, without the need to ask for consent Are there less sensitive data? One criterion each time. The success of implementing this ap- which may have influence on data accessibility is proach in practice is likely to depend on public whether the data is about living subjects or not. trust and awareness about possible risks and oppor- The HIPAA privacy rule under certain conditions 3Additionaly, it may be due to organizational skepticism allows disclosure of personal health information of about the effectiveness of sanitization techniques, although it deceased persons, without the need to seek IRB has been shown that automated de-identification systems for English perform on par with manual de-identification (Deleger agreement and without the need for sanitization et al., 2013). (Huser and Cimino, 2014). It is not entirely clear

82 though how often this possibility has been used in The question is therefore what the most important clinical NLP research or broader. biases are and how to overcome them, not only out Next, the work on surrogate data has recently of ethical but also legal responsibility. Related to seen a surge in activity. Increasingly more health- the question of bias is so-called algorithm trans- related texts are produced in social media (Abbasi parency (Goodman, 2016; Kamarinou et al., 2016), et al., 2014), and patient-generated data are avail- as this right to explanation requires that influences able online. Admittedly, these may not resemble of bias in training data are charted. In addition to the clinical discourse, yet they bear to the same sampling bias, which we introduced in section 2, individuals whose health is documented in the clin- we discuss in this section further sources of bias. ical reports. Indeed, linking individuals’ health Unlike sampling bias, which is a corpus-level bias, information from online resources to their health these biases here are already present in documents, records to improve documentation is an active line and therefore hard to account for by introducing of research (Padrez et al., 2015). Although it is gen- larger corpora. erally easier to obtain access to social media data, the use of social media still requires similar ethical Data quality Texts produced in the clinical set- considerations as in the clinical domain. See for ex- tings do not always tell a complete or accurate pa- ample the influential study on emotional contagion tient story (e.g. due to time constraints or due to pa- in Facebook posts by Kramer et al. (2014), which tient treatment in different hospitals), yet important 4 has been criticized for not properly gaining prior decisions can be based on them. As language is consent from the users who were involved in the situated, a lot of information may be implicit, such study (Schroeder, 2014). as the circumstances in which treatment decisions Another way of reducing sensitivity of data and are made (Hersh et al., 2013). If we fail to detect a improving chances for IRB approval is to work on medical concept during automated processing, this can not necessarily be a sign of negative evidence.5 derived data. Data that can not be used to recon- struct the original text (and when sanitized, can Work on identifying and imputing missing values not directly re-identify the individual) include text holds promise for reducing incompleteness, see fragments, various statistics and trained models. Lipton et al. (2016) for an example in sequential Working on randomized subsets of clinical notes modeling applied to diagnosis classification. may also improve the chances of obtaining the data. Reporting bias Clinical texts may include bias When we only have access to trained models from coming from both patient’s and clinician’s report- disparate sources, we can refine them through en- ing. Clinicians apply their subjective judgments sembling and creation of silver standard corpora, to what is important during the encounter with pa- cf. Rebholz-Schuhmann et al. (2011). tients. In other words, there is separation between, Finally, clinical NLP is also possible on veteri- on the one side, what is observed by the clinician nary texts. Records of companion animals are and communicated by the patient, and on the other, perhaps less likely to involve legal issues, while what is noted down. Cases of more serious illness still amounting to a large pool of data. As an exam- may be more accurately documented as a result of ple, around 40M clinical documents from different clinician’s bias (increased attention) and patient’s veterinary clinics in UK and Australia are stored recall bias. On the other hand, the cases of stigma- centrally in the VetCompass repository. First NLP tized diseases may include suppressed information. steps in this direction were described in the invited In the case of traffic injuries, documentation may talk at the Clinical NLP 2016 workshop (Baldwin, even be distorted to avoid legal consequences (In- 2016). drayan, 2013).

3 Social impact and biases 4A way to increase data completeness and reduce selection bias is the use of nationwide patient registries, as known for Unlocking knowledge from free text in the health example in Scandinavian countries (Schmidt et al., 2015). 5We can take timing-related “censoring” effects as an ex- domain has a tremendous societal value. How- ample. In event detection, events prior to the start of an ob- ever, discrimination can occur when individuals servation may be missed or are uncertain, which means that or groups receive unfair treatment as a result of the first appearance of a diagnosis in the clinical record may not coincide with the occurrence of the disease. Similarly, key automated processing, which might be a result of events after the end of the observation may be missing (e.g. biases in the data that were used to train models. death, when it occurred in another institution).

83 We need to be aware that clinical notes may re- Difficult access to data due to privacy concerns has flect health disparities. These can originate from been an obstacle to progress in the field. We have prejudices held by healthcare practitioners which discussed how the protection of privacy through may impact patients’ perceptions; they can also sanitization measures and the requirement for in- originate from communication difficulties in the formed consent may affect the work in this domain. case of ethnic differences (Zestcott et al., 2016). Perhaps, it is time to rethink the right to privacy Finally, societal norms can play a role. Brady et in health in the light of recent work in ethics of al. (2016) find that obesity is often not documented big data, especially its uneasy relationship to the equally well for both sexes in weight-addressing right to science, i.e. being able to benefit from sci- clinics. Young males are less likely to be recog- ence and participate in it (Tasioulas, 2016; Verbeek, nized as obese, possibly due to societal norms see- 2014). We also touched upon possible sources ing them as “stocky” as opposed to obese. Unless of bias that can have an effect on the application we are aware of such bias, we may draw premature of NLP in the health domain, and which can ulti- conclusions about the impact of our results. mately lead to unfair or harmful treatment. It is clear that during processing of clinical texts, we should strive to avoid reinforcing the biases. It Acknowledgments is difficult to give a solution on how to actually We would like to thank Madhumita and the anony- reduce the reporting bias after the fact. One pos- mous reviewers for useful comments. Part of this sibility might be to model it. If we see clinical research was carried out in the framework of the Ac- reports as noisy annotations for the patient story in cumulate IWT SBO project, funded by the govern- which information is left-out or altered, we could ment agency for Innovation by Science and Tech- try to decouple the bias from the reports. Inspira- nology (IWT). tion could be drawn, for example, from the work on decoupling reporting bias from annotations in visual concept recognition (Misra et al., 2016). References Observational bias Although variance in health Ahmed Abbasi, Donald Adjeroh, Mark Dredze, outcome is affected by social, environmental and Michael J. Paul, Fatemeh Mariam Zahedi, Huimin Zhao, Nitin Walia, Hemant Jain, Patrick Sanvanson, behavioral factors, these are rarely noted in clinical Reza Shaker, et al. 2014. Social media analytics for reports (Kaplan et al., 2014). The bias of missing smart health. IEEE Intelligent Systems, 29(2):60– explanatory factors because they can not be iden- 80. tified within the given experimental setting is also Zubair Afzal, Ewoud Pons, Ning Kang, Miriam C.J.M. known as the streetlight effect. In certain cases, Sturkenboom, Martijn J. Schuemie, and Jan A. Kors. we could obtain important prior knowledge (e.g. 2014. ContextD: an algorithm to identify contextual demographic characteristics) from data other than properties of medical terms in a Dutch clinical cor- pus. BMC Bioinformatics, 15(1):373. clinical notes. Dual use We have already mentioned linking per- David B. Agus. 2016. Give Up Your Data to Cure Disease, The New York Times, February 6. https: sonal health information from online texts to clini- //goo.gl/0REG0n. cal records as a motivation for exploring surrogate data sources. However, this and many other appli- D. Albright, A. Lanfranchi, A. Fredriksen, W. F. Styler, C. Warner, J. D. Hwang, J. D. Choi, D. Dligach, cations also have potential to be applied in both R. D. Nielsen, J. Martin, W. Ward, M. Palmer, and beneficial and harmful ways. It is easy to imagine G. K. Savova. 2013. Towards comprehensive syn- how sensitive information from clinical notes can tactic and semantic annotations of the clinical narra- be revealed about an individual who is present in tive. Journal of the American Medical Informatics Association, 20(5):922–930. social media with a known identity. More general examples of dual use are when the NLP tools are Maxime Amblard, Karen¨ Fort, Michel Musiol, and used to analyze clinical notes with a goal of deter- Manuel Rebuschi. 2014. L’impossibilite´ de mining individuals’ insurability and employability. l’anonymat dans le cadre de l’analyse du discours. In Journee´ ATALA ethique´ et TAL. 4 Conclusion Timothy Baldwin. 2016. VetCompass: Clinical Natural Language Processing for Animal Health. In this paper, we reviewed some challenges that Clinical NLP 2016 keynote. https://goo.gl/ we believe are central to the work in clinical NLP. ScGFa2.

84 Regina Barzilay. 2016. How NLP can help cure can- Jung-wei Fan, Rashmi Prasad, Romme M. Yabut, cer? NAACL’16 keynote. https://goo.gl/ Richard M. Loomis, Daniel S. Zisook, John E. Matti- hi5nrq. son, and Yang Huang. 2011. Part-of-speech tagging for clinical text: wall or bridge between institutions. Jules J. Berman. 2002. Confidentiality issues for med- In AMIA Annual Symposium Proceedings. ical data miners. Artificial Intelligence in Medicine, 26(1):25–36. Carol Friedman, Thomas C. Rindflesch, and Milton Corn. 2013. Natural language processing: state of Sarah Bird, Solon Barocas, Kate Crawford, Fernando the art and prospects for significant progress, a work- Diaz, and Hanna Wallach. 2016. Exploring or Ex- shop sponsored by the National Library of Medicine. ploiting? Social and Ethical Implications of Au- Journal of Biomedical Informatics, 46(5):765–773. tonomous Experimentation in AI. In Workshop on Fairness, Accountability, and Transparency in Ma- Ben Goldacre. 2014. The NHS plan to share our chine Learning. medical data can save lives but must be done right. https://goo.gl/MH2eC0. Cassandra C. Brady, Vidhu V. Thaker, Todd Lingren, Jessica G. Woo, Stephanie S. Kennebeck, Bahram Bryce W. Goodman. 2016. A Step Towards Account- Namjou-Khales, Ashton Roach, Jonathan P. Bickel, able Algorithms?: Algorithmic Discrimination and Nandan Patibandla, Guergana K. Savova, et al. the European Union General Data Protection. In 2016. Suboptimal Clinical Documentation in Young NIPS Symposium on Machine Learning and the Law. Children with Severe Obesity at Tertiary Care Cen- ters. International Journal of Pediatrics, 2016. Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of opportunity in supervised learning. In Wendy W. Chapman, Prakash M. Nadkarni, Lynette NIPS. Hirschman, Leonard W. D’Avolio, Guergana K. Savova, and Ozlem¨ Uzuner. 2011. Overcoming William R. Hersh, Mark G. Weiner, Peter J. Embi, Ju- barriers to NLP for clinical text: the role of shared dith R. Logan, Philip R.O. Payne, Elmer V. Bern- tasks and the need for additional creative solutions. stam, Harold P. Lehmann, George Hripcsak, Timo- Journal of the American Medical Informatics Asso- thy H. Hartzog, James J. Cimino, and Joel H. Saltz. ciation, 18(5):540–543. 2013. Caveats for the use of operational electronic health record data in comparative effectiveness re- Kevin Bretonnel Cohen and Dina Demner-Fushman. search. Medical care, 51(8 0 3). 2014. Biomedical natural language processing. John Benjamins Publishing Company. Dirk Hovy and Shannon L. Spruit. 2016. The social impact of natural language processing. In Proceed- Hugues De Mazancourt, Alain Couillault, Gilles Adda, ings of the 54th Annual Meeting of the Association and Gaelle¨ Recource.´ 2015. Faire du TAL sur des for Computational Linguistics (Volume 2: Short Pa- donnees´ personnelles : un oxymore ? In TALN pers), pages 591–598, Berlin, Germany, August. As- 2015. sociation for Computational Linguistics. Louise Deleger, Katalin Molnar, Guergana Savova, Fei George Hripcsak, Carol Friedman, Philip O. Alder- Xia, Todd Lingren, Qi Li, Keith Marsolo, Anil son, William DuMouchel, Stephen B. Johnson, and Jegga, Megan Kaiser, Laura Stoutenborough, and Paul D. Clayton. 1995. Unlocking clinical data Imre Solti. 2013. Large-scale evaluation of auto- from narrative reports: a study of natural lan- mated clinical note de-identification and its impact guage processing. Annals of internal medicine, on information extraction. Journal of the American 122(9):681–688. Medical Informatics Association, 20(1):84. Vojtech Huser and James J. Cimino. 2014. Don’t take Dina Demner-Fushman, Wendy W. Chapman, and your EHR to heaven, donate it to science: legal and Clement J. McDonald. 2009. What can natural lan- research policies for EHR post mortem. Journal guage processing do for clinical decision support? of the American Medical Informatics Association, Journal of Biomedical Informatics, 42(5):760–772. 21(1):8–12. Penny Duquenoy, Carlisle George, and Anthony Abhaya Indrayan. 2013. Varieties of bias to guard Solomonides. 2008. Considering something ELSE: against. https://goo.gl/SqnuZY. Ethical, legal and socio-economic factors in medical imaging and medical informatics. Computer Meth- Alistair EW Johnson, Tom J. Pollard, Lu Shen, Li- ods and Programs in Biomedicine, 92(3):227–237. wei H. Lehman, Mengling Feng, Mohammad Ghas- semi, Benjamin Moody, Peter Szolovits, Leo An- Cynthia Dwork and Rebecca Pottenger. 2013. Toward thony Celi, and Roger G. Mark. 2016. MIMIC-III, practicing privacy. Journal of the American Medical a freely accessible critical care database. Scientific Informatics Association, 20(1):102–108. data, 3. Joshua Fairfield and Hannah Shtein. 2014. Big data, Dimitra Kamarinou, Christopher Millard, and Jatinder big problems: Emerging issues in the ethics of data Singh. 2016. Machine Learning with Personal Data. science and journalism. Journal of Mass Media Queen Mary School of Law Legal Studies Research Ethics, 29(1):38–51. Paper, (247/2016).

85 Robert M. Kaplan, David A. Chambers, and Russell E. Kevin A. Padrez, Lyle Ungar, Hansen Andrew Glasgow. 2014. Big data and large sample size: a Schwartz, Robert J. Smith, Shawndra Hill, Tadas cautionary note on the potential for bias. Clinical Antanavicius, Dana M. Brown, Patrick Crutchley, and translational science, 7(4):342–346. David A. Asch, and Raina M. Merchant. 2015. Linking social media and medical record data: a Youngjun Kim, Ellen Riloff, and John F. Hurdle. 2015. study of adults presenting to an academic, urban A study of concept extraction across different types emergency department. BMJ quality & safety. of clinical notes. In AMIA Annual Symposium Pro- ceedings, volume 2015, page 737. American Medi- Jules Polonetsky, Omer Tene, and Kelsey Finch. 2016. cal Informatics Association. Shades of Gray: Seeing the Full Spectrum of Practi- cal Data De-identification. Santa Clara Law Review, Adam D.I. Kramer, Jamie E. Guillory, and Jeffrey T. 56(3). Hancock. 2014. Experimental evidence of massive- scale emotional contagion through social networks. Sameer Pradhan, Noemie Elhadad, Brett R. South, Proceedings of the National Academy of Sciences, David Martinez, Amy Vogel, Hanna Suominen, 111(24):8788–8790. Wendy W. Chapman, and Guergana Savova. 2013. Task 1: Share/clef ehealth evaluation lab. In Online Neil Lawrence. 2016. Data Analysis, NHS and Indus- Working Notes of CLEF. trial Partners. https://goo.gl/rRIcu5. Dietrich Rebholz-Schuhmann, Antonio Jimeno Yepes, Zachary C. Lipton, David Kale, and Randall Wetzel. Chen Li, et al. 2011. Assessment of NER solutions 2016. Modeling Missing Data in Clinical Time Se- against the first and second CALBC Silver Standard ries with RNNs. arXiv preprint arXiv:1606.04130. Corpus. Journal of Biomedical Semantics, 2(5):S11.

Bradley A. Malin, Khaled El Emam, and Christine M. Thomas C. Rindfleisch. 1997. Privacy, information O’Keefe. 2013. Biomedical data privacy: problems, technology, and health care. Communications of the perspectives, and recent advances. Journal of the ACM, 40(8):92–100. American Medical Informatics Association, 20(1):2– Mohammed Saeed, Mauricio Villarroel, Andrew T. 6. Reisner, Gari Clifford, Li-Wei Lehman, George Moody, Thomas Heldt, Tin H. Kyaw, Benjamin Stephane´ M. Meystre, Guergana K. Savova, Karin C. Moody, and Roger G. Mark. 2011. Multiparameter Kipper-Schuler, John F. Hurdle, et al. 2008. Ex- Intelligent Monitoring in Intensive Care II (MIMIC- tracting information from textual documents in the II): a public-access intensive care unit database. electronic health record: a review of recent research. Critical care medicine, 39(5):952. Yearbook of Medical Informatics, 35:128–44. Morten Schmidt, S.A. Schmidt, Jakob Lynge Sande- Ishan Misra, C. Lawrence Zitnick, Margaret Mitchell, gaard, Vera Ehrenstein, Lars Pedersen, and Hen- and Ross B. Girshick. 2016. Seeing through the Hu- rik Toft Sørensen. 2015. The danish national man Reporting Bias: Visual Classifiers from Noisy patient registry: a review of content, data qual- Human-Centric Labels. In The IEEE Conference on ity, and research potential. Clinical Epidemiology, Computer Vision and Pattern Recognition. 7(449):e490.

Brent Daniel Mittelstadt and Luciano Floridi. 2016. Ralph Schroeder. 2014. Big data and the brave new The ethics of big data: current and foreseeable is- world of social media research. Big Data & Society, sues in biomedical contexts. Science and engineer- 1(2):2053951714563194. ing ethics, 22(2):303–341. William Styler IV, Steven Bethard, Sean Finan, Martha Travis B. Murdoch and Allan S. Detsky. 2013. The in- Palmer, Sameer Pradhan, Piet de Groen, Brad Erick- evitable application of big data to health care. JAMA, son, Timothy Miller, Chen Lin, Guergana Savova, 309(13):1351–1352. and James Pustejovsky. 2014. Temporal annotation in the clinical domain. Transactions of the Associa- Olof Nyren,´ Magnus Stenbeck, and Henrik Gronberg.¨ tion for Computational Linguistics, 2:143–154. 2014. The European Parliament proposal for the new EU General Data Protection Regulation may John Tasioulas. 2016. Van Hasselt Lecture 2016: Big severely restrict European epidemiological research. Data, Human Rights and the Ethics of Scientific Re- European Journal of Epidemiology, 29(4):227–230. search. https://goo.gl/QREHUN.

Kieran C. O’Doherty, Emily Christofides, Jeffery Yen, Linnet Taylor, Luciano Floridi, and Bart van der Sloot. Heidi Beate Bentzen, Wylie Burke, Nina Hallowell, 2017. Group Privacy: New Challenges of Data Barbara A. Koenig, and Donald J. Willison. 2016. If Technologies. Springer International Publishing. you build it, they will come: unintended future uses of organised health data collections. BMC Medical Omer Tene. 2011. Privacy: The new generations. In- Ethics, 17(1). ternational Data Privacy Law, 1(1):15–27.

86 Nicolas P. Terry. 2012. Protecting patient privacy in the age of big data. University of Missouri-Kansas City Law Review, 81(2). Ozlem¨ Uzuner, Brett R. South, Shuying Shen, and Scott L. DuVall. 2011. 2010 i2b2/va challenge on concepts, assertions, and relations in clinical text. Journal of the American Medical Informatics Asso- ciation, 18(5):552–556. Tjeerd-Pieter van Staa, Ben Goldacre, Iain Buchan, and Liam Smeeth. 2016. Big health data: the need to earn public trust. BMJ, 354:i3636. Sumithra Velupillai, D. Mowery, Brett R. South, Maria Kvist, and Hercules Dalianis. 2015. Recent ad- vances in clinical natural language processing in sup- port of semantic analysis. Yearbook of Medical In- formatics, 10:183–193. Peter-Paul Verbeek. 2014. Op de vleugels van Icarus. Lemniscaat. Colin A. Zestcott, Irene V. Blair, and Jeff Stone. 2016. Examining the presence, consequences, and reduc- tion of implicit bias in health care: A narrative re- view. Group Processes & Intergroup Relations.

87 Goal-Oriented Design for Ethical Machine Learning and NLP

Tyler Schnoebelen Decoded AI [email protected]

ple, often those with less access to resources and Abstract information. Privileged positions come with re- sponsibilities. Namely, to recognize that systems The argument made in this paper is that affect people unevenly. To design with virtues, to act ethically in machine learning and duties, and consequences in mind is to recognize NLP requires focusing on goals. NLP the limits of one's perspective and then design projects are often classificatory systems systems with these limitations in mind. that deal with human subjects, which means that goals from people affected by 2 Wicked problems the systems should be included. The pa- Simple NLP problems and simple NLP projects per takes as its core example a model that require you to identify stakeholders, articulate detects criminality, showing the problems their goals, and build a plan. Another category of of training data, categories, and out- complex problems includes those that are only comes. The paper is oriented to the kinds actually complex until they are decomposed into of critiques on power and the reproduc- multiple simple problems. tion of inequality that are found in social A third category is wicked problems: those in theory, but it also includes concrete sug- which you can articulate goals but they are fun- gestions on how to put goal-oriented de- damentally in conflict (Rittel and Webber, 1973). sign into practice. For example, a traffic planner wants to build a highway because they want less congestion. But 1 Introduction community members don't want their neighbor- Ethics asks us to consider how we live and how hood cut in half because it destroys their goal of we discern right and wrong in particular circum- affiliation. stances. Ethicists differ on what they consider Wicked problems have no definitive solution fundamental: the actor's moral character and dis- because there are multiple valid viewpoints: you positions (virtue ethics), the duties and obliga- cannot take for granted that there is a single ob- tions of the actor given their role (deontology), jective that will let you judge your solution as or the outcomes of the actions (consequential- correct and finished. ism). Computational linguists do not need to an- We often shield ourselves from ethical prob- swer a question of primacy, but the three themes lems by ignoring populations who would throw of virtues, duties, and consequences do need to light on a project's wicked complexity. This is a be considered. good indication of an ethical problem: turning a This paper uses goals to draw out each of the blind eye to people who will be affected by the three themes. Goals are states of affairs that peo- system but who are difficult to reach or who may ple would like to achieve, maintain, or avoid in have inconveniently conflicting goals. the face of changes and obstacles. The use of "goals" here is expansive so that it includes not 3 An easy unethical project just designers and users of a system, but also In order to illustrate the ethical implications of those who are (or would be) affected by the sys- goal-oriented design, let's take an example from tem. machine learning that most readers will find NLP practitioners design and build technolo- straight-forwardly problematic. Here are two gies that connect to law, finance, education and conclusions from an abstract on automated infer- main other domains that substantially affect peo-

88 Proceedings of the First Workshop on Ethics in Natural Language Processing, pages 88–93, Valencia, Spain, April 4th, 2017. c 2017 Association for Computational Linguistics ence of criminality using faces (Wu and Zhang, dication are inherently unethical. But dismissing 2016): the claims and goals of affected populations usu- ally is. Such populations get hit by a double All four classifiers perform consistently whammy: they are unlikely to be represented by well and produce evidence for the validi- technologists and other stakeholders and they ty of automated face-induced inference have much less room to maneuver in whatever on criminality… Also, we find some dis- system is built. criminating structural features for pre- dicting criminality, such as lip curvature, 4 The trouble with training data eye inner corner distance, and the so- The data for Wu and Zhang (2016) comes called nose-mouth angle. from China and includes only men, but it is ethi- cally safer to assume that data from the ministry Can the goal for this project be simply stated? of public security and various police departments It seems to be, "Improve safety by having com- is biased than it is to assume that it is balanced puters automatically detect the criminality of and representative. people's faces." This goal inherently categorizes Interrogating training data is important for people by degrees of criminality based on physi- building effective machine learning models and cal characteristics. It takes the perspective of the it's also important for building ethical ones. Ma- safety-minded, yet anyone categorized as crimi- chine learning techniques depend upon training nal has legitimate goals to consider. Regardless data, which causes two kinds of problems. The of how many iterations the models go through, meeting the main goal will always create a group first problem is that whatever you build, it's bi- ased towards the contexts that you can sample of criminal-looking people who will not agree and the ways you get it annotated. Some popula- with that definition and its consequences. This is tions are overrepresented and some are un- a wicked problem. derrepresented. In the United States, people of color have rad- The second problem with training data is that ically different experiences with the criminal jus- whatever your categories are, they are wrong. tice system than white people. Attempting to use U.S. police data for training will not work: the Categories can still be meaningful and useful, but it is a mistake to consider them to be natural or criminal justice system in the U.S. is systemical- uncontestable. As Bowker and Star (1999) dis- ly biased, as can be seen in Hetey et al. (2016), cuss, categorization always valorizes some point which shows racial biases in Oakland police of view and erases others. stops, searches, handcuffings, and arrests. For example, gender detection is common in The Hetey et al. data is not merely counts of NLP. These projects typically begin with the as- police actions on different kinds of civilians; it is sumption that a binary division of humans is rel- also an examination of the differences in the ag- evant. But as Bamman et al. (2014) show, even gressiveness of the language used by the police binary models with high accuracy are descrip- with African-American men. In other parts of the justice system, language—non-standard dia- tively inadequate. This is also the central point of intersectionality: people are not just the sum of lects—causes crucial testimony to be ignored different demographic characteristics (Crenshaw, (Rickford and King, 2016). Linguistic profiling 1989). is common in housing and many other areas Goals for binary-gender detection projects are (Baugh, 2016). generally couched in terms of understanding Ignoring the social and ideological uses of people. But to what end and in which ways? language means ignoring some of the way NLP Making goals explicit can help uncover latent techniques are applied. There are multiple com- biases in your mental model of what kind of panies working on models that use language data people there are in the world and how you be- to decide who to give loans to. As with police stops, the features detected are not intentionally lieve they move through it. racially biased but they have the same effect in 5 What you think of people excluding specific individual from access to credit because of who they look, sound, or read The ethics you adopt has a lot to do with what like. you think of human beings. In the case of Wu Such wicked problems are adjudicated by acts and Zhang (2016), tying facial structure to crimi- of authority. Neither wicked problems nor adju- nality suggests that some humans are "bad".

89 Plenty of serious thinkers have considered people would make the world marginally safer for many to be fundamentally good. For example, that's people. But none of us have built a system with what Confucian thinkers like Mencius and Wang zero false positives. So a "working" criminal Yang Ming believed (Chan, 2008). recognition system would make the lives of some Evidence suggests that individual choices— innocent people who were treated as criminals our goodness and our badness—are strongly de- much, much worse. Goal-oriented ethical design pendent upon context. For example, what hap- requires thinking about outcomes, with a special pens when you give theology students a chance focus on which systems are created and main- to help a stranger? Darley and Batson (1973) tained, and how disparate the outcomes are for demonstrate how localized the choices are: stu- the people subject to the system. dents in a rush to give a talk do not help—even To think ethically is to think self-skeptically: when the talk they are hurrying to is about The "What is the worst possible way this technology Good Samaritan. could be used and how sound are my mitigation People commonly remember psychologist strategies?" Recently, a number of American Walter Mischel's Marshmallow Test as proof that consulting firms attempted to answer a Request traits like self-control are destiny: certain kinds for Proposals from an oil-rich country that want- of children resist taking a marshmallow and they ed to understand social media sentiment on gov- grow up to be successful. But the idea that peo- ernment projects like the building of a new stadi- ple have durable traits is precisely not Mischel's um. But stated and elicited goals and use cases conclusion. Rather, it is that people are funda- are not necessarily how something will be used mentally flexible: if you reframe how you think or even what is actually desired. about a situation, you change how you react to it The RFP stayed open for over a year, suggest- (Mischel, 2014). ing that consulting firms had difficulty finding People seem stable and consistent because NLP practitioners willing to take the stated goal they tend to be put in the same situations; in at face value. It has subsequently been shown those situations they have the same role, and the that, in fact, this project was intended to identify same kinds of relationships to you. How do we dissidents. The ability to identify sentiment about keep getting into the same situations? The an- government projects can give a voice to people swer requires us to appreciate individual's agen- about those projects, which seems positive. But tive choices as well as to recognize the social the worst-case scenario is that it can find people structures that give rise to and constrain those who are negative about the government for the choices. The systems we build enable, enforce, government to track, regulate, discipline, and and constrain choices. punish. Defining goals, building models, and adjudi- Considering the system-wide consequences of cating conflict are clear exercises of power. But models leads us back to criminality recognition. the powerful have another benefit. "Power means It is one thing to identify an actual perpetrator of not having to act, or more accurately, the capaci- a crime, but to identify someone who has not ty to be more negligent and casual about any sin- committed a crime is to invite harassment from gle performance" (Scott, 1990). Systems are not the police. Corporations could also use these equally hospitable to all people and require some models to make it hard to get a job, go into to perform acrobatics and contortions to get by. stores, or open bank accounts. In short, it could Deontologists are the ethicists who focus on become nearly impossible for certain innocent duties and obligations. As people in relative posi- individuals to operate within the law. tions of power, we have an outsized impact on Systems shape the choices people are allowed systems and therefore greater obligations to the to make and therefore they shape the people— people who are marginalized or victimized by not just the people suspected of being criminals, them (Kamm, 2008). but everyone else, too. People who are not identi- fied as criminals by the system may come to be- 6 Outcomes and reiterations lieve it works and that others who look bad are bad. In social theory terms, "subjects regulated Utilitarians are consequentialist ethicists famous by such structures are, by virtue of being sub- for focusing on the goodness of outcomes (Foot, jected to them, formed, defined, and reproduced 1967; Taurek, 1977; Parfit, 1978; Thomson, in accordance with the requirements of those 1985). Outcomes are complicated: let's say crim- structures" (Butler, 1999). inal recognition worked. The odds are that it

90 It seems handy to have something else make Ask for justifications. There are lots of things choices that we probably would have made any- you could be doing, but why do managers and how. Even without any algorithms, there are executives want to do this? Any of the following more choice points in our lives than we can pos- replies should put you on Ethical High Alert: sibly give thoughtful consideration to. That's one 1. Everyone else is doing it and we have reason why status quos maintain themselves: we to keep up tend to do the things we've tended to do 2. No one else is doing it so we can lead (Bourdieu, 1977; Giddens, 1984; Butler, 1999). the pack The more consistent our systems are and the 3. It makes money more rapidly they converge on consistency, the 4. It's legal more they are likely to reiterate—and possibly 5. It's inevitable exaggerate—what already exists. The actions a Projects that get these responses may be ethi- system takes may be small. But the ramifications cal, but these are terrible justifications in any may not be, as is the case with news recommen- event (for more on problematic justifications see dation engines operating in an already partisan Pope and Vasquez, 2016). You may get an idea context. because competitors are doing it and you certain- Routines in industry often serve to reduce anx- ly want to check on legality, but we shouldn't iety. But whose anxiety? Each human or algo- confuse wishes, plans, and circumstances with rithmic choice offers the possibility of disturbing justifications. Even if markets and the law the status quo, but the vast majority of the time, worked to promote ethical behavior (a big if), they reproduce what came before. By consider- they will necessarily lag behind new ethical ing the goals of people affected by the systems problems that computational linguists, data sci- we build, we have a better chance of seeing how entists and A.I. practitioners bring forth (Moor, much people have to conform or contort them- 1985). selves to receive benefits and avoid problems. In List the people affected. Which groups are turn, these perspectives give us a better ability to specifically represented in the training data and abandon projects or reconceive them to give which ones are left out? Who will use the sys- people ways of thwarting and hindering unethical tem? Who will the system itself affect, distin- instruments and effects of power. guishing people immediately affected from those affected as the system outputs become inputs to 7 Practical recommendations other systems. How awful is it to be a false posi- tive or a false negative? Who is most/least vul- NLP practitioners are used to thinking critically nerable to the negative effects of the system? The about models and algorithms. Taking an ethical point of making a list is to keep technical models stance means looking at goals just as critically, from becoming unmoored from human beings. which in turn requires deeper interrogation of the Is it a WMD? Cathy O'Neill describes the training data, the categories, and the effects of three characteristics of Weapons of Math De- the system. It also means seriously considering struction: they are opaque to the people they af- how the outputs of the specific system being fect, they affect important aspects of life (educa- built become inputs for other systems. But how tion, housing, work, justice, finance/credit), and does one do this other than "thinking harder?" they can do real damage. Perform a premortem (Klein, 2007). In a What values are enshrined? We orient ethics premortem, a team at the beginning of a project around dilemmas of preventing harm, but it is imagines the project was completed and turned also worth asking whether our systems bring out to be a complete disaster. They narrate, indi- about good. Which values are served and which vidually and collectively, the stories of the fail- are eschewed by a planned technology? A non- ures. This is a generally useful way of identify- comprehensive list to consider: freedom, peace, ing weaknesses in design, planning, and imple- security, dignity, respect, justice, equality. mentation. Premortems can also be used to diag- Principles and values come into conflict— nose ethical problems. Ideally, participants ap- there's even an adage, "it's not a principle unless proach the premortem from a place of true con- it costs you something". For example, a project cern for people, but premortems can be helpful centered on security may have negative implica- even if participants are orienting to problems of tions for equality. Conflict is not to be avoided, human resources, public relations, and customer it's to be made explicit—and most difficultly, it service. is to be made explicit to people affected. Sweep-

91 ing concerns under the rug or otherwise obfus- References cating them are convenient solutions, not ethical Julia Annas. 1998. Virtue and eudaimonism. Social ones. Philosophy and Policy, 15(1):37–55.

8 Conclusion David Bamman, Jacob Eisenstein, and Tyler Technology does not just appear and impact so- Schnoebelen. 2014. Gender identity and lexical vari ciety; it is the product of society, carrying with it ation in social media. Journal of Sociolinguistics, 18(2):135–160. the baggage of what has come before and usually reproducing it, discriminatory warts and all. John Baugh. 2016. Linguistic Profiling and Discrimi- Technology does not just appear: we make it. nation. The Oxford Handbook of Language and Soc- But as Bruno Latour points out, "If there is iety:349–368. one thing toward which 'making' does not lead, it is to the concept of a human actor fully in com- Pierre Bourdieu. 1977. Outline of a Theory of Prac- mand" (Latour, 2003). At a construction site you tice. Cambridge University Press, Cambridge, Eng- can witness builders who may have mastery but land. certainly not full control: materials resist, per- sonnel get sick, the weather won't cooperate, the Geoffrey C. Bowker and Susan Leigh Star. 1999. planning department requires another form, the Sorting things out: classification and its consequenc- es. MIT Press, Cambridge, MA. client is late with payment but fresh with a new idea. Judith Butler. 1999. Gender Trouble. Routledge, New Mastery and expertise do not imply control York. over objects and people; they imply practice and the ability to translate that practice into both Wing-tsit Chan. 2008. A source book in Chinese phi- plans and improvisations. losophy. Princeton University Press. An important aspect of virtue ethics is practic- ing and developing dispositions towards moral Kate Crawford. 2016. Artificial Intelligence’s White choices (Annas, 1998). To develop habits of Guy Problem. New York Times. bravery, justice, self-control, and other virtues means practicing them. By focusing on goals, we Kimberle Crenshaw. 1989. Demarginalizing the inter- focus on the connections between systems and section of race and sex: A black feminist critique of antidiscrimination doctrine, feminist theory and anti people. We talk to people about their goals and racist politics. University of Chicago Legal Fo their situations. We reason through surface con- rum:139–167. flicts that can be solved and discover where compromise is impossible so that we know when John M. Darley and C. Daniel Batson. 1973. “ From to reimagine our systems and when to abandon Jerusalem to Jericho”: A study of situational and them. Done consistently, this kind of design de- dispositional variables in helping behavior. Journal velops habits of thinking and feeling that enable of personality and social psychology, 27(1):100– and refine our capacity to be ethical and build 108. ethically. It is necessary to acknowledge and address Philippa Foot. 1967. The problem of abortion and the doctrine of double effect. Oxford Review, 5:5–15. Crawford (2016)'s critique: most of the people who build technology come from privileged Anthony Giddens. 1984. The Constitution of Society: backgrounds, which makes it difficult for our Outline of the Theory of Structuration. University of imagination and our empathy to extend out to California Press, Berkeley, CA. everyone our systems will affect. The implication extends us beyond what is Rebecca C Hetey, Benoît Monin, Amrita Maitreyi, comfortable for many people and organizations: and Jennifer L Eberhardt. 2016. Data for Change: A to not only to attend to issues of diversity and Statistical Analysis of Police Stops, Searches, Hand- representation, but to go out and educate com- cuffings, and Arrests in Oakland, Calif., 2013-2014. munities who will be affected so that they, too, can voice their goals and values. In other words, Frances Myrna Kamm. 2008. Intricate ethics: Rights, the practice of ethical design among NLP experts responsibilities, and permissible harm. Oxford Uni- versity Press. leads to greater ethical capacity—but ethics are too important to be left only to experts.

92 Gary Klein. 2007. Performing a project premortem. Harvard Business Review, 85(9):18–19.

Bruno Latour. 2003. The Promises of Constructivism. In Chasing technoscience: Matrix for materiality. Indiana University Press, Bloomington, IN.

Walter Mischel. 2014. The marshmallow test: under- standing self-control and how to master it. Random House.

James H. Moor. 1985. What is computer ethics? Metaphilosophy, 16(4):266–275.

Derek Parfit. 1978. Innumerate ethics. Philosophy & Public Affairs:285–301.

Kenneth S. Pope and Melba JT Vasquez. 2016. Ethics in psychotherapy and counseling: A practical guide. John Wiley & Sons.

John R. Rickford and Sharese King. 2016. Language and linguistics on trial: Hearing Rachel Jeantel (and other vernacular speakers) in the courtroom and be yond. Language, 92(4):948–988.

Horst WJ Rittel and Melvin M. Webber. 1973. Di- lemmas in a general theory of planning. Policy sci- ences, 4(2):155–169.

James Scott. 1990. Domination and the Arts of Re- sistance: Hidden Transcripts. Yale University Press, New Haven, CT.

John M. Taurek. 1977. Should the numbers count? Philosophy & Public Affairs:293–316.

Judith Jarvis Thomson. 1985. The trolley problem. The Yale Law Journal, 94(6):1395–1415.

Xiaolin Wu and Xi Zhang. 2016. Automated Infer- ence on Criminality using Face Images. arXiv:1611.04135 [cs], November. arXiv: 1611.04135.

93 Ethical Research Protocols for Social Media Health Research

Adrian Benton Glen Coppersmith Mark Dredze Center for Language and Qntfy Human Language Technology Speech Processing [email protected] Center of Excellence Johns Hopkins University Johns Hopkins University [email protected] [email protected]

Abstract topics could directly lead to saving lives and im- proving quality of life. Social media have transformed data- Health research often touches on sensitive top- driven research in political science, the so- ics that relate to ethics of treatment and patient cial sciences, health, and medicine. Since privacy. Based on decades of research experience health research often touches on sensitive and public debate, the research community has de- topics that relate to ethics of treatment and veloped an extensive set of guidelines surrounding patient privacy, similar ethical considera- ethical practices that guide modern research pro- tions should be acknowledged when us- grams. These guidelines focus on human subjects ing social media data in health research. research, which involves research with data from While much has been said regarding the living individuals. The core principles of human ethical considerations of social media re- subjects research were codified in the Belmont Re- search, health research leads to an addi- port (National Commission, 1978), which serves tional set of concerns. We provide practi- as the essential reference for institutional review cal suggestions in the form of guidelines boards (IRBs) in the United States. IRB guidelines for researchers working with social me- include a range of exemptions from full review dia data in health research. These guide- for research protocols that consider certain types lines can inform an IRB proposal for re- of data or populations. For example, research searchers new to social media health re- projects that rely on online data sources may be search. exempt since the data are publicly available. His- torically, public data exemptions included previ- 1 Introduction ously compiled databases containing human sub- ject data that have entered the public domain. Widely available social media data – including The recent proposal to modernize the U.S. Com- Twitter, Facebook, discussion forums and other mon Rule for the Protection of Human Subjects platforms – have emerged as grounds for data- acknowledges the widespread use of social me- driven research in several disciplines, such as dia for health research, but does little to clarify political science (Tumasjan et al., 2011), pub- the ethical obligations of social media health re- lic health (Paul and Dredze, 2011), economics searchers, generally reducing oversight necessary (Bollen et al., 2011), and the social sciences in for research placed under expedited review (Na- general (Schwartz et al., 2013). Researchers have tional Research Council, 2014). access to massive corpora of online conversations about a range of topics as never before. What once A more participatory research model required painstaking data collection or controlled is emerging in social, behavioral, and experiments, can now be quickly collected and an- biomedical research, one in which po- alyzed with computational tools. The impact of tential research subjects and communi- such data is especially significant in health and ties express their views about the value medicine, where advances in our understanding and acceptability of research studies. of disease transmission, medical decision making, This participatory model has emerged human behavior and public perceptions of health alongside a broader trend in American

94 Proceedings of the First Workshop on Ethics in Natural Language Processing, pages 94–102, Valencia, Spain, April 4th, 2017. c 2017 Association for Computational Linguistics

society, facilitated by the widespread human subjects research to projects that analyze use of social media, in which Americans publicly available social media posts? What pro- are increasingly sharing identifiable per- tections or restrictions apply to the billions of sonal information and expect to be in- Twitter posts publicly available and accessible by volved in decisions about how to further anyone in the world? Are tweets that contain per- share the personal information, includ- sonal information – including information about ing health-related information that they the author or individuals known to the author – have voluntarily chosen to provide. subject to the same exemptions from full IRB re- view that have traditionally been granted to public In general, it provides a more permissive def- data sources? Are corpora that include public data inition of what qualifies as exempt research. It from millions of individuals subject to the same in- suggests exempting observational studies of pub- formed consent requirements of traditional human licly available data where appropriate measures subjects research? Should researchers produce an- are taken to secure sensitive data, and demonstra- notations on top of these datasets and share them bly benign behavioral intervention studies. publicly with the research community? The an- The intersection of these ethics traditions and swers to these and other questions influence the social media research pose new challenges for the design of research protocols regarding social me- formulation of research protocols. These chal- dia data. lenges are further complicated by the discipline of Ethical issues surrounding social media re- the researchers conducting these studies. Health search have been discussed in numerous papers, research is typically conducted by researchers a survey of which can be found in McKee (2013) with training in medical topics, who have an un- and Conway (2014). Additionally, Mikal et al. derstanding of human subjects research protocols (2016) used focus groups to understand the per- and issues regarding IRBs. In contrast, social ceived ethics of using social media data for mental media research may be conducted by computer health research. Our goal in this paper is comple- scientists and engineers, disciplines that are typ- mentary to these ethics surveys: we want to pro- ically unaccustomed to these guidelines (Conway, vide practical guidance for researchers working 2014). with social media data in human subjects research. Although this dichotomy is not absolute, many We, ourselves, are not ethicists; we are practition- researchers are still unclear on what measures are ers who have spent time considering practical sug- required by an IRB before analyzing social media gestions in consultation with experts in ethics and data for health research. Conversations by the au- privacy. These guidelines encapsulate our experi- thors with colleagues have revealed a wide range ence implementing privacy and ethical ideals and of “standard practice” from IRBs at different in- principles. stitutions. In fact, the (excellent) anonymous re- views of this paper stated conflicting perceptions These guidelines are not meant as a fixed set on this point. One claimed that online data did of standards, rather they are a starting point for not necessarily qualify for an exemption if account researchers who want to ensure compliance with handles were included, whereas another reviewer ethical and privacy guidelines, and they can be states that health research solely on public social included with an IRB application as a reflection media data did not constitute human subjects re- of current best practices. We intend these to be search. a skeleton upon which formal research protocols The meeting of non-traditional health re- can be developed, and precautions when working searchers, health topics, and non-traditional data with these data. Readers will also note the wide sets has led to questions regarding ethical and pri- range of suggestions we provide, which reflects vacy concerns of such research. This document the wide range of research and associated risk. Fi- is meant to serve as a guide for researchers who nally, we include software packages to support im- are unfamiliar with health-related human subjects plementation of some of these guidelines. research and want to craft a research proposal that For each guideline, we reference relevant dis- complies with requirements of most IRBs or ethics cussions in the literature and give examples of how committees. these guidelines have been applied. We hope that How are we to apply the ethical principles of this serves as a first step towards a robust discus-

95 sion of ethical guidelines for health-related social those challenged by mental illness. Thus, the media research. driving force behind this research is to pre- vent suffering from mental illness. 2 Discussion Intervention has great potential for good and The start of each research study includes asking • core questions about the benefits and risks of the for harm. Naturally, we would like to help proposed research. What is the potential good this those around us that are suffering, but that particular application allows? What is the poten- does not mean that we are properly equipped tial harm it may cause and how can the harm be to do so. Interventions enacted at a time of mitigated? Is there another feasible route to the emotional crisis amplify the risks and bene- good with less potential harm? fits. The approach we have taken in previous Answers to these questions provide a frame- studies was to observe and understand men- work within which we can decide which avenues tal illness, not to intervene. This is likely true of research should be pursued. Virtually all tech- for many computer and data science research nology is dual-use: it can be used for good or ill. endeavors, but that does not absolve the con- The existence of an ill use does not mean that the sideration of interventions. Ultimately, if the technology should not be developed, nor does the proposed research is successful it will inform existence of a good mean that it should. the way that medicine is practiced, and thus To focus our discussion on the pragmatic, we will directly or indirectly have an effect on will use mental health research as a concrete use interventions. case. A research community has grown around Machine learning algorithms do not learn using social media data to assess and understand • mental health (Resnik et al., 2013; Schwartz et al., perfectly predictive models. Errors and mis- 2013; Preotiuc-Pietro et al., 2015; Coppersmith et classifications will be made, and this should al., 2015a; De Choudhury et al., 2016). Our dis- be accounted for by the researcher. Even cussion on the benefits and risks of such research less clearly error-prone systems, such as is sharpened by the discrimination and stigma databases for sensitive patient data, are liable surrounding mental illness. The discrimination to being compromised. paired with potentially lethal outcomes put the Social media platforms, like Twitter, are of- risks and benefits of this type of research in stark • relief – not sufficiently protecting users’/subjects’ ten public broadcast media. Nevertheless, privacy, may exacerbate the challenge, discourage much has been written about the perception individuals from seeking treatment and erode pub- that users do not necessarily treat social me- lic trust in researchers. Similarly, insufficient re- dia as a purely public space (McKee, 2013). search results in a cost measured in human lives – Mikal et al. (2016) found that many Twit- in the United States, more than 40,000 die from ter users in focus groups do have a skewed suicide each year (Curtin et al., 2016). Mental expectation of privacy, even in an explic- health may be an extreme case for the gravity of itly public platform like Twitter, driven by these choices, but similar risk and benefits are “users’ (1) failure to understand data perma- present in many other health research domains. nence, (2) failure to understand data reach, Clearly identifying the risks and the potential re- and (3) failure to understand the big data ward helps to inform the stance and guidelines one computational tools that can be used to an- should adopt. alyze posts”. We found it helpful to enumerate facts and ob- Our guidelines emerge from these tenets and servations that inform each research protocol de- our experience with mental health research on so- cision: cial media, where we try to strike a balance be- We want to make a positive impact upon soci- tween enabling important research with the con- • ety, and one significant contribution we may cerns of risk to the privacy of the target popula- provide is to better understand mental illness. tion. We encourage all researchers to frame their Specifically, we want to learn information own research tenets first to establish guiding prin- that will aid mental health diagnosis and help ciples as to how research should proceed.

96 3 Guidelines “(1) Data through intervention or interaction with the individual, or (2) Identifiable private informa- In contrast to others (Neuhaus and Webmoor, tion” (US Department of HHS, 2009). Collecting 2012; McKee, 2013; Conway, 2014) who have posts, examining networks, or in any way observ- offered broad ethical frameworks and high-level ing the activity of people means that social me- guidance in social media health research, we offer dia health research qualifies as human subjects re- specific suggestions grounded in our own experi- search (O’Connor, 2013) and requires the review ence conducting health research with social me- of an IRB. The distinction between social media dia. At the same time, the risk of a study varies research that involves human subjects and research depending on the type of health annotations col- that does not is nebulous, as the inclusion of indi- lected and whether the research is purely observa- viduals in research alone is insufficient. For exam- tional or not. Therefore, we do not provide hard ple, research that requires the annotation of cor- rules, but different options given the risk associ- pora for training models involves human annota- ated with the study. tors. But since the research does not study the Researchers familiar with human subjects re- actions of those annotators, the research does not search may ask how our guidelines differ from involve human subjects. By contrast, if the goal those recommended for all such research, regard- of the research was to study how humans anno- less of connections with social media data. While tate data, such as to learn about how humans inter- the main points are general to human subjects re- pret language, then the research may constitute hu- search, we describe how these issues specifically man subjects research. When in doubt, researchers arise in the context of social media research, and should consult their appropriate IRB contact. provide relevant examples. Additionally, social IRB review provides a series of exemption cat- media raises some specific concerns and sugges- egories that exempt research protocols from a full tions described below, such as (1) concern of in- review by the IRB. Exemption category 4 in sec- advertently compromising user privacy by linking tion 46.101 (b) concerns public datasets (US De- data, even when all the linked datasets are public, partment of HHS, 2009): (2) using alternatives to traditionally obtained in- formed consent, (3) additional steps to de-identify social media data before analysis and dissemina- Research involving the collection or tion, and (4) care when attributing presenting in- study of existing data, documents, formation in public forums. Furthermore, our in- records, pathological specimens, or di- tended audience are readers unfamiliar with hu- agnostic specimens, if these sources are man subjects research guidelines, as opposed to publicly available or if the informa- seasoned researchers in this area. tion is recorded by the investigator in such a manner that subjects cannot be 3.1 Institutional Review Board identified, directly or through identifiers In the United States, all federally-funded human linked to the subjects. subject research must be approved by a committee of at least five persons, with at least one member Since these projects pose a minimal risk to sub- from outside of the institution (Edgar and Roth- jects, they require minimal review. Since most man, 1995). This committee is the Institutional social media projects rely on publicly available Review Board (IRB), and in practice, many Amer- data, and do not include interventions or interac- ican institutions require all performed research to tions with the population, they may qualify for be sanctioned by the IRB. Ethics committees serve IRB exempt status (Hudson and Bruckman, 2004). a similar role as IRBs in European Union mem- Such research still requires an application to the ber states (European Parliament and Council of IRB, but with a substantially expedited and sim- the European Union, 2001). These committees plified review process. This is an important point: have different regulations, but typically make sim- research that involves human subjects, even if it ilar approval judgments as IRBs (Edwards et al., falls under an exemption, must obtain an exemp- 2007). tion from the IRB. Research that does not involve Human subjects are any living individual about human subjects need not obtain any approval from whom an investigator conducting research obtains the IRB.

97 3.2 Informed Consent or modifying user experience. For example, re- Obtain informed consent when possible. search may start with identifying public Twitter messages on a given topic, and then generating A fundamental tenant of human subjects re- an interaction with the user of the message. The search is to obtain informed consent from study well known study of Kramer et al. (2014) manipu- participants. Research that analyzes public cor- lated Facebook users’ news feeds to vary the emo- pora that include millions of individuals cannot tional content and monitor how the feed influenced feasibly obtain informed consent from each in- users’ emotional states. This study raised partic- dividual (O’Connor, 2013). Therefore, the vast ularly strong ethical reservations since informed majority of research that analyzes collected social consent agreements were never obtained, and was media posts cannot obtain such consent. Still, we followed by an “Editorial Expression of Concern”. advocate for informed consent where possible due While we cannot make definitive judgements as to to the central role of consent in human subjects what studies can receive IRB exemptions, inter- research guidelines. In cases where researchers acting with users often comes with testing specific solicit data from users, such as private Facebook interventions, which typically require a full IRB or Twitter messages, informed consent may be review. In these cases, it is the responsibility of the required (Celli et al., 2013). Be explicit about researchers to work with the IRB to minimize risks how subject data will be used, and how it will to study subjects, and such risk minimization may be stored and protected. OurDataHelps1, which qualify for expedited IRB review (McKee, 2013). solicits data donations for mental health research, In short, researchers should be careful not to con- provides such information. flate exemptions for public datasets with blanket Even if you have not explicitly dealt with con- permission for all social media research. sent while collecting public subject data, attaching 3.4 Protections for Sensitive Data a “statement of responsibility” and description of how the data were compiled and are to be used will Develop appropriate protections for sen- give you, the researcher, a measure of accountabil- sitive data. ity (Neuhaus and Webmoor, 2012; Vayena et al., Even publicly available data may include sen- 2013). This statement of responsibility would be sitive data that requires protection. For example, posted publicly on the research group’s website, users may post sensitive information (e.g. diag- and contains a description of the type of data that noses, personal attributes) that, while public, are are collected, how they are being protected, and still considered sensitive by the user. Furthermore, the types of analyses that will be conducted using algorithms may infer latent attributes of users from it. Users could explicitly choose to opt-out their publicly posted information that can be consid- data from the research by providing their account ered sensitive. This is often the case in mental handle. An IRB or ethics committee may not ex- health research, where algorithms identify users 2 plicitly request such a statement , but it serves to who may be challenged by a mental illness even ensure trust in subjects who typically have no say when this diagnosis isn’t explicitly mentioned by in how their online data are used. the user. Additionally, domain experts may man- ually label users for different medical conditions 3.3 User Interventions based on their public statements. These annota- Research that involves user interven- tions, either manually identified or automatically tions may not qualify for an IRB exemp- extracted, may be considered sensitive user infor- tion. mation even when derived from public data. Proper protections for these data should be de- Research that starts by analyzing public data veloped before the data are created. These may may subsequently lead to interacting with users include: 1https://ourdatahelps.org 2Although some IRBs do require such a statement 1. Restrict access to sensitive data. This may in- and the ability for users to opt-out of the study. See the clude placing such data on a protected server, University of Rochester guidelines for social media research: restricting access using OS level permissions, https://www.rochester.edu/ohsp/documents/ohsp/pdf/ policiesAndGuidance/Guideline_for_Research_Using_ and encrypting the drives. This is common Social_Media.pdf practice for medical record data.

98 2. Separate annotations from user data. The raw 1. Remove usernames and profile pictures from user data can be kept in one location, and the papers and presentations where the tweet in- sensitive annotations in another. The two data cludes potentially sensitive information (Mc- files are linked by an anonymous ID so as not Kee, 2013). to rely on publicly identifiable user handles. 2. Paraphrase the original message. In cases The extent to which researchers should rely on where the post is particularly sensitive, the these and other data protections depends on the na- true author may be identifiable through text ture of the data. Some minimal protections, such searches over the relevant platform. In these as OS level permissions, are easy to implement cases, paraphrase or modify the wording of and may be appropriate for a wide range of data the original message to preserve its meaning types. For example, the dataset of users who self- but obscure the author. identified as having a mental condition as com- 3. Use synthetic examples. In many cases it piled in Coppersmith et al. (2015a) was protected may be appropriate to create new message in this way during the 3rd Annual Frederick Je- content in public presentations that reflects linek Summer Workshop. More extreme measures, the type of content studied without using a such as the use of air-gapped servers – comput- real example. Be sure to inform your audi- ers that are physically removed from external net- ence when the examples are artificial. works – may be appropriate when data is particu- larly sensitive and the risk of harm is great. Cer- Not all cases require obfuscation of message au- tainly in cases where public data (e.g. social me- thorship; in many situations it may be perfectly ac- dia) is linked to private data (e.g. electronic med- ceptable to show screen shots or verbatim quotes ical records) greater restrictions may be appropri- of real content with full attribution. When making ate to control data access (Padrez et al., 2015). these determinations, you should consider if your inclusion of content with attribution may bring un- 3.5 User Attribution wanted attention to the user, demonstrate behav- De-identify data and messages in public ior the user may not want to highlight, or pose presentations to minimize risk to users. a non-negligible risk to the user. For example, showing an example of an un-anonymized tweet While messages posted publicly may be freely from someone with schizophrenia, or another stig- accessible to anyone, users may not intend for matized condition, can be much more damaging their posts to have such a broad audience. For to them than posting a tweet from someone who example, on Twitter many users engage in public smokes tobacco. While the content may be pub- conversations with other users knowing that their licly available, you do not necessarily need to draw messages are public, but do not expect a large attention to it. audience to read their posts. Public users may be aware that their tweets can be read by any- 3.6 User De-identification in Analysis one, but posted messages may still be intended Remove the identity of a user or other for their small group of followers (Hudson and sensitive personal information if it is not Bruckman, 2004; Quercia et al., 2011; Neuhaus needed in your analysis. and Webmoor, 2012; O’Connor, 2013; Kandias et al., 2013). The result is that while technically It is good practice to remove usernames and and legally public messages may be viewable by other identifying fields when the inclusion of such anyone, the author’s intention and care with which information poses risk to the user. For exam- they wrote the message may not reflect this real- ple, in the 2015 CLPsych shared task, tweets ity. Therefore, we suggest that messages be de- were de-identified by removing references to user- identified or presented without attribution in pub- names, URLs, and most metadata fields (Cop- lic talks and papers unless it is necessary and ap- persmith et al., 2015b). Carefully removing propriate to do otherwise. This is especially true such information can be a delicate process, so when the users discuss sensitive topics, or are we encourage the use of existing software for identified as having a stigmatized condition. this task: https://github.com/qntfy/ In practice, we suggest: deidentify_twitter. This tool is clearly

99 not a panacea for social media health researchers, token frequency statistics as features, but not, for and depending on the sensitivity of the data, more example, gazetteers or pre-trained word vectors as time-consuming de-identification measures will features in their models. need to be taken. For example, before analyz- It is also important to refer to the social media ing a collection of breast cancer message board platform terms of service before sharing datasets. posts, Benton et al. (2011) trained a model to de- For example, section F.2 of Twitter’s Developer identify several fields: named entities such as per- Policy restricts sharing to no more than 50,000 son names, locations, as well as phone numbers tweets and user information objects per down- and addresses. When analyzing text data, per- loader per day.3 fect anonymization may be impossible to achieve, since a Google search can often retrieve the iden- 3.8 Data Linkage Across Sites tity of a user given a single message they post. Be cautious about linking data across sites, even when all data are public. 3.7 Sharing Data

Ensure that other researchers will re- While users may share data publicly on multi- spect ethical and privacy concerns. ple platforms, they may not intend for combina- tions of data across platforms to be public (Mc- We strongly encourage researchers to share Kee, 2013). For example, a user may create a pub- datasets and annotations they have created so that lic persona on Twitter, and a less identifiable ac- others can replicate research findings and develop count on a mental health discussion forum. The new uses for existing datasets. In many cases, discussions they have on this health forum should there may be no risk to users in sharing data and not be inadvertently linked to their Twitter account such data should be freely shared. However, where by an overzealous researcher, since it may “out” there may be risk to users, data should not be their condition to the Twitter community. shared blindly without concern for how it will be There have been several cases of identifying used. users in anonymized data based on linking data First, if protective protocols of the kind de- across sources. Douriez et al. (2016) describe scribed above were established for the data, new how the New York City Taxi Dataset can be de- researchers who will use the data should agree anonymized by collecting taxi location informa- to the same protocols. This agreement was im- tion from four popular intersections. Narayanan plemented in the MIMIC-III hospital admissions and Shmatikov (2008) showed that the identify database, by Johnson et al. (2016). Researchers of users in the anonymized Netflix challenge data are required to present a certificate of human sub- can be revealed by mining the Internet Movie jects training before receiving access to a de- Database. identified dataset of hospital admissions. Addi- tionally, the new research team may need to obtain Combinations of public data can create new their own IRB approval before receiving a copy of sensitivities and must be carefully evaluated on a the data. case-by-case basis. In some cases, users may ex- Second, do not share sensitive or identifiable in- plicitly link accounts across platforms, such as in- formation if it is not required for the research. For cluding in a Twitter profile a link to a LinkedIn example, if sensitive annotations were created for page or blog (Burger et al., 2011). Other times users, you may instead share an anonymized ver- users may not make these links explicit, intention- sion of the corpus where features such as, for ex- ally try to hide the connections, or the connections ample, individual posts they made, are not shared. are inferred by the researcher, e.g. by similar- Otherwise, the original user handle may be re- ity in user handles. These factors should be con- covered using a search for the message text. For sidered when conducting research that links users NLP-centric projects where models are trained to across multiple platforms. It goes without saying predict sensitive annotations from text, this means that linking public posts to private, sensitive fields that either opaque feature vectors should be shared (electronic health records) should be handled with (disallowing others from preprocessing the data the utmost care (Padrez et al., 2015). differently), or the messages be replaced with de- 3https://dev.twitter.com/overview/ identified tokens, allowing other researchers to use terms/agreement-and-policy

100 4 Conclusion Marie Douriez, Harish Doraiswamy, Juliana Freire, and Claudio´ T. Silva. 2016. Anonymizing nyc taxi data: We have provided a series of ethical recommenda- Does it matter? In IEEE International Conference tions for health research using social media. These on Data Science and Advanced Analytics (DSAA), recommendations can serve as a guide for devel- pages 140–148. oping new research protocols, and researchers can Harold Edgar and David J. Rothman. 1995. The insti- decide on specific practices based on the issues tutional review board and beyond: Future challenges raised in this paper. We hope that researchers new to the ethics of human experimentation. The Mil- to the field find these guidelines useful to familiar- bank Quarterly, 73(4):489–506. ize themselves with ethical issues. Sarah J. L. Edwards, Tracey Stone, and Teresa Swift. 2007. Differences between research ethics commit- tees. International journal of technology assessment References in health care, 23(01):17–23. Adrian Benton, Lyle Ungar, Shawndra Hill, Sean Hen- nessy, Jun Mao, Annie Chung, Charles E. Leonard, European Parliament and Council of the European and John H. Holmes. 2011. Identifying potential Union. 2001. Directive 2001/20/EC. adverse effects using the web: A new approach to medical hypothesis generation. Journal of biomedi- James M. Hudson and Amy Bruckman. 2004. “Go cal informatics, 44(6):989–996. away”: participant objections to being studied and the ethics of chatroom research. The Information Johan Bollen, Huina Mao, and Xiaojun Zeng. 2011. Society, 20(2):127–139. Twitter mood predicts the stock market. Journal of computational science, 2(1):1–8. Alistair E. W. Johnson, Tom J. Pollard, Lu Shen, Li- John D. Burger, John Henderson, George Kim, and wei H. Lehman, Mengling Feng, Mohammad Ghas- Guido Zarrella. 2011. Discriminating gender on semi, Benjamin Moody, Peter Szolovits, Leo An- twitter. In Empirical Methods in Natural Language thony Celi, and Roger G. Mark. 2016. MIMIC-III, Processing (EMNLP), pages 1301–1309. a freely accessible critical care database. Scientific data, 3. Fabio Celli, Fabio Pianesi, David Stillwell, and Michal Kosinski. 2013. Workshop on computational per- Miltiadis Kandias, Konstantina Galbogini, Lilian sonality recognition (shared task). In Workshop on Mitrou, and Dimitris Gritzalis. 2013. Insiders Computational Personality Recognition. trapped in the mirror reveal themselves in social me- dia. In International Conference on Network and Mike Conway. 2014. Ethical issues in using Twitter System Security, pages 220–235. for public health surveillance and research: devel- oping a taxonomy of ethical concepts from the re- Adam D. I. Kramer, Jamie E. Guillory, and Jeffrey T. search literature. Journal of Medical Internet Re- Hancock. 2014. Experimental evidence of massive- search, 16(12):e290. scale emotional contagion through social networks. Glen Coppersmith, Mark Dredze, Craig Harman, and Proceedings of the National Academy of Sciences, Kristy Hollingshead. 2015a. From ADHD to 111(24):8788–8790. SAD: analyzing the language of mental health on Twitter through self-reported diagnoses. In NAACL Rebecca McKee. 2013. Ethical issues in using social Workshop on Computational Linguistics and Clini- media for health and health care research. Health cal Psychology. Policy, 110(2):298–301.

Glen Coppersmith, Mark Dredze, Craig Harman, Jude Mikal, Samantha Hurst, and Mike Conway. 2016. Kristy Hollingshead, and Margaret Mitchell. 2015b. Ethical issues in using Twitter for population-level CLPsych 2015 shared task: Depression and ptsd on depression monitoring: a qualitative study. BMC Twitter. In Workshop on Computational Linguistics medical ethics, 17(1):1. and Clinical Psychology: From Linguistic Signal to Clinical Reality, pages 31–39. Arvind Narayanan and Vitaly Shmatikov. 2008. Ro- Sally C. Curtin, Margaret Warner, and Holly Hede- bust de-anonymization of large sparse datasets. In gaard. 2016. Increase in suicide in the United IEEE Symposium on Security and Privacy, pages States, 1999-2014. NCHS data brief, 241:1–8. 111–125. Munmun De Choudhury, Emre Kiciman, Mark Dredze, National Commission. 1978. The Belmont Report: Glen Coppersmith, and Mrinal Kumar. 2016. Dis- Ethical Principles and Guidelines for the Protec- covering shifts to suicidal ideation from mental tion of Human Subjects of Research-the National health content in social media. In Conference on Commission for the Protection of Human Subjects Human Factors in Computing Systems (CHI), pages of Biomedical and Behavioral Research. US Gov- 2098–2110. ernment Printing Office.

101 National Research Council. 2014. Proposed revisions Effy Vayena, Anna Mastroianni, and Jeffrey Kahn. to the common rule for the protection of human sub- 2013. Caught in the web: informed consent for on- jects in the behavioral and social sciences. National line health research. Sci Transl Med, 5(173):1–3. Academies Press.

Fabian Neuhaus and Timothy Webmoor. 2012. Agile ethics for massified research and visualization. In- formation, Communication & Society, 15(1):43–65.

Dan O’Connor. 2013. The apomediated world: reg- ulating research when social media has changed research. Journal of Law, Medicine, and Ethics, 41(2):470–483.

Kevin A. Padrez, Lyle Ungar, H. Andrew Schwartz, Robert J. Smith, Shawndra Hill, Tadas Antanavi- cius, Dana M. Brown, Patrick Crutchley, David A. Asch, and Raina M. Merchant. 2015. Linking so- cial media and medical record data: a study of adults presenting to an academic, urban emergency depart- ment. Quality and Safety in Health Care.

Michael J. Paul and Mark Dredze. 2011. You are what you tweet: Analyzing twitter for public health. In International Conference on Weblogs and Social Media (ICWSM), pages 265–272.

Daniel Preotiuc-Pietro, Johannes Eichstaedt, Gregory Park, Maarten Sap, Laura Smith, Victoria Tobolsky, H. Andrew Schwartz, and Lyle Ungar. 2015. The role of personality, age and gender in tweeting about mental illnesses. In Workshop on Computational Linguistics and Clinical Psychology: From Linguis- tic Signal to Clinical Reality.

Daniele Quercia, Michal Kosinski, David Stillwell, and Jon Crowcroft. 2011. Our Twitter profiles, our selves: Predicting personality with Twitter. In IEEE Third International Conference on Privacy, Security, Risk and Trust (PASSAT), pages 180–185.

Philip Resnik, Anderson Garron, and Rebecca Resnik. 2013. Using topic modeling to improve prediction of neuroticism and depression. In Empirical Meth- ods in Natural Language Processing (EMNLP), pages 1348–1353.

H. Andrew Schwartz, Johannes C. Eichstaedt, Mar- garet L. Kern, Lukasz Dziurzynski, Richard E. Lu- cas, Megha Agrawal, Gregory J. Park, Shrinidhi K. Lakshmikanth, Sneha Jha, Martin E. P. Seligman, and Lyle H. Ungar. 2013. Characterizing geo- graphic variation in well-being using tweets. In In- ternational Conference on Weblogs and Social Me- dia (ICWSM).

Andranik Tumasjan, Timm O. Sprenger, Philipp G. Sandner, and Isabell M. Welpe. 2011. Election fore- casts with twitter: How 140 characters reflect the political landscape. Social science computer review, 29(4):402–418.

US Department of HHS. 2009. Code of federal regu- lations. title 45. Public Welfare CFR, 46.

102 Say the Right Thing Right: Ethics Issues in Natural Language Generation Systems

Charese Smiley & Frank Schilder Vassilis Plachouras & Jochen L. Leidner Thomson Reuters R&D Thomson Reuters R&D 610 Opperman Drive 30 South Colonnade Eagan, MN 55123 London E14 5EP USA United Kingdom [email protected] [email protected]

Abstract we reach this goal, it is necessary to have a list of best practices for building NLG systems. We discuss the ethical implications of Nat- This paper presents a checklist of ethics issues ural Language Generation systems. We arising when developing NLG systems in gen- use one particular system as a case study eral and more specifically from the development to identify and classify issues, and we pro- of an NLG system to generate descriptive text vide an ethics checklist, in the hope that for macro-economic indicators as well as insights future system designers may benefit from gleaned from our experiences with other NLG conducting their own ethics reviews based projects. While not meant to be comprehensive, on our checklist. it provides high and low-level views of the types of considerations that should be taken when gen- 1 Introduction erating directly from data to text. The remainder of the paper is organized as fol- With the advent of big data, there is increasingly lows: Section 2 covers related work in ethics for a need to distill information computed from these NLG systems. Section 3 introduces an ethics datasets into automated summaries and reports checklist for guiding the design of NLG systems. that users can quickly digest without the need for Section 4 describes a variety of issues we have time-consuming data munging and analysis. How- encountered. Section 5 outlines ways to address ever, with automated summaries comes not only these issues emphasizing various methods we pro- the added benefit of easy access to the findings of pose should be applied while developing an NLG large datasets but the need for ethical considera- system. We present our conclusions in Section 6. tions in ensuring that these reports accurately re- flect the true nature of the underlying data and do 2 Related work not make any misleading statements. Many of the ethical issues of NLG systems have This is especially vital from a Natural Language been discussed in the context of algorithmic jour- Generation (NLG) perspective because with large nalism (Dorr¨ and Hollnbuchner, 2016). They out- datasets, it may be impossible to read every gen- line a general framework of moral theories follow- eration and reasonable-sounding, but misleading, ing Weischenberg et al. (2006) that should be ap- generations may slip through without proper vali- plied to algorithmic journalism in general and es- dation. As users read the automatically generated pecially when NLG systems are used. summaries, any misleading information can af- fect their subsequent actions, having a real-world We are building on their framework by provid- impact. Such summaries may also be consumed ing concrete issues we encounter while creating by other automated processes, which extract in- actual NLG systems. formation or calculate sentiment for example, po- Kent (2015) proposes a concrete checklist for 1 tentially amplifying any misrepresented informa- robot journalism that lists various guidelines for tion. Ideally, the research community and industry utilizing NLG systems in journalism. He also should be building NLG systems which avoid al- points out that a link back to the source data is together behaviors that promote ethical violations. 1http://mediashift.org/2015/03/an- However, given the difficulty of such a task, before ethical-checklist-for-robot-journalism/

103 Proceedings of the First Workshop on Ethics in Natural Language Processing, pages 103–108, Valencia, Spain, April 4th, 2017. c 2017 Association for Computational Linguistics

QUESTION EXAMPLE RESPONSE SECTION Human consequences Are there ethical objections to building the application? No objections anticipated 4.3 How could a user be disadvantaged by the system? No anticipated disadvantages to user 4.4-4.7 Does the system use any Personally Identifiable Information? No PII collected or used 4.5 Data issues How accurate is the underlying data?* Data is drawn from trusted source 4 Are there any misleading rankings given? Yes, detected via data validation 4.1 Are there (automatic) checks for missing data? Yes, detected via data validation 4.2 Does the data contain any outliers? Yes, detected via data validation 4.2 Generation issues Can you defend how the story is written?* Yes via presupposition checks and disclosure 5 Does the style of the automated report match your style?* Yes, generations reviewed by domain experts 5 Who is watching the machines?* Conducted internal evaluation and quality control 5 Provenance Will you disclose your methods?* Disclosure text 4.4 Will you disclose the underlying data sources? Provide link to open data & source for proprietary data 4.4

Table 1: An ethics checklist for NLG systems. There is an overlap with questions from the checklist Thomas Kent proposed and they are indicated by . ∗ essential and that such systems should at least in of verbs describing the trend between two data the beginning go through rigorous quality checks. points from an extensive corpus analysis. Ground- A comprehensive overview of ethical issues ing the verb choice in data helps to correctly de- on designing computer systems can be found scribe the intensity of a change. in (IEEE, 2016). More specifically, Amodei et The problem of missing data can taint every al. (2016) propose an array of machine learning- data analysis and lead to misleading conclusions based strategies for ensuring safety in general AI if not handled appropriately. Equally important systems, mostly focussing on autonomous sys- as the way one imputes missing data points in the tem interacting with a real world environment. analysis is the transparent description of how data Their research questions encompass avoiding neg- is handled. NLG system designers, in particular, ative side effects, robustness to distributional shift have to be very careful about which kind of data (i.e. the machine’s situational awareness) and scal- their generated text is based on. To our knowl- able oversight (i.e. autonomy of the machine in edge, this problem has not been systematically ad- decision-making). The last question is clearly rel- dressed in the literature on creating NLG systems. evant to defining safeguards for NLG systems as At the application level, Mahamood and Re- well. Ethical questions addressing the impact of iter (2011) present an NLG system for the neonatal specifically NLP systems are addressed by Hovy care domain, which arguably is particularly sensi- and Spruit (2016). tive as far as medical sub-domains are concerned. To ensure oversight of an AI system, they draw They generate summaries about the health status inspiration from semi-supervised reinforcement of young babies, including affective elements to learning and suggest to learn a reward function ei- calm down potentially worried parents to an ap- ther based on supervised or semi-supervised active propriate degree. If a critically ill baby has seen learning. We follow this suggestion and propose dramatic deterioration or has died, the system ap- creating such a reward-based model for NLG sys- propriately does not generate any output, but refers 2 tems in order to learn whether the generated texts to a human medic. may lay outside of the normal parameters. Actual NLG systems are faced with word 3 Ethics Checklist choice problem and possible data problems. Such While there is a large body of work on metrics and systems, however, normally do not address the methodologies for improving data quality (Ba- ethical consequences of the choices taken, but see tini et al., 2008), reaching a state where an NLG Joshi et al. (1984) for an exception. Choosing system could automatically determine edge cases the appropriate word in an NLG system was al- (problems that occur at the extremes or outside of ready addressed by (Ward, 1988; Barzilay and normal data ranges) or issues in the data, is a dif- Lee, 2002), among others. More recently, Smiley et al. (2016), for example, derive the word choice 2Ehud Reiter, personal communication

104 2009 2010 2011 2012 2013 2014 Curac¸ao 76.15609756 .. 77.47317073 .. .. 77.82439024 Table 2: Life expectancy at birth, total (years) for Curac¸ao.

2006 2007 2008 2009 2010 2011 South Sudan .. .. 15,550,136,279 12,231,362,023 15,727,363,443 17,826,697,892

Table 3: GDP (current US$) for South Sudan.

ficult task. Until such systems are built, we be- in this section. lieve it could be helpful to have some guidance in the form of an ethics checklist, which could be in- 4.1 Ranking tegrated in any existing project management pro- It is common to provide a ranking among enti- cess. ties with values that can be ordered. However, In Table 1, we propose such a checklist, with the when there are a small number of entities, rank- aim to aid the developers of NLG systems on how ing may not be informative especially if the size to address the ethical issues arising from the use of the set is not also given. For example, if there of an NLG system, and to provide a starting point is only one country reporting in a region for a for outlining mechanisms and processes to address particular indicator an NLG engine could claim these issues. We divided the checklist up into 4 that the country is either the highest or lowest areas starting with questions on developing NLP in the region. A region like North America, for systems in general. The table also contains the re- which World Bank lists Bermuda, Canada, and the sponse for a system we designed and developed United States will sometimes only have data for 2 and pointers to sections of the paper which discuss countries as Bermuda is dramatically smaller, so methods that could be deployed to make sure the clarity in which countries are being compared for issues raised by the questions are adequately ad- a given indicator and timespan is essential. dressed. The checklist was derived from our own experience with NLG systems as well as informed 4.2 Time series by the literature. We do not assert its completion, Missing Data: Enterprise applications will usu- but rather offer it as a starting point that may be ally contain Terms of Use of products stating that extended by others; also, other kinds of NLP sys- data may be incomplete and calculations may in- tems may lead to specific checklists following the clude missing points. However, users may still same methodology. assume that content shown by an application is authoritative leading to a wrong impression about 4 Current issues the accuracy of the data. Table 3 shows the life expectancy for Curac¸ao from 2009-2015. Here This section consists of issues encountered when we see that 2010, 2012, and 2013 are missing. developing an NLG system for generating sum- NLG systems should check for missing values and maries for macro-economic data (Plachouras et should be informed if calculations are performed al., 2016). To illustrate these issues we use World on data with missing values or if values presented Bank Open Data,3 an open access repository of to the user have been imputed. global development indicator data. While this repository contains a wealth of data that can be Leading/trailing empty cells: Similar to issues used for generating automatic summaries, it also with missing data, leading/trailing zeros and miss- contains a variety of edge cases that are typical ing values in the data may be accurate or may sig- of large datasets. Managing edge cases is es- nal that data was not recorded during that time pe- sential not only due to issues of grammaticality riod or that the phenomena started/ended when the (e.g. noun-number agreement, subject-verb agree- first or last values were reported. For example, Ta- ment), but because they can lead to misstatements ble 3 shows empty leading values for South Sudan, and misrepresentations of the data that a user a country that only recently became independent. might act on. These issues are discussed in turn Small Changes: The reported life expectancy of St. Lucia was very stable in the late 1990s. 3http://data.worldbank.org In 1996, World Bank gives a life expectancy of

105 71.1574878 and in 1997, 71.15529268. Depend- their prior beliefs to ascribe trust (or not). Once ing on our algorithm, one generation would say users are informed about the provenance of the that there was no change in St. Lucia’s life ex- information, they are enabled to decide for them- pectancy between 1996 and 1997 if the number selves whether or how much they trust a piece of was rounded to 2 decimal places. If the difference information output by a system, such as a natural is calculated without rounding then the generation language summary. would say that there was virtually no change. Us- As pointed out by Kent (2015) disclaimers ing the second wording allows for a more precise on the completeness and correctness of the data accounting of the slight difference seen from one should be added to the generation, or website year to the next. where it’s shown. Ideally, a link to the actual data Temporal scope: It is common to report activ- source should also be provided and in general a ity occurring from a starting from the current time description of how the generation is carried out in and extending to some fixed point in the past (e.g. order to provide full transparency to the user. For over the past 10 years). While this is also a fre- example, such description should state whether the quent occurrence in human written texts and dia- generated texts are personalized to match the pro- logues, it is quite ambiguous and could refer to the file of each user. start of the first year, the start of the fiscal calen- dar on the first year, a precise number of days ex- 4.5 Personalization tending from today to 10 years ago, or a myriad of One of the advantages of NLG systems is the ca- other interpretations. Likewise, what it meant by pability to produce text customized to the profile the current time period is also ambiguous as data of individual users. Instead of writing one text may or may not be reported for the current time for all users, the NLG system can incorporate the period. If, for example, the Gross Domestic Prod- background and context of a user to increase the uct (GDP) for the current year is not available the communication effectiveness of the text. How- generation should inform the user that the data is ever, users are not always aware of personaliza- current as of the earliest year available. tion. Hence, insights they may obtain from the text can be aligned with their profile and history, but 4.3 Ethical Objections may also be missing alternative insights that are Before beginning any NLG project, it is important weighed down by the personalization algorithm. to consider whether there are any reasons why the One way to address this limitation is to make users system should not be built. A system that would aware of the use of personalization, similar to how cause harm to the user by producing generations provenance can be addressed. that are offensive should not be built without ap- 4.6 Fraud Prevention propriate safeguards. For example, in 2016, Mi- crosoft released Tay, a chatbot which unwittingly In sensitive financial systems, in theory a rogue began to generate hate speech due to lack of filter- developer could introduce fraudulent code that ing for racist content in its training data and out- generates overly positive or negative-sounding put.4 sentiment for a company, for their financial gain. A code audit can bring attempts to manipulate any 4.4 Provenance code base to light, and pair programming may make any attempts less likely. In the computer medium, authority is ascribed based on number of factors (Conrad et al., 2008): 4.7 Accessibility the user may have a prior trust distribution into hu- In addition to providing misleading texts, the ac- mans and machines (on the “species” and individ- cessibility of the texts generated automatically is ual level), they may ascribe credibility based on an additional way in which users may be put in the generated message itself. Only being transpar- a disadvantaged position by the use of an NLG ent about where data originated permits humans to system. First, the readability of the generated text apply their prior beliefs, whereas hiding whether may not match the expectations of the target users, generated text originated from a machine or a hu- limiting their understanding due to the use of spe- man leaves the user in the dark about how to use cialized terminology, or complex structure. Sec- 4http://read.bi/2ljdvww ond, the quality of the user experience may be af-

106 fected if the generated text has been constructed from the respective data. Then, domain experts without considering the requirements of how users can rate whether the generated text is acceptable access the text. For example, delivering a text or not as a description of the respective data. The through a text-to-speech synthesizer may require judgements can be used to train a classifier that to expand numerical expressions or to construct can be applied to future data sets and generations. shorter texts because of the time required for the articulation of speech. 6 Conclusions We analyzed how the development of an NLG 5 Discussion system can have ethical implications considering The research community and the industry should in particular data problems and how the meaning aim to design NLG systems that do not promote of the generated text can be potentially mislead- unethical behavior, by detecting issues in the data ing. We also introduced best practice guidelines and automatically identifying cases where the au- for creating an NLP system in general and trans- tomated summaries do not reflect the true nature parency in interaction with a user. of the data. Based on the checklist for the NLG systems we proposed various methods for ensuring that the There are a couple of methods we want to high- right utterance is generated. We discussed in par- light because they address the problems of solv- ticular two methods that future research should ing ethical issues from two different angles. The focus on: (a) the validation of utterances via a first method we called presupposition check draws presupposition checker and (b) a better evaluation principled way of describing pragmatic issues in framework that may be able to learn from feed- language by adding semantic and pragmatic con- back and improve upon that feedback. straints informed by Grice’s Cooperative Prin- ciples and presupposition (Grice, 1975; Beaver, Checklists can be collected as project manage- 1997): Adding formal constraints to the genera- ment artifacts for each completed NLP project tion process will make NLG more transparent, and in order to create a learning organization, and less potentially misleading (Joshi, 1982). they are a useful resource that inform Ethics Review Boards, as introduced by Leidner and If an NLG system, for example, is asked to Plachouras (2017). generate a phrase expressing the minimum, av- erage or maximum of a group of numbers (“The smallest/average/largest (Property) of (Group) is Acknowledgments (Value)”), an automatic check should be installed that determines whether the cardinality of the set We would like to thank Khalid Al-Kofahi for sup- comprising that group is greater than one. If this porting this work. check only finds one entity, the generation should be licensed and the system avoids that user is mis- led into believing the very notion of calculating References a minimum, average or maximum actually makes Dario Amodei, Chris Olah, Jacob Steinhardt, Paul sense. Instead, in such a situation a better response Christiano, John Schulman, and Dan Mane.´ 2016. Concrete problems in AI safety. arXiv preprint may be “There is only one (Property) in (Group), arXiv:1606.06565. and it is (Value).” (cf. work on the NLG of grad- able properties by van Deemter (2006)). Regina Barzilay and Lillian Lee. 2002. Bootstrap- ping lexical choice via multiple-sequence alignment. A second method to ensure that the output of In Proceedings of the 2002 Conference on Empiri- the generated system is valid involves evaluating cal Methods in Natural Language Processing, pages and monitoring the quality of the text. A model 164–171. Association for Computational Linguis- can be trained to identify problematic generations tics, July. based on an active learning approach. For exam- Carlo Batini, Federico Cabitza, Cinzia Cappiello, and ple, interquartile ranges can be computed for nu- Chiara Francalanci. 2008. A comprehensive data merical data used for the generation determining quality methodology for Web and structured data. outliers in the data. In addition, the fraction of Int. J. Innov. Comput. Appl., 1(3):205–218, July. missing data points and the number of input el- David Beaver. 1997. Presupposition. In Johan van ements in aggregate functions can be estimated Benthem and Alice ter Meulen, editors, The Hand-

107 book of Logic and Language, pages 939–1008. El- Vassilis Plachouras, Charese Smiley, Hiroko Bretz, Ola sevier, Amsterdam. Taylor, Jochen L. Leidner, Dezhao Song, and Frank Schilder. 2016. Interacting with financial data us- Anja Belz and Ehud Reiter. 2006. Comparing auto- ing natural language. In Raffaele Perego, Fabrizio matic and human evaluation of nlg systems. In In Sebastiani, Javed A. Aslam, Ian Ruthven, and Justin Proc. EACL’06, pages 313–320. Zobel, editors, Proceedings of the 39th International ACM SIGIR Conference on Research and Develop- Jack G. Conrad, Jochen L. Leidner, and Frank Schilder. ment in Information Retrieval, Pisa, Italy, July 17- 2008. Professional credibility: Authority on the 21, 2016, SIGIR 2016, pages 1121–1124. ACM. Web. In Proceedings of the 2nd ACM Workshop on Information Credibility on the Web, WICOW 2008, Ehud Reiter. 2007. An architecture for data-to-text pages 85–88, New York, NY, USA. ACM. systems. In Proceedings of the Eleventh European Konstantin Nicholas Dorr¨ and Katharina Hollnbuchner. Workshop on Natural Language Generation, ENLG 2016. Ethical challenges of algorithmic journalism. ’07, pages 97–104, Stroudsburg, PA, USA. Associa- Digital Journalism, pages 1–16. tion for Computational Linguistics. Paul Grice. 1975. Logic and conversation. In P. Cole Frank Schilder, Blake Howald, and Ravi Kondadadi. and J. Morgan, editors, Syntax and Semantics III: 2013. Gennext: A consolidated domain adaptable Speech Acts, pages 41–58. Academic Press, New nlg system. In Proceedings of the 14th European York, NY, USA. Workshop on Natural Language Generation, pages 178–182, Sofia, Bulgaria, August. Association for Dirk Hovy and Shannon L. Spruit. 2016. The so- Computational Linguistics. cial impact of natural language processing. In Pro- ceedings of the 54th Annual Meeting of the Associa- Charese Smiley, Vassilis Plachouras, Frank Schilder, tion for Computational Linguistics, volume 2, pages Hiroko Bretz, Jochen L. Leidner, and Dezhao Song. 591–598. 2016. When to plummet and when to soar: Corpus based verb selection for natural language generation. IEEE, editor. 2016. Ethically Aligned Design: A Vi- In The 9th International Natural Language Genera- sion For Prioritizing Wellbeing With Artificial Intel- tion Conference, page 36. ligence And Autonomous Systems. IEEE - advanced Technology for Humanity. Kees van Deemter. 2006. Generating referring ex- pressions that involve gradable properties. Compu- Aravind Joshi, Bonnie Webber, and Ralph M. tational Linguistics, 32(2):195–222. Weischede. 1984. Preventing false inferences. In Proceedings of the 10th International Conference on Nigel Ward. 1988. Issues in word choice. In Pro- Computational Linguistics and 22nd Annual Meet- ceedings of the 12th Conference on Computational ing of the Association for Computational Linguis- Linguistics-Volume 2, pages 726–731. Association tics, pages 134–138, Stanford, California, USA, for Computational Linguistics. July. Association for Computational Linguistics. Siegfried Weischenberg, Maja Malik, and Armin Aravind Joshi. 1982. Mutual beliefs in question- Scholl. 2006. Die Souffleure der Medienge- answering systems. In Neil S. Smith, editor, Mutual sellschaft. Report uber¨ die Journalisten in Deutsch- Knowledge, pages 181–197. Academic Press, Lon- land. Konstanz: UVK, page 204. don. Thomas Kent. 2015. “an ethical checklist for robot journalism. Online, cited 2017-01-25, http://mediashift.org/2015/03/an-ethical-checklist- for-robot-journalism/. Jochen L. Leidner and Vassilis Plachouras. 2017. Eth- ical by design: Ethics best practices for natural lan- guage processing. In Proceedings of the Workshop on Ethics & NLP held at the EACL Conference, April 3-7, 2017, Valencia, Spain. ACL. Saad Mahamood and Ehud Reiter. 2011. Generat- ing affective natural language for parents of neona- tal infants. In Proceedings of the 13th European Workshop on Natural Language Generation, ENLG 2011, pages 12–21, Nancy, France. Association for Computational Linguistics. Leo L. Pipino, Yang W. Lee, and Richard Y. Wang. 2002. Data quality assessment. Commun. ACM, 45(4):211–218, April.

108 Author Index

Benton, Adrian, 94 Burstein, Jill, 41

Cahill, Aoife, 41 Coppersmith, Glen, 94

Daelemans, Walter, 80 Dredze, Mark, 94

Fatema, Kaniz, 60

Koolen, Corina, 12

Larson, Brian,1 Leidner, Jochen L., 30, 103 Lewis, Dave, 60 Liu, Chao-Hong, 66 Loukina, Anastassia, 41 Lynn, Teresa, 66

Madnani, Nitin, 41 May, Chandler, 74 Mieskes, Margot, 23 Moorkens, Joss, 60, 66

Parra Escartín, Carla, 66 Plachouras, Vassilis, 30, 103

Reijers, Wessel, 66 Rudinger, Rachel, 74

Schilder, Frank, 103 Schnoebelen, Tyler, 88 Smiley, Charese, 103 Suster, Simon, 80

Tatman, Rachael, 53 Tulkens, Stephan, 80 van Cranenburgh, Andreas, 12 Van Durme, Benjamin, 74 von Davier, Alina, 41

Way, Andy, 66

109