Toward -Inclusive Coreference Resolution

Yang Trista Cao Hal Daume´ III University of Maryland University of Maryland [email protected] Microsoft Research [email protected]

Abstract crimination in trained coreference systems, show- ing that current systems over-rely on social stereo- Correctly resolving textual mentions of people types when resolving HE and SHE pronouns1 (see fundamentally entails making inferences about those people. Such inferences raise the risk of §2). Contemporaneously, critical work in Human- systemic biases in coreference resolution sys- Computer Interaction has complicated discussions tems, including biases that can harm binary around gender in other fields, such as computer and non-binary trans and cis stakeholders. To vision (Keyes, 2018; Hamidi et al., 2018). better understand such biases, we foreground Building on both lines of work, and inspired by nuanced conceptualizations of gender from so- Keyes’s (2018) study of vision-based automatic ciology and sociolinguistics, and develop two gender recognition systems, we consider gender new datasets for interrogating bias in crowd annotations and in existing coreference reso- bias from a broader conceptual frame than the bi- lution systems. Through these studies, con- nary “folk” model. We investigate ways in which ducted on English text, we confirm that with- folk notions of gender—namely that there are two out acknowledging and building systems that , assigned at birth, immutable, and in per- recognize the complexity of gender, we build fect correspondence to gendered linguistic forms— systems that lead to many potential harms. lead to the development of technology that is exclu- 1 Introduction sionary and harmful of binary and non-binary trans and cis people.2 Addressing such issues is critical Coreference resolution—the task of determining not just to improve the quality of our systems, but which textual references resolve to the same real- more pointedly to minimize the harms caused by world entity—requires making inferences about our systems by reinforcing existing unjust social those entities. Especially when those entities are hierarchies (Lambert and Packer, 2019). people, coreference resolution systems run the risk There are several stakeholder groups who may of making unlicensed inferences, possibly resulting easily face harms when coreference systems is in harms either to individuals or groups of people. used (Blodgett et al., 2020). Those harms includes Embedded in coreference inferences are varied as- several possible harms, both allocational and rep- pects of gender, both because gender can show up resentation harms (Barocas et al., 2017), including explicitly (e.g., pronouns in English, morphology

arXiv:1910.13913v4 [cs.CL] 2 Dec 2020 quality of service, erasure, and stereotyping harms. in Arabic) and because societal expectations and Following Bender’s(2019) taxonomy of stakehold- stereotypes around gender roles may be explicitly or implicitly assumed by speakers or listeners. This 1Throughout, we avoid mapping pronouns to a “gender” la- can lead to significant biases in coreference resolu- bel, preferring to use the pronoun directly, include (in English) SHE, HE, the non-binary use of singular THEY, and neopro- tion systems: cases where systems “systematically nouns (e.g., ZE/HIR, XEY/XEM), which have been in usage and unfairly discriminate against certain individ- since at least the 1970s (Bustillos, 2011; Merriam-Webster, uals or groups of individuals in favor of others” 2016; Bradley et al., 2019; Hord, 2016; Spivak, 1997). 2Following GLAAD(2007), individuals are (Friedman and Nissenbaum, 1996, p. 332). those whose gender differs from the sex they were assigned Gender bias in coreference resolution can mani- at birth. This is in opposition to individuals, whose fest in many ways; work by Rudinger et al.(2018), assigned sex at birth happens to correspond to their gender. Transgender individuals can either be binary (those whose Zhao et al.(2018a), and Webster et al.(2018) fo- gender falls in the “male/female” dichotomy) or non-binary cused largely on the case of binary gender dis- (those for which the relationship is more complex). ers and Barocas et al.’s(2017) taxonomy of harms, written by and about trans people).3 In all cases, there are several ways in which trans exclusionary we focus largely on harms due to over- and under- coreference resolution systems can cause harm: representation (Kay et al., 2015), replicating stereo- types (Sweeney, 2013; Caliskan et al., 2017) (par-  Indirect: subject of query. If a person is the ticular those that are cisnormative and/or heteronor- subject of a web query, pages about xem may mative), and quality of service differentials (Buo- be missed if “multiple mentions of query” is a lamwini and Gebru, 2018). ranking feature, and the system cannot resolve xyr pronouns ⇒ quality of service, erasure. The primary contributions of this paper are:  Direct: by choice. If a grammar checker uses (1) Connecting existing work on gender bias in coreference, it may insist that an author writ- NLP to sociological and sociolinguistic concep- ing hir third-person autobiography is repeatedly tions of gender to provide a scaffolding for fu- making errors when referring to hirself ⇒ qual- ture work on analyzing “gender bias in NLP” (§3). ity of service, stereotyping, denigration. (2) Developing an ablation technique for measur-  Direct: not by choice. If an information extrac- ing gender bias in coreference resolution annota- tion system run on resum´ es´ relies on cisnorma- tions, focusing on the human bias that can enter tive assumptions, job experiences by a candidate into annotation tasks (§4). (3) Constructing a new who has transitioned and changed his pronouns dataset, the Gender Inclusive Coreference dataset may be missed ⇒ allocative, erasure. (GICOREF), for testing performance of coreference  Many stakeholders. If a machine translation sys- resolution systems on texts that discuss non-binary tem uses discourse context to generate pronouns, and binary transgender people (§5). then errors can results in directly misgendering subjects of the document being translated ⇒ 2 Related Work quality of service, denigration, erasure. There are four recent papers that consider gender To address such harms as well as understand where bias in coreference resolution systems. Rudinger and how they arise, we need to complicate (a) what et al.(2018) evaluates coreference systems for evi- “gender” means and (b) how harms can enter into dence of occupational stereotyping, by construct- natural language processing (NLP) systems. To- ing Winograd-esque (Levesque et al., 2012) test ex- ward (a), we begin with a unifying analysis (§3) amples. They find that humans can reliably resolve of how gender is socially constructed, and how so- these examples, but systems largely fail at them, cial conditions in the world impose expectations typically in a gender-stereotypical way. In contem- around people’s gender. Of particular interest is poraneous work, Zhao et al.(2018a) proposed a how gender is reflected in language, and how that very similar, also Winograd-esque scheme, also both matches and potentially mismatches the way for measuring gender-based occupational stereo- people experience their gender in the world. Then, types. In addition to reaching similar conclusions in order to understand social biases around gen- to Rudinger et al.(2018), this work also used a der, we find it necessary to consider the different similar “counterfactual” data process as we use in ways in which gender can be realized linguisti- § 4.1 in order to provide additional training data cally, breaking down what previously have been to a coreference resolution system. Webster et al. considered “gendered words” in NLP papers into (2018) produced the GAP dataset for evaluating finer-grained categories that have been identified in coreference systems, by specifically seeking exam- the sociolinguistics literature of lexical, referential, ples where “gender” (left underspecified) could not grammatical, and social gender. be used to help coreference. They found that coref- erence systems struggle in these cases, also point- Toward (b), we focus on how bias can enter ing to the fact that some success of current corefer- into two stages of machine learning systems: data ence systems is due to reliance on (binary) gender annotation (§ 4) and model definition (§ 5). We stereotypes. Finally, Ackerman(2019) presents construct two new datasets: (1) MAP (a similar an alternative breakdown of gender than we use dataset to GAP (Webster et al., 2018) but without (§ 3), and proposes matching criteria for model- binary gender constraints) on which we can per- 3Both datasets are released under a BSD license at form counterfactual manipulations and (2) GICoref github.com/TristaCao/into inclusivecoref (a fully annotated coreference resolution dataset with corresponding datasheets (Gebru et al., 2018). ing coreference resolution linguistically, taking a 3.1 Sociological Gender trans-inclusive perspective on gender. Many modern trans-inclusive models of gender rec- Gender bias in NLP has been considered more ognize that gender encompasses many different broadly than just in coreference resolution, includ- aspects. These aspects include the experience that ing, natural language inference (Rudinger et al., one has of gender (or lack thereof), the way that 2017), word embeddings (e.g., Bolukbasi et al., one expresses one’s gender to the world, and the 2016; Romanov et al., 2019; Gonen and Goldberg, way that normative social conditions impose gender 2019), sentiment analysis (Kiritchenko and Mo- norms, typically as a dichotomy between mascu- hammad, 2018), machine translation (Font and line and feminine roles or traits (Kramarae and Tre- Costa-jussa`, 2019; Prates et al., 2019; Dryer, 2013; ichler, 1985; West and Zimmerman, 1987; Butler, Frank et al., 2004; Wandruszka, 1969; Nissen, 1990; Risman, 2009; Serano, 2007). Gender self- 2002; Doleschal and Schmid, 2001), among many determination, on the other hand, holds that each others (Blodgett et al., 2020, inter alia). Gender is person is the “ultimate authority” on their own gen- also an object of study in gender recognition sys- der identity (Zimman, 2019; Stanley, 2014), with tems (Hamidi et al., 2018). Much of this work has Zimman(2019) further arguing the importance of focused on gender bias with a (usually implicit) the role language plays in that determination. binary lens, an issue which was also called out Such trans-inclusive models deconflate anatomi- recently by Larson(2017b) and May(2019). cal and biological traits and the sex that a person 3 Linguistic & Social Gender had assigned to them at birth from one’s gendered position in society; this includes people, The concept of gender is complex and contested, whose anatomical/biological factors do not match covering (at least) aspects of a person’s internal ex- the usual designational criteria for either sex. Trans- perience, how they express this to the world, how inclusive views typically recognize that gender ex- social conditions in the world impose expectations ists beyond the regressive “female”/“male” binary4; on them (including expectations around their sex- additionally, one’s gender may shift by time or con- uality), and how they are perceived and accepted text (often “genderfluid”), and some people do not (or not). When this complex concept is realized in experience gender at all (often “agender”) (Kessler language, the situation becomes even more com- and McKenna, 1978; Schilt and Westbrook, 2009; plex: linguistic categories of gender do not even Darwin, 2017; Richards et al., 2017). In §5 we an- remotely map one-to-one to social categories. As alyze the degree to which NLP papers make trans- observed by Bucholtz(1999): inclusive or trans-exclusive assumptions. “Attempts to read linguistic structure di- Social gender refers to the imposition of gen- rectly for information about social gender der roles or traits based on normative social condi- are often misguided.” tions (Kramarae and Treichler, 1985), which often For instance, when working in a language like En- includes imposing a dichotomy between feminine glish which formally marks gender on pronouns, it and masculine (in behavior, dress, speech, occupa- is all too easy to equate “recognizing the pronoun tion, societal roles, etc.). Ackerman(2019) high- that corefers with this name” with “recognizing the lights a highly overlapping concept, “bio-social real-world gender of referent of that name.” gender”, which consists of gender role, gender ex- Furthermore, despite the impossibility of a per- pression, and . Taking gender role fect alignment with linguistic gender, it is gener- as an example, upon learning that a nurse is coming ally clear that an incorrectly gendered reference to their hospital room, a patient may form expecta- to a person (whether through pronominalization tions that this person is likely to be “female,” and or otherwise) can be highly problematic (Johnson may generate expectations around how their face or et al., 2019; McLemore, 2015). This process of body may look, how they are likely to be dressed, misgendering is problematic for both trans and cis how and where hair may appear, how to refer to individuals to the extent that transgender historian them, and so on. This process, often referred to as Stryker(2008) writes: gendering (Serano, 2007) occurs both in real world “[o]ne’s gender identity could perhaps best be described as how one feels about being 4Some authors use female/male for sex and woman/man for gender; we do not need this distinction (which is itself referred to by a particular pronoun.” contestable) and use female/male for gender. interactions, as well as in purely linguistic settings as gendered terms of address like “Mrs.” Impor- (e.g., reading a newspaper), in which readers may tantly, lexical gender is a property of the linguistic use social gender clues to assign gender(s) to the unit, not a property of its referent in the real world, real world people being discussed. which may or may not exist. For instance, in “Ev- ery son loves his parents”, there is no real world 3.2 Linguistic Gender referent of “son” (and therefore no referential gen- Our discussion of linguistic gender largely fol- der), yet it still (likely) takes HIS as a pronoun lows (Corbett, 1991; Ochs, 1992; Craig, 1994; Cor- anaphor because “son” has lexical gender MASC. bett, 2013; Hellinger and Motschenbacher, 2015; Fuertes-Olivera, 2007), departing from earlier char- 3.3 Social and Linguistic Gender Interplays acterizations that postulate a direct mapping from The relationship between these aspects of gender language to gender (Lakoff, 1975; Silverstein, is complex, and none is one-to-one. The refer- 1979). Our taxonomy is related but not identical to ential gender of an individual (e.g., pronouns in (Ackerman, 2019), which we discuss in§2. English) may or may not match their social gender Grammatical gender, similarly defined in Ack- and this may change by context. This can happen in erman(2019), is nothing more than a classification the case of people whose everyday life experience of nouns based on a principle of grammatical agree- of their gender fluctuates over time (at any inter- ment. In “gender languages” there are typically val), as well as in the case of drag performers (e.g., two or three grammatical genders that have, for some men who perform drag are addressed as SHE animate or personal references, considerable cor- while performing, and HE when not (for Transgen- respondence between a FEM (resp. MASC) gram- der Equality, 2017)). The other linguistic forms of matical gender and referents with female- (resp. gender (grammatical, lexical) also need not match 5 male-) social gender. In comparison, “noun class each other, nor match referential gender (Hellinger languages” have no such correspondence, and typ- and Motschenbacher, 2015). ically many more classes. Some languages have Social gender (societal expectations, in particu- no grammatical gender at all; English is generally lar) captures the observation that upon hearing “My seen as one (Nissen, 2002; Baron, 1971) (though cousin is a librarian”, many speakers will infer “fe- this is contested (Bjorkman, 2017)). male” for “cousin”, because of either an entailment Referential gender (similar, but not identical of “librarian” or some sort of probabilistic infer- to Ackerman’s (2019) “conceptual gender”) re- ence (Lyons, 1977), but not based on either gram- lates linguistic expressions to extra-linguistic re- matical gender (which does not exist in English) or ality, typically identifying referents as “female,” lexical gender. We focus on English, which has no “male,” or “gender-indefinite.” Fundamentally, ref- grammatical gender, but does have lexical gender. erential gender only exists when there is an entity English also marks referential gender on singular being referred to, and their gender (or sex) is real- third person pronouns. ized linguistically. The most obvious examples in Below, we use this more nuanced notion of dif- English are gendered third person pronouns (SHE, ferent types of gender to inspect how bias play out HE), including neopronouns (ZE, EM) and singular in coreference resolution systems. These biases 6 THEY , but also includes cases like “policeman” may arise in the context of any of these notions of when the intended referent of this noun has so- gender, and we encourage future work to extend cial gender “male” (though not when “policeman” care over and be explicit about what notions of is used non-referentially, as in “every policeman gender are being utilized and when. needs to hold others accountable”). Lexical gender refers to an extra-linguistic 4 Bias in Human Annotation properties of female-ness or male-ness in a non- referential way, as in terms like “mother” as well A possible source of bias in coreference systems comes from human annotations on the data used 5One difficulty in this discussion is that linguistic gender and social gender use the terms “feminine” and “masculine” to train them. Such biases can arise from a com- differently; to avoid confusion, when referring to the linguistic bination of (possibly) underspecified annotations properties, we use FEM and MASC. guidelines and the positionality of annotators them- 6People’s mental acceptability of singular THEY is still rel- atively low even with its increased usage (Prasad and Morris, selves. In this section, we study how different 2020), and depends on context (Conrod, 2018). aspects of linguistic notions impact an annotator’s (d) (b) (c) Mrs. −−→ /0 Rebekah Johnson Bobbitt −−→ M. Booth was the younger sister −→ sibling of (b) Lyndon B. Johnson −−→ T. Schneider, 36th President of the United States. Born in 1910 in Stonewall, (a) Texas, she −−→ they worked in the cataloging department of the Library of Congress in the 1930s before (a) (c) her −−→ their brother −→ sibling entered politics.

Figure 1: Example of applying all ablation substitutions for an example context in the MAP corpus. Each substitution type is marked over the arrow and separately color-coded. judgments of anaphora. This parallels Ackerman ing Webster et al.’s approach, but we do not restrict (2019) linguistic analysis, in which a Broad Match- the two names to match gender. ing Criterion is proposed, which posits that “match- In ablating gender information, one challenge ing gender requires at least one level of the mental is that removing social gender cues (e.g., “nurse” representation of gender to be identical to the can- tending female) is not possible because they can ex- didate antecedent in order to match.” ist anywhere. Likewise, it is not possible to remove Our study can be seen as evaluating which con- syntactic cues in a non-circular manner. For exam- ceptual properties of gender are most salient in ple in (1), syntactic structure strongly suggests the human judgments. We start with natural text in antecedent of “herself” is “Liang”, making it less which we can cast the coreference task as a binary likely that “He” corefers with Liang later (though classification problem (“which of these two names it is possible, and such cases exist in natural data does this pronoun refer to?”) inspired by Webster due either to genderfluidity or misgendering). et al.(2018). We then generate “counterfactual aug- (1) Liang saw herself in the mirror. . .He ... mentations” of this dataset by ablating the various Fortunately, it is possible to enumerate a high cov- notions of linguistic gender described in §3.2, sim- erage list of English terms that signal lexical gen- ilar to Zmigrod et al.(2019). We finally evaluate der: terms of address (Mrs., Mr.) and semantically the impact of these ablations on human annotation gendered nouns (mother).8 We assembled a list by behavior to answer the question: which forms of taking many online lists (mostly targeted at English linguistic knowledge are most essential for human language learners), merging them, and manual fil- annotators to make consistent judgments. See Ap- tering. The assembling process and the final list is pendix A for examples of how linguistic gender published with the MAP dataset and its datasheet. may be used to infer social gender. To execute the “hiding” of various aspects of 4.1 Ablation Methodology gender, we use the following substitutions: (a) ¬PRO: Replace third person pronouns with In order to determine which cues annotators are us- gender neutral variants (THEY, XEY, ZE). ing and the degree to which they use them, we con- (b) ¬NAME: Replace names by random names struct an ablation study in which we hide various with only a first initial and last name. aspects of gender and evaluate how this impacts (c) ¬SEM: Replace semantically gendered nouns annotators’ judgments of anaphoricity. We con- with gender-indefinite variants. struct binary classification examples taken from (d) ¬ADDR: Remove terms of address.9 Wikipedia pages, in which a single pronoun is See Figure 1 for an example of all substitutions. selected, and two possible antecedent names are We perform two sets of experiments, one fol- given, and the annotator must select which one. We lowing a “forward selection” type ablation (start cannot use Webster et al.’s GAP dataset directly, with everything removed and add each back in one- because their data is constrained that the “gender” at-a-time) and one following “backward selection” 7 of the two possible antecedents is “the same” ; for (remove each separately). Forward selection is nec- us, we are specifically interested in how annotators essary in order to de-conflate syntactic cues from make decisions even when additional gender infor- 8 mation is available. Thus, we construct a dataset These are, however, sometimes complex. For instance, “actress” signals lexical gender of female, while “actor” may called Maybe Ambiguous Pronoun (MAP) follow- signal social gender of male and, in certain varieties of English, may also signal lexical gender of male. 7It is unclear from the GAP dataset what notion of “gender” 9An alternative suggested by Cassidy Henry that we did is used, nor how it was determined to be “the same.” not explore would be to replace all with Mx. or Dr. stereotypes; while backward selection gives a sense of how much impact each type of gender cue has in the context of all the others. We begin with ZERO, in which we apply all four substitutions. Since this also removes gender cues from the pronouns themselves, an annotator cannot substantially rely on social gender to per- form these resolutions. We next consider adding back in the original pronouns (always HE or SHE here), yielding ¬NAME ¬SEM ¬ADDR. Any dif- ference in annotation behavior between ZERO and ¬NAME ¬SEM ¬ADDR can only be due to so- cial gender stereotypes. The next setting, ¬SEM ¬ADDR removes both forms of lexical gender (se- mantically gendered nouns and terms of address); Figure 2: Human annotation results for the ablation differences between ¬SEM ¬ADDR and ¬NAME study on MAP dataset. Each column is a different abla- ¬SEM ¬ADDR show how much names are relied tion, and the y-axis is the degree of accuracy with 95% on for annotation. Similarly, ¬NAME ¬ADDR re- significance intervals. Bottom bar plots are annotator moves names and terms of address, showing the im- certainties as how sure they are in their choices. pact of semantically gendered nouns, and ¬NAME ¬SEM removes names and semantically gendered role when annotators resolve coreferences. This nouns, showing the impact of terms of address. confirms the findings of Rudinger et al.(2018) and In the backward selection case, we begin with Zhao et al.(2018a) that human annotated data in- ORIG, which is the unmodified original text. To corporates bias from stereotypes. this, we can apply the pronoun filter to get ¬PRO; Moreover, if we compare ORIG with columns differences in annotation between ORIG and ¬PRO left to it, we see that name is another significant give a measure of how much any sort of gender- cue for annotator judgments, while lexical gender based inference is used. Similarly, we get ¬NAME cues do not have significant impacts on human by only removing names, which gives a measure annotation accuracies. This is likely in part due of how much names are used (in the context of to the low appearance frequency of lexical gen- all other cues); we get ¬SEM by only removing der cues in our dataset. Every example has pro- semantically gendered words; and ¬ADDR by only nouns and names, whereas 49% of the examples removing terms of address. have semantically gendered nouns but only 3% of 4.2 Annotation Results the examples include terms of address. We also We construct examples using the methodology de- note that if we compare ¬NAME ¬SEM ¬ADDR fined above. We then conduct annotation experi- to ¬SEM ¬ADDR and ¬NAME ¬ADDR, accuracy ments using crowdworkers on Amazon Mechanical drops when removing gender cues. Though the Turk following the methodology by which the origi- differences are not statistically significant, we did nal GAP corpus was created10. Because we wanted not expect the accuracy drop. to also capture uncertainty, we ask the crowdwork- Finally, we find annotators’ certainty values fol- ers how sure they are in their choices, between low the same trend as the accuracy: annotators “definitely” sure, “probably” sure and “unsure.” have a reasonable sense of when they are unsure. Figure 2 shows the human annotation results as We also note that accuracy score are essentially binary classification accuracy for resolving the pro- the same for ZERO and ¬PRO, which suggests that noun to the antecedent. We can see that removing once explicit binary gender is gone from pronouns, pronouns leads to significant drop in accuracy. This the impact of any other form of linguistic gender indicates that gender-based inferences, especially in annotator’s decisions is also removed. social gender stereotypes, play the most significant 5 Bias in Model Specifications 10Our study was approved by the Microsoft Research Ethics Board. Workers were paid $1 to annotate ten contexts (the In addition to biases that can arise from the data average annotation time was seven minutes). that a system is trained on, as studied in the previ- ous section, bias can also come from how models All Papers Coref Papers are structured. For instance, a system may fail to L.G? 52.6% (of 150) 95.4% (of 22) recognize anything other than a dictionary of fixed S.G? 58.0% (of 150) 86.3% (of 22) pronouns as possible referents to entities. Here, L.G6=S.G? 11.1% (of 27) 5.5% (of 18) S.G Binary? 92.8% (of 84) 94.4% (of 18) we analyze prior work in models for coreference S.G Immutable? 94.5% (of 74) 100.0% (of 14) resolution in three ways. First, we do a literature They/Neo? 3.5% (of 56) 7.1% (of 14) study to quantify how NLP papers discuss gender. Second, similar to Zhao et al.(2018a) and Rudinger Table 1: Analysis of a corpus of 150 NLP papers that mention “gender” along the lines of what assumptions et al.(2018), we evaluate five freely available sys- around gender are implicitly or explicitly made. tems on the ablated data from §4. Third, we evalu- ate these systems on the dataset we created: Gender Inclusive Coreference (GICOREF).  Only 7.1% (one paper!) considers neopronouns and/or specific singular THEY. 5.1 Cis-normativity in published NLP papers The situation for papers not specifically about In our first study, we adapt the approach Keyes coreference is similar (the majority of these pa- (2018) took for analyzing the degree to which com- pers are either purely linguistic papers about gram- puter vision papers encoded trans-exclusive models matical gender in languages other than English, of gender. In particular, we began with a random or papers that do “gender recognition” of au- sample of ∼150 papers from the ACL anthology thors based on their writing; May(2019) discusses that mention the word “gender” and coded them the (re)production of gender in automated gender according to the following questions: recognition in NLP in much more detail). Overall, • Does the paper discuss coreference resolution? the situation more broadly is equally troubling, and • Does the paper study English? generally also fails to escape from the folk theory • L.G: Does the paper deal with linguistic gender of gender. In particular, none of the differences are (grammatical gender or gendered pronouns)? significant at a p = 0.05 level except for the first • S.G: Does the paper deal with social gender? two questions, due to the small sample size (accord- • L.G6=S.G: (If yes to L.G and S.G:) Does the ing to an n − 1 chi-squared test). The result is that paper distinguish linguistic from social gender? although we do not know exactly what decisions • S.G Binary: (If yes to S.G:) Does the paper ex- are baked in to all systems, the vast majority in our plicitly or implicitly assume that social gender is study (including two papers by one of the authors binary? (Daume´ and Marcu, 2005; Orita et al., 2015)) come • S.G Immutable: (If yes to S.G:) Does the paper with strong assumptions, and exist explicitly or implicitly assume social gender is within a broader sphere of literature which erases immutable? non-binary and binary trans identities. • They/Neo: (If yes to S.G and to English:) Does the paper explicitly consider uses of definite sin- 5.2 System performance on MAP gular “they” or neopronouns? Next, we analyze the effect that our different ab- The results of this coding are in Table 1 (the full lation mechanisms have on existing coreference annotation is in Appendix B). We see out of the resolutions systems. In particular, we run five 22 coreference papers analyzed, the vast majority coreference resolution systems on our ablated data: conform to a “folk” theory of language: the AI2 system (AI2; Gardner et al., 2017), hug-  Only 5.5% distinguish social from linguistic gen- ging face (HF; Wolf, 2017), which is a neural sys- der (despite it being relevant); tem based on spacy, and the Stanford deterministic  Only 5.6% explicitly model gender as inclusive (SfdD; Raghunathan et al., 2010), statistical (SfdS; of non-binary identities; Clark and Manning, 2015) and neural (SfdN; Clark  No papers treat gender as anything other than and Manning, 2016) systems. Figure 3 shows the completely immutable;11 results. We can see that the system accuracies mostly follow the same pattern as human accu- 11The most common ways in which papers implicitly as- sume that social gender is immutable is either 1) by relying on racy scores, though all are significantly lower than external knowledge bases that map names to “gender”; or 2) human results. Accuracy scores for systems drop by scraping a history of a user’s social media posts or emails and assuming that their “gender” today matches the gender of that historical record. Precision Recall F1 AI2 40.4% 29.2% 33.9% HF 68.8% 22.3% 33.6% SfdD 50.8% 23.9% 32.5% SfdS 59.8% 24.1% 34.3% SfdN 59.4% 24.0% 34.2%

Table 2: LEA scores on GICOREF (incorrect reference excluded) with various coreference resolution systems. Rows are different systems while columns are preci- sion, recall, and F1 scores. When evaluate, we only count exact matches of pronouns and name entities.

uments from three types of sources: articles from Figure 3: Coreference resolution systems results for English Wikipedia about people with non-binary the ablation study on MAP dataset. The y-axis is the gender identities, articles from LGBTQ periodi- degree of accuracy with 95% significance intervals. cals, and fan-fiction stories from Archive Of Our Own (with the respective author’s permission)12. These documents were each annotated by both of dramatically when we ablate out referential gender the authors and adjudicated.13 This data includes in pronouns. This reveals that those coreference many examples of people who use pronouns other resolution systems reply heavily on gender-based than SHE or HE (the dataset contains 27% HE, 20% inferences. In terms of each systems, HF and SfdN SHE, 35% THEY, and 18% neopronouns, people systems have similar results and outperform other who are genderfluid and whose names or pronouns systems in most cases. SfdD accuracy drops signif- change through the article, people who are mis- icantly once names are ablated. gendered, and people in relationships that are not These results echo and extend previous observa- heteronormative. In addition, incorrect references tions made by Zhao et al.(2018a), who focus on de- (misgendering and deadnaming14) are explicitly tecting stereotypes within occupations. They detect annotated.15 Two example annotated documents, gender bias by checking if the system accuracies one from Wikipedia, and one from Archive of Our are the same for cases that can be resolved by syn- Own, are provided in Appendix C and Appendix D. tactic cues and cases that cannot, with original data We run the same systems as before on this and reversed-gender data. Similarly, Rudinger et al. dataset. Table 2 reports results according the stan- (2018) focus on detecting stereotypes within occu- dard coreference resolution evaluation metric LEA pations as well. They construct dataset without any (Moosavi and Strube, 2016). Since no systems gender cues other than stereotypes, and check how are implemented to explicitly mark incorrect ref- systems perform with different pronouns – THEY, erences, and no current evaluation metrics address SHE, HE. Ideally, they should all perform the same this case, we perform the same evaluation twice. because there is not any gender cues in the sen- One with incorrect references included as regular tence. However, they find that systems do not work references in the ground truth; and other with in- on “they” and perform better on “he” than “she”. correct references excluded. Due to the limited Our analysis breaks this stereotyping down further number of incorrect references in the dataset, the to detect which aspects of gender signals are most leveraged by current systems. 12See https://archiveofourown.org; thanks to Os Keyes for this suggestion. 5.3 System behavior on gender-inclusive data 13We evaluate inter-annotator agreement by treating one annotation as gold standard and the other as system output Finally, in order to evaluate current coreference res- and computing the LEA metric; the resulting F1-score is 92%. olution models in gender inclusive contexts we in- During the adjudication process we found that most of the dis- agreement are due to one of the authors missing/overlooking troduce a new dataset, GICOREF. Here we focused mentions, and rarely due to true “disagreement.” on naturally occurring data, but sampled specifi- 14According to Clements(2017) deadnaming occurs when cally to surface more gender-related phenomena someone, intentionally or not, refers to a person who’s trans- gender by the name they used before they transitioned. than may be found in, say, the Wall Street Journal. 15Thanks to an anonymous reader of a draft version of this Our new GICOREF dataset consists of 95 doc- paper for this suggestion. difference of the results are not significant. Here We also emphasize that while we endeavored to we only report the latter. be inclusive, our own positionality has undoubt- The first observation is that there is still plenty edly led to other biases. One in particular is a room for coreference systems to improve; the best largely Western bias, both in terms of what models performing system achieves as F1 score of 34%, but of gender we use and also in terms of the data we the Stanford neural system’s F1 score on CoNLL- annotated. We have attempted to partially compen- 2012 test set reaches 60% (Moosavi, 2020). Ad- sate for this bias by intentionally including docu- ditionally, we can see system precision dominates ments with non-Western non-binary expressions recall. This is likely partially due to poor recall of of gender in the GICoref dataset16, but the dataset pronouns other than HE and SHE. To analyze this, nonetheless remains Western-dominant. we compute the recall of each system for finding Additionally, our ability to collect naturally oc- referential pronouns at all, regardless of whether curring data was limited because many sources they are correctly linked to their antecedents. We simply do not yet permit (or have only recently find that all systems achieve a recall of at least 95% permitted) the use of gender inclusive language in for binary pronouns, a recall of around 90% on their articles. This led us to counterfactual text average for THEY, and a recall of around a paltry manipulation, which, while useful, is essentially 13% for neopronouns (two systems—Stanford de- impossible to do flawlessly. Moreover, our abil- terministic and Stanford neural—never identify any ity to evaluate coreference systems with data that neopronouns at all). includes incorrect references was limited as well, because current systems do not mark any forms of 6 Discussion and Moving Forward misgendering or deadnaming explicitly, and cur- rent metrics do not take this into account. Finally, Our goal in this paper was to analyze how gender because the social construct of gender is fundamen- bias exist in coreference resolution annotations and tally contested, some of our results may apply only models, with a particular focus on how it may fail under some frameworks. to adequately process text involving binary and We hope this paper can serve as a roadmap for fu- non-binary trans referents. We thus created two ture studies. In particular, the gender taxonomy we datasets: MAP and GICOREF. Both datasets show presented, while not novel, is (to our knowledge) significant gaps in system performance, but perhaps previously unattested in discussions around gender moreso, show that taking crowdworker judgments bias in NLP systems; we hope future work in this as “gold standard” can be problematic. It may be area can draw on these ideas. We also hope that the case that to truly build gender inclusive datasets developers of datasets or systems can use some of and systems, we need to hire or consult experiential our analysis as inspiration for how one can attempt experts (Patton et al., 2019; Young et al., 2019). to measure—and then root out—different forms Moreover, although we studied crowdworkers on of bias in coreference resolution systems and NLP Mechanical Turk (because they are often employed systems more broadly. as annotators for NLP resources), if other popula- tions are used for annotation, it becomes important Acknowledgments to consider their positionality and how that may im- pact annotations. This echoes a related finding in The authors are grateful to a number of people annotation of hate-speech that annotator positional- who have provided pointers, edits, suggestions, and ity matters (Olteanu et al., 2019). More broadly, we annotation facilities to improve this work: Lau- found that trans-exclusionary assumptions around ren Ackerman, Cassidy Henry, Os Keyes, Chan- gender in NLP papers is made commonly (and dler May, Hanyu Wang, and Marion Zepf, all con- implicitly), a practice that we hope to see change tributed to various aspects of this work, includ- in the future because it fundamentally limits the ing suggestions for data sources for the GI Coref applicability of NLP systems. dataset. We also thank the CLIP lab at the Univer- sity of Maryland for comments on previous drafts. The primary limitation of our study and analysis is that it is limited to English. This is particularly 16We endeavored to represent some non-Western gender limiting because English lacks a grammatical gen- identies that do not fall into the male/female binary, including people who identify as (Indian subcontinent), phuying der system, and some extensions of our work to (Thailand, sometimes referred to as ), (Oaxaca), languages with grammatical gender are non-trivial. two-spirit (Americas), fa’afafine (Samoa) and mah¯ u¯ (Hawaii). References Papers, pages 153–156, Columbus, Ohio. Associa- tion for Computational Linguistics. Saleem Abuleil, Khalid Alsamara, and Martha Evens. 2002. Acquisition system for Arabic noun morphol- Ibrahim Badr, Rabih Zbib, and James Glass. 2009. Syn- ogy. In Proceedings of the ACL-02 Workshop on tactic phrase reordering for English-to-Arabic sta- Computational Approaches to Semitic Languages, tistical machine translation. In Proceedings of the Philadelphia, Pennsylvania, USA. Association for 12th Conference of the European Chapter of the ACL Computational Linguistics. (EACL 2009), pages 86–93, Athens, Greece. Associ- ation for Computational Linguistics. Lauren Ackerman. 2019. Syntactic and cognitive is- sues in investigating gendered coreference. Glossa: R. I. Bainbridge. 1985. Montagovian definite clause a journal of general linguistics, 4. grammar. In Second Conference of the European Chapter of the Association for Computational Lin- Apoorv Agarwal, Jiehan Zheng, Shruti Kamath, Sri- guistics, Geneva, Switzerland. Association for Com- ramkumar Balasubramanian, and Shirin Ann Dey. putational Linguistics. 2015. Key female characters in film have more to talk about besides men: Automating the Bechdel Janet Baker, Larry Gillick, and Robert Roth. 1994. Re- test. In Proceedings of the 2015 Conference of the search in large vocabulary continuous speech recog- North American Chapter of the Association for Com- nition. In Human Language Technology: Proceed- putational Linguistics: Human Language Technolo- ings of a Workshop held at Plainsboro, New Jersey, gies, pages 830–840, Denver, Colorado. Association March 8-11, 1994. for Computational Linguistics. Murali Raghu Babu Balusu, Taha Merghani, and Jacob Tafseer Ahmed Khan. 2014. Automatic acquisition of Eisenstein. 2018. Stylistic variation in social media Urdu nouns (along with gender and irregular plu- part-of-speech tagging. In Proceedings of the Sec- rals). In Proceedings of the Ninth International ond Workshop on Stylistic Variation, pages 11–19, Conference on Language Resources and Evalua- New Orleans. Association for Computational Lin- tion (LREC’14), pages 2846–2850, Reykjavik, Ice- guistics. land. European Language Resources Association (ELRA). Francesco Barbieri and Jose Camacho-Collados. 2018. How gender and skin tone modifiers affect emoji se- Sarah Alkuhlani and Nizar Habash. 2011. A corpus mantics in twitter. In Proceedings of the Seventh for modeling morpho-syntactic agreement in Arabic: Joint Conference on Lexical and Computational Se- Gender, number and rationality. In Proceedings of mantics, pages 101–106, New Orleans, Louisiana. the 49th Annual Meeting of the Association for Com- Association for Computational Linguistics. putational Linguistics: Human Language Technolo- gies, pages 357–362, Portland, Oregon, USA. Asso- Solon Barocas, Kate Crawford, Aaron Shapiro, and ciation for Computational Linguistics. Hanna Wallach. 2017. The Problem With Bias: Al- locative Versus Representational Harms in Machine Sarah Alkuhlani and Nizar Habash. 2012. Identify- Learning. In Proceedings of SIGCIS. ing broken plurals, irregular gender, and rationality in Arabic text. In Proceedings of the 13th Confer- Naomi S. Baron. 1971. A reanalysis of english gram- ence of the European Chapter of the Association matical gender. Lingua, 27:113–140. for Computational Linguistics, pages 675–685, Avi- gnon, France. Association for Computational Lin- Emily M. Bender. 2019. A typology of ethical risks guistics. in language technology with an eye towards where transparent documentation can help. The Future of Tania Avgustinova and Hans Uszkoreit. 2000. An on- Artificial Intelligence: Language, Ethics, Technol- tology of systematic relations for a shared grammar ogy. of Slavic. In COLING 2000 Volume 1: The 18th International Conference on Computational Linguis- Shane Bergsma and Dekang Lin. 2006. Bootstrapping tics. path-based pronoun resolution. In Proceedings of the 21st International Conference on Computational Bogdan Babych, Jonathan Geiger, Mireia Linguistics and 44th Annual Meeting of the Associ- Ginest´ı Rosell, and Kurt Eberle. 2014. Deriv- ation for Computational Linguistics, pages 33–40, ing de/het gender classification for Dutch nouns for Sydney, Australia. Association for Computational rule-based MT generation tasks. In Proceedings Linguistics. of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra), pages 75–81, Gothen- Shane Bergsma, Dekang Lin, and Randy Goebel. 2009. burg, Sweden. Association for Computational Glen, glenda or glendale: Unsupervised and semi- Linguistics. supervised learning of English noun gender. In Pro- ceedings of the Thirteenth Conference on Computa- Ibrahim Badr, Rabih Zbib, and James Glass. 2008. Seg- tional Natural Language Learning (CoNLL-2009), mentation for English-to-Arabic statistical machine pages 120–128, Boulder, Colorado. Association for translation. In Proceedings of ACL-08: HLT, Short Computational Linguistics. Shane Bergsma, Matt Post, and David Yarowsky. 2012. Maria Bustillos. 2011. Our desperate, 250-year-long Stylometric analysis of scientific articles. In Pro- search for a gender-neutral pronoun. permalink. ceedings of the 2012 Conference of the North Amer- ican Chapter of the Association for Computational Judith Butler. 1990. Gender Trouble. Routledge. Linguistics: Human Language Technologies, pages Aylin Caliskan, Joanna J. Bryson, and Arvind 327–337, Montreal,´ Canada. Association for Com- Narayanan. 2017. Semantics derived automatically putational Linguistics. from language corpora contain human-like biases. Bronwyn M. Bjorkman. 2017. Singular they and Science, 356(6334). the syntactic representation of gender in english. Michael Carl, Sandrine Garnier, Johann Haller, Anne Glossa: A Journal of General Linguistics, 2(1):80. Altmayer, and Barbel¨ Miemietz. 2004. Control- Su Lin Blodgett, Solon Barocas, Hal Daume,´ III, and ling gender equality with shallow NLP techniques. Hanna Wallach. 2020. Language (technology) is In COLING 2004: Proceedings of the 20th Inter- power: The need to be explicit about NLP harms. national Conference on Computational Linguistics, In Proceedings of the Conference of the Association pages 820–826, Geneva, Switzerland. COLING. for Computational Linguistics (ACL). Sophia Chan and Alona Fyshe. 2018. Social and emo- Ondrejˇ Bojar, Rudolf Rosa, and Alesˇ Tamchyna. 2013. tional correlates of capitalization on twitter. In Chimera – three heads for English-to-Czech trans- Proceedings of the Second Workshop on Computa- lation. In Proceedings of the Eighth Workshop on tional Modeling of People’s Opinions, Personality, Statistical Machine Translation, pages 92–98, Sofia, and Emotions in Social Media, pages 10–15, New Bulgaria. Association for Computational Linguis- Orleans, Louisiana, USA. Association for Computa- tics. tional Linguistics. Tolga Bolukbasi, Kai-Wei Chang, James Zou, Songsak Channarukul, Susan W. McRoy, and Syed S. Venkatesh Saligrama, and Adam Kalai. 2016. Ali. 2000. Enriching partially-specified represen- Man is to Computer Programmer as Woman is to tations for text realization using an attribute gram- INLG’2000 Proceedings of the First Interna- Homemaker? Debiasing Word Embeddings. In mar. In tional Conference on Natural Language Generation Proceedings of NeurIPS. , pages 163–170, Mitzpe Ramon, Israel. Association Constantinos Boulis and Mari Ostendorf. 2005.A for Computational Linguistics. quantitative analysis of lexical differences between Eric Charton and Michel Gagnon. 2011. Poly-co: a genders in telephone conversations. In Proceed- multilayer perceptron approach for coreference de- ings of the 43rd Annual Meeting of the Association tection. In Proceedings of the Fifteenth Confer- for Computational Linguistics (ACL’05), pages 435– ence on Computational Natural Language Learning: 442, Ann Arbor, Michigan. Association for Compu- Shared Task, pages 97–101, Portland, Oregon, USA. tational Linguistics. Association for Computational Linguistics. Evan Bradley, Julia Salkind, Ally Moore, and Sofi Teit- Chen Chen and Vincent Ng. 2014. Chinese zero sort. 2019. Singular ‘they’ and novel pronouns: pronoun resolution: An unsupervised probabilistic Proceedings of gender-neutral, nonbinary, or both? model rivaling supervised resolvers. In Proceedings the Linguistic Society of America, 4(1):36–1–7. of the 2014 Conference on Empirical Methods in Mary Bucholtz. 1999. Gender. Journal of Linguistic Natural Language Processing (EMNLP), pages 763– Anthropology. Special issue: Lexicon for the New 774, Doha, Qatar. Association for Computational Millennium, ed. Alessandro Duranti. Linguistics. Joy Buolamwini and Timnit Gebru. 2018. Gender Morgane Ciot, Morgan Sonderegger, and Derek Ruths. shades: Intersectional accuracy disparities in com- 2013. Gender inference of Twitter users in non- mercial gender classification. English contexts. In Proceedings of the 2013 Con- ference on Empirical Methods in Natural Language John D. Burger, John Henderson, George Kim, and Processing, pages 1136–1145, Seattle, Washington, Guido Zarrella. 2011. Discriminating gender on USA. Association for Computational Linguistics. twitter. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, Kevin Clark and Christopher D. Manning. 2015. pages 1301–1309, Edinburgh, Scotland, UK. Associ- Entity-centric coreference resolution with model ation for Computational Linguistics. stacking. In ACL. Kevin Clark and Christopher D. Manning. 2016. Deep Felix Burkhardt, Martin Eckert, Wiebke Johannsen, reinforcement learning for mention-ranking corefer- and Joachim Stegmann. 2010. A database of age ence models. In EMNLP. and gender annotated telephone speech. In Proceed- ings of the Seventh International Conference on Lan- KC Clements. 2017. What is deadnaming? Blog post. guage Resources and Evaluation (LREC’10), Val- letta, Malta. European Language Resources Associ- Kirby Conrod. 2018. Changes in singular they. In Cas- ation (ELRA). cadia Workshop in Sociolinguistics. Greville G. Corbett. 1991. Gender. Cambridge Univer- Thierry Declerck, Nikolina Koleva, and Hans-Ulrich sity Press. Krieger. 2012. Ontology-based incremental annota- tion of characters in folktales. In Proceedings of the Greville G. Corbett. 2013. Number of genders. In 6th Workshop on Language Technology for Cultural Matthew S. Dryer and Martin Haspelmath, editors, Heritage, Social Sciences, and Humanities, pages The World Atlas of Language Structures Online. 30–34, Avignon, France. Association for Computa- Max Planck Institute for Evolutionary Anthropol- tional Linguistics. ogy, Leipzig. Liviu P. Dinu, Vlad Niculae, and Octavia-Maria S¸ulea. Marta R. Costa-jussa.` 2017. Why Catalan-Spanish neu- 2012. The Romanian neuter examined through a ral machine translation? analysis, comparison and two-gender n-gram classification system. In Pro- combination with standard rule and phrase-based ceedings of the Eighth International Conference technologies. In Proceedings of the Fourth Work- on Language Resources and Evaluation (LREC’12), shop on NLP for Similar Languages, Varieties and pages 907–910, Istanbul, Turkey. European Lan- Dialects (VarDial), pages 55–62, Valencia, Spain. guage Resources Association (ELRA). Association for Computational Linguistics. Ursula Doleschal and Sonja Schmid. 2001. Doing gen- der in Russian. Gender Across Languages. The lin- Colette G. Craig. 1994. Classifier languages. The en- guistic representation of women and men, 1:253– cyclopedia of language and linguistics, 2:565–569. 282.

Silviu Cucerzan and David Yarowsky. 2003. Mini- Michael Dorna, Anette Frank, Josef van Genabith, and mally supervised induction of grammatical gender. Martin C. Emele. 1998. Syntactic and semantic In Proceedings of the 2003 Human Language Tech- transfer with f-structures. In 36th Annual Meet- nology Conference of the North American Chapter ing of the Association for Computational Linguis- of the Association for Computational Linguistics, tics and 17th International Conference on Computa- pages 40–47. tional Linguistics, Volume 1, pages 341–347, Mon- treal, Quebec, Canada. Association for Computa- Ali Dada. 2007. Implementation of the Arabic numer- tional Linguistics. als and their syntax in GF. In Proceedings of the 2007 Workshop on Computational Approaches to Matthew S. Dryer. 2013. Expression of pronominal Semitic Languages: Common Issues and Resources, subjects. In Matthew S. Dryer and Martin Haspel- pages 9–16, Prague, Czech Republic. Association math, editors, The World Atlas of Language Struc- for Computational Linguistics. tures Online. Max Planck Institute for Evolutionary Anthropology, Leipzig. Laurence Danlos and Fiametta Namer. 1988. Morphol- ogy and cross dependencies in the synthesis of per- Esin Durmus and Claire Cardie. 2018. Understanding sonal pronouns in romance languages. In Coling Bu- the effect of gender and stance in opinion expression dapest 1988 Volume 1: International Conference on in debates on “abortion”. In Proceedings of the Sec- Computational Linguistics. ond Workshop on Computational Modeling of Peo- ple’s Opinions, Personality, and Emotions in Social Helana Darwin. 2017. Doing gender beyond the bi- Media, pages 69–75, New Orleans, Louisiana, USA. nary: A virtual ethnography. Symbolic Interaction, Association for Computational Linguistics. 40(3):317–334. Jason Eisner and Damianos Karakos. 2005. Bootstrap- ping without the boot. In Proceedings of Human Kareem Darwish, Ahmed Abdelali, and Hamdy Language Technology Conference and Conference Mubarak. 2014. Using stem-templates to improve on Empirical Methods in Natural Language Process- Arabic POS and gender/number tagging. In Pro- ing, pages 395–402, Vancouver, British Columbia, ceedings of the Ninth International Conference on Canada. Association for Computational Linguistics. Language Resources and Evaluation (LREC’14), pages 2926–2931, Reykjavik, Iceland. European Ahmed El Kholy and Nizar Habash. 2012. Rich mor- Language Resources Association (ELRA). phology generation using statistical machine trans- lation. In INLG 2012 Proceedings of the Seventh Hal Daume,´ III and Daniel Marcu. 2005. A large-scale International Natural Language Generation Confer- exploration of effective global features for a joint en- ence, pages 90–94, Utica, IL. Association for Com- tity detection and tracking model. In HLT/EMNLP, putational Linguistics. pages 97–104. Katja Filippova. 2012. User demographics and lan- Łukasz Debowski. 2003. A reconfigurable stochastic guage in an implicit social network. In Proceedings tagger for languages with complex tag structure. In of the 2012 Joint Conference on Empirical Methods Proceedings of the 2003 EACL Workshop on Mor- in Natural Language Processing and Computational phological Processing of Slavic Languages, pages Natural Language Learning, pages 1478–1488, Jeju 63–70, Budapest, Hungary. Association for Compu- Island, Korea. Association for Computational Lin- tational Linguistics. guistics. Lucie Flekova, Jordan Carpenter, Salvatore Giorgi, Goran Glavas,ˇ Damir Korenciˇ c,´ and Jan Snajder.ˇ 2013. Lyle Ungar, and Daniel Preot¸iuc-Pietro. 2016. An- Aspect-oriented opinion mining from user reviews alyzing biases in human perception of user age and in Croatian. In Proceedings of the 4th Biennial In- gender from text. In Proceedings of the 54th An- ternational Workshop on Balto-Slavic Natural Lan- nual Meeting of the Association for Computational guage Processing, pages 18–23, Sofia, Bulgaria. As- Linguistics (Volume 1: Long Papers), pages 843– sociation for Computational Linguistics. 854, Berlin, Germany. Association for Computa- tional Linguistics. Yoav Goldberg and Michael Elhadad. 2013. Word seg- mentation, unknown-word resolution, and morpho- Joel Escude´ Font and Marta R. Costa-jussa.` 2019. logical agreement in a Hebrew parsing system. Com- Equalizing gender biases in neural machine transla- putational Linguistics, 39(1):121–160. tion with word embeddings techniques. In Proceed- ings of the 1st ACL Workshop on Gender Bias for Hila Gonen and Yoav Goldberg. 2019. Lipstick on a Natural Language Processing. Pig: Debiasing Methods Cover up Systematic Gen- der Biases in Word Embeddings But do not Remove Anke Frank, Chr Hoffmann, Maria Strobel, et al. 2004. Them. In Proceedings of NAACL-HLT. Gender issues in machine translation. Univ. Bremen. Rob van der Goot, Nikola Ljubesiˇ c,´ Ian Matroos, Malv- Batya Friedman and Helen Nissenbaum. 1996. Bias in ina Nissim, and Barbara Plank. 2018. Bleaching computer systems. ACM Transactions on Informa- text: Abstract features for cross-lingual gender pre- tion Systems (TOIS), 14(3):330–347. diction. In Proceedings of the 56th Annual Meet- Pedro A. Fuertes-Olivera. 2007. A corpus-based view ing of the Association for Computational Linguis- of lexical gender in written business english. En- tics (Volume 2: Short Papers), pages 383–389, Mel- glish for Specific Purposes, 26(2):219–234. bourne, Australia. Association for Computational Linguistics. Matt Gardner, Joel Grus, Mark Neumann, Oyvind Tafjord, Pradeep Dasigi, Nelson F. Liu, Matthew Liane Guillou. 2012. Improving pronoun translation Peters, Michael Schmitz, and Luke S. Zettlemoyer. for statistical machine translation. In Proceedings of 2017. AllenNLP: A deep semantic natural language the Student Research Workshop at the 13th Confer- processing platform. arXiv:1803.07640. ence of the European Chapter of the Association for Computational Linguistics, pages 1–10, Avignon, Nikesh Garera and David Yarowsky. 2009. Modeling France. Association for Computational Linguistics. latent biographic attributes in conversational genres. In Proceedings of the Joint Conference of the 47th Foad Hamidi, Morgan Klaus Scheuerman, and Stacy M. Annual Meeting of the ACL and the 4th International Branham. 2018. Gender recognition or gender re- Joint Conference on Natural Language Processing ductionism?: The social implications of embedded of the AFNLP, pages 710–718, Suntec, Singapore. gender recognition systems. In CHI, page 8. ACM. Association for Computational Linguistics. Sanda M. Harabagiu and Steven J. Maiorano. 1999. Aparna Garimella and Rada Mihalcea. 2016. Zooming Knowledge-lean coreference resolution and its rela- in on gender differences in social media. In Proceed- tion to textual cohesion and coherence. In The Rela- ings of the Workshop on Computational Modeling of tion of Discourse/Dialogue Structure and Reference. People’s Opinions, Personality, and Emotions in So- cial Media (PEOPLES), pages 1–10, Osaka, Japan. Marlis Hellinger and Heiko Motschenbacher. 2015. The COLING 2016 Organizing Committee. Gender across languages, volume 4. John Ben- jamins Publishing Company. Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Toma´sˇ Holan, Vladislav Kubon,ˇ and Martin Platek.´ Daume,´ III, and Kate Crawford. 2018. Datasheets 1997. A prototype of a grammar checker for Czech. arXiv:1803.09010 for datasets. . In Fifth Conference on Applied Natural Language Damien Genthial, Jacques Courtin, and Jacques Processing, pages 147–154, Washington, DC, USA. Menezo. 1994. Towards a more user-friendly cor- Association for Computational Linguistics. rection. In COLING 1994 Volume 2: The 15th Inter- national Conference on Computational Linguistics. Levi C. R. Hord. 2016. Bucking the linguistic binary: Gender neutral language in english, swedish, french, Ona de Gibert, Naiara Perez, Aitor Garc´ıa-Pablos, and and german. Montse Cuadros. 2018. Hate speech dataset from a white supremacy forum. In Proceedings of the Dirk Hovy. 2015. Demographic factors improve classi- 2nd Workshop on Abusive Language Online (ALW2), fication performance. In Proceedings of the 53rd An- pages 11–20, Brussels, Belgium. Association for nual Meeting of the Association for Computational Computational Linguistics. Linguistics and the 7th International Joint Confer- ence on Natural Language Processing (Volume 1: GLAAD. 2007. Media reference guide–transgender. Long Papers), pages 752–762, Beijing, China. As- permalink. sociation for Computational Linguistics. Hongyan Jing, Nanda Kambhatla, and Salim Roukos. Bennett Kleinberg, Maximilian Mozes, and Isabelle 2007. Extracting social networks and biographical van der Vegt. 2018. Identifying the sentiment styles facts from conversational speech transcripts. In Pro- of YouTube’s vloggers. In Proceedings of the 2018 ceedings of the 45th Annual Meeting of the Asso- Conference on Empirical Methods in Natural Lan- ciation of Computational Linguistics, pages 1040– guage Processing, pages 3581–3590, Brussels, Bel- 1047, Prague, Czech Republic. Association for Com- gium. Association for Computational Linguistics. putational Linguistics. Dimitrios Kokkinakis, Ann Ighe, and Mats Malm. Anders Johannsen, Dirk Hovy, and Anders Søgaard. 2015. Gender-based vocation identification in 2015. Cross-lingual syntactic variation over age Swedish 19th century prose fiction using linguistic and gender. In Proceedings of the Nineteenth Con- patterns, NER and CRF learning. In Proceedings of ference on Computational Natural Language Learn- the Fourth Workshop on Computational Linguistics ing, pages 103–112, Beijing, China. Association for for Literature, pages 89–97, Denver, Colorado, USA. Computational Linguistics. Association for Computational Linguistics.

Kelly Johnson, Colette Auerswald, Allen J. LeBlanc, Corina Koolen and Andreas van Cranenburgh. 2017. and Walter O. Bockting. 2019. 7. invalidation expe- These are not the stereotypes you are looking for: riences and protective factors among non-binary ado- Bias and fairness in authorial gender attribution. In lescents. Journal of Adolescent Health, 64(2, Sup- Proceedings of the First ACL Workshop on Ethics in plement):S4. Natural Language Processing, pages 12–22, Valen- cia, Spain. Association for Computational Linguis- Megumi Kameyama. 1986. A property-sharing con- tics. straint in centering. In 24th Annual Meeting of the Cheris Kramarae and Paula A. Treichler. 1985. A femi- Association for Computational Linguistics, pages nist dictionary. Pandora Press. 200–206, New York, New York, USA. Association for Computational Linguistics. Matthias Kraus, Johannes Kraus, Martin Baumann, and Wolfgang Minker. 2018. Effects of gender stereo- Sweta Karlekar, Tong Niu, and Mohit Bansal. 2018. types on trust and likability in spoken human-robot Detecting linguistic characteristics of Alzheimer’s interaction. In Proceedings of the Eleventh Interna- dementia by interpreting neural models. In Proceed- tional Conference on Language Resources and Eval- ings of the 2018 Conference of the North Ameri- uation (LREC 2018), Miyazaki, Japan. European can Chapter of the Association for Computational Language Resources Association (ELRA). Linguistics: Human Language Technologies, Vol- ume 2 (Short Papers), pages 701–707, New Orleans, Robin Lakoff. 1975. Language and woman’s place. Louisiana. Association for Computational Linguis- New York ao: Harper and Row. tics. Max Lambert and Melina Packer. 2019. How gendered Matthew Kay, Cynthia Matuszek, and Sean A. Munson. language leads scientists astray. New York Times. 2015. Unequal representation and gender stereo- types in image search results for occupations. In Brian Larson. 2017a. Gender as a variable in natural- CHI. language processing: Ethical considerations. In Pro- ceedings of the First ACL Workshop on Ethics in Suzanne J. Kessler and Wendy McKenna. 1978. Gen- Natural Language Processing, pages 1–11, Valencia, der: An ethnomethodological approach. University Spain. Association for Computational Linguistics. of Chicago Press. Brian N. Larson. 2017b. Gender as a variable in Mike Kestemont. 2014. Function words in authorship natural-language processing: Ethical considerations. ACL Workshop on Ethics in NLP attribution. from black magic to theory? In Proceed- In . ings of the 3rd Workshop on Computational Linguis- Ronan Le Nagard and Philipp Koehn. 2010. Aiding tics for Literature (CLFL), pages 59–66, Gothen- pronoun translation with co-reference resolution. In burg, Sweden. Association for Computational Lin- Proceedings of the Joint Fifth Workshop on Statisti- guistics. cal Machine Translation and MetricsMATR, pages 252–261, Uppsala, Sweden. Association for Compu- Os Keyes. 2018. The misgendering machines: tational Linguistics. Trans/HCI implications of automatic gender recog- nition. CHI. Hector Levesque, Ernest Davis, and Leora Morgen- stern. 2012. The winograd schema challenge. In Svetlana Kiritchenko and Saif Mohammad. 2018. Ex- Thirteenth International Conference on the Princi- amining gender and race bias in two hundred sen- ples of Knowledge Representation and Reasoning. timent analysis systems. In Proceedings of the Seventh Joint Conference on Lexical and Compu- Moshe Levinger, Uzzi Ornan, and Alon Itai. 1995. tational Semantics, pages 43–53, New Orleans, Learning morpho-lexical probabilities from an un- Louisiana. Association for Computational Linguis- tagged corpus with an application to Hebrew. Com- tics. putational Linguistics, 21(3):383–404. Rivka Levitan. 2013. Entrainment in spoken dialogue In Proceedings of the Second Workshop on NLP and systems: Adopting, predicting and influencing user Computational Social Science, pages 1–6, Vancou- behavior. In Proceedings of the 2013 NAACL HLT ver, Canada. Association for Computational Linguis- Student Research Workshop, pages 84–90, Atlanta, tics. Georgia. Association for Computational Linguistics. Veronica´ Lopez-Lude´ na,˜ Ruben´ San-Segundo, Sarah Ita Levitan, Yocheved Levitan, Guozhen An, Syaheerah Lufti, Juan Manuel Lucas-Cuesta, Michelle Levine, Rivka Levitan, Andrew Rosenberg, Julian´ David Echevarry, and Beatriz Mart´ınez- and Julia Hirschberg. 2016. Identifying individual Gonzalez.´ 2011. Source language categorization for differences in gender, ethnicity, and personality from improving a speech into sign language translation dialogue for deception detection. In Proceedings of system. In Proceedings of the Second Workshop the Second Workshop on Computational Approaches on Speech and Language Processing for Assistive to Deception Detection, pages 40–44, San Diego, Technologies, pages 84–93, Edinburgh, Scotland, California. Association for Computational Linguis- UK. Association for Computational Linguistics. tics. John Lyons. 1977. Semantics. Cambridge University Sarah Ita Levitan, Angel Maredia, and Julia Hirschberg. Press. 2018. Linguistic cues to deception and perceived deception in interview dialogues. In Proceedings of Justina Mandravickaite˙ and Tomas Krilavicius.ˇ 2017. the 2018 Conference of the North American Chap- Stylometric analysis of parliamentary speeches: ter of the Association for Computational Linguistics: Gender dimension. In Proceedings of the 6th Work- Human Language Technologies, Volume 1 (Long Pa- shop on Balto-Slavic Natural Language Processing, pers), pages 1941–1950, New Orleans, Louisiana. pages 102–107, Valencia, Spain. Association for Association for Computational Linguistics. Computational Linguistics.

Dingcheng Li, Tim Miller, and William Schuler. 2011. Inderjeet Mani, T. Richard Macmillan, Susann Luper- A pronoun anaphora resolution system based on fac- foy, Elaine Lusher, and Sharon Laskowski. 1993. torial hidden Markov models. In Proceedings of the Identifying unknown proper names in newswire text. 49th Annual Meeting of the Association for Com- In Acquisition of Lexical Knowledge from Text. putational Linguistics: Human Language Technolo- gies, pages 1169–1178, Portland, Oregon, USA. As- Harmony Marchal, Benoˆıt Lemaire, Maryse Bianco, sociation for Computational Linguistics. and Philippe Dessus. 2008. A MDL-based model of gender knowledge acquisition. In CoNLL 2008: Shoushan Li, Bin Dai, Zhengxian Gong, and Guodong Proceedings of the Twelfth Conference on Compu- Zhou. 2016. Semi-supervised gender classification tational Natural Language Learning, pages 73–80, with joint textual and social modeling. In Proceed- Manchester, England. Coling 2008 Organizing Com- ings of COLING 2016, the 26th International Con- mittee. ference on Computational Linguistics: Technical Pa- pers, pages 2092–2100, Osaka, Japan. The COLING David Marecek,ˇ Rudolf Rosa, Petra Galusˇcˇakov´ a,´ and 2016 Organizing Committee. Ondrejˇ Bojar. 2011. Two-step translation with gram- Wen Li and Markus Dickinson. 2017. Gender predic- matical post-processing. In Proceedings of the tion for Chinese social media data. In Proceedings Sixth Workshop on Statistical Machine Translation, of the International Conference Recent Advances in pages 426–432, Edinburgh, Scotland. Association Natural Language Processing, RANLP 2017, pages for Computational Linguistics. 438–445, Varna, Bulgaria. INCOMA Ltd. Matej Martinc and Senja Pollak. 2018. Reusable work- Olga Litvinova, Pavel Seredin, Tatiana Litvinova, and flows for gender prediction. In Proceedings of the John Lyell. 2017. Deception detection in Russian Eleventh International Conference on Language Re- texts. In Proceedings of the Student Research Work- sources and Evaluation (LREC 2018), Miyazaki, shop at the 15th Conference of the European Chap- Japan. European Language Resources Association ter of the Association for Computational Linguistics, (ELRA). pages 43–52, Valencia, Spain. Association for Com- putational Linguistics. Yuval Marton, Nizar Habash, and Owen Rambow. 2010. Improving Arabic dependency parsing with Yuanchao Liu, Ming Liu, Xiaolong Wang, Limin lexical and inflectional morphological features. In Wang, and Jingjing Li. 2013. PAL: A chatterbot Proceedings of the NAACL HLT 2010 First Work- system for answering domain-specific questions. In shop on Statistical Parsing of Morphologically-Rich Proceedings of the 51st Annual Meeting of the As- Languages, pages 13–21, Los Angeles, CA, USA. sociation for Computational Linguistics: System Association for Computational Linguistics. Demonstrations, pages 67–72, Sofia, Bulgaria. As- sociation for Computational Linguistics. Yuval Marton, Nizar Habash, and Owen Rambow. 2013. Dependency parsing of modern standard Ara- Nikola Ljubesiˇ c,´ Darja Fiser,ˇ and Tomazˇ Erjavec. 2017. bic with lexical and inflectional features. Computa- Language-independent gender prediction on twitter. tional Linguistics, 39(1):161–194. Austin Matthews, Waleed Ammar, Archna Bhatia, We- Arjun Mukherjee and Bing Liu. 2010. Improving gen- ston Feely, Greg Hanneman, Eva Schlinger, Swabha der classification of blog authors. In Proceedings Swayamdipta, Yulia Tsvetkov, Alon Lavie, and of the 2010 Conference on Empirical Methods in Chris Dyer. 2014. The CMU machine translation Natural Language Processing, pages 207–217, Cam- systems at WMT 2014. In Proceedings of the Ninth bridge, MA. Association for Computational Linguis- Workshop on Statistical Machine Translation, pages tics. 142–149, Baltimore, Maryland, USA. Association for Computational Linguistics. Smruthi Mukund, Debanjan Ghosh, and Rohini Srihari. 2011. Using sequence kernels to identify opinion en- Chandler May. 2019. Neurips2019. tities in Urdu. In Proceedings of the Fifteenth Con- ference on Computational Natural Language Learn- Kevin A. McLemore. 2015. Experiences with mis- ing, pages 58–67, Portland, Oregon, USA. Associa- gendering: Identity misclassification of transgender tion for Computational Linguistics. spectrum individuals. Self and Identity, 14(1):51– 74. Hidetsugu Nanba, Haruka Taguma, Takahiro Ozaki, Daisuke Kobayashi, Aya Ishino, and Toshiyuki C. S. Mellish. 1988. Implementing systemic classi- Takezawa. 2009. Automatic compilation of travel in- fication by unification. Computational Linguistics, formation from automatically identified travel blogs. 14(1). In Proceedings of the ACL-IJCNLP 2009 Confer- ence Short Papers, pages 205–208, Suntec, Singa- Merriam-Webster. 2016. Words we’re watching: Sin- pore. Association for Computational Linguistics. gular ’they’. permalink. Ajit Narayanan and Lama Hashem. 1993. On abstract Timothee Mickus, Olivier Bonami, and Denis Paperno. finite-state morphology. In Sixth Conference of the 2019. Distributional effects of gender contrasts European Chapter of the Association for Computa- across categories. In Proceedings of the Society for tional Linguistics, Utrecht, The Netherlands. Asso- Computation in Linguistics (SCiL) 2019, pages 174– ciation for Computational Linguistics. 184. Vivi Nastase and Marius Popescu. 2009. What’s in Saif Mohammad, Felipe Bravo-Marquez, Mohammad a name? In some languages, grammatical gender. Salameh, and Svetlana Kiritchenko. 2018. SemEval- In Proceedings of the 2009 Conference on Empiri- 2018 task 1: Affect in tweets. In Proceedings of cal Methods in Natural Language Processing, pages The 12th International Workshop on Semantic Eval- 1368–1377, Singapore. Association for Computa- uation, pages 1–17, New Orleans, Louisiana. Asso- tional Linguistics. ciation for Computational Linguistics. Costanza Navarretta. 2004. An algorithm for resolv- Saif Mohammad and Tony Yang. 2011. Tracking sen- ing individual and abstract anaphora in Danish texts timent in mail: How genders differ on emotional and dialogues. In Proceedings of the Conference axes. In Proceedings of the 2nd Workshop on Com- on Reference Resolution and Its Applications, pages putational Approaches to Subjectivity and Sentiment 95–102, Barcelona, Spain. Association for Compu- Analysis (WASSA 2.011), pages 70–79, Portland, tational Linguistics. Oregon. Association for Computational Linguistics. Vincent Ng. 2010. Supervised noun phrase coreference Sridhar Moorthy, Ruth Pogacar, Samin Khan, and Yang research: The first fifteen years. In Proceedings of Xu. 2018. Is Nike female? exploring the role the 48th Annual Meeting of the Association for Com- of sound symbolism in predicting brand name gen- putational Linguistics, pages 1396–1411, Uppsala, der. In Proceedings of the 2018 Conference on Em- Sweden. Association for Computational Linguistics. pirical Methods in Natural Language Processing, pages 1128–1132, Brussels, Belgium. Association Dong Nguyen, Dolf Trieschnigg, A. Seza Dogru˘ oz,¨ Ri- for Computational Linguistics. lana Gravel, Mariet¨ Theune, Theo Meder, and Fran- ciska de Jong. 2014a. Why gender and age predic- Nafise Moosavi and Michael Strube. 2016. Which tion from tweets is hard: Lessons from a crowd- coreference evaluation metric do you trust? a pro- sourcing experiment. In Proceedings of COLING posal for a link-based entity aware metric. pages 2014, the 25th International Conference on Compu- 632–642. tational Linguistics: Technical Papers, pages 1950– 1961, Dublin, Ireland. Dublin City University and Nafise Sadat Moosavi. 2020. Robustness in Corefer- Association for Computational Linguistics. ence Resolution. PhD dissertation, University of Heidelberg. Dong Nguyen, Dolf Trieschnigg, and Theo Meder. 2014b. TweetGenie: Development, evaluation, and Cristina Mota, Paula Carvalho, and Elisabete Ranch- lessons learned. In Proceedings of COLING 2014, hod. 2004. Multiword lexical acquisition and dic- the 25th International Conference on Computational tionary formalization. In Proceedings of the Work- Linguistics: System Demonstrations, pages 62–66, shop on Enhancing and Using Electronic Dictionar- Dublin, Ireland. Dublin City University and Associ- ies, pages 73–76, Geneva, Switzerland. COLING. ation for Computational Linguistics. Uwe Kjær Nissen. 2002. Aspects of translating gender. 1–9, Athens, Greece. Association for Computational Linguistik online, 11(2):02. Linguistics. ˇ Michal Novak´ and Zdenekˇ Zabokrtsky.´ 2014. Cross- Barbara Plank. 2018. Predicting authorship and au- lingual coreference resolution of pronouns. In Pro- thor traits from keystroke dynamics. In Proceedings ceedings of COLING 2014, the 25th International of the Second Workshop on Computational Mod- Conference on Computational Linguistics: Techni- eling of People’s Opinions, Personality, and Emo- cal Papers, pages 14–24, Dublin, Ireland. Dublin tions in Social Media, pages 98–104, New Orleans, City University and Association for Computational Louisiana, USA. Association for Computational Lin- Linguistics. guistics. Elinor Ochs. 1992. Indexing gender. Rethinking context: Language as an interactive phenomenon, Fred Popowich. 1989. Tree unification grammar. In 11:335. 27th Annual Meeting of the Association for Com- putational Linguistics, pages 228–236, Vancouver, Alexandra Olteanu, Carlos Castillo, Fernando Diaz, British Columbia, Canada. Association for Compu- and Emre Kiciman. 2019. Social data: Bi- tational Linguistics. ases, methodological pitfalls, and ethical boundaries. Frontiers in Big Data, 2:13. Vinodkumar Prabhakaran, Emily E. Reid, and Owen Rambow. 2014. Gender and power: How gen- Naho Orita, Eliana Vornov, Naomi H. Feldman, and der and gender environment affect manifestations of Hal Daume,´ III. 2015. Why discourse affects speak- power. In Proceedings of the 2014 Conference on ers’ choice of referring expressions. In Proceedings Empirical Methods in Natural Language Processing of the Conference of the Association for Computa- (EMNLP), pages 1965–1976, Doha, Qatar. Associa- tional Linguistics (ACL). tion for Computational Linguistics.

Serguei V. Pakhomov, James Buntrock, and Christo- Grusha Prasad and Joanna Morris. 2020. The p600 for pher G. Chute. 2003. Identification of patients singular “they”: How the brain reacts when john de- with congestive heart failure using a binary clas- cides to treat themselves to sushi. PsyArXiv. sifier: A case study. In Proceedings of the ACL 2003 Workshop on Natural Language Processing in Biomedicine, pages 89–96, Sapporo, Japan. Associ- Marcelo Prates, Pedro Avelar, and Luis C. Lamb. 2019. ation for Computational Linguistics. Assessing gender bias in machine translation – a case study with google translate. Neural Computing Ji Ho Park, Jamin Shin, and Pascale Fung. 2018. Re- and Applications. ducing gender bias in abusive language detection. In Proceedings of the 2018 Conference on Em- Daniel Preot¸iuc-Pietro, Johannes Eichstaedt, Gregory pirical Methods in Natural Language Processing, Park, Maarten Sap, Laura Smith, Victoria Tobolsky, pages 2799–2804, Brussels, Belgium. Association H. Andrew Schwartz, and Lyle Ungar. 2015. The for Computational Linguistics. role of personality, age, and gender in tweeting about mental illness. In Proceedings of the 2nd Work- Desmond Patton, Philipp Blandfort, William Frey, shop on Computational Linguistics and Clinical Psy- Michael Gaskell, and Svebor Karaman. 2019. An- chology: From Linguistic Signal to Clinical Real- notating twitter data from vulnerable populations: ity, pages 21–30, Denver, Colorado. Association for Evaluating disagreement between domain experts Computational Linguistics. and graduate student annotators. Peng Qian, Xipeng Qiu, and Xuanjing Huang. 2016. Carlos Perez´ Estruch, Roberto Paredes Palacios, and Investigating language universal and specific prop- Paolo Rosso. 2017. Learning multimodal gender erties in word embeddings. In Proceedings of the profile using neural networks. In Proceedings of the 54th Annual Meeting of the Association for Compu- International Conference Recent Advances in Natu- tational Linguistics (Volume 1: Long Papers), pages ral Language Processing, RANLP 2017, pages 577– 1478–1488, Berlin, Germany. Association for Com- 582, Varna, Bulgaria. INCOMA Ltd. putational Linguistics. Veronica´ Perez-Rosas,´ Quincy Davenport, Anna Meng- dan Dai, Mohamed Abouelenien, and Rada Mihal- J. Joachim Quantz. 1994. An HPSG parser based cea. 2017. Identity deception detection. In Proceed- on description logics. In COLING 1994 Volume ings of the Eighth International Joint Conference on 1: The 15th International Conference on Computa- Natural Language Processing (Volume 1: Long Pa- tional Linguistics. pers), pages 885–894, Taipei, Taiwan. Asian Federa- tion of Natural Language Processing. Chris Quirk and Simon Corston-Oliver. 2006. The im- pact of parse quality on syntactically-informed sta- Wido van Peursen. 2009. How to establish a verbal tistical machine translation. In Proceedings of the paradigm on the basis of ancient Syriac manuscripts. 2006 Conference on Empirical Methods in Natural In Proceedings of the EACL 2009 Workshop on Com- Language Processing, pages 62–69, Sydney, Aus- putational Approaches to Semitic Languages, pages tralia. Association for Computational Linguistics. Ella Rabinovich, Raj Nath Patel, Shachar Mirkin, Lu- Language Technologies, Volume 2 (Short Papers), cia Specia, and Shuly Wintner. 2017. Personal- pages 8–14, New Orleans, Louisiana. Association ized machine translation: Preserving original author for Computational Linguistics. traits. In Proceedings of the 15th Conference of the European Chapter of the Association for Computa- Maarten Sap, Gregory Park, Johannes Eichstaedt, Mar- tional Linguistics: Volume 1, Long Papers, pages garet Kern, David Stillwell, Michal Kosinski, Lyle 1074–1084, Valencia, Spain. Association for Com- Ungar, and Hansen Andrew Schwartz. 2014. Devel- putational Linguistics. oping age and gender predictive lexica over social media. In Proceedings of the 2014 Conference on Karthik Raghunathan, Heeyoung Lee, Sudarshan Ran- Empirical Methods in Natural Language Processing garajan, Nathanael Chambers, Mihai Surdeanu, Dan (EMNLP), pages 1146–1151, Doha, Qatar. Associa- Jurafsky, and Christopher Manning. 2010. A multi- tion for Computational Linguistics. pass sieve for coreference resolution. In EMNLP. Maarten Sap, Marcella Cindy Prasettio, Ari Holtzman, Anil Ramakrishna, Nikolaos Malandrakis, Elizabeth Hannah Rashkin, and Yejin Choi. 2017. Connota- Staruk, and Shrikanth Narayanan. 2015. A quanti- tion frames of power and agency in modern films. tative analysis of gender differences in movies us- In Proceedings of the 2017 Conference on Empiri- ing psycholinguistic normatives. In Proceedings of cal Methods in Natural Language Processing, pages the 2015 Conference on Empirical Methods in Nat- 2329–2334, Copenhagen, Denmark. Association for ural Language Processing, pages 1996–2001, Lis- Computational Linguistics. bon, Portugal. Association for Computational Lin- guistics. Emili Sapena, Llu´ıs Padro,´ and Jordi Turmo. 2011. RelaxCor participation in CoNLL shared task on Sravana Reddy and Kevin Knight. 2016. Obfuscating coreference resolution. In Proceedings of the Fif- gender in social media writing. In Proceedings of teenth Conference on Computational Natural Lan- the First Workshop on NLP and Computational So- guage Learning: Shared Task, pages 35–39, Port- cial Science, pages 17–26, Austin, Texas. Associa- land, Oregon, USA. Association for Computational tion for Computational Linguistics. Linguistics.

Christina Richards, Walter Pierre Bouman, and Meg- Ruchita Sarawgi, Kailash Gajulapalli, and Yejin Choi. John Barker. 2017. Genderqueer and Non-Binary 2011. Gender attribution: Tracing stylometric evi- Genders. Springer. dence beyond topic and genre. In Proceedings of the Fifteenth Conference on Computational Natural Barbara J. Risman. 2009. From doing to undoing: Gen- Language Learning der as we know it. Gender & Society, 23(1). , pages 78–86, Portland, Oregon, USA. Association for Computational Linguistics. Livio Robaldo and Jurij Di Carlo. 2009. Disambiguat- ing quantifier scope in DTS. In Proceedings of the Kristen Schilt and Laurel Westbrook. 2009. Doing Eight International Conference on Computational gender, doing heteronormativity. Gender & Society, Semantics, pages 195–209, Tilburg, The Nether- 23(4). lands. Association for Computational Linguistics. Alexandra Schofield and Leo Mehr. 2016. Gender- Lina Maria Rojas-Barahona, Thierry Bazillon, distinguishing features in film dialogue. In Proceed- Matthieu Quignard, and Fabrice Lefevre. 2011. ings of the Fifth Workshop on Computational Lin- Using MMIL for the high level semantic annotation guistics for Literature, pages 32–39, San Diego, Cal- of the French MEDIA dialogue corpus. In Pro- ifornia, USA. Association for Computational Lin- ceedings of the Ninth International Conference on guistics. Computational Semantics (IWCS 2011). H. Andrew Schwartz, Gregory Park, Maarten Sap, Alexey Romanov, Maria De-Arteaga, Hanna Wal- Evan Weingarten, Johannes Eichstaedt, Margaret lach, Jennifer Chayes, Christian Borgs, Alexandra Kern, David Stillwell, Michal Kosinski, Jonah Chouldechova, Sahin Geyik, Krishnaram Kentha- Berger, Martin Seligman, and Lyle Ungar. 2015. Ex- padi, Anna Rumshisky, and Adam Tauman Kalai. tracting human temporal orientation from Facebook 2019. What’s in a name? reducing bias in bios with- language. In Proceedings of the 2015 Conference of out access to protected attributes. In NAACL. the North American Chapter of the Association for Computational Linguistics: Human Language Tech- Rachel Rudinger, Chandler May, and Benjamin nologies, pages 409–419, Denver, Colorado. Associ- Van Durme. 2017. Social bias in elicited natural lan- ation for Computational Linguistics. guage inferences. In ACL Workshop on Ethics in NLP, pages 74–79. Julia Serano. 2007. Whipping Girl: A Woman on Sexism and the Scapegoating of Feminin- Rachel Rudinger, Jason Naradowsky, Brian Leonard, ity. Seal Press. and Benjamin Van Durme. 2018. Gender bias in coreference resolution. In Proceedings of the 2018 Candace L. Sidner. 1981. Focusing for interpretation Conference of the North American Chapter of the of pronouns. American Journal of Computational Association for Computational Linguistics: Human Linguistics, 7(4):217–231. Maxim Sidorov, Stefan Ultes, and Alexander Schmitt. Susan Stryker. 2008. . Seal Press. 2014. Comparison of gender- and speaker-adaptive emotion recognition. In Proceedings of the Ninth In- Latanya Sweeney. 2013. Discrimination in online ad ternational Conference on Language Resources and delivery. ACM Queue. Evaluation (LREC’14), pages 3476–3480, Reyk- javik, Iceland. European Language Resources Asso- Marko Tadic´ and Sanja Fulgosi. 2003. Building ciation (ELRA). the Croatian morphological lexicon. In Proceed- ings of the 2003 EACL Workshop on Morphologi- Michael Silverstein. 1979. Language structure and lin- cal Processing of Slavic Languages, pages 41–45, guistic ideology. The elements: A parasession on Budapest, Hungary. Association for Computational linguistic units and levels, pages 193–247. Linguistics.

Noah A. Smith, David A. Smith, and Roy W. Tromble. Tomoki Taniguchi, Shigeyuki Sakaki, Ryosuke Shige- 2005. Context-based morphological disambiguation naka, Yukihiro Tsuboshita, and Tomoko Ohkuma. with random fields. In Proceedings of Human Lan- 2015. A weighted combination of text and image guage Technology Conference and Conference on classifiers for user gender inference. In Proceed- Empirical Methods in Natural Language Process- ings of the Fourth Workshop on Vision and Lan- ing, pages 475–482, Vancouver, British Columbia, guage, pages 87–93, Lisbon, Portugal. Association Canada. Association for Computational Linguistics. for Computational Linguistics.

Juan Soler-Company and Leo Wanner. 2014. How Rachael Tatman. 2017. Gender and dialect bias in to use less features and reach better performance YouTube’s automatic captions. In Proceedings of in author gender identification. In Proceedings of the First ACL Workshop on Ethics in Natural Lan- the Ninth International Conference on Language guage Processing, pages 53–59, Valencia, Spain. As- Resources and Evaluation (LREC’14), pages 1315– sociation for Computational Linguistics. 1319, Reykjavik, Iceland. European Language Re- sources Association (ELRA). Trang Tran and Mari Ostendorf. 2016. Characterizing the language of online communities and its relation Juan Soler-Company and Leo Wanner. 2017. On the to community reception. In Proceedings of the 2016 relevance of syntactic and discourse features for au- Conference on Empirical Methods in Natural Lan- thor profiling and identification. In Proceedings of guage Processing, pages 1030–1035, Austin, Texas. the 15th Conference of the European Chapter of the Association for Computational Linguistics. Association for Computational Linguistics: Volume 2, Short Papers, pages 681–687, Valencia, Spain. As- National Center for Transgender Equality. 2017. Un- sociation for Computational Linguistics. derstanding drag. Blog post. Danny Soloman and Mary McGee Wood. 1994. Learn- Ashwini Vaidya, Owen Rambow, and Martha Palmer. ing a radically lexical grammar. In The Balanc- 2014. Light verb constructions with ‘do’ and ‘be’ in ing Act: Combining Symbolic and Statistical Ap- Hindi: A TAG analysis. In Proceedings of Work- proaches to Language . shop on Lexical and Grammatical Resources for Michael Spivak. 1997. The Joy of TEX: A Gourmet Language Processing, pages 127–136, Dublin, Ire- Guide to Typesetting with the AMS-TEX Macro Pack- land. Association for Computational Linguistics and age, 1st edition. American Mathematical Society, Dublin City University. USA. Eva Vanmassenhove, Christian Hardmeier, and Andy E. Stanley. 2014. Gender self-determination. TSQ: Way. 2018. Getting gender right in neural machine Transgender Studies Quarterly, 1:89–91. translation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Process- Ian Stewart. 2014. Now we stronger than ever: African- ing, pages 3003–3008, Brussels, Belgium. Associa- American English syntax in twitter. In Proceed- tion for Computational Linguistics. ings of the Student Research Workshop at the 14th Conference of the European Chapter of the Asso- Ben Verhoeven, Iza Skrjanec,ˇ and Senja Pollak. 2017. ciation for Computational Linguistics, pages 31– Gender profiling for Slovene twitter communication: 37, Gothenburg, Sweden. Association for Computa- the influence of gender marking, content and style. tional Linguistics. In Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing, pages 119–125, Va- Oliver Streiter, Leonhard Voltmer, and Yoann Goudin. lencia, Spain. Association for Computational Lin- 2007. From tombstones to corpora: TSML for re- guistics. search on language, culture, identity and gender dif- ferences. In Proceedings of the 21st Pacific Asia Adam Vogel and Dan Jurafsky. 2012. He said, she said: Conference on Language, Information and Compu- Gender in the ACL anthology. In Proceedings of tation, pages 450–458, Seoul National University, the ACL-2012 Special Workshop on Rediscovering Seoul, Korea. The Korean Society for Language and 50 Years of Discoveries, pages 33–41, Jeju Island, Information (KSLI). Korea. Association for Computational Linguistics. Thurid Vogt and Elisabeth Andre.´ 2006. Improving Bei Yu. 2012. Function words for Chinese authorship automatic emotion recognition from speech via gen- attribution. In Proceedings of the NAACL-HLT 2012 der differentiaion. In Proceedings of the Fifth In- Workshop on Computational Linguistics for Litera- ternational Conference on Language Resources and ture, pages 45–53, Montreal,´ Canada. Association Evaluation (LREC’06), Genoa, Italy. European Lan- for Computational Linguistics. guage Resources Association (ELRA). Wajdi Zaghouani and Anis Charfi. 2018. Arap-tweet: Svitlana Volkova, Theresa Wilson, and David A large multi-dialect twitter corpus for gender, age Yarowsky. 2013. Exploring demographic lan- and language variety identification. In Proceed- guage variations to improve multilingual sentiment ings of the Eleventh International Conference on analysis in social media. In Proceedings of the Language Resources and Evaluation (LREC 2018), 2013 Conference on Empirical Methods in Natural Miyazaki, Japan. European Language Resources As- Language Processing, pages 1815–1827, Seattle, sociation (ELRA). Washington, USA. Association for Computational Linguistics. Dong Zhang, Shoushan Li, Hongling Wang, and Guodong Zhou. 2016. User classification with mul- Mario Wandruszka. 1969. Sprachen: vergleichbar und tiple textual perspectives. In Proceedings of COL- unvergleichlich. R. Piper & Company. ING 2016, the 26th International Conference on Computational Linguistics: Technical Papers, pages Zijian Wang and David Jurgens. 2018. It’s going to be 2112–2121, Osaka, Japan. The COLING 2016 Orga- okay: Measuring access to support in online com- nizing Committee. munities. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Process- Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or- ing, pages 33–45, Brussels, Belgium. Association donez, and Kai-Wei Chang. 2017. Men also like for Computational Linguistics. shopping: Reducing gender bias amplification using corpus-level constraints. In Proceedings of the 2017 Kellie Webster, Marta Recasens, Vera Axelrod, and Ja- Conference on Empirical Methods in Natural Lan- son Baldridge. 2018. Mind the GAP: A balanced guage Processing, pages 2979–2989, Copenhagen, corpus of gendered ambiguous pronouns. Transac- Denmark. Association for Computational Linguis- tions of the Association for Computational Linguis- tics. tics, 6:605–617. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Or- Marion Weller, Alexander Fraser, and Sabine donez, and Kai-Wei Chang. 2018a. Gender bias Schulte im Walde. 2013. Using subcategoriza- in coreference resolution: Evaluation and debiasing tion knowledge to improve case prediction for methods. In Proceedings of the 2018 Conference translation to German. In Proceedings of the of the North American Chapter of the Association 51st Annual Meeting of the Association for Com- for Computational Linguistics: Human Language putational Linguistics (Volume 1: Long Papers), Technologies, Volume 2 (Short Papers), pages 15–20, pages 593–603, Sofia, Bulgaria. Association for New Orleans, Louisiana. Association for Computa- Computational Linguistics. tional Linguistics.

Candace West and Don H. Zimmerman. 1987. Doing Jieyu Zhao, Yichao Zhou, Zeyu Li, Wei Wang, and Kai- gender. Gender & society, 1(2):125–151. Wei Chang. 2018b. Learning gender-neutral word Thomas Wolf. 2017. State-of-the-art neural corefer- embeddings. In Proceedings of the 2018 Conference ence resolution for chatbots. Blog post. on Empirical Methods in Natural Language Process- ing, pages 4847–4853, Brussels, Belgium. Associa- Zach Wood-Doughty, Nicholas Andrews, Rebecca tion for Computational Linguistics. Marvin, and Mark Dredze. 2018. Predicting twit- ter user demographics from names alone. In Pro- Lal Zimman. 2019. Trans self-identification and the ceedings of the Second Workshop on Computational language of neoliberal selfhood: Agency, power, Modeling of People’s Opinions, Personality, and and the limits of monologic discourse. International Emotions in Social Media, pages 105–111, New Or- Journal of the Sociology of Language, 2019:147– leans, Louisiana, USA. Association for Computa- 175. tional Linguistics. Ran Zmigrod, Sebastian J. Mielke, Hanna Wallach, Kei Yoshimoto. 1988. Identifying zero pronouns in and Ryan Cotterell. 2019. Counterfactual data Japanese dialogue. In Coling Budapest 1988 Volume augmentation for mitigating gender stereotypes in 2: International Conference on Computational Lin- languages with rich morphology. arXiv preprint guistics. arXiv:1906.04571. Meg Young, Lassana Magassa, and Batya Friedman. Michael Zock, Gil Francopoulo, and Abdellatif Laroui. 2019. Toward inclusive tech policy design: A 1988. Language learning as problem solving. In method for underrepresented voices to strengthen Coling Budapest 1988 Volume 2: International Con- tech policy documents. Ethics and Information ference on Computational Linguistics. Technology. A Examples of Possible Bias in Data Annotation Bias can enter coreference resolution datasets, which we use to train our systems, through annotation phase. Annotators may use linguistic notions to infer social gender. For instance, consider (2) below, in which an annotator is likely to determine that “her” refers to “Mary” and not “John” due to assumptions on likely ways that names may map to pronouns (or possibly by not considering that SHE pronouns could refer to someone named “John”). While in (3), an annotator is likely to have difficulty making a determination because both “Sue” and “Mary” suggest “her”. In (4), an annotator lacking knowledge of name stereotypes on typical Chinese and Indian names (plus the fact that given names in Chinese — especially when romanized —generally do not signal gender strongly), respectively, will likewise have difficulty.

(2) John and Mary visited her mother.

(3) Sue and Mary visited her mother.

(4) Liang and Aditya visited her mother. In all these cases, the plausible rough inference is that a reader takes a name, uses it to infer the social gender of the extra-linguistic referent. Later the reader sees the SHE pronoun, infers the referential gender of that pronoun, and checks to see if they match. An equivalent inference happens not just for names, but also for lexical gender references (both gendered nouns (5) and terms of address (6)), grammatical gender references (in gender languages like Arabic (7)), and social gender references (8). The last of these ((8)) is the case in which the correct referent is likely to be least clear to most annotators, and also the case studied by Rudinger et al.(2018) and Zhao et al.(2018a).

(5) My brother and niece visited her mother.

(6) Mr. Hashimoto and Mrs. Iwu visited her mother.

المطرب و الممثلة شاهدا والدتها (7) walidatu -ha shahidanaan walidatuha w almutarab mother -her saw actor[FEM] and singer[MASC] The singer[MASC] and actor[FEM] saw her mother.

walidatuha shahadaa almomathela w almutreb (8) Theher nursemother and the saw actor visited actor[f]her mother. and singer[m]

B Annotation of ACL Anthology Papers Below we list the complete set of annotations we did of the papers described in §5.1. For each of the papers considered, we annotate the following items: • Coref: Does the paper discuss coreference resolution? • L.G: Does the paper deal with linguistic gender (grammatical gender or gendered pronouns)? • S.G: Does the paper deal with social gender? • Eng: Does the paper study English? •L6=G: (If yes to L.G and S.G:) Does the paper distinguish linguistic from social gender? • 0/1: (If yes to S.G:) Does the paper explicitly or implicitly assume that social gender is binary? • Imm: (If yes to S.G:) Does the paper explicitly or implicitly assume social gender is immutable? • Neo: (If yes to S.G and to English:) Does the paper explicitly consider uses of definite singular “they” or neopronouns? For each of these, we mark with [Y] if the answer is yes, [N] if the answer is no, and [-] if this question is not applicable (ie it doesn’t pass the conditional checks).

Citation Coref L.G S.G Eng L6=S 0/1 Imm Neo Sidner(1981)YYYYN--- Bainbridge(1985)YYNY---- Kameyama(1986)YYYYNYYN Mellish(1988)NYNY---- Danlos and Namer(1988)NYNN---- Yoshimoto(1988)NYNN---- Zock et al.(1988)NYNN---- Popowich(1989)NYNY---- Mani et al.(1993)YNYY-Y-- Narayanan and Hashem(1993)NYNN---- Soloman and Wood(1994)NYNY---- Quantz(1994)NYNY---- Baker et al.(1994)------Genthial et al.(1994)NYNN---- Levinger et al.(1995)NYNN---- Holan et al.(1997)NYNN---- Dorna et al.(1998)NNNY---- Harabagiu and Maiorano(1999)YYYYNYYN Avgustinova and Uszkoreit(2000)NYNN---- Channarukul et al.(2000)NYNY---- Abuleil et al.(2002)NYNN---- Cucerzan and Yarowsky(2003)NYNN---- Pakhomov et al.(2003)NNYY---- Tadic´ and Fulgosi(2003)NYNN---- Debowski(2003)NYNN---- Navarretta(2004)YYYNNYY- Carl et al.(2004)YYYNNYY- Mota et al.(2004)NYNY---- Eisner and Karakos(2005)NYNY---- Boulis and Ostendorf(2005)NNYY-YYN Smith et al.(2005)NYNN---- Bergsma and Lin(2006)YYYYNYYN Vogt and Andre´(2006)NNYN-YY- Quirk and Corston-Oliver(2006)NYNY---- Dada(2007)NYNN---- Citation Coref L.G S.G Eng L6=S 0/1 Imm Neo Streiter et al.(2007)NNYN---- Jing et al.(2007)YYYYNY-N Badr et al.(2008)NYNN---- Marchal et al.(2008)NYNN---- van Peursen(2009)NYNN---- Badr et al.(2009)NYNN---- Garera and Yarowsky(2009)NYYYNYYN Bergsma et al.(2009)YYYYNYYN Nastase and Popescu(2009)NYNN---- Nanba et al.(2009)NNNY---- Robaldo and Di Carlo(2009)NNNY---- Mukherjee and Liu(2010)NNYY-YY- Ng(2010)YYYYNYYN Burkhardt et al.(2010)NNYN-YY- Marton et al.(2010)NYNN---- Le Nagard and Koehn(2010)YYYYNYYN Rojas-Barahona et al.(2011)NYNN---- Mukund et al.(2011)NYNN---- Sarawgi et al.(2011)NNYY-YYN Li et al.(2011)YYYYNYYN Burger et al.(2011)NNYY-YYN Mohammad and Yang(2011)NNYY-YYN Sapena et al.(2011)YYYYNYYN Charton and Gagnon(2011)YYYYNYYN Alkuhlani and Habash(2011)NYNN---- Marecekˇ et al.(2011)NYNN---- Lopez-Lude´ na˜ et al.(2011)NYNN---- Declerck et al.(2012)YYNY---- Bergsma et al.(2012)NNYY-YYN Alkuhlani and Habash(2012)NYNN---- Filippova(2012)NNYY-Y-- Dinu et al.(2012)NYNN---- El Kholy and Habash(2012)NYNN---- Yu(2012)NNNN---- Guillou(2012)YYYYYY-- Vogel and Jurafsky(2012)NNYY-YYN Goldberg and Elhadad(2013)NYNN---- Marton et al.(2013)NYNN---- Weller et al.(2013)NYNY---- Ciot et al.(2013)NNYN-YY- Volkova et al.(2013)NNYY-YYN Levitan(2013)NNYY-NNN Bojar et al.(2013)NYNN---- Glavasˇ et al.(2013)NYNN---- Liu et al.(2013)NNNN---- Kestemont(2014)NNNY---- Novak´ and Zabokrtskˇ y´(2014)YYNY---- Babych et al.(2014)NYNN---- Soler-Company and Wanner(2014)NNYY-YYN Chen and Ng(2014)YYYYNYYN Citation Coref L.G S.G Eng L6=S 0/1 Imm Neo Sap et al.(2014)NNYY-YY- Nguyen et al.(2014a)NNYY-YYN Prabhakaran et al.(2014)NNYY-YYN Sidorov et al.(2014)NNYY-YYN Darwish et al.(2014)NYNN---- Ahmed Khan(2014)NYNN---- Nguyen et al.(2014b)NNYN-YY- Stewart(2014)NNYY-YY- Matthews et al.(2014)NYNN---- Vaidya et al.(2014)NYNN---- Kokkinakis et al.(2015)NYYNNY-- Johannsen et al.(2015)NNYY-YY- Schwartz et al.(2015)NNNY---- Hovy(2015)NNYY-YYN Agarwal et al.(2015)NYYYNYYN Preot¸iuc-Pietro et al.(2015)NNYYNYY- Ramakrishna et al.(2015)NYYYNYYN Taniguchi et al.(2015)NNYY-NYN Schofield and Mehr(2016)NNYY-YYN Levitan et al.(2016)NNYY-YYN Flekova et al.(2016)NNYY-YYN Tran and Ostendorf(2016)NNNY---- Qian et al.(2016)NYNY---- Li et al.(2016)NNYY-YYN Zhang et al.(2016)NNYY-YYN Garimella and Mihalcea(2016)NNYY-YYN Reddy and Knight(2016)NNYY-YYN Li and Dickinson(2017)NNYN-YY- Perez´ Estruch et al.(2017)NNYY-YYN Perez-Rosas´ et al.(2017)NNYY-YYN Rabinovich et al.(2017)NNYN-YY- Costa-jussa`(2017)NYNN---- Sap et al.(2017)NNYY-Y-- Zhao et al.(2017)NNYY-YYN Mandravickaite˙ and Krilaviciusˇ (2017)NNYY-YYN Verhoeven et al.(2017)NNYY-YYN Larson(2017a)NYYYYNNY Koolen and van Cranenburgh(2017)NNYN-NY- Tatman(2017)NNYY-YYN Soler-Company and Wanner(2017)NNYY-YYN Ljubesiˇ c´ et al.(2017)NNYN-YY- Litvinova et al.(2017)NNYN-YY- Mohammad et al.(2018)NNYY-Y-- Wang and Jurgens(2018)NYYYYNNN Kraus et al.(2018)NNYY-Y-- Martinc and Pollak(2018)NNYY-YYN Chan and Fyshe(2018)NNYY-YYN Durmus and Cardie(2018)NNNY---- Zaghouani and Charfi(2018)NYYNNYY- Plank(2018)NNYY-YYN Citation Coref L.G S.G Eng L6=S 0/1 Imm Neo Wood-Doughty et al.(2018)NNYY-YYN Moorthy et al.(2018)NNYY-Y-- Levitan et al.(2018)NNYY-YYN Webster et al.(2018)YYYYNYYN Park et al.(2018)NYYYNYYN Vanmassenhove et al.(2018)NYYNNYY- Kleinberg et al.(2018)NNYY-YYN Zhao et al.(2018b)NNYY-YYN Balusu et al.(2018)NNNY---- Rudinger et al.(2018)YYYYNN-Y Zhao et al.(2018a)YYYYNYYN Kiritchenko and Mohammad(2018)------Barbieri and Camacho-Collados(2018)NNYY-YN- van der Goot et al.(2018)NNYN-YY- Karlekar et al.(2018)NNYY-YYN de Gibert et al.(2018)NNNY---- Mickus et al.(2019)NYNN---- C Example GICoref Document from Wikipedia: Dana Zzyym

[[Source: https://en.wikipedia.org/wiki/Dana_Zzyym]]

Dana Alix ZzyymA is an Intersex activist and former sailor who was the first military veteran in the United States to seek a non - binary gender U.S. passport , in a lawsuit ZzyymA v. PompeoC .

Early life ZzyymA has expressed that theirA childhood as a military brat made it out of the question for themA to be associated with the queer community as a youth due to the prevalence of homophobia in the armed forces . TheirA parentsB hid ZzyymA ’s status as intersex from themA and ZzyymA discovered theirA identity and the surgeries theirA parentsB had approved for themA by themselvesB after theirA Navy service . In 1978 , ZzyymA joined the Navy as a machinist ’s mate .

Activism ZzyymA has been an avid supporter of the Intersex Campaign for Equality .

Legal case

ZzyymA is the first veteran to seek a non - binary gender U.S. passport . In light of the State Department ’s continuing refusal to recognize an appropriate gender marker , on June 27 , 2017 a federal court granted Lambda Legal ’s motion to reopen the case . On September 19 , 2018 , the United States District Court for the District of Colorado enjoined the U.S. Department of State from relying upon its binary - only gender marker policy to withhold the requested passport . D Example GICoref Document from AO3: Scar Tissue

[[Source: https://archiveofourown.org/works/14476524]] [[Author: cornheck]]

Despite dreading theirA first true series of final exams , CronaA ’s relieved to have a particularly absorbative memory , lucky to recall all the material theyA ’d been required to catch up on . Half a semester of attendance , a whole year of course content . The only true moment of discomfort came when theyA ’d arrived at the essay portion . Thankful it was easy enough to answer , however , theirA subtle eye - roll stemmed entirely from just how much writing it asked of themA , hands already beginning to ache at the thought of scrawling out two pages on the origins , history , and importance of partnered and grouped soul resonance . By the end of it all , theirA neck , wrist , back , and ribs ached from the strain of theirA typical , hunched posture – a habit theyA defaulted to , and Miss MarieB silently wished theyA ’d be more mindful of . It was a relief , at least to themA , not to be the last one out of the lecture hall . Booklet turned in , theyA left the room as quietly as possible and lingered just outside , an air of hesitance settling upon themA as theyA considered what to do now that , it seemed , everything was over with . No more class , no more lessons , just ... students on break from their studies for the season . “ Kind of a breeze , was n’t it ? ” EvansC ’ voice echoes in the arched hall and CronaA ’s shoulders jump , theirA frame still a tense and anxious mess . “ Oh , ” theyA sigh , “ IA ... IA suppose so . It was n’t ... necessarily hard . ” CronaA answers , putting forth a vaguely forced smile . Smiling with the assumed purpose of making SoulC comfortable with the interaction . A defense mechanism . “ IA - IA guess , for a final , it was easier than IA expected ... everyone ... made it sound like it ’d be difficult . ” “ If by everyone , youA mean Black StarD , then yeah , ” SoulC chuckles , “ heD does n’t really do well on ‘ em ... bad test - taker . ” “ Ah , ” theirA facade falls just in time to be replaced by a much more genuine grin . Of the little theyA ’d spent talking to Black StarD , heD certainly had confidence and skill enough to make up for the lost exam points given hisD performance in every other grading category . “ That ... makes sense . ” “ MakaE ’s always the first one done when it comes to this stuff , sheE practically studies in herE sleep . IC ’m convinced sheE must be practicing clairvoyance the way sheE burns through essay questions , ” SoulC laughs , turning to the meek teenA who gives himC a simple nod in response . Determined not to let an impending awkward silence fall between themF , SoulC pipes up again , “ So , are youA staying here for break ? ” “ Ye - well , IA ... IA think so , ” theyA begin , stuttering , but encouraged to continue by a cock of SoulC ’s head ; a social cue even theyA could read , “ The professorH ... and Miss MarieB G asked if IA ’d like to come and stay with themG for the time being . ” “ Oh , huh , SteinH and MarieB G ? Nice , ” hisC brows lift , clearly some varying degree of happy for the otherA . The optimism is short - lived , observing as CronaA ’s expression falls back to its characteristic expressionless gaze . “ It seems like youA ’ve got a good thing going with those twoG .” “ IA have n’t decided , yet , if IA should accept the invitation , ” theyA shift a bit where theyA stand . Never having been the best at reassuring others , even hisC own meisterA , SoulC kept hisC mouth shut to avoid stuttering while heC searched for the right words a web of thoughts . “ Y’A know , IC think it ’s less of an invitation and more of an extended welcome . ” The otherA raises theirA head , taken aback , “ Oh , ” CronaA mutters , in a poignant tone , “ IA ... never considered something like that . ” SoulC does n’t leave much wiggle room for theirA mood to fall any further ( nothing past a flat - lipped frown ) , “ TheyG ’d probably love to have youA , IC bet theyG drive each other nuts sometimes all by themselvesG .” Though EvansC wo n’t admit it , heC knows it ’s all too likely SteinH might actually put some more effort into taking care of himselfH if heH had someone else besides MarieB to look after . “ IA - IA see , ” theyA exhale with a nod , giving SoulC a hint of affirmation that heC ’d done something to boost the kidA ’s confidence . “ IC mean , it ’s got ta be lonely not to mention boring hanging here all summer ... and the weather , ” SoulC nearly gasps , dramatizing it for added effect , “ Oh , man , IC do n’t know how youA can stay cooped up in that room of yoursA when it ’s so nice out , ” heC grins . “ But ... meh . Different strokes . IC ca n’t judge . ” HisC comments comfort themA , an for a moment theyA forget how this came to be . The cathedral in Italy , Lady MedusaI ’s wrath , and the black blood that infected himC . Every moment theyA spent in the presence of Soul EvansC builds always up to this ; fixation on the memories of theirJ first encounters and all the pain theyA ’ve caused himC , the pain theyA ’ve caused heC and MakaE K both . As quickly as SoulC had lifted the swordsmanA ’s spirits , theyA ’d weighed themselvesA down once more . It seemed so normal , though . SoulC could n’t bring himselfC to feel any sense of accomplishment in the coaxing - out of CronaA ’s smile when the return of theirA self doubt was as certain as the sun in the sky . HisC own stubbornness could n’t let hisC diminished self worth lie . With another encouraging smile , rows of sharpened incisors appearing oddly charismatic , heC opens hisC mouth to speak – but finds himselfC cut off before heC can even squeeze a word in . “ SoulC , IA ’m sorry , ” the meisterA blurts . Having been pent - up for months , the apology comes forth without inhibition , rolling effortlessly off theirA tongue . “ Sorry ... ? For what ? ” EvansC quirks a brow , chuckling . HeC adjusts hisC stance to face CronaA with the whole of hisC body , maintaining hisC positive demeanor . “ F - for what ... ? ” TheyA stammer , shaking theirA head . For all theirA remorse , theyA thought this would have been obvious . “ For everything , it ’s ... the first time weF dueled , IA was the enemy ! IA - IA almost killed youC , IA - IA ... IA really , really hurt youC ,” theyA answer , still so sick with guild that even theirA confession of responsibility is tainted with frustration . SoulC seems stunned for a moment before harnessing hisC quick wit . “ Hey , now , youA ca n’t take all the credit like that , RagnarokL did most of the damage , ” heC ...