Mind the GAP: A Balanced Corpus of Gendered Ambiguous

Kellie Webster and Marta Recasens and Vera Axelrod and Jason Baldridge Google AI Language {websterk|recasens|vaxelrod|jasonbaldridge}@google.com

Abstract (1) In May, Fujisawa joined Mari Motohashi’s rink as the team’s skip, moving back from Karuizawa to Ki- Coreference resolution is an important task tami where she had spent her junior days. for natural language understanding, and the resolution of ambiguous pronouns a long- With this , we make three key contributions: standing challenge. Nonetheless, existing • We design an extensible, language- corpora do not capture ambiguous pronouns in sufficient volume or diversity to accu- independent mechanism for extracting rately indicate the practical utility of mod- challenging ambiguous pronouns from text. els. Furthermore, we find gender bias in • We build and release GAP, a human-labeled existing corpora and systems favoring mas- corpus of 8,908 ambiguous -name culine entities. To address this, we present pairs derived from Wikipedia.2 This dataset and release GAP, a gender-balanced labeled targets the challenges of resolving naturally- corpus of 8,908 ambiguous pronoun-name occurring ambiguous pronouns and rewards pairs sampled to provide diverse coverage systems which are gender-fair. of challenges posed by real-world text. We explore a range of baselines which demon- • We run four state-of-the-art coreference re- strate the complexity of the challenge, the solvers and several competitive simple base- best achieving just 66.9% F1. We show lines on GAP to understand limitations in that syntactic structure and continuous neu- current modeling, including gender bias. ral models provide promising, complemen- We find that syntactic structure and Trans- tary cues for approaching the challenge. former models (Vaswani et al., 2017) - vide promising, complementary cues for ap- 1 Introduction proaching GAP. Coreference resolution decisions can drastically Coreference resolution involves linking referring alter how automatic systems process text. Biases expressions that evoke the same en- in automatic systems have caused a wide range tity, as defined in shared tasks such as CoNLL of underrepresented groups to be served in an in- 2011/12 (Pradhan et al., 2012) and MUC (Gr- equitable way by downstream applications (Hardt, ishman and Sundheim, 1996). Unfortunately, 2014). We take the construction of the new GAP high scores on these tasks do not necessarily corpus as an opportunity to reduce gender bias in translate into acceptable performance for down- arXiv:1810.05201v1 [cs.CL] 11 Oct 2018 coreference datasets; in this way, GAP can pro- stream applications such as machine translation mote equitable modeling of phenom- (Guillou, 2012) and fact extraction (Nakayama, ena complementary to the recent work of Zhao 2008). In particular, high-scoring systems suc- et al. (2018) and Rudinger et al. (2018). Such cessfully identify coreference relationships be- approaches promise to improve equity of down- tween string-matching proper names, but fare stream models, such as triple extraction for knowl- worse on anaphoric mentions such as pronouns edge base population. and common noun phrases (Stoyanov et al., 2009; Rahman and Ng, 2012; Durrett and Klein, 2013). We consider the problem of resolving gendered biguous pronoun in bold, the two potential coreferent names ambiguous pronouns in English, such as she1 in: in italics, and the correct one also underlined. 2http://goo.gl/language/gap- 1The examples throughout the paper highlight the am- coreference 2 Background Langlais, 2016b). GAP examples are not strictly Winograd Existing datasets do not capture ambiguous pro- schemas because they have no reference-flipping nouns in sufficient volume or diversity to bench- word. Nonetheless, they contain two person mark systems for practical applications. named entities of the same gender and an am- 2.1 Datasets with Ambiguous Pronouns biguous pronoun that may refer to either (or nei- ther). As such, they represent a similarly diffi- Winograd schemas (Levesque et al., 2012) are cult challenge and require the same inferential ca- closely related to our work as they contain am- pabilities. More importantly, GAP is larger than biguous pronouns. They are pairs of short texts existing Winograd schema datasets and the exam- with an ambiguous pronoun and a special word (in ples are from naturally occurring Wikipedia text. square brackets) that switches its : GAP complements OntoNotes by providing an ex- (2) The trophy would not fit in the brown suitcase be- tensive targeted dataset of naturally occurring am- cause it was too [big/small]. biguous pronouns.

The Definite Pronoun Resolution Dataset (Rah- 2.2 Modeling Ambiguous Pronouns man and Ng, 2012) comprises 943 Winograd schemas written by undergraduate students and State-of-the-art coreference systems struggle to later extended by Peng et al. (2015). The First resolve ambiguous pronouns that require world Winograd Schema Challenge (Morgenstern et al., knowledge and commonsense reasoning (Durrett 2016) released 60 examples adapted from pub- and Klein, 2013). Past efforts have tried to mine lished literary works (Pronoun Disambiguation semantic preferences and inferential knowledge Problem)3 and 285 manually constructed schemas via predicate-argument statistics mined from cor- (Winograd Schema Challenge)4. More recently, pora (Dagan and Itai, 1990; Yang et al., 2005), Rudinger et al. (2018) and Zhao et al. (2018) semantic roles (Kehler et al., 2004; Ponzetto and have created two Winograd schema-style datasets Strube, 2006), contextual compatibility features containing 720 and 3160 sentences, respectively, (Liao and Grishman, 2010; Bansal and Klein, where each sentence contains a gendered pronoun 2012), and event role sequences (Bean and Riloff, and two occupation (or participant) antecedent 2004; Chambers and Jurafsky, 2008). These usu- candidates that break occupational gender stereo- ally bring small improvements in general corefer- types. Overall, ambiguous pronoun datasets have ence datasets and larger improvements in targeted been limited in size and, most notably, consist only Winograd datasets. of manually constructed examples which do not Rahman and Ng (2012) scored 73.05% preci- necessarily reflect the challenges faced by systems sion on their Winograd dataset after incorporating in the wild. targeted features such as narrative chains, Web- In contrast, the largest and most widely-used based counts, and selectional preferences. Peng coreference corpus, OntoNotes (Pradhan et al., et al. (2015)’s system improved the state of the art 2007), is general purpose. In OntoNotes, simpler to 76.41% by acquiring hsubject, verb, objecti and high-frequency coreference examples (e.g. those hsubject/object, verb, verbi knowledge triples. captured by string matching) greatly outnumber In the First Winograd Schema Challenge (Mor- examples of ambiguous pronouns, which obscures genstern et al., 2016), participants used methods performance results on that key class (Stoyanov ranging from logical axioms and inference to neu- et al., 2009; Rahman and Ng, 2012). Ambiguous ral network architectures enhanced with common- pronouns greatly impact main entity resolution sense knowledge (Liu et al., 2017), but no system in Wikipedia, the of Ghaddar and Langlais qualified for the second round. Recently, Trinh (2016a), who use WikiCoref, a corpus of 30 full and Le (2018) have achieved the best results on articles annotated with coreference (Ghaddar and the Pronoun Disambiguation Problem and Wino- grad Schema Challenge datasets, achieving 70% 3 https://cs.nyu.edu/faculty/davise/ and 63.7%, respectively, which are 3% and 11% papers/WinogradSchemas/PDPChallenge2016. xml above Liu et al.’s (2017)’s previous state of the 4https://cs.nyu.edu/faculty/davise/ art. Their model is an ensemble of word-level and papers/WinogradSchemas/WSCollection.xml character-level recurrent language models, which Type Pattern Example FINALPRO (Name, Name, Pronoun) Preckwinkle criticizes Berrios’ nepotism: [. . . ] County’s ethics rules don’t apply to him. MEDIALPRO (Name, Pronoun, Name) McFerran’s horse farm was named Glen View. After his death in 1885, John E. Green acquired the farm. INITIALPRO (Pronoun, Name, Name) Judging that he is suitable to join the team, Butcher injects Hughie with a specially formulated mix.

Table 1: Extraction patterns and example contexts for each. despite not being trained on coreference data, en- ing given that learned NLP systems often reflect code commonsense as part of the more general and even amplify training biases (Bolukbasi et al., language modeling task. It is unclear how these 2016; Caliskan et al., 2017; Zhao et al., 2017). A systems perform on naturally-occurring ambigu- growing body of work defines notions of fairness, ous pronouns. For example, Trinh and Le’s (2018) bias, and equality in data and machine-learned system relies on choosing a candidate from a pre- systems (Pedreshi et al., 2008; Hardt et al., 2016; specified list, and it would need to be extended to Zafar et al., 2017; Skirpan and Gorelick, 2017), handle the case that the pronoun does not corefer and debiasing strategies include expanding and with any given candidate. By releasing GAP, we rebalancing data (Torralba and Efros, 2011; Ryu aim to foster research in this direction, and set sev- et al., 2017; Shankar et al., 2017; Buda, 2017), and eral competitive baselines without using targeted balancing performance across subgroups (Dwork resources. et al., 2012). In the context of coreference resolu- tion, Zhao et al. (2018) have showed how debias- 2.3 Bias in Machine Learning ing tecniques (e.g. swapping the gender of male While existing corpora have promoted research pronouns and antecedents in OntoNotes, using de- into coreference resolution, they suffer from gen- biased word embeddings, balancing Bergsma and der bias. Specifically, of the over 2,000 gendered Lin’s (2006)’s gender list) succeed at reducing the pronouns in the OntoNotes test corpus, less than gender bias of multiple off-the-shelf coreference 25% are feminine (Zhao et al., 2018). The im- systems. balance is more pronounced on the development We work towards fairness in coreference by re- and training sets, with less than 20% feminine pro- leasing a diverse, gender-balanced corpus for am- nouns each. WikiCoref contains only 12% femi- biguous pronoun resolution and further investigat- nine pronouns. In the Definite Pronoun Resolution ing performance differences by gender, not specif- Dataset training data, 27% of the gendered pro- ically on pronouns with an occupation antecedent nouns are feminine, while the Winograd Schema but more generally on gendered pronouns. Challenge datasets contain 28% and 33% femi- 3 GAP Corpus nine examples. Two exceptions are the recent WinoBias (Zhao et al., 2018) and Winogender We create a corpus of 8,908-human annotated am- schemas (Rudinger et al., 2018) datasets, which biguous pronoun-name examples from Wikipedia. reveal how occupation-specific gender bias per- Examples are obtained from a large set of candi- vades in the majority of publicly-available coref- date contexts and are filtered through a multi-stage erence resolution systems by including a balanced process designed to improve quality and diversity. number of feminine pronouns that corefer with We choose Wikipedia as our base dataset given anti-stereotypical occupations (see Example (3) its wide use in natural language understanding from WinoBias). These datasets focus on pronom- tools, but are mindful of its well-known gender inal coreference where the antecedent is a nominal biases. Specifically, less than 15% of biographi- mention, while GAP focuses on relations where cal Wikipedia pages are about women. Further- the antecedent is a named entity. more, women are written about differently than men: e.g. women’s biographies are more likely to (3) The salesperson sold some books to the librarian because she was trying the sell them. mention marriage or divorce (Bamman and Smith, 2014), abstract terms are more positive in male bi- The pervasive bias in existing datasets is concern- ographies than female biographies (Wagner et al., Dimension Values Ratio Page coverage 1 per page per pronoun form Gender masc. : fem. 1 : 1 Extraction Pattern final : medial : initial 6.2 : 1 : 1 Page Entity true : false 1.3 : 1 Coreferent Name nameA : nameB 1 : 1

Table 2: Corpus diversity statistics in final corpus.

2016), and articles about females are less central contains 7 times more FINALPRO contexts to the article graph (Graells-Garrido et al., 2015). than MEDIALPRO and INITIALPRO com- bined, so we oversampled the latter two to 3.1 Extraction and Filtering lower the ratio to 6:1:1. Extraction targets three patterns, given in Table 1, • Page Entity. Pronouns in a Wikipedia page that characterize locally ambiguous pronoun con- often refer to the entity the page is about. We texts. We limit to singular mentions, gendered include such examples in our dataset but bal- non-reflexive pronouns, and names whose head to- ance them 1:1 against examples that do not kens are different from one another. Additionally, include mentions of the page entity. we do not allow intruders: there can be no other • Coreferent Name. To ensure mention order compatible mention (by gender, number, and en- is not a cue for systems, our final dataset is tity type) between the pronoun and the two names. balanced for label — i.e. whether Name A or To limit the success of na¨ıve resolution heuris- Name B is the pronoun’s referent. tics, we apply a small set of constraints to focus on We applied these constraints to the raw extrac- those pronouns that are truly hard to resolve. tions to select 8,604 contexts (17,208 examples) • FINALPRO. Both names must be in the same for annotation that were globally balanced in all sentence, and the pronoun may appear in the dimensions (e.g. 1:1 gender ratio in MEDIALPRO same or directly following sentence. extractions). Table 2 summarizes the diversity ra- • MEDIALPRO. The first name must be in tios obtained in the final dataset, whose compila- the sentence directly preceding the pronoun tion is described next. and the second name, both of which are in the same sentence. To decrease the bias for 3.2 Annotation the pronoun to be coreferential with the first We used a pool of in-house raters for human an- name, the pronoun must be in an initial sub- notation of our examples. Each example was pre- ordinate clause or be a possessive in an initial sented to three workers, who selected one of five prepositional phrase. labels (Table 3). Full sentences of at least 50 to- • INITIALPRO. All three mentions must be in kens preceding each example were presented as the same sentence and the pronoun must be in context (prior context beyond a section break is an initial subordinate clause or a possessive not included). Rating instructions accompany the in an initial prepositional phrase. dataset release. From the extracted contexts, we sub-sample Despite workers not being expert linguists, we those to send for annotation. We do this to im- find good agreement both within workers and be- prove diversity in five dimensions: tween workers and an expert. Inter-annotator • Page Coverage. We retain at most 3 exam- agreement was κ = 0.74 on the Fleiss et al. (2003) ples per page-gender pair to ensure a broad Kappa statistic; in 73% of cases there was full coverage of domains. agreement between workers, in 25% of cases two • Gender. The raw pipeline extracts contexts of three workers agreed, and only in 2% of cases with a m:f ratio of 9:1. We oversampled fem- there was no consensus. We discard the 194 cases 5 inine pronouns to achieve a 1:1 ratio. with no consensus. On 30 examples rated by an • Extraction Pattern. The raw pipeline output expert linguist, there was agreement on 28 and one 5In doing this, we observed that many feminine pronouns was deemed to be truly ambiguous with the given in Wikipedia refer to characters in film and television. context. Label Raw Final 80 Name A 2913 1979 Lee et al. (2013) Name B 3047 1985 Clark and Manning Neither Name A nor Name B 1614 490 70 Wiseman et al. M Both Name A and Name B 1016 0 Lee et al. (2017) M Not Sure 14 0 Parallelism Total 8604 4454 60 F Table 3: Consensus label counts for the extracted ex- F F amples (Raw) and after further filtering (Final). Recall 50 M M

To produce our final dataset, we applied addi- 40 tional high-precision filtering to remove some er- F F ror cases identified by workers,6 and discarded 30 the “Both” (no ) and “Not Sure” con- 30 40 50 60 70 80 texts. Given that many of the feminine examples received the “Both” label from having Precision stage and married names (4), this unbalanced the Figure 1: Precision-Recall on the GAP develop- number of masculine and feminine examples. ment dataset—Overall (solid markers), Masculine, Feminine—for off-the-shelf resolvers and Parallelism. (4) Ruby Buckton is a fictional character from the Aus- tralian Channel Seven soap opera Home and Away, played by Rebecca Breeds. She debuted . . . 4.1 GAP Challenge To correct this, we discarded masculine examples GAP is an evaluation corpus and we segment the to re-achieve 1:1 gender balance. Additionally, we final dataset into a development and test set of imposed the constraint that there be one example 4,000 examples each8; we reserve the remaining per Wikipedia article per pronoun form (e.g. his), 908 examples as a small validation set for param- to reduce similarity between examples. The fi- eter tuning. All examples are presented with the nal counts for each label are given in the second URL of the source Wikipedia page, allowing us to column of Table 3. Given that the 4,454 contexts define two task settings: snippet-context in which each contain two annotated names, this comprises the URL may not be used, and page-context in 8,908 pronoun-name pair labels. which it may. While name spans are given in the data, we urge the community not to treat this as 4 Experiments a gold mention or Winograd-style task. That is, We set up the GAP challenge and analyze the ap- systems should detect mentions for inference auto- plicability of a range of off-the-shelf tools. We matically, and access labeled spans only to output find that existing resolvers do not perform well and predictions. are biased to favor better resolution of masculine To reward unbiased modeling, we define two pronouns. We empirically validate the observation evaluation metrics: F1 score and Bias. Concretely, that Transformer models (Vaswani et al., 2017) en- we calculate F1 score Overall as well as by the code coreference relationships, adding to the re- gender of the pronoun (Masculine and Feminine). sults by Voita et al. (2018) on machine transla- Bias is calculated by taking the ratio of feminine 9 tion, and Trinh and Le (2018) on language model- to masculine F1 scores, typically less than one. ing. Furthermore, we show they complement tra- 4.2 Off-the-Shelf Resolvers ditional linguistic cues such as syntactic distance and parallelism. The first set of baselines we explore are four rep- All experiments use the Google Cloud NL API7 resentative off-the-shelf coreference systems: the for pre-processing, unless otherwise noted. rule-based system of Lee et al. (2013) and three

6E.g. missing sentence breaks, list environments, and 8All examples extracted from the same URL are parti- non-referential personal roles/nationalities. tioned into the same set. 7https://cloud.google.com/natural- 9http://goo.gl/language/gap- language/ coreference MFBO MFBO Lee et al. (2013) 55.4 45.5 0.82 50.5 Lee et al. (2013) 47.7 53.2 1.12 49.2 Clark and Manning 58.5 51.3 0.88 55.0 Clark and Manning 64.3 63.9 0.99 64.2 Wiseman et al. 68.4 59.9 0.88 64.2 Wiseman et al. 61.9 58.0 0.94 60.6 Lee et al. (2017) 67.2 62.2 0.92 64.7 Lee et al. (2017) 68.9 51.9 0.75 63.4

Table 4: Performance of off-the-shelf resolvers on Table 5: Pronoun-name F1 score, by gender, of off- the GAP development set, split by Masculine and the-shelf systems on the OntoNotes test set. Scores Feminine (Bias shows F/M), and Overall. Bold indi- based on 2091 masculine pronoun-named entity pairs cates best performance. (in 403 clusters) and 1095 feminine pairs (in 104 clus- ters). Bold indicates best performance. neural resolvers—Clark and Manning (2015)10, Wiseman et al. (2016)11, and Lee et al. (2017)12. pared to GAP. One possible reason that in-domain All were trained on OntoNotes and run in as OntoNotes performance and out-of-domain GAP close to their out-of-the-box configuration as pos- performance are not very different could be that sible.13 System clusters were scored against GAP state-of-the-art systems are highly tuned for re- examples according to whether the cluster con- solving names rather than ambiguous pronouns. taining the target pronoun also contained the cor- Further, the relative performance of the four rect name (TP) or the incorrect name (FP), using systems is different on GAP than on OntoNotes. mention heads for alignment. We report here their Particularly interesting is that the current strongest performance on GAP as informative baselines, but system overall for OntoNotes, namely Lee et al. expect retraining on Wikipedia-like texts to yield (2017), scores best on GAP pronouns but has the an overall improvement in performance. (This re- largest gender bias on OntoNotes. This perhaps is mains as future work.) not surprising given the dominance of masculine Table 4 shows that all systems struggle on GAP. examples in that corpus. It is outside the scope That is, despite modeling improvements in recent of this paper to provide an in-depth analysis of the years, ambiguous pronoun resolution remains a data and modeling decisions which cause this bias; challenge. We note particularly the large differ- instead we release GAP to address the measure- ence in performance between genders, which tra- ment problem behind the bias. ditionally has not been tracked but has fairness im- Figure 1 compares the recall/precision trade-off plications for downstream tasks using these pub- for each system split by Masculine and Feminine licly available models. examples, as well as combined (Overall). Also Table 5 provides evidence that this low perfor- shown is a simple syntactic Parallelism heuristic mance is not solely due to domain and task dif- in which subject and direct object pronoun are re- ferences between GAP and OntoNotes. Specif- solved to names with the same grammatical role ically, with the exception of Clark and Man- (see Section 4.3). In this visualization, we see a ning (2015), the table shows that system perfor- further factor contributing to the low performance mance on pronoun-name coreference relations in of off-the-shelf systems, namely their low recall. the OntoNotes test set14 is not vastly better com- That is, while personal pronouns are overwhelm- ingly anaphoric in both OntoNotes and Wikipedia 10https://stanfordnlp.github.io/ texts, OntoNotes-trained models are conservative. CoreNLP/download.html This observation is consistent with the results for 11https://github.com/swiseman/nn_coref 12https://github.com/kentonl/e2e-coref Lee et al. (2013) on the Definite Pronoun Resolu- 13 We run Lee et al. (2017) in the final (single-model) tion Dataset (Rahman and Ng, 2012), on which the configuration, with NLTK preprocessing (Bird and Loper, system scored 47.2% F1,15 failing to beat a ran- 2004); for Wiseman et al. (2016) we use Berkeley prepro- dom baseline due to conservativeness. cessing (Durrett and Klein, 2014) and the Stanford systems are run within Stanford CoreNLP (Manning et al., 2014). 4.3 Coreference-Cue Baselines 14 For each gendered pronoun in a gold OntoNotes cluster, we compare the system cluster with that pronoun. We count To understand the shortcomings of state-of-the-art a TP if the system entity contains at least one gold coreferent coreference systems on GAP, the upper sections of NE mention; FP if the system entity contains at least one non-gold NE mention, and FN if the system entity does not 15Calculated based on the reported performance of 40.07% contain any gold NE mention. Correct, 29.79% Incorrect, and 30.14% No decision. MFBO MFBO Random 43.6 39.3 0.90 41.5 Random 47.5 50.5 1.06 49.0 Token Distance 50.1 42.4 0.85 46.4 Token Distance 50.6 47.5 0.94 49.1 Topical Entity 51.5 43.7 0.85 47.7 Topical Entity 50.2 47.3 0.94 48.8 Syntactic Distance 63.0 56.2 0.89 59.7 Syntactic Distance 66.7 66.7 1.00 66.7 Parallelism 67.1 63.1 0.94 65.2 Parallelism 69.3 69.2 1.00 69.2 Parallelism+URL 71.1 66.9 0.94 69.0 Parallelism+URL 74.2 71.6 0.96 72.9 Transformer-Single 58.6 51.2 0.87 55.0 Transformer-Single 59.6 56.6 0.95 58.1 Transformer-Multi 59.3 52.9 0.89 56.2 Transformer-Multi 62.9 61.7 0.98 62.3

Table 6: Performance of our baselines on the devel- Table 7: Performance of our baselines on the devel- opment set. Parallelism+URL tests the page-context opment set in the gold-two-mention task (access to the setting; all other test the snippet-context setting. Bold two candidate name spans). Parallelism+URL tests the indicates best performance in each setting. page-context setting; all other test the snippet-context setting. Bold indicates best performance in each set- ting. Table 6 consider several simple baselines based on traditional cues for coreference. To calculate these baselines, we first detect can- line and feminine examples suggests that there are didate antecedents by finding all mentions of PER- more distractor mentions in the context of fem- SON entity type, NAME mention type (headed by inine pronouns in GAP. To measure the impact a proper noun), and, for structural cues, that are of pronoun context, we include performance on not in a syntactic position which precludes coref- the artificial gold-two-mention setting where only erence with the pronoun. We do not require gen- the two name spans are candidates for inference der match because gender annotations are not pro- (Table 7). RANDOM is indeed closer here to the vided by the Google Cloud NL API and, even if expected 50% and other baselines are closer to they were, gender predictions on last names (with- gender-parity. out the first name) are not reliable in the snippet- TOKEN DISTANCE and TOPICAL ENTITY are context setting. Second, we select among the can- only weak improvements above RANDOM, vali- didates using one of the heuristics described next. dating that our dataset creation methodology con- For scoring purposes, we do not require exact trolled for these factors. string match for mention alignment, that is, if the selected candidate is a substring of a given name Structural Cues Baseline cues which may ad- (or vice versa), we infer a coreference relation be- ditionally access syntactic structure are: tween that name and the target pronoun.16 • SYNTACTIC DISTANCE. Select the syntac- tically closest candidate to the pronoun. Back Surface Cues Baseline cues which require only off to TOKEN DISTANCE. access to the input text are: • PARALLELISM. If the pronoun is a subject • RANDOM. Select a candidate uniformly at or direct object, select the closest candidate random. with the same grammatical argument. Back • TOKEN DISTANCE. Select the closest can- off to SYNTACTIC DISTANCE. didate to the pronoun, with distance mea- Both cues yield strong baselines comparable to the sured as the number of tokens between spans. strongest OntoNotes-trained systems (cf. Table 4). • TOPICAL ENTITY. Select the closest candi- In fact, Lee et al. (2017) and PARALLELISM pro- date which contains the most frequent token duce remarkably similar output: of the 2000 ex- string among extracted candidates. ample pairs in the development set, the two have The performance of RANDOM (41.5 Overall) is completely opposing predictions (i.e. Name A lower than an otherwise possible guess rate of vs. Name B) on only 325 examples. Further, ∼50%. This is because the baseline considers the cues are markedly gender-neutral, improving all possible candidates, not just the two annotated the Bias metric by 9% in the standard task for- names. Moreover, the difference between mascu- mulation and to parity in the gold-two-mention 16 Note that requiring exact string match drops recall and case. In contrast to surface cues, having the full causes only a small difference in F1 performance. candidate set is helpful: mention alignment via a P PP Layer PARALLELISM PP L0 L1 L2 L3 L4 L5 Head PPP Correct Incorrect H0 46.9 47.4 45.8 46.2 45.8 45.7 Correct 48.7% 13.4% TRANSF. H1 45.3 46.5 46.4 46.2 49.4 46.3 Incorrect 21.6% 16.3%

H2 45.8 46.7 46.3 46.5 45.7 45.9 Table 9: Comparison of the predictions of the PARAL- LELISM and TRANSFORMER-SINGLE heuristics over H3 46.0 46.3 46.8 46.0 46.6 48.0 the GAP development dataset. H4 45.7 46.3 46.5 47.8 45.1 47.0

H5 47.0 46.5 46.5 45.6 46.2 52.9 English-German NMT task,18 using the same set- H6 46.7 45.4 46.4 45.3 46.9 47.0 tings as Vaswani et al. (2017). The model pro- cesses texts as a series of subtokens (text frag- H7 43.8 46.6 46.4 55.0 46.4 46.2 ments the size of a token or smaller) and learns Table 8: Coreference signal of a Transformer model on three multi-head attention matrices over these, two the validation dataset, by encoder attention layer and self-attention matrices (one over the subtokens of head. the source sentences and one over those of the target sentences), and a cross-attention matrix be- tween the source and target. Each attention matrix non-indicated candidate successfully scores 69% is decomposed into a series of feed-forward layers, of PARALLELISM predictions. each composed of discrete heads designed to spe- Wikipedia Cues To explore the page-context cialize for different dimensions in the training sig- setting, we consider a Wikipedia-specific cue: nal. We input GAP snippets as English source text • URL. Select the syntactically closest candi- and extract attention values from the source self- date which has a token overlap with the page attention matrix; the target side (German transla- title. Back off to PARALLELISM. tions) is not used. The heuristic gives a performance gain of 2% We calculate the attention between a name and overall compared to PARALLELISM. That the pronoun to be the mean over all subtokens in these feature is not more helpful again validates our spans; the attention between two subtokens is the methodology for extracting diverse examples. We sum of the raw attention values between all oc- expect future work to greatly improve on this base- currences of those subtoken strings in the input line by using the wealth of cues in Wikipedia arti- snippet. These two factors control for variation cles, including page text. between Transformer models and the spreading of attention between different mentions of the same 4.4 Transformer Models for Coreference entity.

The recent Transformer model (Vaswani et al., TRANSFORMER-SINGLE Table 8 gives the 2017) demonstrated tantalizing representations for performance of the TRANSFORMER heuristic coreference: when trained for machine transla- over each self-attention head on the development tion, some self-attention layers appear to show dataset. Consistent with the observations by stronger attention weights between coreferential Vaswani et al. (2017), we observe that the corefer- elements.17 Voita et al. (2018) found evidence for ence signal is localized on specific heads and that this claim for the English pronouns it, you, and I these heads are in the deep layers of the network in a movie subtitles dataset (Lison et al., 2018). (e.g. L3H7). During development, we saw that the GAP allows us to explore this claim on Wikipedia specific heads which specialize for coreference are for ambiguous personal pronouns. To do so, we different between different models. investigate the heuristic: The TRANSFORMER-SINGLE baseline in Ta- • TRANSFORMER. Select the candidate which ble 6 is the one set by L3H7 in Table 8. De- attends most to the pronoun. spite not having access to syntactic structure, The Transformer model underlying our exper- TRANSFORMER-SINGLE far outperforms all sur- iments is trained for 350k steps on the 2014 face cues above. That is, we find evidence for

17See Figure 4 at https://arxiv.org/abs/1706. 18http://www.statmt.org/wmt14/ 03762 translation-task.html MFBO Agreement Difficulty #% Lee et al. (2017) 67.7 60.0 0.89 64.0 with Gold Parallelism 69.4 64.4 0.93 66.9 Green 4 631 28.7 Parallelism+URL 72.3 68.8 0.95 70.6 3 469 21.3 Yellow 2 420 19.1 Table 10: Baselines on the GAP challenge test set. 1 353 16.0 Red 0 328 14.9

Table 11: Analysis of the GAP development examples the claim that Transformer models implicitly learn by the number of systems (out of 4) agreeing with gold. language understanding relevant to coreference resolution. Even more promising, we find that 4.5 GAP Benchmarks the instances of coreference that TRANSFORMER- Table 10 sets the baselines for the GAP challenge. SINGLE can handle is substantially different from We include the off-the-shelf system which per- those of PARALLELISM, see Table 9. formed best Overall on the development set (Lee et al., 2017), as well as our strongest baseline for the two task settings, PARALLELISM20 and URL. We note that strict comparisons cannot be made TRANSFORMER-MULTI We learn to compose between our snippet-context baselines given that the signals from different self-attention heads us- Lee et al. (2017) has access to OntoNotes annota- ing extra tree classifiers (Geurts et al., 2006).19 We tions that we do not, and we have access to pro- choose this classifier since we have little available noun ambiguity annotations that Lee et al. (2017) training data and a small feature set. Specifically, do not. for each candidate antecedent, we: 5 Error Analysis • Extract one feature for each of the 48 Trans- former heads. The feature value is True if We have shown that GAP is challenging for there is a substring overlap between the can- both off-the-shelf systems and our baselines. To assess the variance between these systems and didate and the prediction of TRANSFORMER- gain a more qualitative understanding of what as- SINGLE. • Use the χ2 statistic to reduce dimensionality. pects of GAP are challenging, we use the num- We found k=3 worked well. ber of off-the-shelf systems that agree with the • Learn an extra trees classifier over these three rater-provided labels (Agreement with Gold) as a features with the validation dataset. proxy for difficulty. Table 11 breaks down the name-pronoun examples in the development set by That TRANSFORMER-MULTI is stronger than Agreement with Gold (the smaller the agreement 21 TRANSFORMER-SINGLE in Table 6 suggests that the harder the example). different self-attention heads encode different di- Agreement with Gold is low (average 2.1) and mensions of the coreference problem. Though the spread. Less than 30% of the examples are suc- gain is modest when all mentions are under con- cessfully solved by all systems (labeled Green), sideration, Table 7 shows a 4.2% overall improve- and just under 15% are so challenging that none ment over TRANSFORMER-SINGLE for the gold- 19http://scikit-learn.org/stable/ two-mention task. Future work could explore fil- modules/generated/sklearn.ensemble. tering the candidate list presented to Transformer ExtraTreesClassifier.html 20 models to reduce the impact of distractor men- We also trained an Extra Tree classifier over all ex- plored coreference-cue baselines (including Transformer- tions in a pronoun’s context—for example, by gen- based heuristics), but its performance was similar to PAR- der in the page-context setting. It is also worth ALLELISM and the predictions matched in the vast majority stressing that these models are trained on very lit- of instances. 21 tle data (the GAP validation set). These prelimi- Given that system predictions are not independent for the nary results suggest that learned models incorpo- two candidate names for a given snippet, we only focus on the positive coreferential name-pronoun pair when the gold label rating such features from the Transformer and us- is either “Name A” or “Name B”; we use both name-pronoun ing more data are worth exploring further. pairs when the gold label is ”Neither”. Category Description Example (abridged) # NARRATIVE Inference involving the roles As Nancy tried to pull Hind down by the arm in the final 28 ROLES people take in described meters as what was clearly an attempt to drop her [...] events COMPLEX Syntactic cues are present Sheena thought back to the 1980s [...] and thought of idol 20 but in complex constructions Hiroko Mita, who had appeared on many posters for medical products, acting as if her stomach or head hurt TOPICALITY Inference involving the en- The disease is named after Eduard Heinrich Henoch (1820– 15 tity topicality, inc. paren- 1910), a German pediatrician (nephew of Moritz Heinrich theticals Romberg) and his teacher DOMAIN Inference involving knowl- The half finished 4–0, after Hampton converted a penalty 6 KNOWLEDGE edge specific to a domain, awarded against Arthur Knight for handball when Fleming’s e.g. sport powerful shot struck his arm. ERROR Annotation error, inc. truly When she gets into an altercation with Queenie, Fiona makes 6 ambiguous cases her act as Queenie’s slave [...]

Table 12: Fine-grained categorization of 75 Red examples from the GAP development set (no system agreed with the worker-selected name). Underlining indicates the rater-selected name. of the systems gets them right (Red). The majority which are able to capture these complexities fairly. are in between (Yellow). Many Green cases have It has been outside the scope of this paper to ex- syntactic cues for coreference, but we find no sys- plore bias in other dimensions, to analyze corefer- tematic trends within Yellow. ence in other languages, and to study the impact on Table 12 provides a fine-grained analysis of downstream systems of improved coreference res- 75 Red cases. When labeling these cases, two olution. We look forward to future work in these important considerations emerged: (1) labels of- directions. ten overlap, with one example possibly fitting into multiple categories; and (2) GAP requires Acknowledgments global reasoning—cues from different entity men- tions work together to build a snippet’s interpre- We would like to thank our anonymous review- tation. The Red examples in particular exemplify ers and the Google AI Language team, especially the challenge of GAP, and point toward the need Emily Pitler, for the insightful comments that con- for multiple modeling strategies to achieve signif- tributed to this paper. Many thanks also to the Data icantly higher scores on the dataset. Compute team, especially Ashwin Kakarla, Henry Jicha and Daphne Luong, for their help with the 6 Conclusions annotations, and thanks to Llion Jones for his help with the Transformer experiments. We have presented a dataset and a set of strong baselines for a new coreference task, GAP. We designed GAP to represent the challenges posed by real-world text, in which ambiguous pronouns are important and difficult to resolve. We high- David Bamman and Noah A. Smith. 2014. Un- lighted gaps in the existing state of the art, and supervised discovery of biographical structure proposed the application of Transformer models from text. Transactions of the ACL, 2:363–376. to address these. Specifically, we show how tradi- tional linguistic features and modern sentence en- Mohit Bansal and Dan Klein. 2012. Coreference coder technology are complementary. semantics from web features. In Proceedings of Our work contributes to the emerging body of ACL, pages 389–398, Jeju Island, Korea. work on the impact of bias in machine learning. We saw systematic differences between genders David Bean and Ellen Riloff. 2004. Unsuper- in analysis; this is consistent with many studies vised learning of contextual role knowledge which call out differences in how males and fe- for coreference resolution. In Proceedings of males are discussed publicly. By rebalancing our HLT-NAACL, pages 297–304, Boston, Mas- dataset for gender, we hope to reward systems sachusetts. Shane Bergsma and Dekang Lin. 2006. Bootstrap- Joseph L. Fleiss, Bruce Levin, and Myunghee Cho ping path-based pronoun resolution. In Pro- Paik. 2003. The Measurement of Interrater ceedings of ACL, pages 33–40, Sydney, Aus- Agreement, 3 edition. John Wiley and Sons Inc. tralia. Pierre Geurts, Damien Ernst, and Louis Wehenkel. Steven Bird and Edward Loper. 2004. NLTK: The 2006. Extremely randomized trees. Machine natural language toolkit. In Proceedings of the Learning, 63(1):3–42. ACL 2004 on Interactive Poster and Demon- stration Sessions, Barcelona, Spain. Abbas Ghaddar and Philippe Langlais. 2016a. Coreference in Wikipedia: Main Concept Res- Tolga Bolukbasi, Kai-Wei Chang, James Y. Zou, olution. In Proceedings of CoNLL, pages 229– Venkatesh Saligrama, and Adam Kalai. 2016. 238, Berlin, Germany. Man is to computer programmer as woman is to homemaker? Debiasing word embed- Abbas Ghaddar and Philippe Langlais. 2016b. dings. In Proceedings of NIPS, pages 4349– WikiCoref: An English coreference-annotated 4357, Barcelona, Spain. corpus of Wikipedia articles. In Proceedings of LREC, Portoroz,ˇ Slovenia. Mateusz Buda. 2017. A systematic study of the class imbalance problem in convolutional neu- Eduardo Graells-Garrido, Mounia Lalmas, and ral networks. Master’s thesis, KTH Royal Insti- Filippo Menczer. 2015. First women, second tute of Technology. sex: Gender bias in Wikipedia. In Proceedings of the 26th ACM Conference on Social Media, Aylin Caliskan, Joanna J. Bryson, and Arvind pages 165–174, Guzelyurt, Northern Cyprus. Narayanan. 2017. Semantics derived automati- cally from language corpora contain human-like Ralph Grishman and Beth Sundheim. 1996. Mes- biases. Science, 356(6334):183–186. sage Understanding Conference 6: A brief his- tory. In Proceedings of COLING, pages 466– Nathanael Chambers and Dan Jurafsky. 2008. Un- 471, Copenhagen, Denmark. supervised learning of narrative event chains. In Proceedings of ACL: HLT, pages 789–797, Liane Guillou. 2012. Improving pronoun transla- Columbus, Ohio. tion for statistical machine translation. In Pro- ceedings of the Student Research Workshop at Kevin Clark and Christopher D. Manning. the 13th Conference of the EACL 2015. Entity-centric coreference resolution , pages 1–10, with model stacking. In Proceedings of ACL, Avignon, France. pages 1405–1415, Beijing, China. Moritz Hardt. 2014. How big data is unfair: Ido Dagan and Alon Itai. 1990. Automatic pro- Understanding unintended sources of unfair- cessing of large corpora for the resolution of ness in data driven decision making. https: references. In Proceedings of COL- //medium.com/@mrtz/how-big- ING, pages 330–332, Helsinki, Finland. data-is-unfair-9aa544d739de.

Greg Durrett and Dan Klein. 2013. Easy victo- Moritz Hardt, Eric Price, and Nathan Srebro. ries and uphill battles in coreference resolution. 2016. Equality of opportunity in supervised In Proceedings of EMNLP, pages 1971–1982, learning. In Proceedings of NIPS, pages 3323– Seattle, Washington. 3331, Barcelona, Spain.

Greg Durrett and Dan Klein. 2014. A joint model Andrew Kehler, Douglas Appelt, Lara Taylor, and for entity analysis: Coreference, typing, and Aleksandr Simma. 2004. The (non)utility of linking. Transactions of the ACL, 2:477–490. predicate-argument frequencies for pronoun in- terpretation. In Proceedings of HLT-NAACL, Cynthia Dwork, Moritz Hardt, Toniann Pitassi, pages 289–296, Boston, Massachusetts. Omer Reingold, and Richard S. Zemel. 2012. Fairness through awareness. In Proceedings of Heeyoung Lee, Angel Chang, Yves Peirsman, ITCS, pages 214–226. Nathanael Chambers, Mihai Surdeanu, and Dan Jurafsky. 2013. Deterministic coreference reso- Haoruo Peng, Daniel Khashabi, and Dan Roth. lution based on entity-centric, precision-ranked 2015. Solving hard coreference problems. In rules. Computational Linguistics, 39(4):885– Proceedings of NAACL, pages 809–819, Den- 916. ver, Colorado. Kenton Lee, Luheng He, Mike Lewis, and Luke Simone Paolo Ponzetto and Michael Strube. 2006. Zettlemoyer. 2017. End-to-end neural coref- Exploiting semantic role labeling, WordNet and erence resolution. In Proceedings of EMNLP, Wikipedia for coreference resolution. In Pro- pages 188–197, Copenhagen, Denmark. ceedings of HLT-NAACL, pages 192–199, New York, New York. Hector Levesque, Ernest Davis, and Leora Mor- genstern. 2012. The Winograd Schema Chal- Sameer Pradhan, Alessandro Moschitti, Nianwen lenge. In Proceedings of KR, pages 552–561, Xue, Olga Uryupina, and Yuchen Zhang. 2012. Rome, Italy. CoNLL-2012 Shared Task: Modeling multilin- gual unrestricted coreference in OntoNotes. In Shasha Liao and Ralph Grishman. 2010. Large Proceedings of CoNLL: Shared Task, pages 1– corpus-based semantic feature extraction for 40, Jeju, Republic of Korea. pronoun coreference. In Proceedings of the 2nd Workshop on NLP Challenges in the Informa- Sameer Pradhan, Lance Ramshaw, Ralph tion Explosion Era (NLPIX), pages 60–68, Bei- Weischedel, Jessica MacBride, and Linnea jing, August. Micciulla. 2007. Unrestricted coreference: Pierre Lison, Jorg¨ Tiedemann, and Milen Identifying entities and events in OntoNotes. In Kouylekov. 2018. OpenSubtitles2018: Sta- Proceedings of ICSC, pages 446–453, Irvine, tistical rescoring of sentence alignments in California. large, noisy parallel corpora. In Proceedings of Altaf Rahman and Vincent Ng. 2012. Resolv- LREC, pages 1742–1748, Miyazaki, Japan. ing complex cases of definite pronouns: The Quan Liu, Hui Jiang, Zhen-Hua Ling, Xiaodan Winograd Schema Challenge. In Proceedings Zhu, Si Wei, and Yu Hu. 2017. Combing of EMNLP-CoNLL, pages 777–789, Jeju, Re- context and commonsense knowledge through public of Korea. neural networks for solving Winograd schema Rachel Rudinger, Jason Naradowsky, Brian problems. In Proceedings of AAAI, pages 315– Leonard, and Benjamin Van Durme. 2018. 321, San Francisco, California. Gender bias in coreference resolution. In Pro- Christopher D. Manning, Mihai Surdeanu, John ceedings of NAACL, New Orleans, Louisiana. Bauer, Jenny Finkel, Steven J. Bethard, and Hee Jung Ryu, Margaret Mitchell, and Hartwig David McClosky. 2014. The Stanford CoreNLP Adam. 2017. Improving smiling detec- Natural Language Processing Toolkit. In Pro- tion with race and gender diversity. ArXiv ceedings of ACL System Demonstrations, pages e-prints v2 https://arxiv.org/abs/ 55–60, Baltimore, Maryland. 1712.00193. Leora Morgenstern, Ernest Davis, and Charles L. Shreya Shankar, Yoni Halpern, Eric Breck, James Ortiz. 2016. Planning, executing, and evaluat- Atwood, Jimbo Wilson, and D. Sculley. 2017. ing the Winograd Schema Challenge. AI Mag- No classification without representation: As- azine, 37(1):50–54. sessing geodiversity issues in open data sets for Kotaro Nakayama. 2008. Wikipedia mining for the developing world. In Proceedings of the triple extraction enhanced by co-reference res- NIPS Workshop on Machine Learning for the olution. In Proceedings of the ISWC Workshop Developing World, Long Beach, California. on Social Data on the Web, Berlin, Germany. Michael Skirpan and Micha Gorelick. 2017. The Dino Pedreshi, Salvatore Ruggieri, and Franco authority of “fair” in machine learning. In Pro- Turini. 2008. Discrimination-aware data min- ceedings of the KDD Workshop on Fairness, ing. In Proceedings of KDD, pages 560–568, Accountability, and Transparency in Machine Las Vegas, Nevada. Learning (FAT/ML), Halifax, Nova Scotia. Veselin Stoyanov, Nathan Gilbert, Claire Cardie, like shopping: Reducing gender bias amplifi- and Ellen Riloff. 2009. Conundrums in noun cation using corpus-level constraints. In Pro- phrase coreference resolution: Making sense of ceedings of EMNLP, pages 2941–2951, Copen- the state-of-the-art. In Proceedings of ACL- hagen, Denmark. IJCNLP, pages 656–664, Singapore. Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Antonio Torralba and Alexei A. Efros. 2011. Un- Ordonez, and Kai-Wei Chang. 2018. Gender biased look at dataset bias. In Proceedings bias in coreference resolution: Evaluation and of CVPR 2011, pages 1521–1528, Colorado debiasing methods. In Proceedings of NAACL, Springs, Colorado. New Orleans, Louisiana.

Trieu H. Trinh and Quoc V. Le. 2018. A simple method for commonsense reasoning. ArXiv e-prints v1 https:/arxiv.org/ abs/1806.02847.

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. At- tention is all you need. In Proceedings of NIPS, pages 6000–6010, Long Beach, California.

Elena Voita, Pavel Serdyukov, Rico Sennrich, and Ivan Titov. 2018. Context-aware neural ma- chine translation learns anaphora resolution. In Proceedings of ACL, Melbourne, Australia.

Claudia Wagner, Eduardo Graells-Garrido, David Garcia, and Filippo Menczer. 2016. Women through the glass ceiling: Gender assymetrics in Wikipedia. EPJ Data Science, 5.

Sam Wiseman, Alexander M. Rush, and Stu- art M. Shieber. 2016. Learning global features for coreference resolution. In Proceedings of NAACL-HLT, pages 994–1004, San Diego, Cal- ifornia.

Xiaofeng Yang, Jian Su, and Chew Lim Tan. 2005. Improving pronoun resolution using statistics- based semantic compatibility information. In Proceedings of ACL, pages 165–172, Ann Ar- bor, Michigan.

Muhammad Bilal Zafar, Isabel Valera, Manuel Gomez Rodriguez, and Krishna P. Gummadi. 2017. Fairness beyond disparate treatment & disparate impact: Learning clas- sification without disparate mistreatment. In Proceedings of WWW, pages 1171–1180, Perth, Australia.

Jieyu Zhao, Tianlu Wang, Mark Yatskar, Vicente Ordonez, and Kai-Wei Chang. 2017. Men also