Removing Gender and Number Cues for Difficult Pronominal Anaphora
Total Page:16
File Type:pdf, Size:1020Kb
The KNOWREF Coreference Corpus: Removing Gender and Number Cues for Difficult Pronominal Anaphora Resolution Ali Emami*1, Paul Trichelair*1, Adam Trischler2, Kaheer Suleman2, Hannes Schulz2, and Jackie Chi Kit Cheung1 1School of Computer Science, Mila/McGill University 2Microsoft Research Montreal fali.emami, [email protected] fadam.trischler, kasulema, [email protected] [email protected] Abstract like gender and number do not by themselves indi- cate the correct resolution (Trichelair et al., 2018). We introduce a new benchmark for corefer- To date, most existing methods for coreference ence resolution and NLI, KNOWREF, that tar- gets common-sense understanding and world resolution (Raghunathan et al., 2010; Lee et al., knowledge. Previous coreference resolution 2011; Durrett et al., 2013; Lee et al., 2017, 2018) tasks can largely be solved by exploiting the have been evaluated on a few popular datasets, in- number and gender of the antecedents, or have cluding the CoNLL 2011 and 2012 shared corefer- been handcrafted and do not reflect the diver- ence resolution tasks (Pradhan et al., 2011, 2012). sity of naturally occurring text. We present These datasets were proposed as the first compre- a corpus of over 8,000 annotated text pas- hensively tagged and large-scale corpora for coref- sages with ambiguous pronominal anaphora. These instances are both challenging and re- erence resolution, to spur progress in state-of-the- alistic. We show that various coreference art techniques. According to Durrett and Klein systems, whether rule-based, feature-rich, or (2013), this progress would contribute in the “up- neural, perform significantly worse on the hill battle” of modelling not just syntax and dis- task than humans, who display high inter- course, but also semantic compatibility based on annotator agreement. To explain this perfor- world knowledge and context. mance gap, we show empirically that state-of- Despite improvements in benchmark dataset per- the art models often fail to capture context, in- formance, the question of what exactly current sys- stead relying on the gender or number of can- didate antecedents to make a decision. We tems learn or exploit remains open, particularly then use problem-specific insights to propose with recent neural coreference resolution models. a data-augmentation trick called antecedent Lee et al.(2017) note that their model does “little switching to alleviate this tendency in mod- in the uphill battle of making coreference decisions els. Finally, we show that antecedent switch- that require world knowledge,” and highlight a few ing yields promising results on other tasks as examples in the CoNLL 2012 task that rely on more well: we use it to achieve state-of-the-art re- complex understanding or inference. Because these sults on the GAP coreference task. cases are infrequent in the data, systems can per- 1 Introduction form very well on the CoNLL tasks according to standard metrics by exploiting surface cues. High- Coreference resolution is one of the best known performing models have also been observed to rely tasks in Natural Language Processing (NLP). De- on social stereotypes present in the data, which spite a large body of work in the area over the could unfairly impact their decisions for some de- last few decades (Morton, 2000; Bean and Riloff, mographics (Zhao et al., 2018). 2004; McCallum and Wellner, 2005; Rahman and There is a recent trend, therefore, to develop Ng, 2009), the task remains challenging. Many more challenging and diverse coreference tasks. resolution decisions require extensive world knowl- Perhaps the most popular of these is the Winograd edge and understanding common points of refer- Schema Challenge (WSC), which has emerged as ence (Pradhan et al., 2011). In the case of pronomi- an alternative to the Turing test (Levesque et al., nal anaphora resolution, these forms of “common 2011). The WSC task is carefully controlled such sense” become much more important when cues that heuristics involving syntactic salience, the *equal contribution number and gender of the antecedents, or other 3952 Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 3952–3961 Florence, Italy, July 28 - August 2, 2019. c 2019 Association for Computational Linguistics obvious syntactic/semantic cues are ineffective. substantial room for improvement to match Previous approaches to common sense reasoning, human performance. based on logical formalisms (Bailey et al., 2015) or 4. We demonstrate the benefits of a data- deep neural models (Liu et al., 2016), have solved augmentation technique called antecedent only restricted subsets of the WSC with high preci- switching in expanding our corpus, further de- sion. These shortcomings can in part be attributed terring models from exploiting surface cues, to the limited size of the corpus (273 instances), as well as in transferring to models trained on which is a side effect of its hand-crafted nature. other co-reference tasks like GAP, leading to Webster et al.(2018) recently presented a corpus state-of-the-art results. called GAP that consists of about 4,000 unique bi- nary coreference instances from English Wikipedia. 2 Related Work This corpus is intended to address gender bias and the mentioned size limitations of the WSC. We be- 2.1 General coreference resolution lieve that gender bias in coreference resolution is Automated techniques for standard coreference res- part and parcel of a more general problem: current olution — that is, the task of correctly partitioning models are unable to abstract away from the enti- the entities and events that occur in a document into ties in the sentence to take advantage of the wider resolution classes — date back to decision trees and context to make a coreference decision. hand-written rules (Hobbs, 1977; McCarthy, 1995). To tackle this issue, we present a coreference res- The earliest evaluation corpora were the Message olution corpus called KNOWREF that specifically Understanding Conferences (MUC) (Grishman and targets the ability of systems to reason about a situ- Sundheim, 1996) and the ACE (Doddington et al., ation described in the context.1 We designed this 2004). These focused on noun phrases tagged with task to be challenging, large-scale, and based on coreference information, but were limited in either natural text. The main contributions of this paper size or annotation coverage. are as follows: The datasets of Pradhan et al.(2011, 2012) from the CoNLL-2011 and CoNLL-2012 Shared Tasks 1. We develop mechanisms by which we con- were proposed as large-scale corpora with high struct a human-labeled corpus of 8,724 inter-annotator agreement. They were constructed Winograd-like text samples whose resolution by restricting the data to coreference phenomena requires significant common sense and back- with highly consistent annotations, and were pack- ground knowledge. As an example: aged with a standard evaluation framework to facil- Marcus is undoubtedly faster than Jarrett itate performance comparisons. right now but in [his] prime the gap wasn’t The quality of these tasks led to their widespread all that big. (answer: Jarrett) use and the emergence of many resolution systems, ranging from hand-engineered methods to deep- 2. We propose a task-specific metric called con- learning approaches. The multi-pass sieve system sistency that measures the extent to which a of Raghunathan et al.(2010) is fully deterministic model uses the full context (as opposed to a and makes use of mention attributes like gender surface cue) to make a coreference decision. and number; it maintained the best results on the We use this metric to analyze the behavior of CoNLL 2011 task for a number of years (Lee et al., state-of-the-art methods and demonstrate that 2011). Later, lexical learning approaches emerged they generally under-utilize context informa- as the new state of the art (Durrett and Klein, 2013), tion. followed more recently by neural models (Wise- man et al., 2016; Clark and Manning, 2016). The 3. We find that a fine-tuned version of the recent current state-of-the-art result on the CoNLL 2012 large-scale language model, BERT (Devlin task is by an end-to-end neural model from Lee et al., 2018), performs significantly better than et al.(2018) that does not rely on a syntactic parser other methods on KNOWREF, although with or a hand-engineered mention detector. 1The corpus, the code to scrape the sentences 2.2 Gender bias in general coreference from the source texts, as well as the code to repro- resolution duce all of our experimental results are available at https://github.com/aemami1/KnowRef. Zhao et al.(2018) observed that state-of-the-art 3953 methods for coreference resolution become gender- Challenge. The goal was that any successful system biased, exploiting various stereotypes that leak would necessarily use common-sense knowledge. from society into data. They devise a dataset of Although the WSC is an important step in evalu- 3,160 manually written sentences called WinoBias ating systems en route to human-like language un- that serves both as a gender-bias test for corefer- derstanding, its size and other characteristics are a ence resolution models and as a training set to bottleneck for progress in pronoun disambiguation counter stereotypes in existing corpora (i.e., the (Trichelair et al., 2018). A Winograd-like expanded two CoNLL tasks). The following example is rep- corpus was proposed by Rahman and Ng(2012) resentative: to address the WSC’s size limitations; however, (1) The physician hired the secretary because systems that perform well on the expanded dataset he was overwhelmed with clients. do not transfer successfully to the original WSC (Rahman and Ng, 2012; Peng et al., 2015), likely (2) The physician hired the secretary because due to loosened constraints in the former. she was overwhelmed with clients.