Wikicrem: a Large Unsupervised Corpus for Coreference Resolution
Total Page:16
File Type:pdf, Size:1020Kb
WikiCREM: A Large Unsupervised Corpus for Coreference Resolution Vid Kocijan1 Oana-Maria Camburu1,2 Ana-Maria Cret¸u3 Yordan Yordanov1 Phil Blunsom1,4 Thomas Lukasiewicz1,2 1University of Oxford, UK 2Alan Turing Institute, London, UK 3Imperial College London, UK 4DeepMind, London, UK [email protected], [email protected] Abstract that forms the sentence with highest probability. Pronoun resolution is a major area of natural Additionally, language models have the advantage language understanding. However, large-scale of being pre-trained on a large collection of un- training sets are still scarce, since manually la- structured text and then fine-tuned on a specific belling data is costly. In this work, we intro- task using much less training data. This proce- duce WIKICREM (Wikipedia CoREferences dure has obtained state-of-the-art results on a se- Masked) a large-scale, yet accurate dataset ries of natural language understanding tasks (De- of pronoun disambiguation instances. We vlin et al., 2018). use a language-model-based approach for pro- noun resolution in combination with our WI- In this work, we address the lack of large train- KICREM dataset. We compare a series of ing sets for pronoun disambiguation by introduc- models on a collection of diverse and challeng- ing a large dataset that can be easily extended. ing coreference resolution problems, where To generate this dataset, we find passages of text we match or outperform previous state-of-the- where a personal name appears at least twice and 6 7 art approaches on out of datasets, such mask one of its non-first occurrences. To make as GAP,DPR,WNLI,PDP,WINOBIAS, and the disambiguation task more challenging, we also WINOGENDER. We release our model to be used off-the-shelf for solving pronoun disam- ensure that at least one other distinct personal biguation. name is present in the text in a position before the masked occurrence. We instantiate our method on 1 Introduction English Wikipedia and generate the Wikipedia Co- Pronoun resolution, also called coreference or ana- REferences Masked (WIKICREM) dataset with phora resolution, is a natural language processing 2:4M examples, which we make publicly available (NLP) task, which aims to link the pronouns with for further usage1. We show its value by using it to their referents. This task is of crucial importance fine-tune the BERT language model (Devlin et al., in various other NLP tasks, such as information 2018) for pronoun resolution. extraction (Nakayama, 2019) and machine trans- To show the usefulness of our dataset, we train lation (Guillou, 2012). Due to its importance, pro- several models that cover three real-world sce- noun resolution has seen a series of different ap- narios: (1) when the target data distribution is proaches, such as rule-based systems (Lee et al., completely unknown, (2) when training data from arXiv:1908.08025v3 [cs.CL] 13 Oct 2019 2013) and end-to-end-trained neural models (Lee the target distribution is available, and (3) the et al., 2017; Liu et al., 2019). However, the re- transductive scenario, where the unlabeled test cently released dataset GAP (Webster et al., 2018) data is available at the training time. We show shows that most of these solutions perform worse that fine-tuning BERT with WIKICREM consis- than na¨ıve baselines when the answer cannot be tently improves the model in each of the three deduced from the syntax. Addressing this draw- scenarios, when evaluated on a collection of 7 back is difficult, partially due to the lack of large- datasets. For example, we outperform the state-of- scale challenging datasets needed to train the data- hungry neural models. 1The code can be found at https://github.com/ As observed by Trinh and Le(2018), language vid-koci/bert-commonsense. The dataset and the models can be obtained from models are a natural approach to pronoun resolu- https://ora.ox.ac.uk/objects/uuid: tion, by selecting the replacement for a pronoun c83e94bb-7584-41a1-aef9-85b0e764d9e3 the-art approaches on GAP (Webster et al., 2018), usefulness of WIKICREM beyond the Winograd DPR (Rahman and Ng, 2012), and PDP (Davis Schema Challenge. et al., 2017) by 5:9%, 8:4%, and 12:7%, respec- Moreover, a manual comparison of WIKI- tively. Additionally, models trained with WIKI- CREM and MASKEDWIKI (Kocijan et al., 2019) CREM show increased performance and reduced shows a significant difference in the quality of bias on the gender diagnostic datasets WINOGEN- the examples. We annotated 100 random exam- DER (Rudinger et al., 2018) and WINOBIAS (Zhao ples from MASKEDWIKI and WIKICREM. In et al., 2018). MASKEDWIKI, we looked for examples where masked nouns can be replaced with a pronoun, 2 Related Work and only in 7 examples, we obtained a natural- sounding and grammatically correct sentence. In There are several large and commonly used bench- contrast, we estimated that 63% of the anno- marks for coreference resolution, such as (Prad- tated examples in WIKICREM form a natural- han et al., 2012; Schafer¨ et al., 2012; Ghaddar and sounding sentence when the appropriate pronoun Langlais, 2016). However, Webster et al.(2018) is inserted, showing that WIKICREM consists of argue that a high performance on these datasets examples that are much closer to the target data. does not correlate with a high accuracy in practice, We highlight that pronouns are not actually in- because examples where the answer cannot be de- serted into the sentences and thus none of the ex- duced from the syntax (we refer to them as hard amples sound unnatural. This analysis was per- pronoun resolution) are underrepresented. There- formed to show that WIKICREM consists of ex- fore, several hard pronoun resolution datasets have amples with data distribution closer to the target been introduced (Webster et al., 2018; Rahman tasks than MASKEDWIKI. and Ng, 2012; Rudinger et al., 2018; Davis et al., 2017; Zhao et al., 2018; Emami et al., 2019). 3 The WIKICREM Dataset However, they are all relatively small, often cre- ated only as a test set. In this section, we describe how we obtained WI- 2 Therefore, most of the pronoun resolution mod- KICREM. Starting from English Wikipedia , we els that address hard pronoun resolution rely on search for sentences and pairs of sentences with little (Liu et al., 2019) or no training data, via un- the following properties: at least two distinct per- supervised pre-training (Trinh and Le, 2018; Rad- sonal names appear in the text, and one of them is ford et al., 2019). Another approach involves us- repeated. We do not use pieces of text with more ing external knowledge bases (Emami et al., 2018; than two sentences to collect concise examples Fahndrich¨ et al., 2018), however, the accuracy of only. Personal names in the text are called “can- these models still lags behind that of the aforemen- didates”. One non-first occurrence of the repeated tioned pre-trained models. candidate is masked, and the goal is to predict the A similar approach to ours for unsupervised masked name, given the correct and one incorrect data generation and language-model-based eval- candidate. In case of more than one incorrect can- uation has been recently presented in our pre- didate in the sentence, several datapoints are con- vious work (Kocijan et al., 2019). We gener- structed, one for each incorrect candidate. ated MASKEDWIKI, a large unsupervised dataset We ensure that the alternative candidate appears created by searching for repeated occurrences of before the masked-out name in the text, in order nouns. However, training on MASKEDWIKI on its to avoid trivial examples. Thus, the example is own is not always enough and sometimes makes retained in the dataset if: a difference only in combination with additional (a) the repeated name appears after both candi- training on the DPR dataset (called WSCR)(Rah- dates, all in a single sentence; or man and Ng, 2012). In contrast, WIKICREM (b) both candidates appear in a single sentence, brings a much more consistent improvement over and the repeated name appears in a sentence a wider range of datasets, strongly improving directly following. models’ performance even when they are not fine- Examples where one of the candidates appears in tuned on additional data. As opposed to our previ- the same sentence as the repeated name, while the ous work (Kocijan et al., 2019), we evaluate mod- 2https://dumps.wikimedia.org/enwiki/ els on a larger collection of test sets, showing the dump id: enwiki-20181201 other candidate does not, are discarded, as they are cesses differ. In GAP, passages with pronouns often too trivial. are collected and the pronouns are manually anno- We illustrate the procedure with the following tated, while WIKICREM is generated by masking example: names that appear in the text. Even if the same When asked about Adams’ report, Powell found text is used, different masking process will result many of the statements to be inaccurate, including in different inputs and outputs, making the exam- a claim that Adams first surveyed an area that was ples different under the transductive hypothesis. surveyed in 1857 by Joseph C. The second occurrence of “Adams” is masked. WIKICREM statistics. We analyze our dataset The goal is to determine which of the two candi- for gender bias. We use the Gender guesser li- 4 dates (“Adams”,“Powell”) has been masked out. brary to determine the gender of the candidates. The masking process resembles replacing a name To mimic the analysis of pronoun genders per- with a pronoun, but the pronoun is not inserted to formed in the related works (Webster et al., 2018; keep the process fully unsupervised and error-free.