Arxiv:2004.06870V2 [Cs.CL] 6 Oct 2020 Tion, Which Could Capture the Contextual Semantic the Model Performance on Downstream Tasks in of the Input Text

Coreferential Reasoning Learning for Language Representation Deming Ye1;2, Yankai Lin4, Jiaju Du1;2, Zhenghao Liu1;2, Peng Li4, Maosong Sun1;3, Zhiyuan Liu1;3 1Department of Computer Science and Technology, Tsinghua University, Beijing, China Institute for Artificial Intelligence, Tsinghua University, Beijing, China Beijing National Research Center for Information Science and Technology 2State Key Lab on Intelligent Technology and Systems, Tsinghua University, Beijing, China 3Beijing Academy of Artificial Intelligence 4Pattern Recognition Center, WeChat AI, Tencent Inc. [email protected] Abstract models to collect local semantic and syntactic information to recover the masked tokens. Hence, lan- Language representation models such as BERT could effectively capture contextual se- guage representation models may not well model mantic information from plain text, and have the long-distance connections beyond sentence been proved to achieve promising results in boundary in a text, such as coreference. Previous lots of downstream NLP tasks with appropri- work has shown that the performance of these mod- ate fine-tuning. However, most existing lan- els is not as good as human performance on the guage representation models cannot explicitly tasks requiring coreferential reasoning (Paperno handle coreference, which is essential to the et al., 2016; Dasigi et al., 2019), and they can be coherent understanding of the whole discourse. further improved on long-text tasks with external To address this issue, we present CorefBERT, a novel language representation model that can coreference information (Cheng and Erk, 2020; Xu capture the coreferential relations in context. et al., 2020; Zhao et al., 2020). Coreference occurs The experimental results show that, compared when two or more expressions in a text refer to with existing baseline models, CorefBERT the same entity, which is an important element for can achieve significant improvements consis- a coherent understanding of the whole discourse. tently on various downstream NLP tasks that For example, for comprehending the whole context require coreferential reasoning, while main- of “Antoine published The Little Prince in 1943. taining comparable performance to previous models on other common NLP tasks. The The book follows a young prince who visits various source code and experiment details of this pa- planets in space.”, we must realize that The book per can be obtained from https://github. refers to The Little Prince. Therefore, resolving com/thunlp/CorefBERT. coreference is an essential step for abundant higher- level NLP tasks requiring full-text understanding. 1 Introduction To improve the capability of coreferential reason- Recently, language representation models such as ing for language representation models, a straight- BERT (Devlin et al., 2019) have attracted consid- forward solution is to fine-tune these models on erable attention. These models usually conduct supervised coreference resolution data. Neverthe- self-supervised pre-training tasks over large-scale less, on the one hand, we find fine-tuning on ex- corpus to obtain informative language representa- isting small coreference datasets cannot improve arXiv:2004.06870v2 [cs.CL] 6 Oct 2020 tion, which could capture the contextual semantic the model performance on downstream tasks in of the input text. Benefiting from this, language rep- our preliminary experiments. On the other hand, resentation models have made significant strides in it is impractical to obtain a large-scale supervised many natural language understanding tasks includ- coreference dataset. ing natural language inference (Zhang et al., 2020), To address this issue, we present CorefBERT, a sentiment classification (Sun et al., 2019b), ques- language representation model designed to better tion answering (Talmor and Berant, 2019), relation capture and represent the coreference information. extraction (Peters et al., 2019), fact extraction and To learn coreferential reasoning ability from large- verification (Zhou et al., 2019), and coreference scale unlabeled corpus, CorefBERT introduces a resolution (Joshi et al., 2019). novel pre-training task called Mention Reference However, existing pre-training tasks, such as Prediction (MRP). MRP leverages those repeated masked language modeling, usually only require mentions (e.g., noun or noun phrase) that appear Contextual Vocabulary Jane, Claire, the, people, face, defense, candidates evidence, … Claire, located, … candidates ℒ = − log Pr �" �$) − log Pr �� $) − log Pr �� !!) ℒ!#$ ℒ!"! �! �' … �& �" �# �$ �% … �!! Multi-layer Transformer �! �' … �& �" �# �$ �% … �!! s Jane presents evidence against Claire but [MASK] presents a strong [MASK] Figure 1: An illustration of CorefBERT’s training process. In this example, the second Claire and a common word defense are masked. The overall loss of Claire consists of the loss of both Mention Reference Prediction (MRP) and Masked Language Modeling (MLM). MRP requires model to select contextual candidates to recover the masked tokens, while MLM asks model to choose from vocabulary candidates. In addition, we also sample some other tokens, such as defense in the figure, which is only trained with MLM loss. multiple times in the passage to acquire abundant the introduction of the new pre-training task about co-referring relations. Among the repeated men- coreferential reasoning would not impair BERT’s tions in a passage, MRP applies mention reference ability in common language understanding. masking strategy to mask one or several mentions and requires model to predict the masked men- 2 Related Work tion’s corresponding referents. Figure1 shows an Pre-training language representation models aim example of the MRP task, we substitute one of to capture language information from the text, the repeated mentions, Claire, with [MASK] and which facilitate various downstream NLP appli- ask the model to find the proper contextual candi- cations (Kim, 2014; Lin et al., 2016; Seo et al., date for filling it. To explicitly model the coref- 2017). Early works (Mikolov et al., 2013; Pen- erence information, we further introduce a copy- nington et al., 2014) focus on learning static word based training objective to encourage the model embeddings from the unlabeled corpus, which have to select words from context instead of the whole the limitation that they cannot handle the poly- vocabulary. The internal logic of our method is semy well. Recent years, contextual language rep- essentially similar to that of coreference resolution, resentation models pre-trained on large-scale un- which aims to find out all the mentions that refer labeled corpora have attracted intensive attention to the masked mentions in a text. Besides, rather and efforts from both academia and industry. SA- than using a context-free word embedding matrix LSTM (Dai and Le, 2015) and ULMFiT (Howard when predicting words from the vocabulary, copy- and Ruder, 2018) pre-trains language models on un- ing from context encourages the model to generate labeled text and perform task-specific fine-tuning. more context-sensitive representations, which is ELMo (Peters et al., 2018) further employs a bidi- more feasible to model coreferential reasoning. rectional LSTM-based language model to extract We conduct experiments on a suite of down- context-aware word embeddings. Moreover, Ope- stream tasks which require coreferential reason- nAI GPT (Radford et al., 2018) and BERT (Devlin ing in language understanding, including extrac- et al., 2019) learn pre-trained language representa- tive question answering, relation extraction, fact tion with Transformer architecture (Vaswani et al., extraction and verification, and coreference reso- 2017), achieving state-of-the-art results on various lution. The results show that CorefBERT outper- NLP tasks. Beyond them, various improvements forms the vanilla BERT on almost all benchmarks on pre-training language representation have been and even strengthens the performance of the strong proposed more recently, including (1) designing RoBERTa model. To verify the model’s robust- new pre-trainning tasks or objectives such as Span- ness, we also evaluate CorefBERT on other com- BERT (Joshi et al., 2020) with span-based learn- mon NLP tasks where CorefBERT still achieves ing, XLNet (Yang et al., 2019) considering masked comparable results to BERT. It demonstrates that positions dependency with auto-regressive loss, MASS (Song et al., 2019) and BART (Wang et al., mention reference masking strategy to mask one 2019b) with sequence-to-sequence pre-training, of the repeated mentions and then employs a copy- ELECTRA (Clark et al., 2020) learning from re- based training objective to predict the masked to- placed token detection with generative adversar- kens by copying from other tokens in the sequence. ial networks and InfoWord (Kong et al., 2020) (2) Masked Language Modeling (MLM)1 is with contrastive learning; (2) integrating external proposed from vanilla BERT (Devlin et al., 2019), knowledge such as factual knowledge in knowledge aiming to learn the general language understanding. graphs (Zhang et al., 2019; Peters et al., 2019; Liu MLM is regarded as a kind of cloze tasks and aims et al., 2020a); and (3) exploring multilingual learn- to predict the missing tokens according to its final ing (Conneau and Lample, 2019; Tan and Bansal, contextual representation. Except for MLM, Next 2019; Kondratyuk and Straka, 2019) or multimodal Sentence Prediction (NSP) is also commonly used learning (Lu et

Load more