ERICA: Improving Entity and Relation Understanding for Pre-Trained Language Models Via Contrastive Learning
Total Page:16
File Type:pdf, Size:1020Kb
ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning Yujia Qin♣♠♦, Yankai Lin}, Ryuichi Takanobu|}, Zhiyuan Liu♣∗, Peng Li}, Heng Ji♠∗, Minlie Huang|, Maosong Sun|, Jie Zhou} |Department of Computer Science and Technology, Tsinghua University, Beijing, China ♠University of Illinois at Urbana-Champaign }Pattern Recognition Center, WeChat AI, Tencent Inc. [email protected] Abstract Culiacán [1] Culiacán is a city in northwestern Mexico. [2] Culiacán is Pre-trained Language Models (PLMs) have the capital of the state of Sinaloa. [3] Culiacán is also the seat of Culiacán Municipality. [4] It had an urban population of shown superior performance on various down- 785,800 in 2015 while 905,660 lived in the entire municipality. stream Natural Language Processing (NLP) [5] While Culiacán Municipality has a total area of 4,758 k!!, tasks. However, conventional pre-training ob- Culiacán itself is considerably smaller, measuring only. [6] Culiacán is a rail junction and is located on the Panamerican jectives do not explicitly model relational facts Highway that runs south to Guadalajara and Mexico City. [7] in text, which are crucial for textual under- Culiacán is connected to the north with Los Mochis, and to the standing. To address this issue, we propose a south with Mazatlán, Tepic. novel contrastive learning framework ERICA Q: where is Guadalajara? A: Mexico. locate on to obtain a deep understanding of the entities Culiacán Panamerican Highway and their relations in text. Specifically, we de- Culiacán Municipality Los Mochis south to city of Mexico City fine two novel pre-training tasks to better un- Sinaloa derstand entities and relations: (1) the entity Mexico Guadalajara discrimination task to distinguish which tail entity can be inferred by the given head en- Figure 1: An example for a document “Culiacán”, in tity and relation; (2) the relation discrimination which all entities are underlined. We show entities and task to distinguish whether two relations are their relations as a relational graph, and highlight the close or not semantically, which involves com- important entities and relations to find out “where is plex relational reasoning. Experimental results Guadalajara”. demonstrate that ERICA can improve typical PLMs (BERT and RoBERTa) on several lan- guage understanding tasks, including relation extraction, entity typing and question answer- However, conventional pre-training objectives ing, especially under low-resource settings.1 do not explicitly model relational facts, which fre- quently distribute in text and are crucial for under- 1 Introduction standing the whole text. To address this issue, some Pre-trained Language Models (PLMs) (Devlin recent studies attempt to improve PLMs to better et al., 2018; Yang et al., 2019; Liu et al., 2019) have understand relations between entities (Soares et al., shown superior performance on various Natural 2019; Peng et al., 2020). However, they mainly Language Processing (NLP) tasks such as text clas- focus on within-sentence relations in isolation, ig- sification (Wang et al., 2018), named entity recog- noring the understanding of entities, and the inter- nition (Sang and De Meulder, 2003), and question actions among multiple entities at document level, answering (Talmor and Berant, 2019). Benefiting whose relation understanding involves complex rea- from designing various effective self-supervised soning patterns. According to the statistics on a learning objectives, such as masked language mod- human-annotated corpus sampled from Wikipedia eling (Devlin et al., 2018), PLMs can effectively documents by Yao et al.(2019), at least 40.7% re- capture the syntax and semantics in text to gener- lational facts require to be extracted from multiple ate informative language representations for down- sentences. Specifically, we show an example in Fig- stream NLP tasks. ure1, to understand that “Guadalajara is located in Mexico”, we need to consider the following clues ∗Corresponding author. 1Our code and data are publicly available at https:// jointly: (i) “Mexico” is the country of “Culiacán” github.com/thunlp/ERICA. from sentence 1; (ii) “Culiacán” is a rail junction lo- 3350 Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 3350–3363 August 1–6, 2021. ©2021 Association for Computational Linguistics cated on “Panamerican Highway” from sentence 6; BERT (Devlin et al., 2018) and XLNet (Yang et al., (iii) “Panamerican Highway” connects to “Guadala- 2019) based on deep Transformer (Vaswani et al., jara” from sentence 6. From the example, we can 2017) architecture demonstrate their superiority in see that there are two main challenges to capture various downstream NLP tasks. Since then, nu- the in-text relational facts: merous PLM extensions have been proposed to 1. To understand an entity, we should consider further explore the impacts of various model ar- its relations to other entities comprehensively. In chitectures (Song et al., 2019; Raffel et al., 2020), the example, the entity “Culiacán”, occurring in larger model size (Raffel et al., 2020; Lan et al., sentence 1, 2, 3, 5, 6 and 7, plays an important 2020; Fedus et al., 2021), more pre-training cor- role in finding out the answer. To understand “Culi- pora (Liu et al., 2019), etc., to obtain better general acán”, we should consider all its connected entities language understanding ability. Although achiev- and diverse relations among them. ing great success, these PLMs usually regard words 2. To understand a relation, we should consider as basic units in textual understanding, ignoring the the complex reasoning patterns in text. For exam- informative entities and their relations, which are ple, to understand the complex inference chain in crucial for understanding the whole text. the example, we need to perform multi-hop reason- To improve the entity and relation understand- ing, i.e., inferring that “Panamerican Highway” is ing of PLMs, a typical line of work is knowledge- located in “Mexico” through the first two clues. guided PLM, which incorporates external knowl- In this paper, we propose ERICA, a novel frame- edge such as Knowledge Graphs (KGs) into PLMs work to improve PLMs’ capability of Entity and to enhance the entity and relation understanding. RelatIon understanding via ContrAstive learning, Some enforce PLMs to memorize information aiming to better capture in-text relational facts by about real-world entities and propose novel pre- considering the interactions among entities and re- training objectives (Xiong et al., 2019; Wang et al., lations comprehensively. Specifically, we define 2019; Sun et al., 2020; Yamada et al., 2020). Oth- two novel pre-training tasks: (1) the entity discrim- ers modify the internal structures of PLMs to fuse ination task to distinguish which tail entity can both textual and KG’s information (Zhang et al., be inferred by the given head entity and relation. 2019; Peters et al., 2019; Wang et al., 2020; He It improves the understanding of each entity via et al., 2020). Although knowledge-guided PLMs considering its relations to other entities in text; introduce extra factual knowledge in KGs, these (2) the relation discrimination task to distinguish methods ignore the intrinsic relational facts in text, whether two relations are close or not semantically. making it hard to understand out-of-KG entities or Through constructing entity pairs with document- knowledge in downstream tasks, let alone the errors level distant supervision, it takes complex relational and incompleteness of KGs. This verifies the ne- reasoning chains into consideration in an implicit cessity of teaching PLMs to understand relational way and thus improves relation understanding. facts from contexts. We conduct experiments on a suite of language understanding tasks, including relation extraction, Another line of work is to directly model entities entity typing and question answering. The experi- or relations in text in pre-training stage to break mental results show that ERICA improves the per- the limitations of individual token representations. formance of typical PLMs (BERT and RoBERTa) Some focus on obtaining better span representa- and outperforms baselines, especially under low- tions, including entity mentions, via span-based resource settings, which demonstrates that ERICA pre-training (Sun et al., 2019; Joshi et al., 2020; effectively improves PLMs’ entity and relation un- Kong et al., 2020; Ye et al., 2020). Others learn derstanding and captures the in-text relational facts. to extract relation-aware semantics from text by comparing the sentences that share the same entity 2 Related Work pair or distantly supervised relation in KGs (Soares et al., 2019; Peng et al., 2020). However, these Dai and Le(2015) and Howard and Ruder(2018) methods only consider either individual entities or propose to pre-train universal language representa- within-sentence relations, which limits the perfor- tions on unlabeled text, and perform task-specific mance in dealing with multiple entities and rela- fine-tuning. With the advance of computing power, tions at document level. In contrast, our ERICA PLMs such as OpenAI GPT (Radford et al., 2018), considers the interactions among multiple entities 3351 fh1; h2; :::; hjdijg, then we apply mean pooling op- eration over the consecutive tokens that mention eij to obtain local entity representations. Note eij may appear multiple times in di, the k-th occurrence of k eij, which contains the tokens from index nstart to k nend, is represented as: k m = MeanPool(h k ; :::; h k ): (1) eij nstart nend To aggregate all information about eij, we aver- Figure 2: An example of Entity Discrimination task. age2 all representations of each occurrence mk For an entity pair with its distantly supervised relation eij as the global entity representation e . Follow- in text, the ED task requires the ground-truth tail entity ij to be closer to the head entity than other entities. ing Soares et al.(2019), we concatenate the final representations of two entities eij1 and eij2 as their ri = [e ; e ] relation representation, i.e., j1j2 ij1 ij2 .