ERICA: Improving Entity and Relation Understanding for Pre-trained Language Models via Contrastive Learning

Yujia Qin♣♠♦, Yankai Lin♦, Ryuichi Takanobu♣♦, Zhiyuan ♣∗, Peng Li♦, Heng Ji♠∗, Minlie Huang♣, Maosong Sun♣, Jie Zhou♦ ♣Department of Computer Science and Technology, Tsinghua University, , China ♠University of Illinois at Urbana-Champaign ♦Pattern Recognition Center, WeChat AI, Tencent Inc. [email protected]

Abstract Culiacán [1] Culiacán is a city in northwestern Mexico. [2] Culiacán is Pre-trained Language Models (PLMs) have the capital of the state of Sinaloa. [3] Culiacán is also the seat of Culiacán Municipality. [4] It had an urban population of shown superior performance on various down- 785,800 in 2015 while 905,660 lived in the entire municipality. stream Natural Language Processing (NLP) [5] While Culiacán Municipality has a total area of 4,758 k!!, tasks. However, conventional pre-training ob- Culiacán itself is considerably smaller, measuring only. [6] Culiacán is a rail junction and is located on the Panamerican jectives do not explicitly model relational facts Highway that runs south to Guadalajara and Mexico City. [7] in text, which are crucial for textual under- Culiacán is connected to the north with Los Mochis, and to the standing. To address this issue, we propose a south with Mazatlán, Tepic. novel contrastive learning framework ERICA Q: where is Guadalajara? A: Mexico. locate on to obtain a deep understanding of the entities Culiacán Panamerican Highway

and their relations in text. Specifically, we de- Culiacán

Municipality Los Mochis south to city of Mexico City fine two novel pre-training tasks to better un- Sinaloa derstand entities and relations: (1) the entity Mexico Guadalajara discrimination task to distinguish which tail entity can be inferred by the given head en- Figure 1: An example for a document “Culiacán”, in tity and relation; (2) the relation discrimination which all entities are underlined. We show entities and task to distinguish whether two relations are their relations as a relational graph, and highlight the close or not semantically, which involves com- important entities and relations to find out “where is plex relational reasoning. Experimental results Guadalajara”. demonstrate that ERICA can improve typical PLMs (BERT and RoBERTa) on several lan- guage understanding tasks, including relation extraction, entity typing and question answer- However, conventional pre-training objectives ing, especially under low-resource settings.1 do not explicitly model relational facts, which fre- quently distribute in text and are crucial for under- 1 Introduction standing the whole text. To address this issue, some Pre-trained Language Models (PLMs) (Devlin recent studies attempt to improve PLMs to better et al., 2018; Yang et al., 2019; Liu et al., 2019) have understand relations between entities (Soares et al., shown superior performance on various Natural 2019; Peng et al., 2020). However, they mainly Language Processing (NLP) tasks such as text clas- focus on within-sentence relations in isolation, ig- sification (Wang et al., 2018), named entity recog- noring the understanding of entities, and the inter- nition (Sang and De Meulder, 2003), and question actions among multiple entities at document level, answering (Talmor and Berant, 2019). Benefiting whose relation understanding involves complex rea- from designing various effective self-supervised soning patterns. According to the statistics on a learning objectives, such as masked language mod- human-annotated corpus sampled from Wikipedia eling (Devlin et al., 2018), PLMs can effectively documents by Yao et al.(2019), at least 40.7% re- capture the syntax and semantics in text to gener- lational facts require to be extracted from multiple ate informative language representations for down- sentences. Specifically, we show an example in Fig- stream NLP tasks. ure1, to understand that “Guadalajara is located in Mexico”, we need to consider the following clues ∗Corresponding author. 1Our code and data are publicly available at https:// jointly: (i) “Mexico” is the country of “Culiacán” github.com/thunlp/ERICA. from sentence 1; (ii) “Culiacán” is a rail junction lo-

3350

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, pages 3350–3363 August 1–6, 2021. ©2021 Association for Computational Linguistics cated on “Panamerican Highway” from sentence 6; BERT (Devlin et al., 2018) and XLNet (Yang et al., (iii) “Panamerican Highway” connects to “Guadala- 2019) based on deep Transformer (Vaswani et al., jara” from sentence 6. From the example, we can 2017) architecture demonstrate their superiority in see that there are two main challenges to capture various downstream NLP tasks. Since then, nu- the in-text relational facts: merous PLM extensions have been proposed to 1. To understand an entity, we should consider further explore the impacts of various model ar- its relations to other entities comprehensively. In chitectures (Song et al., 2019; Raffel et al., 2020), the example, the entity “Culiacán”, occurring in larger model size (Raffel et al., 2020; Lan et al., sentence 1, 2, 3, 5, 6 and 7, plays an important 2020; Fedus et al., 2021), more pre-training cor- role in finding out the answer. To understand “Culi- pora (Liu et al., 2019), etc., to obtain better general acán”, we should consider all its connected entities language understanding ability. Although achiev- and diverse relations among them. ing great success, these PLMs usually regard words 2. To understand a relation, we should consider as basic units in textual understanding, ignoring the the complex reasoning patterns in text. For exam- informative entities and their relations, which are ple, to understand the complex inference chain in crucial for understanding the whole text. the example, we need to perform multi-hop reason- To improve the entity and relation understand- ing, i.e., inferring that “Panamerican Highway” is ing of PLMs, a typical line of work is knowledge- located in “Mexico” through the first two clues. guided PLM, which incorporates external knowl- In this paper, we propose ERICA, a novel frame- edge such as Knowledge Graphs (KGs) into PLMs work to improve PLMs’ capability of Entity and to enhance the entity and relation understanding. RelatIon understanding via ContrAstive learning, Some enforce PLMs to memorize information aiming to better capture in-text relational facts by about real-world entities and propose novel pre- considering the interactions among entities and re- training objectives (Xiong et al., 2019; Wang et al., lations comprehensively. Specifically, we define 2019; Sun et al., 2020; Yamada et al., 2020). Oth- two novel pre-training tasks: (1) the entity discrim- ers modify the internal structures of PLMs to fuse ination task to distinguish which tail entity can both textual and KG’s information (Zhang et al., be inferred by the given head entity and relation. 2019; Peters et al., 2019; Wang et al., 2020; He It improves the understanding of each entity via et al., 2020). Although knowledge-guided PLMs considering its relations to other entities in text; introduce extra factual knowledge in KGs, these (2) the relation discrimination task to distinguish methods ignore the intrinsic relational facts in text, whether two relations are close or not semantically. making it hard to understand out-of-KG entities or Through constructing entity pairs with document- knowledge in downstream tasks, let alone the errors level distant supervision, it takes complex relational and incompleteness of KGs. This verifies the ne- reasoning chains into consideration in an implicit cessity of teaching PLMs to understand relational way and thus improves relation understanding. facts from contexts. We conduct experiments on a suite of language understanding tasks, including relation extraction, Another line of work is to directly model entities entity typing and question answering. The experi- or relations in text in pre-training stage to break mental results show that ERICA improves the per- the limitations of individual token representations. formance of typical PLMs (BERT and RoBERTa) Some focus on obtaining better span representa- and outperforms baselines, especially under low- tions, including entity mentions, via span-based resource settings, which demonstrates that ERICA pre-training (Sun et al., 2019; Joshi et al., 2020; effectively improves PLMs’ entity and relation un- Kong et al., 2020; Ye et al., 2020). Others learn derstanding and captures the in-text relational facts. to extract relation-aware semantics from text by comparing the sentences that share the same entity 2 Related Work pair or distantly supervised relation in KGs (Soares et al., 2019; Peng et al., 2020). However, these Dai and Le(2015) and Howard and Ruder(2018) methods only consider either individual entities or propose to pre-train universal language representa- within-sentence relations, which limits the perfor- tions on unlabeled text, and perform task-specific mance in dealing with multiple entities and rela- fine-tuning. With the advance of computing power, tions at document level. In contrast, our ERICA PLMs such as OpenAI GPT (Radford et al., 2018), considers the interactions among multiple entities

3351 {h1, h2, ..., h|di|}, then we apply mean pooling op- eration over the consecutive tokens that mention eij to obtain local entity representations. Note eij may appear multiple times in di, the k-th occurrence of k eij, which contains the tokens from index nstart to k nend, is represented as:

k m = MeanPool(h k , ..., h k ). (1) eij nstart nend

To aggregate all information about eij, we aver- Figure 2: An example of Entity Discrimination task. age2 all representations of each occurrence mk For an entity pair with its distantly supervised relation eij as the global entity representation e . Follow- in text, the ED task requires the ground-truth tail entity ij to be closer to the head entity than other entities. ing Soares et al.(2019), we concatenate the final representations of two entities eij1 and eij2 as their ri = [e ; e ] relation representation, i.e., j1j2 ij1 ij2 . and relations comprehensively, achieving a better understanding of in-text relational facts. 3.3 Entity Discrimination Task Entity Discrimination (ED) task aims at inferring 3 Methodology the tail entity in a document given a head entity and a relation. By distinguishing the ground-truth In this section, we introduce the details of ERICA. tail entity from other entities in the text, it teaches We first describe the notations and how to represent PLMs to understand an entity via considering its entities and relations in documents. Then we detail relations with other entities. the two novel pre-training tasks: Entity Discrimi- As shown in Figure2, in practice, we first nation (ED) task and Relation Discrimination (RD) sample a tuple ti = (d , e , ri , e ) from T +, task, followed by the overall training objective. jk i ij jk ik PLMs are then asked to distinguish the ground- 3.1 Notations truth tail entity eik from other entities in the document di. To inform PLMs of which head ERICA is trained on a large-scale unlabeled cor- entity and relation to be conditioned on, we pus leveraging the distant supervision from an ex- i concatenate the relation name of rjk, the men- ternal KG K. Formally, let D = {d }|D| be a i i=1 tion of head entity eij and a separation token E = {e }|Ei| ∗ batch of documents and i ij j=1 be all [SEP] in front of di, i.e., di =“relation_name 3 named entities in di, where eij is the j-th entity entity_mention[SEP] di” . The goal of entity in di. For each document di, we enumerate all discrimination task is equivalent to maximizing the entity pairs (e , e ) and link them to their corre- i ij ik posterior P(eik|eij, rjk) = softmax(f(eik)) (f(·) i sponding relation rjk in K (if possible) and obtain indicates an entity classifier). However, we empiri- i i a tuple set Ti = {tjk = (di, eij, rjk, eik)|j 6= k}. cally find directly optimizing the posterior cannot We assign no_relation to those entity pairs with- well consider the relations among entities. Hence, out relation annotation in K. Then we obtain the we borrow the idea of contrastive learning (Hadsell S S S overall tuple set T = T1 T2 ... T|D| for this et al., 2006) and push the representations of pos- + batch. The positive tuple set T is constructed itive pair (eij, eik) closer than negative pairs, the by removing all tuples with no_relation from loss function of ED task can be formulated as: T . Benefiting from document-level distant su- + X exp(cos(eij, eik)/τ) pervision, T includes both intra-sentence (rel- LED = − log , |Ei| atively simple cases) and inter-sentence entity pairs ti ∈T + P jk exp(cos(eij, eil)/τ) (hard cases), whose relation understanding involves l=1, l6=j cross-sentence, multi-hop, or coreferential reason- (2) ing, i.e., T + = T + S T + . single cross 2Although weighted summation by attention mechanism is an alternative, the specific method of entity information 3.2 Entity & Relation Representation aggregation is not our main concern. 3 ∗ Here we encode the modified document di to obtain the For each document di, we first use a PLM to entity representations. The newly added entity_mention is encode it and obtain a series of hidden states not considered for head entity representation.

3352 Document 1 single-sentence where N is a hyper-parameter. We ensure t is … Since 1773, when the Royal Swedish Opera B was founded by Gustav III of Sweden … founded by sampled in Z and construct N − 1 negative exam- Document 2 cross-sentence ples by sampling tC (rA 6= rC ) from T , instead … Gates is an American business magnate, +5

software developer, and philanthropist … He left Pre of T . By additionally considering the last three

his board positions at Microsoft … - trained Language Modeltrained terms of LRD in Eq.3, which require the model to Document 3 single-sentence distinguish complex inter-sentence relations with

… Samarinda is the capital of East Kalimantan, founded by Indonesia, on the island of Borneo … Samarinda other relations in the text, our model could have bet- is known for its traditional food amplang, as well capital of as the cloth Sarung Samarinda … ter coverage and generality of the reasoning chains. country PLMs are trained to perform reasoning in an im- Document 3 cross-sentence

… Samarinda is the capital of East Kalimantan, plicit way to understand those “hard” inter-sentence Indonesia, on the island of Borneo … Samarinda is known for its traditional food amplang, as well cases. as the cloth Sarung Samarinda … 3.5 Overall Objective Figure 3: An example of Relation Discrimination task. For entity pairs belonging to the same relations, the RD Now we present the overall training objective task requires their relation representations to be closer. of ERICA. To avoid catastrophic forgetting (Mc- Closkey and Cohen, 1989) of general language understanding ability, we train masked language where cos(·, ·) denotes the cosine similarity be- modeling task (LMLM) together with ED and RD tween two entity representations and τ (temper- tasks. Hence, the overall learning objective is for- ature) is a hyper-parameter. mulated as follows: 3.4 Relation Discrimination Task L = L + L + L . (4) Relation Discrimination (RD) task aims at distin- ED RD MLM guishing whether two relations are close or not It is worth mentioning that we also try to mask semantically. Compared with existing relation- entities as suggested by Soares et al.(2019) and enhanced PLMs, we employ document-level rather Peng et al.(2020), aiming to avoid simply relearn- than sentence-level distant supervision to further ing an entity linking system. However, we do not make PLMs comprehend the complex reasoning observe performance gain by such a masking strat- chains in real-world scenarios and thus improve egy. We conjecture that in our document-level set- PLMs’ relation understanding. ting, it is hard for PLMs to overfit on memoriz- As depicted in Figure3, we train the text-based ing entity mentions due to the better coverage and relation representations of the entity pairs that share generality of document-level distant supervision. the same relations to be closer in the semantic Besides, masking entities creates a gap between space. In practice, we linearly4 sample a tuple pre-training and fine-tuning, which may be a short- pair tA = (dA, eA1 , rA, eA2 ) and tB = (dB, eB1 , coming of previous relation-enhanced PLMs. + + + + rB, eB2 ) from Ts (Tsingle) or Tc (Tcross), where rA = rB. Using the method mentioned in Sec. 3.2, 4 Experiments we obtain the positive relation representations r tA In this section, we first describe how we construct and r for t and t . To discriminate positive tB A B the distantly supervised dataset and pre-training examples from negative ones, similarly, we adopt details for ERICA. Then we introduce the experi- contrastive learning and define the loss function of ments we conduct on several language understand- RD task as follows: ing tasks, including relation extraction (RE), en-

T ,T X exp(cos(rt , rt )/τ) tity typing (ET) and question answering (QA). L 1 2 = − log A B , RD Z We test ERICA on two typical PLMs, including tA∈T1,tB ∈T2 BERT and RoBERTa (denoted as ERICABERT and N 6 X ERICARoBERTa) . We leave the training details Z = exp(cos(rtA , rtC )/τ), 5 tC ∈T /{tA} In experiments, we find introducing no_relation entity + + + + + + + + pairs as negative samples further improves the performance Ts ,Ts Ts ,Tc Tc ,Ts Tc ,Tc LRD = LRD + LRD + LRD + LRD , and the reason is that increasing the diversity of training entity (3) pairs is beneficial to PLMs. 6Since our main focus is to demonstrate the superiority 4The sampling rate of each relation is proportional to its of ERICA in improving PLMs to capture relational facts and total number in the current batch. advance further research explorations, we choose base models

3353 for downstream tasks and experiments on GLUE Size 1% 10% 100% benchmark (Wang et al., 2018) in the appendix. Metrics F1 IgF1 F1 IgF1 F1 IgF1 CNN - - 42.3 40.3 4.1 Distantly Supervised Dataset BILSTM - - 51.1 50.3 Construction BERT 30.4 28.9 47.1 44.9 56.8 54.5 HINBERT - - 55.6 53.7 Following Yao et al.(2019), we construct our pre- CorefBERT 32.8 31.2 46.0 43.7 57.0 54.5 training dataset leveraging distant supervision from SpanBERT 32.2 30.4 46.4 44.5 57.3 55.0 the English Wikipedia and Wikidata. First, we ERNIE 26.7 25.5 46.7 44.2 56.6 54.2 7 MTB 29.0 27.6 46.1 44.1 56.9 54.3 use spaCy to perform Named Entity Recognition, CP 30.3 28.7 44.8 42.6 55.2 52.7 and then link these entity mentions as well as ERICABERT 37.8 36.0 50.8 48.3 58.2 55.9 Wikipedia’s mentions with hyper-links to Wikidata RoBERTa 35.3 33.5 48.0 45.9 58.5 56.1 items, thus we obtain the Wikidata ID for each en- ERICARoBERTa 40.1 38.0 50.3 48.3 59.0 56.6 tity. The relations between different entities are Table 1: Results on document-level RE (DocRED). We annotated distantly by querying Wikidata. We keep report micro F1 (F1) and micro ignore F1 (IgF1) on test the documents containing at least 128 words, 4 set. IgF1 metric ignores the relational facts shared by entities and 4 relational triples. In addition, we the train and dev/test sets. ignore those entity pairs appearing in the test sets of RE and QA tasks to avoid test set leakage. In Dataset TACRED SemEval 1, 000, 000 the end, we collect documents (about Size 1% 10% 100% 1% 10% 100% 1G storage) in total with more than 4, 000 relations BERT 36.0 58.5 68.1 43.6 79.3 88.1 annotated distantly. On average, each document MTB 35.7 58.8 68.2 44.2 79.2 88.2 contains 186.9 tokens, 12.9 entities and 7.2 rela- CP 37.1 60.6 68.1 40.3 80.0 88.5 tional triples, an entity appears 1.3 times per docu- ERICABERT 36.5 59.7 68.5 47.9 80.1 88.0 ment. Based on the human evaluation on a random RoBERTa 26.3 61.2 69.7 46.0 80.3 88.8 sample of the dataset, we find that it achieves an F1 ERICARoBERTa 40.0 61.9 69.8 46.3 80.4 89.2 score of 84.7% for named entity recognition, and Table 2: Results (test F1) on sentence-level RE (TA- an F1 score of 25.4% for relation extraction. CRED and SemEval-2010 Task8) on three splits (1%, 10% and 100%). 4.2 Pre-training Details

We initialize ERICABERT and ERICARoBERTa with three partitions of the training set (1%, 10% and bert-base-uncased and roberta-base checkpoints 100%) and report results on test sets. released by Google8 and Huggingface9. We adopt AdamW (Loshchilov and Hutter, 2017) as the opti- Document-level RE For document-level RE, we mizer, warm up the learning rate for the first 20% choose DocRED (Yao et al., 2019), which requires steps and then linearly decay it. We set the learning reading multiple sentences in a document and syn- rate to 3 × 10−5, weight decay to 1 × 10−5, batch thesizing all the information to identify the relation size to 2, 048 and temperature τ to 5 × 10−2. For between two entities. We encode all entities in the LRD, we randomly select up to 64 negative sam- same way as in pre-training phase. The relation rep- ples per document. We train both models with 8 resentations are obtained by adding a bilinear layer NVIDIA Tesla P40 GPUs for 2, 500 steps. on top of two entity representations. We choose the following baselines: (1) CNN (Zeng et al., 2014), 4.3 Relation Extraction BILSTM (Hochreiter and Schmidhuber, 1997), Relation extraction aims to extract the relation be- BERT (Devlin et al., 2018) and RoBERTa (Liu tween two recognized entities from a pre-defined et al., 2019), which are widely used as text encoders relation set. We conduct experiments on both for relation extraction tasks; (2) HINBERT (Tang document-level and sentence-level RE. We test et al., 2020) which employs a hierarchical infer- ence network to leverage the abundant information for experiments. from different sources; (3) CorefBERT (Ye et al., 7 https://spacy.io/ 2020) which proposes a pre-training method to help 8https://github.com/google-research/bert 9https://github.com/huggingface/ BERT capture the coreferential relations in context; transformers (4) SpanBERT (Joshi et al., 2020) which masks

3354 Metrics Macro F1 Micro F1 Setting Standard Masked BERT 75.50 72.68 Size 1% 10% 100% 1% 10% 100% MTB 76.37 72.94 FastQA - 27.2 - 38.0 CP 76.27 72.48 BiDAF - 49.7 - 59.8 ERNIE 76.51 73.39 ERICABERT 77.85 74.71 BERT 35.8 53.7 69.5 37.9 53.1 73.1 CorefBERT 38.1 54.4 68.8 39.0 53.5 70.7 RoBERTa 79.24 76.38 SpanBERT 33.1 56.4 70.7 34.0 55.4 73.2 ERICA 80.77 77.04 RoBERTa MTB 36.6 51.7 68.4 36.2 50.9 71.7 CP 34.6 50.4 67.4 34.1 47.1 69.4 Table 3: Results on entity typing (FIGER). We report ERICABERT 46.5 57.8 69.7 40.2 58.1 73.9 macro F1 and micro F1 on the test set. RoBERTa 37.3 57.4 70.9 41.2 58.7 75.5 ERICARoBERTa 47.4 58.8 71.2 46.8 63.4 76.6 and predicts contiguous random spans instead of Table 4: Results (accuracy) on the dev set of WikiHop. random tokens; (5) ERNIE (Zhang et al., 2019) We test both the standard and masked settings on three which incorporates KG information into BERT to splits (1%, 10% and 100%). enhance entity representations; (6) MTB (Soares et al., 2019) and CP (Peng et al., 2020) which in- Setting SQuAD TriviaQA NaturalQA troduce sentence-level relation contrastive learning Size 10% 100% 10% 100% 10% 100% for BERT via distant supervision. For fair compari- BERT 79.7 88.9 60.8 70.7 68.4 78.4 son, we pre-train these baselines on our constructed MTB 63.5 87.1 52.0 67.8 61.2 76.7 pre-training data10 based on the implementation re- CP 69.0 87.1 52.9 68.1 63.3 77.3 ERICABERT 81.8 88.9 63.5 71.9 70.2 79.1 leased by Peng et al.(2020) 11. From the results RoBERTa 82.9 90.5 63.6 72.0 71.8 80.0 shown in Table1, we can see that: (1) ERICA ERICARoBERTa 85.0 90.4 63.6 72.1 73.7 80.5 outperforms all baselines significantly on each su- pervised data size, which demonstrates that ER- Table 5: Results (F1) on extractive QA (SQuAD, Triv- ICA could better understand the relations among iaQA and NaturalQA) on two splits (10% and 100%). entities in the document via implicitly considering Results on 1% split are left in the appendix. their complex reasoning patterns in the pre-training; (2) both MTB and CP achieve worse results than ERICA does not impair PLMs’ performance on BERT, which means sentence-level pre-training, sentence-level relation understanding. lacking consideration for complex reasoning pat- terns, hurts PLM’s performance on document-level 4.4 Entity Typing RE tasks to some extent; (3) ERICA outperforms Entity typing aims at classifying entity men- baselines by a larger margin on smaller training tions into pre-defined entity types. We choose sets, which means ERICA has gained pretty good FIGER (Ling et al., 2015), which is a sentence- document-level relation reasoning ability in con- level entity typing dataset labeled with distant trastive learning, and thus obtains improvements supervision. BERT, RoBERTa, MTB, CP and more extensively under low-resource settings. ERNIE are chosen as baselines. From the results listed in Table3, we observe that, ERICA outper- Sentence-level RE For sentence-level RE, we forms all baselines, which demonstrates that ER- choose two widely used datasets: TACRED (Zhang ICA could better represent entities and distinguish et al., 2017) and SemEval-2010 Task 8 (Hendrickx them in text via both entity-level and relation-level et al., 2019). We insert extra marker tokens to in- contrastive learning. dicate the head and tail entities in each sentence. For baselines, we compare ERICA with BERT, 4.5 Question Answering RoBERTa, MTB and CP. From the results shown Question answering aims to extract a specific an- in Table2, we observe that ERICA achieves almost swer span in text given a question. We conduct comparable results on sentence-level RE tasks with experiments on both multi-choice and extractive CP, which means document-level pre-training in QA. We test multiple partitions of the training set.

10 In practice, documents are split into sentences and we Multi-choice QA For Multi-choice QA, we only keep within-sentence entity pairs. 11https://github.com/thunlp/ choose WikiHop (Welbl et al., 2018), which re- RE-Context-or-Names quires models to answer specific properties of an

3355 entity after reading multiple documents and con- Dataset DocRED FIGER WikiHop ducting multi-hop reasoning. It has both stan- BERT 44.9 72.7 53.1 dard and masked settings, where the latter setting -NSP 45.2 72.6 53.6 -NSP+LED 47.6 73.8 59.8 masks all entities with random IDs to avoid in- + + T c ,T c -NSP+LRD 46.4 72.6 52.2 formation leakage. We first concatenate the ques- + + -NSP+LT s ,T s 47.3 73.5 51.2 tion and documents into a long sequence, then RD -NSP+LRD 48.0 74.0 52.0 we find all the occurrences of an entity in the ERICABERT 48.3 74.7 58.1 documents, encode them into hidden representa- tions and obtain the global entity representation Table 6: Ablation study. We report test IgF1 on Do- by applying mean pooling on these hidden rep- cRED (10%), test micro F1 on FIGER and dev accu- racy on the masked setting of WikiHop (10%). resentations. Finally, we use a classifier on top of the entity representation for prediction. We choose the following baselines: (1) FastQA (Weis- ERICA. Then we give a thorough analysis on how senborn et al., 2017) and BiDAF (Seo et al., 2016), pre-training data’s domain / size and methods for which are widely used question answering systems; entity encoding impact the performance. Lastly, (2) BERT, RoBERTa, CorefBERT, SpanBERT, we visualize the entity and relation embeddings MTB and CP, which are introduced in previous learned by ERICA. sections. From the results listed in Table4, we ob- serve that ERICA outperforms baselines in both set- 5.1 Ablation Study tings, indicating that ERICA can better understand To demonstrate that the superior performance of entities and their relations in the documents and ERICA is not owing to its longer pretraining (2500 extract the true answer according to queries. The steps) on masked language modeling, we include significant improvements in the masked setting also a baseline by optimizing L only (removing indicate that ERICA can better perform multi-hop MLM the Next Sentence Prediction (-NSP) loss (Devlin reasoning to synthesize and analyze information et al., 2018)). In addition, to explore how L and from contexts, instead of relying on entity mention ED L impact the performance, we keep only one of “shortcuts” (Jiang and Bansal, 2019). RD these two losses and compare the results. Lastly, Extractive QA For extractive QA, we adopt to evaluate how intra-sentence and inter-sentence three widely-used datasets: SQuAD (Rajpurkar entity pairs contribute to RD task, we compare et al., 2016), TriviaQA (Joshi et al., 2017) and Natu- the performances of only sampling intra-sentence + + Ts ,Ts ralQA (Kwiatkowski et al., 2019) in MRQA (Fisch entity pairs (LRD ) or inter-sentence entity pairs + + et al., 2019) to evaluate ERICA in various domains. Tc ,Tc (LRD ), and sampling both of them (LRD) during Since MRQA does not provide the test set for each pre-training. We conduct experiments on DocRED, dataset, we randomly split the original dev set into WikiHop (masked version) and FIGER. For Do- two halves and obtain the new dev/test set. We fol- cRED and WikiHop, we show the results on 10% low the QA setting of BERT (Devlin et al., 2018): splits and the full results are left in the appendix. we concatenate the given question and passage into From the results shown in Table6, we can see one long sequence, encode the sequence by PLMs that: (1) extra pretraining (-NSP) only contributes a and adopt two classifiers to predict the start and end little to the overall improvement. (2) For DocRED index of the answer. We choose BERT, RoBERTa, and FIGER, either LED or LRD is beneficial, and MTB and CP as baselines. From the results listed combining them further improves the performance; in Table5, we observe that ERICA outperforms For WikiHop, LED dominates the improvement all baselines, indicating that through the enhance- while LRD hurts the performance slightly, this is ment of entity and relation understanding, ERICA possibly because question answering more resem- is more capable of capturing in-text relational facts bles the tail entity discrimination process, while and synthesizing information of entities. This abil- the relation discrimination process may have con- ity further improves PLMs for question answering. flicts with it. (3) For LRD, both intra-sentence and 5 Analysis inter-sentence entity pairs contribute, which demon- strates that incorporating both of them is necessary In this section, we first conduct a suite of ablation for PLMs to understand relations between entities studies to explore how LED and LRD contribute to in text comprehensively. We also found empiri- 3356 36 Size 1% 10% 100% 48 34 55.5 47

BERT 28.9 44.9 54.5 32 55.0 ERICABERT 36.0 48.3 55.9 46 DocRED 30 ERICABERT 36.3 48.6 55.9 45 54.5 0% 10% 30% 50% 70%100% 0% 10% 30% 50% 70%100% 0% 10% 30% 50% 70%100% 1% DocRED 10% DocRED 100% DocRED Table 7: Effects of pre-training data’s entity distribu- tion shifting. We report test IgF1 on DocRED. Figure 5: Impacts of pre-training data’s size. X axis denotes different ratios of pre-training data, Y axis de-

36 notes test IgF1 on different partitions of DocRED. 48 34 55.5 47

32 55.0 46 Size 1% 10% 100% 30 45 54.5 Metrics F1 IgF1 F1 IgF1 F1 IgF1 0% 30% 50% 70% 100% 0% 30% 50% 70% 100% 0% 30% 50% 70% 100% 1% DocRED 10% DocRED 100% DocRED Mean Pool BERT 30.4 28.9 47.1 44.9 56.8 54.5 Figure 4: Impacts of relation distribution shifting. X ERICABERT 37.8 36.0 50.8 48.3 58.2 55.9 DocRED axis denotes different ratios of relations, Y axis denotes ERICABERT 38.5 36.3 51.0 48.6 58.2 55.9 test IgF1 on different partitions of DocRED. Entity Marker BERT 23.0 21.8 46.5 44.3 58.0 55.6 ERICABERT 34.9 33.0 50.2 48.0 59.9 57.6 DocRED cally that when these two auxiliary objectives are ERICABERT 36.9 34.8 52.5 50.3 60.8 58.4 only added into the fine-tuning stage, the model does not have performance gain. The reason is that Table 8: Results (IgF1) on how entity encoding strat- the size and diversity of entities and relations in egy influences ERICA’s performance on DocRED. We downstream training data are limited. Instead, pre- also show the impacts of entity distribution shifting (ERICADocRED and ERICADocRED) as is mentioned in the training with distant supervision on a large corpus BERT BERT main paper. provides a solution for increasing the diversity and quantity of training examples. their performances. From the results in Figure4, 5.2 Effects of Domain Shifting we observe that the performance of ERICA im- We investigate two domain shifting factors: entity proves constantly as the diversity of relation do- distribution and relation distribution, to explore main increases, which reveals the importance of how they impact ERICA’s performance. using diverse training data on relation-related tasks. Through detailed analysis, we further find that ER- Entity Distribution Shifting The entities in su- ICA is less competent at handling unseen relations pervised datasets of DocRED are recognized by in the corpus. This may result from the construc- human annotators while our pre-training data is tion of our pre-training dataset: all the relations are processed by spaCy. Hence there may exist an en- annotated distantly through an existing KG with tity distribution gap between pre-training and fine- a pre-defined relation set. It would be promising tuning. To study the impacts of entity distribution to introduce more diverse relation domains during shifting, we fine-tune a BERT model on training data preparation in future. set of DocRED for NER tagging and re-tag enti- ties in our pre-training dataset. Then we pre-train 5.3 Effects of Pre-training Data’s Size ERICA on the newly-labeled training corpus (de- To explore the effects of pre-training data’s size, we noted as ERICADocRED). From the results shown in BERT train ERICA on 10%, 30%, 50% and 70% of the Table7, we observe that it performs better than the original pre-training dataset, respectively. We re- original ERICA, indicating that pre-training on a port the results in Figure5, from which we observe dataset that shares similar entity distributions with that with the scale of pre-training data becoming downstream tasks is beneficial. larger, ERICA is performing better. Relation Distribution Shifting Our pre-training data contains over 4, 000 Wikidata relations. To 5.4 Effects of Methods for Entity Encoding investigate whether training on a more diverse rela- For all the experiments mentioned above, we en- tion domain benefits ERICA, we train it with the code each occurrence of an entity by mean pooling pre-training corpus that randomly keeps only 30%, over all its tokens in both pre-training and down- 50% and 70% the original relations, and compare stream tasks. Ideally, ERICA should have consis-

3357 entity entity tent improvements on other kinds of methods for MISC TIME ORG MISC PER ORG LOC LOC entity encoding. To demonstrate this, we try an- TIME PER NUM NUM other entity encoding method mentioned by Soares et al.(2019) on three splits of DocRED (1%, 10% and 100%). Specifically, we insert a special start token [S] in front of an entity and an end token [E] after it. The representation for this entity is calculated by averaging the representations of all BERT: entity ERICA-BERT: entity relation relation P17 P176 its start tokens in the document. To help PLMs P131 P150 P1001 P17 P27 P131 discriminate different entities, we randomly assign P150 P175 P175 P27 P179 P569 P57 P1001 different marker pairs ([S1], [E1]; [S2], [E2], ...) P176 P57 P569 P179 for each entity in a document in both pre-training and downstream tasks12. All occurrences of one entity in a document share the same marker pair. We show in Table8 that ERICA achieves consistent performance improvements for both methods (de- BERT: relation ERICA-BERT: relation noted as Mean Pool and Entity Marker), indicat- Figure 6: t-SNE plots of learned entity and rela- ing that ERICA is applicable to different methods tion embeddings on DocRED comparing BERT and for entity encoding. Specifically, Entity Marker ERICABERT. achieves better performance when the scale of train- ing data is large while Mean Pool is more powerful 6 Conclusions under low-resource settings. We also notice that training on a dataset that shares similar entity dis- In this paper, we present ERICA, a general frame- tributions is more helpful for Mean Pool, where work for PLMs to improve entity and relation un- DocRED ERICABERT achieves 60.8 (F1) and 58.4 (IgF1) derstanding via contrastive learning. We demon- on 100% training data. strate the effectiveness of our method on several language understanding tasks, including relation extraction, entity typing and question answering. 5.5 Embedding Visualization The experimental results show that ERICA outper- forms all baselines, especially under low-resource In Figure6, we show the learned entity and re- settings, which means ERICA helps PLMs better lation embeddings of BERT and ERICABERT on capture the in-text relational facts and synthesize DocRED’s dev set by t-distributed stochastic neigh- information about entities and their relations. bor embedding (t-SNE) (Hinton and Roweis, 2002). We label points with different colors to represent Acknowledgments its corresponding category of entities or relations13 This work is supported by the National Key Re- in Wikidata and only visualize the most frequent 10 search and Development Program of China (No. relations. From the figure, we can see that jointly 2020AAA0106501) and Beijing Academy of Arti- training L with L and L leads to a more MLM ED RD ficial Intelligence (BAAI). This work is also sup- compact clustering of both entities and relations ported by the Pattern Recognition Center, WeChat belonging to the same category. In contrast, only AI, Tencent Inc. training LMLM exhibits random distribution. This verifies that ERICA could better understand and represent both entities and relations in the text. References Andrew M Dai and Quoc V Le. 2015. Semi-supervised sequence learning. In Advances in neural informa- 12In practice, we randomly initialize 100 entity marker pairs. tion processing systems, pages 3079–3087. 13(Key, value) pairs for relations defined in Wikidata are: (P176, manufacturer); (P150, contains administrative territo- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and rial entity); (P17, country); (P131, located in the administra- Kristina Toutanova. 2018. BERT: Pre-training of tive territorial entity); (P175, performer); (P27, country of deep bidirectional transformers for language under- citizenship); (P569, date of birth); (P1001, applies to jurisdic- standing. In Proceedings of the 2019 Conference tion); (P57, director); (P179, part of the series). of the North American Chapter of the Association

3358 for Computational Linguistics: Human Language Jeremy Howard and Sebastian Ruder. 2018. Universal Technologies, NAACL-HLT 2019, Minneapolis, MN, language model fine-tuning for text classification. In USA, June 2-7, 2019, Volume 1 (Long and Short Pa- Proceedings of the 56th Annual Meeting of the As- pers), pages 4171–4186. sociation for Computational Linguistics (Volume 1: Long Papers), pages 328–339, Melbourne, Australia. Markus Eberts and Adrian Ulges. 2019. Span-based Association for Computational Linguistics. joint entity and relation extraction with transformer pre-training. CoRR, abs/1909.07755. Yichen Jiang and Mohit Bansal. 2019. Avoiding rea- soning shortcuts: Adversarial evaluation, training, William Fedus, Barret Zoph, and Noam Shazeer. 2021. and model development for multi-hop qa. In Pro- Switch transformers: Scaling to trillion parameter ceedings of the 57th Annual Meeting of the Associa- models with simple and efficient sparsity. arXiv tion for Computational Linguistics, ACL 2019, July preprint arXiv:2101.03961. 28, 2019, Florence, Italy, pages 2726–2736. Associ- ation for Computational Linguistics. Adam Fisch, Alon Talmor, Robin Jia, Minjoon Seo, Eu- nsol Choi, and Danqi Chen. 2019. MRQA 2019 Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S. shared task: Evaluating generalization in reading Weld, Luke Zettlemoyer, and Omer Levy. 2020. comprehension. In Proceedings of the 2nd Work- SpanBERT: Improving pre-training by representing shop on Machine Reading for Question Answering, and predicting spans. Transactions of the Associa- pages 1–13, Hong Kong, China. Association for tion for Computational Linguistics, 8:64–77. Computational Linguistics. Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Harsha Gurulingappa, Abdul Mateen Rajput, Angus Zettlemoyer. 2017. TriviaQA: A large scale dis- Roberts, Juliane Fluck, Martin Hofmann-Apitius, tantly supervised challenge dataset for reading com- and Luca Toldo. 2012. Development of a bench- prehension. In Proceedings of the 55th Annual Meet- mark corpus to support the automatic extraction of ing of the Association for Computational Linguistics drug-related adverse effects from medical case re- (Volume 1: Long Papers), pages 1601–1611, Van- Journal of biomedical informatics ports. , 45(5):885– couver, Canada. Association for Computational Lin- 892. guistics. Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant Diederik P Kingma and Jimmy Ba. 2014. Adam: A 3rd Inter- mapping. In 2006 IEEE Computer Society Confer- method for stochastic optimization. In ence on Computer Vision and Pattern Recognition national Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7, 2015, Con- (CVPR’06), volume 2, pages 1735–1742. IEEE. ference Track Proceedings. Bin He, Di Zhou, Jinghui Xiao, Xin Jiang, Qun Liu, Nicholas Jing Yuan, and Tong Xu. 2020. BERT- Lingpeng Kong, Cyprien de Masson d’Autume, Lei Yu, MK: Integrating graph contextualized knowledge Wang Ling, Zihang Dai, and Dani Yogatama. 2020. into pre-trained language models. In Findings of the A mutual information maximization perspective of Association for Computational Linguistics: EMNLP language representation learning. In Proceedings 2020, pages 2281–2290, Online. Association for of 8th International Conference on Learning Repre- Computational Linguistics. sentations, ICLR 2020, Virtual Conference, April 26, 2020, Conference Track Proceedings. Iris Hendrickx, Su Nam Kim, Zornitsa Kozareva, Preslav Nakov, Diarmuid O Séaghdha, Sebastian Tom Kwiatkowski, Jennimaria Palomaki, Olivia Red- Padó, Marco Pennacchiotti, Lorenza Romano, and field, Michael Collins, Ankur Parikh, Chris Alberti, Stan Szpakowicz. 2019. SemEval-2010 Task 8: Danielle Epstein, Illia Polosukhin, Jacob Devlin, Multi-way classification of semantic relations be- Kenton Lee, et al. 2019. Natural questions: a bench- tween pairs of nominals. In Proceedings of the mark for question answering research. Transactions Workshop on Semantic Evaluations: Recent Achieve- of the Association for Computational Linguistics, ments and Future Directions (SEW-2009), pages 94– 7:453–466. 99. Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Geoffrey E Hinton and Sam Roweis. 2002. Stochas- Kevin Gimpel, Piyush Sharma, and Radu Soricut. tic neighbor embedding. In Advances in neural in- 2020. ALBERT: A lite BERT for self-supervised formation processing systems 15: 16th Annual Con- learning of language representations. In Proceed- ference on Neural Information Processing Systems ings of 8th International Conference on Learning 2002. Proceedings of a meeting held September 12, Representations, ICLR 2020, Virtual Conference, 2002, Vancouver, British Columbia, Canada, vol- April 26, 2020, Conference Track Proceedings. ume 15, pages 857–864. Xiao Ling, Sameer Singh, and Daniel S Weld. 2015. Sepp Hochreiter and Jürgen Schmidhuber. 1997. Design challenges for entity linking. Transactions Long short-term memory. Neural computation, of the Association for Computational Linguistics, 9(8):1735–1780. 3:315–328.

3359 Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man- Minjoon Seo, Aniruddha Kembhavi, Ali Farhadi, and dar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Hannaneh Hajishirzi. 2016. Bidirectional attention Luke Zettlemoyer, and Veselin Stoyanov. 2019. flow for machine comprehension. In Proceedings of RoBERTa: A robustly optimized BERT pretraining 5th International Conference on Learning Represen- approach. CoRR, abs/1907.11692. tations, ICLR 2017, Toulon, France, April 24, 2017, Con- ference Track Proceedings. Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. In Proceedings of 7th Livio Baldini Soares, Nicholas FitzGerald, Jeffrey International Conference on Learning Representa- Ling, and Tom Kwiatkowski. 2019. Matching the tions, ICLR 2019. blanks: Distributional similarity for relation learn- ing. In Proceedings of the 57th Annual Meeting Michael McCloskey and Neal J Cohen. 1989. Catas- of the Association for Computational Linguistics, trophic interference in connectionist networks: the pages 2895–2905. Association for Computational sequential learning problem. In Psychology of learn- Linguistics. ing and motivation, volume 24, pages 109–165. El- sevier. Kaitao Song, Xu Tan, Tao Qin, Jianfeng Lu, and Tie- Yan Liu. 2019. Mass: Masked sequence to sequence Hao Peng, Tianyu Gao, Xu Han, Yankai Lin, Peng pre-training for language generation. In Proceed- Li, Zhiyuan Liu, Maosong Sun, and Jie Zhou. 2020. ings of International Conference on Machine Learn- Learning from context or names? an empirical study ing, pages 5926–5936. PMLR. on neural relation extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Tianxiang Sun, Yunfan Shao, Xipeng Qiu, Qipeng Guo, Language Processing (EMNLP), pages 3661–3672, Yaru Hu, Xuanjing Huang, and Zheng Zhang. 2020. Online. Association for Computational Linguistics. CoLAKE: Contextualized language and knowledge embedding. In Proceedings of the 28th Inter- Matthew E Peters, Mark Neumann, Robert L Logan IV, national Conference on Computational Linguistics, Roy Schwartz, Vidur Joshi, Sameer Singh, and pages 3660–3670, Barcelona, Spain (Online). Inter- Noah A Smith. 2019. Knowledge enhanced con- national Committee on Computational Linguistics. textual word representations. In Proceedings of the 2019 Conference on Empirical Methods in Natu- Yu Sun, Shuohuan Wang, Yukun Li, Shikun Feng, Xuyi ral Language Processing and the 9th International Chen, Han Zhang, Xin Tian, Danxiang Zhu, Hao Joint Conference on Natural Language Processing Tian, and Hua Wu. 2019. Ernie: Enhanced rep- (EMNLP-IJCNLP). Association for Computational resentation through knowledge integration. arXiv Linguistics. preprint arXiv:1904.09223. Alec Radford, Karthik Narasimhan, Tim Salimans, and Alon Talmor and Jonathan Berant. 2019. MultiQA: An Ilya Sutskever. 2018. Improving language under- empirical investigation of generalization and trans- standing by generative pre-training. fer in reading comprehension. In Proceedings of the Colin Raffel, Noam Shazeer, Adam Roberts, Katherine 57th Annual Meeting of the Association for Compu- Lee, Sharan Narang, Michael Matena, Yanqi Zhou, tational Linguistics, pages 4911–4921. Association Wei Li, and Peter J Liu. 2020. Exploring the limits for Computational Linguistics. of transfer learning with a unified text-to-text trans- former. Journal of Machine Learning Research, Hengzhu Tang, Yanan Cao, Zhenyu Zhang, Jiangxia 21:1–67. Cao, Fang Fang, Shi Wang, and Pengfei Yin. 2020. Hin: Hierarchical inference network for document- Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and level relation extraction. In Advances in Knowledge Percy Liang. 2016. SQuAD: 100,000+ questions for Discovery and Data Mining-24th Pacific-Asia Con- machine comprehension of text. In Proceedings of ference, PAKDD 2020, Singapore, May 11, 2020, the 2016 Conference on Empirical Methods in Natu- Proceedings, Part I, volume 12084 of Lecture Notes ral Language Processing, pages 2383–2392, Austin, in Computer Science, pages 197–209. Springer. Texas. Association for Computational Linguistics. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Dan Roth and Wen-tau Yih. 2004. A linear program- Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz ming formulation for global inference in natural lan- Kaiser, and Illia Polosukhin. 2017. Attention is all guage tasks. In Proceedings of the Eighth Confer- you need. In Advances in Neural Information Pro- ence on Computational Natural Language Learn- cessing Systems 30: Annual Conference on Neural ing (CoNLL-2004) at HLT-NAACL 2004, pages 1–8, Information Processing Systems 2017, 4 December Boston, Massachusetts, USA. Association for Com- 2017, Long Beach, CA, USA, pages 5998–6008. putational Linguistics. Alex Wang, Amanpreet Singh, Julian Michael, Felix Erik F Sang and Fien De Meulder. 2003. Intro- Hill, Omer Levy, and Samuel R Bowman. 2018. duction to the conll-2003 shared task: Language- GLUE: A multi-task benchmark and analysis plat- independent named entity recognition. In Proceed- form for natural language understanding. In Pro- ings of the Seventh Conference on Natural Language ceedings of the 2018 EMNLP Workshop Black- Learning at HLT-NAACL 2003. boxNLP: Analyzing and Interpreting Neural Net-

3360 works for NLP1. Association for Computational Lin- In Proceedings of the 2020 Conference on Empirical guistics. Methods in Natural Language Processing (EMNLP), pages 7170–7186. Association for Computational Ruize Wang, Duyu Tang, Nan Duan, Zhongyu Wei, Linguistics. Xuanjing Huang, Cuihong Cao, Daxin Jiang, Ming Zhou, et al. 2020. K-adapter: Infusing knowl- Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, edge into pre-trained models with adapters. arXiv and Jun Zhao. 2014. Relation classification via con- preprint arXiv:2002.01808. volutional deep neural network. In Proceedings of COLING 2014, the 25th International Conference Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhiyuan on Computational Linguistics: Technical Papers, Liu, Juanzi Li, and Jian Tang. 2019. KEPLER: A pages 2335–2344. Dublin City University and Asso- unified model for knowledge embedding and pre- ciation for Computational Linguistics. trained language representation. Transactions of the Association for Computational Linguistics. Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor An- geli, and Christopher D Manning. 2017. Position- Dirk Weissenborn, Georg Wiese, and Laura Seiffe. aware attention and supervised data improve slot fill- 2017. Making neural QA as simple as possible ing. In Proceedings of the 2017 Conference on Em- but not simpler. In Proceedings of the 21st Con- pirical Methods in Natural Language Processing, ference on Computational Natural Language Learn- pages 35–45. Association for Computational Lin- ing (CoNLL 2017), pages 271–280. Association for guistics. Computational Linguistics. Zhengyan Zhang, Xu Han, Zhiyuan Liu, Xin Jiang, Johannes Welbl, Pontus Stenetorp, and Sebastian Maosong Sun, and Qun Liu. 2019. ERNIE: En- Riedel. 2018. Constructing datasets for multi-hop hanced language representation with informative en- reading comprehension across documents. Transac- tities. In Proceedings of the 57th Annual Meet- tions of the Association for Computational Linguis- ing of the Association for Computational Linguis- tics, 6:287–302. tics, pages 1441–1451. Association for Computa- tional Linguistics. Wenhan Xiong, Jingfei Du, William Yang Wang, and Veselin Stoyanov. 2019. Pretrained encyclopedia: Weakly supervised knowledge-pretrained language model. In Proceedings of 8th International Confer- ence on Learning Representations, ICLR 2020, Vir- tual Conference, April 26, 2020, Conference Track Proceedings.

Ikuya Yamada, Akari Asai, Hiroyuki Shindo, Hideaki Takeda, and Yuji Matsumoto. 2020. LUKE: Deep contextualized entity representations with entity- aware self-attention. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP). Association for Com- putational Linguistics.

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Car- bonell, Russ R Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized autoregressive pretraining for language understanding. In Advances in Neural Information Processing Systems 32: Annual Con- ference on Neural Information Processing Systems 2019, NeurIPS 2019, 8-14 December 2019, Vancou- ver, BC, Canada.

Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and Maosong Sun. 2019. DocRED: A large-scale document-level relation extraction dataset. In Pro- ceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Pa- pers, pages 764–777.

Deming Ye, Yankai Lin, Jiaju Du, Zhenghao Liu, Maosong Sun, and Zhiyuan Liu. 2020. Coreferen- tial reasoning learning for language representation.

3361 Appendices models for three epochs, other hyper-parameters are kept the same as ERNIE. A Training Details for Downstream Tasks A.3 Question Answering Multi-choice QA For multi-choice question an- In this section, we introduce the training details for swering, we choose WikiHop (Welbl et al., 2018). downstream tasks (relation extraction, entity typing Since the standard setting of WikiHop does not and question answering). We implement all models 14 provide the index for each candidate, we then find based on Huggingface transformers . them by exactly matching them in the documents. We did experiments on three partitions of the origi- A.1 Relation Extraction nal training data (1%, 10% and 100%). We set the Document-level Relation Extraction For batch size to 8 and learning rate to 5 × 10−5, and document-level relation extraction, we did ex- train for two epochs. periments on DocRED (Yao et al., 2019). We Extractive QA For extractive question answer- modify the official code15 for implementation. For ing, we adopt MRQA (Fisch et al., 2019) as the experiments on three partitions of the original testbed and choose three datasets: SQuAD (Ra- training set (1%, 10% and 100%), we adopt jpurkar et al., 2016), TriviaQA (Joshi et al., 2017) batch size of 10, 32, 32 and training epochs of and NaturalQA (Kwiatkowski et al., 2019). We 400, 400, 200, respectively. We choose Adam adopt Adam as the optimizer, set the learning rate optimizer (Kingma and Ba, 2014) as the optimizer to 3 × 10−5 and train for two epochs. In the main and the learning rate is set to 4 × 10−5. We paper, we report results on two splits (10% and evaluate on dev set every 20/20/5 epochs and then 100%) and results on 1% are listed in Table 11. test the best checkpoint on test set on the official 16 evaluation server . B Generalized Language Understanding (GLUE) Sentence-level Relation Extraction For sentence-level relation extraction, we did ex- The General Language Understanding Evalua- periments on TACRED (Zhang et al., 2017) tion (GLUE) benchmark (Wang et al., 2018) pro- and SemEval-2010 Task 8 (Hendrickx et al., vides several natural language understanding tasks, 2019) based on the implementation of Peng et al. which is often used to evaluate PLMs. To test 17 (2020) . We did experiments on three partitions whether LED and LRD impair the PLMs’ per- (1%, 10% and 100%) of the original training set. formance on these tasks, we compare BERT, The relation representation for each entity pair is ERICABERT, RoBERTa and ERICARoBERTa. We fol- obtained in the same way as in pre-training phase. low the widely used setting and use the [CLS] to- Other settings are kept the same as Peng et al. ken as representation for the whole sentence or (2020) for fair comparison. sentence pair for classification or regression. Ta- ble9 shows the results on dev sets of GLUE Bench- A.2 Entity Tying mark. It can be observed that both ERICABERT and For entity typing, we choose FIGER (Ling et al., ERICARoBERTa achieve comparable performance 2015), whose training set is labeled with distant than the original model, which suggests that jointly supervision. We modify the implementation of training LED and LRD with LMLM does not hurt ERNIE (Zhang et al., 2019)18. In fine-tuning PLMs’ general ability of language understanding. phrase, we encode the entities in the same way C Full results of ablation study as in pre-training phase. We set the learning rate to 3 × 10−5 and batch size to 256, and fine-tune the Full results of ablation study (DocRED, WikiHop and FIGER) are listed in Table 10. 14https://github.com/huggingface/ transformers D Joint Named Entity Recognition and 15 https://github.com/thunlp/DocRED Relation Extraction 16https://competitions.codalab.org/ competitions/20717 Joint Named Entity Recognition (NER) and Re- 17https://github.com/thunlp/ RE-Context-or-Names lation Extraction (RE) aims at identifying enti- 18https://github.com/thunlp/ERNIE ties in text and the relations between them. We

3362 Dataset MNLI(m/mm) QQP QNLI SST-2 CoLA STS-B MRPC RTE BERT 84.0/84.4 88.9 90.6 92.4 57.2 89.7 89.4 70.1 ERICABERT 84.5/84.7 88.3 90.7 92.8 57.9 89.5 89.5 69.6 RoBERTa 87.5/87.3 91.9 92.8 94.8 63.6 91.2 90.2 78.7 ERICARoBERTa 87.5/87.5 91.6 92.6 95.0 63.5 90.7 91.5 78.5

Table 9: Results on dev sets of GLUE Benchmark. We report matched/mismatched (m/mm) accuracy for MNLI, F1 score for QQP and MRPC, spearman correlation for STS-B and accuracy for other tasks.

Dataset DocRED WikiHop (m) FIGER Size 1% 10% 100% 1% 10% 100% 100% BERT 28.9 44.9 54.5 37.9 53.1 73.1 72.7 -NSP 30.1 45.2 54.6 38.2 53.6 73.3 72.6 -NSP+LED 34.4 47.6 55.8 41.1 59.8 74.8 73.8 + + T c ,T c -NSP+LRD 34.8 46.4 54.7 37.4 52.2 72.8 72.6 + + T s ,T s -NSP+LRD 33.9 47.3 55.5 38.0 51.2 72.5 73.5 -NSP+LRD 35.9 48.0 55.6 37.2 52.0 72.7 74.0 ERICABERT 36.0 48.3 55.9 40.2 58.1 73.9 74.7

Table 10: Full results of ablation study. We report test IgF1 on DocRED, dev accuracy on the masked (m) setting of WikiHop and test micro F1 on FIGER.

Setting SQuAD TriviaQA NaturalQA helping PLMs better understand and represent both BERT 15.8 28.7 31.5 entities and relations in text. MTB 11.2 22.0 28.4 CP 12.5 25.6 29.4 ERICABERT 51.3 51.4 42.9 RoBERTa 22.1 40.6 34.0 ERICARoBERTa 57.6 51.3 57.6

Table 11: Results (F1) on extractive QA (SQuAD, Triv- iaQA and NaturalQA) on 1% split.

CoNLL04 ADE Model NER RE NER RE BERT 88.5 70.3 89.2 79.2 ERICABERT 89.3 71.5 89.5 80.2 RoBERTa 89.8 72.0 89.7 81.6 ERICARoBERTa 90.0 72.8 90.2 82.4

Table 12: Results (F1) on joint NER&RE. adopt SpERT (Eberts and Ulges, 2019) as the base model and conduct experiments on two datasets: CoNLL04 (Roth and Yih, 2004) and ADE (Gu- rulingappa et al., 2012) by replacing the base en- coders (BERT and RoBERTa) with ERICABERT and ERICARoBERTa, respectively. We modify the imple- mentation of SpERT19 and keep all the settings the same. From the results listed in Table 12, we can see that ERICA outperforms all baselines, which again demonstrates the superiority of ERICA in

19https://github.com/markus-eberts/spert

3363