Quick viewing(Text Mode)

Arxiv:2106.08657V1 [Cs.CL] 16 Jun 2021

Arxiv:2106.08657V1 [Cs.CL] 16 Jun 2021

EIDER: Evidence-enhanced Document-level Relation Extraction

Yiqing Xie, Jiaming Shen, Sha Li, Yuning Mao, Jiawei Han University of Illinois at Urbana-Champaign, IL, USA {xyiqing2, js2, shal2, yuningm2, hanj}@illinois.edu

Abstract h: Hero of the Day t: the United States r: [country of origin] Ground truth evidence: [1,10] Predicted evidence: [1,10]

Document-level relation extraction (DocRE) Original document as input: [1] Load is the sixth studio album by the American heavy metal band , released on June 4, aims at extracting the semantic relations 1996 by Elektra Records in the United States and by Vertigo among entity pairs in a document. In DocRE, Records internationally. … [9] It was certified 5×platinum by the a subset of the sentences in a document, called Recording Industry Association of America ( RIAA ) for shipping five million copies in the United States. [10] Four singles—"Until It the evidence sentences, might be sufficient for Sleeps", "Hero of the Day", "Mama Said" , and ""— predicting the relation between a specific en- were released as part of the marketing campaign for the album. tity pair. To make better use of the evidence Prediction results (logits): NA: 17.63 country of origin: 14.79 sentences, in this paper, we propose a three- Predicted evidence as input: [1] Load is … in the United States and by Vertigo Records internationally. [10] Four singles—"Until It stage evidence-enhanced DocRE framework Sleeps", "Hero of the Day", … for the album. consisting of joint relation and evidence ex- Prediction results (logits): country of origin: 18.31 NA: 13.45

traction, evidence-centered relation extraction Final prediction result of our model: country of origin (RE), and fusion of extraction results. We first jointly train an RE model with a sim- Figure 1: A test sample in the DocRED dataset (Yao ple and memory-efficient evidence extraction et al., 2019), where the ith sentence in the document model. Then, we construct pseudo documents is marked with [i] at the start. Our model correctly based on the extracted evidence sentences and predicts [1,10] as evidence, and if we only use the ex- run the RE model again. Finally, we fuse the tracted evidence as input, the model can predict the re- extraction results of the first two stages us- lation “country of origin” correctly. ing a blending layer and make a final predic- tion. Extensive experiments show that our pro- posed framework achieves state-of-the-art per- level relation extraction (DocRE) (Quirk and Poon, formance on the DocRED dataset, outperform- 2017; Peng et al., 2017; Gupta et al., 2019). ing the second best method by 0.76/0.82 Ign Although multiple sentences might be necessary F1/F1. In particular, our method significantly to infer a relation, the sentences in a document are improves the performance on inter-sentence re- not equally important for each entity pair and some lations by 1.23 Inter F1. sentences could be irrelevant for the relation pre- diction. We refer to the minimal set of sentences 1 Introduction required to infer a relation as evidence sentences Relation extraction (RE) is the task of extracting (Huang et al., 2021). In the example shown in Fig- ure1, the 1st and 10th sentences serve as evidence arXiv:2106.08657v1 [cs.CL] 16 Jun 2021 semantic relations among entities within a given text, which is a critical step in information extrac- sentences to the “country of origin” relation be- tion and has abundant applications (Yu et al., 2017; tween “Hero of the Day” and “the United States”. Shi et al., 2019; Trisedya et al., 2019). Prior stud- The 1st sentence indicates that Load is originated ies mostly focus on sentence-level RE, where the from the United States, and the 10th indicates Hero two entities of interest co-occur in the same sen- of the Day is a song of Load. Hence, Hero of the tence and it is assumed that their relation can be Day is also from the United States. On the other derived from the sentence (Zeng et al., 2015; Cai hand, although the 9th sentence also mentions “the et al., 2016). However, this assumption does not United States”, it is irrelevant to the relation pre- always hold and some relations between entities diction. Sometimes including such irrelevant sen- can only be inferred given multiple sentences as the tences in the input might introduce noise to the context. As a result, recent studies have been mov- model and be more detrimental than beneficial. ing towards the more realistic setting of document- In light of the observations above, we propose two approaches to make better use of evidence sen- and make another set of predictions based on the tences. The first is to jointly extract relations and pseudo document. In the last stage, we fuse the evidence. Intuitively, both tasks should focus on predictions based on the original document and the the information relevant to the current entity pair, pseudo document using a blending layer (Wolpert, such as the underlined “Load” and “the album” in 1992). In this way, EIDER puts more attention to the 10th sentence of Figure1. This suggests that the important sentences extracted in the first stage, the two tasks have commonality and can provide ad- while still having access to the whole document to ditional training signals for each other. Huang et al. avoid information loss. (2020) trains these two tasks in a multi-task learn- Contributions. (1) We propose a memory- ing manner. However, their model makes one pre- efficient multi-task learning DocRE framework for diction for every joint relation and evidence extraction, which only tuple, which requires massive GPU memory and requires around 14% additional memory compared long training time. Our method adopts a much sim- to a single RE model alone. (2) We refine the pler model structure, which only predicts for each model inference stage by using a blending layer to (relation, entity, entity) tuple and can be trained on fuse the prediction results from the original docu- a single consumer GPU. ment and the extracted evidence, which allows the The second approach is to conduct evidence- model to focus more on the important sentences centered relation extraction with the evidence sen- with no information loss. (3) We conduct exten- tences as model input. In the extreme case, if there sive experiments to demonstrate the effectiveness is only one sentence related to the relation predic- of our method, which achieves state-of-the-art per- tion, we can make predictions solely based on this formance on the large-scale DocRED dataset. sentence and reduce the problem to sentence-level relation extraction. One concurrent work (Huang 2 Problem Formulation et al., 2021) shows the effectiveness of replacing the original documents with sentences extracted Given a document d comprised of N sentences by hand-crafted rules. However, the sentences ex- N {st}t=1 and a set of entities {ei} appearing in d, the tracted by rules are not perfect. Solely relying task of document-level relation extraction (DocRE) on heuristically extracted sentences may result in is to predict the relations between all entity pairs S information loss and harm model performance in (eh, et) from a pre-defined relation set R {NA}. certain cases. Instead, our evidence is obtained by We refer to eh and et as the head entity and tail en- a dedicated evidence extraction model. In addition, tity, respectively. An entity ei may appear multiple we fuse the prediction results of both the original times in document d, where we denote its corre- document and extracted evidence to avoid informa- i sponding mentions as {mj}. A relation r ∈ R tion loss, and demonstrate that the two sources of between (eh, et) exists if it is expressed by any predictions are complementary to each other. pair of their mentions, and otherwise labeled as Specifically, in this paper, we propose an NA. For each entity pair (eh, et) that possesses a 1 evidence-enhanced RE framework, named EIDER, non-NA relation, we define its evidence sentences K which automatically extracts evidence and effec- Vh,t = {svi }i=1 as the subset of sentences in the tively leverages the extracted evidence to improve document that are sufficient for human annotators the performance of DocRE in three stages. In the to infer the relation. first stage, we train a relation extraction model and an evidence extraction model in a multi-task learn- 3 Methodology ing manner. We adopt localized context pooling (Zhou et al., 2021) in both models, which enhances The framework of EIDER consists of three stages: the entity embedding with additional context rel- joint relation and evidence extraction (Sec. 3.1), evant to the current entity pair. To reduce mem- evidence-centered relation extraction (Sec. 3.2) ory usage and training time, we use the same sen- and fusion of extraction results (Sec. 3.3). An tence representation for each relation and only train illustration of our framework is shown in Figure2. the evidence extraction model on entity pairs with at least one relation. In the second stage, we re- 1We use “evidence sentence” and “evidence” interchange- gard the extracted evidence as a pseudo document ably through the paper. Predicted relation from Orig doc: NA Predicted evidence: [1, 4] Final predicted relation: Country & Located in (Learned threshold: -0.28) Relation Classifier Evidence Classifier Sent embs Head emb Blending Layer … [1] … Context emb … [2] … … … Pred Scores from Orig doc Pred Scores from Pseudo doc Tail emb … [4] Country: -1.05 Country: 3.76 Weights Located in: -1.82 Located in: 2.70 … … Citizenship: -11.53 Citizenship: -5.54 … … Paul Desmarais is a Canadian businessman … Desmarais was born in Ontario …

Attention to head & tail Canadian Ontario After convergence Trained model Pre-trained Encoder

Original Document: [1] Paul Desmarais is a Canadian businessman in his Pseudo Document: [1] Paul Desmarais is a hometown of Montreal . [2] He is the eldest son of Paul Desmarais Sr. and Canadian businessman in his hometown of Jacqueline … [4] Desmarais was born in Ontario . [5] He was educated at … Montreal . [4] Desmarais was born in Ontario .

Joint Relation and Evidence Extraction Evidence-Centered Relation Extraction & Fusion

Figure 2: The overall architecture of EIDER.

3.1 Joint Relation and Evidence Extraction attention towards both eh and et are important to In our framework, the relation extraction model both entities and hence essential to the relation. We and evidence extraction model share a pre-trained first compute the attention from each token to each th M l encoder and have their own prediction heads. Intu- mention mj under the k head, noted as Aj,k ∈ R . itively, tokens relevant to the relation are essential Then, we compute the attention from each token in both models, such as “Paul Desmarais” in the to each entity ei by averaging attention over men- i E l 1st sentence and “Desmarais” in the 4th sentence tions mj ∈ ei, denoted as Ai,k ∈ R . The context in the example shown in Figure2. By sharing the embedding of (eh, et) is then obtained by: base encoder, the two models are able to provide (h,t) additional training signals to each other and hence ch,t = Ha mutually enhance each other (Ruder, 2017; Shen K (3) et al., 2018; Liu et al., 2019). (h,t) X E E a = softmax( Ah,k · At,k). N i=1 Encoder. Given a document d = [st]t=1, we insert a special token “*” before and after each entity mention. We then encode the document with a Relation Prediction Head. We first map the em- pre-trained BERT encoder (Devlin et al., 2019) to beddings of e and e to context-aware representa- obtain the embedding of each token: h t tions zh, zt by combining their entity embeddings H = [h1, ..., hL] = Encoder([s1, ..., sN ]). (1) with the context embedding ch,t, and then obtain the probability of relation r ∈ R holds between For each mention of an entity ei, we first use the (eh, et) via a bilinear function: embedding of the start symbol “*” as its mention embedding. Then, we adopt LogSumExp pooling zh = tanh (Wheh + ch,t) , to obtain the embedding of entity ei, which is a smooth version of max pooling: zt = tanh (Wtet + ch,t) , (4) P(r|eh, et) = σ (zhWrzt + br) , Nei X  i ei = log exp m , (2) j where W , W , W , b are learnable parameters. j,1 h t r r We adopt the adaptive-thresholding loss (Zhou where Nei is the number of entity ei’s occurrence et al., 2021) as the loss function of the RE model. i in the document and mj is the embedding of its Specifically, We say a relation is in the positive th j mention. We compute a context embedding class PT if it actually exists between the entity pair, for each entity pair (eh, et) based on the attention and otherwise it is in the negative classes NT . Then K×l×l matrix A ∈ R in the pre-trained encoder fol- we introduces a dummy relation class TH, and en- lowing (Zhou et al., 2021), where K is the number courages the logits of positive classes to be larger of attention heads. Intuitively, tokens with high than that of TH, and the logits of negative classes smaller than TH: Inference. After the model is trained, we feed the ! original documents as input for relation extraction. X exp (logitr) For each entity pair (eh, et), we obtain the predic- LRE = − log P r0∈P ∪{TH} exp (logitr0 ) tion score of each relation r ∈ R by: r∈PT T ! exp (logit ) ( TH logitr − logitTH if logitr ∈ top_k(logit) − log P , Sh,t,r = (9) 0 exp (logit 0 ) − inf otherwise, r ∈NT ∪{TH} r (5) where logit is the hidden representation in the last where top_k(logit) means the top k relations with layer before Sigmoid. the largest probability, which might also contains the dummy class TH. Evidence Prediction Head. The evidence ex- We also extract the evidence from the joint traction model predicts whether each sentence si 0 2 model, noted as Vh,t. For simplicity, we predict is an evidence sentence of entity pair (eh, et) . si as an evidence sentence if P(si|eh, et) > 0.5. To obtain its sentence embedding si, we apply a mean pooling over all the tokens in the sentence: 3.2 Evidence-centered Relation Extraction 1 P si = (hl) . The mean pooling shows |si| hl∈si Suppose we are given the ground truth evidence better performance compared to LogSumExp pool- during inference and the evidence are labeled per- ing in experiments. fectly, which is to say, they already contain all the Intuitively, the tokens contributing more to c h,t information relevant to the relation, then there is no are more important to both e and e , and hence h t need for us to use the whole document during rela- may be relevant to the relation prediction. Simi- tion extraction. Instead, we can construct a pseudo larly, if s is an evidence sentence of (e , e ), the i h t document d0 for each entity pair (e , e ) by con- tokens in s would also be relevant to the relation h,t h t i catenating the evidence sentences V in the order prediction. Hence, a function with both c and s h,t h,t i they are presented in the original document, and as input could be helpful to predict whether s is i feed the pseudo documents to the trained model. in the evidence V . Specifically, we use a bilinear h,t Noticing that the evidence information is only function between context embedding c and sen- h,t available during training, we replace the evidence tence embedding s to measure the importance of i sentences in the construction of pseudo documents sentence s to entity pair (e , e ): i h t with the evidence extracted by our model, noted as V 0 , and obtain another set of prediction scores by P(si|eh, et) = σ (siWvch,t + bv) , (6) h,t Eq.9. We note the prediction scores from origi- S(O) where Wv and bv are learnable parameters. nal documents and pseudo documents as and (E) As an entity pair may have more than one evi- S , respectively. Since the same trained model is dence sentence, we use the binary cross entropy as used for prediction with both the original document the objective to train the evidence extraction model. and the extracted evidence. This means that our method does not require training multiple rounds X and is still comparable to other methods with only LEvi = − yi · P(si|eh, et) + one round of prediction. si∈D (7) (1 − yi) · log(1 − P(si|eh, et)), 3.3 Fusion of Extraction Results where yi is 1 when si ∈ Vh,t and yi = 0 otherwise. On one hand, if the extracted evidence is com- Optimization. Finally, we optimize our model by pletely accurate, directly using the extracted ev- the combination of the relation extraction loss LRE idence for prediction may simplify the structure of and evidence extraction loss LEvi: the original document, and hence make it easier for the model to make the correct predictions. On the L = LRE + α ·LEvi, (8) other hand, the quality of the extracted evidence is not perfect. Besides, the non-evidence sentences where α is a hyper-parameter that balances the two in the original document may also provide back- losses. ground information of the entities and is possible 2The evidence information is available during training but to contribute to the prediction. Hence, solely rely- is not required during inference. ing on these sentences may result in information loss and lead to sub-optimal performance. As a 2020), CorefBERT (Ye et al., 2020), HIN (Tang result, we combine the prediction results on both et al., 2020), ATLOP (Zhou et al., 2021) and Do- the original documents and the extracted evidence. cuNet (Yuan et al., 2021). For fair comparisons, we After obtaining two sets of relation prediction use BERTbase as the base encoder for all methods. results from the original documents and the pseudo Evaluation Metrics. Following prior studies (Yao documents, we fuse the results by aggregating their et al., 2019; Zhou et al., 2021), we use F1 and Ign (O) (E) prediction scores S and S through a blending F1 as the main evaluation metrics. Ign F1 measures layer (Wolpert, 1992): the F1 score excluding the relations shared by the training and development/test set. We also report P (r|e , e ) = σ(S(O) + S(E) − τ), F use h t h,t,r h,t,r (10) Intra F1 and Inter F1, where the former measures the performance on the co-occurred entity pairs where τ is a learnable parameter. We optimize the (intra-sentence) and the latter measures the perfor- parameter τ on the developing set as follows: mance on inter-sentence relations where none of X X X the entity mention pairs co-occur. LF use = − yr · PF use (r|eh, et) + d∈D h6=t r∈R Implementation Details. Our model is imple- mented based on Pytorch and Huggingface’s Trans- (1 − yr) · log(1 − PF use (r|eh, et)), (11) formers (Wolf et al., 2019). We use cased BERT- where yr = 1 if the relation r holds between based (Devlin et al., 2019) as the base encoder and (eh, et) and yr = 0 otherwise. optimize our model using AdamW with learning rate 5e−5 for the base encoder and 1e−4 for other 4 Experiments parameters. We adopt a linear warmup for the first 6% steps. The batch size (number of documents 4.1 Experiment Setup per batch) is set to 4 and the ratio between relation Dataset. We evaluate all compared methods using extraction loss and evidence extraction loss α is the DocRED benchmark (Yao et al., 2019), a large set to 0.1. We perform early stopping based on the human-annotated document-level RE dataset that F1 score on the development set, with a maximum consists of 3,053/1,000/1,000 documents for train- of 30 epochs. All our models are trained with one ing/development/testing, respectively. DocRED is GTX 1080 Ti GPU. constructed from Wikipedia, involving 96 relation types, 132,275 entities, and 56,354 relations. More 4.2 Main Results than 40.7% relations can only be extracted by con- Relation Extraction Results. Table1 presents the sidering multiple sentences. The dataset provides relation extraction results of EIDER and baseline evidence sentences as part of the annotation. How- models on the DocRED dataset. First, we can ob- ever, during inference stage, our model does not serve that our method outperforms all the baseline have access to this information. methods in terms of all compared metrics on both Baseline Methods. We compare EIDER with state- the development set and test set. Furthermore, com- of-the-art DocRE methods from two categories. pared to ATLOP (Zhou et al., 2021), which uses the The first one includes graph-based methods that same base relation extraction model as our method, construct document graphs by simple rules such EIDER improves its performance significantly by as entity co-occurrence or co-reference and ex- 1.56/1.40 F1/Ign F1 on the development set and plicitly perform inference on the graph. Exam- 1.17/1.09 on the test set. This demonstrates the ples include LSR (Nan et al., 2020), GAIN (Zeng usefulness of joint extraction and integration of et al., 2020), GLRE (Wang et al., 2020), DRN (Xu extracted evidence during inference. et al., 2021) and SIRE (Zeng et al., 2021). The The experiment results also show that EIDER per- second category contains transformer-based meth- forms better than previous methods by 1.23/0.48 ods that adopt the Transformer architecture to cap- Inter F1/Intra F1 on the development set. We notice ture long-distance token dependencies implicitly that the improvement on Inter F1 is much larger and model cross-sentence relations. Our model than that on Intra F1, which indicates that using also falls in this category, and we compare it evidence extraction as an auxiliary task and utiliz- with BERT (Devlin et al., 2019), BERT-Two- ing the extracted evidence in the inference stage Step (Wang et al., 2019), E2GRE (Huang et al., can largely improve the inter-sentence prediction Dev Test Model Ign F1 F1 Intra F1 Inter F1 Ign F1 F1

LSR-BERTbase (Nan et al., 2020) 52.43 59.00 65.26 52.05 56.97 59.05 GLRE-BERTbase (Wang et al., 2020) - - - - 55.40 57.40 GAIN-BERTbase (Zeng et al., 2020) 59.14 61.22 67.10 53.90 59.00 61.24 DRN-BERTbase (Xu et al., 2021) 59.33 61.39 - - 59.15 61.37 SIRE-BERTbase (Zeng et al., 2021) 59.82 61.60 68.07 54.01 60.18 62.05

BERTbase (Wang et al., 2019) - 54.16 61.61 47.15 - 53.20 BERT-Two-Step (Wang et al., 2019) - 54.42 61.80 47.28 - 53.92 HIN-BERTbase (Tang et al., 2020) 54.29 56.31 - - 53.70 55.60 E2GRE-BERTbase (Huang et al., 2020) 55.22 58.72 - - - - CorefBERTbase (Ye et al., 2020) 55.32 57.51 - - 54.54 56.96 † † ATLOP-BERTbase (Zhou et al., 2021) 59.22 61.09 67.44 53.17 59.31 61.30 DocuNet-BERTbase (Yuan et al., 2021) 59.86 61.83 - - 59.93 61.86

EIDER-BERTbase 60.62 62.65 68.55 55.24 60.42 62.47

Table 1: Performance comparison on DocRED. Results with † are based on our implementation. Others are reported in their original papers. We separate graph-based and transformer-based methods into two groups.

Model Dev F1 Test F1 Ablation Ign F1 F1 Intra F1 Inter F1

E2GRE-BERTbase 47.14 - EIDER-Full 60.62 62.65 68.55 55.24 E2GRE-RoBERTalarge - 50.50 NoJoint 59.98 62.03 68.51 54.10 NoEvi 59.70 61.53 67.55 54.01 EIDER-BERTbase 50.71 51.27 NoOrigDoc 59.25 61.57 67.59 54.10 Table 2: The evidence extraction results on DocRED. NoBlending 58.93 61.46 67.33 54.37 The performance of E2GRE (Huang et al., 2020) is re- ported in their original paper. Table 3: Ablation study of EIDER on DocRED. training our joint model requires 10,916 MB on a ability of the model. GAIN and ATLOP have sim- single GTX 1080 Ti GPU, while E2GRE fails to ilar overall F1/Ign F1 scores, but the Inter F1 of run on the same GPU and requires 36,182 MB on GAIN is 0.73 higher and the Intra F1 of ATLOP is a RTX A6000 GPU, which shows EIDER is much 0.34 higher. This indicates that this methods may more memory-efficient. Moreover, we find that the capture the long-distance dependency between en- standalone relation extraction model training con- tities by directly connecting them on the graph. Al- sumes 9,579 MB GPU memory, which indicates though EIDER does not involve explicit multi-hop that our joint model training incurs less than 14% reasoning modules, it still significantly outperforms GPU memory overhead. the graph-based models in terms of Inter F1. This demonstrates that the evidence-centered relation ex- 4.3 Performance Analysis traction also help EIDER to capture long-distance Ablation Studies. We conduct ablation studies to dependencies between entities and hence infer the further analyze the utility of each module in EIDER. complicated relation from multiple sentences. The results are shown in Table3. Evidence Extraction Results. We list the re- We first train the RE model and the evidence sults of evidence prediction in Table2. EIDER- extraction model separately, denoted as NoJoint. BERTbase outperforms E2GRE-BERTbase signif- The performance of Intra F1/Inter F1 drops by icantly by 3.57 F1 on the development set, and 0.04/1.14. We observe that the drop in Inter F1 ourperforms E2GRE-RoBERTalarge by 0.77 F1 on is much more significant, which shows that with- the test set. Note that our model adopts a much sim- out taking the evidence extraction as an auxiliary pler structure for evidence extraction. To compare task, the ability of the RE model for identifying the memory consumption of EIDER and E2GRE, the important context is almost the same for intra- we implement the evidence prediction model of sentence pairs, but is largely decreased for inter- E2GRE and jointly train it with the same relation sentence pairs. Consequently, the RE model fails to extraction model as ours. Experiments show that infer complicated relations from multiple sentences Intra Coref Bridge Total co-occurs is mainly from combining extracted ev- idence during inference. For the Bridge category, Count 6711 984 3212 10,907 the full model outperforms our two ablations by Percent 54.46% 7.99% 26.07% 88.52% a large margin, and the two ablations further out- Table 4: The statistics of categories among the 12,323 perform ATLOP by a large margin. This reveals relations in the DocRED development set. that both extracting evidence in training and pre- dicting with evidence during inference improve the if it is trained alone. performance of entity pairs that require multi-hop Then, we remove the extracted evidence and reasoning over a third entity. the original document in the inference stage sepa- 2.5 rately. These two ablations are denoted as NoEvi Ours­Full NoEvi NoJoint ATLOP and NoOrigDoc, respectively. We observe that 2.0 +2.0 removing either source will lead to performance +1.6 1.5 1 drops, which indicates both the extracted evidence F +1.1+1.1 +1.0 and the original document are important for relation 1.0 +0.8 prediction. The original documents may contain +0.7 0.5 irrelevant and noisy sentences and the extracted evi- +0.3 +0.167.5 61.0 51.4 dence sentences only may fail to cover all important 0.0 Intra Coref Bridge information in the original document. Therefore, it Figure 3: Performance gains in F1 by relation cate- is necessary to use them both during inference. gories. The gains are relative to ATLOP. Finally, we remove the blending layer by sim- ply taking the union of the extracted relations of 4.4 Case Studies original documents and pseudo documents, noted Table5 shows a few example output cases of our as NoBlending. The performance drop sharply model EIDER. In the first example, the extracted by 1.19/1.71 F1/Ign F1. It further demonstrates evidence contains the ground truth evidence, and that the blending layer can successfully learns a the prediction on pseudo document is correct. In dynamic threshold to combine the two sets of pre- the second example, the 6th sentence is missing diction results. in the extracted evidence, but fortunately, the pre- Performance Breakdown. To further analyze the diction in the original document is correct and the performance of EIDER on different types of entity final prediction result is correct. The last example pairs, we categorize the relations into three cate- shows a case where EIDER successfully predicts gories: (1) Intra, where two entities co-occur in the evidence, but the prediction result with pseudo the same sentence, (2) Coref, where none of their document as input is still “NA”. This is consistent explicit entity mention pairs co-occur, but their co- with our argument that the non-evidence sentences reference co-occurs, (3) Bridge, where the first two in the original document may also provide back- situations are not satisfied, but there exists a third ground information of the entities and is possible entity whose mention co-occurs with both the head to contribute to the prediction. entity and the tail entity. The statistics of each cat- 5 Related Work egory is listed in Table4, where the co-reference of each entity is extracted by HOI (Xu and Choi, Relation Extraction. Previous research efforts on 2020). From the statistics, we can see that the three relation extraction mainly concentrate on predict- categories cover over 88% of the relations in the ing relations within a sentence (Cai et al., 2016; development set. Zeng et al., 2015; Feng et al., 2018; Zheng et al., The results on each category are shown in Figure 2021; Zhang et al., 2018, 2019, 2020). While these 3. We can see that our model outperforms ATLOP approaches tackle the sentence-level RE task ef- in all three categories. The differences between fectively, in the real world, certain relations can models varies per category. Under the Intra and only be inferred from multiple sentences. Con- Coref category, our full model and the NoJoint ab- sequently, recent studies (Quirk and Poon, 2017; lation significantly outperforms the NoEvi ablation Peng et al., 2017; Yao et al., 2019; Wang et al., and ATLOP, which indicates that the performance 2019; Tang et al., 2020) have proposed to work on gain for entity pairs whose mention or co-reference the document-level relation extraction (DocRE). Ground Truth Relation: Place of birth Ground Truth Evidence Sentence(s): [3] Document: [1] Kurt Tucholsky (9 January 1890 – 21 December 1935) was a German - Jewish journalist , satirist , and writer. [2] He also wrote under the pseudonyms Kaspar Hauser (after the historical figure), Peter Panter, Theobald Tiger and Ignaz Wrobel. [3] Born in Berlin - Moabit, he moved to Paris in 1924 and then to Sweden in 1929. [4] Tucholsky was one of the most important journalists of ... Extracted Evidence Sentence(s): [1, 3] Prediction based on Orig. Document: NA Prediction based on Extracted Evidences: Place of Birth Final Predicted Type: Place of Birth Ground Truth Relation: Inception Ground Truth Evidence Sentence(s): [5, 6] Document: [1] Oleg Tinkov (born 25 December 1967 ) is a Russian entrepreneur and cycling sponsor. ... [5] Tinkoff is the founder and chairman of the Tinkoff Bank board of directors (until 2015 it was called Tinkoff Credit Systems). [6] The bank was founded in 2007 and as of December 1, 2016, it is ranked 45 in terms of assets and 33 for equity among Russian banks. ... Extracted Evidence Sentence(s): [5] Prediction based on Orig. Document: Inception Prediction based on Extracted Evidences: NA Final Predicted Type: Inception Ground Truth Relation: Original network Ground Truth Evidence Sentence(s): [1, 2] Document: [1] "How to Save a Life" is the twenty-first episode of the eleventh season of the American television medical drama Grey’s Anatomy, and is the 241st episode overall. [2] It aired on April 23, 2015 on ABC in the United States. [3] The episode was written by showrunner Shonda Rhimes and directed by Rob Hardy, making it the first episode Rhimes has written since the season eight finale "Flight". ... Extracted Evidence Sentence(s): [1, 2] Prediction based on Orig. Document: Original network Prediction based on Extracted Evidences: NA Final Predicted Type: Original network

Table 5: Case studies of our proposed framework EIDER. We use red, blue and green to color the head entity, tail entity and relation, respectively. The indices of ground truth evidence sentences are highlighted with yellow.

Graph-based DocRE. Graph-based DocRE meth- designs several hand-crafted rules to extract sen- ods generally first construct a graph with men- tences that are important to the prediction. Similar tions, entities, sentences or documents as the nodes, to our method, Huang et al.(2020) learns a model and then infer the relations by reasoning on this to perform joint relation extraction and evidence ex- graph. Specifically, Nan et al.(2020) constructs a traction. However, our method uses a much simpler document-level graph and iteratively updates the model structure for the evidence extraction model node representations and refines the graph topologi- and hence reduces the memory usage and improves cal structure. Zeng et al.(2020) first constructs both the training efficiency. We are also the first work a mention-level graph and an entity-level graph, to fuse the predictions based on extracted evidence and then performs multi-hop reasoning on both sentences in inference. graphs. Xu et al.(2020) extracts a reasoning path between each entity pair holding at least one rela- 6 Conclusion tion, and encourages the model to reconstruct the In this work, we propose a three-stage DocRE path during training. These methods simplify the framework consisting of joint relation and evidence input document by extracting a graph with entities extraction, evidence-centered relation extraction and performing explicit graph reasoning. However, and fusion of extraction results. The joint training the complicated operations on the graphs lower the stage adopts simple model structure and is memory- efficiency of the methods. efficient. The relation extraction and evidence ex- Transformer-based DocRE. Another line of stud- traction model provide additional training signals ies solely relies on the transformer architecture to each other and mutually enhance each other. We (Devlin et al., 2019) to model cross-sentence rela- combine the prediction results on both the origi- tions since transformers is able to implicitly cap- nal document and the extracted evidence, which ture long-distance dependencies. Zhou et al.(2021) encourages the model to focus on the important uses attention in the transformers to extract useful sentences while reducing information loss. Exper- context and adopts an adaptive threshold for each iment results demonstrate that our model signifi- entity pair. Yuan et al.(2021) views DocRE as a cantly outperforms existing methods on DocRED, semantic segmentation task. Huang et al.(2021) especially on inter-sentence relations. References J. Shen, Maryam Karimzadehgan, Michael Bender- sky, Zhen Qin, and Donald Metzler. 2018. Multi- Rui Cai, Xiaodong Zhang, and Houfeng Wang. 2016. task learning for email search ranking with auxiliary Bidirectional recurrent convolutional neural network query clustering. In CIKM. for relation classification. In ACL, pages 756–765. Y. Shi, Jiaming Shen, Yuchen Li, N. Zhang, Xinwei Jacob Devlin, Ming-Wei Chang, Kenton Lee, and He, Zhengzhi Lou, Q. Zhu, M. Walker, Myung- Kristina Toutanova. 2019. BERT: Pre-training of Hwan Kim, and Jiawei Han. 2019. Discovering hy- deep bidirectional transformers for language under- pernymy in text-rich heterogeneous information net- standing. In Proceedings of the 2019 Conference of work by exploiting context granularity. In CIKM. the North American Chapter of the Association for Computational Linguistics, pages 4171–4186. Hengzhu Tang, Yanan Cao, Zhenyu Zhang, Jiangxia Cao, Fang Fang, Shi Wang, and Pengfei Yin. 2020. Jun Feng, Minlie Huang, Li Zhao, Yang Yang, and Xi- HIN: hierarchical inference network for document- aoyan Zhu. 2018. Reinforcement learning for rela- level relation extraction. In Advances in Knowledge tion classification from noisy data. In AAAI, pages Discovery and Data Mining - 24th Pacific-Asia Con- 5779–5786. ference, PAKDD 2020, Singapore, May 11-14, 2020, Proceedings, Part I, volume 12084 of Lecture Notes Pankaj Gupta, Subburam Rajaram, Hinrich Schütze, in Computer Science, pages 197–209. and Thomas A. Runkler. 2019. Neural relation extraction within and across sentence boundaries. Bayu Distiawan Trisedya, Gerhard Weikum, Jianzhong In The Thirty-Third AAAI Conference on Artificial Qi, and Rui Zhang. 2019. Neural relation extrac- Intelligence, AAAI 2019, The Thirty-First Innova- tion for knowledge base enrichment. In Proceed- tive Applications of Artificial Intelligence Confer- ings of the 57th Annual Meeting of the Association ence, IAAI 2019, The Ninth AAAI Symposium on Ed- for Computational Linguistics, pages 229–240, Flo- ucational Advances in Artificial Intelligence, EAAI rence, Italy. Association for Computational Linguis- 2019, pages 6513–6520. tics.

Kevin Huang, Guangtao Wang, Tengyu Ma, and Jing Difeng Wang, Wei Hu, Ermei Cao, and Weijian Huang. 2020. Entity and evidence guided relation Sun. 2020. Global-to-local neural networks for extraction for docred. document-level relation extraction. In Proceed- ings of the 2020 Conference on Empirical Methods Quzhe Huang, Shengqi Zhu, Yansong Feng, Yuan Ye, in Natural Language Processing (EMNLP), pages Yuxuan Lai, and Dongyan Zhao. 2021. Three sen- 3711–3721. tences are all you need: Local path enhanced docu- ment relation extraction. Hong Wang, Christfried Focke, Rob Sylvester, Nilesh Mishra, and William Wang. 2019. Fine-tune bert for docred with two-step process. Computing Research Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian- Repository, arXiv:1909.11898. feng Gao. 2019. Multi-task deep neural networks for natural language understanding. In ACL. Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pier- Guoshun Nan, Zhijiang Guo, Ivan Sekulic, and Wei Lu. ric Cistac, Tim Rault, Rémi Louf, Morgan Funtow- 2020. Reasoning with latent structure refinement for icz, et al. 2019. Huggingface’s transformers: State- document-level relation extraction. In Proceedings of-the-art natural language processing. ArXiv, pages of the 58th Annual Meeting of the Association for arXiv–1910. Computational Linguistics, pages 1546–1557. David H. Wolpert. 1992. Stacked generalization. Neu- Nanyun Peng, Hoifung Poon, Chris Quirk, Kristina ral Networks, 5:241–259. Toutanova, and Wen-tau Yih. 2017. Cross-sentence n-ary relation extraction with graph LSTMs. Trans- Liyan Xu and Jinho D. Choi. 2020. Revealing the actions of the Association for Computational Lin- myth of higher-order inference in coreference reso- guistics, 5:101–115. lution. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing Chris Quirk and Hoifung Poon. 2017. Distant super- (EMNLP), pages 8527–8533. Association for Com- vision for relation extraction beyond the sentence putational Linguistics. boundary. In Proceedings of the 15th Conference of the European Chapter of the Association for Compu- Wang Xu, Kehai Chen, and Tiejun Zhao. 2020. tational Linguistics: Volume 1, Long Papers, pages Document-level relation extraction with reconstruc- 1171–1182. tion.

Sebastian Ruder. 2017. An overview of multi- Wang Xu, Kehai Chen, and Tiejun Zhao. 2021. Dis- task learning in deep neural networks. ArXiv, criminative reasoning for document-level relation abs/1706.05098. extraction. Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai Lin, Wenxuan Zhou, Kevin Huang, Tengyu Ma, and Jing Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, Huang. 2021. Document-level relation extraction and Maosong Sun. 2019. DocRED: A large-scale with adaptive thresholding and localized context document-level relation extraction dataset. In Pro- pooling. In Proceedings of the AAAI Conference on ceedings of the 57th Annual Meeting of the Associa- Artificial Intelligence. tion for Computational Linguistics, pages 764–777.

Deming Ye, Yankai Lin, Jiaju Du, Zhenghao Liu, Peng Li, Maosong Sun, and Zhiyuan Liu. 2020. Corefer- ential Reasoning Learning for Language Represen- tation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7170–7186.

Mo Yu, Wenpeng Yin, Kazi Saidul Hasan, Cicero dos Santos, Bing Xiang, and Bowen Zhou. 2017. Im- proved neural relation detection for knowledge base question answering. In Proceedings of the 55th An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 571– 581, Vancouver, Canada. Association for Computa- tional Linguistics.

Changsen Yuan, Heyan Huang, Chong Feng, Ge Shi, and Xiaochi Wei. 2021. Document-level relation ex- traction with entity-selection attention. Information Sciences, 568:163–174.

Daojian Zeng, Kang Liu, Yubo Chen, and Jun Zhao. 2015. Distant supervision for relation extraction via piecewise convolutional neural networks. In EMNLP, pages 1753–1762.

Shuang Zeng, Yuting Wu, and Baobao Chang. 2021. Sire: Separate intra- and inter-sentential reasoning for document-level relation extraction.

Shuang Zeng, Runxin Xu, Baobao Chang, and Lei Li. 2020. Double graph based reasoning for document- level relation extraction. In Proceedings of the 2020 Conference on Empirical Methods in Natural Lan- guage Processing (EMNLP), pages 1630–1640.

Ningyu Zhang, Shumin Deng, Zhanlin Sun, Jiaoyan Chen, Wei Zhang, and Huajun Chen. 2020. Rela- tion adversarial network for low resource knowledge graph completion. In Proceedings of The Web Con- ference 2020.

Ningyu Zhang, Shumin Deng, Zhanlin Sun, Xi Chen, Wei Zhang, and Huajun Chen. 2018. Attention- based capsule networks with dynamic routing for re- lation extraction. In EMNLP.

Ningyu Zhang, Shumin Deng, Zhanlin Sun, Guany- ing Wang, Xi Chen, Wei Zhang, and Huajun Chen. 2019. Long-tail relation extraction via knowledge graph embeddings and graph convolution networks. In NAACL-HLT.

Hengyi Zheng, Rui Wen, Xi Chen, Yifan Yang, Yun- nan Zhang, Ziheng Zhang, Ningyu Zhang, Bin Qin, Xu Ming, and Yefeng Zheng. 2021. Prgc: Potential relation and global correpondence based joint rela- tional triple extraction. In ACL.