Coarse-to-Fine Entity Representations for Document-level Relation Extraction

Damai Dai1∗ , Jing Ren1∗ , Shuang Zeng1 , Baobao Chang1,2 , Zhifang Sui1,2 1MOE Key Lab of Computational Linguistics, School of EECS, Peking University 2Peng Cheng Laboratory, China {daidamai,rjj,zengs,chbb,szf}@pku.edu.cn

Abstract (0) Romeo Void was an American new wave band … (1) The band primarily member of Document-level Relation Extraction (RE) requires consisted of saxophonist Benjamin Bossi … (7) The success of their extracting relations expressed within and across signed by performer sentences. Recent works show that graph-based second release, a 4-song EP, Never Say Never resulted in a distribution record label methods, usually constructing a document-level deal with Columbia Records … graph that captures document-aware interactions, can obtain useful entity representations thus help- Figure 1: An example for document-level RE. Word spans with ing tackle document-level RE. These methods ei- background colors denote the mentions. Mentions that refer to the ther focus more on the entire graph, or pay more same entity have the same background color. The solid lines denote attention to a part of the graph, e.g., paths be- intra-sentence relations. The dotted line denotes an inter-sentence tween the target entity pair. However, we find that relation. Document-level RE requires extracting both intra- and inter-sentence relations between all the target entity pairs. document-level RE may benefit from focusing on both of them simultaneously. Therefore, to ob- tain more comprehensive entity representations, we some inter-sentence relations while a considerable number of propose the Coarse-to-Fine Entity Representation relations are expressed beyond a single sentence but across model (CFER) that adopts a coarse-to-fine strat- multiple sentences in a long document [Yao et al., 2019]. egy involving two phases. First, CFER uses graph Therefore, document-level RE has attracted much attention neural networks to integrate global information in in recent years. the entire graph at a coarse level. Next, CFER uti- lizes the global information as a guidance to selec- Figure 1 shows an example document and corresponding tively aggregate path information between the tar- relational facts for document-level RE. In the example, to get entity pair at a fine level. In classification, we extract the relation between Benjamin Bossi and Columbia combine the entity representations from both two Records, two entities separated by several sentences, we need levels into more comprehensive representations for the following inference steps. First, we need to know that relation extraction. Experimental results on two Benjamin Bossi is a member of Romeo Void. Next, we need to document-level RE datasets, DocRED and CDR, infer that Never Say Never is performed by Romeo Void, and show that CFER outperforms existing models and released by Columbia Records. Based on these facts, we can is robust to the uneven label distribution.1 draw a conclusion that Benjamin Bossi is signed by Columbia Records. The example indicates that, to tackle document- level RE, a model needs the ability to capture interactions arXiv:2012.02507v2 [cs.CL] 6 Jun 2021 1 Introduction between long-distance entities. In addition, since a document may have an extremely long text, a model also needs the abil- Relation Extraction (RE) aims to extract semantic relations ity to integrate global contextual information for words. between named entities from plain text. It is an efficient way to acquire structured knowledge automatically, thus ben- Recent works show that graph-based methods can obtain efiting various natural language processing (NLP) applica- useful entity representations thus helping tackle document- tions, especially knowledge graph construction [Luan et al., level RE. These methods usually construct a document-level 2018]. Most of the previous RE works focus on the sen- graph, which represents words as nodes and uses edges be- tence level, i.e., they extract the relations within only a sin- tween them to capture document-aware interactions. To ob- gle sentence [Zeng et al., 2014; Zhou et al., 2016]. However, tain entity representations for relation extraction, some of [ in real-world scenarios, sentence-level RE models may omit them use graph neural networks (GNN) Marcheggiani and Titov, 2017; Guo et al., 2019b] to integrate neighborhood in- ∗Equal Contribution. formation for each node [Sahu et al., 2019; Nan et al., 2020]. 1Our code and data are available at https://github.com/ Although they consider the entire graph structure, they may Hunter-DDM/cfer-document-level-RE. fail to model the interactions between long-distance entities syntactic dependency adjacent word self-loop adjacent sentence coreferential mention Romeo Void was an American new wave band from , formed in 1979. (root)

The band primarily consisted of saxophonist Benjamin Bossi , vocalist Debora Iyall , guitarist Peter Woods , and bassist Frank Zincavage . (root)

Figure 2: An illustration of a document-level graph corresponding to a two-sentence document. Each node in the graph corresponds to a word in the document. We design five categories of edges to connect nodes in the graph. For the simplicity of the illustration, we omit some self-loop edges and adjacent word edges. due to the inherent over-smoothing problem in GNN [Li et ability of CFER to model long-distance interactions. al., 2018]. Other works attempt to encode path information between the target entity pair in the graph [Quirk and Poon, 2 Methodology 2017; Christopoulou et al., 2019]. They have the ability to alleviate the problem of modeling long-distance entity inter- 2.1 Task Formulation actions, but they may fail to capture global contextual infor- Let D denote a document consisting of N sentences D = mation since they usually integrate only local contextual in- N M {s } , where s = {w } denotes the i-th sentence con- formation for nodes in the graph. Therefore, to obtain more i i=1 i j j=1 comprehensive entity representations, it is necessary to find taining M words denoted by wj. Let E denote an entity set P Q a way to integrate global contextual information and model containing P entities E = {ei}i=1, where ei = {mj}j=1 long-distance entity interactions simultaneously. denotes the coreferential mention set of the i-th entity, con- In this paper, we propose the Coarse-to-Fine Entity taining Q word spans of corresponding mentions denoted by Representation model (CFER) to obtain comprehensive en- mj. Given D and E, document-level RE requires extracting tity representations for document-level RE. More specifically, all the relational facts in the form of triplets, i.e., extracting we first construct a document-level graph that captures rich {(ei, r, ej)|ei, ej ∈ E, r ∈ R}, where R is a pre-defined rela- document-aware interactions, reflected by syntactic depen- tion category set. Since an entity pair may have multiple se- dencies, adjacent word connections, cross-sentence connec- mantic relations, we formulate document-level RE as a multi- tions, and coreferential mention interactions. Based on the label classification task. constructed graph, we design a coarse-to-fine strategy with two phases. First, we use Densely Connected Graph Con- 2.2 Document-level Graph Construction volutional Networks (DCGCN) [Guo et al., 2019b] to inte- Given a document, we first construct a document-level graph grate global contextual information in the entire graph at a that captures rich document-aware interactions, reflected by coarse level. Next, we adopt an attention-based path encoding syntactic dependencies, adjacent word connections, cross- mechanism, which takes the global contextual information as sentence connections, and coreferential mention interactions. a guidance, to selectively aggregate path information between Figure 2 shows an example document-level graph corre- the potentially long-distance target entity pair at a fine level. sponding to a two-sentence document. The graph regards Given the entity representations from both two levels that fea- words in the document as nodes and captures document- ture global contextual information and long-distance interac- aware interactions by five categories of edges. These undi- tions, respectively, we can obtain more comprehensive entity rected edges are described as follows. representations for relation extraction by combining them. Syntactic dependency edge: Syntactic dependency informa- Our contributions are summarized as follows: tion is proved effective for document-level or cross-sentence • We propose a novel document-level RE model called RE in previous works [Sahu et al., 2019]. Therefore, we use CFER. It uses a coarse-to-fine strategy to integrate the dependency parser in spaCy2 to parse the syntactic de- global contextual information and model long-distance pendency tree for each sentence. After that, we add edges interactions between the target entities simultaneously, between all node pairs that have dependency relations. thus obtaining comprehensive entity representations. Adjacent word edge: Quirk and Poon [2017] point out that • Experimental results on two popular document-level RE adding edges between adjacent words can mitigate the de- datasets, DocRED and CDR, show that CFER achieves pendency parser errors. Therefore, we add edges between all better performance than existing models. node pairs that are adjacent in the document. Self-loop edge: For a node, in addition to its neighborhood • Elaborate analysis validates the effectiveness of our information, the historical information of the node itself is coarse-to-fine strategy. Further, we highlight the robust- ness of CFER to the uneven label distribution and the 2https://spacy.io/ Figure 3: An illustration of our proposed model, CFER. It is composed of four modules: a text encoding module, a coarse-level representation module, a fine-level representation module, and a classification module. also essential in information integration. Therefore, we add a sub-layers. At the l-th sub-layer in block k, the calculation self-loop edge for each node. for node i is defined as Adjacent sentence edge: To ensure that information can be   integrated across sentences, for each adjacent sentence pair, (k,l) X (k,l) h = W(k,l)hˆ + b(k,l) , we add an edge between their dependency tree roots. i ReLU  j  (2) Coreferential mention edge: Coreferential mentions may j∈N (i) share information captured from their respective contexts hˆ(k,l) = [x(k); h(k,1); ...; h(k,l−1)], (3) with each other. This could be regarded as global cross- j j j j sentence interactions. Therefore, we add edges between the (k) dh (k,l) where xj ∈ R is the block input of node j. hi ∈ first words of all mention pairs that refer to the same entity. Rdh/mk is the output of node i at the l-th sub-layer in block (k,l) ˆ dh+(l−1)dh/mk 2.3 Coarse-to-Fine Entity Representations k. hj ∈ R is the neighborhood input, ob- In this subsection, we describe our proposed Coarse-to-Fine tained by concatenating x(k) and the outputs of node j from Entity Representation model (CFER) in detail. As shown in j all previous sub-layers in the same block. W(k,l) and b(k,l) Figure 3, our model is composed of a text encoding module, are trainable parameters. N (i) denotes the neighbor set of a coarse-level representation module, a fine-level representa- node i in the document-level graph. Finally, block k adds up tion module, and a classification module. the block input and the concatenation of all sub-layer outputs, Text Encoding Module: This module aims to encode each and then adopts a fully connected layer to compute the block word in the document as a vector with text contextual infor- (k) dh mation. By default, CFER uses GloVe [Pennington et al., output oi ∈ R : 2014] embeddings and a Bi-GRU model [Cho et al., 2014; (k)  (k) (k,1) (k,mk)  Schuster and Paliwal, 1997] as the encoder. To improve the oi = FC xi + [hi ; ...; hi ] . (4) ability of contextual text encoding, we can also replace this (n) module by Pre-Trained Models (PTM) such as BERT [De- Finally, we take the output of the final block oi as the output vlin et al., 2019] or RoBERTa [Liu et al., 2019]. This mod- of the coarse-level representation module. ule finally outputs contextual word representations hi of each Fine-level Representation Module: The coarse-level repre- word in the document: sentations can capture rich contextual information, but may fail to model long-distance entity interactions. Taking the h = Encoding(D), (1) i global contextual information as a guidance, the fine-level d where hi ∈ R h , and dh denotes the hidden dimension. representation module aims to utilize path information be- Coarse-level Representation Module: This module aims tween the target entity pair to alleviate this problem. This to integrate local and global contextual information in the module adopts an attention-based path encoding mechanism entire document-level graph. As indicated by Guo et based on a path encoder and an attention aggregator. al. [2019b], Densely Connected Graph Convolutional Net- For a target entity pair (e1, e2), we denote the numbers of works (DCGCN) have the ability to capture rich local and their corresponding mentions as |e1| and |e2|, respectively. global contextual information. Therefore, we adopt DCGCN We first extract |e1| × |e2| shortest paths between all mention layers as the coarse-level representation module. DCGCN pairs in a subgraph that contains only syntactic dependency layers are organized into n blocks and the k-th block has mk and adjacent sentence edges. For the i-th path [w1, ..., wleni ], we then adopt a Bi-GRU model as the path encoder to com- contains 5, 053 documents, 132, 375 entities and 56, 354 re- pute the path-aware mention representations: lational facts divided into 96 relation categories. Among the −−−→   annotated documents, 3, 053 are for training, 1, 000 are for p−−→ = GRU p−−−→, o(n) , i,j i,j−1 wj (5) development, and 1, 000 are for testing. Chemical-Disease ←−−−   Reactions (CDR) [Li et al., 2016] is a popular dataset in the p←−− = GRU p←−−−, o(n) , i,j i,j+1 wj (6) biomedical domain containing 500 training examples, 500 (h) ←−− (t) −−−→ development examples, and 500 testing examples. mi = pi,1, mi = pi,leni , (7)

−−→ ←−− dh 3.2 Experimental Settings where pi,j, pi,j ∈ R are the forward and backward GRU hidden states of the j-th node in the i-th path, respectively. We tune hyper-parameters on the development set. Gener- (h) (t) dh mi , mi ∈ R are the path-aware representations of the ally, we use AdamW [Loshchilov and Hutter, 2017] as the head and tail mentions in the i-th path, respectively. optimizer, and use the DCGCN consisting of two blocks with Since not all the paths contain useful information, we de- 4 sub-layers in each block. On CDR, we use a biomedical sign an attention aggregator, which takes the global contex- domain pre-trained model BioBERT [Lee et al., 2020] as the tual information as a guidance, to selectively aggregate the text encoding module. Details of hyper-parameters such as path-aware mention representations: learning rate, batch size, dropout rate, hidden dimension, and others for each version of CFER are shown in Appendix A. 1 X (n) 1 X (n) he = oj , et = oj , (8) Following previous works, for DocRED, we choose micro |e1| |e2| j∈e1 j∈e2 F1 and Ign F1 as evaluation metrics. For CDR, we choose  (h) (t)  micro F1, Intra-F1, and Inter-F1 as metrics. Ign F1 denotes αi = Softmax Wa[he;et; mi ; mi ] + ba , (9) i F1 excluding relational facts that appear in both the train- X (h) X (t) ing set and the development or testing set. Intra- and Inter- h = m · α , t = m · α , (10) i i i i F1 denote F1 of relations expressed within and across sen- i i tences, respectively. We determine relation-specific thresh- d where he,et ∈ R h are the coarse-level head and tail entity olds δr based on the micro F1 on the development set. With representations, respectively, which are computed by averag- these thresholds, we classify a triplet (e1, r, e2) as positive if ing their corresponding coarse-level mention representations. P (r|e1, e2) > δr or negative otherwise. d Wa and ba are trainable parameters. h, t ∈ R h are the path- aware fine-level representations of the head and tail entities, 3.3 Main Results respectively. To obtain the main evaluation results, we run each version Classification Module: In this module, we combine the en- of CFER three times, take the median F1 on the devel- tity representations from both two levels to obtain compre- opment set to report, and test the corresponding model on hensive representations that capture both global contextual the test set. Table 1 shows the main evaluation results on information and long-distance entity interactions. Next, we DocRED. We compare CFER with 21 existing models on predict the probability of each relation by a bilinear scorer: three tracks: (1) non-PTM track: models on this track do  T  not use Pre-Trained Models (PTM). GCNN [Sahu et al., P (r|e1, e2) = Sigmoid [he; h] Wc[et; t] + bc , (11) r 2019], EoG [Christopoulou et al., 2019], GCGCN [Zhou 2dh×nr ×2dh nr et al., 2020], GEDA [Li et al., 2020], LSR [Nan et al., where Wc ∈ R and bc ∈ R are trainable pa- 2020], GAIN [Zeng et al., 2020], and HeterGSAN-Recon rameters with nr denoting the number of relation categories. are graph-based models that leverage GNN to encode nodes. 2.4 Optimization Objective Bi-LSTM [Yao et al., 2019] and HIN [Tang et al., 2020] are Considering that an entity pair may have multiple relations, text-based models without constructing a graph. (2) BERT we choose the binary cross entropy loss between the ground track: models on this track are based on BERTBase [De- vlin et al., 2019]. GEDA-BERT , HIN-BERT , truth label yr and P (r|e1, e2) as the optimization objective: Base Base GCGCN-BERTBase, LSR-BERTBase, HeterGSAN-Recon- X   L = − yr · log P (r|e1, e2) + BERTBase, and GAIN-BERTBase are BERT-versions of cor- r (12) responding non-PTM models. BERT-Two-Step [Wang et  al., 2019] and MIUK [Li et al., 2021] are two text-based (1 − yr) · log 1−P (r|e1, e2) . models. CorefBERTBase [Ye et al., 2020] and ERICA- BERTBase [Qin et al., 2020] are two pre-trained models. 3 Experiments and Analysis (3) RoBERTa track: models on this track are based on RoBERTa [Liu et al., 2019]: ERICA-RoBERTaBase [Qin et 3.1 Dataset al., 2020] and CorefRoBERTaLarge [Ye et al., 2020]. From We evaluate our model on two document-level RE datasets. Table 1, we observe that on all three tracks, CFER achieves DocRED [Yao et al., 2019] is a large-scale human-annotated the best performance and significantly outperforms most of dataset constructed from Wikipedia and Wikidata [Vrande- the existing models. Further, we find that graph-based models cic and Krotzsch,¨ 2014], and is currently the largest human- generally have significant advantages over text-based models annotated dataset for general domain document-level RE. It in document-level RE. Dev Test Model Ign F1 F1 Ign F1 F1 Bi-LSTM [Yao et al., 2019] 48.87 50.94 48.78 51.06 GCNN [Sahu et al., 2019] 46.22 51.52 49.59 51.62 EoG [Christopoulou et al., 2019] 45.94 52.15 49.48 51.82 HIN [Tang et al., 2020] 51.06 52.95 51.15 53.30 GCGCN [Zhou et al., 2020] 51.14 53.05 50.87 53.13 GEDA [Li et al., 2020] 51.03 53.60 51.22 52.97 LSR [Nan et al., 2020] 48.82 55.17 52.15 54.18 GAIN [Zeng et al., 2020] 53.05 55.29 52.66 55.08 HeterGSAN-Recon [Wang et al., 2020] 54.27 56.22 53.27 55.23 CFER (Ours) 54.29 56.40 53.43 55.75 BERT-Two-Step [Wang et al., 2019] - 54.42 - 53.92 GEDA-BERTBase [Li et al., 2020] 54.52 56.16 53.71 55.74 HIN-BERTBase [Tang et al., 2020] 54.29 56.31 53.70 55.60 GCGCN-BERTBase [Zhou et al., 2020] 55.43 57.35 54.53 56.67 CorefBERTBase [Ye et al., 2020] 55.32 57.51 54.54 56.96 ERICA-BERTBase [Qin et al., 2020] 56.7 58.8 55.9 58.2 LSR-BERTBase [Nan et al., 2020] 52.43 59.00 56.97 59.05 MIUK [Li et al., 2021] 58.27 60.11 58.05 59.99 HeterGSAN-Recon-BERTBase [Wang et al., 2020] 58.13 60.18 57.12 59.45 GAIN-BERTBase [Zeng et al., 2020] 59.14 61.22 59.00 61.24 CFER-BERTBase (Ours) 59.23 61.41 59.16 61.28

ERICA-RoBERTaBase [Qin et al., 2020] 56.3 58.6 56.6 59.0 CorefRoBERTaLarge [Ye et al., 2020] 57.84 59.93 57.68 59.91 CFER-RoBERTaLarge (Ours) 60.91 62.80 60.70 62.76

Table 1: Main evaluation results on DocRED. Bold denotes the best result. Underline denotes the second-best result.

Model F1 Intra-F1 Inter-F1 modify our model in three ways: (1) Replace the attention ag- gregator by a simple mean aggregator (denoted by - Attention [ ] GCNN Sahu et al., 2019 58.6 - - Aggregator). (2) Replace all used shortest paths by a ran- [ ] LSR Nan et al., 2020 61.2 66.2 50.3 dom shortest path (denoted by - Multiple Paths). (3) Remove [ ] BRAN Verga et al., 2018 62.1 - - the whole fine-level representations from the final classifica- [ ] EoG Christopoulou et al., 2019 63.6 68.2 50.9 tion (denoted by - Fine-level Repr.). F1 scores of these three LSR w/o MDP Nodes 64.8 68.9 53.1 modified models decrease by 1.48, 2.72, and 4.12, respec- CFER (Ours) 65.9 70.7 57.8 tively. This verifies that our attention aggregator for multiple paths has the ability to selectively aggregate useful informa- Table 2: Main evaluation results on the biomedical dataset CDR. tion thus producing effective fine-level representations. Sec- ondly, to explore the effectiveness of coarse-level represen- Table 2 shows the main evaluation results on CDR. Besides tations, we modify our model in two ways: (1) Remove the GCNN, LSR, EoG mentioned above, we compare two more whole coarse-level representations from the final classifica- models: BRAN [Verga et al., 2018], a text-based method, tion (denoted by - Coarse-level Repr.). (2) Remove DCGCN and LSR w/o MDP Nodes, a modified version of LSR. From blocks (denoted by - DCGCN Blocks). F1 scores of these Table 2 we find that CFER outperforms all the baselines, es- two modified models decrease by 1.14 and 2.08, respectively. pecially on Inter-F1. This suggests that CFER has stronger This verifies that DCGCN has the ability to capture rich lo- advantages in modeling inter-sentence interactions. Note cal and global contextual information, thus producing high- that the modified version of LSR performs better than full quality coarse-level representations that benefit relation ex- LSR. This may imply that LSR does not always work on all traction. Finally, we remove both coarse- and fine-level rep- datasets, while CFER does not have this problem. resentation modules (denoted by - Both-level Modules). This operation makes our model degenerate into a simple version 3.4 Model Analysis similar to Bi-LSTM. As a result, this simple version achieves a similar performance to the Bi-LSTM baseline as expected. Ablation study: To verify the effectiveness of each mod- This suggests that our text encoding module is not much dif- ule in CFER, we show ablation experiment results of CFER- ferent from the Bi-LSTM baseline, and the performance im- GloVe on the development set of DocRED in Table 3. Firstly, provement is totally introduced by our coarse-to-fine strategy. to explore the effectiveness of fine-level representations, we Robustness to the uneven label distribution: To analyze Model F1 our performance on different relations, we divide 96 relations in DocRED into 4 categories according to their ground-truth CFER-GloVe (Full) 56.40 (-0.00) positive label numbers in the development set. We show mi- - Attention Aggregator 54.92 (-1.48) cro F1 on each category achieved by two baseline models - Multiple Paths 53.68 (-2.72) and CFER in Figure 5. As shown in the figure, compared - Fine-level Repr. 52.28 (-4.12) to Bi-LSTM, BERT-Two-Step makes more improvement on relations with more than 20 positive labels (denoted by major - Coarse-level Repr. 55.26 (-1.14) relations), but less improvement (10.15%) on long-tail rela- - DCGCN Blocks 54.32 (-2.08) tions with less than or equal to 20 positive labels. By con- - Both-level Modules 51.67 (-4.73) trast, keeping the improvement on major relations, our model makes a much more significant improvement (86.18%) on Table 3: Ablation experiments on the development set of DocRED. long-tail relations. With high performance on long-tail rela- tions, our model narrows the F1 gap between major and long- tail relations from 46.05 (Bi-LSTM) to 38.12. This suggests Distance [0, 4) [4, 8) [8, ) that our model is more robust to the uneven label distribution. Ability to model long-distance interactions: To analyze Micro F1 66.3 59.8 61.5 our performance on entity pairs with different distances, we divide all entity pairs in CDR into 3 categories according to Table 4: Micro F1 on entity pairs with different sentence distances. their sentence distance, i.e., the number of sentences that sep- arate them. Specifically, the distance of an intra-sentence en- tity pair is 0. Table 4 shows the micro F1 on entity pairs with performance as CFER (Full). different sentence distances. From the table, we find that as the distance increases, the performance of CFER does not de- crease much, and even increases from [4, 8) to [8, ). This validates again the ability of CFER to model long-distance 4 Related Work interactions between entity pairs. Most of existing document-level RE models are graph- 3.5 Case Study based. They usually construct a document-level graph, which Figure 4 shows an extraction case selected from the develop- represents words as nodes and uses edges between them to ment set. In this case, CFER-GloVe (- Both-level Modules), a capture document-aware interactions. Quirk and Poon [2017] simple version similar to the Bi-LSTM baseline, extracts only design feature templates to extract features from multiple two simple relational facts. CFER-GloVe (- Fine-level Repr.) paths for classification. Sahu et al. [2019] use a labeled edge extracts three more relational facts since it adopts DCGCN GCN [Marcheggiani and Titov, 2017] to integrate informa- blocks to integrate richer local and global contextual infor- tion. Guo et al. [2019a] and Nan et al. [2020] apply graph mation. CFER-GloVe (- DCGCN Blocks) makes use of path convolutional networks to a complete graph with iteratively information to model long-distance entity interactions, and refined edges weights. Christopoulou et al. [2019] propose also extracts three more relational facts compared to CFER- an edge-oriented model that represents paths through a walk- GloVe (- Both-level Modules). CFER-GloVe (Full) combines based iterative inference mechanism. Wang et al. [2020] pro- all the advantages of its modified versions. As a result, it ex- pose a reconstructor to model path dependency between an tracts the most relational facts. entity pair. Li et al. [2020] propose to improve inter-sentence To reveal how our coarse-to-fine strategy works, we further reasoning by characterizing interactions between sentences analyze the relational fact (0, P140, 5), which is extracted by and relation instances. Li et al. [2021] leverage knowledge only CFER (Full). In Figure 4, for CFER (Full) and CFER (- graphs in document-level RE. Zhou et al. [2020] propose a DCGCN Blocks), we additionally show several high-weight context-enhanced model to capture global context informa- paths along with their attention weights used for producing tion. Zeng et al. [2020] design a double graph to cope with fine-level representations. From the shown case, we can find document-level RE. that CFER (Full) gives relatively smooth weights to its high- Besides graph-based methods, there are also text-based weight paths. This enables it to aggregate richer path in- methods without constructing a graph. Verga et al. [2018] formation from several useful paths. In fact, CFER (Full) adopt a transformer [Vaswani et al., 2017] to encode the doc- successfully aggregates both intra-sentence (within sentence ument. Wang et al. [2019] adopt BERT [Devlin et al., 2019] 7) and inter-sentence (across sentences 5, 6, and 7) informa- to encode the document and predict the existence of relations tion. By contrast, without the guidance of global contextual before predicting the specific relations. Tang et al. [2020] de- information, CFER (- DCGCN Blocks) learns extremely un- sign a hierarchical architecture to make full use of informa- balanced weights and pays almost all its attention to a sub- tion from several levels. Ye et al. [2020] attempt to explicitly optimal path. As a result, it fails to extract the P140 relation. capture relations between coreferential noun phrases to co- As for CFER-GloVe (- Fine-level Repr.) and CFER-GloVe herently comprehend the whole document. Qin et al. [2020] (- Both-level Module), they do not consider any path infor- use contrastive learning to obtain a deeper understanding of mation. Therefore, it is hard for them to achieve as good entities and relations in text. Document Text (0) Daniel Ajayi Adeniran is a Pentecostal pastor from Nigeria . (1) As of 2011 , he heads the expansion of the African - based Redeemed Christian Church of God in North America . (2) An educated man who worked as a civil servant in Nigeria , Ajayi Adeniran experienced problems with alcohol , and in 1989 visited the Redeemed Christian Church of God across the street from his home near Ibadan . (3) In 1990 , he converted to the doctrines of the Redeemed Christian Church and was ordained through that denomination 1994 . (4) He moved to the United States in 1995 because of the political conditions under the dictator Sani Abacha in Nigeria . (5) After arriving in the U.S. , Ajayi Adeniran became part of the first parish of the Redeemed Christian Church of God in North America located on Roosevelt Island . (6) Soon after , he became the pastor of a newly formed branch of the church , meeting in a Bronx storefront . (7) Ajayi Adeniran states his goal is to inform others of the mission of the Redeemed Christian so that in each household in the world there will be at least one member of Redeemed Christian Church of God .

Entity & Relation Dictionary Extracted Relational Facts Path Attention Weights for Relational Fact (0, P140, 5) Entities: CFER-GloVe (Full) CFER-GloVe (- DCGCN Blocks) 0: Daniel Ajayi Adeniran Path weight: 0.53 Path weight: 0.21 Path weight: 0.98 2: Nigeria CFER-GloVe (- Both-level Modules) 5: Redeemed Christian Church Ajayi Ajayi Daniel (head) (head) 6: North America Adeniran Adeniran Ajayi (head) (0, P27, 2) 8: Ibadan became (root 5) states (root 7) Adeniran 9: 1990 (13, P27, 2) became (root 6) is is 11: the United States states (root 7) inform heads (root 0) 13: Sani Abacha CFER-GloVe CFER-GloVe is others expansion (root 1) (8, P17, 2) 14: Roosevelt Island (- Fine-level Repr.) (- DCGCN Blocks) inform of of Relations: (6, P527, 11) others mission Redeemed P17: country (14, P17, 11) (11, P30, 6) of of Christian P27: country of citizenship mission Redeemed Church (tail) P30: continent (tail) of Christian of P140: religion (0, P140, 5) (11, P361, 6) P527: has part Redeemed God CFER-GloVe (Full) (tail) P361: part of Christian

Figure 4: An extraction case from the development set. We show the relational facts extracted by four versions of CFER-GloVe. For CFER- GloVe (Full) and CFER-GloVe (- DCGCN Blocks), we additionally show the paths along with their attention weights used for producing fine-level representations of entity 0 (Daniel Ajayi Adeniran) and entity 5 (Redeemed Christian Church).

encoder-decoder for statistical machine translation. In EMNLP 2014, pages 1724–1734, 2014. [Christopoulou et al., 2019] Fenia Christopoulou, Makoto Miwa, and Sophia Ananiadou. Connecting the dots: Document-level neural relation extraction with edge-oriented graphs. In EMNLP- IJCNLP 2019, pages 4924–4935, 2019. [Devlin et al., 2019] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT 2019, pages 4171–4186, 2019. [Guo et al., 2019a] Zhijiang Guo, Yan Zhang, and Wei Lu. Atten- tion guided graph convolutional networks for relation extraction. In ACL 2019, pages 241–251, 2019. Figure 5: Micro F1 on different categories of relations. CFER makes [Guo et al., 2019b] Zhijiang Guo, Yan Zhang, Zhiyang Teng, and more improvement on long-tail relations with fewer positive labels. Wei Lu. Densely connected graph convolutional networks for graph-to-sequence learning. TACL, 7:297–312, 2019. 5 Conclusion [Lee et al., 2020] Jinhyuk Lee, Wonjin Yoon, Sungdong Kim, Donghyeon Kim, Sunkyu Kim, Chan Ho So, and Jaewoo Kang. In this paper, we propose CFER with a coarse-to-fine strategy Biobert: a pre-trained biomedical language representation model to learn comprehensive representations for document-level for biomedical text mining. Bioinform., 36(4):1234–1240, 2020. RE. Our model integrates global contextual information and [Li et al., 2016] Jiao Li, Yueping Sun, Robin J. Johnson, Daniela models long-distance interactions between the target entity Sciaky, Chih-Hsuan Wei, Robert Leaman, Allan Peter Davis, pair simultaneously, thus addressing the disadvantages that Carolyn J. Mattingly, Thomas C. Wiegers, and Zhiyong Lu. existing graph-based models suffer from. Experimental re- Biocreative V CDR task corpus: a resource for chemical disease sults on two document-level RE datasets, DocRED and CDR, relation extraction. Database, 2016, 2016. show that CFER outperforms existing models. Further, elab- [Li et al., 2018] Qimai Li, Zhichao Han, and Xiao-Ming Wu. orate analysis verifies the effectiveness of our coarse-to-fine Deeper insights into graph convolutional networks for semi- strategy, and highlights the robustness of CFER to the un- supervised learning. In AAAI 2018, pages 3538–3545, 2018. even label distribution and the ability of CFER to model long- [ ] distance interactions. Note that our coarse-to-fine strategy is Li et al., 2020 Bo Li, Wei Ye, Zhonghao Sheng, Rui Xie, Xiangyu Xi, and Shikun Zhang. Graph enhanced dual attention network not limited to only the task of document-level RE. It has the for document-level relation extraction. In COLING 2020, pages potential to be applied to a variety of other NLP tasks. 1551–1560, 2020. [Li et al., 2021] Bo Li, Wei Ye, Canming Huang, and Shikun References Zhang. Multi-view inference for relation extraction with uncer- AAAI 2021 [Cho et al., 2014] Kyunghyun Cho, Bart van Merrienboer, C¸aglar tain knowledge. In , 2021. Gulc¸ehre,¨ Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, [Liu et al., 2019] Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, and Yoshua Bengio. Learning phrase representations using RNN Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly opti- [Ye et al., 2020] Deming Ye, Yankai Lin, Jiaju Du, Zhenghao Liu, mized bert pretraining approach. CoRR, abs/1907.11692, 2019. Maosong Sun, and Zhiyuan Liu. Coreferential reasoning learning [Loshchilov and Hutter, 2017] Ilya Loshchilov and Frank Hut- for language representation. CoRR, abs/2004.06870, 2020. ter. Fixing weight decay regularization in adam. CoRR, [Zeng et al., 2014] Daojian Zeng, Kang Liu, Siwei Lai, Guangyou abs/1711.05101, 2017. Zhou, and Jun Zhao. Relation classification via convolutional [Luan et al., 2018] Yi Luan, Luheng He, Mari Ostendorf, and Han- deep neural network. In COLING 2014, pages 2335–2344, 2014. naneh Hajishirzi. Multi-task identification of entities, relations, [Zeng et al., 2020] Shuang Zeng, Runxin Xu, Baobao Chang, and and coreference for scientific knowledge graph construction. In Lei Li. Double graph based reasoning for document-level relation EMNLP 2018, pages 3219–3232, 2018. extraction. In EMNLP 2020, pages 1630–1640, 2020. [Marcheggiani and Titov, 2017] Diego Marcheggiani and Ivan [Zhou et al., 2016] Peng Zhou, Wei Shi, Jun Tian, Zhenyu Qi, Titov. Encoding sentences with graph convolutional networks Bingchen Li, Hongwei Hao, and Bo Xu. Attention-based bidi- for semantic role labeling. In EMNLP 2017, 2017. rectional long short-term memory networks for relation classifi- [Nan et al., 2020] Guoshun Nan, Zhijiang Guo, Ivan Sekulic, cation. In ACL 2016, pages 207–212, 2016. and Wei Lu. Reasoning with latent structure refinement for [Zhou et al., 2020] Huiwei Zhou, Yibin Xu, Weihong Yao, Zhe Liu, document-level relation extraction. In ACL 2020, 2020. Chengkun Lang, and Haibin Jiang. Global context-enhanced [Pennington et al., 2014] Jeffrey Pennington, Richard Socher, and graph convolutional networks for document-level relation extrac- COLING 2020 Christopher D. Manning. Glove: Global vectors for word repre- tion. In , pages 5259–5270, 2020. sentation. In EMNLP 2014, pages 1532–1543, 2014. [Qin et al., 2020] Yujia Qin, Yankai Lin, Ryuichi Takanobu, Zhiyuan Liu, Peng Li, Heng Ji, Minlie Huang, Maosong Sun, and Appendices Jie Zhou. ERICA: improving entity and relation understanding for pre-trained language models via contrastive learning. CoRR, A Details of Hyper-parameters abs/2012.15022, 2020. We search the best hyper-parameters based on F1 on the devel- [ ] Quirk and Poon, 2017 Chris Quirk and Hoifung Poon. Distant su- opment set. Generally, for all of CFER-GloVe, CFER-BERT , pervision for relation extraction beyond the sentence boundary. Base CFER-RoBERTaLarge and CFER for CDR, we use AdamW with In EACL 2017, pages 1171–1182, 2017. β1 = 0.9, β2 = 0.999,  = 1e − 6, weight decay = 0.0001 as [Sahu et al., 2019] Sunil Kumar Sahu, Fenia Christopoulou, the optimizer, apply exponential moving average on all parameters Makoto Miwa, and Sophia Ananiadou. Inter-sentence relation with a decay rate of 0.9999, use ReLU as the activation function, extraction with document-level graph convolutional neural and use the DCGCN consisting of two blocks with 4 sub-layers in network. In ACL 2019, pages 4309–4316, 2019. each block. We adopt a slanted triangular scheduling strategy for [Schuster and Paliwal, 1997] Mike Schuster and Kuldip K. Paliwal. learning rate, which first linearly increases the learning rate from 0 Bidirectional recurrent neural networks. IEEE Transactions on to the peak value in the first 10% steps (warm-up steps), and then Signal Processing, 45(11):2673–2681, 1997. linearly decreases it to 0 in remaining steps. For other key hyper- parameters, we state the values tried and the finally selected value [Tang et al., 2020] Hengzhu Tang, Yanan Cao, Zhenyu Zhang, for four models separately as follows. Jiangxia Cao, Fang Fang, Shi Wang, and Pengfei Yin. HIN: hi- For CFER-GloVe: (1) We search the peak learning rate for all erarchical inference network for document-level relation extrac- modules in {1e − 3, 1e − 4}, and finally choose 1e − 3. (2) We tion. In PAKDD 2020, volume 12084 of Lecture Notes in Com- search the batch size in {8, 16, 32}, and finally select 16. (3) We puter Science, pages 197–209, 2020. search the dropout rate for DCGCN modules in {0.2, 0.4, 0.6}, and [Vaswani et al., 2017] Ashish Vaswani, Noam Shazeer, Niki Par- finally select 0.4. (4) We search the dropout rate for other modules mar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz in {0.2, 0.4, 0.6}, and finally select 0.2. (5) We set the embedding Kaiser, and Illia Polosukhin. Attention is all you need. In dimension to 300, the same as the dimension of used GloVe embed- NeurIPS 2017, pages 5998–6008, 2017. dings. (6) We search the hidden size in {100, 300, 512}, and finally select 300. (7) For each hyper-parameter configuration, we train 300 [Verga et al., 2018] Patrick Verga, Emma Strubell, and Andrew epochs and select the best F1 achieved during these 300 epochs to McCallum. Simultaneously self-attending to all mentions for evaluate the performance under this configuration. full-abstract biological relation extraction. In NAACL-HLT 2018, For CFER-BERT : (1) We search the peak learning rate for pages 872–884, 2018. Base BERT modules in {1e − 4, 5e − 5, 1e − 5}, and finally select [Vrandecic and Krotzsch,¨ 2014] Denny Vrandecic and Markus 1e − 5. (2) We search the peak learning rate for the other mod- Krotzsch.¨ Wikidata: a free collaborative knowledgebase. Com- ules in {1e − 3, 5e − 4, 1e − 4}, and finally select 1e − 3. (3) We munications of the ACM, 57(10):78–85, 2014. search the batch size in {8, 16, 32}, and finally select 32. (4) We [Wang et al., 2019] Hong Wang, Christfried Focke, Rob Sylvester, search the dropout rate for DCGCN modules in {0.2, 0.4, 0.6}, and Nilesh Mishra, and William Wang. Fine-tune bert for docred with finally select 0.6. (5) We search the dropout rate for other modules two-step process. CoRR, abs/1909.11898, 2019. in {0.2, 0.4, 0.6}, and finally select 0.2. (6) We search the hidden size in {300, 512, 768}, and finally select 512. (7) For each hyper- [ ] Wang et al., 2020 Xu Wang, Kehai Chen, and Tiejun Zhao. parameter configuration, we train 300 epochs and select the best F1 Document-level relation extraction with reconstruction. CoRR, achieved during these 300 epochs to evaluate the performance under abs/2012.11384, 2020. this configuration. [Yao et al., 2019] Yuan Yao, Deming Ye, Peng Li, Xu Han, Yankai For CFER-RoBERTaLarge: (1) We search the peak learning rate Lin, Zhenghao Liu, Zhiyuan Liu, Lixin Huang, Jie Zhou, and for RoBERTa modules in {1e − 4, 5e − 5, 1e − 5}, and finally select Maosong Sun. Docred: A large-scale document-level relation 1e − 5. (2) We search the peak learning rate for the other modules extraction dataset. In ACL 2019, pages 764–777, 2019. in {1e − 3, 5e − 4, 1e − 4}, and finally select 1e − 3. (3) We search the batch size in {8, 16, 32}, and finally select 32. (4) We search the dropout rate for DCGCN modules in {0.2, 0.4, 0.6}, and finally select 0.6. (5) We search the dropout rate for other modules in {0.2, 0.4, 0.6}, and finally select 0.2. (6) We search the hidden size in {512, 768, 1024}, and finally select 1024. (7) For each hyper- parameter configuration, we train 300 epochs and select the best F1 achieved during these 300 epochs to evaluate the performance under this configuration. For CFER for CDR: (1) We search the peak learning rate for BioBERT modules in {1e − 4, 5e − 5, 1e − 5}, and finally select 1e − 5. (2) We search the peak learning rate for the other mod- ules in {1e − 3, 5e − 4, 1e − 4}, and finally select 1e − 4. (3) We search the batch size in {4, 8, 16}, and finally select 4. (4) We search the dropout rate for DCGCN modules in {0.2, 0.4, 0.6}, and finally select 0.6. (5) We search the dropout rate for other modules in {0.2, 0.4, 0.6}, and finally select 0.2. (6) We search the hidden size in {512, 768, 1024}, and finally select 1024. (7) For each hyper- parameter configuration, we train 100 epochs and select the best F1 achieved during these 100 epochs to evaluate the performance under this configuration.