Arxiv:2012.02507V2 [Cs.CL] 6 Jun 2021 1 Introduction Between Long-Distance Entities

Coarse-to-Fine Entity Representations for Document-level Relation Extraction Damai Dai1∗ , Jing Ren1∗ , Shuang Zeng1 , Baobao Chang1,2 , Zhifang Sui1,2 1MOE Key Lab of Computational Linguistics, School of EECS, Peking University 2Peng Cheng Laboratory, China fdaidamai,rjj,zengs,chbb,[email protected] Abstract (0) Romeo Void was an American new wave band … (1) The band primarily member of Document-level Relation Extraction (RE) requires consisted of saxophonist Benjamin Bossi … (7) The success of their extracting relations expressed within and across signed by performer sentences. Recent works show that graph-based second release, a 4-song EP, Never Say Never resulted in a distribution record label methods, usually constructing a document-level deal with Columbia Records … graph that captures document-aware interactions, can obtain useful entity representations thus help- Figure 1: An example for document-level RE. Word spans with ing tackle document-level RE. These methods ei- background colors denote the mentions. Mentions that refer to the ther focus more on the entire graph, or pay more same entity have the same background color. The solid lines denote attention to a part of the graph, e.g., paths be- intra-sentence relations. The dotted line denotes an inter-sentence tween the target entity pair. However, we find that relation. Document-level RE requires extracting both intra- and inter-sentence relations between all the target entity pairs. document-level RE may benefit from focusing on both of them simultaneously. Therefore, to obtain more comprehensive entity representations, we some inter-sentence relations while a considerable number of propose the Coarse-to-Fine Entity Representation relations are expressed beyond a single sentence but across model (CFER) that adopts a coarse-to-fine strat- multiple sentences in a long document [Yao et al., 2019]. egy involving two phases. First, CFER uses graph Therefore, document-level RE has attracted much attention neural networks to integrate global information in in recent years. the entire graph at a coarse level. Next, CFER uti- lizes the global information as a guidance to selec- Figure 1 shows an example document and corresponding tively aggregate path information between the tar- relational facts for document-level RE. In the example, to get entity pair at a fine level. In classification, we extract the relation between Benjamin Bossi and Columbia combine the entity representations from both two Records, two entities separated by several sentences, we need levels into more comprehensive representations for the following inference steps. First, we need to know that relation extraction. Experimental results on two Benjamin Bossi is a member of Romeo Void. Next, we need to document-level RE datasets, DocRED and CDR, infer that Never Say Never is performed by Romeo Void, and show that CFER outperforms existing models and released by Columbia Records. Based on these facts, we can is robust to the uneven label distribution.1 draw a conclusion that Benjamin Bossi is signed by Columbia Records. The example indicates that, to tackle document- level RE, a model needs the ability to capture interactions arXiv:2012.02507v2 [cs.CL] 6 Jun 2021 1 Introduction between long-distance entities. In addition, since a document may have an extremely long text, a model also needs the abil- Relation Extraction (RE) aims to extract semantic relations ity to integrate global contextual information for words. between named entities from plain text. It is an efficient way to acquire structured knowledge automatically, thus ben- Recent works show that graph-based methods can obtain efiting various natural language processing (NLP) applica- useful entity representations thus helping tackle document- tions, especially knowledge graph construction [Luan et al., level RE. These methods usually construct a document-level 2018]. Most of the previous RE works focus on the sen- graph, which represents words as nodes and uses edges be- tence level, i.e., they extract the relations within only a sin- tween them to capture document-aware interactions. To ob- gle sentence [Zeng et al., 2014; Zhou et al., 2016]. However, tain entity representations for relation extraction, some of [ in real-world scenarios, sentence-level RE models may omit them use graph neural networks (GNN) Marcheggiani and Titov, 2017; Guo et al., 2019b] to integrate neighborhood in- ∗Equal Contribution. formation for each node [Sahu et al., 2019; Nan et al., 2020]. 1Our code and data are available at https://github.com/ Although they consider the entire graph structure, they may Hunter-DDM/cfer-document-level-RE. fail to model the interactions between long-distance entities syntactic dependency adjacent word self-loop adjacent sentence coreferential mention Romeo Void was an American new wave band from San Francisco California , formed in 1979. (root) The band primarily consisted of saxophonist Benjamin Bossi , vocalist Debora Iyall , guitarist Peter Woods , and bassist Frank Zincavage . (root) Figure 2: An illustration of a document-level graph corresponding to a two-sentence document. Each node in the graph corresponds to a word in the document. We design five categories of edges to connect nodes in the graph. For the simplicity of the illustration, we omit some self-loop edges and adjacent word edges. due to the inherent over-smoothing problem in GNN [Li et ability of CFER to model long-distance interactions. al., 2018]. Other works attempt to encode path information between the target entity pair in the graph [Quirk and Poon, 2 Methodology 2017; Christopoulou et al., 2019]. They have the ability to alleviate the problem of modeling long-distance entity inter- 2.1 Task Formulation actions, but they may fail to capture global contextual infor- Let D denote a document consisting of N sentences D = mation since they usually integrate only local contextual in- N M fs g , where s = fw g denotes the i-th sentence con- formation for nodes in the graph. Therefore, to obtain more i i=1 i j j=1 comprehensive entity representations, it is necessary to find taining M words denoted by wj. Let E denote an entity set P Q a way to integrate global contextual information and model containing P entities E = feigi=1, where ei = fmjgj=1 long-distance entity interactions simultaneously. denotes the coreferential mention set of the i-th entity, con- In this paper, we propose the Coarse-to-Fine Entity taining Q word spans of corresponding mentions denoted by Representation model (CFER) to obtain comprehensive en- mj. Given D and E, document-level RE requires extracting tity representations for document-level RE. More specifically, all the relational facts in the form of triplets, i.e., extracting we first construct a document-level graph that captures rich f(ei; r; ej)jei; ej 2 E; r 2 Rg, where R is a pre-defined rela- document-aware interactions, reflected by syntactic depen- tion category set. Since an entity pair may have multiple se- dencies, adjacent word connections, cross-sentence connec- mantic relations, we formulate document-level RE as a multi- tions, and coreferential mention interactions. Based on the label classification task. constructed graph, we design a coarse-to-fine strategy with two phases. First, we use Densely Connected Graph Con- 2.2 Document-level Graph Construction volutional Networks (DCGCN) [Guo et al., 2019b] to inte- Given a document, we first construct a document-level graph grate global contextual information in the entire graph at a that captures rich document-aware interactions, reflected by coarse level. Next, we adopt an attention-based path encoding syntactic dependencies, adjacent word connections, cross- mechanism, which takes the global contextual information as sentence connections, and coreferential mention interactions. a guidance, to selectively aggregate path information between Figure 2 shows an example document-level graph corre- the potentially long-distance target entity pair at a fine level. sponding to a two-sentence document. The graph regards Given the entity representations from both two levels that fea- words in the document as nodes and captures document- ture global contextual information and long-distance interac- aware interactions by five categories of edges. These undi- tions, respectively, we can obtain more comprehensive entity rected edges are described as follows. representations for relation extraction by combining them. Syntactic dependency edge: Syntactic dependency informa- Our contributions are summarized as follows: tion is proved effective for document-level or cross-sentence • We propose a novel document-level RE model called RE in previous works [Sahu et al., 2019]. Therefore, we use CFER. It uses a coarse-to-fine strategy to integrate the dependency parser in spaCy2 to parse the syntactic de- global contextual information and model long-distance pendency tree for each sentence. After that, we add edges interactions between the target entities simultaneously, between all node pairs that have dependency relations. thus obtaining comprehensive entity representations. Adjacent word edge: Quirk and Poon [2017] point out that • Experimental results on two popular document-level RE adding edges between adjacent words can mitigate the de- datasets, DocRED and CDR, show that CFER achieves pendency parser errors. Therefore, we add edges between all better performance than existing models. node pairs that are adjacent in the document. Self-loop edge: For a node, in addition to its neighborhood • Elaborate analysis validates the effectiveness of our information, the historical information of the node itself is coarse-to-fine strategy. Further, we highlight the robust- ness of CFER to the uneven label distribution and the 2https://spacy.io/ Figure 3: An illustration of our proposed model, CFER. It is composed of four modules: a text encoding module, a coarse-level representation module, a fine-level representation module, and a classification module. also essential in information integration. Therefore, we add a sub-layers. At the l-th sub-layer in block k, the calculation self-loop edge for each node.

Load more