Adjacency List Oriented Relational Fact Extraction via Adaptive Multi-task Learning Fubang Zhao1∗, Zhuoren Jiang2∗†, Yangyang Kang1, Changlong Sun1, Xiaozhong Liu3 1Alibaba Group, Hangzhou, China 2School of Public Affairs, Zhejiang University, Hangzhou, China 3School of Informatics, Computing and Engineering, IUB, Bloomington, USA [email protected], [email protected] [email protected], [email protected] [email protected]

Abstract

Relational fact extraction aims to extract se- mantic triplets from unstructured text. In this work, we show that all of the relational fact extraction models can be organized accord- ing to a graph-oriented analytical perspective. An efficient model, aDjacency lIst oRiented rElational faCT(DIRECT), is proposed based on this analytical framework. To alleviate challenges of error propagation and sub-task loss equilibrium, DIRECT employs a novel adaptive multi-task learning strategy with dy- namic sub-task loss balancing. Extensive ex- periments are conducted on two benchmark Figure 1: Example of exploring the relational fact ex- datasets, and results prove that the proposed traction task from the perspective of directed graph rep- model outperforms a series of state-of-the-art resentation method as output data structure. (SoTA) models for relational triplet extraction.

1 Introduction et al., 2019), and PNDec (Nayak and Ng, 2020), Relational fact extraction, as an essential NLP fall into this category. task, is playing an increasingly important role in is a simple and space-efficient way to knowledge graph construction (Han et al., 2019; represent a graph (Arifuzzaman and Khan, 2015). Distiawan et al., 2019). It aims to extract rela- However, there are three problems. First, the tional triplet from the text. A relational triplet triplet overlapping problem (Zeng et al., 2018). is in the form of (subject, relation, object) or For instance, as shown in Figure1, for triplets (s, r, o) (Zeng et al., 2019). While various prior (Obama, nationality, USA) and (Obama, presi- models proposed for relational fact extraction, few dent of, USA), there are two types of relations be- of them analyze this task from the perspective of tween the “Obama” and “USA”. If the model only output data structure. generates one sequence from the text (Zheng et al., arXiv:2106.01559v1 [cs.CL] 3 Jun 2021 As shown in Figure1, the relational fact extrac- 2017), it may fail to identify the multi-relation be- tion can be characterized as a directed graph con- tween entities. Second, to overcome the triplet over- struction task, where graph representation flexibil- lapping problem, the model may have to extract ity and heterogeneity accompany additional bene- the triplet element repeatedly (Zeng et al., 2018), faction. In practice, there are three common ways which will increase the extraction cost. Third, there to represent graphs (Gross and Yellen, 2005): could be an ordering problem (Zeng et al., 2019): Edge List is utilized to predict a sequence of for multiple triplets, the extraction order could in- triplets (edges). The recent sequence-to-sequence fluence the model performance. based models, such as NovelTagging (Zheng et al., Adjacency Matrices are used to predict ma- 2017), CopyRE (Zeng et al., 2018), CopyRL (Zeng trices that represent exactly which entities (ver- ∗These two authors contributed equally to this research. tices) have semantic relations (edges) between † Zhuoren Jiang is the corresponding author them. Most early works, which take a pipeline ap- proach (Zelenko et al., 2003; Zhou et al., 2005), be- lem (Chen et al., 2018; Sener and Koltun, 2018) long to this category. These models first recognize could compromise its performance. all entities in text and then perform relation classi- Based on the analysis from the perspective of fication for each entity pair. The subsequent neural output data structure, we propose a novel solution, network-based models (Bekoulis et al., 2018; Dai aDjacency lIst oRiented rElational faCT extraction et al., 2019), that attempt to extract entities and model (DIRECT), with the following advantages: relations jointly, can also be classified into this cat- • For efficiency, DIRECT is a fully adjacency list egory. oriented model, which consists of a shared BERT Compared to edge list, adjacency matrices have encoder, the Pointer-Network based subject and ob- better relation (edge) searching efficiency (Arifuz- ject extractors, and a relation classification module. zaman and Khan, 2015). Furthermore, adjacency In Section 3.4, we provide a detailed comparative matrices oriented models is able to cover differ- analysis2 to demonstrate the efficiency of the pro- ent overlapping cases (Zeng et al., 2018) for rela- posed method. tional fact extraction task. But the space cost of • From the performance viewpoint, to address this approach can be expensive. For most cases, the sub-task error propagation and sub-task loss balanc- output matrices are very sparse. For instance, for ing problems, DIRECT employs a novel adaptive a sentence with n tokens, if there are m kinds of multi-task learning strategy with the dynamic sub- relations, the output space is n · n · m, which can task loss balancing approach. In Section 3.2 and be costly for graph representation efficiency. This 3.3, the empirical experimental results demonstrate phenomenon is also illustrated in Figure1. DIRECT can achieve the state-of-the-art perfor- Adjacency List is designed to predict an array mance of relational fact extraction task, and the of linked lists that serves as a representation of a adaptive multi-task learning strategy did play a pos- graph. As depicted in Figure1, in the adjacency itive role in improving the task performance. list, each vertex v (key) points to a list (value) con- The major contributions of this paper can be taining all other vertices connected to v by sev- summarized as follows: eral edges. Adjacency list is a hybrid graph rep- 1. We refurbish the relational fact extraction resentation between edge list and adjacency ma- problem by leveraging an analytical framework trices (Gross and Yellen, 2005), which can bal- of graph-oriented output structure. To the best of ance space and searching efficiency1. Due to the our knowledge, this is a pioneer investigation to structural characteristic of the adjacency list, this explore the output data structure of relational fact type of model usually adopts a cascade fashion to extractions. identify subject, object, and relation sequentially. 2. We propose a novel solution, DIRECT3, For instance, the recent state-of-the-art model Cas- which is a fully adjacency list oriented model with Rel (Wei et al., 2020) can be considered as an ex- a novel adaptive multi-task learning strategy. emplar. It utilizes a two-step framework to rec- 3. Through extensive experiments on two bench- ognize the possible object(s) of a given subject mark datasets3, we demonstrate the efficiency and under a specific relation. However, CasRel is not efficacy of DIRECT. The proposed DIRECT out- fully adjacency list oriented: in the first step, it performs the state-of-the-art baseline models. use subject as the key; while in the second step, it predicts (relation, object) pairs using adjacency 2 The DIRECT Framework matrix representation. In this section, we will introduce the framework Despite its considerable potential, the cascade of the proposed DIRECT model, which includes fashion of adjacency list oriented model may cause a shared BERT encoder and three output layers: problems of sub-task error propagation (Shen et al., subject extraction, object extraction, and relation 2019), i.e., errors from ancestor sub-tasks may ac- classification. As shown in Figure2, DIRECT is cumulate to threaten downstream ones, and sub- fully adjacency list oriented. The input sentence tasks can hardly share supervision signals. Multi- is firstly fed into the subject extraction module to task learning (Caruana, 1997) can alleviate this 2Theoretical representation efficiency analysis of graph problem, however, the sub-task loss balancing prob- representative models are described in Appendix section 6.4. 3To help other scholars reproduce the experiment out- 1More detailed complexity analyses of different graph rep- come, we will release the code and datasets via GitHub: resentations are provided in Appendix section 6.3. https://github.com/fyubang/direct-ie. Figure 2: An overview of the proposed DIRECT framework extract all subjects. Then each extracted subject 2.2 Subject and Object Extraction is concatenated with the sentence, and fed into The subject and object extraction modules are moti- the object extraction module to extract all objects, vated by the Pointer-Network (Vinyals et al., 2015) which can form a set of subject-object pairs. Fi- architecture, which are widely used in Machine nally, the subject-object pair is concatenated with Reading Comprehension (MRC) (Rajpurkar et al., sentence, and fed into the relation classification 2016) task. Different from MRC task that only module to get the relations between them. For needs to extract a single span, the subject and object balancing the weights of sub-task losses and to im- extractions need to extract multiple spans. There- prove the global task performance, three modules fore, in the training phase, we replace softmax share the BERT encoder layer and are trained with function with sigmoid function for the activation an adaptive multi-task learning strategy. function of the output layer, and replace cross en- tropy (CE) (Goodfellow et al., 2016) with binary 2.1 Shared BERT Encoder cross entropy (BCE) (Luc et al., 2016) for the loss In the DIRECT framework, the encoder is used to function. Specifically, we will perform indepen- extract the semantic features from the inputs for dent binary classifications for each token twice to three modules. As aforementioned, we employ the indicate whether the current token is the start or the BERT (Devlin et al., 2019) as the shared encoder to end of a span. The probability of a token to be start make use of its pre-trained knowledge and attention or end is as follows: mechanism. t t t pi,start = σ(Wstart · hi + bstart) (2) The architecture of the shared method is shown t t t in Figure2. The lower embedding layer and trans- pi,end = σ(Wend · hi + bend) (3) formers (Vaswani et al., 2017) are shared across all where hi represents the hidden vector of the ith the three modules, while the top layers represent token, t ∈ [s, o] represents subject and object ex- the task-specific outputs. t h×1 traction respectively, W ∈ R represents the The encoding process is as follows: t 1 trainable weight, b ∈ R is the bias and σ is sigmoid function. t t h = BERT(x ) (1) During inference, we first recognize all the start t positions by checking if the probability pi,start > α, t where x = [w1, ..., wn] is the input text of task t where α is the threshold of extraction. Then, we and ht is the hidden vector sequence of the input. identify the corresponding end position with the t Due to the limited space, the detailed architecture largest probability pi,end between two neighboring of BERT please refer to the original paper (Devlin start positions. Concretely, assuming posj,start is et al., 2019). the start position of the jth span, the corresponding end position is: Algorithm 1: Adaptive Multi-task Learn- ing with Dynamic Loss Balancing Initialize model parameters Θ randomly; pos = argmax pt (4) j,end i,end Load pre-trained BERT parameters for posj,start<=i

Table 2: Results of different methods on NYT and WebNLG datasets. EL: Edge List; AM: Adjacency Matrices; ALP: Adjacency List (Partially); ALF: Adjacency List (Fully).

Method Element NYT WebNLG analysis to demonstrate the efficiency of the pro- 8 s 93.5 95.7 posed method . CasRel o 93.5 95.3 For each graph representation category, we r 94.9 94.0 choose one representative algorithms. Edge List: s 95.4 97.3 CopyRE (Zeng et al., 2018); Adjacency Matrices: DIRECT(Ours) o 96.4 96.4 MHS (Bekoulis et al., 2018); Adjacency List: Cas- r 97.8 97.4 Rel (partially) (Wei et al., 2020) and the proposed DIRECT (fully). Table 3: F1-score for extracting elements of relational The averaged predicted logits estimation for one triplets on NYT and WebNLG datasets. sample9 of different models on two datasets are shown in Table6. MHS is adjacency matrices ori- ented, it has the most logits that need to be pre- A possible explanation is that, although the out- dicted. Since CasRel is partially adjacency list put of these two modules is similar, the semantics oriented, it needs to predict more logits than DI- of subject and object are different. Hence, directly RECT. Theoretically, as an edge list oriented, the sharing the output parameters of two modules could predicted logits of CopyRE should be the least. But, lead to an unsatisfactory performance. as described in Section1, it needs to extract the en- Effects of Adaptive Multi-task Learning tities repeatedly to handle the overlapping problem. (RQ2). As described in Section2, the adaptive Hence, its graph representation efficiency could be multi-task learning strategy with the dynamic sub- worse than our model. The structure of our model task loss balancing approach is proposed for im- is simple and fully adjacency list oriented. There- proving the task performance. To answer RQ2, we fore, from the viewpoint of predicted logits estima- replaced the adaptive multi-task learning strategy tion, DIRECT is the most representative-efficient with an ordinary learning strategy. In this strategy, model. the losses of three sub-tasks were computed with equal weights, denoted as DIRECTequal. From 4 Related Work the results of Table5, we can observe that, by using adaptive multi-task learning, DIRECT was able to Relation Fact Extraction. In this work, we show get a 1.5 percentage improvement on the F1-score. that all of the relational fact extraction models can This significant improvement indicated that adap- be unified into a graph-oriented output structure tive multi-task learning played a positive role in the analytical framework. From the perspective of balance of sub-task learning and can improve the graph representation, the prior models can be di- global task performance. vided into three categories. Edge List, this type of model usually employs sequence-to-sequence 3.4 Graph Representation Efficiency fashion, such as NovelTagging (Zheng et al., 2017), Analysis CopyRE (Zeng et al., 2018), CopyRL (Zeng et al., 8 Based on the amount estimation of predicted log- From the graph representation perspective, when a 7 method requires fewer logits to represent the graph (set of its , we conduct a graph representation efficiency triples), it will reduce the model fitting difficulty. 9The theoretical analysis of predicted logits for different 7Numeric output (0/1) of the last layer models are described in Appendix section 6.4. NYT WebNLG Method N = 1 N = 2 N = 3 N = 4 N ≥ 5 N = 1 N = 2 N = 3 N = 4 N ≥ 5 Count 3244 1045 312 291 108 268 174 128 89 44

CopyREOne 66.6 52.6 49.7 48.7 20.3 65.2 33.0 22.2 14.2 13.2 CopyREMul 67.1 58.6 52.0 53.6 30.0 59.2 42.5 31.7 24.2 30.0 GraphRel1p 69.1 59.5 54.4 53.9 37.5 63.8 46.3 34.7 30.8 29.4 GraphRel2p 71.0 61.5 57.4 55.1 41.1 66.0 48.3 37.0 32.1 32.1 CopyRL 71.7 72.6 72.5 77.9 45.9 63.4 62.2 64.4 57.2 55.7 CasRel 88.2 90.3 91.9 94.2 83.7 89.3 90.8 94.2 92.4 90.9 DIRECT(Ours) 90.4 93.1 94.3 95.8 93.1 90.3 92.8 94.8 94.0 92.9

Table 4: F1-score of extracting relational triplets from sentences with different number (denoted as N) of triplets.

NYT structure of relational fact extraction. Method Prec. Rec. F1 Multi-task Learning. Multi-task Learning DIRECTshared 92.1 91.6 91.9 (MTL) can improve the model performance. (Caru- DIRECTequal 90.6 91.3 91.0 ana, 1997) summarizes the goal succinctly: “it DIRECT 92.3 92.8 92.5 improves generalization by leveraging the domain- specific information contained in the training sig- Table 5: Results of model variants for ablation tests. nals of related task.” It has two benefits (Van- denhende et al.): (1) multiple tasks share a sin- Method Category NYT WebNLG gle model, which can save memory. (2) Associ- CopyRe EL 329 712 ated tasks complement and constrain each other by MHS AM 57369 26518 sharing information, which can reduce overfitting and improve global performance. There are two CasRel ALP 3084 15836 main types of MTL: hard parameter sharing (Bax- DIRECT ALF 238 542 ter, 1997) and soft parameter sharing (Duong et al., Table 6: Graph representation efficiency estimation 2015). Most of the multi-task learning is done by based on the predicted logits amount. EL: Edge List; summing the loses directly, this approach is not AM: Adjacency Matrices; ALP: Adjacency List (Par- suitable for our case. When the input and output tially); ALF: Adjacency List (Fully). are different, it is impossible to get two losses in one forward propagation. MT-DNN (Liu et al., 2019b) is proposed for this problem. Furthermore, 2019), and PNDec (Nayak and Ng, 2020). Some MTL is difficult for training, the magnitudes of models of this category may suffer from the triplet different task-losses are different, and the direct overlapping problem and expensive extraction cost. summation of losses may lead to a bias for a partic- Adjacency Matrices, many early pipeline ap- ular task. There are already some studies proposed proaches (Zelenko et al., 2003; Zhou et al., 2005; to address this problem (Chen et al., 2018; Guo Mintz et al., 2009) and recent neural network-based et al., 2018; Liu et al., 2019a). They all try to models (Bekoulis et al., 2018; Dai et al., 2019; Fu dynamically adjust the weight of the loss accord- et al., 2019), can be classified into this category. ing to the magnitude of the loss, the difficulty of The main problem for this type of model is the the problem, the speed of learning, etc. In this graph representation efficiency. Adjacency List, study, we adopt MT-DNN’s framework, and pro- the recent state-of-the-art model CasRel (Wei et al., pose an adaptive multi-task learning strategy that 2020) is a partially adjacency list oriented model. can dynamically adjust the loss weight based on In this work, we propose DIRECT that is a fully the averaged EMA (Lawrance and Lewis, 1977) of adjacency list oriented relational fact extraction the training data amount, task difficulty, etc. model. To the best of our knowledge, few previ- ous works analyze this task from the output data 5 Conclusion structure perspective. GraphRel (Fu et al., 2019) employs a graph-based approach, but it is utilized In this paper, we introduce a new analytical per- from an encoding perspective, while we analyze it spective to organize the relational fact extraction from the perspective of output structure. Our work models and propose DIRECT model for this task. is a pioneer investigation to analyze the output data Unlike existing methods, DIRECT is fully adja- cency list oriented, which employs a novel adaptive Bayu Distiawan, Gerhard Weikum, Jianzhong Qi, and multi-task learning strategy with dynamic sub-task Rui Zhang. 2019. Neural relation extraction for loss balancing. Extensive experiments on two pub- knowledge base enrichment. In Proceedings of the 57th Annual Meeting of the Association for Compu- lic datasets, prove the efficiency and efficacy of the tational Linguistics, pages 229–240. proposed methods. Long Duong, Trevor Cohn, Steven Bird, and Paul Cook. Acknowledgments 2015. Low resource dependency parsing: Cross- lingual parameter sharing in a neural network parser. We are thankful to the anonymous reviewers for In Proceedings of the 53rd annual meeting of the As- sociation for Computational Linguistics and the 7th their helpful comments. This work is supported international joint conference on natural language by Alibaba Group through Alibaba Research Fel- processing (volume 2: short papers), pages 845– lowship Program, the National Natural Science 850. Foundation of China (61876003), the Key Re- Tsu-Jui Fu, Peng-Hsuan Li, and Wei-Yun Ma. 2019. search and Development Plan of Zhejiang Province Graphrel: Modeling text as relational graphs for (Grant No.2021C03140), the Fundamental Re- joint entity and relation extraction. In Proceedings search Funds for the Central Universities, and of the 57th Annual Meeting of the Association for Guangdong Basic and Applied Basic Research Computational Linguistics, pages 1409–1418. Foundation (2019A1515010837). Claire Gardent, Anastasia Shimorina, Shashi Narayan, and Laura Perez-Beltrachini. 2017. Creating train- ing corpora for nlg micro-planners. In Proceedings References of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Shaikh Arifuzzaman and Maleq Khan. 2015. Fast par- pages 179–188. allel conversion of edge list to adjacency list for large-scale graphs. In Proceedings of the Sympo- Ian Goodfellow, Yoshua Bengio, Aaron Courville, and sium on High Performance Computing, pages 17– Yoshua Bengio. 2016. Deep learning, volume 1. 24. MIT press Cambridge. Jonathan Baxter. 1997. A bayesian/information the- Jonathan L Gross and Jay Yellen. 2005. oretic model of learning to learn via multiple task and its applications. CRC press. Machine learning sampling. , 28(1):7–39. Michelle Guo, Albert Haque, De-An Huang, Serena Ye- Giannis Bekoulis, Johannes Deleu, Thomas Demeester, ung, and Li Fei-Fei. 2018. Dynamic task prioritiza- and Chris Develder. 2018. Joint entity recogni- tion for multitask learning. In European Conference tion and relation extraction as a multi-head selection on Computer Vision, pages 282–299. Springer. problem. Expert Systems with Applications, 114:34– Xu Han, Tianyu Gao, Yuan Yao, Deming Ye, Zhiyuan 45. Liu, and Maosong Sun. 2019. Opennre: An open and extensible toolkit for neural relation extrac- Rich Caruana. 1997. Multitask learning. Machine tion. In Proceedings of the 2019 Conference on learning, 28(1):41–75. Empirical Methods in Natural Language Processing Zhao Chen, Vijay Badrinarayanan, Chen-Yu Lee, and and the 9th International Joint Conference on Nat- Andrew Rabinovich. 2018. Gradnorm: Gradient ural Language Processing (EMNLP-IJCNLP): Sys- normalization for adaptive loss balancing in deep tem Demonstrations, pages 169–174. multitask networks. In International Conference on Diederik P Kingma and Jimmy Ba. 2015. Adam: A Machine Learning, pages 794–803. PMLR. method for stochastic optimization. In International Dai Dai, Xinyan Xiao, Yajuan Lyu, Shan Dou, Qiao- Conference on Learning Representations (ICLR). qiao She, and Haifeng Wang. 2019. Joint ex- Thomas N. Kipf and Max Welling. 2017. Semi- traction of entities and overlapping relations using supervised classification with graph convolutional position-attentive sequence labeling. In Proceed- networks. In International Conference on Learning ings of the AAAI Conference on Artificial Intelli- Representations (ICLR). gence, volume 33, pages 6300–6308. AJ Lawrance and PAW Lewis. 1977. An exponential Jacob Devlin, Ming-Wei Chang, Kenton Lee, and moving-average sequence and point process (ema1). Kristina Toutanova. 2019. Bert: Pre-training of Journal of Applied Probability, pages 98–113. deep bidirectional transformers for language under- standing. In Proceedings of the 2019 Conference of Shikun Liu, Edward Johns, and Andrew J Davison. the North American Chapter of the Association for 2019a. End-to-end multi-task learning with atten- Computational Linguistics: Human Language Tech- tion. In Proceedings of the IEEE/CVF Conference nologies, Volume 1 (Long and Short Papers), pages on Computer Vision and Pattern Recognition, pages 4171–4186. 1871–1880. Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jian- Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. feng Gao. 2019b. Multi-task deep neural networks 2015. Pointer networks. In Proceedings of the for natural language understanding. In Proceedings 28th International Conference on Neural Informa- of the 57th Annual Meeting of the Association for tion Processing Systems-Volume 2, pages 2692– Computational Linguistics, pages 4487–4496. 2700.

Pauline Luc, Camille Couprie, Soumith Chintala, Zhepei Wei, Jianlin Su, Yue Wang, Yuan Tian, and and Jakob Verbeek. 2016. Semantic segmenta- Yi Chang. 2020. A novel cascade binary tagging tion using adversarial networks. arXiv preprint framework for relational triple extraction. In Pro- arXiv:1611.08408. ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 1476– Mike Mintz, Steven Bills, Rion Snow, and Dan Juraf- 1488. sky. 2009. Distant supervision for relation extrac- tion without labeled data. In Proceedings of the Dmitry Zelenko, Chinatsu Aone, and Anthony Joint Conference of the 47th Annual Meeting of the Richardella. 2003. Kernel methods for relation ex- ACL and the 4th International Joint Conference on traction. Journal of machine learning research, Natural Language Processing of the AFNLP, pages 3(Feb):1083–1106. 1003–1011. Daojian Zeng, Haoran Zhang, and Qianying Liu. 2020. Copymtl: Copy mechanism for joint extraction of Tapas Nayak and Hwee Tou Ng. 2020. Effective mod- entities and relations with multi-task learning. In eling of encoder-decoder architecture for joint entity Proceedings of the AAAI Conference on Artificial In- and relation extraction. In Proceedings of the AAAI telligence, volume 34, pages 9507–9514. Conference on Artificial Intelligence, volume 34, pages 8528–8535. Xiangrong Zeng, Shizhu He, Daojian Zeng, Kang Liu, Shengping Liu, and Jun Zhao. 2019. Learning the Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and extraction order of multiple relational facts in a sen- Percy Liang. 2016. Squad: 100,000+ questions for tence with reinforcement learning. In Proceedings machine comprehension of text. In Proceedings of of the 2019 Conference on Empirical Methods in the 2016 Conference on Empirical Methods in Natu- Natural Language Processing and the 9th Interna- ral Language Processing, pages 2383–2392. tional Joint Conference on Natural Language Pro- cessing (EMNLP-IJCNLP), pages 367–377. Sebastian Riedel, Limin Yao, and Andrew McCallum. 2010. Modeling relations and their mentions with- Xiangrong Zeng, Daojian Zeng, Shizhu He, Kang Liu, out labeled text. In Joint European Conference and Jun Zhao. 2018. Extracting relational facts by on Machine Learning and Knowledge Discovery in an end-to-end neural model with copy mechanism. Databases, pages 148–163. Springer. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume Ozan Sener and Vladlen Koltun. 2018. Multi-task 1: Long Papers), pages 506–514. learning as multi-objective optimization. In Pro- ceedings of the 32nd International Conference on Ranran Haoran Zhang, Qianying Liu, Aysa Xuemo Fan, Neural Information Processing Systems, pages 525– Heng Ji, Daojian Zeng, Fei Cheng, Daisuke Kawa- 536. hara, and Sadao Kurohashi. 2020. Minimize expo- sure bias of seq2seq models in joint entity and re- Tao Shen, Xiubo Geng, QIN Tao, Daya Guo, Duyu lation extraction. In Proceedings of the 2020 Con- Tang, Nan Duan, Guodong Long, and Daxin Jiang. ference on Empirical Methods in Natural Language 2019. Multi-task learning for conversational ques- Processing: Findings, pages 236–246. tion answering over a large-scale knowledge base. In Proceedings of the 2019 Conference on Empirical Suncong Zheng, Feng Wang, Hongyun Bao, Yuexing Methods in Natural Language Processing and the Hao, Peng Zhou, and Bo Xu. 2017. Joint extraction 9th International Joint Conference on Natural Lan- of entities and relations based on a novel tagging guage Processing (EMNLP-IJCNLP), pages 2442– scheme. In Proceedings of the 55th Annual Meet- 2451. ing of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1227–1236. Simon Vandenhende, Stamatios Georgoulis, Wouter Van Gansbeke, Marc Proesmans, Dengxin Dai, and GuoDong Zhou, Jian Su, Jie Zhang, and Min Zhang. Luc Van Gool. Multi-task learning for dense predic- 2005. Exploring various knowledge in relation ex- tion tasks: A survey. traction. In Proceedings of the 43rd annual meet- ing of the association for computational linguistics Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob (acl’05), pages 427–434. Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Sys- tems, pages 6000–6010. 6 Appendix entity extraction layer by sharing the parame- ters of output layers of these two modules. 6.1 Implementation Details We adopted the pre-trained BERT model [BERT- • DIRECTequal, we replaced the adaptive Base-Cased]10 as our encoder, where the number multi-task learning strategy with an ordinary of Transformer layers was 12 and the hidden size learning strategy. In this strategy, the losses was 768. The token types of input were always set of three sub-tasks were computed with equal to 0. weights, denoted as DIRECTequal. We used Adam as our optimizer and applied a • DIRECTthreshold, we simply recognized all triangular learning rate schedule as suggested by the start and end positions of entities by check- original BERT paper. In addition, we adopted a t ing if the probability pi,start/end > α, where α lazy mechanism for optimization. Different from was the threshold of extraction. the momentum mechanism of ordinary Adam opti- mizer (Kingma and Ba, 2015) that updated the out- • DIRECTadam, we used ordinary Adam as put layer parameters for all tasks, this lazy-Adam optimizer. mechanism wouldn’t update the parameters of non- current tasks. NYT The dacay rate  of EMA was set to 0.99 as Method default. The max sequence length was 128. Prec. Rec. F1 The other hyper-parameters were determined on DIRECTshared 92.1 91.6 91.9 the validation set. Notably, considering our spe- DIRECTequal 90.6 91.3 91.0 cial decoding strategy, we raised the threshold of DIRECTthreshold 92.8 92.0 92.4 extraction to 0.9 to balance the precision and the DIRECTadam 92.1 92.9 92.5 recall. The threshold of relation classification was DIRECT 92.9 92.1 92.5 set to 0.5 as default. The hyper-parameter setting Table 8: Results of model variants for ablation tests. was listed in Table7. Our mthod were implemented by Pytorch11 and From the results of Table8, we can observe that: run on a server configured with a Tesla V100 GPU, 16 CPU, and 64G memory. 1. Sharing the parameters of output layers of subject and object extraction modules would Hyper-parameter NYT WebNLG reduce the performance of the model. Learning Rate 8e-5 1e-4 Epoch Num. 15 60 2. Compared to ordinary multi-task learning Batch Size 32 16 strategy, by using adaptive multi-task learn- ing, DIRECT was able to get a 1.5 percentage Table 7: Hyper-parameter setting for NYT and point improvement on F1-score. WebNLG datasets. 3. There would be a slight drop in performance, if we just used a simple threshold policy to 6.2 Supplementary Experimental Results recognize the start and end positions of an entity. Ablation Study. To validate the effectiveness of components in DIRECT, We implemented several 4. Despite the difference in precision and recall, model variants for ablation tests respectively. For there was no significant difference between experimental fairness, we kept the other compo- these two optimizers (ordinary-Adam & lazy- nents in the same settings when modifying one Adam ) for the task. module. Results on Extracting Elements of Relational • DIRECTshared, we merged the subject ex- Triplets. The complete extraction performance traction and object extraction layers into one of relational triplet elements from DIRECT and CaslRel are listed in Table9. DIRECT outper- 10Available at: https://storage.googleapis.com/bert models/ 2018 10 18/cased L-12 H-768 A-12.zip formed CasRel in terms of all relational triplet el- 11https://pytorch.org/ ements on both datasets. These empirical results NYT WebNLG Method Element Prec. Rec. F1 Prec. Rec. F1 s 94.6 92.4 93.5 98.7 92.8 95.7 CasRel o 94.1 93.0 93.5 97.7 93.0 95.3 r 96.0 93.8 94.9 96.6 91.5 94.0 s 95.1 95.1 95.1 97.1 96.8 96.9 Ours o 97.2 96.3 96.7 96.4 96.3 96.3 r 98.6 98.3 98.5 97.6 97.3 97.4

Table 9: Results on extracting elements of relational triplets

NYT Method Prec. Rec. F1 MHS∗ (Bekoulis et al., 2018) 60.7 58.6 59.6 CopyMTLone(Zeng et al., 2020) 72.7 69.2 70.9 CopyMTLmul(Zeng et al., 2020) 75.7 68.7 72.0 WDec (Nayak and Ng, 2020) 88.1 76.1 81.7 PNDec (Nayak and Ng, 2020) 80.6 77.3 78.9 Seq2UMTree (Zhang et al., 2020) 79.1 75.1 77.1 DIRECT(ours) 90.2 90.2 90.2

Table 10: Results of different methods under Exact-Match Metrics. * marks results reproduced by official imple- mentation. suggest that, for relational triplet extraction, a fully decoding framework to identify the entities in the adjacency list oriented model (DIRECT) may have sentence using their start and end locations. advantages over a partially oriented one (CasRel). • Seq2UMTree (Zhang et al., 2020) is a mod- Results of Different Methods under Exact- ification of seq2seq model, which employs an Match Metrics. In experiment section, we fol- unordered-multi- decoder to to minimize ex- lowed the match metric from (Zeng et al., 2018), posure bias. which only required to match the first token of en- The task performances on NYT dataset are sum- tity span. Many previous works adopted this match marized in Table 10. The proposed DIRECT model metric (Fu et al., 2019; Zeng et al., 2019; Wei et al., outperformed all baseline models in terms of all 2020). evaluation metrics. This experimental results fur- In fact, our model is capable of extracting the ther confirmed the efficacy of DIRECT for rela- complete entities. Therefore, we collected papers tional fact extraction task. that reported the results of exact-match metrics 6.3 Complexity Analysis of Graph (requiring to match the complete entity span). The Representations following strong state-of-the-art (SoTA) models have been compared: For a graph G = (V,E), |V | denotes the number • CopyMTL (Zeng et al., 2020) is a multi-task of nodes/entities and |E| denotes the number of learning framework, where conditional random edges/relations. Suppose there are m kinds of rela- field is used to identify entities, and a seq2seq tions, d(v) denotes the number of edges from node model is adopted to extract relational triplets. v. • WDec (Nayak and Ng, 2020) fuses a seq2seq • Edge List Complexity model with a new representation scheme, which − Space: O(|E|) enables the decoder to generate one word at a and can handle full entity names of different length and − Find all edges/relations from a node: O(|E|) overlapping entities. • Adjacency Matrices Complexity • PNDec (Nayak and Ng, 2020) is a modification of seq2seq model. Pointer networks are used in the − Space: O(|V | · |V | · m) Category Method Theoretical NYT WebNLG Edge List CopyRe 4kl + kr 329 712 Adjacency Matrices MHS llr 57369 26518 Adjacency List (Partially) CasRel 2l + 2slr 3084 15836 Adjacency List (Fully) DIRECT 2l + 2sl + or 238 542

Table 11: Graph representation efficiency based on the theoretical logits amount and the estimated logits amount on two benchmark datasets.

− Find all edges/relations from a node: O(|V | · m)

• Adjacency List Complexity

− Space: O(|V | + |E|)

− Find all edges/relations from a node: O(d(v))

6.4 Graph Representation Efficiency Analysis Based on the amount estimation of predicted log- its12 (0/1), we conduct a graph representation ef- ficiency analysis to demonstrate the efficiency of proposed method13. For each graph representation category, we choose one representative model algorithms. Edge List: CopyRE (Zeng et al., 2018); Adjacency Ma- trices: MHS (Bekoulis et al., 2018); Adjacency List: CasRel (partially) (Wei et al., 2020) and DI- RECT (fully). Formally, for a sentence whose length is l (l tokens), there are r types of relations, k denotes the number of triplets. Suppose there are s keys (subjects) and o values (corresponding amount of object-based lists) in adjacency list. The theoreti- cal logits amount and the estimated logits amount on two benchmark datasets (NYT and WebNLG) are shown in Table 11. From the viewpoint of predicted logits estimation, DIRECT is the most representative-efficient model.

12Numeric output of the last layer 13As aforementioned, from the graph representation per- spective, when a method requires fewer logits to represent the graph (set of triples), it will reduce the model fitting difficulty.