Stage-wise Fine-tuning for Graph-to-Text Generation

Qingyun Wang1,∗ Semih Yavuz2, Victoria Lin3, Heng Ji1, Nazneen Fatema Rajani2 1 University of Illinois at Urbana-Champaign2 Salesforce Research 3 Facebook Research syavuz,[email protected] [email protected] {qingyun4,hengji}@illinois.edu

Abstract International Telangana Tennis Federation Graph-to-text generation has benefited from Northeast sports Governing Body pre-trained language models (PLMs) in achiev- Acharya ing better performance than structured graph Karnataka state Institute of sports Offered Tennis Technology encoders. However, they fail to fully utilize West was given the 'Technical Campus' by the structure information of the input graph. In All India Arabian this paper, we aim to further improve the per- Council for Sea location Mumbai formance of the pre-trained language model Technical by proposing a structured graph-to-text model Education with a two-step fine-tuning mechanism which first fine-tunes model on Wikipedia before Figure 1: Input RDF Knowledge Graph adapting to the graph-to-text generation. In addition to using the traditional token and and T5 (Raffel et al., 2020) have achieved state- position embeddings to encode the knowl- edge graph (KG), we propose a novel tree- of-the-art results on WebNLG dataset due to fac- level embedding method to capture the inter- tual knowledge acquired in the pre-training phase dependency structures of the input graph. This (Harkous et al., 2020; Ribeiro et al., 2020b; Kale, new approach has significantly improved the 2020; Chen et al., 2020a). performance of all text generation metrics for Despite such improvement, PLMs fine-tuned 1 English WebNLG 2017 dataset. only on the clean (or labeled) data might be more prone to hallucinate factual knowledge (e.g., 1 Introduction “Visvesvaraya Technological University” in Table 1). Inspired by the success of domain-adaptive In the graph-to-text generation task (Gardent et al., pre-training (Gururangan et al., 2020), we propose 2017), the model takes in a complex KG (an exam- a novel two-step fine-tuning mechanism graph-to- ple is in Figure1) and generates a corresponding text generation task. Unlike (Ribeiro et al., 2020b; faithful natural language description (Table1). Pre- Herzig et al., 2020; Chen et al., 2020a) which di- vious efforts for this task can be mainly divided rectly fine-tune the PLMs on the training set, we into two categories: sequence-to-sequence mod- first fine-tune our model over noisy RDF graphs els that directly solve the generation task with arXiv:2105.08021v1 [cs.CL] 17 May 2021 and related article pairs crawled from Wikipedia LSTMs (Gardent et al., 2017) or Transformer before final fine-tuning on the clean/labeled train- (Castro Ferreira et al., 2019); and graph-to-text ing set. The additional fine-tuning step benefits models (Trisedya et al., 2018; Marcheggiani and our model by leveraging triples not included in the Perez-Beltrachini, 2018) which use a graph en- training set and reducing the chances that the model coder to capture the structure of the KGs. Re- fabricates facts based on the language model. cently, Transformer-based PLMs such as GPT- Meanwhile, the PLMs might also fail to cover all 2 (Radford et al., 2019), BART (Lewis et al., 2020), relations in the KG by creating incorrect or miss- This∗ research was conducted during the author’s intern- ing facts. For example, in Table1, although the ship at Salesforce Research T5-large with Wikipedia fine-tuning successfully 1The programs, data and resources will be made publicly available for research purpose at: https://github.com/ removes the unwanted contents, it still ignores the EagleW/Stage-wise-Fine-tuning “sports Governing Body” relation and incorrectly Category Output Reference The Acharya Institute of Technology in Karnataka state was given Technical Campus status by All India Council for Technical Education in Mumbai . The school offers tennis which is governed by the International Tennis Federation . Karnataka has the Arabian Sea to its west and in the northeast is Telangana . T5-large The state of Karnataka is located southwest of Telangana and east of the Arabian Sea . It is the lo- cation of the Acharya Institute of Technology which was granted the Technical Campus status by the All India Council for Technical Education in Mumbai . The Institute is affiliated with the Visvesvaraya Tech- nological University and offers the sport of tennis. [International Tennis Federation] T5-large The Acharya Institute of Technology is located in the state of Karnataka . It was given the Technical Campus sta- + Wiki tus by the All India Council for Technical Education which is located in Mumbai . The institute offers tennis and has Telangana to its northeast and the Arabian Sea to its west. [International Tennis Federation] T5-large The Acharya Institute of Technology is located in the state of Karnataka which has Telangana to + its northeast and the Arabian Sea to its west. It was given the Technical Campus status by the Position All India Council for Technical Education in Mumbai . The Institute offers tennis which is governed by the International Tennis Federation. T5-large The Acharya Institute of Technology is located in the state of Karnataka which has Telangana to its northeast + Wiki + and the Arabian Sea to its west. The Institute was given the Technical Campus status by the All India Council Position for Technical Education in Mumbai . One of the sports offered at the Institute is tennis which is governed by the International Tennis Federation.

Table 1: Human and System Generated Description in Figure1. We use the color box to frame each entity out with the same color as the corresponding entity in Figure1. We highlight fabricated facts, [missed relations], and incorrect relations with different color. links the university to both “Telangana” and “Ara- respectively. We augment our model with addi- bian Sea”. To better capture the structure and in- Token [CLS] S| Karnataka P| Northeast ... terdependence of facts in the KG, instead of using Embeddings a complex graph encoder, we leverage the power Position ... POS0 POS1 POS2 POS3 POS4 of Transformer-based PLMs with additional posi- Embeddings Triple Role ROL ROL ROL ROL ROL ... tion embeddings which have been proved effective Embeddings 0 1 1 1 2 Tree-level LV LV LV LV LV ... in various generation tasks (Herzig et al., 2020; Embeddings 0 2 2 2 2 Chen et al., 2020a,b). Here, we extend the embed- ding layer of Transfomer-based PLMs with two Figure 2: Position Embeddings for the KG in Figure1 additional triple role and tree-level embeddings to capture graph structure. tional position embeddings to capture the structure We explore the proposed stage-wise fine-tuning of the KG. To feed the input for the large-scale and structure-preserving embedding strategies for Transformer-based PLM, we flatten the graph as a graph-to-text generation task on WebNLG corpus concatenation of linearized triple sequences: (Gardent et al., 2017). Our experimental results clearly demonstrate the benefit of each strategy in |S s1 |P r1 |O o1 ... |S sn |P rn |O on achieving the state-of-the-art performance on most |S, |P, |O commonly reported automatic evaluation metrics. following Ribeiro et al.(2020b), where are special tokens prepended to indicate whether 2 Method the phrases in the relations are subjects, relations, or objects, respectively. Instead of directly fine- Given an RDF graph with multiple relations tuning the PLM on the WebNLG dataset, we first G = {(s1, r1, o1), (s2, r2, o2), ..., (sn, rn, on)}, fine-tune our model on a noisy, but larger corpus our goal is to generate a text faithfully describing crawled from Wikipedia, then we fine-tune the the input graph. We represent each relation with model on the training set. a triple (si, ri, oi) ∈ G for i ∈ {1, ..., n}, where Positional embeddings Since the input of the si, ri, and oi are natural language phrases that rep- WebNLG is a small KG which describes proper- resent the subject, type, and object of the relation, ties of entities, we introduce additional positional BLEU(%)↑ METEOR↑ TER↓ Model Seen Unseen All Seen Unseen All Seen Unseen All Without Gardent et al.(2017) 54.52 33.27 45.13 0.41 0.33 0.37 0.40 0.55 0.47 Pretrained Moryossef et al.(2019) 2 53.30 34.41 47.24 0.44 0.34 0.39 0.47 0.56 0.51 LM Zhao et al.(2020) 64.42 38.23 52.78 0.45 0.37 0.41 0.33 0.53 0.42 With Radev et al.(2020) 52.86 37.85 45.89 0.42 0.37 0.40 0.44 0.59 0.51 Pretrained Kale(2020) 63.90 52.80 57.10 0.46 0.41 0.44 --- LM Ribeiro et al.(2020b) 64.71 53.67 59.70 0.46 0.42 0.44 --- Our model T5-large + position + Wiki 65.68 54.05 60.43 0.46 0.42 0.44 0.32 0.41 0.36

Table 2: System Results on WebNLG Test Set Evaluated by BLEU, METEOR, and TER with Official Scripts embeddings to enhance the flattened input of pre- graph structure into text. We remove triples and trained Transformer-based sequence-to-sequence description pairs that have already appeared in the models such as BART and TaPas (Herzig et al., WebNLG dataset. After intermediate pre-training 2020). We extend the input layer with two position- on this noisy corpus, we continue with fine-tuning aware embeddings in addition to the original posi- our model on the WebNLG dataset. tion embeddings3 as shown in the Figure2: 3 Experiments • Position ID, which is the same as the original 3.1 Dataset and Implementation details position ID used in BART, is the index of the token in the flattened sequence |S s1 |P r1 |O Model BLEU↑ P↑ R↑ F1↑ o1 ... |S sn |P rn |O on . BART-base 57.8 68.7 68.9 67.0 • Triple Role ID takes 3 values for a specific + Wikipedia 59.7 69.6 70.7 68.4 + Position 58.8 68.7 69.9 67.6 triple (si, ri, oi): 1 for the subject si, 2 for the + Wiki + Position 57.3 67.8 69.0 66.6 relation ri, and 3 for the object oi. BART-large 58.3 67.9 69.4 66.8 + Wikipedia 59.0 68.0 70.4 67.4 • Tree level ID calculates the distance (number + Position 58.1 67.6 69.4 66.6 + Wiki + Position 60.0 68.6 69.2 67.1 of relations) from the root which is the source distill-BART-xsum 59.1 69.9 70.6 68.5 vertex of the RDF graph. + Wikipedia 59.8 69.7 71.1 68.8 + Position 59.2 69.8 70.2 68.3 Two-step Fine-tuning To get better domain adap- + Wiki + Position 59.9 70.1 70.1 68.7 T5-base 61.2 72.3 72.0 70.6 tation ability (Gururangan et al., 2020; Herzig et al., + Wikipedia 60.9 72.0 71.8 70.2 2020), following TaPas and Wikipedia Person and + Position 60.8 72.4 72.4 70.8 Animal Dataset (Wang et al., 2018), we perform + Wiki + Position 60.3 72.2 72.0 70.5 T5-large 60.0 71.6 72.1 70.2 intermediate pre-training by coupling noisy En- + Wikipedia 61.3 72.2 72.0 70.5 glish Wikipedia data with Wikidata triples, both + Position 60.6 72.1 72.4 70.6 of which are crawled in March 2020. We select + Wiki + Position 61.5 72.4 72.7 71.0 15 related categories (Astronaut, University, Monu- Table 3: Results with both Wikipedia Fine-tuning and ment, Building, ComicsCharacter, Food, Airport, Positional Embedding for Various Pre-trained Models SportsTeam, WrittenWork, Athlete, Artist, City, over All Categories on Development Set Evaluated by MeanOfTransportation, CelestialBody, Politician) average of PARENT precision, recall, F1 and BLEU that appear in the WebNLG dataset (Gardent et al., (%) 2017) and collect 542,192 data pairs. For each Wikipedia article, we query its corresponding Wiki- We use the original version of English Data triples and remove sentences which contain WebNLG2017 (Gardent et al., 2017) dataset which no values in the Wikidata triples to form graph-text contains 18,102/2,268/4,928 graph-description pairs. Unlike (Chen et al., 2020a) which focuses on pairs for training, validation, and testing set re- individual entity-sentence pairs for distant super- spectfully. For this task, we investigate a variety vision, our pre-training corpus, on the other hand, of the BART and T5 models with our novel tree- is designed to better adapt to translating deeper level embeddings. The statistics and more details

3For T5 models, we only keep the Triple Role and Tree- 3For this baseline, we use result reported from Zhao et al. level embeddings. (2020) who also use official evaluation scripts. of those models are listed in Appendix A. relations that do not have a clear direction, the lan- guage model is not powerful enough to consider 3.2 Results and Analysis the deep connections between the subject and the We use NLG evaluation metrics to report results: object. For example, for the relation “doctoralStu- BLEU (Papineni et al., 2002), METEOR (Lavie dent”, the model mistakenly generates a professor and Agarwal, 2007), and TER (Snover et al., 2006), as a Ph.D. student. Similarly, the model treats an as- as shown in Table2. We also evaluate each model teroid as a person because it has an epoch date. For with PARENT (Dhingra et al., 2019) metric which KGs with multiple triples, the generator still has a measures the overlap between predictions and both chance to miss relations or mixes the subject and reference texts and graph contents. Dhingra et al. the object of different relations, especially for the (2019) show PARENT metric has better human unseen category. For instance, for a soccer player rating correlations. Table3 shows the pre-trained with multiple clubs, the system might confuse the models with 2-step fine-tuning and position embed- subject of one club’s relation with another club. dings achieve better results.4

Results with Wikipedia fine-tuning. The 4 Related Work Wikipedia fine-tuning helps the model handle unseen relations such as “inOfficeWhileVicePresi- dent”, and “activeYearsStartYear” by stating “His The WebNLG task is similar to the Wikibio gen- vice president is Atiku Abubakar.” and “started eration (Lebret et al., 2016; Wang et al., 2018), playing in 1995” respectively. It also combines AMR-to-text generation (Song et al., 2018) and relations with the same type together with correct ROTOWIRE (Wiseman et al., 2017; Puduppully order, e.g., given two death places of a person, et al., 2019). Previous methods usually treat the the model generates: “died in Sidcup, London” graph-to-text generation as an end-to-end gener- instead of generating two sentences or placing the ation task. Those models (Trisedya et al., 2018; city name ahead of the area name. Gong et al., 2019; Shen et al., 2020) usually first lin- eralize the knowledge graph and then use attention Results with positional embeddings. For the mechanism to generate the description sentences. KG with multiple triples, additional positional em- While the linearization of input graph may sacrifice beddings help reduce the errors introduced by pro- the inter-dependency inside input graph, some pa- noun ambiguity. For instance, for a KG which has pers (Ribeiro et al., 2019, 2020a; Zhao et al., 2020) “leaderName” relation to both country’s leader and use graph encoder such as GCN (Duvenaud et al., university’s dean, position embeddings can distin- 2015) and graph transformer (Wang et al., 2020a; guish these two relations by stating “Denmark’s Koncel-Kedziorski et al., 2019) to encode the in- leader is Lars Løkke Rasmussen” instead of “its put graphs. Others (Shen et al., 2020; Wang et al., leader is Lars Løkke Rasmussen”. The tree-level 2020b) try to carefully design loss function to con- embeddings also help the model arrange multiple trol the generation quality. With the development triples into one sentence, such as combining the of computation resources, large scale PLMs such as city, the country, the affiliation, and the affiliation’s GPT-2 (Radford et al., 2019), BART (Lewis et al., headquarter of a university into a single sentence: 2020) and T5 (Raffel et al., 2020) achieve state-of- “The School of Business and Social Sciences at the the-art results even with simple linearized graph in- Aarhus University in Aarhus, Denmark is affili- put (Harkous et al., 2020; Chen et al., 2020a; Kale, ated to the European University Association in 2020; Ribeiro et al., 2020b). Instead of directly Brussels”. fine-tuning the PLMs, we propose a two-step fine- 3.3 Remaining Challenges tuning mechanism to get better domain adaptation ability. In addition, using positional embeddings as Because the LM is heavily pre-trained, it is biased an extension for PLMs has shown its effectiveness against the occurrence of patterns that would en- in table-based question answering (Herzig et al., able it to infer the right relation. For example, 2020), fact verification (Chen et al., 2020b), and for the “activeYearsStartYear” relation, the model graph-to-text generation (Chen et al., 2020a). We might confuse it with the birth year. For some capture the graph structure by enhancing the input 4For more examples, please check Appendix for reference. layer with the triple role and tree-level embeddings. 5 Conclusions and Future Work Heng Gong, Xiaocheng Feng, Bing Qin, and Ting Liu. 2019. Table-to-text generation with effective hier- We propose a new two-step structured generation archical encoder on three dimensions (row, column task for the graph-to-text generation task based on and time). In Proceedings of the 2019 Conference a two-step fine-tuning mechanism and novel tree- on Empirical Methods in Natural Language Process- ing and the 9th International Joint Conference on level position embeddings. In the future, we aim Natural Language Processing (EMNLP-IJCNLP), to address the remaining challenges and extend the pages 3143–3152, Hong Kong, China. Association framework for broader applications. for Computational Linguistics. Suchin Gururangan, Ana Marasovic,´ Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, References and Noah A. Smith. 2020. Don’t stop pretraining: Thiago Castro Ferreira, Chris van der Lee, Emiel Adapt language models to domains and tasks. In van Miltenburg, and Emiel Krahmer. 2019. Neu- Proceedings of the 58th Annual Meeting of the ral data-to-text generation: A comparison between Association for Computational Linguistics, pages pipeline and end-to-end architectures. In Proceed- 8342–8360, Online. Association for Computational ings of the 2019 Conference on Empirical Methods Linguistics. in Natural Language Processing and the 9th Interna- Hamza Harkous, Isabel Groves, and Amir Saffari. 2020. tional Joint Conference on Natural Language Pro- Have your text and use it too! end-to-end neural cessing (EMNLP-IJCNLP), pages 552–562, Hong data-to-text generation with semantic fidelity. In Kong, China. Association for Computational Lin- Proceedings of the 28th International Conference guistics. on Computational Linguistics, pages 2410–2424, Barcelona, Spain (Online). International Committee Wenhu Chen, Yu Su, Xifeng Yan, and William Yang on Computational Linguistics. Wang. 2020a. KGPT: Knowledge-grounded pre- training for data-to-text generation. In Proceed- Jonathan Herzig, Pawel Krzysztof Nowak, Thomas ings of the 2020 Conference on Empirical Methods Muller,¨ Francesco Piccinno, and Julian Eisenschlos. in Natural Language Processing (EMNLP), pages 2020. TaPas: Weakly supervised table parsing via 8635–8648, Online. Association for Computational pre-training. In Proceedings of the 58th Annual Linguistics. Meeting of the Association for Computational Lin- guistics, pages 4320–4333, Online. Association for Wenhu Chen, Hongmin Wang, Jianshu Chen, Yunkai Computational Linguistics. Zhang, Hong Wang, Shiyang Li, Xiyou Zhou, and William Yang Wang. 2020b. Tabfact: A large-scale Mihir Kale. 2020. Text-to-text pre-training for data-to- dataset for table-based fact verification. In Proceed- text tasks. Computation and Language Repository, ings of the 8th International Conference on Learning arXiv:2005.10433. Version 2. Representations. Diederik P Kingma and Jimmy Ba. 2015. Adam: A Bhuwan Dhingra, Manaal Faruqui, Ankur Parikh, method for stochastic optimization. In Proceedings Ming-Wei Chang, Dipanjan Das, and William Co- of the 3rd International Conference on Learning hen. 2019. Handling divergent reference texts when Representations. evaluating table-to-text generation. In Proceedings of the 57th Annual Meeting of the Association for Rik Koncel-Kedziorski, Dhanush Bekal, Yi Luan, Computational Linguistics, pages 4884–4895, Flo- Mirella Lapata, and Hannaneh Hajishirzi. 2019. rence, Italy. Association for Computational Linguis- Text Generation from Knowledge Graphs with tics. Graph Transformers. In Proceedings of the 2019 Conference of the North American Chapter of the David K Duvenaud, Dougal Maclaurin, Jorge Ipar- Association for Computational Linguistics: Human raguirre, Rafael Bombarell, Timothy Hirzel, Alan Language Technologies, Volume 1 (Long and Short Aspuru-Guzik, and Ryan P Adams. 2015. Convolu- Papers), pages 2284–2293, Minneapolis, Minnesota. tional networks on graphs for learning molecular fin- Association for Computational Linguistics. gerprints. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Alon Lavie and Abhaya Agarwal. 2007. METEOR: An Neural Information Processing Systems 28, pages automatic metric for MT evaluation with high levels 2224–2232. Curran Associates, Inc. of correlation with human judgments. In Proceed- ings of the Second Workshop on Statistical Machine Claire Gardent, Anastasia Shimorina, Shashi Narayan, Translation, pages 228–231, Prague, Czech Repub- and Laura Perez-Beltrachini. 2017. The WebNLG lic. Association for Computational Linguistics. challenge: Generating text from RDF data. In Pro- ceedings of the 10th International Conference on Remi´ Lebret, David Grangier, and Michael Auli. 2016. Natural Language Generation, pages 124–133, San- Neural text generation from structured data with tiago de Compostela, Spain. Association for Compu- application to the biography domain. In Proceed- tational Linguistics. ings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 1203–1213, Leonardo F. R. Ribeiro, Claire Gardent, and Iryna Austin, Texas. Association for Computational Lin- Gurevych. 2019. Enhancing AMR-to-text genera- guistics. tion with dual graph representations. In Proceed- ings of the 2019 Conference on Empirical Methods Mike Lewis, Yinhan Liu, Naman Goyal, Mar- in Natural Language Processing and the 9th Inter- jan Ghazvininejad, Abdelrahman Mohamed, Omer national Joint Conference on Natural Language Pro- Levy, Veselin Stoyanov, and Luke Zettlemoyer. cessing (EMNLP-IJCNLP), pages 3183–3194, Hong 2020. BART: Denoising sequence-to-sequence pre- Kong, China. Association for Computational Lin- training for natural language generation, translation, guistics. and comprehension. In Proceedings of the 58th An- nual Meeting of the Association for Computational Leonardo F. R. Ribeiro, Yue Zhang, Claire Gardent, Linguistics, pages 7871–7880, Online. Association and Iryna Gurevych. 2020a. Modeling global and for Computational Linguistics. local node contexts for text generation from knowl- edge graphs. Transactions of the Association for Diego Marcheggiani and Laura Perez-Beltrachini. Computational Linguistics, 8:589–604. 2018. Deep graph convolutional encoders for struc- tured data to text generation. In Proceedings of Leonardo FR Ribeiro, Martin Schmitt, Hinrich Schutze,¨ the 11th International Conference on Natural Lan- and Iryna Gurevych. 2020b. Investigating pre- guage Generation, pages 1–9, Tilburg University, trained language models for graph-to-text genera- The Netherlands. Association for Computational tion. arXiv preprint arXiv:2007.08426. Linguistics. Xiaoyu Shen, Ernie Chang, Hui Su, Cheng Niu, and Amit Moryossef, Yoav Goldberg, and Ido Dagan. 2019. Dietrich Klakow. 2020. Neural data-to-text genera- Step-by-step: Separating planning from realization tion via jointly learning the segmentation and corre- in neural data-to-text generation. In Proceedings of spondence. In Proceedings of the 58th Annual Meet- the 2019 Conference of the North American Chap- ing of the Association for Computational Linguistics, ter of the Association for Computational Linguistics: pages 7155–7165, Online. Association for Computa- Human Language Technologies, Volume 1 (Long tional Linguistics. and Short Papers), pages 2267–2277, Minneapolis, Minnesota. Association for Computational Linguis- Matthew Snover, Bonnie Dorr, Richard Schwartz, Lin- tics. nea Micciulla, and Ralph Weischedel. 2006. A study of translation error rate with targeted human annota- Kishore Papineni, Salim Roukos, Todd Ward, and Wei- tion. In In Proceedings of the Association for Ma- Jing Zhu. 2002. Bleu: a method for automatic eval- chine Transaltion in the Americas (AMTA 2006). uation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Com- putational Linguistics, pages 311–318, Philadelphia, Linfeng Song, Yue Zhang, Zhiguo Wang, and Daniel Pennsylvania, USA. Association for Computational Gildea. 2018. A graph-to-sequence model for AMR- Linguistics. to-text generation. In Proceedings of the 56th An- nual Meeting of the Association for Computational Ratish Puduppully, Li Dong, and Mirella Lapata. 2019. Linguistics (Volume 1: Long Papers), pages 1616– Data-to-text generation with content selection and 1626, Melbourne, Australia. Association for Compu- planning. In Proceedings of the AAAI Conference on tational Linguistics. Artificial Intelligence, volume 33, pages 6908–6915. Bayu Distiawan Trisedya, Jianzhong Qi, Rui Zhang, Dragomir Radev, Rui Zhang, Amrit Rau, Abhinand and Wei Wang. 2018. GTR-LSTM: A triple encoder Sivaprasad, Chiachun Hsieh, Nazneen Fatema Ra- for sentence generation from RDF data. In Proceed- jani, Xiangru Tang, Aadit Vyas, Neha Verma, ings of the 56th Annual Meeting of the Association Pranav Krishna, et al. 2020. Dart: Open-domain for Computational Linguistics (Volume 1: Long Pa- structured data record to text generation. Com- pers), pages 1627–1637, Melbourne, Australia. As- putation and Language Repository, arXiv preprint sociation for Computational Linguistics. arXiv:2007.02871. Qingyun Wang, Xiaoman Pan, Lifu Huang, Boliang Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Zhang, Zhiying Jiang, Heng Ji, and Kevin Knight. Dario Amodei, and Ilya Sutskever. 2019. Language 2018. Describing a knowledge base. In Proceed- models are unsupervised multitask learners. OpenAI ings of the 11th International Conference on Natu- blog, 1(8):9. ral Language Generation, pages 10–21, Tilburg Uni- versity, The Netherlands. Association for Computa- Colin Raffel, Noam Shazeer, Adam Roberts, Kather- tional Linguistics. ine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring Tianming Wang, Xiaojun Wan, and Hanqi Jin. 2020a. the limits of transfer learning with a unified text-to- AMR-to-text generation with graph transformer. text transformer. Journal of Machine Learning Re- Transactions of the Association for Computational search, 21(140):1–67. Linguistics, 8:19–33. Zhenyi Wang, Xiaoyang Wang, Bang An, Dong Yu, Ba, 2015) to optimize each model with learning and Changyou Chen. 2020b. Towards faithful neural rate of 3 × 10−5 with  = 1 × 10−8 for a maxi- table-to-text generation with content-matching con- mum of 10 epochs. We run each experiment on one straints. In Proceedings of the 58th Annual Meet- ing of the Association for Computational Linguistics, Nvidia Tesla V100 GPU with 16G DRAM. We first pages 1072–1086, Online. Association for Computa- fine-tuned the PLMs on crawled Wikipedia pairs tional Linguistics. for 3 epochs. The Wikipedia Fine-tuning stage Sam Wiseman, Stuart Shieber, and Alexander Rush. takes about 24 hours for T5-large and 10 hours for 2017. Challenges in data-to-document generation. the rest of models. The final WebNLG fine-tuning In Proceedings of the 2017 Conference on Empiri- stage takes less than 1 hour for all the models. We cal Methods in Natural Language Processing, pages chose our best model based on multi-BLEU score7. 2253–2263, Copenhagen, Denmark. Association for Computational Linguistics. For inference, we use beam search with beam size in the range {3,5}. Table4 shows the number of Thomas Wolf, Lysandre Debut, Victor Sanh, Julien the parameters for each pre-trained model. Chaumond, Clement Delangue, Anthony Moi, Pier- ric Cistac, Tim Rault, Remi Louf, Morgan Funtow- icz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander Rush. 2020. Trans- formers: State-of-the-art natural language process- ing. In Proceedings of the 2020 Conference on Em- pirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Asso- ciation for Computational Linguistics. Chao Zhao, Marilyn Walker, and Snigdha Chaturvedi. 2020. Bridging the structural gap between encod- ing and decoding for data-to-text generation. In Pro- ceedings of the 58th Annual Meeting of the Asso- ciation for Computational Linguistics, pages 2481– 2491, Online. Association for Computational Lin- guistics. A Hyperparameters and Statistics of the Model

Origin5 + Position BART-base 139.42M 139.43M distil-BART-xsum 305.51M 305.53M BART-large 406.29M 406.31M T5-base 222.88M 222,90M T5-large 737.64M 737.65M

Table 4: # of Model Parameters

Our model is built based on the Huggingface framework (Wolf et al., 2020)6. Because the aver- age lengths for source and target text in the train- ing set are 31 and 22 words respectively, we set the maximum length for both source and target to 100 words. For T5 preprocessing, we prepend “translate RDF to English:” before the input. For BART-base, distil-BART-xsum, and T5-base, we use a batch size of 32 and train the model. We use a batch size of 16 for Bart-large, and 6 for T5- 7https://gitlab.com/webnlg/webnlg-baseline/- large. We use the Adam optimizer (Kingma and /blob/master/multi-bleu.perl 7# of parameters are slightly different because we add 6https://github.com/huggingface/transformers special tokens to the vocabulary B Sample Generation Results

Category Output Input S| Aaron Turner P| associatedBand/associatedMusicalArtist O| Twilight (band) S| Aaron Turner P| associatedBand/associatedMusicalArtist O| Lotus Eaters (band) S| Aaron Turner P| genre O| Sludge metal S| Aaron Turner P| origin O| Massachusetts S| Aaron Turner P| activeYearsStartYear O| 1995

Reference Aaron Turner was born in Massachusetts and started performing in 1995 . He formerly played with the band Twilight and is now in the Lotus Eaters . He is is closely associated with the Sludge metal genre. T5-large Aaron Turner, who is from Massachusetts , is a member of the Twilight band and the Lotus Eaters band. He is also associated with the Sludge metal genre and started playing in 1995. T5-large Aaron Turner was born in Massachusetts and started playing in 1995 . He is associated with the band + Wiki Twilight and the band Lotus Eaters . His music style is Sludge metal. T5-large + Aaron Turner, who was born in Massachusetts , is associated with Twilight and Lotus Eaters . He is a Position member of Sludge metal band and has also been a musical artist. [1995]

T5-large Aaron Turner was born in Massachusetts and started playing Sludge metal in 1995 . He is also associated + Wiki + with Lotus Eaters and Twilight. Position

Table 5: Human and System Generated Description We use the color box to frame each entity out with the same color as the corresponding entity. We highlight fabricate facts, [missed relations], and incorrect relations with different color.

Category Output Input S| Alfred Garth Jones P| deathPlace O| Sidcup S| Alfred Garth Jones P| deathPlace O| London S| Alfred Garth Jones P| birthPlace O| Manchester S| Manchester P| leaderName O| Labour Party (UK) S| London P| leaderTitle O| European Parliament

Reference Alfred Garth Jones was born in Manchester and died in Sidcup , London . The leader of London is the European Parliament , and the leader of Manchester is The Labour Party (UK). T5-large Alfred Garth Jones was born in Manchester (UK) and died in London (leader of which is the European Parliament). He was born in Sidcup. [Labour Party (UK)] T5-large Alfred Garth Jones was born in Manchester and died in Sidcup , London . Manchester is lead by the + Wiki Labour Party (UK) and London is lead by the European Parliament. T5-large + Alfred Garth Jones was born in Manchester and died in London . He was a member of the Position Labour Party and the European Parliament. He died in Sidcup. T5-large Alfred Garth Jones was born in Manchester and died in Sidcup , London . The leader of Manchester is + Wiki + the Labour Party (UK) and the leader of London is the European Parliament. Position

Table 6: Human and System Generated Description We use the color box to frame each entity out with the same color as the corresponding entity. We highlight fabricate facts, [missed relations], and incorrect relations with different color. Category Output Input S| P| commander O| Jefferson Davis S| P| country O| S| United States P| ethnicGroup O| Native Americans in the United States S| Alfred Moore Scales P| battles O| S| Siege of Petersburg P| isPartOfMilitaryConflict O| American Civil War target Reference American Civil War , Siege of Petersburg involved U.S. born Alfred Moore Scales. The American Civil War was commanded by Jefferson Davis . An ethnic group of the United States includes Native Americans . T5-large Alfred Moore Scales is from the United States where Native Americans are one of the ethnic groups. He fought in the Siege of Petersburg which was part of the American Civil War commanded by Jefferson Davis. T5-large The American Civil War was fought in the Siege of Petersburg. Jefferson Davis was the commander + Wiki of the war. Alfred Moore Scales was born in the United States where Native Americans are an ethnic group.[S | Alfred Moore Scales P| battles O| Siege of Petersburg] T5-large + Alfred Moore Scales was born in the United States , where Native Americans are an ethnic group. He Position fought in the American Civil War, which was led by Jefferson Davis . The Siege of Petersburg is part of the American Civil War.[S | Alfred Moore Scales P| battles O| Siege of Petersburg] T5-large Alfred Moore Scales is from the United States where Native Americans are one of the ethnic groups. + Wiki + He fought in the Siege of Petersburg which is part of the American Civil War . Jefferson Davis was the Position commander of the American Civil War.

Table 7: Human and System Generated Description We use the color box to frame each entity out with the same color as the corresponding entity. We highlight fabricate facts, [missed relations], and incorrect relations with different color.

Category Output Input S| 101 HelenaP | discoverer O| James Craig Watson S| James Craig Watson P| almaMater O| University of Michigan S| 101 Helena P| discovered O| 1868-08-15 S| James Craig Watson P| nationality O| Canada S| James Craig Watson P| deathPlace O| Madison, Wisconsin Reference James Craig Watson, who discovered 101 Helena on 15th August 1868 , is a Canadian national who attended the University of Michigan . He died in Madison, Wisconsin. T5-large James Craig Watson is a Canadian who graduated from the University of Michigan . He was the discoverer of 101 Helena which was discovered on 15 August 1868 . He died in Madison, Wisconsin. T5-large James Craig Watson, a Canadian , graduated from the University of Michigan and discovered 101 Helena + Wiki on August 15th, 1868 . He died in Madison, Wisconsin. T5-large + James Craig Watson, a Canadian national, was a student at the University of Michigan and discovered Position 101 Helena on August 15, 1868 . He died in Madison, Wisconsin. T5-large James Craig Watson, a Canadian , graduated from the University of Michigan and discovered 101 Helena + Wiki + on August 15th, 1868 . He died in Madison, Wisconsin. Position

Table 8: Human and System Generated Description We use the color box to frame each entity out with the same color as the corresponding entity. We highlight fabricate facts, [missed relations], and incorrect relations with different color.