Learning to Select Bi-Aspect Information for Document-Scale Text Content Manipulation
Total Page:16
File Type:pdf, Size:1020Kb
Learning to Select Bi-Aspect Information for Document-Scale Text Content Manipulation Xiaocheng Feng, 1 Yawei Sun, 1 Bing Qin, 1 Heng Gong, 1 Yibo Sun, 1 Wei Bi, 2 Xiaojiang Liu, 2 Ting Liu 1 1Harbin Institute of Technology, Harbin, China 2Tencent AI Lab, Shenzhen, China fxcfeng, ywsun, qinb, hgong, ybsun, [email protected] fvictoriabi, [email protected] Abstract TEAM/PLAYER PTS REB ... ... AST Feature Rockets 117 49 ... ... 28 Home In this paper, we focus on a new practical task, document- James Harden 16 11 ... ... 10 Home Entity Dwight Howard 14 8 ... ... 0 Home scale text content manipulation, which is the opposite of text Type style transfer and aims to preserve text styles while altering Table ... ... ... ... ... ... Home (Records) Spurs 91 44 ... ... 23 Visiting Value the content. In detail, the input is a set of structured records Tony Parker 18 4 ... ... 7 Visiting Feature and a reference text for describing another recordset. The out- Tim Duncan 16 8 ... ... 1 Visiting put is a summary that accurately describes the partial con- ... ... ... ... ... ... Visiting tent in the source recordset with the same writing style of the reference. The task is unsupervised due to lack of par- The Los Angeles Clippers defeated the Utah Jazz 107– 101 on Monday. allel data, and is challenging to select suitable records and Chris Paul recorded a triple - double of this season’ putting up 13 points. threw in style words from bi-aspect inputs respectively and generate Reference 10 rebounds and 12 assists in 36 minutes. He also two steals Summary and a blocked shot for good measure ... ... ... ... ... ... ... ... a high-fidelity long document. To tackle those problems, we The Jazz were relatively efficient with their shooting but they committed first build a dataset based on a basketball game report cor- 17 turnovers in contrast to ... ... ... ... pus as our testbed, and present an unsupervised neural model with interactive attention mechanism, which is used for learn- The Houston Rockets defeated the San Antonio Spurs 117– 91 on Monday. ing the semantic relationship between records and reference James Harden recorded a triple - double of this season putting up 16 points. Desired 11 rebounds and 10 assists. He also threw in three steals for good texts to achieve better content transfer and better style preser- Output measure ... ... ... The San Antonio Spurs were relatively efficient with their vation. In addition, we also explore the effectiveness of the shooting but they committed 16 turnovers in contrast to Houston 12 ... ... ... back-translation in our task for constructing some pseudo- training pairs. Empirical results show superiority of our ap- proaches over competitive methods, and the models also yield Figure 1: An example input (Table and Reference Summary) 1 a new state-of-the-art result on a sentence-level dataset. of document-level text content manipulation and its desired output. Text portions that fulfill the writing style are high- light in orange. Introduction Data-to-text generation is an effective way to solve data style words. For example, given a set of structured records overload, especially with the development of sensor and and a reference report, such as statistical tables for a bas- data storage technologies, which have rapidly increased the ketball game and a summary for another game, we aim to amount of data produced in various fields such as weather, automatically select partial items from the given records and finance, medicine and sports (Barzilay and Lapata 2005). describe them with the same writing style (e.g., logical ex- arXiv:2002.10210v1 [cs.CL] 24 Feb 2020 However, related methods are mainly focused on content fi- pressions, or wording, transitions) of the reference text to delity, ignoring and lacking control over language-rich style directly generate a new report (Figure 1). attributes (Wang et al. 2019). For example, a sports journal- In this task, the definition of the text content (e.g., sta- ist prefers to use some repetitive words when describing dif- tistical records of a basketball game) is clear, but the text ferent games (Iso et al. 2019). It can be more attractive and style is vague (Dai et al. 2019). It is difficult to construct practical to generate an article with a particular style that is paired sentences or documents for the task of text con- describing the conditioning content. tent manipulation. Therefore, the majority of existing text In this paper, we focus on a novel research task in the editing studies develop controlled generator with unsuper- field of text generation, named document-scale text content vised generation models, such as Variational Auto-Encoders manipulation. It is the task of converting contents of a docu- (VAEs) (Kingma and Welling 2013), Generative Adversar- ment into another while preserving the content-independent ial Networks (GANs) (Goodfellow et al. 2014) and auto- regressive networks (Oord, Kalchbrenner, and Kavukcuoglu 1Our code and data are available at: 2016) with additional pre-trained discriminators. https://github.com/syw1996/SCIR-TG-Data2text-Bi-Aspect Despite the effectiveness of these approaches, it remains Train(D/S) Dev(D/S) Test(D/S) #Instances 3371/31,751 722/6,833 728/6,999 Avg Ref Length 335.55/25.90 341.17/25.82 346.83/25.99 #Data Types 37/34 37/34 37/34 Avg Input Record Length 606/5 606/5 606/5 Avg Output Record Length 38.05/4.88 37.80/4.85 31.32/4.94 Table 1: Document-level/Sentence-level Data Statistics. challenging to generate a high-fidelity long summary from Preliminaries the inputs. One reason for the difficulty is that the input Problem Statement structured records for document-level generation are com- plex and redundant to determine which part of the data Our goal is to automatically select partial items from the should be mentioned based on the reference text. Similarly, given content and describe them with the same writing style the model also need to select the suitable style words ac- of the reference text. As illustrated in Figure 1, each in- cording to the input records. One straightforward way to ad- put instance consists of a statistical table x and a reference 0 dress this problem is to use the relevant algorithms in data- summary y . We regard each cell in the table as a record Lx to-text generation, such as pre-selector (Mei, Bansal, and r = frogo=1, where Lx is the number of records in table x. Walter 2015) and content selector (Puduppully, Dong, and Each record r consists of four types of information includ- Lapata 2018). However, these supervised methods cannot be ing entity r:e (the name of team or player, such as LA Lakers directly transferred considering that we impose an additional or Lebron James), type r:t (the types of team or player, e.g., goal of preserving the style words, which lacks of parallel points, assists or rebounds) and value r:v (the value of a cer- data and explicit training objective. In addition, when the tain player or team on a certain type), as well as feature r:f generation length is expanded from a sentence to a docu- (e.g., home or visiting) which indicates whether a player or a ment, the sentence-level text content manipulation method team compete in home court or not. In practice, each player (Wang et al. 2019) can hardly preserve the style word (see or team takes one row in the table and each column contains case study, Figure 4). a type of record such as points, assists, etc. The reference summary or report consists of multiple sentences, which are assumed to describe content that has the same types but dif- In this paper, we present a neural encoder-decoder archi- ferent entities and values with that of the table x. tecture to deal with document-scale text content manipula- Furthermore, following the same setting in sentence-level tion. In the first, we design a powerful hierarchical record text content manipulation (Wang et al. 2019), we also pro- encoder to model the structured records. Afterwards, in- vide additional information at training time. For instance, stead of modeling records and reference summary as two each given table x is paired with a corresponding yaux, independent modules (Wang et al. 2019), we create fusion which was originally written to describe x and each refer- representations of records and reference words by an in- ence summary y0 also has its corresponding table x0 contain- teractive attention mechanism. It can capture the semantic ing the records information. The additional information can relatedness of the source records with the reference text to help models to learn the table structure and how the desired enable the system with the capability of content selection records can be expressed in natural language when training. from two different types of inputs. Finally, we incorporate It is worth noting that we do not utilize the side information back-translation (Sennrich, Haddow, and Birch 2016) into beyond (x; y0) during the testing phase and the task is unsu- the training procedure to further improve results, which pro- pervised as there is no ground-truth target text for training. vides an extra training objective for our model. Document-scale Data Collection To verify the effectiveness of our text manipulation ap- In this subsection, we construct a large document-scale text proaches, we first build a large unsupervised document-level content manipulation dataset as a testbed of our task. The text manipulation dataset, which is extracted from an NBA dataset is derived from an NBA game report corpus RO- game report corpus (Wiseman, Shieber, and Rush 2017). Ex- TOWIRE (Wiseman, Shieber, and Rush 2017), which con- periments of different methods on this new corpus show that sists of 4,821 human written NBA basketball game sum- our full model achieves 35.02 in Style BLEU and 39.47 F- maries aligned with their corresponding game tables.