Neural Data-To-Text Generation with LM-Based Text Augmentation

Neural Data-to-Text Generation with LM-based Text Augmentation †Ernie Chang, Xiaoyu Shen∗, †Dawei Zhu, †Vera Demberg, ⊗ Hui Su †Dept. of Language Science and Technology, Saarland University fcychang,[email protected] Amazon Alexa AI, Berlin ⊗ Pattern Recognition Center, Wechat AI, Tencent Inc, China Abstract Table + Reference <Name> Blue Spice TheThe Blue Blue Spice Spice is a restaurant that that serves For many new application domains for data- <eattype> restaurant servesEnglish English cuisine. to-text generation, the main obstacle in train- L <food> English cuisine. ing neural models consists of a lack of training data. While usually large numbers of instances Table only are available on the data side, often only very U <Name> The Eagle few text samples are available. To address <eattype> restaurant 12 this problem, we here propose a novel few- <food> Chinese shot approach for this setting. Our approach automatically augments the data available for Figure 1: Few-shot scenario: The model is expected to training by (i) generating new text samples learn data-to-text generation with few labeled instances (i.e. based on replacing specific values by alterna- table-text pairs). The example is taken from the E2E dataset. tive ones from the same category, (ii) generating new text samples based on GPT-2, and (iii) proposing an automatic method for pairing the (van der Lee et al., 2018). This makes using neural new text samples with data samples. As the systems less appealing: oftentimes, in-domain text text augmentation can introduce noise to the samples are not readily available, and there is a training data, we use cycle consistency as an high cost to collecting in-domain texts which fit objective, in order to make sure that a given the data samples, and annotating these texts with data sample can be correctly reconstructed af- the data labels – the cost for collecting this data ter having been formulated as text (and that text samples can be reconstructed from data). might hence even outweigh the efforts of designing a rule-based system (Gkatzia, 2016). The goal of On both the E2E and WebNLG benchmarks, this work is to improve the performance of neural we show that this weakly supervised training paradigm is able to outperform fully super- data-to-text models in scenarios where only very vised seq2seq models with less than 10% an- few text samples exist (we assume that these text notations. By utilizing all annotated data, our samples are paired with corresponding data sam- model can boost the performance of a standard ples). We aim to answer how we can make the seq2seq model by over 5 BLEU points, estab- most of the scarce annotations, together with large lishing a new state-of-the-art on both datasets. amounts of unlabelled data, in order to push the arXiv:2102.03556v1 [cs.CL] 6 Feb 2021 1 Introduction limit of the neural data-to-text models. Figure1 illustrates the scenario. Neural data-to-text generation has been the subject To address the limited-data challenge, we pro- of much recent research. The task aims at trans- pose a simple yet effective way of augmenting the forming source-side structured data into target-side text side with the pretrained language model (LM) natural language text (Reiter and Dale, 2000; Barzi- GPT-2 (Radford et al., 2019). Unlike other text lay and Lapata, 2005). While neural end-to-end augmentation work employed in data-to-text gener- systems afford the advantage of easy adaptabil- ation systems (Freitag and Roy, 2018; Agarwal ity (Lebret et al., 2016; Wiseman et al., 2017), huge et al., 2018), our proposal assumes little to no amounts of data-text pairs are still necessary to domain-dependent heuristics. It consists of two perform on par with their rule-based counterparts steps: (1) information augmentation by slot-value ∗Work done prior to joining Amazon. replacement and (2) LM augmentation by GPT-2 generation. resentation matching. The resulting model Once we have augmented the set of text samples, outperforms standard seq2seq model with less we are essentially in a similar setting as previously than 10% data annotations. proposed semi-supervised approaches to data-to- text generation Schmitt and Schutze¨ (2019); Qader 4. The proposed model is shown to be com- et al.(2019); Su et al.(2020b), which assume the plementary with current seq2seq pretraining presence of vast amounts of unpaired data and text techniques, and can offer orthogonal improve- instances. These approaches exploit a cycle con- ments when combining both. sistency objective in order to learn a pairing for the data samples. The cycle consistency objective 2 Related Work tries to make sure that data samples can be reconstructed correctly from their textual formulations, Building neural data-to-text systems with few and similarly that texts can be reconstructed after paired samples (but a large set of unpaired sam- having been parsed into a data representation. ples) has been a hot research topic recently. Most As the automatically generated text samples works adopt the idea of cycle consistency (Zhu from GPT-2 might be very noisy and not pair well et al., 2017), which has been used in many text gen- with data samples, we align each augmented text eration tasks like machine translation (Artetxe et al., sample with its most similar unlabeled data sample, 2017; Lample et al., 2017) and style transfer (Prab- as defined in their encoded vector space. This idea humoye et al., 2018; Subramanian et al., 2018). is inspired by recent work on representation match- Schmitt and Schutze¨ (2019); Qader et al.(2019); ing in MT (Artetxe and Schwenk, 2019; Ruiter Su et al.(2020b); Chang et al.(2020b, 2021a,b) ap- et al., 2019). To ensure good quality of the training plied this idea to the task of data-to-text generation data, only pairs above a certain similarity threshold and reported promising results. Ma et al.(2019) are retained as pseudo pairs for training. The separate the generation process into few-shot con- quality of the pseudo pairs will gradually improve tent selection and surface realization components as the encoder improves in the training process. In and learn them separately. Nonetheless, all of these return, the learning of the encoder will also be fa- approaches assume the existence of huge quantity cilitated with the improved quality of pseudo pairs of unpaired text samples, which, as we mentioned, as a virtuous cycle. is an unrealistic assumption for the task of data-to- On two data-to-text benchmarks E2E (Novikova text generation. Freitag and Roy(2018) proposes et al., 2017) and WebNLG (Gardent et al., 2017), to reconstruct usable sequences re-written from we show that our LM-augmented weakly super- data with rules for unsupervised data-to-text gener- vised model succeeds on outperforming fully su- ation. Unfortunately, designing these rules require pervised seq2seq model, though utilizing less than efforts similar to building a template-based system. 10% of the data annotations. It even outperforms (Budzianowski and Vulic´, 2019; Chen et al., 2020; previous work which additionally has access to all Peng et al., 2020) tackle the few-shot challenge unpaired text samples. When trained with full data by finetuning a pretrained LM to incorporate prior annotations, it is able to boost the model perfor- knowledge from general-domain text or data-text mance by up to 5 BLEU points, establishing a new pairs. We show that our technique is complemen- state-of-the-art on both datasets. tary with them and can offer orthogonal improve- In summary, this work makes the following con- ments when combining both. tributions: 3 Problem Formulation 1. We study the few-shot data-to-text scenario where, unlike previous works, no further We represent the data samples as D and the text target-side text is available. samples as T. In our work, we do not restrict the 2. We present an effective way of automatically format of the data. Each d 2 D can be a set of augmenting target text by resorting to the pre- key-value pairs, as in Figure1, or in form of RDF trained LM GPT-2. triples as in Gardent et al.(2017). Each text t 2 T consists of a sequence of words. In few-shot 3. We propose utilizing the augmented text by settings, we are assumed to have (1) k labeled pairs a combination of cycle consistency and rep- (DL;TL) and (2) large quantities of unlabeled data 1 DU where jDU j k > 0 . This, we believe, the LM and continuing to generate words. People is a more realistic setting as unlabeled data are have also also shown that GPT-2 is able to improve usually abundant and also can be easily fabricated classification tasks via in-domain text augmenta- from predefined schemata. Notably, we assume no tion (Papanikolaou and Pierleoni, 2020; Sun et al., access to outside resources containing in-domain 2020). We use a similar technique by first fine- text. The k annotations are all we know about the tuning GPT-2 in the few-shot annotations (Wolf text side. et al., 2019), and then applying it to produce synthetic text through an iterative conditional genera- 4 Approach tion process: With initial seeds being samples of In this section, we first explain our proposed new TL plus new samples from information augmenta- method for text sample augmentation, and then dis- tion, the LM iteratively conditions on the previous 2 cuss methods to remove noise and automatically output sentence to generate in-domain text . Each align the data by elaborating on the ideas of cycle synthetic sentence is pruned if it (1) is shorter than consistency and representation matching.

Load more