Denoising Based Sequence-To-Sequence Pre-Training for Text Generation

Denoising based Sequence-to-Sequence Pre-training for Text Generation Liang Wang1, Wei Zhao1, Ruoyu Jia1, Sujian Li2, Jingming Liu1 1Yuanfudao AI Lab, Beijing, China 2Key Laboratory of Computational Linguistics, Peking University, MOE, China fwangliang01,zhaowei01,jiary,[email protected] [email protected] Abstract labeling tasks, such as natural language inference This paper presents a new sequence-to- (Bowman et al., 2015), named-entity recognition, sequence (seq2seq) pre-training method PoDA SQuAD question answering (Rajpurkar et al., (Pre-training of Denoising Autoencoders), 2016) etc. which learns representations suitable for text However, little attention has been paid to pre- generation tasks. Unlike encoder-only (e.g., training for seq2seq text generation (Sutskever BERT) or decoder-only (e.g., OpenAI GPT) et al., 2014). A typical seq2seq network con- pre-training approaches, PoDA jointly pre- sists of a bidirectional encoder, a unidirectional trains both the encoder and decoder by denoising the noise-corrupted text, and it also has the decoder and attention between the encoder and de- advantage of keeping the network architecture coder. Previous work mainly focuses on encoder- unchanged in the subsequent fine-tuning stage. only or decoder-only pre-training. For example, Meanwhile, we design a hybrid model of BERT pre-trains a bidirectional encoder, and Ope- Transformer and pointer-generator networks nAI GPT pre-trains a language model which is es- as the backbone architecture for PoDA. We sentially a unidirectional decoder. Ramachandran conduct experiments on two text generation et al. (2016) propose to train two independent lan- tasks: abstractive summarization, and gram- guage models for the encoder and decoder respec- matical error correction. Results on four datasets show that PoDA can improve model tively. All of the aforementioned methods are only performance over strong baselines without us- able to partially pre-train the seq2seq networks, ing any task-specific techniques and signifi- and therefore are unable to unleash the full poten- cantly speed up convergence. 1 tial of transfer learning for text generation. 1 Introduction In this paper, we present PoDA, a denoising based pre-training method that is able to jointly Methods based on unsupervised pre-training and pre-train all components of seq2seq networks. supervised fine-tuning for NLP have achieved phe- Like denoising autoencoders, PoDA works by de- nomenal successes in the last two years. Most of noising the noise-corrupted text sequences. Any the proposed methods in the literature choose lan- noising function that fits in the seq2seq frame- guage modeling or its variant as the pre-training work can be used. We experiment with three types task. After the pre-training stage, ELMo (Pe- of noises: randomly shuffle, delete or replace the ters et al., 2018) and CoVe (McCann et al., words in a given sequence. It is noted PoDA is 2017) directly use the learned representations as simple, easy-to-implement and applicable to virtu- additional features for downstream tasks, while ally all seq2seq architectures, including ConvS2S BERT (Devlin et al., 2018), ULMFiT (Howard (Gehring et al., 2017) and Transformer (Vaswani and Ruder, 2018), XLM (Lample and Conneau, et al., 2017). Here, we adopt the hybrid archi- 2019), and OpenAI GPT (Radford et al., 2018, tecture of Transformer and pointer-generator net- 2019) require fine-tuning both pre-trained param- works (See et al., 2017). Transformer is effective eters and task-specific parameters on labeled data. at modeling long-distance dependencies, highly The state-of-the-art performances have been sig- parallelizable and demonstrates good performance nificantly advanced for classification and sequence empirically. Pointer-generator network incorpo- 1The code and pre-trained models are available at https: rates copying mechanism (Gu et al., 2016; Gul- //github.com/yuantiku/PoDA. cehre et al., 2016) which is helpful for most text 4003 Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 4003–4015, Hong Kong, China, November 3–7, 2019. c 2019 Association for Computational Linguistics masked loss The fox jumps over the lazy dog . copy attention Pointer-Generator Layer … … Transformer Encoder Transformer Decoder fox fly over the dog lazy . <bos> The fox jumps over the lazy dog Figure 1: PoDA model architecture. The masked loss is calculated only for the blue underlined words. “<bos>” is a special begin-of-sequence padding symbol. The example input-output pair is explained in Section 2.2. generation tasks. Pointer-generator networks are also not the only The text corpora used for pre-training are the solution for handling out-of-vocabulary(OOV) Billion Word Benchmark (Chelba et al., 2013) words, and subword-based methods such as sen- and English Wikipedia, both of which are pub- tencepiece (Kudo and Richardson, 2018) can be licly available and consists of nearly 2:99 billion used at the cost of making the input and output se- words in total. We conduct experiments on two quences longer. abstractive summarization datasets (CNN/Daily 2.2 Noising and Denoising Mail (See et al., 2017) and Gigaword (Rush et al., 2015)), and two grammatical error correc- Similar to denoising autoencoders, PoDA involves tion datasets (CoNLL-2014 (Ng et al., 2014) and two parts: noising and denoising. The noising part f gn JFLEG (Napoles et al., 2017)). With simple maxi- corrupts a given word sequence x = xi i=1 and 0 f 0 gn0 mum likelihood training and no task-specific tech- gets a noisy word sequence x = x i i=1. The 0 niques, PoDA achieves superior or comparable denoising part tries to recover x given x using a performance against state-of-the-art systems and seq2seq model. speeds up convergence for all four datasets. We use three noising functions: randomly shuffle, delete or replace the words in x. The details 2 Method are shown in Algorithm 1, where N(0; σ) is a gaussian distribution with mean 0 and variance σ. 2.1 Model Architecture B(p) is a Bernoulli distribution, and Beta(α; β) First, we design a seq2seq model as the back- is a beta distribution serving as the prior for B(p). bone architecture of our proposed pre-training Take function DELETE (line 10 to line 15 in Algo- method, which is a combination of Transformer rithm 1) as an example, it first samples a Bernoulli and pointer-generator networks, as shown in Fig- distribution with expectation p from Beta(α; β), ure 1. then each word is deleted with probability p. The The input representations are the sum of word usage of Beta(α; β) prior can make the model ro- embeddings and sinusoidal positional encodings. bust to different degrees of noises. Both the Transformer encoder and the decoder We exemplify the operations above in Figure 1. consist of 6 layers of transformer blocks, and each The original word sequence x =“The fox jumps block is a multi-head self-attention layer followed over the lazy dog .”, after three noising opera- by one layer of positionwise feedforward network. tions: delete “The”, replace “jumps” with “fly” and swap “lazy” and “dog”, we get the noisy For the output layer, we use a pointer-generator 0 layer to allow both copying from the input se- word sequence x =“fox fly over the dog lazy .”. quence and generation from a fixed vocabulary. The denoising part maximizes the conditional j 0 The implementation is detailed in Appendix. probability p(x x ), which can be factorized as: As a side note, we want to point out that the j 0 n j 0 p(x x ) = Πi=1p(xi x ; x<i) (1) seq2seq architecture is not limited to the one we propose and other networks such as ConvS2S, When predicting xi, it is conditioned on the RNN-based seq2seq models are also applicable. noise-corrupted full context x0 and the clean left 4004 Algorithm 1 The Noising Algorithm bution. α and β are chosen to have a Beta distribu- Input: x is a sequence of words tion with mean 0:15 and standard deviation 0:03. α; β; σ are hyperparameters 2.3 Pre-training Procedure 1: function NOISING(x) 2: Apply SHUFFLE, DELETE, REPLACE to x in random order Corpus #words 3: end function English Wikipedia 2:22B Billion Word Benchmark 0:76B 4: function SHUFFLE(x) Total 2:99B 5: for i 1 to len(x) do Table 1: Text corpora used for pre-training. 6: indices[i] i + δi ∼ N(0; σ) 7: end for 8: Rearrange x based on argsort(indices) For pre-training, we use two text corpora: the 2 9: end function full dump of English Wikipedia and the Billion 3 10: function DELETE(x) Word Benchmark , as shown in Table 1. For En- 11: Sample p ∼ Beta(α; β) glish Wikipedia, we remove paragraphs with less 12: for w in x do than 3 words or more than 30% OOV words, and 13: Delete w if µ ∼B(p) is 1 each paragraph is split into text segments with no 14: end for more than 128 words for each segment. The Bil- 15: end function lion Word Benchmark is a sentence-level corpus. 16: function REPLACE(x) Sentences with more than 500 words are ignored 17: Sample p ∼ Beta(α; β) during training. 18: for w in x do The pre-training is performed on 4 GPUs using 0 19: Replace w with w sampled from uni- synchronous data parallelism, gradients are aver- gram distribution if µ ∼B(p) is 1 aged across different GPUs. Each batch on a sin- 20: end for gle GPU consists of at most 3000 tokens. We pre- 21: end function train the network for 5 million iterations, which is roughly 14 epochs over the entire dataset.

Load more