Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation
Total Page:16
File Type:pdf, Size:1020Kb
Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation Yuxuan Song Ning Miao Hao Zhou Lantao Yu Shanghai Jiao Tong University Bytedance AI lab Bytedance AI lab Stanford University Mingxuan Wang Lei Li Bytedance AI lab Bytedance AI lab Abstract merous scenarios, such as image generation (Arjovsky et al., 2017; Goodfellow et al., 2014), density estima- Autoregressive sequence generative models tion (Ho et al., 2019; Salimans et al., 2017; Kingma trained by Maximum Likelihood Estimation and Welling, 2013; Townsend et al., 2019), stylization suffer the exposure bias problem in practi- (Ulyanov et al., 2016), and text generation (Yu et al., cal finite sample scenarios. The crux is that 2017; Li et al., 2016). Learning generative models for the number of training samples for Maximum text data is an important task which has significant im- Likelihood Estimation is usually limited and pact on several real world applications, e.g., machine the input data distributions are different at translation, literary creation and article summariza- training and inference stages. Many methods tion. However, text generation remains a challenging have been proposed to solve the above prob- task due to the discrete nature of the data and the lem (Yu et al., 2017; Lu et al., 2018), which huge sample space which increases exponentially with relies on sampling from the non-stationary the sentence length. model distribution and suffers from high vari- Text generation is nontrivial for its huge sample space. ance or biased estimations. In this paper, For generating sentences of various lengths, current we propose -MLE, a new training scheme text generation models are mainly based on density for autoregressive sequence generative mod- factorization instead of directly modeling the joint els, which is effective and stable when operat- distribution, which results in the prosperity of neu- ing at large sample space encountered in text ral autoregressive models on language modeling. As generation. We derive our algorithm from neural autoregressive models have explicit likelihood a new perspective of self-augmentation and function, it is straightforward to employ Maximum introduce bias correction with density ratio Likelihood Estimation (MLE) for training. Although estimation. Extensive experimental results MLE is is asymptotically consistent, for practical fi- on synthetic data and real-world text gen- nite sample scenarios, it is prone to overfit on the eration tasks demonstrate that our method training set. Additionally, during the inference (gen- stably outperforms Maximum Likelihood Es- eration) stage, the error at each time step will accu- timation and other state-of-the-art sequence mulate along the sentence generation process, which is generative models in terms of both quality also known as the exposure bias (Ranzato et al., 2015) and diversity. problem. Many efforts have been devoted to address the above 1 Introduction limitations of MLE. Researchers have proposed several non-MLE methods based on minimizing different dis- Deep generative models dedicate to learning a tar- crepancy measures, e.g., Sequential GANs (Yu et al., get distribution and have shown great promise in nu- 2017; Che et al., 2017; Kusner and Hern´andez-Lobato, 2016) and CoT (Lu et al., 2018). However, non-MLE Proceedings of the 23rdInternational Conference on Artifi- methods typically relies on sampling from the gener- cial Intelligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the au- The work was done during the first author's internship thor(s). at Bytedance AI lab. Improving Maximum Likelihood Training for Text Generation with Density Ratio Estimation ative distribution to estimate gradients, which results 2.2 MLE vs Sequential GANs in high variance and instability during training, as the generative distribution is non-stationary during train- It should be noticed that both MLE and GANs for se- ing process. Some recent study (Caccia et al., 2018) quence generation suffer from their corresponding is- empirically shows that non-MLE methods potentially sues. In this section, we delve deeply into the specific suffer from mode collapse problem and cannot actu- properties of MLE and GANs, and explore how these ally outperform MLE in terms of quality and diversity properties affect their performances in modeling se- tradeoff. quential data. In this paper, we seek to leverage the ability of gen- erative models itself for providing unlimited amount MLE The objective of Maximum Likelihood Esti- of samples to augment the training dataset, which has mation (MLE) is: the potential of alleviating the overfitting problem due to limited samples, as well as addressing the exposure LMLE(θ) = Es∼pdata [log pθ(s)] (1) bias problem by providing the model with prefixes (in- put partial sequences) sampled from its own distribu- where pθ(s) is the learned probability of sequence s tion. To correct the bias incurred by sampling from the in the generative model. Maximizing the objective is model distribution, we propose to learn a progressive equivalent to minimizing the Kullback-Leibler (KL) di- density ratio estimator based on Bregman divergence vergence: minimization. The above procedures together form a novel training scheme for sequence generative models, pdata (s) DKL(pdata jjpθ) = Es∼pdata log (2) termed -MLE. pθ(s) Another essential difference between MLE and -MLE lies in the fact that the likelihood of samples not in Though MLE has lots of attractive properties, it has training set are equally penalized through normaliza- two critical issues: tion in MLE, whether near or far from the true dis- 1) MLE is prone to overfitting on small training sets. tribution. While -MLE takes the difference in the Training an autoregressive sequence generative model quality of unseen samples into account through the with MLE on a training set consists of sentences of importance weight assigned by density ratio estimator, length L, the standard objective can be derived as fol- which can be expected to get further improvement. lowing: Empirically, MLE with mixture training data gives the L same performance as vanilla MLE training with only X LMLE^ (θ) = E log pθ (sljs1:l−1) (3) training data. But our proposed -MLE consistently s∼pdata^ outperforms vanilla MLE training. Additionally, we l=1 empirically demonstrate the superiority of our algo- The forced exposure to the ground-truth data shown rithm over many strong baselines like GAN in terms of in Eq. 3 is known as \teacher forcing", which causes generative performance (in the quality-diversity space) the problem of overfitting. What makes thing worse with both synthetic and real-world datasets. is the exposure bias. During training, the model only learns to predict sl given s1:l−1, which are fluent pre- 2 Preliminary fixes in the training set. During sampling, when there are some small mistakes and the first l−1 can no longer make up a very fluent sentence, the model may easily 2.1 Notations fail to predict sl. 2) KL-divergence punishes the situation where the gen- We denote the target data distribution as pdata , and eration model gives real data points low probabili- the empirical data distribution asp ^data . The parame- ters of the generative model G are presented by θ and ties much more severely than that where unreason- the parameters of a density ratio estimator r are pre- able data points are given high probabilities. As a result, models trained with MLE will focus more on sented by . pθ denotes the distribution implied by the tractable density generative model G. The objec- not missing real data points than avoiding generating data points of low quality. tive is to fit the underlying data distribution pdata with a parameterized model distribution pθ with empirical samples from pdata. We use s to stand for a sample se- Sequential GANs Sequential GANs (Yu et al., quence from datasets or from generator's output. And 2017; Guo et al., 2018), are proposed to overcome the sl stands for the l-th token of s, where s0 = ;. above shortcomings of MLE. The typical objective of Yuxuan Song, Ning Miao, Hao Zhou, Lantao Yu, Mingxuan Wang, Lei Li them is: where m 2 [0; 1] is the proportion of training data. By -MLE, we extend O to the whole space. And since LGAN(θ) = there are real training data in the mixture samples, the " n # X gradients are more informative with lower variances. min −Es∼pθ Qt (s1:t−1; st) · log pθ (stjs1:t−1) θ t=1 For training, we directly minimize the forward KL (4) divergence between pmix and pθ, which is equivalent Qt (s1:t−1; st) is action value, which is usually approx- to performing MLE on samples from pmix. Since the imated by a discriminators evaluation on the complete training goal at each step is to maximize: sequences sampled from the prefix st+1 = [s1:t−1; st]. The main advantage of GANs is that when we update Epmix(S)[log pθ(S)]; (6) the generative model, error will be explicitly reduced by the effect of normalizing constant. when the KL-divergence decrease, the gap between pθ and p get smaller. Eventually, when p ≈ p , p However, there is also a major drawback of GANs. As data θ mix θ also approximates p . the gradient is estimated by REINFORCE algorithm data (Yu et al., 2017), the generated distribution is non- However, pmix may be very different from pdata, espe- stationary. As a result, the estimated gradient may cially at the beginning of training. This discrepancy suffer from high variance. Though many methods have may result in generating really poor samples which been proposed to stabilize the training of sequential have high likelihoods in pθ but not in pdata. As a GANs, e.g control variate (Che et al., 2017) or MLE result, the training set gets noisier, which may harm pretraining (Yu et al., 2017), there, they only have performance.