Arxiv:2101.00828V2 [Cs.CL] 8 Jul 2021
Total Page:16
File Type:pdf, Size:1020Kb
Transformer-based Conditional Variational Autoencoder for Controllable Story Generation Le Fangy, Tao Zengx, Chaochun Liux, Liefeng Box, Wen Dongy, Changyou Cheny yUniversity at Buffalo, xJD Finance America Corporation, AI Lab flefang, wendong, [email protected] ftao.zeng, chaochun.liu, [email protected] Abstract powerful representation learning can deal with both the effec- tiveness and the controllability of generation. We investigate large-scale latent variable models (LVMs) for neural story generation—an under-explored application for In recent years, Transformers (Vaswani et al. 2017) and open-domain long text—with objectives in two threads: gener- its variants have become the main-stream workhorses and ation effectiveness and controllability. LVMs, especially the boosted previous generation effectiveness by large margins. variational autoencoder (VAE), have achieved both effective Models based on similar self-attention architectures (Devlin and controllable generation through exploiting flexible distri- et al. 2018; Radford et al. 2018, 2019) could leverage both big butional latent representations. Recently, Transformers and models and big training data. A dominant paradigm emerges its variants have achieved remarkable effectiveness without to be “pre-training + fine-tuning” on a number of natural explicit latent representation learning, thus lack satisfying con- language processing tasks. Even without explicitly learning trollability in generation. In this paper, we advocate to revive latent representations, Transformer-based models could ef- latent variable modeling, essentially the power of representa- tion learning, in the era of Transformers to enhance controlla- fectively learn from training data and generate high-quality bility without hurting state-of-the-art generation effectiveness. text. It’s thrilling to witness computational models generate Specifically, we integrate latent representation vectors with consistent long text in thousands of words with ease. How- a Transformer-based pre-trained architecture to build condi- ever, given state-of-the-art generation effectiveness, control- tional variational autoencoder (CVAE). Model components lability of these models—especially when generating long such as encoder, decoder and the variational posterior are all text—is still under-explored. The emerging challenge is, how built on top of pre-trained language models—GPT2 specifi- to achieve controllable generation in the era of Transformers cally in this paper. Experiments demonstrate state-of-the-art and a long text setting? conditional generation ability of our model, as well as its ex- cellent representation learning capability and controllability. In this paper, we advocate to revive latent variable mod- eling, essentially the power of representation learning, in the era of Transformers to enhance controllability without Introduction hurting state-of-the-art generation effectiveness. Specifically, Neural text generation has achieved remarkable success to we integrate latent representation vectors with self-attention enable both effective and controllable generation at a level based pre-trained architecture to build a conditional vari- that computational models can write like humans to satisfy ational autoencoder (CVAE). Model components such as practical needs. Among various research objectives, the most encoder, decoder and the variational posterior are all built on significant ones are the effectiveness and the controllability top of pre-trained language models—GPT2 (Radford et al. of generation, where there are always emerging opportunities 2019) specifically. We demonstrate excellent representation and challenges. learning capability and controllability of our Transformer- arXiv:2101.00828v2 [cs.CL] 8 Jul 2021 Deep latent variable models (LVMs), especially variational based LVMs through learning and manipulating the latent autoencoder (VAE) (Kingma and Welling 2013; Rezende, representation vectors. Mohamed, and Wierstra 2014) have been a significant class On the application side, we emphasize a much challeng- of methods to achieve both effective and controllable gener- ing and under-explored task, i.e. neural story generation, ation (Bowman et al. 2015; Miao, Yu, and Blunsom 2016; which creatively writes open-domain long text in hundreds Zhao, Zhao, and Eskenazi 2017; Zhao, Lee, and Eskenazi or thousands of words conditioned on very short and abstract 2018; Zhou and Neubig 2017; Hu et al. 2017; Bao et al. prompts (Fan, Lewis, and Dauphin 2018). The task featured 2019b; Shah and Barber 2018). These models generally work with much longer output leads to higher complexity and more with recurrent neural networks (RNN) such as Long short- flexibility in a broader space than short text generation. Pre- term memory (LSTM) (Hochreiter and Schmidhuber 1997) vious literature (Fan, Lewis, and Dauphin 2018; Mao et al. and Gated recurrent unit networks (GRU) (Cho et al. 2014). 2019; See et al. 2019; Ziegler et al. 2019) have at most stud- The advantage of LVMs is to learn and exploit flexible dis- ied how to effectively learn the mapping between prompt and tributional latent representations to capture holistic features story through explicit end to end (end2end) training. How- of input and further guide the generation of sentences. Such ever, controllability in such a setting has rarely been studied. For instance, how to control story development and semantic φ-parameterized encoder is introduced to approximate transition during the spanning of long text? Pure end2end pθ(zjy) / pθ(yjz)p(z) with a variational distribution learning seems quite rigid, which could miss flexible and qφ(zjy). Variational inference is employed for VAE learning, controllable mechanisms inside a black box. A reasonable yielding the following evidence lower bound (ELBO): solution for this issue is to introduce latent representation logp (x) ≥ vectors, which is the treatment we consider in this paper. Ex∼D θ To summarize, our paper is among the first works, by our Ex∼D Ez∼qφ(zjx)logpθ(xjz) knowledge, to build Transformer-based latent variable mod- − [KL (q (zjx) k p(z))] (1) els to solve the controllability issue in the setting of long text Ex∼D φ generation. Recently, we notice an independent parallel work (Li et al. 2020), which proposes similar Transformer-based Conditional Variational Autoencoder (CVAE) (Zhao, architecture to incorporate latent representation vectors. We Zhao, and Eskenazi 2017) Figure 1 also illustrates the note that there are a number of differences between this work CVAE, an adaptation of VAE to fit supervised learning and ours. Most significantly, we considered both VAE and and conditional generation. Given a training dataset of jDj CVAE in a long text setting, while Li et al. (2020) considered pairs D = fxi; yigi=1, where xi = [x1i; ··· ; xT i]; yi = a pre-trained VAE model in traditional short text setting. Our [y1i; ··· ; yT i] represents ith sentence of length T . In con- datasets and source code is available on GitHub1. trollable story generation, x and y refer to a prompt and a story, respectively. Given an input x, CVAE encodes the prior The Model Architecture knowledge of latent code as p(zjx), and generates target y using the deep generative network p (yjx; z) parameterized Conditional Variational Autoencoder θ by θ. Conditional story generation (Fan, Lewis, and Dauphin 2018) The goal of CVAE is to maximize the conditional data refers to generating open-domain long text based on a short log-likelihood Ex;y∼D[log pθ(yjx)]. Similarly, variational prompt, which provides either a starting point or an ab- inference is employed for CVAE learning, yielding the fol- stract summary for the writing. In this paper, we propose lowing evidence lower bound (ELBO): a Transformer-based conditional variational autoencoder to learn the generative process from prompt to story. Ex;y∼Dlogpθ(yjx) ≥ Ex;y∼D Ez∼qφ(zjx;y)logpθ(yjx; z) − Ex;y∼D [KL (qφ(zjx; y) k p(zjx))] (2) Note that both the prior pθ(zjx) and posterior qφ(zjx; y) are learnable in CVAE. Architecture Design Our model architecture is illustrated in Figure 3. Basically, it consists of a prior, posterior and conditional generator Figure 1: Graphical Model of VAE and CVAE. In control- based on multi-layer self-attention architecture (Vaswani et al. lable story generation, x and y refer to a prompt and a story, 2017), more specifically on top of pre-trained models. respectively. z refers to a latent variable. Parameter initialization with GPT2 In order to exploit the power of pre-trained models, we propose to reuse the Variational Autoencoder (VAE) (Bowman et al. 2015) GPT2 model (Radford et al. 2019) as our decoder. For ease Figure 1 illustrates the graphical model of VAE, an unsu- of computation, we adopt the smallest public version with pervised learning method for unconditional generation. VAE L = 12 layers, H = 12 heads per layer, model dimension of consists of a generative network (decoder) and an inference d = 768 units and total parameters of 117M. The encoder has jDj network (encoder). Given a language dataset D = fyigi=1, L1=6 unmasked/bi-directional self-attention layers, whose where yi = [y1i; ··· ; yT i] represents ith sentence of length parameters are initialized to the parameters of the first L1 lay- T . With a prior distribution p(z), VAE generates a sentence ers of the GPT2 model (initialized but not shared afterwards). y using the deep generative network pθ(yjz) parameterized Moreover, the word embedding and positional embedding by θ. The prior p(z) is typically assumed to be a standard tables in the encoder and decoder are shared. multivariate Gaussian. The decoder pθ(xjz) typically takes Comparing with masked/uni-directional structure in de- QT coder for auto-regressive generation, the key point is to have an auto-regressive form pθ(xjz) = t=1 pθ(xtjx<t; z). In this paper, we will build the decoder based on the pre-trained unmasked/bi-directional structure in the encoder to allow GPT2 rather than traditional recurrent neural networks. full information scope. In this sense, our design is compara- The goal of VAE training is to maximize the marginal ble with (Li et al. 2020) to reuse BERT in the encoder and GPT2 in the decoder.