Adversarial Mutual Information for Text Generation
Total Page:16
File Type:pdf, Size:1020Kb
Adversarial Mutual Information for Text Generation Boyuan Pan 1 * Yazheng Yang 2 * Kaizhao Liang 3 Bhavya Kailkhura 4 Zhongming Jin 5 Xian-Sheng Hua 5 Deng Cai 1 Bo Li 3 Abstract demonstrated dominant performance in various tasks, such as dialog generation (Serban et al., 2016; Park et al., 2018), Recent advances in maximizing mutual informa- machine translation (Bahdanau et al., 2014; Vaswani et al., tion (MI) between the source and target have 2017) and document summarization (See et al., 2017). How- demonstrated its effectiveness in text generation. ever, the MLE-based training schemes have been shown to However, previous works paid little attention to produce safe but dull texts which are ascribed to the relative modeling the backward network of MI (i.e. de- frequency of generic phrases in the dataset such as “I don’t pendency from the target to the source), which is know” or “what do you do” (Serban et al., 2016). crucial to the tightness of the variational informa- tion maximization lower bound. In this paper, we Recently, Li et al. (2016a) proposed to use maximum mutual propose Adversarial Mutual Information (AMI): information (MMI) in the dialog generation as the objective a text generation framework which is formed as a function. They showed that modeling mutual information novel saddle point (min-max) optimization aiming between the source and target will significantly decrease to identify joint interactions between the source the chance of generating bland sentences and improve the and target. Within this framework, the forward informativeness of the generated responses, which has in- and backward networks are able to iteratively pro- spired many important works on MI-prompting text gen- mote or demote each other’s generated instances eration (Li et al., 2016b; Zhang et al., 2018b; Ye et al., by comparing the real and synthetic data distri- 2019). In practice, directly maximizing the mutual informa- butions. We also develop a latent noise sampling tion is intractable, so it is often converted to a lower bound strategy that leverages random variations at the called the variational information maximization (VIM) prob- high-level semantic space to enhance the long lem by defining an auxiliary backward network to approxi- term dependency in the generation process. Ex- mate the posterior of the dependency from the target to the tensive experiments based on different text gen- source (Chen et al., 2016). Specifically, the likelihood of eration tasks demonstrate that the proposed AMI generating the real source text by the backward network, framework can significantly outperform several whose input is the synthetic target text produced by the strong baselines, and we also show that AMI has forward network, will be used as a reward to optimize the potential to lead to a tighter lower bound of maxi- forward network, tailoring it to generate sentences with mum mutual information for the variational infor- higher rewards. mation maximization problem. However, most of them optimize the backward network by maximizing the likelihood stated above. Without the guaran- tee of the quality of the synthetic target text produced by the arXiv:2007.00067v1 [cs.CL] 30 Jun 2020 1. Introduction forward network, this procedure encourages the backward Generating diverse and meaningful text is one of the cov- model to generate the real source text from many uninfor- eted goals in machine learning research. Most sequence mative sentences and is not likely to make the backward transductive models for text generation can be efficiently network approach the true posterior, thus affects the tight- trained by maximum likelihood estimation (MLE) and have ness of the variational lower bound. On the other hand, since the likelihood of the backward network is a reward for *Equal contribution 1State Key Lab of CAD&CG, Zhejiang optimizing the forward network, a biased backward network 2 University College of Computer Science, Zhejiang University may provide unreliable reward scores, which will mislead 3University of Illinois Urbana-Champaign 4Lawrence Livermore National Laboratory 5Alibaba Group. Correspondence to: Boyuan the forward network to still generate bland text. Pan <[email protected]>, Bo Li <[email protected]>. In this paper, we present a simple yet effective framework Proceedings of the 37 th International Conference on Machine based on the maximum mutual information objective that Learning, Vienna, Austria, PMLR 119, 2020. Copyright 2020 by encourages it to train the forward network and the back- the author(s). ward network adversarially. We propose Adversarial Mu- Adversarial Mutual Information for Text Generation tual Information (AMI), a new text generative framework els (Vaswani et al., 2017) have been firmly established as that addresses the above issues by solving an information- the state-of-the-art approaches in text generation problems regularized minimax optimization problem. Instead of max- such as dialog generation or machine translation (Wu et al., imizing the variational lower bound for both the forward 2019; Pan et al., 2018; Akoury et al., 2019). In general, most network and the backward network, we maximize the objec- of these sequence transduction models hold an encoder- tive when training the forward network while minimizing decoder architecture (Bahdanau et al., 2014), where the the objective when training the backward network. In addi- encoder maps an input sequence of symbol representations tion to the original objective, we add the expectation of the S = fx1; x2; :::; xng to a sequence of continuous repre- negative likelihood of the backward model whose sources sentations z = fz1; z2; :::; zng. Given z, the decoder then and targets are sampled from the real data distribution. To generates an output sequence T = fy1; y2; :::; ymg of to- this end, we encourage the backward network to generate kens one element at a time. At each time step the model is real source text only when its input is in the real target auto-regressive, consuming the previously generated tokens distribution. This scheme remedies the problem that the as additional input when generating the next. backward model gives high rewards when its input is the For the sequence-to-sequence models, the encoder and the bland text or the text that looks human-generated but has decoder are usually based on recurrent or convolutional little related content to the real data. To stabilize training, neural networks, and the Transformer builds them using we prove that our objective function can be transformed to stacked self-attention and point-wise fully connected layers. minimizing the Wasserstein Distance (Arjovsky et al., 2017), Given the source S and the target T , the objective of a which is known to be differentiable almost everywhere un- sequence transduction model can be simply formulated as: der mild assumptions. Moreover, we present a latent noise sampling strategy that adds sampled noise to the high-level ∗ θ = arg max Pθ(T jS) (1) latent space of the model in order to affect the long-term θ dependency of the sequence, hence enables the model to where θ is the parameters of the sequence transduction generate more diverse and informative text. model. Technical Contributions In this paper, we take the first step towards generating diverse and informative text by 2.2. Maximum Mutual Information optimizing adversarial mutual information. We make contri- Maximum mutual information for text generation is increas- butions on both the theoretical and empirical fronts. ingly being studied (Li et al., 2016a; Zhang et al., 2018b; Ye • We propose Adversarial Mutual Information (AMI), et al., 2019). Compared to the maximum likelihood estima- a novel text generation framework that adversarially tion (MLE) objective, maximizing the mutual information optimizes the mutual information maximization objec- I(S; T ) between the source and target encourages the model tive function. We provide a theoretical analysis on how to generate sequences that are more specific to the source. our objective can be transformed into the Wasserstein Formally, we have distance minimization problem under certain modifica- P (S; T ) tions. I(S; T ) = E log : (2) P (T;S) P (S)P (T ) • To encourage generating diverse and meaningful text, we develop a latent noise sampling method that adds However, directly optimizing the mutual information is in- random variations to the latent semantic space to en- tractable. To provide a principled approach to maximiz- hance the long term dependency for the generation ing MI, a Variational Information Maximization (Barber & process. Agakov, 2003; Chen et al., 2016; Zhang et al., 2018b) lower bound is adopted: • We conduct extensive experiments on the tasks of dia- log generation and machine translation, which demon- I(S; T ) = H(S) − H(SjT ) strate that the proposed AMI framework outperforms = H(S) + E [logP (SjT )] several strong baselines significantly. We show that P (T;S) E our method has the potential to lead to a tighter lower = H(S) + P (T ) [DKL(P (SjT )jjQφ(SjT ))] bound of maximum mutual information. + EP (T;S) [logQφ(SjT )] E E ≥ H(S) + P (S) Pθ (T jS) [logQφ(SjT )] 2. Background (3) where H(·) denotes the entropy, and D (·; ·) denotes the 2.1. Sequence Transduction Model KL KL divergence between two distributions. Qφ(SjT ) is a The sequence-to-sequence models (Cho et al., 2014; backward network that approximates the unknown P (SjT ), Sutskever et al., 2014) and the Transformer-based mod- and usually shares the same architecture with the forward Adversarial Mutual Information for Text Generation Real Source Real Source � � max �"(�|�′) max �"(�|�) min �"(�|�′) Forward Forward LNS Network Network Policy Gradient Policy Gradient Synthetic Synthetic Synthetic Synthetic �′ �′ �′ �′ �′ Target Backward Source Target Backward Source Network Synthetic Network Source max �* (�|�) max �* (�|�) Backward Network � Real Target � Real Target Maximum Mutual Information Adversarial Mutual Information Figure 1. The overview of frameworks for the Maximum Mutual Information (left part) and the proposed Adversarial Mutual Information (right part). Different colors of text denotes different objectives.