Generative Adversarial Networks for Text Using Word2vec Intermediaries
Total Page:16
File Type:pdf, Size:1020Kb
Generative Adversarial Networks for text using word2vec intermediaries Akshay Budhkar1, 2, 4, Krishnapriya Vishnubhotla1, Safwan Hossain1, 2 and Frank Rudzicz1, 2, 3, 5 1Department of Computer Science, University of Toronto fabudhkar, vkpriya, [email protected] 2Vector Institute [email protected] 3St Michael’s Hospital 4Georgian Partners 5Surgical Safety Technologies Inc. Abstract the information to improve, however, if at the cur- rent stage of training it is not doing that yet, the Generative adversarial networks (GANs) have gradient of G vanishes. Additionally, with this shown considerable success, especially in the loss function, there is no correlation between the realistic generation of images. In this work, we apply similar techniques for the generation metric and the generation quality, and the most of text. We propose a novel approach to han- common workaround is to generate targets across dle the discrete nature of text, during training, epochs and then measure the generation quality, using word embeddings. Our method is ag- which can be an expensive process. nostic to vocabulary size and achieves compet- W-GAN (Arjovsky et al., 2017) rectifies these itive results relative to methods with various issues with its updated loss. Wasserstein distance discrete gradient estimators. is the minimum cost of transporting mass in con- 1 Introduction verting data from distribution Pr to Pg. This loss forces the GAN to perform in a min-max, rather Natural Language Generation (NLG) is often re- than a max-min, a desirable behavior as stated in garded as one of the most challenging tasks in (Goodfellow, 2016), potentially mitigating mode- computation (Murty and Kabadi, 1987). It in- collapse problems. The loss function is given by: volves training a model to do language genera- tion for a series of abstract concepts, represented either in some logical form or as a knowledge Lcritic = min max(Ex∼pr(x)[D(x)] base. Goodfellow introduced generative adver- G D2D (1) −E [D(~x)]) sarial networks (GANs) (Goodfellow et al., 2014) x~∼pg(x) as a method of generating synthetic, continuous data with realistic attributes. The model includes a where D is the set of 1-Lipschitz functions and discriminator network (D), responsible for distin- Pg is the model distribution implicitly defined by guishing between the real and the generated sam- x~ = G(z), z ∼ p(z). A differentiable function is ples, and a generator network (G), responsible for 1-Lipschtiz iff it has gradients with norm at most generating realistic samples with the goal of fool- 1 everywhere. Under an optimal D minimizing ing the D. This setup leads to a minimax game the value function with respect to the generator pa- where we maximize the value function with re- rameters minimizes the W(pr; pg), where W is the spect to D, and minimize it with respect to G. The Wasserstein distance, as discussed in (Vallender, ideal optimal solution is the complete replication 1974). To enforce the Lipschitz constraint, the au- of the real distributions of data by the generated thors propose clipping the weights of the gradient distribution. within a compact space [−c; c]. GANs, in this original setup, often suffer from (Gulrajani et al., 2017) show that even though the problem of mode collapse - where the G man- this setup leads to more stable training compared ages to find a few modes of data that resem- to the original GAN loss function, the architec- ble real data, using them consistently to fool the ture suffers from exploding and vanishing gradient D. Workarounds for this include updating the problems. They introduce the concept of gradient loss function to incorporate an element of multi- penalty as an alternative way to enforce the Lip- diversity. An optimal D would provide G with schitz constraint, by penalizing the gradient norm 15 Proceedings of the 4th Workshop on Representation Learning for NLP (RepL4NLP-2019), pages 15–26 Florence, Italy, August 2, 2019. c 2019 Association for Computational Linguistics directly in the loss. The loss function is given by: sentence generation by leaking high-level infor- mation from D to G, and generates a latent rep- resentation from the features of the already gen- 2 L = Lcritic + λEx^∼p(^x)[(jjrx^D(^x)jj2 − 1) ] (2) erated words, to aid in the next word generation. TextGAN (Zhang et al., 2017) adds an element of where x^ are random samples drawn from Px, diversity to the original GAN loss by employing and Lcritic is the loss defined in Equation1. the Maximum Mean Discrepancy objective to al- Empirical results of GANs over the past year or leviate mode collapse. so have been impressive. GANs have gotten state- In the latter half of 2018, (Zhu et al., 2018) in- of-the-art image-generation results on datasets like troduced Texygen, a benchmarking platform for ImageNet (Brock et al., 2018) and LSUN (Rad- natural language generation, while introducing ford et al., 2015). Such GANs are fully differen- standard metrics apt for this task. (Lu et al., 2018) tiable and allow for back-propagation of gradients surveys all these new methods along with other from D through the samples generated by G. How- baselines, and documents model performance on ever, if the data is discrete, as, in the case of text, standard corpus like EMNLP2017 WMT News1 the gradient cannot be propagated back from D and Image COCO2. to G, without some approximation. Workarounds to this problem include techniques from reinforce- 2 Motivation ment learning (RL), such as policy gradients to 2.1 Problems with the Softmax Function choose a discrete entity and reparameterization to represent the discrete quantity in terms of an ap- The final layer of nearly all existing language gen- proximated continuous function (Williams, 1992; eration models is the softmax function. It is usu- Jang et al., 2016). ally the slowest to compute, leaves a large memory footprint and can lead to significant speedups if re- 1.1 Techniques for GANs for text placed by approximate continuous outputs (Kumar SeqGAN (Yu et al., 2017) uses policy gradient and Tsvetkov, 2018). Given this bottleneck, mod- techniques from RL to approximate gradient from els usually limit the vocabulary size to a few thou- discrete G outputs, and applied MC rollouts dur- sand and use an unknown token (unk) for the rare ing training to obtain a loss signal for each word in words. Any change in the allowed vocabulary size the corpus. MaliGAN (Che et al., 2017) rescales also means that the researcher needs to modify the the reward to control for the vanishing gradient existing model architecture. problem faced by SeqGAN. RankGAN (Lin et al., Our work breaks this bottleneck by having our 2017) replaces D with an adversarial ranker and G produce a sequence (or stack) of continuous dis- minimizes pair-wise ranking loss to get better con- tributed word vectors, with n dimensions, where vergence, however, is more expensive than other n << V and V is the vocabulary size. The ex- methods due to the extra sampling from the orig- pectation is that the model will output words in a inal data. (Kusner and Hernandez-Lobato´ , 2016) semantic space, that is produced words would ei- used the Gumbel-softmax approximation of the ther be correct or close synonyms (Mikolov et al., discrete one-hot encoded output of the G, and 2013; Kumar and Tsvetkov, 2018), while having a showed that the model learns rules of a context- smaller memory footprint and faster training and free grammar from training samples. (Rajeswar inference procedures. et al., 2017), the state of the art in 2017, forced the 2.2 GAN2vec GAN to operate on continuous quantities by ap- proximating the one-hot output tokens with a soft- In this work, we propose GAN2vec - GANs max distribution layer at the end of the G network. that generate real-valued word2vec-like vectors MaskGAN (Fedus et al., 2018) uses policy (as opposed to discrete one-hot encoded outputs). gradient with REINFORCE estimator (Williams, While this work mainly focuses specifically on 1992) to train the model to predict a word based word2vec-based representation, it can be eas- on its context, and show that for the specific ily extended to other embedding techniques like blank-filling task, their model outperforms maxi- GloVe and fastText. mum likelihood model using the perplexity met- 1http://www.statmt.org/wmt17/ ric. LeakGAN (Guo et al., 2018) allows for long 2http://cocodataset.org/ 16 Expecting a neural network to generate text is, regular intervals during training and during infer- intuitively, expecting it to learn all the nuances of ence for human interpretation. A nearest-neighbor natural language, including the rules of grammar, approach based on cosine similarity is used to find context, coherent sentences, and so on. Word2vec the closest word to the generated embedding in the has shown to capture parts of these subtleties by vector space. capturing the inherent semantic meaning of the words, and this is shown by the empirical results 4 The Algorithm in the original paper (Mikolov et al., 2013) and The complete GAN2vec flow is presented in Al- with theoretical justifications by (Ethayarajh et al., gorithm1. 2018). GAN2vec breaks the problem of genera- tion down into two steps, the first is the word2vec Algorithm 1 GAN2vec Framework mapping, with the following network expected to 1: Train a word2vec model, e, on the train corpus address the other aspects of sentence generation. It also allows the model designers to swap out 2: Transform text to a stack of word2vec vectors word2vec for a different type of word represen- using e tation that is best suited for the specific language 3: Pre-train D for t iterations on real data task at hand.