Adversarial Training of Word2vec for Basket Completion

arXiv:1805.08720v1 [cs.LG] 22 May 2018 08AM 123-4567-24-567/08/06...$15.00 ACM. 2018 © desra ewrs(As,adpropose and (GANs), Networks Adversarial C utb ooe.Asrcigwt rdti emd To permied. is credit with Abstracting honored. be must ACM C eeec format: Reference ACM • CONCEPTS CCS Negativ and Estimation pling. fu Contrastive loss Noise standard including using tions, trained Word2Vec over improvem performance significant show in and completion, basket of task a the seing, data new discrete a the in troduce GANs of objective training the Generat of Sampling field the Neg from sophisticated ideas more leveraging pursue have by we loss Sampling, paper Sampling ative our Negative In standard proposed. the been upon improve model metho to Several to aim preferences. applies user and that activity extension shopping user an Prod2Vec, recommendat through in and tasks, similarity, word and analogy modeling word language as including such tasks, learning Negativ machine the of ber with results state-of-the-art trained shown has model function Word2Vec loss Sampling the years, recent In ABSTRACT O:10.475/123 DOI: Canada [email protected] Vancouver, from RecSys’18, permissions requ Request lists, fee. to a redistribute to and/or or servers on post to work publish, this of components n for this Copyrights bear page. copies first that the on and thi tion advantage of commercial ar part copies or that or profit provided all for fee of without granted copies is hard use or classroom digital make to Permission igbse,gvnacleto fiesta h srhsalr has user the that basket. items the of to collection added a a to given added be basket, should ping of that item part next the involv key for completion predictions a puting Basket is applications. completion retail online basket many of task recommendation e INTRODUCTION 1 a learning; ial O:10.475/123 DOI: (RecSys’18), 2018 October Canada, Vancouver, Systems, In Completion. Basket T for Adversarial 2018. Word2Vec Vasile. of Flavian ing and Gartrell, Mike Tanielian, Ugo works; er fcomputation of eory ebiduo h eetpors aei stabilizing in made progress recent the upon build We . [email protected] 4 • 4 g Tanielian Ugo GAN-Word2Vec optrssesorganization systems Computer rtoResearch Criteo ai,France Paris, UPMC rceig fAMCneneo Recommender on Confernce ACM of Proceedings → tutrdpeito;Adversar- prediction; Structured oe.W vlaeormdlon model our evaluate We model. desra riigo Word2Vec of Training Adversarial rspirseicpermission specific prior ires o aktCompletion Basket for desra Negative Adversarial o aeo distributed or made not e g. pages. 5 okfrproa or personal for work s tc n h ulcita- full the and otice we yohr than others by owned oyohrie rre- or otherwise, copy → erlnet- Neural [email protected] nanum- a in scom- es sthat ds Sam- e din- nd ieGartrell Mike rtoResearch Criteo tasks, shop- eady ents rain- ai,France Paris, ion ing ive nc- e - ihnti ls fapproaches, of class this Within epropose we acsta aeGNtann tbefrdsrt nu data input discrete for stable training GAN make that vances coming signal training beer a from benefits and versarially ersnain a edt tt-fteatrsls ss as results, state-of-the-art to lead can representations ayetnin fteclassical the of extensions many models. these of performance s state-of-the-art implementation and simplicity, ity, conceptual the to due proach, extension tv apig(S osfunction loss (NS) Sampling ative y[] hc rpssa ciesmln ersi.I u p our In heuristic. sampling active an sampling addres proposes of was which shortcoming way [4], is by dynamic negatives. a informative have most not the do approaches these ever, vlaetepromneo our of performance recent the evaluate upon builds algorithm our seings, GAN-like be in which stability, issue training of terms In discriminator. the tra also is generator the structure traini current our classical sign In the proaches. shows over and performance in grounds, improvements theoretical cant firmer on sampling ative creat sampling to negative (GANs) adversarial Networks Adversarial Generative from ideas prahssc sWr2e ihNgtv apigb sign a margin. by icant Sampling Negative with Word2Vec supe as classical such outperforms approaches it that show and task completion ld ihmi da n ietosfrftr oki Secti in work future con for and directions 4, and Section 5. ideas in main method with our clude of performance introduc the formally highlight we 3, Section In paper. GAN-Word2Vec this sei of discrete 2 in training Section GAN in on developments recent the and nti otx fbse opein erigie embeddi item learning completion, basket of context this In ntrso riigadteueo eaie,teehv been have there negatives, of use the and training of terms In vrl,temi otiuin fti ae r h follow the are paper this of contributions main the Overall, ebifl ics eae oko apigshmsfrWord2 for schemes sampling on work related discuss briefly We • • • odVcbsdo da rmGN.T h eto our of best the To GANs. from ideas on based Word2Vec ruha xeietleauto ntoreal-world our that two show on we datasets, evaluation experimental an rough scheme. sampling implements adversarial that our algorithm training stable a introduce We Word2Vec. for train- adversarial ing implement to first the are we knowledge, for scheme sampling negative dynamic new a propose We om lsia apigshmsfrWord2Vec. for schemes sampling classical forms Prod2Vec GAN-Word2Vec oe n ecietetann loih.We algorithm. training the describe and model 1] aebcm h efcosadr ap- standard de-facto the become have [10], netnino odVcta uses that Word2Vec of extension an , ai,Fac 43017-6221 France Paris, prahta lcsdnmcneg- dynamic places that approach [email protected] Word2Vec GAN-Word2Vec Word2Vec lva Vasile Flavian rtoResearch Criteo 1] uhas such [16], GAN-Word2Vec oe ae nthe on based model 1] n t item-based its and [15], Swivel oe nabasket- a on model oe outper- model oni [24]. in hown 2] How- [21]. oe an comes ndad- ined implic- rvised gap- ng our e from We . Neg- aper an e ing: ngs sed ad- ng on ifi- if- Vec - RecSys’18, October 2018, Vancouver, Canada Ugo Tanielian, Mike Gartrell, and Flavian Vasile

2 RELATED WORK 2.3 GANs 2.1 Basket Completion with Embedding First proposed by [8] in 2014, Generative Adversarial Networks Representations (GANs) have been quite successful at generating realistic images. GANs can be viewed as a framework for training generative mod- In the recent years, a substantial amount of work has focused on els by posing the training procedure as a minimax game between improving the performance of language modeling and text genera- the generative and discriminative models. is adversarial learn- tion. In both tasks, Deep Neural Networks (DNNs) have proven to ing framework bypasses many of the difficulties associated with be extremely effective, and are now considered state-of-the-art. In maximum likelihood estimation (MLE), and has demonstrated im- this paper, we focus on the task of basket completion with learned pressive results in natural image generation tasks. embedding representations. For the task of basket completion, very eoretical work [1, 14, 18, 25] and practical studies [2, 5, 19] lile work has focused on applying DNNs, with the notable excep- have stabilized GAN training and enabled many improvements in tion of [24]. In this paper, the authors introduced a new family continuous spaces. However, in discrete seings a number of im- of networks designed to work on sets, where the output is invari- portant limitations have prevented major breakthroughs. A ma- ant to any permutation in the order of objects in the input set. As jor issue involves the complexity of backpropagating the gradi- an alternative to a set-based interpretation, basket completion can ents for the generative model. To bypass this differentiation prob- be approached as a sequence generation task, where one seeks to lem, [3, 23] proposed a cost function inspired by reinforcement predict the distribution of the next item conditioned on the items learning. e discriminator is used as a reward function and the already present in the basket. First, one could use the Skip-Gram generator is trained via policy gradient [22]. Additionally, the au- model proposed by [16], and use the average of the embeddings for thors propose the use of importance sampling and variance reduc- the items within a basket to compute the next-item prediction for tion techniques to stabilize the training. the basket. Second, building upon work on models for text generation, it is natural to leverage Recurrent Neural Networks (RNNs), particularlyLong Short Term Memory cells, as proposedby [12], or bi-directional LSTMs [9, 20]. Convolutional neural networks could 3 OUR MODEL: GAN-WORD2VEC also be used, for example by using a Text-CNN like architecture as proposed by [13]. e authors of [13] empirically show on different We begin this section by formally defining the basket completion Z tasks, such as point cloud classification, that their method outper- task. We denote as all of the potential items, products, or words. Zd = ∞ Zd forms state-of-the-art results with good generalization properties. is the space of all baskets of size d, and U d=1 is the space of all possible baskets of any size. e objective of GANs is to generate candidate,Ð or synthetic, samples from a target data distribution p⋆, from which only true samples – in our case baskets of products – are available. erefore, the collection of true baskets S is a subset of U . We are given each 2.2 Negative sampling schemes for Word2Vec basket {X } ∈ S, and an element z ∈ Z randomly selected from {X }. Knowing {X\z}, which denotes basket X with item z removed, we When dealing with the training of multi-class models with thou- want to be able to predict the missing item z. We denote P(Z) as sands or millions of output classes, candidate sampling algorithms the space of probability distributions defined on Z. can speed up the training by considering a small randomly-chosen As in [16], we are working with embeddings. For a given tar- subset of contrastive candidates for each batch of training exam- get item z, we denote C as its context, where C = {X\z}. e ples. Ref. [11] introduced Noise Contrastive Estimation (NCE) as z z embeddings of z and C are w and w , respectively. an unbiased estimator of the somax loss, and has been proven to z z Cz be efficient for learning word embeddings [17]. In [16], the authors propose negative sampling and directly sample candidates from a noise distribution. More recently, in [4], the authors provide an insightful analysis 3.1 Notation of negative sampling. ey show that negative samples with high Both generators and discriminators as a whole have the form of a ′ inner product scores with a context word are more informative in parametric family of functions from Zd to P(Z), where d ′ is the = terms of gradients on the loss function. Leveraging this analysis, length of the prefix basket. We denote G {Gθ }θ ∈Θ as the family p the authors propose a dynamic sampling method based on inner- of generator functions, with parameters Θ ⊂ R , and {Dα }α ∈Λ product rankings. is result can be intuitively interpreted by see- as the family of discriminator functions, with parameters Λ ⊂ Rq . ′ ing that negative samples with a higher inner product will lead to Each function Gθ and Dα is intended to be applied to a d -long a beer approximation of the somax. random input basket. Both the generator and the discriminator In our setup, we simultaneously train two neural networks, and output probabilities that are conditioned on an input basket. Inour use the output distribution coming from one network to generate seing, G and D may therefore be from the same class of models, the negative samples for the second network. is adversarial neg- and may be parametrized by the same class of functions. Given ative sampling proves to be a dynamic and efficient way to improve the context Cz , we denote PG (.|Cz ) and PD (.|Cz ) as the conditional the training of our system. is method echoes the recent work on probabilities output by G and D, respectively. e samples drawn Generative Adversarial Networks. from PG (.|Cz ) are denoted as O. Adversarial Training of Word2Vec for Basket Completion RecSys’18, October 2018, Vancouver, Canada 3.2 Adversarial Negative Sampling Loss Algorithm 1 GAN-Word2Vec In the usual GAN seing, D is trained with a binary cross-entropy Require: generator policy Gθ ; discriminator Dα ; a basket dataset S loss function, with positives coming from p⋆ and negatives gener- Initialize Gθ , Dα with random weights θ, α . ated by G. In our seing, we modify the discriminator’s loss using Pre-train Gθ and Dα using sampled somax on S an extension of the standard approach to Negative Sampling, which repeat has proven to be very efficient for language modeling. We define for g-steps do Get random true sub-baskets {X \z } and the targets z. Adversarial Negative Sampling Loss by the following objective: Generate negatives by sampling from PG (. | {X \z }) Update generator parameters via policy gradient Eq. (3) k end for ( T ) + E [ (− T )] for d-steps do log σ wz wCz Oi ∼PG (. |Cz ) log σ wOi wCz (1) i=1 Get random true sub-baskets {X \z } and the targets z. Õ Generate adversarial samples for D from PG (. | {X \z }) . where k is the number of negatives Oi sampled from PG ( |Cz ). Train discriminator Dα by Eq. (1) As in the standard GAN seing, D’s task is to learn to discrim- end for inate between true samples and synthetic samples coming from until GAN-Word2Vec converges G. Compared to standard negative sampling, the main difference is that the negatives are now drawn from a dynamic distribution that is by design more informative than the fixed distribution used in case, this loss mixes the adversarial training loss and a standard standard negative sampling. Ref. [4] proposes a dynamic sampling sampled somax loss (negative sampling loss): strategy based on self-embedded features, but our approach is a fully adversarial sampling method that is not based on heuristics. m rD (Oi ) LG (θ) = 0.5 ∗ log PG (Oi |Cz ) + = i rD (Oi ) 3.3 Training the Generator Õi 1 In discrete seings, training G using the standard GAN architec- Í k . T + E T ture is not feasible, due to discontinuities in the space of discrete 0 5 log σ(wz wCz ) Ni ∼U (Z ) [log σ(−wNi wCz )] = ! data that prevent updates of the generator. To address this issue, i 1 Õ (3) we sample potential outcomes from PG (.|Cz ), and use D as a reward function on these outcomes. where Ni ∼ U (Z) are negatives uniformly sampled among the po- In our case, the training of the generator has been inspired by [3]. tential next items. We empirically find that this mixed loss pro- In this paper, the authors define two main loss functions for train- vides more stable gradients than the loss in Eq. 2, leading to faster ing the generator. e first loss is called the basic MALIGAN and convergence during training. only uses signals from D. Adapted to our seing, we have the fol- A description of our algorithm can be found in Algorithm 1. We lowing formulation for the generator’s loss: pre-train both G and D using a standard negative sampling loss before training these components adversarially. We empirically m rD (Oi ) show improvements with this procedure in the following section. LG (θ) = log PG (Oi |Cz ) (2) = i rD (Oi ) Õi 1 4 EXPERIMENTS where: Í All our experiments have been ran on the task basket completion, • Oi ∼ PG (.|Cz ). at is, we draw negative samples in the which is a well-known Recommendation task. form of “next-item” samples, where only the missing element z for each basket {X } is sampled. is missing el- 4.1 Datasets ement is sampled by conditioning PG on the context, Cz , In [6] and [7], the authors present state-of-the-art results on basket for z. completion datasets. We performed our experiments on two of the pD (Oi |Cz ) • rD (Oi ) = ). is term allows us to incorpo- 1 − pD (Oi |Cz datasets used in this prior work: the Amazon Baby Registries and rate reward from the discriminator into updates for the the Belgian Retail datasets. generator. e expression for rD comes from a property • is public dataset consists of registries of baby products of the optimal D for a GAN; a full explanation is provided from 15 different categories (such as ’feeding’, ’diapers’, in [3]. ’toys’, etc.), where the item catalog and registries for each . • m is the number of negatives sampled from PG ( |Cz ) used category are disjoint. Each category therefore provides a to compute the gradients small dataset, with a maximum of 15,000 purchased bas- Unlike [3], we do not use any b parameter in Eq. 2, as we did kets per category. We use a random split of 80% of the not observe the need for further variance reduction provided by data for training and 20% for testing. this term in our applications. As both models output probability • Belgian Retail Supermarket - is is a public dataset com- distributions, computing PG (Oi |Cz ) and rD (Oi ) is straightforward. posed of shopping baskets purchased over three non-consecutive e second loss, Mixed MLE-MALIGAN, mixes adversarial loss time periods from a Belgian retail supermarket store. ere and the standard maximum likelihood estimate (MLE) loss. In our are 88,163 purchased baskets, with a catalog of 16,470 unique RecSys’18, October 2018, Vancouver, Canada Ugo Tanielian, Mike Gartrell, and Flavian Vasile

items. We use a random split of 80% of the data for training Method Precision@1 MPR and 20% for testing. Amazon dataset W2V-NCE 14.80 ±0.07 80.15 ±0.05 4.2 Task deﬁnition and associated metrics W2V-NEG 15.40 ±0.05 80.20 ±0.07 16.30 ± . 80.50 ± . In the following evaluation, we consider two metrics: W2V-GANs 0 08 0 1

• Mean Percentile Rank (MPR) - For a basket {X } and one Belgian retail dataset item z randomly removed from this basket, we rank all po- W2V-NCE 29.50 ±0.05 87.54 ±0.04 Z tential items i from set of candidates according to their W2V-NEG 34.35 ±0.07 88.55 ±0.05 probabilities of completing {X\z}, which are PG (i|{X\z}) W2V-GANs 35.82 ±0.09 89.45 ±0.1 and PD (i|{X\z}). e Percentile Rank (PR) of the missing Table 1: One item basket completion task on the Belgian re- item z is defined by: tail dataset. I j′∈Z (pj ≥ pj′) PR = × 100% j |Z| Í where I is the indicator function and |Z| is the number 5 CONCLUSIONS of items in the candidate set. e Mean Percentile Rank In this paper, we have proposed a new adversarial negative sam- (MPR) is the average PR of all the instances in the test-set pling algorithm suitable for models such as Word2Vec. Based on T . recent progress made on GANs in discrete data seings, our so- PRt lution eliminates much of the complexity of implementing a gen- MPR = t ∈T |T| erative adversarial structure for such models. In particular, our Í adversarial training approach can be easily applied to models that MPR = 100 always places the held-out item for the test use standard sampled somax training, where the generator and instance at the head of the ranked list of predictions, while discriminator can be of the same family of models. MPR = 50 is equivalent to random selection. Regarding future work, we plan to investigate the effectiveness • Precision@k - We define this metric as of this training procedure on other models. It is possible that models with more capacity than Word2Vec could benefit even more I[rank ≤ k] precision@k = t ∈T t from using somax with the adversarial negative sampling loss |T| Í structure that we have proposed. erefore, we plan to test this procedure on models such as TextCNN, RNNs, and determinantal where rankt is the predicted rank of the held-out item for test instance t. In other words, precision@k is the fraction point processes (DPPs) [6, 7], which are known to be effective in of instances in the test set for which the predicted rank of modeling discrete set structures. the held-out item falls within the top k predictions. GANs have proven to be quite effective in conjunction with deep neural networks when applied to image generation. In this 4.3 Experimental results work, we have showed that adversarial training can also be applied to simpler models, in discrete seings, and bring statistically sig- We compare our GAN-Word2Vec model with Word2Vec models train- nificant improvements in predictive quality. ing using classical loss funcitions, including Noise Contrastive Esti- mation Loss (NCE) [11, 17] and Negative Sampling Loss (NEG) [16]. REFERENCES We observe that we have beer results with the Mixed Loss. [1] M. Arjovsky and L. Boou. 2017. Towards principled methods for training gen- We find that pre-training bothG and D with a Negative Sampling erative adversarial networks. In International Conference on Learning Represen- Loss leads to beer predictive quality for GAN-Word2Vec. tations. [2] M. Arjovsky, S. Chintala, and L. Boou. 2017. Wasserstein generative adver- Aer pre-training, we train G and D using Eq. 3 and Eq. 1, re- sarial networks. In Proceedings of the 34th International Conference on Machine spectively. We observe that the discriminator initially benefits from Learning, D. Precup and Y. Whye Teh (Eds.). Vol. 70. Proceedings of Machine Learning Research, 214–223. adversarial sampling, and its performance on both MPR and preci- [3] Tong Che, Yanran Li, Ruixiang Zhang, R. Devon Hjelm, Wenjie Li, Yangqiu sion@1 increases. However, aer convergence, the generator ulti- Song, and Yoshua Bengio. 2017. Maximum-Likelihood Augmented DiscreteGen- mately provides beer performance than the discriminator onboth erative Adversarial Networks. CoRR abs/1702.07983 (2017). arXiv:1702.07983 hp://arxiv.org/abs/1702.07983 metrics. We conjecture that this may be explained by the fact that [4] Long Chen, Fajie Yuan, Joemon M Jose, and Weinan Zhang. 2018. Improving basket completion is a generative task. Negative Sampling for Word Representation using Self-embedded Features. In From Table 1, we see that our GAN-Word2Vec model consis- Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining. ACM, 99–107. tently provides statistically-significant improvements over the Word2Vec [5] G.K. Dziugaite, D.M. Roy, and Z. Ghahramani. 2015. Training generative neu- baseline models on both the Precision@1 and MPR metrics. As ral networks via maximum mean discrepancy optimization. In Proceedings of the irty-First Conference on Uncertainty in Artificial Intelligence. AUAI Press, confirmed by the experiments, we expect our method to be more Arlington, 258–267. effective on larger datasets. [6] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. 2016. Bayesian low-rank We also see that Word2Vec trained using Negative Sampling determinantal point processes. In Proceedings of the 10th ACM Conference on Recommender Systems. ACM, 349–356. (W2V-NEG) is generally a stronger baseline than Word2Vec trained [7] Mike Gartrell, Ulrich Paquet, and Noam Koenigstein. 2017. Low-Rank Factor- via NCE. ization of Determinantal Point Processes.. In AAAI. 1912–1918. Adversarial Training of Word2Vec for Basket Completion RecSys’18, October 2018, Vancouver, Canada

[8] I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and J. Bengio. 2014. Generative adversarial nets. In Advances in Neural Information Processing Systems 27, Z. Ghahramani, M. Welling, C. Cortes, N.D. Lawrence, and K.Q. Weinberger (Eds.). Curran Associates, Inc., Red Hook, 2672–2680. [9] Alex Graves, Santiago Fernández, and Jürgen Schmidhuber. 2005. Bidirectional LSTM networks for improved phoneme classification and recognition. In Inter- national Conference on Artificial Neural Networks. Springer, 799–804. [10] Mihajlo Grbovic, Vladan Radosavljevic, Nemanja Djuric, Narayan Bhamidipati, Jaikit Savla, Varun Bhagwan, and Doug Sharp. 2015. E-commerce in your inbox: Product recommendations at scale. In Proceedings of the 21th ACM SIGKDD Inter- national Conference on Knowledge Discovery and Data Mining. ACM, 1809–1818. [11] M. Gutmann and A. Hyvrinen. 2010. Noise-contrastive estimation: A new estimation principle for unnormalized statistical models. JMLR WCP, Vol. 9. Journal of Machine Learning Research - Proceedings Track, 297–304. [12] Sepp Hochreiter and Jrgen Schmidhuber. 1997. Long Short-term Memory. 9 (12 1997), 1735–80. [13] Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. CoRR abs/1408.5882 (2014). arXiv:1408.5882 hp://arxiv.org/abs/1408.5882 [14] S. Liu, O. Bousquet, and K. Chaudhuri. 2017. Approximation and convergence properties of generative adversarial learning. In Advances in Neural Informa- tion Processing Systems 30, I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fer- gus, S. Vishwanathan, and R. Garne (Eds.). Curran Associates, Inc., Red Hook, 5551–5559. [15] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013). [16] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in neural information processing systems. 3111–3119. [17] Andriy Mnih and Yee Whye Teh. 2012. A fast and simple algorithm for training neural probabilistic language models. arXiv preprint arXiv:1206.6426 (2012). [18] S. Nowozin, B. Cseke, and R. Tomioka. 2016. f-GAN: Training generative neural samplers using variational divergence minimization. In Advances in Neural In- formation Processing Systems 29, D.D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon, and R. Garne (Eds.). Curran Associates, Inc., Red Hook, 271–279. [19] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen. 2016. Improved techniques for training GANs. In Advances in Neural Informa- tion Processing Systems 29, D.D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon, and R. Garne (Eds.). Curran Associates, Inc., Red Hook, 2234–2242. [20] Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing 45, 11 (1997), 2673–2681. [21] Noam Shazeer, Ryan Doherty, Colin Evans, and Chris Waterson. 2016. Swivel: Improving Embeddings by Noticing What’s Missing. CoRR abs/1602.02215 (2016). arXiv:1602.02215 hp://arxiv.org/abs/1602.02215 [22] Richard S Suon, David A McAllester, Satinder P Singh, and Yishay Mansour. 2000. Policy gradient methods for reinforcement learning with function approximation. In Advances in neural information processing systems. 1057–1063. [23] Lantao Yu, Weinan Zhang, Jun Wang, and Yong Yu. 2016. SeqGAN: Sequence Generative Adversarial Nets with Policy Gradient. CoRR abs/1609.05473 (2016). arXiv:1609.05473 hp://arxiv.org/abs/1609.05473 [24] Manzil Zaheer, Satwik Kour, Siamak Ravanbakhsh, Barnabás Póczos, Ruslan Salakhutdinov, and Alexander J. Smola. 2017. Deep Sets. CoRR abs/1703.06114 (2017). arXiv:1703.06114 hp://arxiv.org/abs/1703.06114 [25] P. Zhang, Q. Liu, D. Zhou, T. Xu, and X. He. 2018. On the discriminative- generalization tradeoff in GANs. In International Conference on Learning Rep- resentations.