Unbiased Auxiliary Classifier Gans with MINE
Total Page:16
File Type:pdf, Size:1020Kb
Unbiased Auxiliary Classifier GANs with MINE Ligong Han Anastasis Stathopoulos Tao Xue Rutgers University Rutgers University Rutgers University [email protected] [email protected] [email protected] Dimitris Metaxas Rutgers University [email protected] Abstract Despite its simplicity and popularity, AC-GAN is re- ported to produce less diverse data samples [18, 14]. This Auxiliary Classifier GANs (AC-GANs) [15] are widely phenomenon is formally discussed in Twin Auxiliary Clas- used conditional generative models and are capable of sifier GAN (TAC-GAN) [5]. The authors of TAC-GAN re- generating high-quality images. Previous work [18] has veal that due to a missing negative conditional entropy term pointed out that AC-GAN learns a biased distribution. To in the objective of AC-GAN, it does not exactly minimize remedy this, Twin Auxiliary Classifier GAN (TAC-GAN) [5] the divergence between real and fake conditional distribu- introduces a twin classifier to the min-max game. How- tions. TAC-GAN proposes to estimate this missing term by ever, it has been reported that using a twin auxiliary clas- introducing an additional classifier in the min-max game. sifier may cause instability in training. To this end, we However, it has also been reported that using such twin aux- propose an Unbiased Auxiliary GANs (UAC-GAN) that uti- iliary classifiers might result in unstable training [10]. lizes the Mutual Information Neural Estimator (MINE) [2] In this paper, we propose to incorporate the negative to estimate the mutual information between the generated conditional entropy in the min-max game by directly esti- data distribution and labels. To further improve the per- mating the mutual information between generated data and formance, we also propose a novel projection-based statis- labels. The resulting method enjoys the same theoretical tics network architecture for MINE∗. Experimental results guarantees as that of TAC-GAN and avoids the instability on three datasets, including Mixture of Gaussian (MoG), caused by using a twin auxiliary classifier. We term the MNIST [12] and CIFAR10 [11] datasets, show that our proposed method UAC-GAN because (1) it learns an Un- UAC-GAN performs better than AC-GAN and TAC-GAN. biased distribution, and (2) MINE [2] relates to Unnormal- Code can be found on the project websitey. ized bounds [16]. Finally, our method demonstrates supe- rior performance compared to AC-GAN and TAC-GAN on 1-D mixture of Gaussian synthetic data, MNIST [12], and 1. Introduction CIFAR10 [11] dataset. Generative Adversarial Networks (GANs) [6] are gen- arXiv:2006.07567v1 [cs.CV] 13 Jun 2020 2. Related Work erative models that can be used to sample from high di- mensional non-parametric distributions, such as natural im- Learning unbiased AC-GANs. In CausalGAN [10], the ages or videos. Conditional GANs [13] is an extension of authors incorporate a binary Anti-Labeler in AC-GAN and GANs that utilize the label information to enable sampling theoretically show its necessity for the generator to learn the from the class conditional data distribution. Class condi- true class conditional data distributions. The Anti-Labeler tional sampling can be achieved by either (1) conditioning is similar to the twin auxiliary classifier in TAC-GAN, but the discriminator directly on labels [13,9, 14], or by (2) in- it is used only for binary classification. Shu et al.[18] for- corporating an additional classification loss in the training mulates the AC-GAN objective as a Lagrangian to a con- objective [15]. The latter approach originates in Auxiliary strained optimization problem and shows that the AC-GAN Classifier GAN (AC-GAN) [15]. tends to push the data points away from the decision bound- ary of the auxiliary classifiers. TAC-GAN [5] builds on ∗This is an extended version of a CVPRW’20 workshop paper with the same title. In the current version the projection form of MINE is detailed. the insights of [18] and shows that the bias in AC-GAN is yhttps://github.com/phymhan/ACGAN-PyTorch caused by a missing negative conditional entropy term. In 1 addition, [5] proposes to make AC-GAN unbiased by intro- 3.2. Twin Auxiliary Classifier GANs ducing a twin auxiliary classifier that competes in an adver- Twin Auxiliary Classifier GAN (TAC-GAN) [5] tries to sarial game with the generator. The TAC-GAN can be con- estimate H (Y jX) by introducing another auxiliary clas- sidered as a generalization of CausalGAN’s Anti-Labeler to Q sifier Cmi. First, notice the mutual information can be de- the multi-class setting. composed in two symmetrical forms, Mutual information estimation. Learning a twin auxil- iary classifier is essentially estimating the mutual informa- IQ(X; Y ) = H(Y ) − HQ(Y jX) = HQ(X) − HQ(XjY ): tion between generated data and labels. We refer readers to [16] for a comprehensive review of variational mutual in- Herein, the subscript Q denotes the corresponding distribu- formation estimators. In this paper, we employ the Mutual tion Q induced by G. Since H(Y ) is constant, optimizing Information Neural Estimator (MINE) [2]. −HQ(Y jX) is equivalent to optimizing IQ(X; Y ). TAC- GAN shows that when Y is uniform, the latter form of 3. Background IQ can be written as the Jensen-Shannon divergence (JSD) 3.1. Bias in Auxiliary Classifier GANs between conditionals fQXjY =1;:::;QXjY =K g. Finally, TAC-GAN introduces the following min-max game First, we review the AC-GAN [15] and the analysis in [5, mi 18] to show why AC-GAN learns a biased distribution. The min max VTAC(G; C ) = AC-GAN introduces an auxiliary classifier C and optimizes G Cmi mi the following objective Ez∼PZ ;y∼PY log C (G(z; y); y); (4) min max LAC(G; C; D) = (1) to minimize the JSD between multiple distributions. The G;C D overall objective is Ex∼PX log D(x) + Ez∼PZ ;y∼PY log(1 − D(G(z; y))) mi | {za } min max LTAC(G; D; C; C ) = LAC + VTAC : (5) G;C D;Cmi |{z} d − Ex;y∼PXY log C(x; y) − Ez∼PZ ;y∼PY log C(G(z; y); y); | {z } | {z } b c 3.3. Insights on Twin Auxiliary Classifier GANs where a is the value function of a vanilla GAN, and b TAC-GAN from a variational perspective. Training the c correspond to cross-entropy classification error on real twin auxiliary classifier minimizes the label reconstruction c and fake data samples, respectively. Let QY jX denote error on fake data as in InfoGAN [3]. Thus, when opti- the conditional distribution induced by C. As pointed out mizing over G, TAC-GAN minimizes a lower bound of the in [5], adding a data-dependent negative conditional entropy mutual information. To see this, −HP (Y jX) to b yields the Kullback-Leibler (KL) diver- c mi gence between PY jX and QY jX , VTAC =Ex;y∼QXY log C (x; y) mi c Q (yjx) −H(Y jX) + b = D (P kQ ): (2) = x∼Q y∼Q log Q(yjx) Ex∼PX KL Y jX Y jX E X E Y jX Q(yjx) Similarly, adding a term −HQ(Y jX) to c yields the KL- =Ex∼QX Ey∼QY jX log Q(yjx) c divergence between QY jX and QY jX , mi − Ex∼QX DKL(QY jX kQY jX ) c ≤ − HQ(Y jX): (6) −HQ(Y jX) + c = Ex∼QX DKL(QY jX kQY jX ): (3) As illustrated above, if we were to optimize2 and3, the The above shows that d is a lower bound of −HQ(Y jX). mi generated data posterior QY jX and the real data posterior The bound is tight when classifier C learns the true pos- PY jX would be effectively chained together by the two KL- terior QY jX on fake data. However, minimizing a lower divergence terms. However, HQ(Y jX) cannot be consid- bound might be problematic in practice. Indeed, previous ered as a constant when updating G. Thus, to make the literature [10] has reported unstable training behavior of us- original AC-GAN unbiased, the term −HQ(Y jX) has to ing an adversarial twin auxiliary classifier in AC-GAN. be added in the objective function. Without this term, the TAC-GAN as a generalized CausalGAN. A binary ver- generator tends to generate data points that are away from sion of the twin auxiliary classifier has been introduced the decision boundary of C, and thus learns a biased (de- as Anti-Labeler in CausalGAN [10] to tackle the issue of generate) distribution. Intuitively, minimizing −HQ(Y jX) label-conditioned mode collapse. As pointed out in [10], over G forces the generator to generate diverse samples with the use of Anti-Labeler brings practical challenges with high (conditional) entropy. gradient-based training. Specifically, (1) in the early stage, the Anti-Labeler quickly minimizes its loss if the gener- 4.1. Mutual Information Neural Estimator ator exhibits label-conditioned mode collapse, and (2) in The mutual information I (X; Y ) is equal to the KL- the later stage, as the generator produces more and more Q divergence between the joint Q and the product of the realistic images, Anti-Labeler behaves more like Labeler XY marginals Q ⊗ Q (here we denote Q = P for a con- (the other auxiliary classifier). Therefore, maximizing Anti- X Y Y Y sistent and general notation), Labeler loss and minimizing Labeler loss become a contra- dicting task, which ends up with unstable training. To ac- IQ(X; Y ) = DKL(QXY kQX ⊗ QY ): (7) count for this, CausalGAN adds an exponential decaying MINE is built on top of the bound of Donsker and Varadhan weight before the Anti-Labeler loss term (or d in5 when [4] (for the KL-divergence between distributions P and Q), optimizing G).