JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 1 A Review on Generative Adversarial Networks: Algorithms, Theory, and Applications
Jie Gui, Zhenan Sun, Yonggang Wen, Dacheng Tao, Jieping Ye
Abstract—Generative adversarial networks (GANs) are a hot research topic recently. GANs have been widely studied since 2014, and a large number of algorithms have been proposed. However, there is few comprehensive study explaining the connections among different GANs variants, and how they have evolved. In this paper, we attempt to provide a review on various GANs methods from the perspectives of algorithms, theory, and applications. Firstly, the motivations, mathematical representations, and structure of most GANs algorithms are introduced in details. Furthermore, GANs have been combined with other machine learning algorithms for specific applications, such as semi-supervised learning, transfer learning, and reinforcement learning. This paper compares the commonalities and differences of these GANs methods. Secondly, theoretical issues related to GANs are investigated. Thirdly, typical applications of GANs in image processing and computer vision, natural language processing, music, speech and audio, medical field, and data science are illustrated. Finally, the future open research problems for GANs are pointed out.
Index Terms—Deep Learning, Generative Adversarial Networks, Algorithm, Theory, Applications. !
1 INTRODUCTION
ENERATIVE adversarial networks (GANs) have become relevant work is predictability minimization [2]. The connec- G a hot research topic recently. Yann LeCun, a legend in tions between predictability minimization and GANs can be deep learning, said in a Quora post “GANs are the most found in [3], [4]. interesting idea in the last 10 years in machine learning.” The popularity and importance of GANs have led to sev- There are a large number of papers related to GANs accord- eral previous reviews. The difference from previous work is ing to Google scholar. For example, there are about 11,800 summarized in the following. papers related to GANs in 2018. That is to say, there are 1) GANs for specific applications: There are surveys of about 32 papers everyday and more than one paper every using GANs for specific applications such as image hour related to GANs in 2018. synthesis and editing [5], audio enhancement and GANs consist of two models: a generator and a dis- synthesis [6]. criminator. These two models are typically implemented by 2) General survey: The earliest relevant review was neural networks, but they can be implemented with any probably the paper by Wang et al. [7] which mainly form of differentiable system that maps data from one space introduced the progress of GANs before 2017. Ref- to the other. The generator tries to capture the distribution erences [8], [9] mainly introduced the progress of of true examples for new data example generation. The GANs prior to 2018. The reference [10] introduced discriminator is usually a binary classifier, discriminating the architecture-variants and loss-variants of GANs generated examples from the true examples as accurately only related to computer vision. Other related work as possible. The optimization of GANs is a minimax opti- can be found in [11]–[13]. mization problem. The optimization terminates at a saddle point that is a minimum with respect to the generator and As far as we know, this paper is the first to provide a comprehensive survey on GANs from the algorithm, theory, arXiv:2001.06937v1 [cs.LG] 20 Jan 2020 a maximum with respect to the discriminator. That is, the optimization goal is to reach Nash equilibrium [1]. Then, and application perspectives which introduces the latest the generator can be thought to have captured the real progress. Furthermore, our paper focuses on applications distribution of true examples. not only to image processing and computer vision, but also Some previous work has adopted the concept of making sequential data such as natural language processing, and two neural networks compete with each other. The most other related areas such as medical field. The remainder of this paper is organized as follows: The J. Gui is with the Department of Computational Medicine and Bioinformatics, related work is discussed in Section 2. Sections3-5 introduce University of Michigan, USA (e-mail: [email protected]). GANs from the algorithm, theory, and applications perspec- Z. Sun is with the Center for Research on Intelligent Perception and Computing, Chinese Academy of Sciences, Beijing 100190, China (e-mail: tives, respectively. Tables1 and2 show GANs’ algorithms [email protected]). and applications which will be discussed in Sections3 and Y. Wen is with the School of Computer Science and Engineering, Nanyang 5, respectively. The open research problems are discussed in Technological University, Singapore 639798 (e-mail: [email protected]). Section 6 and Section 7 concludes the survey. D. Tao is with the UBTECH Sydney Artificial Intelligence Center and the School of Information Technologies, the Faculty of Engineering and Information Technologies, the University of Sydney, Australia. (e-mail: 2 RELATED WORK [email protected]). J. Ye is with DiDi AI Labs, P.R. China and University of Michigan, Ann GANs belong to generative algorithms. Generative algo- Arbor.(e-mail:[email protected]). rithms and discriminative algorithms are two categories of JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 2 TABLE 1: A overview of GANs’ algorithms discussed in Section3
GANs’ Representative variants InfoGAN [14], cGANs [15], CycleGAN [16], f-GAN [17], WGAN [18], WGAN-GP [19], LS-GAN [20] Objective function LSGANs [21], [22], hinge loss based GAN [23]–[25], MDGAN [26], unrolled GAN [27], SN-GANs [23], RGANs [28] Skills ImprovedGANs [29], AC-GAN [30] LAPGAN [31], DCGANs [32], PGGAN [33], StackedGAN [34], SAGAN [35], BigGANs [36], GANs training StyleGAN [37], hybrids of autoencoders and GANs (EBGAN [38], Structure BEGAN [39], BiGAN [40]/ALI [41], AGE [42]), multi-discriminator learning (D2GAN [43], GMAN [44]), multi-generator learning (MGAN [45], MAD-GAN [46]), multi-GAN learning (CoGAN [47]) Semi-supervised learning CatGANs [48], feature matching GANs [29], VAT [49], ∆-GAN [50], Triple-GAN [51] DANN [52], CycleGAN [53], DiscoGAN [54], DualGAN [55], StarGAN [56], Task driven GANs Transfer learning CyCADA [57], ADDA [58], [59], FCAN [60], unsupervised pixel-level domain adaptation (PixelDA) [61] Reinforcement learning GAIL [62]
TABLE 2: Applications of GANs discussed in Section5
Field Subfield Method Super-resolution SRGAN [63], ESRGAN [64], Cycle-in-Cycle GANs [65], SRDGAN [66], TGAN [67] DR-GAN [68], TP-GAN [69], PG2 [70], PSGAN [71], Image synthesis and manipulation APDrawingGAN [72], IGAN [73], Image processing and computer vision introspective adversarial networks [74], GauGAN [75] Texture synthesis MGAN [76], SGAN [77], PSGAN [78] Object detection Segan [79], perceptual GAN [80], MTGAN [81] Video VGAN [82], DRNET [83], Pose-GAN [84], video2video [85], MoCoGan [86] Natural language processing (NLP) RankGAN [87], IRGAN [88], [89], TAC-GAN [90] Sequential data Music RNN-GAN (C-RNN-GAN) [91], ORGAN [92], SeqGAN [93], [94]
machine learning algorithms. If a machine learning algo- It may fail to represent the complexity of true data distribu- rithm is based on a fully probabilistic model of the observed tion and learn the high-dimensional data distributions [100]. data, this algorithm is generative. Generative algorithms have become more popular and important due to their wide 2.1.2 Implicit density model practical applications. An implicit density model does not directly estimate or fit the data distribution. It produces data instances from 2.1 Generative algorithms the distribution without an explicit hypothesis [101] and utilizes the produced examples to modify the model. Prior Generative algorithms can be classified into two classes: to GANs, the implicit density model generally needs to be explicit density model and implicit density model. trained utilizing either ancestral sampling [102] or Markov chain-based sampling, which is inefficient and limits their 2.1.1 Explicit density model practical applications. GANs belong to the directed implicit An explicit density model assumes the distribution and density model category. A detailed summary and relevant utilizes true data to train the model containing the distri- papers can be found in [103]. bution or fit the distribution parameters. When finished, new examples are produced utilizing the learned model or 2.1.3 The comparison between GANs and other generative distribution. The explicit density models include maximum algorithms likelihood estimation (MLE), approximate inference [95], GANs were proposed to overcome the disadvantages of [96], and Markov chain method [97]–[99]. These explicit other generative algorithms. The basic idea behind adver- density models have an explicit distribution, but have lim- sarial learning is that the generator tries to create as real- itations. For instance, MLE is conducted on true data and istic examples as possible to deceive the discriminator. The the parameters are updated directly based on the true data, discriminator tries to distinguish fake examples from true which leads to an overly smooth generative model. The gen- examples. Both the generator and discriminator improve erative model learned by approximate inference can only through adversarial learning. This adversarial process gives approach the lower bound of the objective function rather GANs notable advantages over other generative algorithms. than directly approach the objective function, because of More specifically, GANs have advantages over other gener- the difficulty in solving the objective function. The Markov ative algorithms as follows: chain algorithm can be used to train generative models, but it is computationally expensive. Furthermore, the explicit 1) GANs can parallelize the generation, which is im- density model has the problem of computational tractability. possible for other generative algorithms such as JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 3 PixelCNN [104] and fully visible belief networks 3.1.1.1 Original minimax game: (FVBNs) [105], [106]. The objective function of GANs [3] is 2) The generator design has few restrictions. min max V (D,G) = Ex∼p (x) [log D (x)] 3) GANs are subjectively thought to produce better G D data (1) +E [log (1 − D (G (z)))] . examples than other methods. z∼pz (z) T Refer to [103] for more detailed discussions about this. log D (x) is the cross-entropy between [1 0] and T [D (x) 1 − D (x)] . Similarly, log (1 − D (G (z))) T is the cross-entropy between [0 1] and 2.2 Adversarial idea T [D (G (z)) 1 − D (G (z))] . For fixed G, the optimal The adversarial idea has been successfully applied to many discriminator D is given by [3]: areas such as machine learning, artificial intelligence, com- puter vision and natural language processing. The recent ∗ pdata (x) DG (x) = . (2) event that AlphaGo [107] defeats world’s top human player pdata (x) + pg (x) engages public interest in artificial intelligence. The interme- The minmax game in (1) can be reformulated as: diate version of AlphaGo utilizes two networks competing C(G) = max V (D,G) with each other. D ∗ Adversarial examples [108]–[117] have the adversarial = Ex∼pdata [log DG (x)] ∗ idea, too. Adversarial examples are those examples which +Ez∼pz [log (1 − DG (G (z)))] ∗ ∗ are very different from the real examples, but are classified = Ex∼p [log D (x)] + Ex∼p [log (1 − D (x))] (3) data h G gi G into a real category very confidently, or those that are = E log pdata(x) x∼pdata 1 (p (x)+p (x)) slightly different than the real examples, but are classified 2 data g h pg (x) i into a wrong category. This is a very hot research topic +Ex∼pg 1 − 2 log 2. 2 (pdata(x)+pg (x)) recently [112], [113]. To be against adversarial attacks [118], [119], references [120], [121] utilize GANs to conduct the The definition of KullbackLeibler (KL) divergence and right defense. Jensen-Shannon (JS) divergence between two probabilistic p (x) q (x) Adversarial machine learning [122] is a minimax prob- distributions and are defined as lem. The defender, who builds the classifier that we want Z p (x) KL(pk q) = p (x) log dx, (4) to work correctly, is searching over the parameter space to q (x) find the parameters that reduce the cost of the classifier as much as possible. Simultaneously, the attacker is searching 1 p + q 1 p + q JS(pk q) = KL(pk ) + KL(qk ). (5) over the inputs of the model to maximize the cost. 2 2 2 2 The adversarial idea exists in adversarial networks, ad- Therefore, (3) is equal to versarial learning, and adversarial examples. However, they pdata+pg pdata+pg have different objectives. C(G) = KL(pdatak 2 ) + KL(pgk 2 ) −2 log 2 (6) = 2JS(pdatak pg) − 2 log 2. 3 ALGORITHMS Thus, the objective function of GANs is related to both KL In this section, we first introduce the original GANs. Then, divergence and JS divergence. the representative variants, training, evaluation of GANs, 3.1.1.2 Non-saturating game: and task-driven GANs are introduced. It is possible that the Equation (1) cannot provide sufficient gradient for G to learn well in practice. Generally speak- 3.1 Generative Adversarial Nets (GANs) ing, G is poor in early learning and samples are clearly The GANs framework is straightforward to implement different from the training data. Therefore, D can reject the when the models are both neural networks. In order to learn generated samples with high confidence. In this situation, log (1 − D (G (z))) G the generator’s distribution pg over data x, a prior on input saturates. We can train to maximize log (D (G (z))) log (1 − D (G (z))) noise variables is defined as pz (z) [3] and z is the noise vari- rather than minimize . able. Then, GANs represent a mapping from noise space to The cost for the generator then becomes data space as G (z, θg), where G is a differentiable function (G) J = Ez∼pz (z) [− log (D (G (z)))] represented by a neural network with parameters θg. Other (7) = Ex∼pg [− log (D (x))] . than G, the other neural network D (x, θd) is also defined This new objective function results in the same fixed point with parameters θd and the output of D (x) is a single scalar. D G D (x) denotes the probability that x was from the data of the dynamics of and but provides much larger rather than the generator G. The discriminator D is trained gradients early in learning. The non-saturating game is to maximize the probability of giving the correct label to heuristic, not being motivated by theory. However, the both training data and fake samples generated from the non-saturating game has other problems such as unstable G D∗ generator G. G is trained to minimize log (1 − D (G (z))) numerical gradient for training . With optimal G, we simultaneously . have ∗ ∗ Ex∼pg [− log (DG (x))] + Ex∼pg [log (1 − DG (x))] h (1−D∗ (x)) i h p (x) i 3.1.1 Objective function G g = Ex∼pg log ∗ = Ex∼pg log (8) DG(x) pdata(x) Different objective functions can be used in GANs. = KL(pgk pdata). JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 4
∗ 6 Therefore, Ex∼pg [− log (DG (x))] is equal to Original ∗ Ex∼pg [− log (DG (x))] Non-saturating ∗ (9) Maximum likelihood cost = KL(pgk pdata) − Ex∼pg [log (1 − DG (x))] . 0 From (3) and (6), we have ∗ ∗ Ex∼pdata [log DG (x)] + Ex∼pg [log (1 − DG (x))] (10) -5 = 2JS(pdatak pg) − 2 log 2. High gradient Low gradient ∗ Therefore, Ex∼pg [log (1 − DG (x))] equals ∗ -10 Ex∼pg [log (1 − DG (x))] When samples are ∗ (11) = 2JS(pdatak pg) − 2 log 2 − Ex∼pdata [log DG (x)] . possible to be fake, we want to learn it to Gradient is By substituting (11) into (9), (9) reduces to -15 improve the generator. dominated by the ∗ However, gradient in region where Ex∼pg [− log (DG (x))] this region is almost samples in very = KL(pgk pdata) − 2JS(pdatak pg)+ (12) vanishing good quality. E [log D∗ (x)] + 2 log 2. -20 x∼pdata G 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 From (12), we can see that the optimization of the alternative ( ) G loss in the non-saturating game is contradictory because the first term aims to make the divergence between the Fig. 1: The three curves of “Original”, “Non-saturating”, and generated distribution and the real distribution as small as “Maximum likelihood cost” denotes log (1 − D (G (z))), possible while the second term aims to make the divergence − log (D (G (z))), and −D(G(z))/(1 − D(G(z))) in (1), (7), between these two distributions as large as possible due and (13), respectively. The cost that the generator has for to the negative sign. This will bring unstable numerical generating a sample G(z) is only decided by the discrimi- gradient for training G. Furthermore, KL divergence is not a nator’s response to that generated sample. The larger prob- symmetrical quantity, which is reflected from the following ability the discriminator gives the real label to the generated two examples sample, the less cost the generator gets. This figure is repro- duced from [103], [123]. • If pdata (x) → 0 and pg (x) → 1, we have KL(pgk pdata) → +∞. • p (x) → 1 p (x) → 0 If data and g , we have from gradient vanishing. The heuristically motivated KL(p k p ) → 0 g data . non-saturating game does not have this problem. The penalizations for two errors made by G are completely • Second, maximum likelihood game also has the different. The first error is that G produces implausible sam- problem that almost all of the gradient is from the ples and the penalization is rather large. The second error is right end of the curve, which means that a rather that G does not produce real samples and the penalization is small number of samples in each minibatch dominate quite small. The first error is that the generated samples are the gradient computation. This demonstrates that inaccurate while the second error is that generated samples variance reduction methods could be an important are not diverse enough. Based on this, G prefers producing research direction for improving the performance of repeated but safe samples rather than taking risk to produce GANs based on maximum likelihood game. different but unsafe samples, which has the mode collapse • Third, the heuristically motivated non-saturating problem. game has lower sample variance, which is the pos- 3.1.1.3 Maximum likelihood game: sible reason that it is more successful in real applica- There are many methods to approximate (1) in GANs. tions. Under the assumption that the discriminator is optimal, GAN Lab [124] is proposed as the interactive visualization minimizing tool designed for non-experts to learn and experiment with (G) −1 GANs. Bau et al. [125] present an analytic framework to J = Ez∼pz (z) − exp σ (D (G (z))) (13) visualize and understand GANs. = Ez∼pz (z) [−D (G (z))/(1 − D (G (z)))] , where σ is the logistic sigmoid function, equals minimiz- ing (1)[123]. The demonstration of this equivalence can 3.2 GANs’ representative variants be found in Section 8.3 of [103]. Furthermore, there are There are many papers related to GANs [126]–[131] such as other possible ways of approximating maximum likelihood CSGAN [132] and LOGAN [133]. In this subsection, we will within the GANs framework [17]. A comparison of original introduce GANs’ representative variants. zero-sum game, non-saturating game, and maximum likeli- hood game is shown in Fig.1. 3.2.1 InfoGAN Three observations can be obtained from Fig.1. Rather than utilizing a single unstructured noise vector z, • First, when the sample is possible to be fake, that is InfoGAN [14] proposes to decompose the input noise vector on the left end of the figure, both the maximum like- into two parts: z, which is seen as incompressible noise; c, lihood game and the original minimax game suffer which is called the latent code and will target the significant JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 5
structured semantic features of the real data distribution. y G(y) x InfoGAN [14] aims to solve G
min max VI (D,G) = V (D,G) − λI(c; G(z, c)), (14) G D D D fake where V (D,G) is the objective function of original GAN, real G(z, c) is the generated sample, I is the mutual information, and λ is the tunable regularization parameter. Maximizing I(c; G(z, c)) means maximizing the mutual information be- tween c and G(z, c) to make c contain as much important y y and meaningful features of the real samples as possible. However, I(c; G(z, c)) is difficult to optimize directly in Fig. 2: Illustration of pix2pix: Training a conditional GANs practice since it requires access to the posterior P (c|x). to map grayscale→color. The discriminator D learns to Fortunately, we can have a lower bound of I(c; G(z, c)) classify between real grayscale, color tuples and fake (syn- by defining an auxiliary distribution Q(c|x) to approximate thesized by the generator). The generator G learns to fool P (c|x). The final objective function of InfoGAN [14] is the discriminator. Different from the original GANs, both the generator and discriminator observe the input grayscale min max VI (D,G) = V (D,G) − λLI (c; Q), (15) G D image and there is no noise input for the generator of pix2pix. where LI (c; Q) is the lower bound of I(c; G(z, c)). InfoGAN has several variants such as causal InfoGAN [134] and semi- supervised InfoGAN (ss-InfoGAN) [135]. correct class label, LC . LS is equivalent to L in (17). LC is defined as 3.2.2 Conditional GANs (cGANs) L = E [log P (C = c| X )] C real (18) GANs can be extended to a conditional model if both the +E [log (P (C = c| Xfake))] . discriminator and generator are conditioned on some extra The discriminator and generator of AC-GAN is to maximize information y. The objective function of conditional GANs LC + LS and LC − LS, respectively. AC-GAN was the [15] is: first variant of GANs that was able to produce recognizable examples of all the ImageNet [151] classes. min max V (D,G) = Ex∼pdata(x) [log D (x| y)] G D (16) Discriminators of most cGANs based methods [31], [41], +E [log (1 − D (G (z| y)))] . z∼pz (z) [152]–[154] feed conditional information y into the discrimi- By comparing (15) and (16), we can see that the generator nator by simply concatenating (embedded) y to the input or of InfoGAN is similar to that of cGANs. However, the latent to the feature vector at some middle layer. cGANs with pro- code c of InfoGAN is not known, and it is discovered by jection discriminator [155] adopts an inner product between training. Furthermore, InfoGAN has an additional network the condition vector y and the feature vector. Q to output the conditional variables Q(c|x). Isola et al. [156] used cGANs and sparse regularization Based on cGANs, we can generate samples conditioning for image-to-image translation. The corresponding software on class labels [30], [136], text [34], [137], [138], bounding is called pix2pix. In GANs, the generator learns a mapping box and keypoints [139]. In [34], [140], text to photo-realistic from random noise z to G (z). In contrast, there is no noise image synthesis is conducted with stacked generative ad- input in the generator of pix2pix. A novelty of pix2pix is versarial networks (SGAN) [141]. cGANs have been used that the generator of pix2pix learns a mapping from an for convolutional face generation [142], face aging [143], observed image y to output image G (y), for example, from image translation [144], synthesizing outdoor images having a grayscale image to a color image. The objective of cGANs specific scenery attributes [145], natural image description in [156] can be expressed as [146], and 3D-aware scene manipulation [147]. Chrysos et L (D,G) = E [log D (x, y)] cGANs x,y (19) al. [148] proposed robust cGANs. Thekumparampil et al. +Ey [log (1 − D (y, G (y)))] . [149] discussed the robustness of conditional GANs to noisy labels. Conditional CycleGAN [16] uses cGANs with cyclic Furthermore, l1 distance is used: consistency. Mode seeking GANs (MSGANs) [150] proposes Ll1 (G) = Ex,y [kx − G(y)k1] . (20) a simple yet effective regularization term to address the mode collapse issue for cGANs. The final objective of [156] is The discriminator of original GANs [3] is trained to LcGANs (D,G) + λLl1 (G) , (21) maximize the log-likelihood that it assigns to the correct source [30]: where λ is the free parameter. As a follow-up to pix2pix, pix2pixHD [157] used cGANs and feature matching loss for L = E [log P (S = real| X )] real (17) high-resolution image synthesis and semantic manipulation. +E [log (P (S = fake| Xfake))] , With the discriminators, the learning problem is a multi-task learning problem: which is equal to (1). The objective function of the auxiliary X classifier GAN (AC-GAN) [30] has two parts: the loglikeli- min max LGAN (G, Dk). (22) G D1,D2,D3 hood of the correct source, LS, and the loglikelihood of the k=1,2,3 JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 6 The training set is given as a set of pairs of corresponding quantitatively evaluated GANs with divergences proposed images {(si, xi)}, where xi is a natural photo and si is a for training. Uehara et al. [162] extend the f-GAN further, corresponding semantic label map. The ith-layer feature ex- where the f-divergence is directly minimized in the gen- (i) tractor of discriminator Dk is denoted as Dk (from input to erator step and the ratio of the distributions of real and the ith layer of Dk). The feature matching loss LFM (G, Dk) generated data are predicted in the discriminator step. is: 3.2.5 Integral Probability Metrics (IPMs) LFM (G, Dk) = T Denoting P the set of all Borel probability measures on a P 1 h (i) (i) i (23) E(s,x) D (s, x) − D (s, G (s)) , Ni k k topological space (M, A). The integral probability metric i=1 1 (IPM) [163] between two probability distributions P ∈ P where Ni is the number of elements in each layer and and Q ∈ P is defined as T denotes the total number of layers. The final objective Z Z function of [157] is γF (P,Q) = sup fdP − fdQ , (26) X f∈F M M min max (LGAN (G, Dk) + λLFM (G, Dk)). (24) G D1,D2,D3 k=1,2,3 where F is a class of real-valued bounded measurable functions on M. Nonparametric density estimation and 3.2.3 CycleGAN convergence rates for GANs under Besov IPM Losses is Image-to-image translation is a class of graphics and vision discussed in [164]. IPMs include such as RKHS-induced problems where the goal is to learn the mapping between maximum mean discrepancy (MMD) as well as the Wasser- an output image and an input image using a training stein distance used in Wasserstein GANs (WGAN). set of aligned image pairs. When paired training data is 3.2.5.1 Maximum Mean Discrepancy (MMD): available, reference [156] can be used for these image-to- The maximum mean discrepancy (MMD) [165] is a measure P Q image translation tasks. However, reference [156] can not be of the difference between two distributions and given F used for unpaired data (no input/output pairs), which was by the supremum over a function space of differences well solved by Cycle-consistent GANs (CycleGAN) [53]. between the expectations with regard to two distributions. CycleGAN is an important progress for unpaired data. It The MMD is defined by: is proved that cycle-consistency is an upper bound of the MMD(F,P,Q) = conditional entropy [158]. CycleGAN can be derived as a sup (EX∼P [f (X)] − EY ∼Q [f (Y )]) . (27) special case within the proposed variational inference (VI) f∈F framework [159], naturally establishing its relationship with MMD has been used for deep generative models [166]–[168] approximate Bayesian inference methods. and model criticism [169]. The basic idea of DiscoGAN [54] and CycleGAN [53] 3.2.5.2 Wasserstein GAN (WGAN): is nearly the same. Both of them were proposed separately WGAN [18] conducted a comprehensive theoretical analysis nearly at the same time. The only difference between Cy- of how the Earth Mover (EM) distance behaves in com- cleGAN [53] and DualGAN [55] is that DualGAN uses the parison with popular probability distances and divergences loss format advocated by Wasserstein GAN (WGAN) rather such as the total variation (TV) distance, the Kullback- than the sigmoid cross-entropy loss used in CycleGAN. Leibler (KL) divergence, and the Jensen-Shannon (JS) diver- gence utilized in the context of learning distributions. The 3.2.4 f-GAN definition of the EM distance is As we know, Kullback-Leibler (KL) divergence measures the difference between two given probability distributions. A W (pdata, pg) = inf E(x,y)∈γ [kx − yk] , (28) γ∈Π(pdata,pg ) large class of assorted divergences are the so called Ali- Silvey distances, also known as the f-divergences [160]. where Π(pdata, pg) denotes the set of all joint distributions Given two probability distributions P and Q which have, γ (x, y) whose marginals are pdata and pg, respectively. respectively, an absolutely continuous density function p However, the infimum in (28) is highly intractable. The and q with regard to a base measure dx defined on the reference [18] uses the following equation to approximate domain X, the f-divergence is defined, the EM distance Z p (x) max Ex∼p (x) [fw (x)] − Ez∼p (z) [fw (G (z))] , (29) D (P kQ) = q (x)f dx. (25) w∈W data z f q (x) X where there is a parameterized family of functions Different choices of f recover popular divergences as special {fw}w∈W that are all K-Lipschitz for some K and fw can cases of f-divergence. For example, if f (a) = a log a, f- be realized by the discriminator D. When D is optimized, divergence becomes KL divergence. The original GANs (29) denotes the approximated EM distance. Then the aim [3] is a special case of f-GAN [17] which is based on f- of G is to minimize (29) to make the generated distribution divergence. The reference [17] shows that any f-divergence as close to the real distribution as possible. Therefore, the can be used for training GAN. Furthermore, the reference overall objective function of WGAN is [17] discusses the advantages of different choices of di- min max Ex∼pdata(x) [fw (x)] − Ez∼pz (z) [fw (G (z))] vergence functions on both the quality of the produced G w∈W (30) = min max Ex∼p (x) [D (x)] − Ez∼p (z) [D (G (z))] . generative models and training complexity. Im et al. [161] G D data z JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 7 By comparing (1) and (30), we can see three differences between the objective function of original GANs and that of WGAN: min LG = Ez∼p (z) [Lθ (G (z))] , (34) G z • First, there is no log in the objective function of + WGAN. where [y] = max(0, y), λ is the free tuning-parameter, and • Second, the D in original GANs is utilized as a θ is the paramter of the discriminator D. binary classifier while D utilized in WGAN is to approximate the Wasserstein distance, which is a regression task. Therefore, the sigmoid in the last 3.2.7 Summary layer of D is not used in the WGAN. The output There is a website called “The GAN Zoo” (https: of the discriminator of the original GANs is between //github.com/hindupuravinash/the-gan-zoo) which lists zero and one while there is no constraint for that of many GANs’ variants. Please refer to this website for more WGAN. details. • Third, the D in WGAN is required to be K-Lipschitz for some K and therefore WGAN uses weight clip- ping. 3.3 GANs Training Compared with traditional GANs training, WGAN can improve the stability of learning and provide meaningful Despite the theoretical existence of unique solutions, GANs learning curves useful for hyperparameter searches and training is hard and often unstable for several reasons [29], debugging. However, it is a challenging task to approxi- [32], [179]. One difficulty is from the fact that optimal mate the K-Lipschitz constraint which is required by the weights for GANs correspond to saddle points, and not Wasserstein-1 metric. WGAN-GP [19] is proposed by utiliz- minimizers, of the loss function. ing gradient penalty for restricting K-Lipschitz constraint There are many papers on GANs training. Yadav et al. and the objective function is [180] stabilized GANs with prediction methods. By using independent learning rates, [181] proposed a two time- L = −Ex∼pdata [D (x)] + Ex˜∼pg [D (˜x)] scale update rule (TTUR) for both discriminator and gen- h 2i (31) +λExˆ∼pxˆ (k∇xˆD (ˆx)k2 − 1) erator to ensure that the model can converge to a stable local Nash equilibrium. Arjovsky [179] made theoretical where the first two terms are the objective function of steps towards fully understanding the training dynamics of WGAN and xˆ is sampled from the distribution pxˆ which GANs; analyzed why GANs was hard to train; studied and samples uniformly along straight lines between pairs of proved rigorously the problems including saturation and points sampled from the real data distribution pdata and instability that occurred when training GANs; examined a the generated distribution pg. There are some other methods practical and theoretically grounded direction to mitigate closely related to WGAN-GP such as DRAGAN [170]. Wu et these problems; and introduced new tools to study them. al. [171] propose a novel and relaxed version of Wasserstein- Liang et al. [182] think that GANs training is a continual 1 metric: Wasserstein divergence (W-div), which does not learning problem [183]. require the K-Lipschitz constraint. Based on W-div, Wu et al. One method to improve GANs training is to assess [171] introduce a Wasserstein divergence objective for GANs the empirical “symptoms” that might occur in training. (WGAN-div), which can faithfully approximate W-div by These symptoms include: the generative model collapsing optimization. CramerGAN [172] argues that the Wasserstein to produce very similar samples for diverse inputs [29]; distance leads to biased gradients, suggesting the Cramr the discriminator loss converging quickly to zero [179], distance between two distributions. Other papers related to providing no gradient updates to the generator; difficulties WGAN can be found in [173]–[178]. in making the pair of models converge [32]. We will introduce GANs training from three perspec- 3.2.6 Loss Sensitive GAN (LS-GAN) tives: objective function, skills, and structure. Similar to WGAN, LS-GAN [20] also has a Lipschitz con- straint. It is assumed in LS-GAN that pdata lies in a set of Lipschitz densities with a compact support. In LS-GAN , the 3.3.1 Objective function loss function L (x) is parameterized with θ and LS-GAN θ As we can see from Subsection 3.1, utilizing the original assumes that a generated sample should have larger loss objective function in equation (1) will have the gradient van- than a real one. The loss function can be trained to satisfy ishing problem for training G and utilizing the alternative the following constraint: G loss (12) in non-saturating game will get the mode col- Lθ (x) ≤ Lθ (G (z)) − ∆ (x, G (z)) (32) lapse problem. These problems are caused by the objective function and cannot be solved by changing the structures where ∆ (x, G (z)) is the margin measuring the difference of GANs. Re-designing the objective function is a natural between generated sample G(z) and real sample x. The solution to mitigate these problems. Based on the theoretical objective function of LS-GAN is flaws of GANs, many objective function based variants min LD = Ex∼p (x) [Lθ (x)] have been proposed to change the objective function of D data +λE [∆ (x, G (z)) + L (x) − L (G (z))] , (33) GANs based on theoretical analyses such as least squares x∼pdata(x), θ θ + generative adversarial networks [21], [22]. z∼pz (z) JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 8 3.3.1.1 Least squares generative adversarial net- D give high score to real samples and low score to the works (LSGANs) : generated (“fake”) samples. However, the discriminator in LSGANs [21], [22] are proposed to overcome the vanishing EBGAN attributes low energy (score) to the real samples gradient problem in the original GANs. This work shows and higher energy to the generated ones. EBGAN has more that the decision boundary for D of original GAN penalizes stable behavior than original GANs during training. very small error to update G for those generated samples 3.3.1.4 Boundary equilibrium generative adversar- which are far from the decision boundary. LSGANs adopt ial networks (BEGAN): the least squares loss rather than the cross-entropy loss in Similar to EBGAN [38], dual-agent GAN (DA-GAN) [186], the original GANs. Suppose that the a-b coding is used [187], and margin adaptation for GANs (MAGANs) [188], for the LSGANs’ discriminator [21], where a and b are the BEGAN also uses an auto-encoder as the discriminator. labels for generated sample and real sample, respectively. Using proportional control theory, BEGAN proposes a novel The LSGANs’ discriminator loss VLSGAN (D) and generator equilibrium method to balance generator and discriminator loss VLSGAN (G) are defined as: in training, which is fast, stable, and robust to parameter h i changes. min V (D) = E (D (x) − b)2 LSGAN x∼pdata(x) 3.3.1.5 Mode regularized generative adversarial D (35) h 2i networks (MDGAN) : +Ez∼p (z) (D (G (z)) − a) , z Che et al. [26] argue that GANs’ unstable training and model collapse is due to the very special functional shape of the h 2i min VLSGAN (G) = Ez∼p (z) (D (G (z)) − c) , (36) trained discriminators in high dimensional spaces, which G z can make training stuck or push probability mass in the where c is the value that G hopes for D to believe for wrong direction, towards that of higher concentration than generated samples. The reference [21] shows that there are that of the real data distribution. Che et al. [26] introduce two advantages of LSGANs in comparison with the original several methods of regularizing the objective, which can GANs: stabilize the training of GAN models. The key idea of E (x): x → z • The new decision boundary produced by D penal- MDGAN is utilizing an encoder to produce izes large error to those generated samples which the latent variable z for the generator G rather than utilizing are far from the decision boundary, which makes noise. This procedure has two advantages: those “low quality” generated samples move toward • z the decision boundary. This is good for generating Encoder guarantees the correspondence between E(x) x G higher quality samples. ( ) and , which makes capable of covering di- verse modes in the data space. Therefore, it prevents • Penalizing the generated samples far from the de- cision boundary can supply more gradient when the mode collapse problem. updating the G, which overcomes the vanishing gra- • Because the reconstruction of encoder can add more G dient problems in the original GANs. information to the generator , it is not easy for the discriminator D to distinguish between real samples 3.3.1.2 Hinge loss based GAN: and generated ones. Hinge loss based GAN is proposed and used in [23]–[25] and its objective function is V (D,G): The loss function for the generator and the encoder of ˆ MDGAN is VD G, D = Ex∼pdata(x) [min(0, −1 + D (x))] h i (37) L = −E [log (D (G (z)))] ˆ G z∼pz (z) +Ez∼pz (z) min(0, −1 − D G(z) ) . λ1d (x, G ◦ E (x)) (41) +Ex∼pdata(x) , +λ2 log D (G ◦ E (x)) ˆ h ˆ i VD G, D = −Ez∼pz (z) D (G(z)) . (38) The softmax cross-entropy loss [184] is also used in GANs. λ1d (x, G ◦ E (x)) 3.3.1.3 Energy-based generative adversarial net- LE = Ex∼pdata(x) , (42) +λ2 log D (G ◦ E (x)) work (EBGAN):
EBGAN’s discriminator is seen as an energy function, giving where both λ1 and λ2 are free tuning parameters, d is the high energy to the fake (“generated”) samples and lower distance metric such as Euclidean distance, and G ◦ E (x) = energy to the real samples. As for the energy function, please G (E (x)). refer to [185] for the corresponding tutorial. Given a positive 3.3.1.6 Unrolled GAN: margin m , the loss functions for EBGAN can be defined as Metz et al. [27] introduce a technique to stabilize GANs follows: by defining the generator objective with regard to an un- + LD(x, z) = D(x) + [m − D(G(z))] , (39) rolled optimization of the discriminator. This allows training to be adjusted between utilizing the current value of the discriminator, which is usually unstable and leads to poor solutions, and utilizing the optimal discriminator solution LG(z) = D(G(z)), (40) in the generator’s objective, which is perfect but infeasible + where [y] = max(0, y) is the rectified linear unit (ReLU) in real applications. Let f (θG, θD) denote the objective function. Note that in the original GANs, the discriminator function of the original GANs. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 9 A local optimal solution of the discriminator parameters This modification can be understood in the following way ∗ θD can be expressed as the fixed point of an iterative [28]: D estimates the probability that the given real sample optimization procedure, is more realistic than a randomly sampled generated sam- ple. Similarly, Drev (˜x) = sigmoid (C (xg) − C (xr)) can be θ0 = θ , (43) D D defined as the probability that the given generated sample is more realistic than a randomly sampled real sample. The df θ , θk discriminator and generator loss functions of the Relativistic k+1 k k G D Standard GAN (RSGAN) is: θD = θD + η k , (44) dθD RSGAN LD = −E(xr ,xg ) [log (sigmoid (C (xr) − C (xg)))] , (51) ∗ k θD (θG) = lim θD, (45) k→∞ k RSGAN where η is the learning rate. By unrolling for K steps, a LG = −E(xr ,xg ) [log (sigmoid (C (xg) − C (xr)))] . (52) surrogate objective for the update of the generator is created K Most GANs can be parametrized: fK (θG, θD) = f θG, θD (θG, θD) . (46) GAN When K = 0, this objective is the same as the standard LD = Exr [f1 (C (xr))] + Exg [f2 (C (xg))] , (53) GAN objective. When K → ∞, this objective is the true ∗ generator objective function f (θG, θD (G)). By adjusting the K GAN number of unrolling steps , we are capable of interpolat- LG = Exr [g1 (C (xr))] + Exg [g2 (C (xg))] , (54) ing between standard GAN training dynamics with their related pathologies, and more expensive gradient descent where f1, f2, g1, g2 are scalar-to-scalar functions. If we adopt on the true generator loss. The generator and discriminator a relativistic discriminator, the loss functions of these GANs parameter updates of unrolled GAN using this surrogate become: loss are LRGAN = E [f (C (x ) − C (x ))] dfK (θG, θD) D (xr ,xg ) 1 r g , (55) θG = θG − η , (47) +E(x ,x ) [f2 (C (xg) − C (xr))] dθG r g
df (θG, θD) θ = θ + η . RGAN D D (48) L = E(x ,x ) [g1 (C (xr) − C (xg))] dθD G r g (56) +E(x ,x ) [g2 (C (xg) − C (xr))] . Metz et al. [27] show how this method solves mode collapse, r g stabilizes training of GANs, and increases diversity and coverage of the generated distribution by the generator. 3.3.2 Skills 3.3.1.7 Spectrally normalized GANs (SN-GANs): NIPS 2016 held a workshop on adversarial training, with SN-GANs [23] propose a novel weight normalization an invited talk by Soumith Chintala named ”How to train method named spectral normalization to make the training a GAN.” This talk has an assorted tips and tricks. For of the discriminator stable. This new normalization tech- example, this talk suggests that if you have labels, train- nique is computationally efficient and easy to be integrated ing the discriminator to also classify the examples: AC- into existing methods. The spectral normalization [23] uses GAN [30]. Refer to the GitHub repository associated with a simple method to make the weight matrix W satisfy the Soumith’s talk: https://github.com/soumith/ganhacks for Lipschitz constraint σ (W ) = 1: more advice. W¯ SN (W ) := W/σ (W ) , (49) Salimans et al. [29] proposed very useful and improved techniques for training GANs (ImprovedGANs), such as W D σ (W ) where is the weight matrix of each layer in and feature matching, minibatch discrimination, historical aver- W is the spectral norm of . It is shown that [23] SN-GANs aging, one-sided label smoothing, and virtual batch normal- can generate images of equal or better quality in comparison ization. with the previous training stabilization methods. In theory, spectral normalization is capable of being applied to all GANs variants. Both BigGANs [36] and SAGAN [35] use 3.3.3 Structure the spectral normalization and have good performances on The original GANs utilized multi-layer perceptron (MLP). the Imagenet. Specific type of structure may be good for specific applica- 3.3.1.8 Relativistic GANs (RGANs): tions e.g., recurrent neural network (RNN) for time series In the original GANs, the discriminator can be defined, data and convolutional neural network (CNN) for images. according to the non-transformed layer C(x), as D(x) = 3.3.3.1 The original GANs: sigmoid(C(x)). A simple way to make discriminator rela- The original GANs used MLP as the generator G and tivistic (i.e., making the output of D depend on both real and discriminator D. MLP can be only used for small datasets generated samples) [28] is to sample from real/generated such as CIFAR-10 [189], MNIST [190], and the Toronto Face data pairs x˜ = (xr, xg) and define it as Database (TFD) [191]. However, MLP does not have good D (˜x) = sigmoid (C (xr) − C (xg)) . (50) generalization on more complex images [10]. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 10 3.3.3.2 Laplacian generative adversarial networks greatly. BigGANs successfully generates images with quite (LAPGAN) and SinGAN: high resolution up to 512 by 512 pixels. If you do not have LAPGAN [31] is proposed for producing higher resolution enough data, it can be a challenging task to replicate the images in comparison with the original GANs. LAPGAN BigGANs results from scratch. Lucic et al. [199] propose to uses a cascade of CNN within a Laplacian pyramid frame- train BigGANs quality model with fewer labels. BigBiGAN work [192] to generate high quality images. [200], based on BigGANs, extends it to representation learn- SinGAN [193] learns a generative model from a single ing by adding an encoder and modifying the discriminator. natural image. SinGAN has a pyramid of fully convolutional BigBiGAN achieve the state of the art in both unsupervised GANs, each of which learns the patch distribution at a representation learning on ImageNet and unconditional im- different scale of the image. Similar to SinGAN, InGAN age generation. [194] also learns a generative model from a single natural In the original GANs [3], G and D are defined by MLP. image. Karras et al. [37] proposed a StyleGAN architecture for 3.3.3.3 Deep convolutional generative adversarial GANs, which wins the CVPR 2019 best paper honorable networks (DCGANs): mention. StyleGAN’s generator is a really high-quality gen- In original GANs, G and D are defined by MLP. Because erator for other generation tasks like generating faces. It is CNN are better at images than MLP, G and D are defined by particular exciting because it allows to separate different deep convolutional neural networks (DCNNs) in DCGANs factors such as hair, age and sex that are involved in con- [32], which have better performance. Most current GANs trolling the appearance of the final example and we can then are at least loosely based on the DCGANs architecture [32]. control them separately from each other. StyleGAN [37] has Three key features of the DCGANs architecture are listed as also been used in such as generating high-resolution fashion follows: model images wearing custom outfits [201]. 3.3.3.7 Hybrids of autoencoders and GANs: • First, the overall architecture is mostly based on An autoencoder is a type of neural networks used for the all-convolutional net [195]. This architecture has learning efficient data codings in an unsupervised way. The neither pooling nor “unpooling” layers. When G autoencoder has an encoder and a decoder. The encoder needs to increase the spatial dimensionality of the aims to learn a representation (encoding) for a set of data, representation, it uses transposed convolution (de- z = E(x), typically for dimensionality reduction. The de- convolution) with a stride greater than 1. coder aims to reconstruct the data xˆ = g (z). That is to say, • Second, utilize batch normalization for most layers of the decoder tries to generate from the reduced encoding a both G and D. The last layer of G and first layer of representation as close as possible to its original input x. D are not batch normalized, in order that the neural GANs with an autoencoder: Adversarial autoencoder network can learn the correct mean and scale of the (AAE) [202] is a probabilistic autoencoder based on GANs. data distribution. Adversarial variational Bayes (AVB) [203], [204] is proposed • Third, utilize the Adam optimizer instead of SGD to unify variational autoencoders (VAEs) and GANs. Sun et with momentum. al. [205] proposed a UNsupervised Image-to-image Transla- 3.3.3.4 Progressive GAN: tion (UNIT) framework that are based on GANs and VAEs. In Progressive GAN (PGGAN) [33], a new training method- Hu et al. [206] aimed to establish formal connections be- ology for GAN is proposed. The structure of Progressive tween GANs and VAEs through a new formulation of them. GAN is based on progressive neural networks that is first By combining a VAE with a GAN, Larsen et al. [207] utilize proposed in [196]. The key idea of Progressive GAN is to learned feature representations in the GAN discriminator as grow both the generator and discriminator progressively: basis for the VAE reconstruction. Therefore, element-wise starting from a low resolution, adding new layers that errors is replaced with feature-wise errors to better capture model increasingly fine details as training progresses. the data distribution while offering invariance towards such 3.3.3.5 Self-Attention Generative Adversarial Net- as translation. Rosca et al. [208] proposed variational ap- work (SAGAN): proaches for auto-encoding GANs. By adopting an encoder- SAGAN [35] is proposed to allow attention-driven, long- decoder architecture for the generator, disentangled rep- range dependency modeling for image generation tasks. resentation GAN (DR-GAN) [68] addresses pose-invariant Spectral normalization technique has only been applied to face recognition, which is a hard problem due to the drastic the discriminator [23]. SAGAN uses spectral normalization changes in an image for each diverse pose. for both generator and discriminator and it is found that this GANs with an encoder: References [40], [42] only add improves training dynamics. Furthermore, it is confirmed an encoder to GANs. The original GANs [3] can not learn that the two time-scale update rule (TTUR) [181] is effective the inverse mapping - projecting data back into the latent in SAGAN. space. To solve this problem, Donahue et al. [40] proposed Note that AttnGAN [197] utilizes attention over word Bidirectional GANs (BiGANs), which can learn this inverse embeddings within an input sequence rather than self- mapping through the encoder, and show that the resulting attention over internal model states. learned feature representation is useful. Similarly, Dumoulin 3.3.3.6 BigGANs and StyleGAN: et al. [41] proposed the adversarially learned inference (ALI) Both BigGANs [36] and StyleGAN [37], [198] made great model, which also utilizes the encoder to learn the latent advances in the quality of GANs. feature distribution. The structure of BiGAN and ALI is BigGANs [36] is a large scale TPU implementation of shown in Fig.3(a). Besides the discriminator and generator, GANs, which is pretty similar to SAGAN but scaled up BiGAN also has an encoder, which is used for mapping the JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 11 whilst the other discriminator, conversely, favoring data Latent Data Latent Data from the true distribution, and the generator generates data space space space space to fool both discriminators. The reference [43] develops ( ) R ( ) theoretical analysis to show that, given the optimal discrim- , Real ( ) OR D or inators, optimizing the generator of D2GAN is minimizing , ( ) Fake? ( ) both KL and reverse KL divergences between true distribu- R tion and the generated distribution, thus effectively over- (a) BiGAN/ALI (b) AGE coming the mode collapsing problem. Generative multi- adversarial network (GMAN) [44] further extends GANs Fig. 3: The structures of (a) BiGAN and ALI and (b) AGE. to a generator and multiple discriminators. Albuquerque et al. [211] performed multi-objective training of GANs with multiple discriminators. data back to the latent space. The input of discriminator is a 3.3.3.9 Multi-generator learning: pair of data composed of data and its corresponding latent Multi-generator GAN (MGAN) [45] is proposed to train x x, E(x) E(x) code. For real data , the pair are where is the GANs with a mixture of generators to avoid the mode E G(z) obtained from the encoder . For the generated data , collapsing problem. More specially, MGAN has one binary G(z), z z this pair is where is the noise vector generating discriminator, K generators, and a multi-class classifier. G(z) G the data through the generator . Similar to (1), the The distinguishing feature of MGAN is that the generated objective function of BiGAN is samples are produced from multiple generators, and then only one of them will be randomly selected as the final min max V (D,E,G) = Ex∼pdata(x) [log D (x, E(x))] G,E D (57) output similar to the mechanism of a probabilistic mixture +E [log (1 − D (G (z) , z))] . z∼pz (z) model. The classifier shows which generator a generated The generator of [40], [42] can be seen the decoder since the sample is from. generator map the vectors from latent space to data space, The most closely related to MGAN is multi-agent diverse which performs the same function as the decoder. GANs (MAD-GAN) [46]. The difference between MGAN Similar to utilizing an encoding process to model the and MAD-GAN can be found in [45]. SentiGAN [212] uses distribution of latent samples, Gurumurthy et al. [209] a mixture of generators and a multi-class discriminator to model the latent space as a mixture of Gaussians and learn generate sentimental texts. the mixture components that maximize the likelihood of 3.3.3.10 Multi-GAN learning: generated samples under the data generating distribution. Coupled GAN (CoGAN) [47] is proposed for learning a joint In an encoding-decoding model, the output (also known distribution of two-domain images. CoGAN is composed as a reconstruction), ought to be similar to the input in the of a pair of GANs - GAN1 and GAN2, each of which ideal case. Generally, the fidelity of reconstructed samples synthesizes images in one domain. Two GANs respectively synthesized utilizing a BiGAN/ALI is poor. With an addi- considering structure and style are proposed in [213] based tional adversarial cost on the distribution of data samples on cGANs. Causal fairness-aware GANs (CFGAN) [214] and their reconstructions [158], the fidelity of samples may used two generators and two discriminators for generating be improved. Other related methods include such as vari- fair data. The structures of GANs, D2GAN, MGAN, and ational discriminator bottleneck (VDB) [210] and MDGAN CoGAN are shown in Fig.4. (detailed in Paragraph 3.3.1.5). 3.3.3.11 Summary: Combination of a generator and an encoder: Different There are many GANs’ variants and milestone ones are from previous hybrids of autoencoders and GANs, Ad- shown in Fig.5. Due to space limitation, only limited versarial Generator-Encoder (AGE) Network [42] is set up number of variants are shown in Fig.5. directly between the generator and the encoder, and no GANs’ objective function based variants can be gener- external mappings are trained in the learning process. The alized to structure variants. Compared with other objective structure of AGE is shown in Fig.3(b) which R is the recon- function based variants, both SN-GANs and RGANs show struction loss function. In AGE, there are two reconstruction the stronger generalization ability. These two objective func- losses: the latent variable z and E(G(z)), the data x and tion based variants can be generalized to the other objective G(E(x)). AGE is similar to CycleGAN. However, there are function based variants. Spectral normalization is capable two differences between them: of being generalized to any type of GANs’ variants while • CycleGAN [53] is used for two modalities of the RGANs is able to be generalized to any IPM-based GANs. image such as grayscale and color. AGE acts between latent space and true data space. 3.4 Evaluation metrics for GANs • There is a discriminator for each modality of Cycle- GAN and there is no discriminator in AGE. In this subsection, we show evaluation metrics [215], [216] that are used for GANs. 3.3.3.8 Multi-discriminator learning: GANs have a discriminator together with a generator. Dif- 3.4.1 Inception Score (IS) ferent from GANs, dual discriminator GAN (D2GAN) [43] has a generator and two binary discriminators. D2GAN is Inception score (IS) is proposed in [29], which uses the analogous to a minimax game, wherein one discriminator Inception model [217] for every generated image to get gives high scores for samples from generated distribution the conditional label distribution p (y|x). Images that have JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2015 12