A Survey of Image Synthesis and Editing with Generative Adversarial Networks
Total Page:16
File Type:pdf, Size:1020Kb
TSINGHUA SCIENCE AND TECHNOLOGY ISSNll1007-0214ll0X/XXllppXXX-XXX Volume XX, Number 3, June 20XX A Survey of Image Synthesis and Editing with Generative Adversarial Networks Xian Wu, Kun Xu*, Peter Hall Abstract: This paper presents a survey of image synthesis and editing with generative adversarial networks (GANs). GANs consist of two deep networks, a generator and a discriminator, which are trained in a competitive way. Due to the power of deep networks and the competitive training manner, GANs are capable of producing reasonable and realistic images, and have shown great capability in many image synthesis and editing applications. This paper surveys recent GAN papers regarding topics including, but not limited, to texture synthesis, image inpainting, image-to-image translation, and image editing. Key words: image synthesis; image editing; constrained image synthesis; generative adversarial networks; image- to-image translation 1 Introduction made a breakthrough in computer vision. Trained using large-scale data, deep neural networks With the rapid development of Internet and digital substantially outperform previous techniques with capturing devices, huge volumes of images have regard to the semantic understanding of images. They become readily available. There are now widespread claim state of the art in various tasks, including [15–17] [18, 19] demands for tasks requiring synthesizing and editing image classification , object detection , image [20, 21] images, such as: removing unwanted objects in segmentation , etc. wedding photographs; adjusting the colors of landscape Deep learning has also shown great ability in content images; and turning photographs into artwork – generation. In 2014 Goodfellow et al. proposed or vice-versa. These and other problems have a generative model, called generative adversarial [22] attracted significant attention within both the computer networks (GANs) , GANs contain two networks, graphics and computer vision communities. A variety a generator and a discriminator. The discriminator of methods have been proposed for image/video [1–3] tries to distinguish fake images from real ones; the editing and synthesis, including texture synthesis , [4–6] [7, 8] generator produces fake images but it tries to fool the image inpainting , image stylization , image [9, 10] discriminator. Both networks are jointly trained in deformation , and so on. Although many methods a competitive way. The resulting generator is able have been proposed, intelligent image synthesis and to synthesise plausible images. GAN variants have editing remains a challenging problem. This is now achieved impressive results in a variety of image because these traditional methods are mostly based [1, 4, 11] [8, 10, 12, 13] synthesis and editing applications. on pixels , patches and low-level image [3, 14] In this survey, we cover recent papers that leverage features , lacking high-level semantic information. generative adversarial networks(GANs) for image In recent years, deep learning techniques have synthesis and editing applications. This survey • Xian Wu, Kun Xu are with TNList, the Department of discusses the ideas, contributions and drawbacks of Computer Science and Technology, Tsinghua University, these networks. This survey is structured as follows. Beijing, China; Peter Hall is with the Department of Computer Section 2 provides a brief introduction to GANs and Science, University of Bath, Bath, UK. related variants. Section 3 discusses applications in ∗ Corresponding author, [email protected] image synthesis, including texture synthesis, image Manuscript received: year-month-day; revised: year-month- impainting, face and human image synthesis. Section 4 day; accepted: year-month-day discusses applications in constrained image synthesis, 2 Tsinghua Science and Technology, June 20XX, XX(X): 000-000 Real samples Real samples X X|C Conditions Real Real D or C D or Fake? Fake? Discriminator Discriminator Z G X’ Z G X’|C Noise vector Noise vector Generated samples Generated samples Generator Generator Fig. 1 Structure of GANs. Fig. 2 Structure of cGANs. including general image-to-image translation, text-to- in which the generator produces the same or similar image, and sketch-to-image. Section 5 discusses images for different noise vectors z. Various extensions applications in image editing and video generation. of GANs have been proposed to improve training [24–28] Finally, Section 6 provides a summary discussion stability . [29] and current challenges and limitations of GAN based cGANs. Mirza et al. introduced conditional methods. generative adversarial networks (cGANs), which extends GANs into a conditional model. In cGANs, the 2 Generative Adversarial Networks generator G and the discriminator D are conditioned on some extra information c. This is done by putting Generative adversarial networks (GANs) were [22] c as additional inputs to both G and D. The extra proposed by Goodfellow et al. in 2014. They contain information could be class labels, text, or sketches. two networks, a generator G and a discriminator cGANs provide additional controls on which kind of D. The generator tries to create fake but plausible data are being generated, while the original GANs do images, while the discriminator tries to distinguish not have such controls. It makes cGANs popular for fake images (produced by the generator) from real image synthesis and image editing applications. The images. Formally, the generator G maps a noise vector structure of cGANs is illustrated as Fig. 2. z in the latent space to an image: G(z) ! x, and DCGANs. Radford and Metz presented deep the discriminator is defined as D(x) ! [0; 1], which convolutional generative adversarial networks classifies an image as a real image (i.e., close to 1) or [24] (DCGANs) . They propose a class of architecturally as a fake image (i.e., close to 1). constrained convolution networks for both generator To train the networks, the loss function is formulated and discriminator. The architectural constraints include: as: 1) replacing all pooling layers with strided convolutions min max x2X [log D(x)]+ z2Z [log(1−D(G(z)))]; and fractional-strided convolutions; 2) using batchnorm G D E E (1) layers; 3) removing fully connected hidden layers; 4) where X denotes the set of real images, Z denotes the in the generator, using tanh as the activation function latent space. The above loss function (Equation 1) is and using rectified linear units (ReLU) activation for referred to as the adversarial loss. The two networks are other layers; 5) in the discriminator, using LeakyReLU trained in a competitive fashion with back propagation. activation in discriminator for all layers. DCGANs The structure of GANs is illustrated as Fig. 1. have shown to be more stable in training and are able Compared with other generative models such as to produce higher quality images, hence they have been [23] variational autoencoders (VAEs) , images generated widely used in many applications. by GANs are usually less blurred and more realistic. LAPGAN. Laplacian generative adversarial [30] It is also theoretically proven that optimal GANs exist, networks (LAPGAN) are composed of a cascade of that is the generator perfectly produces images which convolutional GANs with the framework of a Laplacian match the distributions of real images well, and the pyramid with K levels. At the coarsest level, K, [22] discriminator always produces 1/2 . However, in a GAN is trained which maps a noise vector to an practice, training GANs is difficult because of several image with the coarsest resolution. At each level of reasons: firstly, networks converge is difficult to achieve the pyramid except the coarsest one (i.e., level k, [24] ; secondly, GANs often get into ’mode collapse’, 0 ≤ k < K), a separate cGAN is trained, which takes Xian Wu et al.: A Survey of Image Synthesis and Editing with Generative Adversarial Networks 3 the output image in the coarser level (i.e., level k + 1) matrices are computed by the correlations of feature as a conditional variable to generate the residual image responses in some layers. The target texture is obtained at this level. Due to such a coarse-to-fine manner, by minimizing the distance between the Gram-matrix LAPGANs are able to produce images with higher representation of the target texture and that of the input resolutions. texture. The target texture starts from random noise, Other extensions. Zhao et al. proposed an energy- and is iteratively optimized through back propagation, [25] based generative adversarial network (EBGAN) , hence, its computational cost is expensive. [33] which views the discriminator as an energy function MGANs. Li et al. proposed a real-time texture instead of a probability function. They show synthesis method. They first introduced Markovian that EBGANs are more stable in training. To deconvolutional adversarial networks (MDANs). Given overcome the vanishing gradient problem, Mao et a content image xc (e.g., a face image) and a texture al. proposed least squares generative adversarial image xt (i.e., a texture image of leaves), MDANs [26] networks(LSGAN) , which replace the log function by synthesize a target image xs (e.g., a face image textured least square function in the adversarial loss. Arjovsky by leaves). Feature maps of an image are defined et al. proposed Wasserstein generative adversarial as feature maps extracted from a pre-trained VGG19 [27] [34] networks (WGAN) . They first theoretically show by feeding the image into it , and neural patches of that the