High-Fidelity Image Generation with Fewer Labels
Total Page:16
File Type:pdf, Size:1020Kb
High-Fidelity Image Generation With Fewer Labels Mario Lucic *1 Michael Tschannen *2 Marvin Ritter *1 Xiaohua Zhai 1 Olivier Bachem 1 Sylvain Gelly 1 Abstract Deep generative models are becoming a cor- nerstone of modern machine learning. Recent work on conditional generative adversarial net- works has shown that learning complex, high- dimensional distributions over natural images is within reach. While the latest models are able to generate high-fidelity, diverse natural images at high resolution, they rely on a vast quantity of labeled data. In this work we demonstrate how one can benefit from recent work on self- and Figure 1. Median FID of the baselines and the proposed method. semi-supervised learning to outperform the state The vertical line indicates the baseline (BIGGAN) which uses all of the art on both unsupervised ImageNet synthe- the labeled data. The proposed method (S3 GAN) is able to match sis, as well as in the conditional setting. In partic- the state-of-the-art while using only 10% of the labeled data and ular, the proposed approach is able to match the outperform it with 20%. sample quality (as measured by FID) of the cur- rent state-of-the-art conditional model BigGAN However, this dependence on vast quantities of labeled data on ImageNet using only 10% of the labels and is at odds with the fact that most data is unlabeled, and outperform it using 20% of the labels. labeling itself is often costly and error-prone. Despite the recent progress on unsupervised image generation, the gap between conditional and unsupervised models in terms of 1. Introduction sample quality is significant. Deep generative models have received a great deal of In this work, we take a significant step towards closing the attention due to their power to learn complex high- gap between conditional and unsupervised generation of dimensional distributions, such as distributions over natural high-fidelity images using generative adversarial networks images (van den Oord et al., 2016b; Dinh et al., 2017; Brock (GANs). We leverage two simple yet powerful concepts: et al., 2019), videos (Kalchbrenner et al., 2017), and au- (i) Self-supervised learning: A semantic feature extractor dio (van den Oord et al., 2016a). Recent progress was driven for the training data can be learned via self-supervision, by scalable training of large-scale models (Brock et al., and the resulting feature representation can then be 2019; Menick & Kalchbrenner, 2019), architectural modifi- employed to guide the GAN training process. cations (Zhang et al., 2019; Chen et al., 2019a; Karras et al., (ii) Semi-supervised learning: Labels for the entire train- 2019), and normalization techniques (Miyato et al., 2018). ing set can be inferred from a small subset of labeled Currently, high-fidelity natural image generation hinges training images and the inferred labels can be used as upon having access to vast quantities of labeled data. The conditional information for GAN training. labels induce rich side information into the training process Our contributions In this work, we which effectively decomposes the extremely challenging im- 1. propose and study various approaches to reduce or fully age generation task into semantically meaningful sub-tasks. omit ground-truth label information for natural image *Equal contribution 1Google Research, Brain Team 2ETH generation tasks, Zurich. Correspondence to: Mario Lucic <[email protected]>, 2. achieve a new state of the art (SOTA) in unsupervised Michael Tschannen <[email protected]>, Marvin Ritter generation on IMAGENET, match the SOTA on 128 128 ⇥ <[email protected]>. IMAGENET using only 10% of the labels, and set a new th SOTA (measured by FID) using 20% of the labels, and Proceedings of the 36 International Conference on Machine Learning, Long Beach, California, PMLR 97, 2019. Copyright 3. open-source all the code used for the experiments at 2019 by the author(s). github.com/google/compare gan. High-Fidelity Image Generation With Fewer Labels 2. Background and related work High-fidelity GANs on IMAGENET Besides BIGGAN (Brock et al., 2019) only a few prior methods have man- aged to scale GANs to ImageNet, most of them relying on class-conditional generation using labels. One of the earliest attempts are GANs with auxiliary classifier (AC- GANs) (Odena et al., 2017) which feed one-hot encoded label information with the latent code to the generator and equip the discriminator with an auxiliary head predicting the image class in addition to whether the input is real or fake. More recent approaches rely on a label projection layer in Figure 2. Top row: 128 128 samples from our implementation of ⇥ the discriminator, essentially resulting in per-class real/fake the fully supervised current SOTA model BIGGAN. Bottom row: classification (Miyato & Koyama, 2018), and self-attention Samples form the proposed S3 GAN which matches BIGGAN in in the generator (Zhang et al., 2019). Both methods use terms of FID and IS using only 10% of the ground-truth labels. modulated batch normalization (De Vries et al., 2017) to provide label information to the generator. On the unsu- pervised side, Chen et al. (2019b) showed that auxiliary patches on a regular grid (Noroozi & Favaro, 2016). A study rotation loss added to the discriminator has a stabilizing on self-supervised learning with modern neural architectures effect on the training. Finally, appropriate gradient regular- is provided in Kolesnikov et al. (2019). ization enables scaling MMD-GANs to ImageNet without using labels (Arbel et al., 2018). 3. Reducing the appetite for labeled data Semi-supervised GANs Several recent works leveraged In a nutshell, instead of providing hand-annotated ground GANs for semi-supervised learning of classifiers. Both Sal- truth labels for real images to the discriminator, we will imans et al. (2016) and Odena (2016) train a discriminator provide inferred ones. To obtain these labels we will make that classifies its input into K +1classes: K image classes use of recent advancements in self- and semi-supervised for real images, and one class for generated images. Sim- learning. We propose and study several different methods ilarly, Springenberg (2016) extends the standard GAN ob- with different degrees of computational and conceptual com- jective to K classes. This approach was also considered by plexity. We emphasize that our work focuses on using few Li et al. (2017) where separate discriminator and classifier labels to improve the quality of the generative model, rather models are applied. Other approaches incorporate infer- than training a powerful classifier from a few labels as ex- ence models to predict missing labels (Deng et al., 2017) tensively studied in prior work on semi-supervised GANs. or harness joint distribution (of labels and data) matching for semi-supervised learning (Gan et al., 2017). Up to our Before introducing these methods in detail, we discuss how knowledge, improvements in sample quality through partial label information is used in state-of-the-art GANs. The label information are only reported in Li et al. (2017); Deng following exposition assumes familiarity with the basics of et al. (2017); Sricharan et al. (2017), all of which consider the GAN framework (Goodfellow et al., 2014). only low-resolution data sets from a restricted domain. Incorporating the labels To provide the label informa- Self-supervised learning Self-supervised learning meth- tion to the discriminator we employ a linear projection layer ods employ a label-free auxiliary task to learn a semantic as proposed by Miyato & Koyama (2018). To make the feature representation of the data. This approach was suc- exposition self-contained, we will briefly recall the main cessfully applied to different data modalities, such as images ideas. In a “vanilla” (unconditional) GAN, the discriminator (Doersch et al., 2015; Caron et al., 2018), video (Agrawal D learns to predict whether the image x at its input is real or et al., 2015; Lee et al., 2017), and robotics (Jang et al., 2018; generated by the generator G. We decompose the discrimi- ˜ Pinto & Gupta, 2016). The current state-of-the-art method nator into a learned discriminator representation, D, which on IMAGENET is due to Gidaris et al. (2018) who proposed is fed into a linear classifier, cr/f, i.e., the discriminator is ˜ predicting the rotation angle of rotated training images as an given by cr/f(D(x)). In the projection discriminator, one auxiliary task. This simple self-supervision approach yields learns an embedding for each class of the same dimension ˜ representations which are useful for downstream image clas- as the representation D(x). Then, for a given image, la- sification tasks. Other forms of self-supervision include bel input x, y the decision on whether the sample is real or predicting relative locations of disjoint image patches of a generated is based on two components: (a) on whether the ˜ given image (Doersch et al., 2015; Mundhenk et al., 2018) representation D(x) itself is consistent with the real data, ˜ or estimating the permutation of randomly swapped image and (b) on whether the representation D(x) is consistent with the real data from class y. More formally, the discrim- High-Fidelity Image Generation With Fewer Labels xf xf yf yf G G z z cr/f cr/f D˜ D˜ D P D P yf y yr f xr xr F cCL Figure 3. Conditional GAN with projection discriminator. The Figure 4. CLUSTERING: Unsupervised approach based on cluster- discriminator tries to predict from the representation D˜ whether a ing the representations obtained by solving a self-supervised task. real image xr (with label yr) or a generated image xf (with label F corresponds to the feature extractor learned via self-supervision yf) is at its input, by combining an unconditional classifier cr/f and and cCL is the cluster assignment function.