Dual Contradistinctive Generative Autoencoder

Dual Contradistinctive Generative Autoencoder Gaurav Parmar* 1 Dacheng Li* 1 Kwonjoon Lee* 2 Zhuowen Tu2 1Carnegie Mellon University 2UC San Diego {gparmar,dacheng2}@andrew.cmu.edu {kwl042,ztu}@ucsd.edu (a) DC-VAE (ours) Reconstruction results. Left: 128 × 128. Right: 512 × 512. (b) DC-VAE (ours) Sampling results. Left: 128 × 128. Right: 512 × 512. Figure 1: DC-VAE Reconstruction (top) and Sampling (bottom) on LSUN Bedroom [68] at resolution 128 × 128 (left) and CelebA-HQ [35] at resolution 512 × 512 (right). Abstract 1. Introduction We present a new generative autoencoder model with dual contradistinctive losses to improve generative autoencoder Tremendous progress has been made in deep learning that performs simultaneous inference (reconstruction) and for the development of various learning frameworks [40, synthesis (sampling). Our model, named dual contradis- 24, 17, 64]. Autoencoder (AE) [44, 27] aims to compactly tinctive generative autoencoder (DC-VAE), integrates an represent and faithfully reproduce the original input signal instance-level discriminative loss (maintaining the instance- by concatenating an encoder and a decoder in an end-to- level fidelity for the reconstruction/synthesis) with a set-level end learning framework. The goal of AE is to make the adversarial loss (encouraging the set-level fidelity for the encoded representation semantically efficient and sufficient reconstruction/synthesis), both being contradistinctive. Ex- to reproduce the input signal by its decoder. Autoencoder’s tensive experimental results by DC-VAE across different res- generative companion, variational autoencoder (VAE) [38], olutions including 32×32, 64×64, 128×128, and 512×512 additionally learns a variational model for the latent variables are reported. The two contradistinctive losses in VAE work to capture the underlying sample distribution. harmoniously in DC-VAE leading to a significant qualitative The key objective for a generative autoencoder is to main- and quantitative performance enhancement over the base- tain two types of fidelities: (1) an instance-level fidelity to line VAEs without architectural changes. State-of-the-art make the reconstruction/synthesis faithful to the individual or competitive results among generative autoencoders for input data sample, and (2) a set-level fidelity to make the image reconstruction, image synthesis, image interpolation, reconstruction/synthesis of the decoder faithful to the entire and representation learning are observed. DC-VAE is a input data set. The VAE/GAN algorithm [42] combines a general-purpose VAE model, applicable to a wide variety of * indicates equal contribution. downstream tasks in computer vision and machine learning. Code: https://github.com/mlpc-ucsd/DC-VAE. 823 reconstruction loss (for instance-level fidelity) with an ad- by developing an Evidence Lower BOund (ELBO) objec- versarial loss (for set-level fidelity). However, the result of tive. However, the VAE reconstruction/synthesis is known VAE/GAN is sub-optimal, as shown in Table 1. to be blurry. To improve the image quality, a sequence of The pixel-wise reconstruction loss in the standard VAE VAE based models have been developed [42, 15, 30, 3, 69]. [38] typically results in blurry images with degenerated se- VAE/GAN [42] adopts an adversarial loss to improve the mantics. A possible solution to resolving the above conflict quality of the image, but its output for both reconstruction lies in two aspects: (1) turning the measure in the pixel space and synthesis (new samples) is still unsatisfactory. IntroVAE into induced feature space that is more semantically meaning- [30] adds a loop from the output back to the input and is ful; (2) changing the L2 distance (per-pixel) into a learned able to attain image quality that is on par with some modern instance-level distance function for the entire image (akin GANs in some aspects. However, its full illustration for both to generative adversarial networks which learn set-level dis- reconstruction and synthesis is unclear. PGA [69] adds a tance functions). Taking these two steps allows us to design constraint to the latent variables. an instance-level classification loss that is aligned with the Deep generative model. Pioneering works of [61, 20] alle- adversarial loss in the GAN model enforcing set-level fidelity. viate the difficulty of learning densities by approximating Motivated by the above observations, we develop a new gen- likelihoods via classification (real (positive) samples vs. fake erative autoencoder model with dual contradistinctive losses (pseudo-negative or adversarial) samples). Generative adver- by adopting a discriminative loss performing instance-level sarial network (GAN) [17] builds on neural networks and classification (enforcing the instance-level fidelity), which amortized sampling (a decoder network that maps a noise is rooted in metric learning [41] and contrastive learning into an image). The subsequent development after GAN [21, 66, 63]. Combined with the adversarial losses for the [56, 2, 18, 35, 16] has led to a great leap forward in build- set-level fidelity, both terms are formulated in the induced ing decoder-based generative models. It has been widely feature space performing contradistinction: (1) the instance- observed that the adversarial loss in GANs contributes signif- level contrastive loss considers each input instance (image) icantly to the improved quality of image synthesis. InfoGAN itself as a class, and (2) the set-level adversarial loss treats [9] aims to maximize mutual information between latent vari- the entire input set as a positive class. We name our method ables and the output of the generator. In contrast, DC-VAE dual contradistinctive generative autoencoder (DC-VAE) and can be understood as increasing mutual information between make the following contributions: the input and output of the autoencoder. Furthermore, In- foGAN does not learn encoder. ALI/BiGAN [15, 14] ad- • We develop a new algorithm, dual contradistinctive gen- ditionally learn inference network (encoder), however they erative autoencoder (DC-VAE), by combining instance- are shown to have poor reconstruction. Energy-based gen- level and set-level classification losses in the VAE frame- erative models [57, 67, 43, 32, 47] — which aim to directly work, and systematically show the significance of these model data density — are making a steady improvement for two loss terms in DC-VAE; a simultaneously generative and discriminative single model. • The effectiveness of DC-VAE is illustrated in a number Contrastive learning. From another angle, contrastive of tasks, including image reconstruction, image synthesis, learning [21, 66, 23, 6] has lately shown its particular advan- image interpolation, and representation learning by recon- tage in unsupervised training of CNN features. It overcomes structing and sampling images across different resolutions the limitation in unsupervised learning where class label is including 32 × 32, 64 × 64, 128 × 128, and 512 × 512; missing by turning each image instance into one class. Thus, • Under the new loss term, DC-VAE attains a signifi- the softmax function in the standard discriminative classifi- cant performance boost over the competing methods cation training can be applied. Contrastive learning can be without architectural change, thus potentially provid- connected to metric learning [4, 10, 5]. ing a handy solution for many AE/VAE based model- In this paper, we aim to improve VAE [38] by introduc- ing/representation applications. DC-VAE helps greatly ing a contrastive loss [63] to address instance-level fidelity reducing the performance gap for image synthesis be- between the input and the reconstruction in the induced fea- tween the baseline VAE to the competitive GAN models. ture space. Unlike in self-supervised representation learning methods [63, 23, 6], where self-supervision requires generat- 2. Related Work ing a transformed input (via data augmentation operations), the reconstruction naturally fits into the contrastive term that Related work can be roughly divided into three categories: encourages the matching between the reconstruction and (1) generative autoencoder, (2) deep generative model, and the input image instance, while pushing the reconstruction (3) contrastive learning. away from the rest of the images in the entire training set. Generative autoencoder. Variational autoencoder (VAE) Thus, the instance-level and set-level contradistinctive terms [38] points to an exciting direction of generative models collaborate with each to encourage the high fidelity of the 824 reconstruction and synthesis. In Figure 3 and Table 1, we VAE/GAN Generative adversarial networks (GANs) [17] systematically show the significance of with and without the and its variants [56], on the other hand, are shown to be instance-level and the set-level contradistinctive terms. In producing highly realistic images. The success was largely addition, we explore multi-scale contrastive learning via two attributed to learning a fidelity function (often referred to as schemes in Section 4.2: 1) deep supervision for contrastive a discriminator) that measures how realistic the generated learning in different convolution layers, and 2) patch-based images are. This can be achieved by learning to contrast contrastive learning for fine-grained data fidelity. In the ex- (classify) the set of training images with the set of generated periments, we show competitive results for the proposed DC- images [61, 20, 17]. VAE in a number of benchmarks for three

Load more