Semi-Supervised Adversarial Variational Autoencoder Ryad Zemouri
Total Page:16
File Type:pdf, Size:1020Kb
Semi-Supervised Adversarial Variational Autoencoder Ryad Zemouri To cite this version: Ryad Zemouri. Semi-Supervised Adversarial Variational Autoencoder. Machine Learning and Knowl- edge Extraction, MDPI, 2020, 2 (3), pp.361-378. 10.3390/make2030020. hal-02931571 HAL Id: hal-02931571 https://hal-cnam.archives-ouvertes.fr/hal-02931571 Submitted on 21 Sep 2020 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License machine learning & knowledge extraction Article Semi-Supervised Adversarial Variational Autoencoder Ryad Zemouri Cedric-Lab, CNAM, HESAM Université, 292 rue Saint Martin, CEDEX 03, 750141 Paris, France; [email protected] Received: 1 August 2020; Accepted: 3 September 2020; Published: 6 September 2020 Abstract: We present a method to improve the reconstruction and generation performance of a variational autoencoder (VAE) by injecting an adversarial learning. Instead of comparing the reconstructed with the original data to calculate the reconstruction loss, we use a consistency principle for deep features. The main contributions are threefold. Firstly, our approach perfectly combines the two models, i.e., GAN and VAE, and thus improves the generation and reconstruction performance of the VAE. Secondly, the VAE training is done in two steps, which allows to dissociate the constraints used for the construction of the latent space on the one hand, and those used for the training of the decoder. By using this two-step learning process, our method can be more widely used in applications other than image processing. While training the encoder, the label information is integrated to better structure the latent space in a supervised way. The third contribution is to use the trained encoder for the consistency principle for deep features extracted from the hidden layers. We present experimental results to show that our method gives better performance than the original VAE. The results demonstrate that the adversarial constraints allow the decoder to generate images that are more authentic and realistic than the conventional VAE. Keywords: variational autoencoder; adversarial learning; deep feature consistent; data generation 1. Introduction Deep generative models (DGMs) are part of the deep models family and are a powerful way to learn any distribution of observed data through unsupervised learning. The DGMs are composed mainly by variational autoencoders (VAEs) [1–4], and generative adversarial networks (GANs) [5]. The VAEs are mainly used to extract features from the input vector in an unsupervised way while the GANs are used to generate synthetic samples through an adversarial learning by achieving an equilibrium between a Generator and a Discriminator. The strength of VAE comes from an extra layer used for sampling the latent vector z and an additional term in the loss function that makes the generation of a more continuous latent space than standard autoencoders. The VAEs have met with great success in recent years in several applicative areas including anomaly detection [6–9], text classification [10], sentence generation [11], speech synthesis and recognition [12–14], spatio-temporal solar irradiance forecasting [15] and in geoscience for data assimilation [2]. In other respects, the two major application areas of the VAEs are the biomedical and healthcare recommendation [16–19], and industrial applications for nonlinear processes monitoring [1,3,4,20–25]. VAEs are very efficient for reconstructing input data through a continuous and an optimized latent space. Indeed, one of the terms of the VAE loss function is the reconstruction term, which is suitable for reconstructing the input data, but with less realism than GANs. For example, an image reconstructed by a VAE is generally blurred and less realistic than the original one. However, the VAEs are completely inefficient for generating new synthetic data from the latent space comparing to the GANs. Upon an adversarial learning of the GANs, a discriminator is employed to evaluate the probability that a sample created by the generator is real or false. Most of the applications of GANs are Mach. Learn. Knowl. Extr. 2020, 2, 361–378; doi:10.3390/make2030020 www.mdpi.com/journal/make Mach. Learn. Knowl. Extr. 2020, 2 362 focused in the field of image processing such as biomedical images [26,27], but recently some papers in the industrial field have been published, mainly focused on data generation to solve the problem of the imbalanced training data [28–33]. Through continuous improvement of GAN, its performance has been increasingly enhanced and several variants have recently been developed [34–51]. The weakness of GANs lies in their latent space, which is not structured similarly to VAEs due to the lack of an encoder, which can further support the inverse mapping. With an additional encoding function, GANs can explore and use a structured latent space to discover “concept vectors” that represent high-level attributes [26]. To this purpose, some research has been recently developed, such as [48,52–55]. In this paper, we propose a method to inject an adversarial learning into a VAE to improve its reconstruction and generation performance. Instead of comparing the reconstructed data with the original data to calculate the reconstruction loss, we use a consistency principle for deep features such as style reconstruction loss (Gatys [56]). To do this, the training process of the VAE is divided into two steps: training the encoder and then training the decoder. While training the encoder, we integrate the label information to structure the latent space in a supervised way, in a manner similar to [53,55]. For the consistency principle for deep features, contrary to [56] and Hou [54] who used the pretrained deep convolutional neural network VGGNet [57], we use the features extracted from the hidden layers of the encoder. Thus, our method can be applied to areas other than image processing such as industrial nonlinear monitoring or biomedical applications. We present experimental results to show that our method gives better performance than the original VAE. In brief, our main contributions are threefold. • Our approach perfectly combines the two models, i.e., GAN and VAE, and thus improves the generation and reconstruction performance of the VAE. • The VAE training is done in two steps, which allows to dissociate the constraints used for the construction of the latent space on the one hand, and those used for the training of the decoder. • The encoder is used for the consistency principle for deep features extracted from the hidden layers. The rest of the paper is organized as follows. We first briefly review the generative models in Section2. After that, in Section3, we describe the proposed method. Then, in Section4, we present experimental results, which show that the proposed method improves the performance of VAE. The last section of the paper summarizes the conclusions. 2. Generative Networks and Adversarial Learning 2.1. Autoencoder and the Variational Form Autoencoder (AE) represents one of the first generative models trained to recreate or reproduce the input vector x [58–61]. The AE is composed by two main structures: an encoder and a decoder (Figure1), which are multilayered neural networks (NNs) parameterized by f and q, respectively. The first part encodes the input data x into a latent representation z by the encoder function z = ff(x), whereas the second NN decodes this latent representation onto xˆ = hq(z) which is an approximation or a reconstruction of the original data. In an AE, an equal number of units are used in the input/output layers while fewer units are used in the latent space (Figure1). The variational form of the AE (Figure1) becomes a popular generative model by combining the Bayesian inference and the efficiency of the NNs to obtain a nonlinear low-dimensional latent space [1–4]. The Bayesian inference is obtained by an additional layer used for sampling the latent vector z with a prior specified distribution p(z), usually assumed to be a standard Gaussian N (0, I), where I is the identity matrix. Each element zi of the latent layer is obtained as follows: zi = mi + si.e (1) Mach. Learn. Knowl. Extr. 2020, 2 363 th where mi and si are the i components of the mean and standard deviation vectors, e is a random variable following a standard Normal distribution (e ∼ N (0, 1)). Unlike the AE, which generates the latent vector z, the VAE generates vector of means mi and standard deviations si. This allows to have more continuity in the latent space than the original AE. The VAE loss function given by Equation (2) has two terms. The first term Lrec is the reconstruction loss function (Equation (3)), which allows to minimize the difference between the input and output instances. Variational Autoencoder Autoencoder Encoder Decoder Encoder Decoder Latent Latent space space Figure 1. Schematic architecture of a standard deep autoencoder and a variational deep autoencoder. Both architectures have two parts: an encoder and a decoder. Both the negative expected log-likelihood (e.g., the cross-entropy function) and the mean squared error (MSE) can be used. When the sigmoid function is used in the output layer, the derivatives of MSE and cross-entropy can have similar forms. The second term Lkl (Equation (4)) corresponds to the Kullback–Liebler (Dkl) divergence loss term that forces the generation of a latent vector with the specified Normal distribution [62,63].