Student-T Variational Autoencoder for Robust Density Estimation

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) Student-t Variational Autoencoder for Robust Density Estimation Hiroshi Takahashi1, Tomoharu Iwata2, Yuki Yamanaka3, Masanori Yamada3, Satoshi Yagi1 1 NTT Software Innovation Center 2 NTT Communication Science Laboratories 3 NTT Secure Platform Laboratories takahashi.hiroshi, iwata.tomoharu, yamanaka.yuki, yamada.m, yagi.satoshi @lab.ntt.co.jp f g Abstract used for multivariate density estimation. The VAE is com- posed of two conditional distributions: the encoder and the We propose a robust multivariate density estima- decoder, where neural networks are used to model the param- tor based on the variational autoencoder (VAE). eters of these conditional distributions. The encoder infers The VAE is a powerful deep generative model, and the posterior distribution of continuous latent variables given used for multivariate density estimation. With the an observation. The decoder infers the posterior distribution original VAE, the distribution of observed contin- of observation given a latent variable. The encoder and de- uous variables is assumed to be a Gaussian, where coder neural networks are optimized by minimizing the train- its mean and variance are modeled by deep neu- ing objective function. In the density estimation task, since ral networks taking latent variables as their inputs. the observed variables are continuous, a Gaussian distribu- This distribution is called the decoder. However, tion is used for the decoder. We call this type of VAE the the training of VAE often becomes unstable. One Gaussian VAE. reason is that the decoder of VAE is sensitive to However, the training of the Gaussian VAE often becomes the error between the data point and its estimated unstable. The reason is as follows: the training objective mean when its estimated variance is almost zero. function of Gaussian VAE is sensitive to the error between the We solve this instability problem by making the de- data point and its decoded mean when its decoded variance is coder robust to the error using a Bayesian approach almost zero, and hence, the objective function can give an to the variance estimation: we set a prior for the extremely large value even with a small error. We call this variance of the Gaussian decoder, and marginal- problem zero-variance problem. This zero-variance problem ize it out analytically, which leads to proposing the often occurs with biased data, i.e. in which some clusters Student-t VAE. Numerical experiments with var- of data points have small variance. Real-world datasets such ious datasets show that training of the Student-t as network, sensor and media datasets often have this bias, VAEis robust, and the Student-t VAEachieves high therefore, this problem is serious considering the application density estimation performance. of the VAE for real-world datasets. Our purpose is to solve this instability by making the decoder robust to the error. In this paper, we introduce a 1 Introduction Bayesian approach to the inference of the Gaussian decoder: Multivariate density estimation [Scott, 2015], which esti- we set a Gamma prior for the inverse of the variance of the mates the distribution of continuous data, is an important task decoder and marginalize it out analytically, which leads to in- for artificial intelligence. This fundamental task is widely troducing a Student-t distribution as the decoder distribution. performed from basic analysis such as clustering and data vi- We call this proposed method the Student-t VAE. Since the sualization to applications such as image processing, speech Student-t distribution is a heavy-tailed distribution [Lange et recognition, natural language processing, and anomaly detec- al., 1989], the Student-t decoder is robust to the error be- tion. For these tasks, conventional density estimation meth- tween the data point and its decoded mean, which makes the ods such as the kernel density estimation [Silverman, 1986] training of the Student-t VAE stable. and the Gaussian mixture model [McLachlan and Peel, 2004] are often used. However, recent developments of networks 2 Variational Autoencoder and sensors have made data more high-dimensional, complicated, and noisy, and hence multivariate density estimation First, we review the variational autoencoder (VAE) [Kingma has become very difficult. and Welling, 2013; Rezende et al., 2014]. The VAE is a prob- Meanwhile, the variational autoencoder (VAE) [Kingma abilistic latent variable model that relates an observed vari- and Welling, 2013; Rezende et al., 2014] was presented as able vector x to a continuous latent variable vector z by a a powerful generative model for learning high-dimensional conditional distribution. Since our task is density estimation, complicated data by using neural networks, and the VAE is we assume that the observed variables x are continuous. With 2696 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) the VAE, the probability of a data point x is given by 2013]. Then, the resulting objective function is (i) pθ (x) = pθ (x z) p (z) dz; (1) θ; φ; x j Z L L where p (z) is a prior of the latent variable vector, which 1 (i) (i;`) (i) ln pθ x z DKL qφ z x p (z) is usually modeled by a standard Gaussian distribution ' L j − j k 2 `=1 (z 0; I), and pθ (x z) = x µθ (z) ; σθ (z) is a Gaus- X N j j N j (i) sian distribution with mean µ (z) and variance σ2 (z), which = ^ θ; φ; x : (6) θ θ L are modeled by neural networks with parameter θ and input z. These neural networks are called the decoder. We call this The KL divergence between Gaussian distributions type of VAE the Gaussian VAE. D q z x(i) p (z) and its gradient can be cal- (1) (N) KL φ Given a dataset X = x ;:::; x , the sum of the culated analyticallyj k[Kingma and Welling, 2013]. In this marginal log-likelihoods is given by paper, we minimize the negative of (6) instead of maximizing N it. (i) ln pθ (X) = ln pθ x ; (2) i=1 X 3 Instability of Training Gaussian VAE where N is the number of data points. The marginal log- likelihood is bounded below by the variational lower bound, To investigate the Gaussian VAE, we applied it to SMTP which is derived from Jensen’s inequality, as follows: data1, which is a subset of the KDD Cup 1999 data and provided by the scikit-learn community [Pedregosa et al., p x(i) z p (z) (i) E θ 2011]. The KDD Cup 1999 data were generated using a ln pθ x = ln (i) j qφ(zjx ) (i) closed network and hand-injected attacks for evaluating the " qφ z x # j performance of supervised network intrusion detection, and (i) pθ x z p(z) they have often been used also for unsupervised anomaly de- E (i) ln j qφ(zjx ) (i) tection. The SMTP data consists of three-dimensional contin- ≥ qφ z x " j # uous data, and contains 95,156 data points. Figure 1a shows = θ; φ; x(i) ; (3) a visualization of this dataset. This dataset has some bias: L the variance of some clusters of data points is small along the dimension directions. where E[ ] represents the expectation, qφ (z x) = · 2 j We trained the Gaussian VAE using this dataset by Adam z µφ (x) ; σ (x) is the posterior of z given x, and its N j φ [Kingma and Ba, 2014] with mini-batch size of 100. We used 2 mean µφ (x) and variance σφ (x) are modeled by neural net- a two-dimensional latent variable vector z, two-layer neural works with parameter φ. These neural networks are called the networks (500 hidden units per layer) as the encoder and the encoder. decoder, and a hyperbolic tangent as the activation function. The variational lower bound (3) can be also written as: The data were standardized with zero mean and unit variance. We used 10% of this dataset for training. (i) E (i) θ; φ; x = (i) ln pθ x z Figure 1b shows the mean training loss, which equals qφ(zjx ) L j N ^ θ; φ; x(i) =N. The training loss was very un- h (i) i i=1 DKL qφ z x p (z) ; (4) − L 2 (i;`) − j k stable. One reason for this is that the variance σθ z P (i) (i;`) 2 (i;`) in the decoder x µθ z ; σθ z becomes al- where DKL (P Q) is the Kullback Leibler (KL) divergence N j k most zero, where z(i;`) is sampled from the encoder between P and Q. The parameters of the encoder and decoder (i;`) (i) 2 (i) neural networks are optimized by maximizing the variational z µφ x ; σ x . For example, at the 983rd N j φ lower bound using stochastic gradient descent (SGD) [Duchi epoch, the training loss jumped up sharply. Figure 1c shows et al., 2011; Zeiler, 2012; Tieleman and Hinton, 2012; the relationship between the difference in the training losses ] Kingma and Ba, 2014 . The expectation term in (4) is approx- and the variance σ2 z(i;`) at this epoch. The training loss imated by the reparameterization trick [Kingma and Welling, θ of the data points with small variance σ2 z(i;`) increased 2013]: θ drastically. L σ2 z(i;`) E (i) 1 (i) (i;`) When the decoded variance θ is almost zero, the (i) ln pθ x z ln pθ x z ; qφ(zjx ) j ' L j Gaussian decoder is sensitive to the error between the data `=1 (i) h i X point and its decoded mean: even if x differs only slightly (5) (i;`) from its decoded mean µθ z , the value of the first term (i;`) (i) (i;`) 2 (i) (i;`) where z = µφ x + " σφ x , " is a sample drawn from (0; I), and L is the sample size of the reparam- 1This dataset is available at http://scikit-learn.org/stable/ eterization trick.N L= 1 is usually used [Kingma and Welling, modules/generated/sklearn.datasets.fetch kddcup99.html 2697 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) 160 Loss(982) Loss(981) 140 − 60 60 Loss(983) Loss(982) ) t 120 ( − 40 100 40 20 Loss 80 − 0 981 982 983 984 985 60 +1) t ( 40 20 Loss 20 0 Negative variational lower bound 0 20 − 0 200 400 600 800 1000 10 8 6 4 2 0 − − − 2− − Number of epochs min ln σθ (z) (c) Relation of decoded variance to loss.

Student-T Variational Autoencoder for Robust Density Estimation

Nonparametric Density Estimation Using Kernels with Variable Size Windows Anil Ramchandra Londhe Iowa State University

MDL Histogram Density Estimation

A Least-Squares Approach to Direct Importance Estimation∗

Lecture 2: Density Estimation 2.1 Histogram

Density Estimation 36-708

Nonparametric Estimation of Probability Density Functions of Random Persistence Diagrams

DENSITY ESTIMATION INCLUDING EXAMPLES Hans-Georg Müller

Density and Distribution Estimation

Sparr: Analyzing Spatial Relative Risk Using Fixed and Adaptive Kernel Density Estimation in R

Density Estimation (D = 1) Discrete Density Estimation (D > 1)

Nonparametric Density Estimation Histogram

APPLIED SMOOTHING TECHNIQUES Part 1: Kernel Density Estimation