Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

Student-t Variational Autoencoder for Robust Density Estimation

Hiroshi Takahashi1, Tomoharu Iwata2, Yuki Yamanaka3, Masanori Yamada3, Satoshi Yagi1 1 NTT Software Innovation Center 2 NTT Communication Science Laboratories 3 NTT Secure Platform Laboratories takahashi.hiroshi, iwata.tomoharu, yamanaka.yuki, yamada.m, yagi.satoshi @lab.ntt.co.jp { }

Abstract used for multivariate density estimation. The VAE is com- posed of two conditional distributions: the encoder and the We propose a robust multivariate density estima- decoder, where neural networks are used to model the param- tor based on the variational autoencoder (VAE). eters of these conditional distributions. The encoder infers The VAE is a powerful deep , and the posterior distribution of continuous latent variables given used for multivariate density estimation. With the an observation. The decoder infers the posterior distribution original VAE, the distribution of observed contin- of observation given a latent variable. The encoder and de- uous variables is assumed to be a Gaussian, where coder neural networks are optimized by minimizing the train- its and are modeled by deep neu- ing objective function. In the density estimation task, since ral networks taking latent variables as their inputs. the observed variables are continuous, a Gaussian distribu- This distribution is called the decoder. However, tion is used for the decoder. We call this type of VAE the the training of VAE often becomes unstable. One Gaussian VAE. reason is that the decoder of VAE is sensitive to However, the training of the Gaussian VAE often becomes the error between the point and its estimated unstable. The reason is as follows: the training objective mean when its estimated variance is almost zero. function of Gaussian VAE is sensitive to the error between the We solve this instability problem by making the de- data point and its decoded mean when its decoded variance is coder robust to the error using a Bayesian approach almost zero, and hence, the objective function can give an to the variance estimation: we set a prior for the extremely large value even with a small error. We call this variance of the Gaussian decoder, and marginal- problem zero-variance problem. This zero-variance problem ize it out analytically, which leads to proposing the often occurs with biased data, i.e. in which some clusters Student-t VAE. Numerical with var- of data points have small variance. Real-world datasets such ious datasets show that training of the Student-t as network, sensor and media datasets often have this bias, VAEis robust, and the Student-t VAEachieves high therefore, this problem is serious considering the application density estimation performance. of the VAE for real-world datasets. Our purpose is to solve this instability by making the de- coder robust to the error. In this paper, we introduce a 1 Introduction Bayesian approach to the inference of the Gaussian decoder: Multivariate density estimation [Scott, 2015], which esti- we set a Gamma prior for the inverse of the variance of the mates the distribution of continuous data, is an important task decoder and marginalize it out analytically, which leads to in- for artificial intelligence. This fundamental task is widely troducing a Student-t distribution as the decoder distribution. performed from basic analysis such as clustering and data vi- We call this proposed method the Student-t VAE. Since the sualization to applications such as image processing, speech Student-t distribution is a heavy-tailed distribution [Lange et recognition, natural language processing, and anomaly detec- al., 1989], the Student-t decoder is robust to the error be- tion. For these tasks, conventional density estimation meth- tween the data point and its decoded mean, which makes the ods such as the density estimation [Silverman, 1986] training of the Student-t VAE stable. and the Gaussian mixture model [McLachlan and Peel, 2004] are often used. However, recent developments of networks 2 Variational Autoencoder and sensors have made data more high-dimensional, compli- cated, and noisy, and hence multivariate density estimation First, we review the variational autoencoder (VAE) [Kingma has become very difficult. and Welling, 2013; Rezende et al., 2014]. The VAE is a prob- Meanwhile, the variational autoencoder (VAE) [Kingma abilistic latent variable model that relates an observed vari- and Welling, 2013; Rezende et al., 2014] was presented as able vector x to a continuous latent variable vector z by a a powerful generative model for learning high-dimensional conditional distribution. Since our task is density estimation, complicated data by using neural networks, and the VAE is we assume that the observed variables x are continuous. With

2696 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18) the VAE, the of a data point x is given by 2013]. Then, the resulting objective function is

(i) pθ (x) = pθ (x z) p (z) dz, (1) θ, φ; x | Z L  L  where p (z) is a prior of the latent variable vector, which 1 (i) (i,`) (i) ln pθ x z DKL qφ z x p (z) is usually modeled by a standard Gaussian distribution ' L | − | k 2 `=1 (z 0, I), and pθ (x z) = x µθ (z) , σθ (z) is a Gaus- X       N | | N | (i) sian distribution with mean µ (z) and variance σ2 (z), which = ˆ θ, φ; x . (6) θ θ L are modeled by neural networks with parameter θ and input   z. These neural networks are called the decoder. We call this The KL between Gaussian distributions type of VAE the Gaussian VAE. D q z x(i) p (z) and its gradient can be cal- (1) (N) KL φ Given a dataset X = x ,..., x , the sum of the culated analytically| k[Kingma and Welling, 2013]. In this marginal log-likelihoods is given by    paper, we minimize the negative of (6) instead of maximizing N it. (i) ln pθ (X) = ln pθ x , (2) i=1 X   3 Instability of Training Gaussian VAE where N is the number of data points. The marginal log- likelihood is bounded below by the variational lower bound, To investigate the Gaussian VAE, we applied it to SMTP which is derived from Jensen’s inequality, as follows: data1, which is a subset of the KDD Cup 1999 data and provided by the scikit-learn community [Pedregosa et al., p x(i) z p (z) (i) E θ 2011]. The KDD Cup 1999 data were generated using a ln pθ x = ln (i) | qφ(z|x ) (i) closed network and hand-injected attacks for evaluating the " qφ z x #   |  performance of supervised network intrusion detection, and (i) pθ x z p(z) they have often been used also for unsupervised anomaly de- E (i) ln | qφ(z|x ) (i) tection. The SMTP data consists of three-dimensional contin- ≥ qφ z x " |  # uous data, and contains 95,156 data points. Figure 1a shows = θ, φ; x(i) ,  (3) a visualization of this dataset. This dataset has some bias: L the variance of some clusters of data points is small along the   dimension directions. where E[ ] represents the expectation, qφ (z x) = · 2 | We trained the Gaussian VAE using this dataset by Adam z µφ (x) , σ (x) is the posterior of z given x, and its N | φ [Kingma and Ba, 2014] with mini-batch size of 100. We used 2 mean µφ (x) and variance σφ (x) are modeled by neural net- a two-dimensional latent variable vector z, two-layer neural works with parameter φ. These neural networks are called the networks (500 hidden units per layer) as the encoder and the encoder. decoder, and a hyperbolic tangent as the activation function. The variational lower bound (3) can be also written as: The data were standardized with zero mean and unit variance. We used 10% of this dataset for training. (i) E (i) θ, φ; x = (i) ln pθ x z Figure 1b shows the mean training loss, which equals qφ(z|x ) L | N ˆ θ, φ; x(i) /N. The training loss was very un-   h  (i) i i=1 DKL qφ z x p (z) , (4) − L 2 (i,`) − | k stable. One reason for this is that the variance σθ z P (i) (i,`) 2 (i,`)     in the decoder x µθ z , σθ z becomes al- where DKL (P Q) is the Kullback Leibler (KL) divergence N |  k most zero, where z(i,`) is sampled from the encoder between P and Q. The parameters of the encoder and decoder   (i,`) (i) 2 (i) neural networks are optimized by maximizing the variational z µφ x , σ x . For example, at the 983rd N | φ lower bound using stochastic gradient descent (SGD) [Duchi epoch, the training loss jumped up sharply. Figure 1c shows    et al., 2011; Zeiler, 2012; Tieleman and Hinton, 2012; the relationship between the difference in the training losses ] Kingma and Ba, 2014 . The expectation term in (4) is approx- and the variance σ2 z(i,`) at this epoch. The training loss imated by the reparameterization trick [Kingma and Welling, θ of the data points with small variance σ2 z(i,`) increased 2013]:  θ drastically. L  σ2 z(i,`) E (i) 1 (i) (i,`) When the decoded variance θ is almost zero, the (i) ln pθ x z ln pθ x z , qφ(z|x ) | ' L | Gaussian decoder is sensitive to the error between the data `=1 (i) h  i X   point and its decoded mean: even if x differs only slightly (5) (i,`) from its decoded mean µθ z , the value of the first term (i,`) (i) (i,`) 2 (i) (i,`) where z = µφ x + ε σφ x , ε is a sample  drawn from (0, I), and L is the sample size of the reparam- 1This dataset is available at http://scikit-learn.org/stable/ eterization trick.N L= 1 is usually used [Kingma and Welling, modules/generated/sklearn.datasets.fetch kddcup99.html

2697 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

160 Loss(982) Loss(981) 140 − 60 60 Loss(983) Loss(982) )

t 120

( − 40 100 40 20 Loss 80 − 0 981 982 983 984 985 60 +1) t ( 40 20

Loss 20 0

Negative variational lower bound 0 20 − 0 200 400 600 800 1000 10 8 6 4 2 0 − − − 2− − Number of epochs min ln σθ (z) (c) Relation of decoded variance to loss. (a) SMTP data (10%). (b) Mean training loss.

Figure 1: (a) Visualization of the SMTP data. (b) Mean training loss for the SMTP data. The inset is an enlargement near the 983rd   epoch. (c) Relation of the decoded variance to the difference in training losses. Loss(t) represents −Lˆ θ, φ; x(i) for each data at the tth 2  2 (t) (t−1) epoch, and min ln σθ (z) represents minimum of ln σθ (z). We plotted the and standard deviations of Loss − Loss , where 2  Z 2 min ln σθ (z) ∈ [c − 0.5, c + 0.5], for c in . From 982nd to 983rd, training loss of data points with small ln σθ (z) increased drastically. of the training objective function (6): 4.1 MAP Estimation (i) (i,`) ln pθ x z First, we present the maximum a posteriori (MAP) estimation | as a way to solve the zero-variance problem. To simplify the  (i)  (i,`) 2 (i,`) 2 = ln x µθ z , σθ z calculation, we use Gam (τ 1, b) as the prior for 1/σθ (z). N | Its log probability is given by| 2  (i)  (i,`)   xd µθ,d z 1 = − ln 2πσ2 z(i,`) b   2 (i,`)  θ,d  ln Gam (τθ(z) 1, b) = ln b bτθ(z) 2 . (9) − 2σθ,d z  − 2 | − ∝ −σ (z) d   θ X     (7) Accordingly, the objective function of MAP estimation for changes drastically, where d is the dimension index of x(i), the VAE is: (i,`) 2 (i,`) µθ z , and σθ z . This sensitivity makes the train- L ing of the Gaussian VAE unstable. We call this problem 1 (i) (i,`) b ln pθ x z the zero-variance problem.  This zero-variance problem of- L | − σ2 z(i,`) l=1 ( θ ) ten occurs with biased data, since some data points are from X   (i) a cluster with very small variance, and their decoded variance DKL qφ z x  p (z) . (10) becomes much smaller as the training proceeds. Since the − | k     cause of this problem is the sensitivity of the objective func- We call this type of VAE the MAP VAE. This objective func- tion, changing the optimizer and tuning the hyperparameters tion can be viewed as a regularized version of the original of optimizer do not matter. Gaussian VAE objective function (6), where the small vari- 2 (i,`) 2 (i,`) ance σθ z is penalized by b/σθ z with the reg- 4 Student-t VAE ularization parameter b. This regularization− parameter b can We would like to solve this zero-variance problem by mak- be tuned by cross-validation, which requires a heavy compu- ing the decoder robust to the error between the data point tational cost. In addition, the MAP VAE has a problem in and its decoded mean. We propose a Bayesian approach to that the prior distribution is assumed to be independent of la- the inference of the Gaussian decoder, which leads to using a tent variables z, which leads to a lack of flexibility in density Student-t distribution as the decoder. estimation. We introduce a prior distribution for the variance of the Gaussian decoder. Let the precision parameter τθ(z) be the 4.2 Marginalization of the Variance Parameter 2 inverse of the variance, τθ(z) = 1/σθ (z). We use the conju- gate prior for the precision parameter, which is the following We propose a more flexible and computationally efficient ap- Gamma distribution: proach by introducing a Gamma prior distribution that de- baτ a−1 exp ( bτ) pends on latent variables, Gam (τ a (z) , b (z)), where a (z) Gam (τ a, b) = − , (8) and b (z) are the shape and rate| parameters, respectively, | Γ(a) which take the latent variable z as their inputs. By analyti- where a is a and b is a rate parameter. The cally integrating out the precision τ, the conditional distribu- domains of a and b are positive. tion of the observation x given the latent variable z becomes

2698 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

� �

2 − 2 �� � �� � �� � �� � �� � )) 4

, ν − 1 , 6 0

| − x

( 8 − St ( 10 ν = 1.0 ln − ν = 10.0 � � 12 ν − → ∞ 14 (a) Gaussian decoder. (b) Student-t decoder. − 4 2 0 2 4 − − x

Figure 2: The diagram of decoders. The decoder neural network Figure 3: Plot of Student-t distribution St (x | 0, 1, ν) in log scale takes the latent variable z as input, and estimates the parameters of for various values of ν. The Student-t distribution has a heav- assumed distribution. (a) Gaussian decoder network estimates the ier tail compared with a Gaussian, and in the case of ν → ∞, 2 −1 mean µθ (z) and variance σθ (z). (b) Student-t decoder network St (x | µ, λ, ν) corresponds to a Gaussian N x | µ, λ . estimates the mean µθ (z), precision λθ (z), and degree of freedom νθ (z). SMTP Aloi Thyroid Cancer Satellite a Student-t distribution as follows: Data size 95,156 50,000 6,916 367 5,100 Dimension 3 27 21 30 36 pθ (x z) ∞| Table 1: Number of data points and dimensions of five datasets −1 = x µθ (z) , τ Gam (τ a (z) , b (z)) dτ N | | Z 0  5 Experiments νθ(z)+1 νθ(z)+1 1 − 2 Γ 2 2 In this section, we evaluate the robustness of the training and 2 λθ(z) λθ(z)(x µθ(z)) = 1+ − the density estimation performance of the Student-t VAE.  νθ(z)  πνθ(z) νθ(z) Γ 2   " # 5.1 Data = St (x µθ (z) , λθ (z) , νθ (z)) , (11) | We used the following five datasets: SMTP, Aloi, Thyroid, where λ (z) = a(z)/b (z) is a precision2 of a Student-t dis- Cancer and Satellite. The SMTP data is the same as the θ data used in Section 3, where we used 10% of this dataset tribution, and νθ (z) = 2a (z) is a degree of freedom. We use neural networks to model the parameters of the Student-t for training. We also used 10% of this dataset for validation distribution, λ (z) and ν (z) as well as µ (z), and use them and the remaining 80% for test. The Aloi data is the Amster- θ θ θ [ et al. ] as the decoder. We call this type of VAE the Student-t VAE. dam library of object images Geusebroek , 2005 , and The diagram of Student-t decoder is illustrated in Figure 2. the Thyroid, Cancer and Satellite datasets were obtained from the UCI Repository [Lichman, 2013]. We Figure 3 shows a plot of the Student-t distribution. The used the transformations of these datasets by [Goldstein and Student-t distribution is obtained by mixing an infinite num- Uchida, 2016] 3. We used 50% of the dataset for training, ber of Gaussians that have the same mean but different vari- 10% for validation, and the remaining 40% for test. The total ances, and it has an important property: the tail heaviness. It number of data points and the dimensions of the observation is well-known that a heavy-tailed distribution is robust be- of the five datasets are listed in Table 1. cause the probability in the tail region is higher than that of a light-tailed distribution such as a Gaussian distribution 5.2 Setup [Lange et al., 1989]. Therefore, the Student-t decoder is We used two-layer neural networks (500 hidden units per robust to the error between the data point and its decoded layer) as the encoder and the decoder, and a hyperbolic tan- mean, which makes the training of the Student-t VAE stable. gent as the activation function. We trained the VAE by us- Since the degree of tail heaviness can be adjusted by tuning ν, ing Adam [Kingma and Ba, 2014] with mini-batch size of the Student-t VAE can set an appropriate robustness for each 100. We set the sample size of the reparameterization trick to data point by estimating ν (z), which leads to obtaining bet- θ L = 1. The maximum number of epochs was 500, and we ter flexibility than the MAP VAE. Since there is no need to used early-stopping [Goodfellow et al., 2016] based on the use the cross-validation for tuning the parameters such as the validation data. We used a two-dimensional latent variable MAP VAE, the Student-t VAE requires only a lightweight vector z with the SMTP data, and a 20-dimensional latent computational cost. 3These datasets are available at https://dataverse.harvard.edu/ 2This parameter is not always equal to the inverse of the variance. dataset.xhtml?persistentId=doi:10.7910/DVN/OPQMVF

2699 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

100 100 Gaussian Gaussian Gaussian MAP(b = 1) MAP(b = 1) MAP(b = 1) 20 50 MAP(b = 0.001) MAP(b = 0.001) 50 MAP(b = 0.001) Student-t Student-t Student-t

10 0 0

0 50 50 − − Negative variational lower bound 0 50 100 150 200 Negative variational lower bound 0 200 400 Negative variational lower bound 0 20 40 60 wall clock (seconds) wall clock (seconds) wall clock (seconds) (a) SMTP data. (b) Aloi. (c) Thyroid.

60 Gaussian Gaussian 80 MAP(b = 1) MAP(b = 1) MAP(b = 0.001) 40 MAP(b = 0.001) 60 Student-t Student-t

40 20 20

Negative variational lower bound 0 1 2 3 Negative variational lower bound 0 20 40 wall clock (seconds) wall clock (seconds) (d) Cancer. (e) Satellite.

PN ˆ (i) Figure 4: Mean training loss (− i=1 L θ, φ; x /N) for each .

Method SMTP Aloi Thyroid Cancer Satellite Gaussian -1.248 0.404 45.418 5.457 15.519 4.422 -18.668 3.448 -1.852 0.370 MAP(b = 1) -4.864  0.020 -38.210  0.156 -31.266  0.159 -45.895  0.843 -50.895  0.238 MAP(b = 0.001) -1.932  0.404 30.406  0.383 18.037  1.318 -19.017  3.273 -1.899  0.372 Student-t 0.827  0.105 77.022  0.539 69.543  0.634 -18.253  2.629 -1.811  0.289      Table 2: Comparison of test log-likelihoods. We highlighted the best result in bold, and we also highlighted the results in bold which are not statistically different from the best result according to a pair-wise t-test. We use 5% as p-value. variable vector for the other datasets. For the evaluation, we sian VAE was stable, and the Gaussian VAE and the Student-t calculated the marginal log-likelihood of the test data by us- VAEachieved the comparable test log-likelihoods. Figures 5a ing the importance [Burda et al., 2015]. We set the and 5b show the test log-likelihoods by the Gaussian VAE for sample size of the importance sampling to 100. We ran all different mini-batch sizes, and different learning rates for the experiments ten times each. The data was standardized with SMTP data, respectively. Even though the mini-batch size zero mean and unit variance. We used the following setup: and learning rate changed, the test log-likelihood with the CPU was Intel Xeon E5-2640 v4 2.40GHz, the memory size Gaussian VAE was worse than that with the Student-t VAE. was 1 TB, and GPU was NVIDIA Tesla M40. These results indicate that improving the robustness of the objective function is better than tuning the hyper parameters 5.3 Results of the optimizer. Figures 4a–4e show the relationship between the mean train- ing loss and wall clock (seconds), and Table 2 compares the Second, we focused on the MAP VAE. When the regular- test log-likelihoods of the Gaussian VAE, MAP VAE, and ization parameter was large, b = 1, the negative variational Student-t VAE. lower bound was stable but it did not decrease well. When First, we focused on the Gaussian VAE. With the SMTP, the regularization parameter was small, b = 0.001 with the Aloi and Thyroid data, training of the Gaussian VAE was un- SMTP, Aloi and Thyroid data, the negative variational lower stable, and the test log-likelihoods of the Gaussian VAE were bound became unstable, and it behaved similarly to that of worse than those of the Student-t VAE. On the other hand, the Gaussian VAE. The test log-likelihoods of the MAP VAE with the Cancer and Satellite data, the training of the Gaus- were equal to or worse than the Gaussian VAE, because the

2700 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

1 1 1 0 0 0 2 0 − 1 1 4 − − − 2 Student-t 2 Student-t 6 − 1 − − 3 − Gaussian Gaussian 8 Gaussian − 3 − 4 − 10 − 2 4 − Student-t 5 Gaussian Test log likelihood Test log likelihood

− Test log likelihood Test log likelihood − 12 − 5 − MAP 6 Student-t − 14 − 3 − 7 − 200 400 600 800 1000 10 9 8 7 6 5 4 3 12 10 8 6 4 2 0 2 − 0.1 0.2 0.3 0.4 0.5 − − − − − − batch size − −learning− − rate− (log− scale)− − b (log scale) Ratio of noisy samples to dataset (a) batch size. (b) learning rate. (c) regularization parameter. (d) under noisy environment.

Figure 5: Relationship between the test log-likelihoods and various settings of Gaussian VAEand MAP VAE.We used SMTP data, and plotted the test log-likelihoods in Table 2 by dashed line for comparison. The semi-transparent area and error bar represent standard deviations. (a) Relationship between the test log-likelihoods and mini-batch size with the Gaussian VAE. (b) Relationship between the test log-likelihoods and learning rate with the Gaussian VAE. (c) Relationship between the test log-likelihoods and regularization parameter b with the MAP VAE. (d) Relationship between the test log-likelihoods and the stability of training under noisy environment. constant parameter of the Gamma prior b was not flexible. Roeder et al., 2017; Miller et al., 2017]. These methods fo- Figure 5c shows the test log-likelihoods with different reg- cused on the stability of optimization methods, such as the ularization parameter values b for the SMTP data. When reparameterization trick and the stochasticity due to data sub- b 0.001, the test log-likelihood of the MAP VAE was sampling. The Student-t VAE focuses on the different part not≤ statistically different from the Gaussian VAE, and when from these methods: the robustness of the training objective b 0.01, the larger b was, the worse the test log-. This requires only a lightweight computational cost, of≥ the MAP VAE became. These results indicate that set- and can be used together with these existing methods. ting an appropriate robustness for each data point is effective It is well-known that the Student-t distribution has robust- to avoid zero-variance problem without lacking the flexibility ness [Lange et al., 1989], and this robustness has been ap- of density estimation. plied to a number of machine learning algorithms, such as Third, we focused on the Student-t VAE. With all of the stochastic neighbor embedding [Maaten and Hinton, 2008], datasets, the Student-t VAE reduced the training loss sta- Gaussian process [Jylanki¨ et al., 2011], and Bayesian opti- bly, and obtained the equal to or better density estimation mization [Martinez-Cantin et al., 2017]. These algorithms performance than that of the Gaussian VAE and the MAP used a Student-t distribution to reduce the influence of noise VAE. Since the Student-t VAE can set the appropriate ro- included in the observed data. The Student-t VAE uses a bustness for each data point by estimating νθ (z), it avoids Student-t distribution for reducing the influence of the error zero-variance problem while improving the flexibility of the between the data point and its decoded mean, which makes decoder, which makes the training stable and leads to obtain- the training objective function of Student-t VAE robust. ing good density estimation performance. In addition, we evaluated the stability of training under the noisy environment. We added noisy samples which follows 7 Conclusion uniform distribution to SMTP data, and evaluated the test log- likelihoods. Figure 5d shows the relationship between the We proposed the Student-t variational autoencoder: a robust test log-likelihoods and the ratio of noisy samples to dataset. multivariate density based on the variational au- Whereas the training of Gaussian VAE becomes unstable as toencoder (VAE). The training of the Gaussian VAE often the ratio of noisy samples increases, the training of Student- becomes unstable. We investigated the cause of this insta- t VAE continues to be stable. This result indicates that the bility, and revealed that the Gaussian decoder is sensitive to Student-t VAE is useful under the noisy environments such the error between the data point and its decoded mean when as real-world applications. the decoded variance is almost zero. These results indicate that the Student-t VAE is a good al- In order to improve the robustness of the VAE, we in- ternative to the Gaussian VAE: the training of the Student- troduced a Bayesian approach to the Gaussian decoder: we t VAE is stable, which requires only a lightweight compu- set a Gamma prior for the inverse of the decoded variance tational cost, and the density estimation performance of the and marginalized it out analytically, which led to using the Student-t VAE is equal to or better than that of the Gaussian Student-t distribution as the decoder. Since the Student-t dis- VAE. tribution is a heavy-tailed distribution, the Student-t decoder is robust to the error, which makes the training stable. We demonstrated that the robustness of the training and the high 6 Related Work density estimation performance of the Student-t VAE in ex- Regarding related work on improving the stability of train- periments using five datasets. ing the VAE, a number of methods have been proposed In the future, we will try to apply the Student-t VAE to that reduce the variance of stochastic gradients [Johnson real-world applications such as anomaly detection [Suh et al., and Zhang, 2013; Wang et al., 2013; Kingma et al., 2015; 2016] and image generation [van den Oord et al., 2016].

2701 Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)

References [Miller et al., 2017] Andrew Miller, Nick Foti, Alexander [Burda et al., 2015] Yuri Burda, Roger Grosse, and Ruslan D’Amour, and Ryan P Adams. Reducing reparameteriza- Advances in Neural Information Salakhutdinov. Importance weighted autoencoders. arXiv tion gradient variance. In Processing Systems 30 preprint arXiv:1509.00519, 2015. , pages 3711–3721, 2017. [Pedregosa et al., 2011] Fabian Pedregosa, Gael¨ Varoquaux, [Duchi et al., 2011] John Duchi, Elad Hazan, and Yoram Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Singer. Adaptive subgradient methods for online learning Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron and stochastic optimization. Journal of Machine Learning Weiss, Vincent Dubourg, et al. Scikit-learn: Machine Research, 12(Jul):2121–2159, 2011. learning in Python. Journal of Machine Learning Re- [Geusebroek et al., 2005] Jan-Mark Geusebroek, Gertjan J search, 12(Oct):2825–2830, 2011. Burghouts, and Arnold WM Smeulders. The amsterdam [Rezende et al., 2014] Danilo J Rezende, Shakir Mohamed, library of object images. International Journal of Com- and Daan Wierstra. Stochastic backpropagation and ap- puter Vision, 61(1):103–112, 2005. proximate inference in deep generative models. In Pro- [Goldstein and Uchida, 2016] Markus Goldstein and Seiichi ceedings of the 31st International Conference on Machine Uchida. A comparative evaluation of , pages 1278–1286, 2014. anomaly detection algorithms for multivariate data. PloS [Roeder et al., 2017] Geoffrey Roeder, Yuhuai Wu, and one, 11(4):e0152173, 2016. David K Duvenaud. Sticking the landing: Simple, [Goodfellow et al., 2016] Ian Goodfellow, Yoshua Bengio, lower-variance gradient for variational infer- and Aaron Courville. Deep Learning. MIT Press, 2016. ence. In Advances in Neural Information Processing Sys- http://www.deeplearningbook.org. tems, pages 6928–6937, 2017. [Johnson and Zhang, 2013] Rie Johnson and Tong Zhang. [Scott, 2015] David W Scott. Multivariate density estima- Accelerating stochastic gradient descent using predictive tion: theory, practice, and visualization. John Wiley & variance reduction. In Advances in neural information pro- Sons, 2015. cessing systems, pages 315–323, 2013. [Silverman, 1986] Bernard W Silverman. Density estimation [Jylanki¨ et al., 2011] Pasi Jylanki,¨ Jarno Vanhatalo, and Aki for statistics and data analysis, volume 26. CRC press, Vehtari. Robust Gaussian process regression with a 1986. Student-t likelihood. Journal of Machine Learning Re- [Suh et al., 2016] Suwon Suh, Daniel H Chae, Hyon-Goo search, 12(Nov):3227–3257, 2011. Kang, and Seungjin Choi. Echo-state conditional varia- [Kingma and Ba, 2014] Diederik Kingma and Jimmy Ba. tional autoencoder for anomaly detection. In Neural Net- Adam: A method for stochastic optimization. arXiv works (IJCNN), 2016 International Joint Conference on, preprint arXiv:1412.6980, 2014. pages 1015–1022. IEEE, 2016. [Kingma and Welling, 2013] Diederik P Kingma and Max [Tieleman and Hinton, 2012] Tijmen Tieleman and Geoffrey Welling. Auto-encoding variational Bayes. arXiv preprint Hinton. Lecture 6.5-rmsprop: Divide the gradient by a arXiv:1312.6114, 2013. running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012. [Kingma et al., 2015] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local repa- [van den Oord et al., 2016] Aaron van den Oord, Nal Kalch- rameterization trick. In Advances in Neural Information brenner, Lasse Espeholt, Oriol Vinyals, Alex Graves, et al. Processing Systems, pages 2575–2583, 2015. Conditional image generation with PixelCNN decoders. In Advances in Neural Information Processing Systems, [Lange et al., 1989] Kenneth L Lange, Roderick JA Little, pages 4790–4798, 2016. and Jeremy MG Taylor. Robust statistical modeling us- ing the t distribution. Journal of the American Statistical [Wang et al., 2013] Chong Wang, Xi Chen, Alexander J Association, 84(408):881–896, 1989. Smola, and Eric P Xing. Variance reduction for stochastic gradient optimization. In Advances in Neural Information [Lichman, 2013] M. Lichman. UCI machine learning repos- Processing Systems, pages 181–189, 2013. itory, 2013. [Zeiler, 2012] Matthew D Zeiler. ADADELTA: an adaptive [Maaten and Hinton, 2008] Laurens van der Maaten and Ge- learning rate method. arXiv preprint arXiv:1212.5701, offrey Hinton. Visualizing data using t-SNE. Journal of 2012. Machine Learning Research, 9(Nov):2579–2605, 2008. [Martinez-Cantin et al., 2017] Ruben Martinez-Cantin, Michael McCourt, and Kevin Tee. Robust Bayesian optimization with Student-t likelihood. arXiv preprint arXiv:1707.05729, 2017. [McLachlan and Peel, 2004] Geoffrey McLachlan and David Peel. Finite mixture models. John Wiley & Sons, 2004.

2702