On the Use of Natural Gradient for Variational Autoencoders Sabin Roman and Luigi Malagò Romanian Institute of Science and Technology E-Mail: {Roman,Malago}@Rist.Ro

Home , Gradient descent

On the Use of Natural Gradient for Variational AutoEncoders Sabin Roman and Luigi Malagò Romanian Institute of Science and Technology e-mail: {roman,malago}@rist.ro

In this work-in-progress paper we study the use of Natural Gradient for the training of Vari- ational AutoEncoders [Kingma and Welling, 2013, Rezende et al., 2014], a family of powerful generative models that combine the advantages of neural networks with those of variational inference. In the first part of the work we review the literature of the Natural Gradient [Amari, 1998] for neural networks, together with a discussion about the different approaches which have been recently proposed for training Variational AutoEncoders using geometric methods. Next, we discuss how to define a proper Fisher Information metric for generative models with latent variables, based on the nature of the statistical model considered in the analysis. Notice that the computation of the Natural Gradient is required for the evaluation of the Natural Gradient, which is given by the inverse of the Fisher Information matrix times the vector of partial derivatives of the function to be optimized. Finally, we provide a proposal for a new family of algorithms, based on a two step optimization procedure, based on the computation of the Natural Gradient separately for the recognition and reconstruction networks. The Natural Gradient corresponds to the Riemannian gradient of a function defined over a statistical manifold, endowed with a non-Euclidean metric, given by the Fisher Information matrix. In the language of Information Geometry [Amari and Nagaoka, 2007], a statistical manifold is defined as a statistical model, i.e., a set of probability density functions parametrized by a vector of parameters which play the role of a coordinate set. Natural Gradient has many applications in machine learning, for instance it can be used in the definition of more sophisticated variants of classical Stochastic Gradient Descent, where the vector of partial derivatives can be replaced by the Riemannian gradient. Several numerical schemes for the computation of the Natural Gradient for feed-forward neural networks have been proposed in the literature, see [Park et al., 2000, González and Dorronsoro, 2008, Roux et al., 2008], but computations remained prohibitive for deep architectures. Recent years have seen a revival of interest in using Natural Gradient for the training of neural networks [Martens and Grosse, 2015, Pascanu and Bengio, 2013, Ollivier, 2013], and several efficient algorithms which can be used to train deeper architectures have been proposed, e.g., [Martens and Grosse, 2015, Grosse and Martens, 2016, Desjardins et al., 2015, Marceau-Caron and Ollivier, 2016]. On the other side, Natural Gradient finds also relevant applications in variational inference [Honkela et al., 2007], especially for conjugate models where the computations becomes particu- larly efficient [Hoffman et al., 2013], and recently also in more general contexts [Khan and Lin, 2017]. However, in the literature of Variational AutoEncoders, where neural networks and variational inferences are efficiently combined, Natural Gradient has only recently started to be ex- ploited [Zhang et al., 2017]. Recent literature focus only of some components of the full inference process, see [Johnson et al., 2016] and [Lin et al., 2017], where the Fisher matrix is computed limited to the encoder. We are interested in a fully geometrical framework to train Variational AutoEncoders. In order to achieve this, we need to define the Fisher Information matrix for a generative model with latent variables. As for Restricted Boltzmann Machines, based on the choice of the underlying statistical model, in principle one could think of considering the joint probability model over the visible x and hidden variables h, and compute derivatives of log p(x, h), or alternatively evaluate log derivatives of the marginal probability distribution p(x) = R p(x|h)p(h)dh, yielding different results, cf. [Ollivier et al., 2011, Desjardins et al., 2013]. For Variational AutoEncoders a further issue appears, indeed the choice of the approximate posterior is arbitrarily and it does not affect

1 the computation of the marginal distribution

Z q(z|x) p(x) = p(x|z)p(z) dz . q(z|x)

This implies that the parameters of the approximate distribution do not play a role in parameteriz- ing the marginal p(x), and if considered in the computation of the Fisher Matrix this would result in singularities. However, the Fisher matrix needs to be approximated from samples, and a good choice for q(z|x) ensures more efficient estimations, implying a dependence of the estimator on the parameters of the approximate posterior. On the other side, if the generative model represented by p(x|z) is kept fixed, and the approximate posterior is learned as in variational inference tasks, it is possible to introduce the Fisher Information matrix only for q(z|x), as it is done for example in stochastic variational inference [Hoffman et al., 2013], and related literature. These observations suggest the introduction of a two steps procedure for the optimization of the parameters of a Variational AutoEncoder, in which the weights of the encoder and the weights of the decoder are optimized separately, using update rules based on Natural Gradient. It is interesting to notice that the possibility to train the recognition and the generative model separately has been recently suggested for different reasons in [Rainforth et al., 2018], where the authors analyze the impact of use of tighter bounds for the log-likelihood during training, and the associated difficulty in obtaining reliable estimations of them. The analysis carried out in this work represents a preliminary step towards the design of novel training algorithms for Variational AutoEncoders, able to exploit the geometric nature of the optimization problem, using Information Geometric principles.

References

[Amari, 1998] Amari, S.-I. (1998). Natural gradient works efficiently in learning. Neural computation, 10(2):251–276. [Amari and Nagaoka, 2007] Amari, S.-i. and Nagaoka, H. (2007). Methods of information geometry, volume 191. American Mathematical Soc. [Desjardins et al., 2013] Desjardins, G., Pascanu, R., Courville, A., and Bengio, Y. (2013). Metric- free natural gradient for joint-training of boltzmann machines. arXiv preprint arXiv:1301.3545. [Desjardins et al., 2015] Desjardins, G., Simonyan, K., Pascanu, R., et al. (2015). Natural neural networks. In Advances in Neural Information Processing Systems, pages 2071–2079. [González and Dorronsoro, 2008] González, A. and Dorronsoro, J. R. (2008). Natural conjugate gradient training of multilayer perceptrons. Neurocomputing, 71(13-15):2499–2506. [Grosse and Martens, 2016] Grosse, R. and Martens, J. (2016). A kronecker-factored approximate fisher matrix for convolution layers. In International Conference on Machine Learning, pages 573–582. [Hoffman et al., 2013] Hoffman, M. D., Blei, D. M., Wang, C., and Paisley, J. (2013). Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347. [Honkela et al., 2007] Honkela, A., Tornio, M., Raiko, T., and Karhunen, J. (2007). Natural conjugate gradient in variational inference. In International Conference on Neural Information Processing, pages 305–314. Springer. [Johnson et al., 2016] Johnson, M. J., Duvenaud, D., Wiltschko, A., Datta, S., and Adams, R. (2016). Structured vaes: Composing probabilistic graphical models and variational autoencoders. NIPS 2016. [Khan and Lin, 2017] Khan, M. E. and Lin, W. (2017). Conjugate-computation variational inference: Converting variational inference in non-conjugate models to inferences in conjugate models. arXiv preprint arXiv:1703.04265. [Kingma and Welling, 2013] Kingma, D. P. and Welling, M. (2013). Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114.

2 [Lin et al., 2017] Lin, W., Khan, M. E., Hubacher, N., and Nielsen, D. (2017). Natural-gradient stochastic variational inference for non-conjugate structured variational autoencoder.

[Marceau-Caron and Ollivier, 2016] Marceau-Caron, G. and Ollivier, Y. (2016). Practical riemannian neural networks. arXiv preprint arXiv:1602.08007. [Martens and Grosse, 2015] Martens, J. and Grosse, R. (2015). Optimizing neural networks with kronecker-factored approximate curvature. In International Conference on Machine Learning, pages 2408–2417.

[Ollivier, 2013] Ollivier, Y. (2013). Riemannian metrics for neural networks i: feedforward networks. arXiv preprint arXiv:1303.0818. [Ollivier et al., 2011] Ollivier, Y., Arnold, L., Auger, A., and Hansen, N. (2011). Information- geometric optimization algorithms: A unifying picture via invariance principles. arXiv preprint arXiv:1106.3708. [Park et al., 2000] Park, H., Amari, S.-I., and Fukumizu, K. (2000). Adaptive natural gradient learning algorithms for various stochastic models. Neural Networks, 13(7):755–764. [Pascanu and Bengio, 2013] Pascanu, R. and Bengio, Y. (2013). Revisiting natural gradient for deep networks. arXiv preprint arXiv:1301.3584.

[Rainforth et al., 2018] Rainforth, T., Kosiorek, A. R., Le, T. A., Maddison, C. J., Igl, M., Wood, F., and Teh, Y. W. (2018). Tighter variational bounds are not necessarily better. arXiv preprint arXiv:1802.04537. [Rezende et al., 2014] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backprop- agation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082. [Roux et al., 2008] Roux, N. L., Manzagol, P.-A., and Bengio, Y. (2008). Topmoumoute online natural gradient algorithm. In Advances in neural information processing systems, pages 849– 856. [Zhang et al., 2017] Zhang, G., Sun, S., Duvenaud, D., and Grosse, R. (2017). Noisy natural gradient as variational inference. arXiv preprint arXiv:1712.02390.