
Joint Stochastic Approximation and Its Application to Learning Discrete Latent Variable Models Zhijian Ou Yunfu Song Electronic Engineering Department Electronic Engineering Department Tsinghua University, Beijing, China Tsinghua University, Beijing, China Abstract Currently variational methods are widely used for learning latent variable models, especially those parameterized us- Although with progress in introducing auxil- ing neural networks. In such methods, an auxiliary amor- iary amortized inference models, learning dis- tized inference model q.hðx/ with parameter is intro- crete latent variable models is still challenging. duced to approximate the posterior p.hðx/ (Kingma and In this paper, we show that the annoying diffi- Welling, 2014; Rezende et al., 2014), and some bound of culty of obtaining reliable stochastic gradients the marginal log-likelihood, used as a surrogate objective, for the inference model and the drawback of in- is optimized over both and . Two well-known bounds directly optimizing the target log-likelihood can are the evidence lower bound (ELBO) (Jordan et al., 1999) be gracefully addressed in a new method based and the multi-sample importance-weighted (IW) lower on stochastic approximation (SA) theory of the bound (Burda et al., 2015). Though with progress (as Robbins-Monro type. Specifically, we propose reviewed in Section 4), a difficulty in variational learn- to directly maximize the target log-likelihood ing of discrete latent variable models is to obtain reliable and simultaneously minimize the inclusive di- (unbiased, low-variance) Monte-Carlo estimates of the 1 vergence between the posterior and the infer- gradient of the bound with respect to (w.r.t.) . ence model. The resulting learning algorithm Additionally, a common drawback in many existing meth- is called joint SA (JSA). To the best of our ods for learning latent-variable models is that they indi- knowledge, JSA represents the first method that rectly optimize some bound of the target marginal log- couples an SA version of the EM (expectation- likelihood. This leaves an uncontrolled gap between the maximization) algorithm (SAEM) with an adap- marginal log-likelihood and the bound, depending on the tive MCMC procedure. Experiments on several expressiveness of the inference model. There are hard benchmark generative modeling and structured efforts to develop more expressive but increasingly com- prediction tasks show that JSA consistently out- plex inference models to reduce the gap (Salimans et al., performs recent competitive algorithms, with 2014; Rezende and Mohamed, 2015; Maaløe et al., 2016; faster convergence, better final likelihoods, and Kingma et al., 2017)2. But it is highly desirable that we lower variance of gradient estimates. can eliminate the effect of the gap on model learning or ide- ally directly optimize the marginal log-likelihood, without increasing the complexity of the inference model. 1 INTRODUCTION In this paper, we show that the annoying difficulty of obtaining reliable stochastic gradients for the inference A wide range of machine learning tasks involves observed model and the drawback of indirectly optimizing the tar- and unobserved data. Latent variable models explain ob- get log-likelihood can be gracefully addressed in a new servations as part of a partially observed system and usu- method based on stochastic approximation (SA) theory ally express a joint distribution p.x; h/ over observation x and its unobserved counterpart h, with parameter . 1since the gradient of the ELBO w.r.t. usually can be Models with discrete latent variables are broadly applied, reliably estimated. including mixture modeling, unsupervised learning (Neal, 2Notably, some methods are mainly applied to continuous 1992; Mnih and Gregor, 2014), structured output predic- latent variables, e.g. it is challenging to directly apply normaliz- tion (Raiko et al., 2014; Mnih and Rezende, 2016) and so ing flows to discrete random variables, though recently there are on. some effort (Ziegler and Rush, 2019). Proceedings of the 36th Conference on Uncertainty in Artificial Intelligence (UAI), PMLR volume 124, 2020. of the Robbins-Monro type (Robbins and Monro, 1951). Algorithm 1 The general stochastic approximation (SA) These two seemingly unrelated issues are inherently re- algorithm lated to our choice that we optimizes the ELBO or the for t = 1; 2; 5 do similar IW lower bound. Monte Carlo sampling: Draw a sample z.t/ with a .t*1/ Specifically, we propose to directly maximize w.r.t. the Markov transition kernel K.t*1/ .z ; ⋅/, which 3 .t*1/ marginal log-likelihood and simultaneously minimize starts with z and admits p.t*1/ .⋅/ as the invariant 4 w.r.t. the inclusive divergence KL[p.h x/ q.h x/] distribution. ð ðð ð .t/ .t*1/ .t/ between the posterior and the inference model, and for- SA updating: Set = + tF.t*1/ .z /, where tunately, we can use the SA framework to solve the joint t is the learning rate. optimization problem. The key is to recast the two gra- end for dients as expectations and equate them to zero; then the equations can be solved by applying the SA algorithm, in which the inference model serves as an adaptive proposal of iterative stochastic optimization algorithms, introduced for constructing the Markov Chain Monte Carlo (MCMC) in (Robbins and Monro, 1951) and extensively studied sampler. The resulting learning algorithm is called joint (Benveniste et al., 1990; Chen, 2002). Basically, stochastic SA (JSA), as it couples SA-based model learning and SA- approximation provides a mathematical framework for based adaptive MCMC and jointly finds the two sets of stochastically solving a root finding problem, which has unknown parameters ( and ). the form of expectations being equal to zeros. Suppose that the objective is to find the solution < of f./ = 0 It is worthwhile to recall that there is an another class of with methods in learning latent variable models for maximum f./ = E [F .z/]; (1) likelihood (ML), even prior to the recent development of z∼p.⋅/ variational learning for approximate ML, which consists of where is a d-dimensional parameter vector, and z is the expectation-maximization (EM) algorithm (Dempster an observation from a probability distribution p.⋅/ de- et al., 1977) and its extensions. Interestingly, we show that d pending on . F.z/Ë R is a function of z, providing the JSA method amounts to coupling an SA version of EM d-dimensional stochastic measurements of the so-called (SAEM) (Delyon et al., 1999; Kuhn and Lavielle, 2004) mean-field function f./. Intuitively, we solve a system with an adaptive MCMC procedure. This represents a new of simultaneous equations, f./ = 0, which consists of d extension among the various stochastic versions of EM in constraints, for determining d-dimensional . the literature. This revealing of the connection between JSA and SAEM is important for us to appreciate the new Given some initialization .0/ and z.0/, a general SA al- JSA method. gorithm iterates Monte Carlo sampling and parameter up- dating, as shown in Algorithm 1. The convergence of SA The JSA learning algorithm can handle both continuous has been established under conditions (Benveniste et al., and discrete latent variables. The application of JSA in 1990; Andrieu et al., 2005; Song et al., 2014), including the continuous case is not pursued in the present work, a few technical requirements for the mean-field function and we leave it as an avenue of future exploration. In this .t*1/ f./, the Markov transition kernel K.t*1/ .z ; ⋅/ and paper, we mainly present experimental results for learning the learning rates. Particularly, when f./ corresponds discrete latent variable models with Bernoulli and cate- to the gradient of some objective function, then .t/ will gorical variables, consisting of stochastic layers or neural converge to local optimum, driven by stochastic gradients network layers. Our results on several benchmark genera- F.z/. To speed up convergence, during each SA iteration, tive modeling and structured prediction tasks demonstrate it is possible to generate a set of multiple observations z that JSA consistently outperforms recent competitive al- by performing the Markov transition repeatedly and then gorithms, with faster convergence, better final likelihoods, use the average of the corresponding values of F.z/ for and lower variance of gradient estimates. updating , which is known as SA with multiple moves (Wang et al., 2018), as shown in Algorithm 3 in Appendix. 2 PRELIMINARIES Remarkably, Algorithm 1 shows stochastic approximation with Markovian perturbations (Benveniste et al., 1990). It 2.1 STOCHASTIC APPROXIMATION (SA) is more general than the non-Markovian SA which requires .t/ exact sampling z ∼ p .t*1/ .⋅/ at each iteration and in Stochastic approximation methods are an important family some tasks can hardly be realized. In non-Markovian SA, 3“directly” is in the sense that we set the marginal log- we check that F.z/ is unbiased estimates of f./, while in likelihood as the objective function in stochastic optimization. SA with Markovian perturbations, we check the ergodicity 4 KL[pððq] ≜ ∫ p log p_q property of the Markov transition kernel. 2.2 VARIATIONAL LEARNING METHODS two objectives in Eq. (2) can be derived as follows6: Here we briefly review the variational methods, recently hg ≜ *(KL[ ̃p.x/ððp.x/] developed for learning latent variable models (Kingma n = E ( logp .x; h/ n ̃p.x/p.hðx/ and Welling, 2014; Rezende et al., 2014). Consider a la- (3) l g *( E KL p .h x/ q .h x/ tent variable model p.x; h/ for observation x and latent n ≜ ̃p.x/ ð ðð ð variable h, with parameter . Instead of directly maximiz- n = E ( logq .h x/ j ̃p.x/p.hðx/ ð ing the marginal log-likelihood log p.x/ for the above latent variable model, variational methods maximize the By the following Proposition 1, Eq.(3) can be recast in variational lower bound (also known as ELBO), after intro- the expectation form of Eq.(1).
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages12 Page
-
File Size-