Reparameterization Gradients Through Acceptance-Rejection Sampling Algorithms
Total Page:16
File Type:pdf, Size:1020Kb
Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms Christian A. Naessethyz Francisco J. R. Ruizzx Scott W. Lindermanz David M. Bleiz yLink¨opingUniversity zColumbia University xUniversity of Cambridge Abstract 2014] and text [Hoffman et al., 2013]. By definition, the success of variational approaches depends on our Variational inference using the reparameteri- ability to (i) formulate a flexible parametric family zation trick has enabled large-scale approx- of distributions; and (ii) optimize the parameters to imate Bayesian inference in complex pro- find the member of this family that most closely ap- babilistic models, leveraging stochastic op- proximates the true posterior. These two criteria are timization to sidestep intractable expecta- at odds|the more flexible the family, the more chal- tions. The reparameterization trick is appli- lenging the optimization problem. In this paper, we cable when we can simulate a random vari- present a novel method that enables more efficient op- able by applying a differentiable determinis- timization for a large class of variational distributions, tic function on an auxiliary random variable namely, for distributions that we can efficiently sim- whose distribution is fixed. For many dis- ulate by acceptance-rejection sampling, or rejection tributions of interest (such as the gamma or sampling for short. Dirichlet), simulation of random variables re- For complex models, the variational parameters can lies on acceptance-rejection sampling. The be optimized by stochastic gradient ascent on the evi- discontinuity introduced by the accept{reject dence lower bound (elbo), a lower bound on the step means that standard reparameterization marginal likelihood of the data. There are two pri- tricks are not applicable. We propose a new mary means of estimating the gradient of the elbo: method that lets us leverage reparameteriza- the score function estimator [Paisley et al., 2012, Ran- tion gradients even when variables are out- ganath et al., 2014, Mnih and Gregor, 2014] and the puts of a acceptance-rejection sampling algo- reparameterization trick [Kingma and Welling, 2014, rithm. Our approach enables reparameteri- Rezende et al., 2014, Price, 1958, Bonnet, 1964], both zation on a larger class of variational distribu- of which rely on Monte Carlo sampling. While the tions. In several studies of real and synthetic reparameterization trick often yields lower variance data, we show that the variance of the estima- estimates and therefore leads to more efficient opti- tor of the gradient is significantly lower than mization, this approach has been limited in scope to a other state-of-the-art methods. This leads to few variational families (typically Gaussians). Indeed, faster convergence of stochastic gradient vari- some lines of research have already tried to address ational inference. this limitation [Knowles, 2015, Ruiz et al., 2016]. There are two requirements to apply the reparameteri- 1 Introduction zation trick. The first is that the random variable can be obtained through a transformation of a simple ran- dom variable, such as a uniform or standard normal; Variational inference [Hinton and van Camp, 1993, the second is that the transformation be differentiable. Waterhouse et al., 1996, Jordan et al., 1999] under- In this paper, we observe that all random variables we lies many recent advances in large scale probabilistic simulate on our computers are ultimately transforma- modeling. It has enabled sophisticated modeling of tions of uniforms, often followed by accept-reject steps. complex domains such as images [Kingma and Welling, So if the transformations are differentiable then we can use these existing simulation algorithms to expand the Proceedings of the 20th International Conference on Artifi- cial Intelligence and Statistics (AISTATS) 2017, Fort Laud- scope of the reparameterization trick. erdale, Florida, USA. JMLR: W&CP volume 54. Copy- Thus, we show how to use existing rejection samplers right 2017 by the author(s). Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms to develop stochastic gradients of variational param- the gradient of the elbo [Paisley et al., 2012, Ran- eters. In short, each rejection sampler uses a highly- ganath et al., 2014, Mnih and Gregor, 2014, Sali- tuned transformation that is well-suited for its distri- mans and Knowles, 2013, Kingma and Welling, 2014]. bution. We can construct new reparameterization gra- This results in a stochastic optimization procedure, dients by \removing the lid" from these black boxes, where different Monte Carlo estimators of the gradi- applying 65+ years of research on transformations [von ent amount to different algorithms. We review below Neumann, 1951, Devroye, 1986] to variational infer- two common estimators: the score function estimator ence. We demonstrate that this broadens the scope and the reparameterization trick.1 of variational models amenable to efficient inference Score function estimator. The score function esti- and provides lower-variance estimates of the gradient mator, also known as the log-derivative trick or rein- compared to state-of-the-art approaches. force [Williams, 1992, Glynn, 1990], is a general way We first review variational inference, with a focus on to estimate the gradient of the elbo [Paisley et al., stochastic gradient methods. We then present our key 2012, Ranganath et al., 2014, Mnih and Gregor, 2014]. contribution, rejection sampling variational inference The score function estimator expresses the gradient as (rsvi), showing how to use efficient rejection samplers an expectation with respect to q(z ; θ): to produce low-variance stochastic gradients of the θ (θ) = Eq(z ;θ)[f(z) θ log q(z ; θ)] + θH[q(z ; θ)]: variational objective. We study two concrete exam- r L r r ples, analyzing rejection samplers for the gamma and We then form Monte Carlo estimates by approx- Dirichlet to produce new reparameterization gradients imating the expectation with independent samples for their corresponding variational factors. Finally, we from the variational distribution. Though it is very analyze two datasets with a deep exponential family general, the score function estimator typically suf- (def) [Ranganath et al., 2015], comparing rsvi to the fers from high variance. In practice we also need state of the art. We found that rsvi achieves a sig- to apply variance reduction techniques such as Rao- nificant reduction in variance and faster convergence Blackwellization [Casella and Robert, 1996] and con- of the elbo. Code for all experiments is provided at trol variates [Robert and Casella, 2004]. github.com/blei-lab/ars-reparameterization. Reparameterization trick. The reparameteriza- 2 Variational Inference tion trick [Salimans and Knowles, 2013, Kingma and Welling, 2014, Price, 1958, Bonnet, 1964] results in a lower variance estimator compared to the score func- Let p(x; z) be a probabilistic model, i.e., a joint prob- tion, but it is not as generally applicable. It requires ability distribution of data x and latent (unobserved) that: (i) the latent variables z are continuous; and (ii) variables z. In Bayesian inference, we are interested we can simulate from q(z ; θ) as follows, in the posterior distribution p(z x) = p(x;z) . For most j p(x) models, the posterior distribution is analytically in- z = h("; θ); with " s("): (2) ∼ tractable and we have to use an approximation, such Here, s(") is a distribution that does not depend on as Monte Carlo methods or variational inference. In the variational parameters; it is typically a standard this paper, we focus on variational inference. normal or a standard uniform. Further, h("; θ) must In variational inference, we approximate the posterior be differentiable with respect to θ. In statistics, this with a variational family of distributions q(z ; θ), pa- is known as a non-central parameterization and has rameterized by θ. Typically, we choose the variational been shown to be helpful in, e.g., Markov chain Monte parameters θ that minimize the Kullback-Leibler (kl) Carlo methods [Papaspiliopoulos et al., 2003]. divergence between q(z ; θ) and p(z x). This minimiza- Using (2), we can move the derivative inside the ex- tion is equivalent to maximizingj the elbo [Jordan pectation and rewrite the gradient of the elbo as et al., 1999], defined as θ (θ) = Es(") [ zf(h("; θ)) θh("; θ)] + θH[q(z ; θ)]: (θ) = Eq(z ;θ) [f(z)] + H[q(z ; θ)]; r L r r r L f(z) := log p(x; z); (1) Empirically, the reparameterization trick has been shown to be beneficial over direct Monte Carlo es- H[q(z ; θ)] := Eq(z ;θ)[ log q(z ; θ)]: − 1In this paper, we assume for simplicity that the gra- When the model and variational family satisfy con- dient of the entropy rθH[q(z ; θ)] is available analytically. jugacy requirements, we can use coordinate ascent to The method that we propose in Section 3 can be easily extended to handle non-analytical entropy terms. Indeed, find a local optimum of the elbo [Blei et al., 2016]. If the resulting estimator of the gradient may have lower vari- the conjugacy requirements are not satisfied, a com- ance when the analytic gradient of the entropy is replaced mon approach is to build a Monte Carlo estimator of by its Monte Carlo estimate. Here we do not explore that. Christian A. Naessethyz, Francisco J. R. Ruizzx, Scott W. Lindermanz, David M. Bleiz timation of the gradient using the score fuction es- Algorithm 1 Reparameterized Rejection Sampling timator [Salimans and Knowles, 2013, Kingma and Input: target q(z ; θ), proposal r(z ; θ), and constant Welling, 2014, Titsias and L´azaro-Gredilla,2014, Fan Mθ, with q(z ; θ) Mθr(z ; θ) et al., 2015]. Unfortunately, many distributions com- Output: " such that≤h("; θ) q(z ; θ) ∼ monly used in variational inference, such as gamma 1: i 0 or Dirichlet, are not amenable to standard reparame- 2: repeat terization because samples are generated using a rejec- 3: i i + 1 tion sampler [von Neumann, 1951, Robert and Casella, 4: Propose "i s(") 2004], introducing discontinuities to the mapping.