Distilling Importance Sampling
Total Page:16
File Type:pdf, Size:1020Kb
Distilling Importance Sampling Dennis Prangle1 1School of Mathematics, University of Bristol, United Kingdom Abstract in almost any setting, including in the presence of strong posterior dependence or discrete random variables. How- ever it only achieves a representative weighted sample at a Many complicated Bayesian posteriors are difficult feasible cost if the proposal is a reasonable approximation to approximate by either sampling or optimisation to the target distribution. methods. Therefore we propose a novel approach combining features of both. We use a flexible para- An alternative to Monte Carlo is to use optimisation to meterised family of densities, such as a normal- find the best approximation to the posterior from a family of ising flow. Given a density from this family approx- distributions. Typically this is done in the framework of vari- imating the posterior, we use importance sampling ational inference (VI). VI is computationally efficient but to produce a weighted sample from a more accur- has the drawback that it often produces poor approximations ate posterior approximation. This sample is then to the posterior distribution e.g. through over-concentration used in optimisation to update the parameters of [Turner et al., 2008, Yao et al., 2018]. the approximate density, which we view as dis- A recent improvement in VI is due to the development of tilling the importance sampling results. We iterate a range of flexible and computationally tractable distribu- these steps and gradually improve the quality of the tional families using normalising flows [Dinh et al., 2016, posterior approximation. We illustrate our method Papamakarios et al., 2019a]. These transform a simple base in two challenging examples: a queueing model random distribution to a complex distribution, using a se- and a stochastic differential equation model. quence of learnable transformations. We propose an alternative to variational inference for train- 1 INTRODUCTION ing the parameters of an approximate posterior density, typ- ically a normalising flow, which we call the distilled density. Bayesian inference has had great success in in recent dec- This alternates two steps. The first is importance sampling, ades [Green et al., 2015], but remains challenging in models with the current distilled density as the proposal. The target with a complex posterior dependence structure e.g. those distribution is an approximate posterior, based on tempering, latent variables which is an improvement on the proposal. The second step arXiv:1910.03632v3 [stat.CO] 10 Sep 2021 involving . Monte Carlo methods are one state-of-the-art approach. These produce samples from the is to use the resulting weighted sampled to train the distilled posterior distribution. However in many settings it remains density further. Following Li et al. [2017], we refer to this challenging to design good mechanisms to propose plaus- as distilling the importance sampling results. By iteratively ible samples, despite many advances (e.g. Cappé et al., 2004, distilling IS results, we can target increasingly accurate pos- Cornuet et al., 2012, Graham and Storkey, 2017, Whitaker terior approximations i.e. reduce the tempering. et al., 2017). Each step of our distilled importance sampling (DIS) We focus on one simple Monte Carlo method: importance method aims to reduce the Kullback Leibler (KL) diver- sampling (IS). This weights draws from a proposal distribu- gence of the distilled density from the current tempered tion so that the weighted sample can be viewed as represent- posterior. This is known as the inclusive KL divergence, ing a target distribution, such as the posterior. IS can be used as minimising it tends to produce a density which is over- dispersed compared to the tempered posterior. Such a dis- 1This work was completed while the author was employed at tribution is well suited to be an IS proposal distribution. Newcastle University, UK. Variational inference, on the other hand, uses the exclusive KL divergence which tends to produce over-concentrated a joint distribution to parameters and data from simulations. distributions which cannot easily be corrected using import- Then one can condition on the observed data to approximate ance sampling. its posterior distribution. These methods also sometimes require dimension reduction, and can perform poorly when In the remainder of the paper, Section 2 presents back- the observed data is unlike the simulations. Our approach ground material. Sections 3 and 4 describe our method. avoids these difficulties by directly finding parameters which Section 5 illustrates it on a simple two dimensional in- can reproduce the full observations. ference task. Sections 6 and 7 give more challenging ex- amples: queueing and time series models. Section 8 con- More broadly, DIS has connections to several inference cludes with a discussion, including limitations and opportun- methods. Concentrating on its IS component, it is closely ities for future improvements. Code for the examples can be related to adaptive importance sampling [Cornuet et al., found at https://github.com/dennisprangle/ 2012] and sequential Monte Carlo (SMC) [Del Moral et al., DistillingImportanceSampling. All examples 2006]. Concentrating on training an approximate density, were run on a 6-core desktop PC. it can be seen as a version of the cross-entropy method [Rubinstein, 1999], an estimation of distribution algorithm [Larrañaga and Lozano, 2002], or reweighted wake-sleep 1.1 RELATED WORK AND NOVELTY Bornschein and Bengio [2014]. Several recent papers [Müller et al., 2019, Cotter et al., 2020, Duan, 2019] learn a density defined via a transforma- 2 BACKGROUND tion to use as an importance sampling proposal. Since the first version of our paper, more work has also looked at 2.1 BAYESIAN FRAMEWORK optimising the inclusive KL divergence. Dhaka et al. [2021] show that a naive implementation performs poorly. Naes- We observe data y, assumed to be the output of a probability seth et al. [2020] use conditional importance sampling to model p(yjq) under some parameters q. Given prior density give convergence guarantees. Jerfel et al. [2021] propose a p(q) we aim to find corresponding posterior p(qjy). boosting approach in which a Gaussian mixture density is Many probability models involve latent variables x, so that improved by sequentially adding mixture components. In R comparison to the above, a novelty of our work is using a p(yjq) = p(yjq;x)p(xjq)dx. To avoid computing this in- sequential approach based on tempering, and an application tegral we’ll attempt to infer the joint posterior p(q;xjy), and to likelihood-free inference. marginalise to get p(qjy). For convenience we introduce x = (q;x) to represent the collection of parameters and A related approach is to distill Markov chain Monte Carlo latent variables. For models without latent variables x = q. output, but this turns out to be more difficult than for IS. One reason is that optimising the KL divergence typically re- We now wish to infer p(xjy). Typically we can only evaluate quires unbiased estimates of it or related quantities (e.g. its an unnormalised version, gradient), but MCMC only provides unbiased estimates asymptotically. Li et al. [2017] and Parno and Marzouk p˜(xjy) = p(yjq;x)p(xjq)p(q): [2018] proceed by using biased estimates, while Ruiz and Then p(xjy) = p˜(xjy)=Z where Z = R p˜(xjy)dx is an in- Titsias [2019] introduce an alternative more tractable diver- tractable normalising constant. gence. However IS, as we shall see, can produce unbiased estimates of the required KL gradient. Approximate Bayesian computation (ABC) methods [Marin 2.2 TEMPERING et al., 2012, Del Moral et al., 2012] involve simulating data- sets under various parameters to find those which produce We’ll use a tempered target density pe (x) such that p0 is the close matches to the observations. However, close matches posterior and e > 0 gives an approximation. As for the pos- are rare unless the observations are low dimensional. Hence terior, we can often only evaluate an unnormalised version R ABC typically uses dimension reduction of the observations p˜e (x). Then pe (x) = p˜e (x)=Ze where Ze = p˜e (x)dx. We through summary statistics, which reduces inference accur- use various tempering schemes later in the paper. See the acy. Our method can instead learn a joint proposal distribu- supplement (Section H) for a summary. tion for the parameters and all the random variables used in simulating a dataset (see Section 6 for details). Hence it 2.3 IMPORTANCE SAMPLING can control the simulation process to frequently output data similar to the full observations. Let p(x) be a target density, such as a tempered posterior, Conditional density estimation methods (see e.g. Le et al., where p(x) = p˜(x)=Z and only p˜(x) can be evaluated. Im- 2017, Papamakarios et al., 2019b, Grazian and Fan, 2019) fit portance sampling (IS) is a Monte Carlo method to estimate expectations of the form review. We focus on real NVP (“non-volume preserving”) flows [Dinh et al., 2016] (other flows with similar properties, I = Ex∼p[h(x)]; such as Durkan et al., 2019, could also be used). These compose several transformations of z. One type is a coupling for some function h. Here we give an overview of relevant layer which transforms input vector u to output vector v, aspects. For full details see e.g. Robert and Casella [2013] both of dimension D, by and Rubinstein and Kroese [2016]. IS requires a proposal density l(x) which can easily be v1:d = u1:d; vd+1:D = m + exp(s) ud+1:D; sampled from, and must satisfy m = fm (u1:d); s = fs (u1:d); supp(p) ⊆ supp(l); (1) where and exp are elementwise multiplication and ex- ponentiation. Here the first d elements of u are copied un- where supp denotes support.