Distilling Importance Sampling

Dennis Prangle1

1School of Mathematics, University of Bristol, United Kingdom

Abstract in almost any setting, including in the presence of strong posterior dependence or discrete random variables. How- ever it only achieves a representative weighted sample at a Many complicated Bayesian posteriors are difficult feasible cost if the proposal is a reasonable approximation to approximate by either sampling or optimisation to the target distribution. methods. Therefore we propose a novel approach combining features of both. We use a flexible para- An alternative to Monte Carlo is to use optimisation to meterised family of densities, such as a normal- find the best approximation to the posterior from a family of ising flow. Given a density from this family approx- distributions. Typically this is done in the framework of vari- imating the posterior, we use importance sampling ational inference (VI). VI is computationally efficient but to produce a weighted sample from a more accur- has the drawback that it often produces poor approximations ate posterior approximation. This sample is then to the posterior distribution e.g. through over-concentration used in optimisation to update the parameters of [Turner et al., 2008, Yao et al., 2018]. the approximate density, which we view as dis- A recent improvement in VI is due to the development of tilling the importance sampling results. We iterate a range of flexible and computationally tractable distribu- these steps and gradually improve the quality of the tional families using normalising flows [Dinh et al., 2016, posterior approximation. We illustrate our method Papamakarios et al., 2019a]. These transform a simple base in two challenging examples: a queueing model random distribution to a complex distribution, using a se- and a stochastic differential equation model. quence of learnable transformations. We propose an alternative to variational inference for train- 1 INTRODUCTION ing the parameters of an approximate posterior density, typ- ically a normalising flow, which we call the distilled density. Bayesian inference has had great success in in recent dec- This alternates two steps. The first is importance sampling, ades [Green et al., 2015], but remains challenging in models with the current distilled density as the proposal. The target with a complex posterior dependence structure e.g. those distribution is an approximate posterior, based on tempering, latent variables which is an improvement on the proposal. The second step arXiv:1910.03632v3 [stat.CO] 10 Sep 2021 involving . Monte Carlo methods are one state-of-the-art approach. These produce samples from the is to use the resulting weighted sampled to train the distilled posterior distribution. However in many settings it remains density further. Following Li et al. [2017], we refer to this challenging to design good mechanisms to propose plaus- as distilling the importance sampling results. By iteratively ible samples, despite many advances (e.g. Cappé et al., 2004, distilling IS results, we can target increasingly accurate pos- Cornuet et al., 2012, Graham and Storkey, 2017, Whitaker terior approximations i.e. reduce the tempering. et al., 2017). Each step of our distilled importance sampling (DIS) We focus on one simple : importance method aims to reduce the Kullback Leibler (KL) diver- sampling (IS). This weights draws from a proposal distribu- gence of the distilled density from the current tempered tion so that the weighted sample can be viewed as represent- posterior. This is known as the inclusive KL divergence, ing a target distribution, such as the posterior. IS can be used as minimising it tends to produce a density which is over- dispersed compared to the tempered posterior. Such a dis- 1This work was completed while the author was employed at tribution is well suited to be an IS proposal distribution. Newcastle University, UK. Variational inference, on the other hand, uses the exclusive KL divergence which tends to produce over-concentrated a joint distribution to parameters and data from simulations. distributions which cannot easily be corrected using import- Then one can condition on the observed data to approximate ance sampling. its posterior distribution. These methods also sometimes require dimension reduction, and can perform poorly when In the remainder of the paper, Section 2 presents back- the observed data is unlike the simulations. Our approach ground material. Sections 3 and 4 describe our method. avoids these difficulties by directly finding parameters which Section 5 illustrates it on a simple two dimensional in- can reproduce the full observations. ference task. Sections 6 and 7 give more challenging ex- amples: queueing and time series models. Section 8 con- More broadly, DIS has connections to several inference cludes with a discussion, including limitations and opportun- methods. Concentrating on its IS component, it is closely ities for future improvements. Code for the examples can be related to adaptive importance sampling [Cornuet et al., found at https://github.com/dennisprangle/ 2012] and sequential Monte Carlo (SMC) [Del Moral et al., DistillingImportanceSampling. All examples 2006]. Concentrating on training an approximate density, were run on a 6-core desktop PC. it can be seen as a version of the cross-entropy method [Rubinstein, 1999], an estimation of distribution algorithm [Larrañaga and Lozano, 2002], or reweighted wake-sleep 1.1 RELATED WORK AND NOVELTY Bornschein and Bengio [2014].

Several recent papers [Müller et al., 2019, Cotter et al., 2020, Duan, 2019] learn a density defined via a transforma- 2 BACKGROUND tion to use as an importance sampling proposal. Since the first version of our paper, more work has also looked at 2.1 BAYESIAN FRAMEWORK optimising the inclusive KL divergence. Dhaka et al. [2021] show that a naive implementation performs poorly. Naes- We observe data y, assumed to be the output of a probability seth et al. [2020] use conditional importance sampling to model p(y|θ) under some parameters θ. Given prior density give convergence guarantees. Jerfel et al. [2021] propose a π(θ) we aim to find corresponding posterior p(θ|y). boosting approach in which a Gaussian mixture density is Many probability models involve latent variables x, so that improved by sequentially adding mixture components. In R comparison to the above, a novelty of our work is using a p(y|θ) = p(y|θ,x)p(x|θ)dx. To avoid computing this in- sequential approach based on tempering, and an application tegral we’ll attempt to infer the joint posterior p(θ,x|y), and to likelihood-free inference. marginalise to get p(θ|y). For convenience we introduce ξ = (θ,x) to represent the collection of parameters and A related approach is to distill Markov chain Monte Carlo latent variables. For models without latent variables ξ = θ. output, but this turns out to be more difficult than for IS. One reason is that optimising the KL divergence typically re- We now wish to infer p(ξ|y). Typically we can only evaluate quires unbiased estimates of it or related quantities (e.g. its an unnormalised version, gradient), but MCMC only provides unbiased estimates asymptotically. Li et al. [2017] and Parno and Marzouk p˜(ξ|y) = p(y|θ,x)p(x|θ)π(θ). [2018] proceed by using biased estimates, while Ruiz and Then p(ξ|y) = p˜(ξ|y)/Z where Z = R p˜(ξ|y)dξ is an in- Titsias [2019] introduce an alternative more tractable diver- tractable normalising constant. gence. However IS, as we shall see, can produce unbiased estimates of the required KL gradient. Approximate Bayesian computation (ABC) methods [Marin 2.2 TEMPERING et al., 2012, Del Moral et al., 2012] involve simulating data- sets under various parameters to find those which produce We’ll use a tempered target density pε (ξ) such that p0 is the close matches to the observations. However, close matches posterior and ε > 0 gives an approximation. As for the pos- are rare unless the observations are low dimensional. Hence terior, we can often only evaluate an unnormalised version R ABC typically uses dimension reduction of the observations p˜ε (ξ). Then pε (ξ) = p˜ε (ξ)/Zε where Zε = p˜ε (ξ)dξ. We through summary , which reduces inference accur- use various tempering schemes later in the paper. See the acy. Our method can instead learn a joint proposal distribu- supplement (Section H) for a summary. tion for the parameters and all the random variables used in simulating a dataset (see Section 6 for details). Hence it 2.3 IMPORTANCE SAMPLING can control the simulation process to frequently output data similar to the full observations. Let p(ξ) be a target density, such as a tempered posterior, Conditional density estimation methods (see e.g. Le et al., where p(ξ) = p˜(ξ)/Z and only p˜(ξ) can be evaluated. Im- 2017, Papamakarios et al., 2019b, Grazian and Fan, 2019) fit portance sampling (IS) is a Monte Carlo method to estimate expectations of the form review. We focus on real NVP (“non-volume preserving”) flows [Dinh et al., 2016] (other flows with similar properties, I = Eξ∼p[h(ξ)], such as Durkan et al., 2019, could also be used). These compose several transformations of z. One type is a coupling for some function h. Here we give an overview of relevant layer which transforms input vector u to output vector v, aspects. For full details see e.g. Robert and Casella [2013] both of dimension D, by and Rubinstein and Kroese [2016].

IS requires a proposal density λ(ξ) which can easily be v1:d = u1:d, vd+1:D = µ + exp(σ) ud+1:D, sampled from, and must satisfy µ = fµ (u1:d), σ = fσ (u1:d),

supp(p) ⊆ supp(λ), (1) where and exp are elementwise multiplication and ex- ponentiation. Here the first d elements of u are copied un- where supp denotes support. Then 1 changed. We typically take d = b 2 Dc. The other elements  p(ξ)  are scaled by vector exp(σ) then shifted by vector µ, where I = E h(ξ) . (2) ξ∼λ λ(ξ) µ and σ are functions of u1:d. This transformation is in- vertible, and allows quick computation of the density of So an unbiased Monte Carlo estimate of I is v from that of u, as the Jacobian determinant is simply D N ∏i=d+1 exp(σi). Coupling layers are alternated with per- ˆ 1 (i) mutations so that different variables are copied in successive I1 = ∑ wih(ξ ), (3) NZ i=1 coupling layers. Real NVP typically uses order-reversing or random permutations. where ξ (1),ξ (2),...,ξ (N) are independent samples from λ, (i) (i) The functions f and f are neural network outputs. Each and wi = p˜(ξ )/λ(ξ ) is an importance weight. µ σ coupling layer has its own neural network. The collection of 1 N Typically Z is estimated as N ∑i=1 wi giving all weights and biases, φ, can be trained for particular tasks e.g. density estimation of images. Permutations are fixed in N  N ˆ (i) advance and not learnable. I2 = ∑ wih(ξ ) ∑ wi, (4) i=1 i=1 Real NVP produces a flexible family of densities q(ξ;φ) a biased, but consistent, estimate of I. Equivalently with two useful properties for this paper. Firstly, samples can be drawn rapidly. Secondly, it is reasonably fast to compute N logq( ; ) for any . ˆ (i) ∇φ ξ φ ξ I2 = ∑ sih(ξ ), i=1

N 3 OBJECTIVE AND GRADIENT for normalised importance weights si = wi/∑i=1 wi. A drawback of IS is that it can produce estimates with large, Given an approximate family of densities q(ξ;φ), such as or infinite, variance if λ is a poor approximation to p. Hence normalising flows, this section introduces objective func- diagnostics for the quality of the results are useful. A popular tions to judge how well q approximates a tempered target diagnostic is effective sample size (ESS), pε . It then discusses how to estimate the gradient of this 2 objective with respect to φ. Section 4 presents our algorithm N !  N 2 using these gradients to update φ while also reducing ε. NESS = ∑ wi ∑ wi . (5) i=1 i=1 3.1 OBJECTIVE For most functions h, Var(Iˆ2) roughly equals the variance of an idealised Monte Carlo estimate based on N inde- ESS Given p , we aim to minimise the inclusive Kullback- pendent samples from p(ξ) [Liu, 1996]. ε Leibler (KL) divergence,

2.4 NORMALISING FLOWS KL(pε ||q) = Eξ∼pε [log pε (ξ) − logq(ξ;φ)].

A normalising flow represents a random vector ξ with a This is equivalent to maximising a scaled negative cross- complicated distribution as an invertible transformation of a entropy, which we use as our objective, random vector z with a simple base distribution, typically J (φ) = Z E [logq(ξ;φ)]. N (0,I). ε ε ξ∼pε

Recent research has developed flexible learnable families (We scale by Zε to avoid this intractable constant appearing of normalising flows. See Papamakarios et al. [2019a] for a in our gradient estimates below.) The inclusive KL divergence penalises φ values which pro- Algorithm 1 Distilled importance sampling (DIS) duce small q(ξ;φ) when pε (ξ) is large. Hence the optimal 1: Input: importance sampling size N, target ESS M, batch φ tends to make q(ξ;φ) non-negligible where p (ξ) is non- ε size n, initial tempering parameter ε0 negligible, known as the zero-avoiding property. This is an 2: Initialise φ0 (followed by pretraining if necessary). intuitively attractive feature for importance sampling pro- 3: for t = 1,2,... do posal distributions. Indeed recent theoretical work shows 4: Sample (ξi)1≤i≤N from q(ξ;φt−1). that, under some conditions, the sample size required in 5: Select a new tempering parameter εt ≤ εt−1 (see Sec- importance sampling scales exponentially with the inclusive tion 4.2 for details). KL divergence [Chatterjee and Diaconis, 2018]. (i) (i) 6: Calculate weights wi = p˜ε (ξ )/q(ξ ;φt−1) and Our work could be adapted to use the χ2 divergence [Dieng truncate to w˜is (see the supplement, Section A, for et al., 2017, Müller et al., 2019], which also has theoretical details). links to the sample size needed by IS [Agapiou et al., 2017]. 7: for j = 1,2,...,B do ˜( j) (i) 8: Resample (ξ )1≤ j≤n from (ξ )1≤i≤N using nor- malisedw ˜is as probabilities, with replacement. 3.2 BASIC GRADIENT ESTIMATE 9: Calculate gradient estimate g3 using (7). 10: Update φ using stochastic gradient optimisation. Assuming standard regularity conditions [Mohamed et al., We use the Adam algorithm. 2020, Section 4.3.1], the objective has gradient 11: end for 12: end for ∇Jε (φ) = Zε Eξ∼pε [∇logq(ξ;φ)]. Using (2), an importance sampling form is   A more sophisticated alternative method is to Pareto smooth p˜ε (ξ) ∇J (φ) = E ∇logq(ξ;φ) , the largest importance weights [Yao et al., 2018]. We did not ε ξ∼λ λ(ξ) use this as it is more expensive to implement than clipping, where λ(ξ) is a proposal density. We will take λ(ξ) = but it would be interesting to investigate in future work. q(ξ;φ ∗) for some φ ∗. (In our main algorithm, φ ∗ will be the output of a previous optimisation step.) Note we use choices Resampling Calculating g2 requires evaluating of q with full support, so (1) is satisfied. ∇logq(ξ (i);φ) for 1 ≤ i ≤ N. Each of these has a computational cost, but often many receive small weights An unbiased Monte Carlo gradient estimate is and so contribute little to g2. N 1 (i) To reduce this cost we can discard many low weight samples, g1 = ∑ wi∇logq(ξ ;φ), (6) N i=1 by using importance resampling [Smith and Gelfand, 1992] (i) as follows. We sample n  N times, with replacement, from where ξ ∼ λ(ξ) are independent samples and wi = ( j) N the ξ s with probabilities s˜j = w˜ j/S where S = ∑ w˜i. p˜ (ξ (i))/λ(ξ (i)) are importance sampling weights. i=1 ε Denote the resulting samples as ξ˜( j). The following is then (i) We calculate ∇logq(ξ ;φ) by backpropagation. Note we an unbiased estimate of g2, backpropagate with respect to φ, but not φ ∗ which is treated (i) n as a constant. Hence the ξ values are themselves constant. S ˜( j) g3 = ∑ ∇logq(ξ ;φ). (7) nN j=1 3.3 IMPROVED GRADIENT ESTIMATES

Here we discuss reducing the variance and cost of g1. 4 ALGORITHM

Clipping Weights To avoid high variance gradient es- Our approach to inference is as follows. Given a current timates we apply truncated importance sampling [Ionides, distilled density approximating the posterior, q(ξ;φt ), we 2008]. This clips the weights at a maximum value ω, pro- use this as λ(ξ) in (7) to produce a gradient estimate. (In the ∗ ducing truncated importance weights w˜i = min(wi,ω). The notation of Section 3.2, we take φ = φt .) We then update resulting gradient estimate is φt to φt+1 by stochastic gradient ascent, aiming to increase . As t increases, we also reduce the tempering in our N Jε 1 (i) target density p (ξ) by reducing ε, slowly enough to avoid g2 = ∑ w˜i∇logq(ξ ;φ). ε N i=1 high variance gradient estimates.

This typically has lower variance than g1, but has some Algorithm 1 gives our implementation of this approach. The bias. See the supplement (Section A) for more details and remainder of the section discusses various details of it. and discussion, including how we choose ω automatically. the supplement (Section H) for a summary of tuning choices. We fix training batch size to n = 100, and number of batches 5 EXAMPLE: SINUSOIDAL B to M/n. So steps 4–6 perform importance sampling with DISTRIBUTION a target ESS of M. Then steps 7–11 uses M of its outputs

(sampled with replacement) for training. The idea is to avoid As a simple illustration, consider θ1 ∼ U(−π,π), θ2|θ1 ∼ overfitting by too much reuse of the same training data. N (sin(θ1),1/200), giving unnormalised target density In our experiments later, we run the algorithm until ε = 0 p˜(θ) = exp−100[θ − sin(θ )2] 1[|θ | < π], is reached or for a fixed runtime. Alternatively, the termin- 2 1 1 ation decision could be based on approximate inference where 1 is an indicator function. (Note earlier sections infer diagnostics (e.g. Yao et al., 2018, Huggins et al., 2020. ξ = (θ,x), where x are latent variables. This example has We investigate other tuning choices for the algorithm in no latent variables, so we simply infer θ.) Section 6. For now note that N must be reasonably large We use the unnormalised tempered target since our method to update εt relies on making an accurate ε 1−ε ESS estimate, as detailed in Section 4.2. p˜ε (θ) = p1(θ) p˜(θ) , (8)

initialising ε at 1 and reducing it to 0 during the algorithm.

4.1 INITIALISATION AND PRETRAINING As initial distribution we take θ1 and θ2 to have independent 2 N (0,σ0 ) distributions. We use σ0 = 2 to give a reasonable match to the standard deviation of θ1 under the target. Hence The initial q should be similar to the initial target pε0 . Oth- erwise the first gradient estimates produced by importance   1 1 2 2 sampling are likely to be high variance. This can sometimes p1(θ) = exp − (θ + θ ) . 2πσ 2 2σ 2 1 2 be achieved by initialising φ to give q a particular distribu- 0 0 tion, and designing our tempering scheme to have a similar initial target. See the supplement (Section B) for details, and We use real NVP for q(θ;φ), with 4 coupling layers, al- Sections 5–7 for examples. ternated with permutation layers swapping θ1 and θ2. Each coupling layer uses a neural network with 3 hidden layers Pretraining can also be used to improve the match between of 10 hidden units each and ELU activation. We initialise q the initial q and p when it is possible to sample from the ε0 close to a N (0,I) distribution, as in the supplement (Sec- latter. This iterates the following steps: tion B), then pretrain so q approximates p1.

(i) We use Algorithm 1 with N = 4000 training samples and a 1. Sample (ξ )1≤i≤n from pε (ξ). 0 target ESS of M = 2000. These values give a clear visual 1 n (i) illustration: we investigate efficient tuning choices later. 2. Update φ using gradient n ∑i=1 ∇logq(ξ ;φ). Figure 1 shows our results. The distilled density quickly This maximises the negative cross-entropy adapts to meet the importance sampling results, and ε = 0 E [logq(ξ;φ)]. We use n = 100, and terminate is reached by 90 iterations. This took roughly 1.5 minutes. ξ∼pε0 once q(ξ;φ) achieves a reasonable ESS (e.g. half actual sample size) when targeting p in importance sampling. ε0 6 EXAMPLE: M/G/1 QUEUE

This section describes an application to likelihood-free infer-

4.2 SELECTING εT ence [Marin et al., 2012, Papamakarios and Murray, 2016]. Here a generative model or simulator is specified, typically by computer code. This maps parameters θ and pseudo- We select εt using effective sample size, as in Del Moral random draws x to data y(θ,x). et al. [2012]. Given (ξi)1≤i≤N sampled from q(ξ;φt−1), the ESS value for target pε (ξ) is Given observations y0, we aim to infer the joint posterior of θ and x, p(ξ|y0). Here we approximate this with a black- N 2 box choice of q(ξ|φ) i.e. a generic normalising flow. This [∑ w(ξi,ε)] p˜ (ξ) N (ε) = i=1 , where w(ξ,ε) = ε . approach could be applied to any simulator model without ESS N 2 q(ξ;φ ) ∑i=1 w(ξi,ε) t−1 modifying its computer code, instead overriding the ran- dom number generator to use x values proposed by q, as In step 5 of Algorithm 1 we first check whether in Baydin et al. [2019]. For higher dim(ξ) a black-box ap- NESS(εt−1) < M, a target ESS value. If so we set εt = εt−1. proach becomes impractical. Section 7 outlines an alternat- Otherwise we set εt to an estimate of the minimal ε such ive – using knowledge of the simulator to inform the choice that NESS(ε) ≥ M, computed by a bisection algorithm. of q. Iteration 15, epsilon=0.956 Iteration 30, epsilon=0.937 Iteration 45, epsilon=0.877 2.0 2.0 2.0

1.5 1.5 1.5

1.0 1.0 1.0

0.5 0.5 0.5

0.0 0.0 0.0

0.5 0.5 0.5

1.0 1.0 1.0

1.5 1.5 1.5

2.0 2.0 2.0 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4

Iteration 60, epsilon=0.812 Iteration 75, epsilon=0.004 Iteration 90, epsilon=0.000 2.0 2.0 2.0

1.5 1.5 1.5

1.0 1.0 1.0

0.5 0.5 0.5

0.0 0.0 0.0

0.5 0.5 0.5

1.0 1.0 1.0

1.5 1.5 1.5

2.0 2.0 2.0 4 2 0 2 4 4 2 0 2 4 4 2 0 2 4

Figure 1: Sinusoidal example output. Each frame shows 300 samples from the current distilled density. Red crosses show a subsample of 150 targeting pε (θ) for the current ε value, selected by importance resampling (see Section 3.3). Blue dots show the remaining points.

6.1 MODEL (Section C) for details.

We consider a M/G/1 queuing model of a single queue Tempered Target Our unnormalised tempered target is of customers. Times between arrivals at the back of the h i p˜ ( ) = ( )exp − 1 d( )2 , for > 0 (9) queue are Exp(θ1). On reaching the front of the queue, ε ξ π ξ 2ε2 ξ ε a customer’s service time is U(θ2,θ3). All these random variables are independent. where π(ξ) is the N (0,I) density and d(ξ) = ||y(ξ)−y0||2 is the Euclidean distance between the simulated and ob- We consider a setting where only inter-departure times are served data. This target is often used in ABC and corres- observed: times between departures from the queue. This ponds to the posterior under the assumption that the data model is a common benchmark for likelihood-free inference, is observed with independent N (0,ε2) errors [Wilkinson, which can often provide fast approximate inference. See 2013]. We use initial value ε0 = 10: a scale for errors which Papamakarios et al. [2019b] for a detailed comparison. An is large relative to our observations. Note that DIS cannot advantage of DIS over these methods is that it does not reach the exact target here. Instead, like ABC, it produces require using low dimensional summary statistics which increasingly good posterior approximations as ε → 0. lose some information from the data. Near-exact posterior inference is also possible for this model using a sophisticated Approximate Family We use a real NVP architecture for MCMC scheme [Shestopaloff and Neal, 2014]. q(ξ;φ) with 16 coupling layers alternated with random per- We sample a synthetic dataset of m = 20 observations from mutations. Each coupling layer uses a neural network with 3 hidden layers of 100,100,50 units and ELU activation. parameter values θ1 = 0.1,θ2 = 4,θ3 = 5. We attempt to We initialise q close to a N (0,I) distribution, as described infer these parameters under the prior θ1 ∼ U(0,1/3),θ2 ∼ in the supplement (Section B): close enough to the initial U(0,10),θ3 − θ2 ∼ U(0,10) (all independent). target that no pretraining was needed.

6.2 DIS IMPLEMENTATION 6.3 RESULTS

Latent Variables and Simulator We introduce ϑ and x, Figure 2 (left) compares different choices of N (number of vectors of length 3 and 2m, and take ξ as the collection importance samples) and M (target ESS). It shows that ε (ϑ,x). Our simulator transforms these inputs to θ(ϑ) and reduces more quickly for larger N or smaller M. Our choice y(ξ) in a way such that when ξ ∼ N (0,I) then (θ,y) is a of N was restricted by memory requirements: the largest sample from the prior and the model. See the supplement value we used was 50,000. Our choice of M was restricted by numerical stability: values below a few hundred often Approximate Family Following Ryder et al. [2018], we produced numerical overflow errors in the normalising flow. use an autoregressive approximate family

This tuning study suggests using large N and small M sub- m−1 ject to these restrictions. We use this guidance here and in q(ξ;φ) = q(θ;φθ ) ∏ q(xi+1|xi,θ;φx). Section 7. In both examples, the cost of a single evaluation i=0 of the target density is low. For more expensive models We define q(x |x ,θ;φ ) generatively as efficient tuning choices may differ. i+1 i x p Figure 2 (right) shows that DIS results with N = 50,000 and xi+1 = xi + [α(xi,θ) + β]∆t + γ∆tεi. (13) M = 2500 are a close match to near-exact MCMC output using the algorithm of Shestopaloff and Neal [2014]. The This modifies (10) by adding to α an extra term β and main difference is that DIS lacks sharp truncation at θ2 = 4. replacing 10 with γ. We take β and γ to be outputs of a The DIS results are far more accurate than an ABC baseline, neural network with inputs (i,xi,θ) and weights and biases as detailed in the supplement (Section D). φx. See the supplement (Section E) for more details. In the limit ∆t → 0, interpreted in an appropriate fashion, 7 EXAMPLE: LORENZ MODEL (13) with γ = 10, and the correct choice of β, gives the conditional distribution p(x|θ,y) (see e.g. Opper, 2019). Here we consider a time series application. Noisy and in- However, for ∆t > 0 it is useful to allow γ to vary Durham frequent observations y of a time series path x are available. and Gallant [2002]. Intuitively the idea is that the random The black-box approach of Section 6 is not practical here: variation in xi+1 − xi may need to be smaller when i + 1 is dim(x) is too large to learn (θ,x) directly. Instead we take close to an observation time, to ensure that the simulation is an amortised approach, exploiting knowledge of the model close to the observed value. structure by just learning q(θ) (marginal parameters) and Tuning Choices We use a real NVP architecture for q(xi+1|xi,θ) (next step of the time series). q(θ;φθ ), made up of 8 coupling layers, alternated with ran- dom permutation layers. Each coupling layer uses a neural 7.1 MODEL network with 3 hidden layers of 30 units each and ELU activation. We can thus initialise q(θ;φθ ) close to the prior We consider the time series model π(θ) via the procedure described in the supplement (Section √ B). xi+1 = xi + α(xi,θ)∆t + 10∆tεi (10) The network for β and γ has three hidden layers of 80 units  θ (x − x )  1 i,2 i,1 each and ELU activation. We initialise the neural network α(x ,θ) = θ x − x − x x , (11) i  2 i,1 i,2 i,1 i,3 to give x dynamics approximating those of the initial target xi,1xi,2 − θ3xi,3 p1 (the SDE without conditioning on y). See the supplement (Section E) for details. for 0 ≤ i ≤ m. Here xi is a vector (xi,1,xi,2,xi,3) and εi ∼ N (0,I) is also a vector of length 3. This is a stochastic In the DIS algorithm we take N = 50000,M = 2500, fol- differential equation (SDE) version of the Lorenz 63 dynam- lowing the general tuning suggestions in Section 6. See the ical system [Lorenz, 1963] from Vrettas et al. [2015], after supplement (Section E) for more details of tuning choices, applying Euler-Maruyama discretisation. and methods to avoid numerical instability. We take m = 100 and assume independent observations 2 Gradients Our approximating family parameters are φ = yi ∼ N (xi,σ I) at i = 20,40,60,80,100. We fix x0 = (−30,0,30) and ∆t = 0.02. The unknown parameters are (φθ ,φx). We calculate ∇φ logq(θ,x;φ) by backpropagation. Simulating from q involves looping through (13) m times, θ = (θ1,θ2,θ3,σ). We simulate synthetic data from this discretised model for θ = (10,28,8/3,2) – see Figure 3. each time using the neural network for β and γ. Back- propagation requires unrolling this. See Li et al. [2020] for recent progress on more efficient gradient calculation 7.2 DIS IMPLEMENTATION for SDE models.

Tempered Target Our unnormalised tempered target is 7.3 RESULTS 1−ε p˜ε (ξ) = π(θ)p(x|θ)p(y|x,θ) . (12) First we consider an example which is easy by other meth- We initialise ε at 1. So the initial target p1 is the prior and ods to verify that DIS gives correct results. Here we perform time series model unconditioned by y. As ε is reduced, more inference for θ under independent Exp(0.1) priors on each agreement with y is enforced. θi and σ. Figure 3 shows the results using DIS, which takes 25 3.0 101 1.25 20 2.5 1.00 2.0 15 1.5 0.75

MCMC 10 1.0 0.50 M/N 5 0.25 0.05 0.5 0.1 0 0.0 0.00 0.2 0.0 0.1 0.2 2 4 4 6

epsilon N 25 3.0 0 5000 1.25 10 2.5 10000 20 1.00 20000 2.0 15 50000 1.5 0.75 DIS 10 1.0 0.50

5 0.5 0.25

0 0.0 0.00 0 2000 4000 6000 8000 10000 0.0 0.1 0.2 2 4 4 6 time (seconds) arrival rate min service max service

Figure 2: M/G/1 results. Left: The ε value reached by DIS on the M/G/1 example against computation time, for various choices of N (number of importance samples) and M/N (ratio of target effective sampling size to N). Right: Marginal posterior histograms for M/G/1 example. The DIS output shown is for ε = 0.283, which took 180 minutes to reach.

0.4 0.4 1.5 0.8

0.3 0.3 0.6 1.0 40 0.2 0.2 0.4 MCMC 0.5 0.1 0.1 0.2 20 0.0 0.0 0.0 0.0 5 10 25 30 2 4 0.0 2.5

0.4 0.4 1.5 0.8 0

0.3 0.3 0.6 1.0

DIS 0.2 0.2 0.4 20 0.5 0.1 0.1 0.2

0.0 0.0 0.0 0.0 0 20 40 60 80 100 5 10 25 30 2 4 0.0 2.5 i theta1 theta2 theta3 sigma

Figure 3: Output for Lorenz example with σ unknown. Left: marginal posterior histograms for parameters θ from particle MCMC and DIS output. Vertical lines show the true values. Right: paths x (red dot-dash xi,1, blue dotted xi,2, green dashed xi,3) and observations (circles). Solid lines show true paths. DIS plots are based on subsamples (1000 for θ, 30 for x) selected by resampling (see Section 3.3) from the final importance sampling output with ε = 0, targeting the posterior distribution. 49 minutes and using particle MCMC [Andrieu et al., 2010], of the Royal Statistical Society: Series B, 72(3):269–342, as detailed in the supplement (Section G). The two meth- 2010. ods produce very similar output, although MCMC is much faster, taking 4 minutes. Atilim Gunes Baydin, Lei Shao, Wahid Bhimji, Lukas Hein- rich, Saeid Naderiparizi, Andreas Munk, Jialin Liu, Brad- Secondly we investigate an example with the observation ley Gram-Hansen, Gilles Louppe, Lawrence Meadows, scale fixed to σ = 0.2, well below the true value of 2, and Philip Torr, Victor Lee, Kyle Cranmer, Prabhat, and Frank perform inference for θ1,θ2,θ3. This is a simple illustra- Wood. Efficient probabilistic inference in the quest for tion of how particle filtering based methods such as PM- physics beyond the standard model. In Advances in CMC can become expensive under model misspecification, Neural Information Processing Systems, 2019. which occurs often in practice [Akyildiz and Míguez, 2020]. Now DIS takes 443 minutes to target the posterior, Particle Mark A. Beaumont, Jean-Marie Cornuet, Jean-Michel MCMC was infeasible with our available memory, but we Marin, and Christian P. Robert. Adaptive approximate estimate it would take at least 80,000 minutes. See the sup- Bayesian computation. Biometrika, pages 2025–2035, plement (Sections F and G) for details. 2009. These results illustrate that DIS can be implemented in a set- Jörg Bornschein and Yoshua Bengio. Reweighted wake- ting with high dim(x), and it outperforms MCMC methods sleep. arXiv preprint arXiv:1406.2751, 2014. for some problems. Also, unlike recent work in variational inference for SDEs [Ryder et al., 2018, Opper, 2019, Li Olivier Cappé, Arnaud Guillin, Jean-Michel Marin, and et al., 2020], DIS directly targets the posterior distribution, Christian P. Robert. Population Monte Carlo. Journal of not a variational approximation. Computational and Graphical Statistics, 13(4):907–929, 2004.

8 CONCLUSION Sourav Chatterjee and Persi Diaconis. The sample size required in importance sampling. The Annals of Applied We’ve presented distilled importance sampling, and shown Probability, 28(2):1099–1135, 2018. its application as an approximate Bayesian inference method Jean-Marie Cornuet, Jean-Michel Marin, Antonietta Mira, for a likelihood-free example, and also as an exact method and Christian P. Robert. Adaptive multiple importance for a challenging time series model. sampling. Scandinavian Journal of Statistics, 39(4):798– There are interesting opportunities to extend DIS. Firstly it’s 812, 2012. not required that p˜ is differentiable, so discrete parameter ε Simon L. Cotter, Ioannis G. Kevrekidis, and Paul Rus- inference is plausible. Secondly, we can use random-weight sell. Transport map accelerated adaptive importance importance sampling [Fearnhead et al., 2010] in DIS, and sampling, and application to inverse problems arising replacep ˜ with an unbiased estimate. ε from multiscale stochastic reaction networks. SIAM/ASA Journal on Uncertainty Quantification, 8(4):1383–1413, Acknowledgements 2020. Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequen- Thanks to Alex Shestopaloff for providing MCMC code tial Monte Carlo samplers. Journal of the Royal Statistical for the M/G/1 model and to Andrew Golightly and Chris Society: Series B, 68(3):411–436, 2006. Williams for helpful suggestions. Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. An adaptive sequential Monte Carlo method for approximate References Bayesian computation. Statistics and Computing, 22(5): 1009–1020, 2012. Sergios Agapiou, Omiros Papaspiliopoulos, Daniel Sanz- Alonso, and Andrew M. Stuart. Importance sampling: Akash Kumar Dhaka, Alejandro Catalina, Manushi Intrinsic dimension and computational cost. Statistical Welandawe, Michael Riis Andersen, Jonathan Huggins, Science, 32(3):405–431, 2017. and Aki Vehtari. Challenges and opportunities in high-dimensional variational inference. arXiv preprint Ömer Deniz Akyildiz and Joaquín Míguez. Nudging the arXiv:2103.01085, 2021. particle filter. Statistics and Computing, 30(2):305–330, 2020. Adji Bousso Dieng, Dustin Tran, Rajesh Ranganath, John Paisley, and David Blei. Variational inference via χ upper Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. bound minimization. In Advances in Neural Information Particle Markov chain Monte Carlo methods. Journal Processing Systems, 2017. Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Pedro Larrañaga and Jose A. Lozano. Estimation of distribu- Density estimation using real NVP. arXiv preprint tion algorithms: A new tool for evolutionary computation. arXiv:1605.08803, 2016. Springer, 2002.

Arnaud Doucet, Michael K. Pitt, George Deligiannidis, and Tuan Anh Le, Atilim Gunes Baydin, and Frank Wood. In- Robert Kohn. Efficient implementation of Markov chain ference compilation and universal probabilistic program- Monte Carlo when using an unbiased likelihood . ming. In Artificial Intelligence and Statistics, 2017. Biometrika, 102(2):295–313, 2015. Xuechen Li, Ting-Kam Leonard Wong, Ricky T. Q. Chen, Leo L. Duan. Transport Monte Carlo. arXiv preprint and David Duvenaud. Scalable gradients for stochastic arXiv:1907.10448, 2019. differential equations. In Artificial Intelligence and Stat- istics, 2020. Garland B. Durham and A. Ronald Gallant. Numerical tech- niques for maximum likelihood estimation of continuous- Yingzhen Li, Richard E. Turner, and Qiang Liu. Approx- time diffusion processes. Journal of Business & Eco- imate inference with amortised MCMC. arXiv preprint nomic Statistics, 20(3):297–338, 2002. arXiv:1702.08343, 2017.

Conor Durkan, Artur Bekasov, Iain Murray, and George Dennis V. Lindley. The theory of queues with a single server. Papamakarios. Neural spline flows. In Advances in Mathematical Proceedings of the Cambridge Philosoph- Neural Information Processing Systems, 2019. ical Society, 48(2):277–289, 1952.

Paul Fearnhead, Omiros Papaspiliopoulos, Gareth O. Jun S. Liu. Metropolized independent sampling with com- Roberts, and Andrew Stuart. Random-weight particle parisons to and importance sampling. filtering of continuous time processes. Journal of the Statistics and Computing, 6(2):113–119, 1996. Royal Statistical Society: Series B , 72(4):497–512, 2010. Edward N. Lorenz. Deterministic nonperiodic flow. Journal Matthew M. Graham and Amos J. Storkey. Asymptotic- of the atmospheric sciences, 20(2):130–141, 1963. ally exact inference in differentiable generative models. Jean-Michel Marin, Pierre Pudlo, Christian P. Robert, and Electronic Journal of Statistics, 11(2):5105–5164, 2017. Robin J. Ryder. Approximate Bayesian computational Statistics and Computing Clara Grazian and Yanan Fan. A review of approximate methods. , 22(6):1167–1180, Bayesian computation methods via density estimation: 2012. Inference for simulator-models. Wiley Interdisciplinary Luca Martino, Víctor Elvira, and Francisco Louzada. Ef- Reviews: Computational Statistics, 2019. fective sample size for importance sampling based on discrepancy measures. Signal Processing, 131:386–401, Peter J. Green, Krzysztof Łatuszynski,´ Marcelo Pereyra, and 2017. Christian P. Robert. Bayesian computation: a summary of the current state, and samples backwards and forwards. Shakir Mohamed, Mihaela Rosca, Michael Figurnov, and Statistics and Computing, 25(4):835–862, 2015. Andriy Mnih. Monte Carlo gradient estimation in ma- chine learning. Journal of Machine Learning Research, Jonathan H. Huggins, Mikołaj Kasprzak, Trevor Campbell, 21(132):1–62, 2020. and Tamara Broderick. Practical posterior error bounds from variational objectives. In Artificial Intelligence and Thomas Müller, Brian Mcwilliams, Fabrice Rousselle, Statistics, 2020. Markus Gross, and Jan Novák. Neural importance sampling. ACM Transactions on Graphics (TOG), 38 Edward L. Ionides. Truncated importance sampling. Journal (5):1–19, 2019. of Computational and Graphical Statistics, 17(2):295– 311, 2008. Christian A. Naesseth, Fredrik Lindsten, and David Blei. Markovian score climbing: Variational inference with Ghassen Jerfel, Serena Wang, Clara Fannjiang, Katherine A KL(p||q). arXiv preprint arXiv:2003.10374, 2020. Heller, Yian Ma, and Michael I Jordan. Variational refine- ment for importance sampling using the forward kullback- Manfred Opper. Variational inference for stochastic differ- leibler divergence. arXiv preprint arXiv:2106.15980, ential equations. Annalen der Physik, 531(3):1800233, 2021. 2019.

Aaron A. King, Dao Nguyen, and Edward L. Ionides. Stat- George Papamakarios and Iain Murray. Fast ε-free infer- istical inference for partially observed Markov processes ence of simulation models with Bayesian conditional via the R package pomp. Journal of Statistical Software, density estimation. In Advances in Neural Information 69(12), 2016. Processing Systems, 2016. George Papamakarios, Eric Nalisnick, Danilo Jimenez Tina Toni, David Welch, Natalja Strelkowa, Andreas Ipsen, Rezende, Shakir Mohamed, and Balaji Lakshminaray- and Michael Stumpf. Approximate Bayesian computa- anan. Normalizing flows for probabilistic modeling and tion scheme for parameter inference and model selection inference. arXiv preprint arXiv:1912.02762, 2019a. in dynamical systems. Journal of The Royal Society In- terface, 6(31):187–202, 2009. George Papamakarios, David Sterratt, and Iain Murray. Se- quential neural likelihood: Fast likelihood-free inference Richard E. Turner, Pietro Berkes, Maneesh Sahani, and with autoregressive flows. In Artificial Intelligence and David J. C. MacKay. Counterexamples to variational Statistics, 2019b. free energy compactness folk theorems. Technical report, University College London, 2008. Matthew D. Parno and Youssef M. Marzouk. Transport map accelerated Markov chain Monte Carlo. SIAM/ASA Dootika Vats, James M. Flegal, and Galin L. Jones. Mul- Journal on Uncertainty Quantification, 6(2):645–682, tivariate output analysis for Markov chain Monte Carlo. 2018. Biometrika, 106(2):321–337, 2019.

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On Michail D. Vrettas, Manfred Opper, and Dan Cornford. Vari- the difficulty of training recurrent neural networks. In ational mean-field algorithm for efficient inference in International Conference on Machine Learning, 2013. large systems of stochastic differential equations. Phys- ical Review E, 91, 2015. Dennis Prangle, Richard G. Everitt, and Theodore Kypraios. A rare event approach to high-dimensional approximate Gavin A. Whitaker, Andrew Golightly, Richard J. Boys, and Bayesian computation. Statistics and Computing, 28(4): Chris Sherlock. Improved bridge constructs for stochastic 819–834, 2018. differential equations. Statistics and Computing, 27(4): 885–900, 2017. Christian P. Robert and George Casella. Monte Carlo stat- istical methods. Springer, 2013. Richard D. Wilkinson. Approximate Bayesian computation (ABC) gives exact results under the assumption of model Reuven Rubinstein. The cross-entropy method for combin- error. Statistical applications in genetics and molecular atorial and continuous optimization. Methodology and biology, 12(2):129–141, 2013. computing in applied probability, 1(2):127–190, 1999. Yuling Yao, Aki Vehtari, Daniel Simpson, and Andrew Gel- Reuven Y. Rubinstein and Dirk P. Kroese. Simulation and man. Yes, but did it work?: Evaluating variational infer- the Monte Carlo method. John Wiley & Sons, 2016. ence. In International Conference on Machine Learning, 2018. Francisco Ruiz and Michalis Titsias. A contrastive diver- gence for combining variational inference and MCMC. In International Conference on Machine Learning, 2019.

Thomas Ryder, Andrew Golightly, A. Stephen McGough, and Dennis Prangle. Black-box variational inference for stochastic differential equations. In International Conference on Machine Learning, 2018.

Chris Sherlock, Alexandre H. Thiery, Gareth O. Roberts, and Jeffrey S. Rosenthal. On the efficiency of pseudo- marginal random walk Metropolis algorithms. The Annals of Statistics, 43(1):238–275, 2015.

Alexander Y. Shestopaloff and Radford M. Neal. On Bayesian inference for the M/G/1 queue with efficient MCMC sampling. arXiv preprint arXiv:1401.5548, 2014.

Scott A. Sisson, Yanan Fan, and Mark M. Tanaka. Correc- tion: Sequential Monte Carlo without likelihoods. Pro- ceedings of the National Academy of Sciences, 106(39): 16889–16890, 2009.

Adrian F. M. Smith and Alan E. Gelfand. Bayesian statistics without tears: a sampling–resampling perspective. The American Statistician, 46(2):84–88, 1992. SUPPLEMENTARY MATERIAL parameters so that all its coupling layers are approximately the identity transformation. Then the resulting distribution A TRUNCATING IMPORTANCE approximates its base distribution N (0,I). We can often WEIGHTS design our initial target to equal this, or to be sufficiently similar that only a few iterations of pretraining are needed. Importance sampling estimates can have large or infin- To achieve this we initialise all the neural network weights ite variance if λ is a poor approximation to p. The prac- and biases used in real NVP to be approximately zero. In tical manifestation of this is a small number of importance more detail, we set biases to zero and sample weights from weights being very large relative to the others. To reduce the N (0,0.0012) distributions truncated to two standard devi- variance, Ionides [2008] introduced truncated importance ations from the mean. We also ensure that we use activation sampling. This replaces each importance weight wi with functions which map zero to zero. Then the neural net- w˜i = min(wi,ω) given some threshold ω. The truncated work outputs µ and σ are also approximately zero. Thus weights are then used in estimates (3) or (4). Truncating in each coupling layer has shift vector µ ≈ 0 and scale vector this way typically reduces variance, at the price of increasing exp(σ) ≈ 1. bias.

Gradient clipping [Pascanu et al., 2013] is common practice C M/G/1 MODEL in stochastic gradient optimisation of an objective J (φ) to prevent occasional large gradient estimates from destabil- This section describes the M/G/1 queueing model, in partic- ising optimisation. Truncating importance weights in Al- ular how to simulate from it. gorithm 1 has a similar effect of reducing the variability of gradient estimates. A potential drawback of either method Recall that the parameters are θ1,θ2,θ3, with independent is that gradients lose the property of unbiasedness, which prior distributions θ1 ∼ U(0,1/3),θ2 ∼ U(0,10),θ3 −θ2 ∼ is theoretically required for convergence to an optimum U(0,10). We introduce a reparameterised version of our of the objective. Pascanu et al. [2013] give heuristic argu- parameters: ϑ1,ϑ2,ϑ3 with independent N (0,1) priors. −1 −1 ments for good optimiser performance when using truncated Then we can take θ1 = Φ (ϑ1)/3, θ2 = 10Φ (ϑ2) and −1 gradients, and we make the following similar case for using θ3 = θ2 + 10Φ (ϑ3), where Φ is the N (0,1) cumulative clipped weights. Firstly, even after truncation, the gradient distribution function. is likely to point in a direction increasing the objective. In The model involves independent latent variables x ∼ our approach, it should still increase the q density at ξ (i) i (0,1) for 1 ≤ i ≤ 2m, where m is the number of observa- values with large w weights, which is desirable. Secondly, N i tions. These generate we expect there is a region near the optimum for φ where truncation extremely rare, and therefore gradient estimates 1 −1 have very low bias once this region is reached. Finally, we ai = − logΦ (xi), (inter-arrival times) θ3 also observe good empirical behaviour in our examples, −1 showing that optimisation with truncated weights can find si = θ2 + (θ3 − θ2)Φ (xi+m). (service times) very good importance sampling proposals. Inter-departure times can be calculated through the follow- We could use gradient clipping directly in Algorithm 1. ing recursion [Lindley, 1952] However we prefer truncating importance weights as there is an automated way to choose the threshold ω, as follows. di = si + max(Ai − Di−1), (inter-departure times) We select ω to reduce the maximum normalised importance w˜i i i weight, maxi N , to a prespecified value: throughout we where Ai = ∑ j=1 a j (arrival times) and Di = ∑ j=1 d j (depar- ∑i=1 w˜i use 0.1. The required ω can easily be calculated e.g. by ture times). bisection. (Occasionally no such ω exists i.e. if most wis are zero. In this case we set ω to the smallest positive w .) i D ABC ANALYSIS OF M/G/1 EXAMPLE

B APPROXIMATE DENSITY As a baseline comparison, we perform inference for the INITIALISATION M/G/1 example using approximate Bayesian computation (ABC). As discussed in Section 4.1, we wish to initialise q(ξ;φ) close to its initial target distribution p . This avoid import- ε0 D.1 ALGORITHM ance sampling initially producing high variance gradient estimates. We implement ABC using Algorithm A below, a version of An approach we use in our examples is to select real NVP the popular ABC-PMC approach [Toni et al., 2009, Sisson et al., 2009, Beaumont et al., 2009] modified to target D.2 RESULTS Z h i p˜ ( ) = ( ) exp − 1 ||y( ,x) − y ||2 (x)dx, We ran Algorithm A on the M/G/1 example with k = 0.7 and ε θ π θ 2ε2 θ 0 π N = 500. It terminates (in step 2) once the number of sim- where || · || is the Euclidean norm, and π(x) is the density ulated datasets used in the run of DIS reported in the main 8 of the x variables used in the simulator. This is the target paper (1.65×10 ) is exceeded. The final ABC-PMC ε value used in the main paper for θ in this example. i.e. it is the θ is 7.15, much worse than the final value for DIS, ε = 0.28). marginal of the joint target (9). Recall that the approximate posterior p˜ε (θ) is the exact pos- terior under the assumption of extra Gaussian observation Each iteration of the algorithm produces a weighted sample error with scale ε [Wilkinson, 2013]. At ε = 7.15 this extra t t (θi ,wi)1≤i≤N. This targets p˜εt (θ) in the same way as im- error is large compared to the data (whose values range portance sampling output. Over the course of the algorithm from 4 to 34), hence it seems likely to produce significant the value of εt is reduced, and the number of simulated approximation error. This is reflected in very wide posterior datasets required for an iteration tends to increase. marginals – see Figure 4. Further, the asymptotic cost per output sample of ABC is O(ε−dim(y)) [Prangle et al., 2018], Algorithm A ABC-PMC suggesting that reaching ε = 0.28 using ABC requires a −20 28 1: Initialise ε1 = ∞ factor of (0.28/7.15) ≈ 10 more simulations, which is 2: for t = 1,2,... do computationally infeasible. 3: Let i = 0 (number of acceptances). To reduce the computational cost of ABC, the data are of- 4: while i < N do ten replaced by low dimensional summary statistics. We 5: Sample θ ∗ from density λ (θ) (see below). t investigate using quartile summaries in the M/G/1 example, 6: if π(θ ∗) > 0 then following Papamakarios and Murray [2016]. This done by 7: Sample x∗ from π(x) and let y∗ = y(θ ∗,x∗). ∗ ∗ adding a final step to the simulator which converts the raw 8: Let d = ||y − y0||. 1 ∗2 data to quartiles, and letting y0 be the observed quartiles. 9: Let α = exp[− 2 d ]. 2εt We ran ABC-PMC with quartile statistics using the same 10: With probability α accept: tuning details as above. The final ε value is now 0.31. Fig- t ∗ t ∗ t ∗ ∗ let θi = θ , di = d , wi = π(θ )/λt (θ ) and ure 4 shows that the posterior marginals are much improved. increment i by 1. However the maximum service time marginal is still a poor 11: end if approximation of the near-exact MCMC results, suggesting 12: end while that the summaries have lost information from the raw data. 13: Calculate εt+1 (see below). 14: end for E LORENZ EXAMPLE TUNING Step 5 of the algorithm samples from the density DETAILS ( π(θ) if t=1 Here we describe further details of our implementation of λ (θ) = t N t−1 t−1 N t−1 DIS for the Lorenz example. ∑i=1 wi Kt (θ|θi )/∑i=1 wi otherwise.

In the first iteration λt (θ) is the prior. After this a kernel E.1 NUMERICAL STABILITY density estimate is used based on the previous weighted sample. We follow Beaumont et al. [2009] in using The Lorenz model (10) can produce large xi values which cause numerical difficulties in our implementation. To avoid K (θ|θ 0) = ϕ(θ 0,2Σ ), t t−1 this our code gives zero importance weight to any simulation where max |x | > 1000. Effectively this is a weak extra where ϕ is the density of a normal distribution and Σt−1 i, j i, j t−1 prior constraint. Our final posteriors in Figures 3 and 5 show is the empirical variance matrix of (θi )1≤i≤N calculated t−1 no xi values near this bound, verifying that this constraint using weights (w )1≤i≤N i has a negligible effect on the final results. To implement step 13 of the algorithm, we select εt+1 so that the acceptance probability of a typical member of the previous weighted sample is reduced by a prespecified factor E.2 NEURAL NETWORK INPUT k. Specifically, we define d˜ as the median of the dt values, i The inputs to the neural network for β and γ, used in (13), and find εt+1 by solving are: ˜ ˜ α(d,εt+1) = kα(d,εt ), where • Parameters θ (d, ) = exp[− 1 d2]. • Current time i α ε 2ε2 20 2 0.6

0.4 10 1

MCMC 0.2

0 0 0.0 0.0 0.2 0 5 10 0 10 20

20 2 0.6

0.4 10 1 0.2

0 0 0.0

ABC no summaries 0.0 0.2 0 5 10 0 10 20 arrival rate min service max service

20 2 0.6

0.4 10 1 0.2

0 0 0.0 ABC summaries 0.0 0.2 0 5 10 0 10 20 arrival rate min service max service

Figure 4: Marginal posterior histograms for M/G/1 example. Top: MCMC output. Middle: ABC output without summary statistics for ε = 7.14. Bottom: ABC output without quartile summaries for ε = 0.32. The ABC histograms are based on samples drawn from ABC-PMC output by importance resampling.

• Current state xi The remaining output of the neural network, η, is used to produce the multiplier γ through and also the following derived features: • Current α(x ,θ), from (11) i 10 • Time until next observation γ = softplus(η). • Next observation value log(2)

We apply the softplus transform since γ must be positive. E.3 NEURAL NETWORK INITIALISATION AND The constant factor ensures that η = 0 produces γ = 10, as OUTPUT required.

As discussed in Section 7, we aim to initialise the neural network for β and γ so that (13) produces x dynamics similar to those of the initial target. This requires β ≈ (0,0,0) and F DIS RESULTS FOR FIXED σ LORENZ γ ≈ 10. To achieve this we initialise as follows. EXAMPLE We initialise the biases to zero and the weights close to zero, as in Section B. Our final neural network layer has 4 Figure 5 shows the DIS results in the Lorenz example with outputs. Under our initialisation these will be close to zero. σ fixed at 0.2. The results shown are after 1500 iterations, Three are used as β, providing the required initial values. taking 652 minutes. 50 12 40 10 theta1

8 30

20 28 10 theta2 26 0

3.25 10

3.00 20

theta3 2.75 30 2.50 0 20 40 60 80 100 26 28 7.5 2.5 3.0 10.0 12.5 theta2 theta3 i theta1

Figure 5: Output for Lorenz example with σ = 0.2. Left: Parameters θ. Diagonal plots show histograms of marginals, with vertical lines showing the true values. Off-diagonal plots show bivariate scatter-plots. Right: Paths x (red dot-dash xi,1, blue dotted xi,2, green dashed xi,3) and observations (dots). Solid lines show the true paths. Both panels are subsamples (1000 left, 30 right) selected by resampling (see Section 3.3) from the final importance sampling output with ε = 0, targeting the posterior distribution. Darker points/lines indicate more frequent resampling.

G PARTICLE MCMC ANALYSIS OF relatively common for forward simulations to produce high LORENZ EXAMPLES observation densities. This is reflected in that we need only NPF = 50 to attain s ≈ 1.5, and so PMCMC was fast to run. We ran 80,000 PMCMC iterations, taking 4 minutes. This As a comparison to DIS, we also investigate inference for produces an effective sample size – calculated using the the Lorenz example using particle Markov chain Monte method of Vats et al., 2019 – of 2328, comparable to the Carlo (PMCMC) [Andrieu et al., 2010], a near-exact in- target ESS of 2500 used by DIS. ference method. This runs a Metropolis-Hastings MCMC However in our second example we fix σ = 0.2. At this algorithm for the model parameters θ. For each proposed level noise it is rare for forward simulations to produce high θ, the likelihood is estimated by running a particle filter for observation densities, particularly for the observations in the x variables with N particles. A particle filter involves PF Figure 3 which lie furthest from the true paths. Hence even forward simulating the unconditioned time series model to N = 106, near the largest choice we could use with our the next observation time and weighting the simulated paths PF available memory, produced s ≈ 6. We can estimate a lower based on the density of the observation given the endpoint. bound on the time cost of PMCMC for this example without We implement PMCMC and particle filters using the pomp memory constraints by considering how long 106 particles R package [King et al., 2016]. would take. The computational cost of this particle filter One PMCMC tuning choice is the proposal distribution implementation is O(NPF ) [King et al., 2016], and using 50 for θ. We use a normal proposal, with variance equal to a particles in PMCMC took 4 minutes. So MCMC using 106 posterior covariance matrix estimate from a pilot PMCMC particles would take approximately 4 × 106/50 = 80,000 run. Another choice is where to initialise the MCMC chain. minutes, which is orders of magnitude longer than DIS. For simplicity we use the true parameter values. In a real analysis initialisation is often more difficult. H TUNING RECOMMENDATIONS A key remaining tuning choice is selecting NPF so that the particle filter produces likelihood estimates which are Here we summarise recommendations on tuning DIS which accurate enough for the MCMC algorithm to be efficient. We appear throughout the paper, and add some further com- follow the theoretically derived tuning advice of Sherlock ments. et al. [2015] and Doucet et al. [2015]: we choose NPF so that the log-likelihood estimates at a representative parameter value (we use the true parameter values) have a standard H.1 HYPERPARAMETERS deviation s of roughly 1.5. For the M/G/1 and Lorenz examples we recommend using: In our first example σ, observation noise scale, is an un- known parameter whose true value is 2. For σ = 2 it is • N = 50,000 (importance sampling sample size) • M = 2,500 (target effective sample size) that q(ξ;φ) can be easily pretrained to match the initial • n = 100 (training batch size) target. • B = M/n (number of batches) In the likelihood-free example we used h i p˜ ( ) = ( )exp − 1 d( )2 , (15) The sinusoidal example is much simpler, and we found ε ξ π ξ 2ε2 ξ N = 4000,M = 2000 to be sufficient here. which is convenient when the likelihood cannot be evalu- In general we expect optimising these hyperparameter for ated but it is possible to simulate data y(ξ) and calculate particular tasks is likely to be useful. In particular, this paper d(ξ) = ||y(ξ) − y ||, the Euclidean distance to the observa- focuses on models where producing a single evaluation of 0 tions y . It is also necessary to be able to easily train q(ξ;φ) the target density is low. For more expensive models, the 0 to match the prior π(ξ) on the parameters and random vari- optimal tuning choices could be qualitatively different. ables involved in the simulator. Many variations on this target are possible, such as changing the choice of distance H.2 NORMALISING FLOW ARCHITECTURE function or replacing the exponential term with K(ξ/ε) for some density K with mode zero [Wilkinson, 2013]. We tune the normalising flow architecture by trial and error. In the SDE example we used In general we expect more complex and higher dimensional 1−ε targets to require more layers and hidden units. p˜ε (ξ) = π(θ)p(x|θ)p(y|x,θ) . (16) One possible more general approach is to begin with a small Recall y represents observed data, x latent variables and θ number of layers with few hidden units. It may become model parameters. This is convenient when it is straightfor- apparent while running DIS that the decrease in ε reaches a ward to simulate from an unconditioned model for ξ = (θ,x) barrier due to an insufficiently flexible class of distributions and to pretrain q(ξ;φ) to approximate its density, and the available in the approximating family. Whenever this hap- observation density p(y|x,θ) is known. pens, extra layers with more hidden units can be added to the flow, initialised to be close to identity transformations There are close relations between these three tempering as in Section B above. schemes. Firstly, (16) is the special case of (14) where the unconditioned model is used as the initial target. Secondly, note that (15) and (16) are similar. In (16) the model has an H.3 REGULARISATION observation density component, and the tempering scheme inflates the observation error. As ε → 0 the true observation We speculate that L1 regularisation on neural network density is recovered. In (15) an artificial observation error weights may be helpful to encourage simple density approx- density is introduced, which converges to a point mass as imations, which are especially appropriate early in training ε → 0. when ε is large. Exploratory work found that L1 regularisa- tion improved the results in our most complicated example, on the Lorenz model, but not the others.

H.4 SELECTING ε

The method outlined in the main paper to select ε is free of tuning parameters. However, alternative methods to select ε could be used, for instance by using variations on the standard effective sample size [see e.g. Martino et al., 2017].

H.5 TEMPERING SCHEME

We have not compared the performance of different temper- ing schemes and for now recommend choosing one based on convenience. In the sinusoidal example we used

ε 1−ε p˜ε (ξ) = p1(ξ) p˜(ξ) , (14) which is convenient when the initial target p1(ξ) and final unnormalised target p˜(ξ) can be evaluated. Here it is crucial