Quick viewing(Text Mode)

Nesting Probabilistic Programs

Nesting Probabilistic Programs

Nesting Probabilistic Programs

Tom Rainforth Department of Statistics University of Oxford [email protected]

Abstract Some, so-called universal, systems (Goodman et al., 2008; We formalize the notion of nesting probabilistic Goodman and Stuhlmuller,¨ 2014; Mansinghka et al., programming queries and investigate the result- 2014; Wood et al., 2014) further allow the definition of ing statistical implications. We demonstrate models that would be hard, or even impossible, to convey that while query nesting allows the definition using conventional frameworks such as graphical models. of models which could not otherwise be ex- One enticing manner they do this is by allowing arbitrary pressed, such as those involving agents reason- nesting of models, known in the probabilistic program- ing about other agents, existing systems take ming literature as queries (Goodman et al., 2008), such approaches which lead to inconsistent estimates. that it is easy to define and run problems that fall outside We show how to correct this by delineating pos- the standard inference framework (Goodman et al., 2008; sible ways one might want to nest queries and Mantadelis and Janssens, 2011; Stuhlmuller¨ and Good- asserting the respective conditions required for man, 2014; Le et al., 2016). This allows the definition of convergence. We further introduce a new on- models that could not be encoded without nesting, such line nested Monte Carlo estimator that makes it as experimental design problems (Ouyang et al., 2016) substantially easier to ensure these conditions and various models for theory-of-mind (Stuhlmuller¨ and are met, thereby providing a simple framework Goodman, 2014). In particular, models that involve agents for designing statistically correct inference en- reasoning about other agents require, in general, some gines. We prove the correctness of this online form of nesting. For example, one might use such nesting estimator and show that, when using the recom- to model a poker player reasoning about another player as mended setup, its asymptotic variance is always shown in Section 3.1. As machine learning increasingly better than that of the equivalent fixed estimator, starts to try and tackle problem domains that require in- while its bias is always within a factor of two. teraction with humans or other external systems, such as the need for self-driving cars to account for the behavior of pedestrians, we believe that such nested problems are 1 INTRODUCTION likely to become increasingly common and that PPSs will Probabilistic programming systems (PPSs) allow proba- form a powerful tool for encoding them. bilistic models to be represented in the form of a genera- However, previous work has, in general, implicitly, and in- tive model and statements for conditioning on data (Good- correctly, assumed that the convergence results from stan- man et al., 2008; Gordon et al., 2014). Informally, one dard inference schemes carry over directly to the nested can think of the generative model as the definition of setting. In truth, inference for nested queries falls out- a prior, the conditioning statements as the definition of side the of conventional proofs and so additional a likelihood, and the output of the program as samples work is required to prove the consistency of PPS inference from a posterior distribution. Their core philosophy is to engines for nested queries. Such problems constitute spe- decouple model specification and inference, the former cial cases of nested estimation. In particular, the use of corresponding to the user-specified program code and the Monte Carlo (MC) methods by most PPSs mean they form latter to an inference engine capable of operating on ar- particular instances of nested Monte Carlo (NMC) esti- bitrary programs. Removing the need for users to write mation (Hong and Juneja, 2009). Recent work (Rainforth inference algorithms significantly reduces the burden of et al., 2016a, 2018; Fort et al., 2017) has demonstrated developing new models and makes effective statistical that NMC is consistent for a general class of models, but methods accessible to non-experts. also that it entails a convergence rate in the total com- expectation γ0 of a function λ using putational cost which decreases exponentially with the N0 h i 1 X depth of the nesting. Furthermore, additional assumptions γ = λ(y(0)) ≈ I = λ(y(0)) (1) 0 E 0 N n are required to achieve this convergence, most noticeably 0 n=1 that, except in a few special cases, one needs to drive not (0) i.i.d. (0) where yn ∼ p(y ), resulting in a mean squared er- only the total number of samples used to infinity, but also ror (MSE) that decreases at a rate O(1/N0). For nested the number of samples used at each layer of the estimator, estimation problems, λ(y(0)) is itself intractable, cor- a requirement generally flaunted by existing PPSs. responding to a nonlinear mapping of a (nested) esti- The aim of this work is to formalize the notion of query mation. Thus in the single nesting case, λ(y(0)) = (0)  (0) (1) (0) nesting and use these recent NMC results to investigate f0 y , E f1 y , y y giving the statistical correctness of the resulting procedures car- h  h   ii (0) (0) (1) (0) ried out by PPS inference engines. To do this, we pos- γ0 = E f0 y , E f1 y , y y

tulate that there are three distinct ways one might nest N0 N1 ! 1 X 1 X   one query within another: sampling from the conditional ≈ I = f y(0), f y(0), y(1) 0 N 0 n N 1 n n,m distribution of another query (which we refer to as nested 0 n=1 1 m=1 (1) (1) (0) inference), factoring the trace probability of one query where each yn,m ∼ p(y |yn ) is drawn independently with the partition function estimate of another (which we and I0 is now a NMC estimate using T = N0N1 samples. refer to as nested conditioning), and using expectation es- More generally, one may have multiple layers of nesting. timates calculated using one query as first class variables To notate this, we first presume some fixed integral depth in another. We use the aforementioned NMC results to D ≥ 0 (with D = 0 corresponding to conventional esti- assess the relative correctness of each of these categories mation), and real-valued functions f , . . . , f . We then of nesting. In the interest of exposition, we will mostly 0 D recursively define focus on the PPS Anglican (Tolpin et al., 2016; Wood   h   i (0:D−1) (0:D) (0:D−1) et al., 2014) (and also occasionally Church (Goodman γD y = E fD y y , and et al., 2008)) as a basis for our discussion, but note that h    i (0:k−1) (0:k) (0:k) (0:k−1) our results apply more generally. For example, our nested γk(y ) = E fk y , γk+1 y y inference case covers the problem of sampling from cut for 0 ≤ k < D. Our goal is to estimate γ0 =  (0) (0) distributions in OpenBugs (Plummer, 2015). E f0 y , γ1 y , for which the NMC estimate is We find that nested inference is statistically challenging I0 defined recursively using and incorrectly handled by existing systems, while nested ND   1 X   I y(0:D−1) = f y(0:D−1), y(D) conditioning is statistically straightforward and done cor- D D nD and ND rectly. Using estimates as variables turns out to be exactly nD =1 equivalent to generic NMC estimation and must thus be  (0:k−1) Ik y (2) dealt with on a case-by-case basis. Consequently, we will focus more on nested inference than the other cases. Nk 1 X  (0:k−1) (k)  (0:k−1) (k) = fk y , y ,Ik+1 y , y To assist in the development of consistent approaches, we N nk nk k n =1 further introduce a new online NMC (ONMC) scheme k (k) (k) (0:k−1) that obviates the need to revisit previous samples when for 0 ≤ k < D, where each yn ∼ p y |y is refining estimates, thereby simplifying the process of writ- drawn independently. Note that there are multiple values (k) (0:k−1) (0:k−1) ing consistent online nested estimation schemes, as re- of y for each associated y and that Ik y quired by most PPSs. We show that ONMC’s convergence is still a random variable given y(0:k−1). rate only varies by a small constant factor relative to con- As shown by (Rainforth et al., 2018, Theorem 3), if each ventional NMC: given some weak assumptions and the fk is continuously differentiable and use of recommended parameter settings, its asymptotic  2 2   (0:k)  (0:k)  (0:k−1) variance is always better than the equivalent NMC estima- ςk = E fk y , γk+1 y −γk y tor with matched total sample budget, while its asymptotic < ∞ ∀k ∈ 0,...,D, then the MSE converges at rate bias is always within a factor of two. 2 h 2i ς0 E (I0 − γ0) ≤ + 2 BACKGROUND N0 2 (3) 2 D−2 k ! 2 ! 2.1 NESTED MONTE CARLO C0ς1 X Y Ck+1ςk+2 + Kd + O() 2N1 2Nk+2 We start by providing a brief introduction to NMC, us- k=0 d=0 ing similar notation to that of Rainforth et al. (2018). where Kk and Ck are respectively bounds on the magni- Conventional MC estimation approximates an intractable tude of the first and second derivatives of fk, and O() Anglican queries are written using the macro defquery. represents asymptotically dominated terms – a convention This allows users to define a model using a mixture of we will use throughout. Note that the dominant terms in sample and observe statements and deterministic code, the bound correspond respectively to the variance and and bind that model to a variable. As a simple example, the bias squared. Theorem 2 of Rainforth et al. (2018) (defquery my-query [data] further shows that the continuously differentiable assump- (let [µ (sample (normal 0 1)) (sample (gamma 2 2)) tion must hold almost surely, rather than absolutely, for σ lik (normal µ σ)] convergence more generally, such that functions with (map (fn [obs] (observe lik obs)) data) measure-zero discontinuities still converge in general. [µ σ]))

We see from (3) that if any of the Nk remain fixed, there corresponds to a model where we are trying to in- is a minimum error that can be achieved: convergence fer the mean and standard deviation of a Gaus- requires each Nk → ∞. As we will later show, many of sian given some data. The syntax of defquery is the shortfalls in dealing with nested queries by existing (defquery name [args] body) such that we are PPSs revolve around implicitly fixing Nk ∀k ≥ 1. binding the query to my-query here. The query starts by sampling µ ∼ N (0, 1) and σ ∼ Γ(2, 2), before construct- For a given total sample budget T = N0N1 ...ND, the √ ing a distribution object lik to use for the observations. bound is tightest when N0 ∝ N1 ∝ · · · ∝ ND giving 2 It then maps over each datapoint and observes it under a convergence rate of O(1/T D+2 ). The intuition behind the distribution lik. After the observations are made, µ this potentially surprising optimum setting is that the vari- and σ are returned from the variable-binding let block ance is mostly dictated by N and bias by the other N . 0 k and then by proxy the query itself. Denoting the data as We see that the convergence rate diminishes exponen- y1:S this particular query defines the joint distribution tially with D. However, this optimal setting of the Nk 1 YS still gives a substantially faster rate than the O(1/T D+1 ) p(µ, σ, y1:S) = N (µ; 0, 1) Γ(σ; 2, 2) N (ys; µ, σ). s=1 from na¨ıvely setting N ∝ N ∝ · · · ∝ N . 0 1 D Inference on a query is performed using the macro 2.2 THE ANGLICAN PPS doquery, which produces a lazy infinite sequence of approximate samples from the conditional distribu- Anglican is a universal probabilistic programming lan- tion and, for appropriate inference algorithms, an es- guage integrated into Clojure (Hickey, 2008), a dialect timate of the partition function. Its calling syntax is of Lisp. There are two important ideas to understand (doquery inf-alg model inputs & options). for reading Clojure: almost everything is a function and parentheses cause evaluation. For example, a + b is coded Key to our purposes is Anglican’s ability to nest queries as (+ a b) where + is a function taking two arguments within one another. In particular, the special form and the parentheses cause the function to evaluate. conditional takes a query and returns a distribution object constructor, the outputs of which ostensibly cor- Anglican inherits most of the syntax of Clojure, but ex- responds to the conditional distribution defined by the tends it with the key special forms sample and observe query, with the inputs to the query becoming its param- (Wood et al., 2014; Tolpin et al., 2015, 2016), between eters. However, as we will show in the next section, which the distribution of the query is defined. Informally, the true behavior of conditional deviates from this, sample specifies terms in the prior and observe terms thereby leading to inconsistent nested inference schemes. in the likelihood. More precisely, sample is used to make random draws from a provided distribution and observe 3 NESTED INFERENCE is used to apply conditioning, factoring the probability One of the clearest ways one might want to nest queries is density of a program trace by a provided density evaluated by sampling from the conditional distribution of one query at an “observed” point. inside another. A number of examples of this are pro- The syntax of sample is to take a distribution object vided for Church in (Stuhlmuller¨ and Goodman, 2014).1 as its only input and return a sample. observe instead Such nested inference problems fall under a more general takes a distribution object and an observation and returns framework of inference for so-called doubly (or multi- nil, while changing the program trace probability in ply) intractable distributions (Murray et al., 2006). The Anglican’s back-end. Anglican provides a number of key feature of these problems is that they include terms elementary random procedures, i.e. distribution object with unknown, parameter dependent, normalization con- constructors for common sampling distributions, but also stants. For nested probabilistic programming queries, this allows users to define their own distribution object con- manifests through conditional normalization. structors using the defdist macro. Distribution objects are generated by calling a class constructor with the re- 1Though their nesting happens within the conditioning predi- quired parameters, e.g. (normal 0 1). cate, Church’s semantics means they constitute nested inference. Consider the following unnested model using the Angli- distributions” (Plummer, 2015), whereby the sampling of can function declaration defm certain subsets of the variables in a model are made with (defm inner [y D] factors of the overall likelihood omitted. let sample gamma ( [z ( ( y 1))] out- (observe (normal y z) D) It is important to note that if we had observed the z)) put of the inner query, rather than sampling from it, this would still constitute a nested inference problem. The (defquery outer [D] key to the nesting is the conditional normalization applied (let [y (sample (beta 2 3)) by conditional, not the exact usage of the generated z (inner y D)] distribution object dist. However, as discussed in Ap- (* y z))) pendix B, actually observing a nested query requires nu- Here inner is simply an Anglican function: it takes in merous additional computational issues to be overcome, inputs y and D, effects the trace probability through its which are beyond the scope of this paper. We thus focus observe statement, and returns the random variable z as on the nested sampling scenario. output. The unnormalized distribution for this model is thus straightforwardly given by 3.1 MOTIVATING EXAMPLE πu(y, z,D) = p(y)p(z|y)p(D|y, z) Before jumping into a full formalization of nested infer- =BETA(y; 2, 3) Γ(z; y, 1) N (D; y, z2), ence, we first consider the motivating example of model- for which we can use conventional inference schemes. ing a poker player who reasons about another player. Here We can convert this model to a nested inference problem each player has access to information the other does not, by using defquery and conditional as follows namely the cards in their hand, and they must perform (defquery inner [y D] their own inference to deal with the resulting uncertainty. (let [z (sample (gamma y 1))] (observe (normal y z) D) Imagine that the first player is deciding whether or not to z)) bet. She could na¨ıvely just make this decision based on the strength of her hand, but more advanced play requires (defquery outer [D] her to reason about actions the other player might take (let [y (sample (beta 2 3)) given her own action, e.g. by considering whether a bluff dist (conditional inner) z (sample (dist y D))] is likely to be successful. She can carry out such reasoning (* y z))) by constructing a model for the other player to try and This is now a nested query: a separate inference proce- predict their action given her action and their hand. Again dure is invoked for each call of (sample (dist y D)), this nested model could just simply be based on a na¨ıve returning an approximate sample from the conditional simulation, but we can refine it by adding another layer distribution defined by inner when input with the cur- of meta-reasoning: the other player will themselves try to rent values of y and D. Mathematically, conditional infer the first player’s hand to inform their own decision. applies a conditional normalization. Specifically, the com- These layers of meta-reasoning create a nesting: for the ponent of πu from the previous example corresponding to first player to choose an action, they must run multiple inner was p(z|y)p(D|y, z) and conditional locally simulations for what the other player will do given that normalizes this to the probability distribution p(z|D, y). action and their hand, each of which requires inference to The distribution now defined by outer is thus given by be carried out. Here adding more levels of meta-reasoning p(y)p(z|y)p(D|y, z) can produce smarter models, but also requires additional π (y, z, D) = p(y)p(z|y, D) = n R p(z|y)p(D|y, z)dz layers of nesting. We expand on this example to give a concrete nested inference problem in Appendix E. p(z|y)p(D|y, z) = p(y) 6= πu(z, y, D). p(D|y) 3.2 FORMALIZATION Critically, the partial normalization constant p(D|y) de- pends on y and so the conditional distribution is doubly To formalize the nested inference problem more generally, let y and x denote all the random variables of an outer intractable: we cannot evaluate πn(y, z, D) exactly. query that are respectively passed or not to the inner query. Another way of looking at this is that wrapping inner in Further, let z denote all random variables generated in the conditional has “protected” y from the conditioning inner query – for simplicity, we will assume, without loss in inner (noting π (y, z, D) ∝ p(y|D)p(z|y, D)), such u of generality, that these are all returned to the outer query, that its observe statement only affects the probability of but that some may not be used. The unnormalized density z given y and not the marginal probability of y. This is for the outer query can now be written in the form why, when there is only a single layer of nesting, nested inference is equivalent to the notion of sampling from “cut πo(x, y, z) = ψ(x, y, z)pi(z|y) (4) where pi(z|y) is the normalized density of the outputs of This, in turn, gives us the empirical measure the inner query and ψ(x, y, z) encapsulates all other terms PN0 PN1 wn,mδ(x ,y ,z )(·) influencing the trace probability of the outer query. Now pˆ(·) = n=1 m=1 n n n,m (9) PN0 PN1 wn,m the inner query defines an unnormalized density πi(y, z) n=1 m=1

that can be evaluated pointwise and we have where δ(xn,yn,zn,m)(·) is a delta function centered on π (y, z) (xn, yn, zn,m). By definition, the convergence of this p (z|y) = i giving (5) i R π (y, z0)dz0 empirical measure to the target requires that expectation i estimates calculated using it converge in probability for ψ(x, y, z)πi(y, z) any integrable g(x, y, z) (presuming our proposal is valid). po(x, y, z) ∝ πo(x, y, z) = R 0 0 (6) πi(y, z )dz We thus see that the convergence of the ratio of nested ex- where po(x, y, z) is our target distribution, for which we pectations in (7) for any arbitrary g(x, y, z), is equivalent can directly evaluate the numerator, but the denominator to the produced samples converging to the distribution is intractable and must be evaluated separately for each defined by the program. Informally, the NMC results then possible value of y. Our previous example is achieved by tell us this will happen in the limit N0,N1 → ∞ pro- R fixing ψ(x, y, z) = p(y) and πi(y, z) = p(z|y)p(D|y, z). vided that πi(y, z)dz is strictly positive for all possible We can further straightforwardly extend to the multiple y (as otherwise the problem becomes ill-defined). More layers of nesting setting by recursively defining πi(y, z) formally we have the following result. Its proof, along in the same way as πo(x, y, z). with all others, is given in Appendix A. Theorem 1. Let g(x, y, z) be an integrable function, 3.3 RELATIONSHIP TO NESTED ESTIMATION let γ0 = Epo(x,y,z)[g(x, y, z)], and let I0 be a self- normalized MC estimate for γ0 calculated using pˆ(·) as To relate the nested inference problem back to the nested per (9). Assuming that q(x, y, z) forms a valid impor- estimation formulation from Section 2.1, we consider tance sampling proposal distribution for po(x, y, z), then using a proposal q(x, y, z) = q(x, y)q(z|y) to calculate 2 2 h 2i σ δ the expectation of some arbitrary function g(x, y, z) under (I0 − γ0) = + + O() (10) E N N 2 p (x, y, z) as per self-normalized importance sampling 0 1 o where σ and δ are constants derived in the proof and, as h g(x,y,z)πo(x,y,z) i Eq(x,y,z) q(x,y,z) before, O() represents asymptotically dominated terms. Epo(x,y,z) [g(x, y, z)] = h πo(x,y,z) i Eq(x,y,z) q(x,y,z) Note that rather than simply being a bound, this result is " # an equality and thus provides the exact asymptotic rate. g(x, y, z)ψ(x, y, z)πi(y, z) Using the arguments of (Rainforth et al., 2018, Theo- Eq(x,y,z) 0 0 q(x, y, z)Ez0∼q(z|y) [πi(y, z )/q(z |y)] rem 3), it can be straightforwardly extended to cases of = " #. ψ(x, y, z)π (y, z) multiple nesting (giving a rate analogous to (3)), though i characterizing σ and δ becomes more challenging. Eq(x,y,z) 0 0 q(x, y, z)Ez0∼q(z|y) [πi(y, z )/q(z |y)] (7) 3.4 CONVERGENCE REQUIREMENTS Here both the denominator and numerator are nested ex- We have demonstrated that the problem of nested infer- pectations with a nonlinearity coming from the fact that ence is a particular case of nested estimation. This prob- we are using the reciprocal of an expectation. A similar lem equivalence will hold whether we elect to use the reformulation could also be applied in cases with multi- aforementioned nested importance sampling based ap- ple layers of nesting, i.e. where inner itself makes use proach or not, while we see that our finite sample esti- of another query. The formalization can also be directly mates must be biased for non-trivial g by the convexity of extended to the sequential MC (SMC) setting by invoking f0 and Theorem 4 of Rainforth et al. (2018). Presuming extended space arguments (Andrieu et al., 2010). we cannot produce exact samples from the inner query Typically g(x, y, z) is not known upfront and we instead and that the set of possible inputs to the inner query is not return an empirical measure from the program in the form finite (these are respectively considered in Appendix D of weighted samples which can later be used to estimate and Appendix C), we thus see that there is no “silver bul- let” that can reduce the problem to a standard estimation. an expectation. That is, if we sample (xn, yn) ∼ q(x, y) and zn,m ∼ q(z|yn) and return all samples (xn, yn, zn,m) We now ask, what behavior do we need for Anglican’s (such that each (xn, yn) is duplicated N1 times in the conditional, and nested inference more generally, to sample set) then our unnormalized weights are given by ensure convergence? At a high level, the NMC results ψ(xn, yn, zn,m)πi(yn, zn,m) show us that we need the computational budget of each wn,m = . (8) 1 PN1 πi(yn,zn,`) call of a nested query to become arbitrarily large, such q(xn, yn, zn,m) N1 `=1 q(zn,`|yn) that we use an infinite number of samples at each layer of the estimator: we require each Nk → ∞. We have formally demonstrated convergence when this requirement is satisfied and the previously introduced nested importance sampling approach is used. Another possible approach would be to, instead of drawing sam- ples to estimate (7) directly, importance sample N1 times for each call of the inner query and then return a single sample from these, drawn in proportion to the inner query importance weights. We can think of this as drawing the same raw samples, but then constructing the estimator as PN0 ∗ Figure 1: Empirical densities produced by running the wnδ(xn,yn,z ∗ )(·) pˆ∗(·) = n=1 n,m (n) (11) nested Anglican queries given in the text, a reference PN0 ∗ n=1 wn NMC estimate, the unnested model, a na¨ıve estimation

∗ ψ(xn, yn, zn,m∗(n)) scheme where N1 = 1, and the ONMC approach intro- where wn = and (12) duced in Section 6, with the same computational budget q(xn, yn) 9 √ ! of T = 5 × 10 and τ1(n0) = min(500, n0). Note that πi(yn, zn,m)/q(zn,m|yn) m∗(n) ∼ DISCRETE the results for ONMC and the reference approach overlap. PN1 π (y , z )/q(z |y ) `=1 i n n,` n,` n shown in Figure 1, the samples produced by Anglican As demonstrated formally in Appendix A, this approach are substantially different to the reference code, demon- also converges. However, if we Rao Blackwellize (Casella ∗ strating that the outputs do not match their semantically and Robert, 1996) the sampling of m (n), we find that intended distribution. For reference, we also considered this recovers (9). Consequently, this is a strictly inferior the distribution induced by the aforementioned unnested estimator (it has an increased variance relative to (9)). model and a na¨ıve estimation scheme where a sample Nonetheless, it may often be a convenient setup from budget of N1 = 1 is used for each call to inner, effec- the perspective of the PPS semantics and it will typically tively corresponding to ignoring the observe statement have substantially reduced memory requirements: we by directly returning the first draw of z. need only store the single returned sample from the inner query to construct our empirical measure, rather than all We see that the unnested model defines a noticeably differ- of the samples generated within the inner query. ent distribution, while the behavior of Anglican is similar, but distinct, to ignoring the observe statement in the Though one can use the results of Fort et al. (2017) to inner query. Further investigation shows that the default show the correctness of instead using an MCMC estima- behavior of conditional in a query nesting context tor for the outer query, the correctness of using MCMC is equivalent to using (11) but with N1 held fixed to at methods for the inner queries is not explicitly covered by N1 = 2, inducing a substantial bias. More generally, the existing results. Here we find that we need to start a new Anglican source code shows that conditional defines Markov chain for each call of the inner query because a Markov chain generated by equalizing the output of the each value of y defines a different local inference prob- weighted samples generated by running inference on the lem. One would intuitively expect the NMC results to query. When used to nest queries, this Markov chain is carry over – as N1 → ∞ all the inner queries will run only ever run for a finite length of time, specifically one their Markov chains for an infinitely long time, thereby accept-reject step is carried out, and so does not produce in principle returning exact samples – but we leave for- samples from the true conditional distribution. mal proof of this case to future work. We note that such an approach effectively equates to what is referred to as Plummer (2015) noticed that WinBugs and Open- multiple imputation by Plummer (2015). Bugs (Spiegelhalter et al., 1996) similarly do not provide valid inference when using their cut function primitives, 3.5 SHORTFALLS OF EXISTING SYSTEMS which effectively allow the definition of nested inference Using the empirical measure (9) provides one possible problems. However, they do not notice the equivalence to manner of producing a consistent estimate of our target the NMC formulation and instead propose a heuristic for reducing the bias that itself has no theoretical guarantees. by taking N0,N1 → ∞ and so we can use this as a gold- standard reference approach (with a large value of N1) to assess whether Anglican returns samples for the correct 4 NESTED CONDITIONING target distribution. To this end, we ran Anglican’s im- An alternative way one might wish to nest queries is to portance sampling inference engine on the simple model use the partition function estimate of one query to factor introduced earlier and compared its output to the refer- the trace probability of another. We refer to this as nested 6 3 ence approach using N0 = 5 × 10 and N1 = 10 . As conditioning. In its simplest form, we can think about conditioning on the values input to the inner query. In further trivially extend to the repeated nesting case by Anglican we can carry this out by using the following recursion, while using the idea of pseudo-marginal meth- custom distribution object constructor ods (Andrieu and Roberts, 2009), the results also extend (defdist nest [inner inputs inf-alg M] [] to using MCMC based inference for the outermost query. (sample [this] nil) Rather than just fixing the inputs to the nested query, one (observe [this _] (log-marginal (take M can also consider conditioning on the internally sampled (doquery inf-alg inner inputs))))) variables in the program taking on certain values. Such a When the resulting distribution object is observed, this nested conditioning approach has been implicitly carried will now generate, and factor the trace probability by, a out by Rainforth et al. (2016b); Zinkov and Shan (2017); partition function estimate for inner with inputs inputs, Scibior and Ghahramani (2016); Ge et al. (2018), each of constructed using M samples of the inference algorithm which manipulate the original program in some fashion inf-alg. For example, if we were to use the query to construct a partition function estimator that is used (defquery outer [D] used within a greater inference scheme, e.g. a PMMH (let [y (sample (beta 2 3))] estimator (Andrieu et al., 2010). (observe (nest inner [y D] :smc 100) nil) y)) 5 ESTIMATES AS VARIABLES with inner from the nested inference example, then this Our final case is that one might wish to use estimates would form a pseudo marginal sampler (Andrieu and as first class variables in another query. In other words, Roberts, 2009) for the unnormalized target distribution a variable in an outer query is assigned to a MC expec- Z 2 tation estimate calculated from the outputs of running πc(y, D) =BETA(y; 2, 3) Γ(z; y, 1) N (D; y, z )dz. inference on another, nested, query. By comparison, the Unlike the nested inference case, nested conditioning nested inference case (without Rao-Blackwellization) can turns out to be valid even if our budget is held fixed, be thought of as assigning a variable in the outer query to provided that the partition function estimate is unbiased, a single approximate sample from the conditional distri- as is satisfied by, for example, importance sampling and bution of the inner query, rather than an MC expectation SMC. In fact, it is important to hold the budget fixed to estimate constructed by averaging over multiple samples. achieve a MC convergence rate. In general, we can define Whereas nested inference can only encode a certain class our target density as of nested estimation problems – because the only nonlin-

po(x, y) ∝ πo(x, y) = ψ(x, y)pi(y), (13) earity originates from taking the reciprocal of the partition where ψ(x, y) is as before (except that we no longer have function – using estimates as variables allows, in principle, the encoding of any nested estimation. This is because returned variables from the inner query) and pi(y) is the true partition function of the inner query when given using the estimate as a first class variable allows arbitrary nonlinear mappings to be applied by the outer query. input y. In practice, we cannot evaluate pi(y) exactly, but instead produce unbiased estimates pˆi(y). Using an An example of this approach is shown in Appendix G, analogous self-normalized importance sampling to the where we construct a generic estimator for Bayesian ex- nested inference case leads to the weights perimental design problems. Here a partition function estimate is constructed for an inner query and is then used wn = ψ(xn, yn)ˆpi(yn)/q(xn, yn) (14) in an outer query. The output of the outer query depends and corresponding empirical measure on the logarithm of this estimate, thereby creating the N 1 X0 nonlinearity required to form a nested expectation. pˆ(·) = wn,δ(x ,y )(·) (15) PN0 n n n=1 wn n=1 Because using estimates as variables allows the encoding such that we are conducting conventional MC estima- of any nested estimation problem, the validity of doing tion, but our weights are now themselves random vari- so is equivalent to that of NMC more generally and must ables for a given (xn, yn) due to the pˆi(yn) term. How- thus satisfy the requirements set out in (Rainforth et al., ever, the weights are unbiased estimates of the “true 2018). In particular, one needs to ensure that the budgets weights” ψ(xn, yn)pi(yn)/q(xn, yn) such that we have used for the inner estimates increase as more samples of proper weighting (Naesseth et al., 2015) and thus conver- the outermost query are taken. gence at the standard MC rate, provided the budget of the inner query remains fixed. This result also follows 6 ONLINE NESTED MONTE CARLO directly from Theorem 6 of Rainforth et al. (2018), which NMC will be highly inconvenient to actually implement further ensures no complications arise when conditioning in a PPS whenever one desires to provide online estimates; on multiple queries if the corresponding partition func- for example, a lazy sequence of samples that converges tion estimates are generated independently. These results to the target distribution. Suppose that we have already calculated an NMC estimate, but now desire to refine it ror for ONMC is never more than twice that of NMC for further. In general, this will require an increase to all Nk a matched sample budget and can even be smaller. for each sample of the outermost estimator. Consequently, + Let τk(n0) ∈ N , k = 1,...,D be monotonically in- the previous samples of the outermost query must be creasing functions dictating the number of samples used revisited to refine their estimates. This significantly com- by ONMC at depth k for the n0-th iteration of the outer- plicates practical implementation, necessitating additional most estimator. The ONMC estimator is defined as communication between queries, introducing computa- N 1 X0    tional overhead, and potentially substantially increasing J = f y(0),I y(0), τ (n ) 0 0 n0 1 n0 1:D 0 (16) N0 the memory requirements. n0=1 To highlight these shortfalls concretely, consider the (0) where I1(yn0 , τ1:D(n0)) is calculated using I1 in (2), set- nested inference class of problems and, in particular, con- (0) (0) ting y = yn0 and Nk = τk(n0), ∀k ∈ 1,...,D. For structing the un–Rao–Blackwellized estimator (11) in an reference, the NMC estimator, I , is as per (16), except ∗ 0 online fashion. Increasing N1 requires m (n) to be re- for replacing τ1:D(n0) with τ1:D(N0). Algorithmically, drawn for each n, which in turn necessitates storage of we have that the ONMC approach is defined as follows. previous samples and weights.2 This leads to an over- head cost from the extra computation carried out for re- Algorithm 1 Online Nested Monte Carlo visitation and a memory overhead from having to store 1: n0 ← 0,J0 ← 0 information about each call of the inner query. 2: while true do (0) (0) 3: n0 ← n0 + 1, yn ∼ p(y ) Perhaps even more problematically, the need to revisit old  0  I y(0), τ (n ) N = τ (n ) ∀k samples when drawing new samples can cause substantial 4: Construct 1 n0 1:D 0 using k k 0 n0−1 (0) (0)  complications for implementation. Consider implement- 5: J0 ← J0 + f0 yn ,I1 yn , τ1:D(n0) n0 0 0 ing such an approach in Anglican. Anglican is designed to return a lazy infinite sequence of samples converging We see that OMMC uses fewer samples at inner layers to the target distribution. Once samples are taken from for earlier samples of the outermost level, and that each this sequence, they become external to Anglican and can- of resulting inner estimates is calculated as per an NMC not be posthumously updated when further samples are estimator with a reduced sample budget. We now show requested. Even when all the output samples remain inter- the consistency of the ONMC estimator. nal, revisiting samples remains difficult: one either needs α Theorem 2. If each τk(n0) ≥ A (log(n0)) , ∀n0 > B to implement some form of memory for nested queries for some constants A, B, α > 0 and each fk is continu- so they can be run further, or, if all information is instead ously differentiable, then the mean squared error of J0 as stored at the outermost level, additional non-trivial code is an estimator for γ0 converges to zero as N0 → ∞. necessary to apply post-processing and to revisit queries with previously tested inputs. The latter of these is likely In other words, ONMC converges for any realistic choice

to necessitate inference–algorithm–specific changes, par- of τk(n0) provided limn0→∞ τk(n0) = ∞: the require- ticularly when there are multiple levels of nesting, thereby ments on τk(n0) are, for example, much weaker than hampering the entire language construction. requiring a logarithmic or faster rate of growth, which To alleviate these issues, we propose to only increase the would already be an impractically slow rate of increase. computational budget of new calls to nested queries, such In the case where τk(n0) increases at a polynomial rate, that earlier calls use fewer samples than later calls. This we can further quantify the rate of convergence, along simple adjustment removes the need for communication with the relative variance and bias compared to NMC: between different calls and requires only the storage of α Theorem 3. If each τk(n0) ≥ An0 , ∀n0 > B for some the number of times the outermost query has previously constants A, B, α > 0 and each fk is continuously differ- been sampled to make updates to the overall estimate. We entiable, then refer to this approach as online NMC (ONMC), which, to 2  2 h 2i ς0 βg(α, N0) the best of our knowledge, has not been previously con- E (J0 − γ0) ≤ + α + O(), (17) sidered in the literature. As we now show, ONMC only N0 AN0  leads to small changes in the convergence rate of the re- 1/(1 − α), α < 1  sultant estimator compared to NMC: using recommended where g(α, N0) = log(N0) + η, α = 1 ; (18) parameter settings, the asymptotic root mean squared er-  α−1 ζ(α)N0 , α > 1

2 2 D−2 k ! 2 Note that not all previous samples and weights need storing C0ς X Y Ck+1ς β = 1 + K k+2 ; (19) – when making the update we can sample whether to change 2 d 2 m∗(n) or not based on combined weights from all the old sam- k=0 d=0 ples compared to all the new samples. η ≈ 0.577 is the Euler–Mascheroni constant; ζ is the Riemann–zeta function; and Ck, Kk, and ςk are constants defined as per the corresponding NMC bound given in (3).

Corollary 1. Let J0 be an ONMC estimator setup as per Theorem 3 with N0 outermost samples and let I0 be an NMC estimator with a matched overall sample budget. Defining c = (1 + αD)(−1/(1+αD)), then

Var[J0] → cVar[I0] as N0 → ∞. Further, if the NMC bias decreases at a rate proportional to that implied by the bound given in (3), namely b Figure 2: Convergence of ONMC, NMC, and fixed N . |E[I0 − γ0]| = α + O() (20) 1 M0 Results are averaged over 1000 runs, with solid lines for some constant b > 0, where M0 is the number of showing the mean and shading the 25-75% quantiles. The outermost samples used by the NMC sampler, then theoretical rates for NMC are shown by the dashed lines. α |E[J0 − γ0]| ≤ c g(α, N0) |E[I0 − γ0]| + O(). ing N = 25 for T < 253 = 15625, but unlike fixing We expect the assumption that the bias scales as 1/M α to 1 0 N , it continues to improve beyond this because it is not be satisfied in the vast majority of scenarios, but there may 1 limited by asymptotic bias. Instead, we see an inflection be edge cases, e.g. when an f gives a constant output, for k point-like behavior around T , with the rate recovering which faster rates are observed. Critically, the assumption min to effectively match that of the NMC estimator. holds for all nested inference problems because the rate given in (10) is an equality. 6.2 USING ONMC IN PPSs We see that if α < 1, which will generally be the case in Using ONMC based estimation schemes to ensure con- practice for sensible setups, then the convergence rates for sistent estimation for nested inference in PPSs is straight- ONMC and NMC vary only by a constant factor. Specifi- forward – the number of iterations the outermost query cally, for a fixed value of N0, they have the same asymp- has been run for is stored and used to set the number of totic variance and ONMC has a factor of 1/(1−α) higher iterations used for the inner queries. In fact, even this min- bias. However, the cost of ONMC is (asymptotically) only imal level of communication is not necessary – n0 can be c < 1 times that of NMC, so for a fixed overall sample inferred from the number of times we have previously run budget it has lower variance. inference on the current query, the current depth k, and As the bound varies only in constant factors for τ1(·), . . . , τk−1(·). α < 1, the asymptotically optimal value for α for As with NMC, for nested inference problems ONMC can ONMC is the same as that for NMC, namely α = either return a single sample from each call of a nested 0.5 (Rainforth et al., 2018). For this setup, we have query, or Rao–Blackwellize the drawing of this sample c ∈ {0.763, 0.707, 0.693, 0.693, 0.699, 1} respectively when possible. Each respectively produces an estimator for D ∈ {1, 2, 3, 4, 5, ∞}. Consequently, when α = 0.5, analogous to (11) and (9) respectively, except that N1 in the fixed budget variance of ONMC is always better than the definition of the inner weights is now a function of n. NMC, while the bias is no more than 1.75 times larger if Returning to Figure 1, we see that using ONMC with D ≤ 13 and no more than 2 times large more generally. nested importance sampling and only returning a single 6.1 EMPIRICAL CONFIRMATION sample corrects the previous issues with how Anglican deals with nested inference, producing samples indistin- To test ONMC empirically, we consider the simple an- guishable from the reference code. alytic model given in Appendix F, setting τ (n ) = √ 1 0 max(25, no). The rationale for setting a minimum 7 CONCLUSIONS N value of 1 is to minimize the burn-in effect of ONMC We have formalized the notion of nesting probabilistic – earlier samples will have larger bias than later samples program queries and investigated the statistical validity of and we can mitigate this by ensuring a minimum value different categories of nesting. We have found that current N for 1. More generally, we recommend setting (in the systems tend to use methods that lead to asymptotic bias absence of other information) τ1(n0) = τ2(n0) = ··· = 1/3 √ for nested inference problems, but that they are consistent τD(n0) = max(Tmin, n0), where Tmin is the minimum for nested conditioning. We have shown how to carry out overall budget we expect to spend. In Figure 2, we have the former in a consistent manner and developed a new on- chosen to set Tmin deliberately low so as to emphasize line estimator that simplifies the construction algorithms the differences between NMC and ONMC. Given our that satisfy the conditions required for convergence. value for Tmin, the ONMC approach is identical to fix- References I. Murray, Z. Ghahramani, and D. J. MacKay. MCMC for doubly-intractable distributions. In UAI, 2006. C. Andrieu and G. O. Roberts. The pseudo-marginal C. A. Naesseth, F. Lindsten, and T. B. Schon.¨ Nested approach for efficient Monte Carlo computations. The sequential Monte Carlo methods. In ICML, 2015. Annals of Statistics, pages 697–725, 2009. L. Ouyang, M. H. Tessler, D. Ly, and N. Goodman. Prac- C. Andrieu, A. Doucet, and R. Holenstein. Particle tical optimal experiment design with probabilistic pro- Markov chain Monte Carlo methods. Journal of the grams. arXiv preprint arXiv:1608.05046, 2016. Royal Statistical Society: Series B (Statistical Method- M. Plummer. Cuts in Bayesian graphical models. Statis- ology), 2010. tics and Computing, 25(1):37–43, 2015. G. Casella and C. P. Robert. Rao-Blackwellisation of J. G. Propp and D. B. Wilson. Exact sampling with cou- sampling schemes. Biometrika, 83(1):81–94, 1996. pled Markov chains and applications to statistical me- K. Chaloner and I. Verdinelli. Bayesian experimental chanics. Random structures and Algorithms, 9(1-2): design: A review. Statistical Science, 1995. 223–252, 1996. R. Cornish, F. Wood, and H. Yang. Efficient exact infer- T. Rainforth. Automating Inference, Learning, and Design ence in discrete Anglican programs. 2017. using Probabilistic Programming. PhD thesis, 2017. K. Csillery,´ M. G. Blum, O. E. Gaggiotti, and O. Franc¸ois. T. Rainforth, R. Cornish, H. Yang, and F. Wood. On Approximate Bayesian Computation (ABC) in practice. the pitfalls of nested Monte Carlo. NIPS Workshop on Trends in Ecology & Evolution, 25(7):410–418, 2010. Advances in Approximate Bayesian Inference, 2016a. M. F. Cusumano-Towner and V. K. Mansinghka. Using T. Rainforth, T. A. Le, J.-W. van de Meent, M. A. Osborne, probabilistic programs as proposals. arXiv preprint and F. Wood. Bayesian optimization for probabilistic arXiv:1801.03612, 2018. programs. In NIPS, pages 280–288, 2016b. G. Fort, E. Gobet, and E. Moulines. MCMC design-based T. Rainforth, R. Cornish, H. Yang, A. Warrington, and non-parametric regression for rare-event. application to F. Wood. On nesting Monte Carlo estimators. In ICML, nested risk computations. Monte Carlo Methods Appl, 2018. 2017. A. Scibior and Z. Ghahramani. Modular construction of H. Ge, K. Xu, and Z. Ghahramani. Turing: a language for Bayesian inference algorithms. In NIPS Workshop on composable probabilistic inference. In AISTATS, 2018. Advances in Approximate Bayesian Inference, 2016. N. Goodman, V. Mansinghka, D. M. Roy, K. Bonawitz, D. Spiegelhalter, A. Thomas, N. Best, and W. Gilks. and J. B. Tenenbaum. Church: a language for genera- BUGS 0.5: Bayesian inference using Gibbs sampling tive models. UAI, 2008. manual (version ii). MRC Biostatistics Unit, Cam- N. D. Goodman and A. Stuhlmuller.¨ The Design and Im- bridge, 1996. plementation of Probabilistic Programming Languages. A. Stuhlmuller¨ and N. D. Goodman. A dynamic program- 2014. ming algorithm for inference in recursive probabilistic A. D. Gordon, T. A. Henzinger, A. V. Nori, and S. K. programs. In Second Statistical Relational AI workshop Rajamani. Probabilistic programming. In Proceedings at UAI 2012 (StaRAI-12), 2012. of the on Future of Software Engineering. ACM, 2014. A. Stuhlmuller¨ and N. D. Goodman. Reasoning about R. Hickey. The Clojure programming language. In reasoning by nested conditioning: Modeling theory of Proceedings of the 2008 symposium on Dynamic lan- mind with probabilistic programs. Cognitive Systems guages, page 1. ACM, 2008. Research, 28:80–99, 2014. L. J. Hong and S. Juneja. Estimating the mean of a non- D. Tolpin, J.-W. van de Meent, and F. Wood. Probabilistic linear function of conditional expectation. In Winter programming in Anglican. Springer, 2015. Simulation Conference, 2009. D. Tolpin, J.-W. van de Meent, H. Yang, and F. Wood. T. A. Le, A. G. Baydin, and F. Wood. Nested compiled Design and implementation of probabilistic program- inference for hierarchical reinforcement learning. In ming language Anglican. In Proceedings of the 28th NIPS Workshop on Bayesian Deep Learning, 2016. Symposium on the Implementation and Application of V. Mansinghka, D. Selsam, and Y. Perov. Ven- Languages. ACM, 2016. ture: a higher-order probabilistic programming plat- F. Wood, J. W. van de Meent, and V. Mansinghka. A new form with programmable inference. arXiv preprint approach to probabilistic programming inference. In arXiv:1404.0099, 2014. AISTATS, pages 2–46, 2014. T. Mantadelis and G. Janssens. Nesting probabilistic R. Zinkov and C.-C. Shan. Composing inference algo- inference. arXiv preprint arXiv:1112.3785, 2011. rithms as program transformations. In UAI, 2017.