Nesting Probabilistic Programs
Total Page:16
File Type:pdf, Size:1020Kb
Nesting Probabilistic Programs Tom Rainforth Department of Statistics University of Oxford [email protected] Abstract Some, so-called universal, systems (Goodman et al., 2008; We formalize the notion of nesting probabilistic Goodman and Stuhlmuller,¨ 2014; Mansinghka et al., programming queries and investigate the result- 2014; Wood et al., 2014) further allow the definition of ing statistical implications. We demonstrate models that would be hard, or even impossible, to convey that while query nesting allows the definition using conventional frameworks such as graphical models. of models which could not otherwise be ex- One enticing manner they do this is by allowing arbitrary pressed, such as those involving agents reason- nesting of models, known in the probabilistic program- ing about other agents, existing systems take ming literature as queries (Goodman et al., 2008), such approaches which lead to inconsistent estimates. that it is easy to define and run problems that fall outside We show how to correct this by delineating pos- the standard inference framework (Goodman et al., 2008; sible ways one might want to nest queries and Mantadelis and Janssens, 2011; Stuhlmuller¨ and Good- asserting the respective conditions required for man, 2014; Le et al., 2016). This allows the definition of convergence. We further introduce a new on- models that could not be encoded without nesting, such line nested Monte Carlo estimator that makes it as experimental design problems (Ouyang et al., 2016) substantially easier to ensure these conditions and various models for theory-of-mind (Stuhlmuller¨ and are met, thereby providing a simple framework Goodman, 2014). In particular, models that involve agents for designing statistically correct inference en- reasoning about other agents require, in general, some gines. We prove the correctness of this online form of nesting. For example, one might use such nesting estimator and show that, when using the recom- to model a poker player reasoning about another player as mended setup, its asymptotic variance is always shown in Section 3.1. As machine learning increasingly better than that of the equivalent fixed estimator, starts to try and tackle problem domains that require in- while its bias is always within a factor of two. teraction with humans or other external systems, such as the need for self-driving cars to account for the behavior of pedestrians, we believe that such nested problems are 1 INTRODUCTION likely to become increasingly common and that PPSs will Probabilistic programming systems (PPSs) allow proba- form a powerful tool for encoding them. bilistic models to be represented in the form of a genera- However, previous work has, in general, implicitly, and in- tive model and statements for conditioning on data (Good- correctly, assumed that the convergence results from stan- man et al., 2008; Gordon et al., 2014). Informally, one dard inference schemes carry over directly to the nested can think of the generative model as the definition of setting. In truth, inference for nested queries falls out- a prior, the conditioning statements as the definition of side the scope of conventional proofs and so additional a likelihood, and the output of the program as samples work is required to prove the consistency of PPS inference from a posterior distribution. Their core philosophy is to engines for nested queries. Such problems constitute spe- decouple model specification and inference, the former cial cases of nested estimation. In particular, the use of corresponding to the user-specified program code and the Monte Carlo (MC) methods by most PPSs mean they form latter to an inference engine capable of operating on ar- particular instances of nested Monte Carlo (NMC) esti- bitrary programs. Removing the need for users to write mation (Hong and Juneja, 2009). Recent work (Rainforth inference algorithms significantly reduces the burden of et al., 2016a, 2018; Fort et al., 2017) has demonstrated developing new models and makes effective statistical that NMC is consistent for a general class of models, but methods accessible to non-experts. also that it entails a convergence rate in the total com- expectation γ0 of a function λ using putational cost which decreases exponentially with the N0 h i 1 X depth of the nesting. Furthermore, additional assumptions γ = λ(y(0)) ≈ I = λ(y(0)) (1) 0 E 0 N n are required to achieve this convergence, most noticeably 0 n=1 that, except in a few special cases, one needs to drive not (0) i:i:d: (0) where yn ∼ p(y ), resulting in a mean squared er- only the total number of samples used to infinity, but also ror (MSE) that decreases at a rate O(1=N0). For nested the number of samples used at each layer of the estimator, estimation problems, λ(y(0)) is itself intractable, cor- a requirement generally flaunted by existing PPSs. responding to a nonlinear mapping of a (nested) esti- The aim of this work is to formalize the notion of query mation. Thus in the single nesting case, λ(y(0)) = (0) (0) (1) (0) nesting and use these recent NMC results to investigate f0 y ; E f1 y ; y y giving the statistical correctness of the resulting procedures car- h h ii (0) (0) (1) (0) ried out by PPS inference engines. To do this, we pos- γ0 = E f0 y ; E f1 y ; y y tulate that there are three distinct ways one might nest N0 N1 ! 1 X 1 X one query within another: sampling from the conditional ≈ I = f y(0); f y(0); y(1) 0 N 0 n N 1 n n;m distribution of another query (which we refer to as nested 0 n=1 1 m=1 (1) (1) (0) inference), factoring the trace probability of one query where each yn;m ∼ p(y jyn ) is drawn independently with the partition function estimate of another (which we and I0 is now a NMC estimate using T = N0N1 samples. refer to as nested conditioning), and using expectation es- More generally, one may have multiple layers of nesting. timates calculated using one query as first class variables To notate this, we first presume some fixed integral depth in another. We use the aforementioned NMC results to D ≥ 0 (with D = 0 corresponding to conventional esti- assess the relative correctness of each of these categories mation), and real-valued functions f ; : : : ; f . We then of nesting. In the interest of exposition, we will mostly 0 D recursively define focus on the PPS Anglican (Tolpin et al., 2016; Wood h i (0:D−1) (0:D) (0:D−1) et al., 2014) (and also occasionally Church (Goodman γD y = E fD y y ; and et al., 2008)) as a basis for our discussion, but note that h i (0:k−1) (0:k) (0:k) (0:k−1) our results apply more generally. For example, our nested γk(y ) = E fk y ; γk+1 y y inference case covers the problem of sampling from cut for 0 ≤ k < D. Our goal is to estimate γ0 = (0) (0) distributions in OpenBugs (Plummer, 2015). E f0 y ; γ1 y , for which the NMC estimate is We find that nested inference is statistically challenging I0 defined recursively using and incorrectly handled by existing systems, while nested ND 1 X I y(0:D−1) = f y(0:D−1); y(D) conditioning is statistically straightforward and done cor- D D nD and ND rectly. Using estimates as variables turns out to be exactly nD =1 equivalent to generic NMC estimation and must thus be (0:k−1) Ik y (2) dealt with on a case-by-case basis. Consequently, we will focus more on nested inference than the other cases. Nk 1 X (0:k−1) (k) (0:k−1) (k) = fk y ; y ;Ik+1 y ; y To assist in the development of consistent approaches, we N nk nk k n =1 further introduce a new online NMC (ONMC) scheme k (k) (k) (0:k−1) that obviates the need to revisit previous samples when for 0 ≤ k < D, where each yn ∼ p y jy is refining estimates, thereby simplifying the process of writ- drawn independently. Note that there are multiple values (k) (0:k−1) (0:k−1) ing consistent online nested estimation schemes, as re- of y for each associated y and that Ik y quired by most PPSs. We show that ONMC’s convergence is still a random variable given y(0:k−1). rate only varies by a small constant factor relative to con- As shown by (Rainforth et al., 2018, Theorem 3), if each ventional NMC: given some weak assumptions and the fk is continuously differentiable and use of recommended parameter settings, its asymptotic 2 2 (0:k) (0:k) (0:k−1) variance is always better than the equivalent NMC estima- &k = E fk y ; γk+1 y −γk y tor with matched total sample budget, while its asymptotic < 1 8k 2 0;:::;D, then the MSE converges at rate bias is always within a factor of two. 2 h 2i &0 E (I0 − γ0) ≤ + 2 BACKGROUND N0 2 (3) 2 D−2 k ! 2 ! 2.1 NESTED MONTE CARLO C0&1 X Y Ck+1&k+2 + Kd + O() 2N1 2Nk+2 We start by providing a brief introduction to NMC, us- k=0 d=0 ing similar notation to that of Rainforth et al. (2018). where Kk and Ck are respectively bounds on the magni- Conventional MC estimation approximates an intractable tude of the first and second derivatives of fk, and O() Anglican queries are written using the macro defquery. represents asymptotically dominated terms – a convention This allows users to define a model using a mixture of we will use throughout.