Approximate Inference in Discrete Distributions with Monte Carlo Tree Search and Value Functions

Approximate Inference in Discrete Distributions with Monte Carlo Tree Search and Value Functions Lars Buesing, Theophane Weber, Nicolas Hees DeepMind Abstract model, such as marginal distributions of variables, is no- toriously difficult. For discrete distributions, inference is #P-complete (Roth, 1996), and thus at least as hard as (and Exact probabilistic inference in discrete models suspected to be much harder than) NP-complete problems is often prohibitively expensive, as it may require (Stockmeyer, 1985). evaluating the (unnormalized) target density on its entire domain. Here we consider the setting The hardness of exact inference, which often prevents its where only a limited budget of calls to the unnor- application in practice, has led to the development of nu- malized target density oracle is available, raising merous approximate methods such as Markov Chain Monte the challenge of where in the domain to allocate Carlo (MCMC) and Sequential Monte Carlo (SMC) (Hast- these function calls in order to construct a good ings, 1970; Del Moral et al., 2006). Whereas exact inference approximate solution. We formulate this prob- methods need to evaluate and sum the UTD over its entire lem as an instance of sequential decision-making domain in the worst case, approximate methods attempt under uncertainty and leverage methods from re- to reduce computation by concentrating evaluations of the inforcement learning for probabilistic inference UTD on regions of the domain that contribute most to the with budget constraints. In particular, we pro- probability mass. The exact locations of high-probability pose the TREESAMPLE algorithm, an adaptation regions are, however, often unknown a-priori, and different of Monte Carlo Tree Search to approximate infer- approaches use a variety of means to identify them effi- ence. This algorithm caches all previous queries ciently. In continuous domains, Hamiltonian Monte Carlo, to the density oracle in an explicit search tree, for instance, guides a set of particles towards high density and dynamically allocates new queries based on regions by using gradients of the target density (Neal et al., a "best-first" heuristic for exploration, using ex- 2011). Instead of exclusively relying on a-priori knowl- isting upper confidence bound methods. Our non- edge (such as a gradient oracle), adaptive approximation parametric inference method can be effectively methods use the outcome of previous evaluations of the combined with neural networks that compile ap- UTD to dynamically allocate subsequent evaluations on proximate conditionals of the target, which are promising parts of the domain (Mansinghka et al., 2009; then used to guide the inference search and enable Andrieu and Thoms, 2008). This can be formalized as an generalization across multiple target distributions. instance of decision-making under uncertainty, where act- We show empirically that TREESAMPLE outper- ing corresponds to evaluating the UTD in order to discover forms standard approximate inference methods probability mass in the domain (Lu et al., 2018). Form on synthetic factor graphs. this viewpoint, approximate inference methods attempt to explore the target domain based on a-priori information about the target density as well as on partial feedback from 1 Introduction previous evaluations of the UTD. In this work, we propose a new approximate inference Probabilistic models are often easy to specify, e.g. by multi- method for discrete distributions, termed TREESAMPLE, plying non-negative functions that each reflect an indepen- that is motivated by the correspondence between probabilis- dent piece of information, yielding an unnormalized target tic inference and sequential decision-making highlighted density (UTD). However, extracting knowledge from the previously in the literature, e.g. (Dayan and Hinton, 1997; Rawlik et al., 2013; Weber et al., 2015). TREESAMPLE approximates a joint distribution over multiple discrete vari- rd Proceedings of the 23 International Conference on Artificial In- ables by a sequential decision-making approach: Variables telligence and Statistics (AISTATS) 2020, Palermo, Italy. PMLR: Volume 108. Copyright 2020 by the author(s). are inferred / sampled one variable at a time based on all Approximate Inference with Monte Carlo Tree Search previous ones in an arbitrary, pre-specified ordering. An interested in compiling a good approximation PX using a explicit tree-structured cache of all previous UTD evalu- fixed computational budget: ations in maintained, and a heuristic inspired by Upper Confidence Bounds on Trees (UTC) (Kocsis and Szepesvári, Input: Factor oracles 1; : : : ; M with known 2006) for trading off exploration around configurations that scope( m); budget B 2 N of pointwise were previously found to yield high values of UTD and evaluations of any m ∗ configurations in regions that have not yet been explored, Output: Approximation PX ≈ PX that allows tractable is applied. Algorithmically, TREESAMPLE amounts to a sampling variant of Monte Carlo Tree Search (MCTS) (Browne et al., 2012), modified so that it performs integration rather than A brute force approach would exhaustively compute all optimization. In contrast to other approximate methods, conditionals P ∗ , P ∗ up to P ∗ and resort to X1j? X2jx1 XN jx<N it leverages systematic, backtracking tree search with a ancestral sampling. This entails explicitly evaluating the fac- "best-first" exploration heuristic. Inspired by prior work tors m everywhere, likely including "wasteful" evaluations on combining MCTS with function approximation (Silver in regions of X with low density γ∗, i.e. parts of X that do ∗ et al., 2016), we further augment TREESAMPLE with neural not significantly contribute to PX . Instead, it may be more networks that parametrically cache previously computed efficient to construct an approximation PX that concentrates approximate solutions of inference sub-problems. This en- computational budget on those parts of the domain X where ables generalization across branches of the search tree for the density γ∗, or equivalently γ^, is suspected to be high. a given target density as well as across inference problems For small budgets B, determining the points where to probe for different target densities. In particular, we experimen- γ^ should ideally be done sequentially: Having evaluated γ^ tally show that suitably structured neural networks such as on values x1; : : : ; xb with b < B, the choice of xb+1 should Graph Neural Networks (Battaglia et al., 2018) can effi- be informed by the previous results γ^(x1);:::; γ^(xb). If ciently guide the search even on new problem instances by e.g. the target density is assumed to be "smooth", a point reducing the effective search space. x "close" to points xi with large γ^(xi) might also have a high value γ^(x) under the target, making it a good candi- 2 Inference with Budget Constraint date for future exploration (under appropriate definitions of "smooth" and "close"). In this view, inference presents itself as a structured exploration problem of the form stud- X = (X ;:::;X ) ∼ P ∗ Notation Let 1 N X be a discrete ied in the literature on sequential decision-making under x = (x ; : : : ; x ) := random vector taking values 1 N in X uncertainty and reinforcement learning (RL), in which we f1;:::;KgN x := (x ; : : : ; x ) 2 , and let ≤n 1 n X≤n be its decide where to evaluate γ^ next in order to reduce uncer- n x 2 -prefix and define <n X<n analogously. We assume tainty about its exact values. As presented in detail in the P ∗ the distribution X is given by a factor graph. Denote with following, borrowing from the RL literature, we will use γ∗ its density: a form of tree search that preferentially explores points xj i M M that share a common prefix with previously found points x ∗ X X X with high γ^. log γ (x) = m(x) − log exp m(x): (1) m=1 x2X m=1 We denote with Z the normalization constant and with γ^ the unnormalized density. We assume that all factors m, 3 Inference with Monte Carlo Tree Search defined in the log-domain, take values in R [ {−∞}. Fur- thermore, scope( m) is assumed to be known for all factors, ∗ where scope( m) ⊆ f1;:::;Ng is the index set of the vari- In the following, we cast sampling from PX as a sequential ables that m takes as input. We denote the densities of the decision-making problem in a suitable maximum-entropy conditionals P ∗ as γ∗(x jx ). Xnjx<n n n <n Markov Decision Process (MDP). We show that the target ∗ distribution PX is equal to the solution, i.e. the optimal pol- ∗ Problem Setting and Motivation Consider the problem icy, of this MDP. This representation of PX as optimal pol- ∗ of constructing a tractable approximation PX to PX . In icy allows us to leverage standard methods from RL for ap- ∗ this context, we define tractable as being able to sample proximating PX . Our definition of the MDP will capture the from PX (say in polynomial time in N). Such a PX then following intuitive procedure: At each step n = 1;:::;N allows Monte Carlo estimates of ∗ [f] ≈ [f] for any we decide how to sample X based on the realization x EPX EPX n <n function f of interest in downstream tasks without having of X<n that has already been sampled. The reward function ∗ to touch the original PX again. This setup is an example of the MDP will be defined such that the return (sum of of model compilation (Darwiche, 2002). We assume that rewards) of an episode will equal the unnormalized target ∗ the computational cost of inference in PX is dominated density logγ ^, therefore "rewarding" samples that have high by evaluating any of the factors m. Therefore, we are probability under the target. Lars Buesing, Theophane Weber, Nicolas Hees 3.1 Sequential Decision-Making Representation Algorithm 1 TREESAMPLE sampling procedure φ 1: procedure SAMPLE(tree T , default Q ) We first fix an arbitrary ordering of the variables 2: x ? X1;:::;XN ; for now any ordering will do, but see the dis- 3: for n = 1;:::;N do cussion in sec.

Load more