Belief Propagation for Structured Decision Making

Belief Propagation for Structured Decision Making Qiang Liu Alexander Ihler Department of Computer Science Department of Computer Science University of California, Irvine University of California, Irvine Irvine, CA, 92697 Irvine, CA, 92697 [email protected] [email protected] Abstract under uncertainty; they can be treated as an extension of Bayesian networks, augmented with decision nodes Variational inference algorithms such as be- and utility functions. Traditionally, IDs are used to lief propagation have had tremendous im- model centralized, sequential decision processes under pact on our ability to learn and use graph- \perfect recall", which assumes that the decision steps ical models, and give many insights for de- are ordered in time and that all information is remem- veloping or understanding exact and approx- bered across time; limited memory influence diagrams imate inference. However, variational ap- (LIMIDs) [Zhang et al., 1994, Lauritzen and Nilsson, proaches have not been widely adoped for 2001] relax the perfect recall assumption, creating a decision making in graphical models, often natural framework for representing decentralized and formulated through influence diagrams and information-limited decision problems, such as team including both centralized and decentralized decision making and multi-agent systems. Despite the (or multi-agent) decisions. In this work, close connection and similarity to Bayes nets, IDs have we present a general variational framework less visibility in the graphical model and automated re- for solving structured cooperative decision- seasoning community, both in terms of modeling and making problems, use it to propose several algorithm development; see Pearl[2005] for an inter- belief propagation-like algorithms, and ana- esting historical perspective. lyze them both theoretically and empirically. Solving an ID refers to finding decision rules that max- imize the expected utility function (MEU); this task is significantly more difficult than standard inference 1 Introduction on a Bayes net. For IDs with perfect recall, MEU can be restated as a dynamic program, and solved Graphical modeling approaches, including Bayesian with cost exponential in a constrained tree-width of networks and Markov random fields, have been widely the graph that is subject to the temporal ordering of adopted for problems with complicated dependency the decision nodes. The constrained tree-width can be structures and uncertainties. The problems of learn- much higher than the tree-width associated with typ- ing, i.e., estimating a model from data, and inference, ical inference, making MEU significantly more com- e.g., calculating marginal probabilities or maximum a plex. For LIMIDs, non-convexity issues also arise, posteriori (MAP) estimates, have attracted wide at- since the limited shared information and simultane- tention and are well explored. Variational inference ous decisions may create locally optimal policies. The approaches have been widely adopted as a principled most popular algorithm for LIMIDs is based on policy- way to develop and understand many exact and ap- by-policy improvement [Lauritzen and Nilsson, 2001], proximate algorithms. On the other hand, the prob- and provides only a \person-by-person" notion of opti- lem of decision making in graphical models, sometimes mality. Surprisingly, the variational ideas that revolu- formulated via influence diagrams or decision networks tionized inference in Bayes nets have not been adopted and including both sequential centralized decisions and for influence diagrams. Although there exists work on decentralized or multi-agent decisions, is surprisingly transforming MEU problems into sequences of stan- less explored in the approximate inference community. dard marginalization problems [e.g., Zhang, 1998], on Influence diagrams (ID), or decision networks, which variational methods apply, these methods do [Howard and Matheson, 1985, 2005] are a graphical not yield general frameworks, and usually only work model representation of structured decision problems for IDs with perfect recall. A full variational framework would provide general procedures for developing times simply MAP). All these are NP-hard in general. efficient approximations such as loopy belief propaga- Marginalization calculates the marginal probabilities tion (BP), that are crucial for large scale problems, or of one or a few variables, or equivalently the normal- providing new theoretical analysis. ization constant Z, while MAP/MPE finds the mode of the distribution. More generally, marginal MAP In this work, we propose a general variational frame- seeks the mode of a marginal probability, work for solving influence diagrams, both with and without perfect recall. Our results on centralized deci- ∗ X Y Marginal MAP: x = arg max α(xα), sion making include traditional inference in graphical xA xB α models as special cases. We propose a spectrum of exact and approximate algorithms for MEU problems where A; B are disjoint sets with A[B = V ; it reduces based on the variational framework. We give several to marginalization if A = ; and to MAP if B = ;. optimality guarantees, showing that under certain con- Marginal Polytope. A marginal polytope is a ditions, our BP algorithm can find the globally optimal M set of local marginals τ = fτ (x ): α 2 Ig that are solution for ID with perfect recall and solve LIMIDs in α α extensible to a global distribution over x, that is, M = a stronger locally optimal sense than coordinate-wise P fτ j 9 a distribution p(x), s.t. p(x) = τα(xα) g. optimality. We show that a temperature parameter xV nα can also be introduced to smooth between MEU tasks Call P[τ ] the set of global distributions consistent and standard (easier) marginalization problems, and with τ 2 M; there exists a unique distribution in P[τ ] can provide good solutions by annealing the tempera- that has maximum entropy and follows the exponen- ture or using iterative proximal updates. tial family form for some θ. We abuse notation to denote this unique global distribution τ(x). This paper is organized as follows. Section2 sets up background on graphical models, variational methods A basic result for variational methods is that Φ(θ) is and influence diagrams. We present our variational convex and can be rewritten into a dual form, framework of MEU in Section3, and use it to develop Φ(θ) = maxfhθ; τ i + H(x; τ )g; (1) several BP algorithms in Section4. We present nu- τ 2M merical experiments in Section5. Finally, we discuss where hθ; τ i = P P θ (x )τ (x ) is the point-wise additional related work in Section6 and concluding re- x α α α α α inner product, and H(x; τ ) = − P τ(x) log τ(x) is marks in Section7. Proofs and additional information x the entropy of distribution τ(x); the maximum of (1) can be found in the appendix. is obtained when τ equals the marginals of the original distribution with parameter θ. See Wainwright and 2 Background Jordan[2008]. 2.1 Graphical Models Similar dual forms hold for MAP and marginal MAP. Letting Φ (θ) = log max P exp( (x)), we have A;B xA xB θ Let x = fx1; x2; ··· ; xng be a random vector in X = [Liu and Ihler, 2011] X1 × · · · × Xn. Consider a factorized probability on x, ΦA;B(θ) = maxfhθ; τ i + H(xBjxA ; τ )g; (2) 1 Y 1 h X i τ 2 p(x) = (x ) = exp θ (x ) ; M Z α α Z α α α2I α2I P where H(xBjxA ; τ ) = − x τ(x) log τ(xBjxA) is the conditional entropy; its appearance corresponds to the where I is a set of variable subsets, and : ! + α Xα R sum operators. are positive factors; the θα(xα) = log α(xα) are the natural parameters of the exponential family repre- The dual forms in (1) and (2) are no easier to compute P Q sentation; and Z = x α2I α is the normaliza- than the original inference. However, one can approxi- tion constant or partition function with Φ(θ) = log Z mate the marginal polytope M and the entropy in var- the log-partition function. Let θ = fθαjα 2 Ig and ious ways, yielding a body of approximate inference P θ(x) = α θα(xα). There are several ways to repre- algorithms, such as loopy belief propagation (BP) and sent a factorized distribution using graphs (i.e., graphi- its generalizations [Yedidia et al., 2005, Wainwright cal models), including Markov random fields, Bayesian et al., 2005], linear programming solvers [e.g., Wain- networks, factors graphs and others. wright et al., 2003b], and recently hybrid message pass- ing algorithms [Liu and Ihler, 2011, Jiang et al., 2011]. Given a graphical model, inference refers to the pro- cedure of answering probabilistic queries. Important Junction Graph BP. Junction graphs provide a inference tasks include marginalization, maximum a procedural framework to approximate the dual (1). A posteriori (MAP, sometimes called maximum proba- cluster graph is a triple (G; C; S), where G = (V; E) is bility of evidence or MPE), and marginal MAP (some- an undirected graph, with each node k 2 V associated with a subset of variables ck 2 C (clusters), and each Weather Weather Chance nodes edge ( ) 2 E a subset 2 S (separator) satisfying Forecast Condition kl skl Decision nodes s ⊆ c \ c . We assume that C subsumes the index kl k l Utility nodes set I, that is, for any α 2 I, there exists a ck 2 C, denoted c[α], such that α ⊆ c . In this case, we can Vacation k Activity Satisfaction reparameterize θ = fθαjα 2 Ig into θ = fθck jk 2 Vg by taking θ = P θ , without changing the ck α : c[α]=ck α distribution. A cluster graph is called a junction graph Figure 1: A simple influence diagram for deciding va- if it satisfies the running intersection property { for cation activity [Shachter, 2007]. each i 2 V , the induced sub-graph consisting of the clusters and separators that include i is a connected of policies δ = fδ ji 2 Dg a strategy.

Load more