<<

Belief Propagation for Structured Decision Making

Qiang Liu Alexander Ihler Department of Computer Science Department of Computer Science University of California, Irvine University of California, Irvine Irvine, CA, 92697 Irvine, CA, 92697 [email protected] [email protected]

Abstract under uncertainty; they can be treated as an extension of Bayesian networks, augmented with decision nodes Variational inference such as be- and utility functions. Traditionally, IDs are used to lief propagation have had tremendous im- model centralized, sequential decision processes under pact on our ability to learn and use graph- “perfect recall”, which assumes that the decision steps ical models, and give many insights for de- are ordered in time and that all information is remem- veloping or understanding exact and approx- bered across time; limited memory influence diagrams imate inference. However, variational ap- (LIMIDs) [Zhang et al., 1994, Lauritzen and Nilsson, proaches have not been widely adoped for 2001] relax the perfect recall assumption, creating a decision making in graphical models, often natural framework for representing decentralized and formulated through influence diagrams and information-limited decision problems, such as team including both centralized and decentralized decision making and multi-agent systems. Despite the (or multi-agent) decisions. In this work, close connection and similarity to Bayes nets, IDs have we present a general variational framework less visibility in the and automated re- for solving structured cooperative decision- seasoning community, both in terms of modeling and making problems, use it to propose several development; see Pearl[2005] for an inter- belief propagation-like algorithms, and ana- esting historical perspective. lyze them both theoretically and empirically. Solving an ID refers to finding decision rules that max- imize the expected utility function (MEU); this task is significantly more difficult than standard inference 1 Introduction on a Bayes net. For IDs with perfect recall, MEU can be restated as a dynamic program, and solved Graphical modeling approaches, including Bayesian with cost exponential in a constrained -width of networks and Markov random fields, have been widely the graph that is subject to the temporal ordering of adopted for problems with complicated dependency the decision nodes. The constrained tree-width can be structures and uncertainties. The problems of learn- much higher than the tree-width associated with typ- ing, i.e., estimating a model from data, and inference, ical inference, making MEU significantly more com- e.g., calculating marginal probabilities or maximum a plex. For LIMIDs, non-convexity issues also arise, posteriori (MAP) estimates, have attracted wide at- since the limited shared information and simultane- tention and are well explored. Variational inference ous decisions may create locally optimal policies. The approaches have been widely adopted as a principled most popular algorithm for LIMIDs is based on policy- way to develop and understand many exact and ap- by-policy improvement [Lauritzen and Nilsson, 2001], proximate algorithms. On the other hand, the prob- and provides only a “person-by-person” notion of opti- lem of decision making in graphical models, sometimes mality. Surprisingly, the variational ideas that revolu- formulated via influence diagrams or decision networks tionized inference in Bayes nets have not been adopted and including both sequential centralized decisions and for influence diagrams. Although there exists work on decentralized or multi-agent decisions, is surprisingly transforming MEU problems into sequences of stan- less explored in the approximate inference community. dard marginalization problems [e.g., Zhang, 1998], on Influence diagrams (ID), or decision networks, which variational methods apply, these methods do [Howard and Matheson, 1985, 2005] are a graphical not yield general frameworks, and usually only work model representation of structured decision problems for IDs with perfect recall. A full variational frame- work would provide general procedures for developing times simply MAP). All these are NP-hard in general. efficient approximations such as loopy belief propaga- Marginalization calculates the marginal probabilities tion (BP), that are crucial for large scale problems, or of one or a few variables, or equivalently the normal- providing new theoretical analysis. ization constant Z, while MAP/MPE finds the mode of the distribution. More generally, marginal MAP In this work, we propose a general variational frame- seeks the mode of a marginal probability, work for solving influence diagrams, both with and without perfect recall. Our results on centralized deci- ∗ X Y Marginal MAP: x = arg max ψα(xα), sion making include traditional inference in graphical xA xB α models as special cases. We propose a spectrum of exact and approximate algorithms for MEU problems where A, B are disjoint sets with A∪B = V ; it reduces based on the variational framework. We give several to marginalization if A = ∅ and to MAP if B = ∅. optimality guarantees, showing that under certain con- Marginal Polytope. A marginal polytope is a ditions, our BP algorithm can find the globally optimal M set of local marginals τ = {τ (x ): α ∈ I} that are solution for ID with perfect recall and solve LIMIDs in α α extensible to a global distribution over x, that is, M = a stronger locally optimal sense than coordinate-wise P {τ | ∃ a distribution p(x), s.t. p(x) = τα(xα) }. optimality. We show that a temperature parameter xV \α can also be introduced to smooth between MEU tasks Call P[τ ] the set of global distributions consistent and standard (easier) marginalization problems, and with τ ∈ M; there exists a unique distribution in P[τ ] can provide good solutions by annealing the tempera- that has maximum entropy and follows the exponen- ture or using iterative proximal updates. tial family form for some θ. We abuse notation to denote this unique global distribution τ(x). This paper is organized as follows. Section2 sets up background on graphical models, variational methods A basic result for variational methods is that Φ(θ) is and influence diagrams. We present our variational convex and can be rewritten into a dual form, framework of MEU in Section3, and use it to develop Φ(θ) = max{hθ, τ i + H(x; τ )}, (1) several BP algorithms in Section4. We present nu- τ ∈M merical experiments in Section5. Finally, we discuss where hθ, τ i = P P θ (x )τ (x ) is the point-wise additional related work in Section6 and concluding re- x α α α α α inner product, and H(x; τ ) = − P τ(x) log τ(x) is marks in Section7. Proofs and additional information x the entropy of distribution τ(x); the maximum of (1) can be found in the appendix. is obtained when τ equals the marginals of the original distribution with parameter θ. See Wainwright and 2 Background Jordan[2008]. 2.1 Graphical Models Similar dual forms hold for MAP and marginal MAP. Letting Φ (θ) = log max P exp( (x)), we have A,B xA xB θ Let x = {x1, x2, ··· , xn} be a random vector in X = [Liu and Ihler, 2011] X1 × · · · × Xn. Consider a factorized probability on x, ΦA,B(θ) = max{hθ, τ i + H(xB|xA ; τ )}, (2) 1 Y 1 h X i τ ∈ p(x) = ψ (x ) = exp θ (x ) , M Z α α Z α α α∈I α∈I P where H(xB|xA ; τ ) = − x τ(x) log τ(xB|xA) is the conditional entropy; its appearance corresponds to the where I is a set of variable subsets, and ψ : → + α Xα R sum operators. are positive factors; the θα(xα) = log ψα(xα) are the natural parameters of the exponential family repre- The dual forms in (1) and (2) are no easier to compute P Q sentation; and Z = x α∈I ψα is the normaliza- than the original inference. However, one can approxi- tion constant or partition function with Φ(θ) = log Z mate the marginal polytope M and the entropy in var- the log-partition function. Let θ = {θα|α ∈ I} and ious ways, yielding a body of approximate inference P θ(x) = α θα(xα). There are several ways to repre- algorithms, such as loopy belief propagation (BP) and sent a factorized distribution using graphs (i.e., graphi- its generalizations [Yedidia et al., 2005, Wainwright cal models), including Markov random fields, Bayesian et al., 2005], linear programming solvers [e.g., Wain- networks, factors graphs and others. wright et al., 2003b], and recently hybrid message pass- ing algorithms [Liu and Ihler, 2011, Jiang et al., 2011]. Given a graphical model, inference refers to the pro- cedure of answering probabilistic queries. Important Junction Graph BP. Junction graphs provide a inference tasks include marginalization, maximum a procedural framework to approximate the dual (1). A posteriori (MAP, sometimes called maximum proba- cluster graph is a triple (G, C, S), where G = (V, E) is bility of evidence or MPE), and marginal MAP (some- an undirected graph, with each node k ∈ V associated with a subset of variables ck ∈ C (clusters), and each Weather Weather Chance nodes edge ( ) ∈ E a subset ∈ S (separator) satisfying Forecast Condition kl skl Decision nodes s ⊆ c ∩ c . We assume that C subsumes the index kl k l Utility nodes set I, that is, for any α ∈ I, there exists a ck ∈ C, denoted c[α], such that α ⊆ c . In this case, we can Vacation k Activity Satisfaction reparameterize θ = {θα|α ∈ I} into θ = {θck |k ∈ V} by taking θ = P θ , without changing the ck α : c[α]=ck α distribution. A cluster graph is called a junction graph Figure 1: A simple influence diagram for deciding va- if it satisfies the running intersection property – for cation activity [Shachter, 2007]. each i ∈ V , the induced sub-graph consisting of the clusters and separators that include i is a connected of policies δ = {δ |i ∈ D} a strategy. Finally, a util- tree. A junction graph is a junction tree if G is tree. i ity function u : X → R+ measures the reward given To approximate the dual (1), we can replace M with an instantiation of x = [xR, xD], which the decision a locally consistent polytope L: the set of local maker wants to maximize. It is reasonable to assume marginals τ = {τck , τskl : k ∈ V, (kl) ∈ E} satisfying some decomposition structure on the utility u(x), ei- P ther additive, (x) = P ( ), or multiplicative, x τck (xck ) = τ(xskl ). Clearly, M ⊆ L. We then u j∈U uj xβj ck\skl Q approximate (1) by u(x) = j∈U uj(xβj ). A decomposable utility func- tion can be visualized by augmenting the DAG with X X a set of leaf nodes , called utility nodes, each with max{hθ, τ i + H(xck ; τ ck ) − H(xskl ; τ skl )}, U τ ∈L k∈V (kl)∈E parent set βj. See Fig.1 for a simple example. where the joint entropy is approximated by a linear A decision rule δi is alternatively represented as a δ combination of the entropies of local marginals. The deterministic conditional “probability” pi (xi|xpa(i)), δ approximate objective can be solved using Lagrange where pi (xi|xpa(i)) = 1 for xi = δi(xpa(i)) and zero multipliers [Yedidia et al., 2005], leading to a sum- otherwise. It is helpful to allow soft decision rules δ product message passing algorithm that iteratively where pi (xi|xpa(i)) takes fractional values; these de- sends messages between neighboring clusters via fine a randomized strategy in which xi is determined δ by randomly drawing from pi (xi|xpa(i)). We denote X o mk→l(xcl ) ∝ ψck (xck )m∼k\l(xcN (k) ), (3) by ∆ the set of deterministic strategies and ∆ the set o xck\skl of randomized strategies. Note that ∆ is a discrete set, while ∆ is its convex hull. where ψck = exp(θck ), and m∼k\l is the product of messages into k from its neighbors N (k) except l. At Given an influence diagram, the optimal strategy convergence, the (locally) optimal marginals are should maximize the expected utility function (MEU):

τc ∝ ψc m∼k and τs ∝ mk→lml→k, MEU = max EU(δ) = max E(u(x)|δ) k k kl δ∈∆ δ∈∆ where m is the product of messages into k. Max- X Y Y δ ∼k = max u(x) pi(xi|xpa(i)) pi (xi|xpa(i)) δ∈∆ product and hybrid methods can be derived analo- x i∈C i∈D gously for MAP and marginal MAP problems. def X Y δ = max exp(θ(x)) pi (xi|xpa(i)) (4) δ∈∆ x ∈ 2.2 Influence Diagrams i D Q where θ(x) = log[u(x) i∈C pi(xi|xpa(i))]; we call the Influence diagrams (IDs) or decision networks are ex- distribution q(x) ∝ exp(θ(x)) the augmented distri- tensions of Bayesian networks to represent structured bution [Bielza et al., 1999]. The concept of the aug- decision problems under uncertainty. Formally, an in- mented distribution is critical since it completely spec- fluence diagram is defined on a directed acyclic graph ifies a MEU problem without the semantics of the in- G = (V,E), where the nodes V are divided into two fluence diagram; hence one can specify q(x) arbitrarily, subsets, V = R ∪ D, where R and D represent respec- e.g., via an undirected MRF, extending the definition tively the set of chance nodes and decision nodes. Each of IDs. We can treat MEU as a special sort of “infer- chance node i ∈ R represents a xi ence” on the augmented distribution, which as we will with a conditional probability table pi(xi|xpa(i)). Each show, generalizes more common inference tasks. decision node i ∈ D represents a controllable decision In (4) we maximize the expected utility over ∆; this variable xi, whose value is determined by a decision is equivalent to maximizing over ∆o, since maker via a decision rule (or policy) δi : Xpa(i) → Xi, which determines the values of xi based on the obser- Lemma 2.1. For any ID, maxδ∈∆ EU(δ) = vation on the values of xpa(i); we call the collection maxδ∈∆o EU(δ). Perfect Recall Assumption. The MEU prob- d1 d2 u(d1, d2) d1 d2 d1 d2 1 1 1 lem can be solved in closed form if the influence di- 0 1 0 agram satisfies a perfect recall assumption (PRA) — 1 0 0 u u there exists a “temporal” ordering over all the deci- 0 0 0.5 sion nodes, say {d1, d2, ··· , dm}, consistent with the (a) Perfect Recall (b) Imperfect Recall (c) Utility Function partial order defined by the DAG G, such that every decision node observes all the earlier decision nodes Figure 2: Illustrating imperfect recall. In (a) d2 ob- serves d ; its optimal decision rule is to equal d ’s state and their parents, that is, {dj} ∪ pa(dj) ⊆ pa(di) for 1 1 any j < i. Intuitively, PRA implies a centralized de- (whatever it is); knowing d2 will follow, d1 can choose cision scenario, where a global decision maker sets all d1 = 1 to achieve the global optimum. In (b) d1 and the decision nodes in a predefined order, with perfect d2 do not know the other’s states; both d1 = d2 = 1 memory of all the past observations and decisions. and d1 = d2 = 0 (suboptimal) become locally optimal strategies and the problem is multi-modal. With PRA, the chance nodes can be grouped by when they are observed. Let ri−1 (i = 1, . . . , m) be the set of chance nodes that are parents of di but not of any laxation causes many difficulties. First, it is no longer dj for j < i; then both decision and chance nodes are possible to eliminate the decision nodes in a sequential ordered by o = {r0, d1, r1, ··· , dm, rm}. The MEU and “sum-max-sum” fashion. Instead, the dependencies of its optimal strategy for IDs with PRA can be calcu- the decision nodes have cycles, formally discussed in lated by a sequential sum-max-sum rule, Koller and Milch[2003] by defining a relevance graph over the decision nodes; the relevance graph is a tree X X X MEU = max ··· max exp(θ(x)), (5) with PRA, but is usually loopy with imperfect recall. xd1 xdm xr0 xr1 xrm Thus iterative algorithms are usually required for LIM- X X ∗ ( ) = arg max  ··· max exp( (x)) IDs. Second and more importantly, the incomplete in- δdi xpa(di) θ , xdi xdm formation may cause the agents to behave myopically, xr xr i m selfishly choosing locally optimal strategies due to ig- where the calculation is performed in reverse temporal norance of the global statistics. This breaks the strat- ordering, interleaving marginalizing chance nodes and egy space into many local modes, making the MEU maximizing decision nodes. Eq. (5) generalizes the problem non-convex; see Fig.2 for an illustration. inference tasks in Section 2.1, arbitrarily interleaving The most popular algorithms for LIMIDs are based on the sum and max operators. For example, marginal policy-by-policy improvement, e.g., the single policy MAP can be treated as a blind decision problem, where update (SPU) algorithm [Lauritzen and Nilsson, 2001] no chance nodes are observed by any decision nodes. sequentially optimizes δi with δ¬i = {δj : j 6= i} fixed: As in other inference tasks, the calculation of the sum- max-sum rule can be organized into local computa- δi(xpa(i)) ← arg max E(u(x)|xfam(i) ; δ¬i), (6) xi tions if the augmented distribution q(x) is factorized. X Y δ However, since the max and sum operators are not ex- E(u(x)|xfam(i); δ¬i) = exp(θ(x)) pj (xj|xpa(j)), changeable, the calculation of (5) is restricted to elimi- x¬fam(i) j∈D\{i} nation orders consistent with the “temporal ordering”. where fam(i) = {i}∪pa(i). The update circles through Notoriously, this “constrained” tree-width can be very all i ∈ D in some order, and ties are broken arbitrar- high even for trees. See Koller and Friedman[2009]. ily in case of multiple maxima. The expected util- However, PRA is often unrealistic. First, most systems ity in SPU is non-decreasing at each iteration, and it lack enough memory to express arbitrary policies over gives a locally optimal strategy at convergence in the an entire history of observations. Second, many prac- sense that the expected utility can not be improved tical scenarios, like team decision analysis [Detwarasiti by changing any single node’s policy. Unfortunately, and Shachter, 2005] and decentralized sensor networks SPU’s solution is heavily influenced by initialization [Kreidl and Willsky, 2006], are distributed by nature: and can be very suboptimal. a team of agents makes decisions independently based This issue is helped by generalizing SPU to the on sharing limited information with their neighbors. strategy improvement (SI) algorithm [Detwarasiti and In these cases, relaxing PRA is very important. Shachter, 2005], which simultaneously updates sub- Imperfect Recall. General IDs with the perfect groups of decisions nodes. However, the time and recall assumption relaxed are discussed in Zhang et al. space complexity of SI grows exponentially with the [1994], Lauritzen and Nilsson[2001], Koller and Milch sizes of subgroups. In the sequel, we present a novel [2003], and are commonly referred as limited memory variational framework for MEU, and propose BP-like influence diagrams (LIMIDs). Unfortunately, the re- algorithms that go beyond the na¨ıve greedy paradigm. 3 Duality Form of MEU a subset of the marginal polytope that restricts the observation domains of the decision rules; this non- In this section, we derive a duality form for MEU, gen- convex inner subset is similar to the mean field approx- eralizing the duality results of the standard inference imation for partition functions. See Wolpert[2006] for in Section 2.1. Our main result is summarized in the a similar connection to mean field for bounded rational following theorem. game theory. Interestingly, this shows that extending Theorem 3.1. (a). For an influence diagram with a LIMID to have perfect recall (by extending the ob- augmented distribution q(x) ∝ exp(θ(x)), its log max- servation domains of the decision nodes) can be con- imum expected utility log MEU(θ) equals sidered a “convex” relaxation of the LIMID. Corollary 3.3. For any , if τ ∗ is global optimum of X  max{hθ, τ i + H(x; τ ) − H(xi|xpa(i); τ )}. (7) τ ∈M i∈D X max{hθ, τ i + H(x) − (1 − ) H(xi|xpa(i))}. (10) τ ∈M Suppose τ ∗ is a maximum of (7), then δ∗ = i∈D ∗ {τ (xi|xpa(i))|i ∈ D} is an optimal strategy. ∗ ∗ and δ = {τ (xi|xpa(i))|i ∈ D} is a deterministic strat- (b). For IDs with perfect recall, (7) reduces to egy, then it is an optimal strategy for MEU. X max{hθ, τ i + H(xoi |xo1:i−1 ; τ )}, (8) The parameter  is a temperature to “anneal” the τ ∈M oi∈C MEU problem, and trades off convexity and optimal- ity. For large , e.g.,  ≥ 1, the objective in (10) is a where o is the temporal ordering of the perfect recall. strictly convex function, while δ∗ is unlikely to be de- terministic nor optimal (if  = 1, (10) reduces to stan- Proof. (a) See appendix; (b) note PRA implies pa(i) = dard marginazation); as  decreases towards zero, δ∗ o1:i−1 (i ∈ D), and apply the entropy chain rule. becomes more deterministic, but (10) becomes more non-convex and is harder to solve. In Section4 we The distinction between (8) and (7) is subtle but im- derive several possible optimization approaches. portant: although (8) (with perfect recall) is always (if not strictly) a convex optimization, (7) (without perfect recall) may be non-convex if the subtracted 4 Algorithms entropy terms overwhelm; this matches the intuition that incomplete information sharing gives rise to mul- The duality results in Section3 offer new perspectives tiple locally optimal strategies. for MEU, allowing us to bring the tools of variational inference to develop new efficient algorithms. In this The MEU duality (8) for ID with PRA generalizes ear- section, we present a junction graph framework for BP- lier duality results of inference: with no decision nodes, like MEU algorithms, and provide theoretical analysis. D = ∅ and (8) reduces to (1) for the log-partition In addition, we propose two double-loop algorithms function; when C = ∅, no entropy terms appear and that alleviate the issue of non-convexity in LIMIDs or (8) reduces to the linear program relaxation of MAP. provide convergence guarantees: a deterministic an- Also, (8) reduces to marginal MAP when no chance nealing approach suggested by Corollary 3.3 and a nodes are observed before any decision. As we show method based on the proximal point algorithm. in Section4, this unification suggests a line of unified algorithms for all these different inference tasks. 4.1 A Belief Propagation Algorithm Several corollaries provide additional insights. Corollary 3.2. For an ID with parameter θ, we have We start by formulating the problem (7) into the junction graph framework. Let (G, C, S) be a junc- X log MEU = max{hθ, τ i + H(xoi |xo1:i−1 ; τ )} (9) tion graph for the augmented distribution q(x) ∝ τ ∈I oi∈C exp(θ(x)). For each decision node i ∈ D, we as- sociate it with exactly one cluster ck ∈ C satisfying where = {τ ∈ : xo ⊥ x \ |x , ∀oi ∈ I M i o1:i−1 pa(oi) pa(oi) {i, pa(i)} ⊆ ck; we call such a cluster a decision clus- D}, corresponding to those distributions that respect ter. The clusters C are thus partitioned into decision the imperfect recall constraints; “x ⊥ y | z” denotes clusters D and the other (normal) clusters R. For conditional independence of x and y given z. simplicity, we assume each decision cluster ck ∈ D is associated with exactly one decision node, denoted d . Corollary 3.2 gives another intuitive interpretation of k imperfect recall vs. perfect recall: MEU with imper- Following the junction graph framework in Section 2.1, fect recall optimizes same objective function, but over the MEU dual (10) (with temperature parameter ) is approximated by sum (resp. max or hybrid) consistency property yet X X X leaving the joint distribution unchanged [e.g., Wain- max{hθ, τ i + H + H − H }, (11) ck ck skl wright et al., 2003a, Weiss et al., 2007, Liu and Ih- τ ∈L k∈R k∈D (kl)∈E ler, 2011]. We define a set of “MEU-beliefs” b =  {b(xck ), b(xskl )} by b(xck ) ∝ ψck mk for all ck ∈ C, and where Hck = H(xck ), Hskl = H(xskl ) and H (xck ) = b(xskl ) ∝ mk→lml→k; note that the “beliefs” b are dis- H(xck )−(1−)H(xdk |xpa(dk)). The dependence of en- tropies on τ is suppressed for compactness. Eq. (11) tinguished from the “marginals” τ . We can show that is similar to the objective of regular sum-product junc- at each iteration of MEU-BP in (12)-(13), the b satisfy tion graph BP, except the entropy terms of the decision Q b(x ) clusters are replaced by H . k∈V ck ck Reparameterization: q(x) ∝ Q , (17) ∈E b(xskl ) Using a Lagrange multiplier method similar to Yedidia (kl) et al.[2005], a hybrid message passing algorithm can and further, at a fixed point of MEU-BP we have be derived for solving (11): Sum-consistency: X b(x ) = b(x ), (18) Sum messages: X (normal clusters) ck skl mk→l ∝ ψck m∼k\l, (12) (normal clusters) ck\sij xc \s k kl MEU-consistency: X MEU messages: X σk[ψc m∼k; ] σk[b(xck ); ] = b(xskl ). (19) m ∝ k , (13) (decision clusters) k→l c \s (decision clusters) ml→k k ij xck\skl where σk[·] is an operator that solves an annealed local Optimality Guarantees. Optimality guarantees of + MEU problem associated with b(xck ) ∝ ψck m∼k: MEU-BP (with  → 0 ) can be derived via reparam- eterization. Our result is analogous to those of Weiss def 1− σk[b(xck ); ] = b(xck )b(xdk |xpa(dk)) and Freeman[2001] for max-product BP and Liu and Ihler[2011] for marginal-MAP. where b(xdk |xpa(dk)) is the “annealed” optimal policy For a junction tree, a tree-order is a partial ordering on b(x , x )1/ ( | ) = dk pa(dk) the nodes with k  l iff the unique path from a special b xdk xpa(dk) P 1/ , b(xdk , xpa(dk)) xdk cluster (called root) to l passes through k; the parent X π(k) is the unique neighbor of k on the path to the b(x , x ) = b(x ), z = c \{d , pa(d )}. dk pa(dk) ck k k k k root. Given a subset of decision nodes D0, a junction x zk tree is said to be consistent for D0 if there exists a + 0 As  → 0 , one can show that b(xdk |xpa(dk)) is exactly tree-order with sk,π(k) ⊆ pa(dk) for any dk ∈ D . an optimal strategy of the local MEU problem with Theorem 4.1. Let (G, C, S) be a consistent junction augmented distribution ( ). 0 b xck tree for a subset of decision nodes D , and b be a set At convergence, the stationary point of (11) is: of MEU-beliefs satisfying the reparameterization and consistency conditions (17)-(19) with  → 0+. Let ∗ ∗ τck ∝ ψck m∼k for normal clusters (14) δ = {bck (xdk |xpa(dk)): dk ∈ D}; then δ is a locally ∗ 0 τck ∝ σk[ψck m∼k; ] for decision clusters (15) optimal strategy in the sense that EU({δD0 , δD\D }) ≤ ∗ EU(δ ) for any δD\D0 . τskl ∝ mk→lml→k for separators (16) This message passing algorithm reduces to sum- A junction tree is said to be globally consistent if it product BP when there are no decision clusters. The is consistent for all the decision nodes, which as im- outgoing messages from decision clusters are the cru- plied by Theorem 4.1, ensures a globally optimal strat- cial ingredient, and correspond to solving local (an- egy; this notation of global consistency is similar to nealed) MEU problems. the strong junction trees in Jensen et al.[1994]. For IDs with perfect recall, a globally consistent junction Taking  → 0+ in the MEU message update (13) gives tree can be constructed by a standard procedure which a fixed point algorithm for solving the original objec- triangulates the DAG of the ID along reverse tempo- tive directly. Alternatively, one can adopt a deter- ral order. For IDs without perfect recall, it is usually ministic annealing approach [Rose, 1998] by gradually not possible to construct a globally consistent junction decreasing , e.g., taking t = 1/t at iteration t. tree; this is the case for the toy example in Fig.2b. Reparameterization Properties. BP algorithms, However, coordinate-wise optimality follows as a con- including sum-product, max-product, and hybrid mes- sequence of Theorem 4.1 for general IDs with arbitrary sage passing, can often be interpreted as reparame- junction trees, indicating that MEU-BP is at least as terization operators, with fixed points satisfying some “optimal” as SPU. Theorem 4.2. Let (G, C, S) be an arbitrary junc- We use two choices of coefficients {wt}: (1) wt = 1 tion tree, and b and δ∗ defined in Theorem 4.1. (constant), and (2) wt = 1/t (harmonic). The choice Then δ∗ is a locally person-by-person optimal strategy: wt = 1 is especially interesting because the proximal ∗ ∗ EU({δi , δD\i}) ≤ EU(δ ) for any i ∈ D and δD\i. update reduces to a standard marginalization prob- lem, solvable by standard tools without the MEU’s Additively Decomposable Utilities. Our al- temporal elimination order restrictions. Concretely, gorithms rely on the factorization structure of the the proximal update in this case reduces to augmented distribution q(x). For this reason, mul- tiplicative utilities fit naturally, but additive utilities t+1 t n τi (xi|xpa(i)) ∝ τi (xi|xpa(i))E(u(x)|xfam(i) ; δ¬i) are more difficult (as they also are in exact inference) n [Koller and Friedman, 2009]. To create factorization with E(u(x)|xfam(i) ; δ¬i) as defined in (6). This prox- structure in additive utility problems, we augment the imal update can be seen as a “soft” and “parallel” model with a latent “selector” variable, similar to that version of the greedy update (6), which makes a hard in mixture models. For details, see the appendix. update at a single decision node, instead of a soft mod- ification simutaneously for all decision nodes. The soft 4.2 Proximal Algorithms update makes it possible to correct earlier suboptimal choices and allows decision nodes to make cooperative In this section, we present a proximal point approach movements. However, convergence with wt = 1 may [e.g., Martinet, 1970, Rockafellar, 1976] for the MEU be slow; using wt = 1/t takes larger steps but is no problems. Similar methods have been applied to stan- longer a standard marginalization. dard inference problems, e.g., Ravikumar et al.[2010]. We start with a brief introduction to the proximal 5 Experiments point algorithm. Consider an optimization problem minτ ∈M f(τ ). A proximal method instead iteratively We demonstrate our algorithms on several influence di- solves a sequence of “proximal” problems agrams, including randomly generated IDs, large scale IDs constructed from problems in the UAI08 infer- τ t+1 = arg min {f(τ ) + wtD(τ ||τ t)}, (20) ence challenge, and finally practically motivated IDs τ ∈M for decentralized detection in wireless sensor networks. t t where τ is the solution at iteration t and w is a pos- We find that our algorithms typically find better so- itive coefficient. D(·||·) is a distance, called the prox- lutions than SPU with comparable time complexity; imal function; typical choices are Euclidean or Breg- for large scale problems with many decision nodes, man distances or ψ-divergences [e.g., Teboulle, 1992, our algorithms are more computationally efficient than Iusem and Teboulle, 1993]. Convergence of proxi- SPU because one step of SPU requires updating (6) (a mal algorithms has been well studied: the objective global expectation) for all the decision nodes. series {f(τ t)} is guaranteed to be non-increasing at each iteration, and {τ t} converges to an optimal so- In all experiments, we test single policy updating lution (sometimes superlinearly) for convex programs, (SPU), our MEU-BP running directly at zero temper- + t under some regularity conditions on the coefficients ature (BP-0 ), annealed BP with temperature  = {wt}. See, e.g., Rockafellar[1976], Tseng and Bert- 1/t (Anneal-BP-1/t), and the proximal versions with t t sekas[1993], Iusem and Teboulle[1993]. w = 1 (Prox-BP-one) and w = 1/t (Prox-BP-1/t). For the BP-based algorithms, we use two construc- Here, we use an entropic proximal function that natu- tions of junction graphs: a standard junction tree by rally fits the MEU problem: triangulating the DAG in backwards topological order, X X and a loopy junction graph following [Mateescu et al., (τ ||τ 0) = (x) log[ ( | ) 0( | )] D τ τi xi xpa(i) /τi xi xpa(i) , 2010] that corresponds to Pearl’s loopy BP; for SPU, i∈D x we use the same junction graphs to calculate the in- a sum of conditional KL-divergences. The proximal ner update (6). The junction trees ensure the inner update for the MEU dual (7) then reduces to updates of SPU and Prox-BP-one are performed ex- actly, and has optimality guarantees in Theorem 4.1, t+1 t t τ = arg max{hθ , τ i + H(x) − (1 − w )H(xi|xpa(i))} but may be computationally more expensive than the τ ∈ M loopy junction graphs. For the proximal versions, we t t P t where θ (x) = θ(x) + w i∈D log τi (xi|xpa(i)). This set a maximum of 5 iterations in the inner loop; chang- has the same form as the annealed problem (10) and ing this value did not seem to lead to significantly dif- can be solved by the message passing scheme (12)-(13). ferent results. The BP-based algorithms may return Unlike annealing, the proximal algorithm updates θt non-deterministic strategies; we round to determinis- each iteration and does not need wt to approach zero. tic strategies by taking the largest values. 0.2 0.2

SPU 2.26 SPU + + BP−0 0.1 BP−0 Anneal−BP−1/t 2.24 Anneal−BP−1/t Prox−BP−one 0.1 Prox−BP−1/t 0 Prox−BP−1/t 2.22 Prox−BP−one

−0.1 log MEU 2.2

Improvement of logMEU 0 30% 50% 70% 30% 50% 70% 2.18 Percentages of Decision Nodes Percentages of Decision Nodes 10 20 30 40 50 Iteration (a) α = 0.3; tree (b) α = 0.3; loopy 0.2 0.2 Figure 4: A typical trajectory of MEU (of the rounded 0.1 deterministic strategies) v.s. iterations for the random IDs in Fig.3. One iteration of the BP-like methods 0.1 0 denotes a forward-backward reduction on the junction −0.1 graph; One step of SPU requires |D| (number of deci- +

Improvement of logMEU sion nodes) reductions. SPU and BP-0 are stuck at a 0 0.3 0.5 1 0.3 0.5 1 local model in the 2nd iteration. Dirichlet Parameter α Dirichlet Parameter α

(c) 70% decision nodes; tree (d) 70% decision nodes; loopy 10 6

+ Figure 3: Results on random IDs of size 20. The y- 8 BP−0 Anneal−BP−1/t 4 axes show the log MEU of each algorithm compared to 6 Prox−BP−one Prox−BP−1/t SPU on a junction tree. The left panels correspond to 4 2 running the algorithms on junction trees, and right 2

panels on loopy junction graphs. (a) & (b) shows Improvement of logMEU 0 0 30% 50% 70% 30% 50% 70% MEUs as the percentage of decision nodes changes. Percentages of Decision Nodes Percentages of Decision Nodes (c) & (d) show MEUs v.s. the Dirichlet parameter α. (a). Diagnostic BN 1 (b). Diagnostic BN 2 The results are averaged on 20 random models. Figure 5: Results on IDs constructed from two di- agnostic BNs from the UAI08 challenge. Here all al- Random Bayesian Networks. We test our algo- gorithms used the loopy junction graph and are ini- rithms on randomly constructed IDs with additive util- tialized uniformly. (a)-(b) the logMEU of algorithms ities. We first generate a set of random DAGs of size normalized to that of SPU. Averaged on 10 trails. 20 with maximum parent size of 3. To create IDs, we take the leaf nodes to be utility nodes, and among non- larger scale IDs based on two diagnostic Bayes nets leaf nodes we randomly select a fixed percentage to be with 200-300 nodes and 300-600 edges, taken from the decision nodes, with the others being chance nodes. UAI08 inference challenge. To create influence dia- We assume the chance and decision variables are dis- grams, we made the leaf nodes utilities, each defined crete with 4 states. The conditional probability tables by its conditional probability when clamped to a ran- of the chance nodes are randomly drawn from a sym- domly chosen state, and total utility as the product metric Dirichlet distribution Dir(α), and the entries of of the local utilities (multiplicatively decomposable). the utility function from Gamma distribution Γ(α, 1). The set of decision nodes is again randomly selected The relative improvement of log MEU compared to the among the non-leaf nodes with a fixed percentage. SPU with junction tree are reported in Fig.3. We find Since the network sizes are large, we only run the al- that when using junction trees, all our BP-based meth- gorithms on the loopy junction graphs. Again, our ods dominate SPU; for loopy junction graphs, BP-0+ algorithms significantly improve on SPU; see Fig.5. occasionally performs worse than SPU, but all the an- Decentralized Sensor Network. In this sec- nealed and proximal algorithms outperform SPU with tion, we test an influence diagram constructed for de- the same loopy junction graph, and often even SPU centralized detection in wireless sensor networks [e.g., with junction tree. As the percentage of decision nodes Viswanathan and Varshney, 1997, Kreidl and Willsky, increases, the improvement of the BP-based methods 2006]. The task is to detect the states of a hidden on SPU generally increases. Fig.4 shows a typical tra- process p(h) (as a pairwise MRF) using a set of dis- jectory of the algorithms across iterations. The algo- tributed sensors; each sensor provides a noisy mea- rithms were initialized uniformly; random initializa- surement v of the local state h , and overall perfor- tions behaved similarly, but are omitted for space. i i mance is boosted by allowing the sensor to transmit Diagnostic Bayesian networks. We construct small (1-bit) signals si along an directional path, to uj cj ui ci −2.2 −2.2 SPU dj di BP−0+ Prox−BP−1/t hi Hidden variables sj si … −2.25 −2.25 Prox−BP−one Local measurements vj vi Anneal−BP−1/t vi Anneal−BP−1/t Prediction decisions log MEU (Perturbed) d … i hj hi si Signal decisions −2.3 −2.3 ui Reward utilities … c Signal cost utilities h i … k −2.35 −2.35 0.05 0.1 0.15 0.2 0.25 0.05 0.1 0.15 0.2 0.25 Signal Unit Cost Signal Unit Cost (a) The ID for sensor network detection (b) Junction tree (c) Loopy junction graph

Figure 6: (a) A sensor network on 3 × 3 grid; green lines denote the MRF edges of the hidden process p(h), on some of which (red arrows) signals are allowed to pass; each sensor may be accurate (purple) or noisy (black). Optimal strategies should pass signals from accurate sensors to noisy ones but not the reverse. (b)-(c) The log MEU of algorithms running on (b) a junction tree and (c) a loopy junction graph. As the signal cost increases, all algorithms converge to the communication-free strategy. Results averaged on 10 random trials.

help the predictions of their downstream sensors. The −5.6 utility function includes rewards for correct prediction −5.7 and a cost for sending signals. We construct an ID as −5.8 sketched in Fig.6(a) for addressing the offline policy −5.9 log MEU design task, finding optimal policies of how to predict −6 the states based on the local measurement and received −6.1 signals, and policies of whether and how to pass signals 0.01 0.1 1 to downstream nodes; see appendix for more details. Signal Unit Cost (a) Sensor network (b) Loopy junction graph To escape the “all-zero” fixed point, we initialize the proximal algorithms and SPU with 5 random policies, Figure 7: The results on a sensor network on a random and BP-0+ and Anneal-BP-1/t with 5 random mes- graph with 30 nodes (the MRF edges overlap with the sages. We first test on a sensor network on a 3 × 3 signal paths). Averaged on 5 random models. grid, where the algorithms are run on both a junction tree constructed by standard triangulation and a loopy plored, and usually based on separately approximating junction graph (see the Appendix for construction de- individual components of exact algorithms [e.g., Sab- tails). As shown in Fig.6(b)-(c), SPU performs worst in badin et al., 2011, Sallans, 2003]; our method instead all cases. Interestingly, Anneal-BP-1/t performs rela- builds an integrated framework. Other approaches, in- tively poorly here, because the annealing steps make it cluding MCMC [e.g., Charnes and Shenoy, 2004] and insensitive to and unable to exploit the random initial- search methods [e.g., Marinescu, 2010], also exist but izations; this can be fixed by a “perturbed” annealed are usually more expensive than SPU or our BP-like method that injects a random perturbation into the methods. See the appendix for more discussion. model, and gradually decreases the perturbation level across iterations (Anneal-BP-1/t (Perturbed)). A similar experiment (with only the loopy junction 7 Conclusion graph) is performed on the larger random graph in In this work we derive a general variational framework Fig.7; the algorithm performances follow similar for influence diagrams, for both the “convex” central- trends. SPU performs even worse in this case since ized decisions with perfect recall and “non-convex” de- it appears to over-send signals when two “good” sen- centralized decisions. We derive several algorithms, sors connect to one “bad” sensor. but equally importantly open the door for many oth- ers that can be applied within our framework. Since 6 Related Works these algorithms rely on decomposing the global prob- lems into local ones, they also open the possibility of Many exact algorithms for ID have been developed, efficiently distributable algorithms. usually in a variable-elimination or message-passing form; see Koller and Friedman[2009] for a recent re- Acknowledgements. Work supported in part by view. Approximation algorithms are relatively unex- NSF IIS-1065618 and a Microsoft Research Fellowship. References foundations to applications, page 177, 2007. M. Teboulle. Entropic proximal mappings with applica- C. Bielza, P. Muller, and D. R. Insua. Decision analysis tions to nonlinear programming. Mathematics of Oper- by augmented probability simulation. Management Sci- ations Research, 17(3):pp. 670–690, 1992. ence, 45(7):995–1007, 1999. P. Tseng and D. Bertsekas. On the convergence of the J. Charnes and P. Shenoy. Multistage exponential multiplier method for convex programming. for solving influence diagrams using local computation. Mathematical Programming, 60(1):1–19, 1993. Management Science, pages 405–418, 2004. R. Viswanathan and P. Varshney. Distributed detection A. Detwarasiti and R. D. Shachter. Influence diagrams for with multiple sensors: part I – fundamentals. Proc. team decision analysis. Decision Anal., 2, Dec 2005. IEEE, 85(1):54 –63, Jan 1997. R. Howard and J. Matheson. Influence diagrams. In Read- M. Wainwright and M. Jordan. Graphical models, ex- ings on Principles & Appl. Decision Analysis, 1985. ponential families, and variational inference. Found. R. Howard and J. Matheson. Influence diagrams. Decision Trends Mach. Learn., 1(1-2):1–305, 2008. Anal., 2(3):127–143, 2005. M. Wainwright, T. Jaakkola, and A. Willsky. A new class A. Iusem and M. Teboulle. On the convergence rate of of upper bounds on the log partition function. IEEE entropic proximal optimization methods. Computational Trans. Info. Theory, 51(7):2313–2335, July 2005. and Applied Mathematics, 12:153–168, 1993. M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree- F. Jensen, F. V. Jensen, and S. L. Dittmer. From influ- based reparameterization framework for analysis of sum- ence diagrams to junction trees. In UAI, pages 367–373. product and related algorithms. IEEE Trans. Info. The- Morgan Kaufmann, 1994. ory, 45:1120–1146, 2003a. J. Jiang, P. Rai, and H. Daum´eIII. Message-passing for M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. MAP approximate MAP inference with latent variables. In estimation via agreement on (hyper) trees: Message- NIPS, 2011. passing and linear programming approaches. IEEE D. Koller and N. Friedman. Probabilistic graphical models: Trans. Info. Theory, 51(11):3697 – 3717, Nov 2003b. principles and techniques. MIT press, 2009. Y. Weiss and W. Freeman. On the optimality of solutions D. Koller and B. Milch. Multi-agent influence diagrams for of the max-product belief-propagation algorithm in arbi- representing and solving games. Games and Economic trary graphs. IEEE Trans. Info. Theory, 47(2):736 –744, Behavior, 45(1):181–221, 2003. Feb 2001. O. P. Kreidl and A. S. Willsky. An efficient message- Y. Weiss, C. Yanover, and T. Meltzer. MAP estimation, passing algorithm for optimizing decentralized detection linear programming and belief propagation with convex networks. In IEEE Conf. Decision Control, Dec 2006. free energies. In UAI, 2007. S. Lauritzen and D. Nilsson. Representing and solving de- D. Wolpert. – the bridge connecting cision problems with limited information. Managment bounded rational game theory and statistical physics. Sci., pages 1235–1251, 2001. Complex Engineered Systems, pages 262–290, 2006. Q. Liu and A. Ihler. Variational algorithms for marginal J. Yedidia, W. Freeman, and Y. Weiss. Constructing free- MAP. In UAI, Barcelona, Spain, July 2011. energy approximations and generalized BP algorithms. R. Marinescu. A new approach to influence diagrams eval- IEEE Trans. Info. Theory, 51, July 2005. uation. In Research and Development in Intelligent Sys- N. L. Zhang. Probabilistic inference in influence diagrams. tems XXVI, pages 107–120. Springer London, 2010. In Computational Intelligence, pages 514–522, 1998. B. Martinet. R´egularisationd’in´equationsvariationnelles N. L. Zhang, R. Qi, and D. Poole. A computational theory par approximations successives. Revue Fran¸caisedInfor- of decision networks. Int. J. Approx. Reason., 11:83–158, matique et de Recherche Op´erationelle, 4:154–158, 1970. 1994. R. Mateescu, K. Kask, V. Gogate, and R. Dechter. Join- graph propagation algorithms. Journal of Artificial In- telligence Research, 37:279–328, 2010. J. Pearl. Influence diagrams - historical and personal per- spectives. Decision Anal., 2(4):232–234, dec 2005. P. Ravikumar, A. Agarwal, and M. J. Wainwright. Message-passing for graph-structured linear programs: Proximal projections, convergence, and rounding schemes. Journal of Machine Learning Research, 11: 1043–1080, Mar 2010. R. T. Rockafellar. Monotone operators and the proximal point algorithm. SIAM Journal on Control and Opti- mization, 14(5):877, 1976. K. Rose. Deterministic annealing for clustering, compres- sion, classification, regression, and related optimization problems. Proc. IEEE, 86(11):2210 –2239, Nov 1998. R. Sabbadin, N. Peyrard, and N. Forsell. A framework and a mean-field algorithm for the local control of spatial processes. International Journal of Approximate Rea- soning, 2011. B. Sallans. Variational action selection for influence dia- grams. Technical Report OEFAI-TR-2003-29, Austrian Research Institute for Artificial Intelligence, 2003. R. Shachter. Model building with belief networks and in- fluence diagrams. Advances in decision analysis: from This document contains proofs and other supplemen- tion on qδ(x) ∝ exp(θδ(x)), tal information for the UAI 2012 submission, “Belief X Propagation for Structured Decision Making”. log MEU = max log exp(θδ(x)) δ x = max  max hθδ, τ i + H(x; τ ) A Randomized v.s. Deterministic δ τ ∈M Strategies = max  max hθδ, τ i + H(x; τ ) , (3) τ ∈M δ It is a well-known fact in decision theory that no ran- and we have domized strategy can improve on the utility of the best max{hθδ, τ i} deterministic strategy, so that: δ Lemma 2.1. For any ID, maxδ∈∆ EU(δ) = X δ = max{hθ + log pi (xi|xpa(i)), τ i} maxδ∈∆o EU(δ). δ i∈D X X δ = hθ, τ i + max{ τ(x) log pi (xi|xpa(i))} Proof. Since ∆o ⊂ ∆, we need to show that for any pδ i∈D i x randomized strategy δ ∈ ∆, there exists a determinis- ∗ X X tic strategy δ0 ∈ ∆o such that EU(δ) ≤ EU(δ0). Note = hθ, τ i + { τ(x) log τ(xi|xpa(i))} that i∈D x X X Y δ = hθ, τ i − H(x |x ; τ ), (4) EU(δ) = q(x) pi (xi|xpa(i)), i pa(i) x i∈D i∈D

δ ∗ Thus, EU(δ) is linear on pi (xi|xpa(i)) for any i ∈ D where the equality “=” holds because the solution of P δ (with all the other policies fixed); therefore, one can max δ { τ(x) log p (x |x )} subject to the nor- pi x i i pa(i) δ P δ δ always replace p (xi|xpa(i)) with some deterministic malization constraint ( | ) = 1 is = i xi pi xi xpa(i) pi δ0 pi (xi|xpa(i)) without decreasing EU(δ). Doing so se- τ(xi|xpa(i)). We obtain (1) by plugging (4) into (3). quentially for all ∈ yields to a deterministic rule i D (b). By the chain rule of entropy, we have δ0 with EU(δ) ≤ EU(δ0). X H(x; τ ) = H(xoi |xo1:i 1 ; τ ). (5) − One can further show that any (globally) optimal ran- i domized strategy can be represented as a convex com- Note that we have pa(i) = o1:i−1 for i ∈ D for ID with bination of a set of optimal deterministic strategies. perfect recall. The result follows by substituting (5) into (1). B Duality form of MEU The following lemma is the “dual” version of Here we give a proof of our main result. Lemma 2.1; it will be helpful for proving Corollary 3.2 Theorem 3.1. (a). For an influence diagram with and Corollary 3.3. augmented distribution q(x) ∝ exp(θ(x)), its log max- Lemma B.1. Let Mo be the set of distributions τ(x) imum expected utility log MEU(θ) equals in which τ(xi|xpa(i)), i ∈ D are deterministic. Then the optimization domain M in (1) of Theorem 3.1 can X o max{hθ, τ i + H(x; τ ) − H(xi|xpa(i); τ )}. (1) be replaced by M without changing the result, that is, τ ∈M i∈D log MEU(θ) equals X ∗ ∗ max {hθ, τ i + H(x; τ ) − H(x |x ; τ )}. (6) Suppose τ is a maximum of (1), then δ = o i pa(i) ∗ τ ∈M {τ (xi|xpa(i))|i ∈ D} is an optimal strategy. i∈D

(b). Under the perfect recall assumption, (1) reduces o Proof. Note that M is equivalent to the set of de- to terministic strategies ∆o. As shown in the proof of X Lemma 2.1, there always exists optimal deterministic max{hθ, τ i + H(xoi |xo1:i 1 ; τ )} (2) τ ∈ − strategies, that is, at least one optimal solutions of (1) M ∈ oi C falls in Mo. Therefore, the result follows. where o = {o |j = 1, . . . , i − 1}. 1:i−1 j Corollary 3.2. For an ID with parameter θ, we have X δ Q δ log MEU = max{hθ, τ i + H(x |x ; τ )} (7) Proof. (a). Let q (x) = q(x) i∈D pi (xi|xpa(i)). We i o1:i 1 τ ∈I − apply the standard duality result (1) of partition func- i∈C where I = {τ ∈ M: xoi ⊥ xo1:i 1\pa(oi)|xpa(oi)}, corre- where bck (xck ) ∝ exp(ϑck (xck )) and b(xdk |xpa(dk)) is − sponding to the distributions respecting the imperfect the “annealed” conditional probability of bck , recall structures; “x ⊥ y | z” means that x and y are 1/ conditionally independent given . b(xdk , xpa(dk)) z ( | ) = b xdk xpa(dk) P 1/ , b(xdk , xpa(dk)) xdk Proof. For any τ ∈ , we have H(x |x ; τ ) = X I i pa(i) b(x , x ) = b(x ), z = c \{d , pa(d )}. H(x |x ; τ ), hence by the entropic chain rule, the dk pa(dk) ck k k k k i o1:i 1 x objective function− in (7) is the same as that in (1). zk

o Then, for any τ ∈ M and oi ∈ D, since Proof. The Lagrangian function is τ(x |x ) is deterministic, we have 0 ≤ oi pa(oi) X hϑ , τ i + H (x ; τ ) + λ [τ (x ) − 1]. H(xoi |xo1:i 1 ) ≤ H(xoi |xpa(oi)) = 0, which implies ck ck  ck ck ck ck − o xc I(xoi ; xo1:i\pa(oi)|xpa(oi)) = 0, and hence M ⊆ I ⊆ M. k We thus have that the LHS of (7) is no larger than (1), while no smaller than (6). The result follows since Its stationary point satisfies (1) and (6) equal by Lemma B.1. ϑck (xck )−log τck (xck )+(−1) log τck (xdk |xpa(dk))+λ, or equivalently,

Corollary 3.3. For any , let τ ∗ be an optimum of −1 τck (xck )[τck (xdk |xpa(dk))] = bck (xck ). (9) X max{hθ, τ i + H(x) − (1 − ) H(xi|xpa(i))}. (8) Summing over xzk on both side of (9), we have τ ∈M i∈D  τck (xpa(dk))[τck (xdk |xpa(dk))] = bck (xdk , xpa(dk)), (10) ∗ ∗ If δ = {τ (xi|xpa(i))|i ∈ D} is an deterministic strat- egy, then it is an optimal strategy of the MEU. Raising both sides of (9) to the power 1/ and sum-

ming over xdk , we have o Proof. First, we have H(xi|xpa(i); τ ) = 0 for τ ∈ M 1/ X 1/ [τck (xxpa(d ) )] = [bck (xdk , xpa(dk))] . (11) and i ∈ D, since such τ(xi|x ) are determinis- k pa(i) x tic. Therefore, the objective functions in (8) and dk (6) are equivalent when the maximization domains Combining (11) with (10), we have are restricted on Mo. The result follows by applying Lemma B.1. τck (xdk |xpa(dk)) = b(xdk |xpa(dk)). (12)

B.1 Derivation of Belief Propagation for Finally, combining (12) with (9) gives MEU 1− τck (xck ) = bck (xck )b(xdk |xpa(dk)) . (13) Eq. (11) is similar to the objective of sum-product junction graph BP, except the entropy terms of the decision clusters are replaced by Hˆ (x ), which can  ck The operator σ [b(x ); ] can be treated as im- be thought of as corresponding to some local MEU k c puting b(x ) with an “annealed” policy defined as problem. In the sequel, we derive a similar belief prop- c b (x |x ); this can be seen more clearly in the agation algorithm for (11), which requires that the de-  dk pa(dk) limit as  → 0+. cision clusters receive some special consideration. To demonstrate how this can be done, we feel it is helpful Lemma B.3. Consider a local MEU problem of a sin- to first consider a local optimization problem associ- gle decision node dk with parent nodes pa(dk) and an ated with a single decision cluster. augmented probability bck (xck ); let Lemma B.2. Consider a local optimization problem ∗ b (xdk |xpa(dk)) = lim b(xdk |xpa(dk)), ∀dk ∈ D, → + on decision cluster ck,  0 ∗ ∗ then δ = {b (xdk |xpa(dk)): dk ∈ D} is an optimal max{hϑck , τck i + H(xck ; τck )}. τck strategy.

Its solution is, Proof. Let

def 1− δ∗ (x ) = arg max{b (x |x )}, τck (xck ) ∝ σk[bck (xck ), ] = b(xck )b(xdk |xpa(dk)) dk pa(dk)  dk pa(dk) xdk One can show that as  → 0+, B.2 Reparameterization Interpretation  1/|δ∗ | if x ∈ δ∗ ∗ dk dk dk We can give a reparameterization interpretation for b (xdk |xpa(dk)) = (14) 0 if otherwise, the MEU-BP update in (12)-(13) similar to that of the ∗ sum-, max- and hybrid- BP algorithms [e.g., Wain- thus, b (xdk |xpa(d )) acts as a “maximum operator” of k wright et al., 2003a, Weiss et al., 2007, Liu and Ihler, b(xdk |xpa(d )), that is, k 2011]. We start by defining a set of “MEU-beliefs” X ∗ b(xdk |xpa(dk))b (xdk |xpa(dk)) = max b(xdk |xpa(dk)). b = {b(xck ), b(xskl )} by b(xck ) ∝ ψck mk for all ck ∈ C, xdk xdk and b(xskl ) ∝ mk→lml→k. Note that we distinguish between the “beliefs” b and the “marginals” τ in (15)- Therefore, for any δ ∈ ∆, we have (17). We have: X δ EU(δ) = bck (xck )b (xdk |xpa(dk)) Lemma B.4. (a). At each iteration of MEU-BP in

xck (12)-(13), the MEU-beliefs b satisfy X X δ Q = b(xpa(dk)) b(xdk |xpa(dk))b (xdk |xpa(dk)) b(xc ) q(x) ∝ k∈V k (18) xpa(dk) xdk Q ∈E b(xskl ) X (kl) ≤ b(xpa(dk)) max b(xdk |xpa(dk)) xdk where q(x) is the augmented distribution of the ID. xpa(dk) X X ∗ (b). At a fixed point of MEU-BP, we have = b(xpa(dk)) b(xdk |xpa(dk))b (xdk |xpa(dk))

xpa(dk) xdk Sum-consistency: X b(xck ) = b(xskl ), = EU(δ∗). (normal clusters) ck\sij This concludes the proof. MEU-consistency: X σ [b(x ); ] = b(x ). (decision clusters) k ck skl ck\sij Therefore, at zero temperature limit, the σk[·] operator in MEU-BP (12)-(13) can be directly calculated via Proof. (a). By simple algebraic substitution, one can (14), avoiding the necessity for power operations. show Q b(xck ) Y We now derive the MEU-BP in (12)-(13) for solving k∈V ∝ ( ) Q ψck xck . (11) using a Lagrange multiplier method similar to ∈E b(xskl ) (kl) ck∈C Yedidia et al.[2005]. Consider the Lagrange multi- Q Since p(x) ∝ ψc (xc ), the result follows. plier of (8), ck∈C k k X X X (b). Simply substitute the definition of b into the mes- hθ, τ i + H + H − H + ck ck skl sage passing scheme (12)-(13). k∈R k∈D (kl)∈E X X X λsk l (xskl )[ τck (xck ) − τskl (xskl )], B.3 Correctness Guarantees → (kl)∈E xck skl xskl \ Theorem 4.1. Let (G, C, S) be a consistent junc- where the nonnegative and normalization constraints tion tree for a subset of decision nodes D0, and b are not included and are dealt with implicitly. Tak- be a set of MEU-beliefs satisfying the reparameteri- ing its gradient w.r.t. and , one has τck τskl zation and the consistency conditions in Lemma B.4 + ∗ with  → 0 . Let δ = {b(xdk |xpa(d )): dk ∈ D}, τck ∝ ψck m∼k for normal clusters, (15) k then δ∗ is a locally optimal strategy in that sense that τ ∝ σ [ψ m ; ] for decision clusters, (16) ck k ck ∼k EU({δ∗ , δ }) ≤ EU(δ∗) for any δ . D0 D\D0 D\D0 τskl ∝ mk→lml→k for separators, (17) Proof. On a junction tree, the reparameterization in where ψck = exp(θck ), mk→l = exp(λk→l) and m∼k = Q (18) can be rewritten as l∈N (k) ml→k is the product of messages sending from the set of neighboring clusters N (k) to ck. The deriva- Y b(xck ) tion of Eq. 16 used the results in Lemma B.2. q(x) = b0 , b(xs ) k∈V k Finally, substituting the consistency constraints where s = s (s = ∅ for the root node) and b X k k,π(k) k 0 τck = τskl is the normalization constant.

xc s k\ kl For notational convenience, we only prove the case into (15)-(17) leads the fixed point updates in (12)- when D0 = D, i.e., the junction tree is globally con- (13). sistent . More general cases follow similarly, by noting that any decision node imputed with a fixed decision The MM algorithm is an generalization of the EM al- rule can be simply treated as a chance node. gorithm, which solves minτ ∈M f(τ ) by a sequence of surrogate optimization problems First, we can rewrite EU(δ∗) as τ t+1 = arg min f t(τ ), ∗ X Y EU(δ ) = q(x) b(xi|xpa(i)) τ ∈M x i∈D where f t(τ ), known as a majorizing function, should X Y b(xck ) Y t t t t = b0 b(xi|xpa(i)) satisfy f (τ ) ≥ f(τ ) for all τ ∈ M and f (τ ) = f(τ ). b(xs ) x k∈V k i∈D It is straightforward to check that the objective in the     proximal update (20) is a majorizing function. There- X Y b(xck ) Y b(xck )b(xdk |xpa(dk)) = b0 · fore, the proximal algorithm can be treated as a special b(xs ) b(xs ) x k∈C k k∈D k MM algorithm. = b , 0 The convex concave procedure (CCCP) [Yuille and where the last equality follows by the sum- and MEU- Rangarajan, 2003] is a special MM algorithm which decomposes the objective into a difference of two con- consistency condition (with  → 0+). To complete the vex functions, that is, proof, we just need to show that EU(δ) ≤ b0 for any δ ∈ ∆. Again, note that EU(δ)/b0 equals f(τ ) = f +(τ ) − f −(τ ),

   δ  + − X Y b(xck ) Y b(xck )p (xdk |xpa(dk)) where (τ ) and (τ ) are both convex, and con- · . f f b(xs ) b(xs ) structs a majorizing function by linearizing the nega- x k∈C k k∈D k tive part, that is,

Let zk = ck \ sk; since G is a junction tree, the zk form f t(τ ) = f +(τ ) − ∇f −(τ t)T (τ − τ t). a partition of V , i.e., ∪kzk = V and zk ∩zl = 1 for any k 6= l. We have One can easily show that f t(τ ) is a majorizing function via Jensen’s inequality. To apply CCCP on the MEU See insert (*) dual (1), it is natural to set

where the equality (*) holds because {zk} forms a f +(τ ) = −[hθ, τ i + H(x; τ )] partition of V , and equality (19) holds due to the sum- consistency condition. The last inequality follows the and − X proof in Lemma B.3. This completes the proof. f (τ ) = − H(xi|xpa(i); τ ). i∈D Based to Theorem 4.1, we can easily establish person- Such a CCCP algorithm is recognizable as equivalent by-person optimality of BP on an arbitrary junction to the proximal algorithm in Section 4.2 with wt = 1. tree. The convergence results for MM algorithms and CCCP Theorem 4.2. Let (G, C, S) be an arbitrary junc- are also well established; see Vaida[2005], Lange et al. ∗ tion tree, and b and δ defined in Theorem 4.1. [2000], Schifano et al.[2010] for the MM algorithm ∗ Then δ is a locally optimal strategy in Nash’s sense: and Sriperumbudur and Lanckriet[2009], Yuille and ∗ ∗ EU({δi , δD\i}) ≤ EU(δ ) for any i ∈ D and δD\i. Rangarajan[2003] for CCCP.

Proof. Following Theorem 4.1, one need only show D Additively Decomposable Utilities that any junction tree is consistent for any single de- cision node ∈ ; this is easily done by choosing a i D The algorithms we describe require the augmented tree-ordering rooted at i’s decision cluster. distribution q(x) to be factored, or have low (con- strained) tree-width. However, can easily not be the C About the Proximal Algorithm case for a direct representation of additively decompos- able utilities. To explain, recall that the augmented The proximal method can be equivalently interpreted distribution is q(x) ∝ q0(x)u(x), where q0(x) = Q P as a marjorize-minimize (MM) algorithm [Hunter and i∈C p(xi|xpa(i)), and u(x) = j∈U uj(xβj ). In this Lange, 2004], or a convex concave procedure [Yuille, case, the utility u(x) creates a large factor with vari- 2002]. The MM and CCCP algorithms have been able domain ∪jβj, and can easily destroy the factored widely applied to standard inference problems to ob- structure of q(x). Unfortunately, the na ive method of tain convergence guarantees or better solutions, see calculating the expectation node by node, or the com- e.g., Yuille[2002], Liu and Ihler[2011]. monly used general variable elimination procedures   ( ) δ( | ) X Y b(xck )  Y b xck p xdk xpa(dk) EU(δ)/b0 ≤ max · max (*) xsk b(xs ) xsk b(xs ) x k∈C k k∈D k    δ  Y X b(xc ) Y X b(xck )p (xdk |xpa(dk)) = max k · max (19) x x sk b(xsk ) sk b(xsk ) k∈C xzk k∈D xzk δ Y X b(xck )p (xdk |xpa(dk)) = max (20) x sk b(xsk ) k∈D xzk δ Y X b(xdk , xpa(dk))p (xdk |xpa(dk)) = max x sk b(xsk ) k∈D xdk

Y X b(xdk , xpa(dk))b(xdk |xpa(dk)) ≤ max x sk b(xsk ) k∈D xdk = 1,

[e.g., Jensen et al., 1994] do not appear suitable for Gh = (Vh,Eh), our variational framework. 1 X p(h) = exp[ θ (h , h ))], (21) Instead, we introduce an artificial product structure Z ij i j into the utility function by augmenting the model with (ij)∈Gh a latent “selector” variable, similar to that used for the where h are discrete variables with p states (we take “complete likelihood” in mixture models. Let y be an i h 0 p = 5); we set θ (k, k) = 0 and randomly draw auxiliary random variable taking values in the utility h ij θ (k, l)(k 6= l) from the negative absolute values of a index set U, so that ij standard Gaussian variable N (0, 1). Each sensor gives a noisy measurement vi of the local variable hi with 0 Y q˜(x, y0) = q (x) u˜j(xβj , y0), probability of error ei, that is, p(vi|hi) = 1 − ei for j vi = hi and p(vi|hi) = ei/(ph − 1) (uniformly) for vi 6= hi. whereu ˜ (x , y ) is defined byu ˜ (x , j) =u ˜ (x ) j βj 0 j βj j βj Let G be a DAG that defines the path on which the andu ˜ (x , k) = 1 for j 6= k. It is easy to verify s j βj sensors are allowed to broadcast signals (all the down- that the marginal distribution ofq ˜(x, y0) over y0 is P stream sensors receive the same signal); we assume q(x), that is, q˜(x, y0) = q(x). The tree-width of y0 the channels are noise free. Each sensor is associated q˜(x, y ) is no larger than one plus the tree-width of the 0 with two decision variables: ∈ {0 ±1} represents graph (with utility nodes included) of the ID, which si , the signal send from sensor i, where ±1 represents a is typically much smaller than that of q(x) when the one-bit signal with cost λ and 0 represents “off” with complete utility u(x) is included directly. A deriva- no cost; and represents the prediction of based tion similar to that in Theorem 3.1 shows that we di hi on v and the signals s received from i’s upper- can replace θ(x) = log q(x) with θˆ(x) = logq ˆ(x) in i pas(i) stream sensors; a correct prediction ( = ) yields (1) without changing the results. The complexity of di hi a reward (we set = ln 2). Hence, two types of this method may be further improved by exploiting γ γ utility functions are involved, the signal cost utilities the context-specific independence ofq ˆ(x, y), i.e., that uλ, with uλ(si = ±1) = −λ and uλ(si = 0) = 0; qˆ(x|y0) has a different dependency structure for differ- the prediction reward utilities uγ with uγ (di, hi) = γ ent values of y0, but we leave this for future work. if di = hi and uγ (di, hi) = 0 otherwise. The to- tal utility function is constructed multiplicatively via P E Decentralized Sensor Network u = exp[ i uγ (di, hi) + uλ(si)]. We also create two qualities of sensors: “good” sen- In this section, we provide detailed information about sors for which ei are drawn from U([0,.1]) and “bad” the influence diagram constructed for the decentral- sensors (ei ∼ U([.7,.8])), where U is the uniform dis- ized sensor network detection problem in Section5. tribution. Generally speaking, the optimal strategies Let hi, i = 1, . . . , nh, be the hidden variables we want should pass signals from the good sensors to bad sen- to detect using sensors. We assume the distribu- sors to improve their predictive power. See Fig.1 for tion p(h) is an attractive pairwise MRF on a graph the actual influence diagram. uj cj ui ci hi Hidden variables

dj di vi Local measurements s s di Prediction decisions j i … j vj vi si Signal decisions i … ui Reward utilities … h hi … j ci Signal cost utilities k

… hk

(a) The sensor network structure (b) The influence diagram for sensor network in (a)

dj sj v j dj sj hj v j hj vj vj hj vj h⇡ (j) s di si 0 ⇡(j) hj s⇡(j) hj sj hj s⇡(j) s⇡(j) s ⇡(j) vi di si hi vi hi v vi hi vi h i ⇡0(i) s⇡(i) s hi hi ⇡(i) … s⇡(i) s s⇡(i) … ⇡(i)

dk sk hk vk dk sk Decision clusters hk vk hk vk vk h vk h s k Normal clusters ⇡0(k) ⇡(k) hk s⇡(k) hk s⇡(k) s⇡(k) s⇡(k) Separators

(c) A loopy junction graph for the ID in (b)

Figure 1: (a) A node of the sensor network structure; the green lines denote the MRF edges, on some of which (red arrows) signals are allowed to path. (b) The influence diagram constructed for the sensor network in (a). (c) A junction 0 graph for the ID in (b); π(i) denotes the parent set of i in terms of the signal path Gs, and π (i) denotes the parent set in terms of the hidden process p(h) (when p(h) is transformed into a by triangularizing reversely along order o). The decision clusters (black rectangles) are labeled with their corresponding decision variables on their top. The definition of the ID here is not a standard one, For ID without perfect recall (LIMID), backward- since p(h) is not specified as a Bayesian network; but induction-like methods do not apply; most algorithms one could convert p(h) to an equivalent Bayesian net- work by optimizing the decision rules node-by-node or work by the standard triangulation procedure. The group-by-group; see e.g., Lauritzen and Nilsson[2001], normalization constant Z in (21) only changes the ex- Madsen and Nilsson[2001], Koller and Milch[2003]; pected utility function by a constant and so does not these methods reduce to the exact backward-reduction need to be calculated for the purpose of the MEU task. (hence guaranteeing global optimality) if applied on IDs with perfect recall and update backwards along Without loss of generality, for notation we assume the the temporal ordering. However, they only guarantee order [1, . . . , n ] is consistent with the signal path G . h h local person-by-person optimality for general LIMIDs, Let o = [h , v , d , s ; ... ; h , v , d , s ]. The 1 1 1 1 nh nh nh nh which may be weaker than the optimality guaranteed junction tree we used in the experiment is constructed by our BP-like methods. Other styles of approaches, by the standard triangulation procedure, backwards such as Monte Carlo methods [e.g., Bielza et al., 1999, along the order o. A proper construction of a loopy Cano et al., 2006, Charnes and Shenoy, 2004, Garcia- junction graph is non-trivial; we show in Fig.1(c) Sanchez and Druzdzel, 2004] and search-based meth- the one we used in Section5. It is constructed such ods [e.g., Luque et al., 2008, Qi and Poole, 1995, Yuan that the decision structure inside each sensor node is and Wu, 2010, Marinescu, 2010] have also been pro- preserved to be exact, while at a higher level (among posed. Recently, Maua and Campos[2011] proposed sensors), a standard loopy junction graph (similar to a method for finding the globally optimal strategies that introduced in Mateescu et al.[2010] that corre- of LIMIDs by iteratively pruning non-maximal poli- sponds to Pearl’s loopy BP) captures the correlation cies. However, these methods usually appear to have between the sensor nodes. One can shown that such a much greater computational complexity than SPU or constructed junction graph reduces to a junction tree our BP-like methods. when the MRF Gh is a tree and the signal path Gs is an oriented tree. Finally, some variational inference ideas have been ap- plied to the related problems of reinforcement learning F Additional Related Work or solving Markov decision processes [e,g., Sallans and Hinton, 2001, Furmston and Barber, 2010, Yoshimoto There exists a large body of work for solving influ- and Ishii, 2004]. ence diagrams, mostly on exact algorithms with pre- fect recall; see Koller and Friedman[2009] for a recent References review. Our work is most closely connected to the early work of Jensen et al.[1994], who compile an ID C. Bielza, P. Muller, and D. R. Insua. Decision analysis by augmented probability simulation. Management Sci- to a junction tree structure on which a special mes- ence, 45(7):995–1007, 1999. sage passing algorithm is performed; their notion of A. Cano, M. G´omez, and S. Moral. A forward-backward strong junction trees is related to our notion of global Monte Carlo method for solving influence diagrams. Int. consistency. However, their framework requires the J. Approx. Reason., 42(1-2):119–135, 2006. perfect recall assumption and it is unclear how to ex- J. Charnes and P. Shenoy. Multistage Monte Carlo method for solving influence diagrams using local computation. tend it to approximate inference. A somewhat differ- Management Science, pages 405–418, 2004. ent approach transforms the decision problem into a G. F. Cooper. A method for using belief networks as influ- sequence of standard Bayesian network inference prob- ence diagrams. In UAI, pages 55–63, 1988. lems [Cooper, 1988, Shachter and Peot, 1992, Zhang, A. Detwarasiti and R. D. Shachter. Influence diagrams for 1998], where each subroutine is a standard inference team decision analysis. Decision Anal., 2, Dec 2005. T. Furmston and D. Barber. Variational methods for rein- problem, and can be solved using standard algorithms, forcement learning. In AISTATS, volume 9, pages 241– either exactly or approximately; again, their method 248, 2010. only works within the perfect recall assumption. Other D. Garcia-Sanchez and M. Druzdzel. An efficient sampling approximation algorithms for ID are also based on ap- algorithm for influence diagrams. In Proceedings of the proximating some inner step of exact algorithms, e.g., Second European Workshop on Probabilistic Graphical Models (PGM–04), Leiden, Netherlands, pages 97–104, Sabbadin et al.[2011] and Sallans[2003] approximate 2004. the policy update methods by mean field methods; R. Howard and J. Matheson. Influence diagrams. In Read- Nath and Domingos[2010] uses adaptive belief propa- ings on Principles & Appl. Decision Analysis, 1985. gation to approximate the inner loop of greedy search R. Howard and J. Matheson. Influence diagrams. Decision algorithms. [Watthayu, 2008] proposed a loopy BP al- Anal., 2(3):127–143, 2005. D. R. Hunter and K. Lange. A tutorial on MM algorithms. gorithm, but without theoretical justification. To the The American Statistician, 1(58), February 2004. best of our knowledge, we know of no well-established A. Iusem and M. Teboulle. On the convergence rate of “direct” approximation methods. entropic proximal optimization methods. Computational and Applied Mathematics, 12:153–168, 1993. B. Sallans and G. Hinton. Using free energies to represent F. Jensen, F. V. Jensen, and S. L. Dittmer. From influ- q-values in a multiagent reinforcement learning task. In ence diagrams to junction trees. In UAI, pages 367–373. NIPS, pages 1075–1081, 2001. Morgan Kaufmann, 1994. E. Schifano, R. Strawderman, and M. Wells. Majorization- J. Jiang, P. Rai, and H. Daum´eIII. Message-passing for minimization algorithms for nonsmoothly penalized ob- approximate MAP inference with latent variables. In jective functions. Electronic Journal of Statistics, 4: NIPS, 2011. 1258–1299, 2010. D. Koller and N. Friedman. Probabilistic graphical models: R. Shachter. Model building with belief networks and in- principles and techniques. MIT press, 2009. fluence diagrams. Advances in decision analysis: from D. Koller and B. Milch. Multi-agent influence diagrams for foundations to applications, page 177, 2007. representing and solving games. Games and Economic R. Shachter and M. Peot. Decision making using proba- Behavior, 45(1):181–221, 2003. bilistic inference methods. In UAI, pages 276–283, 1992. O. P. Kreidl and A. S. Willsky. An efficient message- B. Sriperumbudur and G. Lanckriet. On the convergence passing algorithm for optimizing decentralized detection of the concave-convex procedure. In NIPS, pages 1759– networks. In IEEE Conf. Decision Control, Dec 2006. 1767, 2009. K. Lange, D. Hunter, and I. Yang. Optimization transfer M. Teboulle. Entropic proximal mappings with applica- using surrogate objective functions. Journal of Compu- tions to nonlinear programming. Mathematics of Oper- tational and Graphical Statistics, pages 1–20, 2000. ations Research, 17(3):pp. 670–690, 1992. S. Lauritzen and D. Nilsson. Representing and solving de- P. Tseng and D. Bertsekas. On the convergence of the cision problems with limited information. Managment exponential multiplier method for convex programming. Sci., pages 1235–1251, 2001. Mathematical Programming, 60(1):1–19, 1993. Q. Liu and A. Ihler. Variational algorithms for marginal F. Vaida. Parameter convergence for EM and MM algo- MAP. In UAI, Barcelona, Spain, July 2011. rithms. Statistica Sinica, 15(3):831–840, 2005. M. Luque, T. Nielsen, and F. Jensen. An anytime algo- R. Viswanathan and P. Varshney. Distributed detection rithm for evaluating unconstrained influence diagrams. with multiple sensors: part I – fundamentals. Proc. In Proc. PGM, pages 177–184, 2008. IEEE, 85(1):54 –63, Jan 1997. A. Madsen and D. Nilsson. Solving influence diagrams M. Wainwright and M. Jordan. Graphical models, ex- using HUGIN, Shafer-Shenoy and lazy propagation. In ponential families, and variational inference. Found. UAI, volume 17, pages 337–345, 2001. Trends Mach. Learn., 1(1-2):1–305, 2008. R. Marinescu. A new approach to influence diagrams eval- M. Wainwright, T. Jaakkola, and A. Willsky. A new class uation. In Research and Development in Intelligent Sys- of upper bounds on the log partition function. IEEE tems XXVI, pages 107–120. Springer London, 2010. Trans. Info. Theory, 51(7):2313–2335, July 2005. B. Martinet. R´egularisationd’in´equationsvariationnelles M. J. Wainwright, T. Jaakkola, and A. S. Willsky. Tree- par approximations successives. Revue Fran¸caisedInfor- based reparameterization framework for analysis of sum- matique et de Recherche Op´erationelle, 4:154–158, 1970. product and related algorithms. IEEE Trans. Info. The- R. Mateescu, K. Kask, V. Gogate, and R. Dechter. Join- ory, 45:1120–1146, 2003a. graph propagation algorithms. Journal of Artificial In- M. J. Wainwright, T. S. Jaakkola, and A. S. Willsky. MAP telligence Research, 37:279–328, 2010. estimation via agreement on (hyper) trees: Message- D. D. Maua and C. P. Campos. solving decision problems passing and linear programming approaches. IEEE with limited information. In NIPS, 2011. Trans. Info. Theory, 51(11):3697 – 3717, Nov 2003b. A. Nath and P. Domingos. Efficient belief propagation for W. Watthayu. Representing and solving influence diagram utility maximization and repeated inference. In AAAI in multi-criteria decision making: A loopy belief prop- Conference on Artificial Intelligence, 2010. agation method. In International Symposium on Com- J. Pearl. Influence diagrams - historical and personal per- puter Science and its Applications (CSA ’08), pages 118 spectives. Decision Anal., 2(4):232–234, dec 2005. –125, Oct. 2008. R. Qi and D. Poole. A new method for influence diagram Y. Weiss and W. Freeman. On the optimality of solutions evaluation. Computational Intelligence, 11(3):498–528, of the max-product belief-propagation algorithm in arbi- 1995. trary graphs. IEEE Trans. Info. Theory, 47(2):736 –744, P. Ravikumar, A. Agarwal, and M. J. Wainwright. Feb 2001. Message-passing for graph-structured linear programs: Y. Weiss, C. Yanover, and T. Meltzer. MAP estimation, Proximal projections, convergence, and rounding linear programming and belief propagation with convex schemes. Journal of Machine Learning Research, 11: free energies. In UAI, 2007. 1043–1080, Mar 2010. D. Wolpert. Information theory – the bridge connecting R. T. Rockafellar. Monotone operators and the proximal bounded rational game theory and statistical physics. point algorithm. SIAM Journal on Control and Opti- Complex Engineered Systems, pages 262–290, 2006. mization, 14(5):877, 1976. J. Yedidia, W. Freeman, and Y. Weiss. Constructing free- K. Rose. Deterministic annealing for clustering, compres- energy approximations and generalized BP algorithms. sion, classification, regression, and related optimization IEEE Trans. Info. Theory, 51, July 2005. problems. Proc. IEEE, 86(11):2210 –2239, Nov 1998. J. Yoshimoto and S. Ishii. A solving method for mdps R. Sabbadin, N. Peyrard, and N. Forsell. A framework and by minimizing variational free energy. In International a mean-field algorithm for the local control of spatial Joint Conference on Neural Networks, volume 3, pages processes. International Journal of Approximate Rea- 1817–1822. IEEE, 2004. soning, 2011. C. Yuan and X. Wu. Solving influence diagrams using B. Sallans. Variational action selection for influence dia- heuristic search. In Proc. Symp. AI& Math, 2010. grams. Technical Report OEFAI-TR-2003-29, Austrian A. L. Yuille. CCCP algorithms to minimize the Bethe and Research Institute for Artificial Intelligence, 2003. Kikuchi free energies: Convergent alternatives to belief propagation. Neural Computation, 14:2002, 2002. A. L. Yuille and A. Rangarajan. The concave-convex pro- cedure. Neural Computation, 15:915–936, April 2003. N. L. Zhang. Probabilistic inference in influence diagrams. In Computational Intelligence, pages 514–522, 1998. N. L. Zhang, R. Qi, and D. Poole. A computational theory of decision networks. Int. J. Approx. Reason., 11:83–158, 1994.