431

Logarithmic Time Parallel Bayesian Inference

David M. Pennock University of Michigan AI Laboratory 1101 Beal Avenue Ann Arbor, MI 48109-2110 USA dpennock@ umich.edu

Abstract (i:) (c': 'Y ,,r � 1. I present a parallel algorithm for exact prob­ (B:: (D:! abilistic inference in Bayesian networks. For � polytree networks with n variables, the worst­ v case time complexity is n on a CREW ,l O(log ) (E'�' PRAM (concurrent-read, exclusive-write paral­ (b) lel random-access machine) with n processors, (a) for any constant number of evidence variables. For arbitrary networks, the time complexity is Figure 1: Comparing the Opportunities for Parallelism in Two Bayesian Networks. Network (a) seems to have more O( r3w logn ) for n processors, or 0( w logn ) for "natural" parallelism. r3wn processors, where r is the maximum range

of any variable, and w is the induced width (the maximum clique size), aftermoralizing and trian­ gulating the network. 1994; Kozlov 1996; Kozlov and Singh 1994; Shachter, An­ dersen, and Szolovits 1994). In fact, Pearl's original algo­ rithm for inference in singly connected networks was con­ ceived as a distributed algorithm, where each variable could 1 INTRODUCTION be associated with a separate processor, passing and receiv­ ing messages to and from only local, neighboring proces­ Two key breakthroughs make representation of and reaso ­ � sors (Neapolitan 1990; Pearl 1988). The popular ing with probabilities practical, and have led to a prolif­ junction algorithm for multiply connected networks (Jensen, eration of related research within the artificialintelligence Lauritzen, and Olesen 1990; Lauritzen and Spiegelhalter community. exploit conditional inde­ Bayesian networks 1988; Neapolitan 1990; Spiegelhalter, Dawid, Lauritzen, pendence to represent joint probability distributions com­ and Cowell 1993) can be considered a distributed algo­ pactly, and associated evaluate ar­ inference algorithms rithm in the same sense, with one processor per clique. bitrary conditional probabilities implied by the network However, to my knowledge, there are no derivations of representation (Neapolitan 1990; Pearl 1988) Exact in­ : improved worst-case complexity bounds for parallel in­ ference is known to be NP-hard (more specifically, #P­ ference, mainly because previous attempts at paralleliza­ complete) (Cooper 1990), and even approximate inference tion only exploit concurrency in the form of independent is NP-hard (Dagum and Luby 1993). Nonetheless, clever computations. For example, consider the two networks algorithms exploit network topology to make exact proba­ in Figure 1. In both (a) and (b), the two computations bilistic reasoning practical for a wide variety of real-world ,_ b1JA)Pr(A) and applications. Although no algorithm, including any par­ Pr(B bi) L:A Pr(B are in­ allel algorithm with a polynomial number of processors, Pr(B = b2) ,_ L:A Pr(B = b2JA)Pr(A) dependent and easily parallelized. D'Ambrosio (1992) can avoid worst-case exponential time complexity (unless calls this a conformal product parallelization, while Ko­ p NP), polynomial time speed-ups are still of great in­ = zlov and Singh (1996, 1994) call it a par­ terest, to allow evaluation of larger, more densely connected within-clique allelization. In network (a), the computations Bayesian networks. Pr(B) <­ I: Pr(BJA)Pr(A) and Pr(D) ,_ l::c Pr(DJC)Pr(C) Some researchers have proposed parallel algorithms for ar/also independent; this has been called topological par­ Bayesian inference (D'Ambrosio 1992; Diez and Mira allelism. However in the chain network (b), the computa- 432 Pennock

tion Pr(D) ,._ Lc Pr(DIC) Pr(C) must wait for the com­ variables to the unit interval [0, 1] (with a normalization putation of which must in turn wait for the com­ constraint). The joint probability distribution thus consists Pr( C), n putation of Pr(B), etc. Because of the possibility of such of r probabilities, an unmanageably large number even chains, any implementation with one processor per variable for relatively small r and n. To simplify notation, r is (or one per clique), that only exploits obvious independent not explicitly indexed, even though it may vary across computations, will have the same worst-case running time variables. as a serial algorithm. This paper describes a parallelization A exploits conditional independence to with improved time complexity, regardless of the network represent a joint distribution more compactly. Consider the topology. To propagate probabilities in chains, it employs variable Ai. Suppose that, given the values of a set of a procedure similar to pointer jumping, a standard "trick" preceding variables pa(Aj )-called Ai's parents-Aj is used in parallel algorithms (Cormen, Leiserson, and Rivest conditionally independent of all other preceding variables. 1992; Gibbons and Spirakis 1993; JaJa 1992). Polytrees are Then we can rewri� the joint probability distribution in a handled with a variant of (Abra­ parallel tree contraction (usually) more compact form: hamson, Dadoun, Kirkpatrick, and Przyttycka 1989; JaJa 1992; Miller and Reif 1985). Evidence propagation is n achieved with a parallel version of Shachter's arc reversal Pr(W) = Pr(At, A2, ..., An) = ITPr(Aj lpa(Aj)) . and evidence absorption (Shachter 1988; Shachter 1990) j=l The computational model employed is the CREW PRAM, For each variable Aj, a conditional probabilitytable (CPT) or the concurrent-read, exclusive-write parallel random­ records the conditional probabilities Pr(Aj !pa(Aj)) , for access machine; this model assumes a shared, global mem­ all possible combinations of values for Aj and pa(Aj) . ory that processors can read from, but not write to, si­ A Bayesian network can be represented graphically as a multaneously (Cormen, Leiserson, and Rivest 1992; Gib­ di­ (DAG). Each variable is a node or ver­ bons and Spirakis 1993; JaJa 1992). The algorithm is rected acyclicgraph tex in the graph, and directed edges encode parent relation­ termed PHISHFOOD, for Parallel algoritHm for Inference ships. There is a directed edge from to if and only if in BayeSian networks witH a (really) FOOrceD acronym. A; Aj is parent of We may also refer to as the child For Multiply-Connected networks, the name McPHISH­ A; Ai. Aj of and as the set of children of We use FOOD is tastelessly coined. A;, ch(A;) A;. gp(A;) = pa(pa(Aj)) to denote the set of grandparents The next section reviews Bayesian networks and proba­ (parents of parents) of Aj. A DAG has no directed cycles bilistic inference, and the necessary notation and terminol­ and thus definesa partial order over its vertices. We assume ogy. For clarity of exposition, the PHISHFOOD algorithm without loss of generality that the variable indices are con­ is introduced in four stages, each successively more gen­ sistent with this partial ordering: if A; is an ancestor of Aj eral. Section 3 presents a key parallelization component via then i < j. We denote a total ordering of the nodes, which a simple example, describing how to findall marginal prob­ may or may not be consistent with the DAG's partial or­ abilities in a chain network. Sections 4, 5, and 6 general­ der, as a ::: {a1, a2, ...,an}. wherea is a permutation ize the algorithm to find conditional probabilities in tree, of {1, 2, ..., n}. Thus Aa3 is the third node according to polytree, and arbitrary networks, respectively. In all cases, orderinga, etc. for any constant number of evidence variables, the running A is a DAG with no undirected cycles. A time is log with processors, where is the max­ polyforest poly­ 0( r3w n) n r is a connected polyforest. A is a polyforest where imum range of any variable, and is the induced width af­ tree forest w every node has at most one parent. A is a connected ter a fill-in computation. Note that, for trees and polytrees, tree forest. Since each tree in a forest can be handled indepen­ is a constant. The running time in multiply connected w dently, we need not explicitly consider forests. A is a networks can be improved to log if exponentially chain 0( w n) tree where each node has at most one child. For complexity many processors are employed. Section 7 surveys (r3wn) analysis, it is generally assumed that the maximum number some related work; section 8 summarizes, and addresses of parents for any variable is constant, since the size of the several open questions. input Bayesian network itself is exponential in the parent set size.1 Given this assumption, standard serial algorithms for 2 BAYESIAN NETWORKS inference in trees and polytrees run in 0(n ) time. A set C of nodes is calleda clique, or a maximal complete Consider a set of n random variables, W set, if all of its nodes are fully connected, and no proper su­ {A1, A2, ... , An}. The probability that variable Aj takes perset of C is fully connected. Within a clique, inference on the value ai is Pr(Aj = ai ), where ai E {1, 2, ... , r}, takes time, since we must essentially sum over C's and r is a constant. A joint probability distribution, O(riCI) . entire jointdistribution. Pr(W) = Pr(A1, A2, ..., An). assigns a probability to every possible combination of values of variables; it is 1This assumption bteaks·down for some more compact condi­ a mapping from the cross-product of the ranges of the tional probability encodings, not treated here, e.g., noisy-OR. Logarithmic Time Parallel Bayesian Inference 433

Most methods for Bayesian inference in multiply connected step 0 networks first convert to a (possibly exponentially larger) -free representation, called variously a cluster tree, step l junction tree, clique tree, or join tree (J ensen, Lauritzen, and Olesen 1990; Lauritzen and Spiegelhalter 1988; Neapolitan step 2 1990; Shachter, Andersen, and Szolovits 1994; Spiegelhal­ ter, Dawid, Lauritzen, and Cowell l993). First, we moral­ (d) ... ize the graph by fully connecting ("marrying") the parents @ @ @ @ ® 0 0 step log n of every node, and dropping edge directionality. Next, the graph is filledin according to a given total ordering of nodes Figure 2: Propagating Probabilities in a Chain Network. At a: starting with Aan, and continuing for j = n - 1 down time step t, the first 2t marginals, darkened in the figure, to j = 1, all preceding neighbors of Aai (according to or­ have been computed. dering a) are fully connected. This fill-incomputation en­ sures that the graph is triangulated, or has no four-node cy­ Processor j is assigned to variable and "owns" the con­ cles without a chord. Each clique is then clustered into a su­ Aj, ditional probability table (CPT) for Note pernodethat can take on all possible combinations of values Pr(Aj IAi _1). of its constituents. Supernode clusters, denoted Cj, are in­ that processor #1 already holds the marginal probability dexed according to their highest numbered constituent (ac­ Pr(A1); it simply records a flag indicating that it is done. At the firsttime step, each processor j rewrites its CPT in terms cording to a). A triangulated graph has the running inter­ of its grandparent, given its own current CPT and its par­ section property. For every j, there exists an i < j such that: ent's current CPT. Mathematically, processor j computes: ... ci n (C1 u c2 u u Cj-1) � ci Pr(Aj IAj-2) +- L Pr(Aj IAj-1) Pr(Aj_11Aj_2), For every j, we choose one i for which this property holds, Aj-t and connect Cj to Ci. The result is an undirected tree of cliques. In the final step, for every two adjacent cliques, for each combination of values of Aj and Aj _2, taking con­ their intersection (called a separatingset) is clustered into a stant (0(r3)) time. After time step one, processor #2 holds the marginal probability so it marks itself done (it new supernode, and placed between the two. The induced Pr(A2). knows that it is done because its parent was done at the pre­ width w of a network, relative to an ordering a, is defined as the maximum number of nodes in any clique, after mor­ vious time step). The new network topology after step one is pictured in Figure 2(b). The second time step proceeds alization and fill-in. Computing the optimal a, or equiv­ alently the optimal triangulation, that yields the minimum exactly as the first-eachvariable rewrites its CPT in terms induced width is itself an NP-complete problem (Klocks of its (new) grandparent: 1994)-for this reason, heuristic methods are usually em­ ployed. Serial algorithms for inference in cluster trees run Pr(Aj IAj-4) +- L Pr(Aj IAj-2) Pr(Aj-21Aj_4). Aj-2 in O(rwn) time. If we now reintroduce directionality to the cluster tree After time step two, the first four processors hold their re­ spective marginals and mark themselves done. After time edges, consistent with the original DAG partial order, the step three, the first eight processors are done, etc. Af­ result is a polytree of supernodes; I call this a directed clus­ ter log2 n time steps, all n processors are done, and all ter polytree. Each clique's CPT is simply the product of its constituents' CPTs. Any polytree inference algorithm is di­ marginal probabilities have been computed; the entire pro­ rectly applicable. cess is conveyed in Figure 2. This procedure of iteratively recomputing CPTs in terms of grandparents is analogous The general problem of Bayesian inference is to find the to a widely used method in parallel algorithms to achieve probability of each variable given a subset E of evidence O(log n) computations in linked lists, called pointer jump­ variables that have been instantiated with values. That is, ing. In chain networks, the algorithm can operate in an compute Pr(Aj)E) for every j, where E = {Ae, exclusive-read fashion, though we will use concurrent reads ae,, Ae2 ae2, •••, Aec aec }, and each ek E later, to process trees and polytrees. {1, 2, ... ,n}. 4 INFERENCE IN TREES 3 AN EXAMPLE: FINDING MARGINALS IN CHAINS This section generalizes PHISHFOOD to compute infer­ ence queries in tree networks. The next subsection begins Consider the problem of findingall marginal probabilities, with computing marginal probabilities only; Section 4.2 Pr(Aj ), in the chain network of Figure 2(a), using n pro­ presents a parallel arc reversal procedure to propagate ev­ cessors. idence. 434 Pennock

PHISH-TREE-MARGINALS(T) the transformed network represents the same underlying INPUT: Bayesian network tree T joint distribution, as ,long the reversal does not create a di­ OUTPUT: Pr(Aj) for all j rected cycle. When an edge between A; and Aj is reversed, both variables must share each other's parents, which, in I. mark root node A1 done general, may add significantlyto the number of arcs in the 2. while there is a node not marked done graph. However, a tree can be rerooted at any node with­ 3. for each node Aj in parallel out adding any arcs. Consider rerooting the tree at the first 4. compute (1) evidence variable, Ae,. Let a be a preorder walk of the un­ 5. ifpa(Aj) is marked done derlying undirected tree, starting from Ae 1• Then a encodes 6. then mark Aj done the desired new edge directions-that is, an edge in the re­ 7. else pa(Aj) +- gp(Aj) rooted tree points from Aa, to a neighbor A"i if and only if a; < ai. A serial arc reversal procedure would begin at Table I: Computing all Marginal Probabilities in a Tree the original root A1, which has no parent, reversing any of Network. A1 's child arcs inconsistent with a. A1 's children have no parents other than A1 itself, and thus no new arcs are cre­ ated. Next, any child arcs from inconsistent with a 4.1 ch(A1) FINDING MARGINALS IN TREES are reversed, then any child arcs from ch( ch(Al)) , etc. The Marginal computation in trees proceeds in much the same result is a new tree, rooted at Ae,, with the same underlying way as for chains. Again, the central step is for each vari­ undirected topology as the original tree. able to rewrites its CPT in terms of its grandparent: Suppose that all marginal probabilities are known, com­ (1) puted as in Section 4.1. Then, since the tree topology is Pr[AiJgp(Ai)J +- invariant, it is not necessary to reverse arcs in any partic­

L Pr[AjJpa(Aj)] · Pr[pa(Aj)Jgp(Aj)). ular order. We can simply reverse any edges inconsistent pa(Aj) with a simultaneously, in constant parallel time, by apply­ ing Bayes's rule. To be concrete, assign one processor to Because the graph is a tree, each "set" pa(Aj) and gp(Aj) every edge (since the graph is a tree, there are -1 edges). is actually a singleton. The procedure psuedocode is given n If an arc from to is inconsistent with the ordering a, in Table This simple implementation uses concurrent A; Aj I. then the processor reads the CPT associated with and reads, since each variable may have several grandchildren Aj rewrites the CPT associated with according to Bayes's at a given step. If we consider the root to be at depth one, A; rule: then, after time step two, all marginals for variables at depth · ) Pr(AiJA;) Pr(A;) two have been computed; after step three, all marginals at P r (A, lA +- (2) ' J Pr(Aj) · depth four have been computed, etc. All marginals in the tree are computed in O(log d) time, where dis its depth and, Since the original and new structures are both trees, each in the worst-case, is O(n). Notice that, if the tree is bal­ processor reads from and writes over at most one CPT, and anced (i.e., is of depth O(logn)), then the running time is each CPT is rewritten by at most one processor. . O(log log n) In general, evidence absorption (Shachter 1988; Shachter 1990) involves a single update at each of Ae, 's children, 4.2 EVIDENCE PROPOGATION IN TREES and a series of arc reversals propagating to all of Ae, 's an­ cestors. However, by construction, the evidence variable is This subsection describes how PHISHFOOD propogates the root and has no ancestors. Each of Ae, 's children Aj evidence, using a parallel version of arc reversal (a form simply absorbs the evidence in constant time, with the up­ of Bayes's rule), and evidence absorption, the standard ver­ date: (1988, 1990). sions of which were developed by Shachter (3) We assume that the graph topology is given as an adjacency Next, all of 's child arcs are eliminated, creating dis­ list, so that each node has pointers to its children as well Ae, connected trees rooted at each of Ae, 's children. The ev­ as to its parent. We utilize a standard parallel subroutine idence propagation and absorption process is depicted in that findsa of a tree, starting from any given preorder walk Figure 3. The implicit marginal probabilities in these new root node. A preordering numbers the nodes according to trees are actually posterior probabilities, given the evidence the order that they are first reached in a depth-first traver­ , = e,, in the original representation. Note that evi­ sal of the underlying undirected tree. The computation can Ae a dence given in the form of a likelihood function could also be done in time with processors, using what is O(log n) n be absorbed at in this case, 's child arcs would in called the (Gibbons and Spirakis 1993; Ae,; Ae, Euler tour technique general not be eliminated. The algorithm repeats by com­ Ja.Ja 1992; Tarjan and Vishkin 1985). puting the new marginals as described in Section 4.1, re­ In Bayesian networks, any edge can be reversed such that rooting the tree at Ae2, and absorbing the second evidence Logarithmic Time Parallel Bayesian Inference 435

'� .- I, -,, ,"as'' \ between variables, which can be exploited to speed infer­ . • ence (Neapolitan 1990; Pearl 1988; Russell and Norvig - - _....--L�. y - -'-- I 1995). In particular, given no evidence at or at any - (a.4) A1, -� �f'A' of Aj 's descendents, the parents of A1 are independent. : l ,a.$/X) ( )/- �- ) - '._/ [ (9 evidence variables E that does not contain A1 or any of __t 's descendents. Consider an arbitrary node If its i n Aj A1. : " 6) --- parents' marginal probabilities are known, then, since its (a) (b) (c) parents are independent, Ai's marginal can be computed. Similarly, its parents' marginals can be computed once Figure 3: Propagating Evidence in a Tree Network. (a) The its grandparents' marginals are known. Let A1 's induced ancestral polytree2 (lAP) be the subnetwork induced by initial tree. (b) A preordering a of nodes, starting fromA e,, with inconsistent arcs reversed. (c) Evidence at the new Ai and its ancestors-that is, the network consisting of root, Ae,, is absorbed. AJ, pa(AJ ), pa(pa(AJ)), pa(pa(pa(Aj))) , etc., and the edges between them. Note that this subnetwork contains all PHISHFOOD-TREE(T, E) of the necessary information to computePr ( A1). INPUT: Bayesian network tree T, adjacency list, I will make use of the following terms: a polynode is a vari­ and evidence variables and values: c able with more than one parent, a treenode is a variable with = = a . . . a E {A e, e,, , Aec = ec } exactly one parent, and a rootnode is a variable with no par­ OUTPUT: Pr(AiIE) for all j ents. 1. PHISH-TREE-MARGINALS(T) I adopt a strategy similar in style to that employed in Miller 2. for and Reif's parallel tree contraction algorithm (1985). The each evidence variable A ek 3. a +- PREORDER(T, Aek) algorithm consists of two main steps: the firstis called root­ 4. for each edge (A;, Aj) in parallel node absorption (analogous to Miller and Reif's RAKE pro­ 5. if (A;, A1) is inconsistent with a then cedure), and the second is treenode jumping (analogous to 6. compute (2) and store in A; 's CPT their COMPRESS procedure). The main conceptual diffi­ 7. (3) culty in directly mapping the current problem to parallel absorb evidence at new root A ek 8. PHISH-TREE-MARGINALS(T) tree contraction is that each node can have both multiple parent and multiple child relationships, each with very dif­ Table 2: Parallel Bayesian Inference in a Tree Network. ferent semantics. The key insight is that each node's lAP can be contracted in more or less the standard way, by ex­ ploiting the independence properties mentioned above. variable, etc. The full psuedocode is given in Table 2. For We initialize the algorithm by marking all rootnodes as any constant number of evidence variables, its running time done. In the first step, each node absorbs all of its rootnode is O log with processors. ( n) n parents (if any) in constant parallel time. Consider node A1. Denote Rj = {A;,, ..., A;.} as the set of A1 's rootnode 5 INFERENCE IN POLYTREES parents, and Ii as the set of Aj 's other, internal node par­

ents (thus pa(Aj) = Rj U Ij ). Then the update at A1 is as This section describes a parallel algorithm for Bayesian in­ follows: ference in polytree networks. Once again, discussion is di­ vided into two stages: (1) computing marginal probabili­ Pr(AJIIJ) +- LPr(AJIRiniJ)Pr(R11I1) (4) ties, and (2) propogating evidence. Rj

= L Pr(Aj IRj nIj) Pr(A;,) ...Pr(A; .) 5.1 FINDING MARGINALS IN POLYTREES Rj In trees, PHISHFOOD's key to parallelization is for each where _ is the sum over all combinations of values of variable to iteratively rewrite its CPT in terms of its grand­ 'L:R) the variables in and the equality follows from the inde- parent. In polytrees, a node may have more than one parent, Rj, pendence properties of a polytree. Any new rootnodes that each of which has more than one parent, etc. Since CPT size are formed after absorption now hold their marginal prob­ is exponential in the number of parents, a simple applica­ abilities, and are marked as done. This first phase is illus­ tion of PHISH-TREE-MARGINALS would be worst-case in­ trated in Figure 4. tractable, so a modified approach is necessary. 2 The structure of a polytree implies many independencies Indiscriminate use of this phrase discouraged. 436 Pennock

PHISH-POLY-MARGINALS(P) INPUT: Bayesian network polytree P OUTPUT: Pr(Aj) for all j

I. mark all rootnodes done 2. while there is a node not marked done 3. for each node Aj with rootnode parent(s) in parallel 4. absorb rootnode parents (4) 5. if Aj has no more parents 6. then mark done (a) (b) Aj 7. for each treenode Aj in parallel 8. compute (1) Figure 4: Rootnode Absorption. 9. if pa( Aj) is marked done 10. then mark Aj done 11. else pa(Aj) <---- gp(Aj) 12. identify new rootnodes, treenodes, polynodes

Table 3: Computing all Marginal Probabilities a Polytree Network.

polynode implies a path between at least two rootnodes; if there are more polynodes than rootnodes, then there must be (a) (b) a cycle in the lAP.)At time step t, Aj 's new lAP consists of the nodes T' U P', where T' � T, and P' � P, since all Figure 5: Treenode Jumping. rootnodes have been absorbed. For each node in T', exactly one node in T UP has been "jumped", and is thus no longer in Aj 's lAP. Thus, IT' I= ITI-IT'I+ I PI-IP'I. Combin­ Iterative application of rootnode absorption alone would ing this equality with the previous inequality, we findthat: require 0( n) time to compute all marginals in an unbal­ anced polytree. We interleave the second step-treenode IT'I+IP'I < IRI + ITI-IT'I + !PI- IP'I jumping-to compress long treenode chains in the network. 1 We accomplish this with one step of the pointer jumping IT'I+IP'I < 2(1RI + ITI+ IPI). procedure described in section 3, applied only at treenodes. That is, each treenode CPT is rewritten in terms of its grand­ parent(s) by computing (1). Since treenodes have exactly Thus, afterO(log n) iterations, each node's lAP is reduced to a singleton, and all marginals have been computed. one parent, this step does not introduce a combinatorial growth in parent set size. This step will essentially reduce the length of treenode chains in each lAP by half. As be­ 5.2 EVIDENCE PROPAGATION IN POLYTREES fore, if the treenode's parent is marked done at time step 1, then the treenode itself can be marked done at time t - As with trees, evidence is propagated with arc reversals. step This second phase is pictured in Figure 5. After both t. The firststep is to convert the polytree into a directed cluster phases, the algorithm iterates on the transformed polytree: polytree, as described in Section 2. Since the moral graph we absorb each new rootnode, rewrite each new treenode of a polytree is triangulated, the supernodes simply consist CPT in terms of its grandparent(s), and repeat until all nodes of each variable and its parents, and the cluster polytree's are marked done. The algorithm psuedocode is given in Ta­ size is of the same order as the original polytree. A clus­ ble 3. ter tree, like an ordinary tree, can be rooted at any node, For the time complexity analysis, we show that, for each without changing its underlying undirected topology (Chyu

node, the number of "active" nodes (those not marked done) 1991; Shachter, Andersen, and Poh 1991) . Once all margi­ in its induced ancestral polytree (lAP) is reduced by at least nal probabilities are known, we can root the cluster tree with 4.2 half at each iteration. Consider Aj 'slAP at time step t- 1, the method described in Section in O(log n) time. The consisting of rootnodes R, treenodes T, and polynodes P. algorithm PHISHFOOD-TREE is then applicable, unmodi­ Note that because the network is acyclic, IPI :::; IRI. (Each fied, for inference in the rooted cluster tree. Logarithmic Time Parallel Bayesian Inference 437

6 INFERENCE IN ARBITRARY 7 RELATED WORK NETWORKS D'Ambrosio ( 1992) examines the possibility of paralleliz­ ing the (SPI) algorithm on In the general case, we first convert to a directed cluster symbolic probabilistic inference hypercube parallel architectures. He identifiestwo sources polytree, as described in Section 2. The parallel inference of concurrency: topological or "evaluation tree" paral­ algorithm, called McPHISHFOOD (Multiply Connected lelism, and conformal product parallelism; in the former PHISHFOOD) is exactly that for polytrees, and has a new case, he only considers exploiting independent computa­ acronym only to accommodate any future corporate spon­ tions, stating that, "a lower bound on the running time of sorship. an algorithm exploiting evaluation tree parallelism can be

Assuming that the fill-inordering a is chosen with a heuris­ calculated by summing the [computations] in the longest tic method, all of the graph-theoretic computations can be path of the evaluation tree." Given several different mod­ done in polynomial time. Consider the complexity of com­ eling assumptions, he empirically evaluates predicted com­ puting the supernodes' CPTs. After separation sets are in­ putational, communication, and storage costs, and parallel troduced, each supernode CPT contains 0( rw ) entries, and speedup and efficiency. He concludes that good opportuni­ each entry is computed by multiplying w of the original ties exist for parallelizing independent conformal products CPTs' entries. For this task, parallelization is trivial. Each evaluations, but not much natural concurrency is present at entry of each supernode CPT can be computed indepen­ the topological level. dently, taking time with one processor per clique, 0( rw w) Kozlov and Singh (1996, 1994) describe experimental re­ or O(log time with processors per clique. 3 w) rw w sults for a parallel version of a junction tree algorithm, im­ In both the tree and polytree algorithms, the complexity bot­ plemented on a Stanford DASH multi-processor and an SGI tleneck is (1), because this computation involves three su­ Challenge XL. They identify the same two types of con­ pernodes. The summation is evaluated for each possible currency (topological and in-clique), and also only consid­ combination of values of Ci and gp( Ci) . A conservative ered parallelizing independent operations: "computations bound for computing the new CPT is 0(r3w ) time with one for cliques that do not have an ancestor-descendent relation­ 1996). processor, or O(log rw) = 0( w) time with r3w proces­ ship can be done in parallel." (Kozlov sors. We can do slightly better by exploiting separation sets. Deicher et al. (1996) give an O(log time algorithm First, each separation set CPT is rewritten in terms of its n) serial for inference queries in tree networks, after a linear time separation set grandparent(s), taking O time with one (rw) preprocessing step to generate a special data structure. This processor, or time with processors. The marginal­ 0(1) rw data structure is actually based on the tree contraction se­ finding algorithm is then applied only to the network of sep­ quence described by Abrahamson et al. in the context of a aration sets. Once all separation set marginals are known, parallel algorithm (1989). remaining marginals can be computed in O(rw) or 0(1) time, with one or rw processors, respectively. 8 CONCLUSION AND FUTURE WORK Any cluster tree can be rooted at any of its cliques, without changing its topology (Chyu 1991; Shachter, Andersen, and I have presented PHISHFOOD, a parallel algorithm for ex­ Poh 1991). For each evidence variable, we root the clus­ act Bayesian inference that, for any constant number of ev­ ter tree at the supernode that contains it. Each application idence variables, runs in O(Iog time for polytrees, and of Bayes's rule (2) takes time with one processor, n) O(rw) log time for cluster trees with processors, and or time with processors. Evidence is absorbed at 0( r3w n) n 0(1) rw log time for cluster trees with processors. the new root by instantiating the variable, then renormaliz­ 0(w n) r3w n Concurrent memory reads, but not writes, are required. ing the remaining constituents of the supernode, in 0( rw ) or 0( w) time, with one or rw processors, respectively. There are several possible directions for improving the al­ gorithm. An O(log n exclusive-read polytree algorithm The full McPHISHFOOD algorithm then runs in ) may be possible, perhaps based on Abrahamson et al.'s log time with processors, or log 0( r3w n) n 0( w n) EREW parallel tree contraction algorithm (1989).4 The time with processors, for any constant number of r3wn done by an algorithm is defined as its running time evidence variables. Let be the maximum separation set work s multiplied by the number of processors used. There may size. If we exploit separation sets as described above, then exist a work-efficient version of PHISHFOOD that uses inference takes 0( r3• log n) time with one processor, or only n/ log n processors. O(w+s log n) time with r3•n processors. Theseimproved bounds assume that 3s > w. Future work might also pursue parallelizations for other

4Any CREW (or CRCW) algorithm can run on an EREW 3 Multiplication (or addition) of m numbers can be done in PRAM with an O(log n) factor slowdown. (Cormen, Leiserson, O(log m) parallel time with m processors. and Rivest 1992; Gibbons and Spirakis 1993; JaJa 1992) 438 Pennock

types of inference queries (e.g., most probable explanation, Kozlov, A. V. (1996, December). Parallel implementa­ maximum expected utility, value of information), and for tions of probabilistic inference. Computer 29(12), other conditional probability encodings (e.g., noisy-OR). 33-40. Certainly, an implementation of these algorithms is called Kozlov, A. V. andJ. P. Singh (1994, November). A paral­ for, on a specific parallel architecture, along with empirical lel Lauritzen-Spiegelhalter algorithm for probabilis­ comparisons with existing serial and parallel algorithms. tic inference. In Proceedings of Supercomputing '94, Washington, DC.

Acknowledgments Lauritzen, S. L. and D. J. Spiegelhalter (1988). Local computations with probabilities on graphical struc­ Thanks to David Pynadath and to the anonymous reviewers. tures and their application to expert systems. Jour­ nal of the Royal Statistical Society, Series B 50, 157- 224. References Miller, G. L. and J. Reif (1985). Parallel tree contrac­ Abrahamson, K., N. Dadoun, D. G. Kirkpatrick, and tion and its applications. In Proceedings of the 26th T. Przyttycka (1989). A simple parallel tree contrac­ IEEE Symposium on Foundations of Computer Sci­ tion algorithm. 10, 287-302. Journal of Algorithms ence, pp. 478-489. (1991). Chyu, C. C. Decomposable probabilistic influ­ Neapolitan, R. E. (1990). ProbabilisticReasoning in Ex­ ence diagrams. Probability in the Engineering and pert Systems: Theory and Algorithms. New York: InformationalSciences 5, 229-243. John Wiley and Sons.

Cooper, G. (1990). The computational complexity of Pearl, J. (1988). Probabilistic Reasoning in Intelligent probabilistic inference using Bayes belief networks. Systems. Morgan Kaufmann. Artificial Intelligence 42, 393-405. Russell, S. J. and P. Norvig (1995). Artificial Intelli­ Cormen, T. H., C. E. Leiserson, and R. L. Rivest (1992). gence: A Modem Approach. New Jersey: Prentice­ Introduction to Algorithms. New York: McGraw­ Hall. Hill. Shachter, R. D. (1988). Probabilistic inference and in­ Dagum, P. and M. Luby (1993). Approximating proba­ fluence diagrams. Operations Research 36(4), 589- bilistic inference in Bayesian belief networks is NP­ 604. hard. ArtificialIntelligence 60, 141-153. Shachter, R. D. (1990). Evidence absorption and prop­ D'Ambrosio, B. (1992). Parallelizing probabilistic infer­ agation through evidence reversals. In M. Henrion, R. D. Shachter, L. N. Kanal, and J. F. Lemmer (Eds.), ence: Some early explorations. In Proceedings of the Uncertainty in Artificial Intelligence, Volume 5, pp. Eighth Conference on Uncertainty in Artificial Intel­ 173-190. North Holland. ligence (UAI-92), Portland, OR, USA, pp. 59-66. Shachter, R. D., S. K. Andersen, and K. L. Poh (1991). Deicher, A. L., A. J. Grove, S. Kasif, and J. Pearl (1996, Directed reduction algorithms and decomposable February). Logarithmic-time updates and queries in graphs. In P. P. Bonissone, M. Henrion, L. N. Kanal, probabilistic networks. Journal of Artificial Intelli­ and J. F. Lemmer (Eds.), Uncertainty in Artificial gence Research 4, 37-59. Intelligence, Volume 6, pp. 197-208. Amsterdam: Diez, F.J. and J. Mira (1994, January). Distributed in­ North Holland. ference in Bayesian networks. Cybernetics and Sys­ Shachter, R. D., S. K. Andersen, and P.Szolovits (1994). tems 25( 1 ), 39-61. Global conditioning for probabilistic inference in be­ Gibbons, A. and P. Spirakis (1993). Lectures on Paral­ lief networks. In Proceedings of the Tenth Confer­ lel Computation. Cambridge, UK: Cambridge Uni­ ence on Uncertainty in Artificial Intelligence (UAI- versity Press. 94), Portland, OR, USA, pp. 514-522. Spiegelhalter, D. J., A. P. Dawid, S. L. Lauritzen, and Ja.Ja,J. (1992). An Introduction to Parallel Algorithms. Reading, MA, USA: Addison-Wesley. R. G. Cowell (1993). Bayesian analysis in expert sys­ tems. Statistical Science 8(3), 219-203. Jensen, S. L. Lauritzen, and K. G. Olesen (1990). F. V., Tarjan, R. E. and U. Vishkin (1985, November). An ef­ Bayesian updating in causal probabilistic networks ficient parallel biconnectivityalgorithm. SIAM Jour­ by local computations. Computational Statistics nal on Computing 14(4), 862-874. Quarterly 4, 269-282.

Klocks, T. (1994). Treewidth: Computations and Ap­ proximations. Berlin: Springer-Verlag.