Exploring Multiple Dimensions of Parallelism in Junction Tree Message Passing

Exploring Multiple Dimensions of Parallelism in Junction Tree Message Passing Lu Zheng Ole J. Mengshoel Electrical and Computer Engineering Electrical and Computer Engineering Carnegie Mellon University Carnegie Mellon University Abstract the treewidth of BN, which is upper bounded by the generated junction tree [13]. This computational chal- Belief propagation over junction trees is lenge may hinder the application of BNs in cases where known to be computationally challenging in real-time inference is required. the general case. One way of addressing this Parallelization of Bayesian network computation is a computational challenge is to use node-level feasible way of addressing this computational challenge parallel computing, and parallelize the com- [1,6,7,9,11,12,14,19,20]. A data parallel implementa- putation associated with each separator po- tion for junction tree inference has been developed for tential table cell. However, this approach is a cache-coherent shared-address-space machine with not efficient for junction trees that mainly physically distributed main memory [9]. Parallelism contain small separators. In this paper, in the basic sum-product computation has been in- we analyze this problem, and address it by vestigated for Graphics Processing Units (GPUs) [19]. studying a new dimension of node-level par- The efficiency in using disk memory for exact infer- allelism, namely arithmetic parallelism. In ence, using parallelism and other techniques, has been addition, on the graph level, we use a clique improved [7]. An algorithm for parallel BN inference merging technique to further adapt junction using pointer jumping has been developed [14]. Both trees to parallel computing platforms. We parallelization based on graph structure [12] as well as apply our parallel approach to both marginal node level primitives for parallel computing based on and most probable explanation (MPE) infer- a table extension idea have been introduced [20]; this ence in junction trees. In experiments with a idea was later implemented on a GPU [6]. Gonzalez Graphics Processing Unit (GPU), we obtain et al. developed a parallel belief propagation algorithm for marginal inference an average speedup based on parallel graph traversal to accelerate the com- of 5.54x and a maximum speedup of 11.94x; putation [3]. speedups for MPE inference are similar. A parallel message computation algorithm for junction tree belief propagation, based on the cluster- 1 INTRODUCTION sepset mapping method [4], has been introduced [22]. Cluster-sepset based node level parallelism (denoted element-wise parallelism in this paper) can accelerate Bayesian networks (BN) are frequently used to repre- the junction tree algorithm [22]; unfortunately the per- sent and reason about uncertainty. The junction tree formance varies substantially between different junc- is a secondary data structure which can be compiled tion trees. In particular, for small separators in junc- from a BN [2, 4, 5, 9, 10, 19]. Junction trees can be tion trees, element-wise parallelism [22] provides lim- used for both marginal and most probable explanation ited parallel opportunity as explained in this paper. (MPE) inference in BNs. Sum-product belief propagation on junction tree is perhaps the most popular ex- Our work aims at addressing the small separator issue. act marginal inference algorithm [8], and max-product Specifically, this paper makes these contributions that belief propagation can be used to compute the most further speed up computation and make performance probable explanations [2, 15]. However, belief propa- more robust over different BNs from applications: gation is computationally hard and the computational difficulty increases dramatically with the density of the • We discuss another dimension of parallelism, BN, the number of states of each network node, and namely arithmetic parallelism (Section 3.1). Inte- grating arithmetic parallelism with element-wise The junction tree size, and hence also junction tree parallelism, we develop an improved parallel sum- computation, can be lower bounded by treewidth, product propagation algorithm as discussed in which is defined to be the minimal size of the largest Section 3.2. junction tree clique minus one. Considering a junction tree with a treewidth tw, the amount of computation is • We also develop and test a parallel max-product lower-bounded by O(exp(c∗tw)) where c is a constant. (Section 3.3) propagation algorithm based on the Belief propagation is invoked when we get new evi- two dimensions of parallelism. dence e for a set of variables E ⊆ X . We need to up- date the potential tables Φ to reflect this new informa- • On the graph level, we use a clique merging tech- tion. To do this, belief propagation over the junction nique (Section 4), which leverages the two dimen- tree is used. This is a two-phase procedure: evidence sions of parallelism, to adapt the various Bayesian collection and evidence distribution. For the evidence networks to the parallel computing platform. collection phase, messages are collected from the leaf vertices all the way up to a designated root vertex. In our GPU experiments, we test the novel two- For the evidence distribution phase, messages are dis- dimensional parallel approach for both regular sum- tributed from the root vertex to the leaf vertices. propagation and max-propagation. Results show that our algorithms improve the performance of both kinds 2.2 Junction Trees and Parallelism of belief propagation significantly. Current emerging many-core platforms, like the recent Our paper is organized as follows: In Section 2, we Graphical Processing Units (GPUs) from NVIDIA and review BNs, junction trees parallel computing using Intel's Knights Ferry, are built around an array of pro- GPUs, and the small-separator problem. In Section cessors running many threads of execution in parallel. 3 and Section 4, we describe our parallel approach to These chips employ a Single Instruction Multiple Data message computation for belief propagation in junc- (SIMD) architecture. Threads are grouped using a tion trees. Theoretical analysis of our approach is in SIMD structure and each group shares a multithreaded Section 5. Experimental results are discussed in Sec- instruction unit. The key to good performance on such tion 6, while Section 7 concludes and outlines future platforms is finding enough parallel opportunities. research. We now consider opportunities for parallel computing 2 BACKGROUND in junction trees. Associated with each junction tree vertex Ci and its variables Xi, there is a potential table φX containing non-negative real numbers that are 2.1 Belief Propagation in Junction Trees i proportional to the joint distribution of Xi. If each variable contains s states, the minimal size of the A BN is a compact representation of a joint distribu- j potential table is jφ j = QjXij s , where jX j is the tion over a set of random variables X . A BN is struc- Xi j=1 j i tured as a directed acyclic graph (DAG) whose vertices cardinality of Xi. are the random variables. The directed edges induce Message passing from Ci to an adjacent vertex Ck, with dependence and independence relationships among the separator Sik, involves two steps: random variables. The evidence in a Bayesian network consists of instantiated random variables. 1. Reduction step. In sum-propagation, the po- The junction tree algorithm propagates beliefs (or pos- tential table φSik of the separator is updated to φ∗ by reducing the potential table φ : teriors) over a derived graph called a junction tree. A Sik Xi junction tree is generated from a BN by means of mor- X φ∗ = φ . (1) alization and triangulation [10]. Each vertex Ci of the Sik Xi junction tree contains a subset of the random variables Xi=Sik and forms a clique in the moralized and triangulated 2. Scattering step. The potential table of Ck is BN, denoted by Xi ⊆ X . Each vertex of the junction updated using both the old and new table of Sik: tree has a potential table φXi . With the above nota- tions, a junction tree can be defined as J = ( ; Φ), φ∗ T ∗ Sik φ = φX . (2) where T represents a tree and Φ represents all the po- Xk k φSik tential tables associated with this tree. Assuming Ci 0 and Cj are adjacent, a separator Sij is induced on a We define 0 = 0 in this case, that is, if the de- connecting edge. The variables contained in Sij are nominator in (2) is zero, then we simply set the defined to be X \X . corresponding φ∗ to zeros. i j Xk Figure 1: Histograms of the separator potential table Figure 2: Due to the small separator in the B-S-S sizes of junction trees Pigs and Munin3. For both junc- pattern, a long index mapping table is produced. If tion trees, the great majority of the separator tables only element-wise parallelism is used, there is just one contain 20 or fewer elements. thread per index mapping table, resulting in slow se- quential computation. Equation (1) and (2) reveal two dimensions of parallelism opportunity. The first dimension, which we return to in Section 3, is arithmetic parallelism. The second dimension is element-wise parallelism [22]. Element-wise parallelism in junction trees is based on the fact that the computation related to each sepa- call it the S-S-S pattern); (ii) due to a small intersec- rator potential table cell are independent, and takes tion set of two big neighboring cliques (the B-S-B pat- advantage of an index mapping table, see Figure 2. In tern); and (iii) due to one small neighboring clique and Figure 2, this independence is illustrated by the white one big neighboring clique (the B-S-S pattern).2 Due and grey coloring of cells in the cliques, the separa- to parallel computing issues, detailed next, these three tor, and the index mapping tables.

Load more