On the Max-Product Belief Propagation Algorithm

On the optimality of solutions of the max-pro duct b elief propagation algorithm in arbitrary graphs. Yair Weiss and William T. Freeman Abstract Graphical mo dels, suchasBayesian networks and Markov random elds, represent statistical dep endencies of variables by a graph. The max-pro duct \b elief propagation" algorithm is a lo cal-message passing algorithm on this graph that is known to converge to a unique xed p oint when the graph is a tree. Furthermore, when the graph is a tree, the assignment based on the xed-p oint yields the most probable a p osteriori MAP values of the unobserved variables given the observed ones. Recently, go o d empirical p erformance has b een obtained by running the max-pro duct algorithm or the equivalent min-sum algorithm on graphs with lo ops, for applications including the deco ding of \turb o" co des. Except for two simple graphs cycle co des and single lo op graphs there has b een little theoretical understanding of the max-pro duct algorithm on graphs with lo ops. Here we prove a result on the xed p oints of max-pro duct on a graph with arbitrary topology and with arbitrary probability distributions discrete or continuous valued no des. We show that the assignment based on a xed-p oint is a \neighborhood maximum" of the p osterior probability: the p osterior probability of the max-pro duct assignment is guaranteed to b e greater than all other assignments in a particular large region around that assignment. The region includes all assignments that di er from the max-pro duct assignmentinany subset of no des that form no more than a single lo op in the graph. In some graphs this neighb orho o d is exp onentially large. We illustrate the analysis with examples. keywords: Belief propagation, max-pro duct, min-sum, Bayesian networks, Markov random elds, MAP estimate. Problems involving probabilistic b elief propagation arise in a wide variety of applications, including error correcting co des, sp eech recognition and image understanding. Typically, a probability distribution is assumed over a set of variables and the task is to infer the values of the unobserved variables given the observed ones. The assumed probability distribution is describ ed using a graphical mo del [14] | the qualitative asp ects of the distribution are sp eci ed bya graph structure. The graph may either be directed as in a Bayesian network [18], [12] or undirected as in a Markov Random Field [18], [10]. Here we fo cus on the problem of nding an assignment for the unobserved variables that is most probable given the observed ones. In general, this problem is NP hard [19] but if the graph is singly connected i.e. there is only one path b etween anytwo given no des then there exist ecient lo cal message{passing schemes to p erform this task. Pearl [18] derived suchascheme for singly connected Bayesian networks. The algorithm, which he called \b elief revision", is identical to his algorithm for nding p osterior marginals over no des except that the summation op erator is replaced with a maximization. Aji et al. [2] have shown that b oth of Pearl's algorithms can b e seen as sp ecial cases of generalized distributive laws over particular semirings. In particular, Pearl's algorithm for nding maximum a p osteriori MAP assignments can b e seen as a generalized distributivelawover the max-pro duct semiring. We will henceforth refer to it as the \max-pro duct" algorithm. Pearl showed that for singly connected networks, the max-pro duct algorithm is guaranteed to converge and that the assignment based on the messages at convergence is guaranteed to give the optimal assignment{values corresp onding to the MAP solution. Several groups have recently rep orted excellent exp erimental results by running the max-pro duct algorithm on graphs with lo ops [23], [6], [3], [20], [6], [11]. Benedetto et al. used the max-pro duct algorithm to deco de \turb o" co des and obtained excellent results that were slightly inferior to the original turb o deco ding algorithm which is equivalent to the sum-pro duct algorithm. Weiss [20] compared the p erformance of sum-pro duct and max-pro duct on a \toy" turb o co de problem while distinguishing b etween converged and unconverged cases. He found that if one considers only the convergent cases, the p erformance of max-pro duct deco ding is signi cantly b etter than sum-pro duct deco ding. However, the max-pro duct algorithm converges less often so its overall p erformance including b oth convergent and nonconvergent cases is inferior. Progress in the analysis of the max-pro duct algorithm has b een made for two sp ecial top ologies: single lo op graphs, and \cycle co des". For graphs with a single lo op [23], [20], [21], [5], [2], it can b e shown that the algorithm converges to a stable xed p oint or a p erio dic oscillation. If it converges to a stable xed-p oint, then the assignment based on the xed-p oint messages is the optimal assignment. For graphs that corresp ond to cycle co des low density paritycheck co des in which each bit is checked by exactly twocheck no des, Wib erg [23] gave sucient conditions for max-pro duct to converge to the transmitted co deword and Horn [11] gave sucient conditions for convergence to the MAP assignment. In this pap er we analyze the max-pro duct algorithm in graphs of arbitrary topology. We show that at a xed-p ointof the algorithm, the assignment is a \neighb orho o d maximum" of the p osterior probability: the p osterior probabilityof the max-pro duct assignment is guaranteed to b e greater than all other assignments in a particular large region around that assignment. These results motivate using this p owerful algorithm in a broader class of networks. Y. Weiss is with Computer Science Division, 485 So da Hall, UC Berkeley, Berkeley, CA 94720-1776. E-mail: [email protected] erkeley.edu. Supp ort by MURI-ARO-DAAH04-96-1-0341, MURI N00014-00-1-0637 and NSF I IS-9988642 is gratefully acknowledged. W. T. Freeman is with MERL, Mitsubishi Electric Research Labs., 201 Broadway, Cambridge, MA 02139. E-mail: [email protected]. 1 x1 x1 x2 x3 x2 x3 x7=(x2’,x3’) x5 x4 x5 x4 x6 x6 a b Fig. 1. AnyBayesian network can b e converted into an undirected graph with pairwise cliques by adding cluster no des for all parents that share a common child. a. ABayesian network. b. The corresp onding undirected graph with pairwise cliques. A cluster no de for x ;x 2 3 has b een added. The p otentials can b e set so that the joint probability in the undirected graph is identical to that in the Bayesian network. In this case the up date rules presented in this pap er reduce to Pearl's propagation rules in the original Bayesian network [21]. I. The max-product algorithm in pairwise Markov Random Fields Pearl's original algorithm was describ ed for directed graphs, but in this pap er we fo cus on undirected graphs. Every directed graphical mo del can b e transformed into an undirected graphical mo del b efore doing inference see gure 1. An undirected graphical mo del or a Markov Random Field is a graph in which the no des represent variables and arcs represents compatibility relations b etween them. Assuming all probabilities are nonzero, the Hammersley-Cli ord theorem e.g. [18] guarantees that the probability distribution will factorize into a pro duct of functions of the maximal cliques of the graph. Denoting by x the values of all unobserved variables in the graph, the factorization has the form: Y P x= x 1 c c c where x is a subset of x that form a clique in the graph and is the p otential function for the clique. c c We will assume, without loss of generality, that each x no de has a corresp onding y no de that is connected only to i i x . i Thus: Y Y P x; y = x x ;y 2 c c ii i i c i The restriction that all the y variables are observed and none of the x variables are is just to make the notation i i simple | x ;y may be indep endent of y equivalent to y b eing unobserved or x ;y may be x x ii i i i i ii i i i o equivalenttox b eing observed, with value x . i o In describing and analyzing b elief propagation, we assume the graphical mo del has b een prepro cessed so that all the cliques consist of pairs of units. Any graphical mo del can b e converted into this form b efore doing inference through a suitable clustering of no des into large no des [21]. Figure 1 shows an example | a Bayesian network is converted into an MRF in which all the cliques are pairs of units. Equation 2 b ecomes Y Y x ;y 3 x ;x P x; y = ii i i ij i j i i;j where the rst pro duct is over connected pairs of no des. The imp ortant prop erty of MRFs that we will use throughout this pap er is the Markov blanket prop erty. The C probability of a subset of no des S given all other no des in the graph S dep ends only on the values of the no des that 2 immediately neighbor S .

On the Max-Product Belief Propagation Algorithm

Inference Algorithm for Finite-Dimensional Spin

Many-Pairs Mutual Information for Adding Structure to Belief Propagation Approximations

Variational Message Passing

Linearized and Single-Pass Belief Propagation

Α Belief Propagation for Approximate Inference

Markov Random Field Modeling, Inference & Learning in Computer Vision & Image Understanding: a Survey Chaohui Wang, Nikos Komodakis, Nikos Paragios

Belief Propagation for the (Physicist) Layman

Efficient Belief Propagation for Higher Order Cliques Using Linear

13 : Variational Inference: Loopy Belief Propagation and Mean Field

Characterization of Belief Propagation and Its Generalizations

Particle Belief Propagation

Belief Propagation Neural Networks