On the optimality of solutions of the max-pro duct b elief propagation algorithm in arbitrary graphs.
Yair Weiss and William T. Freeman
Abstract
Graphical mo dels, suchasBayesian networks and Markov random elds, represent statistical dep endencies of variables by
a graph. The max-pro duct \b elief propagation" algorithm is a lo cal-message passing algorithm on this graph that is known to
converge to a unique xed p oint when the graph is a tree. Furthermore, when the graph is a tree, the assignment based on the
xed-p oint yields the most probable a p osteriori MAP values of the unobserved variables given the observed ones.
Recently, go o d empirical p erformance has b een obtained by running the max-pro duct algorithm or the equivalent min-sum
algorithm on graphs with lo ops, for applications including the deco ding of \turb o" co des. Except for two simple graphs cycle
co des and single lo op graphs there has b een little theoretical understanding of the max-pro duct algorithm on graphs with lo ops.
Here we prove a result on the xed p oints of max-pro duct on a graph with arbitrary topology and with arbitrary probability
distributions discrete or continuous valued no des. We show that the assignment based on a xed-p oint is a \neighborhood
maximum" of the p osterior probability: the p osterior probability of the max-pro duct assignment is guaranteed to b e greater than
all other assignments in a particular large region around that assignment. The region includes all assignments that di er from
the max-pro duct assignmentinany subset of no des that form no more than a single lo op in the graph. In some graphs this
neighb orho o d is exp onentially large. We illustrate the analysis with examples.
keywords: Belief propagation, max-pro duct, min-sum, Bayesian networks, Markov random elds, MAP estimate.
Problems involving probabilistic b elief propagation arise in a wide variety of applications, including error correcting
co des, sp eech recognition and image understanding. Typically, a probability distribution is assumed over a set of
variables and the task is to infer the values of the unobserved variables given the observed ones. The assumed probability
distribution is describ ed using a graphical mo del [14] | the qualitative asp ects of the distribution are sp eci ed bya
graph structure. The graph may either be directed as in a Bayesian network [18], [12] or undirected as in a Markov
Random Field [18], [10].
Here we fo cus on the problem of nding an assignment for the unobserved variables that is most probable given the
observed ones. In general, this problem is NP hard [19] but if the graph is singly connected i.e. there is only one
path b etween anytwo given no des then there exist ecient lo cal message{passing schemes to p erform this task. Pearl
[18] derived suchascheme for singly connected Bayesian networks. The algorithm, which he called \b elief revision",
is identical to his algorithm for nding p osterior marginals over no des except that the summation op erator is replaced
with a maximization. Aji et al. [2] have shown that b oth of Pearl's algorithms can b e seen as sp ecial cases of generalized
distributive laws over particular semirings. In particular, Pearl's algorithm for nding maximum a p osteriori MAP
assignments can b e seen as a generalized distributivelawover the max-pro duct semiring. We will henceforth refer to it
as the \max-pro duct" algorithm.
Pearl showed that for singly connected networks, the max-pro duct algorithm is guaranteed to converge and that the
assignment based on the messages at convergence is guaranteed to give the optimal assignment{values corresp onding to
the MAP solution.
Several groups have recently rep orted excellent exp erimental results by running the max-pro duct algorithm on graphs
with lo ops [23], [6], [3], [20], [6], [11]. Benedetto et al. used the max-pro duct algorithm to deco de \turb o" co des and
obtained excellent results that were slightly inferior to the original turb o deco ding algorithm which is equivalent to
the sum-pro duct algorithm. Weiss [20] compared the p erformance of sum-pro duct and max-pro duct on a \toy" turb o
co de problem while distinguishing b etween converged and unconverged cases. He found that if one considers only the
convergent cases, the p erformance of max-pro duct deco ding is signi cantly b etter than sum-pro duct deco ding. However,
the max-pro duct algorithm converges less often so its overall p erformance including b oth convergent and nonconvergent
cases is inferior.
Progress in the analysis of the max-pro duct algorithm has b een made for two sp ecial top ologies: single lo op graphs,
and \cycle co des". For graphs with a single lo op [23], [20], [21], [5], [2], it can b e shown that the algorithm converges
to a stable xed p oint or a p erio dic oscillation. If it converges to a stable xed-p oint, then the assignment based on
the xed-p oint messages is the optimal assignment. For graphs that corresp ond to cycle co des low density paritycheck
co des in which each bit is checked by exactly twocheck no des, Wib erg [23] gave sucient conditions for max-pro duct to
converge to the transmitted co deword and Horn [11] gave sucient conditions for convergence to the MAP assignment.
In this pap er we analyze the max-pro duct algorithm in graphs of arbitrary topology. We show that at a xed-p ointof
the algorithm, the assignment is a \neighb orho o d maximum" of the p osterior probability: the p osterior probabilityof
the max-pro duct assignment is guaranteed to b e greater than all other assignments in a particular large region around
that assignment. These results motivate using this p owerful algorithm in a broader class of networks.
Y. Weiss is with Computer Science Division, 485 So da Hall, UC Berkeley, Berkeley, CA 94720-1776. E-mail: [email protected] erkeley.edu.
Supp ort by MURI-ARO-DAAH04-96-1-0341, MURI N00014-00-1-0637 and NSF I IS-9988642 is gratefully acknowledged.
W. T. Freeman is with MERL, Mitsubishi Electric Research Labs., 201 Broadway, Cambridge, MA 02139. E-mail: [email protected]. 1 x1 x1
x2 x3 x2 x3 x7=(x2’,x3’) x5 x4 x5 x4
x6
x6
a b
Fig. 1. AnyBayesian network can b e converted into an undirected graph with pairwise cliques by adding cluster no des for all parents that
share a common child. a. ABayesian network. b. The corresp onding undirected graph with pairwise cliques. A cluster no de for x ;x
2 3
has b een added. The p otentials can b e set so that the joint probability in the undirected graph is identical to that in the Bayesian
network. In this case the up date rules presented in this pap er reduce to Pearl's propagation rules in the original Bayesian network [21].
I. The max-product algorithm in pairwise Markov Random Fields
Pearl's original algorithm was describ ed for directed graphs, but in this pap er we fo cus on undirected graphs. Every
directed graphical mo del can b e transformed into an undirected graphical mo del b efore doing inference see gure 1.
An undirected graphical mo del or a Markov Random Field is a graph in which the no des represent variables and
arcs represents compatibility relations b etween them. Assuming all probabilities are nonzero, the Hammersley-Cli ord
theorem e.g. [18] guarantees that the probability distribution will factorize into a pro duct of functions of the maximal
cliques of the graph.
Denoting by x the values of all unobserved variables in the graph, the factorization has the form:
Y
P x= x 1
c c
c
where x is a subset of x that form a clique in the graph and is the p otential function for the clique.
c c
We will assume, without loss of generality, that each x no de has a corresp onding y no de that is connected only to
i i
x .
i
Thus:
Y Y
P x; y = x x ;y 2
c c ii i i
c
i
The restriction that all the y variables are observed and none of the x variables are is just to make the notation
i i
simple | x ;y may be indep endent of y equivalent to y b eing unobserved or x ;y may be x x
ii i i i i ii i i i o
equivalenttox b eing observed, with value x .
i o
In describing and analyzing b elief propagation, we assume the graphical mo del has b een prepro cessed so that all the
cliques consist of pairs of units. Any graphical mo del can b e converted into this form b efore doing inference through a
suitable clustering of no des into large no des [21]. Figure 1 shows an example | a Bayesian network is converted into
an MRF in which all the cliques are pairs of units.
Equation 2 b ecomes
Y Y
x ;y 3 x ;x P x; y =
ii i i ij i j
i i;j
where the rst pro duct is over connected pairs of no des.
The imp ortant prop erty of MRFs that we will use throughout this pap er is the Markov blanket prop erty. The
C
probability of a subset of no des S given all other no des in the graph S dep ends only on the values of the no des that 2
immediately neighbor S . Furthermore, the probabilityof S given all other no des is prop ortional to the pro duct of all
clique p otentials within S and all clique p otentials b etween S and its immediate neighb ors.
P x jx C = P x jN x 4
S S S
S
Y Y
= x ;x x ;x
ij i j ij i j
x ;x 2S
x 2S;x 2N S
i j
i j
5
The advantage of prepro cessing the graph into one with pairwise cliques is that the description and the analysis of
b elief propagation b ecomes simpler. For completeness, we review the b elief propagation scheme used in [21]. As we
discuss in the app endix, this b elief propagation scheme is equivalenttoPearl's b elief propagation algorithm in directed
graphs, the Generalized Distributive Law algorithm of [1] and the factor graph propagation algorithm of [13]. These
three algorithms corresp ond to the algorithm presented here with a particular way of prepro cessing the graph in order
to obtain pairwise p otentials.
Atevery iteration, each no de sends a di erent message to each of its neighb ors and receives a message from each
neighb or. Let x and x be two neighb oring no des in the graph. We denote by m x the message that no de x sends
i j ij j i
to no de x ,by m x the message that y sends to x , and by b x the b elief at no de x .
j ii i i i i i i
The max-pro duct up date rules are:
Y
m x m x max x ;x m x
ki i ij j ij i j ii i
x
i
x 2N x nx
i j
k
6
Y
b x m x m x 7
i i ii i ki i
x 2N x
i
k
where denotes a normalization constant and N x nx means all no des neighb oring x , except x .
i j i j
The pro cedure is initialized with all message vectors set to constant functions. Observed no des do not receive
messages and they always transmit the same vector| if y is observed to have value y then m x = x ;y .
i ii i ii i
The normalization of m in equation 6 is not necessary{whether or not the message are normalized, the b elief b will
ij i
b e identical. However, normalizing the messages avoids numerical under ow and adds to the stability of the algorithm.
We assume throughout this pap er that all no des simultaneously up date their messages in parallel.
For singly connected graphs it is easy to show that:
The algorithm converges to a unique xed p oint regardless of initial conditions in a nite numb er of iterations.
At convergence, the b elief for anyvalue x of a no de i is the maximum of the p osterior, conditioned on that no de
i
having the value x : b x = max P xjy; x .
i i i x i
De ne the max-pro duct assignment, x by x = arg max b x assuming a unique maximizing value exists. Then
x i i
i
i
x is the MAP assignment.
The max pro duct assignment assumes there are no \ties" | that a unique maximizing x exists for all bx . Ties can
i i
arise when the MAP assignment is not unique, e.g. when there are two assignments that have identical p osterior and
b oth maximize the p osterior. For singly connected graphs, the converse is also true: if there are no ties in bx then
i
the MAP assignment is unique. In what follows, we assume a unique MAP assignment.
In particular applications, it might b e easier to work in the log domain so that the pro duct op eration in equations 7 is
replaced by a sum op erations. Thus the max-pro duct algorithm is sometimes referred to as the max-sum algorithm or
the min-sum algorithm [23], [5]. If the graph is a chain, the max-pro duct is a two-wayversion of the Viterbi algorithm
in Hidden Markov Mo dels and is closely related to concurrent dynamic programming [4]. Despite this connection to
well studied algorithms, there has b een very little analytical success in characterizing the solutions of the max-pro duct
algorithm on arbitrary graphs with lo ops.
II. What are the fixed points of the max-product algorithm?
t
Each iteration of the max-pro duct algorithm can b e thought of as an op erator F that inputs a list of messages m
t+1 t
and outputs a list of messages m = Fm . Thus running b elief propagation can b e thought of as an iterativeway
0
of nding a solution to the xed p oint equations Fm = m with an initial guess m in which all messages are constant
functions.
Note that this is not the only way of nding xed-p oints. Murphy et al. [17] describ e an alternative metho d for nding
xed-p oints of F . They suggested iterating:
t+1 t t 1
m = F m +1 m 8 3
a b c
Fig. 2. a. A25x25 grid of p oints. b.-c. Examples of subsets of no des that form no more than a single lo op. In this pap er we prove that
changing such a subset of no des from the max-pro duct assignment will always decrease the p osterior probability
Here again, if iterations of equation 8 converge to m then m satis es m = Fm .
The equation m = Fm is a highly nonlinear equation and it is not obvious how many solutions exist or how to
characterize them. Horn [11] showed that in single-lo op graphs, F can b e considered as matrix multiplication over the
max-pro duct semiring and xed p oints corresp ond to eigenvectors of that matrix. Thus even in single lo op graphs, one
can construct examples with anynumb er of xed p oints.
The main result of this pap er is a characterization of howwell the max-pro duct assignment approximates the MAP
assignment. We show that the assignment x must b e a neighb orho o d maximum of P xjy : that is P x jy >Pxjy
for all x in a particular large region around x . This condition is weaker than a global maximum but stronger than a
lo cal maximum.
To be more precise we de ne the Single Loops and Trees SLT neighborhood of an assignment x in a graphical
mo del G to include all assignments x that can b e obtained from x by:
Cho osing an arbitrary subset S of no des in G that consists of disconnected combinations of trees and single lo ops.
Assigning arbitrary values to x | the chosen subset of no des. The other no des have the same assignmentasin
S
x .
Claim 1: For an arbitrary graphical mo del with arbitrary p otentials, if m is a xed-p oint of the max-pro duct
algorithm and x is the assignment based on m then P x jy >Pxjy for all x 6= x in the SLT neighb orho o d of x .
Figure 2 illustrates example con gurations within the the SLT neighb orho o d of the max-pro duct assignment. It shows
examples of subsets of no des that could b e changed to arbitrary values and the p osterior probability of the assignment
is guaranteed to b e worse than that of the max-pro duct assignment.
To build intuition, we rst describ e the pro of for a sp eci c case, the diamond graph of gure 3. The general pro of is
given in section I I-B.
A. Speci c Example
We start by giving an overview of the pro of for the diamond graph shown in gure 3a. The pro of is based on the
unwrapp ed tree | the graphical mo del that the lo opy b elief propagation is solving exactly when applying the b elief
propagation rules in a lo opy network [9], [23], [21], [22]. In error-correcting co des, the unwrapp ed tree is referred to as
the \computation tree" | it is based on the idea that the computation of a message sentby a no de at time t dep ends on
messages it received from its neighb ors at time t 1 and those messages dep end on the messages the neighb ors received
at time t 2 etc.
Figure 3 shows an unwrapp ed tree around no de x for the diamond shap ed graph on the left. Each no de has a shaded
1
observed no de attached to it that is not shown for simplicity.
To simplify notation, we assume that x , the assignment based on a xed-p oint of the max-pro duct algorithm is equal
~
to zero, x =0. The p erio dic assignment lemma from [22] guarantees that we can mo dify x ;y for the leaf no des
ii i i
so that the optimal assignment in the unwrapp ed tree is all zeros
P ~x =0jy~ = max P ~xjy~ 9
x~
~
The x ;y are mo di ed to include the messages from the no des to b e added at the next stage of the unwrapping.
ii i i 4 ~ x1
x1 ~ ~x2 ~x3 x4
~ ~ x2 x3 x4 x5 ~x5’ x5’’
~ ~ ~ ~ ~ x4’ x3’~ x2’ x4’ x3’’ x2’’ x5 ~ ~ ~ ~ ~ ~
x1’ x1’’ x1’’’x1‘ x1‘‘ x1‘‘‘
a b
Fig. 3. a: A Markov Random Field with multiple lo ops. b: The unwrapp ed graph corresp onding to this structure. The unwrapp ed graphs
are constructed by replicating the p otentials x ;x and observations y while preserving the lo cal connectivity of the lo opy graph.
i j i
They are constructed so that the messages received bynode x after t iterations in the lo opy graph are equivalent to those that would
1
b e received by x in the unwrapp ed graph. An observed no de, y , not shown, is connected to each depicted no de.
1
i
Wenow show that the global optimalityofx~= 0 and the metho d of construction of the unwrapp ed tree guarantee
that P x =0jy >Pxjy for all x in the SLT neighb orho o d of 0.
Referring to gure 3a, supp ose that P x = 10000jy > P x = 00000jy . By the Markov prop ertyof the diamond
gure, this means that:
P x =1jx =0 >Px =0jx =0 10
1 2 4 1 2 4
Note that no dex ~ has exactly the same neighb ors in the unwrapp ed graph as x has in the lo opy graph. Furthermore,
1 1
by the metho d of construction, the p otentials b etween x and each of its neighb ors is the same as the p otentials b etween
1
x~ and its neighb ors. Thus equation 10 implies that:
1
P ~x =1jx~ =0 >P~x =0jx~ =0 11
1 2 4 1 2 4
in contradiction to equation 9. Thus no change of a single x can improve the p osterior probability.
i
What ab out changing two x at a time? If wechange a pair that is not connected in the graph, say x and x , then
i 1 5
by the Markov prop erty this is equivalent to changing one at a time. Thus supp ose P 10001 > P 0000 this again
implies that P x =1jx =0 >Px =0jx = 0 and wehave shown earlier that leads to a contradiction. Thus
1 2 4 1 2 4
no change of assignmentintwo unconnected no des can improve the p osterior probability.
If the two are connected, say x ;x then the same argument holds with resp ect to the pair of no des x ;x . Note that
1 2 1 2
the subgraphx ~ is isomorphic to the subgraph x and the two subgraphs have the same neighb ors. Hence:
1 2 1 2
P x =1jx =0 >Px =0jx =0 12
1 2 3 5 1 2 3 5
implies that:
P ~x =1jx~ =0 >P~x =0jx~ =0 13
1 2 3 5 1 2 3 5
and this is again in contradiction to equation 9. Thus no change of assignmentintwo connected no des can improve the
p osterior probability.
Similar arguments show that no change of assignmentinany subtree of the graph can improve the p osterior probability
e.g. changing the values of x or changing the values of x and x .
1 4 1 2 4 5
These arguments no longer hold, however, when wechange a subset of no des that form a lo opy subgraph of G. For
example, the subgraph x ;x ;x ;x is not isomorphic to the subgraph x~ ; x~ ; x~ ; x~ . Indeed since the unwrapp ed tree
1 2 3 5 1 2 3 5
is a tree, it cannot have a lo opy subgraph. Hence we cannot equate the probabilities of the two subgraphs given their
neighb ors.
If the subset forms a single lo op, however, then there exists an arbitrarily long chain in the unwrapp ed tree that
0
~
corresp onds to the unwrapping of that lo op. For example, note the chainx ~ ; x~ ; x~ ; x ; in gure 3b. Using a similar
1 2 5 3
argument to that used in proving optimality of max-pro duct in a single lo op graph [23], [20], [2], [5] we can show that
if we can improve the p osterior in the lo opy graph bychanging the value of x ;x ;x ;x then we can also improve the
1 2 3 5
p osterior in the unwrapp ed graph bychanging the values of the arbitrarily long chain. This again leads to a contradiction. 5
B. Proof of Claim 1:
~
We denote by G the original graph and by G the unwrapp ed graph. We use x for no des in the original graph andx ~
i i
~
for no des in the unwrapp ed graph. We de ne a mapping C from no des in G to no des in G. This mapping will say for
~
eachnodeinG what is the corresp onding no de in G: C ~x =x .
1 1
~
The unwrapp ed tree is therefore a graph G and a corresp ondence map. Wenow give the metho d of constructing b oth.
Pick an arbitrary no de in G,sayx . Set C ~x =x . Iterate t times.
1 1 1
~
Find all leaves of G start with the ro ot.
For each leaf x~ nd all k no des in G that neighbor C ~x .
i i
Add k 1 no des as children to x , corresp onding to all neighb ors of C ~x except C ~x , where x~ is the parentof
i i j j
x~ .
i
The p otential matrices and observations for each no de in the unwrapp ed network are copied from the corresp onding
no des in the lo opy graph. That is, if x = C ~x then y~ = y , and:
j i i j
~
~x ; x~ = C~x ;C~x 14
i j i j
Note that the unwrapp ed tree around x after t iterations is a subtree of the unwrapp ed tree around x after t >t
i 1 j 2 1
~
iterations. If we let the number of iterations t ! 1 then the unwrapp ed tree G b ecomes a well-studied ob ject in
top ology: the universal covering of G [16]. Roughly sp eaking, it is a top ology that preserves the local top ology of the
graph G but is singly connected. It is precisely this fact, that max-pro duct gives the global optimum on a graph that
has the same lo cal top ology as G, that makes sure that x is a neighb orho o d maximum of P xjy .
~
Wenow state some prop erties of G.
~
1. Equal neighbors property: Every non-leaf no de in G has the same numb er of neighb ors as the corresp onding no de
in G and the neighb ors are in one-to-one corresp ondence. If C ~x =x then for each~x 2N~x , C ~x 2 N x and for
i i j i j i
each x 2 N x there exists x~ 2 N ~x such that C ~x =x . This follows directly from the metho d of constructing
j i k j k j
~
G.
~
2. Equal conditional probability property: The probability of a nonleaf no dex ~ given its neighb ors in G is equal to the
j
probabilityofC~x given its neighb ors. Formally,if C~x =x then
j j i
P ~x jN ~x =z; y=P x jNx =z; y 15
j j i i
This follows from the Markov blanket prop erty of MRFs eq. 5 and the equal neighb orho o d prop erty.
3. Isomoprhic subtreeproperty: For any subtree T G then for suciently large unwrapping count t there exists an
~ ~
isomorphic subtree T G. The no des of the subtrees are in one-to-one corresp ondence: for each x 2 T there exists a
i
~ ~
x~ 2 T such that C ~x =x and for each~x 2T,C~x 2 T . To prove this we pick the ro ot no de of T , x as the initial
i i i j j i
no de around which to expand the unwrapp ed tree. By the metho d of construction, the unwrapp ed tree after a number
~
of iterations equal to the depth of T will b e isomoprhic to T and in one-to-one corresp ondence. This gives us T . For
~ ~
any other choice of initial no de for G, the unwrapp ed tree starting with x is a subtree of G .
i
~ ~
4. In nite chain property: For anyloopLG there exists an arbitrarily long chain L G that is the unwrapping of
L. Furthermore, if we denote by n the length of the chain divided by the length of the lo op and by x~ and x~ the two
1 N
endp oints of the chain then:
n
P ~x jN ~x = P x jN X 16
~ ~
L L
L L
with the \b oundary p otentials"
Y Y
~ ~
= x ~ ; x~ ~x ; x~ 17
i 1 i N
x~ 2N ~x nx~ x~ 2N ~x nx~
i 1 2 i N N 1
~
Note that is indep endentofn. x and x~ refer to all no de variables in the sets L and L, resp ectively.
~
L
L
The existence of the arbitrarily long chain follows from the equal neighb or prop erty, while equation 16 follows from
the Markov blanket prop erty for MRFs equation 5.
Another prop erty of the unwrapp ed tree that we will need was proven in [22]:
5. Periodic assignment lemma: Let m be a xed-p oint of the max-pro duct algorithm and x the max-pro duct
~ ~
assignmentin G. Let G b e the unwrapp ed tree. Supp ose we mo dify the observation p otentials at the leaf no des to
ii
~
include the messages from the no des to b e added at the next stage of unwrapping. Then the MAP assignment~x in G
is a replication of x : if x = C ~x then x~ = x .
i j
j i
Using these prop erties, we can prove the main claim. To simplify notation, we again assume that x , the assignment
based on a xed-p oint of the max-pro duct algorithm is equal to zero, x = 0. The p erio dic assignment prop erty
~
guarantees that we can mo dify x ;y for the leaf no des so that the optimal assignment in the unwrapp ed tree is all
ii i i
zeros
P ~x =0jy~ = max P ~xjy~ 18
x~ 6
Now, assume that we can cho ose a subtree of T G and change the assignment of these no des x to another value and
T
increase the p osterior. Again, to simplify notation assume that maximizing value is x =1. By the Markov prop erty,
T
this means that:
P x =1jNx =0;y >Px =0jNx =0;y 19
T T T T
~ ~
Now, by the isomorphic subtree prop ertywe know that there exists T G that is isomorphic to T . We also know
~
that T has the same conditional probability given its neighb ors as T do es. Thus equation 19 implies that:
P ~x =1jN~x =0;y~>P~x =0jN~x =0;y~ 20
~ ~ ~ ~
T T T T
in contradiction to equation 18. Hence changing the value of a subtree from the converged values of the max-pro duct
algorithm cannot increase the p osterior probability.
Now, assume wechange the value of a single lo op L G and increase the p osterior probability. This means that:
P x =1jNx =0;y >Px =0jNx =0;y 21
L L L L
By the in nite chain prop erty,we know that for arbitrarily large n:
n
P ~x =1jN~x =0=Px =1jNx =0 22
~ ~
L L 1
L L
Therefore equation 21 implies that:
P ~x =1jN~x =0>P~x =0jN~x =0 23
~ ~ ~ ~
L L L L
in contradiction to the optimalityofx~ =0. Hence changing the value of a single lo op cannot improve the p osterior
probabilityover x .
Nowwe taketwo disconnected comp onents C ;C 2 G and assume that changing the values of x ;x improves the
1 2 C C
1 2
p osterior probability. Again, by the Markov prop erty:
P x =1;x =1jNx ;x =0;y >
C C C C
1 2 1 2
Px =0;x =0jNx ;x =0;y 24
C C C C
1 2 1 2
but since C and C are not connected this implies:
1 2
P x =1jNx =0;yPx =1jNx =0;y >
C C C C
1 1 2 2
Px =0jNx =0;yPx =0jNx =0;y 25
C C C C
1 1 2 2
Which implies that either
P x =1jNx =0;y >Px =0jNx =0;y 26
C C C C
1 1 1 1
or:
P x =1jNx =0;y >Px =0jNx =0;y 27
C C C C
2 2 2 2
Thus if C or C are either a tree or a single lo op this leads to a contradiction. Hence we cannot simultaneously change
1 2
the values of two subtrees or of a subtree and a lo op or of two disconnected lo ops and increase the p osterior probability.
Similarly,we can show that changing the value of any nite numb er of disconnected trees or single lo ops will not increase
the p osterior. This proves claim 1.
III. Examples
Claim 1 holds for arbitrary top ologies and arbitrary p otentials b oth discrete and continuous no des. We illustrate
the implications of claim 1 for sp eci c networks.
A. Gaussian graphical models
A Gaussian graphical mo del is one in which the joint distribution over x is Gaussian. Weiss and Freeman [22] have
analyzed b elief propagation on such graphs. One of the results given there can also b e proved using our claim 1.
Corol lary 1: For a Gaussian graphical mo del of arbitrary top ology. If b elief propagation converges, then the p osterior
marginal means calculated using b elief-propagation are exact.
Proof: For Gaussians, max-pro duct and sum-pro duct are identical. The p osterior means calculated by b elief propa-
gation are therefore identical to the max-pro duct assignment. By claim 1, we know that this must b e a neighborhood
maximum of the p osterior probability. But Gaussians are unimo dal hence it must b e a global maximum of the p osterior
probability. Thus the max-pro duct assignmentmust equal the MAP assignment, and the p osterior means calculated
using b elief propagation are exact. 7 message 1
bit 1 bit 2 bit 3 bit 4 bit 5 bit 6 bit 7
message 2
Fig. 4. Turb o co de structure
100 100
80 greedy 80 max−product 60 60
40 40 greedy max−product Percent Correct
20 Percent converged 20
0 0 0 1 2 3 0 1 2 3
noise σ noise σ
a b
Fig. 5. Results of small turb o-co de simulation. a Percent correct deco dings for max-pro duct algorithm, compared with greedy gradient
ascent in the p osterior probability. b Comparison of convergence results of the two algorithms.
B. Turbo-codes
Figure 4 shows the pairwise Markov network corresp onding to the deco ding of a turb o co de with 7 unknown bits. The
top and b ottom no des represent the two transmitted messages one for each constituent co de. Thus in this example, the
top and b ottom no des can take on 128 p ossible values. The p otentials b etween the message no des and their observations
give the p osterior probability of the transmitted word given one message, and the p otentials b etween the message no des
and the bit no des imp ose consistency. For example bit ; message =1 if the rst bit of message is equal to bit
1 1
1 1
and zero otherwise. It is easy to show [21] that sum-pro duct b elief propagation on this graph gives the turb o deco ding
algorithm and max-pro duct b elief propagation gives the mo di ed turb o deco ding algorithm of [3].
Corol lary 2: For a turb o co de with arbitrary constituent co des. Let x b e a xed-p oint max-pro duct deco ding. Then
P x jy >Pxjy for all x within Hamming distance 2 of x .
Proof: This follows from the main claim. Note that whenever wechange any bits in the graph we also havetochange
the two message no des so changing more than two bits will givetwo lo ops.
An obvious consequence of corollary 2 is that for a turb o co de with arbitrary constituent co des, the max-pro duct
deco ding is either the MAP deco ding or at least Hamming distance 3 from the MAP deco ding. In other words, the
max-pro duct algorithm cannot converge to a deco ding that is \almost" right: if it is wrong it must be wrong in at
least three bits. In order for max-pro duct to converge to a wrong deco ding, there must exist a deco ding that is at
least distance 3 from the MAP deco ding, and that deco ding must have higher p osterior probability than anything in its
neighborhood. If no such wrong deco ding exists, the max-pro duct algorithm must either converge to the MAP deco ding
or fail to converge to a xed-p oint.
This b ehavior can be contrasted with the b ehavior of \greedy" iterative deco ding which increases the p osterior
probability of the deco ding at every iteration. Greedy iterative deco ding checks all bits and compares the p osterior
probability with the currentvalue of that bit versus ipping that bit. If the p osterior probability improved with ipping,
the algorithm ips it this is equivalent to free energy deco ding [15] at zero temp erature. This greedy deco ding algorithm
is guaranteed to converge to a local maximum of the p osterior probability.
To illustrate these prop erties we ran the following simulations. We simulated transmitting turb o-enco ded co dewords
of length 7 bits over a Gaussian channel. We compared the max-pro duct deco ding and the greedy deco ding to the MAP
deco ding since wewere dealing with such short blo ck lengths we could calculate the MAP deco ding using exhaustive
search. Wevaried the noise of the Gaussian channel.
Figure 5 shows the results. Figure 5a shows the probabilityof convergence for b oth algorithms. Convergence was
determined numerically for b oth algorithms: if the standard deviation of the messages over 10 successive iterations was